Technical noteOptimal partitioning of soil transects with R
D G Rossiter
October 16 2013
Contents
1 Example transect 1
2 Split moving-window 321 Functions for the SMW method 422 Options for the SMW method 4
221 Standardized principal components 5222 Original variables 5223 Window width 6224 Window parity 7225 Mullion 7226 What Mahanalobis distance to report 8
23 Analyzing the example transect 8231 Using PCs 8232 Using original variables 11
24 Visualizing the boundaries 12
3 When SMW fails to find a solution 14
4 Maximum Level Variance 16
References 17
A Example transect 18
B Split moving-window functions 19B1 PCA 19
Version 21 Copyright copy 2004 2009 2013 D G Rossiter All rights re-served Reproduction and dissemination of the work as a whole (notparts) freely permitted if this original copyright notice is included Saleor placement on a web site where payment must be made to access thisdocument is strictly prohibited To adapt or translate please contact theauthor (httpwwwitcnlpersonalrossiter)
B2 Find boundaries 19B3 Plot the transect 21
C Insight into Mahalanobis distance 22
Index of R concepts 26
ii
This note describes procedures to identify soil boundaries along a tran-sect where soil samples have been taken at regular intervals It is basedon the work of Webster [3ndash6] and used as an example in the text of Davis[1 pp 234-243]
Two partitioning methods are described by Webster [4] Split moving-window (SMW) (sect2) and Maximum Level Variance (MLV) (sect4) Only thefirst is developed in this technical note
Note The code in this document was tested with R version 301 (2013-05-16) and packages from that version or later running on Mac OS X1075 The text and graphical output you see here was written as aNoWeb file including both R code and regular LATEX source and then runthrough the excellent knitr package Version 141 [7] on R to generatea LATEX document that includes formatted R code input and the resultsof running the code both text results and graphs Then the LATEX docu-ment was compiled into the PDF version you are now reading If you runthe R code from this document your output may be slightly different ondifferent versions and on different platforms
1 Example transect
We illustrate optimal partitioning with a transect of 42 stations spacedat 25 m intervals in an undisturbed forest near Juruena Mato Grosso inthe southern Amazon collected by Steven Jirka of Cornell University aspart of the LBA1 project Both field and laboratory measurements weremade for illustrative purposes we use only the sand and clay contentmeasured in g kg-1 of three layers These are organized as an R dataframe with the rows being the stations and the columns the variablesThe sample dataset is provided as a comma-separated value file trcsvand is reproduced in Appendix sectA
Task 1 Read the transect from the CSV file into R and display itsstructure bullgt transect lt- readcsv(trcsv)gt str(transect)
dataframe 42 obs of 6 variables$ sandA num 438 449 560 549 428 $ clayA num 283 316 216 261 472 $ sandB num 349 349 449 516 404 $ clayB num 372 405 305 272 429 $ sandC num 483 616 616 516 271 $ clayC num 239 205 172 239 463
Task 2 Plot the six variables along the transect bullgt plot(transect$sandA type=l ylim=c(0900)+ main=Particle-size fraction along transect+ xlab=station ylab=g kg-1+ sub=black A blue B green C solid sand dashed clay)gt lines(transect$sandB type=l col=blue)
1 Large-scale Biosphere-Atmosphere Experiment in Amazonia see httpearthobservatorynasagovStudyLBA
1
gt lines(transect$sandC type=l col=green)gt lines(transect$clayA type=l col=1 lty=2)gt lines(transect$clayB type=l col=blue lty=2)gt lines(transect$clayC type=l col=green lty=2)
0 10 20 30 40
020
040
060
080
0
Particleminussize fraction along transect
black A blue B green C solid sand dashed claystation
g kg
minus1
It is evident that there is spatial autocorrelation ie nearby stations arelikely to be similar It also appears that the sequence can be broken upinto several more homogeneous sections A clear difference is that thebeginning of the sequence has fairly equal sand and clay whereas afterabout station 15 there is much more sand than clay However there is abreak in this pattern near station 21 where again the clay increases
It is also evident that there is much redundant information this is be-cause an increase in one particle-size fraction must be compensated by adecrease in one of the others Further the sand and clay contents in thethree horizons at each station are similar We can check the redundancyand reduce the number of variables with principal components analysis(PCA) Although all the variables are in the same units of measure usingstandardized components gives equal weight to all variables regardlessof their ranges
Task 3 Compute the standardized principal components of the sixvariables and plot the scores of the first two PCs along the transect bullgt pc lt- prcomp(transect center=T scale=T retx=T)$xgt plot(pc[1] type=l col=blue+ main=PCs of particle-size fraction along transect+ xlab=station ylab=PC score+ sub=solid PC1 dashed PC2+ ylim=c(-45 45))gt lines(pc[2] type=l lty=2 col=blue)gt abline(h=0)
2
0 10 20 30 40
minus4
minus2
02
4
PCs of particleminussize fraction along transect
solid PC1 dashed PC2station
PC
sco
re
The graph of the PCs shows that much more variance is captured by thefirst PC than by the second since the PC scores are much larger for PC1than PC2 The first PC has many sharp peaks but there is nonetheless aclear trend from negative to positive scores moving along the transect
For both the original variables and the PCs the question is where toput boundaries in particular which algorithm will find the ones that webelieve correspond to real soil type differences
2 Split moving-window
The split moving-window (SMW) approach computes the contrast be-tween two halves of a window as it moves along the transect and re-ports this contrast The higher the contrast the more likely it is thatthe station at which the window is split is a soil boundary The contrastis measured by the squared Mahalanobis distance between halves of thewindow This is defined as
D2 = (xs minus xd)TWminus1(xs minus xd) (1)
where xs is the average vector of the left half xd is is the average vectorof the right half and W is the pooled within-half variance-covariance ma-trix after correcting for the respective means The squared differencesbetween the mean vectors of each window half are thus corrected in twoways2
1 Variables with lower pooled within-half residual variances are givenmore weight
2 Two variables with positively-correlated residuals are not ldquodouble-countedrdquo that is a difference with the same sign in both axes iscorrected downwards by contrast a difference with opposite signis emphasized (because it is unexpected)
2 See numeric example in Appendix D
3
The variables are weighted by their discriminating power in the particu-lar window this changes as the window moves
If W is replaced by the identity matrix I this is equivalent to the squaredEuclidean distance
E2 = (xs minus xd)T I(xs minus xd) (2)
where all axes are weighted equally and there is no correction for dis-criminating power in the window either for different residual varianceor for covariances
Note If the pooled within-half variance-covariance matrix W is singularit can not be inverted and thus the Mahalanobis distance is not definedhowever the Euclidean distance is zero and it makes sense to also makethe Mahalanobis distance zero
21 Functions for the SMW method
I have written three R functions to implement the SMW method they arecontained in file smwR and loaded with the source function The sourcecode is also shown in Appendix sectBgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
These must be called in the following order
1 smwpc (optional) extract PCs
2 smwdw SMW analysis of PCs or original variables
3 smwgraph visualise results of SMW analysis
22 Options for the SMW method
The analyst must make several important choices that have a large effecton the presumed boundaries
1 Whether to use the original variables or some number of their stan-dardized principal components
2 Whether to use Euclidean or Mahalanobis distances
3 How wide a window to use
4 Window parity ie whether to use an odd or even window
5 Whether to use a ldquomullionrdquo ie leave out some stations near themiddle of the window to allow for a more diffuse boundary
6 What distance in feature space (as a proportion of the maximumin the whole transect) probably signals a boundary and should bereported
4
221 Standardized principal components
Function smwpc computes the standardized principal components fromthe correlation matrix3 and replaces the original variables with one ormore synthetic variables that are uncorrelated and arranged in descend-ing order of the amount of variance of the original data that they explainThis corrects for different measurement scales and also for correlationbetween variables (redundancy) The default is to extract two compo-nents this can be over-ridden with the npc named argumentgt pc lt- smwpc(transect)
[1] PCA selected components explain 0923 of the variance[1] Loadings for the first 2 components
PC1 PC2sandA 04061 047370clayA -03968 -062764sandB 04133 -025254clayB -04149 002935sandC 04095 -036782clayC -04087 042631
gt str(pc)
dataframe 42 obs of 2 variables$ PC1 num -2019 -1757 -0401 -0782 -4039 $ PC2 num -013327 -068999 -008597 -000673 003252
Here we see that the two PCs explain most of the variance
The PCs can be interpreted by examining the loadings These revealwhich soil variables on this transect are combined into each componentInterpretation in this case is easy The first component is associated withhigh sand (positive loading) and low clay (negative loading) ie withcoarser textures throughout the profile the second component is asso-ciated with higher sand in the topsoil than subsoil and the reverse forclay In other words the first component is the overall texture and thesecond is textural contrast
The structure returned by smwpc is a data frame with the principal com-ponent scores for each station on the transect these replace the originalvariables
222 Original variables
Distances can be computed in the multivariate space of the original vari-ables rather than in the space of their PCs To compute the Euclideandistances between original variables these must be scaled corrected in-dividually to zero mean and unit variance computation of Mahalanobisdistances implicitly scales them A data frame can be scaled with thescale function the result must be converted back into a data framegt trscale lt- dataframe(scale(transect))gt str(trscale)
dataframe 42 obs of 6 variables$ sandA num -096 -0882 -0104 -0181 -1031
3 rather than the variance-covariance matrix
5
$ clayA num 05823 08658 00153 03933 21888 $ sandB num -1341 -1341 -0722 -0309 -1002 $ clayB num 1228 1496 0692 0424 1689 $ sandC num -0616 0214 0214 -0409 -1936 $ clayC num 02005 -00768 -03542 02005 20642
Note all the variables have zero mean and unit variance and the covari-ance matrix is a correlation matrixgt round(apply(trscale 2 mean) 4)
sandA clayA sandB clayB sandC clayC0 0 0 0 0 0
gt var(trscale)
sandA clayA sandB clayB sandC clayCsandA 10000 -09016 08317 -08175 08237 -07670clayA -09016 10000 -07596 08417 -07541 07697sandB 08317 -07596 10000 -09064 08661 -08600clayB -08175 08417 -09064 10000 -08221 08581sandC 08237 -07541 08661 -08221 10000 -09129clayC -07670 07697 -08600 08581 -09129 10000
223 Window width
The narrower the window the narrower the soil unit that can be distin-guished but the more likely are spurious boundaries (and more likelythat the system will be computationally singular) Webster [4] recom-mends 23 of the expected distance between boundaries in this case theexpected width of a soil unit across the transect
This can be estimated by auto-correlation of one or more variables usu-ally the first PC By default the acf (ldquoAuto- and Cross-Covariance and-Correlation Function Estimationrdquo) function of the default stats pack-age displays a graph of autocorrelation by lag the results can also beprinted on the consolegt acf(pc[ 1] main = Serial auto-correlation PC1)gt print(acf(pc[ 1] plot = FALSE))
Autocorrelations of series pc[ 1] by lag
0 1 2 3 4 5 6 7 8 91000 0687 0637 0531 0422 0440 0310 0318 0300 0331
10 11 12 13 14 15 160182 0171 0027 -0039 -0055 -0200 -0131
6
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Serial autominuscorrelation PC1
The auto-correlation decreases to zero near lag 12 suggesting that thereare about three boundaries in this 42-station transect the suggested win-dow size is then 8
If the named argument wid is not supplied to smwdw by the user it iscomputed as a quarter of the total length of the transect
224 Window parity
If the window width n is even the transect is split at station l + n2where l is the left-most station of the window and any inferred boundaryis there ie neither half of the window uses the data from that stationFor example with window width 8 the first window (at the beginningof the transect) is from station 1 9 the middle of which is station5The left half of the split window includes stations 1 4 and the righthalf 6 9 station 5 is the potential boundary If the window width nis odd the transect is split half-way between stations l + (n minus 1)2 andl+ (n+ 1)2 For examplewith window width 7 the first window (at thebeginning of the transect) is from station 1 8 The left half of the splitwindow includes stations 1 4 and the right half 5 8 the potentialboundary is between stations 4 and 5 This is reported as station 4 butis actually located at position 45
225 Mullion
Especially for wide windows we may accept a more diffuse boundary infact the lsquoboundaryrsquo we are looking for could be identified as a separatesoil unit if a narrower window were used In this case we may increasethe discriminating power of the method by leaving out some observa-tions on either side of the potential boundary this is the so-called ldquomul-lionrdquo In this implementation of SMW the user-specified mullion (if any)applies to both halves of the window ie it is the number of stationsto omit The default for the mull named argument to smwdw is 0 note
7
that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1
226 What Mahanalobis distance to report
The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output
23 Analyzing the example transect
The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity
231 Using PCs
The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating
Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584
There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38
Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull
8
Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold
Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)
[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409
Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results
Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)
[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639
Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate
Using a narrow window changes the picture radically
Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)
[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 6 4764
9
The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions
Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 14 206992 15 109723 25 94034 24 5239
With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)
Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329
And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast
Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-
10
portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect
232 Using original variables
The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same
Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull
Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same
gt d lt- smwdw(transect wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756
gt d lt- smwdw(pc wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 14 229532 15 98113 24 83184 17 75715 25 7160
Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority
11
24 Visualizing the boundaries
The results returned by smwdw can be visualized with the smwgraphfunction
Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)
0 10 20 30 40
05
1015
20
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source
The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature
12
gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))
0 10 20 30 40
05
1015
2025
Boundaries
window8 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
7
1415
16 24
25
26 38
0 10 20 30 40
05
1015
20
Boundaries
window8 PC 1 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
14
15
24
25
0 10 20 30 40
02
46
8
Boundaries
window8 PC 2 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
67
1718
19
23
24
25
26
0 10 20 30 40
05
1015
20
Boundaries
window9 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
0 10 20 30 40
05
1015
2025
Boundaries
window10 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
16 1824
2526
0 10 20 30 40
05
1015
2025
30
Boundaries
window10 mullion1 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
13
14
15
16
17
2425
26
0 10 20 30 40
010
020
030
040
0
Boundaries
window6 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
0 10 20 30 40
010
2030
40
Boundaries
window7 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
14
24
0 10 20 30 40
05
1015
Boundaries
window11 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17
23
24
25
0 10 20 30 40
05
1015
Boundaries
window9 PCs 2 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
2425
26
0 10 20 30 40
050
100
200
300
Boundaries
window9 Original vars 6station
squa
red
dist
ance
bet
wee
n ha
lves
6
9 1417
24
25
28 35
0 10 20 30 40
010
0000
2000
0030
0000
Boundaries
window9 Original vars 6 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
24
25
26
13
3 When SMW fails to find a solution
The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like
Position 3 Singular pooled covariance matrix not a boundary
This implies a non-stable analytical context Among the causes are
bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero
bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly
Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries
Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull
The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]
Note We use setseed so your results will be identical to these notesin practice you would not use this
gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)
Autocorrelations of series runif(42) by lag
0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097
10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040
14
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Autocorrelation of a uniform random variable
Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)
Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)
[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components
PC1 PC2x -07071 07071y 07071 07071
gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)
15
0 10 20 30 40
minus2
minus1
01
2
black PC1 blue PC2station
There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance
Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)
0 10 20 30 40
010
2030
40
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
12
1415
16
17
1920 29
This illustrates the danger of relying on automatic procedures
4 Maximum Level Variance
The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a
16
mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries
I have not yet implemented this the mathematics are presented in theappendix to Webster [4]
References
[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1
[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416
[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1
[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17
[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980
[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1
[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1
17
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
B2 Find boundaries 19B3 Plot the transect 21
C Insight into Mahalanobis distance 22
Index of R concepts 26
ii
This note describes procedures to identify soil boundaries along a tran-sect where soil samples have been taken at regular intervals It is basedon the work of Webster [3ndash6] and used as an example in the text of Davis[1 pp 234-243]
Two partitioning methods are described by Webster [4] Split moving-window (SMW) (sect2) and Maximum Level Variance (MLV) (sect4) Only thefirst is developed in this technical note
Note The code in this document was tested with R version 301 (2013-05-16) and packages from that version or later running on Mac OS X1075 The text and graphical output you see here was written as aNoWeb file including both R code and regular LATEX source and then runthrough the excellent knitr package Version 141 [7] on R to generatea LATEX document that includes formatted R code input and the resultsof running the code both text results and graphs Then the LATEX docu-ment was compiled into the PDF version you are now reading If you runthe R code from this document your output may be slightly different ondifferent versions and on different platforms
1 Example transect
We illustrate optimal partitioning with a transect of 42 stations spacedat 25 m intervals in an undisturbed forest near Juruena Mato Grosso inthe southern Amazon collected by Steven Jirka of Cornell University aspart of the LBA1 project Both field and laboratory measurements weremade for illustrative purposes we use only the sand and clay contentmeasured in g kg-1 of three layers These are organized as an R dataframe with the rows being the stations and the columns the variablesThe sample dataset is provided as a comma-separated value file trcsvand is reproduced in Appendix sectA
Task 1 Read the transect from the CSV file into R and display itsstructure bullgt transect lt- readcsv(trcsv)gt str(transect)
dataframe 42 obs of 6 variables$ sandA num 438 449 560 549 428 $ clayA num 283 316 216 261 472 $ sandB num 349 349 449 516 404 $ clayB num 372 405 305 272 429 $ sandC num 483 616 616 516 271 $ clayC num 239 205 172 239 463
Task 2 Plot the six variables along the transect bullgt plot(transect$sandA type=l ylim=c(0900)+ main=Particle-size fraction along transect+ xlab=station ylab=g kg-1+ sub=black A blue B green C solid sand dashed clay)gt lines(transect$sandB type=l col=blue)
1 Large-scale Biosphere-Atmosphere Experiment in Amazonia see httpearthobservatorynasagovStudyLBA
1
gt lines(transect$sandC type=l col=green)gt lines(transect$clayA type=l col=1 lty=2)gt lines(transect$clayB type=l col=blue lty=2)gt lines(transect$clayC type=l col=green lty=2)
0 10 20 30 40
020
040
060
080
0
Particleminussize fraction along transect
black A blue B green C solid sand dashed claystation
g kg
minus1
It is evident that there is spatial autocorrelation ie nearby stations arelikely to be similar It also appears that the sequence can be broken upinto several more homogeneous sections A clear difference is that thebeginning of the sequence has fairly equal sand and clay whereas afterabout station 15 there is much more sand than clay However there is abreak in this pattern near station 21 where again the clay increases
It is also evident that there is much redundant information this is be-cause an increase in one particle-size fraction must be compensated by adecrease in one of the others Further the sand and clay contents in thethree horizons at each station are similar We can check the redundancyand reduce the number of variables with principal components analysis(PCA) Although all the variables are in the same units of measure usingstandardized components gives equal weight to all variables regardlessof their ranges
Task 3 Compute the standardized principal components of the sixvariables and plot the scores of the first two PCs along the transect bullgt pc lt- prcomp(transect center=T scale=T retx=T)$xgt plot(pc[1] type=l col=blue+ main=PCs of particle-size fraction along transect+ xlab=station ylab=PC score+ sub=solid PC1 dashed PC2+ ylim=c(-45 45))gt lines(pc[2] type=l lty=2 col=blue)gt abline(h=0)
2
0 10 20 30 40
minus4
minus2
02
4
PCs of particleminussize fraction along transect
solid PC1 dashed PC2station
PC
sco
re
The graph of the PCs shows that much more variance is captured by thefirst PC than by the second since the PC scores are much larger for PC1than PC2 The first PC has many sharp peaks but there is nonetheless aclear trend from negative to positive scores moving along the transect
For both the original variables and the PCs the question is where toput boundaries in particular which algorithm will find the ones that webelieve correspond to real soil type differences
2 Split moving-window
The split moving-window (SMW) approach computes the contrast be-tween two halves of a window as it moves along the transect and re-ports this contrast The higher the contrast the more likely it is thatthe station at which the window is split is a soil boundary The contrastis measured by the squared Mahalanobis distance between halves of thewindow This is defined as
D2 = (xs minus xd)TWminus1(xs minus xd) (1)
where xs is the average vector of the left half xd is is the average vectorof the right half and W is the pooled within-half variance-covariance ma-trix after correcting for the respective means The squared differencesbetween the mean vectors of each window half are thus corrected in twoways2
1 Variables with lower pooled within-half residual variances are givenmore weight
2 Two variables with positively-correlated residuals are not ldquodouble-countedrdquo that is a difference with the same sign in both axes iscorrected downwards by contrast a difference with opposite signis emphasized (because it is unexpected)
2 See numeric example in Appendix D
3
The variables are weighted by their discriminating power in the particu-lar window this changes as the window moves
If W is replaced by the identity matrix I this is equivalent to the squaredEuclidean distance
E2 = (xs minus xd)T I(xs minus xd) (2)
where all axes are weighted equally and there is no correction for dis-criminating power in the window either for different residual varianceor for covariances
Note If the pooled within-half variance-covariance matrix W is singularit can not be inverted and thus the Mahalanobis distance is not definedhowever the Euclidean distance is zero and it makes sense to also makethe Mahalanobis distance zero
21 Functions for the SMW method
I have written three R functions to implement the SMW method they arecontained in file smwR and loaded with the source function The sourcecode is also shown in Appendix sectBgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
These must be called in the following order
1 smwpc (optional) extract PCs
2 smwdw SMW analysis of PCs or original variables
3 smwgraph visualise results of SMW analysis
22 Options for the SMW method
The analyst must make several important choices that have a large effecton the presumed boundaries
1 Whether to use the original variables or some number of their stan-dardized principal components
2 Whether to use Euclidean or Mahalanobis distances
3 How wide a window to use
4 Window parity ie whether to use an odd or even window
5 Whether to use a ldquomullionrdquo ie leave out some stations near themiddle of the window to allow for a more diffuse boundary
6 What distance in feature space (as a proportion of the maximumin the whole transect) probably signals a boundary and should bereported
4
221 Standardized principal components
Function smwpc computes the standardized principal components fromthe correlation matrix3 and replaces the original variables with one ormore synthetic variables that are uncorrelated and arranged in descend-ing order of the amount of variance of the original data that they explainThis corrects for different measurement scales and also for correlationbetween variables (redundancy) The default is to extract two compo-nents this can be over-ridden with the npc named argumentgt pc lt- smwpc(transect)
[1] PCA selected components explain 0923 of the variance[1] Loadings for the first 2 components
PC1 PC2sandA 04061 047370clayA -03968 -062764sandB 04133 -025254clayB -04149 002935sandC 04095 -036782clayC -04087 042631
gt str(pc)
dataframe 42 obs of 2 variables$ PC1 num -2019 -1757 -0401 -0782 -4039 $ PC2 num -013327 -068999 -008597 -000673 003252
Here we see that the two PCs explain most of the variance
The PCs can be interpreted by examining the loadings These revealwhich soil variables on this transect are combined into each componentInterpretation in this case is easy The first component is associated withhigh sand (positive loading) and low clay (negative loading) ie withcoarser textures throughout the profile the second component is asso-ciated with higher sand in the topsoil than subsoil and the reverse forclay In other words the first component is the overall texture and thesecond is textural contrast
The structure returned by smwpc is a data frame with the principal com-ponent scores for each station on the transect these replace the originalvariables
222 Original variables
Distances can be computed in the multivariate space of the original vari-ables rather than in the space of their PCs To compute the Euclideandistances between original variables these must be scaled corrected in-dividually to zero mean and unit variance computation of Mahalanobisdistances implicitly scales them A data frame can be scaled with thescale function the result must be converted back into a data framegt trscale lt- dataframe(scale(transect))gt str(trscale)
dataframe 42 obs of 6 variables$ sandA num -096 -0882 -0104 -0181 -1031
3 rather than the variance-covariance matrix
5
$ clayA num 05823 08658 00153 03933 21888 $ sandB num -1341 -1341 -0722 -0309 -1002 $ clayB num 1228 1496 0692 0424 1689 $ sandC num -0616 0214 0214 -0409 -1936 $ clayC num 02005 -00768 -03542 02005 20642
Note all the variables have zero mean and unit variance and the covari-ance matrix is a correlation matrixgt round(apply(trscale 2 mean) 4)
sandA clayA sandB clayB sandC clayC0 0 0 0 0 0
gt var(trscale)
sandA clayA sandB clayB sandC clayCsandA 10000 -09016 08317 -08175 08237 -07670clayA -09016 10000 -07596 08417 -07541 07697sandB 08317 -07596 10000 -09064 08661 -08600clayB -08175 08417 -09064 10000 -08221 08581sandC 08237 -07541 08661 -08221 10000 -09129clayC -07670 07697 -08600 08581 -09129 10000
223 Window width
The narrower the window the narrower the soil unit that can be distin-guished but the more likely are spurious boundaries (and more likelythat the system will be computationally singular) Webster [4] recom-mends 23 of the expected distance between boundaries in this case theexpected width of a soil unit across the transect
This can be estimated by auto-correlation of one or more variables usu-ally the first PC By default the acf (ldquoAuto- and Cross-Covariance and-Correlation Function Estimationrdquo) function of the default stats pack-age displays a graph of autocorrelation by lag the results can also beprinted on the consolegt acf(pc[ 1] main = Serial auto-correlation PC1)gt print(acf(pc[ 1] plot = FALSE))
Autocorrelations of series pc[ 1] by lag
0 1 2 3 4 5 6 7 8 91000 0687 0637 0531 0422 0440 0310 0318 0300 0331
10 11 12 13 14 15 160182 0171 0027 -0039 -0055 -0200 -0131
6
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Serial autominuscorrelation PC1
The auto-correlation decreases to zero near lag 12 suggesting that thereare about three boundaries in this 42-station transect the suggested win-dow size is then 8
If the named argument wid is not supplied to smwdw by the user it iscomputed as a quarter of the total length of the transect
224 Window parity
If the window width n is even the transect is split at station l + n2where l is the left-most station of the window and any inferred boundaryis there ie neither half of the window uses the data from that stationFor example with window width 8 the first window (at the beginningof the transect) is from station 1 9 the middle of which is station5The left half of the split window includes stations 1 4 and the righthalf 6 9 station 5 is the potential boundary If the window width nis odd the transect is split half-way between stations l + (n minus 1)2 andl+ (n+ 1)2 For examplewith window width 7 the first window (at thebeginning of the transect) is from station 1 8 The left half of the splitwindow includes stations 1 4 and the right half 5 8 the potentialboundary is between stations 4 and 5 This is reported as station 4 butis actually located at position 45
225 Mullion
Especially for wide windows we may accept a more diffuse boundary infact the lsquoboundaryrsquo we are looking for could be identified as a separatesoil unit if a narrower window were used In this case we may increasethe discriminating power of the method by leaving out some observa-tions on either side of the potential boundary this is the so-called ldquomul-lionrdquo In this implementation of SMW the user-specified mullion (if any)applies to both halves of the window ie it is the number of stationsto omit The default for the mull named argument to smwdw is 0 note
7
that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1
226 What Mahanalobis distance to report
The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output
23 Analyzing the example transect
The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity
231 Using PCs
The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating
Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584
There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38
Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull
8
Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold
Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)
[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409
Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results
Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)
[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639
Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate
Using a narrow window changes the picture radically
Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)
[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 6 4764
9
The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions
Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 14 206992 15 109723 25 94034 24 5239
With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)
Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329
And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast
Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-
10
portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect
232 Using original variables
The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same
Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull
Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same
gt d lt- smwdw(transect wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756
gt d lt- smwdw(pc wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 14 229532 15 98113 24 83184 17 75715 25 7160
Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority
11
24 Visualizing the boundaries
The results returned by smwdw can be visualized with the smwgraphfunction
Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)
0 10 20 30 40
05
1015
20
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source
The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature
12
gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))
0 10 20 30 40
05
1015
2025
Boundaries
window8 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
7
1415
16 24
25
26 38
0 10 20 30 40
05
1015
20
Boundaries
window8 PC 1 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
14
15
24
25
0 10 20 30 40
02
46
8
Boundaries
window8 PC 2 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
67
1718
19
23
24
25
26
0 10 20 30 40
05
1015
20
Boundaries
window9 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
0 10 20 30 40
05
1015
2025
Boundaries
window10 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
16 1824
2526
0 10 20 30 40
05
1015
2025
30
Boundaries
window10 mullion1 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
13
14
15
16
17
2425
26
0 10 20 30 40
010
020
030
040
0
Boundaries
window6 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
0 10 20 30 40
010
2030
40
Boundaries
window7 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
14
24
0 10 20 30 40
05
1015
Boundaries
window11 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17
23
24
25
0 10 20 30 40
05
1015
Boundaries
window9 PCs 2 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
2425
26
0 10 20 30 40
050
100
200
300
Boundaries
window9 Original vars 6station
squa
red
dist
ance
bet
wee
n ha
lves
6
9 1417
24
25
28 35
0 10 20 30 40
010
0000
2000
0030
0000
Boundaries
window9 Original vars 6 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
24
25
26
13
3 When SMW fails to find a solution
The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like
Position 3 Singular pooled covariance matrix not a boundary
This implies a non-stable analytical context Among the causes are
bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero
bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly
Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries
Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull
The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]
Note We use setseed so your results will be identical to these notesin practice you would not use this
gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)
Autocorrelations of series runif(42) by lag
0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097
10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040
14
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Autocorrelation of a uniform random variable
Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)
Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)
[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components
PC1 PC2x -07071 07071y 07071 07071
gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)
15
0 10 20 30 40
minus2
minus1
01
2
black PC1 blue PC2station
There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance
Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)
0 10 20 30 40
010
2030
40
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
12
1415
16
17
1920 29
This illustrates the danger of relying on automatic procedures
4 Maximum Level Variance
The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a
16
mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries
I have not yet implemented this the mathematics are presented in theappendix to Webster [4]
References
[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1
[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416
[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1
[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17
[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980
[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1
[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1
17
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
This note describes procedures to identify soil boundaries along a tran-sect where soil samples have been taken at regular intervals It is basedon the work of Webster [3ndash6] and used as an example in the text of Davis[1 pp 234-243]
Two partitioning methods are described by Webster [4] Split moving-window (SMW) (sect2) and Maximum Level Variance (MLV) (sect4) Only thefirst is developed in this technical note
Note The code in this document was tested with R version 301 (2013-05-16) and packages from that version or later running on Mac OS X1075 The text and graphical output you see here was written as aNoWeb file including both R code and regular LATEX source and then runthrough the excellent knitr package Version 141 [7] on R to generatea LATEX document that includes formatted R code input and the resultsof running the code both text results and graphs Then the LATEX docu-ment was compiled into the PDF version you are now reading If you runthe R code from this document your output may be slightly different ondifferent versions and on different platforms
1 Example transect
We illustrate optimal partitioning with a transect of 42 stations spacedat 25 m intervals in an undisturbed forest near Juruena Mato Grosso inthe southern Amazon collected by Steven Jirka of Cornell University aspart of the LBA1 project Both field and laboratory measurements weremade for illustrative purposes we use only the sand and clay contentmeasured in g kg-1 of three layers These are organized as an R dataframe with the rows being the stations and the columns the variablesThe sample dataset is provided as a comma-separated value file trcsvand is reproduced in Appendix sectA
Task 1 Read the transect from the CSV file into R and display itsstructure bullgt transect lt- readcsv(trcsv)gt str(transect)
dataframe 42 obs of 6 variables$ sandA num 438 449 560 549 428 $ clayA num 283 316 216 261 472 $ sandB num 349 349 449 516 404 $ clayB num 372 405 305 272 429 $ sandC num 483 616 616 516 271 $ clayC num 239 205 172 239 463
Task 2 Plot the six variables along the transect bullgt plot(transect$sandA type=l ylim=c(0900)+ main=Particle-size fraction along transect+ xlab=station ylab=g kg-1+ sub=black A blue B green C solid sand dashed clay)gt lines(transect$sandB type=l col=blue)
1 Large-scale Biosphere-Atmosphere Experiment in Amazonia see httpearthobservatorynasagovStudyLBA
1
gt lines(transect$sandC type=l col=green)gt lines(transect$clayA type=l col=1 lty=2)gt lines(transect$clayB type=l col=blue lty=2)gt lines(transect$clayC type=l col=green lty=2)
0 10 20 30 40
020
040
060
080
0
Particleminussize fraction along transect
black A blue B green C solid sand dashed claystation
g kg
minus1
It is evident that there is spatial autocorrelation ie nearby stations arelikely to be similar It also appears that the sequence can be broken upinto several more homogeneous sections A clear difference is that thebeginning of the sequence has fairly equal sand and clay whereas afterabout station 15 there is much more sand than clay However there is abreak in this pattern near station 21 where again the clay increases
It is also evident that there is much redundant information this is be-cause an increase in one particle-size fraction must be compensated by adecrease in one of the others Further the sand and clay contents in thethree horizons at each station are similar We can check the redundancyand reduce the number of variables with principal components analysis(PCA) Although all the variables are in the same units of measure usingstandardized components gives equal weight to all variables regardlessof their ranges
Task 3 Compute the standardized principal components of the sixvariables and plot the scores of the first two PCs along the transect bullgt pc lt- prcomp(transect center=T scale=T retx=T)$xgt plot(pc[1] type=l col=blue+ main=PCs of particle-size fraction along transect+ xlab=station ylab=PC score+ sub=solid PC1 dashed PC2+ ylim=c(-45 45))gt lines(pc[2] type=l lty=2 col=blue)gt abline(h=0)
2
0 10 20 30 40
minus4
minus2
02
4
PCs of particleminussize fraction along transect
solid PC1 dashed PC2station
PC
sco
re
The graph of the PCs shows that much more variance is captured by thefirst PC than by the second since the PC scores are much larger for PC1than PC2 The first PC has many sharp peaks but there is nonetheless aclear trend from negative to positive scores moving along the transect
For both the original variables and the PCs the question is where toput boundaries in particular which algorithm will find the ones that webelieve correspond to real soil type differences
2 Split moving-window
The split moving-window (SMW) approach computes the contrast be-tween two halves of a window as it moves along the transect and re-ports this contrast The higher the contrast the more likely it is thatthe station at which the window is split is a soil boundary The contrastis measured by the squared Mahalanobis distance between halves of thewindow This is defined as
D2 = (xs minus xd)TWminus1(xs minus xd) (1)
where xs is the average vector of the left half xd is is the average vectorof the right half and W is the pooled within-half variance-covariance ma-trix after correcting for the respective means The squared differencesbetween the mean vectors of each window half are thus corrected in twoways2
1 Variables with lower pooled within-half residual variances are givenmore weight
2 Two variables with positively-correlated residuals are not ldquodouble-countedrdquo that is a difference with the same sign in both axes iscorrected downwards by contrast a difference with opposite signis emphasized (because it is unexpected)
2 See numeric example in Appendix D
3
The variables are weighted by their discriminating power in the particu-lar window this changes as the window moves
If W is replaced by the identity matrix I this is equivalent to the squaredEuclidean distance
E2 = (xs minus xd)T I(xs minus xd) (2)
where all axes are weighted equally and there is no correction for dis-criminating power in the window either for different residual varianceor for covariances
Note If the pooled within-half variance-covariance matrix W is singularit can not be inverted and thus the Mahalanobis distance is not definedhowever the Euclidean distance is zero and it makes sense to also makethe Mahalanobis distance zero
21 Functions for the SMW method
I have written three R functions to implement the SMW method they arecontained in file smwR and loaded with the source function The sourcecode is also shown in Appendix sectBgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
These must be called in the following order
1 smwpc (optional) extract PCs
2 smwdw SMW analysis of PCs or original variables
3 smwgraph visualise results of SMW analysis
22 Options for the SMW method
The analyst must make several important choices that have a large effecton the presumed boundaries
1 Whether to use the original variables or some number of their stan-dardized principal components
2 Whether to use Euclidean or Mahalanobis distances
3 How wide a window to use
4 Window parity ie whether to use an odd or even window
5 Whether to use a ldquomullionrdquo ie leave out some stations near themiddle of the window to allow for a more diffuse boundary
6 What distance in feature space (as a proportion of the maximumin the whole transect) probably signals a boundary and should bereported
4
221 Standardized principal components
Function smwpc computes the standardized principal components fromthe correlation matrix3 and replaces the original variables with one ormore synthetic variables that are uncorrelated and arranged in descend-ing order of the amount of variance of the original data that they explainThis corrects for different measurement scales and also for correlationbetween variables (redundancy) The default is to extract two compo-nents this can be over-ridden with the npc named argumentgt pc lt- smwpc(transect)
[1] PCA selected components explain 0923 of the variance[1] Loadings for the first 2 components
PC1 PC2sandA 04061 047370clayA -03968 -062764sandB 04133 -025254clayB -04149 002935sandC 04095 -036782clayC -04087 042631
gt str(pc)
dataframe 42 obs of 2 variables$ PC1 num -2019 -1757 -0401 -0782 -4039 $ PC2 num -013327 -068999 -008597 -000673 003252
Here we see that the two PCs explain most of the variance
The PCs can be interpreted by examining the loadings These revealwhich soil variables on this transect are combined into each componentInterpretation in this case is easy The first component is associated withhigh sand (positive loading) and low clay (negative loading) ie withcoarser textures throughout the profile the second component is asso-ciated with higher sand in the topsoil than subsoil and the reverse forclay In other words the first component is the overall texture and thesecond is textural contrast
The structure returned by smwpc is a data frame with the principal com-ponent scores for each station on the transect these replace the originalvariables
222 Original variables
Distances can be computed in the multivariate space of the original vari-ables rather than in the space of their PCs To compute the Euclideandistances between original variables these must be scaled corrected in-dividually to zero mean and unit variance computation of Mahalanobisdistances implicitly scales them A data frame can be scaled with thescale function the result must be converted back into a data framegt trscale lt- dataframe(scale(transect))gt str(trscale)
dataframe 42 obs of 6 variables$ sandA num -096 -0882 -0104 -0181 -1031
3 rather than the variance-covariance matrix
5
$ clayA num 05823 08658 00153 03933 21888 $ sandB num -1341 -1341 -0722 -0309 -1002 $ clayB num 1228 1496 0692 0424 1689 $ sandC num -0616 0214 0214 -0409 -1936 $ clayC num 02005 -00768 -03542 02005 20642
Note all the variables have zero mean and unit variance and the covari-ance matrix is a correlation matrixgt round(apply(trscale 2 mean) 4)
sandA clayA sandB clayB sandC clayC0 0 0 0 0 0
gt var(trscale)
sandA clayA sandB clayB sandC clayCsandA 10000 -09016 08317 -08175 08237 -07670clayA -09016 10000 -07596 08417 -07541 07697sandB 08317 -07596 10000 -09064 08661 -08600clayB -08175 08417 -09064 10000 -08221 08581sandC 08237 -07541 08661 -08221 10000 -09129clayC -07670 07697 -08600 08581 -09129 10000
223 Window width
The narrower the window the narrower the soil unit that can be distin-guished but the more likely are spurious boundaries (and more likelythat the system will be computationally singular) Webster [4] recom-mends 23 of the expected distance between boundaries in this case theexpected width of a soil unit across the transect
This can be estimated by auto-correlation of one or more variables usu-ally the first PC By default the acf (ldquoAuto- and Cross-Covariance and-Correlation Function Estimationrdquo) function of the default stats pack-age displays a graph of autocorrelation by lag the results can also beprinted on the consolegt acf(pc[ 1] main = Serial auto-correlation PC1)gt print(acf(pc[ 1] plot = FALSE))
Autocorrelations of series pc[ 1] by lag
0 1 2 3 4 5 6 7 8 91000 0687 0637 0531 0422 0440 0310 0318 0300 0331
10 11 12 13 14 15 160182 0171 0027 -0039 -0055 -0200 -0131
6
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Serial autominuscorrelation PC1
The auto-correlation decreases to zero near lag 12 suggesting that thereare about three boundaries in this 42-station transect the suggested win-dow size is then 8
If the named argument wid is not supplied to smwdw by the user it iscomputed as a quarter of the total length of the transect
224 Window parity
If the window width n is even the transect is split at station l + n2where l is the left-most station of the window and any inferred boundaryis there ie neither half of the window uses the data from that stationFor example with window width 8 the first window (at the beginningof the transect) is from station 1 9 the middle of which is station5The left half of the split window includes stations 1 4 and the righthalf 6 9 station 5 is the potential boundary If the window width nis odd the transect is split half-way between stations l + (n minus 1)2 andl+ (n+ 1)2 For examplewith window width 7 the first window (at thebeginning of the transect) is from station 1 8 The left half of the splitwindow includes stations 1 4 and the right half 5 8 the potentialboundary is between stations 4 and 5 This is reported as station 4 butis actually located at position 45
225 Mullion
Especially for wide windows we may accept a more diffuse boundary infact the lsquoboundaryrsquo we are looking for could be identified as a separatesoil unit if a narrower window were used In this case we may increasethe discriminating power of the method by leaving out some observa-tions on either side of the potential boundary this is the so-called ldquomul-lionrdquo In this implementation of SMW the user-specified mullion (if any)applies to both halves of the window ie it is the number of stationsto omit The default for the mull named argument to smwdw is 0 note
7
that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1
226 What Mahanalobis distance to report
The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output
23 Analyzing the example transect
The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity
231 Using PCs
The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating
Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584
There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38
Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull
8
Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold
Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)
[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409
Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results
Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)
[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639
Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate
Using a narrow window changes the picture radically
Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)
[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 6 4764
9
The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions
Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 14 206992 15 109723 25 94034 24 5239
With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)
Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329
And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast
Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-
10
portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect
232 Using original variables
The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same
Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull
Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same
gt d lt- smwdw(transect wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756
gt d lt- smwdw(pc wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 14 229532 15 98113 24 83184 17 75715 25 7160
Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority
11
24 Visualizing the boundaries
The results returned by smwdw can be visualized with the smwgraphfunction
Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)
0 10 20 30 40
05
1015
20
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source
The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature
12
gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))
0 10 20 30 40
05
1015
2025
Boundaries
window8 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
7
1415
16 24
25
26 38
0 10 20 30 40
05
1015
20
Boundaries
window8 PC 1 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
14
15
24
25
0 10 20 30 40
02
46
8
Boundaries
window8 PC 2 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
67
1718
19
23
24
25
26
0 10 20 30 40
05
1015
20
Boundaries
window9 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
0 10 20 30 40
05
1015
2025
Boundaries
window10 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
16 1824
2526
0 10 20 30 40
05
1015
2025
30
Boundaries
window10 mullion1 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
13
14
15
16
17
2425
26
0 10 20 30 40
010
020
030
040
0
Boundaries
window6 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
0 10 20 30 40
010
2030
40
Boundaries
window7 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
14
24
0 10 20 30 40
05
1015
Boundaries
window11 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17
23
24
25
0 10 20 30 40
05
1015
Boundaries
window9 PCs 2 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
2425
26
0 10 20 30 40
050
100
200
300
Boundaries
window9 Original vars 6station
squa
red
dist
ance
bet
wee
n ha
lves
6
9 1417
24
25
28 35
0 10 20 30 40
010
0000
2000
0030
0000
Boundaries
window9 Original vars 6 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
24
25
26
13
3 When SMW fails to find a solution
The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like
Position 3 Singular pooled covariance matrix not a boundary
This implies a non-stable analytical context Among the causes are
bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero
bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly
Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries
Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull
The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]
Note We use setseed so your results will be identical to these notesin practice you would not use this
gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)
Autocorrelations of series runif(42) by lag
0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097
10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040
14
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Autocorrelation of a uniform random variable
Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)
Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)
[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components
PC1 PC2x -07071 07071y 07071 07071
gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)
15
0 10 20 30 40
minus2
minus1
01
2
black PC1 blue PC2station
There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance
Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)
0 10 20 30 40
010
2030
40
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
12
1415
16
17
1920 29
This illustrates the danger of relying on automatic procedures
4 Maximum Level Variance
The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a
16
mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries
I have not yet implemented this the mathematics are presented in theappendix to Webster [4]
References
[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1
[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416
[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1
[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17
[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980
[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1
[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1
17
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
gt lines(transect$sandC type=l col=green)gt lines(transect$clayA type=l col=1 lty=2)gt lines(transect$clayB type=l col=blue lty=2)gt lines(transect$clayC type=l col=green lty=2)
0 10 20 30 40
020
040
060
080
0
Particleminussize fraction along transect
black A blue B green C solid sand dashed claystation
g kg
minus1
It is evident that there is spatial autocorrelation ie nearby stations arelikely to be similar It also appears that the sequence can be broken upinto several more homogeneous sections A clear difference is that thebeginning of the sequence has fairly equal sand and clay whereas afterabout station 15 there is much more sand than clay However there is abreak in this pattern near station 21 where again the clay increases
It is also evident that there is much redundant information this is be-cause an increase in one particle-size fraction must be compensated by adecrease in one of the others Further the sand and clay contents in thethree horizons at each station are similar We can check the redundancyand reduce the number of variables with principal components analysis(PCA) Although all the variables are in the same units of measure usingstandardized components gives equal weight to all variables regardlessof their ranges
Task 3 Compute the standardized principal components of the sixvariables and plot the scores of the first two PCs along the transect bullgt pc lt- prcomp(transect center=T scale=T retx=T)$xgt plot(pc[1] type=l col=blue+ main=PCs of particle-size fraction along transect+ xlab=station ylab=PC score+ sub=solid PC1 dashed PC2+ ylim=c(-45 45))gt lines(pc[2] type=l lty=2 col=blue)gt abline(h=0)
2
0 10 20 30 40
minus4
minus2
02
4
PCs of particleminussize fraction along transect
solid PC1 dashed PC2station
PC
sco
re
The graph of the PCs shows that much more variance is captured by thefirst PC than by the second since the PC scores are much larger for PC1than PC2 The first PC has many sharp peaks but there is nonetheless aclear trend from negative to positive scores moving along the transect
For both the original variables and the PCs the question is where toput boundaries in particular which algorithm will find the ones that webelieve correspond to real soil type differences
2 Split moving-window
The split moving-window (SMW) approach computes the contrast be-tween two halves of a window as it moves along the transect and re-ports this contrast The higher the contrast the more likely it is thatthe station at which the window is split is a soil boundary The contrastis measured by the squared Mahalanobis distance between halves of thewindow This is defined as
D2 = (xs minus xd)TWminus1(xs minus xd) (1)
where xs is the average vector of the left half xd is is the average vectorof the right half and W is the pooled within-half variance-covariance ma-trix after correcting for the respective means The squared differencesbetween the mean vectors of each window half are thus corrected in twoways2
1 Variables with lower pooled within-half residual variances are givenmore weight
2 Two variables with positively-correlated residuals are not ldquodouble-countedrdquo that is a difference with the same sign in both axes iscorrected downwards by contrast a difference with opposite signis emphasized (because it is unexpected)
2 See numeric example in Appendix D
3
The variables are weighted by their discriminating power in the particu-lar window this changes as the window moves
If W is replaced by the identity matrix I this is equivalent to the squaredEuclidean distance
E2 = (xs minus xd)T I(xs minus xd) (2)
where all axes are weighted equally and there is no correction for dis-criminating power in the window either for different residual varianceor for covariances
Note If the pooled within-half variance-covariance matrix W is singularit can not be inverted and thus the Mahalanobis distance is not definedhowever the Euclidean distance is zero and it makes sense to also makethe Mahalanobis distance zero
21 Functions for the SMW method
I have written three R functions to implement the SMW method they arecontained in file smwR and loaded with the source function The sourcecode is also shown in Appendix sectBgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
These must be called in the following order
1 smwpc (optional) extract PCs
2 smwdw SMW analysis of PCs or original variables
3 smwgraph visualise results of SMW analysis
22 Options for the SMW method
The analyst must make several important choices that have a large effecton the presumed boundaries
1 Whether to use the original variables or some number of their stan-dardized principal components
2 Whether to use Euclidean or Mahalanobis distances
3 How wide a window to use
4 Window parity ie whether to use an odd or even window
5 Whether to use a ldquomullionrdquo ie leave out some stations near themiddle of the window to allow for a more diffuse boundary
6 What distance in feature space (as a proportion of the maximumin the whole transect) probably signals a boundary and should bereported
4
221 Standardized principal components
Function smwpc computes the standardized principal components fromthe correlation matrix3 and replaces the original variables with one ormore synthetic variables that are uncorrelated and arranged in descend-ing order of the amount of variance of the original data that they explainThis corrects for different measurement scales and also for correlationbetween variables (redundancy) The default is to extract two compo-nents this can be over-ridden with the npc named argumentgt pc lt- smwpc(transect)
[1] PCA selected components explain 0923 of the variance[1] Loadings for the first 2 components
PC1 PC2sandA 04061 047370clayA -03968 -062764sandB 04133 -025254clayB -04149 002935sandC 04095 -036782clayC -04087 042631
gt str(pc)
dataframe 42 obs of 2 variables$ PC1 num -2019 -1757 -0401 -0782 -4039 $ PC2 num -013327 -068999 -008597 -000673 003252
Here we see that the two PCs explain most of the variance
The PCs can be interpreted by examining the loadings These revealwhich soil variables on this transect are combined into each componentInterpretation in this case is easy The first component is associated withhigh sand (positive loading) and low clay (negative loading) ie withcoarser textures throughout the profile the second component is asso-ciated with higher sand in the topsoil than subsoil and the reverse forclay In other words the first component is the overall texture and thesecond is textural contrast
The structure returned by smwpc is a data frame with the principal com-ponent scores for each station on the transect these replace the originalvariables
222 Original variables
Distances can be computed in the multivariate space of the original vari-ables rather than in the space of their PCs To compute the Euclideandistances between original variables these must be scaled corrected in-dividually to zero mean and unit variance computation of Mahalanobisdistances implicitly scales them A data frame can be scaled with thescale function the result must be converted back into a data framegt trscale lt- dataframe(scale(transect))gt str(trscale)
dataframe 42 obs of 6 variables$ sandA num -096 -0882 -0104 -0181 -1031
3 rather than the variance-covariance matrix
5
$ clayA num 05823 08658 00153 03933 21888 $ sandB num -1341 -1341 -0722 -0309 -1002 $ clayB num 1228 1496 0692 0424 1689 $ sandC num -0616 0214 0214 -0409 -1936 $ clayC num 02005 -00768 -03542 02005 20642
Note all the variables have zero mean and unit variance and the covari-ance matrix is a correlation matrixgt round(apply(trscale 2 mean) 4)
sandA clayA sandB clayB sandC clayC0 0 0 0 0 0
gt var(trscale)
sandA clayA sandB clayB sandC clayCsandA 10000 -09016 08317 -08175 08237 -07670clayA -09016 10000 -07596 08417 -07541 07697sandB 08317 -07596 10000 -09064 08661 -08600clayB -08175 08417 -09064 10000 -08221 08581sandC 08237 -07541 08661 -08221 10000 -09129clayC -07670 07697 -08600 08581 -09129 10000
223 Window width
The narrower the window the narrower the soil unit that can be distin-guished but the more likely are spurious boundaries (and more likelythat the system will be computationally singular) Webster [4] recom-mends 23 of the expected distance between boundaries in this case theexpected width of a soil unit across the transect
This can be estimated by auto-correlation of one or more variables usu-ally the first PC By default the acf (ldquoAuto- and Cross-Covariance and-Correlation Function Estimationrdquo) function of the default stats pack-age displays a graph of autocorrelation by lag the results can also beprinted on the consolegt acf(pc[ 1] main = Serial auto-correlation PC1)gt print(acf(pc[ 1] plot = FALSE))
Autocorrelations of series pc[ 1] by lag
0 1 2 3 4 5 6 7 8 91000 0687 0637 0531 0422 0440 0310 0318 0300 0331
10 11 12 13 14 15 160182 0171 0027 -0039 -0055 -0200 -0131
6
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Serial autominuscorrelation PC1
The auto-correlation decreases to zero near lag 12 suggesting that thereare about three boundaries in this 42-station transect the suggested win-dow size is then 8
If the named argument wid is not supplied to smwdw by the user it iscomputed as a quarter of the total length of the transect
224 Window parity
If the window width n is even the transect is split at station l + n2where l is the left-most station of the window and any inferred boundaryis there ie neither half of the window uses the data from that stationFor example with window width 8 the first window (at the beginningof the transect) is from station 1 9 the middle of which is station5The left half of the split window includes stations 1 4 and the righthalf 6 9 station 5 is the potential boundary If the window width nis odd the transect is split half-way between stations l + (n minus 1)2 andl+ (n+ 1)2 For examplewith window width 7 the first window (at thebeginning of the transect) is from station 1 8 The left half of the splitwindow includes stations 1 4 and the right half 5 8 the potentialboundary is between stations 4 and 5 This is reported as station 4 butis actually located at position 45
225 Mullion
Especially for wide windows we may accept a more diffuse boundary infact the lsquoboundaryrsquo we are looking for could be identified as a separatesoil unit if a narrower window were used In this case we may increasethe discriminating power of the method by leaving out some observa-tions on either side of the potential boundary this is the so-called ldquomul-lionrdquo In this implementation of SMW the user-specified mullion (if any)applies to both halves of the window ie it is the number of stationsto omit The default for the mull named argument to smwdw is 0 note
7
that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1
226 What Mahanalobis distance to report
The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output
23 Analyzing the example transect
The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity
231 Using PCs
The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating
Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584
There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38
Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull
8
Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold
Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)
[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409
Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results
Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)
[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639
Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate
Using a narrow window changes the picture radically
Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)
[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 6 4764
9
The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions
Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 14 206992 15 109723 25 94034 24 5239
With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)
Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329
And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast
Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-
10
portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect
232 Using original variables
The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same
Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull
Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same
gt d lt- smwdw(transect wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756
gt d lt- smwdw(pc wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 14 229532 15 98113 24 83184 17 75715 25 7160
Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority
11
24 Visualizing the boundaries
The results returned by smwdw can be visualized with the smwgraphfunction
Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)
0 10 20 30 40
05
1015
20
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source
The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature
12
gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))
0 10 20 30 40
05
1015
2025
Boundaries
window8 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
7
1415
16 24
25
26 38
0 10 20 30 40
05
1015
20
Boundaries
window8 PC 1 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
14
15
24
25
0 10 20 30 40
02
46
8
Boundaries
window8 PC 2 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
67
1718
19
23
24
25
26
0 10 20 30 40
05
1015
20
Boundaries
window9 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
0 10 20 30 40
05
1015
2025
Boundaries
window10 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
16 1824
2526
0 10 20 30 40
05
1015
2025
30
Boundaries
window10 mullion1 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
13
14
15
16
17
2425
26
0 10 20 30 40
010
020
030
040
0
Boundaries
window6 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
0 10 20 30 40
010
2030
40
Boundaries
window7 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
14
24
0 10 20 30 40
05
1015
Boundaries
window11 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17
23
24
25
0 10 20 30 40
05
1015
Boundaries
window9 PCs 2 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
2425
26
0 10 20 30 40
050
100
200
300
Boundaries
window9 Original vars 6station
squa
red
dist
ance
bet
wee
n ha
lves
6
9 1417
24
25
28 35
0 10 20 30 40
010
0000
2000
0030
0000
Boundaries
window9 Original vars 6 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
24
25
26
13
3 When SMW fails to find a solution
The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like
Position 3 Singular pooled covariance matrix not a boundary
This implies a non-stable analytical context Among the causes are
bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero
bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly
Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries
Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull
The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]
Note We use setseed so your results will be identical to these notesin practice you would not use this
gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)
Autocorrelations of series runif(42) by lag
0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097
10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040
14
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Autocorrelation of a uniform random variable
Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)
Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)
[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components
PC1 PC2x -07071 07071y 07071 07071
gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)
15
0 10 20 30 40
minus2
minus1
01
2
black PC1 blue PC2station
There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance
Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)
0 10 20 30 40
010
2030
40
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
12
1415
16
17
1920 29
This illustrates the danger of relying on automatic procedures
4 Maximum Level Variance
The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a
16
mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries
I have not yet implemented this the mathematics are presented in theappendix to Webster [4]
References
[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1
[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416
[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1
[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17
[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980
[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1
[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1
17
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
0 10 20 30 40
minus4
minus2
02
4
PCs of particleminussize fraction along transect
solid PC1 dashed PC2station
PC
sco
re
The graph of the PCs shows that much more variance is captured by thefirst PC than by the second since the PC scores are much larger for PC1than PC2 The first PC has many sharp peaks but there is nonetheless aclear trend from negative to positive scores moving along the transect
For both the original variables and the PCs the question is where toput boundaries in particular which algorithm will find the ones that webelieve correspond to real soil type differences
2 Split moving-window
The split moving-window (SMW) approach computes the contrast be-tween two halves of a window as it moves along the transect and re-ports this contrast The higher the contrast the more likely it is thatthe station at which the window is split is a soil boundary The contrastis measured by the squared Mahalanobis distance between halves of thewindow This is defined as
D2 = (xs minus xd)TWminus1(xs minus xd) (1)
where xs is the average vector of the left half xd is is the average vectorof the right half and W is the pooled within-half variance-covariance ma-trix after correcting for the respective means The squared differencesbetween the mean vectors of each window half are thus corrected in twoways2
1 Variables with lower pooled within-half residual variances are givenmore weight
2 Two variables with positively-correlated residuals are not ldquodouble-countedrdquo that is a difference with the same sign in both axes iscorrected downwards by contrast a difference with opposite signis emphasized (because it is unexpected)
2 See numeric example in Appendix D
3
The variables are weighted by their discriminating power in the particu-lar window this changes as the window moves
If W is replaced by the identity matrix I this is equivalent to the squaredEuclidean distance
E2 = (xs minus xd)T I(xs minus xd) (2)
where all axes are weighted equally and there is no correction for dis-criminating power in the window either for different residual varianceor for covariances
Note If the pooled within-half variance-covariance matrix W is singularit can not be inverted and thus the Mahalanobis distance is not definedhowever the Euclidean distance is zero and it makes sense to also makethe Mahalanobis distance zero
21 Functions for the SMW method
I have written three R functions to implement the SMW method they arecontained in file smwR and loaded with the source function The sourcecode is also shown in Appendix sectBgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
These must be called in the following order
1 smwpc (optional) extract PCs
2 smwdw SMW analysis of PCs or original variables
3 smwgraph visualise results of SMW analysis
22 Options for the SMW method
The analyst must make several important choices that have a large effecton the presumed boundaries
1 Whether to use the original variables or some number of their stan-dardized principal components
2 Whether to use Euclidean or Mahalanobis distances
3 How wide a window to use
4 Window parity ie whether to use an odd or even window
5 Whether to use a ldquomullionrdquo ie leave out some stations near themiddle of the window to allow for a more diffuse boundary
6 What distance in feature space (as a proportion of the maximumin the whole transect) probably signals a boundary and should bereported
4
221 Standardized principal components
Function smwpc computes the standardized principal components fromthe correlation matrix3 and replaces the original variables with one ormore synthetic variables that are uncorrelated and arranged in descend-ing order of the amount of variance of the original data that they explainThis corrects for different measurement scales and also for correlationbetween variables (redundancy) The default is to extract two compo-nents this can be over-ridden with the npc named argumentgt pc lt- smwpc(transect)
[1] PCA selected components explain 0923 of the variance[1] Loadings for the first 2 components
PC1 PC2sandA 04061 047370clayA -03968 -062764sandB 04133 -025254clayB -04149 002935sandC 04095 -036782clayC -04087 042631
gt str(pc)
dataframe 42 obs of 2 variables$ PC1 num -2019 -1757 -0401 -0782 -4039 $ PC2 num -013327 -068999 -008597 -000673 003252
Here we see that the two PCs explain most of the variance
The PCs can be interpreted by examining the loadings These revealwhich soil variables on this transect are combined into each componentInterpretation in this case is easy The first component is associated withhigh sand (positive loading) and low clay (negative loading) ie withcoarser textures throughout the profile the second component is asso-ciated with higher sand in the topsoil than subsoil and the reverse forclay In other words the first component is the overall texture and thesecond is textural contrast
The structure returned by smwpc is a data frame with the principal com-ponent scores for each station on the transect these replace the originalvariables
222 Original variables
Distances can be computed in the multivariate space of the original vari-ables rather than in the space of their PCs To compute the Euclideandistances between original variables these must be scaled corrected in-dividually to zero mean and unit variance computation of Mahalanobisdistances implicitly scales them A data frame can be scaled with thescale function the result must be converted back into a data framegt trscale lt- dataframe(scale(transect))gt str(trscale)
dataframe 42 obs of 6 variables$ sandA num -096 -0882 -0104 -0181 -1031
3 rather than the variance-covariance matrix
5
$ clayA num 05823 08658 00153 03933 21888 $ sandB num -1341 -1341 -0722 -0309 -1002 $ clayB num 1228 1496 0692 0424 1689 $ sandC num -0616 0214 0214 -0409 -1936 $ clayC num 02005 -00768 -03542 02005 20642
Note all the variables have zero mean and unit variance and the covari-ance matrix is a correlation matrixgt round(apply(trscale 2 mean) 4)
sandA clayA sandB clayB sandC clayC0 0 0 0 0 0
gt var(trscale)
sandA clayA sandB clayB sandC clayCsandA 10000 -09016 08317 -08175 08237 -07670clayA -09016 10000 -07596 08417 -07541 07697sandB 08317 -07596 10000 -09064 08661 -08600clayB -08175 08417 -09064 10000 -08221 08581sandC 08237 -07541 08661 -08221 10000 -09129clayC -07670 07697 -08600 08581 -09129 10000
223 Window width
The narrower the window the narrower the soil unit that can be distin-guished but the more likely are spurious boundaries (and more likelythat the system will be computationally singular) Webster [4] recom-mends 23 of the expected distance between boundaries in this case theexpected width of a soil unit across the transect
This can be estimated by auto-correlation of one or more variables usu-ally the first PC By default the acf (ldquoAuto- and Cross-Covariance and-Correlation Function Estimationrdquo) function of the default stats pack-age displays a graph of autocorrelation by lag the results can also beprinted on the consolegt acf(pc[ 1] main = Serial auto-correlation PC1)gt print(acf(pc[ 1] plot = FALSE))
Autocorrelations of series pc[ 1] by lag
0 1 2 3 4 5 6 7 8 91000 0687 0637 0531 0422 0440 0310 0318 0300 0331
10 11 12 13 14 15 160182 0171 0027 -0039 -0055 -0200 -0131
6
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Serial autominuscorrelation PC1
The auto-correlation decreases to zero near lag 12 suggesting that thereare about three boundaries in this 42-station transect the suggested win-dow size is then 8
If the named argument wid is not supplied to smwdw by the user it iscomputed as a quarter of the total length of the transect
224 Window parity
If the window width n is even the transect is split at station l + n2where l is the left-most station of the window and any inferred boundaryis there ie neither half of the window uses the data from that stationFor example with window width 8 the first window (at the beginningof the transect) is from station 1 9 the middle of which is station5The left half of the split window includes stations 1 4 and the righthalf 6 9 station 5 is the potential boundary If the window width nis odd the transect is split half-way between stations l + (n minus 1)2 andl+ (n+ 1)2 For examplewith window width 7 the first window (at thebeginning of the transect) is from station 1 8 The left half of the splitwindow includes stations 1 4 and the right half 5 8 the potentialboundary is between stations 4 and 5 This is reported as station 4 butis actually located at position 45
225 Mullion
Especially for wide windows we may accept a more diffuse boundary infact the lsquoboundaryrsquo we are looking for could be identified as a separatesoil unit if a narrower window were used In this case we may increasethe discriminating power of the method by leaving out some observa-tions on either side of the potential boundary this is the so-called ldquomul-lionrdquo In this implementation of SMW the user-specified mullion (if any)applies to both halves of the window ie it is the number of stationsto omit The default for the mull named argument to smwdw is 0 note
7
that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1
226 What Mahanalobis distance to report
The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output
23 Analyzing the example transect
The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity
231 Using PCs
The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating
Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584
There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38
Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull
8
Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold
Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)
[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409
Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results
Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)
[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639
Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate
Using a narrow window changes the picture radically
Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)
[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 6 4764
9
The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions
Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 14 206992 15 109723 25 94034 24 5239
With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)
Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329
And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast
Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-
10
portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect
232 Using original variables
The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same
Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull
Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same
gt d lt- smwdw(transect wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756
gt d lt- smwdw(pc wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 14 229532 15 98113 24 83184 17 75715 25 7160
Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority
11
24 Visualizing the boundaries
The results returned by smwdw can be visualized with the smwgraphfunction
Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)
0 10 20 30 40
05
1015
20
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source
The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature
12
gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))
0 10 20 30 40
05
1015
2025
Boundaries
window8 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
7
1415
16 24
25
26 38
0 10 20 30 40
05
1015
20
Boundaries
window8 PC 1 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
14
15
24
25
0 10 20 30 40
02
46
8
Boundaries
window8 PC 2 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
67
1718
19
23
24
25
26
0 10 20 30 40
05
1015
20
Boundaries
window9 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
0 10 20 30 40
05
1015
2025
Boundaries
window10 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
16 1824
2526
0 10 20 30 40
05
1015
2025
30
Boundaries
window10 mullion1 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
13
14
15
16
17
2425
26
0 10 20 30 40
010
020
030
040
0
Boundaries
window6 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
0 10 20 30 40
010
2030
40
Boundaries
window7 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
14
24
0 10 20 30 40
05
1015
Boundaries
window11 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17
23
24
25
0 10 20 30 40
05
1015
Boundaries
window9 PCs 2 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
2425
26
0 10 20 30 40
050
100
200
300
Boundaries
window9 Original vars 6station
squa
red
dist
ance
bet
wee
n ha
lves
6
9 1417
24
25
28 35
0 10 20 30 40
010
0000
2000
0030
0000
Boundaries
window9 Original vars 6 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
24
25
26
13
3 When SMW fails to find a solution
The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like
Position 3 Singular pooled covariance matrix not a boundary
This implies a non-stable analytical context Among the causes are
bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero
bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly
Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries
Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull
The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]
Note We use setseed so your results will be identical to these notesin practice you would not use this
gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)
Autocorrelations of series runif(42) by lag
0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097
10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040
14
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Autocorrelation of a uniform random variable
Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)
Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)
[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components
PC1 PC2x -07071 07071y 07071 07071
gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)
15
0 10 20 30 40
minus2
minus1
01
2
black PC1 blue PC2station
There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance
Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)
0 10 20 30 40
010
2030
40
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
12
1415
16
17
1920 29
This illustrates the danger of relying on automatic procedures
4 Maximum Level Variance
The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a
16
mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries
I have not yet implemented this the mathematics are presented in theappendix to Webster [4]
References
[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1
[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416
[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1
[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17
[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980
[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1
[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1
17
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
The variables are weighted by their discriminating power in the particu-lar window this changes as the window moves
If W is replaced by the identity matrix I this is equivalent to the squaredEuclidean distance
E2 = (xs minus xd)T I(xs minus xd) (2)
where all axes are weighted equally and there is no correction for dis-criminating power in the window either for different residual varianceor for covariances
Note If the pooled within-half variance-covariance matrix W is singularit can not be inverted and thus the Mahalanobis distance is not definedhowever the Euclidean distance is zero and it makes sense to also makethe Mahalanobis distance zero
21 Functions for the SMW method
I have written three R functions to implement the SMW method they arecontained in file smwR and loaded with the source function The sourcecode is also shown in Appendix sectBgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
These must be called in the following order
1 smwpc (optional) extract PCs
2 smwdw SMW analysis of PCs or original variables
3 smwgraph visualise results of SMW analysis
22 Options for the SMW method
The analyst must make several important choices that have a large effecton the presumed boundaries
1 Whether to use the original variables or some number of their stan-dardized principal components
2 Whether to use Euclidean or Mahalanobis distances
3 How wide a window to use
4 Window parity ie whether to use an odd or even window
5 Whether to use a ldquomullionrdquo ie leave out some stations near themiddle of the window to allow for a more diffuse boundary
6 What distance in feature space (as a proportion of the maximumin the whole transect) probably signals a boundary and should bereported
4
221 Standardized principal components
Function smwpc computes the standardized principal components fromthe correlation matrix3 and replaces the original variables with one ormore synthetic variables that are uncorrelated and arranged in descend-ing order of the amount of variance of the original data that they explainThis corrects for different measurement scales and also for correlationbetween variables (redundancy) The default is to extract two compo-nents this can be over-ridden with the npc named argumentgt pc lt- smwpc(transect)
[1] PCA selected components explain 0923 of the variance[1] Loadings for the first 2 components
PC1 PC2sandA 04061 047370clayA -03968 -062764sandB 04133 -025254clayB -04149 002935sandC 04095 -036782clayC -04087 042631
gt str(pc)
dataframe 42 obs of 2 variables$ PC1 num -2019 -1757 -0401 -0782 -4039 $ PC2 num -013327 -068999 -008597 -000673 003252
Here we see that the two PCs explain most of the variance
The PCs can be interpreted by examining the loadings These revealwhich soil variables on this transect are combined into each componentInterpretation in this case is easy The first component is associated withhigh sand (positive loading) and low clay (negative loading) ie withcoarser textures throughout the profile the second component is asso-ciated with higher sand in the topsoil than subsoil and the reverse forclay In other words the first component is the overall texture and thesecond is textural contrast
The structure returned by smwpc is a data frame with the principal com-ponent scores for each station on the transect these replace the originalvariables
222 Original variables
Distances can be computed in the multivariate space of the original vari-ables rather than in the space of their PCs To compute the Euclideandistances between original variables these must be scaled corrected in-dividually to zero mean and unit variance computation of Mahalanobisdistances implicitly scales them A data frame can be scaled with thescale function the result must be converted back into a data framegt trscale lt- dataframe(scale(transect))gt str(trscale)
dataframe 42 obs of 6 variables$ sandA num -096 -0882 -0104 -0181 -1031
3 rather than the variance-covariance matrix
5
$ clayA num 05823 08658 00153 03933 21888 $ sandB num -1341 -1341 -0722 -0309 -1002 $ clayB num 1228 1496 0692 0424 1689 $ sandC num -0616 0214 0214 -0409 -1936 $ clayC num 02005 -00768 -03542 02005 20642
Note all the variables have zero mean and unit variance and the covari-ance matrix is a correlation matrixgt round(apply(trscale 2 mean) 4)
sandA clayA sandB clayB sandC clayC0 0 0 0 0 0
gt var(trscale)
sandA clayA sandB clayB sandC clayCsandA 10000 -09016 08317 -08175 08237 -07670clayA -09016 10000 -07596 08417 -07541 07697sandB 08317 -07596 10000 -09064 08661 -08600clayB -08175 08417 -09064 10000 -08221 08581sandC 08237 -07541 08661 -08221 10000 -09129clayC -07670 07697 -08600 08581 -09129 10000
223 Window width
The narrower the window the narrower the soil unit that can be distin-guished but the more likely are spurious boundaries (and more likelythat the system will be computationally singular) Webster [4] recom-mends 23 of the expected distance between boundaries in this case theexpected width of a soil unit across the transect
This can be estimated by auto-correlation of one or more variables usu-ally the first PC By default the acf (ldquoAuto- and Cross-Covariance and-Correlation Function Estimationrdquo) function of the default stats pack-age displays a graph of autocorrelation by lag the results can also beprinted on the consolegt acf(pc[ 1] main = Serial auto-correlation PC1)gt print(acf(pc[ 1] plot = FALSE))
Autocorrelations of series pc[ 1] by lag
0 1 2 3 4 5 6 7 8 91000 0687 0637 0531 0422 0440 0310 0318 0300 0331
10 11 12 13 14 15 160182 0171 0027 -0039 -0055 -0200 -0131
6
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Serial autominuscorrelation PC1
The auto-correlation decreases to zero near lag 12 suggesting that thereare about three boundaries in this 42-station transect the suggested win-dow size is then 8
If the named argument wid is not supplied to smwdw by the user it iscomputed as a quarter of the total length of the transect
224 Window parity
If the window width n is even the transect is split at station l + n2where l is the left-most station of the window and any inferred boundaryis there ie neither half of the window uses the data from that stationFor example with window width 8 the first window (at the beginningof the transect) is from station 1 9 the middle of which is station5The left half of the split window includes stations 1 4 and the righthalf 6 9 station 5 is the potential boundary If the window width nis odd the transect is split half-way between stations l + (n minus 1)2 andl+ (n+ 1)2 For examplewith window width 7 the first window (at thebeginning of the transect) is from station 1 8 The left half of the splitwindow includes stations 1 4 and the right half 5 8 the potentialboundary is between stations 4 and 5 This is reported as station 4 butis actually located at position 45
225 Mullion
Especially for wide windows we may accept a more diffuse boundary infact the lsquoboundaryrsquo we are looking for could be identified as a separatesoil unit if a narrower window were used In this case we may increasethe discriminating power of the method by leaving out some observa-tions on either side of the potential boundary this is the so-called ldquomul-lionrdquo In this implementation of SMW the user-specified mullion (if any)applies to both halves of the window ie it is the number of stationsto omit The default for the mull named argument to smwdw is 0 note
7
that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1
226 What Mahanalobis distance to report
The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output
23 Analyzing the example transect
The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity
231 Using PCs
The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating
Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584
There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38
Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull
8
Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold
Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)
[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409
Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results
Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)
[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639
Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate
Using a narrow window changes the picture radically
Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)
[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 6 4764
9
The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions
Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 14 206992 15 109723 25 94034 24 5239
With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)
Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329
And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast
Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-
10
portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect
232 Using original variables
The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same
Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull
Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same
gt d lt- smwdw(transect wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756
gt d lt- smwdw(pc wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 14 229532 15 98113 24 83184 17 75715 25 7160
Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority
11
24 Visualizing the boundaries
The results returned by smwdw can be visualized with the smwgraphfunction
Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)
0 10 20 30 40
05
1015
20
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source
The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature
12
gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))
0 10 20 30 40
05
1015
2025
Boundaries
window8 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
7
1415
16 24
25
26 38
0 10 20 30 40
05
1015
20
Boundaries
window8 PC 1 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
14
15
24
25
0 10 20 30 40
02
46
8
Boundaries
window8 PC 2 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
67
1718
19
23
24
25
26
0 10 20 30 40
05
1015
20
Boundaries
window9 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
0 10 20 30 40
05
1015
2025
Boundaries
window10 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
16 1824
2526
0 10 20 30 40
05
1015
2025
30
Boundaries
window10 mullion1 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
13
14
15
16
17
2425
26
0 10 20 30 40
010
020
030
040
0
Boundaries
window6 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
0 10 20 30 40
010
2030
40
Boundaries
window7 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
14
24
0 10 20 30 40
05
1015
Boundaries
window11 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17
23
24
25
0 10 20 30 40
05
1015
Boundaries
window9 PCs 2 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
2425
26
0 10 20 30 40
050
100
200
300
Boundaries
window9 Original vars 6station
squa
red
dist
ance
bet
wee
n ha
lves
6
9 1417
24
25
28 35
0 10 20 30 40
010
0000
2000
0030
0000
Boundaries
window9 Original vars 6 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
24
25
26
13
3 When SMW fails to find a solution
The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like
Position 3 Singular pooled covariance matrix not a boundary
This implies a non-stable analytical context Among the causes are
bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero
bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly
Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries
Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull
The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]
Note We use setseed so your results will be identical to these notesin practice you would not use this
gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)
Autocorrelations of series runif(42) by lag
0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097
10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040
14
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Autocorrelation of a uniform random variable
Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)
Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)
[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components
PC1 PC2x -07071 07071y 07071 07071
gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)
15
0 10 20 30 40
minus2
minus1
01
2
black PC1 blue PC2station
There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance
Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)
0 10 20 30 40
010
2030
40
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
12
1415
16
17
1920 29
This illustrates the danger of relying on automatic procedures
4 Maximum Level Variance
The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a
16
mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries
I have not yet implemented this the mathematics are presented in theappendix to Webster [4]
References
[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1
[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416
[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1
[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17
[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980
[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1
[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1
17
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
221 Standardized principal components
Function smwpc computes the standardized principal components fromthe correlation matrix3 and replaces the original variables with one ormore synthetic variables that are uncorrelated and arranged in descend-ing order of the amount of variance of the original data that they explainThis corrects for different measurement scales and also for correlationbetween variables (redundancy) The default is to extract two compo-nents this can be over-ridden with the npc named argumentgt pc lt- smwpc(transect)
[1] PCA selected components explain 0923 of the variance[1] Loadings for the first 2 components
PC1 PC2sandA 04061 047370clayA -03968 -062764sandB 04133 -025254clayB -04149 002935sandC 04095 -036782clayC -04087 042631
gt str(pc)
dataframe 42 obs of 2 variables$ PC1 num -2019 -1757 -0401 -0782 -4039 $ PC2 num -013327 -068999 -008597 -000673 003252
Here we see that the two PCs explain most of the variance
The PCs can be interpreted by examining the loadings These revealwhich soil variables on this transect are combined into each componentInterpretation in this case is easy The first component is associated withhigh sand (positive loading) and low clay (negative loading) ie withcoarser textures throughout the profile the second component is asso-ciated with higher sand in the topsoil than subsoil and the reverse forclay In other words the first component is the overall texture and thesecond is textural contrast
The structure returned by smwpc is a data frame with the principal com-ponent scores for each station on the transect these replace the originalvariables
222 Original variables
Distances can be computed in the multivariate space of the original vari-ables rather than in the space of their PCs To compute the Euclideandistances between original variables these must be scaled corrected in-dividually to zero mean and unit variance computation of Mahalanobisdistances implicitly scales them A data frame can be scaled with thescale function the result must be converted back into a data framegt trscale lt- dataframe(scale(transect))gt str(trscale)
dataframe 42 obs of 6 variables$ sandA num -096 -0882 -0104 -0181 -1031
3 rather than the variance-covariance matrix
5
$ clayA num 05823 08658 00153 03933 21888 $ sandB num -1341 -1341 -0722 -0309 -1002 $ clayB num 1228 1496 0692 0424 1689 $ sandC num -0616 0214 0214 -0409 -1936 $ clayC num 02005 -00768 -03542 02005 20642
Note all the variables have zero mean and unit variance and the covari-ance matrix is a correlation matrixgt round(apply(trscale 2 mean) 4)
sandA clayA sandB clayB sandC clayC0 0 0 0 0 0
gt var(trscale)
sandA clayA sandB clayB sandC clayCsandA 10000 -09016 08317 -08175 08237 -07670clayA -09016 10000 -07596 08417 -07541 07697sandB 08317 -07596 10000 -09064 08661 -08600clayB -08175 08417 -09064 10000 -08221 08581sandC 08237 -07541 08661 -08221 10000 -09129clayC -07670 07697 -08600 08581 -09129 10000
223 Window width
The narrower the window the narrower the soil unit that can be distin-guished but the more likely are spurious boundaries (and more likelythat the system will be computationally singular) Webster [4] recom-mends 23 of the expected distance between boundaries in this case theexpected width of a soil unit across the transect
This can be estimated by auto-correlation of one or more variables usu-ally the first PC By default the acf (ldquoAuto- and Cross-Covariance and-Correlation Function Estimationrdquo) function of the default stats pack-age displays a graph of autocorrelation by lag the results can also beprinted on the consolegt acf(pc[ 1] main = Serial auto-correlation PC1)gt print(acf(pc[ 1] plot = FALSE))
Autocorrelations of series pc[ 1] by lag
0 1 2 3 4 5 6 7 8 91000 0687 0637 0531 0422 0440 0310 0318 0300 0331
10 11 12 13 14 15 160182 0171 0027 -0039 -0055 -0200 -0131
6
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Serial autominuscorrelation PC1
The auto-correlation decreases to zero near lag 12 suggesting that thereare about three boundaries in this 42-station transect the suggested win-dow size is then 8
If the named argument wid is not supplied to smwdw by the user it iscomputed as a quarter of the total length of the transect
224 Window parity
If the window width n is even the transect is split at station l + n2where l is the left-most station of the window and any inferred boundaryis there ie neither half of the window uses the data from that stationFor example with window width 8 the first window (at the beginningof the transect) is from station 1 9 the middle of which is station5The left half of the split window includes stations 1 4 and the righthalf 6 9 station 5 is the potential boundary If the window width nis odd the transect is split half-way between stations l + (n minus 1)2 andl+ (n+ 1)2 For examplewith window width 7 the first window (at thebeginning of the transect) is from station 1 8 The left half of the splitwindow includes stations 1 4 and the right half 5 8 the potentialboundary is between stations 4 and 5 This is reported as station 4 butis actually located at position 45
225 Mullion
Especially for wide windows we may accept a more diffuse boundary infact the lsquoboundaryrsquo we are looking for could be identified as a separatesoil unit if a narrower window were used In this case we may increasethe discriminating power of the method by leaving out some observa-tions on either side of the potential boundary this is the so-called ldquomul-lionrdquo In this implementation of SMW the user-specified mullion (if any)applies to both halves of the window ie it is the number of stationsto omit The default for the mull named argument to smwdw is 0 note
7
that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1
226 What Mahanalobis distance to report
The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output
23 Analyzing the example transect
The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity
231 Using PCs
The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating
Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584
There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38
Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull
8
Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold
Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)
[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409
Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results
Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)
[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639
Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate
Using a narrow window changes the picture radically
Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)
[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 6 4764
9
The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions
Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 14 206992 15 109723 25 94034 24 5239
With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)
Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329
And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast
Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-
10
portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect
232 Using original variables
The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same
Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull
Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same
gt d lt- smwdw(transect wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756
gt d lt- smwdw(pc wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 14 229532 15 98113 24 83184 17 75715 25 7160
Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority
11
24 Visualizing the boundaries
The results returned by smwdw can be visualized with the smwgraphfunction
Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)
0 10 20 30 40
05
1015
20
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source
The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature
12
gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))
0 10 20 30 40
05
1015
2025
Boundaries
window8 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
7
1415
16 24
25
26 38
0 10 20 30 40
05
1015
20
Boundaries
window8 PC 1 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
14
15
24
25
0 10 20 30 40
02
46
8
Boundaries
window8 PC 2 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
67
1718
19
23
24
25
26
0 10 20 30 40
05
1015
20
Boundaries
window9 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
0 10 20 30 40
05
1015
2025
Boundaries
window10 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
16 1824
2526
0 10 20 30 40
05
1015
2025
30
Boundaries
window10 mullion1 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
13
14
15
16
17
2425
26
0 10 20 30 40
010
020
030
040
0
Boundaries
window6 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
0 10 20 30 40
010
2030
40
Boundaries
window7 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
14
24
0 10 20 30 40
05
1015
Boundaries
window11 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17
23
24
25
0 10 20 30 40
05
1015
Boundaries
window9 PCs 2 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
2425
26
0 10 20 30 40
050
100
200
300
Boundaries
window9 Original vars 6station
squa
red
dist
ance
bet
wee
n ha
lves
6
9 1417
24
25
28 35
0 10 20 30 40
010
0000
2000
0030
0000
Boundaries
window9 Original vars 6 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
24
25
26
13
3 When SMW fails to find a solution
The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like
Position 3 Singular pooled covariance matrix not a boundary
This implies a non-stable analytical context Among the causes are
bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero
bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly
Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries
Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull
The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]
Note We use setseed so your results will be identical to these notesin practice you would not use this
gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)
Autocorrelations of series runif(42) by lag
0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097
10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040
14
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Autocorrelation of a uniform random variable
Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)
Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)
[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components
PC1 PC2x -07071 07071y 07071 07071
gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)
15
0 10 20 30 40
minus2
minus1
01
2
black PC1 blue PC2station
There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance
Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)
0 10 20 30 40
010
2030
40
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
12
1415
16
17
1920 29
This illustrates the danger of relying on automatic procedures
4 Maximum Level Variance
The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a
16
mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries
I have not yet implemented this the mathematics are presented in theappendix to Webster [4]
References
[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1
[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416
[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1
[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17
[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980
[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1
[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1
17
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
$ clayA num 05823 08658 00153 03933 21888 $ sandB num -1341 -1341 -0722 -0309 -1002 $ clayB num 1228 1496 0692 0424 1689 $ sandC num -0616 0214 0214 -0409 -1936 $ clayC num 02005 -00768 -03542 02005 20642
Note all the variables have zero mean and unit variance and the covari-ance matrix is a correlation matrixgt round(apply(trscale 2 mean) 4)
sandA clayA sandB clayB sandC clayC0 0 0 0 0 0
gt var(trscale)
sandA clayA sandB clayB sandC clayCsandA 10000 -09016 08317 -08175 08237 -07670clayA -09016 10000 -07596 08417 -07541 07697sandB 08317 -07596 10000 -09064 08661 -08600clayB -08175 08417 -09064 10000 -08221 08581sandC 08237 -07541 08661 -08221 10000 -09129clayC -07670 07697 -08600 08581 -09129 10000
223 Window width
The narrower the window the narrower the soil unit that can be distin-guished but the more likely are spurious boundaries (and more likelythat the system will be computationally singular) Webster [4] recom-mends 23 of the expected distance between boundaries in this case theexpected width of a soil unit across the transect
This can be estimated by auto-correlation of one or more variables usu-ally the first PC By default the acf (ldquoAuto- and Cross-Covariance and-Correlation Function Estimationrdquo) function of the default stats pack-age displays a graph of autocorrelation by lag the results can also beprinted on the consolegt acf(pc[ 1] main = Serial auto-correlation PC1)gt print(acf(pc[ 1] plot = FALSE))
Autocorrelations of series pc[ 1] by lag
0 1 2 3 4 5 6 7 8 91000 0687 0637 0531 0422 0440 0310 0318 0300 0331
10 11 12 13 14 15 160182 0171 0027 -0039 -0055 -0200 -0131
6
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Serial autominuscorrelation PC1
The auto-correlation decreases to zero near lag 12 suggesting that thereare about three boundaries in this 42-station transect the suggested win-dow size is then 8
If the named argument wid is not supplied to smwdw by the user it iscomputed as a quarter of the total length of the transect
224 Window parity
If the window width n is even the transect is split at station l + n2where l is the left-most station of the window and any inferred boundaryis there ie neither half of the window uses the data from that stationFor example with window width 8 the first window (at the beginningof the transect) is from station 1 9 the middle of which is station5The left half of the split window includes stations 1 4 and the righthalf 6 9 station 5 is the potential boundary If the window width nis odd the transect is split half-way between stations l + (n minus 1)2 andl+ (n+ 1)2 For examplewith window width 7 the first window (at thebeginning of the transect) is from station 1 8 The left half of the splitwindow includes stations 1 4 and the right half 5 8 the potentialboundary is between stations 4 and 5 This is reported as station 4 butis actually located at position 45
225 Mullion
Especially for wide windows we may accept a more diffuse boundary infact the lsquoboundaryrsquo we are looking for could be identified as a separatesoil unit if a narrower window were used In this case we may increasethe discriminating power of the method by leaving out some observa-tions on either side of the potential boundary this is the so-called ldquomul-lionrdquo In this implementation of SMW the user-specified mullion (if any)applies to both halves of the window ie it is the number of stationsto omit The default for the mull named argument to smwdw is 0 note
7
that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1
226 What Mahanalobis distance to report
The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output
23 Analyzing the example transect
The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity
231 Using PCs
The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating
Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584
There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38
Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull
8
Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold
Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)
[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409
Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results
Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)
[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639
Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate
Using a narrow window changes the picture radically
Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)
[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 6 4764
9
The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions
Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 14 206992 15 109723 25 94034 24 5239
With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)
Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329
And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast
Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-
10
portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect
232 Using original variables
The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same
Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull
Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same
gt d lt- smwdw(transect wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756
gt d lt- smwdw(pc wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 14 229532 15 98113 24 83184 17 75715 25 7160
Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority
11
24 Visualizing the boundaries
The results returned by smwdw can be visualized with the smwgraphfunction
Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)
0 10 20 30 40
05
1015
20
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source
The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature
12
gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))
0 10 20 30 40
05
1015
2025
Boundaries
window8 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
7
1415
16 24
25
26 38
0 10 20 30 40
05
1015
20
Boundaries
window8 PC 1 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
14
15
24
25
0 10 20 30 40
02
46
8
Boundaries
window8 PC 2 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
67
1718
19
23
24
25
26
0 10 20 30 40
05
1015
20
Boundaries
window9 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
0 10 20 30 40
05
1015
2025
Boundaries
window10 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
16 1824
2526
0 10 20 30 40
05
1015
2025
30
Boundaries
window10 mullion1 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
13
14
15
16
17
2425
26
0 10 20 30 40
010
020
030
040
0
Boundaries
window6 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
0 10 20 30 40
010
2030
40
Boundaries
window7 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
14
24
0 10 20 30 40
05
1015
Boundaries
window11 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17
23
24
25
0 10 20 30 40
05
1015
Boundaries
window9 PCs 2 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
2425
26
0 10 20 30 40
050
100
200
300
Boundaries
window9 Original vars 6station
squa
red
dist
ance
bet
wee
n ha
lves
6
9 1417
24
25
28 35
0 10 20 30 40
010
0000
2000
0030
0000
Boundaries
window9 Original vars 6 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
24
25
26
13
3 When SMW fails to find a solution
The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like
Position 3 Singular pooled covariance matrix not a boundary
This implies a non-stable analytical context Among the causes are
bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero
bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly
Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries
Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull
The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]
Note We use setseed so your results will be identical to these notesin practice you would not use this
gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)
Autocorrelations of series runif(42) by lag
0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097
10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040
14
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Autocorrelation of a uniform random variable
Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)
Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)
[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components
PC1 PC2x -07071 07071y 07071 07071
gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)
15
0 10 20 30 40
minus2
minus1
01
2
black PC1 blue PC2station
There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance
Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)
0 10 20 30 40
010
2030
40
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
12
1415
16
17
1920 29
This illustrates the danger of relying on automatic procedures
4 Maximum Level Variance
The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a
16
mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries
I have not yet implemented this the mathematics are presented in theappendix to Webster [4]
References
[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1
[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416
[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1
[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17
[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980
[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1
[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1
17
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Serial autominuscorrelation PC1
The auto-correlation decreases to zero near lag 12 suggesting that thereare about three boundaries in this 42-station transect the suggested win-dow size is then 8
If the named argument wid is not supplied to smwdw by the user it iscomputed as a quarter of the total length of the transect
224 Window parity
If the window width n is even the transect is split at station l + n2where l is the left-most station of the window and any inferred boundaryis there ie neither half of the window uses the data from that stationFor example with window width 8 the first window (at the beginningof the transect) is from station 1 9 the middle of which is station5The left half of the split window includes stations 1 4 and the righthalf 6 9 station 5 is the potential boundary If the window width nis odd the transect is split half-way between stations l + (n minus 1)2 andl+ (n+ 1)2 For examplewith window width 7 the first window (at thebeginning of the transect) is from station 1 8 The left half of the splitwindow includes stations 1 4 and the right half 5 8 the potentialboundary is between stations 4 and 5 This is reported as station 4 butis actually located at position 45
225 Mullion
Especially for wide windows we may accept a more diffuse boundary infact the lsquoboundaryrsquo we are looking for could be identified as a separatesoil unit if a narrower window were used In this case we may increasethe discriminating power of the method by leaving out some observa-tions on either side of the potential boundary this is the so-called ldquomul-lionrdquo In this implementation of SMW the user-specified mullion (if any)applies to both halves of the window ie it is the number of stationsto omit The default for the mull named argument to smwdw is 0 note
7
that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1
226 What Mahanalobis distance to report
The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output
23 Analyzing the example transect
The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity
231 Using PCs
The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating
Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584
There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38
Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull
8
Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold
Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)
[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409
Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results
Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)
[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639
Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate
Using a narrow window changes the picture radically
Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)
[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 6 4764
9
The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions
Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 14 206992 15 109723 25 94034 24 5239
With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)
Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329
And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast
Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-
10
portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect
232 Using original variables
The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same
Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull
Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same
gt d lt- smwdw(transect wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756
gt d lt- smwdw(pc wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 14 229532 15 98113 24 83184 17 75715 25 7160
Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority
11
24 Visualizing the boundaries
The results returned by smwdw can be visualized with the smwgraphfunction
Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)
0 10 20 30 40
05
1015
20
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source
The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature
12
gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))
0 10 20 30 40
05
1015
2025
Boundaries
window8 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
7
1415
16 24
25
26 38
0 10 20 30 40
05
1015
20
Boundaries
window8 PC 1 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
14
15
24
25
0 10 20 30 40
02
46
8
Boundaries
window8 PC 2 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
67
1718
19
23
24
25
26
0 10 20 30 40
05
1015
20
Boundaries
window9 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
0 10 20 30 40
05
1015
2025
Boundaries
window10 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
16 1824
2526
0 10 20 30 40
05
1015
2025
30
Boundaries
window10 mullion1 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
13
14
15
16
17
2425
26
0 10 20 30 40
010
020
030
040
0
Boundaries
window6 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
0 10 20 30 40
010
2030
40
Boundaries
window7 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
14
24
0 10 20 30 40
05
1015
Boundaries
window11 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17
23
24
25
0 10 20 30 40
05
1015
Boundaries
window9 PCs 2 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
2425
26
0 10 20 30 40
050
100
200
300
Boundaries
window9 Original vars 6station
squa
red
dist
ance
bet
wee
n ha
lves
6
9 1417
24
25
28 35
0 10 20 30 40
010
0000
2000
0030
0000
Boundaries
window9 Original vars 6 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
24
25
26
13
3 When SMW fails to find a solution
The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like
Position 3 Singular pooled covariance matrix not a boundary
This implies a non-stable analytical context Among the causes are
bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero
bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly
Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries
Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull
The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]
Note We use setseed so your results will be identical to these notesin practice you would not use this
gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)
Autocorrelations of series runif(42) by lag
0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097
10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040
14
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Autocorrelation of a uniform random variable
Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)
Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)
[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components
PC1 PC2x -07071 07071y 07071 07071
gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)
15
0 10 20 30 40
minus2
minus1
01
2
black PC1 blue PC2station
There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance
Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)
0 10 20 30 40
010
2030
40
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
12
1415
16
17
1920 29
This illustrates the danger of relying on automatic procedures
4 Maximum Level Variance
The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a
16
mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries
I have not yet implemented this the mathematics are presented in theappendix to Webster [4]
References
[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1
[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416
[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1
[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17
[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980
[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1
[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1
17
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
that the middle station is omitted from an even transect in any case Itdoesnrsquot make sense for the mullion to be more than 14 of the windowhalf width and the program checks for this In the present example witha suggested window width of 8 the mullion can be 0 or 1
226 What Mahanalobis distance to report
The smwdw function returns the distances for each possible boundaryhowever in its printed summary it only reports those above a certainthreshold which is a proportion of the maximum feature-space distancefound in the transect By default this is 025 to see more boundaries setto a lower value with the pmmd named argument eg pmmd=01 Thisdoes not affect the calculation only the printed output
23 Analyzing the example transect
The core of the SMW analysis is performed by the smwdw function Thishas as its first argument the data frame representing the transect (prob-ably transformed to PCs) and optional arguments for the window widthmullion and boundary sensitivity
231 Using PCs
The SMW method is often applied to the first few PCs since these holdmost of the information about the soil properties in compressed formHowever they are not guaranteed to be the most discriminating
Task 4 Analyze the transect using the first two standardized PCs anda window width of 8 bullgt d lt- smwdw(pc wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 254892 14 249233 7 162184 25 155645 6 79786 24 77887 16 75288 26 67749 38 6584
There are three main suggested boundaries at stations 15 7 and 25these are somewhat diffuse because the alternative adjacent stations(14 or 16) 6 and (24 or 26) are also identified A less likely boundary isfound towards the right of the transect at station 38
Task 5 Analyze the transect using the first two standardized PCs anda window width of 7 ie an odd window size one unit smaller bull
8
Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold
Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)
[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409
Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results
Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)
[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639
Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate
Using a narrow window changes the picture radically
Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)
[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 6 4764
9
The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions
Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 14 206992 15 109723 25 94034 24 5239
With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)
Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329
And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast
Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-
10
portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect
232 Using original variables
The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same
Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull
Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same
gt d lt- smwdw(transect wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756
gt d lt- smwdw(pc wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 14 229532 15 98113 24 83184 17 75715 25 7160
Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority
11
24 Visualizing the boundaries
The results returned by smwdw can be visualized with the smwgraphfunction
Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)
0 10 20 30 40
05
1015
20
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source
The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature
12
gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))
0 10 20 30 40
05
1015
2025
Boundaries
window8 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
7
1415
16 24
25
26 38
0 10 20 30 40
05
1015
20
Boundaries
window8 PC 1 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
14
15
24
25
0 10 20 30 40
02
46
8
Boundaries
window8 PC 2 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
67
1718
19
23
24
25
26
0 10 20 30 40
05
1015
20
Boundaries
window9 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
0 10 20 30 40
05
1015
2025
Boundaries
window10 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
16 1824
2526
0 10 20 30 40
05
1015
2025
30
Boundaries
window10 mullion1 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
13
14
15
16
17
2425
26
0 10 20 30 40
010
020
030
040
0
Boundaries
window6 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
0 10 20 30 40
010
2030
40
Boundaries
window7 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
14
24
0 10 20 30 40
05
1015
Boundaries
window11 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17
23
24
25
0 10 20 30 40
05
1015
Boundaries
window9 PCs 2 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
2425
26
0 10 20 30 40
050
100
200
300
Boundaries
window9 Original vars 6station
squa
red
dist
ance
bet
wee
n ha
lves
6
9 1417
24
25
28 35
0 10 20 30 40
010
0000
2000
0030
0000
Boundaries
window9 Original vars 6 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
24
25
26
13
3 When SMW fails to find a solution
The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like
Position 3 Singular pooled covariance matrix not a boundary
This implies a non-stable analytical context Among the causes are
bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero
bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly
Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries
Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull
The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]
Note We use setseed so your results will be identical to these notesin practice you would not use this
gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)
Autocorrelations of series runif(42) by lag
0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097
10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040
14
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Autocorrelation of a uniform random variable
Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)
Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)
[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components
PC1 PC2x -07071 07071y 07071 07071
gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)
15
0 10 20 30 40
minus2
minus1
01
2
black PC1 blue PC2station
There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance
Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)
0 10 20 30 40
010
2030
40
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
12
1415
16
17
1920 29
This illustrates the danger of relying on automatic procedures
4 Maximum Level Variance
The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a
16
mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries
I have not yet implemented this the mathematics are presented in theappendix to Webster [4]
References
[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1
[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416
[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1
[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17
[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980
[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1
[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1
17
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
Using the next smaller odd window width raises the absolute distancesand sharpens the same three boundaries here found at positions 65145 and 245 (rather than at stations 7 15 and 25 with window width8) Station 38 is not found at this reporting threshold
Task 6 Analyze the transect using the first two standardized PCs anda window width of 10 ie slightly wider bullgt d lt- smwdw(pc wid = 10)
[1] Window 10 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 15 280242 14 143563 24 101964 25 88235 16 82626 18 78057 26 7409
Widening the window misses the boundaries near stations 7 and 38 (be-cause they are closer than the half-width to the ends) but otherwiseagrees with the previous results
Task 7 Analyze the transect using the first two standardized PCs awindow width of 10 and a one-unit mullion bullgt d lt- smwdw(pc wid = 10 mull = 1)
[1] Window 10 stations mullion 1[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 16 313192 15 221263 26 177744 17 176775 14 156826 13 114227 24 95068 25 8639
Adding a mullion to this wider window finds the same boundary manytimes suggesting that the sharper analysis without a mullion was moreappropriate
Using a narrow window changes the picture radically
Task 8 Analyze the transect using the first two standardized PCs anda window width of 6 ie narrower bullgt d lt- smwdw(pc wid = 6)
[1] Window 6 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 6 4764
9
The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions
Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 14 206992 15 109723 25 94034 24 5239
With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)
Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329
And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast
Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-
10
portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect
232 Using original variables
The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same
Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull
Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same
gt d lt- smwdw(transect wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756
gt d lt- smwdw(pc wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 14 229532 15 98113 24 83184 17 75715 25 7160
Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority
11
24 Visualizing the boundaries
The results returned by smwdw can be visualized with the smwgraphfunction
Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)
0 10 20 30 40
05
1015
20
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source
The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature
12
gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))
0 10 20 30 40
05
1015
2025
Boundaries
window8 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
7
1415
16 24
25
26 38
0 10 20 30 40
05
1015
20
Boundaries
window8 PC 1 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
14
15
24
25
0 10 20 30 40
02
46
8
Boundaries
window8 PC 2 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
67
1718
19
23
24
25
26
0 10 20 30 40
05
1015
20
Boundaries
window9 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
0 10 20 30 40
05
1015
2025
Boundaries
window10 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
16 1824
2526
0 10 20 30 40
05
1015
2025
30
Boundaries
window10 mullion1 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
13
14
15
16
17
2425
26
0 10 20 30 40
010
020
030
040
0
Boundaries
window6 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
0 10 20 30 40
010
2030
40
Boundaries
window7 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
14
24
0 10 20 30 40
05
1015
Boundaries
window11 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17
23
24
25
0 10 20 30 40
05
1015
Boundaries
window9 PCs 2 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
2425
26
0 10 20 30 40
050
100
200
300
Boundaries
window9 Original vars 6station
squa
red
dist
ance
bet
wee
n ha
lves
6
9 1417
24
25
28 35
0 10 20 30 40
010
0000
2000
0030
0000
Boundaries
window9 Original vars 6 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
24
25
26
13
3 When SMW fails to find a solution
The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like
Position 3 Singular pooled covariance matrix not a boundary
This implies a non-stable analytical context Among the causes are
bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero
bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly
Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries
Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull
The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]
Note We use setseed so your results will be identical to these notesin practice you would not use this
gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)
Autocorrelations of series runif(42) by lag
0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097
10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040
14
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Autocorrelation of a uniform random variable
Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)
Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)
[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components
PC1 PC2x -07071 07071y 07071 07071
gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)
15
0 10 20 30 40
minus2
minus1
01
2
black PC1 blue PC2station
There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance
Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)
0 10 20 30 40
010
2030
40
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
12
1415
16
17
1920 29
This illustrates the danger of relying on automatic procedures
4 Maximum Level Variance
The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a
16
mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries
I have not yet implemented this the mathematics are presented in theappendix to Webster [4]
References
[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1
[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416
[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1
[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17
[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980
[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1
[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1
17
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
The narrow window finds a very sharp contrast at station 6 which over-whelms all the others The boundary near the right of the transect (39)is now prominent rather than secondary Stations 25 and 14 (and theirneighbours) are also found As the window gets smaller the results tendto be more erratic and indeed the system may be computationally singu-lar at some window positions
Task 9 Analyze the transect using only the first standardized PCs andthe recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC1) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 14 206992 15 109723 25 94034 24 5239
With this suggested window width (8) using only one PC lowers the ab-solute distances a bit and finds the boundaries at station 14 (shiftedfrom 15 as suggested with two PCs) and 25 but misses the boundarynear station 7 This latter boundary must therefore be associated withchanges in PC2 (textural contrast) rather than PC1 (overall texture)
Task 10 Analyze the transect using only the second standardized PCsand the recommended window width of 8 bullgt d lt- smwdw(asdataframe(pc$PC2) wid = 8)
[1] Window 8 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary at stationstation D2
1 25 88412 7 70663 6 65704 18 52695 17 48186 24 45757 26 39768 23 32569 19 2329
And indeed looking for boundaries with PC2 only we see station 25 (with23 24 and 26) but then station 7 (with 6) a new boundary is suggestedat station 18 (with 17 and 19) which may also be associated with texturalcontrast
Notes on use of the Mahalonobis distance with principal componentsThe Mahalonobis distance uses the covariance matrix of the pooled ob-servations (corrected for their means) in both halves This would bediag(npc) if there were only one window (since there is by definitionno correlation between PCs) with the variances on the diagonal pro-
10
portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect
232 Using original variables
The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same
Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull
Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same
gt d lt- smwdw(transect wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756
gt d lt- smwdw(pc wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 14 229532 15 98113 24 83184 17 75715 25 7160
Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority
11
24 Visualizing the boundaries
The results returned by smwdw can be visualized with the smwgraphfunction
Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)
0 10 20 30 40
05
1015
20
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source
The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature
12
gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))
0 10 20 30 40
05
1015
2025
Boundaries
window8 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
7
1415
16 24
25
26 38
0 10 20 30 40
05
1015
20
Boundaries
window8 PC 1 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
14
15
24
25
0 10 20 30 40
02
46
8
Boundaries
window8 PC 2 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
67
1718
19
23
24
25
26
0 10 20 30 40
05
1015
20
Boundaries
window9 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
0 10 20 30 40
05
1015
2025
Boundaries
window10 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
16 1824
2526
0 10 20 30 40
05
1015
2025
30
Boundaries
window10 mullion1 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
13
14
15
16
17
2425
26
0 10 20 30 40
010
020
030
040
0
Boundaries
window6 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
0 10 20 30 40
010
2030
40
Boundaries
window7 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
14
24
0 10 20 30 40
05
1015
Boundaries
window11 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17
23
24
25
0 10 20 30 40
05
1015
Boundaries
window9 PCs 2 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
2425
26
0 10 20 30 40
050
100
200
300
Boundaries
window9 Original vars 6station
squa
red
dist
ance
bet
wee
n ha
lves
6
9 1417
24
25
28 35
0 10 20 30 40
010
0000
2000
0030
0000
Boundaries
window9 Original vars 6 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
24
25
26
13
3 When SMW fails to find a solution
The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like
Position 3 Singular pooled covariance matrix not a boundary
This implies a non-stable analytical context Among the causes are
bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero
bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly
Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries
Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull
The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]
Note We use setseed so your results will be identical to these notesin practice you would not use this
gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)
Autocorrelations of series runif(42) by lag
0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097
10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040
14
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Autocorrelation of a uniform random variable
Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)
Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)
[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components
PC1 PC2x -07071 07071y 07071 07071
gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)
15
0 10 20 30 40
minus2
minus1
01
2
black PC1 blue PC2station
There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance
Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)
0 10 20 30 40
010
2030
40
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
12
1415
16
17
1920 29
This illustrates the danger of relying on automatic procedures
4 Maximum Level Variance
The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a
16
mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries
I have not yet implemented this the mathematics are presented in theappendix to Webster [4]
References
[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1
[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416
[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1
[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17
[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980
[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1
[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1
17
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
portional to the eigenvalues within the selected set The PCs would beweighted according to their importance However for each pooled win-dow this will be somewhat different the off-diagonals will usually notbe zero (there may be correlation between the PCs for just these obser-vations) and the proportions will be different from that for the wholetransect
232 Using original variables
The SMW method can also be applied to original variables either un-scaled or scaled A disadvantage is that more variables are needed tocapture the same information as the PCs so that wider windows areusually needed to avoid ill-conditioned covariance matrices at some win-dow positions For Mahalanobis distances unscaled and scaled variablesgive the same
Task 11 Analyze the transect using the original variables and withthe first two PCs both using a window size 9 bull
Note Using all six PCs would give exactly the same result as originalvariables since the information content is the same
gt d lt- smwdw(transect wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 6 298922 24 208503 25 160124 9 113355 35 108066 14 105037 28 101518 17 7756
gt d lt- smwdw(pc wid = 9)
[1] Window 9 stations mullion 0[1] Distances up to 025 of the maximum[1] Boundary to the right of stationstation D2
1 14 229532 15 98113 24 83184 17 75715 25 7160
Both methods suggest a boundary near stations 24 and 25 but using theoriginal variables suggests boundaries near stations 6 9 and 35 as wellwhereas using the first two PCs suggests a boundary near station 14 thisis also found with the original variables but with lower priority
11
24 Visualizing the boundaries
The results returned by smwdw can be visualized with the smwgraphfunction
Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)
0 10 20 30 40
05
1015
20
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source
The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature
12
gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))
0 10 20 30 40
05
1015
2025
Boundaries
window8 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
7
1415
16 24
25
26 38
0 10 20 30 40
05
1015
20
Boundaries
window8 PC 1 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
14
15
24
25
0 10 20 30 40
02
46
8
Boundaries
window8 PC 2 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
67
1718
19
23
24
25
26
0 10 20 30 40
05
1015
20
Boundaries
window9 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
0 10 20 30 40
05
1015
2025
Boundaries
window10 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
16 1824
2526
0 10 20 30 40
05
1015
2025
30
Boundaries
window10 mullion1 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
13
14
15
16
17
2425
26
0 10 20 30 40
010
020
030
040
0
Boundaries
window6 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
0 10 20 30 40
010
2030
40
Boundaries
window7 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
14
24
0 10 20 30 40
05
1015
Boundaries
window11 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17
23
24
25
0 10 20 30 40
05
1015
Boundaries
window9 PCs 2 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
2425
26
0 10 20 30 40
050
100
200
300
Boundaries
window9 Original vars 6station
squa
red
dist
ance
bet
wee
n ha
lves
6
9 1417
24
25
28 35
0 10 20 30 40
010
0000
2000
0030
0000
Boundaries
window9 Original vars 6 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
24
25
26
13
3 When SMW fails to find a solution
The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like
Position 3 Singular pooled covariance matrix not a boundary
This implies a non-stable analytical context Among the causes are
bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero
bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly
Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries
Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull
The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]
Note We use setseed so your results will be identical to these notesin practice you would not use this
gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)
Autocorrelations of series runif(42) by lag
0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097
10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040
14
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Autocorrelation of a uniform random variable
Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)
Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)
[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components
PC1 PC2x -07071 07071y 07071 07071
gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)
15
0 10 20 30 40
minus2
minus1
01
2
black PC1 blue PC2station
There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance
Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)
0 10 20 30 40
010
2030
40
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
12
1415
16
17
1920 29
This illustrates the danger of relying on automatic procedures
4 Maximum Level Variance
The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a
16
mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries
I have not yet implemented this the mathematics are presented in theappendix to Webster [4]
References
[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1
[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416
[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1
[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17
[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980
[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1
[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1
17
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
24 Visualizing the boundaries
The results returned by smwdw can be visualized with the smwgraphfunction
Task 12 Display the boundaries for the analysis using two PCs and awindow size of 9 as computed just above bullgt smwgraph(d)
0 10 20 30 40
05
1015
20
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
We can compare various approaches visually by writing a small scriptand putting the graphs in one frame this repeats the examples of sect23with a few more variations The following code can be entered at the Rcommand line but is more easily placed in a file and loaded with source
The results are quite variable most approaches find the boundary nearpositions 14ndash16 but some miss this entirely again the boundary near po-sition 24ndash26 is often found as the second-most important but not alwaysThe extreme case is given by window width 6 two PCs which only findsa boundary at position 6 using a window width 7 with these two PCs stillhas this as the most important but does find positions 14 and 24 Thisexample shows the importance of (1) selecting relevant variables whosetransition across the transect should be used to defined boundaries (2)deciding whether to combine these as standardized PCs or use originalvariables (3) selecting a window width (and eventually a mullion) whichcorresponds to the width of the boundary in nature
12
gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))
0 10 20 30 40
05
1015
2025
Boundaries
window8 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
7
1415
16 24
25
26 38
0 10 20 30 40
05
1015
20
Boundaries
window8 PC 1 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
14
15
24
25
0 10 20 30 40
02
46
8
Boundaries
window8 PC 2 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
67
1718
19
23
24
25
26
0 10 20 30 40
05
1015
20
Boundaries
window9 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
0 10 20 30 40
05
1015
2025
Boundaries
window10 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
16 1824
2526
0 10 20 30 40
05
1015
2025
30
Boundaries
window10 mullion1 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
13
14
15
16
17
2425
26
0 10 20 30 40
010
020
030
040
0
Boundaries
window6 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
0 10 20 30 40
010
2030
40
Boundaries
window7 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
14
24
0 10 20 30 40
05
1015
Boundaries
window11 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17
23
24
25
0 10 20 30 40
05
1015
Boundaries
window9 PCs 2 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
2425
26
0 10 20 30 40
050
100
200
300
Boundaries
window9 Original vars 6station
squa
red
dist
ance
bet
wee
n ha
lves
6
9 1417
24
25
28 35
0 10 20 30 40
010
0000
2000
0030
0000
Boundaries
window9 Original vars 6 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
24
25
26
13
3 When SMW fails to find a solution
The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like
Position 3 Singular pooled covariance matrix not a boundary
This implies a non-stable analytical context Among the causes are
bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero
bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly
Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries
Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull
The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]
Note We use setseed so your results will be identical to these notesin practice you would not use this
gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)
Autocorrelations of series runif(42) by lag
0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097
10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040
14
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Autocorrelation of a uniform random variable
Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)
Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)
[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components
PC1 PC2x -07071 07071y 07071 07071
gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)
15
0 10 20 30 40
minus2
minus1
01
2
black PC1 blue PC2station
There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance
Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)
0 10 20 30 40
010
2030
40
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
12
1415
16
17
1920 29
This illustrates the danger of relying on automatic procedures
4 Maximum Level Variance
The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a
16
mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries
I have not yet implemented this the mathematics are presented in theappendix to Webster [4]
References
[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1
[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416
[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1
[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17
[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980
[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1
[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1
17
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
gt par(mfrow=c(43))gt d lt- smwdw(pc wid=8) smwgraph(d text=window8 PCs 2)gt d lt- smwdw(asdataframe(pc$PC1) wid=8) smwgraph(d text=window8 PC 1 only)gt d lt- smwdw(asdataframe(pc$PC2) wid=8) smwgraph(d text=window8 PC 2 only)gt d lt- smwdw(pc wid=9) smwgraph(d text=window9 PCs 2)gt d lt- smwdw(pc wid=10) smwgraph(d text=window10 PCs 2)gt d lt- smwdw(pc wid=10 mull=1) smwgraph(d text=window10 mullion1 PCs 2)gt d lt- smwdw(pc wid=6) smwgraph(d text=window6 PCs 2)gt d lt- smwdw(pc wid=7) smwgraph(d text=window7 PCs 2)gt d lt- smwdw(pc wid=11) smwgraph(d text=window11 PCs 2)gt d lt- smwdw(pc wid=9 ident=T) smwgraph(d text=window9 PCs 2 Euclidean)gt d lt- smwdw(transect wid=9) smwgraph(d text=window9 Original vars 6)gt d lt- smwdw(transect wid=9 ident=T) smwgraph(d text=window9 Original vars 6 Euclidean)gt par(mfrow=c(11))
0 10 20 30 40
05
1015
2025
Boundaries
window8 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
7
1415
16 24
25
26 38
0 10 20 30 40
05
1015
20
Boundaries
window8 PC 1 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
14
15
24
25
0 10 20 30 40
02
46
8
Boundaries
window8 PC 2 onlystation
squa
red
dist
ance
bet
wee
n ha
lves
67
1718
19
23
24
25
26
0 10 20 30 40
05
1015
20
Boundaries
window9 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17 2425
0 10 20 30 40
05
1015
2025
Boundaries
window10 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
16 1824
2526
0 10 20 30 40
05
1015
2025
30
Boundaries
window10 mullion1 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
13
14
15
16
17
2425
26
0 10 20 30 40
010
020
030
040
0
Boundaries
window6 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
0 10 20 30 40
010
2030
40
Boundaries
window7 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
6
14
24
0 10 20 30 40
05
1015
Boundaries
window11 PCs 2station
squa
red
dist
ance
bet
wee
n ha
lves
14
15
17
23
24
25
0 10 20 30 40
05
1015
Boundaries
window9 PCs 2 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
2425
26
0 10 20 30 40
050
100
200
300
Boundaries
window9 Original vars 6station
squa
red
dist
ance
bet
wee
n ha
lves
6
9 1417
24
25
28 35
0 10 20 30 40
010
0000
2000
0030
0000
Boundaries
window9 Original vars 6 Euclideanstation
squa
red
dist
ance
bet
wee
n ha
lves
12
13
14
15
23
24
25
26
13
3 When SMW fails to find a solution
The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like
Position 3 Singular pooled covariance matrix not a boundary
This implies a non-stable analytical context Among the causes are
bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero
bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly
Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries
Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull
The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]
Note We use setseed so your results will be identical to these notesin practice you would not use this
gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)
Autocorrelations of series runif(42) by lag
0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097
10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040
14
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Autocorrelation of a uniform random variable
Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)
Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)
[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components
PC1 PC2x -07071 07071y 07071 07071
gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)
15
0 10 20 30 40
minus2
minus1
01
2
black PC1 blue PC2station
There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance
Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)
0 10 20 30 40
010
2030
40
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
12
1415
16
17
1920 29
This illustrates the danger of relying on automatic procedures
4 Maximum Level Variance
The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a
16
mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries
I have not yet implemented this the mathematics are presented in theappendix to Webster [4]
References
[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1
[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416
[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1
[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17
[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980
[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1
[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1
17
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
3 When SMW fails to find a solution
The core of the SMW algorithm is the between-halves distance calcula-tion this is always possible using Euclidean distance (ie a diagonalvariance-covariance matrix as in Equation 2) but if using Mahalanobisdistance (as in Equation 1) with non-zero off-diagonals the matrix maybe singular In that case the smwdw function will report something like
Position 3 Singular pooled covariance matrix not a boundary
This implies a non-stable analytical context Among the causes are
bull the window is too narrow so the two halves are in fact quite simi-lar after subtracting the respective window half means the pooledcovariance matrix may have diagonal elements close to zero
bull the chosen variables do not in fact differ much at the window widthie the variables are not discriminating This is quite likely forhigher PCs which can be pure noise (explaining none of the truevariability The solution here is to only use the first few PCs (theones contributing most to the variance) if using original variablesuse only those that show clear boundaries not those that seem tofluctuate randomly
Both of these situations can be anticipated by examining the structure ofthe local spatial correlation with acf We have already seen how to useit to set an appropriate window width If the autocorrelation is alreadylow at short lags this is evidence that the variable has no structure andany boundaries that the SMW algorithms may find are an artefact and donot represent real boundaries
Task 13 Compute and display the autocorrelation of a uniform ran-dom variable along a transect of length 42 bull
The runif function returns a vector of uniformly-distributed randomnumbers on [0 1]
Note We use setseed so your results will be identical to these notesin practice you would not use this
gt setseed(321)gt acf(runif(42) main = Autocorrelation of a uniform random variable)gt acf(runif(42) plot = FALSE)
Autocorrelations of series runif(42) by lag
0 1 2 3 4 5 6 7 8 91000 0064 -0169 -0001 0098 -0179 -0242 0144 0127 0097
10 11 12 13 14 15 16-0036 0122 0111 -0197 -0292 0000 0040
14
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Autocorrelation of a uniform random variable
Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)
Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)
[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components
PC1 PC2x -07071 07071y 07071 07071
gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)
15
0 10 20 30 40
minus2
minus1
01
2
black PC1 blue PC2station
There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance
Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)
0 10 20 30 40
010
2030
40
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
12
1415
16
17
1920 29
This illustrates the danger of relying on automatic procedures
4 Maximum Level Variance
The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a
16
mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries
I have not yet implemented this the mathematics are presented in theappendix to Webster [4]
References
[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1
[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416
[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1
[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17
[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980
[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1
[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1
17
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
0 5 10 15
minus0
20
00
20
40
60
81
0
Lag
AC
F
Autocorrelation of a uniform random variable
Note that even at the first lag the autcorrelation is almost zero and wellwithin the confidence limits around 0 (shown by the dashed blue lines)
Task 14 Build a dataframe with two uniformly-random variables andplot these along the transect bullgt tr lt- dataframe(x=runif(42) y=runif(42))gt trpc lt- smwpc(tr)
[1] PCA selected components explain 1 of the variance[1] Loadings for the first 2 components
PC1 PC2x -07071 07071y 07071 07071
gt plot(trpc$PC1 type=b+ main=+ xlab=station ylab=+ sub=black PC1 blue PC2)gt lines(trpc$PC2 type=b col=blue)gt abline(h=0 lty=2)
15
0 10 20 30 40
minus2
minus1
01
2
black PC1 blue PC2station
There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance
Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)
0 10 20 30 40
010
2030
40
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
12
1415
16
17
1920 29
This illustrates the danger of relying on automatic procedures
4 Maximum Level Variance
The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a
16
mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries
I have not yet implemented this the mathematics are presented in theappendix to Webster [4]
References
[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1
[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416
[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1
[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17
[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980
[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1
[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1
17
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
0 10 20 30 40
minus2
minus1
01
2
black PC1 blue PC2station
There is evidently no structure However maybe the automatic proce-dure will find structure which arose purely by chance
Task 15 Attempt to find its boundaries bullgt d lt- smwdw(trpc wid = 8)gt smwgraph(d)
0 10 20 30 40
010
2030
40
Boundaries
station
squa
red
dist
ance
bet
wee
n ha
lves
12
1415
16
17
1920 29
This illustrates the danger of relying on automatic procedures
4 Maximum Level Variance
The Maximum Level Variance (MLV) approach [4] based on the workof Hawkins and Merriam [2] considers the whole transect together andlooks for the best way to divide it into a user-specified number of seg-ments so that the pooled within-segment variance is as small as pos-sible The advantage of MLV is that it can not be ldquotunedrdquo with windowwidth and mullion only one classification can be found In addition itwill find diffuse boundaries if they otherwise separate very contrastingzones whereas SMW must be given a wide enough window and possibly a
16
mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries
I have not yet implemented this the mathematics are presented in theappendix to Webster [4]
References
[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1
[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416
[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1
[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17
[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980
[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1
[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1
17
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
mullion to find these MLV can work on either Euclidean or Mahalanobisdistances and the classification can be stopped at any number of groupsthat is the user can specify the number of expected boundaries
I have not yet implemented this the mathematics are presented in theappendix to Webster [4]
References
[1] J C Davis Statistics and data analysis in geology John Wiley amp SonsNew York 3rd edition 2002 ISBN 0-471-17275-8 1
[2] D M Hawkins and D F Merriam Zonation of multivariate sequencesof digitized geologic data Mathematical Geology 6263ndash269 197416
[3] R Webster Automatic soil-boundary location from transect dataMathematical Geology 527ndash37 1973 1
[4] R Webster Optimally partitioning soil transects Journal of Soil Sci-ence 29388ndash402 1978 1 6 16 17
[5] R Webster DIVIDE a FORTRAN IV program for segmenting multi-variate one-dimensional spatial series Computers amp Geosciences 6(1)61ndash68 1980
[6] R Webster and I F T Wong A numerical procedure for testing soilboundaries interpreted from air photographs Photogrammetria 2459ndash72 1969 1
[7] Yihui Xie knitr Elegant flexible and fast dynamic report generationwith R 2011 URL httpyihuinameknitr 1
17
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
A Example transect
sandAclayAsandBclayBsandCclayC4382428313493637199482692386644936316433493640532616022053356046216444493630532616021719954935260885160227199516022386642802471984040242932270746265498693155448269405323826937199427133719948269371996160230532460473386554935238665826823866449363164354935238665826823866271584830934936405322827438652827416432493637199282740532360473608724936371992827405324049133865582683386544936238663373639598337363959840402362656160226088649352053361602205336715720533482692053368268305325853521199749341053371601138667493411644716011386661602238665937921644416023053241602271996853514533585352786668535145333853621199252033453228536445325520217866485352119938536345326186878665520224532518692786663735229324613527199570682293278534786675201112785341786655202145336853511268535112685351126853578666853511255202211997186845337186845337186811271868786671868786675201786671868786671868786678534453385201128520112618681453365201112585351453365201112718681127186811278534453381867128186712585351126186811261868112764011026676401102668520181337520178666853511268535112752017866718681127186811268535145336853511268535112585352453261868178665853517866552022453251869278665186927866585351786658535211996186821199
18
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
B Split moving-window functionsgt source(smwR)gt ls(pattern = smw)
[1] smwdw smwgraph smwpc
B1 PCA
gt print(smwpc)
function(transect npc=2)
cant ask for more PCs than variablesnpc lt- min(npc length(transect[1]))
standardized PCs with scorespc lt- prcomp(transect center=T scale=T retx=T)
frame to return with named synthetic variablespcs lt- asdataframe(pc$x[1npc])
print some diagnosticsprint(paste(PCA selected component
ifelse(npc==1s) explainifelse(npc==1s )round(sum((pc$sdev^2(sum(pc$sdev^2)))[1npc])3) of the variance sep=))
print(paste(Loadings for the firstifelse(npc==1paste(npc)) componentifelse(npc==1s) sep=))
print(pc$rotation[1npc]) return frame of synthetic variables (PC scores)
return(pcs)
B2 Find boundaries
gt print(smwdw)
function(tsect wid=floor(length(tsect[1])4) mull=0 pmmd=25ident=FALSE)
silently correct unreasonable arguments
mull=floor(mull) wid=floor(wid)pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)
check if arguments make senseif (mull lt 0) print(Error mullion must be positive)return(NULL)
if (mull gt (wid8)) print(Error mullion must be less than 14 of the half-window width)return(NULL)
length of sequencelen lt- dim(tsect)[1]if ((wid lt 0) || (wid gt floor(len2))) print(paste(
Error width must be positive and less than 12 of transect lengthlen))
return(NULL) number of variables
nvar lt- dim(tsect)[2] offset from left to possible boundary for odd widths this is the station to its left
boffset lt- floor(wid2) half width of interval maybe mullioned remove one (common) point for even widths
hwid lt- boffset - ((wid2)) - mull collect ds with their centre
19
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
both ends will be 0 (not wide enough for a window)d lt- vector(mode=numeric length=len)
move left boundary of window along transectfor (l in 1(len-wid))
right boundary of windowr lt- l + wid
option Mahalanobis distance extract halves
winl lt- tsect[l(l+hwid)] winr lt- tsect[(r-hwid)r] compute column-wise means or one mean for a vector
if (isnull(dim(winl))) winlmean lt- mean(winl) winrmean lt- mean(winr)
else
winlmean lt- apply(winl 2 mean)winrmean lt- apply(winr 2 mean)
if (ident==F)
pool halves after subtracting meanspool lt- (rbind(t(t(winl) - winlmean) t(t(winr) - winrmean)))
store D^2 by boundary locationtmp lt- try(d[l + boffset] lt-
mahalanobis(winlmean winrmean cov(pool))silent=TRUE)
if (class(tmp) == try-error) d[l + boffset] lt- 0print(paste(Position l+hwid
Singular pooled covariance matrix not a boundarysep=))
option Euclidean distanceelse
d[l + boffset] lt- mahalanobis(winlmean winrmean diag(nvar))
sort in decreasing order of probable boundaryds lt- sort(d index=T decreasing=T)ds lt- dataframe(station=ds$ix[ds$xgt0] D2=ds$x[ds$xgt0])print(paste(Windowwidstations mullion mull))print(paste(Distances up to pmmd of the maximum))print(paste(Boundaryifelse((wid2)atto the right of)station))print(ds[ds$D2 gt= (ds$D2[1]pmmd)])return(dataframe(station=seq(1length(d)) D2=d))
Notes on this code
1 The expression if (isnull(dim(winl))) tests whether the win-dow half is a two-dimensional array (in which case the expressionis FALSE) or a one-dimensional vector Vectors do not have dimen-sion attributes The mean function can only be applied directly to avector for a multi-dimensional array the mean of each dimensionmust be found by using the apply function to compute the meanacross the second dimension (ie column-wise) of the array
2 The window half vector or array is transposed with the t functionto make it conformable with the window half mean then the dif-ference between these is again transposed to row-wise in the formexpected by cov
3 The try function surrounds code which may cause an error andwhich the programmer wants to handle In this case it traps er-
20
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
rors caused by the attempt to invert a singular pooled within-halfcovariance matrix with mahalanobis If there is such an error asdetected by the class function the code sets the between-half dis-tance to zero see explanation at Equation 2 In addition a warningmessage is printed explaining what happened
B3 Plot the transect
gt print(smwgraph)
function (ds pmmd=25 text= vcol=blue ymax=NULL) sanity check on proportion of maximum to call a boundary
pmmd lt- min(1 pmmd) pmmd lt- max(005 pmmd)if (devcur() gt1)
the transectylim lt- (if (isnull(ymax)) NULL else c(0 ymax))plot(ds xlab=station ylab=squared distance between halves
type=l main=Boundaries sub=text ylim = ylim)abline(h=0 lty=2)
main boundaries -- compare to imposed maximum if any otherwise within-transect maximum
dsig lt- ds[ds$D2 gt ifelse(isnull(ymax) max(ds$D2) ymax)pmmd]for (i in 1length(dsig$D2)) s lt- dsig$station[i] d2 lt- dsig$D2[i]lines(c(s s) c(0 d2) col=vcol)text(s d2 s pos=4 col=vcol)
21
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
C Insight into Mahalanobis distance
This appendix is to give a feeling for the Mahalanobis distance as op-posed to Euclidean distance and how it is affected by the variance-covariance structure of a window
We compare two windows from a scaled and centred matrix computedfrom the trees example dataset provided in the default datasets pack-age
Note This dataset provides measurements of the girth (diameter atbreast height) height and volume of timber in 31 felled black cherry treessee trees It is sorted in order of girth so forms a sequence
We consider the distance between the windows in bivariate space formedby the girth and height
Task 16 Load the data scale and centre it compute the covariance fortwo of variables and then extract two sequences of four observations(1) trees 1 4 and (2) trees 5 8 as window halves bullgt data(trees)gt test lt- asdataframe(scale(trees))gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
gt (left lt- test[14 12])
Girth Height1 -15769 -094162 -14813 -172643 -14175 -204024 -08758 -06278
gt (right lt- test[58 12])
Girth Height5 -08121 078476 -07802 109867 -07165 -156948 -07165 -01569
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt lmean - rmean
Girth Height-05816 -13732
Over the whole sequence the variables both have unit variance (becauseof scaling) and moderate positive correlation (r = 052) There seems tobe a good contrast between the windows because of the varying heights
The Euclidean distance is simply the RMS of the difference this can alsobe computed by the mahalanobis function using an identity matrix I inplace of the variance-covariance matrix W
22
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
gt sqrt(sum((lmean - rmean)^2))
[1] 1491
gt sqrt(mahalanobis(lmean rmean diag(2)))
[1] 1491
Euclidean distance does not weight the variables by their discriminatingpower In this case we see that the height shows more difference betweenthe two sequences than does the girth For this we use the pooled within-half variance-covariance matrix W which takes into account the varianceof each variable (lower is more discriminating) and their covariance
Task 17 Compute the pooled within-half variance-covariance matrixW bullgt note the transposition to compute column meansgt (leftresid lt- t(left) - lmean)
1 2 3 4Girth -02390 -01434 -007967 04621Height 03924 -03924 -070624 07062
gt (rightresid lt- t(right) - rmean)
5 6 7 8Girth -005577 -00239 003983 003983Height 074547 10594 -160865 -019618
gt (resid lt- (rbind(t(leftresid) t(rightresid))))
Girth Height1 -023900 039242 -014340 -039243 -007967 -070624 046206 070625 -005577 074556 -002390 105947 003983 -160868 003983 -01962
gt apply(resid 2 sd)
Girth Height02085 08952
gt (W lt- cov(resid))
Girth HeightGirth 004348 002947Height 002947 080137
gt var(test[ 12])
Girth HeightGirth 10000 05193Height 05193 10000
So for this particular window after subtracting the respective mean fromeach half the first scaled variable has much less residual variance thanthe second that is observations of the first variable are closer in eachhalf to that halfrsquos mean than is the case for the second variable The
23
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
whole-sequence residual variance is of course 1 for each variable this isthe same as the variance because each variable has zero mean from thescaling
This shows that the variance-covariance matrix of a window can varygreatly from that of the whole sequence In this case variances are 0043and 0801 respectively for the two variables rather than 1 for the wholesequence) and the covariance is 0029 (instead of 0519 for the wholesequence) The two residuals are weakly correlated (ie girth is not agood predictor of height and vice-versa) so this will hardly affect thedistance
We can now examine the contribution of each variable to the distance
Task 18 Compute the distance step-by-step according to the definitionof Mahalanobis distance then with the direct call to mahalanobis bullgt (diff lt- lmean - rmean)
Girth Height-05816 -13732
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 235855 -08674Height -08674 12798
gt Winv[1 1]Winv[2 2]
Girth1843
gt (tmp lt- drop(Winv diff))
Girth Height-12525 -1253
gt note drop removes redundant dimensions from arraysgt (dsquared lt- drop(diff tmp))
[1] 9005
gt sqrt(dsquared)
[1] 3001
gt sqrt(mahalanobis(lmean rmean W))
[1] 3001
Notice that the first scaled variable has a much higher weight (over 18times) than the second This is because a variable with low variance inthe residuals has higher discriminating power in the mean The residualsof the two variables were slightly positively correlated this lowers thedistance slightly because of the negative entries in the inverse The finalresult is about twice the Euclidean distance mainly because of the largediscriminating power (low residual variance) of the first variable
24
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
Task 19 Extract another two sequences of four observations (1) trees22 25 and (2) treess 26 29 as window halves and compute theirMahalanobis distance bull
These windows turn out to have higher variances (because the trees arelarger in this section of the dataframe) and a slight negative correlationbetween the residuals The negative correlation increases the weight ofboth variables because it becomes positive in the inverse Both variablesdiscriminate well but the first has about three times the weight as thesecondgt (left lt- test[2225 12])
Girth Height22 03032 0627823 03988 -0313924 08768 -0627825 09724 01569
gt (right lt- test[2629 12])
Girth Height26 1291 0784727 1355 0941628 1482 0627829 1514 06278
gt lmean lt- apply(left 2 mean)gt rmean lt- apply(right 2 mean)gt (diff lt- lmean - rmean)
Girth Height-07728 -07847
gt sqrt(sum((lmean - rmean)^2))
[1] 1101
gt leftresid lt- t(left) - lmeangt rightresid lt- t(right) - rmeangt resid lt- (rbind(t(leftresid) t(rightresid)))gt apply(resid 2 sd)
Girth Height02303 03728
gt (W lt- cov(resid))
Girth HeightGirth 005306 -00384Height -003840 01390
gt (Winv lt- solve(W diag(2)))
[1] [2]Girth 23559 6509Height 6509 8993
gt Winv[1 1]Winv[2 2]
Girth262
gt tmp lt- drop(Winv diff)gt (dsquared lt- drop(diff tmp))
25
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
[1] 275
gt sqrt(dsquared)
[1] 5244
gt sqrt(mahalanobis(lmean rmean W))
[1] 5244
In this case the Euclidean distance 1101 is increased dramatically tothe Mahalanobis distance 5244
26
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27
Index of R Concepts
acf 6 14apply 21
class 22cov 21
datasets package 23
knitr package 1
mahalanobis 21 23 25mean 21
runif 14
scale 5setseed 14source 4 12stats package 6
t 21trees dataset 23try 21
27