Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
Multivariate Data Analysis a survey of data reduction and data
association techniques:
Principal Components Analysis
For example
• Data reduction approaches
– Cluster analysis
– Principal components analysis
– Principal coordinates analysis
– Multidimensional scaling
• Hypothesis testing approaches
– Discriminant analysis
– MANOVA
– ANOSIM
– Canonical correlation
– PERMANOVA
Objects
• Things we wish to compare
– sampling or experimental units
– e.g. quadrats, animals, plants, cages etc.
Variables
• Characteristics measured from each
object
– usually continuous variables
– e.g. counts of species, size of body parts
etc.
Ecological data
• Objects:
– sampling units (SU’s, e.g. quadrats, plots
etc.)
• Variables:
– species abundances and/or environmental
data
• Common in community ecology
Wisconsin forests (Peet &
Loucks 1977)
• Plots (quadrats) in Wisconsin forests
• Number of individuals of each species
of tree recorded in each quadrat
• Objects:
– quadrats
• Variables:
– abundances of each tree species
Plot Bur oak Black oak White oak Red oak etc.
1 9 8 5 3
2 8 9 4 4
3 3 8 9 0
4 5 7 9 6
5 6 0 7 9
6 0 0 7 8
etc.
Data
Garroch Head dumping ground
(Clarke & Ainsworth 1993)
• Sewage sludge dumping ground in bay
• Transect across dumping ground
• Core of mud at each of 10 stations along transect
• Objects:
– stations
• Variables:
– metal concentrations in ppm
Station Cu Mn Co Ni Zn Cd etc.
1 26 2470 14 34 160 0
2 30 1170 15 32 156 0.2
3 37 394 12 38 182 0.2
4 74 349 12 41 227 0.5
5 115 317 10 37 329 2.2
etc.
Data
Morphological data
• Objects: – usually organisms or specimens
• Variables: – morphological measurements
Morphological data
• Morphological variation between dog
species/types
• Objects:
– dog types (7)
• Variables:
– sizes of 6 different parts of mandible
– mandible breadth, mandible height, etc.
Variable
Dog type 1 2 3 4 5 6
Modern dog 9.7 21.0 19.4 7.7 32.0 36.5
Jackal 8.1 16.7 18.3 7.0 30.3 32.9
Chinese wolf 13.5 27.3 26.8 10.6 41.9 48.1
Indian wolf 11.5 24.3 24.5 9.3 40.0 44.6
Cuon 10.7 23.5 21.4 8.5 28.8 37.6
Dingo 9.6 22.6 21.1 8.3 34.4 43.1
Prehistoric dog 10.3 22.1 19.1 8.1 32.3 25.0
Data
Presentation of Multivariate Data
• Hard to visualize complex (more than 3
dimensions) multivariate datasets
– For example, how do you visualize 7 attributes
of a dog skull
• Easier to visualize relationships between
objects (e.g. similarity, dissimilarity,
correlation, scaled distance)
Presentation of Multivariate Data
V1 V2 . . . . . . . . . . Vn
O1
O2
.
.
Op
x x
x x
x x x
x
x
Raw data matrix Resemblance
matrix
Ordination
Classification
created using
correlations,
covariances or
dissimilarity indices
O1
O2
.
.
Op
O1 O2 . . Op
Principal Components Analysis
• Aims to reduce large number of variable to smaller
number of summary variables called Principal
Components (or factors), that explain most of the
variation in the data.
• Is basically a rotation of axes after centering to the
means of the variables, the rotated axes being the
Principal Components.
• Is usually carried out using a matrix algebra
technique called eigenanalysis.
Regression
Least squares (OLS) estimation, allows best prediction of Y
given X (minimize distance in y direction to line)
Y
X
least squares
regression line
y
x
}
y yi i residual
y i Predicted y
yi
xi
Observed y
y
PCA association among variables (minimize
distance to line in both x and y directions)
Y
X
y
x
yi
xi
Observed y
y
Y
X
y
x
y
Regression
line (Y on X)
Component 1 (Factor 1)
Comparison
PCA association among variables (minimize
distance to line in both x and y directions)
Y
X
y
x
yi
xi
y
Principal component 1
(Factor 1)
Can be done in N dimensions
Maximum # PC’s = Original Variables-1
PC1
PC2
Steps in PCA
1) From raw data matrix, calculate correlation matrix,
or covariance matrix on standardized variables
NO3 Total Total N . . . .
Organic N
Site 1
Site 2
Site 3
:
:
NO3 TON TN
NO3
TON
TN
1
0.37 1
0.84 0.13 1
Steps in PCA
2) Calculate eigenvectors
(weightings of each original variable on each component)
and eigenvalues (= "latent roots")
(relative measures of the variation explained by each
component)
Eigenvectors
zik = c1yi1 + c2yi2 + . . cjyij + . . + cpyip
Where zik = score for component k for object i
yi = value of original variable for object i
cj = factor score coefficient (weight) of variable for
component k
Example: soil chemistry in a forest
zik = c1(NO3) + c2(total organic N) + c3(total N) + ..
•the objects are sampling sites
•the variables are chemical measurements, e.g. total N
Steps in PCA - continued
3) Decide how many components to retain
(scree plot of eigenvalues)
1 2 3 4 5 6 7 8
Factor
0
1
2
3
4
5
Eig
en
va
lue
Eigenvalue of 1
means the Factor
explains as much
variation in the
dataset as an original
variable. Values
greater than 1 indicate
useful Factors
Steps in PCA
4) Using factor score coefficients, calculate
factor score =
coefficient x (standardized) variable
Steps in PCA
5) Position objects on scatterplot, using factor
scores on first two (or three) Principal
Components
-3 -2 -1 0 1 2 3
FACTOR(1)
-2
-1
0
1
2
3
FA
CT
OR
(2)
Site 1
Site 2
Site 3
What are loadings?
• Correlations of original data and Factors (r’s)
– For example the correlation between variable X and Factor 1
– Correlations range from +1 to –1
– +1 indicates strong positive relationship with NO scatter around line
– -1 indicates strong negative relationship with no scatter around line
r = 0, r2=0
r = 0, r2=0
r = 1, r2 =1 r = .77, r2= .59
r = -1, r2=1 r = - .77, r2=.59
Interpretation of r (correlation coefficient)
Factor 1
Ori
gin
al V
aria
ble
Worked example
• Using ourworld
• Variables sampled are Population in 1983,
1986 and 1990, military spending, Gross
National Product, birth rate in 1982, death
rate in 1982 (7 total)
• Can these variables be reduced into fewer
composite factors
Case 1, Factor 1= 3.4 (.560)+3.6 (.564) + 3.5 (.566) + 20 (. 114) + 9 (.086) + 5150 (-. 130) + 95.83 (-.092)
Case 1, Factor 2= 3.4 (.141)+3.6 (.123) + 3.5 (.104) + 20 (-.520) + 9 (-.326) + 5150 (. 574) + 95.83 (.495)
Case POP83 POP86 POP90 Birth82 Death82 GNP Mil
1 3.4 3.6 3.500212 20 9 5150 95.83333
2 7.5 7.6 7.644275 12 12 9880 127.2368
Factor Coefficients
Raw Data
Multiply Raw Data by coefficients
to get factor scores
Determine how many components
(composite factors) to retain
~80% of variance explained by 2 (of 7)
components
Using PCA
• Run simple PCA, no rotation
• Examine loadings – correlations between
factors and original variables
Rotation - Varimax
PCA - ourworld
• What have we found out
– The seven examined variables can be reduced to 2 and still retain ~ 80% of original information
• What we have not found out
– Any relationships with predictor variables
• Remember PCA is a data reduction NOT hypothesis testing technique
• Can it be used to examine hypotheses?
– Overlay predictor groups on Factor Plots
– For example is there a relationship between the Factor scores and Urban (Urban, City) or Group (Europe, Islamic or New World)
-2 -1 0 1 2 3 4
FACTOR(1)
-2
-1
0
1
2
FA
CT
OR
(2)
NewWorldIslamicEurope
GROUP
Any contribution of Factor 1?
Any contribution of Factor 1?
-2 -1 0 1 2 3 4
FACTOR(1)
-2
-1
0
1
2
FA
CT
OR
(2)
ruralcity
URBAN