Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For

Multivariate Data Analysis a survey of data reduction and data

association techniques:

Principal Components Analysis

For example

• Data reduction approaches

– Cluster analysis

– Principal components analysis

– Principal coordinates analysis

– Multidimensional scaling

• Hypothesis testing approaches

– Discriminant analysis

– MANOVA

– ANOSIM

– Canonical correlation

– PERMANOVA

Objects

• Things we wish to compare

– sampling or experimental units

– e.g. quadrats, animals, plants, cages etc.

Variables

• Characteristics measured from each

object

– usually continuous variables

– e.g. counts of species, size of body parts

etc.

Ecological data

• Objects:

– sampling units (SU’s, e.g. quadrats, plots

etc.)

• Variables:

– species abundances and/or environmental

data

• Common in community ecology

Wisconsin forests (Peet &

Loucks 1977)

• Plots (quadrats) in Wisconsin forests

• Number of individuals of each species

of tree recorded in each quadrat

• Objects:

– quadrats

• Variables:

– abundances of each tree species

Plot Bur oak Black oak White oak Red oak etc.

1 9 8 5 3

2 8 9 4 4

3 3 8 9 0

4 5 7 9 6

5 6 0 7 9

6 0 0 7 8

etc.

Data

Garroch Head dumping ground

(Clarke & Ainsworth 1993)

• Sewage sludge dumping ground in bay

• Transect across dumping ground

• Core of mud at each of 10 stations along transect

• Objects:

– stations

• Variables:

– metal concentrations in ppm

Station Cu Mn Co Ni Zn Cd etc.

1 26 2470 14 34 160 0

2 30 1170 15 32 156 0.2

3 37 394 12 38 182 0.2

4 74 349 12 41 227 0.5

5 115 317 10 37 329 2.2

etc.

Data

Morphological data

• Objects: – usually organisms or specimens

• Variables: – morphological measurements

Morphological data

• Morphological variation between dog

species/types

• Objects:

– dog types (7)

• Variables:

– sizes of 6 different parts of mandible

– mandible breadth, mandible height, etc.

Variable

Dog type 1 2 3 4 5 6

Modern dog 9.7 21.0 19.4 7.7 32.0 36.5

Jackal 8.1 16.7 18.3 7.0 30.3 32.9

Chinese wolf 13.5 27.3 26.8 10.6 41.9 48.1

Indian wolf 11.5 24.3 24.5 9.3 40.0 44.6

Cuon 10.7 23.5 21.4 8.5 28.8 37.6

Dingo 9.6 22.6 21.1 8.3 34.4 43.1

Prehistoric dog 10.3 22.1 19.1 8.1 32.3 25.0

Data

Presentation of Multivariate Data

• Hard to visualize complex (more than 3

dimensions) multivariate datasets

– For example, how do you visualize 7 attributes

of a dog skull

• Easier to visualize relationships between

objects (e.g. similarity, dissimilarity,

correlation, scaled distance)

Presentation of Multivariate Data

V1 V2 . . . . . . . . . . Vn

O1

O2

.

.

Op

x x

x x

x x x

x

x

Raw data matrix Resemblance

matrix

Ordination

Classification

created using

correlations,

covariances or

dissimilarity indices

O1

O2

.

.

Op

O1 O2 . . Op

Principal Components Analysis

• Aims to reduce large number of variable to smaller

number of summary variables called Principal

Components (or factors), that explain most of the

variation in the data.

• Is basically a rotation of axes after centering to the

means of the variables, the rotated axes being the

Principal Components.

• Is usually carried out using a matrix algebra

technique called eigenanalysis.

Regression

Least squares (OLS) estimation, allows best prediction of Y

given X (minimize distance in y direction to line)

Y

X

least squares

regression line

y

x

}

y yi i residual

y i Predicted y

yi

xi

Observed y

y

PCA association among variables (minimize

distance to line in both x and y directions)

Y

X

y

x

yi

xi

Observed y

y

Y

X

y

x

y

Regression

line (Y on X)

Component 1 (Factor 1)

Comparison

PCA association among variables (minimize

distance to line in both x and y directions)

Y

X

y

x

yi

xi

y

Principal component 1

(Factor 1)

Can be done in N dimensions

Maximum # PC’s = Original Variables-1

PC1

PC2

Steps in PCA

1) From raw data matrix, calculate correlation matrix,

or covariance matrix on standardized variables

NO3 Total Total N . . . .

Organic N

Site 1

Site 2

Site 3

:

:

NO3 TON TN

NO3

TON

TN

1

0.37 1

0.84 0.13 1

Steps in PCA

2) Calculate eigenvectors

(weightings of each original variable on each component)

and eigenvalues (= "latent roots")

(relative measures of the variation explained by each

component)

Eigenvectors

zik = c1yi1 + c2yi2 + . . cjyij + . . + cpyip

Where zik = score for component k for object i

yi = value of original variable for object i

cj = factor score coefficient (weight) of variable for

component k

Example: soil chemistry in a forest

zik = c1(NO3) + c2(total organic N) + c3(total N) + ..

•the objects are sampling sites

•the variables are chemical measurements, e.g. total N

Steps in PCA - continued

3) Decide how many components to retain

(scree plot of eigenvalues)

1 2 3 4 5 6 7 8

Factor

0

1

2

3

4

5

Eig

en

va

lue

Eigenvalue of 1

means the Factor

explains as much

variation in the

dataset as an original

variable. Values

greater than 1 indicate

useful Factors

Steps in PCA

4) Using factor score coefficients, calculate

factor score =

coefficient x (standardized) variable

Steps in PCA

5) Position objects on scatterplot, using factor

scores on first two (or three) Principal

Components

-3 -2 -1 0 1 2 3

FACTOR(1)

-2

-1

0

1

2

3

FA

CT

OR

(2)

Site 1

Site 2

Site 3

What are loadings?

• Correlations of original data and Factors (r’s)

– For example the correlation between variable X and Factor 1

– Correlations range from +1 to –1

– +1 indicates strong positive relationship with NO scatter around line

– -1 indicates strong negative relationship with no scatter around line

r = 0, r2=0

r = 0, r2=0

r = 1, r2 =1 r = .77, r2= .59

r = -1, r2=1 r = - .77, r2=.59

Interpretation of r (correlation coefficient)

Factor 1

Ori

gin

al V

aria

ble

Worked example

• Using ourworld

• Variables sampled are Population in 1983,

1986 and 1990, military spending, Gross

National Product, birth rate in 1982, death

rate in 1982 (7 total)

• Can these variables be reduced into fewer

composite factors

Case 1, Factor 1= 3.4 (.560)+3.6 (.564) + 3.5 (.566) + 20 (. 114) + 9 (.086) + 5150 (-. 130) + 95.83 (-.092)

Case 1, Factor 2= 3.4 (.141)+3.6 (.123) + 3.5 (.104) + 20 (-.520) + 9 (-.326) + 5150 (. 574) + 95.83 (.495)

Case POP83 POP86 POP90 Birth82 Death82 GNP Mil

1 3.4 3.6 3.500212 20 9 5150 95.83333

2 7.5 7.6 7.644275 12 12 9880 127.2368

Factor Coefficients

Raw Data

Multiply Raw Data by coefficients

to get factor scores

Determine how many components

(composite factors) to retain

~80% of variance explained by 2 (of 7)

components

Using PCA

• Run simple PCA, no rotation

• Examine loadings – correlations between

factors and original variables

Rotation - Varimax

PCA - ourworld

• What have we found out

– The seven examined variables can be reduced to 2 and still retain ~ 80% of original information

• What we have not found out

– Any relationships with predictor variables

• Remember PCA is a data reduction NOT hypothesis testing technique

• Can it be used to examine hypotheses?

– Overlay predictor groups on Factor Plots

– For example is there a relationship between the Factor scores and Urban (Urban, City) or Group (Europe, Islamic or New World)

-2 -1 0 1 2 3 4

FACTOR(1)

-2

-1

0

1

2

FA

CT

OR

(2)

NewWorldIslamicEurope

GROUP

Any contribution of Factor 1?

Any contribution of Factor 1?

-2 -1 0 1 2 3 4

FACTOR(1)

-2

-1

0

1

2

FA

CT

OR

(2)

ruralcity

URBAN

Documents

Multivariate Data Analysis - courses.pbsci.ucsc.edu Handouts... · Presentation of Multivariate Data • Hard to visualize complex (more than 3 dimensions) multivariate datasets –For