Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Gradient Analysis
Multivariate Fundamentals: Rotation/Distance
NMDS – Indirect Gradient Analysis NMDS – Direct Gradient Analysis
Objective: Use one dataset to explain another
Use the spatial patterns of each dataset to try and understand the structure and data variation in terms of gradients in space of variables on multiple levels
E.g. environmental factors, species populations and characteristics of communities
“Gradient Analysis” is an umbrella term which includes both rotation-based and distance-based techniques
All of which are aiming to determine “What in dataset A explains dataset B”?
Gradient analysis
Data gets separated into 2 distinct datasets that have a spatial link:
Which set of variables is classified as the response vs the predictors is not always clear
You have to use some logic to say what set of variables potentially influences another
MAT MAP AHI … Sp1 Sp2 Sp3 …
Predictor Variables Response Variables
Direct gradient analysis
Ordinates the data according to the independent variable (e.g. climate) and then investigates how the dependent variables (e.g. plant species) correlate to the ordination scores.
AHI
MAP
MAT
Sp3
Sp1
Sp2
Example: Species 1 and 2 are associated with greater MAT and less MAP (warm & dry)
AHI
MAP
MAT
Sp3
Sp1
Sp2
Indirect gradient analysis
Ordinates your dependent variable (e.g. community data according to their similarity in species composition). Relationship between species frequencies and environmental gradients is then investigated by correlating the ordination scores with the environmental variables in the second step.
Example: Species 1 and 2 are associated with moderate MAT and moderate MAP (mild & moist)
Direct vs Indirect gradient analysis
Can extend any ordination technique to a gradient analysis
But the easiest way to look at both a direct and indirect gradient analysis is to use a PCA (rotation) or NMDS (distance) plot and simply add a second set of vectors to infer relationships between datasets (we do this in Lab 7)
NMDS – Indirect Gradient Analysis NMDS – Direct Gradient Analysis
Indirect: Lodgepole pine (PINUCON) is associated with environments with greater growing season precipitation (lnMSP)
Direct: Lodgepole pine is still associated with wetter summer environments, but so is white spruce (PICEGLA)
Direct vs Indirect gradient analysis
Direct and indirect ordinations may detect different trends hidden in your data structure Or by only considering one orientation you may miss an important pattern in you plant community data E.g. soil attributes or climate variables you did not measure Therefore it is good to use both direct and indirect gradient analysis to get the full picture of relationships within your data This will allow you to see the full data structure and correctly interpret environmental drivers and data responses Sometimes in the literature:
Indirect gradient analysis Direct gradient analysis Constrained gradient analysis
Indirect gradient analysis
Direct gradient analysis
Constrained gradient analysis (CGA)
Goal of CGA is to utilize both datasets to infer (as in regression) patterns in species composition from patterns in environmental variables
CGA identifies which environmental variables are most important in structuring the community E.g. brings out pairs of variables between datasets that are highly associated with each other
Further describes how the environmental variables are related and how the community varies along these most important gradients
BUT:
CGA loses all structure between predictor and response variables
You will just pull out the cross-correlated components from the datasets and ignore everything else
Not necessarily a problem – but it is something to be aware of Canonical – something being optimized against some other constraint
Canonical Correlation Analysis (CANCOR)
Rotation based technique for constrained gradient analysis
Rotate both the predictor and response datasets independently to maximize correlation between corresponding variables among datasets
Once correlation is maximized between datasets on the first axes – the axis for each dataset is fixed and rotation for the second predictor/response variable is carried out
Repeated for all variables
You do not need to have the same number of response and predictor variables Conical functions (rotations) will be built for the smaller number of variables
Herold Hotelling (1895-1974)
All CGA techniques are typically based on an underlying community model
CANCOR (statistical test) assume that when variables are sampled over a sufficient range, responses will be linear or unimodal
Linearity is an important assumption for CANCOR
Predictor Predictor Predictor
Res
po
nse
Res
po
nse
Res
po
nse
Golden Can fit linear curve BUT hard to fit – should transform response or
predictor variables
CANCOR (and others) will fail to detect a
relationship
Use MRT (Lab 8)
Canonical Correlation Analysis (CANCOR)
CANCOR in R
CANCOR in R: library(CCA)
cancor(predictorData,responseData) (CCA package)
cc(predictorData,responseData)
Data table of your predictor variables E.g. Environmental Variables
To run CANCOR you need to install the CCA package Data table of your response variables
E.g. Species Community Variables
The cancor() function will display the correlation values The cc () function outputs a number of statistics that we can query to provide some more information from the analysis AND use to test significance
Predictor data table and Response data table need to have the same number of rows BUT they do not need to have the same number of columns
CANCOR in R Correlations between the rotated axes
The first correlation will be the maximum values and all successional correlations
will be smaller
Estimates of the predictor and response coefficients from the rotation model
(matrix algebra)
The value used to adjust each predictor and response variable under rotation
CANCOR in R We can use the output from the cc function to individually test if the
correlation values between our axes are significant
P-values test the hypothesis that the true correlation is equal to 0
i.e. There is no correlation
Therefore small p-values reject this hypothesis and there is a true significant
correlation between the axes
Based on our CANCOR analysis of 3 predictor and 3 response variables, the correlations found between rotation 1 (0.93) and rotation 2 (0.7) were found to be significant BUT the correlation between rotations 3 (0.12) was found to NOT be significant
CANCOR in R If correlation for the canonical functions (rotated axes) is significant we can look at the loadings to see what each new rotated axes is related to in our original predictor and response variables
In our case we only have to look at Can1 and Can2
We now have to interpret the loading values for the predictor and response variables together (e.g. associate high loadings together)
Can1: When Env3 is rare, Spec 2 and Spec 3 have lower frequency (both negative) Spec2 and Spec3 prefer Env3 (reverse)
Can2: When Env1 is abundant, Spec1 has a higher frequency (both positive) Spec 1 prefers Env1
Nothing really likes Env2
Canonical Correspondence Analysis (CCA)
Rotation-based technique for constrained gradient analysis
Like CANCOR, CCA aims to maximize the correlations between response and predictor variables, BUT response scores are constrained to be linear combinations of predictor variables in a effort to maximize the variance explained by the predictor data in all ordination axes (e.g. CCA1, CCA2, etc.)
Multiple linear regression is used to solve the linear combinations of predictor variables Categorical variables can be used in CCA – converted to “dummy” variables where each class is assigned a numeric value (should be addressed with caution)
CCA is considered an improvement over CANCOR in some fields
CCA was developed for Ecology
Like CANCOR – linearity of the relationship between response and predictor variables is assumed
CCA may be able to detect some non-linear responses, BUT there are better techniques for that (MRT, RandomForest – Lab 8)
CCA in R
CCA in R: library(vegan)
cca(responseData,Predictor1+Predictor2+…,data=predictorData) (vegan package)
Data table of your response variables E.g. Species Community Variables
To run CCA you need to install the vegan package
A linear equation including the predictor variables (e.g. Environmental Variables) that you feel are related to the response variable outputs (e.g. Species Occurrence)
You can include as many predictors as you wish HOWEVER, the more predictors you include the more complex the analysis and the capacity to detect strong relationships is reduced (so pick your predictor variables mindfully)
Data table of your predictor variables E.g. Environmental Variables
Predictor data table and Response data table need to have the same number of rows BUT they do not need to have the same number of columns
CCA in R We analyzed a model where we included 3 environmental variables to explain
species frequency
Variance Explained:
Total variance – total amount of variance in the response variables (e.g. species data)
Constrained variance – how much is explained by the predictor variables (e.g. environmental data)
Unconstrained variance – how much variance is left in the response variables (unexplained)
Eigenvalues – how much of the variance is explained by the individual axes of the ordination (you can plot these axes with the plot command – Lab 8)
You will have to figure out the % of variance explained by yourself Simply divide value/ total variance Constrained CCA1 = 0.005395/0.007953 = ~68% Constrained CCA2 = 0.000214/0.007953 = ~3%
Unconstrained defaults to a correspondence analysis (unconstrained)
CCA in R To determine if a significant relationship between our response and predictor variables exists we can run our CCA output through an ANVOA
Generic ANOVA tells us if a significant relationship between the response and
predictor variables exists
ANOVA (overwrites as anova.cca in vegan package)
For by option – selecting "term" p-values will be produces for each predictor term
For permu option – the number of permutations to use to generate the p-values
P-values test the hypothesis that the correlation between species variables and each environmental variable is 0
From p-values Env2 and Env3 are significantly associated with species occurrences
CCA in R
From the image we can see:
Env3 appears positively associated with Spec3 and negatively associated with Spec2 and Spec1
Env2 appears positively associated with Spec3 and Spec1 and negatively associated with Spec2
Env1 appears negatively associated with Spec2 – but from ANOVA this is not a significant relationship
Redundancy Analysis (RDA)
Rotation-based technique for constrained gradient analysis
The goal of RDA is to apply linear regression in order to find linear combinations of predictor variables to represent as much variance in the response variables as possible CCA focuses more on species composition, i.e. relative abundance, if you have a gradient along which all species are positively correlated, RDA will detect such a gradient while CCA will not
With RDA, it is possible to use 'species' that are measured in different units, BUT in this case, the data must be centered and standardized RDA can useful when gradients are short or you are conducting a short-term experimental study Like CCA categorical variables can be used in RDA – converted to “dummy” variables where each class is assigned a numeric value (should be addressed with caution)
Like CANCOR and CCA – linearity of the relationship between response and predictor variables is assumed
RDA in R
RDA in R: library(vegan)
rda(responseData,Predictor1+Predictor2+…,data=predictorData) (vegan package)
Data table of your response variables E.g. Species Community Variables
To run RDA you need to install the vegan package
A linear equation including the predictor variables (e.g. Environmental Variables) that you feel are related to the response variable outputs (e.g. Species Occurrence)
Data table of your predictor variables E.g. Environmental Variables
Predictor data table and Response data table need to have the same number of rows BUT they do not need to have the same number of columns
With RDA it is possible to use response variables that are measured in different units BUT in this case the dependent data must be centered and standardized before executing the analysis to do this you can specify the option scale=TRUE (default is FALSE)
RDA in R We analyzed a model where we included 4 environmental variables to explain
species frequency for 6 species
Eigenvalues – how much of the variance is explained by the individual axes of the ordination (you can plot these axes with the plot command – Lab 8)
You will have to figure out the % of variance explained by yourself Simply divide value/ total variance Constrained RDA1 = 74.52/112.88889 = ~66% Constrained RDA2 = 24.94/112.88889 = ~22% Constrained RDA3 = 8.88/112.88889 = ~8%
Variance Explained:
Total variance – total amount of variance in the response variables (e.g. species data)
Constrained variance – how much is explained by the predictor variables (e.g. environmental data)
Unconstrained variance – how much variance is left in the response variables (unexplained)
Unconstrained defaults to a PCA analysis (unconstrained)
RDA in R To determine if a significant relationship between our response and predictor variables exists we can run our RDA output through an ANVOA (like CCA)
Generic ANOVA tells us if a significant relationship between the response and
predictor variables exists
ANOVA (overwrites as anova.cca in vegan package)
For by option – selecting "term" p-values will be produces for each predictor term
For permu option – the number of permutations to use to generate the p-values
P-values test the hypothesis that the correlation between species variables and each environmental variable is 0
From p-values Depth, Sand, and Coral are all significantly associated with fish occurrences
“Other substrate” was removed from the analysis due to collinearity (last column entered)
RDA in R
From the image we can see:
All species were found to dislike environments characterized by Sand (i.e all species respond the same across an environmental gradient)
Sp3 and Sp4 associated with Coral environments
Sp1 and Sp2 associated with environments with greater water Depth