Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
DATA HANDLING ANDPRESENTATION
MULTIVARIATE ANALYSIS
Dr. Anne van Dam
Multivariate data analysis: lecture outline
• What is it?Why/when use itClassification of techniques
• Structured approach to multivariate analysisProblem definition, objectivesAnalysis planAssumptionsModel estimationInterpretationValidation
• Examplesmultiple regression analysisfactor analysis (Dr. Peter Kelderman)
Multivariate data analysis – what is it?
• Univariate analysis: single-variable distributions
• Bivariate analysis: analysis of relationships between two variablescross-classificationcorrelationone-way analysis of variance (ANOVA)simple regression
• Multivariate analysis : analysis of > 2 variables
Bivariate analysis: phosphorous and chlorophyll a in lakes
Univariate analysis: dissolved oxygen in a fishpond
Univariate versus bivariate analysis: examples
Histogram
0
5
10
15
20
25
30
35
40
45
1 3 5 7 9 Meer
Frequency
DO concentration (mg/l)
No.
of o
bser
vatio
ns
Mean = 4.78S.D. = 0.95
Multivariate data analysis: example
Predicting the phosphporous concentrations of lakes
What is a variate?
Building block of multivariate analysis is the “variate”
Definition: a variate is a linear combination of variables with empirically determined weights
Variate value = w1X1 + w2X2 + w3X3 + …. + wnXn
Xn : observed variableWn : weight determined by multivariate technique
The value of the variate represents the combination of the entire set of variables
Variate: example from multiple regressionlog(SPM) = 1.148 log(TP) + 0.137 pH + 0.286 log(DR) - 1.985
Metric and non-metric dataNon-metric or qualitative:
attributes, characteristics, or categorical propertiestypes, classes, absence/presenceExample: male/female
Metric or quantitative:differing in amount or degreeExample: temperature
Nominal (or categorical) scale: class symbols have no quantitative meaning (e.g., female = 1, male = 2)Ordinal scale: variables can be ranked according to scale (e.g., level of agreement in survey: don’t agree = 1, don’t know = 2, agree = 3)
Interval scale: arbitrary zero point (e.g., temperature in Celcius versus Fahrenheit)Ratio scale: absolute zero point (e.g., weight or length)
Classification of multivariate techniques
Dependence techniques:variable or set of variables is identified as the dependentvariable to be predicted or explained by independentvariables
Interdependence techniques:set of variables is analysed simultaneously without defining dependence relationships
Example: multiple regression analysis
Example: factor analysis
Multivariate dependence methods
• Analysis of varianceY1 = X1 + X2 + X3 + ... + Xn
(metric) (nonmetric)
• Multiple discriminant analysisY1 = X1 + X2 + X3 + ... + Xn
(nonmetric) (metric)
• Multiple regression analysisY1 = X1 + X2 + X3 + ... + Xn
(metric) (metric, nonmetric)
(one dependent variable)
• Canonical correlationY1+ Y2 + Y3 + ... + Yn = X1 + X2 + X3 + ... + Xn
(metric, nonmetric) (metric, nonmetric)
• Multivariate analysis of varianceY1+ Y2 + Y3 + ... + Yn = X1 + X2 + X3 + ... + Xn
(metric) (nonmetric)
Multivariate dependence methods(multiple dependent variables)
Type of relationship?
How many variables
predicted?
Source: Hair et al. 1998. Multivariate data analysis, 5th ed.
Structural equation
modelling
Measurementscale of dependent
variable?
Measurementscale of predictor
variable?
Canonicalcorrelation w/
dummy variables
Canonical correlation
analysis
Multivariate analysis of variance
Multiplediscriminant analysis
Linear probabilitymodels
dependence interdependence
Measurementscale of dependent
variable?
nonmetricmetric nonmetricmetric
nonmetricmetric
multiple relationshipsof dependent and
independent variablesone dependent variablein a single relationship
several dependent variablein a single relationship
Multiple regressionConjointanalysis
dependence interdependence
Type of relationship?
Structure of relationships
among:
Factoranalysis
Clusteranalysis
How are the attributes
measured?
Multi-dimensional
scaling
Correspondence analysis
nonmetricmetric
variable objectcases / respondent
nonmetric
Source: Hair et al. 1998. Multivariate data analysis, 5th ed.
Classification of multivariate techniques
Exploratory methods:Main objective is to identify interrelationships and structures among variables. Reduction of large number of variables to a few key components
Confirmatory methods:Main objective is to test hypothesized relationships between variables. Researcher has a-priori understanding of relationships
Examples: principal component and factor analysis; cluster analysis; multidimensional scaling
Examples: correlation analysis; multiple regression; canonical correlation; analysis of variance; discriminant analysis; conjoint analysis; structural equation modelling
Structured approach to multivariate analysisStage 1: Define problem, objectives, and choose technique
Stage 2: Develop analysis plan and evaluate assumptions of multivariate technique
Stage 3: Estimate the model
Stage 4: Interprate the variates
Stage 5: Validate the model
1. Define problem, objectives, and technique
• Develop a conceptual modelrelationships between variablesstructure, similaritiescause and effect
• Define the objective(s) of the modelprediction or explorationunderstanding of the system
• Choose technique in relation to objectives and data typesdependence/interdependencemetric/nonmetric
2. Develop analysis plan and evaluate assumptions underlying technique
• Analysis plan:minimum/desired sample sizeallowable/required variable typesspecial variable formulation
• Assumptions:normal distributionlinearityindependence of error termsequality of variance (homoscedasticity)
• Data manipulation / transformatione.g., logarithmic or arc sine transformationdummy variables
3. Estimate the model
• Model estimation
• Assessment of model fitR2
if necessary, improve fit (e.g., rotation in factor analysis)iteration
• Evaluation of the modelstatistical significance (model, parameters)outliers?“robustness”
4. Interpret the variates
• Examination of estimated coefficients (weights) for each variable in the variate
e.g., : multiple regressionlog(SPM) = 1.148 log(TP) + 0.137 pH + 0.286 log(DR) - 1.985
• Interpretation of multiple variates as underlying “dimensions”e.g., principle components
• If necessary: reformulation of model
5. Validate the model
• How general is the model? does is apply to the total population?can the model be used for prediction
• Method: check model fit with independent dataset
Structured approach to multivariate analysis
Define problem, objectives, and choose technique
Analysis plan and evaluate assumptions, data manipulation
Estimate the model
Interprate the variates
Validate the model
iteration!!
Examples of multivariate analysisPrincipal components and factor analysisMultiple regression analysisMultiple discriminant analysisMultivariate analysis of variance and covarianceConjoint analysisCanonical correlationCluster analysisMultidimensional scalingCorrespondence analysisLinear probability modelsStructural equation modellingData mining and warehousingNeural networksResampling
Multiple regression analysis of rice-fish data*
• Objectivesexplain variation in fish and rice production from input and climate dataexploratory analysis, “meta analysis”
*Source: van Dam, A.A. (1990) Multiple regression analysis of accumulated data from aquaculture experiments: a rice-fish culture example. Aquaculture and Fisheries Management 21, 1-15.
• Background50 experiments on rice-fish culture in unpublished reportsproblem: low power of single pond experiments (many type II errors)multivariate analysis allows new look at data
• Choice of technique: multiple linear regressionrelate one metric variable (rice, fish yield) to multiple metric explaining variables
Multiple regression analysis of rice-fish data in the PhilippinesSystem description
Nile tilapia(Oreochromis niloticus L.)
Analysis plan / methodology / assumptions
Multiple regression analysis of rice-fish data
• Database management
• First analysis: plots, correlation matrix
• Transformations/re-expression
Multiple regression analysis of rice-fish dataDatabase
management
0.00
20.00
40.00
60.00
80.00
100.00
120.00
0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00 50.00
Fish stocking size (g)
(%)
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
(g/day)
recovery growth rate
Data plot; trends
Data plot; trends
y = 1.8516x - 50.993R2 = 0.4605
-150.00
-100.00
-50.00
0.00
50.00
100.00
150.00
200.00
250.00
300.00
350.00
0.00 20.00 40.00 60.00 80.00 100.00 120.00
Recovery (%)
Net
fish
yie
ld (k
g/ha
)
DESCRIPTIVE STATISTICS OF RIFE-FISH DATASET. NO. OF CASES (N) = 198 Name M ean SD M in. M ax.
Dependent variables
Gross fish yield (kg ha-1) 122.86 77.66 2.5 390.0 Net fish yield (kg ha-1) 51.35 68.62 -105.0 300.0
Fish recovery percentage (%) 55.62 25.49 1 100 Fish growth rate (g d-1) 0.347 0.123 0.14 0.84
Rice yield (kg ha-1) 4337.46 1689.08 600 8250
Independent variables Plot size (m2) 201.52 27.55 100 400
Period (d) 78.97 17.73 50 114 Log period (d) 1.89 0.0995 1.70 2.06
Stocking density (no. ha-1) 5878.79 1883.80 2000 10000 Log stocking density (no. ha-1) 3.75 0.128 3.30 4.00
Stocking size (g) 12.80 9.41 1 44 Log stocking size (g) 0.958 0.404 0 1.64
Basal N application (kg ha-1) 63.11 13.18 40.2 79.5 Log basal N applic. (kg ha-1) 1.79 0.0961 1.60 1.90
Top dress N applic. (kg ha-1) 23.18 31.40 0 89.3
Basal P application (kg ha-1) 30.84 13.66 10.5 55.2 Log basal P applic. (kg ha-1) 1.44 0.205 1.02 1.74
Top dress P applic. (kg ha-1) 3.23 8.02 0 28 Herbicide application (dummy) 0.136 0.344 0 1
Basal insect. applic. (dummy) 0.707 0.456 0 1 No. of insecticide sprayings 1.19 0.879 0 3
Feed (dummy) 0.222 0.417 0 1 Avg. max. air temperature (? C) 32.29 1.30 30.2 34.7
Avg. min. air temperature (? C) 22.84 0.690 21.2 24.0 Avg. daily wind speed (knots) 4.97 1.04 3.2 7.0
Avg. daily evaporation (mm) 5.50 1.39 3.5 7.4 Avg. daily rainfall (mm) 3.68 2.81 0.1 8.8
Avg. daily sunshine (min) 476.85 85.51 350 611
Results:summary of data
Multiple regression analysis of rice-fish data
Multiple regression analysis of rice-fish data*
Y = a + b1X1 + b2X2 + ... + bkXk + ε
with
Y : dependent variableX1..k : independent or explaining variableB1..k : partial regression coefficients (slopes)a : constant (intercept)ε : residual
Estimate model with R2 and F, α (significance of model)Evaluate sign and significance of coefficients (t-test)Standardize b’s: beta-weights
Multiple regression analysis of rice-fish data
R2 F-value Sign.(α)
Gross fish yield (kg ha-1) 0.6571 52.013 <0.001 Net fish yield (kg ha-1) 0.5536 33.659 <0.001Fish recovery (%) 0.4469 21.935 <0.001Fish growth rate (g d-1) 0.3549 21.129 <0.001Rice yield (kg ha-1) 0.7088 66.080 <0.001
Results: estimation of 5 models
Conclusion: significant models; 35-71% of variation in Y’s explained
Multiple regression analysis of rice-fish dataMultiple regression models for gross fish yield (kg ha-1). All partial regression coefficients (bs) were significant at the 0.1% level, except when marked * (5%) or ** (1%). Also given are the standard errors (SE) of the bs and the standardized bs or betaweights (rankings between brackets). Number of cases = 198
Model 1 Model 2b SE beta b SE beta
Independent variablesPeriod (d) 1.57 0.225 0.359(4)Log period (d) 230.26 42.23 0.295(5)Stocking density (no. ha-1) 0.012 0.002 0.279(6)Log stocking density (no. ha-1) 136.31 27.31 0.225(6)Stocking size (g) 3.78 0.432 0.458(1) 3.82 0.423 0.463(2)Basal N application (kg ha-1) 1.74 0.283 0.296(5)Log basal N application (kg ha-1) 276.07 36.36 0.342(4)Basal P application (kg ha-1) -2.05 0.318 -0.361(3)Log basal P application (kg ha-1) -166.37 21.16 -0.439(3)No. of insecticidesprayings -10.03* 4.15 -0.114(7)Maximum air temperature (°C) 26.97 3.21 0.452(2) 28.67 3.17 0.481(1)
Constant (a) -1022.83 -2051.25Coeff. of determination (R2) 0.6571 0.6676F-value 52.013 63.946Probability <0.001 <0.001Durbin-Watson statistic 1.5120 1.5738
Partial regression coefficients: b’s
Y = a + b1X1 + b2X2 + ... + bkXk + ε
Multiple regression analysis of rice-fish data
Model: gross fish yieldb s.e. beta (rank)
Period (d) 1.57 0.225 0.359(4)Stocking density (ha-1) 0.012 0.002 0.279(6)Stocking size (g) 3.78 0.432 0.458(1)Basal N application (kg ha-1) 1.74 0.283 0.296(5)Basal P application (kg ha-1) -2.05 0.318 -0.361(3)No. of insecticide sprayings -10.03 4.15 -0.114(7)Maximum air temperature (°C) 26.97 3.21 0.452(2)
βk = bk • (sdXk / sdYk)
Multiple regression analysis of rice-fish data
Beta weight
SSSD
PER
NPHBIT
-0.3 0.30 -0.3 0.30 -0.3 0.30 -0.3 0.30 -0.3 0.30
Gross yield Net yield Recovery Growth rate Rice yield
Figure 1. Beta-weights of variables in all models. Pesticides (dotted bars)were of minor importance for yield and recovery, had a strong negative effecton fish growth rate and positive effects on rice yields. Phosphorous fertilization(striped bars) showed a negative effect on all fish variables (PER=length ofthe culture period, SD=stocking density, SS=stocking size, N=basal nitrogenapplication, P=basal phosphorous application, H=herbicide application,B=basal insecticide application, I=number of insecticide sprayings,T=maximum air temperature).
Beta weights: standardized regression coefficients
Allow straight comparison between effects
Y = a + b1X1 + b2X2 + ... + bkXk + ε
βk = bk • (sdXk / sdYk)
Factor analysis
• Analyze interrelationships among large no. of variables• Explain in terms of underlying relationships (= factors = variates)• Data reduction (reduce large no. of variables to 2-4 factors)
Example: water quality in fishponds (Bangladesh)
12 ponds, 4 treatments (fish stocked)9 WQ parameters: temp, transp, pH, DO, alk, PO4, NH4, NO2, NO3, ChlA
Samples on 20 dates (1 May – 14 November), every 10 days
Dataset: 20 * 12 * 9 = 2160 numbers
Factor analysis
WQ Factors WQF1 WQF2 WQF3 WQF4
Secchi 0.65 0.20 0.34 -0.15alkalinity 0.70 -0.08 -0.19 -0.02pH 0.61 0.04 -0.44 0.36oxygen 0.11 -0.47 -0.28 0.36ammonia -0.57 0.49 -0.06 0.04nitrite 0.15 0.01 0.83 0.23nitrate 0.37 0.52 0.20 0.36phosphate -0.21 0.58 -0.23 0.51chlorophyll -0.30 -0.52 0.27 0.61
Varianceexplained (%) 21 16 14 12
Interpretation liming photosynthesis partial P-limitingeffect decomposition nitrification algae
Factor analysis
For each pond and date, the value of the factor can be calculated, e.g.:
1-May WQ1 WQ2 WQ3 WQ4
temp 31.3secchi 31alk 154pH 8.03DO 9.8NHx 0.8NO2 0.08NO3 2.0ChlA 177PO4 1.5
0.81 -1.00 0.23 1.08
Factors can be plotted, e.g., in relation to time (see pdf)
Model estimation: software
Some popular statistics software packages:
SPSS http://www.spss.comSAS http://www.sas.comSYSTAT http://www.systat.comStatistica http://www.statsoftinc.comMinitab http://www.minitab.comLISREL http://www.ssicentral.com/lisrel/mainlis.htm
Etcetera !!!!!
Analyse-it www.analyse-it.com software add-in for Excel
Some further reading
Doucet, P. And P.B. Sloep (1992) Mathematical modeling in the life sciences. Ellis Horwood, New York. 490 p.
Hair, J.F., R. E. Anderson, R.L. Tatham and W.C. Black (1998) Multivariate data analysis, 5th Edition. Prentice Hall International, Inc., New Jersey.
Kelly, L.A., A. Bergheim and M.M. Hennesy (1994) Predicting output of ammonium from fish farms. Water Research 28, 1403-1405.
Lindstrom, M., L. Hakanson, O. Abrahamsson and H. Johansson (1999) An empirical model for prediction of lake waqter suspended particulate matter. Ecological Modelling 121, 185-198.
Milstein, A., M.A. Wahab and M.M. Rahman (2002) Environmental effects of common carp Cyprinuscarpio (L.) and mrigal Cirrhinus mrigala (Hamilton) as bottom feeders in major Indian carp polycultures. Aquaculture Research 33, 1103-1117. Prein, M., G. Hulata and D. Pauly (1993) Multivariate methods in aquaculture research: case studies of tilapias in experimental and commercial systems. ICLARM Stud. Rev. 20, 221 p.
Van Dam, A.A. (1990) Multiple regression analysis of accumulated data from aquaculture experiments: a rice-fish culture example. Aquaculture and Fisheries Management 21, 1-15.