39
DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS Dr. Anne van Dam

DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

DATA HANDLING ANDPRESENTATION

MULTIVARIATE ANALYSIS

Dr. Anne van Dam

Page 2: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Multivariate data analysis: lecture outline

• What is it?Why/when use itClassification of techniques

• Structured approach to multivariate analysisProblem definition, objectivesAnalysis planAssumptionsModel estimationInterpretationValidation

• Examplesmultiple regression analysisfactor analysis (Dr. Peter Kelderman)

Page 3: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Multivariate data analysis – what is it?

• Univariate analysis: single-variable distributions

• Bivariate analysis: analysis of relationships between two variablescross-classificationcorrelationone-way analysis of variance (ANOVA)simple regression

• Multivariate analysis : analysis of > 2 variables

Page 4: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Bivariate analysis: phosphorous and chlorophyll a in lakes

Univariate analysis: dissolved oxygen in a fishpond

Univariate versus bivariate analysis: examples

Histogram

0

5

10

15

20

25

30

35

40

45

1 3 5 7 9 Meer

Frequency

DO concentration (mg/l)

No.

of o

bser

vatio

ns

Mean = 4.78S.D. = 0.95

Page 5: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Multivariate data analysis: example

Predicting the phosphporous concentrations of lakes

Page 6: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

What is a variate?

Building block of multivariate analysis is the “variate”

Definition: a variate is a linear combination of variables with empirically determined weights

Variate value = w1X1 + w2X2 + w3X3 + …. + wnXn

Xn : observed variableWn : weight determined by multivariate technique

The value of the variate represents the combination of the entire set of variables

Page 7: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Variate: example from multiple regressionlog(SPM) = 1.148 log(TP) + 0.137 pH + 0.286 log(DR) - 1.985

Page 8: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Metric and non-metric dataNon-metric or qualitative:

attributes, characteristics, or categorical propertiestypes, classes, absence/presenceExample: male/female

Metric or quantitative:differing in amount or degreeExample: temperature

Nominal (or categorical) scale: class symbols have no quantitative meaning (e.g., female = 1, male = 2)Ordinal scale: variables can be ranked according to scale (e.g., level of agreement in survey: don’t agree = 1, don’t know = 2, agree = 3)

Interval scale: arbitrary zero point (e.g., temperature in Celcius versus Fahrenheit)Ratio scale: absolute zero point (e.g., weight or length)

Page 9: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Classification of multivariate techniques

Dependence techniques:variable or set of variables is identified as the dependentvariable to be predicted or explained by independentvariables

Interdependence techniques:set of variables is analysed simultaneously without defining dependence relationships

Example: multiple regression analysis

Example: factor analysis

Page 10: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Multivariate dependence methods

• Analysis of varianceY1 = X1 + X2 + X3 + ... + Xn

(metric) (nonmetric)

• Multiple discriminant analysisY1 = X1 + X2 + X3 + ... + Xn

(nonmetric) (metric)

• Multiple regression analysisY1 = X1 + X2 + X3 + ... + Xn

(metric) (metric, nonmetric)

(one dependent variable)

Page 11: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

• Canonical correlationY1+ Y2 + Y3 + ... + Yn = X1 + X2 + X3 + ... + Xn

(metric, nonmetric) (metric, nonmetric)

• Multivariate analysis of varianceY1+ Y2 + Y3 + ... + Yn = X1 + X2 + X3 + ... + Xn

(metric) (nonmetric)

Multivariate dependence methods(multiple dependent variables)

Page 12: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Type of relationship?

How many variables

predicted?

Source: Hair et al. 1998. Multivariate data analysis, 5th ed.

Structural equation

modelling

Measurementscale of dependent

variable?

Measurementscale of predictor

variable?

Canonicalcorrelation w/

dummy variables

Canonical correlation

analysis

Multivariate analysis of variance

Multiplediscriminant analysis

Linear probabilitymodels

dependence interdependence

Measurementscale of dependent

variable?

nonmetricmetric nonmetricmetric

nonmetricmetric

multiple relationshipsof dependent and

independent variablesone dependent variablein a single relationship

several dependent variablein a single relationship

Multiple regressionConjointanalysis

Page 13: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

dependence interdependence

Type of relationship?

Structure of relationships

among:

Factoranalysis

Clusteranalysis

How are the attributes

measured?

Multi-dimensional

scaling

Correspondence analysis

nonmetricmetric

variable objectcases / respondent

nonmetric

Source: Hair et al. 1998. Multivariate data analysis, 5th ed.

Page 14: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Classification of multivariate techniques

Exploratory methods:Main objective is to identify interrelationships and structures among variables. Reduction of large number of variables to a few key components

Confirmatory methods:Main objective is to test hypothesized relationships between variables. Researcher has a-priori understanding of relationships

Examples: principal component and factor analysis; cluster analysis; multidimensional scaling

Examples: correlation analysis; multiple regression; canonical correlation; analysis of variance; discriminant analysis; conjoint analysis; structural equation modelling

Page 15: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Structured approach to multivariate analysisStage 1: Define problem, objectives, and choose technique

Stage 2: Develop analysis plan and evaluate assumptions of multivariate technique

Stage 3: Estimate the model

Stage 4: Interprate the variates

Stage 5: Validate the model

Page 16: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

1. Define problem, objectives, and technique

• Develop a conceptual modelrelationships between variablesstructure, similaritiescause and effect

• Define the objective(s) of the modelprediction or explorationunderstanding of the system

• Choose technique in relation to objectives and data typesdependence/interdependencemetric/nonmetric

Page 17: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

2. Develop analysis plan and evaluate assumptions underlying technique

• Analysis plan:minimum/desired sample sizeallowable/required variable typesspecial variable formulation

• Assumptions:normal distributionlinearityindependence of error termsequality of variance (homoscedasticity)

• Data manipulation / transformatione.g., logarithmic or arc sine transformationdummy variables

Page 18: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

3. Estimate the model

• Model estimation

• Assessment of model fitR2

if necessary, improve fit (e.g., rotation in factor analysis)iteration

• Evaluation of the modelstatistical significance (model, parameters)outliers?“robustness”

Page 19: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

4. Interpret the variates

• Examination of estimated coefficients (weights) for each variable in the variate

e.g., : multiple regressionlog(SPM) = 1.148 log(TP) + 0.137 pH + 0.286 log(DR) - 1.985

• Interpretation of multiple variates as underlying “dimensions”e.g., principle components

• If necessary: reformulation of model

Page 20: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

5. Validate the model

• How general is the model? does is apply to the total population?can the model be used for prediction

• Method: check model fit with independent dataset

Page 21: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Structured approach to multivariate analysis

Define problem, objectives, and choose technique

Analysis plan and evaluate assumptions, data manipulation

Estimate the model

Interprate the variates

Validate the model

iteration!!

Page 22: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Examples of multivariate analysisPrincipal components and factor analysisMultiple regression analysisMultiple discriminant analysisMultivariate analysis of variance and covarianceConjoint analysisCanonical correlationCluster analysisMultidimensional scalingCorrespondence analysisLinear probability modelsStructural equation modellingData mining and warehousingNeural networksResampling

Page 23: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Multiple regression analysis of rice-fish data*

• Objectivesexplain variation in fish and rice production from input and climate dataexploratory analysis, “meta analysis”

*Source: van Dam, A.A. (1990) Multiple regression analysis of accumulated data from aquaculture experiments: a rice-fish culture example. Aquaculture and Fisheries Management 21, 1-15.

• Background50 experiments on rice-fish culture in unpublished reportsproblem: low power of single pond experiments (many type II errors)multivariate analysis allows new look at data

• Choice of technique: multiple linear regressionrelate one metric variable (rice, fish yield) to multiple metric explaining variables

Page 24: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Multiple regression analysis of rice-fish data in the PhilippinesSystem description

Nile tilapia(Oreochromis niloticus L.)

Page 25: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Analysis plan / methodology / assumptions

Multiple regression analysis of rice-fish data

• Database management

• First analysis: plots, correlation matrix

• Transformations/re-expression

Page 26: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Multiple regression analysis of rice-fish dataDatabase

management

Page 27: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

0.00

20.00

40.00

60.00

80.00

100.00

120.00

0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00 50.00

Fish stocking size (g)

(%)

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

(g/day)

recovery growth rate

Data plot; trends

Page 28: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Data plot; trends

y = 1.8516x - 50.993R2 = 0.4605

-150.00

-100.00

-50.00

0.00

50.00

100.00

150.00

200.00

250.00

300.00

350.00

0.00 20.00 40.00 60.00 80.00 100.00 120.00

Recovery (%)

Net

fish

yie

ld (k

g/ha

)

Page 29: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

DESCRIPTIVE STATISTICS OF RIFE-FISH DATASET. NO. OF CASES (N) = 198 Name M ean SD M in. M ax.

Dependent variables

Gross fish yield (kg ha-1) 122.86 77.66 2.5 390.0 Net fish yield (kg ha-1) 51.35 68.62 -105.0 300.0

Fish recovery percentage (%) 55.62 25.49 1 100 Fish growth rate (g d-1) 0.347 0.123 0.14 0.84

Rice yield (kg ha-1) 4337.46 1689.08 600 8250

Independent variables Plot size (m2) 201.52 27.55 100 400

Period (d) 78.97 17.73 50 114 Log period (d) 1.89 0.0995 1.70 2.06

Stocking density (no. ha-1) 5878.79 1883.80 2000 10000 Log stocking density (no. ha-1) 3.75 0.128 3.30 4.00

Stocking size (g) 12.80 9.41 1 44 Log stocking size (g) 0.958 0.404 0 1.64

Basal N application (kg ha-1) 63.11 13.18 40.2 79.5 Log basal N applic. (kg ha-1) 1.79 0.0961 1.60 1.90

Top dress N applic. (kg ha-1) 23.18 31.40 0 89.3

Basal P application (kg ha-1) 30.84 13.66 10.5 55.2 Log basal P applic. (kg ha-1) 1.44 0.205 1.02 1.74

Top dress P applic. (kg ha-1) 3.23 8.02 0 28 Herbicide application (dummy) 0.136 0.344 0 1

Basal insect. applic. (dummy) 0.707 0.456 0 1 No. of insecticide sprayings 1.19 0.879 0 3

Feed (dummy) 0.222 0.417 0 1 Avg. max. air temperature (? C) 32.29 1.30 30.2 34.7

Avg. min. air temperature (? C) 22.84 0.690 21.2 24.0 Avg. daily wind speed (knots) 4.97 1.04 3.2 7.0

Avg. daily evaporation (mm) 5.50 1.39 3.5 7.4 Avg. daily rainfall (mm) 3.68 2.81 0.1 8.8

Avg. daily sunshine (min) 476.85 85.51 350 611

Results:summary of data

Multiple regression analysis of rice-fish data

Page 30: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Multiple regression analysis of rice-fish data*

Y = a + b1X1 + b2X2 + ... + bkXk + ε

with

Y : dependent variableX1..k : independent or explaining variableB1..k : partial regression coefficients (slopes)a : constant (intercept)ε : residual

Estimate model with R2 and F, α (significance of model)Evaluate sign and significance of coefficients (t-test)Standardize b’s: beta-weights

Page 31: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Multiple regression analysis of rice-fish data

R2 F-value Sign.(α)

Gross fish yield (kg ha-1) 0.6571 52.013 <0.001 Net fish yield (kg ha-1) 0.5536 33.659 <0.001Fish recovery (%) 0.4469 21.935 <0.001Fish growth rate (g d-1) 0.3549 21.129 <0.001Rice yield (kg ha-1) 0.7088 66.080 <0.001

Results: estimation of 5 models

Conclusion: significant models; 35-71% of variation in Y’s explained

Page 32: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Multiple regression analysis of rice-fish dataMultiple regression models for gross fish yield (kg ha-1). All partial regression coefficients (bs) were significant at the 0.1% level, except when marked * (5%) or ** (1%). Also given are the standard errors (SE) of the bs and the standardized bs or betaweights (rankings between brackets). Number of cases = 198

Model 1 Model 2b SE beta b SE beta

Independent variablesPeriod (d) 1.57 0.225 0.359(4)Log period (d) 230.26 42.23 0.295(5)Stocking density (no. ha-1) 0.012 0.002 0.279(6)Log stocking density (no. ha-1) 136.31 27.31 0.225(6)Stocking size (g) 3.78 0.432 0.458(1) 3.82 0.423 0.463(2)Basal N application (kg ha-1) 1.74 0.283 0.296(5)Log basal N application (kg ha-1) 276.07 36.36 0.342(4)Basal P application (kg ha-1) -2.05 0.318 -0.361(3)Log basal P application (kg ha-1) -166.37 21.16 -0.439(3)No. of insecticidesprayings -10.03* 4.15 -0.114(7)Maximum air temperature (°C) 26.97 3.21 0.452(2) 28.67 3.17 0.481(1)

Constant (a) -1022.83 -2051.25Coeff. of determination (R2) 0.6571 0.6676F-value 52.013 63.946Probability <0.001 <0.001Durbin-Watson statistic 1.5120 1.5738

Page 33: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Partial regression coefficients: b’s

Y = a + b1X1 + b2X2 + ... + bkXk + ε

Multiple regression analysis of rice-fish data

Model: gross fish yieldb s.e. beta (rank)

Period (d) 1.57 0.225 0.359(4)Stocking density (ha-1) 0.012 0.002 0.279(6)Stocking size (g) 3.78 0.432 0.458(1)Basal N application (kg ha-1) 1.74 0.283 0.296(5)Basal P application (kg ha-1) -2.05 0.318 -0.361(3)No. of insecticide sprayings -10.03 4.15 -0.114(7)Maximum air temperature (°C) 26.97 3.21 0.452(2)

βk = bk • (sdXk / sdYk)

Page 34: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Multiple regression analysis of rice-fish data

Beta weight

SSSD

PER

NPHBIT

-0.3 0.30 -0.3 0.30 -0.3 0.30 -0.3 0.30 -0.3 0.30

Gross yield Net yield Recovery Growth rate Rice yield

Figure 1. Beta-weights of variables in all models. Pesticides (dotted bars)were of minor importance for yield and recovery, had a strong negative effecton fish growth rate and positive effects on rice yields. Phosphorous fertilization(striped bars) showed a negative effect on all fish variables (PER=length ofthe culture period, SD=stocking density, SS=stocking size, N=basal nitrogenapplication, P=basal phosphorous application, H=herbicide application,B=basal insecticide application, I=number of insecticide sprayings,T=maximum air temperature).

Beta weights: standardized regression coefficients

Allow straight comparison between effects

Y = a + b1X1 + b2X2 + ... + bkXk + ε

βk = bk • (sdXk / sdYk)

Page 35: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Factor analysis

• Analyze interrelationships among large no. of variables• Explain in terms of underlying relationships (= factors = variates)• Data reduction (reduce large no. of variables to 2-4 factors)

Example: water quality in fishponds (Bangladesh)

12 ponds, 4 treatments (fish stocked)9 WQ parameters: temp, transp, pH, DO, alk, PO4, NH4, NO2, NO3, ChlA

Samples on 20 dates (1 May – 14 November), every 10 days

Dataset: 20 * 12 * 9 = 2160 numbers

Page 36: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Factor analysis

WQ Factors WQF1 WQF2 WQF3 WQF4

Secchi 0.65 0.20 0.34 -0.15alkalinity 0.70 -0.08 -0.19 -0.02pH 0.61 0.04 -0.44 0.36oxygen 0.11 -0.47 -0.28 0.36ammonia -0.57 0.49 -0.06 0.04nitrite 0.15 0.01 0.83 0.23nitrate 0.37 0.52 0.20 0.36phosphate -0.21 0.58 -0.23 0.51chlorophyll -0.30 -0.52 0.27 0.61

Varianceexplained (%) 21 16 14 12

Interpretation liming photosynthesis partial P-limitingeffect decomposition nitrification algae

Page 37: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Factor analysis

For each pond and date, the value of the factor can be calculated, e.g.:

1-May WQ1 WQ2 WQ3 WQ4

temp 31.3secchi 31alk 154pH 8.03DO 9.8NHx 0.8NO2 0.08NO3 2.0ChlA 177PO4 1.5

0.81 -1.00 0.23 1.08

Factors can be plotted, e.g., in relation to time (see pdf)

Page 38: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Model estimation: software

Some popular statistics software packages:

SPSS http://www.spss.comSAS http://www.sas.comSYSTAT http://www.systat.comStatistica http://www.statsoftinc.comMinitab http://www.minitab.comLISREL http://www.ssicentral.com/lisrel/mainlis.htm

Etcetera !!!!!

Analyse-it www.analyse-it.com software add-in for Excel

Page 39: DATA HANDLING AND PRESENTATION MULTIVARIATE ANALYSIS

Some further reading

Doucet, P. And P.B. Sloep (1992) Mathematical modeling in the life sciences. Ellis Horwood, New York. 490 p.

Hair, J.F., R. E. Anderson, R.L. Tatham and W.C. Black (1998) Multivariate data analysis, 5th Edition. Prentice Hall International, Inc., New Jersey.

Kelly, L.A., A. Bergheim and M.M. Hennesy (1994) Predicting output of ammonium from fish farms. Water Research 28, 1403-1405.

Lindstrom, M., L. Hakanson, O. Abrahamsson and H. Johansson (1999) An empirical model for prediction of lake waqter suspended particulate matter. Ecological Modelling 121, 185-198.

Milstein, A., M.A. Wahab and M.M. Rahman (2002) Environmental effects of common carp Cyprinuscarpio (L.) and mrigal Cirrhinus mrigala (Hamilton) as bottom feeders in major Indian carp polycultures. Aquaculture Research 33, 1103-1117. Prein, M., G. Hulata and D. Pauly (1993) Multivariate methods in aquaculture research: case studies of tilapias in experimental and commercial systems. ICLARM Stud. Rev. 20, 221 p.

Van Dam, A.A. (1990) Multiple regression analysis of accumulated data from aquaculture experiments: a rice-fish culture example. Aquaculture and Fisheries Management 21, 1-15.