41
NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introductio n John Birks

NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

Embed Size (px)

Citation preview

Page 1: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

NUMERICAL ANALYSIS OF BIOLOGICAL AND

ENVIRONMENTAL DATA

Lecture 1Introduction

John Birks

Page 2: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

TEACHING OF THE COURSE

Course Leader Gavin Simpson (UCL)

Lectures 1, 4, 5, 8, 10, 12 John Birks (Bergen & UCL)

Lectures 2, 3, 6, 7, 9, 11 Gavin Simpson (UCL)

Practicals 1-10 Gavin Simpson (UCL)

Course administration Adam Young (UCL)

Page 3: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

Book list

Level of course

Aims of course

What are multivariate data?

What is multivariate data analysis?

Aims of multivariate data analysis

Why do multivariate data analysis?

Terminology

Types of variables

Geometrical models and concept of similarity (dissimilarity or distance)

Computing

Course topics

INTRODUCTION

Page 4: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

Approach from practical biological and geological viewpoint, not statistical theory viewpoint.

Assume no background in matrix algebra, eigenanalysis, or statistical theory.

Emphasis on techniques that are ecologically realistic and useful and that are computationally feasible.

LEVEL OF THE COURSE

Page 5: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

“Truths which can be proved can also be known by faith. The proofs are difficult and can only be understood by the learned; but faith is necessary also to the young, and to those who, from practical preoccupations, have not the leisure to learn. For them, revelation suffices.” 

Bertrand Russell 1946The History of Western Philosophy

Page 6: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

“It cannot be too strongly emphasised that a long mathematical argument can be fully understood on first reading only when it is very elementary indeed, relative to the reader’s mathematical knowledge. If one wants only the gist of it, he may read such material once only, but otherwise he may expect to read it at least once again. Serious reading of mathematics is best done sitting bolt upright on a hard chair at a desk. Pencil and paper are indispensable.”

L Savage 1972The Foundations of Statistics. BUT:

“A journey of a thousand miles begins with a single step”

Lao Tsu

Page 7: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

STATUS OF MULTIVARIATE NUMERICAL DATA ANALYSIS

Basic mathematics of correlation, regression, analysis of variance, eigenanalysis, randomisation etc. not new, worked out in 1920-1930s.

Arithmetic manipulations and calculations involved so numerous and so time consuming; virtually impossible to work with anything other than smallest data-sets on hand calculator or early computer.

Development of numerical data analysis closely linked to development of computers.

Now possible to do in seconds what would have taken hours, days, even weeks.

Increased availability of computer program packages has advantages and disadvantages.Advantage

s

• fastfast

• painlesspainless

• simplesimple

Disadvantages

• too fasttoo fast

• too easytoo easy

• too simpletoo simple

Need to understand a technique well before one can critically evaluate results. Sound interpretation requires a good understanding of the technique.

Page 8: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

Provide introductory understanding to the most appropriate methods for the numerical analysis of complex multivariate biological and environmental data. Recent maturation of methods.

Provide introduction to what these methods do and do not do.

Provide some guidance as to when and when not to use particular methods.

Provide an outline of major assumptions, limitations, strengths, and weaknesses of different methods.

Indicate to you when to seek expert advice.

Encourage numerical thinking (ideas, reasons, potentialities behind the techniques). Not so concerned here with numerical arithmetic (the numerical manipulations involved).

AIMS

Page 9: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

ON THE USES AND METHODS OF ON THE USES AND METHODS OF STATISTICSSTATISTICS

By Professor F. Y. Edgeworth, M. A., D. C. L.By Professor F. Y. Edgeworth, M. A., D. C. L.

Syllabus for Edgeworth’s 1892

Newmarch Lectures,University College

London I. FIRST PRINCIPLES

The extent of the subject here treated is that which is denoted by two leading definitions of statistics, viz: the study of numerical statements relating to society, and the theory of means. The subject may be divided according as the element of induction is more or less prevalent. First come general directions as to the acquisition of data; e.g., that figures should be accurate, and terms unambiguous. Examples of the violation of these rules; together with other precepts and cautions. Use of relative figures (per head, per cent, &c.). Analysis of the data.

References: Conférences sur la Statistique (Rozier Editeur), 1891; Pidgin, Practical Statistics, 1888; Giffen, International Statistical Comparisons, Economic Journal, June, 1892.

II. GRAPHICAL METHODS

The Cartesian system of co-ordinates. Integration and interpolation. Case where several dependent variables ( i.e. diseases from different causes) are referred to one independent variable ( i.e. the time). The case of one variable dependent on two independent variables is properly represented by a surface; but curves of level and variously coloured planes are more convenient. Methods of expressing variation of a quantity relative to its initial, or average, value. Miscellaneous devices for exhibiting numerical relations to the eye.  

References: Marey, La Méthode Graphique, 1885; Favaro, Leçons de Statique Graphique (translated into French by Terrier), Ch. V. with appendix by the translator. Levasseur, La Statistique Graphique, Journal of the Statistical Society, Jubilee vol., 1885; Marshall, The Graphic Method of Statistics, Ibid; Cheysson, Les Cartogrammes à teintes graduées, Journal de la Société de Statistique de Paris, 1887; Scribner’s Statistical Atlas of the United States; Longstaff, Studies in Statistics, 1891.

Page 10: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

III. THE DOCTRINE OF AVERAGES

The general idea of a mean comprehends innumerable species, of which the most important are, the Arithmetic Mean, the Median, the Greatest Ordinate (or centre of greatest condensation) and the Geometric Mean. A cross division is between simple and weighted means. Concrete instances of these varieties. Subtle distinction between so-called objective and subjective means. Peculiar prestige attaches to the means of which the constituents are grouped according to the Probability Curve, or law of error. A priori demonstration, and empirical verification, that this form arises under certain conditions.

References: Venn, Logic of Chance, Third Edition, 1888, chap, xviii., and xix.; On….Averages. Journal of the Statistical Society, 1891; Galton, Statistics by inter-comparison, Philosophical Magazine, 1875; Bertillon, Moyenne, Dictionnaire Encyclopédique des Science Médicales; Edgeworth, On the Choice of Means, Phil. Mag., 1887, On the empirical proof of the law of error, Ib., 1887.

IV. TYPES AND CORRELATIONS

The ‘mean man’ has for stature, length of cubit, height of knee, &c, the respective means of the statures, lengths, &c., of a greater number of men. Reply of the objection that such a combination of partial means may not form a possible whole. Relation between the deviation of one organ or attribute, e.g. length of cubit, from its mean; as established by Mr. Galton, and illustrated by Mr. H. Dickson. Abridged method of ascertaining the co-efficient which expresses the correlation between three attributes, e.g. stature, length of cubit and height of knee. The formula for the most probable attribute, e.g. stature corresponding to assigned values of two other attributes, e.g. length of cubit and height of knee, may be ascertained either from three simple correlations, between stature and cubit, stature and height of knee, cubit and height of knee; or by observations special to the case of three variables. Correlation between any number of attributes.

References: Quetelet, Anthropométrie; Galton, Family Likeness in Stature, Proceedings of the Royal Society, 1886; Co-relations and their measurements Ibid. 1888; Weldon, Correlated Variations, Ibid, 1892.

Page 11: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

V. THE STATISTICAL PART OF INDUCTIVE LOGIC

Passing Insurance and other direct applications of statistics, we come to the investigation of causes. The inductive method to which statistics lends itself, the Method of Agreement, is liable to the fallacy Post hoc propter hoc; of which numerous examples occur. The Method of Concomitant variations is facilitated by the use of parallel curves. The Method of Residues is exemplified when in comparing the death rates of different classes, we make allowance for their different ages; and in similar cases.

 References: Mill, Logic; Giffen, Essays on Finance, and Article in June No. of Economic Journal; Humphreys, Value of death rates as a test of Sanitary conditions, Journal of the Statistical Society, 1874, Class Mortality Statistics, Ibid, 1887.

VI. THE ELIMINATION OF CHANCE

 One case of the Method of Residues, for which there exists a technical apparatus, is where the agency allowed for consists of those “fleeting causes” called chance. The simple method of eliminating chance, described by Mill (Logic, iii, xviii, 4) and the higher method derived from the theory of error. The latter method is particularly applicable where the deviation from the average value of a ratio – e.g. that between male and female births – follows the analogy of the simpler games of chance. In other cases the higher theory affords rather regulative ideas than exact conclusions; in this respect, comparable to the use of the mathematical theory of economics.

References: Westergaard, Grundzüge der Theorie der Statistik, 1891; Duesing, Das geschlechtverhaltniss in Preussen, 1890; Edgeworth, Methods of Statistics, Journal of the Statistical Society, Jubilee vol., 1885.

[The lectures were presented on six consecutive Wednesdays at 5:00 P.M., beginning 11 May 1892,

admission free.]

Page 12: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

At the end of the semester, could my students fully understand all of the statistical methods used in a typical issue of Ecology? Probably not, but they did have the foundation to consider the methods if authors clearly described their approach. Statistics can still mislead students, but students are less apt to see all statistics as lies and more apt to constructively criticise questionable methods. They can dissect any approach by applying the conceptual terms used throughout the semester. Students leave the course believing that statistics does, after all, have relevance, and that it is more accessible than they believed at the beginning of the semester.

At its best, statistical analysis sharpens thinking about data, reveals new patterns, prompts creative thinking, and stimulates productive discussions in multi-disciplinary research groups. For many scientists, these positive possibilities of statistics are over-shadowed by negatives; abstruse assumptions, emphasis of things one can’t do, and convoluted logic based on hypothesis rejection. One colleague’s reaction to this Special Feature (on statistical analysis of ecosystem studies) was that “statistics is the scientific equivalent of a trip to the dentist.”

This view is probably widespread. It leads to insufficient awareness of the fact that statistics, like ecology, is a vital, evolving discipline with ever-changing capabilities.

AIMS

Page 13: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

Species #11 #12 #13 #14 #15 #16 #17 #18 #19 #20

Equisetum pratense 4 - 1 2 - 7 10 13 18 17

Rubus pubescens 11 4 13 18 4 7 17 - 13 2

R. strigosus 1 8 1 2 19 8 3 5 2 8

Cornus stolonifera 6 - - 1 - - 1 1 - 1

C. canadenis - - 2 - 12 - - 1 - -

Rosa acicularis 2 2 1 6 11 2 1 - 3 3

Galium boreale - - 12 3 22 - 2 - 1 -

Ribes oxycanthoides - 1 - 4 15 - - 8 - 3

R. triste 2 9 13 2 - 4 10 6 16 9

Mitella nuda - 6 - - 1 9 - 16 25 19

Mertensia nudicaulis - 11 6 10 - 2 10 4 1 12

Aralia nudicaulis 4 - 6 1 3 - - 1 - 1

Viburnum edule 2 15 5 6 - 7 4 5 3 4

Calamagrostis canescens 3 3 - 1 1 6 11 8 4 4

Populus balsamifera (seedling) 2 1 - 1 1 2 2 - 1 -

Prunus virginiana (seedling) - - 1 - - - - - 1 -

Populus tremuloides (seedling) - - 1 - 1 - - 1 - -

Actaea rubra - - 1 - 1 - - - - 1

Circaea alpina 4 - 1 18 1 3 - - 2 11

Thalictrun venulosum 3 - - - - 1 1 - - -

Matteuccia struthiopteris - - - - - - - - - 2

NO. OF SPECIES 12 10 14 14 12 12 12 12 13 14

July 18, 1998.

Plot 6 (quadrats) (Rt. Bank, c 300 m S of mouth of Steepbank R., 40m inland)

A typical page from a field notebook. This one records observations on the ground vegetation in Populus balsamifera woodland in the flood plain of the Athabasca River, Alberta.

Page 14: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

TYPES OF MULTIVARIATE DATA

Object (n) Variable (m)

Botany (plant ecology)

QuadratRelevéPlot

Plant species

Archaeology Sites Artefacts

Geology Samples Particle-size classes

Chemistry Stream sediments Trace elements

Zoology Geographical localities

Morphometric characters

Pollen analysis Sediment samples Pollen types

Diatom analysis Sediment samples Diatom types

Palaeontology Rock samples Fossil taxa

... ... ...

Features in common –

MANY OBJECTS n

MANY VARIABLES m

CAN BE ARRANGED IN DATA MATRIX of SAMPLES or OBJECTS x VARIABLES

Page 15: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

Samples (n samples)

1 2 3 4 ... N (columns

)

1 xik * * * ... X1n

Variables (m vars)

2 * * * *

3 * * * *

4 * * * *

... ...

M (rows

)

xm1 Xmn

DATA MATRIX

Matrix X with n columns x m rows. n x m matrix. Order (n x m).

23

13

22

12

21

11

x

x

x

x

x

xX X21

element in row two

column one

Xik

row i column k

subscript

Page 16: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

FEATURES OF MULTIVARIATE DATA

Complex

Show: Noise

Redundancy

Internal relationships

Outliers

Some information in the data is only indirectly interpretable

BIOLOGICAL DATA

many species

+/–, quantitative, often %, many zero values, skewed

non-linear responses to environment

ENVIRONMENTAL DATA

fewer variables

+/–, ranks, quantitative

non-normal

linear inter-relationships, often high correlations, some

redundancy

Page 17: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

STATISTICS AND DATA ANALYSIS 1. Hypothesis testing ‘confirmatory data analysis’ (CDA).

2. Model building

  explanatory

empirical

[statistical]

  Pielou (1981) Quart. Rev. Biol.

  “Models are often displayed with little or no effort to link them with the real world. As a result the whole body of knowledge and theory has grown top-heavy with models... Models are not useless but too much should not be expected of them. Modelling is only a part, and a subordinate part, of research.”

3. Hypothesis generation ‘exploratory data analysis’ (EDA).

  Detective work

CDA & EDA - different aims, philosophies, methods

  “We need both exploratory and confirmatory”.

 J W Tukey 1980

Page 18: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

EXPLORATORYDATA ANALYSIS

Real world ’facts’

ObservationsMeasurements Data

Data analysis

Patterns

‘Information’

Hypotheses

Decisions

CONFIRMATORY DATA ANALYSIS

Hypotheses

Real world ‘facts’

ObservationsMeasuremen

ts

Data

Statistical testing

Hypothesis testing

Theory

Page 19: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

Underlying statistical Underlying statistical model (e.g. linear or model (e.g. linear or unimodal response) unimodal response)

Exploratory data Exploratory data analysis analysis

Biological Biological Data Data YY

DescriptioDescriptionn

Confirmatory data Confirmatory data analysis analysis

Testable ‘null Testable ‘null hypothesis’ hypothesis’

Additional (e.g. Additional (e.g. environmental data) environmental data)

XX

Rejected Rejected hypotheses hypotheses

Page 20: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

Observation

Data collectionAnalysis

Evaluate statistical H0, HA

Evaluate prediction

Evaluate scientific H0, HA

Evaluate theory/paradigm

Theory/Paradigm

PredictionScientific H0

Scientific HA

Statistical H0

Statistical HA

Conceptual design of study, choice of format (experimental, non-experimental) and classes of data

Sampling or experimental design

induction

deduction

deduction

The Popperian hypothetico-deductive method, after Underwood and others.

HO = null hypothesis HA = alternative hypothesis

Page 21: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

EXPLORATORYDATA ANALYSIS

CONFIRMATORYDATA ANALYSIS

How can I optimally describe or explain variation in data set?

Can I reject the null hypothesis that the species are unrelated to a particular environmental

factor or set of factors?

Samples can be collected in many ways, including subjective sampling.

Samples must be representative of universe of interest – random, stratified

random, systematic.

‘Data-fishing’ permissible, post-hoc analyses, explanations, hypotheses, narrative okay.

Analysis must be planned a priori.

P-values only a rough guide. P-values meaningful.

Stepwise techniques (e.g. forward selection) useful and

valid.

Stepwise techniques not strictly valid.

Main purpose is to find ‘pattern’ or ‘structure’ in nature.

Inherently subjective, personal activity.

Interpretations not repeatable.

Main purpose is to test hypotheses about patterns.

Inherently analytical and rigorous.

Interpretations repeatable.

Page 22: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

A WELL-DESIGNED MODERN ECOLOGICAL STUDY COMBINES BOTH.

1) Two-phase study

- Initial phase is exploratory, perhaps involving subjectively located plots or previous data to generate hypotheses.

- Second phase is confirmatory, collection of new data from defined sampling scheme, planned data analysis.

2) Split-sampling

- Large data set (>100 objects), randomly split into two (75/25) – exploratory set and confirmatory set.

- Generate hypotheses from exploratory set (allow data fishing); test hypotheses with confirmatory set.

- Rarely done in ecology.

Page 23: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

Data diving with cross-validation: an investigation of broad-scale gradients in Swedish weed communities.

ERIK HALLGREN, MICHAEL W. PALMER and PER MILBERG. Journal of Ecology, 1999, 87, 1037-1051.

Full data set

Some previousl

y removed

data

Clean data set

Exploratory data set

Combined data set

Confirmatory data set

RESULTS

Remove observations with missing data

Random split

Hypotheses

Ideas for more analysis

Choice of variables

Analyses for display

Hypothesis tests

Flow chart for the sequence of analyses. Solid lines represent the flow of data and dashed lines the flow of analysis.

Page 24: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

EUROPEAN FOOD

(From A Survey of Europe Today, The Reader’s Digest Association Ltd.) Percentage of all households with various foods in house at time of questionnaire. Foods by countries.

Country

GC ground coffee 90 82 88 96 94 97 27 72 55 73 97 96 96 98 70 13IC instant coffee 49 10 42 62 38 61 86 26 31 72 13 17 17 12 40 52TB tea or tea bags 88 60 63 98 48 86 99 77 61 85 93 92 83 84 40 99SS sugarless sugar 19 2 4 32 11 28 22 2 15 25 31 35 13 20 - 11BP packaged biscuits 57 55 76 62 74 79 91 22 29 31 - 66 62 64 62 80SP soup (packages) 51 41 53 67 37 73 55 34 33 69 43 32 51 27 43 75ST soup (tinned) 19 3 11 43 25 12 76 1 1 10 43 32 4 10 2 18IP instant potatoes 21 2 23 7 9 7 17 5 5 17 39 11 17 8 14 2FF frozen fish 27 4 11 14 13 26 20 20 15 19 54 51 30 18 23 5VF frozen vegetables 21 2 5 14 12 23 24 3 11 15 45 42 15 12 7 3AF fresh apples 81 67 87 83 76 85 76 22 49 79 56 81 61 50 59 57OF fresh oranges 75 71 84 89 76 94 68 51 42 70 78 72 72 57 77 52FT tinned fruit 44 9 40 61 42 83 89 8 14 46 53 50 34 22 30 46JS jam (shop) 71 46 45 81 57 20 91 16 41 61 75 64 51 37 38 89CG garlic clove 22 80 88 16 29 91 11 89 51 64 9 11 11 15 86 5BR butter 91 66 94 31 84 94 95 65 51 82 68 92 63 96 44 97ME margarine 85 24 47 97 80 94 94 78 72 48 32 91 94 94 51 25OO olive, corn oil 74 94 36 13 83 84 57 92 28 61 48 30 28 17 91 31YT yoghurt 30 5 57 53 20 31 11 6 13 48 2 11 2 - 16 3CD crispbread 26 18 3 15 5 24 28 9 11 30 93 34 62 64 13 9

D I F NL B L GB P A CH S DK N SF E IRL

Page 25: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

Dendrogram showing the results of minimum variance agglomerative cluster analysis of the 16 European countries for the 20 food variables listed in the table. Key: Countries: A Austria, B Belgium, CH Switzerland, D West Germany, E Spain, F France, GB Great Britain, I Italy, IRL Ireland, L Luxembourg, N Norway, NL Holland, P Portugal, S Sweden, SF Finland

Classification

Page 26: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

Ordination

Correspondence analysis of percentages of households in 16 European countries having each of

20 types of food.

Key: Countries: A Austria, B Belgium, CH Switzerland, D West Germany, E Spain, F France, GB Great Britain, I Italy, IRL Ireland, L Luxembourg, N Norway, NL Holland, P Portugal, S Sweden, SF Finland

Page 27: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

Minimum spanning tree fitted to the full 15-dimensional correspondence analysis solution superimposed on a rotated plot of countries from previous figure.

Page 28: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

Percentages of people employed in nine different industry groups in Europe. (AGR = agriculture, MIN = mining, MAN = manufacturing, PS = power supplies, CON = construction, SER = service industries, FIN = finance, SPS = social and personal services, TC = transport and communications).

Country AGR MIN MAN PS CON SER FIN SPS TC

Belgium 3.3 0.9 27.6 0.9 8.2 19.1 6.2 26.6 7.2

Denmark 9.2 0.1 21.8 0.6 8.3 14.6 6.5 32.2 7.1

France 10.8 0.8 27.5 0.9 8.9 16.8 6 22.6 5.7

W. Germany 6.7 1.3 35.8 0.9 7.3 14.4 5 22.3 6.1

Ireland 23.2 1 20.7 1.3 7.5 16.8 2.8 20.8 6.1

Italy 15.9 0.6 27.6 0.5 10 18.1 1.6 20.1 5.7

Luxembourg 7.7 3.1 30.8 0.8 9.2 18.5 4.6 19.2 6.2

Netherlands 6.3 0.1 22.5 1 9.9 18 6.8 28.5 6.8

UK 2.7 1.4 30.2 1.4 6.9 16.9 5.7 28.3 6.4

Austria 12.7 1.1 30.2 1.4 9 16.8 4.9 16.8 7

Finland 13 0.4 25.9 1.3 7.4 14.7 5.5 24.3 7.6

Greece 41.4 0.6 17.6 0.6 8.1 11.5 2.4 11 6.7

Norway 9 0.5 22.4 0.8 8.6 16.9 4.7 27.6 9.4

Portugal 27.8 0.3 24.5 0.6 8.4 13.3 2.7 16.7 5.7

Spain 22.9 0.8 28.5 0.7 11.5 9.7 8.5 11.8 5.5

Sweden 6.1 0.4 25.9 0.8 7.2 14.4 6 32.4 6.8

Switzerland 7.7 0.2 37.8 0.8 9.5 17.5 5.3 15.4 5.7

Turkey 66.8 0.7 7.9 0.1 2.8 5.2 1.1 11.9 3.2

Bulgaria 23.6 1.9 32.3 0.6 7.9 8 0.7 18.2 6.7

Czechoslovakia 16.5 2.9 35.5 1.2 8.7 9.2 0.9 17.9 7

E. Germany 4.2 2.9 41.2 1.3 7.6 11.2 1.2 22.1 8.4

Hungary 21.7 3.1 29.6 1.9 8.2 9.4 0.9 17.2 8

Poland 31.1 2.5 25.7 0.9 8.4 7.5 0.9 16.1 6.9

Romania 34.7 2.1 30.1 0.6 8.7 5.9 1.3 11.7 5

USSR 23.7 1.4 25.8 0.6 9.2 6.1 0.5 23.6 9.3

Yugoslavia 48.7 1.5 16.8 1.1 4.9 6.4 11.3 5.3 4

Source: Euromonitor (1979, pp. 76-7) with the percentage employed in finance in Spain reduced from 14.7 to the more reasonable figure of 8.5

Page 29: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

Correspondence analysis

Page 30: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

Correspondence analysis

Page 31: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

WHY DO MULTIVARIATE DATA ANALYSIS?

1: Data simplification and data reduction - “signal from noise”

2: Detect features that might otherwise escape attention.

3: Hypothesis generation and prediction.

4: Data exploration as aid to further data collection.

5: Communication of results of complex data.

Ease of display of complex data.

6: Aids communication and forces us to be explicit.

“The more orthodox amongst us should at least reflect that many of the same imperfections are implicit in our own cerebrations and welcome the exposure which numbers bring to the muddle which words may obscure”.

  D Walker (1972)

7: Tackle problems not otherwise soluble. Hopefully better science.

8: Fun!

Page 32: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

“General impressions are never to be trusted. Unfortunately when they are of long standing they become fixed rules of life, and assume a prescriptive right not to be questioned. Consequently those who are not accustomed to original inquiry entertain a hatred and a horror of statistics. They cannot endure the idea of submitting their sacred impressions to cold-blooded verification. But it is the triumph of scientific men to rise superior to their superstitions, to desire tests by which the value of their beliefs may be ascertained, and to feel sufficiently masters of themselves to discard contemptuously whatever may be found untrue.”

Francis GaltonQuoted from Quotes, Damned Quotes and...

compiled by J Bibby Edinburgh: John Bibby (Books)

Page 33: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

TERMINOLOGY

Sample, object, individual “sampling unit”

Statistician Others

Single unit Sampling unit

Sample

Collection of units Sample Sample set

Variable, character, attribute

Algorithms, methods, models, programs

Classification, clustering, partitioning, scaling, gradient analysis

[assignment, identification, discrimination]

[dissection]

 Objective, repeatable

Page 34: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

TYPES OF VARIABLES

1) Numeric, quantitative, continuous variables

3) Binary or dichotomous variables +/– (e.g. male, female) 4) Conditionally present variables

2) Nominal and ordinal variables (qualitative multistate) Nominal “disordered multistate” (e.g. red, white, blue)Ordinal “ordered multistate” (e.g. dry, moist, wet)

e.g. 3 species - A, B, C Only A & B have petalsA pink petalsB white petals

A B C

Pink petals + - -

White petals - + - nominal disordered

No petals - - +

5) Mixed data – see Lecture 12

Page 35: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

Pollen data - 2 pollen types x 15 samples

Depths are in centimetres, and the units for pollen frequencies may be either in grains counted or percentages.

Sample Depth Type A Type B1 0 10 502 10 12 423 20 15 474 30 17 385 40 18 436 50 22 377 60 23 358 70 26 269 80 35 23

10 90 37 2211 100 43 1812 110 38 1713 120 47 1514 130 42 1215 140 50 10

Samples

Variables

Adam (1970)

GEOMETRICAL MODELS

Page 36: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

Palynological representation

Geometrical representation

ALTERNATE REPRESENTATIONS OF THE POLLEN DATA

In (a) the data are plotted as a standard diagram, and in (b) they are plotted using the geometric model. Units along the axes may be either pollen counts or percentages.

Adam (1970)

Page 37: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

Geometrical model of a vegetation space containing 52 records (stands).

A: A cluster within the cloud of points (stands) occupying vegetation space.

B: 3-dimensional abstract vegetation space: each dimension represents an element (e.g. proportion of a certain species) in the analysis (X Y Z axes).

A, the results of a classification approach (here attempted after ordination) in which similar individuals are grouped and considered as a single cell or unit.

B, the results of an ordination approach in which similar stands nevertheless retain their unique properties and thus no information is lost (X1 Y1 Z1 axes).

N. B. Abstract space has no connection with real space from which the records were initially collected.

Page 38: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

Concept of Similarity, Dissimilarity, Distance and Proximity

sij – how similar object i is object j

Proximity measure DC or SC

Dissimilarity = Distance

_________________________________

Convert sij dij

sij = C – dij where C is constant

ijij sd 1

)( ijij sd 1

)( ijij ds 1

1

Page 39: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

COMPUTING

In the 10 practicals, mainly use R, a public-domain statistical-computing environment, rather than specific commercial packages such as MINITAB or SYSTAT.

Relatively steep learning curve but worth it.

Recommend Fox (2002) An R and S-PLUS companion to applied regression (Sage), Crawley (2005) Statistics – An introduction using R (Wiley), Crawley (2007) The R Book (Wiley), Everitt (2005) An R and S-PLUS companion to multivariate analysis (Springer), and Verzani (2005) Using R for introductory statistics (Chapman Hall/CRC) as excellent guides.

Will also use specialised software for specific methods (e.g. TWINSPAN, CANOCO and CANODRAW, C2, ZONE, etc.)

Computing practicals are an integral and essential part of the course.

Page 40: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

COURSE TOPICSIntroduction Lecture 1 -

Exploratory Data Analysis Lecture 2 Practical 1

Cluster Analysis Lecture 3 Practical 2

Regression Analysis Lectures 4 & 5

Practicals 3 & 4

Ordination (Indirect Gradient Analysis)

Lecture 6 Practical 5

Constrained Ordination (Direct Gradient Analysis)

Lecture 7 Practical 6

Calibration and Environmental Reconstructions

Lecture 8 Practical 7

Classification Lecture 9 Practical 8

Analysis of Stratigraphical and Spatial Data

Lecture 10 Practical 9

Hypothesis Testing Lecture 11 Practical 10

Overview and Future Developments

Lecture 12 -

Page 41: NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 1 Introduction John Birks

COURSE POWERP0INTS

In some of the lectures, some of the slides are rather technical.

They are included for the sake of completion to the topic under discussion.

They are for reference only and are marked REF