41
Pattern Recognition for the Natural Sciences Pattern Recognition for the Natural Sciences Explorative Data Analysis Explorative Data Analysis Principal Component Analysis (PCA) Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry Lutgarde Buydens, IMM, Analytical Chemistry

Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Embed Size (px)

Citation preview

Page 1: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Pattern Recognition for the Natural SciencesPattern Recognition for the Natural Sciences

Explorative Data AnalysisExplorative Data Analysis

Principal Component Analysis (PCA)Principal Component Analysis (PCA)

Lutgarde Buydens, IMM, Analytical ChemistryLutgarde Buydens, IMM, Analytical Chemistry

Page 2: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Why Explorative Data Analysis ?Why Explorative Data Analysis ?

Classical ScienceClassical Science

?

[System

Paradigm change in natural sciences

Hypothesis driven

Page 3: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Why Explorative Data Analysis? Why Explorative Data Analysis?

Classical ScienceClassical Science Science Science with advanced technologies with advanced technologies

?

[System

ExplorativeAnalysis of data ?

System

Paradigm change in natural sciences

Hypothesis driven Data driven

Page 4: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Explorative Data AnalysisExplorative Data Analysis

Advanced technology: High throughput (high quality) analysis

NMR, HPLC, GC, MS/MS, immune assays, HybridsNano/Sensor technology

Genomics (gene expression profiling)

Proteomics, Metabolomics

Fingerprinting

Profiling in drug design

Overwhelming amount of dataOverwhelming amount of data

Page 5: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Explorative Data AnalysisExplorative Data Analysis

Visualization (principal component analysis, projections)

Unsupervised Pattern recognition (clustering)

Supervised Pattern recognition (classification)

Quantitative analysis (correlations, predictions)

Page 6: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Principal Component Analysis: an ExamplePrincipal Component Analysis: an Example

150 samples of Italian wines from the same region 3 different cultivars

Is it possible to characterise cultivars ?Which variables are relevant for which cultivars ?

Page 7: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

p (13 properties) (variables)

(150 wine samples) n(objects)

Xij Flavanoid concentration of sample 75

X

xij

1 7

75

xj

xi

Flavanoid concentration

Data MatrixData Matrix

Page 8: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Principal Component AnalysisPrincipal Component Analysis

Barplot of 1 wine sample

Page 9: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Principal Component AnalysisPrincipal Component Analysis

Line plot of 1 wine sampleBarplot of 1 wine sample

Page 10: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Principal Component AnalysisPrincipal Component Analysis

Line plot of 1 wine sampleBarplot of 1 wine sample

Page 11: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Principal Component AnalysisPrincipal Component Analysis

Line plot of 1 wine sampleBarplot of 1 wine sample

Page 12: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Data Matrix RepresentationData Matrix Representation

xj

xi

X

xij

1 p

n xj

xi

# samples # properties

Page 13: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

xj

xi

X

xij

1 13

150

13

1

p (13)- dimensionalVariable space

150 samples

j

xi

Sample 75

Sp (13)

Data Matrix RepresentationData Matrix Representation

Page 14: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

xj

xi

X

xij

1 13

150

13

1

150

1

i

p (13)- dimensionalVariable space

13 variables150 samples

n (150)-dimensionalObject space

j

xi

Sample 75Property 7 (flavanoids)

Sp (13) Sn (150)

Data Matrix RepresentationData Matrix Representation

Page 15: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Explorative Data AnalysisExplorative Data Analysis

Page 16: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

r (2)-dim. space of variables

Principal Component AnalysisPrincipal Component Analysis

PCA: visualization : projection in 2 dimensions

1

p (13)- dim. space of variables

Sp (13)

j

xi

1

i

n (150)-dim. space of objects

Sn (150)

13 variables150 samples

lv2

lv1

S2

13 variables

x

x

xx

xxx

xx

x

x

lv1

lv2

S2

150 samples

r (2)-dim. space of objects

13 150

Page 17: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Principal Component AnalysisPrincipal Component Analysis

x3

x1

x2

3 variables : S3

••

•• ••

•••

•• 12 samples

Page 18: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Principal Component AnalysisPrincipal Component Analysis

x3

x1

x2

3 variables : S3

••

•• ••

•••

•• 12 samples

Page 19: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Principal Component AnalysisPrincipal Component Analysis

S3 12 samples

PC1

PC1 = l11 x1 + l12x2 + l13x3

x3

x1

x2

••

•• ••

•••

••

Page 20: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

x3

x1

x2

••

•• ••

•••

•• PC1

PC1 = l11 x1 + l12x2 + l13x3

Criterion: Maximum variance of projections (x)

x x xx x

xx x

xx

x

S3 12 samples

Principal Component AnalysisPrincipal Component Analysis

Page 21: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

PC1 = l11 x1 + l12x2 + l13x3

PC2 = l21 x1 + l22x2 + l23x3

Criterion: Maximum variance of projections (x)

PC1 PC2

x2

x3

x1

x2

••

•• ••

•••

•• PC1

x x xx x

xx x

xx

x

S312 samples

PC2

Principal Component AnalysisPrincipal Component Analysis

Page 22: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Principal Components SpacePrincipal Components Space

•••• ••

••

PC1

PC2

S2 12 samples

Page 23: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

r (2)-dim. space

pc2

pc1

S2

1

p (13)- dim. space of variables

Sp (13)

j

xi

13

150 samples

150 samples

Principal Component AnalysisPrincipal Component Analysis

Score plot

Page 24: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

r (2)-dim. space

pc2

pc1

S2

1

p (13)- dim. space of variables

Sp (13)

j

xi

13

150 samples

150 samples

Principal Component AnalysisPrincipal Component Analysis

Score plot

PC1 (38%)

PC

2 (2

0%)

Wine data: score plot

Page 25: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

pc2

pc1

S2

150

1

i

n (150)- dim. Space of objects

Sn (150)

13 variables

13 variables

x

x

xx

xxx

xx

x

x

Loading plot

Principal Component AnalysisPrincipal Component Analysis

Page 26: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

pc2

pc1

S2

150

1

i

n (150)- dim. Space of objects

Sn (150)

13 variables

13 variables

x

x

xx

xxx

xx

x

x

Loading plot

Principal Component AnalysisPrincipal Component Analysis

Wine data: loading plot

PC1 (38%)

PC

2 (2

0%)

Page 27: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Singular Value Decomposition (SVD)Singular Value Decomposition (SVD)

Xnp = Unr Drr VTrp

Left singular vectors

PC scores

Right singular vectors

PC loadings

p

n

rr

r

n

p

r

X UVT

=

UTU =VTV =I

Page 28: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

S2

Sp (13)

i

Sn (150)

n

11

j

xi

p

S2

Loading plot

13 variables

pc1

pc2

pc1

Score plot

150 samples

pc2

x

x

xx

xxx

xx

x

x

Principal Component Analysis : Biplot Principal Component Analysis : Biplot

pc2

pc1

x

xx

xxx

xxx

x

x150 samples + 13 variables

BIPLOTBIPLOT

Page 29: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Principal Component Analysis: an ExamplePrincipal Component Analysis: an Example

PC1 (38%)

PC

2 (2

0%)

Page 30: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Principal Component Analysis: Some IssuesPrincipal Component Analysis: Some Issues

• How many PC’s ?

• Scaling

• Outliers

Page 31: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

How many PC’s ? How many PC’s ?

No of PC’s

Cumulative % of variance Scree plot

p

1i

2

i

2

i2

i

d

dd

100%

No of PC’s

Log

varia

nce

2 3 11 5 64 2 3 5 64

Page 32: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

How many PC’s ? How many PC’s ?

Wine data

Page 33: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

How many PC’s ? How many PC’s ?

Page 34: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

PCA: ScalingPCA: Scaling

For better interpretation; may obscure results

raw data;

Mean-centering: (column wise, row wise, double)

Auto-scaling (column wise, row wise)

…..

Page 35: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Wine datamean-centered

Wine dataautoscaled

PCA: ScalingPCA: Scaling

Page 36: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Wine dataraw

Wine datamean-centered

PC1 (99.79%)

PC

2 (0

.20%

)

PC1 (99.79%)

PC

2 (0

.20%

)

PCA: ScalingPCA: Scaling

Page 37: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

x3

x1

x2

3 variables : S3

••

•• ••

••••

••

12 samples

PC1

PCA: OutliersPCA: Outliers

Page 38: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

x3

x1

x2

3 variables : S3

••

•• ••

••••

••

12 + 1 outlier

PC1

PCA: OutliersPCA: Outliers

Page 39: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

x3

x1

x2

3 variables : S3

••

•• ••

••••

••

PC1

PC1

Leverage effect

PCA: OutliersPCA: Outliers

Page 40: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

Gene expression values

Principal Component Analysis: a Recent Research ExamplePrincipal Component Analysis: a Recent Research Example

X

xij

1 4 Treatments

genes 50.000

xj

OrganonDepartment of Cell Biology

Page 41: Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry

PCA Interaction Gene TreatmentPCA Interaction Gene Treatment