Multivariate Resolution in Chemistry
Lecture 1
Roma TaulerRoma TaulerIIQAB-CSIC, Spain
e-mail: [email protected]
Lecture 1
• Introduction to data structures and soft-modelling methods.
• Factor Analysis of two-way data: Bilinear models.
• Rotation and intensity ambiguities.
• Pseudo-rank, local rank and rank deficiency.
• Evolving Factor Analysis.
Chemical sensors and analytical data structures
one variable x1 e.g. pH
two variables x1,x2 e.g pH i T
three variables x1, x2 and x3
e.g. pH, T i P
n variables ?????
* ***** * ** *** *
pH
T
pH
pH
T
P
*
***
**
*
*
x; one sample gives one scalar (tensor 0th order)
Examples: - selective electrodes, pH- absorption at one wavelength - height/area chromatographic peak
Assumptions: - total selectivity- known lineal response
Tools: - univariate algebra and statistics
Advantages: - simple and easy to understand
Disadvantages: - only one compound information- total selectivity
- one sensor for every analyte- low information content
c
h
Time
xx
x
x
x x
xx
x
ci
hi
Data Structures: Zero order Zero-way data
x1, x2, ....., xn; one sample gives one vector (tensors of order 1)Examples:
- matrix of sensors- absorption at many (spectra)- chromatograms at a single - current intensities at many E- readings with time (kinetics)-..................
Assumptions: - known lineal responses - different and independent responses
Tools: - linear algebra - multivariate statistics - spectral analysis - chemometrics (PCA,MLR, PCR, PLS...)
Advantages: - Calibration in presence of interferences is possible - Multicomponent analysis is possible
Disadvantages: Interferences should be present in calibration samples
Spectrum
Wavelength (nm)
Abso
rban
ce
min10 20 30 40
0
Time
Chromatogram
Data Structures: First order One-way data
xij; each sample gives a datatable/matrix; tensor of order 2X = xkyk
T
Examples: - LC-DAD; LC-FTIR; GC-MS; LC-MS; FIA-DAD; CE-MS,.. (hyphenated techniques)- esp. excitation/emission (fluorescence)- MS/MS, NMR 2D, GCxGC-MS ...- spectroscopic/voltammetric monitoring of chemical reactions/processes with pH, time, T, etc.
Assumptions: - linear responses- sufficient rank (of the data matrices)
Tools: - linear algebra- chemometrics
Advantages: - calibration for the analyte in the presence of interferences not modelled in calibration samples is possible- full characterization of the analyte and interferents may be possible- few calibration samples are needed (only one sample calibration)
Wavelengths
Elu
tio
nti
me
s Spectrum
Ch
rom
ato
gra
m
Wavelengths
Elu
tio
nti
me
s Spectrum
Ch
rom
ato
gra
m
Data Structures: Second order / Two-way data
xijk; each sample gives a datacube; tensor of order 3X = xkykzk
Examples- Several spectroscopic matrices- Several hyphenated chromatographic- Hyphenated multidimensional chromatography (GC x GC / MS)
- excitation/emission/time..............Assumptions: - bilinear/trilinear responses
- sufficient rank (of the data matrices)Tools: - multilinear and tensor algebra
- chemometricsAdvantages: - unique solutions (no ambiguities)
- calibration for the analyte in the presence of interferences not modelled in calibration samples is possible
- full characterization of the analyte and interferents is possible- few calibration samples are needed (only one sample calibration)
Dtime
Run nr.
D
Di
time
Multi-way data analysis(PARAFAC, GRAM)
Extended multivariate resolution
Data Structures: Third order Three-way data
0th order data: ISE, pH,..
1th order data: spectra
2nd order data: LC/DAD GC/MS fluorescence
3rf order data: time/ /excitation/ /emission
ExamplesChemical reaction systems monitored using spectroscopic measurements (even at femtosecond scale) to follow the evolution of a reaction with time, pH, temperature, etc., and the detection of the formation and disappearance of intermediate and transient species
Monitoring chemical reactions.
T=37oC
Spectrophotometer
Peristalticpump
0.050 ml
-125.3
pHmeter
Autoburette
Computer
PrinterStirrer
Thermostatic bath
D (NR,NC)
NCNRNRNR
NC
NC
pHpHpH
pHpHpH
pHpHpH
ddd
ddd
ddd
,,,
,,,
,,,
..
.....
..
..
21
22212
12111
pH
wavelength
Examples
Quality control and optimisation of industrial batch reactions and processes, where on-line measurements are applied to monitor the process.
Process analysis
Computer Spectrometer
*
*
*
* ***
*
*
* *
probe
D (NR,NC)
NCNRNRNR
NC
NC
ttt
ttt
ttt
ddd
ddd
ddd
,,,
,,,
,,,
..
.....
..
..
21
22212
12111
time
wavelength
Examples Analytical characterisation of complex environmental, industrial and food mixtures using hyphenated (chromatography, continuous flow methods with spectroscopic detection)
Chromatographic Hyphenated techniques
LC-DAD, GC-MS, LC-MS, LC-MS/MS....
D (NR,NC)
NCNRNRNR
NC
NC
ttt
ttt
ttt
ddd
ddd
ddd
,,,
,,,
,,,
..
.....
..
..
21
22212
12111
time
wavelength
Examples FIA-DAD-UV with pH gradient for the analysis of a mixture of drugs.
D (NR,NC)
NCNRNRNR
NC
NC
pHpHpH
pHpHpH
pHpHpH
ddd
ddd
ddd
,,,
,,,
,,,
..
.....
..
..
21
22212
12111
pH
wavelength
Examples Analytical characterisation of complex sea-water samples by means of Excitation-Emission spectra for an unknown with tripheniltin (in the reaction with flavonol)
Excitation emission (fluorescence) EEM techniques
NCNRNRNR
NC
NC
xxx
xxx
xexex
ddd
ddd
ddd
,,,
,,,
,,,
..
.....
..
..
21
22212
12111
D (NR,NC)
emission
exci
tatio
n
Examples Protein folding and dynamic protein-nucleic acid interaction processes. In the post-genomic era, understanding these biochemical complex evolving processes is one of the main challenges of the current proteomics research.Conformation changes
D (NR,NC)
NCNRNRNR
NC
NC
TTT
TTT
TTT
ddd
ddd
ddd
,,,
,,,
,,,
..
.....
..
..
21
22212
12111
Temperature
wavelength
Primarystructure
Secondarystructure
Tertiarystructure
Amino acidsGlobule
formationAssembled subunitsHelix, sheet formation
Val
Leu
Ser
Ala
Asp
Ala
Trp
Gly
Val
His
-helix
-sheet
turn
Random coil
Quaternarystructure
Examples Image analysis of spatially distributed chemicals on 2D surfaces measured using coupled microscopy-spectroscopy techniques in geological samples, biological tissues or food samples.
Spectroscopic Image analysis
0.5
1
1.5
2
2.5
3
3.5
4
10 20 30 40 50 60
10
20
30
40
50
60
x
y
T
ota
l nu
mb
er o
f p
ixe
ls (
x
y)
Data Structures in ChemistryExperimental Data
two orders/ways/modes of measurement
d d d
d d d
d d d
NC
NC
NR NR NR NC
1 1 1 2 1
2 1 2 2 2
1 2
, , ,
, , ,
, , ,
. .
. .
. . . . .
. .
row-order(way,mode)i.e. usuallychange in chemical composition (concentrationorder)
column order (way,mode)i.e usually change in system propertieslike in spectroscopy, voltammetry,...(spectral order)
D(NR,NC)
Chemical data tables (two-way data)
I sp
ectr
a (t
imes
)
J variables (wavelengths)
Data table or matrix
D
Instrumental measurements(spectra,voltammograms,...)concentration
changesmeasurements(time, tempera-ture, pH, ....
0 5 10 15 20 25 300
0.2
0.4
0.6
0.8
1
1.2
1.4
0 5 10 15 20 25 30 35 40 45 500
0.2
0.4
0.6
0.8
1
1.2
1.4
Plot of spectra(rows)
Plot of elutionprofiles (columns)
Chemical data modelling methods may be divided in:Hard- modelling methods (deterministic)Soft-modelling methods (data driven)Hybrid hard-soft modelling methods
Chemical data modelling
Soft modellingHard modelling
DataPhysical
HardModel
AnalyticalInformation
Data AnalyticalInformation
PhysicalModel
Datadriven
softmodel
?
Hard-modelling approaches for chemical (stationary, dynamic, evolving…) systems are based on an accurate physical description of the system and on the solution of complex systems of (differential) equations fitting the experimental measurements describing the evolution and dynamics of these systems. They are deterministic models.
Hard-modelling methods usually use non-linear least squares regression (Marquardt algorithm) and optimisation methods to find out the best values for the parameters of the model.
Hard-modelling usually deal with univariate data. It has been often used in the past until the advent of modern instrumentation and computers giving large amounts of data outputs.
Hard-modelling is often successful for laboratory experiments, where all the variables are under control and the physicochemical nature of the dynamic model is known and can be fully described using a known mathematical model
Hard-modelling
However, and even at a laboratory level, there are examples where hard-modelling requirements and constraints are not totally fulfilled or no physicochemical model is known to describe the process (e.g. in chromatographic separations or in protein folding experiments).
Data sets obtained from the study of natural and industrial evolving processes are too complex and difficult to analyse using hard-modelling methods. In these cases, there is no known physical model available or it is too complex to be set in a general way.
Advanced hard-modelling in industrial applications has been attempted to model experimental difficulties, such as changes in temperature, pH, ionic strength and activity coefficients. This is a very difficult task!
Hard-modelling
Data Fitting in the Chemical SciencesP. Gans, John Wiley and Sons, New York 1992
0 10 20 30 40 50 60 70 80 90 1000
0.5
1
1.5
2
2.5
3x 104
Wavelengths
Ab
so
rtiv
itie
s
LS (D, C)(ST)
ST
0 10 20 30 40 50 60 70 80 90 1000
0.5
1
1.5
2
2.5
Wavelength
Ab
so
rba
nc
e
0 1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Time
Co
nc
en
tra
tio
nNon-linear model
fittingmin(D(I-CC+)C = f(k1, k2)
D C
Output: C, S and model parameters.
The model should describe all the variation in the experimental measurements.
Hard modelling
Soft-modelling instead, attempts the description of these systems without the need of an a priori physical or (bio)chemical model postulation. The goal of the latter methods is the explanation of the variations observed in the systems using the minimal and softer assumptions about data. They are data driven models.
Soft models usually give an improved analytical description of the analysed process.
Soft modelling needs more data than hard-modelling. Soft modelling methods deal with multivariate data. Its use has augmented in the recent years because of the advent of modern analytical instrumentation and computers providing large amounts of data outputs.
The disadvantage of soft models is their poorer extrapolating capabilities (compared with hard-modelling).
Soft-modelling
A soft model is hardly able to predict the behaviour of the system under very different conditions from which it was derived.
Complex multivariate soft-modelling data analysis methods have been introduced for the study of chemical processes/systems like Factor Analysis derived methods.
Factor Analysis is a multivariate technique for reducing matrices of data to their lowest dimensionality by the use of orthogonal factor space and transformations that yield predictions and/or recognizable factors.
Soft-modelling
Factor Analysis in Chemistry 3rd Edition, E.R.Malinowski, Wiley, New York 2002
0 10 20 30 40 50 60 70 80 90 1000
0.5
1
1.5
2
2.5
3x 104
Wavelengths
Ab
so
rtiv
itie
s
ST
0 10 20 30 40 50 60 70 80 90 1000
0.5
1
1.5
2
2.5
Wavelength
Ab
so
rba
nc
e
D C
Constrained ALS optimisationLS (D,C) S*LS (D,S*) C*min (D –C*S*)
,
Output: C and S.
All absorbing contributions in and out of the process are modelled.
0 1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Time
Co
nc
en
tra
tio
n
Soft modelling
Lecture 1
• Introduction to data structures and soft-modelling methods.
• Factor Analysis of two-way data: Bilinear models.
• Rotation and intensity ambiguities.
• Pseudo-rank, local rank and rank deficiency.
• Evolving Factor Analysis.
Soft-modelling
1 1 2 21
T
.....
D = U V
n
ij i j i j in nj ik kjk
d u v u v u v u v
In matrix form
data scores loadings
experimental datais modelled as alinear sum of weighted(scores) factors (loadings)
Factor Analysis (Bilinear Model)
D = + + + ... +A B C E
D =
A
+
B
+
C
+ ... + E+ +
BILINEARITY
Assumption: Bilinearity (the contributions of the components in the two orders of measurement
are additive)
Soft-modelling
0 50 100
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 20 40 60
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Recovery of the responses of every component (chemical species) in the different modes of
measurement
GOALS OF BILINEAR MODEL
=
Soft-modelling
Soft-modelling: Factor Analysis
ExperimentalData Matrix
PrincipalComponents
FactorIdentification
Targettesting
ClusterAnalysis
Real FactorModels
Predictions
Covariancematrix
AbstractFactors
New AbstractFactors
Datamatrix
RealFactors
com
bina
tion
matrixmultiplication
abstractreproduction
targettransformation
decomposition
abstract rotation
Soft-modelling: Factor Analysis(traditional approach)
Soft-modelling methods (I)
Factor Analysis methods based on the use of latent variables or eigenvalue/singular value data matrix decompositions. Examples
•PCA, SVD, rotation FA methods•Evolving Factor Analysis methods•Rank Annihilation methods•Window Factor Analysis methods•Heuristic Evolving Latent Projections methods•Subwindow Factor Analysis methods•…..
Multivariate Resolution methods do a data matrix decomposition into their ‘pure’ components without using explicitly latent variables analysis techniques. Examples:
•SIMPLISMA•Orthogonal Projection Approach (OPA),•Positive Matrix Factorization methods (and Multilinear Engine extensions)•Multivariate Curve Resolution-Alternating Least Squares (MCR-ALS)•Gentle•.....
Soft-modelling methods (II)
Three-way and Multiway methods which decompose three-way or multiway data structures. Examples:
•Multiway and multiset extensions of PCA•Genralized rank Annihilation, GRAM; Direct Trilear Decomposition (DTD, TLD)•Multiway and multiset extensions of MCR-ALS methods •PARAFAC-ALS•Tucker3-ALS•.......
Soft-modelling methods (III)
Factor Analysis in Chemistry, 3rd Ed., E.R.Malinowski, John Wiley & Sons, New York, 2002
Principal Component Analysis, I.T. Jollife, 2nd Ed., Springer, Berlin, 2002
Multiway Analysis, Applications in the Chemical Sciences, A.Smilde, R.Bro and P.Geladi, John Wiley & Sons, New York, 2004
Multivariate Image Analysis, P.Geladi, John Wiley and Sons, 1996
Soft modeling of Analytical Data. A.de Juan, E.Casassas and R.Tauler, Encyclopedia of Analytical Chemistry: Instrumentation and Applications, Edited by R.A.Meyers, John Wiley & Sons, 2000, Vol 11, 9800-9837
Soft-modelling
Data structures Type of ModelsOne way data (vectors) Linear and non-linear models
di = b0 + b ci;di = fnon-linear(ci)
Two way data (matrices) Bilinear and non-bilinear models
Non-bilinear data can still belinear in one of the two modes
Three-way data (cubes) Trilinear and non-trilinear models
Non-trilinear data can still be bilinear in two modes
N
ij in nj ijn 1
d c s e
D
I samplesdij
J variables
di
pqr
c
g c
z
zp q r
N
ijk in jn kn ijkn=1
N N N
ijk ip jq krn ijkp=1 q=1 r=1
d = s +e
d = s +ek=1,...,Kconditions
i=1,
...,I
j=1,...,J
I samples
Soft-modelling
N
ij in nj ijn 1
T
d c s e
D CS E
dij is the data measurement (response) of variable j in sample in=1,...,N are the number of components (species, sources...)cin is the concentration of component n in sample i;snj is the response of component n at variable j
Bilinear models for two way data:
D
J
Idij
Soft-modelling
N
DUor
C
VT or ST
E+
J J J
I I
N
N << I or J
Bilinear models for two way data
PCAD = UVT + E
U orthogonal, VT orthonormalVT in the direction of maximum
variance
Unique solutions but without physical meaning
Useful for interpretationbut not for resolution!
MCRD = CST + E
Other constraints (non-negativity,unimodality, local rank,… )
U=C and VT =ST non-negative,...C or ST normalization
Non-unique solutions but with physical meaning
Useful for resolution(and obviously for interpretation)!
I
Soft-modelling
PCA Model (Principal Component Analysis)
X = U VT + E
U ‘scores’ matrix (orthogonal)VT loadings matrix (orthonormal)
SVD Model (Singular Value Decomposition)
D = U* S VT + E
U* ‘scores’ matrix (orthonormal)S diagonal matrix of the singular values s
s = 1/2
eigenvalues of the covariances matrix DDT
VT ‘loadings’ matrix (orthonormal)
,i i is
PCA Model: D = U VT
= +DU
VT
E
D = u1v1T + u2v2
T + ……+ unvnT + E
D u1v1T u2 v2
T unvnT E= + +….+ +
n number of components (<< number of variables in D)
rank 1 rank 1 rank 1
scores
loadings(projections)
unexplained variance
PCA Model
X = U VT + EX = structure + noise
It is an approximation to the experimental data matrix X
• Loadings, Projections: VT relationships between original variables and the principal components (eigenvectors of the covariances matrix). Vectors in VT (loadings) are orthonormals (orthogonal and normalized).
• Scores, Targets: U relationships between the samples (coordinates ofsamples or objects in the space defined by the principal componentsVectors in U (scores) are orthogonal
Noise E Experimental error, non-explained variances
Summary of Principal Component Analysis PCA
1. Formulation of the problem to solve2. Plot of the original data 3. Data pretreatment.
(data centering, autoscaling, logarithmic transformation…)4. Built PCA model. Determination of the number of
components. Graphical inspection of explained/residual plots)
5. Study of the PCA model PCA. Multivariate data exploration- ‘loadings’ plot ==> map of the variables- ‘scores’ plot ==> map of the samples
6. Interpretation of the PCA mode. Identification of themain sources of data variance
7. Analysis of the residuals matrix E = D -U VT
-2-3
-2-10123
PC
2 (
27
%)
Scores plot
-1
-0.8-0.6-0.4-0.2
00.2
PC
2 (
27
%)
Loadings plot
-1 0 1 2 3 4PC1 (41%)
1 2
3
4 5 6
7
8
910
11
12
13141516
17
18
19
2021
22
1
2
3
-1 -0.5 0 0.5 1PC1 (41%)
1
2
3 4 5
6
7 8 910
11
1213141516
1718
1920
21222324
25
2627
282930
313233
34
35
363738
39
40414243444546
47
48495051
52
5354
5556
5758596061
6263
6465
666768697071
727374
75
76
77
78
79
8081
8283
84
858687
8889
9091 929394
9596
B
A
-2 -1 0 1 2 3 4-3
-2
-1
0
1
2
3
PC1 (41%)
PC
2 (
27
%)
Biplot
1 2 3 4 5 6 7 8 91011121314151617 18192021222324
25 26272829303132333435363738
3940414243444546
47484950 515253545556 5758596061 6263646566676869707172737475767778
7980818283 8485868788899091 9293949596
1 2
3
4 5 6
7
8
910
11
12
13141516
17
18
19
20
21
22
Site 1Site 2Site 3
Site 22
Sa
mp
lin
g
sit
es
[org]1 [org]2 [org]3 [org]96
Pollutant concentrationData set
D
Uscores
VT
loadings
PCA
Multivariate Curve Resolution (MCR)
Pure component information
C
ST
sn
s1
c nc 1
WavelengthsRetention times
Pure concentration profiles Chemical model
Process evolutionCompound contribution
relative quantitation
Pure signals
Compound identity
source identification and Interpretation
D
Mixed information
tR
Lecture 1
• Introduction to data structures and soft-modelling methods.
• Factor Analysis of two-way data: Bilinear models.
• Rotation and intensity ambiguities.
• Pseudo-rank, local rank and rank deficiency.
• Evolving Factor Analysis.
Factor Analysis Ambiguities in the analysis of a data matrix (two-way data)
Rotation Ambiguities
Factor Analysis (PCA) Data Matrix DecompositionD = U VT + E
‘True’ Data Matrix DecompositionD = C ST + E
Rotation and scale/intensity ambiguities
D = U T T-1 VT + E = C ST + EC = U T; ST = T-1 VT
How to find the rotation matrix T?
Factor Analysis Ambiguities in the analysis of a data matrix (two-way data)
Rotation Ambiguities
Rotation and scale/intensity ambiguities
D = C ST + E = D* + E
Cnew = C T ( NR,N) (NR,N) (N,N)
STnew = T-1 ST
(N,NC) (N,N) (N,NC)
D* = C ST = CnewSTnew
Matrix decomposition is not unique!T(N,N) is any non-singular matrix
Rotational freedom for any T
Rotation and scale/intensity ambiguities
=
Cnew,2 Cold,1, Cold,2 t1,2
t2,2
STnew,1
STold,1
STold,2
STnew,2
t-12,2t-1
2,1
STold,1
STold,2
=
=
T
t-11,1 t-1
1,2
T-1
=
Cnew,1 Cold,1, Cold,2 t1,1
t2,1
T
Rotation and scale/intensity ambiguities
Rotation ambiguities and rotation matrix T(N,N)
Intensity (scale) ambiguities:
d c s c sij in nj
n
in nj
n
k1
k
For any scalar k
Intensity/scale ambiguities make difficuly to obtain quantitative informationWhen they are solved then it is also possible to have quantitative information
Rotation and scale/intensity ambiguities
Intensity (scale) ambiguities: cold x k = cnew
x x
1/k x sold = snew
cold sold == ( cold x k)(1/k x sold) =
cnew snew
Rotation and scale/intensity ambiguities
Questions to answer:
1. Is it possible to have unique solutions?
2. What are the conditions to have unique solutions?
3. If total unique solutions are not possible:
a) Is it still possible at least to find out some of the possible solutions?
b) Is it possible to have an estimation of the band or range of possible/feasible solutions?
c) How this range of feasible solutions can be reduced?
Rotation and scale/intensity ambiguities
Lecture 1
• Introduction to data structures and soft-modelling methods.
• Factor Analysis of two-way data: Bilinear models.
• Rotation and intensity ambiguities.
• Pseudo rank, local rank and rank deficiency.
• Evolving Factor Analysis.
Definitions
Mathematical rank of a data matrix is the minimum number of linearly independent rows or columns describing the variance of the whole data set. Minimum number of basis vectors spanning the row and column vector spaces. It may be obtained by SVD or PCA.
Pseudo-rank or Chemical rank is the mathematical rank in absence of experimental error/noise. Usually it is equal to the number of chemical/physical components contributing to the observed data variance apart from experimental noise/error. Obtained from the number of larger components from PCA, SVD or other FA methods
Local Rank is the chemical rank of data submatrices. Obtained from EFA, EFF, SIMPLISMA, OPA, or other FA submatrix analysis methods
Rank deficiency when chemical rank is lower than the known number of contributions. Rank deficiency may be broken/solved by data matrix augmentation and perturbation strategies.
Rank overlap rank deficiency caused by equal vector profiles of different chemical/physical components in one or more modes.
Pseudo Rank: Number of contributions (factors, components)
Principal Component Analysis
D = U
unu1
VT
vn
v1 Gives an abstract
(orthogonal) bilinear
model to describe
optimally the variation
in our data set.Useful chemical information
Size of the model (chemical rank)
Number of chemical contributions
Principal Component Analysis (SVD algorithm)
D = TPT = USVT
Diagonal matrix (singular values)
Magnitude of singular value
Importance of contribution
Pseudo Rank: Number of contributions (factors, components)
Principal Component Analysis (SVD algorithm)
1 2 3 4 5 6 7 8 9 1018
19
20
21
22
23
24
25
Number of components
log
(eig
en
valu
es)
4 contributions Plot log(eigenvalues)
Plot singular values
Eigenvalue = (sing. value)2
D = TPT = USVT
Diagonal matrix (singular values)
Pseudo Rank: Number of contributions (factors, components)
2s
small size
large size
Overestimations of rank (overfitting).
– Large overestimation: the measurements may not follow a bilinear model.
– Small overfestimation: presence of structured noise or high
noise levels.
Underestimations of rank (rank deficiency).
Linear dependencies
Contributions with very similar signals or concentration profiles.
Compounds with non-measurable signals.
Minor compounds.
Pseudo Rank: Number of contributions (factors, components)
• Are all the signals distinguishable and independent?
• Are all the concentration profiles distinguishable and independent?
No Rank- deficient systems
Rank deficiency
Detectable rank < nr. of process contributions
Examples:1) 2nd order reaction A = B + C, [B] = [C], 3 chemical species/contributions, but Rank =22) Enantiomer conversion monitored by UV and the spectrum D = spectrum L, two chemical species/components but Rank =1 (Rank overlap)
Rank deficiency
• Closed reaction systems. Some concentration profiles are described as linear combinations of others.
System HA / A-, HB / B-
CA = [HA] + [A-]
CB = [HB] + [B-]
CB = kCA
[HA], [HB]
[A-] = CA - [HA]
[B-] = CB - [HB] = kCA - [HB]
[HA], [HB], [A-], [B-] f ([HA], [HB], CA) Rank 3
0
0,02
0,04
0,06
0,08
0,1
0,12
1 6 11
pH
Co
nce
ntr
atio
n
CA
CB
Breaking rank-deficiency by matrix augmentation
• Matrix Augmentation in the rank-deficient direction
Data set
HA
HA / HB
pH
pH
CB kCA
[B-] = CB - [HB] kCA - [HB]
[HA], [HB], [A-], [B-] f ([HA], [HB], CA)
Rank 4
Rank deficiency
Lecture 1
• Introduction to data structures and soft-modelling methods.
• Factor Analysis of two-way data: Bilinear models.
• Rotation and intensity ambiguities.
• Pseudo-rank and rank deficiency.
• Local Rank and Evolving Factor Analysis.
Local exploratory analysis
Study of the variation of the number of contributions in the process or system. Study of the rank variation during the process.
Evolving Factor Analysis (EFA)
Fixed Size Moving Window - Evolving Factor Analysis (FSMW-EFA)
Evolving Factor Analysis
• Stepwise chemometric monitoring of a process.
– Forward Evolving FA (from beginning to end)– Backward Evolving FA (from end to beginning)
Working procedure
Display of subsequent PCA analyses along gradually increasing data set windows.
Evolving Factor Analysis
HPLC-DAD example
D
Wavelengths
Ret
enti
on
tim
es
Spectrum
Ch
rom
ato
gra
m
Evolving Factor Analysis
Forward Evolving Factor Analysis
PCA PCA PCA PCA
0 5 10 15 20 25 30 35 40 45 507.5
8
8.5
9
9.5
10
10.5
11
Retention times
log
(eig
enva
lues
)
Evolving Factor Analysis
Forward Evolving Factor Analysis
0 5 10 15 20 25 30 35 40 45 507.5
8
8.5
9
9.5
10
10.5
11
Retention times
log
(eig
enva
lues
)
Total number of compounds (PCA in last window)
Noise level
Location of the emergence of compounds
Selective zone
Evolving Factor Analysis
Backward Evolving Factor Analysis
PCAPCAPCAPCA
0 5 10 15 20 25 30 35 40 45 507.5
8
8.5
9
9.5
10
10.5
11
Retention times
log
(eig
en
valu
es)
Evolving Factor Analysis
Backward Evolving Factor Analysis
0 5 10 15 20 25 30 35 40 45 507.5
8
8.5
9
9.5
10
10.5
11
Retention times
log
(eig
en
valu
es)
Noise level
Location of the disappearance of compounds
Total number of compounds (PCA last total window)
Selective zone
Evolving Factor Analysis
Combined EFA plot (forward and backward EFA)
5 10 15 20 25 30 35 40 45 507.5
8
8.5
9
9.5
10
10.5
11
Retention times
log
(eig
enva
lues
)
Location of emergence and decay of compounds
Detection of selective zones (extremes)
Total number of components (PCA of extreme windows)
Evolving Factor Analysis
Sequential processes Consecutive emergence-decay profiles. No embedded compounds.
Approximate concentration profiles
Noise level
5 10 15 20 25 30 35 40 45 507.5
8
8.5
9
9.5
10
10.5
11
Retention times
log
(eig
enva
lues
)
Concentration windowZero-component windows
Evolving Factor Analysis
Approximate concentration profiles
5 10 15 20 25 30 35 40 45 500
0.2
0.4
0.6
0.8
1
1.2
1.4
5 10 15 20 25 30 35 40 45 500
1
2
3
4
5
6x 10
8
EFA derived concentration profiles
Real concentration profiles
Fixed Size Moving Window-Evolving FA (FSMW-EFA)
• Local rank map along the process direction or the signal direction.
Working procedure
Subsequent PCA in fixed size windows moving stepwisely along the data set.
Window size min(number of components + 1)
FSMW-EFA
0 5 10 15 20 25 30 35 40 45 503.5
4
4.5
5
5.5
Retention times
log
(eig
enva
lues
)
PCA PCA PCA PCA
FSMW-EFA
0 5 10 15 20 25 30 35 40 45 503.5
4
4.5
5
5.5
Retention times
log
(eig
enva
lues
)
Noise level
Detection of selective zones along the whole process
10 01 12 22
Variation of local rank along the process direction (complexity, degree of overlap among compounds)
0 20 40 6000.20.40.60.8
11.21.41.61.8
2 x 10 -5
0 20 40 60 80 1000
0.5
1
1.5
2
2.5
3
3.5 x 104
0 20 40 600
0.05
0.1
0.15
0.2
0.25
0.3
0 20 40 60-3
-2
-1
0
1
2
3
0 10 20 30 40 50-2
-1.5
-1
-0.5
0
0.5
1
0 10 20 30 40 50-2
-1.5
-1
-0.5
0
0.5
1
window size 5 window size 8
Local rank
detection
EFAEFA
FSMW-EFA FSMW-EFA
FSMW-EFA vs. EFA
EFA
• Displays the evolution of the process.
• The compounds are well identified (concentration windows)
• Local rank information is not easily interpreted.
FSMW-EFA
Clear definition of local rank.
Sensitive to detection of minor compounds.
The idea of process evolution is not preserved.
• Detection of the selective windows or regions where only one species exists (total selectivity)
• Detection of zero concentration windows or regions (no species is present)
• Detection of windows or regions where a particular species is not present
• Detection of the concentration windows or regions where one species is present (other species can coexist)
Getting Local rank information from Evolving Factor Analysis methods
References
• EFA– H. Gampp, M. Maeder, C.J. Meyer and A.D.
Zuberbühler. Talanta, 32, 1133-1139 (1985).– M. Maeder. Anal. Chem. 59, 527-530 (1987).
• FSMW-EFA– H.R. Keller and D.L. Massart. Anal. Chim. Acta,
246, 379-390 (1991).
• SIMPLISMA
– W. Windig and J. Guilment. Anal. Chem., 63, 1425-1432 (1991).