Threeway analysis
Batch organic synthesis
Paul Geladi
Head of Research NIRCEChairperson NIR Nord
Unit of Biomass Technology and ChemistrySwedish University of Agricultural SciencesUmeåTechnobothniaVasa
paul.geladi @ btk.slu.se paul.geladi @ syh.fi
I
J
K
A = batch
B = variable
C = time
THREE-WAY ARRAY
Literature
Geladi P. & Åberg P., Three-way modeling of a batch organic synthesis process monitored by near infrared spectroscopy, Journal of Near Infrared Spectroscopy, 9, 1-9, 2001
Geladi P. & Forsström J., Monitoring of a batch organic synthesis by infrared spectroscopy: modeling and interpretation of three-way data, Journal of Chemometrics, 16, 329-338, 2002.
Three-way arrays
• GC-MS
• LC-UV
• Fluorescence
• Batch processing
• many others
Properties
• Components / pseudorank
• 3 types, not 2
• No orthogonality
• Parsimonious model
400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
0
0.5
1
1.5
2
2.5
3
3.5
4
Pseudoabsorbance
Wavelength nm
All batches and times
BATCH REACTION
• ester synthesis by refluxing alcohol and acid
• many batches as experimental design
• measure NIR spectrum with transflectance fiberoptic probe at regular intervals
• 400-2500 nm every 2 nm, 32 scans average
• reference = air
REACTION
O
O
H
H+amberlite
O
O
H
H+
OH
O
O
H
O H
H +
O
O
H
O H
H
+-H2Omolsikt
O
O
H
+
O
O-H+
-H20molecular sieve
REACTION
C5H11OH + CH3COOH -> C5H11OCOCH3 + H2O
-acid catalysis H+
-remove water to shift equilibrium
Fibers in Fibers out
Mirror
Wavelength nm
PseudoabsorbanceOne batch, all times
AB
C D
400 600 800 1000 1200 1400 1600 1800 2000 2200
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Parsimony = not using too many model parameters = no overfitting
10 Stations x 13 Variables x 22 Times
2 Components
MODEL PARAMETERSPCA1 10x286 20 + 572 = 592PCA2 13x220 26 + 440 = 466PCA3 22x130 44 + 260 = 304PARAFAC 20 + 26 + 44 = 92
IMPORTANT QUESTIONS
- can we learn something about reaction kinetics?
- can we see difference between batches?
- can we interpret the spectra?
- how does it all fit together?
REACTION 1
14 x 701 x 13 array.
Source of SS % explained
Rank 3 model 97.1
Residual 2.9
Total 100
Component 1 48.0
Component 2 15.3
Component 3 4.0
~
~
~
1050 wavelengths
14 batches
13 times
= +
+
X
E
I
J
K
a
b
c
1
1
1
a
b
c
2
2
2
Time
Parafac loading
Comp 2
Comp 1
Comp 3
Line plots of C-loadings
0 2 4 6 8 10 12 140
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
C-loading 1
C-loading 2Scatter plot of C-loadings 1 and 2
0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 0.420
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1
2
3
4
5
6 7
8 9
10
111213
600 800 1000 1200 1400 1600 1800
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
Wavelength nm
B-loading 1
A
Wavelength nm
B-loading 2
B
600 800 1000 1200 1400 1600 1800-0.1
-0.05
0
0.05
0.1
0.15
0.2
Wavelength nm
B-loading 3
C
600 800 1000 1200 1400 1600 1800
-0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
600 800 1000 1200 1400 1600 1800-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
Wavelength nm
First derivative spectra
PARAFAC loadings b1 and mean of acid and alcohol pure spectra
-0.5 -0.45 -0.4 -0.35 -0.3 -0.25 -0.2 -0.15 -0.1-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
b1
b2
b3
ALCOHOL
ACID
ESTER
WATER
ACETOPHENONE
REACTIONMIXTURE
Comp 1 (45%)
Comp 2 (17%)
0.15 0.2 0.25 0.3 0.35 0.40.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
a1
a2
28
55
120
90
100
15070
7860
9063
95
90
95
SLOW
FAST
Batch mode loadings: first and second
REACTION 2
6 x 40 x 776 array
number %SS SS
1 62 2.73
2 18 0.78
3 16 0.71
4 3.2 0.14
Model 99.2 4.38
Residual 0.8 0.038
Total 100 4.42
400 nm 2498 nm
0 min
120 min
60 min
Wavelength
Time
800 nm 1200 nm 1600 nm 2000nm
100 200 300 400 500 600 700 800 900 1000
5
10
15
20
25
Standard deviation (Absorbance)
Time50 min 100 min0 5 10 15 20 25
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
Wavelength
Standard deviation (Absorbance)
400 600 800 1000 1200 1400 1600 1800 2000 2200 24000
0.02
0.04
0.06
0.08
0.1
0.12
0.14
11 Batches
40 Times
A few hundred wavelengths
a a a1 2 3
bbb
1
23
cc
c
1
23
32 with extra center points
1 2 3 4 5 6 7 8 9 10 110
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Batch #
a1
block effect
Fig 10.51
t
t
1
2
Block effect
Early batches
Late batches
-0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Wavelength
p1
400 600 800 1000 1200 1400 1600 1800 2000 2200 24000
0.01
0.02
0.03
0.04
0.05
0.06
Ester/C6H6
Ester
”Block” loading
C6H6
Reaction mixture/C6H6
Acetic acid
Acetic acid/C6H6
Reaction
Alcohol
Alcohol/C6H6
Water
0 0.05 0.1 0.15 0.2 0.25 0.30
2
4
6
8
10
12
Distance to K-Means Nearest Group
1
2
3
4
5
6
7
8
9
10
11
Dendrogram Using Unscaled Data
400 600 800 1000 1200 1400 1600 1800 2000 2200 2400-0.03
-0.02
-0.01
0
0.01
0.02
0.03
Wavelength
SGT first derivative
3-level Koshal, or reduced 3 2 design
Reactant ratio (molar)
Catalyst (g)1 1.5 2
0.15
0.45
0.75
00
1
2
3
4
5
6
Fig 10.55
1 2 3 4 5
Pseudorank
Component size
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
a
a
2
3
1.5 / 0.45
1.5 / 0.75
2 / 0.75
1 / 0.75
1 / 0.15 1 / 0.45
Catalyst
Reagentratio
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
C6H6
Ester
Ester/C6H6
CH3COOH
CH3COOH/C6H6
C5H11 OH
C5H11 OH/C6H6
Mixture
Mixture/C6H6
H2O
c1
c2
c3
c4
u
u2
1
Wavelength
H+ on CH3COOH
C6H6
C5H11OH inC6H6
600 800 1000 1200 1400 1600 1800 2000-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
time
b
b b
b
1
2
3
4
0 20 40 60 80 100 120-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
b
b
1
3
030
60
90
0.12 0.14 0.16 0.18 0.2 0.22 0.240
0.05
0.1
0.15
0.2
0.25
600 800 1000 1200 1400 1600 1800-0.02
-0.015
-0.01
-0.005
0
0.005
0.01
0.015
0.02
0.025
0.03
Wavelength
Bias
600 800 1000 1200 1400 1600 18000
0.05
0.1
0.15
Wavelength
Sum of squares
CONCLUSIONS
It is possible: rank 3-4
Preprocessing needed (derivative)
Interpretation of time (reaction kinetics)
Interpretation of batch mode (design)
Interpretation of spectral mode needs pure standards
What is the mystery chemical?
Visual interpretation as line or loading plots
Plotting
Especially for 3-way analysis
Paul Geladi
Plotting techniques
• Line / bar plots
• Box plots
• Quantile plots
• Autocorrelation plots
• Two-dimensional plots
• Three-dimensional plots
• Joint plots / biplots
Plotting techniques
• Response surfaces
• Imaging and mapping
• Movies
• Correlation spectroscopy
• Dendrograms
• Advanced interactive visualization
in more dimensions
What do we want to do?
• Inspect raw data
• Detect outliers / groupings
• Select a model
• Build the model = calculate parameters
• Choose a pseudorank
What do we want to do?
• Inspect and use the model parameters
• Study the residuals
• Use the model for predictions
• More??
Properties
• Rectangular shape
• Every point exists
• Projection
• Resolution?
Properties
• Distances are correct
• Angles are meaningful
Topology
Do all points have a continuum of close neighbours?
4000 5000 6000 7000 8000 9000 100000.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Wavenumber, cm-1
Absorbance
Average NIR spectrum
What do we see?
• Data?
• Interpolation?
• Model?
• Are data fuzzy?
• Are models fuzzy?
The human eye is superb atdetecting things
But also very subjective
The remedies
• Background information
• Experience
• Objective techniques
Chemometrics is poisoned by (bad) line and scatter plots
The biggest problem is with the scatter plots
Grain example
FTNIR 10000-4000 cm-1
112 x 1501
Flour
5 Locations
10 Cultivars
PCA after mean-centering
0 2 4 6 8 10 12 14 16 18 200
0.5
1
1.5
2
2.5
3
0 5 10 15 20 250
0.5
1
1.5
2
2.5
3
0 2 4 6 8 10 12 14 16 18 200
0.5
1
1.5
2
2.5
3
Line plotHorizontal: # comp.Vertical: singular value
TrueEasiest
%SS explainedbased on eigenvalues
# %SS Cumulative
1 78.89 78.89
2 18.21 97.10
3 1.56 98.66
4 0.77 99.43
5 0.11 99.54
6 0.08 99.62
7 0.06 99.68
-0.35 -0.3 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
t2 (18%)
t3 (1.6%)
-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
t2 (18%)
t3 (1.6%)
9 10 11 12 13 14 15 16 179
10
11
12
13
14
15
16
17
Measured Y(:,2)
Pre
dic
ted
Y(:
,2)
Scores Plot
Protein in flour
PLS 6 components
Scatter plot requirements?
• Zero indicated?
• Orthonormal base?
• Equal scales?
• Mirroring?
PCA
• Never gives true spectra
• Never finds pure constituents
• Always rotates
• So why would scatter plots from it be useful?
• Factor analysis is much better
• Factors are chemically meaningful
• Curve resolution
• PARAFAC
Making PARAFAC loadings look good
X = A ( C B )’ + E
^
= X + E
X = USV’ + E
US is the space of A in the orthonormal basis of V