Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Data Preprocessing for Quantitative and Qualitative Models
Based on NIR Spectroscopy
1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650
2.5
3
3.5
4
4.5
5
5.5
Wavelength (nm)
Raw
Sig
nal
Tablet NIR Spectra Raw Data
Colored by Assay Value160
170
180
190
200
210
220
230
1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650
−0.015
−0.01
−0.005
0
0.005
0.01
0.015
0.02
Wavelength (nm)
Prep
roce
ssed
Sig
nal
Tablet NIR Spectra Preprocessed with MSC, First Derivative and Mean Centering
160
170
180
190
200
210
220
230
Barry M. Wise, Ph.D.President
Eigenvector Research, Inc.Manson, WA USA
• Preprocessing Objective• Definition of Clutter• Linearization• Mean Centering and Autoscaling• Baseline Removal• Normalization, Multiplicative Scatter Correction
(MSC)• Smoothing, Filtering and Derivatives• Orthogonalization Filters: EPO, GLS• Conclusions
Outline
2
Goal of Preprocessing
3
• Data preprocessing is what you do to the data before it hits the modeling algorithm (PCA, PLS, MCR, SIMCA, etc.)
• The goal of preprocessing is to remove variation you don’t care about, i.e. clutter, in order to let the analysis focus on the variation you do care about
• Examples• Systems with scattering: physical vs. chemical effects• Classification: intra-class vs. inter-class variation
Measured Signal
• Clutter is present in all measurements (X & Y)• clutter = interferences + noise not of interest
X = csT + Xc +E
Measured SignalTarget Signal
Clutter Signal
Interference Signal
Noise
csT
Xc
Xc +E E
4
Sources of Clutter• Systematic background variability
• Variation in chemical interferents• Physical effects such as scattering due to particles
• Other changes in the system being observed• T, P changes, variable sample matrix, “dark current”
• Variance due to physics of instrument• e.g., drift, instrument changes, variable baseline or gain• Non-linearity, saturation
• Non-systematic random noise• homoscedastic, heteroscedastic
5
Reasons to Preprocess
• Reduces variance from extraneous sources• Makes relevant variance more obvious• Makes statistics work better• Aids interpretation• Avoids numerical problems
6
Transformation to Linear Form• Within X-block (predictor variables, e.g. spectra)
• PCA works best with linear relationships• Between X- & Y-block (predicted variable)
• PLS regression assumes linear relation• If possible, non-linear data should be converted to
a linear form (e.g., use known physics of the system)• Example:
• Typically work with absorbance rather than transmittance
• Log(I/I0)
7
Mean Centering• Often we are most interested in how the
data varies around the mean• Mean centering is done by subtracting the
mean off each column, thus forming a matrix where each column has mean of zero
8
Centering is an Axis Translation
• Geometry for 2 variables
Variable 1
Varia
ble
2
Mean Vector
Variable 1Va
riabl
e 2
9
Mean Centering on Spectra
10
600700800900100011001200130014001500Wavenumber (1/cm)
0
0.2
0.4
0.6
0.8
1
Abso
rban
ce
FTIR of Edible Oils
CornOliveSafflCMarg
600700800900100011001200130014001500Wavenumber (1/cm)
-0.05
0
0.05
0.1
Abso
rban
ce
FTIR of Edible Oils Mean Centered
CornOliveSafflCMarg
Variable Scaling• Scaling is done to change variance of variables,
and thus the weight given to them in modeling• Most common is autoscaling, which makes
variables unit variance and mean zero• Mean center variables• Divide by standard deviation
• Autoscaling removes all scale information• What’s left is only how the variables correlate
with each other• it is the “correlation matrix”
11
Autoscaling on Spectra
12
600700800900100011001200130014001500Wavenumber (1/cm)
-0.05
0
0.05
0.1
Abso
rban
ce
FTIR of Edible Oils Mean Centered
CornOliveSafflCMarg
600700800900100011001200130014001500Wavenumber (1/cm)
-3
-2
-1
0
1
2
3
Abso
rban
ce (A
utos
cale
d)
FTIR of Edible Oils Autoscaled
CornOliveSafflCMarg
4
3
2
1
0
-1
Raw FTIR Spectra
4000 3500 3000 2500 2000 1500 1000 500Wavenumbers
Baseline Close Up0.15
0.1
0.05
0
-0.054000 3500 3000 2500 2000 1500 1000 500
Wavenumbers
Sample-to-Sample BaselineBaselines can exhibit simple offsets, slopes, polynomials or more complicated functions.
In the example, the offset is larger than the absorbance features of interest.
Adds variance that can inhibit predictive capability and make extraction of chemical information (e.g., via multivariate curve resolution) difficult.
13
Background/Baseline Subtraction
Removal of broad (low-frequency) interferences while retaining higher-frequency features. Only low-order polynomials are used to model the background.• Detrend: fit polynomial to entire spectrum • Selected-Points baselining: fit polynomial to selected points in
spectrum• Weighted Least-squares (a.k.a. asymmetric) baselining: fit to automatically selected points on the bottom of the spectrum• Windowed: Whittaker, Rolling Ball, Median, Minimum, etc.• Etc.
14
Selected-Points Baseline• Detrend based on points in spectrum known to be
only baseline. Subtract the result from all channels.• good when zero points are known a priori
15
2003004005006007008009001000Variables
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
Mean
Baseline Order: 3
Weighted Least-Squares Baselining• Automatic selection of baseline points by fitting polynomial to
the “bottom” (or “top”) of the spectrum à asymmetric fit.• Starts with a fit to all points (detrend) then de-weights points above the baseline (those with large
positive residuals).• Iterates until only points w/in a defined tolerance on the residuals are kept. (Need to define tolerance
on the residuals.)• Easy approach for simple baselines (e.g., polynomials).• Can also include known baseline functions.
600 800 1000 1200 1400 1600 18000
1
2
3
4
5
Raman Shift (cm-1)
16
Sample Normalization Methods• Previous examples removed an offset. How is variance due
to changing magnitude removed?• variable source or lighting magnitude• scattering effects
• Row Normalization: removes magnitude• Standard Normal Variate (SNV): subtracts the row mean
from each row and scales to unit variance• Autoscaling of the rows
• Multiplicative Scatter Correction: Determines scale factor that best fits new spectrum to reference
• Be aware that these can “blow up” low signal noisy samples to have more variance
17
Normalization• Normalize each row / spectrum • Order of normalization (p-norm)
• 1-norm : normalize to unit AREA (area = 1)• 2-norm : normalize to unit LENGTH (vector length = 1)• inf-norm : normalize to unit MAXIMUM (max value = 1)
1/
1
pN
p
jjx
=
æ ö= ç ÷
è øåx x
p-norm
18 600700800900100011001200130014001500Wavenumber (1/cm)
-0.05
0
0.05
0.1
Abso
rban
ce
FTIR of Edible Oils Mean Centered
CornOliveSafflCMarg
600700800900100011001200130014001500Wavenumber (1/cm)
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
Abso
rban
ce
10-3FTIR of Edible Oils Normalized and Centered
CornOliveSafflCMarg
Scatter / Signal Correction
• Multiplicative Scatter Correction (MSC)• Attempts to remove offset and row magnitude
variability• Result is less signal related to scattering artifacts
and more signal related to analyte(s) of interest
19
MSC Example
20
-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5Scores on PC 1 (75.52%)
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
Scor
es o
n PC
2 (1
4.36
%)
PCA Scores of Edible Oils Centered
CMargCornOliveSaffl
-0.2 -0.15 -0.1 -0.05Scores on PC 1 (75.52%)
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06Sc
ores
on
PC 2
(14.
36%
)Zoom in on Olive Oils
10 11 12
13 14
15 16
17 18
19
20
21
22
23
24
21
Spectra (Selected Wavelengths)Samples 19 & 22
Sample 22
Sample 19
Sample 22 looks uniformly larger than Sample 19
22
Plot Sample 22 vs. Sample 19
Identity Line
Slope = 1.0584Int = 0
23
Multiplicative Effect:Spectra are Identical except one
is a Multiple of the Other• Changing sample pathlength, e.g. changing
light scattering with particle size.• Changing sample density, e.g. changing
temperature of sample.• Changing gain of the instrument.
24
MSC Multiplicative Signal (Scatter) Correction
Identity Line
Divide each absorbance of Sample 22 by slope = 1.0584
With MSC
25
-0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2Scores on PC 1 (75.67%)
-0.1
-0.05
0
0.05
0.1
0.15
0.2Sc
ores
on
PC 2
(22.
33%
)PCA Scores of Edible Oils with MSC and Centered
CMargCornOliveSaffl
Savitzky-Golay Smoothing and Derivatives
• Derivatives wrt λ can be used to remove offsets/slopes
• Savitzky-Golay smoothing and derivatives• piece-wise fit of polynomials to
each spectrum• use fit directly for smoothing• use derivative in each window for
estimate of derivative wrt λ• smooth + derivative can be boiled
down to a set of coefficients
800 900 1000 1100 1200 1300 1400 1500 16000
0.5
1
1.5
2
2.5
Wavelength, l
Abs
orba
nce
with offset and slope
with offset
original spectrum
26
Savitzky-Golay First Derivative
x = cST
x = cST +α1T
dxdλ
= cdST
dλ
800 900 1000 1100 1200 1300 1400 1500 1600-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
Wavelength, l
dA/dl
multicomponent Beer’s Law
first derivative removes the offset
with offset and slope
with offset
original spectrum
27
800 900 1000 1100 1200 1300 1400 1500 1600-0.015
-0.01
-0.005
0
0.005
0.01
0.015
Wavelength, l
d2 A/dl2
Savitzky-Golay Second Derivative
x = cST
x = cST +α1T + βλdxdλ
= cdST
dλ+ β
d2xdλ 2 = c
d2ST
dλ 2
multicomponent Beer’s Law
second derivative remove the offset and slope
with offset and slope
with offset
original spectrum
28
EPO and GLS Filters• EPO = External Parameter Orthogonalization• GLS = Generalized Least Squares filter• Both use samples that characterize the clutter
• Variation not related to the problem of interest• Classification problems: inter-class variance• Regression problems: samples with same property
• EPO makes PCA model of clutter, orthogonalizes data against first few PCs – hard filter
• GLS calculates weighted inverse of clutter covariance, applies to all data – soft filter
29
With MSC and GLS
30
-0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2Scores on PC 1 (75.67%)
-0.1
-0.05
0
0.05
0.1
0.15
0.2Sc
ores
on
PC 2
(22.
33%
)PCA Scores of Edible Oils with MSC and Centered
CMargCornOliveSaffl
-0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2Scores on PC 1 (73.38%)
-0.1
-0.05
0
0.05
0.1
0.15
0.2Sc
ores
on
PC 2
(25.
05%
)PCA Scores of Edible Oils with MSC, GLS & Centered
-0.1 -0.05 0 0.05 0.1 0.15Scores on PC 1 (73.38%)
-0.05
0
0.05
0.1
0.15
Scor
es o
n PC
2 (2
5.05
%)
PCA Scores of Test Samples with MSC, GLS & Centered
NIR Shootout 2002• Estimate tablet assay value from NIR transmittance
• Calibration (155 samples), Test (460 samples)
311100 1200 1300 1400 1500 1600
Wavelength (nm)
2.5
3
3.5
4
4.5
5
5.5
Sign
al
Raw Tablet Data Colored by Assay Value
160
170
180
190
200
210
220
230
1100 1200 1300 1400 1500 1600Variables
2.5
3
3.5
4
4.5
Data
MSC Preprocessed Tablet Data Colored by Assay Value
160
170
180
190
200
210
220
230
1100 1200 1300 1400 1500 1600Variables
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
Data
MSC & 1st Derivative Preprocessed Tablet Data
160
170
180
190
200
210
220
230
1100 1200 1300 1400 1500 1600Variables
-0.015
-0.01
-0.005
0
0.005
0.01
Data
MSC, 1st Derivative and Mean Centered Tablet Data
160
170
180
190
200
210
220
230
Prediction Error on Validation Set
0
1
2
3
4
5
6
7
8
9
10
Preprocessing Method
RM
SEP
for V
alid
atio
n Se
t
Mea
n C
ente
r
MSC
1-N
orm
2-N
orm
SavG
ol-1
,SN
V
SavG
ol-2
,SN
V
SNV EM
SC
SavG
ol-2
,EM
SC
GLS
W
SavG
ol-2
,SN
V,G
LSW
150 160 170 180 190 200 210 220 230 240 250150
160
170
180
190
200
210
220
230
240
250Assay SavGol-2,SNV,GLSW
Measured
Pred
icte
d
RMSEP = 2.7
CalibrationValidation
32
Perspectives on Preprocessing• Order matters. The general approach is:
1. Background and offset removal2. Normalization3. Centering4. Scaling
• Always keep in mind: “what is each preprocessing step supposed to be doing?....”
• Plot data after pre-preprocessing and color code!• Always compare the effect of the pre-processing
(classification or regression error rates) with the results from a model based on the raw data
33
Pre-processing will offer…
• Models with better predictive or classification performance and/or
• Simpler models that are more robust and/or more easy to interpret
• But there is a risk that you can remove useful information from data
• Preprocessing must be validated as part of the model development process
34