5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Data Preprocessing for Quantitative and Qualitative Models

Based on NIR Spectroscopy

1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650

2.5

3

3.5

4

4.5

5

5.5

Wavelength (nm)

Raw

Sig

nal

Tablet NIR Spectra Raw Data

Colored by Assay Value160

170

180

190

200

210

220

230

1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

Wavelength (nm)

Prep

roce

ssed

Sig

nal

Tablet NIR Spectra Preprocessed with MSC, First Derivative and Mean Centering

160

170

180

190

200

210

220

230

Barry M. Wise, Ph.D.President

Eigenvector Research, Inc.Manson, WA USA

• Preprocessing Objective• Definition of Clutter• Linearization• Mean Centering and Autoscaling• Baseline Removal• Normalization, Multiplicative Scatter Correction

(MSC)• Smoothing, Filtering and Derivatives• Orthogonalization Filters: EPO, GLS• Conclusions

Outline

2

Goal of Preprocessing

3

• Data preprocessing is what you do to the data before it hits the modeling algorithm (PCA, PLS, MCR, SIMCA, etc.)

• The goal of preprocessing is to remove variation you don’t care about, i.e. clutter, in order to let the analysis focus on the variation you do care about

• Examples• Systems with scattering: physical vs. chemical effects• Classification: intra-class vs. inter-class variation

Measured Signal

• Clutter is present in all measurements (X & Y)• clutter = interferences + noise not of interest

X = csT + Xc +E

Measured SignalTarget Signal

Clutter Signal

Interference Signal

Noise

csT

Xc

Xc +E E

4

Sources of Clutter• Systematic background variability

• Variation in chemical interferents• Physical effects such as scattering due to particles

• Other changes in the system being observed• T, P changes, variable sample matrix, “dark current”

• Variance due to physics of instrument• e.g., drift, instrument changes, variable baseline or gain• Non-linearity, saturation

• Non-systematic random noise• homoscedastic, heteroscedastic

5

Reasons to Preprocess

• Reduces variance from extraneous sources• Makes relevant variance more obvious• Makes statistics work better• Aids interpretation• Avoids numerical problems

6

Transformation to Linear Form• Within X-block (predictor variables, e.g. spectra)

• PCA works best with linear relationships• Between X- & Y-block (predicted variable)

• PLS regression assumes linear relation• If possible, non-linear data should be converted to

a linear form (e.g., use known physics of the system)• Example:

• Typically work with absorbance rather than transmittance

• Log(I/I0)

7

Mean Centering• Often we are most interested in how the

data varies around the mean• Mean centering is done by subtracting the

mean off each column, thus forming a matrix where each column has mean of zero

8

Centering is an Axis Translation

• Geometry for 2 variables

Variable 1

Varia

ble

2

Mean Vector

Variable 1Va

riabl

e 2

9

Mean Centering on Spectra

10

600700800900100011001200130014001500Wavenumber (1/cm)

0

0.2

0.4

0.6

0.8

1

Abso

rban

ce

FTIR of Edible Oils

CornOliveSafflCMarg

600700800900100011001200130014001500Wavenumber (1/cm)

-0.05

0

0.05

0.1

Abso

rban

ce

FTIR of Edible Oils Mean Centered

CornOliveSafflCMarg

Variable Scaling• Scaling is done to change variance of variables,

and thus the weight given to them in modeling• Most common is autoscaling, which makes

variables unit variance and mean zero• Mean center variables• Divide by standard deviation

• Autoscaling removes all scale information• What’s left is only how the variables correlate

with each other• it is the “correlation matrix”

11

Autoscaling on Spectra

12

600700800900100011001200130014001500Wavenumber (1/cm)

-0.05

0

0.05

0.1

Abso

rban

ce


CornOliveSafflCMarg

600700800900100011001200130014001500Wavenumber (1/cm)

-3

-2

-1

0

1

2

3

Abso

rban

ce (A

utos

cale

d)

FTIR of Edible Oils Autoscaled

CornOliveSafflCMarg

4

3

2

1

0

-1

Raw FTIR Spectra

4000 3500 3000 2500 2000 1500 1000 500Wavenumbers

Baseline Close Up0.15

0.1

0.05

0

-0.054000 3500 3000 2500 2000 1500 1000 500

Wavenumbers

Sample-to-Sample BaselineBaselines can exhibit simple offsets, slopes, polynomials or more complicated functions.

In the example, the offset is larger than the absorbance features of interest.

Adds variance that can inhibit predictive capability and make extraction of chemical information (e.g., via multivariate curve resolution) difficult.

13

Background/Baseline Subtraction

Removal of broad (low-frequency) interferences while retaining higher-frequency features. Only low-order polynomials are used to model the background.• Detrend: fit polynomial to entire spectrum • Selected-Points baselining: fit polynomial to selected points in

spectrum• Weighted Least-squares (a.k.a. asymmetric) baselining: fit to automatically selected points on the bottom of the spectrum• Windowed: Whittaker, Rolling Ball, Median, Minimum, etc.• Etc.

14

Selected-Points Baseline• Detrend based on points in spectrum known to be

only baseline. Subtract the result from all channels.• good when zero points are known a priori

15

2003004005006007008009001000Variables

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

Mean

Baseline Order: 3

Weighted Least-Squares Baselining• Automatic selection of baseline points by fitting polynomial to

the “bottom” (or “top”) of the spectrum à asymmetric fit.• Starts with a fit to all points (detrend) then de-weights points above the baseline (those with large

positive residuals).• Iterates until only points w/in a defined tolerance on the residuals are kept. (Need to define tolerance

on the residuals.)• Easy approach for simple baselines (e.g., polynomials).• Can also include known baseline functions.

600 800 1000 1200 1400 1600 18000

1

2

3

4

5

Raman Shift (cm-1)

16

Sample Normalization Methods• Previous examples removed an offset. How is variance due

to changing magnitude removed?• variable source or lighting magnitude• scattering effects

• Row Normalization: removes magnitude• Standard Normal Variate (SNV): subtracts the row mean

from each row and scales to unit variance• Autoscaling of the rows

• Multiplicative Scatter Correction: Determines scale factor that best fits new spectrum to reference

• Be aware that these can “blow up” low signal noisy samples to have more variance

17

Normalization• Normalize each row / spectrum • Order of normalization (p-norm)

• 1-norm : normalize to unit AREA (area = 1)• 2-norm : normalize to unit LENGTH (vector length = 1)• inf-norm : normalize to unit MAXIMUM (max value = 1)

1/

1

pN

p

jjx

=

æ ö= ç ÷

è øåx x

p-norm

18 600700800900100011001200130014001500Wavenumber (1/cm)

-0.05

0

0.05

0.1

Abso

rban

ce


CornOliveSafflCMarg

600700800900100011001200130014001500Wavenumber (1/cm)

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

Abso

rban

ce

10-3FTIR of Edible Oils Normalized and Centered

CornOliveSafflCMarg

Scatter / Signal Correction

• Multiplicative Scatter Correction (MSC)• Attempts to remove offset and row magnitude

variability• Result is less signal related to scattering artifacts

and more signal related to analyte(s) of interest

19

MSC Example

20

-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5Scores on PC 1 (75.52%)

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

Scor

es o

n PC

2 (1

4.36

%)

PCA Scores of Edible Oils Centered

CMargCornOliveSaffl

-0.2 -0.15 -0.1 -0.05Scores on PC 1 (75.52%)

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06Sc

ores

on

PC 2

(14.

36%

)Zoom in on Olive Oils

10 11 12

13 14

15 16

17 18

19

20

21

22

23

24

21

Spectra (Selected Wavelengths)Samples 19 & 22

Sample 22

Sample 19

Sample 22 looks uniformly larger than Sample 19

22

Plot Sample 22 vs. Sample 19

Identity Line

Slope = 1.0584Int = 0

23

Multiplicative Effect:Spectra are Identical except one

is a Multiple of the Other• Changing sample pathlength, e.g. changing

light scattering with particle size.• Changing sample density, e.g. changing

temperature of sample.• Changing gain of the instrument.

24

MSC Multiplicative Signal (Scatter) Correction

Identity Line

Divide each absorbance of Sample 22 by slope = 1.0584

With MSC

25

-0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2Scores on PC 1 (75.67%)

-0.1

-0.05

0

0.05

0.1

0.15

0.2Sc

ores

on

PC 2

(22.

33%

)PCA Scores of Edible Oils with MSC and Centered

CMargCornOliveSaffl

Savitzky-Golay Smoothing and Derivatives

• Derivatives wrt λ can be used to remove offsets/slopes

• Savitzky-Golay smoothing and derivatives• piece-wise fit of polynomials to

each spectrum• use fit directly for smoothing• use derivative in each window for

estimate of derivative wrt λ• smooth + derivative can be boiled

down to a set of coefficients

800 900 1000 1100 1200 1300 1400 1500 16000

0.5

1

1.5

2

2.5

Wavelength, l

Abs

orba

nce

with offset and slope

with offset

original spectrum

26

Savitzky-Golay First Derivative

x = cST

x = cST +α1T

dxdλ

= cdST

dλ

800 900 1000 1100 1200 1300 1400 1500 1600-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

Wavelength, l

dA/dl

multicomponent Beer’s Law

first derivative removes the offset


with offset

original spectrum

27

800 900 1000 1100 1200 1300 1400 1500 1600-0.015

-0.01

-0.005

0

0.005

0.01

0.015

Wavelength, l

d2 A/dl2

Savitzky-Golay Second Derivative

x = cST

x = cST +α1T + βλdxdλ

= cdST

dλ+ β

d2xdλ 2 = c

d2ST

dλ 2

multicomponent Beer’s Law

second derivative remove the offset and slope


with offset

original spectrum

28

EPO and GLS Filters• EPO = External Parameter Orthogonalization• GLS = Generalized Least Squares filter• Both use samples that characterize the clutter

• Variation not related to the problem of interest• Classification problems: inter-class variance• Regression problems: samples with same property

• EPO makes PCA model of clutter, orthogonalizes data against first few PCs – hard filter

• GLS calculates weighted inverse of clutter covariance, applies to all data – soft filter

29

With MSC and GLS

30

-0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2Scores on PC 1 (75.67%)

-0.1

-0.05

0

0.05

0.1

0.15

0.2Sc

ores

on

PC 2

(22.

33%

)PCA Scores of Edible Oils with MSC and Centered

CMargCornOliveSaffl

-0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2Scores on PC 1 (73.38%)

-0.1

-0.05

0

0.05

0.1

0.15

0.2Sc

ores

on

PC 2

(25.

05%

)PCA Scores of Edible Oils with MSC, GLS & Centered

-0.1 -0.05 0 0.05 0.1 0.15Scores on PC 1 (73.38%)

-0.05

0

0.05

0.1

0.15

Scor

es o

n PC

2 (2

5.05

%)

PCA Scores of Test Samples with MSC, GLS & Centered

NIR Shootout 2002• Estimate tablet assay value from NIR transmittance

• Calibration (155 samples), Test (460 samples)

311100 1200 1300 1400 1500 1600

Wavelength (nm)

2.5

3

3.5

4

4.5

5

5.5

Sign

al

Raw Tablet Data Colored by Assay Value

160

170

180

190

200

210

220

230

1100 1200 1300 1400 1500 1600Variables

2.5

3

3.5

4

4.5

Data

MSC Preprocessed Tablet Data Colored by Assay Value

160

170

180

190

200

210

220

230

1100 1200 1300 1400 1500 1600Variables

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

Data

MSC & 1st Derivative Preprocessed Tablet Data

160

170

180

190

200

210

220

230

1100 1200 1300 1400 1500 1600Variables

-0.015

-0.01

-0.005

0

0.005

0.01

Data

MSC, 1st Derivative and Mean Centered Tablet Data

160

170

180

190

200

210

220

230

Prediction Error on Validation Set

0

1

2

3

4

5

6

7

8

9

10

Preprocessing Method

RM

SEP

for V

alid

atio

n Se

t

Mea

n C

ente

r

MSC

1-N

orm

2-N

orm

SavG

ol-1

,SN

V

SavG

ol-2

,SN

V

SNV EM

SC

SavG

ol-2

,EM

SC

GLS

W

SavG

ol-2

,SN

V,G

LSW

150 160 170 180 190 200 210 220 230 240 250150

160

170

180

190

200

210

220

230

240

250Assay SavGol-2,SNV,GLSW

Measured

Pred

icte

d

RMSEP = 2.7

CalibrationValidation

32

Perspectives on Preprocessing• Order matters. The general approach is:

1. Background and offset removal2. Normalization3. Centering4. Scaling

• Always keep in mind: “what is each preprocessing step supposed to be doing?....”

• Plot data after pre-preprocessing and color code!• Always compare the effect of the pre-processing

(classification or regression error rates) with the results from a model based on the raw data

33

Pre-processing will offer…

• Models with better predictive or classification performance and/or

• Simpler models that are more robust and/or more easy to interpret

• But there is a risk that you can remove useful information from data

• Preprocessing must be validated as part of the model development process

34

Documents

5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250