34
Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 2.5 3 3.5 4 4.5 5 5.5 Wavelength (nm) Raw Signal Tablet NIR Spectra Raw Data Colored by Assay Value 160 170 180 190 200 210 220 230 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 0.015 0.01 0.005 0 0.005 0.01 0.015 0.02 Wavelength (nm) Preprocessed Signal Tablet NIR Spectra Preprocessed with MSC, First Derivative and Mean Centering 160 170 180 190 200 210 220 230 Barry M. Wise, Ph.D. President Eigenvector Research, Inc. Manson, WA USA

5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Data Preprocessing for Quantitative and Qualitative Models

Based on NIR Spectroscopy

1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650

2.5

3

3.5

4

4.5

5

5.5

Wavelength (nm)

Raw

Sig

nal

Tablet NIR Spectra Raw Data

Colored by Assay Value160

170

180

190

200

210

220

230

1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

Wavelength (nm)

Prep

roce

ssed

Sig

nal

Tablet NIR Spectra Preprocessed with MSC, First Derivative and Mean Centering

160

170

180

190

200

210

220

230

Barry M. Wise, Ph.D.President

Eigenvector Research, Inc.Manson, WA USA

Page 2: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

• Preprocessing Objective• Definition of Clutter• Linearization• Mean Centering and Autoscaling• Baseline Removal• Normalization, Multiplicative Scatter Correction

(MSC)• Smoothing, Filtering and Derivatives• Orthogonalization Filters: EPO, GLS• Conclusions

Outline

2

Page 3: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Goal of Preprocessing

3

• Data preprocessing is what you do to the data before it hits the modeling algorithm (PCA, PLS, MCR, SIMCA, etc.)

• The goal of preprocessing is to remove variation you don’t care about, i.e. clutter, in order to let the analysis focus on the variation you do care about

• Examples• Systems with scattering: physical vs. chemical effects• Classification: intra-class vs. inter-class variation

Page 4: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Measured Signal

• Clutter is present in all measurements (X & Y)• clutter = interferences + noise not of interest

X = csT + Xc +E

Measured SignalTarget Signal

Clutter Signal

Interference Signal

Noise

csT

Xc

Xc +E E

4

Page 5: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Sources of Clutter• Systematic background variability

• Variation in chemical interferents• Physical effects such as scattering due to particles

• Other changes in the system being observed• T, P changes, variable sample matrix, “dark current”

• Variance due to physics of instrument• e.g., drift, instrument changes, variable baseline or gain• Non-linearity, saturation

• Non-systematic random noise• homoscedastic, heteroscedastic

5

Page 6: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Reasons to Preprocess

• Reduces variance from extraneous sources• Makes relevant variance more obvious• Makes statistics work better• Aids interpretation• Avoids numerical problems

6

Page 7: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Transformation to Linear Form• Within X-block (predictor variables, e.g. spectra)

• PCA works best with linear relationships• Between X- & Y-block (predicted variable)

• PLS regression assumes linear relation• If possible, non-linear data should be converted to

a linear form (e.g., use known physics of the system)• Example:

• Typically work with absorbance rather than transmittance

• Log(I/I0)

7

Page 8: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Mean Centering• Often we are most interested in how the

data varies around the mean• Mean centering is done by subtracting the

mean off each column, thus forming a matrix where each column has mean of zero

8

Page 9: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Centering is an Axis Translation

• Geometry for 2 variables

Variable 1

Varia

ble

2

Mean Vector

Variable 1Va

riabl

e 2

9

Page 10: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Mean Centering on Spectra

10

600700800900100011001200130014001500Wavenumber (1/cm)

0

0.2

0.4

0.6

0.8

1

Abso

rban

ce

FTIR of Edible Oils

CornOliveSafflCMarg

600700800900100011001200130014001500Wavenumber (1/cm)

-0.05

0

0.05

0.1

Abso

rban

ce

FTIR of Edible Oils Mean Centered

CornOliveSafflCMarg

Page 11: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Variable Scaling• Scaling is done to change variance of variables,

and thus the weight given to them in modeling• Most common is autoscaling, which makes

variables unit variance and mean zero• Mean center variables• Divide by standard deviation

• Autoscaling removes all scale information• What’s left is only how the variables correlate

with each other• it is the “correlation matrix”

11

Page 12: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Autoscaling on Spectra

12

600700800900100011001200130014001500Wavenumber (1/cm)

-0.05

0

0.05

0.1

Abso

rban

ce

FTIR of Edible Oils Mean Centered

CornOliveSafflCMarg

600700800900100011001200130014001500Wavenumber (1/cm)

-3

-2

-1

0

1

2

3

Abso

rban

ce (A

utos

cale

d)

FTIR of Edible Oils Autoscaled

CornOliveSafflCMarg

Page 13: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

4

3

2

1

0

-1

Raw FTIR Spectra

4000 3500 3000 2500 2000 1500 1000 500Wavenumbers

Baseline Close Up0.15

0.1

0.05

0

-0.054000 3500 3000 2500 2000 1500 1000 500

Wavenumbers

Sample-to-Sample BaselineBaselines can exhibit simple offsets, slopes, polynomials or more complicated functions.

In the example, the offset is larger than the absorbance features of interest.

Adds variance that can inhibit predictive capability and make extraction of chemical information (e.g., via multivariate curve resolution) difficult.

13

Page 14: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Background/Baseline Subtraction

Removal of broad (low-frequency) interferences while retaining higher-frequency features. Only low-order polynomials are used to model the background.• Detrend: fit polynomial to entire spectrum • Selected-Points baselining: fit polynomial to selected points in

spectrum• Weighted Least-squares (a.k.a. asymmetric) baselining: fit to automatically selected points on the bottom of the spectrum• Windowed: Whittaker, Rolling Ball, Median, Minimum, etc.• Etc.

14

Page 15: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Selected-Points Baseline• Detrend based on points in spectrum known to be

only baseline. Subtract the result from all channels.• good when zero points are known a priori

15

2003004005006007008009001000Variables

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

Mean

Baseline Order: 3

Page 16: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Weighted Least-Squares Baselining• Automatic selection of baseline points by fitting polynomial to

the “bottom” (or “top”) of the spectrum à asymmetric fit.• Starts with a fit to all points (detrend) then de-weights points above the baseline (those with large

positive residuals).• Iterates until only points w/in a defined tolerance on the residuals are kept. (Need to define tolerance

on the residuals.)• Easy approach for simple baselines (e.g., polynomials).• Can also include known baseline functions.

600 800 1000 1200 1400 1600 18000

1

2

3

4

5

Raman Shift (cm-1)

16

Page 17: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Sample Normalization Methods• Previous examples removed an offset. How is variance due

to changing magnitude removed?• variable source or lighting magnitude• scattering effects

• Row Normalization: removes magnitude• Standard Normal Variate (SNV): subtracts the row mean

from each row and scales to unit variance• Autoscaling of the rows

• Multiplicative Scatter Correction: Determines scale factor that best fits new spectrum to reference

• Be aware that these can “blow up” low signal noisy samples to have more variance

17

Page 18: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Normalization• Normalize each row / spectrum • Order of normalization (p-norm)

• 1-norm : normalize to unit AREA (area = 1)• 2-norm : normalize to unit LENGTH (vector length = 1)• inf-norm : normalize to unit MAXIMUM (max value = 1)

1/

1

pN

p

jjx

=

æ ö= ç ÷

è øåx x

p-norm

18 600700800900100011001200130014001500Wavenumber (1/cm)

-0.05

0

0.05

0.1

Abso

rban

ce

FTIR of Edible Oils Mean Centered

CornOliveSafflCMarg

600700800900100011001200130014001500Wavenumber (1/cm)

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

Abso

rban

ce

10-3FTIR of Edible Oils Normalized and Centered

CornOliveSafflCMarg

Page 19: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Scatter / Signal Correction

• Multiplicative Scatter Correction (MSC)• Attempts to remove offset and row magnitude

variability• Result is less signal related to scattering artifacts

and more signal related to analyte(s) of interest

19

Page 20: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

MSC Example

20

-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5Scores on PC 1 (75.52%)

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

Scor

es o

n PC

2 (1

4.36

%)

PCA Scores of Edible Oils Centered

CMargCornOliveSaffl

-0.2 -0.15 -0.1 -0.05Scores on PC 1 (75.52%)

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06Sc

ores

on

PC 2

(14.

36%

)Zoom in on Olive Oils

10 11 12

13 14

15 16

17 18

19

20

21

22

23

24

Page 21: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

21

Spectra (Selected Wavelengths)Samples 19 & 22

Sample 22

Sample 19

Sample 22 looks uniformly larger than Sample 19

Page 22: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

22

Plot Sample 22 vs. Sample 19

Identity Line

Slope = 1.0584Int = 0

Page 23: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

23

Multiplicative Effect:Spectra are Identical except one

is a Multiple of the Other• Changing sample pathlength, e.g. changing

light scattering with particle size.• Changing sample density, e.g. changing

temperature of sample.• Changing gain of the instrument.

Page 24: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

24

MSC Multiplicative Signal (Scatter) Correction

Identity Line

Divide each absorbance of Sample 22 by slope = 1.0584

Page 25: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

With MSC

25

-0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2Scores on PC 1 (75.67%)

-0.1

-0.05

0

0.05

0.1

0.15

0.2Sc

ores

on

PC 2

(22.

33%

)PCA Scores of Edible Oils with MSC and Centered

CMargCornOliveSaffl

Page 26: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Savitzky-Golay Smoothing and Derivatives

• Derivatives wrt λ can be used to remove offsets/slopes

• Savitzky-Golay smoothing and derivatives• piece-wise fit of polynomials to

each spectrum• use fit directly for smoothing• use derivative in each window for

estimate of derivative wrt λ• smooth + derivative can be boiled

down to a set of coefficients

800 900 1000 1100 1200 1300 1400 1500 16000

0.5

1

1.5

2

2.5

Wavelength, l

Abs

orba

nce

with offset and slope

with offset

original spectrum

26

Page 27: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Savitzky-Golay First Derivative

x = cST

x = cST +α1T

dxdλ

= cdST

800 900 1000 1100 1200 1300 1400 1500 1600-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

Wavelength, l

dA/dl

multicomponent Beer’s Law

first derivative removes the offset

with offset and slope

with offset

original spectrum

27

Page 28: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

800 900 1000 1100 1200 1300 1400 1500 1600-0.015

-0.01

-0.005

0

0.005

0.01

0.015

Wavelength, l

d2 A/dl2

Savitzky-Golay Second Derivative

x = cST

x = cST +α1T + βλdxdλ

= cdST

dλ+ β

d2xdλ 2 = c

d2ST

dλ 2

multicomponent Beer’s Law

second derivative remove the offset and slope

with offset and slope

with offset

original spectrum

28

Page 29: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

EPO and GLS Filters• EPO = External Parameter Orthogonalization• GLS = Generalized Least Squares filter• Both use samples that characterize the clutter

• Variation not related to the problem of interest• Classification problems: inter-class variance• Regression problems: samples with same property

• EPO makes PCA model of clutter, orthogonalizes data against first few PCs – hard filter

• GLS calculates weighted inverse of clutter covariance, applies to all data – soft filter

29

Page 30: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

With MSC and GLS

30

-0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2Scores on PC 1 (75.67%)

-0.1

-0.05

0

0.05

0.1

0.15

0.2Sc

ores

on

PC 2

(22.

33%

)PCA Scores of Edible Oils with MSC and Centered

CMargCornOliveSaffl

-0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2Scores on PC 1 (73.38%)

-0.1

-0.05

0

0.05

0.1

0.15

0.2Sc

ores

on

PC 2

(25.

05%

)PCA Scores of Edible Oils with MSC, GLS & Centered

-0.1 -0.05 0 0.05 0.1 0.15Scores on PC 1 (73.38%)

-0.05

0

0.05

0.1

0.15

Scor

es o

n PC

2 (2

5.05

%)

PCA Scores of Test Samples with MSC, GLS & Centered

Page 31: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

NIR Shootout 2002• Estimate tablet assay value from NIR transmittance

• Calibration (155 samples), Test (460 samples)

311100 1200 1300 1400 1500 1600

Wavelength (nm)

2.5

3

3.5

4

4.5

5

5.5

Sign

al

Raw Tablet Data Colored by Assay Value

160

170

180

190

200

210

220

230

1100 1200 1300 1400 1500 1600Variables

2.5

3

3.5

4

4.5

Data

MSC Preprocessed Tablet Data Colored by Assay Value

160

170

180

190

200

210

220

230

1100 1200 1300 1400 1500 1600Variables

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

Data

MSC & 1st Derivative Preprocessed Tablet Data

160

170

180

190

200

210

220

230

1100 1200 1300 1400 1500 1600Variables

-0.015

-0.01

-0.005

0

0.005

0.01

Data

MSC, 1st Derivative and Mean Centered Tablet Data

160

170

180

190

200

210

220

230

Page 32: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Prediction Error on Validation Set

0

1

2

3

4

5

6

7

8

9

10

Preprocessing Method

RM

SEP

for V

alid

atio

n Se

t

Mea

n C

ente

r

MSC

1-N

orm

2-N

orm

SavG

ol-1

,SN

V

SavG

ol-2

,SN

V

SNV EM

SC

SavG

ol-2

,EM

SC

GLS

W

SavG

ol-2

,SN

V,G

LSW

150 160 170 180 190 200 210 220 230 240 250150

160

170

180

190

200

210

220

230

240

250Assay SavGol-2,SNV,GLSW

Measured

Pred

icte

d

RMSEP = 2.7

CalibrationValidation

32

Page 33: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Perspectives on Preprocessing• Order matters. The general approach is:

1. Background and offset removal2. Normalization3. Centering4. Scaling

• Always keep in mind: “what is each preprocessing step supposed to be doing?....”

• Plot data after pre-preprocessing and color code!• Always compare the effect of the pre-processing

(classification or regression error rates) with the results from a model based on the raw data

33

Page 34: 5.5 230 5 220 Data Preprocessing 210 for Quantitative and Raw …€¦ · Data Preprocessing for Quantitative and Qualitative Models Based on NIR Spectroscopy 1100 1150 1200 1250

Pre-processing will offer…

• Models with better predictive or classification performance and/or

• Simpler models that are more robust and/or more easy to interpret

• But there is a risk that you can remove useful information from data

• Preprocessing must be validated as part of the model development process

34