Pls Tutorial

Embed Size (px)

Citation preview

  • 7/26/2019 Pls Tutorial

    1/12

    12/9/2013

    1

    Partial Least Squares

    A tutorial

    Lutgarde Buydens

    Partial least Squares

    Multivariate regression

    Multiple Linear Regression(MLR)

    Principal Component Regression (PCR)

    Partial Least Squares (PLS)

    Val idation

    Preprocessing

    Multivariate Regression

    X Y

    n

    p k

    Rows: Cases, observations, Collums: Variables, Classes, tags

    Analytical observations of different samplesExperimental runs

    Persons

    .

    P: Spectral variablesAnalytical measurements

    K: Class information

    Concentration,..X: Independent variabels (will be always available)

    Y: Dependent variables ( to be predicted later from X)

    200 0 4 00 0 6 00 0 8 00 0 1 00 00 1 20 00 1 40 00-0.5

    0

    0.5

    1

    1.5

    2

    Wavenumber(cm -1)

    Rawdata

    Multivariate Regressio n

    X Y

    n

    p

    k

    Rows: Cases, observations Collums: Variables, Classes, tags

    X: Independent variabels (will be always available)

    Y: Dependent variables ( to be predicted later from X)

    2 00 0 4 00 0 6 00 0 8 00 0 1 00 00 1 20 00 1 40 00-0.5

    0

    0.5

    1

    1.5

    2

    Wavenumber(cm-1

    )

    Rawdata

    Y = f(X) : Predict Y from X

    MLR: Multiple Linear Regression

    PCR: Principal Component Regression

    PLS: Partial Least Sqaures

    From univariate to Mult iple Linear Regression (MLR)

    Least squares regression

    y= b0 +b1 x1 +

    b0 : intercept

    b1 : slope

    y

    x

    MLR: Multiple Linear Regressio n

    Least squares regression

    y= b0 +b1 x1 +

    b0 : intercept

    b1 : slope

    Multiple Linear Regression

    y= b0 +b1 x1 + b2x2 + bpxp +

    EYY ^

    y

    x

    x2

    x1

    y

    ),(

    yyrmaximizes

  • 7/26/2019 Pls Tutorial

    2/12

    12/9/2013

    2

    MLR: Multiple Linear Regression

    y= b0 +b1 x1 + b2x2 + bpxp +

    +

    y b0 eX

    nn n

    b1

    bp

    =

    EYY ^

    1

    1

    1

    1

    :

    ::

    p+1

    x

    x2

    x1

    y

    yn1 = Xnpbp1 + en1

    Ynk = XnpBpk + Enk

    b = (XTX)-1XTy

    MLR: Multiple Linear Regression

    Disadavantages: (XTX)-1

    Uncorrelated X-variables required

    n p +1

    x2

    x1

    y r(x1,x2) 1

    MLR: Multiple Linear Regression

    Disadavantages: (XTX)-1

    Uncorrelated X-variables required

    x2

    x1

    y r(x1,x2) 1

    Fits a plane through a line !!

    MLR: Multiple Linear Regressio n

    Disadavantages: (XTX)-1

    Uncorrelated X-variables required

    x2

    x1

    y r(x1,x2) 1

    3.763.67

    -2.91-2.87

    0.230.23

    5.555.49

    3.253.23

    -0.99-1.01

    3.67

    -2.87

    0.23

    5.49

    3.23

    -1.01

    3.76

    -2.91

    0.21

    5.55

    3.25

    -0.99

    x2x2 x1x1

    Set A Set B

    11.29

    -8.09

    2.19

    19.09

    10.33

    -1.89

    y

    y= b1 x1 + b2x2 +

    b1 b2 b1 b2

    10.3 -6.92 2.96 0.28MLR

    R2 =0.98 R2 =0.98

    MLR: Multiple Linear Regression

    Disadavantages: (XTX)-1

    Uncorrelated X-variables required

    n p +1

    Xp

    n

    Dimension reduction Variable Selection

    Latent variables (PCR, PLS)

    PCR: Princi pal Component Regressio n

    Step 1: Perform PCA on the original X

    Step 2 : Use the orthogonal PC-scores as independent variables in a MLR model

    Step 3: Calculate b-coefficients from the a-coefficients

    X

    PCAT

    p

    cols

    n-rows n-rows

    a

    colsa1

    a2

    aa

    MLR

    y

    Step 1 Step2

    n-rows

    a2

    aa

    b1

    b0

    bp

    Step 3a1

  • 7/26/2019 Pls Tutorial

    3/12

    12/9/2013

    3

    PCR: Principal Component Regression

    PC1

    Dimension reduction:

    Use scores (projections) on latent variables that explain maximal variance in X

    xp

    x2

    x1

    PCR: Principal Component Regression

    Step 0 : Meancenter X

    Step 1: Perform PCA: X = TPT

    X* = (TPT

    )*

    Step 2: Perform MLR Y=TA

    A = (TTT)-1TTY

    Step 3 : Calculate B Y = X* B

    Y = (T PT) B

    A = PTB

    B = (PPT)-1PA

    B = PA

    Calculate b0s

    MLR on reconstructed X*= (TPT)*

    yyb 0

    PCR: Princi pal Component Regressio n

    Optimal number of PCs

    Calculate Crossvalidation RMSE for diff erent # PCs

    nyy

    RMSECV ii2)(

    PLS: Partial Least Squares Regressio n

    X

    PLST

    p

    cols

    n-rows n-rows

    a

    cola1

    a2

    aa

    MLR

    y

    Phase 1

    n-rows

    a1

    a2

    aa

    b1

    b0

    bp

    Y

    k

    cols

    n-rows

    Phase 2

    a1

    k

    cols

    Phase 3

    PLS: Partial Least Squares Regressio n

    Projection to Latent Structure

    PC1

    xp

    x2

    x1

    LV1 (w)

    xp

    x2

    x1

    PCR PLS

    Use PC:

    Maximizes variance in X

    Use LV:

    Maximizes covariance (X,y)

    = VarX*vary*cor(X,y)

    PLS: Partial Least Squares Regressio n

    Sequential Algorithm: Latent variables and their scores are calculated sequentially

    Step 0: Mean center X

    Step 1: Calculate w

    Calculate LV1= w1 that maximizes Covariance (X,Y) : SVD on XTY

    (XTY)pk = WpaDaa ZTak w1 = 1

    st col. of W

    Phase 1 : Calculate new independent variables (T)

    xp

    x2

    x1

    w1

  • 7/26/2019 Pls Tutorial

    4/12

    12/9/2013

    4

    PLS: Partial Least Squares Regression

    Sequential Algorithm: Latent variables and th eir scores are calculated sequentially

    Step 1: Calculate LV1= w1 that maximizes Covariance (X,Y) : SVD on XTY

    (XTY)pk = WpaDaa ZTak w1 = 1

    st col. of W

    Step 2:

    Calculate t1, scores (projections) of X on w 1

    tn1 = Xnpwp1

    Phase 1 : Calculate new independent variables (T)

    xp

    x2

    x1

    w

    PLS: Partial Least Squares Regression

    X

    PLST

    p

    cols

    n-rows n-rows

    a

    cola1

    a2

    aaMLR

    y

    Phase 1

    n-rows

    a1

    a2

    aa

    b1

    b0

    bp

    Y

    k

    cols

    n-rows

    Phase 2

    a1

    kcols

    Phase 3

    Optimal number of LVsCalculate Crossvalidation RMSE for diff erent # LVs

    nyy

    RMSECV ii2)(

    PLS: Partial Least Squares Regressio n

    3.763.67

    -2.91-2.87

    0.230.23

    5.555.49

    3.253.23

    -0.99-1.01

    3.67

    -2.87

    0.23

    5.49

    3.23

    -1.01

    3.76

    -2.91

    0.21

    5.55

    3.25

    -0.99

    x2x2 x1x1

    Set A Set B

    11.29

    -8.09

    2.19

    19.09

    10.33

    -1.89

    y

    y= b1 x1 + b2x2 +

    b1 b2 b1 b2

    10.3 -6.92 2.96 0.28

    1.60 1.62 1.60 1.62

    1.60 1.62 1.60 1.62

    MLR

    PCR

    PLS

    MLR, PCR, PLS:

    VALIDATION

    Estimating predicti on error.

    Basic Principle:

    test how well your model works with new data,

    it has not seen yet!

    Common measure for prediction error

  • 7/26/2019 Pls Tutorial

    5/12

    12/9/2013

    5

    Prediction error of the samples the model was built on

    Error is biased!

    Samples also used to build the model

    model is biased towards accurate prediction of these

    specific samples

    A Biased Approach

    Basic Principle:

    test how well your model works with new data, it has not

    seen yet!

    Split data in training and test set.

    Several ways:

    One large test set

    Leave one out and repeat: LOO

    Leave n objects out and repeat: LNO

    . . .

    Apply entire model procedure on the test set

    Validation: Basic Principle

    Validation

    Full data

    set

    Training

    set

    Test

    set

    Build model :

    b0

    bp

    yRMSEP

    Remark: f or fi nal model use whole data set.

    Training and test sets

    Split in training and test set.

    Test set should berepresentativeof training set

    Random choice is often thebest

    Checkfor extremely unluckydivisions

    Apply wholeprocedure on the

    test and validation sets

    Cross-validation

    Most simple case: Leave-One-Out (=LOO, segment=1sample). Normally 10-20% out (=LnO).

    Remark: for final model use whole data set.

    Cross-validation: an example

    The data

  • 7/26/2019 Pls Tutorial

    6/12

    12/9/2013

    6

    Cross-validation: an example

    Split data into training set and validation set

    Cross-validation: an example

    Split data into training set and test set

    Cross-validation: an example

    Build a model on the training set

    Cross-validation: an example

    Cross-validation: an example

    Split data again into training set and valid. set

    Until all samples have been in the validationset once

    Common: Leave-One-Out (LOO)

    Cross-validation: an example

    Split data again into training set and valid. set

    Until all samples have been in the validation set once

    Common: Leave-One-Out (LOO)

  • 7/26/2019 Pls Tutorial

    7/12

    12/9/2013

    7

    Cross-validation: an example

    Split data again into training set and valid. set

    Until all samples have been in the validationset once

    Common: Leave-One-Out (LOO)

    Cross-validation: an example

    Split data again into training set and valid. set

    Until all samples have been in the validation set once

    Common: Leave-One-Out (LOO)

    Cross-validation: an example

    Split data again into training set and valid. set

    Until all samples have been in the validationset once

    Common: Leave-One-Out (LOO)

    Cross-validation: an example

    Split data again into training set and valid. set

    Until all samples have been in the validation set once

    Common: Leave-One-Out (LOO)

    Cross-validation: a warning

    Data: 13 x 5 = 65 NIR spectra (1102 wavelengths)

    13 samples: different composition of NaOH, NaOCland Na2CO3 5 temperatures: each sample measuredat 5 temperatures

    Composit

    ion

    NaOH (w t%) NaOCl

    (wt%)

    Na2CO3 (wt%) Temperature ( C)

    1 18.99 0 0 15 21 27 34 40

    2 9.15 9.99 0.15 15 21 27 34 40

    3 15.01 0 4.01 15 21 27 34 40

    4 9.34 5.96 3.97 15 21 27 34 40

    13 16.02 2.01 1.00 15 21 27 34 40

    Cross-validation: a warning

    The data

    65

    1102

    65

    31

    2

    13

    Leave SAMPLE out

    y

  • 7/26/2019 Pls Tutorial

    8/12

    12/9/2013

    8

    Selection of number of LVs

    Trough Validation:

    Choose number of LVs that results in model withlowest prediction errror

    Testset to assess final model cannot be used !

    Divide trainingsetCrossvalidation

    Validation

    Full data

    set

    Training

    Set

    Testset

    Test

    set

    1) determine #LVs : wit test set

    2) Build model : b0

    bp

    yRMSEP

    Remark: for final model use whole data set.

    Double Cross Validation

    Full data

    set

    Training

    setC

    1) determine #LVs : CV Innerloop

    2) Build model : CV Outer loop

    b0

    bp

    yRMSEP

    Remark: f or fi nal model use whole data set Skip.

    CV 1

    CV2

    Double cross -validation

    The data

    Double cross-validation

    Split data into training set and validation set

    Double cross -validation

    Split data into training set and validation set

    Used later to assess model performance!

  • 7/26/2019 Pls Tutorial

    9/12

    12/9/2013

    9

    Double cross-validation

    Apply crossvalidation on the rest: Split training set into

    (new) training set and test set

    1LV 2LV 3LV

    1LV 2LV 3LV 1LV 2LV 3LV

    Lowest RMSECV

    Double cross -validation

  • 7/26/2019 Pls Tutorial

    10/12

    12/9/2013

    10

    Cross-validation: an example

    Repeat procedure

    Until all samples have been in the validationset once

    Cross-validation: an example

    Repeat procedure

    Until all samples have been in the validation set once

    Double cross -validation

    In th is way:

    The number of LVs is determined by using samples notused to buildthe model with

    The prediction error is also determined using samples the

    model has not seenbefore

    Remark: f or fi nal model use whole data set.

    Raw + meancentered data

    2 00 0 4 00 0 6 00 0 8 00 0 1 00 00 1 20 00 1 40 00-0.5

    0

    0.5

    1

    1.5

    2

    Wavenumber (cm-1)

    Absorbance(a.u.)

    Raw data

    20 00 4 00 0 6 00 0 8 000 1 00 00 1 20 00 1 400 0-0.2

    -0.15

    -0.1

    -0.05

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    Wavenumber (cm-1)

    Absorbance(a.u.)

    Meancentereddata

    PLS: an example

    RMSECV vs. No of LVs

    1 2 3 4 5 6 7 8 9 100

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    Number of LVs

    RMSECV

    RMSECV values for prediction of NaOH

    Regression coeffficients

    3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000-0.5

    0

    0.5

    1

    1.5

    2

    Wavenumber (cm-1)

    Absorbance(a.u.)

    Raw data

    3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000-2

    0

    2

    4

    6

    8

    10

    Wavenumber (cm-1

    )

    Regressioncoefficient

  • 7/26/2019 Pls Tutorial

    11/12

    12/9/2013

    11

    True vs. predicted

    8 10 12 14 16 18 208

    10

    12

    14

    16

    18

    20

    NaOH, true

    NaOH,predicted

    True values vs. predictions

    500 1000 1500 2000 250

    0

    0.5

    1

    1.5

    2

    2.5

    3

    Wavelength (cm-1)

    Intensity(a.u

    )

    Original spectrum

    Offset

    Slope

    Scatter

    Why Pre-Processing ?Data Artefacts

    Baseline correcti on Al ignment Scatter correction Noise removal

    Scaling, Normalisati on Transformation

    ..

    Other Missing values

    Outliers

    0 200 400 600 800 1000 1200 1400 16000

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    Wavelength (a.u.)

    Intensity

    (a.u.

    )

    original

    0 20 0 40 0 60 0 80 0 1 000 1 20 0 140 0 160 00

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    Wavelength (a.u.)

    Intensit

    y

    (a.u.

    ) offset

    0 200 400 600 800 1000 1200 1400 16000

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    Wavelength (a.u.)

    Intensity

    (a.u.

    )

    offset+slope

    0 2 00 4 00 6 00 8 00 1 00 0 12 00 1 40 0 16 000

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    Wavelength (a.u.)

    Intens

    ity

    (a.u.

    )

    multiplicative

    0 200 400 600 800 1000 1200 1400 16000

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    Wavelength (a.u.)

    Intensity

    (a.u.

    )

    originaloffsetoffset+slopemultiplicativeoffset + slope + multiplicative

    0 200 400 600 800 1000 1200 1400 16000

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    Wavelength (a.u.)

    Intensity

    (a.u.

    )

    original

    offset

    offset+slopemultiplicativeoffset + slope + multiplicative

    Pre-Processing Methods

    4914 combinations: all reasonableSTEP 1:

    (7x) BASELINESTEP 2:

    (10x) SCATTER

    STEP 3:

    (10x) NOISE

    STEP 4:

    (7x) SCALING &

    TRANSFORMATION

    No baseline correction No scatter correction No noiseremoval Meancentering

    (3x) Detrending

    polynomial order

    (2-3-4)

    (4x) scaling: Mean

    Median Max L2 norm

    (9x) S-G smoothing

    (window: 5-9-11 pt)

    (order: 2-3-4)

    Autoscaling

    Range scaling

    (2x)Derivatisation

    (1st

    2nd

    )

    SNV Pareto scaling

    (3x) RNV (15, 25, 35)% Poisson scaling

    AsLS MSC Level scaling

    Log tr ansformation

    Supervised pre-processing methods

    OSC No noiseremoval Meancentering

    DOSC Autoscaling

    Range scaling

    Pareto scalingPoisson scaling

    Level scaling

    Log scaling

    Pre-Processing Results

    Classification accuracy %Complexityofthemodel(noofLV)

    Complexity of the model : no of LV

    Classification Accuracy

    Raw Data

    J. Eng el et al. TrAC 2013

  • 7/26/2019 Pls Tutorial

    12/12

    12/9/2013

    12

    PLS Toolbox (Eigenvector Inc.) www.eigenvector.com

    For use in MATLAB (or standalone!)

    XLSTAT-PLS (XLSTAT)

    www.xlstat.com

    For use in Microsoft Excel

    Package pl s for R

    Free software

    http://cran.r-project.org

    SOFTWARE