Una Introduccion a PLS

Embed Size (px)

Citation preview

  • 8/12/2019 Una Introduccion a PLS

    1/8

    An Introduction toPartial Least Squares RegressionRandall D Tobias, SAS Institute Inc., Cary, NC

    AbstractPartial least squares is a popular method for Softmodelling in industrial applications. This paper introduces the basic concepts and illustrates them witha chemometric example. An appendix describes theexperimental PLS procedure of SAS/STAT software.

    IntroductionResearch in SCience and engineering often involvesusing controllable and/or easy-to-measure variablesfactors) to explain, regulate, or predict the behavior ofother variables responses). When the factors are fewin number, are not significantly redundant(collinear),and have a well-understood relationship to the responses, then multiple linear regression (MLR) canbe a good way to turn data into information. However,if any of these three conditions breaks down, MLRcan be inefficient or inappropriate. In such so-calledso t science applications, the researcher is faced withmany variables and ill-understood relationships, andthe object is merely to construct a good predictivemodel. For example, spectrographs are often usedS O ~ ~ ~ ~1 = 0 370

    A a JWt 2 0 152~ 3 = 0 3 3 7I 30 Coil......... 4 0 494Id .., o c ~ ~o 100 3 0 300 OSOO8007008C05IOO1:xXI

    Rgure 2: Spectrograph for a mixtureto estimate the amount of different compounds in achemical sample. (See Figure 2. In this case, thefactors are the measurements that comprise the spectrum; they can number in the hundreds but are likelyto be highly collinear. The responses are componentamounts that the researcher wants to predict in futuresamples.

    1250

    Partial least squares (PLS) is a method for constructing predictive models when the factors are many andhighly collinear. Note that the emphasis is on predicting the responses and not necessarily on tryingto understand the underlying relationship between thevariables. For example, PLS is not usually appropriatefor screening out factors that have a negligible effecton the response. However, when prediction is thegoal and there is no practical need to limit the numberof measured factors, PLS can be a useful tool.PLS was developed in the 1960 s by Herman Woldas an econometric technique, but some of its mostavid proponents (including Wold s son Svante) arechemical engineers and chemometricians. In addilion to spectrometric calibration as discussed above,PLS has been applied to monitoring and controllingindustrial processes; a large process can easily havehundreds of controllable variables and do;zens of outputs.The next section gives a brief overview of how PLSworks, relating it to other multivariate techniques suchas prinCipal components regression and maximum redundancy analysis. An extended chemometric example is presented that demonstrates how PLS modelsare evaluated and how their components are interpreted. A final section discusses alternatives andextensions of PLS. The appendices introduce the experimental PLSprocedure for performing partial leastsquares and related modeling techniques.

    How Does PLS Work?In principle, MLR can be usedwith very many factors .However, if the number of factors gets too large (forexample, greater than the number of observations),you are likely to get a mo el that fits the sampleddata perfectly but that will fail to predict new data well.This phenomenon is called over-fitting. In such cases,although there are many manifest factors, there maybe only a few underlying or latent factors that accountfor most of the variation in the response. The generalidea of PLS is to try to extract these latent factors,accounting for as much of the manifest factor variation

  • 8/12/2019 Una Introduccion a PLS

    2/8

    as possible while modeling the responses well. Forthis reason, the acronym PLS has also been takento mean projection to latent structure. It should benoted, however, that the term latent does not havethe same technical meaning in the context of PLS asit does for other multivariate techniques. In particular,PLS does not yield consistent estimates of what arecalled latent variables in formal structural equationmodelling (Dykstra 1983, 1985).Figure 3 gives a schematic outline of the method.The overall goal (shown in the lower box) is to use

    t - Gt t ,B I ~ M I,,IeB-8

    opulationFigure 3: Indirect modeling

    the factors to predict the responses in the population.This is achieved indirectly by extracting latent variables T and U from sampled factors and responses,respectively. The extracted factors T (also referredto as X-scores are used to predict the Y-scores U,and then the predicted Y -scores are used to constructpredictions for the responses. This procedure actually covers various techniques, depending on whichsource of variation is considered most crucial. Principal Components Regression PCR):The X-scores are chosen to explain as muchof the factor variation as possible. This approach yields informative directions in the factorspace, but they may not be associated with theshape of the predicted surface. Maximum Redundancy Analysis MRA) vanden Wollenberg 1977): The Y scores are chosen to explain as much of the predicted Y variation as possible. This approach seeks directionsin the factor space that are associated with themost variation in the responses, but the predictions may not be very accurate. Partial Least Squares: The X- and V-scoresare chosen so that the relationship between

    25

    successive pairs of scores is as strong as possible. In prinCiple, this is like a robust form ofredundancy analysis, seeking directions in thefactor space that are associated with high variation in the responses but biaSing them towarddirections that are accurately predicted.Another way to relate the three techniques is to notethat pe is based on the spectral decomposition ofX X, where X is the matrix of factor values; MRA isbased on the spectral decomposition of Y Y, whereY is the matrix of (predicted) response values; andPLS is based on the singular value decomposition ofX Y. In SAS@ software, both the REG procedure andSAS/INSIGHT@software implement forms of principalcomponents regression; redundancy analysis can beperformed using the TRANSREG procedure.If the number of extracted factors is greater than orequal to the rank of the sample factor space, thenPLS is equivalent to MLR. An important feature of themethod is that usually a great deal fewer factors arerequired. The precise number of extracted factors isusually chosen by some heuristic technique based onthe amount of residual variation. Another approachis to construct the PLS model for a given number ofjactors on one set of data and then to test it on another,choosing the number of extracted factors jor whichthe total prediction error is minimized. Alternatively,van der Voet (1994) suggests choosing the leastnumber of extracted factors whose residuals are notsignificantly greater than those of the model withminimum error. If no convenient test set is available,then each observation can be used in turn as a testset; this is known as cross-validation.Example: Spectrometric Calibra-tionSuppose you have a chemical process whose yieldhas jive different components. You use an instrumentto predict the amounts of these components basedon a spectrum. In order to calibrate the instrument,you run 20 different known combinations of the jivecomponents through it and observe the spectra. Theresults are twenty spectra with their associated component amounts, as in Figure 2PLS can be used to construct a linear predictivemodel for the component amounts based on the spectrum. Each spectrum is comprised of measurementsat 1,000 different frequencies; these are the factorlevels, and the responses are the five componentamounts. The lell-hand side of Table shows theindividual and cumulative variation accounted for by

  • 8/12/2019 Una Introduccion a PLS

    3/8

    Table 2: PLS analysis of spectral calibration. with cross-validationNumber of Percent Variation Accounted For Cross-validationPLS Factors Responses ComparisonFactors Current Total Current01 39.35 39.35 28.702 29.93 69.28 25.573 7.94 77 22 21.874 6.40 83.625 2.07 85.696 1.20 86.897 1.15 88.04.8 1.12 89.169 1.06 90.2210 1.02 91.24

    the first ten PLS factors. for both the factors and theresponses. Notice that the first five PLS factors account for almost all of the variation.in the responses.with the fifth factor accounting for a sizable proportion.This gives a strong indication that five PLS factors areappropriate for modeling the five component amounts.The cross-validation analysis confirms this: althoughthe model with nine PLS factors achieves the absoluteminimum predicted residual sum of squares (PRESS).it is insignificantly better than the model with only fivefactors.The PLS factors are computed as certain linear combinations of the spectral amplitudes. and the responsesare predicted linearly based on these extracted factors. Thus. the final predictive function for eachresponse is also a linear combination of the spectralamplitudes. The trace for the resulting predictor ofthe first response is plotted in Figure 4 Notice that

    wIt

    o'- 750 lOOFigure 4: PLS predictor coefficients for one responsea PLS prediction is not aSSOCiated with a single frequency or even just a few. as would be the case ifwe tried to choose optimal frequencies for predictingeach response (stepwise regression). Instead. PLSprediction is a function of all of the input factors. In

    6.4516.950.380.040.020 010 01

    1252

    Total PRESS1.067 028.70 0.929 054.27 0.851 076.14 0.728 082.59 0.600 0.00299.54 0.312 0.26199.92 0.305 0.42899.96 0.305 0.47899.98 0.306 0.02399.99 0.304 100.00 0.306 0.091

    this case. the PLS predictions can be interpreted ascontrasts between broad bands of frequencies.

    iscussionAs discussed in the introdllctory section. soft scienceapplications involve so many variables that it is notpractical to seek a hard model explicitly relatingthem all. Partial least squares is one solution for suchproblems. but there are others. including

    other factor extraction techniques. like principalcomponents regression and maximum redundancy analysis ridge regression. a technique that originatedwithin the field of statistics (Hoerl and Kennard1970) as a method for handling collinearity inregression neural networks. which originated with attempts

    in computer science and biology to simulate theway animal brains recognize patterns (Haykin1994. Sarle 1994)Ridge regression and neural nets are probably thestrongest competitors for PLS in terms of flexibilityand robustness of the predictive models. but neitherof them explicitly incorporates dimension reduction--that is. linearly extracting a relatively few latent factorsthat are most useful in modeling the response. Formore discussion of the pros and cons of sof t modelingalternatives. see Frank and Friedman (1993).There are also modifications and extensions of partialleast squares. The SIMPLS algorithm o de Jong

  • 8/12/2019 Una Introduccion a PLS

    4/8

    (1993) is a closely related technique. It is exactlythe same as PLS when there is only one responseand invariably gives very similar results, but it canbe dramatically more efficient to compute when thereare many factors. Continuum regression (Stone andBrooks 1990) adds a continuous parameter a, whereo S ex S 1, allowing the modeling method to varycontinuously between MLR ex = 0 , PLS a = 0.5),and PCR ex = 1). De Jong and Kiers (1992) describe a related technique called principal covariatesregression.In any case, PLS has become an established tool inchemometric modeling, primarily because it is oftenpossible to interpret the extracted factors in termsof the underlying physical system---that is, to derive"hard" modeling information from the soft model. Morework is needed on applying statistical methods to theselection of the model. The idea of van der Voet(1994) for randomization-based model comparison isa promising advance in this direction.

    or urther ReadingPLS is still evolving as a statistical modeling technique, and thus there is no standard text yet that givesit in-depth coverage. Geladi and Kowalski (1986) isa standard reference introducing PLS in chemometric applications. For technical details, see Naes andMartens (1985) and de Jong (1993), as well as thereferences in the latter.

    ReferencesDijkstra, T. (1983), "Some comments on maximumlikelihood and partial least squares methods,"Journal of Econometrics, 22 67-90.Dijkstra, T. (1985). Latent variables in linear stochastic models: Reflections on maximum likelihoodand partial eastsquares methods. 2nd ed. Amsterdam, The Netherlands: SOCiometric ResearchFoundation.Geladi, P, and Kowalski, B. (1986), "Partial leastsquares regression: A tutorial," Analytica ChimicaActa 185 1-17.Frank, I. and Friedman, J. (1993), A statistical viewof some chemometrics regression tools," Technometrics 35,109-135.

    1253

    Haykin, S. (1994). Neural Networks a Comprehensive Foundation. New York: Macmillan.Helland, I. (1988), "On the structure of partial leastsquares regression," Communications in Statistics Simulation and Computation 17(2), 581-607.Hoerl, A. and Kennard, R (1970), "Ridge regression:biased estimation for non-orthogonal problems,"Technometrics 12,55-67.de Jong, S. and Kiers, H. (1992), "Principal covariates regression," Chemometrics and IntelligentLaboratory Systems 14, 155-164.deJong, S. (1993), "SIMPLS:An alternative approachto partial least squares regression," Chemometrics and Intelligent Laboratory Systems 18, 251-263.Naes, T. and Martens, H. (1985), "Comparison of prediction methods for multicollinear data," Communications in Statistics Simulation and Computation 14(3), 545-576.Ranner, Lindgren, Geladi, and Wold, A PLS kernelalgorithm for data sets with many variables andfewer objects," JournalofChemometrics 8, 111-125.Sarle, W.S. (1994), "Neural Networks and Statistical Models," Proceedings of the Nineteenth

    Annual SAS Users Group International Conference, Cary, NC: SAS Institute, 1538-1550.Stone, M. and Brooks, R (1990), "Continuum regression: Cross-validated sequentially constructedprediction embracing ordinary least squares,partial least squares, and principal componentsregreSSion," Journal of the Royal Statistical Society Series B 52(2), 237-269.van den Wollenberg, A.L. (1977), "RedundancyAnalysis--An Alternative to Canonical Correla

    tion Analysis," Psychometrika 42, 207-219.van derVoet, H. (1994), "Comparing the predictive accuracy of models using a simple randomizationtest," Chemometrics nd Intelligent LaboratorySystems 25, 313-323.SAS, SASIINSIGHT, and SAS/STAT are registeredtrademarks of SAS Insti u1e Inc. in the USA and othercountries. indicates USA registration.

  • 8/12/2019 Una Introduccion a PLS

    5/8

    Appendix 1: PRoe PLS: An Experimental SAS Procedure for PartialLeast SquaresAn experimental SAS STAT software procedure,PROC PLS, is available with Release 6 11 of theSAS System for performing various factor-extractionmethods of modeling, including partial least squares.Other methods currently supported include alternativealgorithms for PLS, such as the SIMPLS method of deJong 1993) and the RLGW method of Rannar et al.1994), as well as principal components regression.Maximum redundancy analysis will also be included ina future release. Factors can be specified using GLMtype modeling, allowing for polynomial, cross-product,and classification effects. The procedure offers a widevariety of methods for performing cross-validation onthe number of factors, with an optional test for theappropriate number of factors. There are output datasets for cross-validation and model information as wellas for predicted values and estimated factor scores.You can specify the following statements with the PLSprocedure. Items within the brackets are optional.

    PROC PLS ;CLASS class-variables;MODEL responses = effects < / option >;OUTPUT OUT=SAS-data-set ;

    PRoe PLS StatementPROC PLS ;

    You use the PROC PLS statement to invoke the PLSprocedure and optionally to indicate the analysis dataand method. The fOllowing options are available:DATA = SAS-data-setspecifies the input SAS data set that contains the factor and response values.METHOD = factor-extraction-methodspecifies the general factor extractionmethod to be used. You can specify anyone of the following:

    METHOD=PLS < PLS-options) >specifies partial least squares. This isthe default factor extraction method.

    METHOD=SIMPLSspecifies the SIMPLS method of deJong 1993). This is a more efficient algorithm than standard PLS; it

    1254

    is equivalent to standard PLS whenthere is only one response, and itinvariably gives very similar results.METHOD=PCR

    specifies principal components regression.You can specify the following PLS-optionsin parentheses after METHOD=PLS:ALGORITHM=PLS-algorithm

    gives the specific algorithm used tocompute PLS factors. Available algorithms areITER the usual iterative NIPALS algorithmSVD singular value decompOSi

    tion of X Y, the most exactbut least efficient approachEIG eigenvalue decomposition of

    Y XX YRLGW an iterative approach thatis efficient when there aremany factors

    MAXITER=number .gives the maximum number of iterations for the ITER and RLGW algorithms. The default is 200.

    EPSILON=numbergives the convergence criterion forthe ITER and RLGW algorithms. Thedefault is 10 12

    CV = ross-validation-methodspecifies the cross-validation method tobe used. If you do not specify a crossvalidation method, the default action isnot to perform cross-validation. You canspecify anyone of the following:CV=ONE

    specifies one-at-a-time cross- validationCV = SPLIT < ) >

    specifies that every n1h observationbe excluded. You may optionallyspecify n; the default is 1, which isthe same as CV=ONE.CV = BLOCK < n >

    specifies that blocks of n1h observations be excluded. You may optionally specify n; the default is 1, whichis the same as CV=ONE.

  • 8/12/2019 Una Introduccion a PLS

    6/8

    CV= NDOM < ( cv-random-opts >specifies that random observationsbe excluded.CV = TESTSET(SAS-data-sel)specifies a test set of observations tobe used for cross-validation.You also can specify the following cvrandom-opts in parentheses after CV =RANDOM:NITER= umberspecifies the number of random subsets to exclude.NTEST =numberspecifies the number of observationsin each random subset chosen forexclusion.SEED= umberspecifies the seed value for randomnumber generation.

    CVTEST < cHest-options >specifies that van der Voel s (1994)randomization-based model comparisontest be performed on each cross-validatedmodel. You also can specify the following cv-test-options in parentheses afterCVTEST:PV L =numberspecifies the cut-off probability fordeclaring a significant difference. Thedefault is 0.10.ST T=est-statistic

    specifies the test statistic for themodel comparison. You can specifyeither T2, for Hotelling s r statistic,or PRESS, for the predicted residualsum of squares. T is the default.NS MP= umberspecifies the number of randomizations to perform. The default is 1000.

    LV =numberspecifies the number of factors to extract.The default number of factors to extract isthe number of input factors, in which casethe analysis is equivalent to a regular leastsquares regression of the responses onthe input factors.

    OUTMODEL= AS-data-setspecifies a name for a data set to containinformation about the fit model.OUTCV = AS-data-setspecifies a name for a data set to containinformation about the cross-validation.

    CL SS StatementCL SS class-variables;

    You use the CLASS statement to identify classification variables, which are factors that separate theobservations into groups.Class-variables can be either numeric or character.The PLS procedure uses the formatted values ofc/ass-variables in forming model effects. Any variablein the model that is not listed in the CLASS statementis assumed to be continuous. Continuous variablesmust be numeric.

    MODEL StatementMODEL responses = effects < I INTERCEPT >;

    You use the MODEL statement to specify the response variables and the independent effects usedto model them. Usually you will just list the namesof the independent variables as the model effects,but you can also use the effects notation of PROCGLM to specify polynomial effects and interactions.By default the factors are centered and thus no intercept is required in the model, but you can specify theINTERCEPT option to override this behavior.

    1255

    OUTPUT StatementOUTPUT OUT=SAS-data-set keyword = names< ... keyword = names>;

    You use the OUTPUT statement to specify a dataset to receive quantities that can be computed forevery input observation, such as extracted factorsand predicted values. The following keywords areavailable:PREDICTEDYRESIDUALXRESIDUALXSCOREYSCORESTDYSTDXHPRESST

    predicted values for responsesresiduals for responsesresiduals for factorsextracted factors (X-scores, latentvectors, 1 )extracted responses y scores, Ustandard error for Y predictionsstandard error for X predictionsapproximate measure of influencepredicted residual sum of squaresscaled sum of squares of scores

  • 8/12/2019 Una Introduccion a PLS

    7/8

    XQRESYQRES

    sum of squares of scaled residualsfor factorssum of squares of scaled residualsfor responses

    Appendix : Example CodeThe data for the spectrometric calibration example isin the form of a SAS data set called SPECTRA with20 observations, one for each test combination of thefive components. The variables areX l X1000 the spectrum for this combinationYl YS the component amountsThere is also a test data set of 20 more observationsavailable for cross-validation. The following stateTlents use PROC PlS to analyze the data, using theS MPlS algorithm and selecting the number of factorswith cross-validation.

    proc pl d t spectrmethod s implsl v 9cv t es t se t t es tS )cvtest stat=press);

    model yl yS = xl xlOOO;r un ;

    The listing has two parts (Figure 5), the first partsummarizing the cross-validation and the second partshowing how much variation is explained by each extracted factor for both the factors and the responses.Note that the extracted factors are labeled latentvariables in the listing.

    1256

  • 8/12/2019 Una Introduccion a PLS

    8/8

    The PLS ProcedureCross Validat ion for the NUmber of Latent VariablesTest for larger

    res idua l s thanminimum

    Number of RootLatent

    Variables

    234567B9

    10

    MeanPRESS

    1 . 06700 92860 85l 00 72820 600l0 3l 230 305l0 30470 30550 30450 3061

    ProbPRESS

    0 005000 63 400 63 400 35300 42701 00000 0700

    Mini mum Root Mean PRESS 0 304457 fo r 9 l a t ent var iab lesSmallest model with p value > 0 .1 5 l a t ent var iab les

    Number ofLatent

    Variables

    The PLS ProcedurePercent Variat ion Accounted For

    Model Effects Dependent VariablesCurrent Tota l Current Total

    1 39 3526 39 3526 28 7022 28 70222 29 9369 69 2895 25 5759 54 27803 7 9333 77 2228 21 8631 76 l 4114 6 4014 8 3 .6 2 4 2 6 4502 82_59135 2 0679 85 6920 l 6 9573 99 5486

    Rgure : PROC PLS output for spectrometric calibration example

    1257