8
Key wavelengths selection from near infrared spectra using Monte Carlo samplingrecursive partial least squares Mingjin Zhang a, , Shizhi Zhang b , Jibran Iqbal c a Department of Chemistry, Qinghai Normal University, Xining 810008, PR China b College of Chemistry Life Science, Qinghai University for Nationalities, Xining 810007, PR China c Interdisciplinary Research Centre in Biomedical Materials, COMSATS Institute of Information Technology, Lahore, Pakistan abstract article info Article history: Received 14 April 2012 Received in revised form 13 July 2013 Accepted 20 July 2013 Available online 27 July 2013 Keywords: Near infrared Feature selection Monte Carlo sampling Recursive partial least squares Variable selection is a critical step in data analysis for near infrared spectroscopy. Recently, many studies have been reported on variable selection and researchers have proposed a large number of methods to identify variables (wavelengths) that contribute useful information. In the present study, a key wavelengths selection method named Monte Carlo samplingrecursive partial least squares (MCS-RPLS) is proposed. The method mainly includes three steps: (1) Monte Carlo sampling; (2) feature selection for each subset; and (3) determination of the optimum feature set for the dataset. The method has been used for feature selection and multivariate calibration on four near infrared spectroscopic datasets: corn moisture, corn protein, HSA and γ-globulin of biological samples. And the 10-fold cross validation results are compared with those obtained by full spectra-PLS, Moving Window Partial Least Squares (MWPLS), Monte Carlo-based Uninformative Variable Elimination (MC-UVE) and CARS. The results showed that the data dimensionalities and the RMSECV values of the selected variables are greatly reduced, thus the MCS-RPLS is available for feature selection from NIR data. © 2013 Elsevier B.V. All rights reserved. 1. Introduction In recent years, near-infrared (NIR) spectroscopy is an increasingly developing analytical method in the analysis of both simple and com- plex matrices of analytes. Therefore NIR spectroscopy has wide applica- tions in the eld of petrochemical industry [1,2], pharmacy [3,4], environment [5,6], agriculture [79], food industry [10] and biomedicine [11], because of its simple, fast and nondestructive testing. Multivariate calibration methods, such as partial least squares (PLS) [12], and princi- pal component regression (PCR) [13] have been widely used for NIR data analysis. For the spectroscopic analysis, multivariate calibration models can extract chemically meaningful information, e.g. structure-related wave- lengths, from the over-determined systems. But the measured spectral data on the modern spectroscopic instrument, such as ultraviolet or near infrared instruments, are usually of high collinearity, which is one of the most common problems faced by analytical chemists [14]. Conse- quently, researchers have proposed a variety of latent variable (LV)- based techniques to solve this problem, for instance, PCR and PLS are the most common used LV-based methods. It's a proven fact that the new variables are some kind of combination of the original variables which thus can achieve the goal of dimensionality reduction. In other words, these methods construct new features as functions that express relationships between the initial features; hence the new features contain the information of all the initial features. For the multivariate calibration of spectroscopic data, a calibration model established usually includes all the measured wavelengths. It is obvious that such full spec- trum model is sure to contain much redundant information, which will certainly have negative inuence on the prediction ability of the devel- oped model [14]. In addition, from the model interpretation point of view, it is really difcult for researchers to determine which wave- lengths or combinations are responsible for the property of interest. It has been demonstrated both experimentally and theoretically that im- provement of the performance of the calibration model can be achieved by using the selected informative wavelengths but not the full spectrum [1418]. In essence, the models built with selected features are more inter- pretable, because it reected the relationship between digitalized spectra and the property to be investigated, e.g. concentration. The introduction of wavelengths which are irrelevant to the investigated property into the model may lead to the deviation of regression model. Researchers have discussed the importance of feature selection in NIR data analysis. Xu et al. [15] presented the accuracy of the quantitative analysis conducted by VisNIR spectroscopy which can be improved through appropriate wavelength selection. Balabin and Smirnov [19] have pointed out that variable selection is a critical step in data analysis for vibrational spec- troscopy (infrared, Raman, or NIRS). Boaz and Ronald [20] provided a theoretical justication for the need to perform feature selection of the input data prior to application of multivariate regression algorithms. Zou et al. [21] reviewed the variable selection methods in NIRS. In Chemometrics and Intelligent Laboratory Systems 128 (2013) 1724 Corresponding author. Tel.: +86 971 6307635. E-mail address: [email protected] (M. Zhang). 0169-7439/$ see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.chemolab.2013.07.009 Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab

Key wavelengths selection from near infrared spectra using Monte Carlo sampling–recursive partial least squares

  • Upload
    jibran

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Key wavelengths selection from near infrared spectra using Monte Carlo sampling–recursive partial least squares

Chemometrics and Intelligent Laboratory Systems 128 (2013) 17–24

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems

j ourna l homepage: www.e lsev ie r .com/ locate /chemolab

Key wavelengths selection from near infrared spectra using Monte Carlosampling–recursive partial least squares

Mingjin Zhang a,⁎, Shizhi Zhang b, Jibran Iqbal c

a Department of Chemistry, Qinghai Normal University, Xining 810008, PR Chinab College of Chemistry Life Science, Qinghai University for Nationalities, Xining 810007, PR Chinac Interdisciplinary Research Centre in Biomedical Materials, COMSATS Institute of Information Technology, Lahore, Pakistan

⁎ Corresponding author. Tel.: +86 971 6307635.E-mail address: [email protected] (M. Zhang).

0169-7439/$ – see front matter © 2013 Elsevier B.V. All rihttp://dx.doi.org/10.1016/j.chemolab.2013.07.009

a b s t r a c t

a r t i c l e i n f o

Article history:Received 14 April 2012Received in revised form 13 July 2013Accepted 20 July 2013Available online 27 July 2013

Keywords:Near infraredFeature selectionMonte Carlo samplingRecursive partial least squares

Variable selection is a critical step in data analysis for near infrared spectroscopy. Recently, many studies havebeen reported on variable selection and researchers have proposed a large number of methods to identifyvariables (wavelengths) that contribute useful information. In the present study, a key wavelengths selectionmethodnamedMonte Carlo sampling–recursive partial least squares (MCS-RPLS) is proposed. Themethodmainlyincludes three steps: (1)Monte Carlo sampling; (2) feature selection for each subset; and (3) determination of theoptimum feature set for the dataset. The method has been used for feature selection and multivariate calibrationon four near infrared spectroscopic datasets: corn moisture, corn protein, HSA and γ-globulin of biologicalsamples. And the 10-fold cross validation results are compared with those obtained by full spectra-PLS, MovingWindow Partial Least Squares (MWPLS), Monte Carlo-based Uninformative Variable Elimination (MC-UVE) andCARS. The results showed that the data dimensionalities and the RMSECV values of the selected variables aregreatly reduced, thus the MCS-RPLS is available for feature selection from NIR data.

© 2013 Elsevier B.V. All rights reserved.

1. Introduction

In recent years, near-infrared (NIR) spectroscopy is an increasinglydeveloping analytical method in the analysis of both simple and com-plex matrices of analytes. Therefore NIR spectroscopy has wide applica-tions in the field of petrochemical industry [1,2], pharmacy [3,4],environment [5,6], agriculture [7–9], food industry [10] andbiomedicine[11], because of its simple, fast and nondestructive testing. Multivariatecalibration methods, such as partial least squares (PLS) [12], and princi-pal component regression (PCR) [13] have been widely used for NIRdata analysis.

For the spectroscopic analysis, multivariate calibration models canextract chemically meaningful information, e.g. structure-relatedwave-lengths, from the over-determined systems. But the measured spectraldata on the modern spectroscopic instrument, such as ultraviolet ornear infrared instruments, are usually of high collinearity, which is oneof the most common problems faced by analytical chemists [14]. Conse-quently, researchers have proposed a variety of latent variable (LV)-based techniques to solve this problem, for instance, PCR and PLS arethe most common used LV-based methods. It's a proven fact that thenew variables are some kind of combination of the original variableswhich thus can achieve the goal of dimensionality reduction. In otherwords, these methods construct new features as functions that express

ghts reserved.

relationships between the initial features; hence the new featurescontain the information of all the initial features. For the multivariatecalibration of spectroscopic data, a calibrationmodel established usuallyincludes all the measured wavelengths. It is obvious that such full spec-trum model is sure to contain much redundant information, which willcertainly have negative influence on the prediction ability of the devel-oped model [14]. In addition, from the model interpretation point ofview, it is really difficult for researchers to determine which wave-lengths or combinations are responsible for the property of interest. Ithas been demonstrated both experimentally and theoretically that im-provement of the performance of the calibration model can be achievedby using the selected informative wavelengths but not the full spectrum[14–18].

In essence, the models built with selected features are more inter-pretable, because it reflected the relationship between digitalized spectraand the property to be investigated, e.g. concentration. The introductionofwavelengthswhich are irrelevant to the investigated property into themodel may lead to the deviation of regression model. Researchers havediscussed the importance of feature selection in NIR data analysis. Xuet al. [15] presented the accuracy of the quantitative analysis conductedby Vis–NIR spectroscopy which can be improved through appropriatewavelength selection. Balabin and Smirnov [19] have pointed out thatvariable selection is a critical step in data analysis for vibrational spec-troscopy (infrared, Raman, or NIRS). Boaz and Ronald [20] provided atheoretical justification for the need to perform feature selection of theinput data prior to application of multivariate regression algorithms.Zou et al. [21] reviewed the variable selection methods in NIRS. In

Page 2: Key wavelengths selection from near infrared spectra using Monte Carlo sampling–recursive partial least squares

18 M. Zhang et al. / Chemometrics and Intelligent Laboratory Systems 128 (2013) 17–24

summary, wavelength selection is a key factor for constructing a reliableand interpretable calibration model with good prediction accuracy.

For the multivariate calibration model:

y ¼ Xbþ e ð1Þ

where y denotes the vector of property values (e.g. concentration), Xthe spectra data matrix, b represents the regression coefficients vectorand e the prediction error. The absolute value of the ith element in b,denoted |bi| (1 ≤ i ≤ p) reflects the ith wavelength's contribution to y.Thus, it can be considered that the larger |bi| is, the more importantthe ith variable is. In other words, the larger |bi| reflects the better line-arity between absorbance and concentration at ithwavelength; hence itis good to improve themodel performance bymodeling with thewave-lengths corresponding to the larger |bi|.

For this reason, a regression coefficient-based variable stepwiseelimination method is proposed and used in this study for key wave-length selection from NIR data. Here the stepwise elimination meansthe modeling with global variables firstly, and then eliminate most ofthe variables which have little contribution to modeling, after that themodel is built with the retained variables in the next step. The elimina-tion process is repeated formany steps until the optimumvariable set isobtained. In consideration of sampling reflects to model robustness, theMonte Carlo sampling technique is adopted. Consequently, we call thismethod as Monte Carlo sampling–recursive partial least squares(MCS-RPLS).

2. Methods

Fig. 1 shows the workflow of the MCS-RPLS method, it mainlyincludes three steps:Monte Carlo sampling, feature selection for subsetsand determination of optimum feature set.

2.1. Monte Carlo sampling

In each sampling run of themethod, a PLSmodel is built by using thesamples which are randomly selected from the raw dataset, totally Nm

times of sampling are performed, and thus Nm models are built. In thisstudy, 90% of the samples are selected for modeling in each samplingrun. From the sampling point of view, this process can be regarded assampling in the model space combined with Monte Carlo strategy.

2.2. Feature selection for subsets

For each subset, a best variable subset is selected using the RPLS-based stepwise selection method. As a result, Nm best variable subsets

Fig. 1.Workflow of MCS-RPLS.

are obtained from the Nm data subsets. The analysis procedure foreach data subset is described as follows:

(1) 80% samples of the subset are randomly selected and constructedthe calibration set; the others form the test set.

(2) PLS modeling: the PLS model y = Xb is built on the calibrationset and used for prediction on the test set, thus obtainingRMSEC and RMSEP.

(3) Variable stepwise elimination: according to the regressionmodel, the variables with smaller |bi| are eliminated, only mvariables are retained, and the calibration set and validation setare reconstructed with the m retained variables. Then back tostep (2), until m reached to a certain threshold (i.e. minimumnumber of the variable finally retained, here we take 2 as thethreshold). The number of retained variable m in each step wasdetermined by EDF function described below.

The variable elimination process is a continuously refined process,i.e. eliminating most of the trashy variables first, and then discardingsome less of the useless variables, so the number of variables eliminatedin each run is diminishing steadily, until the number of retained variablesis equal to the threshold. This decreasing process can be approximatelyregarded as an exponentially decreasing process. Therefore, the numberof retained variable in each elimination step is determined by the expo-nentially decreasing function (EDF) introduced by reference [14], inwhich the detail of the EDF is described.

In the variable stepwise elimination process, PLS modeling isemployed in each step (i.e. the PLS function is called repeatedly) untilthe boundary condition: only two variables are reserved. From thispoint of view, the analysis procedure is actually a recursive computationalprocess, and that's why we call the procedure as RPLS.

(4) For each subset, after N times of eliminations, a set of RMSEC andRMSEP can be obtained, and the variable set corresponding to theminimum error is selected as the best variable subset. The Nm

data subsets thus obtain Nm best variable subsets.

2.3. Determination of optimum feature set

Statistical analysis is conducted on the Nm best variable subsets, andthe optimumvariable set is obtained according to the selected frequencyof each variable on all of the subsets. The variable with highest selectedfrequency is evaluated with Monte Carlo cross validation (MCCV), thenthe next variable is merged into the variable set, and a correspondingRMSECV also can be calculated, and so on, until all of the variables aremerged. The variable set with minimum RMSECV is selected to be theoptimum variable set.

It should be pointed that the MCS-RPLS is similar but differentiatedwith CARS (competitive adaptive reweighted sampling) method [14].The similarity is that regression coefficients are used for measuring theimportance of variables; Monte Carlo Sampling (MCS) and EDF areused data reconstruction. While the differences between MCS-RPLS andCARS are that, firstly, Monte Carlo strategy is adopted for both of samplesand variables sampling in CARS but just for samples sampling in MCS-RPLS; secondly, afterN sampling runs, CARS obtainsN subsets of variablesand choose the subsetwith the lowest RMSECV as optimumvariable sub-set. But inMSC-RPLS, afterN variable subsets obtain, all of the variables inthe N subsets are integrated and sorted with its selection frequency, andstepwise selected by using 10-fold cross validation by forward selectionstrategy. Thus MCS-RPLS can be regard as an ensemble of CARS.

3. Datasets

3.1. Corn dataset

This benchmark data set consists of NIR spectra of 80 corn samples,measured on different types of NIR spectrometer. Each spectrum

Page 3: Key wavelengths selection from near infrared spectra using Monte Carlo sampling–recursive partial least squares

19M. Zhang et al. / Chemometrics and Intelligent Laboratory Systems 128 (2013) 17–24

contains 700 data points measured in the wavelength range of1100–2498 nm at 2 nm intervals. The dataset contains the content ofmoisture, oil, protein and starch of the samples. To investigate the per-formance of MCS-RPLS in the present study, the NIR spectra of 80 cornsamples measured on m5 instrument are used as X and the moistureand protein value as dependent variable y. The dataset can be freelydownloaded from the webpage: http://www.eigenvector.com/data/index.htm. The original spectra of the 80 samples are shown in Fig. 2a.

3.2. Biological solutions dataset

This data set consists of 125 NIR spectra of biological solutionsmeasured in the 7758.9–5382.8 cm−1 region with a Bruker Vector 22/NFT-NIR spectrometer (Bruker Optics Inc.) equipped with an InGaAsdetector. The spectral resolution was 4 cm−1 and the spectra totallycontain 1234 variables. The temperature of samples was kept at 37 ±0.2 °C. The 125 solution samples were prepared by dissolving appropri-ate amount of human serumalbumin (HSA),γ-globulin and glucose in a0.1 M phosphate buffer solution (pH 7.0). Five concentration levels ofHSA (0.00–6.00 g/dL), γ-globulin (0.00–4.00 g/dL) and glucose (0.00–2.00 g/dL) were used to design the experiment. Details of this data setcan be found in references [22,23].

The concentrations of γ-globulin and HSA were used for analysis inthis study. The original spectra of the 125 samples are shown in Fig. 2b.

4. Results and discussions

4.1. Influence of number of Monte Carlo sampling runs

In order to investigate the influence of number of Monte Carlo sam-pling runs on the performance of the method, each of the four datasetsdescribed above was sampled 50, 100, 200 and 500 times, and the iter-ation number of the EDF was set to 100. The distributions of RMSECVvalues on each dataset are shown in the statistical box-plot of Fig. 3, in

Fig. 2. Original spectra of corn samples (a) and biological solutions (b).

which the maximum, minimum, median, upper quartile and lowerquartile of the data distributions are presented. The more concentratethe RMSECV distribution is, the more stable the result is. It can be seenfrom Fig. 3 that the distributions of the RMSECV at different number ofsampling runs are generally consistent. It indicated that the number ofMonte Carlo sampling runs does not have significant influence on theperformance of MCS-RPLS. For HSA data, though the population distri-bution of RMSECV at 50 sampling runs is relatively low than others, itcontains several outliers, while there are no outliers for 100 and 200sampling runs. Considering the computational efficiency and the stabilityof results, it is set to 100 as default in the following sections.

4.2. Variable stepwise elimination based on EDF

The EDFwas used for variable stepwise elimination. In the recursiveprocedure, the data dimensionwas reduced in exponentially decreasingmanner. As illustrated in reference [14], the process of wavelengthreduction can be roughly divided into two stages: ‘fast selection’ and‘refined selection’ stage. Therefore, wavelengths of little or no informa-tion in a full spectrum can be removed in a stepwise and efficient waybecause of the advantage of EDF.

For the number of iterations (i.e. the steps of the variables reducedfrom original to minimum number), we took 100 as default due to theconsiderations of two factors: on one hand, the decreasing trend ofvariable number is consistent at different number of recursion(i.e. exponentially decreasing at 50, 100, 200 and 500 times of recur-sions). It means that the number of recursions does not have significantinfluence onway of variable decreasing. On the other hand, the numberof original variables used in the present study is not very large (700 and1234 respectively), thus too large number of recursion may lead toexcessively refined selection procedure, thereby increasing the compu-tational complexity. Otherwise, too small number of recursionmay leadto the loss of useful information because of the excessive sensitivity ondimensionality reduction.

4.3. Corn moisture data analysis

100 data subsets were constructed from the cornmoisture datawithMCS, and the MCS-RPLS was used for corn moisture data analysis. Thebest variable subset was selected from each data subset. Fig. 4 showsthe selected frequency of each variable in the 100 subsets. It can beseen that the two wavelength bands around 1908 nm and 2108 nmare selected in the 100 times of computationswithout exception, more-over, the selected frequency of 1908 nmand 2108 nm is high obviouslythan the others. As a result, 1908 nm and 2108 nm are selected as theoptimum variables on the corn moisture data, which is in accordancewith the results of CARS andMonte Carlo-based Uninformative VariableElimination (MC-UVE) [24,25] (MC-UVE selected two wavelengthintervals around 1908 nm and 2108 nm), the main reason may bethat CARS,MC-UVE andMCS-RPLS are all based on regression coefficientand the variables 1908 nm and 2108 nm are very significant on thebasis of regression coefficient.

The optimum variables selected by MCS-RPLS were used for 10-foldcross validation on the dataset, and the resultswere comparedwith thatobtained by full-range PLS, MC-UVE-PLS, moving window partial least-squares (MWPLS) [26] and CARS-PLS (Table 1).

Fig. 5 shows the variables selected with MC-UVE, MWPLS andMCS-RPLS from the corn moisture dataset. The variables selected withMC-UVE-PLS and MCS-RPLS are generally much the same (i.e. aroundthe 1908 nm and 2108 nm), the only difference is the width of wave-length interval: theMCS-RPLS selects 1908 nmand2108 nmas optimumvariables, while MC-UVE-PLS picks out two intervals in which the wave-length band 1894–1922 nm is corresponding to the water absorption[27] and the band 2098–2122 nm is corresponding to the combinationof O\H band [28], thus they are chemically meaningful.

Page 4: Key wavelengths selection from near infrared spectra using Monte Carlo sampling–recursive partial least squares

Fig. 3. The box-plots of RMSECV of each data set for 50, 100, 200 and 500 of the number of Monte Carlo Sampling runs, respectively. (a) Cornmoisture data. (b) Corn protein data. (c) HSAdata. (d) γ-Globulin data.

20 M. Zhang et al. / Chemometrics and Intelligent Laboratory Systems 128 (2013) 17–24

Fig. 6 shows the changing trend of RMSECV values of 10-fold crossvalidation (Fig. 6a) and regression coefficients of each wavelength(Fig. 6b) with the increasing of recursive modeling runs. As Fig. 6ashows, the RMSECV is decreasing fast at the initial stage, the main rea-son is that most of the irrelevant variables were eliminated accordingto EDF at the stage, thus significantly improved themodel performance,this stage can be regarded as the fast selection stage. However, theRMSECV values decrease tend to ease at the later stage which can beconsidered as the refined selection stage.

Fig. 6b shows the regression coefficient path of each wavelength inthe modeling process of the recursions with the number of samplingruns set to 100. Where each line represents the regression coefficientsfor one wavelength and the x axis represents the number of iterations.It can be seen that in each additional time of iteration, the regressioncoefficients for few variables return to zero, and these variables areeliminated in the iteration. It means that in a recursive modeling pro-cess, the variables with nonzero of regression coefficient are reserved.

Fig. 4. The selected frequency of each wavelength in 100 times of MCS runs on cornmoisture data.

In contrast to plot a, the variables reserved in the recursion step withminimum value of RMSECV are the best variables. For this dataset, theRMSECV value reaches to minimum when two variables are reserved,they are 1908 nm and 2108 nm, and therefore, they are the optimumvariables for this dataset.

4.4. Corn protein data analysis

Fig. 7 shows the selected frequency of each variable in the 100 runs ofMCS on the corn protein dataset. It is obvious that the selected frequencyofwavelength intervals around 1180, 1980, 2050 and 2160 nm is higher,while the great majority of wavelength ranges are not selected in all ofthe data subsets.

The changing trends of RMSECV values and regression coefficientpath of 10-fold CV on corn protein data are similar to that in Fig. 6,where the RMSECV values are decreasing fast at the initial stage andthe regression coefficients of part of the variables become zero. Withthe reduction of the data dimensionality, the decreasing tendency ofRMSECV values gradually eased. However, it should be pointed that,unlike to Fig. 6, the RMSECV values on corn protein data have suddenlyincreasedwhen it decreased to a certain extent. Its primary cause is thatfor the corn moisture data, the number of optimum variable is two(i.e. the boundary condition of the iteration process), thus the RMSECVwill not decrease because the useful variables are not lost in the wholeprocedure (Fig. 6a). In contrast, the minimum value of RMSECV forcorn protein data corresponds to the optimum variables which thenumber is larger than two, and the useful variables are eliminated atthe last stage of the recursion, thismakes theRMSECV increased sharply.

The optimum variables selected by MCS-RPLS were used for 10-foldcross validation, and the results were compared with that obtainedby full-range PLS, MC-UVE-PLS, MWPLS and CARS-PLS. As shown inTable 1, different methods have chosen different variables. Some of theselected variables are consistent across different methods, for instance,the wavebands around 1200 and 1980 nm are selected as optimumvariables by the four methods. Whereas some are different, such as theregions around 1760 and 2180 nm are selected by MWPLS andMC-UVE, while the region around 2050 is selected by MWPLS and

Page 5: Key wavelengths selection from near infrared spectra using Monte Carlo sampling–recursive partial least squares

Table 1Results on 4 datasets for different methods.

Datasets Indicators Full range-PLS MC-UVE-PLS MWPLS CARS-PLS MCS-RPLS

Corn moisture RMSECV 0.01822 0.005723 0.03826 0.0006 0.0002988NLvs 10 4 10 2 2Nvars 700 28 119 2 2

Corn protein RMSECV 0.1462 0.1214 0.1325 0.1067 0.06863NLvs 10 8 9 8 10Nvars 700 175 106 19 76

HSA RMSECV 0.08506 0.05873 0.06313 0.06208 0.05280NLvs 5 5 5 5 5Nvars 1234 95 59 45 37

γ-Globulin RMSECV 0.1218 0.05378 0.09490 0.06059 0.05201NLvs 5 5 5 5 5Nvars 1234 60 48 33 51

21M. Zhang et al. / Chemometrics and Intelligent Laboratory Systems 128 (2013) 17–24

MCS-RPLS. The main reason of such inconsistency is that the selectedwavebands covered a relatively wide range of wavelengths (i.e. 1100–2498 nm) which contain the complicated structural characteristics ofproteins, such as different vibration mode of C\H, O\H and N\Hbands (stretching vibration or flexural vibrations), the complexity ofmicroenvironment of C\H, O\H and N\H bands and their interaction[14], while different methods have different emphases.

4.5. HSA and γ-globulin data analysis

The analysis results on HSA and γ-globulin data by using full-rangePLS, MC-UVE-PLS, MWPLS and CARS-PLS can be found in Table 1. Andthe variables selected by different methods were shown in Fig. 8. It canbe seen from Fig. 8 that most of the selected variables were consistenton the two datasets (i.e. the interval around 5750 nm).

For further interpretation of theprocess of data analysis,MWPLS andMCS-RPLS were taken as examples, and the analysis details on HSA andγ-globulin datasets are discussed below.

4.5.1. Analysis results of MWPLSThe HSA and γ-globulin data were analyzed by MWPLS with 15 of

window size and the residuals are shown in Fig. 9a and c. It can beseen from Fig. 9a that the residuals of four wavebands are lower obvi-ously, thus the four wavebands are selected as feature variables forHSA data. In order to investigate the modeling performance of differentcombinations of the fourwavebands, all of their combinationswere usedfor 10-fold CV and the results are shown in Table 2. It is clear that theRMSECV values are different when the combinations of the wavebandsused for modeling are different, consequently, the variable combinationwith minimum RMSECV is selected as optimum variables (i.e. 5853–5816.4 cm−1, 5795.2–5743.2 cm−1 and 5700.8–5681.5 cm−1, totally

Fig. 5. The wavelengths selected from the corn moisture data with MCS-RPLS, UVE andMWPLS, respectively.

59 variables). The RMSECV of 10-fold CV with 5 latent variables is0.06313, which is lower than that obtained by full spectra (0.08506).

For the γ-globulin data, from Fig. 9c, four wavebands with lowestresiduals were selected as feature variables, they were 6539–6498.6 cm−1, 6381–6336.7 cm−1, 6024.5–5970.5 cm−1, and 5824–5775.9 cm−1. And all of their combinations were investigated with10-fold CV (Table 2). As a result, the wavebands 6539–6498.6 cm−1

and 5824–5775.9 cm−1 (totally contain 48 variables) to be consideredas optimum variables, the RMSECV value for 10-fold CV at five latentvariables is 0.0949 which is lower than that obtained by full spectra(i.e. 0.1218).

4.5.2. Analysis results of MCS-RPLS on HSA and γ-globulin dataFig. 9b shows the selected frequency of each variable in the 100MCS

runs on the HSA data. As a result, 37 wavenumbers are selected asoptimum variables (they are 7338.8 cm−1, 6307.8 cm−1, 5849.1–5826 cm−1 and 5752.8–5712.3 cm−1). The RMSECV of 10-fold CVwith five latent variables is 0.05280 which is lower than that obtainedby MWPLS.

The shadowed areas 1 and 2 in Fig. 9a and b represent thewavenumber bands around 5850–5820 cm−1 and 5760–5740 cm−1

which are selected by both of MWPLS and MCS-RPLS consistently. In

Fig. 6. The changing trend of RMSECV values of 10-fold CV (a) and regression coefficientpath of each variable (b) with the increasing of recursive modeling runs on cornmoisturedata.

Page 6: Key wavelengths selection from near infrared spectra using Monte Carlo sampling–recursive partial least squares

Fig. 7. Selected frequency of each wavelength in 100 times of MCS runs on corn proteindata.

22 M. Zhang et al. / Chemometrics and Intelligent Laboratory Systems 128 (2013) 17–24

addition, 7338.8 cm−1 and 6307.8 cm−1 are selected as features byMCS-RPLS only. Thus to investigate the availability of the two variables,the 10-fold CV on HSA data with the 37 variables and 35 variables(i.e. eliminate the 7338.8 cm−1 and 6307.8 cm−1) are compared, as aresult, the RMSECV for the 35 variables is 0.1106 which is distinctlyhigher than that for the 37 variables (0.05280), therefore, 7338.8 cm−1

and 6307.8 cm−1 are considered to be significant for prediction of theHSA, andmay reflect some information about the structure characteristicof HSA.

For γ-globulin dataset, 51 variables (6685.5 cm−1, 6612.3–6598.8 cm−1, 5851.1–5826 cm−1, 5812.5–5797.1 cm−1, 5752.8–5743.2 cm−1, 5733.5–5729.7 cm−1 and 5716.2–5698.8 cm−1) are

Fig. 8. The variables selected by different methods on two datasets. (a) HSA dataset,(b) γ-globulin dataset.

Fig. 9. Residual plot (a for HSA data and c for γ-globulin data) and selected frequency plot(b for HSA data and d for γ-globulin data).

selected by MCS-RPLS, and the RMSECV value of 10-fold CV of the 51variables is 0.05201 which is obviously lower than that obtained byother methods (Table 1).

The wavenumber bands selected by MWPLS and MCS-RPLS did notcompletely coincide. It can be seen from Fig. 9c and d that 5812.5–5797.1 cm−1 and 5752.8–5743.2 cm−1 which were selected byMCS-RPLS are partly coinciding with 5824–5775.9 cm−1 selected byMWPLS. Moreover, the wavenumber regions with lower residuals arecomparatively accordant with that frequently selected by MCS-RPLS, itfollows that the most abundant information about γ-globulin concen-tration is consist in these regions. As for the same data, different variableselection methods may select different variables because they utilizedifferent strategy. For this reason, several methods need to be investi-gated and evaluated for feature selection in the practical applications.It should be pointed that the result of MCS-RPLS is compared withthat of MWPLS in the present study. The aim of this study is not to

Page 7: Key wavelengths selection from near infrared spectra using Monte Carlo sampling–recursive partial least squares

Table 2Results of 10-fold cross validation using all combinations of band1–band4.

Wavebands HSA dataset γ-Globulin dataset

10-fold CV RMSECV Number of variables Number of latent variables 10-fold CV RMSECV Number of variables Number of latent variables

1 0.4528 14 7 0.3576 22 92 0.4975 20 3 0.5298 24 93 0.3459 28 3 0.4055 29 54 0.7058 11 6 0.1628 26 51 + 2 0.3910 34 5 0.2468 46 91 + 3 0.1738 42 3 0.2829 51 81 + 4 0.2022 25 4 0.09490 48 52 + 3 0.1583 48 6 0.2943 53 82 + 4 0.1397 31 5 0.1360 50 43 + 4 0.1197 39 3 0.1241 55 81 + 2 + 3 0.1081 62 5 0.2357 75 101 + 2 + 4 0.1875 45 5 0.1385 70 41 + 3 + 4 0.08915 53 5 0.1097 77 62 + 3 + 4 0.06313 59 5 0.1343 79 41 + 2 + 3 + 4 0.07704 73 5 0.1175 101 5

The wavebands are denoted as below:For HSA dataset: 1: 5995.6–5970.5 cm−1; 2: 5853–5816.4 cm−1; 3: 5795.2–5743.2 cm−1; 4: 5700.8–5681.5 cm−1.For γ-globulin dataset: 1: 6539–6498.6 cm−1; 2: 6381–6336.7 cm−1; 3: 6024.5–5970.5 cm−1; 4: 5824–5775.9 cm−1.

23M. Zhang et al. / Chemometrics and Intelligent Laboratory Systems 128 (2013) 17–24

demonstrate which method is more excellent, but to explain that MCS-RPLS also can be used for feature selection.

5. Conclusions

This paper proposes MC-RPLS method for key wavelength selectionfromNIRS. Firstly, create a number of sub-datasets by usingMonte Carlosampling technique, then modeling with PLS on each subset repeatedlyand select feature subset on each dataset by taken regression coefficientas criterion, finally determine the optimum feature set through statisti-cal analysis on the feature subsets. The corn moisture and protein con-tent and HSA and γ-globulin in the biological samples were analyzedwith the proposed method and the results are compared with thatobtained by MWPLS, MC-UVE and CARS. The results showed that thedata dimensionalities and the RMSECV values of the selected variablesare greatly reduced, thus the MCS-RPLS can select variables efficientlyfrom NIR data. In addition, the robustness of the proposed method canbe enhanced using Monte Carlo strategy.

In this study, we used the MCS-RPLS for key wavelength selectionand multivariate calibration on NIR datasets; actually, it can also beexpended for other aims such as pattern recognition and classification.For example, the partial least square — discriminant analysis coupledwith Monte Carlo sampling technique can be used for potential bio-marker discovery from proteomic and genomic data.

Acknowledgment

Thisworkwas supported byQinghai Provincial Natural Science Fund(No. 2012-Z-937Q).

References

[1] A. Murugesan, C. Umarani, T. Chinnusamy, M. Krishnan, R. Subramanian, N.Neduzchezhain, Production and analysis of bio-diesel from non-edible oils — areview, Renewable & Sustainable Energy Reviews 13 (2009) 825–834.

[2] L. Meher, D. Vidya Sagar, S. Naik, Technical aspects of biodiesel production bytransesterification — a review, Renewable & Sustainable Energy Reviews 10 (2006)248–268.

[3] C. Gendrin, Y. Roggo, C. Spiegel, C. Collet, Monitoring galenical process developmentbynear infrared chemical imaging: one case study, European Journal of Pharmaceuticsand Biopharmaceutics 68 (2008) 828–837.

[4] Y. Roggo, P. Chalus, L. Maurer, C. Lema-Martinez, A. Edmond, N. Jent, A review ofnear infrared spectroscopy and chemometrics in pharmaceutical technologies,Journal of Pharmaceutical and Biomedical Analysis 44 (2007) 683–700.

[5] K.D. Shepherd, M.G. Walsh, Infrared spectroscopy — enabling an evidence-baseddiagnostic surveillance approach to agricultural and environmental managementin developing countries, Journal of Near Infrared Spectroscopy 15 (2007) 1–19.

[6] J. Nyström, E. Dahlquist, Methods for determination of moisture content inwoodchips for power plants — a review, Fuel 83 (2004) 773–779.

[7] C.E.Miller, Chemical principles of near infrared technology, Near infrared technologyin the agricultural and food industries, 2001, pp. 19–37.

[8] C. Connolly, NIR spectroscopy for foodstuff monitoring, Sensor Review 25 (2005)192–194.

[9] G. Moreda, J. Ortiz-Ca avate, F.J. Garcra-Ramos, M. Ruiz-Altisent, Non-destructivetechnologies for fruit and vegetable size determination—a review, Journal of FoodEngineering 92 (2009) 119–136.

[10] R. Karoui, J. De Baerdemaeker, A review of the analytical methods coupled withchemometric tools for thedetermination of thequality and identity of dairy products,Food Chemistry 102 (2007) 621–640.

[11] S. Landau, T. Glasser, L. Dvash, Monitoring nutrition in small ruminants with theaid of near infrared reflectance spectroscopy (NIRS) technology: a review, SmallRuminant Research 61 (2006) 1–11.

[12] M. Sjostrom, S.Wold,W. Lindberg, J.-A. Persson, H.Martens, Amultivariate calibrationproblem in analytical chemistry solved by partial least-squares models in latentvariables, Analytica Chimica Acta 150 (1983) 61–70.

[13] P.J. Gemperline, A. Salt, Principal components regression for routine multicomponentUVdeterminations: a validation protocol, Journal of Chemometrics 3 (1989) 343–357.

[14] H. Li, Y. Liang, Q. Xu, D. Cao, Key wavelengths screening using competitive adaptivereweighted sampling method for multivariate calibration, Analytica Chimica Acta648 (2009) 77–84.

[15] H. Xu, B. Qi, T. Sun, X. Fu, Y. Ying, Variable selection in visible and near-infraredspectra: application to on-line determination of sugar content in pears, Journal ofFood Engineering 109 (2012) 142–147.

[16] Y. Qin, X. Ding, H. Gong, Application of high dimensional feature selection in nearinfrared spectroscopy of cigarettes qualitative evaluation, Spectroscopy Letters 46(2013) 397–402.

[17] K. Zheng, Q. Li, J. Wang, J. Geng, P. Cao, T. Sui, X. Wang, Y. Du, Stability competitiveadaptive reweighted sampling (SCARS) and its applications tomultivariate calibrationof NIR spectra, Chemometrics and Intelligent Laboratory Systems 112 (2012) 48–54.

[18] X. Shao, G. Du, M. Jing, W. Cai, Application of latent projective graph in variableselection for near infrared spectral analysis, Chemometrics and Intelligent LaboratorySystems 114 (2012) 44–49.

[19] R.M. Balabin, S.V. Smirnov, Variable selection in near-infrared spectroscopy:benchmarking of feature selection methods on biodiesel data, Analytica ChimicaActa 692 (2011) 63–72.

[20] N. Boaz, R.C. Ronald, Theprediction error in CLS andPLS: the importance of feature se-lection prior tomultivariate calibration, Journal of Chemometrics 19 (2005) 107–118.

[21] X. Zou, J. Zhao, M.J.W. Povey, M. Holmes, H. Mao, Variable selection methods innear-infrared spectroscopy, Analytica Chimica Acta 667 (2010) 14–32.

[22] S. Kasemsumran, Y. Du, K. Murayama, M. Huehne, Y. Ozaki, Simultaneous determi-nation of human serum albumin, γ-globulin, and glucose in a phosphate buffersolution by near-infrared spectroscopy with moving window partial least-squaresregression, Analyst 128 (2003) 1471–1477.

[23] Y.P. Du, S. Kasemsumran, K. Maruo, T. Nakagawa, Y. Ozaki, Ascertainment of thenumber of samples in the validation set in Monte Carlo cross validation and theselection of model dimension with Monte Carlo cross validation, Chemometricsand Intelligent Laboratory Systems 82 (2006) 83–89.

[24] W. Cai, Y. Li, X. Shao, A variable selection method based on uninformative variableelimination for multivariate calibration of near-infrared spectra, Chemometricsand Intelligent Laboratory 90 (2008) 188.

Page 8: Key wavelengths selection from near infrared spectra using Monte Carlo sampling–recursive partial least squares

24 M. Zhang et al. / Chemometrics and Intelligent Laboratory Systems 128 (2013) 17–24

[25] Q.J. Han, H.L. Wu, C.B. Cai, L. Xu, R.Q. Yu, An ensemble of Monte Carlo uninformativevariable elimination for wavelength selection, Analytica Chimica Acta 612 (2008)121.

[26] J.H. Jiang, R.J. Berry, H.W. Siesler, Y. Ozaki, Wavelength interval selection in multi-component spectral analysis by moving window partial least-squares regressionwith applications to mid-infrared and near-infrared spectroscopic data, AnalyticalChemistry 74 (2002) 3555.

[27] D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O.E. De Noord, Genetic algorithms as atool for wavelength selection in multivariate calibration, Analytical Chemistry 67(1995) 4295–4301.

[28] J.H. Jiang, R.J. Berry, H.W. Siesler, Y. Ozaki, Wavelength interval selection in multi-component spectral analysis by moving window partial least-squares regressionwith applications to mid-infrared and near-infrared spectroscopic data, AnalyticalChemistry 74 (2002) 3555–3565.