8
Analytica Chimica Acta 635 (2009) 45–52 Contents lists available at ScienceDirect Analytica Chimica Acta journal homepage: www.elsevier.com/locate/aca Variable selection in visible/near infrared spectra for linear and nonlinear calibrations: A case study to determine soluble solids content of beer Fei Liu, Yihong Jiang , Yong He College of Biosystems Engineering and Food Science, Zhejiang University, 268 Kaixuan Road, Hangzhou, Zhejiang 310029, China article info Article history: Received 9 September 2008 Received in revised form 28 December 2008 Accepted 12 January 2009 Available online 17 January 2009 Keywords: Visible/near infrared spectroscopy Successive projections algorithm Independent component analysis Variable selection Least squares-support vector machine Beer abstract Three effective wavelength (EW) selection methods combined with visible/near infrared (Vis/NIR) spec- troscopy were investigated to determine the soluble solids content (SSC) of beer, including successive projections algorithm (SPA), regression coefficient analysis (RCA) and independent component analysis (ICA). A total of 360 samples were prepared for the calibration (n = 180), validation (n = 90) and predic- tion (n = 90) sets. The performance of different preprocessing was compared. Three calibrations using EWs selected by SPA, RCA and ICA were developed, including linear regression of partial least squares analysis (PLS) and multiple linear regression (MLR), and nonlinear regression of least squares-support vector machine (LS-SVM). Ten EWs selected by SPA achieved the optimal linear SPA-MLR model com- pared with SPA-PLS, RCA-MLR, RCA-PLS, ICA-MLR and ICA-PLS. The correlation coefficient (r) and root mean square error of prediction (RMSEP) by SPA-MLR were 0.9762 and 0.1808, respectively. Moreover, the newly proposed SPA-LS-SVM model obtained almost the same excellent performance with RCA-LS-SVM and ICA-LS-SVM models, and the r value and RMSEP were 0.9818 and 0.1628, respectively. The nonlinear model SPA-LS-SVM outperformed SPA-MLR model. The overall results indicated that SPA was a powerful way for the selection of EWs, and Vis/NIR spectroscopy incorporated to SPA-LS-SVM was successful for the accurate determination of SSC of beer. © 2009 Elsevier B.V. All rights reserved. 1. Introduction Recently, variable selection or uninformative variable elimina- tion has attracted more and more attention for the development of multicomponent calibrations using spectroscopic techniques. Some preprocessing aiming at reducing noise, correcting light pathlength and baseline shift had been applied to the ordinary full-spectrum multivariate calibration. The commonly used pre- processing included multiplicative scatter correction (MSC) [1], Savitzky–Golay smoothing (SG) [2], standard normal variate (SNV) [3], the first and second derivative (1-Der and 2-Der), and direct orthogonal signal correction (DOSC) [4]. However, the selection of variables or elimination of uninformative variables is still very necessary to obtain a parsimonious model using relevant spectral variables with least collinearity, redundancies and noise. The recently developed methods for variable selection included generalized simulated annealing (SA) [5], genetic algorithm (GA) [6], correlation coefficients and B-matrix coefficients [7], x- loading weights [8,9], uninformative variable elimination (UVE) [10], regression coefficient analysis (RCA) [11–13], independent component analysis (ICA) [12,14,15], modeling power [12,16] Corresponding authors. Tel.: +86 571 86971143; fax: +86 571 86971143. E-mail addresses: [email protected] (Y.H. Jiang), [email protected] (Y. He). and successive projections algorithm (SPA) [17,18]. Among these methods, successive projections algorithm (SPA) employs sim- ple projection operations for variable selection with minimum of collinearity and redundancy. Normally, SPA was incorporated to some linear calibration methods, such as multiple linear regression (MLR) and partial least squares (PLS) analysis, as in previous studies [17,18]. This kind of combination was helpful for the interpreta- tion of the developed models like SPA-MLR. However, the accuracy and prediction precision of the model would be impaired to some extent without considering the latent nonlinear relevant informa- tion in the spectral data, although SPA-MLR model performed as well as full-spectrum PLS model in some case studies [18,19]. There- fore, a new combination of SPA with least squares-support vector machine (LS-SVM) was proposed as a nonlinear calibration model for quantitative analysis using spectroscopic techniques. LS-SVM could handle the linear and nonlinear relationships between the spectra and response chemical constituents [20,21]. The performance of SPA-LS-SVM was evidenced by a case study to determine the soluble solids content (SSC) of beer. Near infrared spectroscopy had been applied in beer, such as the determina- tion of original and real extract [22,23], alcohol content [22–25], sugar content and pH [24,26], and fermentation monitoring [25]. In these applications, PLS was the most used regression method and there was no variable selection procedure except covariance procedure and GA by McLeod et al. [25]. Hence, it was also neces- 0003-2670/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.aca.2009.01.017

Variable selection in visible/near infrared spectra for linear and nonlinear calibrations: A case study to determine soluble solids content of beer

  • Upload
    fei-liu

  • View
    217

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Variable selection in visible/near infrared spectra for linear and nonlinear calibrations: A case study to determine soluble solids content of beer

Vc

FC

a

ARR2AA

KVSIVLB

1

toSpfpS[oonv

g[l[c

0d

Analytica Chimica Acta 635 (2009) 45–52

Contents lists available at ScienceDirect

Analytica Chimica Acta

journa l homepage: www.e lsev ier .com/ locate /aca

ariable selection in visible/near infrared spectra for linear and nonlinearalibrations: A case study to determine soluble solids content of beer

ei Liu, Yihong Jiang ∗, Yong He ∗

ollege of Biosystems Engineering and Food Science, Zhejiang University, 268 Kaixuan Road, Hangzhou, Zhejiang 310029, China

r t i c l e i n f o

rticle history:eceived 9 September 2008eceived in revised form8 December 2008ccepted 12 January 2009vailable online 17 January 2009

eywords:isible/near infrared spectroscopy

a b s t r a c t

Three effective wavelength (EW) selection methods combined with visible/near infrared (Vis/NIR) spec-troscopy were investigated to determine the soluble solids content (SSC) of beer, including successiveprojections algorithm (SPA), regression coefficient analysis (RCA) and independent component analysis(ICA). A total of 360 samples were prepared for the calibration (n = 180), validation (n = 90) and predic-tion (n = 90) sets. The performance of different preprocessing was compared. Three calibrations usingEWs selected by SPA, RCA and ICA were developed, including linear regression of partial least squaresanalysis (PLS) and multiple linear regression (MLR), and nonlinear regression of least squares-supportvector machine (LS-SVM). Ten EWs selected by SPA achieved the optimal linear SPA-MLR model com-

uccessive projections algorithmndependent component analysisariable selectioneast squares-support vector machineeer

pared with SPA-PLS, RCA-MLR, RCA-PLS, ICA-MLR and ICA-PLS. The correlation coefficient (r) and rootmean square error of prediction (RMSEP) by SPA-MLR were 0.9762 and 0.1808, respectively. Moreover, thenewly proposed SPA-LS-SVM model obtained almost the same excellent performance with RCA-LS-SVMand ICA-LS-SVM models, and the r value and RMSEP were 0.9818 and 0.1628, respectively. The nonlinearmodel SPA-LS-SVM outperformed SPA-MLR model. The overall results indicated that SPA was a powerfulway for the selection of EWs, and Vis/NIR spectroscopy incorporated to SPA-LS-SVM was successful for

n of

the accurate determinatio

. Introduction

Recently, variable selection or uninformative variable elimina-ion has attracted more and more attention for the developmentf multicomponent calibrations using spectroscopic techniques.ome preprocessing aiming at reducing noise, correcting lightathlength and baseline shift had been applied to the ordinaryull-spectrum multivariate calibration. The commonly used pre-rocessing included multiplicative scatter correction (MSC) [1],avitzky–Golay smoothing (SG) [2], standard normal variate (SNV)3], the first and second derivative (1-Der and 2-Der), and directrthogonal signal correction (DOSC) [4]. However, the selectionf variables or elimination of uninformative variables is still veryecessary to obtain a parsimonious model using relevant spectralariables with least collinearity, redundancies and noise.

The recently developed methods for variable selection includedeneralized simulated annealing (SA) [5], genetic algorithm (GA)

6], correlation coefficients and B-matrix coefficients [7], x-oading weights [8,9], uninformative variable elimination (UVE)10], regression coefficient analysis (RCA) [11–13], independentomponent analysis (ICA) [12,14,15], modeling power [12,16]

∗ Corresponding authors. Tel.: +86 571 86971143; fax: +86 571 86971143.E-mail addresses: [email protected] (Y.H. Jiang), [email protected] (Y. He).

003-2670/$ – see front matter © 2009 Elsevier B.V. All rights reserved.oi:10.1016/j.aca.2009.01.017

SSC of beer.© 2009 Elsevier B.V. All rights reserved.

and successive projections algorithm (SPA) [17,18]. Among thesemethods, successive projections algorithm (SPA) employs sim-ple projection operations for variable selection with minimum ofcollinearity and redundancy. Normally, SPA was incorporated tosome linear calibration methods, such as multiple linear regression(MLR) and partial least squares (PLS) analysis, as in previous studies[17,18]. This kind of combination was helpful for the interpreta-tion of the developed models like SPA-MLR. However, the accuracyand prediction precision of the model would be impaired to someextent without considering the latent nonlinear relevant informa-tion in the spectral data, although SPA-MLR model performed aswell as full-spectrum PLS model in some case studies [18,19]. There-fore, a new combination of SPA with least squares-support vectormachine (LS-SVM) was proposed as a nonlinear calibration modelfor quantitative analysis using spectroscopic techniques.

LS-SVM could handle the linear and nonlinear relationshipsbetween the spectra and response chemical constituents [20,21].The performance of SPA-LS-SVM was evidenced by a case study todetermine the soluble solids content (SSC) of beer. Near infraredspectroscopy had been applied in beer, such as the determina-

tion of original and real extract [22,23], alcohol content [22–25],sugar content and pH [24,26], and fermentation monitoring [25].In these applications, PLS was the most used regression methodand there was no variable selection procedure except covarianceprocedure and GA by McLeod et al. [25]. Hence, it was also neces-
Page 2: Variable selection in visible/near infrared spectra for linear and nonlinear calibrations: A case study to determine soluble solids content of beer

4 imica

ssf

apoP

2

2

sbwrTbetmtm

tIrc

2

r(sLplrampsisrwuB

fD9asmrmt

2

i

6 F. Liu et al. / Analytica Ch

ary to develop a fast and accurate nonlinear model using fewerelected variables for the determination of quality parameters andermentation monitoring of beer.

The objective of this paper is to study the performance of SPAnd the newly proposed SPA-LS-SVM model comparing differentreprocessing (SG, SNV, 1-Der and 2-Der), variable selection meth-ds (SPA, RCA and ICA), and linear regression methods (MLR andLS).

. Materials and methods

.1. Sample preparation

Six varieties of beers commonly consumed were obtained in theupermarket, including Budweiser beer, HeineKen beer, Tsingtaoeer, Yanjing beer, Snow beer and Siwo beer. A total of 360 samplesere prepared with 60 samples for each variety. The samples were

andomly separated into calibration, validation and prediction sets.he calibration set consisted of 180 samples (30 samples for eachrand), validation set was composed of 90 samples (15 samples forach brand), and the remaining 90 samples were used in predic-ion set. The validation set was only used to validate the calibration

odel and make sure that a stable model was achieved. The predic-ion set was to evaluate the prediction performance of developed

odels. Each sample was only used in one data set.The reference value of SSC was measured by an Abbebench-

op refractometer (Model: WAY-2S, Shanghai Precision & Scientificnstrument Co. Ltd., Shanghai, China). The refractive index accu-acy is ±0.0002 and the ◦Brix (%) range is 0–95% with temperatureorrection.

.2. Spectral collection and preprocessing

A handheld FieldSpec Pro FR (325–1075 nm)/A110070 spectro-adiometer with Trademarks of Analytical Spectral Devices, Inc.Analytical Spectral Devices, BO, USA) was applied for the spectralcanning. The field-of-view (FOV) of the spectroradiometer is 25◦. Aowell pro-lam interior light source assemble/128930 with Lowellro-lam 14.5 V Bulb/128690 tungsten halogen bulb was used as the

ight source which could be used both in visible and near infraredegion (325–1075 nm). The energy of light source was adjustedccording to the standard curve of spectroradiometer. The trans-ission mode was applied for this experiment. Beer sample was

laced in a cuvette with a 2 mm light path length. The transmissionpectra were measured from 325 to 1075 nm with an average read-ng of 30 scans for each spectrum. For each sample, three replicatepectra were collected and the averaged spectrum of these threeeplicates was used as the data of this sample. All spectral dataere stored in a personal computer for later analysis and processedsing the RS3 software for Windows (Analytical Spectral Devices,O, USA) designed with a Graphical User Interface.

Using the RS3 software, the transmission spectra were trans-erred into absorbance spectra by log(1/T) (T = transmittance).ifferent preprocessing were implemented by “The Unscrambler®

.6” (CAMO AS, Oslo, Norway) to study the influences of SG, SNV,nd first derivative (1-Der) and second derivative (2-Der). The SGmoothing could be applied to reduce the noise [2]. Standard nor-al variate (SNV) could applied for light scatter correction and

educing the changes of light pathlength [3]. The derivative treat-ent could remove the influence of baseline variation and make

he noise of a variable moderately amplified [27].

.3. Selection of effective wavelengths

In this paper, three effective wavelength selection methods werenvestigated for relevant variable selection, including successive

Acta 635 (2009) 45–52

projections algorithm, regression coefficient analysis and indepen-dent component analysis.

2.3.1. Successive projections algorithmSPA is a forward variable selection algorithm applying vector

projection operations in a vector space for the selection of relevantvariables with small collinearity for multivariate calibration [17,18].In the algorithm, the instrumental response data are disposed in amatrix X of dimensions (N × K) such that the kth variable xk is cor-responding to the kth column vector xk ∈ �N. Let M = min(N − 1, K)be the maximum number of selected variables used in later cali-bration models. Firstly, the projections are carried on the X matrix,which generate k chains of M variables each. Each element in achain is selected in order to display the least collinearity with theprevious ones. The construction of each chain starts from one ofthe variables xk, k = 1,. . ., K, and follows a comparison step of pro-jections until the need relevant variables are selected. The detailsof these steps could be found in the previous studies [17,18]. Thenthe selected variables, named EWs, were used as the inputs of MLR,PLS and LS-SVM models.

2.3.2. Regression coefficient analysisRegression coefficient analysis (RCA) is derived from partial least

squares (PLS) analysis [12,13]. The regression coefficient could beobtained by PLS model using the software “The Unscrambler® 9.6”(CAMO AS, Olso, Norway). In this procedure, full cross-validationwas used to develop a PLS regression model. The regression coef-ficients in PLS model are used to calculate the response valueY-variables (soluble solids content of beer) from the X-variables(spectral data of beer). The size of the coefficients gave an indica-tion of which variables had the important impact on the responsevariables (Y). Its task was to find which variables were impor-tant for predicting Y-variable. Large absolute values indicated theimportance and the significance of the effect on the prediction ofY-variable preference. Hence, RCA could be use for EWs selection.In this paper, two assumed principles were employed for the EWselection: (1) the absolute RC value of selected EWs should be largerthan certain threshold value and (2) these selected EWs should atcertain peaks and valleys of the regression coefficient curve plot.The peaks and valleys represented the extremum of regression coef-ficients plot, and the peaks and valleys were defined by a visualapproach as a rough selection. These two hypotheses were success-fully used in previous studies [11,12]. Therefore, the selected EWsby RCA could be employed as the input data matrix of MLR, PLS andLS-SVM models.

2.3.3. Independent component analysisIndependent component analysis (ICA) is a newly developed

signal processing technique aiming at blind (unobserved) sourceseparation (BSS) [14]. ICA can separate unobserved, independentsource variables from the observed variables that are the combina-tions (or matrixes) of these source variables. The source variables,the so-called ICs, can give more chemical explanation because sta-tistically independence is a high-order statistic which is a muchstronger condition than orthogonality. A chief explanation of noise-free ICA model could be written as the following expression:

X = AS (1)

where X denotes the recorded data matrix, S and A represent theindependent components and the coefficient matrix, respectively.The goal of ICA is to find a proper linear representation of non-

Gaussian vectors so that the estimated vectors are as independentas possible, and the mixed signals can be denoted by the linearcombinations of these independent components. In this procedure,the coefficient matrix A could be used for the selection of EWs. Foreach IC, the wavelength with largest absolute coefficient value was
Page 3: Variable selection in visible/near infrared spectra for linear and nonlinear calibrations: A case study to determine soluble solids content of beer

imica Acta 635 (2009) 45–52 47

sfpeo

2

mbsMobtfctetvpd

httst

aLilgpl

y

wi

Efb

K

wntsmpRsgmilTm

Table 1The statistics of soluble solids content of beers in three data sets.

Data set (◦Brix) Sample no. Range Mean Standard deviation

F. Liu et al. / Analytica Ch

elected as the EW for this IC. Hence, certain EWs could be chosenor the development of MLR, PLS and LS-SVM models. A fast fixed-oint algorithm (FastICA) developed by Hyvärinen and Oja [28] wasmployed for the ICA procedure. All the calculations were carriedut in Matlab 7.0 (The Math Works, Natick, USA).

.4. MLR, PLS and LS-SVM calibrations

Multiple linear regression (MLR) is a commonly used calibrationethod which is simple and easy to interpret, but it is very affected

y the collinearity between the variables [29]. Hence, the EWselected by SPA, RCA and ICA could be evaluated by building SPA-LR, RCA-MLR and ICA-MLR models, respectively. In the procedure

f MLR, all selected EWs were used in the model, and the num-er of EWs should be less than the sample number and larger thanhe response chemical variable number. For comparison purpose,ull-spectrum PLS models were developed with different prepro-essing methods. PLS model could develop a relationship betweenhe spectral data and the response chemical variable. However, PLSmploys latent variables (LVs) instead of real variable (spectral data)o develop the calibration model. In MLR and PLS, the samples inalidation set were used to validate the calibration model. The sam-les in prediction set were applied to evaluate the performance ofeveloped models.

Moreover, it is worth noting that MLR and PLS methods onlyandle the linear problems and build a linear relationship betweenhe spectral variables and target chemical response. Consideringhe latent nonlinear information existed in the spectral data, leastquares-support vector machine (LS-SVM) was applied to comparehe prediction performance.

LS-SVM handles both linear and nonlinear multivariate problemnd resolving these relationships in a relatively fast way [20,21].S-SVM, a state-of-the-art learning algorithm, has a good theoret-cal foundation in statistical learning method. It employs a set ofinear equations using support vector (SVs) instead of quadratic pro-ramming (QP) problems to reduce the complexity of optimizationrocesses. The details of LS-SVM algorithm could be found in the

iteratures [21,30]. The final LS-SVM model can be expressed as

(x) =N∑

i=1

˛iK(x, xi) + b (2)

here ˛i are Lagrange multipliers, K(x, xi) is the kernel function, bs the bias value.

During the application of LS-SVM, the input data are the selectedWs by SPA, RCA and ICA. The mostly used kernel is the radial basisunction (RBF) kernel, also applied in this paper. The function cane expressed as

(x, xi) = exp

(−||x − xi||2

�2

)(3)

here xi is the input data (selected EWs). Sigma is the RBF ker-el parameter, and �2 was the bandwidth and implicitly definedhe nonlinear mapping from input space to some high dimen-ional feature space. RBF kernel as a nonlinear function was aore compacted supported kernel and able to reduce the com-

utational complexity of the training procedure. Simultaneously,BF kernel could handle the nonlinear relationships between thepectra and target attributes and give a good performance undereneral smoothness assumptions. Thus, RBF kernel was recom-

ended as the kernel function of LS-SVM in this paper. The two

mportant parameters in LS-SVM with RBF kernel were the regu-arization parameter gam (�) and the width parameter sig2 (�2).he regularization parameter � determined the tradeoff betweeninimizing the training error and minimizing model complexity.

Calibration 180 6.5–9.3 8.18 0.823Validation 90 6.5–9.3 8.19 0.835Prediction 90 6.6–9.2 8.19 0.823

The width parameter �2 was the bandwidth and implicitly definedthe nonlinear mapping from input space to some high dimensionalfeature space. In this paper, a two-step grid search technique wasemployed to obtain the optimal combination of (� , �2). Leave-one-out cross-validation was used to avoid overfitting problems in theselection of optimal combination of (� , �2). The ranges of � and�2 within (10−3–103) were set based on experience and previousresearches [11,12]. The first step grid search was for a crude searchwith a large step size and the second step for the specified searchwith a small step size. After the process of grid search, the optimalcombination of (� , �2) would be achieved for the LS-SVM models.The samples in validation set were used to validate the calibrationmodel. The samples in prediction set were applied to evaluate theperformance of developed models. All the calculations were per-formed using MATLAB® 7.0 (The Math Works, Natick, USA). Thefree LS-SVM toolbox (LS-SVM v 1.5, Suykens, Leuven, Belgium) wasapplied with MATLAB 7.0 to develop the calibration models.

The evaluation indices of predictive capability for all developedmodels were correlation coefficient (r) and root mean square error(RMSE) of calibration set (RMSEC), validation set (RMSEV) and pre-diction set (RMSEP), as in previous papers [17,18]. Generally, a goodmodel should have higher r value, lower RMSEC, RMSEV and RMSEPvalues. RMSE is calculated as

RMSE =√∑n

i=1(yi − yi)2

n(4)

where n is the number of samples, yi and yi are the reference andpredicted values of the ith sample, respectively.

3. Results and discussion

3.1. Spectral features

The raw absorbance spectra of beer are shown in Fig. 1a. Thepreprocessed spectra by SG and SNV, 1-Der and 2-Der are shown inFig. 1b–d, respectively. As can be seen, the trends of the raw beerspectra of different brands were similar. A sharp decent is shown inthe region of 350–400 nm. A small absorbance peak can be foundaround 960–980 nm. The preprocessing of SG and SNV enhancedthe features in the regions stated above (seen in Fig. 1b). In Fig. 1cand d, most absorbance values were close to zero, but at the begin-ning and end part, there were some noise brought in by spectraldifferentiation process. The performance of these preprocessingwas compared in later calibration stage.

The statistical values of the SSC in calibration, validation andprediction sets are shown in Table 1. The ranges of calibration were6.5–9.3◦Brix, which covered the largest scale in all three data sets.This situation was helpful for develop a stable and robust calibrationmodel.

3.2. Full-spectrum PLS models

Before the selection of EWs, the performance of different

preprocessing methods was compared using PLS analysis. The pre-treating methods included SG smoothing, SNV, 1-Der and 2-Der.Full-spectrum PLS models were developed using the raw and pre-processed spectra in calibration and validation sets. The modelperformance was evaluated by the samples in prediction set. The
Page 4: Variable selection in visible/near infrared spectra for linear and nonlinear calibrations: A case study to determine soluble solids content of beer

48 F. Liu et al. / Analytica Chimica Acta 635 (2009) 45–52

d spe

efi(rcioRwHStTmtc

3

3

dmtc

TT

P

RS12

processing or some EWs were quite close to each other. Wavelengthat 352 nm was selected by both raw and 2-Der spectra; while it wasquite close to 351 nm selected by SG + SNV spectra. Wavelength at355 and 1045 nm were selected by both raw and SG + SNV, SG + SNV

Fig. 1. The raw absorbance spectra (a) and preprocesse

valuation standards were the aforementioned correlation coef-cient (r), RMSEC, RMSEV and RMSEP. Different latent variablesLVs) were applied in the full-spectrum PLS models. The predictionesults are shown in Table 2. As can be seen, the spectra prepro-essed by SG smoothing and SNV achieved the optimal performancen validation and prediction sets. The 1-Der spectra obtained theptimal performance only in the calibration set (with r = 0.9863,MSEC = 0.1352). Considering the main indices, more attentionould be paid on the performance in prediction and validation sets.ence, it was concluded that full-spectrum PLS model with SG andNV was the best one for the prediction of SSC of beer. The correla-ion coefficient (r) and RMSEP were 0.9811 and 0.1808, respectively.he reason for the poor performance of 1-Der and 2-Der spectraight be that the differentiation spectra brought in some noise to

he variable matrix. This also could be discovered in Fig. 1c and dompared with Fig. 1b.

.3. Selection of EWs

.3.1. EWs selected by SPA

As stated above, SPA was used for the selection of EWs for the

etermination of SSC of beer. The aforementioned preprocessingethods were also taken into consideration. It was worth noting

hat the validation set was applied for the guidance of selection ofandidate subsets of variables. The prediction set was utilized in

able 2he prediction results of SSC by full-spectrum PLS models.

reprocessing LVs Calibration Validation Prediction

r RMSEC r RMSEV r RMSEP

aw 10 0.9835 0.1482 0.9752 0.2025 0.9684 0.2532G + SNV 5 0.9841 0.1456 0.9833 0.1515 0.9811 0.1808-Der 11 0.9863 0.1352 0.9705 0.2391 0.9522 0.3301-Der 5 0.9129 0.3345 0.9233 0.3190 0.9168 0.3271

ctra by SG and SNV (b), 1-Der (c) and 2-Der (d) of beer.

the final performance evaluation of the resulting models. It was notapplied in any step of the calibration and validation procedures.The EW selection procedure employed the samples in calibrationand validation sets. The maximum number of EWs selected bySPA was set as 30 according to experience and practical consid-eration. The selected EWs by different preprocessing are specifiedin Table 3. As can be seen, different number of EWs was obtainedby different pretreating. The EWs were sequenced in the order ofimportance in the projection procedure for each preprocessing.Individual EWs instead of spectral ranges were selected in orderto develop more parsimonious models with least number of rel-evant variables. These individual EWs were more convenient forfurther applications such as the development of portable instru-ments. Take SG and SNV for instance, wavelength at 406 nm wasthe most relevant variable in these 10 selected EWs in the SPAprocedure. Some wavelengths were both selected by different pre-

Table 3The selected EWs by SPA with different preprocessing, RCA and ICA.

Methods Preprocessing EWs Selected EWs (nm)

SPA Raw 21 763, 1019, 973, 392, 360, 624, 1028,940, 1046, 362, 1049, 634, 1036, 355,1047, 1050, 1043, 1038, 1033, 1030, 352

SG + SNV 10 406, 1045, 637, 753, 1048, 878, 351,355, 633, 359

1-Der 6 449, 958, 1002, 979, 678, 10452-Der 4 352, 737, 880, 995

RCA SG + SNV 10 353, 362, 400, 406, 371, 1045, 964, 960,980, 365

ICA SG + SNV 10 886, 952, 480, 648, 464, 888, 933, 456,849, 865

Page 5: Variable selection in visible/near infrared spectra for linear and nonlinear calibrations: A case study to determine soluble solids content of beer

F. Liu et al. / Analytica Chimica Acta 635 (2009) 45–52 49

mar

as6wpsFt

swaccTbEoffwp

3

tfottom

Fa

Fig. 2. The RMSEV plot (a) and selected EWs (shown in ×

nd 1-Der spectra, respectively. Some other EWs were quite close,uch as 359 (SG + SNV) and 360 nm (Raw), and 633 (SG + SNV) and34 nm (Raw). More EWs selected by raw and SG + SNV spectraere identical or quite close, the reason might be that the SG + SNVreprocessing remained the main features of raw spectra, andimultaneously reduced some noise of the raw spectra (seen inig. 1a and b). Hence, less number of EWs was selected by SG + SNVhan the raw spectra.

The RMSEV plot of preprocessed spectra by SG and SNV arehown in Fig. 2a, and the related EWs are shown in Fig. 2b. Fig. 2as used for the explanation of the selection procedure by SPA,

nd the distribution of selected EWs in the spectral curve plot. Asan be seen, a sharp fall is shown at the beginning of the RMSEVurve as the number of selected EWs is increased from one to four.his might indicate that the least number of selected EWs shoulde four to resolve the spectral overfitting features. From four to sixWs, the RMSEV curve was level off, but from six to eight, the trendsf RMSEV curve was descent quickly. Then a gradually descent wasrom eight to ten, and the improvement becomes marginal withurther increasing number of selected EWs. Thus, the curve tendas level off after the determination of selected EWs by SPA cutoffrocedure at the tenth EW [18].

.3.2. EWs selected by RCARegression coefficient analysis (RCA) was implemented during

he PLS regression. Only the best prediction performance was usedor the selection of EWs by RCA. The optimal performance was

btained by SG + SNV spectra with five LVs. The aforementionedwo hypotheses were applied with a threshold value of ±0.15. Thishreshold value was settled based on experience and many trials inrder to select the least number of relevant variables to representost of the useful information of full spectral region. The plots of

ig. 3. The plots of regression coefficient (a) and selected EWs (shown in × markers) (b) bnd lower cutoff threshold value.

kers) (b) by SPA with preprocessing of SG + SNV spectra.

regression coefficient and distribution of selected EWs in SG + SNVspectral curve are shown in Fig. 3a and b, respectively. The specificEWs are shown in Table 3. The EWs were in the sequence from thehigh to the low absolute regression coefficients. As can be seen,wavelengths at 406 and 1045 nm were identical with SPA selec-tion process in the SG + SNV spectra. Wavelength at 362 nm wasthe same with SPA in raw spectra, and 1045 nm was also the samewith SPA in 1-Der spectra.

3.3.3. EWs selected by ICAICA was applied to the spectra without considering the SSC

value. After the comparison, the preprocessing of SG and SNVwas optimal for the prediction. Hence, only the EWs selected bySG + SNV spectra were stated in this paper. During the process ofICA, the coefficient matrix for each IC and weights matrix for eachwavelength were obtained. The coefficients and the weight plots ofthe first IC (IC 1) are shown in Fig. 4a and b, respectively. The EWswere selected with the largest absolute weight value of each IC. Takethe first IC for instance, Fig. 4b shows the weight plots of IC 1, thelargest absolute weight value was corresponding to wavelength at648 nm, hence, wavelength at 648 nm was selected as the EWs forIC 1. For comparison with SPA and RCA in the same condition, 10EWs corresponding to ten ICs (ICs 1–10) were selected and speci-fied in Table 3. Moreover, the performance using more or less EWswas compared, and the results indicated that 10 EWs were moresuitable for the ICA-methods. As can be seen, the EWs selected byICA were different from those selected by SPA and RCA. The reason

might be that SPA and RCA took the chemical component (SSC) intoconsideration, whereas ICA only calculated with the spectral data.With a further inspection of Fig. 4a, the selected EWs by ICA werecorresponding to the wavelength with a larger absolute coefficientvalue, like within the wavelength region 880–1000 nm.

y RCA with preprocessing of SG and SNV spectra. The dotted line shows the upper

Page 6: Variable selection in visible/near infrared spectra for linear and nonlinear calibrations: A case study to determine soluble solids content of beer

50 F. Liu et al. / Analytica Chimica Acta 635 (2009) 45–52

(b) of

3

stda

3

Pmd(strtwefttmtttSfwwmT

TT

M

S

RIS

RI

Fig. 4. The coefficients (a) and weight plot

.4. MLR, PLS and LS-SVM models using selected EWs

According to the variable selection methods stated above, theelected EWs by SPA, RCA and ICA were employed as the inputso develop MLR, PLS and LS-SVM models. The calibration and vali-ation sets were used for calibration stage, and prediction set waspplied for the performance evaluation.

.4.1. Linear MLR and PLS modelsIn order to make a full evaluation of SPA, SPA-MLR and SPA-

LS models with different preprocessing were developed. In MLRodel, all selected EWs by each preprocessing were employed to

evelop the calibration model, whereas, certain latent variablesLVs) were used in the PLS models. The prediction results arehown in Table 4. As can be seen, the SG + SNV spectra performedhe optimal performance in all SPA-MLR and SPA-PLS models. Theesults were consistent with full-spectrum PLS with SG + SNV spec-ra in validation and prediction sets. Moreover, SPA-MLR modelas slightly better than SPA-PLS model with SG + SNV spectra,

xcept that the correlation coefficient (r) in prediction set (0.9762or SPA-MLR was lower than 0.9773 for SPA-PLS). The reason forhe poor performance of 1-Der and 2-Der spectra might be thathe differentiation spectra brought in some noise to the variable

atrix, which was in agreement with full-spectrum PLS model. Fur-hermore, the performance of SPA-MLR model was slightly betterhan full-spectrum PLS model using the SG + SNV spectra, excepthat the correlation coefficient (r) in prediction set (0.9762 forPA-MLR < 0.9811 for full-spectrum PLS). However, the whole per-

ormance by SPA-MLR for calibration, validation and prediction setsas not impaired because of only ten EWs used comparing with 701avelengths in full-spectrum PLS model. The SPA-MLR model wasore parsimonious and simple to interpret for further applications.

herefore, it could be concluded that SPA was powerful approach

able 4he prediction results by MLR and PLS models.

ethods Preprocessing EWs/LVs Calibration

r R

PA-MLR Raw 21/– 0.9770 0SG + SNV 10/– 0.9849 01-Der 6/– 0.9447 02-Der 4/– 0.8651 0

CA-MLR SG + SNV 10/– 0.9704 0CA-MLR SG + SNV 10/– 0.9624 0PA-PLS Raw 21/7 0.9692 0

SG + SNV 10/5 0.9813 01-Der 6/2 0.9398 02-Der 4/2 0.8432 0

CA-PLS SG + SNV 10/4 0.9741 0CA-PLS SG + SNV 10/5 0.9728 0

the first IC (IC 1) with SG and SNV spectra.

for the relevant variable selection. SPA reduced the collinearity,redundancies and noise for the whole spectral data.

For comparison, the EWs selected by RCA and ICA wereemployed to develop MLR and PLS models. Only the optimal resultsusing SG + SNV spectra were stated in this paper. The performanceof SG + SNV spectra was optimal, and this was consistent with SPAselection procedure and full-spectrum PLS model. The predictionresults using RCA and ICA are shown in Table 4. As can be seen,SPA-MLR (SG + SNV) model was slightly better than all RCA andICA models since SPA-MLR model had a higher r value and lowerRMSE value in calibration, validation and prediction sets. The resultsreconfirmed the success of SPA for the selection of most relevantvariables from full-spectrum region.

3.4.2. Nonlinear LS-SVM modelsAfter the determination of EWs by SPA, RCA and ICA, LS-SVM

models were developed to determine the SSC of beer beverage.Up to our knowledge, the combination of SPA-LS-SVM model wasnewly proposed in this paper. The selected EWs were employedas the inputs, the RBF kernel was recommended as the kernelfunction, and the model parameters (� , �2) were determined bya two-step grid search technique. The EWs were used as the inputsof LS-SVM models in order to reduce the training time becausethe training time using LS-SVM increased with the square of thenumber of training samples and linearly with the number of vari-ables (dimension of spectra) [31]. The calibration and validationsets were applied for calibration stage, and prediction set wasto validate the model performance. The optimal combinations of

(� , �2) were (79.5, 55.5), (9.5, 6.1) and (121.5, 10.1) for SPA-LS-SVM,RCA-LS-SVM and ICA-LS-SVM models, respectively. The predictionresults of all LS-SVM models are shown in Table 5. The predictedvs reference values of SSC are shown in Fig. 5a–c for SPA, RCA andICA models, respectively. As can be seen, all three developed LS-

Validation Prediction

MSEC r RMSEV r RMSEP

.1748 0.9703 0.2010 0.9594 0.2388

.1420 0.9853 0.1432 0.9762 0.1808

.2688 0.9625 0.2257 0.9255 0.3256

.4111 0.8760 0.4014 0.8401 0.4742

.1979 0.9635 0.2257 0.9566 0.2446

.2226 0.9588 0.2419 0.9358 0.2959

.2018 0.9579 0.2401 0.9367 0.2923

.1576 0.9800 0.1654 0.9773 0.1916

.2834 0.9569 0.2416 0.9259 0.3259

.4406 0.8832 0.3893 0.8209 0.4746

.1854 0.9740 0.1889 0.9650 0.2153

.1898 0.9729 0.1925 0.9617 0.2251

Page 7: Variable selection in visible/near infrared spectra for linear and nonlinear calibrations: A case study to determine soluble solids content of beer

F. Liu et al. / Analytica Chimica Acta 635 (2009) 45–52 51

Table 5The prediction results of SSC of beer by LS-SVM models.

Methods (� , �2) Calibration Validation Prediction

r RMSEC r RMSEV r RMSEP

SPA-LS-SVM (79.5, 55.5) 0.9905 0.1106RCA-LS-SVM (9.5, 6.1) 0.9915 0.1064ICA-LS-SVM (121.5, 10.1) 0.9910 0.1079

Fig. 5. The predicted vs reference values of SSC by SPA-LS-SVM (a), RCA-LS-SVM (b)and ICA-LS-SVM (c) models.

0.9915 0.1084 0.9818 0.16280.9910 0.1106 0.9869 0.13300.9915 0.1074 0.9808 0.1600

SVM models were better than the linear models developed usingselected EWs since LS-SVM model had higher r values and lowerRMSE values. Considering the full-spectrum PLS model, SPA-LS-SVM and RCA-LS-SVM models had slightly better performance.However, 701 variables were used in the full-spectrum PLS modeland only 10 variables were employed in the LS-SVM models. Theperformance of LS-SVM was not impaired and even played a littlebetter performance. Moreover, ten variables were more convenientfor other further applications. In the point of this view, the proposedvariable selection methods were quite helpful and useful for poten-tial applications. The reason might be that LS-SVM model took alllinear and latent nonlinear useful information into consideration,whereas MLR and PLS models only dealt with the linear relationshipbetween the spectral data and chemical SSC. The results were alsoin agreement with previous studies [11,32]. Comparing these threeLS-SVM models, RCA-LS-SVM obtained slightly better results in cal-ibration and prediction sets, whereas SPA-LS-SVM and ICA-LS-SVMhad the same r values, ICA-LS-SVM also had a lower RMSEV value invalidation set. However, all three LS-SVM models obtained almostthe same excellent performance with r values higher than 0.9800and RMSE values lower than 0.1650. Only considering the perfor-mance of prediction set, EWs selected by RCA achieved the bestperformance with r = 0.9869 and RMSEP = 0.1330. This result wasalso better than other previous studies. Wang et al. [26] obtainedthe prediction results with r = 0.9539 and RMSEP = 0.2559 by backpropagation-artificial neural network (BP-ANN), and r = 0.9829 andRMSEP = 0.1506 by principal component analysis-least squares-support vector machine (PCA-LS-SVM) model. In conclusion, thenew combination of SPA-LS-SVM was quite powerful and wouldbe helpful and useful for potential applications, SPA selected themost relevant variable as EWs with least collinearity to determinethe SSC of beer beverage. Comparing the computational time costfor three variable selection methods, the time for RCA, ICA and SPA(processed in the same computer) was 300, 60 and 35 s, respec-tively. The time for SPA was relatively fast compared with RCA andICA. The computational time for RCA-MLR, ICA-MLR and SPA-MLRwas 310, 70 and 45 s, respectively. The total computational time forPLS models using EWs were quite close to that of MLR models usingEWs. When using LS-SVM for calibration and validation, only lessthan 1 s was needed once the parameters of LS-SVM were settled.Hence, the computational time for RCA-LS-SVM, ICA-LS-SVM andSPA-LS-SVM was about 300, 60 and 35 s, respectively. In the point ofthis view, SPA-LS-SVM was a relatively fast way for the determina-tion of determination of SSC of beer beverage. The results indicatedthat Vis/NIR spectroscopy combined with chemometrics could besuccessfully applied for the determination of SSC of beer. The newlyproposed SPA-LS-SVM would have potential applications for thedetermination of other quality parameters of beer.

4. Conclusion

Vis/NIR spectroscopy was successfully utilized for the determi-

nation of SSC of beer. SPA with different preprocessing methods(SG smoothing, SNV, 1-Der and 2-Der) were applied to select mostrelevant EWs with comparison of RCA and ICA. A new combi-nation of SPA-LS-SVM was proposed with comparison of linearMLR and PLS models. The developed SPA-MLR (SG + SNV) model
Page 8: Variable selection in visible/near infrared spectra for linear and nonlinear calibrations: A case study to determine soluble solids content of beer

5 imica

oawcmLRotao

A

oAtCo

R

[

[[

[[[[

[[

2 F. Liu et al. / Analytica Ch

btained better results than RCA-MLR, ICA-MLR, SPA-PLS, RCA-PLSnd ICA-PLS models. From the linear models, it indicated that SPAas a powerful way to select most relevant variable with least

ollinearity and redundancies. The newly proposed SPA-LS-SVModel obtained almost the same excellent performance with RCA-

S-SVM and ICA-LS-SVM models. The correlation coefficient (r) andMSEP by SPA-LS-SVM were 0.9818 and 0.1628, respectively. Theverall results demonstrated SPA was powerful for variable selec-ion, and the newly proposed SPA-LS-SVM could be applied as anlternative fast and accurate method for the determination of SSCf beer beverage.

cknowledgements

This study was supported by National Science and Technol-gy Support Program (2006BAD10A09), the Teaching and Researchward Program for Outstanding Young Teachers in Higher Educa-ion Institutions of MOE, PR China, Natural Science Foundation ofhina (Project No. 30671213), Science and Technology Departmentf Zhejiang Province (Project No. 2005C12029).

eferences

[1] I.S. Helland, T. Nas, T. Isaksson, Chemom. Intell. Lab. Syst. 29 (1995) 233.

[2] P.A.G. Gorry, Anal. Chem. 62 (1990) 570.[3] R. Barnes, M. Dhanoa, J. Lister, Appl. Spectrosc. 43 (1989) 772.[4] J.A. Westerhuis, S. de Jong, A.K. Smilde, Chemom. Intell. Lab. Syst. 56 (2001) 13.[5] J.H. Kalivas, N. Roberts, J.M. Sutter, Anal. Chem. 61 (1989) 2024.[6] D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O.E. de Noord, Anal. Chem. 67 (1995)

4295.

[[[[

[

Acta 635 (2009) 45–52

[7] M. Min, W.S. Lee, Trans. ASABE 48 (2005) 455.[8] K.H. Esbensen, Multivariate Data Analysis in Practice, 5th ed., CAMO Process As,

Oslo, 2002.[9] F. Liu, Y. He, L. Wang, H.M. Pan, J. Food Eng. 83 (2007) 430.

[10] V. Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.M. Vandeginste, C. Sterna,Anal. Chem. 68 (1996) 3851.

[11] F. Liu, Y. He, L. Wang, Anal. Chim. Acta 610 (2008) 196.12] F. Liu, Y. He, L. Wang, Anal. Chim. Acta 615 (2008) 10.

[13] I.G. Chong, C.H. Jun, Chemom. Intell. Lab. Syst. 78 (2005) 103.[14] A. Hyvärinen, J. Karhunen, E. Oja, Independent Component Analysis, John Wiley

& Sons, New York, 2001.[15] C. Krier, F. Rossi, D. Francois, M. Verleysen, Chemom. Intell. Lab. Syst. 91 (2008)

43.[16] S. Sagrado, M.T.D. Cronin, Anal. Chim. Acta 609 (2008) 169.[17] M.C.U. Araújo, T.C.B. Saldanha, R.K.H. Galvão, T. Yoneyama, H.C. Chame, V. Visani,

Chemom. Intell. Lab. Syst. 57 (2001) 65.[18] R.K.H. Galvão, M.C.U. Araújo, W.D. Fragoso, E.C. Silva, G.E. José, S.F.C. Soares,

H.M. Paiva, Chemom. Intell. Lab. Syst. 92 (2008) 83.[19] M.C. Breitkreitz, I.M. Raimundo Jr., J.J.R. Rohwedder, C. Pasquini, H.A.D. Filho,

G.E. josé, M.C.U. Araújo, Analyst 128 (2003) 1204.20] J.A.K. Suykens, J. Vandewalle, Neural Process. Lett. 9 (1999) 293.21] J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, J. Vande-

walle, Least Squares Support Vector Machines, World Scientific, Singapore,2002.

22] F.A. Inón, S. Garrigues, M. de la Guardia, Anal. Chim. Acta 571 (2006) 167.23] R. Llario, F. Inón, S. Garrigues, M. de la Guardia, Talanta 69 (2006) 469.24] D.W. Lachenmerier, Food Chem. 101 (2007) 825.25] G. McLeod, K. Clelland, H. Tapp, E.K. Kemsley, R.H. Wilson, G. Poulter, D. Coombs,

C.J. Hewitt, J. Food Eng. 90 (2009) 300.26] L. Wang, Y. He, F. Liu, X.F. Ying, J. Infrared Millim. Waves 27 (2008) 51.27] S.F. Ye, D. Wang, S.G. Min, Chemom. Intell. Lab. Syst. 91 (2008) 194.

28] A. Hyvärinen, E. Oja, Neural Netw. 13 (2000) 411.29] T. Naes, B.H. Mevik, J. Chemom. 15 (2001) 413.30] H. Guo, H.P. Liu, L. Wang, J. Syst. Simul. 18 (2006) 2033.31] F. Chauchard, R. Cogdill, S. Roussel, J.M. Roger, V. Bellon-Maurel, Chemom. Intell.

Lab. Syst. 71 (2004) 141.32] F. Liu, Y. He, J. Agric. Food Chem. 83 (2007) 430.