Method for the Classification of Biological FT-IR Spectra Prior to Quantitative Analysis

Method for the Classification of Biological FT-IR Spectra Prior to Quantitative Analysis

FRI~DI~RIC C A D E T Laboratoire de Biochimie, Facultd des Sciences, Universit~ de la Rdunion, 15 avenue Rend Cassin, BP 7151. 97715 Saint-Denis Messag Cedex 9, Rdunion, France-DOM.

Several methods have been proposed with the aim of improving the precision of quantitative measurements of biological components (baseline correction, classification, elimination of unwanted components, etc.). In this context, we propose a classification method of biological samples (raw sugar cane juices) before sucrose content prediction is performed. The method consisted of isolating the two most dissimilar individuals from a large calibration family of mid- FT- IR spectra, and, by successive principal component analysis (PCA) and principal component regression (PCR), a family composed of a few individuals was constituted. Each individual from this family represented the first spectrum of the corresponding classes that were ultimately formed. The classification of the re- main ing samples from the calibration family was carried out by the mobile centers method, that is, by the measurements of' the Euclid- ian distances. This procedure improved the precision of the predictions. The mean and standard deviation (SD) of the differences between predicted and reference values were, respectively, - 1 . 6 2 × 10- ~ and 0.308 before classification and 2.38 × 10 3 and 0.254 after classification. The procedure developed in this paper first allowed a qualitative classification of spectra without knowledge of their chemical composition, and second, improved the precision of the quantitative predictions.

Index Headings: Biological spectra; Classification; Mid-FT- IR; PCA; PCR.

I N T R O D U C T I O N

The management of spectral data is not obvious. A collection of a hundred spectra could contain 300,000 absorbance values. Mid-FT-IR spectra of natural products are often very similar. It is therefore necessary to em- phasize their differences by processing the data and by applying a particular mathematical distance to measure the dissimilarities between spectra. Consultation of the spectral database to find spectra similar to an unidentified spectrum needs to have a fast-adapted procedure.

Near-infrared reflectance spectroscopy (NIR) is the most widely used spectroscopic method for the quantitative analysis of major biochemical constituents in food products, and it is now widely applied in order to predict the composit ion of various food products. ~ More and more products and compounds are being analyzed. 2,3 Only a few food products have been analyzed by MIR spectroscopy (complex spectra, strong IR absorption by water, and weak penetration of incident rays). With recent developments in the field of computer and software en- gineering and the advent of new techniques such as ATR (attenuated total reflectance) cells, the development and application of physical quantitative analysis methods such as mid-infrared spectroscopy have expanded considerably. 4-7 The interest in using principal component anal-

Received 23 January 1996; accepted 3 July 1996.

ysis (PCA) for studies in mid-infrared spectroscopy has already been established, 8,9 and for the first time PCA was applied to the analysis of diffused reflected NIR spectra as reported by Bertrand et al. 1° and Cowe and McNichol. 11 We have recently proposed the use of multidimensional analysis combined with mid-Fr - IR for the prediction of sucrose content in biological samples (raw sugar cane juices). We report the use of one equation for the sucrose content prediction of the whole spectral collection. ~2

With the aim of improving the precision of the measured sucrose content values, a classification algorithm adapted to mid-FT-IR spectra has been developed. The most dissimilar individuals are obtained from the whole spectral collection. The other samples are then classified according to their Euclidian distance to these individuals. Multidimensional analysis, principal component analysis, and principal component regression (PCR) are used ex- tensively in this aim. The algorithms developed are tested on raw sugar cane juices in order to improve the precision of the quantitative determination of sucrose.

M A T E R I A L S AND M E T H O D S

Biological Samples. Sampling of sugar in sugar cane by coring is used. The average core is about 7000 g. After pulverization, a subsample of approximately 1000 g is removed. A hydraulic press is used to extract juice from the samples obtained fi'om coring and from the disinte- grator. The subsample is pressed for two and one-half minutes at 250 bar. The raw juices that are obtained and that contain impurities and fibers are filtered instantane- ously via a highly porous plastic filter when the ATR cells are filled.

Corrected Polarization. The reference sucrose content values were determined by corrected polarimetric measures. 7

Mathematical Treatment, Mathematical treatments were performed on a Compaq personal computer with software written in " C " language and developed in our laboratory. Multidimensional statistical analyses, such as principal component analysis, describe variation in multidimensional data by few synthetic variables. These synthetic variables are a linear combination of all the original variables and have the advantage of having no correlation with each other. Simpler descriptions of data sets are thus obtained with minimal loss of information. These treatments were used for morphological analysis of spectra ~3 and for graphical representation of spectral similarity. ~4

PCA was applied to the spectra from 800 to 1250 cm (with 235 data points used as principal variables). Spectra were centered prior to PCA according to:

X~j = Air - A i - A~ + A

1590 Volume 50, Number 12, 1996 0003-7028/96/5012-159052.00/0 APPLIED SPECTROSCOPY © 1996 Society for Applied Spectroscopy

where Xij is the centered data; Aij is the spectral data (log l/R) of spectrum i and wavelength j; Aj is the mean value of spectral data at wavelength j for every spectrum; Ai is the mean value of spectral of spectrum i for every wavelength; and A is the average mean of all spectral data in the collection.

Principal component regression was used to establish a predict ion equation. PCR is basically a multilin- ear regression applied to scores assessed by PCA. tSA6 In- terest in the introduction of scores according to their pre- dictive ability has already been shown. 17,~s

Concentrations are predicted according to:

C,,a = X,, , 'Vk.p'Rp.i

where C is the column vector of predicted concentrations, X is the centered matrix of spectral data, V is the matrix of latent vectors of PCA, and R is the column vector of the regression coefficients of the prediction equations, n, k, and p are, respectively, the number of samples; the number of wavelengths; and the number of significant principal components. The dot product V.R is a vector, the components of which may be interpreted in terms of absorption bands. Plotting the components against the cor respond ing wave leng ths gives a spectral pattern. Peaks correspond to absorption bands that are characteristic of the measured chemical constituents. Hollows in- dicate that, when the concentration increases, the corresponding absorption bands will decrease, j9

C a l i b r a t i o n a n d V e r i f i c a t i o n se t s . The calibration and verification sets are constituted, respectively, of 107 and 30 spectra of raw sugar cane juices.

R E S U L T S A N D D I S C U S S I O N

R e f e r e n c e Values. The 800-1250-cm l spectral re- gion is characteristic of C-O, O-H, and C - C bonds o f saccharides. However, the spectra in this zone result from the absorption bands of the three major sugars (sucrose, glucose, and fructose), which are present in the biological sample. The sucrose content values range from 11.842% and 23.154% (g/100 mL) for the calibration set with a mean and SD of 19.656 and 2.168, respectively. For the verification set, it ranged between 17.04% and 24.27% with a mean of 20.33 and a SD of 1.55.

T H E O R Y

Our investigations were limited to a fixed interval Ip = 800-1250 cm ~ (Fig. 1). The collection of spectra S defined on Ip forms a vectorial space of finite dimension. If the number of discreet points is p, then the dimension o f S i s p + 1.

We define f as a finite function defined on S and with values in R. In the present study, f represents "sucrose content".

In a hypothesis that f is a linear function: If (ej)j = 0 .., p is a basis from S, then f can be decomposed into

P

f ( s ) = ~ o~.f(ej) (1) j=0

where s • S and

P

s = ~ %-ej. j 0

The function f is hence entirely defined by f(ej). PCA is applied with a family (F) containing n samples

(n = 107). Hence a system B of orthonormalized spectra is obtained. The spectra are the eigenvectors of the co- variance matrix:

B = (a)) j = 1 . . . . n.

The regression formula is given by

f ( s ) = cst + ~ 13;.f(a;) (2) j = l

where s • S and

s = ~ fli'aj. j = l

When the two equations are compared, it can be observed that n - p terms are missing in Eq. 1.

It should be noticed that aj in Eq. 2 are classified according to their relative weight, which means that

weightf (a l ) > weightf(a2) > . . . > weightf(a,,).

Let us consider a function g : S × (1 . . . . . n) ~ R, such that

g(s, x) = cst + ~ f l j . f (aj) (3) j -1

w h e r e s • S. It should be noted that: (1) For the autoprediction, i.e.,

s e F, g(s, x) converge to £(s) when x tends to n. (2) For the other spectra, i.e., s e (S - F), g(s, x) converge to h(s) when x tends to n. The variation when x tends to n is hence small. However, h(s) differs from £(s). (3) With regard to the classification of the regression coefficients, the weight of aj depends on the number of defined principal components. The order of aj is thus not absolute. For example, with 10 principal components, the ai set can be as follows: 1, 2, 3, 4, 9, 6, 10, 5, 8, 7; and with 30 principal axes the aj set would be: 1, 2, 3, 4, 18, 14, 19, 12, 29, 9, 6, 28, 10, 27, 30, 20, 15, 21, 5, 23, 11, 22, 24, 13, 8, 17, 26, 25, 7, t6.

It should be noted that the 7th axis has a weak weight and, if the study is re,;trained on the first 10 axes, several terms that have greater weight than a 7 are omitted.

A P o s s i b l e E x p l a n a t i o n f o r t h e D e v i a t i o n b e t w e e n £(s) a n d h(s). When Eqs. 1 and 2 are compared, it can be noticed that p - n terms have been neglected in the prediction. In the present case, where p = 235 and n = 107, 128 terms were neglected.

These incomplete coefficients are hence responsible for the convergence to other £ values.

From the third observation above, it is most probable that most ponderated terms among the p - n terms have been forgotten in the prediction equation which contains only n terms.

The first n most ponderated terms can be obtained when PCA is carried out on a family that contains p samples that are independent between them and when the first n terms are chosen.

The statement (E) which follows can thus be made:

APPLIED SPECTROSCOPY 1591

r~

elm[ 0.9183

0,B55!

e!ami 0.7919

' ~ ' 0.7286

0.6654

0.602t

0.5389

0.4757

0.4t24

0.3492

~ . . - , , . . J I I I I I 1249~ t 2 0 4 . 3 1 1 5 9 . 4 11114.5 t 0 6 9 . 5 t 0 2 4 . 6 9 7 9 . 7

I I I , , I 9 3 4 . 7 BBg. B B 4 4 . 9 7 9 9 . 9

cm-1 FIG. 1. Mid-FT-IR (800-1250 cm 1) spectra of a raw sugar cane juice sample.

PCA should be applied to a basis from S. The difficulty in the application of this statement is that an infinite basis of S exists. Looking for a basis is a complicated task. Unfortunately, to be able to carry out a prediction, the £ values must be known. This is a very delicate task since it is impossible to obtain the sucrose content value from any spectrum. However, it is easier to obtain £ on a system (S) of spectra. The cardinal of such a system can be much greater than p. One method consists of extracting a BASE from this system of spectra. The difficulty now lies in verifying the independence of a family of spectra.

Proposition for an Approximative Method of Con- struction. A procedure for the construction of families that corresponds at best to the statement (E) is proposed. How can the dependence of a spectrum on a family of spectra be described? s e S is a linear combination of a family F = aj (j = 1 . . . . n) of spectra if and only if:

j= l

If the sucrose content values can be determined exactly with Eq. 2, the same can be applied to the spectrum s, by splitting s with Eq. 3. In other words, for a given spectrum s, the more exact formula 2, the greater the chance that it is a linear combination of the spectra in the calibration family F.

Inversely and algebraically, the greater the deviation, the greater the independence of the spectrum s with regard to the family F. If a function ~ is defined as ~ = 1£ - hi, ~ can be considered a measure of the independence of a spectrum with respect to a family F of spectra. The aim of this manipulation is to find the family of spectra that best features the space S : statement (E').

The spectra are very dependent on each other. Such a

1592 Volume 50, Number 12, 1996

333.0

e ~ 30t .3

269.6 = ¢,J ( ~ 237 .g

5 2 . 8 t 0 4 . 6

14 14 14 14 14 14 14 14

N N N N N 14 I* 14

14 14 I 14 NN 14 N 14 N

14 14 N I l 14 14 N 14 M

M. 14 N ~ N N N 1414 1 4 1 4

N N 14' 14

I I N ~ I , . / I - , I I I I t 5 6 . 4 2 0 B . 2 ~ 2 B 0 . 0 ~ " 3 i i . B 3 6 3 . 6 4 t 5 . 4 4 6 7 . 2 5 t 9 . 0

First Axis FIG. 2. Factorial map associated with the first two axes as assessed by PCA on the calibration set.

T A B L E I. Chemica l values of the samples const i tut ive of the different sub-classes .

Sucrose content (g/100 mL)

Sub- Sub- class 1 Sub-class 2 class 3 Sub-class 4

11.984 12.93

19.062 20.797 21.113 14.230 15.916 17.422 23.174 20.931 20.049 19.365 19.514 16.455 22.193 20.509 22.289 13.673 15.540 18.010 22.393 22.378 22.271 17.365 19.969 19.386 20.421 21.742 20.311 20.091 18.516 21.798 22.021 17.132 20.337 23.417 19.155

set can be approached by a sub-vectorial space of S. However, the dimension of this sub-vectorial space can- not be easily determined. It can be assumed that, in order to obtain good precision, it is not necessary to use all p coefficients.

The construction algorithm is as follows: Let us consider a large family, F, of spectra. From F, a family will be extracted. A family, F2, composed of the two most dissimilar spectra or linearly independent spectra, is built by choosing the couple of spectra f rom F whose scalar product is the minimum.

The construction method is an iterative one, and pro- gressive series F~ of families are built so that

c a r d i n a l o f f I = i (4)

Fi is in accordance with the statement (E') . (5)

One iterative step consists of:


3 3 3 . 0

= @

30t.3

2 6 9 . 6

!37.9

!06.2 ]

7 4 . 5 i

,42 .6

t t . t

7 9 . 4

47.7

114 I 2 I. O t 0 4 . fi

• r N /

N

N,

N

N N

I = I .. I I I I t 5 6 . 4 2 0 0 . - . 0 3 t t . f l 3 6 3 . 6 4 t 5 . 4 4 6 7 . 2 5 1 g . 0

First Axis FIG. 3. F a c t o r i a l m a p a s s o c i a t e d w i th the f irst t w o axes as a s s e s s e d b y P C A on the c a l i b r a t i o n set a f t e r c l a ss i f i ca t ion .

Applying PCA to the calibration family Fi. Finding a prediction equation with i terms Calculating B(s) for each spectrum s E F - F i, Choosing the spectrum Si+l which verifies:

Max B(s) = F(i) s E F - F~

Considering: F~+, = F~ + Si+l.

The algorithm can be stopped when F(i) < epsilon. It should be noted that the quality of F~ strongly de-

pends on the size of the family F. This is why family F should be as big as possible.

Samples Classification. From the classification procedure described above, a new set of samples that are the most dissimilar is constructed. These can be considered the starting point for the classification of all the other spectra.

Several classification procedures can be considered; 2°

we have adopted the mobile centers method. The distance used is the most natural one known, that is, the Euclidian distance.

In a second step, these classes were reduced into sub- classes according to the procedure described above, since some individuals do not bring additional information to their respective classes. Prediction equations are established when the most representative individuals that con- stitute the sub-classes are assessed by PCR.

A P P L I C A T I O N

From a calibration family composed of 107 individuals (calibration set), the four most dissimilar individuals are chosen. These individuals represent the origin of the dis- tribution of the other 103 samples in the 4 classes thus formed.

1594 Volume 50, Number 12, 1996

T A B L E II. Predicted sucrose content values before and after classification.

Before classification After classification

Reference Predicted Deviation Sub-class Predicted Deviation

19,793 20,408 0,615 4 20,184 0,391 19,482 19,511 0,029 4 19,547 0,065 21,187 21,101 -0,086 2 20,738 -0,449 21,137 20,975 -0,162 2 21,041 -0 ,096 18,169 17,932 -0,237 4 18,141 -0,028 20,171 20,014 -0,157 4 19,789 -0,382 18,763 19,115 0,352 4 18,878 0,115 21,029 21,163 0,134 2 20,851 -0,178 22,144 22,042 -0,102 2 22,153 0,009 20,029 19,976 -0,053 2 20,167 0,138 21,487 21,511 0,024 2 21,337 -0,15 18,838 19,033 0,195 4 18,734 -0,104 21,699 21,607 -0 ,092 2 21,643 -0,056 19,373 19,773 0,4 4 19,38 0,007 23,035 23,352 0,317 2 23,339 0,304 21,017 21,343 0,326 2 21,232 0,215 21,197 21,234 0,037 2 21,174 -0,023 20,804 20,997 0,193 2 21,107 0,303 20,884 20,461 -0,423 2 20,632 -0,252 20,16 20,396 0,236 4 20,288 0,128 19,48 18,925 -0,555 4 19,235 -0,245 20,223 19,915 -0,308 4 20,214 -0,009 20,281 20,42 0,139 2 20,361 0,08 17,04 16,29 -0,75 4 16,412 -0,628 19,781 19,4 -0,381 4 19,41 -0,371 18,311 18,258 -0,053 4 18,528 0,217 17,588 17,677 0,089 4 17,767 0,179 21,581 21,953 0,372 2 21,945 0,364 24,273 23,954 -0,319 2 24,522 0,249 20,851 21,021 0,17 2 21,131 0,28 Mean - 0,002 0,002 SD 0,308 0,254

Classes l, 2, 3, and 4 are constituted of 2, 54, 8, and 43 samples, respectively.

The factorial map associated with the first two axes of PCA on the calibration family is given in Fig. 2.

The different classes are represented on this map. In each class, some,individuals are clearly redundant.

Crocombe et al. 5 showed the importance of the calibration set on the precision of the predicted values; par- ticularly, the effect of over-determination of calibration samples is pointed out.

The construction procedure is applied to each of these classes. The most representative individuals for each class are retained and their numbers are reduced. The selection of the representative individuals is based on the SD (SD -< 0.2) of the deviations obtained by autoprediction (the individuals f rom a class are predicted from the sub-class that is being elaborated).

The composition of each sub-class is given in Table I. From 107 spectra, the number of individuals is reduced to 37.

The factorial map from this new family is given in Fig. 3. The samples are distributed into 4 distinct classes that contain only the most representative individuals and are distributed along a sucrose concentration gradient.

The sucrose content in samples f rom a verification set were predicted (1) before classification, in which case a prediction equation was established by PCR on the whole calibration set (107 individuals); and (2) after classification from the prediction equations established f rom each sub-class.

The observed and predicted values, together with the

difference between observed and predicted values before and after classification, are given in Table II. The sub- classes associated with each spectrum are also given.

The predicted values are considerably improved after classification. The mean and standard deviation values of the difference between observed and predicted values are -1 .62 × 10 -3 and 0.308 before classification and 2.38 × 10 -3 and 0.254 after classification, respectively.

It can be noticed that some individuals have sucrose content values that are different from the other spectra of the respective sub-classes. For example, in sub-class 3, an individual has a sucrose content value of 19.365%, while the other individuals do not exceed 17.36%, These clas- sifications, a priori surprising, suggest that the distance used in the classification procedure is not adequate; the Euclidian distance with respect to the gravity center in the corresponding class is not the most appropriate distance in this case. The major difficulty in every classification procedure lies in the choice of an appropriate mathematical distance. 21-23 Other mathematical distances should be looked for, and this area is currently being investigated.

CONCLUSION

The classification procedure developed in this paper allowed, first, a qualitative classification of mid-FT-IR spectra of biological samples without knowledge of their chemical composition and, second, the improvement of the precision of quantitative results.

Improving precision is fundamental in cases where quantitative analysis of biological constituents is used as the basis for determining the market value of the product.

A C K N O W L E D G M E N T S

This work was supported by a grant from the Minist~re de la Re- cherche et de la Technologic and the Conseil Rdgional de la Rdunion.

1. B. G. Osborne and T. Fearn, Near Infrared Spectroscopy in Food Analysis (Longman Scientific and Technical, Wiley and Sons, New York, 1986).

2. B. G. Osborne, J. Food Technol. 16, 13 (1981 ). 3. P. Williams and K. Norris, Near Infrared Technology in the Agri-

cultural and Food Industry (American Association of Cereal Chem- ists Saint Paul, Minnesota, 1987).

4. C. Depecker, P. Legrand, J. C. Merlin, and B. Sombret, "Contri- bution de la R6flexion Diffuse Infrarouge ~t l'6tude des Compos6s Biologiques", in Spectroscopy of Biological Molecules, A. J. E Alix, L. Bernard, and M. Manfait, Eds. (John Wiley and Sons, Chichester, 1985), p. 69.

5. R. A. Crocombe, N. L. Olson, and S. L. Hills, Quantitative Fourier Transform Methods" for Real Complex Samples (American Society for Testing and Materials, Philadelphia, Pennsylvania, 1987), p. 95.

6. F. R. Van de voort, and J. Sedman, Food Chem. 48, 213 (1993). 7. E Cadet, D. Bertrand, E Robert, J. Maillot, J. Dieudonn6, and C.

Rouch, Appl. Spectrosc. 45, 166 (1991). 8. M. K. Antoon, L. D'Esposito, and J. L. Koenig, Appl. Spectrosc.

33, 351 (1979). 9. P. C. Gillette and J. L. Koenig, Appl. Spectrosc. 36, 535 (1982).

10. D. Bertrand, P. Robert, and V. Tran, "Traitements Math6matiques des Spectres NIR de M61anges", in llOme Congrks de L'Associa- tion lnternationale de Chimie Cdr~alibre, Vienna, Austria (1984).

1 l. I. A. Cowe and J. W. McNichol, Appl. Spectrosc. 39, 257 (1985). 12. E Cadet and B. Offmann, J. Agric. Food Chem., revised paper

submitted for publication (1996). 13. J. Le Nouvel, Thesis of third cycle, Universit6 de Rennes I, France

(1981). 14. M. E Devaux, D. Bertrand, P Robert, and M. Qannari, Appl. Spec-

trosc. 42, 1015 (1988).


15. J. Lefebvre Introduction aux Analyses Statistiques Multidimension- nelles (Masson, Paris, 1983), p. 137.

16. L. Lebart, A. Morineau, and N. Tabard, Techniques de la Descrip- tion Stastitique (Dunod, Paris, 1977), p. 7.

17. P. Dagnelie, Analyse Statistique ~ Plusieurs Variables (Les Presses Agronomiques de Gembloux, Belgium 1975), p. 185.

18. D. Bertrand, M. Lila, V. Furtoss, P. Robert, and G. J. Downey, Sci. Food Agric. 41, 299 (1987).

19. D. Bertrand, P. Robert, M. F. Devaux, and J. Abecassis, "Assignment of Near-Infrared Absorption Bands by Muldimensional Analyses of Spectral

Data", in Analytical Applications of Spectroscopy, C. S. Creaser and A. M. C. Davies, F_As. (Royal Society of Chemistry, London, 1988), p. 450.

20. M. Roux, Algorithmes de Classification (Masson, Paris, 1985). 21. J. H. Ward, Am. Statist. Assoc. 58, 236 (1963). 22. D. Bertrand, P. Robert, and C. Rouxel, "Hierarchical Clustering of

NIR Spectra", in Proceedings of the 2nd International N1RS Con- ference, Tsukuba Japan, (1989), p. 241.

23. M. A. Sharaf, D. L. Illman, and B. R. Kowalski, in Chemometrics, P. J. Elving and J. D. H. Winefordner, Eds. (John Wiley and Sons, New York, 1986), p. 82.

lS96 Volume 50, Number 12, 1996

Documents

Method for the Classification of Biological FT-IR Spectra Prior to Quantitative Analysis