7
Comparison of Pattern Recognition Techniques for Sample Classification Using Elemental Composition: Applications for ICP-AES WAYNE BRANAGH, HUINAN YU,* and ERIC D. SALIN'~ Department of Chemistry, McGill University, Montreal, Quebec, H3A 2K6, Canada Pattern recognition is very important for many aspects of data analysis and robotic control. Three pattern recognition techniques were exam- ined-k-Nearest Neighbors, Bayesian analysis, and the C4.5 inductive learning algorithm. Their abilities to classify 71 different reference ma- terials were compared. Each training and test example consisted of 79 different elemental concentrations. Different data sets were generated with relative standard deviations of 1, 3, 5, 10, 30, 100, and 500%. Each data set consisted of 2000 examples. These sets were used in both the training stages and in the test stages. It was found that C4.5's inductive learning algorithm had a higher classification accuracy than either Baye- sian or k-Nearest Neighbors techniques, especially when large amounts of noise were present in the systems. Index Headings: Pattern recognition; Induction; ICP-AES. INTRODUCTION In several recent papers 1,2 we have discussed the con- cept of the "autonomous instrument". Among the char- acteristics expected of such an instrument are: • general sample recognition • the ability to learn from experience • the ability to choose an appropriate calibration meth- odology. We recently discussed the development of an expert system specifically to address the last characteristic; 1how- ever, we did not address in any detail the question, How will the methodology expert system recognize a sample? We did discuss the potential power of a semi-quantitative scan, but we did not describe how that data might be used in a decision-making process. Recognizing patterns and working by analogy are important human capabilities. With a human expert operator, one could envision a num- ber of reasonable (but not exhaustive) scenarios for se- lection of operating conditions and calibration method- ology (e.g., standard additions, internal standards): 1. Using the sample description, recognizing that it be- longs to a certain sample class (e.g., iron ore), and knowing that the sample class has been successfully run in the past with the use of a certain methodology, the operator adopts the same methodology. 2. By doing a preliminary analysis (semi-quantitative scan), the operator recognizes that the apparent ele- mental composition is similar to a sample class that has been successfully run with the use of a certain Received 25 July 1994; accepted 23 March 1995. * Current Address: Department of Chemistry, University of California, Santa Barbara, CA. t Author to whom correspondence should be sent. methodology. The operator adopts the same meth- odology. The first case involves a simple database look-up. The second is considerably more interesting. While humans have "judgement", "experience", and "common sense", computers do not. Computers do, however, have the abil- ity to perform highly complex mathematical operations in a short time. In the following paper we have evaluated the potential of several numeric processing techniques for the identification of samples. We have selected a wide variety of reference materials for use as potential sample types ranging from clinical, through botanical, to geolog- ical. We used their reported elemental concentrations and relative standard deviations to generate validation (test) sets and training sets. We were interested in both the general classification capability (e.g., soil vs. plant vs. alloy) and the fine discerning power (high-purity steel vs. mild steel) of these processing techniques. There is a wide variety of pattern recognition tech- niques available. We selected two common, powerful, "chemometric"-type techniques and a pattern recogni- tion technique whose origin is found in the artificial in- telligence (A.I.) literature. All these techniques are "ro- bust" techniques in that they have a tolerance for noise. Of the chemometric techniques, k-Nearest Neighbors 3 and Bayesian analysis 4 seemed well suited to the task. Inductive learning 5,6 is the A.I.-based technique included in the comparison. Their ability to classify samples with different amounts of noise (i.e., different standard devi- ations) was compared with the use of simulated data based on a variety of sample types. BACKGROUND Since some readers may be unfamiliar with the tech- niques, each will be discussed briefly. While the tech- niques vary in their approach, all have certain common characteristics: 1. They use supervised learning; i.e., the examples used in training must have predefined classes and have been classified beforehand. 2. The classes must be discrete. For a set of conditions, an example either belongs to a class or it doesn't. 3. The data attributes must make up a fixed collection. A training or test example cannot have variable struc- ture. All examples must have the same number and type of attributes. Each technique uses a training set and a test set of unseen values. The training set consists of a representative 964 Volume 49, Number 7, 1995 0003-7028/95/4907.096452.00/0 APPLIED SPECTROSCOPY © 1995Society for Applied Spectroscopy

Comparison of Pattern Recognition Techniques for Sample Classification Using Elemental Composition: Applications for ICP-AES

  • Upload
    eric-d

  • View
    216

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Comparison of Pattern Recognition Techniques for Sample Classification Using Elemental Composition: Applications for ICP-AES

Comparison of Pattern Recognition Techniques for Sample Classification Using Elemental Composition: Applications for ICP-AES

WAYNE BRANAGH, HUINAN YU,* and ERIC D. SALIN'~ Department of Chemistry, McGill University, Montreal, Quebec, H3A 2K6, Canada

Pattern recognition is very important for many aspects of data analysis and robotic control. Three pattern recognition techniques were exam- ined-k-Nearest Neighbors, Bayesian analysis, and the C4.5 inductive learning algorithm. Their abilities to classify 71 different reference ma- terials were compared. Each training and test example consisted of 79 different elemental concentrations. Different data sets were generated with relative standard deviations of 1, 3, 5, 10, 30, 100, and 500%. Each data set consisted of 2000 examples. These sets were used in both the training stages and in the test stages. It was found that C4.5's inductive learning algorithm had a higher classification accuracy than either Baye- sian or k-Nearest Neighbors techniques, especially when large amounts of noise were present in the systems. Index Headings: Pattern recognition; Induction; ICP-AES.

INTRODUCTION

In several recent papers 1,2 we have discussed the con- cept of the "autonomous instrument". Among the char- acteristics expected of such an instrument are:

• general sample recognition • the ability to learn from experience • the ability to choose an appropriate calibration meth-

odology.

We recently discussed the development of an expert system specifically to address the last characteristic; 1 how- ever, we did not address in any detail the question, How will the methodology expert system recognize a sample? We did discuss the potential power of a semi-quantitative scan, but we did not describe how that data might be used in a decision-making process. Recognizing patterns and working by analogy are important human capabilities. With a human expert operator, one could envision a num- ber of reasonable (but not exhaustive) scenarios for se- lection of operating conditions and calibration method- ology (e.g., standard additions, internal standards):

1. Using the sample description, recognizing that it be- longs to a certain sample class (e.g., iron ore), and knowing that the sample class has been successfully run in the past with the use of a certain methodology, the operator adopts the same methodology.

2. By doing a preliminary analysis (semi-quantitative scan), the operator recognizes that the apparent ele- mental composition is similar to a sample class that has been successfully run with the use of a certain

Received 25 July 1994; accepted 23 March 1995. * Current Address: Department of Chemistry, University of California,

Santa Barbara, CA. t Author to whom correspondence should be sent.

methodology. The operator adopts the same meth- odology.

The first case involves a simple database look-up. The second is considerably more interesting. While humans have "judgement", "experience", and "common sense", computers do not. Computers do, however, have the abil- ity to perform highly complex mathematical operations in a short time. In the following paper we have evaluated the potential of several numeric processing techniques for the identification of samples. We have selected a wide variety of reference materials for use as potential sample types ranging from clinical, through botanical, to geolog- ical. We used their reported elemental concentrations and relative standard deviations to generate validation (test) sets and training sets. We were interested in both the general classification capability (e.g., soil vs. plant vs. alloy) and the fine discerning power (high-purity steel vs. mild steel) of these processing techniques.

There is a wide variety of pattern recognition tech- niques available. We selected two common, powerful, "chemometric"-type techniques and a pattern recogni- tion technique whose origin is found in the artificial in- telligence (A.I.) literature. All these techniques are "ro- bust" techniques in that they have a tolerance for noise. Of the chemometric techniques, k-Nearest Neighbors 3 and Bayesian analysis 4 seemed well suited to the task. Inductive learning 5,6 is the A.I.-based technique included in the comparison. Their ability to classify samples with different amounts of noise (i.e., different standard devi- ations) was compared with the use of simulated data based on a variety of sample types.

BACKGROUND

Since some readers may be unfamiliar with the tech- niques, each will be discussed briefly. While the tech- niques vary in their approach, all have certain common characteristics:

1. They use supervised learning; i.e., the examples used in training must have predefined classes and have been classified beforehand.

2. The classes must be discrete. For a set of conditions, an example either belongs to a class or it doesn't.

3. The data attributes must make up a fixed collection. A training or test example cannot have variable struc- ture. All examples must have the same number and type of attributes.

Each technique uses a training set and a test set of unseen values. The training set consists of a representative

964 Volume 49, Number 7, 1995 0003-7028/95/4907.096452.00/0 APPLIED SPECTROSCOPY © 1995 Society for Applied Spectroscopy

Page 2: Comparison of Pattern Recognition Techniques for Sample Classification Using Elemental Composition: Applications for ICP-AES

sampling of training examples with each training example consisting of a number of attributes, al to a,, and the class the example belongs to:

(at, b,, cl, . . . zL, classO

(a2, b2, C 2 , • . . Z 2 , classy)

(a3, b3, C 3 , • . . Z 3 , class2)

(a,, b,, c,, . . . z,, class,,).

For example, if class 1 was a particular soil type, a, might be the concentration of iron for that soil standard, b~ might be the concentration of lead in that soil standard, and so on. The concentrations of iron and lead in the second example would be az and b2. This example also belongs to the first class. In a training set, there can be several examples of the same class; i.e., there can be multiple examples of a particular type of alloy or soil in the training set.

k-Nearest Neighbors. k-Nearest Neighbors 3 (kNN) is a very simple technique that classifies an example using its k nearest neighboring examples. To decide which class a test example belongs to, one calculates the Euclidean distance between it and each training example. Considering the first two examples, the distance between them can be calculated as:

dr2 = k/[Aal22 + Ab,22 + Ac~22 + . . . + Az,22]. (1)

The nearest k neighbors are examined, and the class to which the majority of training examples belong is assigned to the test example. In the case of a tie, the closest of the two classes is selected. In the present work, the nearest 10 neighboring samples were consulted.

Bayesian Classification. Bayesian classification 4 is a probabilistic technique of pattern recognition based on the assumptions that the decision problem is posed in probabilistic terms and that all the relevant probability values are known. By using the mean and standard deviation of training examples for each class and knowing the distribution form of the training examples, it is possible to determine the probability of a test example belonging to a particular class. In the present work, the distributions were assumed to be Gaussian in nature.

C4.5 Inductive Learning. Inductive learning is an approach commonly used to generate rule-based expert systems. It is, in fact, a powerful pattern recognizer and operates quite differently from neural networks, another "self-teaching" expert system technology. Despite its widespread use in other areas, inductive learning has received very little attention in the chemical literature. As with human induction, it involves a process of examining specific examples and inferring general relationships about these examples.

We have used the C4.5 induction engine developed by Quinlan? The output of the C4.5 induction algorithm is an induction decision tree (Fig. 1). Consider the task of classifying several different alloys on the basis of their major elemental components. If one were to input training examples of elemental compositions for different alloys into an induction system, one would end up with a tree- structure similar to the one in Fig. 1. To "read" it, one

TOP

i Cr <=9.04

/

FI6. 1. Typical C4.5 decision tree.

starts at the top of the tree (TOP) with an example to be classified. At each node or connection point, a decision is made regarding some aspect of the example being classified, and the appropriate branch is taken. The first decision made in Fig. 1 is based on the concentration of chromium in the sample. If the concentration of chromium in the alloy is less than or equal to 17.15%, then the left subtree is examined and another decision has to be made: Is the concentration of chromium less than or equal to 9.04%? If so, then the alloy is classified as "steel"; if not, then further decisions must be made. When one reaches a circular end node (leaf), i.e., the class, the classification of the alloy is complete. Classifying with a decision tree is computationally trivial and consequently very fast.

In order to generate a decision tree, C4.5 applies information theory to the training examples to determine which data attribute best divides the training set into distinct classes. It divides the training set of examples so that any pattern in the data is made apparent, resulting in a hierarchical structure of decisions. Using this decision tree, one can classify future, as yet unseen, samples. In order to be effective, a minimum number of divisions must be made, and there must be a reasonable number of examples present at each leaf of the tree.

C4.5 looks at the gain ratio generated from a particular division of the training cases. To understand the gain ratio and how a set of cases is divided into subsets, consider the data present in Table I. Suppose one were interested in determining whether a sample to be analyzed fits into the class "simple" or "difficult" to analyze. C4.5 contains mechanisms for dealing with both symbolic and numeric data attr ibutes so that any combinat ion of actual measurements and descriptions could be used. For ease of demonstration, Table I consists of discrete, symbolic data, i.e., names only. Numeric values would be dealt with analogously with the use of ranges to group numeric values into discrete attributes.

In this demonstration set, there are 14 examples of chemical samples that have been analyzed. Some samples have been easy to analyze (simple), while others have been more problematic (difficult). The decision about whether a sample is simple or difficult to analyze is made

APPLIED SPECTROSCOPY 965

Page 3: Comparison of Pattern Recognition Techniques for Sample Classification Using Elemental Composition: Applications for ICP-AES

TABLE I. C4.5 induction example.

Difficulty of Client Source pH Appearance analysis

NTEX River 7-8 Cloudy Simple NTEX Mine 7-8 Cloudy Difficult NTEX Lake 8-9 Clear Difficult NTEX Mine 7-8 Clear Difficult NTEX River 6-7 Clear Simple Alumco Mine 7-8 Cloudy Simple Alumco Lake 8-9 Clear Simple Alumco River 6-7 Cloudy Simple Alumco Mine 8-9 Clear Simple Tracelab Lake 7-8 Cloudy Difficult Tracelab River 6-7 Cloudy Difficult Tracelab Mine 7-8 Clear Simple Tracelab Lake 6-7 Clear Simple Tracelab River 7-8 Clear Simple

by a human operator when preparing a training set for C4.5. Each sample contains four attributes that the human analyst wants C4.5 to use to classify samples--client, source, pH, and appearance--and a classification with respect to whether the sample was "simple" or "difficult" to analyze.

If it was applied to the Table I data, C4.5 would initially determine the information present in the entire training set. This is done by calculating the fraction of examples that belong to a class and multiplying it by the information of that fraction:

j=, \ ISI tog2 ISI (2)

where Cj are the number of examples belonging to class j ; S is the set of training examples; and k is the number of different classes. For the set of examples in Table I, there are nine examples that are classified as "simple" and five that are classified as "difficult". The information present in this set is:

info(S) = - ~4 log2 (~4) - ~4 log2 (~4)

= 0.940 bits. (3)

The unit of information is the bit. The bit is a scalar value used here to measure the usefulness of a particular division of data. If there were an even split of seven examples in each class, there would be more information in the set (1.0 bit). If the class sizes were very different (13 examples in one class and one example in the other), then little information would exist in the set (0.36 bits).

C4.5 then looks at each of the attributes and calculates the information that would be gained if the training set were divided up according to that attribute. For this subdivided set of samples, the information can be calculated as follows:

j=t ISil \ * lnf° infox(S) = - ~ ( - ~ - ) " (Si) (4)

where St is the information in the subset and is calculated as above; x is the attribute to use to subdivide the set; S is the original set of examples; and k is the number of different values of the attribute, x.

Consider the first attribute, client. There are three

possible values for client: NTEX, Alumco, and Tracelab. For NTEX there are five samples; two are simple and three are difficult. For Alumco, there are four samples with all four being simple. For Tracelab, there are five samples with three being simple and two being difficult. This setup would result in the following information for the subset:

infocliem(S) = infOcuem(NTEX) + infoc~ien,(Alumco)

+ infoctie,t (Tracelab),

infoclient(S) = ~4 ( _ 2 2 _ 3 3 glog2~ glog2~)

+ i 4 ~ - ~ l o g 2 ~ - ~log2

\ 5 3 3 2 2 + g log2 ~ , log2g -

infoc~ient(S) = 0.694 bits. (5)

The information that would be gained if the set of samples were divided up according to the client attribute is:

gainotiem(S) = info(S) - infoclient(S )

= 0.940 - 0.694 = 0.246 bits. (6)

Gain is a good measure of which attribute should be used to divide the data set and algorithms prior to being used by C4.5. However, there is a problem. If a particular division generates many subsets, it will generate a higher gain value even though it may not be the most desirable split. In order to prevent the process from being biased in favor of divisions with many results, a split information value is calculated for each of the attributes. The split information value performs a sort of normalization with respect to the number of results. For the client attribute:

split info(client)= - ~ ( I S i l ~ , flSil'~ j:, log 5 - ]

The gain ratio is then calculated:

gain(client) gain ratio(clien0 =

split info(client)

0.246 - 1.57---7 - 0.156. (8)

This process is performed for all the attributes. For the set of samples in Table I, the gain ratios are as follows: gain ratio(client) = 0.156; gain ratio(source) = 0.031; gain ratio(pH) = 0.012; and gain ratio(appearance) = 0.075.

Since the client attribute yields the largest gain ratio, it is used to subdivide the set of examples (Table II). This entire process of examining and comparing possible subdivisions, then dividing up the remaining data into

968 Volume 49, Number 7, 1995

Page 4: Comparison of Pattern Recognition Techniques for Sample Classification Using Elemental Composition: Applications for ICP-AES

TABLE II. Subdivided induction examples.

Difficulty of Client Source pH Appearance analysis

NTEX River 7-8 Cloudy Simple NTEX Mine 7-8 Cloudy Difficult NTEX Lake 8-9 Clear Difficult NTEX Mine 7-8 Clear Difficult NTEX River 6-7 Clear Simple

Alumco Mine 7-8 Cloudy Simple Alumco Lake 8-9 Clear Simple Alumco River 6-7 Cloudy Simple Alumco Mine 8-9 Clear Simple

Tracelab Lake 7-8 Cloudy Difficult Tracelab River 6-7 Cloudy Difficult Tracelab Mine 7-8 Clear Simple Tracelab Lake 6-7 Clear Simple Tracelab River 7-8 Clear Simple

further subgroups, is applied to each of the resulting subsets and continues until further divisions yield no improvement in the information gain, or until a threshold number of examples is present in a subset. At this point, C4.5 is finished and the decision tree has been generated. The final tree is shown in Fig. 2. The tree could now be used to classify novel samples as being simple or difficult to analyze with only the four attributes discussed above.

A more detailed tutorial on inductive systems is available, 6 as is more thorough background on the C4.5 algorithm?

EXPERIMENTAL

The data sets used in this work consisted of the simulated elemental compositions for 71 different reference materials (Table III). Each example consisted of 79 elemental concentrations. With the use of a program written in our lab, the training and test sets were generated 8 on the basis of reference material compositions. Unfortunately, the concentration of each of the 79 elements was not available for each reference material. Since all the pattern recog- nition techniques that were used required all attributes to be known or, in the best case, the overwhelming ma- jority of attributes to be known, the unknown attributes needed to be set to some value. Rather than guess at their true values, we set unknown attributes to both 0 and an arbitrarily large value, 10,000. Therefore, a training set with a total of 2000 examples would have 1000 examples with unknown concentrations set to 0, and 1000 examples with unknown concentrations set to 10,000. Since each unknown attribute in the training set has both a high and a low value, it was reasoned that the pattern recognition techniques would not rely on these attributes to make a classification.

Two "seed" examples were generated for each type of reference material--one with unknown values set to 0, one with unknown values set to 10,000. These were then used to generate a larger number of examples used in training and testing. It was assumed that the distribution of each elemental concentration in an example would be Gaussian in nature with a particular standard deviation. Each training and validation data set contained two thousand examples, and data sets with relative standard deviations (RSDs) of 1, 3, 5, 10, 30, I00, and 500% were generated.

TOP P

Client? Traeela/~ '/ ---[Alumco XNTEX

I Appearar~ce.~[ ~J-~"'~ "X]--Souroe~. ---I Cloudj (31-ea- ~ L'-S.~P)~ e~ ~River 'Mine LX'~ake

FIG. 2. C4.5 example, final decision tree.

Computer programs for data generation, Bayesian classification, and kNN classification were all written in Borland International's TurboPascal 7.0. C4.5 was written in Watcom International's C/C+ + v. 10.0 for OS/2. All computer programs were executed on a 33-MHz 486 PC- type computer.

RESULTS

Since kNN classification and Bayesian classification are relatively conventional techniques, their performance will be compared with the somewhat novel technique of inductive learning.

As one can see in Figs. 3 and 4, kNN and Bayesian classification performed best (i.e., classified the training set most correctly) when the training and test sets had similar standard deviations or the training set contained more noise than the test set. C4.5 (Fig. 5) outperformed both of these techniques, classifying ~ 100% of the examples in the test set, while both kNN and Bayesian techniques classified ~ 73% of test examples when small standard deviations were present in the training and test sets. Both Bayesian and C4.5 performed poorly when the training set had a small standard deviation while the test sets had a large standard deviation. This is what one would expect. If the training groups are well defined while the test sets are very ambiguous, then correct identification would be more difficult and the number of correct

Reference Material Classification using k-Nearest Neighbors

100 80

,~ 60 ..~ 40

20 o o 0

% deviation in training set

1 in test

FIG. 3. Reference material classification using k-Nearest Neighbors.

APPLIED SPECTROSCOPY 967

Page 5: Comparison of Pattern Recognition Techniques for Sample Classification Using Elemental Composition: Applications for ICP-AES

TABLE IIL Reference materials classified.

Number of examples in Reference material Source of composition information training and test sets

Flint Clay 97b Plastic Clay 98b Brick Clay 979 Portland Cement Black 1880 Portland Cement White 1881 Portland Cement Orange 1882 Portland Cement Silver 1883 Portland Cement Ivory 1884 Glass K-456 Glass K-493 Glass K-523 Glass K-453 Glass K-491 Glass K-968 Mild Steel High Si Steel High Mn Steel Tool Steel Soil SO-2 Soil SO-3 Soil SO-4 Lake Sediment LKSD-I Lake Sediment LKSD-2 Lake Sediment LKSD-3 Lake Sediment LKSD-4 Iron Formations FeR-1 Iron Formations FeR-2 Iron Formations FeR-3 Iron Formations FeR-4 Harbor Sediment MESS-1 Harbor Sediment BCSS-1 Harbor Sediment PACS-1 Lobster TomaUey TORT-1 Lobster Hepatopancreas LUTS-1 Stream Sediment STSD-1 Stream Sediment STSD-2 Stream Sediment STSD-3 Stream Sediment STSD-4 Freeze Dried Urine 2670 Leaded Steel High-Purity Steel Carbon Steel Mild Steel Ca Treated Steel Resulfurated Steel Low-Alloy Steel Coal Fly Ash 1633a Coal Fly Ash 2690 Coal Fly Ash 2691 Dogfish Muscle DORM-1 Dogfish Liver DOLT-1 Gold Ore M A l b Gold Ore MA3 Gold Ore CH 1 Gold Ore CH2 Apple Leaves 1515 Peach Leaves 1547 Pine Needles 1575 Spruce Needles CLV2 Spruce Twigs CLV 1 Corn Stalk RM 8412 Corn Kernel RM 8413 Bituminous Coal 1632b Sub-bituminous Coal 1635 Fuel Oil 1634b Bovine Liver 1577h Rice Flour 1568b Oyster Tissue 1566a Nonfat Milk Powder 1549 Uranium Ore BL-2a, BL-4a U/Th Ore DL- I a, DH- i a

NIST 10 NIST 10 NIST 10 NIST 10 NIST 10 NIST 10 NIST 10 NIST 10 NIST 10 NIST 10 NIST 10 NIST 10 NIST 10 NIST 10 Hieftje 7 30 Hieftje 7 30 Hieftje 7 30 Hieftje 7 30 CANMET 10 CANMET 10 CANMET 10 CANMET 10 CANMET 10 CANMET 10 CANMET 10 Geolog. Survey of Canada 10 Geolog. Survey of Canada 10 Geolog. Survey of Canada 10 Geolog. Survey of Canada 10 NRC Marine Analytical Chem Stds Prg 10 NRC Marine Analytical Chem Stds Prg 10 NRC Marine Analytical Chem Stds Prg 10 NRC Marine Analytical Chem Stds Prg 10 NRC Marine Analytical Chem Stds Prg 10 CANMET 10 CANMET 10 CANMET 10 CANMET 10 NIST 20 Hieftje 7 50 Hieftje 7 40 Hieftje 7 40 Hieftje 7 50 Hieftje ~ 20 Hieftje ~ 30 Hieftje 7 30 NIST 10 NIST 10 NIST 10 NRC Marine Analytical Chem Stds Prg 10 NRC Marine Analytical Chem Stds Prg 10 CANMET 10 CANMET 10 CANMET 10 CANMET 10 NIST l0 NIST l0 NIST l0 CANMET 10 CANMET 10 NIST 10 NIST l0 NIST l0 NIST 10 NIST 10 NIST l0 NIST 10 NIST l0 NIST l0 CANMET 20 CANMET 20

968 Volume 49, Number 7, 1995

Page 6: Comparison of Pattern Recognition Techniques for Sample Classification Using Elemental Composition: Applications for ICP-AES

~ 80

60

~4o ~2 20 o o~ 0

Reference Material Classification using Bayesian Classification

% deviation in training set

~n in

100

80 6O

~ 4o ~ 20 o ~ 0

Reference Material Classification Using C4.5 Induction System

% deviation in train set

~n in

Fla. 4. Reference material classification using Bayesian classifier. Fia. 5. Reference material classification using C4.5 inductive learning.

identifications would be lower, k-Nearest Neighbor classification also performed poorly in these cases but degraded more gracefully, yielding better classifications than the Bayesian classification when the test sets contained large amounts of noise (Fig. 3). kNN performed least effectively when it was trained with a large amount of noise, i.e., > 100% RSD.

In actual analyses, the accuracy of determination depends on the concentration of the analyte, the element being analyzed, and the method of sample preparation. Reported standard deviations for elements in the materials list varied from < 1 to ~30% RSD. Since the standard deviations of all of the elements in a material were not listed, it is assumed that all analytes would have a standard deviation of ~30% RSD. With highly conservative training sets with elemental standard deviations of 30%, it can be seen that kNN and C4.5 performed well, correctly classifying 72% and ~ 100%, respectively, of the test set examples. Bayesian analysis had the poorest performance, classifying only about 46% of the test set examples correctly under these rather stringent conditions.

In test sets where the classification schemes failed, misclassifications occurred between very similar compounds. For instance, apple leaves and peach leaves would be confused, as would two types of gold ores or two types of steels. Misclassifications between "coarse" material classes (e.g., organics vs. alloys) were much less common.

Since these techniques can also be used in data explo- ration, it is important to consider the output generated by each of them. kNN generates a list of the k nearest samples and a measure of the distance between the ex- ample being classified and its neighbors. Bayesian clas- sification yields an answer and a measure of the certainty with which the answer is correct. C4.5 produces a decision tree (Fig. 1) and, optionally, a list of rules showing how the classification was produced. For performing data ex- ploration, determining what characteristics of the data are being used in the classification can be extremely im- portant. C4.5 provides clear rules demonstrating what attributes (elements in this example) are being used in the classification and what levels of those elements are used

in the decision-making process. Placing these rules in a computer program is trivial. The utility of this infor- mation should not be underestimated. For example, the information in a classification tree could be used to decide in real time the order in which elements would be ana- lyzed in a sequential spectrometer system. Such a scheme would drastically reduce the number of elemental deter- minations (spectral lines) that might be required during a semiquantitative scan.

An additional consideration is the time required for each program to execute. As new samples are analyzed and added to the instrument's database, it should be possible for the instrument to re-train using all the previously analyzed samples--including samples which were recently analyzed. In order for this to be possible, the pattern classification technique must be reasonably fast. C4.5 trained on a training set and processed a test set in about 5 min. Bayesian classification required 3 min, and kNN required ~ 30 min.

Because of its excellent classification ability and clarity of results, inductive learning appears to be the best pattern recognition technique for use in the au tonomous instrument project. Since it is also quite fast, as new samples are run and the results added to the database, the program could be re-run to provide continual improvement in its classification ability.

CONCLUSION

It has been shown that for classification of 71 reference materials with the use of their elemental compositions, and real world standard deviations, kNN, Bayesian classification, and C4.5 all perform very well. When relative standard deviations become extremely high, all three techniques' performances degraded, but at these high RSD levels, C4.5 outperformed the other techniques, yielding a higher percentage of correct classifications. Because of its speed, higher classification accuracy, and clarity of results, inductive learning appears to be superior for both general and fine sample recognition.

With respect to the development of more intelligent

APPLIED SPECTROSCOPY 969

Page 7: Comparison of Pattern Recognition Techniques for Sample Classification Using Elemental Composition: Applications for ICP-AES

instruments, these results suggest that modern techniques and computer hardware will allow:

1. Trained spectrometer systems to rapidly (1 s) classify a wide variety of sample types by collecting semi- quantitative data.

2. Spectrometer systems to learn (be trained) to classify samples on site using either (a) information extracted from previously run sam-

pies; (b) information provided in other forms.

3. The possibility of classifying samples, even when some attribute (concentration) information is missing, by using the scheme that we have proposed or some other.

A warning is in order. All these systems will classify a sample whether or not the sample "reasonably" belongs in that class. For example, a system trained only with alloy samples will classify a biological sample as one of the alloys. There are, however, provisos which emerge when we consider each of the classification systems.

While Bayesian learning gave the poorest results in our studies, it does have several important assets. It does provide "goodness-of-fit" information which could be used to reduce the probability of a gross misclassification. It also provides a very short training time. The kNN does not directly provide a goodness-of-fit indicator, but it does provide the distance between the example being

classified and its nearest neighbors. This, or other readily calculable information, could be used to provide a mea- sure of goodness of fit. Inductive learning provides an output which is easily comprehensible to humans; how- ever, it provides limited statistical information which can be used as a goodness-of-fit measure. This limitation must be balanced against the superior performance of the in- ductive system in our experiments. This observation sug- gests to us that a combination of the Bayesian and in- ductive systems might provide superior performance. Both are very fast, and the Bayesian system provides statistical information which should minimize the possibility of making a glaring error in classification.

1. D. P. Webb, J. Hamier, and E. D. Salin, Trends Anal. Chem. 13, 44 (1994).

2. D. P. Webb and E. D. Salin, Intelligent Instruments and Computers 5, 185 (1992).

3. M.A. Shafif, D. L. Illman, and B. R. Kowalski, Chemometrics (John Wiley and Sons, New York, 1986).

4. R.O. Duda and P. E. Hart, Pattern Classification and Scene Analysis (John Wiley and Sons, Toronto, 1973).

5. J. R. Quinlan, C4.5 Algorithms for Machine Learning (Morgan Kaufmann Publishers, San Mateo, 1993).

6. E. D. Salin and P. H. Winston, Anal. Chem. 64, 49A (1992). 7. M. Glick and G. Hieftje, Appl. Spectrosc. 45, 1706 (1991). 8. N. T. Naylorm, J. L. Balintfy, D. S. Burdick, and K. Chu, Computer

Simulation Techniques (John Wiley and Sons, New York, 1967).

970 Volume 49, Number 7, 1995