Sorting Spam with K-Nearest-Neighbor and Hyperspace ...crm114.sourceforge.net/docs/KNN_Hyperspace_Filters.pdf · learning filter and a reasonable survey really requires a full book

Sorting Spam with K-Nearest-Neighbor and Hyperspace Classifiers

William Yerazunis1, Fidelis Assis2, Christian Siefkes3, Shalendra Chhabra,1,4

1: Mitsubishi Electric Research Laboratories, Cambridge MA 2: Empresa Brasileira de Telecomunicações Embratel, Rio de Janeiro, RJ, Brazil

3: Database and Information Systems Group, Freie Universität Berlin, BerlinBrandenburg Graduate School in Distributed Information Systems

4: Computer Science and Engineering, University of California, Riverside CA

Abstract: We consider the wellknown Knearestneighbor (KNN) classifier as a spam filter. We compare KNNbased classification in both equalvote and decreasingrank forms to a welltested Markov Random Field (MRF) spam classifier. As KNN classification is known to be asymtotically bounded to be not worse than twice the performance of the best possible probabalistic classifier, we can approximate how well the MRF classifier approaches the performance bound. We then consider a variation of KNN classification based on a highdimensioned feature radiationpropagation model (termed a “hyperspace” classifier), and compare the hyperspace classifier performance to the KNN and MRF classifiers.

Introduction

Spam classifiation continues to provide an interesting, if not vexing, field of research. This particular classifier problem is unique in the machinelearning field as most other problems in machine learning are not continuously made more difficult by intelligent and motivated minds.

A number of different approaches have been taken for spam filtering beyond the singleended machine learning filter and a reasonable survey really requires a full book [Zdzairski 2005] ; in this paper we will restrict ourselves to postSMTP acceptance filtering and the sorting of email into two classes: good and spam.

Even within postacceptance, singleended filtering, there are a number of techniques available. One of the most common is a Naive Bayesian filter (usually using a limiting window of the most significant N words). [Graham 2001]. Other common variations use chisquared analysis, a Markov random field [Chhabra 2004, Bratko 2005], or even compressibility of the unknown text given basis vectors representing the good and spam classes [Willets 2003].

Although KNN filters have been considered for spam classification in the past ( GrahamCumming’s early POPfile used a KNN) they have fallen into disfavor among most filter authors. We reconsider the use of KNNs for classification and attempt to quantize their qualities.

Pure versus incrementally trained KNNs

One disadvantage of KNNs is that in the Cover and Hart configuration, every known input is added to the stored data; this can cause very long compute times. To mitigate this, we have used selectively trained KNNs; rather than adding every known text immediately to the stored data, we incrementally test each known text and only add the known text to the stored data if the known text was judged incorrectly. This speeds up filter classification tremendously. However, we must also consider the time needed to iteratively train these filters.

A disadvantage of this is that the Cover and Hart limit theorem does not necessarily apply to these KNNs. We will consider extension of the Cover and Hart theorem to cover incremental trained KNNs in future work.

Details of the Filters

We used several different filtering algorithms in our tests. These are the KNearestNeighbor (KNN) filters with neighborhood sizes of 3, 7, and 21, and we compare this the Markov Random Field (MRF) filter and with a new filter variation based on luminance in a highdimensional space, called the “hyperspace” filter. We will discuss the algorithms in detail below.

All of these filters are constructed within the framework of CRM114 with only minor tweaks to thesource code. CRM114 is GPLed open source software and can be freely downloaded from:

http://crm114.sourceforge.net

so the reader should feel free to examine the actual algorithms.

Feature Extraction

All of the filters tested were configured to use the same set of features extracted from the text of the spam and good email messages. This is the OSB feature set described in [Siefkes PKDD], and experimentally verified to be of high quality as compared to other filter feature sets [Assis TREC].

To summarize how these features are generated, an initial regex is run repeatedly against the text to obtain words (that is, the POSIXformat regex [[:graph:]]+ ). This provides a stream of variablelength words. This stream of words is then subjected to a combination operator such that each additional word in the series provides four combination output strings. These output strings are then hashed to provide a stream of unsigned 32bit features. These features are not truly unique, but for our purposes they are “unique enough”. All three filters were given identical streams of 32bit features.

The string combination operator that generates the 32bit tokens can best be described as repeated

http://crm114.sourceforge.net/

skipping and concatenation. Each word in the stream is sequentially considered the “current” word. The “current” word is concatenated with the first following word to form the first combination output string. The same “current” word is then concatenated with a “skip” placeholder followed by the 2nd

following word to form the second combination output string. The current word is then concatenated with a “skip skip” placeholder followed by the 3rd following word to form the third combination output string. Then the current word is concatenated with a “skip skip skip” placeholder followed by the 4th

following word to form the fourth combination output string. Finally, the “current” word is discarded and the next word in the input stream becomes “current”. This process repeats until the input stream is exhausted.

As an example, let’s use “For example, let’s look at this sentence.” as a sample input stream. The input stream would be broken into the word stream:

Forexample,let’slookatthissentence.

That word stream would produce the following concatenated strings which would then be hashed to 32bit features for input to each of the classifiers: For example, For skip let’s For skip skip look For skip skip skip at example, let’s example, skip look example, skip skip at example, skip skip skip this let’s look let’s skip at let’s skip skip this let’s skip skip skip sentence

at which point the input stream is exhausted. In the actual system, “null” placeholders are used to both prefill and postdrain the pipe, so all words are equally counted.

This method of feature extraction and token generation has been shown to be both efficient and capable of producing classifier accuracies far superior to single wordatatime tokenization [Chhabra 2004]. In all tests described in this document, we used the “unique” option, so that only the first occurrence of any token was considered significant.

KNN Filter Configuration

The KNN filter used was configured in several different ways. All configurations were based on an Ndimensional Hamming distance metric; that is, the presence of the same feature in both a known and an unknown document is meaningless; rather the number of differences (specifically, the feature hashes found in one document but not the other) determine distance. Within the three neighborhood sizes of 3, 7, and 21 members, we tested two different configurations – equalweight (the standard Cover and Hart model), and a decliningweight model based on distance.

In the equalweight configurations, the set of the K closest matches to the unknown are considered as “votes” for the unknown text; each vote is cast in favor of the unknown text being a member of that example’s class and with each member of the K closest matches getting an equal vote. In the distancebased weighting, the weight of each vote was the reciprocal of the Euclidean distance (defined as the square root of the Hamming distance) from the unknown text to each known text. In both configurations, the votes are summed and the unknown is assigned the class of the majority vote.

MRF Filter Configuration

The MRF filter configuration was the CRM114 Markov random field classifier using the OSB feature set and otherwise unmodified. This is the same classifier as described in [Assis TREC paper] and was not modified for these tests. In this classifier design, the clique size of the Markov random field is the same as the length of the feature generator – in the current implementation, this is 5.

Hyperspace Filter Configuration

The hyperspace filter is based on the KNN filter concept, but sets K = | corpus | . that is, the neighborhood is the entire corpus, and does not use an equalweight votesystem. In the limit of an equalweight voting, this causes all texts to evaluate identically to the Bayesian prior (that is, the probability of an unknown text being a member of a class is proportional to the membership counts for the known membership of the class). Since all members of the learned corpus are “members” of the neighborhood, there is no question determining the size of the neighborhood needed for best filtering.

We tested several weighting variants for this hyperspace filter, among them proportional to the similarity between two texts (that is, the number of shared features), dissimilarity (that is, the Euclidean distance between texts mapped into a hyperspace where each feature is it’s own dimension), and variants and mixtures of these weights.

The Test Set

All of the filter configurations were tested against the SA2 corpus as distributed by Gordon Cormack [Cormack]. The SA2 corpus can be freely downloaded from:

http://plg.uwaterloo.ca/~trlynam/spamjig/

This test set is also used in the TREC 2005 conference testing for filters. However, comparisons between these test results and those published in TREC are only somewhat appropriate; note that the published TREC test set used filters that were trained in a single pass only; here we will do multipass training to observe asymptotic behavior so we should expect that multipass accuracy should exceed the TREC single pass accuracy.

This test set contains approximately 4150 good emails and 1900 spams; this asymmetry of content is significant in that the test itself must accomodate unequal class sizes in the training corpus. In preliminary testing, we considered first an alternating scheme (training first one good email, then one spam email) and ceasing training when one class became exhausted. We also tested a training scheme where we “reloaded” an empty training class whenever all members were trained, and a third training scheme where the choice of the next text to be trained was controlled pseudorandomly and so both classes would be exhausted simultaneously. None of these variants made any significant difference in base testing, so for simplicity we will consider only the pseudorandom sequencing as this most closely mimics actual spam filter usage.

Testing Procedure:

Each of the subject filters was tested with 10fold validation – that is, the filter was trained on 90% of the SA2 corpus, with 10% held back, then the remaining 10% was tested to give an accuracy result. As training was not permitted on the final 10%, we then repeated the training phase. We allowed five passes of such training for each filter, to observe any convergence to an asymtotic behavior. In the one case where convergence did not seem to occur, we ran an additional 5 passes and found that in fact convergence had occurred.

After these five passes to converge, the statistical memory is cleared, and the process repeated with a different 10% selected to be held back from training and to be used for testing. The result is a total of ten different training corpora are tested, and each text in the corpus is actually trained nine times and used as a test case exactly once.

The training procedure for each filter is a SSTTTR (SingleSided Thick Threshold Train and Retest) training process. Each text is sequentially tested to verify that it is correctly classified. If the text is classified correctly by at least the thickness of the threshold, no further training is needed on that text. Otherwise, the text is trained into the correct class, and the test classification repeated. As before, if the text is not classified correctly by at least the thickness of the threshold, the text is trained into the correct class and execution for that particular text ends. Except where noted, all of the results presented here us a very thin “thick” threshold of 0.001 weighting units; this threshold is well below the resolution limit of 1 part in 42 possible in a K=21 KNN classifier (the K=21 KNN classifier is a

voting classifier with 21 votes, since votes are quantized, a K=21 classifier cannot represent any finer gradation of certainty than half a vote, which is 1 part in 42). The reason for this very thin but nonzero threshold is so none of the classifiers will “stick at zero” in the symmetrical situation. We have tested thicker and thinner thresholds, and conclude that this thickness is trivial and the results are statistically similar to true zerothreshold training..

The reader will note that this is a somewhat different training regimen than is typically used for KNN classifiers; many KNN classifiers simply “dump” all known texts into the class database and classify based on that large database. We found that the betterbehaved learning KNNs learned about ¼ of the documents (thus gaining considerable speed); the weakerlearning KNNs essentially learned the entire corpus and arrived at the simple KNN fullcorpus database behavior by learning individual texts.

Testing Results

Overall testing was performed on a Dell 700m 1.6 Ghz subcompact laptop running Fedora Core 4 updated to the January 15 releases, and CRM114 version 20060209ReaverSecondBreakfast (TRE 0.7.2 (GPL)). This machine was equipped with only 512 megabytes of memory and a slow (laptop) disk, so the processing speeds given below will almost certainly be at slow end of the scale for modern servers.

KNN Classification Results

Our testing gave the surprising result that the KNN equalvote filter was very weak in performance, for all tested neighborhood sizes ( 3, 7, and 21 members) . The learning curves showed that for neighborhood sizes of 3 and 7 neighbors respectively, that learning did not really occur past the second and third passes; the overall accuracy was no better than 93% (stdev 1.55 and 0.85% respectively) with a distinct split in accuracy 98.6% for good email and 79.8 for spam. This implies a limit that the best classifier would be only about 96.5% accurate.

With a neighborhood of 21 members, accuracy continued to improve past the 5th training pass and the KNN accuracy moved to 94% overall (standard deviation %1.06%). Again, there was a distinct split in accuracy of 96.1% for good email versus 90.9% for spam. The initial score for all classifiers (one pass rather than the full five passes) was abysmal, often at less than 90% overall. All three of these KNNs added most of the input corpora to their database, yielding slow runs (47 to 72 milliseconds per classification)

Fig 1: Learning curves for equalvote KNNs with neighborhoods of 3, 7, and 21 respectively

We now consider nonequalweight KNN voting schemes. The simplest of these uses the Euclidean distance (the square root of the Hamming distance) as a measure of closeness; as we wish to weight votes according to closeness, the actual per document weight is 1/Hamming distance for each of the K documents within the neighborhood. This gives an improvement for K=7 and 21 to 94% and 92% accuracy overall respectively (standard deviations of 1.05% and 1.93%) and a symmetry split of 85% and 79% for spam to 98% and 98% for good email. This decrease in accuracy with increasing size of neighborhood is slightly paradoxical but again, the error bars on these results overlap so no real conclusion should be drawn.

Figure 2: Learning curves for KNNs with neighborhoods of 7 and 21 members with a perdocument weight of the reciprocal of the Euclidean distance

Full-corpus ( K = |corpus| ) KNN classification

We now consider nonequalweight KNN classifier systems with K=∣corpus∣ that is, where all known documents are considered to be “in the neighborhood” and the weighting function determines the significance of the document. This provided significantly improved accuracy and a much better

Pass 1 Pass 2 Pass 3 Pass 4 Pass 5

70

72.5

75

77.5

80

82.5

85

87.5

90

92.5

95

97.5

100


70

72.5

75

77.5

80

82.5

85

87.5

90

92.5

95

97.5

100


70

72.5

75

77.5

80

82.5

85

87.5

90

92.5

95

97.5

100

Good

Spam

Wt. Total

Pass Pass Pass Pass Pass 70

72.5

75

77.5

80

82.5

85

87.5

90

92.5

95

97.5

100

Good

SpamWt. Total

Pass 1 Pass Pass 3 Pass 4 Pass 5

70

72.5

75

77.5

80

82.5

85

87.5

90

92.5

95

97.5

100

asymtotic behavior. Because the filter accuracy increases, the learning time decreases, and the knowntext database shrinks; the overall filter not only becomes more accurate, but also faster.

Full-corpus KNN classification with reciprocal Euclidean weighting

Here, we weight each known document with 1 / Euclidean distance between the known and unknown documents. The final behavior for this configuration is 97.8% overall accuracy (0.76% standard deviation, with good mail at 98.2% accuracy with spam at 96.9% and using 32 miliseconds per text. This result is much closer to what we expect out of a modern spam filter and in line with the results reported in TREC 2005 for the middlegroup spam filters.

Fig 3: KNN with fullcorpus neighborhoods, and voting weights of the reciprocal of the Euclidean distance

Full-corpus KNN classification with similarity vote weighting

Another weighting for fullcorpus KNN classification is to weight proportionally to the similarity between the unknown text and each known text example, and the improvement is immediately visible. This results in a final 5pass accuracy of 99.24%, with a standard deviation of 0.4% (good mail accuracy of 99.4% and spam accuracy of 98.84%) at 44 milliseconds/text. Because the system did not seem to have reached an asymptote at 5 passes, we continued this experiment out to 10 passes, yielding only the statistically insignificant change to 99.27% overall accuracy (0.49% standard deviation) with good email at 99.49% and spam at 99.27% accuracy. Note the scale of this graph (and all subsequent graphs) is changed from the previous scaling; the Y axis has changed from 70% accuracy to 95% accuracy.

Note that a 99.27% accurate system is in line with the better TREC evaluations, however this system


70

72.5

75

77.5

80

82.5

85

87.5

90

92.5

95

97.5

100Euclidean weight 1/R^2, K = |corpus|

Good

Spam

Wt. Total

was trained with 10 full passes, while the TREC evaluations were trained on only a single pass.

Fig 4: Fullcorpus KNN classification with similarity vote weighting. This plot shows the full run of 10 passes as the filter seemed not to have reached asymptotic performance at five passes, however there is no statistical improvement between 5 and 10. Note the Yaxis scale change; full height is now 95% to

100% accuracy.

Hyperspace-based weighting in KNN classifiers

We now consider a new variation on the K=|corpus| classification, where the pervote weights contain not only a distance term but also a term related to the featurewise similarity of the documents. Because these weighting models are mathematically similar to radiative energy transfer in highdimensional spaces, we will call them hyperspace models for lack of a better term. Other than that change, they are very similar to other K=|corpus| KNN classifiers.

The actual radiative transfer from each known text in hyperspace to the unknown text is a property of the similarity of the known and unknown documents (as before, this is the count of unique features found in both documents). These models all use a weighting of the form:

W i=document similarity d

Euclidean distance2

where d is an independent parameter determining how important the document similarity is (document.The first three terms of this series of weights (for d = 1, 2, and 3) all produce amazingly similar results.

Testing with d = 1, 2, 3 yield final overall accuracies of 99.34, 99.27, and 99.26% respectively (with standard deviations of 0.5, 0.3, and 0.5% respectively), with goodmail accuracies of 99.64, 99.66, and

Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 Pass 6 Pass 7 Pass 8 Pass 9 Pass 10

95

95.5

96

96.5

97

97.5

98

98.5

99

99.5

100

K=|corpus|, similarity weight

Good

Spam

Wt. Total

99.59% accuracy, and spam mail accuracies of 98.68, 98.42, and 98.52% respectively. It cannot be determined which dvalue is better with this experiment, because all three dvalues yield results with substantially the same error bars. These are also among the fastest KNN classifiers tested here, with permessage processing times of 32, 22, and 22 milliseconds respectively.

Fig 5 – KNN Hyperspace weighting, for d=1, 2, and 3 respectively. Vertical scale is again from 95% to 100% accuracy

Comparison with Markov OSB filtering

We now compare a KNN filter against a Markov OSB filter with the same training regimen and the same data. We performed two experiments one in thin threshold training (that is, identical to the best KNN test training as above) and another with true thick threshold training, corresponding to 10 pR units. Because pR relates to a probability uncorrected for the naive Bayes assumptions, it is impossible to give an exact translation to probability, but empirically, scores within +/ 10 pR units of zero correspond very roughly the bracket between 10% certainty and 90% certainty. It is known that thinthreshold training is suboptimal for Markov OSB filters but it is the “apples to apples” comparison.

The results are shown below. Note that the thinthreshold Markov OSB performance asymptote is 99.11% (stdev 0.26%) with good email accuracy of 99.64 and spam at 97.94 and 31 milliseconds/message, and as such is probably inferior to the best hyperspace and KNN classifiers.

However, thickthreshold Markov OSB, trained with a thickness of 10 pR units, yields the bestsofar accuracy of 99.52% overall (standard deviation of 0.28%) with good email accuracy of 99.64% and spam accuracy of 99.26%. This was also one of the fastest classifiers, spending just 19 milliseconds on each of text.


95.5

96

96.5

97

97.5

98

98.5

99

99.5

100


95

95.5

96

96.5

97

97.5

98

98.5

99

99.5

100


95.5

96

96.5

97

97.5

98

98.5

99

99.5

100

Fig 6 – Markov OSB classification used in the same training harness and test set as the KNN and hyperspace filters previously tested. Left: thinthreshold of 0.001 pR units; Right: thickthreshold

trained at 10.0 pR units. Note that with a thick threshold, Markov OSB learned very rapidly.

Conclusions

Although smallneighborhood Knearestneighbor classifiers are not well suited for spam classification, KNNs using the entire corpus as the neighborhood and a distance weighting for the voting summation deliver good results. The hyperspatial variant using a similarity term based on shared document features as well as a distance term based on the document Hamming distance can yield asymptotic accuracies comparable to the best Naive Bayesian and Markov Random Field classifiers, and at speeds equal to or faster than those classifiers operated in thinthreshold configurations. For a thickthreshold configuration, a Bayesian filter is probably superior to KNN or Hyperspace filters.

Knearestneighbor classification itself with hyperspatial weighting may be useful for spam filtering, as the KNN algorithm is not dependent on linear separability of classes as Naive Bayesian classification is.

It is unlikely that KNNbased filters would be useful in singlepass TOE systems, because of their weaker error rate in singlepass learning. However, their great speed may compensate for the learning rate in singlefiltermanyusers deployments.

References

Bratko, A. and Filipic. B, “Spam Filtering using Compression Models”, NIST TREC 2005, downloadable from http://ai.ijs.si/andrej/papers/ijs_dp9227.html


95.5

96

96.5

97

97.5

98

98.5

99

99.5

100


95.5

96

96.5

97

97.5

98

98.5

99

99.5

100

GoodSpam

Wt. Total

Chhabra, S., Yerazunis, W,S., Siefkes, 2004 “Spam Filtering using a Markov Random Field Model with Variable Weighting Schemas”, ICDM ‘04, http://www.siefkes.net/papers/mrfspamfiltering.pdf

T.M. and Hart, P.E. , 1967: “Nearest Neighbor Pattern Classification”, IEEE Transactions on Information Theory, 13:2127, 1967

Graham, P., 2002, “A Plan For Spam”, http://www.paulgraham.com/spam.html

Willets, K, 2003, “Spam Filtering with gzip”, http://www.kuro5hin.org/story/2003/1/25/224415/367

Zdziarski, J. 2005 Ending Spam ISBN: 1593270526 No Starch Press

http://www.kuro5hin.org/story/2003/1/25/224415/367

http://www.paulgraham.com/spam.html

http://www.siefkes.net/papers/mrf-spamfiltering.pdf

http://www.siefkes.net/papers/mrf-spamfiltering.pdf

Documents

Sorting Spam with K-Nearest-Neighbor and Hyperspace ...crm114.sourceforge.net/docs/KNN_Hyperspace_Filters.pdf · learning filter and a reasonable survey really requires a full book