View
0
Download
0
Embed Size (px)
Sorting Spam with K-Nearest-Neighbor and Hyperspace Classifiers
William Yerazunis1, Fidelis Assis2, Christian Siefkes3, Shalendra Chhabra,1,4
1: Mitsubishi Electric Research Laboratories, Cambridge MA 2: Empresa Brasileira de Telecomunicações Embratel, Rio de Janeiro, RJ, Brazil
3: Database and Information Systems Group, Freie Universität Berlin, BerlinBrandenburg Graduate School in Distributed Information Systems
4: Computer Science and Engineering, University of California, Riverside CA
Abstract: We consider the wellknown Knearestneighbor (KNN) classifier as a spam filter. We compare KNNbased classification in both equalvote and decreasingrank forms to a welltested Markov Random Field (MRF) spam classifier. As KNN classification is known to be asymtotically bounded to be not worse than twice the performance of the best possible probabalistic classifier, we can approximate how well the MRF classifier approaches the performance bound. We then consider a variation of KNN classification based on a highdimensioned feature radiationpropagation model (termed a “hyperspace” classifier), and compare the hyperspace classifier performance to the KNN and MRF classifiers.
Introduction
Spam classifiation continues to provide an interesting, if not vexing, field of research. This particular classifier problem is unique in the machinelearning field as most other problems in machine learning are not continuously made more difficult by intelligent and motivated minds.
A number of different approaches have been taken for spam filtering beyond the singleended machine learning filter and a reasonable survey really requires a full book [Zdzairski 2005] ; in this paper we will restrict ourselves to postSMTP acceptance filtering and the sorting of email into two classes: good and spam.
Even within postacceptance, singleended filtering, there are a number of techniques available. One of the most common is a Naive Bayesian filter (usually using a limiting window of the most significant N words). [Graham 2001]. Other common variations use chisquared analysis, a Markov random field [Chhabra 2004, Bratko 2005], or even compressibility of the unknown text given basis vectors representing the good and spam classes [Willets 2003].
Although KNN filters have been considered for spam classification in the past ( GrahamCumming’s early POPfile used a KNN) they have fallen into disfavor among most filter authors. We reconsider the use of KNNs for classification and attempt to quantize their qualities.
Pure versus incrementally trained KNNs
One disadvantage of KNNs is that in the Cover and Hart configuration, every known input is added to the stored data; this can cause very long compute times. To mitigate this, we have used selectively trained KNNs; rather than adding every known text immediately to the stored data, we incrementally test each known text and only add the known text to the stored data if the known text was judged incorrectly. This speeds up filter classification tremendously. However, we must also consider the time needed to iteratively train these filters.
A disadvantage of this is that the Cover and Hart limit theorem does not necessarily apply to these KNNs. We will consider extension of the Cover and Hart theorem to cover incremental trained KNNs in future work.
Details of the Filters
We used several different filtering algorithms in our tests. These are the KNearestNeighbor (KNN) filters with neighborhood sizes of 3, 7, and 21, and we compare this the Markov Random Field (MRF) filter and with a new filter variation based on luminance in a highdimensional space, called the “hyperspace” filter. We will discuss the algorithms in detail below.
All of these filters are constructed within the framework of CRM114 with only minor tweaks to the source code. CRM114 is GPLed open source software and can be freely downloaded from:
http://crm114.sourceforge.net
so the reader should feel free to examine the actual algorithms.
Feature Extraction
All of the filters tested were configured to use the same set of features extracted from the text of the spam and good email messages. This is the OSB feature set described in [Siefkes PKDD], and experimentally verified to be of high quality as compared to other filter feature sets [Assis TREC].
To summarize how these features are generated, an initial regex is run repeatedly against the text to obtain words (that is, the POSIXformat regex [[:graph:]]+ ). This provides a stream of variablelength words. This stream of words is then subjected to a combination operator such that each additional word in the series provides four combination output strings. These output strings are then hashed to provide a stream of unsigned 32bit features. These features are not truly unique, but for our purposes they are “unique enough”. All three filters were given identical streams of 32bit features.
The string combination operator that generates the 32bit tokens can best be described as repeated
http://crm114.sourceforge.net/
skipping and concatenation. Each word in the stream is sequentially considered the “current” word. The “current” word is concatenated with the first following word to form the first combination output string. The same “current” word is then concatenated with a “skip” placeholder followed by the 2nd
following word to form the second combination output string. The current word is then concatenated with a “skip skip” placeholder followed by the 3rd following word to form the third combination output string. Then the current word is concatenated with a “skip skip skip” placeholder followed by the 4th
following word to form the fourth combination output string. Finally, the “current” word is discarded and the next word in the input stream becomes “current”. This process repeats until the input stream is exhausted.
As an example, let’s use “For example, let’s look at this sentence.” as a sample input stream. The input stream would be broken into the word stream:
For example, let’s look at this sentence.
That word stream would produce the following concatenated strings which would then be hashed to 32bit features for input to each of the classifiers: For example, For skip let’s For skip skip look For skip skip skip at example, let’s example, skip look example, skip skip at example, skip skip skip this let’s look let’s skip at let’s skip skip this let’s skip skip skip sentence
at which point the input stream is exhausted. In the actual system, “null” placeholders are used to both prefill and postdrain the pipe, so all words are equally counted.
This method of feature extraction and token generation has been shown to be both efficient and capable of producing classifier accuracies far superior to single wordatatime tokenization [Chhabra 2004]. In all tests described in this document, we used the “unique” option, so that only the first occurrence of any token was considered significant.
KNN Filter Configuration
The KNN filter used was configured in several different ways. All configurations were based on an N dimensional Hamming distance metric; that is, the presence of the same feature in both a known and an unknown document is meaningless; rather the number of differences (specifically, the feature hashes found in one document but not the other) determine distance. Within the three neighborhood sizes of 3, 7, and 21 members, we tested two different configurations – equalweight (the standard Cover and Hart model), and a decliningweight model based on distance.
In the equalweight configurations, the set of the K closest matches to the unknown are considered as “votes” for the unknown text; each vote is cast in favor of the unknown text being a member of that example’s class and with each member of the K closest matches getting an equal vote. In the distance based weighting, the weight of each vote was the reciprocal of the Euclidean distance (defined as the square root of the Hamming distance) from the unk