Sorting Spam with K-Nearest-Neighbor and Hyperspace
William Yerazunis1, Fidelis Assis2, Christian Siefkes3, Shalendra Chhabra,1,4
1: Mitsubishi Electric Research Laboratories, Cambridge MA
2: Empresa Brasileira de Telecomunicações Embratel, Rio de Janeiro, RJ, Brazil
3: Database and Information Systems Group, Freie Universität Berlin,
BerlinBrandenburg Graduate School in Distributed Information Systems
4: Computer Science and Engineering, University of California, Riverside CA
Abstract: We consider the wellknown Knearestneighbor (KNN) classifier as a spam filter. We
compare KNNbased classification in both equalvote and decreasingrank forms to a welltested
Markov Random Field (MRF) spam classifier. As KNN classification is known to be asymtotically
bounded to be not worse than twice the performance of the best possible probabalistic classifier, we
can approximate how well the MRF classifier approaches the performance bound. We then consider a
variation of KNN classification based on a highdimensioned feature radiationpropagation model
(termed a “hyperspace” classifier), and compare the hyperspace classifier performance to the KNN and
Spam classifiation continues to provide an interesting, if not vexing, field of research. This particular
classifier problem is unique in the machinelearning field as most other problems in machine learning
are not continuously made more difficult by intelligent and motivated minds.
A number of different approaches have been taken for spam filtering beyond the singleended machine
learning filter and a reasonable survey really requires a full book [Zdzairski 2005] ; in this paper we
will restrict ourselves to postSMTP acceptance filtering and the sorting of email into two classes: good
Even within postacceptance, singleended filtering, there are a number of techniques available. One
of the most common is a Naive Bayesian filter (usually using a limiting window of the most significant
N words). [Graham 2001]. Other common variations use chisquared analysis, a Markov random field
[Chhabra 2004, Bratko 2005], or even compressibility of the unknown text given basis vectors
representing the good and spam classes [Willets 2003].
Although KNN filters have been considered for spam classification in the past ( GrahamCumming’s
early POPfile used a KNN) they have fallen into disfavor among most filter authors. We reconsider
the use of KNNs for classification and attempt to quantize their qualities.
Pure versus incrementally trained KNNs
One disadvantage of KNNs is that in the Cover and Hart configuration, every known input is added to
the stored data; this can cause very long compute times. To mitigate this, we have used selectively
trained KNNs; rather than adding every known text immediately to the stored data, we incrementally
test each known text and only add the known text to the stored data if the known text was judged
incorrectly. This speeds up filter classification tremendously. However, we must also consider the
time needed to iteratively train these filters.
A disadvantage of this is that the Cover and Hart limit theorem does not necessarily apply to these
KNNs. We will consider extension of the Cover and Hart theorem to cover incremental trained KNNs
in future work.
Details of the Filters
We used several different filtering algorithms in our tests. These are the KNearestNeighbor (KNN)
filters with neighborhood sizes of 3, 7, and 21, and we compare this the Markov Random Field (MRF)
filter and with a new filter variation based on luminance in a highdimensional space, called the
“hyperspace” filter. We will discuss the algorithms in detail below.
All of these filters are constructed within the framework of CRM114 with only minor tweaks to the
source code. CRM114 is GPLed open source software and can be freely downloaded from:
so the reader should feel free to examine the actual algorithms.
All of the filters tested were configured to use the same set of features extracted from the text of the
spam and good email messages. This is the OSB feature set described in [Siefkes PKDD], and
experimentally verified to be of high quality as compared to other filter feature sets [Assis TREC].
To summarize how these features are generated, an initial regex is run repeatedly against the text to
obtain words (that is, the POSIXformat regex [[:graph:]]+ ). This provides a stream of
variablelength words. This stream of words is then subjected to a combination operator such that each
additional word in the series provides four combination output strings. These output strings are then
hashed to provide a stream of unsigned 32bit features. These features are not truly unique, but for our
purposes they are “unique enough”. All three filters were given identical streams of 32bit features.
The string combination operator that generates the 32bit tokens can best be described as repeated
skipping and concatenation. Each word in the stream is sequentially considered the “current” word.
The “current” word is concatenated with the first following word to form the first combination output
string. The same “current” word is then concatenated with a “skip” placeholder followed by the 2nd
following word to form the second combination output string. The current word is then concatenated
with a “skip skip” placeholder followed by the 3rd following word to form the third combination output
string. Then the current word is concatenated with a “skip skip skip” placeholder followed by the 4th
following word to form the fourth combination output string. Finally, the “current” word is discarded
and the next word in the input stream becomes “current”. This process repeats until the input stream is
As an example, let’s use “For example, let’s look at this sentence.” as a sample input stream. The
input stream would be broken into the word stream:
That word stream would produce the following concatenated strings which would then be hashed to
32bit features for input to each of the classifiers:
For skip let’s
For skip skip look
For skip skip skip at
example, skip look
example, skip skip at
example, skip skip skip this
let’s skip at
let’s skip skip this
let’s skip skip skip sentence
at which point the input stream is exhausted. In the actual system, “null” placeholders are used to both
prefill and postdrain the pipe, so all words are equally counted.
This method of feature extraction and token generation has been shown to be both efficient and capable
of producing classifier accuracies far superior to single wordatatime tokenization [Chhabra 2004].
In all tests described in this document, we used the “unique” option, so that only the first occurrence of
any token was considered significant.
KNN Filter Configuration
The KNN filter used was configured in several different ways. All configurations were based on an N
dimensional Hamming distance metric; that is, the presence of the same feature in both a known and an
unknown document is meaningless; rather the number of differences (specifically, the feature hashes
found in one document but not the other) determine distance. Within the three neighborhood sizes of
3, 7, and 21 members, we tested two different configurations – equalweight (the standard Cover and
Hart model), and a decliningweight model based on distance.
In the equalweight configurations, the set of the K closest matches to the unknown are considered as
“votes” for the unknown text; each vote is cast in favor of the unknown text being a member of that
example’s class and with each member of the K closest matches getting an equal vote. In the distance
based weighting, the weight of each vote was the reciprocal of the Euclidean distance (defined as the
square root of the Hamming distance) from the unk