14
Cosdes: A Collaborative Spam Detection System with a Novel E-Mail Abstraction Scheme Chi-Yao Tseng, Pin-Chieh Sung, and Ming-Syan Chen, Fellow, IEEE Abstract—E-mail communication is indispensable nowadays, but the e-mail spam problem continues growing drastically. In recent years, the notion of collaborative spam filtering with near-duplicate similarity matching scheme has been widely discussed. The primary idea of the similarity matching scheme for spam detection is to maintain a known spam database, formed by user feedback, to block subsequent near-duplicate spams. On purpose of achieving efficient similarity matching and reducing storage utilization, prior works mainly represent each e-mail by a succinct abstraction derived from e-mail content text. However, these abstractions of e-mails cannot fully catch the evolving nature of spams, and are thus not effective enough in near-duplicate detection. In this paper, we propose a novel e-mail abstraction scheme, which considers e-mail layout structure to represent e-mails. We present a procedure to generate the e-mail abstraction using HTML content in e-mail, and this newly devised abstraction can more effectively capture the near-duplicate phenomenon of spams. Moreover, we design a complete spam detection system Cosdes (standing for COllaborative Spam DEtection System), which possesses an efficient near-duplicate matching scheme and a progressive update scheme. The progressive update scheme enables system Cosdes to keep the most up-to-date information for near-duplicate detection. We evaluate Cosdes on a live data set collected from a real e-mail server and show that our system outperforms the prior approaches in detection results and is applicable to the real world. Index Terms—Spam detection, e-mail abstraction, near-duplicate matching. Ç 1 INTRODUCTION E -MAIL communication is prevalent and indispensable nowadays. However, the threat of unsolicited junk e- mails, also known as spams, becomes more and more serious. According to a survey by the website TopTenRE- VIEWS [11], 40 percent of e-mails were considered as spams in 2006. The statistics collected by MessageLabs 1 show that recently the spam rate is over 70 percent and persistently remains high. The primary challenge of spam detection problem lies in the fact that spammers will always find new ways to attack spam filters owing to the economic benefits of sending spams. Note that existing filters generally perform well when dealing with clumsy spams, which have duplicate content with suspicious keywords or are sent from an identical notorious server. Therefore, the next stage of spam detection research should focus on coping with cunning spams which evolve naturally and continuously. Although the techniques used by spammers vary constantly, there is still one enduring feature: spams with identical or similar content are sent in large quantities and successively. Since only a small amount of e-mail users will order products or visit websites advertised in spams, spammers have no choice but to send a great quantity of spams to make profits. It means that even with developing and employing unexpected new tricks, spammers still have to send out large quantities of identical or similar spams simultaneously and in succession. This specific feature of spams can be designated as the near-duplicate phenomenon, which is a significant key in the spam detection problem. In view of above facts, the notion of collaborative spam filtering with near-duplicate similarity matching scheme has recently received much attention. The primary idea of the near-duplicate matching scheme for spam detection is to maintain a known spam database, formed by user feedback, to block subsequent spams with similar content. Collabora- tive filtering indicates that user knowledge of what spam may subsequently appear is collected to detect following spams. Overall, there are three key points of this type of spam detection approach we have to be concerned about. First, an effective representation of e-mail (i.e., e-mail abstraction) is essential. Since a large set of reported spams has to be stored in the known spam database, the storage size of e-mail abstraction should be small. Moreover, the e- mail abstraction should capture the near-duplicate phe- nomenon of spams, and should avoid accidental deletion of nonspam e-mails (also known as hams). Second, every incoming e-mail has to be matched with the large database, meaning that the near-duplicate matching process should be substantially efficient. Finally, the latest spams have to be IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011 669 . C.-Y. Tseng and M.-S. Chen are with the Department of Electrical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan, and the Research Center for Information Technology Innovation (CITI), Academia Sinica, No. 128, Sec. 2, Academia Road, Nankang, Taipei 11529, Taiwan. E-mail: [email protected], [email protected]. . P.-C. Sung is with the Department of Electrical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan. E-mail: [email protected]. Manuscript received 7 Mar. 2009; revised 9 Aug. 2009; accepted 7 Dec. 2009; published online 24 Aug. 2010. Recommended for acceptance by D. Cook. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDE-2009-03-0118. Digital Object Identifier no. 10.1109/TKDE.2010.147. 1. http://www.messagelabs.com/globalthreats. 1041-4347/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society

Cosdes a collaborative spam detection system with a novel e mail abstraction scheme

Embed Size (px)

DESCRIPTION

Cosdes: A Collaborative Spam Detection System with a Novel E-Mail Abstraction Scheme

Citation preview

Page 1: Cosdes a collaborative spam detection system with a novel e mail abstraction scheme

Cosdes: A Collaborative Spam DetectionSystem with a Novel E-Mail Abstraction

SchemeChi-Yao Tseng, Pin-Chieh Sung, and Ming-Syan Chen, Fellow, IEEE

Abstract—E-mail communication is indispensable nowadays, but the e-mail spam problem continues growing drastically. In recent

years, the notion of collaborative spam filtering with near-duplicate similarity matching scheme has been widely discussed. The primary

idea of the similarity matching scheme for spam detection is to maintain a known spam database, formed by user feedback, to block

subsequent near-duplicate spams. On purpose of achieving efficient similarity matching and reducing storage utilization, prior works

mainly represent each e-mail by a succinct abstraction derived from e-mail content text. However, these abstractions of e-mails cannot

fully catch the evolving nature of spams, and are thus not effective enough in near-duplicate detection. In this paper, we propose a

novel e-mail abstraction scheme, which considers e-mail layout structure to represent e-mails. We present a procedure to generate the

e-mail abstraction using HTML content in e-mail, and this newly devised abstraction can more effectively capture the near-duplicate

phenomenon of spams. Moreover, we design a complete spam detection system Cosdes (standing for COllaborative Spam DEtection

System), which possesses an efficient near-duplicate matching scheme and a progressive update scheme. The progressive update

scheme enables system Cosdes to keep the most up-to-date information for near-duplicate detection. We evaluate Cosdes on a live

data set collected from a real e-mail server and show that our system outperforms the prior approaches in detection results and is

applicable to the real world.

Index Terms—Spam detection, e-mail abstraction, near-duplicate matching.

Ç

1 INTRODUCTION

E-MAIL communication is prevalent and indispensablenowadays. However, the threat of unsolicited junk e-

mails, also known as spams, becomes more and moreserious. According to a survey by the website TopTenRE-VIEWS [11], 40 percent of e-mails were considered as spamsin 2006. The statistics collected by MessageLabs1 show thatrecently the spam rate is over 70 percent and persistentlyremains high. The primary challenge of spam detectionproblem lies in the fact that spammers will always find newways to attack spam filters owing to the economic benefits ofsending spams. Note that existing filters generally performwell when dealing with clumsy spams, which haveduplicate content with suspicious keywords or are sentfrom an identical notorious server. Therefore, the next stageof spam detection research should focus on coping withcunning spams which evolve naturally and continuously.

Although the techniques used by spammers varyconstantly, there is still one enduring feature: spams withidentical or similar content are sent in large quantities andsuccessively. Since only a small amount of e-mail users willorder products or visit websites advertised in spams,spammers have no choice but to send a great quantity ofspams to make profits. It means that even with developingand employing unexpected new tricks, spammers still haveto send out large quantities of identical or similar spamssimultaneously and in succession. This specific feature ofspams can be designated as the near-duplicate phenomenon,which is a significant key in the spam detection problem.

In view of above facts, the notion of collaborative spamfiltering with near-duplicate similarity matching schemehas recently received much attention. The primary idea ofthe near-duplicate matching scheme for spam detection is tomaintain a known spam database, formed by user feedback,to block subsequent spams with similar content. Collabora-tive filtering indicates that user knowledge of what spammay subsequently appear is collected to detect followingspams. Overall, there are three key points of this type ofspam detection approach we have to be concerned about.First, an effective representation of e-mail (i.e., e-mailabstraction) is essential. Since a large set of reported spamshas to be stored in the known spam database, the storagesize of e-mail abstraction should be small. Moreover, the e-mail abstraction should capture the near-duplicate phe-nomenon of spams, and should avoid accidental deletion ofnonspam e-mails (also known as hams). Second, everyincoming e-mail has to be matched with the large database,meaning that the near-duplicate matching process shouldbe substantially efficient. Finally, the latest spams have to be

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011 669

. C.-Y. Tseng and M.-S. Chen are with the Department of ElectricalEngineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road,Taipei 10617, Taiwan, and the Research Center for Information TechnologyInnovation (CITI), Academia Sinica, No. 128, Sec. 2, Academia Road,Nankang, Taipei 11529, Taiwan.E-mail: [email protected], [email protected].

. P.-C. Sung is with the Department of Electrical Engineering, NationalTaiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan.E-mail: [email protected].

Manuscript received 7 Mar. 2009; revised 9 Aug. 2009; accepted 7 Dec. 2009;published online 24 Aug. 2010.Recommended for acceptance by D. Cook.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2009-03-0118.Digital Object Identifier no. 10.1109/TKDE.2010.147.

1. http://www.messagelabs.com/globalthreats.

1041-4347/11/$26.00 � 2011 IEEE Published by the IEEE Computer Society

Page 2: Cosdes a collaborative spam detection system with a novel e mail abstraction scheme

included instantly and successively into the database so asto effectively block subsequent near-duplicate spams.

Although previous researchers have developed variousmethods on near-duplicate spam detection [7], [8], [12], [16],[17], [22], [23], [24], [25], [30], [31], these works are stillsubject to some drawbacks. To achieve the objectives ofsmall storage size and efficient matching, prior worksmainly represent each e-mail by a succinct abstractionderived from e-mail content text. Moreover, hash-based textrepresentation is applied extensively. One major problem ofthese abstractions is that they may be too brief and thusmay not be robust enough to withstand intentional attacks.A common attack to this type of representation is to insert arandom normal paragraph without any suspicious key-words into unobvious position of an e-mail. In such acontext, if the whole e-mail content is utilized for hash-based representation, the near-duplicate part of spamscannot be captured. In addition, the false positive rate (i.e.,the rate of classifying hams as spams) may increase becausethe random part of e-mail content is also involved in e-mailabstraction. On the other hand, hash-based text representa-tion also suffers from the problem of not being suitable forall languages. Finally, images and hyperlinks are importantclues to spam detection, but both of them are unable to beincluded in hash-based text representation.

In this paper, we explore to devise a more sophisticated e-mail abstraction, which can more effectively capture the near-duplicate phenomenon of spams. Motivated by the fact that e-mail users are capable of easily recognizing similar spams byobserving the layouts of e-mails, we attempt to represent eache-mail based on the e-mail layout structure. Fortunately,almost all e-mails nowadays are in Multipurpose InternetMail Extensions (MIME) format with the text/html content-type. That is, HTML content is available in an e-mail andprovides sufficient information about e-mail layout structure.In view of this observation, we propose the specific procedureStructure Abstraction Generation (SAG), which generates anHTML tag sequence to represent each e-mail. Different fromprevious works, SAG focuses on the e-mail layout structureinstead of detailed content text. In this regard, each paragraphof text without any HTML tag embedded will be transformedto a newly defined tag <mytext=>.

Definition 1 (<mytext=>). <mytext=> is a newly definedtag that represents a paragraph of text without any HTML tagembedded.

Since we ignore the semantics of the text, the proposedabstraction scheme is inherently applicable to e-mails in alllanguages. This significant feature is superior to mostexisting methods. Once e-mails are represented by ournewly devised e-mail abstractions, two e-mails are viewedas near-duplicate if their HTML tag sequences are exactlyidentical to each other. Note that even when spammers insertrandom tags into e-mails, the proposed e-mail abstractionscheme will still retain efficacy since arbitrary tag insertion isprone to syntax errors or tag mismatching, meaning that theappearance of the e-mail content will be greatly altered.Moreover, the proposed procedure SAG also adopts someheuristics to better guarantee the robustness of our approach.

While a more sophisticated e-mail abstraction is intro-duced, one challenging issue arises: how to efficiently

match each incoming e-mail with an existing huge spamdatabase. To resolve this issue, we devise an innovative treestructure, SpTrees, to store large amounts of the e-mailabstractions of reported spams, and SpTrees contribute tosubstantially promoting the efficiency of matching. In thedesign of the near-duplicate matching scheme based onSpTrees, we aim at reducing the number of spams and tagswhich are required to be compared.

By integrating above techniques, in this paper, we designa complete spam detection system COllaborative SpamDEtection System (Cosdes). Cosdes possesses an efficientnear-duplicate matching scheme and a progressive updatescheme. The progressive update scheme not only adds innew reported spams, but also removes obsolete ones in thedatabase. With Cosdes maintaining an up-to-date spamdatabase, the detection result of each incoming e-mail canbe determined by the near-duplicate similarity matchingprocess. In addition, to withstand intentional attacks, areputation mechanism is also provided in Cosdes to ensurethe truthfulness of user feedback.

To the best of our knowledge, there is no prior researchin considering e-mail layout structure to represent e-mailsin the field of near-duplicate spam detection. In summary,the contributions of this paper are as follows:

1. We propose the specific procedure SAG to generatethe e-mail abstraction using HTML content in e-mail,and this newly devised abstraction can moreeffectively capture the near-duplicate phenomenonof spams.

2. We devise an innovative tree structure, SpTrees, tostore large amounts of the e-mail abstractions ofreported spams. SpTrees contribute to the accom-plishment of the efficient near-duplicate matchingwith a more sophisticated e-mail abstraction.

3. We design a complete spam detection systemCosdes with an efficient near-duplicate matchingscheme and a progressive update scheme. Theprogressive update scheme enables system Cosdesto keep the most up-to-date information for near-duplicate detection.

The rest of this paper is outlined as follows: In Section 2,preliminaries including the definition of near-duplicate andthe related works are given. In Section 3, we introduce thenovel e-mail abstraction scheme. In Section 4, the completesystem model of Cosdes is depicted. The experimentalresults are shown in Section 5, and finally, this paper isconcluded with Section 6.

2 PRELIMINARIES

In this section, the definition of near-duplicate, in thispaper, is presented in Section 2.1. We then review therelated works on spam detection in Section 2.2.

2.1 Definition of Near-Duplicate

The central idea of near-duplicate spam detection is to exploitreported known spams to block subsequent ones which havesimilar content. For different forms of e-mail representation,the definitions of similarity between two e-mails are diverse.Unlike most prior works representing e-mails based mainly

670 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011

Page 3: Cosdes a collaborative spam detection system with a novel e mail abstraction scheme

on content text, we investigate representing each e-mailusing an HTML tag sequence, which depicts the layoutstructure of e-mail, and look forward to more effectivelycapturing the near-duplicate phenomenon of spams. Initi-ally, the definition of <anchor> tag is given as follows:

Definition 2 (<anchor>). The tag <anchor> is one type ofnewly defined tag that records the domain name or the e-mailaddress in an anchor tag.

For example, the anchor tag <a href=“http://arbor.ee.ntu.edu.tw/index.htm”> is transformed to <arbor.ee.ntu.edu.tw>. The anchor tag <a href=“mailto:[email protected]”> is transformed to <[email protected]>. The purpose of creating the <anchor> tag isto minimize the false positive rate when the number of tagsin an e-mail abstraction is short. The less the number of tagsin an e-mail abstraction, the more possible that a ham maybe matched with known spams and be misclassified as aspam. Therefore, when the number of tags in an e-mailabstraction is smaller than a predefined threshold, for eachanchor tag <a>, we specifically record the targeted domainname or e-mail address, which is a significant clue foridentifying spams.

On the other hand, in this paper, the detailed definitionof near-duplicate is given as follows:

Definition 3 (Near-Duplicate). Let I ¼ ft1; t2; . . . ; ti; . . . ; tn;<mytext=>;<anchor>g be the set of all valid HTML tagswith two types of newly created tags, <mytext=> and<anchor>, included. An e-mail abstraction derived fromprocedure SAG is denoted as <e1; e2; . . . ; ei; . . . ; em>, whichis an ordered list of tags, where ei 2 I. The definition of near-duplicate is: “Two e-mail abstractions � ¼ <a1; a2; . . . ;ai; . . . ; an> and � ¼ <b1; b2; . . . ; bi; . . . ; bm> are viewed asnear-duplicate if 8ai ¼ bi and n ¼ m.”

Definition 4 (Tag Length). The tag length of an e-mailabstraction is defined as the number of tags in an e-mailabstraction.

Note that we strictly define that two e-mail abstractionsare near-duplicate only if they are exactly identical to eachother. The major reason is that there are numerous HTMLtag patterns appearing commonly and frequently. Partialmatching of HTML tag sequences will cause much higherrate of false positive error, and the complexity will be toohigh to achieve efficient matching. In addition, for furtherspeed-up, while the tag length of an e-mail abstraction islonger, we even apply a looser matching criterion, whichdoes not degrade detection results.

2.2 Related Works

Since the e-mail spam problem is increasingly seriousnowadays, various techniques have been explored to relievethe problem. Based on what features of e-mails are beingused, previous works on spam detection can be generallyclassified into three categories: 1) content-based methods,2) noncontent-based methods, and 3) others. Initially,researchers analyze e-mail content text and model thisproblem as a binary text classification task. Representativesof this category are Naive Bayes [14], [20] and SupportVector Machines (SVMs) [1], [10], [15], [27] methods. In

general, Naive Bayes methods train a probability modelusing classified e-mails, and each word in e-mails will begiven a probability of being a suspicious spam keyword. Asfor SVMs, it is a supervised learning method, whichpossesses outstanding performance on text classificationtasks. Traditional SVMs [10] and improved SVMs [1], [15],[27] have been investigated. While above conventionalmachine learning techniques have reported excellent resultswith static data sets, one major disadvantage is that it iscost-prohibitive for large-scale applications to constantlyretrain these methods with the latest information to adapt tothe rapid evolving nature of spams. The spam detection ofthese methods on the e-mail corpus with various languagehas been less studied yet. In addition, other classificationtechniques, including markov random field model [3],neural network [6] and logic regression [2], and certainspecific features, such as URLs [26] and images [19], [29]have also been taken into account for spam detection.

The other group attempts to exploit noncontent informa-tion such as e-mail header, e-mail social network [4], [28],and e-mail traffic [5], [9] to filter spams. Collectingnotorious and innocent sender addresses (or IP addresses)from e-mail header to create black list and white list is acommonly applied method initially. MailRank [4] examinesthe feasibility of rating sender addresses with algorithmPageRank in the e-mail social network, and in [28], amodified version with update scheme is introduced. Sincee-mail header can be altered by spammers to conceal theidentity, the main drawback of these methods is thehardness of correctly identifying each user. In [5], [9], theauthors intend to analyze e-mail traffic flows to detectsuspicious machines and abnormal e-mail communication.It is noted that these approaches have to operate incoordination with other complementary methods to gainbetter results. Moreover, some researchers consider com-bining the merits of several techniques [2], [13], [18]. Eventhough the performance of classifier integration seemsprominent, there is still no conclusion on what is the bestcombination. In addition, how to efficiently update thewhole included classifiers is another unsolved issue.

On the other hand, collaborative spam filtering withnear-duplicate similarity matching scheme has been stu-died extensively in recent years. Regarding collaborativemechanism, P2P-based architecture [8], [12], [31], centra-lized server-based system [16], [21], [22], [30], and others[17], [23] are generally employed. Note that no matterwhich mechanism is applied, the most critical factor is howto represent each e-mail for near-duplicate matching. The e-mail abstraction not only should capture the near-duplicatephenomenon of spams, but should avoid accidentaldeletion of hams. In [30], the first N hash values of eachlength L substring are used as vector representation of thee-mail. In [7], [8], [17], a 32-byte code derived from avariation of Nilsimsa digest technique is utilized torepresent the distribution of word trigrams in e-mail. In[23], [24], [25], the authors improve the open digesttechnique [7] by representing each e-mail with multipledigests produced from the strings of fixed length sampledat randomized positions within e-mail. In [12], [31], afeature vector of a block text fingerprint generated from theset of checksums of each length L substring is exploited. In

TSENG ET AL.: COSDES: A COLLABORATIVE SPAM DETECTION SYSTEM WITH A NOVEL E-MAIL ABSTRACTION SCHEME 671

Page 4: Cosdes a collaborative spam detection system with a novel e mail abstraction scheme

[22], the authors make use of spam-vocabulary patternsproduced by Teiresias pattern discovery algorithm. In [16],the I-Match signature determined by a set of unique termsshared by spams and the I-Match lexicon is put to use. In[21], the content similarity of e-mails computed usingextracted words is measured. It is noted that most existingmethods generate e-mail abstractions based mainly oncontent text. However, randomized and normal paragraphsare commonly inserted in spams nowadays, and thus if ane-mail abstraction is generated by the whole content text,the near-duplicate part of spams cannot be captured.Moreover, generating e-mail abstraction with the contenttext also suffers from the problem of not being applicable toall languages.

In light of the above problems, it deserves further studiesto design a better e-mail abstraction approach that is morerobust to withstand intentional attacks.

3 E-MAIL ABSTRACTION SCHEME

In this section, a novel e-mail abstraction scheme isintroduced. In Section 3.1, procedure SAG is presented todepict the generation process of an e-mail abstraction. Thedevised data structures SpTable and SpTrees are illustratedin Section 3.2. Finally, the robustness issue is discussed inSection 3.3.

3.1 Structure Abstraction Generation

We propose the specific procedure SAG to generate the e-mailabstraction using HTML content in e-mail. SAG is elaboratedwith the example of Fig. 3, and the algorithmic form of SAG isoutlined in Fig. 1. Procedure SAG is composed of three majorphases, Tag Extraction Phase, Tag Reordering Phase, and<anchor> Appending Phase. In Tag Extraction Phase, thename of each HTML tag is extracted, and tag attributes andattribute values are eliminated. In addition, each paragraphof text without any tag embedded is transformed to<mytext=>. In lines 4-5, <anchor> tags are then insertedintoAnchorSet, and the first 1,023 valid tags are concatenated

to form the tentative e-mail abstraction. Note that we retainonly the first 1,023 tags as the tag sequence. The main reason isthat the rear part of long e-mails can be ignored withoutaffecting the effectiveness of near-duplicate matching. Sub-sequently, in line 6 of Fig. 1, we preprocess the tag sequence ofthe tentative e-mail abstraction. One objective of thispreprocessing step is to remove tags that are common butnot discriminative between e-mails. The other objective is toprevent malicious tag insertion attack, and thus the robust-ness of the proposed abstraction scheme can be furtherenhanced. The following sequence of operations is performedin the preprocessing step.

1. Front and rear tags (as shown in the gray area of theexample e-mail in the top of Fig. 3) are excluded.

2. Nonempty tags2 that have no corresponding starttags or end tags are deleted. Besides, mismatchednonempty tags are also deleted.

3. All empty tags2 are regarded as the same and arereplaced by the newly created <empty=> tag.Moreover, successive <empty=> tags are prunedand only one <empty=> tag is retained.

4. The pairs of nonempty tags enclosing nothing areremoved.

Example 1. Consider the example e-mail abstraction in Fig. 2

that has been produced through the execution of lines 1-5

in procedure SAG. The first operation of the preprocessing

step is to remove tags which are in front of the<body> tag

and which are in rear of the <=body> tag. With regard to

operation 2, since there is no end tag of <a>, this tag is

deleted. Besides, the tags<font> and< =font > are also

deleted because the position of<=font> is incorrect. Note

that we can utilize the stack data structure to determine

whether nonempty tags are mismatched. After that, empty

tags are transformed to < empty=> in operation 3. More-

over, since <mytext=><img=><hr=> appear consecu-

tively, only one <empty=> tag is retained. Finally,

<div><font><=font><=div> are removed due to the

lack of content.

672 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011

Fig. 1. Algorithmic form of procedure SAG.

Fig. 2. An example of the preprocessing step in Tag Extraction Phase ofprocedure SAG.

2. In the HTML Document Type Definition (DTD), empty tags contain notext and have no end tags. For example, <br=>, <hr=>, <img=>, <input=>are empty tags. The newly created tag, <mytext=>, is also regarded as anempty tag in this paper. On the other hand, all tags that are not declared inthe DTD as empty tags are nonempty tags. Nonempty tags must have anend tag.

Page 5: Cosdes a collaborative spam detection system with a novel e mail abstraction scheme

The middle part of Fig. 3 shows an example of a tentativee-mail abstraction and AnchorSet (i.e., <spam:com>)derived from Tag Extraction Process.

On purpose of accelerating the near-duplicate matchingprocess, we reorder the tag sequence of an e-mail abstrac-tion in Tag Reordering Phase. Note that since the arrange-ment of HTML tags is regular and in pairs, varioussequential patterns of tags are contained in e-mails. In theworst case, if we consider two e-mail abstractions whichhave the same tag length and differ only in their last tags,the difference cannot be detected until the last tags arecompared. To handle this problem, we destroy theregularity by rearranging the order of tag sequence tolower the number of tag comparisons. Note that this processensures that the newly assigned position numbers of e-mailabstractions with the same number of tags are completelyidentical. As such, the matching process can be acceleratedwithout violating the definition of near-duplicate in thispaper. In lines 8-11 of Fig. 1, each tag is assigned a newposition number by function ASSIGN_PN (PN denotes forPosition Number) with following expressions,

b ¼ dffiffiffiffi

Lpe;

r ¼ ðPNorig � 1Þ%b;q ¼ bðPNorig � 1Þ=bc þ 1;

PNnew ¼ ðb� rÞ þ ðb� q þ 1Þ;

where L is the tag length of an e-mail abstraction, and PNorig

is the original position number. Variable b is the number ofbuckets. Variable r indicates which bucket should be placed,and variable q is the number of shift counts from the end ofthis bucket. Fig. 3 demonstrates the assignment of the firstsix tags. The final e-mail abstraction is the concatenation ofall tags with new position numbers (the vacant positions,e.g., positions 9 and 13 in Fig. 3, are ignored). Additionally, ifthe tag length of an e-mail abstraction is smaller than apredefined tag length threshold (set as 16 in the experi-ments) of the short e-mail, the tags in AnchorSet will be

appended in front of the e-mail abstraction. The mainobjective of appending <anchor> tags is to reduce theprobability that a ham is successfully matched with reportedspams when the tag length of an e-mail abstraction is short.An example e-mail abstraction produced by procedure SAGis shown in the bottom of Fig. 3.

3.2 Design of SpTable and SpTrees

One major focus of this work is to design the innovativedata structure to facilitate the process of near-duplicatematching. SpTable and SpTrees (sp stands for spam) areproposed to store large amounts of the e-mail abstractionsof reported spams. As shown in Fig. 4, several SpTrees arethe kernel of the database, and the e-mail abstractions ofcollected spams are maintained in the correspondingSpTrees. According to Definition 3, two e-mail abstractionsare possible to be near-duplicate only when the numbers oftheir tags are identical. Thus, if we distribute e-mailabstractions with different tag lengths into diverse SpTrees,the quantity of spams required to be matched will decrease.However, if each SpTree is only mapped to one single taglength, it is too much of a burden for a server to maintainsuch thousands of SpTrees. In view of this concern, eachSpTree is designed to take charge of e-mail abstractionswithin a range of tag lengths. As can be seen in Fig. 4,SpTable is created to record overall information of SpTrees.The ith column of SpTable links to the root of SpTree_i by apointer, and e-mail abstractions with tag lengths rangingfrom 2i to 2iþ1 � 1 belong to SpTree_i.

Regarding how an e-mail abstraction is stored in SpTree,Fig. 5 gives an example with the same e-mail abstractionderived from Fig. 3. An e-mail abstraction is segmented intoseveral subsequences, and these subsequences are consecu-tively put into the corresponding nodes from low levels tohigh levels. As such, an e-mail abstraction is stored in onepath from the root node to a leaf node of SpTree, and hencethe matching between a testing e-mail and known spams isprocessed from root to leaf. As shown in Fig. 5, the examplee-mail abstraction is stored in a path from the root node a tothe leaf node d. The primary goal of applying the tree datastructure for storage is to reduce the number of tagsrequired to be matched when processing from root to leaf.Since only subsequences along the matching path from rootto leaf should be compared, the matching efficiency issubstantially increased. Note that if each type of HTML tag

TSENG ET AL.: COSDES: A COLLABORATIVE SPAM DETECTION SYSTEM WITH A NOVEL E-MAIL ABSTRACTION SCHEME 673

Fig. 3. An example procedure flow of SAG.

Fig. 4. The data structures of SpTable and SpTrees.

Page 6: Cosdes a collaborative spam detection system with a novel e mail abstraction scheme

determines a branch direction (i.e., the tree degree will bethe number of HTML tag types) and each level of SpTreecontains merely one HTML tag (i.e., the tree height will bethe tag length of the longest e-mail abstraction), the numberof tag comparisons will be minimum. However, it isinfeasible because the degrees and the heights of SpTreeswill be too large, and SpTrees will be extremely unbalanced.

To achieve efficient matching with balanced tree struc-ture, SpTrees are designed to be binary trees. The branchdirection of each SpTree is determined by a binary hashfunction. If the first tag of a subsequence is a start tag (e.g.,<div>), this subsequence will be placed into the left childnode. A subsequence whose first tag is an end tag (e.g.,<=div>) will be placed into the right child node. Since mostHTML tags are in pairs and the proposed e-mail abstractionis reordered in procedure SAG, subsequences are expectedto be uniformly distributed. Moreover, on level i of eachSpTree (with the root on level 0), each node storessubsequences whose tag lengths are equal to 2i. For instance,as shown in Fig. 3, the subsequence <spam:com> (whosetag length is 20) is placed into level 0, the subsequence<=p><a> (whose tag length is 21) is placed into level 1, andso forth. Note that since SpTree_i takes charge of e-mailabstractions with tag lengths ranging from 2i to 2iþ1 � 1,based on the above-mentioned arrangement, the lastsubsequence of each e-mail abstraction in an SpTree willbe stored in the leaf nodes on the same level. Also note thatthe tag lengths of subsequences stored in leaf nodes of level jrange from 1 to 2j. As described in Section 3.1, we designthat the longest length of an e-mail abstraction is 1,023,meaning that there are totally 10 SpTrees (from SpTree_0 toSpTree_9) in our database while the proposed arrangementis applied. In addition, to further accelerate the process ofmatching, we employ a hash function to map eachsubsequence to an integer. The key idea is that onlysubsequences which look like the testing subsequenceshould be exactly matched. The hash function is definedas follows:

hashðseqÞ ¼ fðseq½0�Þ � 2m�1 þ fðseq½1�Þ � 2m�2

þ � � � þ fðseq½m� 1�Þ � 20;

wherem is the number of tags in this subsequence and seq½n�denotes the tag type of the nth tag. The function f convertseach type of tag to a unique integer. Moreover, for thesubsequence which contains more than eight tags, we justuse the first eight tags to generate the hash value (i.e.,m � 8).

With the hash function, most subsequence matching istransformed to the integer matching, and hence the complex-ity of matching process can be substantially reduced.

Overall, the advantageous features of this innovativearrangement are as follows: 1) The height of an SpTree isequal to lgLb c, where L is the tag length of the longest e-mail abstraction in this SpTree. 2) Owing to the fact thatparent nodes store less number of subsequences thanchildren nodes, we design that longer subsequences areput into higher levels (the tag length of the subsequenceon level i is 2i). Thus, the number of tags matched fromroot to leaf is markedly decreased. Moreover, with thehash function, the matching efficiency is substantiallyincreased. 3) The numbers of tags stored in the nodes ofan SpTree are expected to be similar, and hence SpTreesare balanced binary trees. Assume that there are N e-mailabstractions in SpTree_i and subsequences on the samelevel are uniformly distributed. For each node on level j,there are N

2j subsequences (the number of nodes on level jis 2j) whose tag lengths are equal to 2j. Thus, the numberof tags stored in each node is N

2j � 2j ¼ N , which is notcorrelated with j. It is noted that the number of tags ineach leaf node is less than N because not all subse-quences in leaf nodes are with the longest allowed taglength.

On the other hand, as shown in the bottom of Fig. 5,certain additional information is required and kept in eachsubsequence of a node. For subsequences in internal nodes,spam id and timestamp are included. For subsequences inexternal nodes, spam id, user id, timestamp, length EA (thelength of the e-mail abstraction), and SR (the reputationscore) are involved.

3.3 Robustness Issue

The main difficulty of near-duplicate spam detection is towithstand malicious attack by spammers. Prior approachesgenerate e-mail abstractions based mainly on hash-basedcontent text. These methods primarily differ in whatgranularity is used as the input of the hash function. Forexample, the authors in [16], [21], [22] extract words orterms to generate the e-mail abstraction. Besides, substringsextracted by various techniques are widely employed in [7],[8], [12], [17], [23], [25], [30], [31]. However, this type of e-mail representation inherently has following disadvantages.First, the insertion of a randomized and normal paragraphcan easily defeat this type of spam filters. Moreover, sincethe structures and features of different languages arediverse, word and substring extraction may not be applic-able to e-mails in all languages. Concretely speaking, forinstance, trigrams of substrings used in [7], [8], [17] are notsuitable for nonalphabetic languages, such as Chinese.

In this paper, we devise a novel e-mail abstraction schemethat considers e-mail layout structure to represent e-mails.To assess the robustness of the proposed scheme, we modelpossible spammer attacks and organize these attacks asfollowing three categories. Examples and the outputs ofpreprocessing of procedure SAG are shown in Fig. 6.

3.3.1 Random Paragraph Insertion

This type of spammer attack is commonly used nowa-days. As shown in Fig. 6, normal contents without any

674 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011

Fig. 5. The illustration of SpTree_3 with an example e-mail abstraction.

Page 7: Cosdes a collaborative spam detection system with a novel e mail abstraction scheme

advertisement keywords are inserted to confuse text-based spam filtering techniques. It is noted that ourscheme transforms each paragraph into a newly createdtag <mytext=>, and consecutive empty tags will then betransformed to <empty=>. As such, the representation ofeach random inserted paragraph is identical, and thus ourscheme is resistant to this type of attack.

3.3.2 Random HTML Tag Insertion

If spammers know that the proposed scheme is based onHTML tag sequences, random HTML tags will be insertedrather than random paragraphs. On the one hand, arbitrarytag insertion will cause syntax errors due to tag mismatch-ing. This may lead to abnormal display of spam content thatspammers do not wish this to happen. On the other hand,procedure SAG also adopts some heuristics (as depicted inSection 3.1) to deal with the random insertion of empty tagsand the tag mismatching of nonempty tags. Fig. 6 showstwo example outputs and the details of each step can befound in Fig. 2. With the proposed method, most randominserted tags will be removed, and thus the effectiveness ofthe attack of random tag insertion is limited. We shall verifythis inference in Section 5.4.

3.3.3 Sophisticated HTML Tag Insertion

Suppose that spammers are more sophisticated, they mayinsert legal HTML tag patterns. As shown in Fig. 6, if tagpatterns that do conform to syntax rules are inserted, theywill not be eliminated. However, although some craftytricks may be conceivable, it is not intuitive for spammers togenerate a large number of spams with completely distincte-mail layout structure.

Note that due to space limitation, we are not able todiscuss all possible situations. Nevertheless, representing e-mails with layout structure is more robust to most existingattacks than text-based approaches. Even though new attackhas been designed, we can react against it by adjusting thepreprocessing step of procedure SAG. On the other hand,our approach extracts only HTML tag sequences andtransforms each paragraph with no tag embedded to<mytext=>, meaning that the proposed abstraction schemecan be applied to e-mails in all languages without modifyingany components. This important feature also enables system

Cosdes to perform more robustly. We shall assess theeffectiveness of our approach with real e-mail streams inSection 5.2.

4 COLLABORATIVE SPAM DETECTION SYSTEM

COSDES

A complete collaborative spam detection system Cosdes isintroduced in this section. The system model of Cosdes isgiven in Section 4.1. We then elaborate the processinghandlers of Cosdes in Section 4.2. Finally, we describe thereputation mechanism of Cosdes in Section 4.3.

4.1 System Model of Cosdes

The system model of Cosdes is illustrated in Fig. 7, and thealgorithmic form is outlined in Fig. 8. Initially, threeparameters, Tm (the maximum time span for reported spamsbeing retained in the system), Td (the time span for triggeringDeletion Handler), and Sth (the score threshold for deter-mining spams) should be given for Cosdes. Before startingto do the spam detection, Cosdes collects feedback spams fortime Tm in advance to construct an initial database. Threemajor modules, Abstraction Generation Module, DatabaseMaintenance Module, and Spam Detection Module, are

TSENG ET AL.: COSDES: A COLLABORATIVE SPAM DETECTION SYSTEM WITH A NOVEL E-MAIL ABSTRACTION SCHEME 675

Fig. 6. Examples of possible spammer attacks.

Fig. 7. System model of Cosdes.

Fig. 8. Algorithmic form of system Cosdes.

Page 8: Cosdes a collaborative spam detection system with a novel e mail abstraction scheme

included in Cosdes. With regard to Abstraction GenerationModule, each e-mail is converted to an e-mail abstraction byStructure Abstraction Generator with procedure SAG. Threetypes of action handlers, Deletion Handler, InsertionHandler, and Error Report Handler, are involved inDatabase Maintenance Module. Note that although the term“database” is used, the collection of reported spams can beessentially stored in main memory to facilitate the process ofmatching. In addition, Matching Handler in Spam DetectionModule takes charge of determining results.

There are three types of e-mails, reported spam, testinge-mail, and misclassified ham, required to be dealt with byCosdes. When receiving a reported spam, InsertionHandler adds the e-mail abstraction of this spam into thedatabase except that the reputation score of this reporter istoo low. Whenever a new testing e-mail arrives, MatchingHandler performs the near-duplicate detection with col-lected spams to do the judgment. Meanwhile, if a testing e-mail is classified as a spam, this e-mail will be viewed as areported spam and be added into the database. Moreover,Error Report Handler copes with feedback misclassifiedhams and adjusts Cosdes by degrading the reputation ofrelated reporters to prevent malicious attacks. For every Td,Deletion Handler is triggered to delete obsolete spamswhich exist over time Tm. The main functionalities ofdeleting outdated spams are not only to alleviate theoverhead of the server, but to reduce the risk of accidentaldeletion of hams. Due to the evolving nature of spams, it isinappropriate to utilize old spams to filter current ones.Overall, Cosdes is self-adjusting and retains the most up-to-date spams for near-duplicate detection.

4.2 Procedures of System Cosdes

In this section, we elucidate each procedure of systemCosdes. Cosdes deals with four circumstances by handlers(the algorithmic forms are shown in Fig. 9), and the detailedprocedure flow will be explained as follows: For InsertionHandler in Fig. 9a, initially, the corresponding SpTree isfound in SpTable according to the tag length of the insertedspam, and nowNode is assigned as the root of this SpTree. Inlines 3-8, we iteratively insert the subsequences of thee-mail abstraction along the path from root to leaf. IfnowNode is an internal node, the subsequence with 2i tagsis inserted into level i, which is illustrated in Fig. 5.Meanwhile, the hash value of this subsequence is com-puted. Then, nowNode is assigned as the correspondingchild node based on the type of the next tag. If the next tagis a start (end) tag, nowNode is assigned as the left (right)child node. Finally, when nowNode is processed to a leafnode, the subsequence with remaining tags is stored.

Matching Handler (as shown in Fig. 9b) is the mostsignificant procedure in Cosdes to achieve efficient match-ing between every testing e-mail and the known spamdatabase. There are two major phases in the matchingprocess: Approximate Matching Phase and Exact MatchingPhase. As mentioned in Section 3.2, the tag lengths of e-mailabstractions in an SpTree may not be identical. However,two e-mail abstractions are possible to be near-duplicateonly when the numbers of their tags are identical. For thisreason, in Approximate Matching Phase, we traversedirectly to the targeted leaf node based on the types of tags

at positions 2i without doing tag comparisons. It is certain

that a testing e-mail may merely be near-duplicate with

spams which have the same tag length and are in the same

path. Therefore, we tentatively record the information of

spams, which appear in the targeted leaf node and have the

same tag length, into a candidate set candSet. The main

objectives of the approximate matching are: 1) to reduce

676 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011

Fig. 9. Algorithmic forms of all processing handlers. (a) Algorithmic formof Insertion Handler. (b) Algorithmic form of Matching Handler.(c) Algorithmic form of Error Report Handler. (d) Algorithmic form ofDeletion Handler.

Page 9: Cosdes a collaborative spam detection system with a novel e mail abstraction scheme

unnecessary tag comparisons of e-mails with different taglengths, and 2) to exclude e-mails which can be determinedwithout the exact tag matching. Subsequently, the processstarts Exact Matching Phase from the root of the SpTree. Foreach level, in lines 11-17, the hash values of subsequencesare matched first. Then, we do the exact matching ofsubsequences only if their hash values are matched. Theunmatched information of spams will be deleted fromcandSet. Moreover, we design that the exact tag matching isonly processed to level f level (set as 3 in the experiments),namely, only the first 2f levelþ1 � 1 tags are exactly matched.It means that the looser matching criterion is applied whenthe length of an e-mail abstraction is longer than 2f levelþ1.This looser criterion substantially promotes the efficiency ofmatching but does not influence detection results owing tothe effects of the preceding approximate matching and thetag reordering process of procedure SAG. Finally, if the sumof SR of all candidate spams in candSet exceeds Sth, thetesting e-mail will be classified as a spam.

To handle e-mails with only plain text, we maintain ablacklist of not only sender addresses but also URLs ande-mail addresses included in the content of reportedspams. Gathering these feedback information to blocktext/plain spams is very effective since spammers have tolet receivers connect to them so as to make profits.Therefore, Cosdes still can cope with spams with merelyplain text even though we concentrate on e-mail layoutstructure in this paper.

When receiving a misclassified ham, Error ReportHandler (shown in Fig. 9c) first finds the correspondingSpTree and does the matching process as the same inMatching Handler. For the spams matched with thereported misclassified ham, we reset SR of these spams as0 to avoid subsequent misclassification incurred by theidentical group of spams. In addition, the reputation scoresof reporters who cause the false positive error are halved toprevent continuous attacks by specific users.

Moreover, to delete obsolete spams, for every Td,Deletion Handler (as shown in Fig. 9d) traverses eachSpTree in inorder (traverses the left subtree, visit the root,and then traverses the right subtree) to visit all nodes inSpTrees. For each subsequence, if the existing time exceedsTm, it will be viewed as outdated and be deleted from thisnode. As such, all obsolete spams are removed from theknown spam database after Deletion Handler is processed.

4.3 Reputation Mechanism

The principal concept of collaborative spam detection is tocollect human judgment to block subsequent near-duplicatespams. To ensure the truthfulness of spam reports and toprevent malicious attacks, we propose the reputationmechanism to evaluate the credit of each reporter. Thefundamental idea of the reputation mechanism is to utilize areputation table to maintain a reputation score SR of eachreporter according to the previous reliability record. Eachinserted spam is given a suspicion score equal to SR of thereporter. In such a context, when doing near-duplicatedetection, if the sum of suspicion scores of matched spamsexceeds a predefined threshold, the testing e-mail will beclassified as a spam. The reputation mechanism is describedin detail as follows:

1. Each reporter is assigned an initial score Sinitial whenhe submits a reported spam at the first time.

2. If a reporter submits any feedback spam once more,the reputation score will be incremented by a smallerincremental score Sincre. The value of Sincre is set asSinitial

10 in the experiments.3. If a reporter is charged that his previous feedback

spam is mistaken, the reputation score will behalved.

To prevent malicious error reports and to attain a near-zerofalse positive rate, we cautiously increase the reputationscore but drop it drastically while a false positive error isissued. On the other hand, when SR of a reporter is smallerthan Sinitial, his subsequent feedback spams will not beadded into the database until SR is equal to or larger thanSinitial. Regarding the parameter Sth, we simply use a fixedsmall value (set as three in the experiments) instead ofdetermining the threshold according to the ratio of totalusers. The reason is that as long as there are certain trustyusers reporting the e-mails with the same e-mail abstractionas spams, it is sufficiently reliable to classify the subsequentnear-duplicate e-mails as spams.

5 PERFORMANCE EVALUATION

To assess the feasibility of system Cosdes, we conduct severalexperiments to explore its efficiency and detection results.The real spam data sets used in the experiments are from thee-mail servers of Computer Center in National TaiwanUniversity, which has over 30,000 students. Since the groundtruth of real e-mail streams is unavailable, spams areextracted from the well-known existing system, SpamAssas-sin.3 Concerning hams, we not only include public data sets(around 4,000 e-mails) provided by SpamAssassin,4 but alsoobtain from volunteers. There are about 60,000 spams per dayand a set of 7,000 or so hams in the data set. Note thatnumerous related works have evaluated the proposedmethods with static databases. However, to access theperformance of spam detection system with near-duplicatematching scheme, real e-mail streams are more appropriatethan static data sets. Therefore, in this paper, we useuniversity-scale e-mail streams as the experimental data setsto better simulate the e-mail environment. On the other hand,three representative approaches [7], [24], [30] of near-duplicate spam detection are employed for comparison.The authors of [8], [17] also adopt the same e-mailrepresentation approach as in [7] but with different sharingmechanisms. For ease of presentation, Damiani’s work isabbreviated as Digest. Sarafijanovic’s work is abbreviated asMultiDigest, and Yoshida’s work is abbreviated as Density. Itis worth mentioning that Sarafijanovic’s work [24] improvesDamiani’s one [7] by representing each e-mail with multipledigests produced from the strings of fixed length sampled atrandomized positions within e-mail. The processes ofgenerating each digest in Digest and MultiDigest areidentical. Although Sarafijanovic’s work claims that usingmultiple digests can enhance the robustness of near-duplicatespam detection system, these two works have not been

TSENG ET AL.: COSDES: A COLLABORATIVE SPAM DETECTION SYSTEM WITH A NOVEL E-MAIL ABSTRACTION SCHEME 677

3. http://spamassassin.apache.org.4. http://spamassassin.apache.org/publiccorpus. We include three files,

that is, 20030228_easy_ham.tar.bz2, 20030228_easy_ham_2.tar.bz2, and20030228_hard_ham.tar.bz2.

Page 10: Cosdes a collaborative spam detection system with a novel e mail abstraction scheme

validated by real e-mail streams. Besides, Yoshida’s workconsiders maintaining a direct-mapped cache to facilitate theprocess of matching. In the experiments of his work [30], thereare 10 million spams in the database and 10 percent of hashvalues are copied in the cache. However, to fairly compare thedetection performance, the cache mechanism of Yoshida’swork is not included in our experiments, meaning that allspams in the database are used for detection. We implementCosdes and comparative techniques with C++ language, andthe programs are executed in Windows XP professionalplatform with Pentium 4—3GHz CPU and 1GB RAM. Theprograms of Digest and MultiDigest are implemented withsource codes shared by original authors of [7] and [24].Initially, the efficiency and the space usage of generating e-mail representation are investigated in Section 5.1. Wecompare and analyze the detection results of four approachesin Section 5.2. The detailed efficiency analysis is presented inSection 5.3. Finally, Section 5.4 simulates the spammer attackof random HTML tag insertion.

5.1 E-mail Representation

The processing time of the generation of e-mail abstractionswith the number of e-mails varied is shown in Fig. 10a. Asmentioned in Section 2.2, most prior works on near-duplicatespam detection represent e-mails based mainly on contenttext. Cosdes is the first work to attempt to utilize HTMLcontent, which depicts the layout structure of an e-mail, forrepresentation. Regarding Digest, word trigrams are ex-tracted consecutively along the whole e-mail content. As forDensity, the authors acquire the first N (in [30], N is set as100) hash values of each length L substring with a fixed-sizewindow sliding through an e-mail. As can be seen in Fig. 10a,Density takes the least time since it computes only the first Nhash values. As for Digest, hash values of word trigrams inthe whole e-mail are required to be computed. RegardingMultiDigest, each e-mail is represented by a set of digests,meaning that each e-mail is separated into multiple stringswith fixed length (in [24], the length is set as 60 characters)sampled at randomized positions. Although the length ofeach string is much shorter, the overall complexity stillincreases since each e-mail has multiple strings needed to beprocessed. Fig. 10a shows that the processing time ofMultiDigest is about four times longer than that of Digest,and thus MultiDigest takes the longest time among fourapproaches. In procedure SAG of Cosdes, HTML tags areextracted and each paragraph of text is transformed to thenewly created tag <mytext=>. If only these operations are

performed, Cosdes will be the most efficient. However,sequence preprocessing and tag reordering are also executedin Cosdes, and therefore, the execution time slightlyincreases. Overall, the cost of generating e-mail abstractionsof four approaches is very low.

With regard to the space issue, Fig. 10b shows thememory usage of four approaches with the number ofe-mails varied in the database. Note that we estimate a hashvalue or an integer number as one unit of memory usage,which approximates to 2 bytes. Digest represents eache-mail with a 32-byte code, which is equal to 16 units. Asmentioned above, MultiDigest utilizes multiple 32-bytecodes for the representation of each e-mail. In the experi-mental data set, the average number of digests in eache-mail is approximately 12, and thus the memory usage ofMultiDigest is larger than that of Digest by around 12 times.Regarding Density, N hash values, that is, 100 units, are therepresentation of each e-mail. However, since some e-mailsare too short to be extracted N hash values, the averagememory usage of each e-mail in Density is smaller that100 units. On the other hand, a sequence of HTML tags isthe representation of each e-mail in Cosdes . We can replaceeach type of tag with a unique integer, and thus each tagcan be viewed as 1 unit. It is calculated that the averagelength of the HTML tag sequence produced by procedureSAG of Cosdes is approximately 35. In addition, as stated inSection 3.2, additional information is required in SpTableand SpTrees of Cosdes. We include this memory overheadas well. It can be observed in Fig. 10b that the memoryusage of Digest is the least, and MultiDigest uses the largestmemory space among four approaches. As for Cosdes, thememory usage is larger than that of Digest by two to fourtimes. Although a fairly succinct e-mail abstraction cangreatly reduce the overhead of near-duplicate matching, wecan find in the following section that the effectiveness ofDigest cannot be validated.

5.2 Accuracy Evaluation

In this section, we evaluate the detection performance ofCosdes and three competitive approaches. The mostimportant requirement for a spam detection system is thecapability to resist malicious attack that evolves continu-ously. To examine this capability, two recent streams ofspams (collected from National Taiwan University inSeptember 2007 and February 2009) are utilized as theexperimental data sets. Regarding the language of contenttext, 80 percent of all e-mails are in Chinese, and 15 percentof them are in English. The minority of e-mails are inJapanese, French, and so forth. Since Chinese is anonalphabetic language and English is an alphabetic one,the data set used in the experiments can verify theeffectiveness of spam detection system with different kindsof languages to a certain extent. Before a system starts to dothe near-duplicate spam detection, a set of known spams isinserted into the system. We consider situations with theparameter Tm varied from one to five days. That is, asshown in the left side of Fig. 11, the detection results areproduced by inserting spams within Tm days first, and thenthe following one-day spams are tested. Note that eachspam is inserted into the database after the process ofmatching. On the other hand, the entire set of hams is tested

678 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011

Fig. 10. The execution time and the memory usage of generating e-mailabstractions with the number of e-mails varied. (a) Execution time of e-mail representation. (b) Memory usage of e-mail representation.

Page 11: Cosdes a collaborative spam detection system with a novel e mail abstraction scheme

in each situation. True positive rate (i.e., TP, a real spam isclassified as a spam) and false positive rate (i.e., FP, a realham is misclassified as a spam) are listed in Fig. 11.

As can be seen in Fig. 12a, Cosdes reports 96.47 percentTP rate and 0.46 percent FP rate on average, which has themost outstanding performance. The TP rate of Digest isextremely high but the FP rate is unacceptable. In order toaccelerate the process of near-duplicate matching, only a 32-byte code is used in Digest to represent each e-mail.Moreover, as defined in [7], two e-mails are determined asnear-duplicate if more than 182 bits of their 32-byte (i.e.,256 bits) codes have the same value. It is shown in Fig. 12athat as the size of spam database is large, the 32-byte code isnot discriminative to clearly distinguish each e-mail, andthus hams are easily mismatched with known spams. As forMultiDigest, although the authors claim in [24] that usingmultiple digests to represent each e-mail can be more robustagainst increased obfuscation effort by spammers, the FPrate of MultiDigest is even worse than that of Digest as thesize of spam database is large. This is owing to the reasonthat MultiDigest separates each e-mail into a set of shortstrings. As long as one digest in the huge spam database issimilar to one of digests in the testing e-mail, this e-mail willbe classified as a spam. In [7] and [24], there are only2,500 spams and 2,500 hams in the data set, which might notsuffice to reflect the real situation. In addition, the effective-ness of Digest and MultiDigest has not been validated byreal e-mail streams. On the other hand, the effectiveness ofDensity has been evaluated in [30] with 10 million spams inthe database. One problem of Density is that a huge numberof known spams are required to make the proposed cachemechanism of Density work well. However, in our experi-ments, even though we do not consider the cache mechan-ism of Density and all reported spams are used for near-duplicate spam detection, the effectiveness of Density onmore recent e-mail streams cannot be validated. Moreover,several parameters should be given for Density and beadjusted according to different environments. Since theauthors in [30] do not provide the parameter tuning method,in this experiment, we follow the same setting as in [30] andobtain the results in Fig. 11. Note that various tricks targetingat nullifying the approaches of hash-based text representa-tion have been increasingly employed recently. Besides,most spams in our data set are in Chinese, which is anonalphabetic language. However, Digest and MultiDigestgenerate hash values with trigrams of substrings. Themethod of this kind is essentially designed to resist theword obfuscation of nonalphabetic languages. Therefore, theresults in this section also verify the language restriction of

comparative approaches. These factors could partiallyaccount for the results shown in Fig. 11.

Concerning Cosdes, we extract more essential informa-tion to represent each e-mail, and the newly devised e-mailabstraction can more effectively capture the near-duplicatephenomenon of spams with an acceptable FP rate. More-over, since we represent each e-mail using HTML tagsequence rather than content text, Cosdes can intrinsicallyapply to both alphabetic and nonalphabetic languageswithout modifying any components of the system. Thisadvantageous property is verified with our data set thatconsists of 15 percent English e-mails and 80 percentChinese ones. In addition, to further investigate thecomponents of Cosdes, we evaluate the detection perfor-mance when either the sequence preprocessing step or theanchor-appending step of procedure SAG is removed. It canbe observed in Fig. 12b that the performance almost doesnot degrade as we exclude the sequence preprocessing step.This consequence reveals that the proposed abstractionscheme has not been countered. Nevertheless, we proposeelaborate design to guarantee the robustness of Cosdes. Onthe other hand, as depicted in Section 3.1, the main objectiveof appending <anchor> tags in front of the e-mailabstraction is to reduce the probability that a ham issuccessfully matched with reported spams when the taglength of an e-mail abstraction is short. Anchor-appendingprocess will be included as the length of an e-mailabstraction is smaller than 16. It can be seen in the rightside of Fig. 12b that the TP rate increases slightly but the FPrate is twice higher than that of the original situation as thisprocess is removed. This is because there are several hamscontaining only a URL that normal users want to share withtheir friends. If the anchor-appending process is removed,these e-mails will be misclassified as spams, and thereby theFP rate deteriorates.

Subsequently, we explore the impact of score thresholdthat determines the number of matched spams that Cosdeswill detect a testing e-mail as a spam. Fig. 12c shows the TPrate and the FP rate of Cosdes with score threshold Sth

TSENG ET AL.: COSDES: A COLLABORATIVE SPAM DETECTION SYSTEM WITH A NOVEL E-MAIL ABSTRACTION SCHEME 679

Fig. 12. The comparison of detection performance and the performanceof Cosdes with and without sequence preprocessing in procedure SAG.(a) Average detection results. (b) Impact of sequence preprocessing andanchor-appending. (c) Impact of score threshold (Sth) in Cosdes.

Fig. 11. Performance of detection results.

Page 12: Cosdes a collaborative spam detection system with a novel e mail abstraction scheme

varied from 1 to 5. It can be observed that as Sth increasesbut remains small, the detection results of Cosdes are stillconsistent. This means that no complex heuristic method isrequired to determine the appropriate value of Sth. More-over, this threshold is not required to be adjusted accordingto the size of the data set. Note that as the FP rate increasesto a certain unacceptable value, our system can simplyresponse by slightly decreasing the value of Sth. Theproperty of simple threshold setting is also an advanta-geous feature of Cosdes.

5.3 Efficiency Analysis

In the succeeding experiments, we initially examine theefficiency of near-duplicate matching. Owing to the fact thateach incoming e-mail has to be matched with a huge spamdatabase, the efficiency of near-duplicate matching iscrucial to a collaborative spam detection system. Onevaluation of matching performance, we consider thesituation of matching with the number of e-mails variedwhile there are identical e-mails in the database. As shownin Fig. 13a, the execution time of Digest is minimal sinceonly two 32-byte codes of each pair of e-mails have to becompared. However, as shown in Fig. 12a, the detectionresults of Digest are not satisfied even if their matchingprocesses are very fast. As for MultiDigest, it is defined in[24] that the similarity measure between two e-mails is themaximum number of equal bits over all pairs of digests.According to this definition, processing all pairs of digestsbetween two e-mails requires n2 comparisons, where n isthe average number of digests in an e-mail. It is calculatedthat n is close to 12 in our data set, meaning that thematching time of MultiDigest is larger than that of Digest byover 100 times. This indicates that the matching process ofMultiDigest is the least efficient among four approaches.Regarding Density, the sequences of the first 100 hashvalues, which are longer than Digest and Cosdes, arematched, and therefore Density takes more time than Digestand Cosdes. It is noted that the authors in [30] propose acache mechanism to avoid matching each e-mail with thehuge database. The cache mechanism enables Density tomarkedly enhance the efficiency of matching. However, thiswill degrade the detection performance while the spamdatabase is not as huge as in [30]. To fairly compare theperformance, we ignore the cache mechanism of Density inthis experiment. On the other hand, Cosdes has to match alonger sequence than Digest, and in essence Cosdesrequires more time for matching. Nevertheless, with the

devised data structure and the customized matchingscheme, the execution time is still controlled in two to fourtimes of Digest. Moreover, regarding the effectiveness of theproposed hash function in the matching process, Fig. 13bshows that a speedup of approximately three times isachieved. According to above results, it is shown that withthe design of a more sophisticated e-mail abstraction,Cosdes gains better near-duplicate detection results withstill efficient matching time.

Subsequently, we conduct the efficiency investigation ofCosdes on inserting e-mail abstractions into the databaseand deleting outdated spams from the database. Owing tothe fact that competitive approaches, Digest and Density,did not isolate insertion parts from the systems and did nottake account of deletion, we only study the performance ofInsertion Handler and Deletion Handler in Cosdes. Fig. 14ashows the execution time of Insertion Handler of Cosdeswith the number of e-mails varied. The execution timegrows linearly and costs merely 3.5 seconds for inserting100,000 spams into the database. Moreover, as can be seenin Fig. 10a, the process of generating 100,000 e-mailabstractions costs about 10 seconds. On the other hand,the performance of Deletion Handler is shown in Fig. 14b.We evaluate the execution time of deleting spams in oneday while the number of e-mails in the database varied. Themain purpose of this experiment is to examine whether theefficiency of deletion will be influenced by the amount of e-mails stored in SpTrees. It is shown that the deletionprocess costs only 2 to 3 seconds in each situation, and theexecution time slightly increases with the amount ofe-mails. Therefore, we can observe that both the processesof insertion and deletion in Cosdes are efficient and incurvery little overhead.

To further evaluate the proposed e-mail abstractionscheme, we consider the sequence preprocessing step andthe reordering step of procedure SAG. The primaryobjective of the sequence preprocessing is to preventmalicious tag insertion attack, and thus the robustness ofCosdes can be enhanced. Fig. 15a shows that generatinge-mail abstractions with the sequence preprocessing stepleads to little increment of execution time. Although thedetection results in Fig. 12b can be inferred that spammersstill do not intend to obfuscate HTML content, thisprotection process enables Cosdes to perform morerobustly in the future. On the other hand, the main purposeof the reordering step is to differentiate e-mails with similartag sequences in the earlier stage of matching. As shown in

680 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011

Fig. 13. The comparison of matching efficiency and the efficiency ofCosdes with and without the hash function in the matching process.(a) Execution time of matching of 4 approaches. (b) Impact of hash inCosdes

Fig. 14. Performance of Insertion Handler and Deletion Handler inCosdes. (a) Execution time of insertion of Cosdes. (b) Execution time ofdeletion of Cosdes.

Page 13: Cosdes a collaborative spam detection system with a novel e mail abstraction scheme

Fig. 15b, 10 percent of efficiency improvement is achieved.The improvement is limited since we map each subse-quence in a node of an SpTree to a hash value. Therefore,the subsequences that have some prefix tags in commonstill can be differentiated with one comparison. Never-theless, the reordering step certainly can further acceleratethe matching process.

5.4 Spammer Attack

To further verify the robustness of Cosdes, in this section,we simulate the spammer attack of random HTML taginsertion. We consider the situation that a spammer sendsn identical e-mails at a time, where n is varied from 1,000 to100,000. It is assumed that a sequence of random HTML tagsis inserted into the beginning of each e-mail. The number oftags in a sequence is a random number between 1 and 50.5

Regarding the type of tag, we randomly choose them fromHTML tag list.6 As shown in Fig. 16a, around 90 percent of e-mails are matched with other e-mails, meaning that only10 percent of spams have completely distinct HTML tagsequences as random HTML tag insertion is applied. This isbecause the sequence preprocessing step of procedure SAGwill delete nonempty tags that have no corresponding starttags or end tags. Random HTML tag insertion cannotgenerate legal tag sequences and thus most tags will beeliminated. Concerning the efficiency analysis, it can beobserved in Fig. 16b that the sequence preprocessing stepincurs very little overhead. This also indicates that if newspammer attack occurs, Cosdes still has capacity to reactagainst it by applying a more sophisticated countermeasure.

6 CONCLUSION

In the field of collaborative spam filtering by near-duplicatedetection, a superior e-mail abstraction scheme is required tomore certainly catch the evolving nature of spams. Com-pared to the existing methods in prior research, in this paper,we explore a more sophisticated and robust e-mail abstrac-tion scheme, which considers e-mail layout structure torepresent e-mails. The specific procedure SAG is proposed togenerate the e-mail abstraction using HTML content ine-mail, and this newly-devised abstraction can moreeffectively capture the near-duplicate phenomenon ofspams. Moreover, a complete spam detection system Cosdeshas been designed to efficiently process the near-duplicate

matching and to progressively update the known spam

database. Consequently, the most up-to-date information

can be invariably kept to block subsequent near-duplicate

spams. In the experimental results, we show that Cosdes

significantly outperforms competitive approaches, which

indicates the feasibility of Cosdes in real-world applications.

ACKNOWLEDGMENTS

This work was supported in part by the National Science

Council of Taiwan under Contracts NSC97-2221-E-002-

172-MY3.

REFERENCES

[1] E. Blanzieri and A. Bryl, “Evaluation of the Highest ProbabilitySVM Nearest Neighbor Classifier with Variable Relative ErrorCost,” Proc. Fourth Conf. Email and Anti-Spam (CEAS), 2007.

[2] M.-T. Chang, W.-T. Yih, and C. Meek, “Partitioned LogisticRegression for Spam Filtering,” Proc. 14th ACM SIGKDD Int’l Conf.Knowledge Discovery and Data mining (KDD), pp. 97-105, 2008.

[3] S. Chhabra, W.S. Yerazunis, and C. Siefkes, “Spam Filtering Usinga Markov Random Field Model with Variable WeightingSchemas,” Proc. Fourth IEEE Int’l Conf. Data Mining (ICDM),pp. 347-350, 2004.

[4] P.-A. Chirita, J. Diederich, and W. Nejdl, “Mailrank: UsingRanking for Spam Detection,” Proc. 14th ACM Int’l Conf.Information and Knowledge Management (CIKM), pp. 373-380, 2005.

[5] R. Clayton, “Email Traffic: A Quantitative Snapshot,” Proc. of theFourth Conf. Email and Anti-Spam (CEAS), 2007.

[6] A.C. Cosoi, “A False Positive Safe Neural Network; The Followersof the Anatrim Waves,” Proc. MIT Spam Conf., 2008.

[7] E. Damiani, S.D.C. di Vimercati, S. Paraboschi, and P. Samarati,“An Open Digest-Based Technique for Spam Detection,” Proc. Int’lWorkshop Security in Parallel and Distributed Systems, pp. 559-564,2004.

[8] E. Damiani, S.D.C. di Vimercati, S. Paraboschi, and P. Samarati,“P2P-Based Collaborative Spam Detection and Filtering,” Proc.Fourth IEEE Int’l Conf. Peer-to-Peer Computing, pp. 176-183, 2004.

[9] P. Desikan and J. Srivastava, “Analyzing Network Traffic toDetect E-Mail Spamming Machines,” Proc. ICDM Workshop Privacyand Security Aspects of Data Mining, pp. 67-76, 2004.

[10] H. Drucker, D. Wu, and V.N. Vapnik, “Support VectorMachines for Spam Categorization,” Proc. IEEE Trans. NeuralNetworks, pp. 1048-1054, 1999.

[11] D. Evett, “Spam Statistics,” http://spam-filter-review.toptenreviews.com/spam-statistics.html, 2006.

[12] A. Gray and M. Haahr, “Personalised, Collaborative SpamFiltering,” Proc. First Conf. Email and Anti-Spam (CEAS), 2004.

[13] S. Hershkop and S.J. Stolfo, “Combining Email Models for FalsePositive Reduction,” Proc. 11th ACM SIGKDD Int’l Conf. KnowledgeDiscovery and Data Mining (KDD), pp. 98-107, 2005.

[14] J. Hovold, “Naive Bayes Spam Filtering Using Word-Position-Based Attributes,” Proc. Second Conf. Email and Anti-Spam (CEAS),2005.

TSENG ET AL.: COSDES: A COLLABORATIVE SPAM DETECTION SYSTEM WITH A NOVEL E-MAIL ABSTRACTION SCHEME 681

Fig. 15. Efficiency analysis on the sequence preprocessing and thereordering steps of Cosdes. (a) Impact of sequence preprocessing inCosdes. (b) Impact of reordering in Cosdes.

Fig. 16. Effectiveness and efficiency analysis of the sequencepreprocessing step under the spammer attack of random HTML taginsertion. (a) Probability of matching. (b) Execution time of the sequencepreprocessing step.

5. It is calculated that the average length of the HTML tag sequenceproduced by procedure SAG of Cosdes is approximately 35.

6. http://www.w3schools.com/tags/default.asp.

Page 14: Cosdes a collaborative spam detection system with a novel e mail abstraction scheme

[15] A. Kolcz and J. Alspector, “SVM-Based Filtering of Email Spamwith Content-Specific Misclassification Costs,” Proc. ICDM Work-shop Text Mining, 2001.

[16] A. Kolcz, A. Chowdhury, and J. Alspector, “The Impact of FeatureSelection on Signature-Driven Spam Detection,” Proc. First Conf.Email and Anti-Spam (CEAS), 2004.

[17] J.S. Kong, P.O. Boykin, B.A. Rezaei, N. Sarshar, and V.P.Roychowdhury, “Scalable and Reliable Collaborative SpamFilters: Harnessing the Global Social Email Networks,” Proc.Second Conf. Email and Anti-Spam (CEAS), 2005.

[18] T.R. Lynam and G.V. Cormack, “On-Line Spam Filter Fusion,”Proc. 29th Ann. Int’l ACM SIGIR Conf. Research and Development inInformation Retrieval (SIGIR), pp. 123-130, 2006.

[19] B. Mehta, S. Nangia, M. Gupta, and W. Nejdl, “Detecting ImageSpam Using Visual Features and Near Duplicate Detection,” Proc.17th Int’l Conf. World Wide Web (WWW), pp. 497-506, 2008.

[20] V. Metsis, I. Androutsopoulos, and G. Paliouras, “Spam Filteringwith Naive Bayes—Which Naive Bayes?” Proc. Third Conf. Emailand Anti-Spam (CEAS), 2006.

[21] M.S. Pera and Y.-K. Ng, “Using Word Similarity to Eradicate JunkEmails,” Proc. 16th ACM Int’l Conf. Information and KnowledgeManagement (CIKM), pp. 943-946, 2007.

[22] I. Rigoutsos and T. Huynh, “Chung-Kwei: A Pattern-Discovery-Based System for the Automatic Identification of Unsolicited E-Mail Messages (SPAM),” Proc. First Conf. Email and Anti-Spam(CEAS), 2004.

[23] S. Sarafijanovic and J.-Y.L. Boudec, “Artificial Immune System forCollaborative Spam Filtering,” Proc. Second Workshop NatureInspired Cooperative Strategies for Optimization (NICSO), 2007.

[24] S. Sarafijanovic, S. Perez, and J.-Y.L. Boudec, “Improving Digest-Based Collaborative Spam Detection,” Proc. MIT Spam Conf., 2008.

[25] S. Sarafijanovic, S. Perez, and J.-Y.L. Boudec, “Resolving FP-TPConflict in Digest-Based Collaborative Spam Detection by Use ofNegative Selection Algorithm,” Proc. Fifth Conf. Email and Anti-Spam (CEAS), 2008.

[26] K.M. Schneider, “Brightmail URL Filtering,” Proc. MIT Spam Conf.,2004.

[27] D. Sculley and G.M. Wachman, “Relaxed Online SVMs for SpamFiltering,” Proc. 30th Ann. Int’l ACM SIGIR Conf. Research andDevelopment in Information Retrieval (SIGIR), pp. 415-422, 2007.

[28] C.-Y. Tseng, J.-W. Huang, and M.-S. Chen, “Promail: UsingProgressive Email Social Network for Spam Detection,” Proc. 10thPacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD),pp. 833-840, 2007.

[29] Z. Wang, W. Josephson, Q. Lv, and K.L.M. Charikar, “FilteringImage Spam with Near-Duplicate Detection,” Proc. Fourth Conf.Email and Anti-Spam (CEAS), 2007.

[30] K. Yoshida, F. Adachi, T. Washio, H. Motoda, T. Homma, A.Nakashima, H. Fujikawa, and K. Yamazaki, “Density-Based SpamDetector,” Proc. 10th ACM SIGKDD Int’l Conf. Knowledge Discoveryand Data Mining (KDD), pp. 486-493, 2004.

[31] F. Zhou, L. Zhuang, B.Y. Zhao, L. Huang, A.D. Joseph, and J.D.Kubiatowicz, “Approximate Object Location and Spam Filteringon Peer-to-Peer Systems,” Proc. ACM/IFIP/USENIX Int’l Middle-ware Conf., pp. 1-20, 2003.

Chi-Yao Tseng received the BS degree inelectrical engineering from the National TaiwanUniversity, Taipei, in 2004, and is currently a PhDcandidate in the Graduate Institute of ElectricalEngineering of the National Taiwan University.His research interests include sequential patternmining, e-mail spam detection, social networkmining, and multimedia data mining.

Pin-Chieh Sung received the BS degree incomputer science from the National Tsing HuaUniversity, Hsinchu, in 2006 and the MS degreein the network database laboratory (led byProfessor Ming-Syan Chen), Graduate Instituteof Electrical Engineering, National Taiwan Uni-versity, Taipei, in 2008. Currently, he is asoftware engineer in Cazoodle Inc. His researchinterests include web mining, recommendationsystem, and multimedia data mining.

Ming-Syan Chen received the BS degree inelectrical engineering from the National TaiwanUniversity, Taipei, and the MS and PhD degreesin computer, information and control engineeringfrom the University of Michigan, Ann Arbor, in1985 and 1988, respectively. He is now adistinguished research fellow and the directorof Research Center for Information TechnologyInnovation (CITI) in the Academia Sinica, Tai-wan, and is also a distinguished professor jointly

appointed by EE Department, CSIE Department, and Graduate Instituteof Communication Engineering (GICE) at the National Taiwan Uni-versity. He was a research staff member at IBM Thomas J. WatsonResearch Center, Yorktown Heights, New York, from 1988 to 1996, thedirector of GICE from 2003 to 2006, and also the president/CEO of theInstitute for Information Industry (III), which is one of the largestorganizations for information technology in Taiwan, from 2007 to 2008.His research interests include databases, data mining, mobile comput-ing systems, and multimedia networking, and he has published morethan 280 papers in his research areas. In addition to serving as programchairs/vice chairs, and keynote/tutorial speakers in many internationalconferences, he was an associate editor of IEEE TKDE, VLDB Journal,KAIS, and also JISE, is currently the editor-in-chief of the InternationalJournal of Electrical Engineering (IJEE), and is a distinguished visitor ofthe IEEE Computer Society for Asia-Pacific from 1998 to 2000, and alsofrom 2005 to 2007. He is now also serving as the chief executive officerof Networked Communication Program, which is a national programcoordinating several primary activities in information and communicationtechnologies in Taiwan. He holds, or has applied for, 18 US patents andseven ROC patents in his research areas. He is a recipient of theAcademic Award of the Ministry of Education, the National ScienceCouncil (NSC) Distinguished Research Award, Pan Wen YuanDistinguished Research Award, Teco Award, Honorary Medal ofInformation, and K.-T. Li Research Breakthrough Award for his researchwork, and also the Outstanding Innovation Award from IBM Corporatefor his contribution to a major database product. He also receivednumerous awards for his research, teaching, inventions, and patentapplications. He is a fellow of the ACM and the IEEE.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

682 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011