17
1 Botnet Detection Through Fine Flow Classification Xiaonan Zang, Athichart Tangpong, George Kesidis and David J. Miller Departments of CS&E and EE The Pennsylvania State University University Park, PA, 16802 CSE Dept Technical Report No. CSE11-001, Jan. 31, 2011 Abstract The prevalence of botnets, which is defined as a group of infected machines, have become the predominant factor among all the internet malicious attacks such as DDoS, Spam, and Click fraud. The number of botnets is steadily increasing, and the characteristic C&C channels have evolved from IRC to HTTP, FTP, and DNS, etc., and from the centralized structure to P2P and Fast Flux Network Services. In counter to the escalations of the botnet developments, the internet security community have designed many botnet detection and disruption systems which can be summarized into two categories: Honeynet-based and Passive Traffic Monitoring, while the Passive Traffic Monitoring could be further divided into Behavior-based, DNS-based, and Mining-based techniques. Among all the Intrusion Detection System designs, the mining-based method, operated on the flow level internet traffic, has shown some promising resilience against the botnets evolutions. A preliminary experiment has been conducted in this paper observing the discriminating capabilities of the Hierarchical and K mean clustering algorithms and exploring a RTT adjustment procedure to mix the botnet trace with the background internet traffic. I. I NTRODUCTION The term Botnet denotes a network of compromised end hosts (bots) under the remote command of a botmaster [37]. Once a botnet has been constructed, these bots are controlled autonomously and automatically, in some cases to perform some illicit monetary activities. A. Botnet Life Cycle The general life cycle of a botnet, shown in Figure 1, contains four phases: initial infection, secondary injection, maintenance & update, and malicious activities [13]. Botmaster Command and Control (C&C) Server Botnet Vulnerable End Host Initial Infection Secondary Injection Malicious Commands Maintenance & Update Connection & Update Malicious Commands Maintenance & Update Fig. 1: A General Botnet Life Cycle. This material is based upon work supported by the National Science Foundation under Grant No. 0915552 and a Cisco Systems URP gift.

1 Botnet Detection Through Fine Flow Classification€¦ ·  · 2011-01-28Botnet Detection Through Fine Flow Classification Xiaonan Zang, ... log into the command and control server

  • Upload
    vannhan

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

1

Botnet Detection Through Fine Flow ClassificationXiaonan Zang, Athichart Tangpong, George Kesidis and DavidJ. Miller

Departments of CS&E and EEThe Pennsylvania State University

University Park, PA, 16802

CSE Dept Technical Report No. CSE11-001, Jan. 31, 2011

Abstract

The prevalence of botnets, which is defined as a group of infected machines, have become the predominant factor amongall the internet malicious attacks such as DDoS, Spam, and Click fraud. The number of botnets is steadily increasing, andthe characteristic C&C channels have evolved from IRC to HTTP, FTP, and DNS, etc., and from the centralized structure toP2P and Fast Flux Network Services. In counter to the escalations of the botnet developments, the internet security communityhave designed many botnet detection and disruption systemswhich can be summarized into two categories: Honeynet-based andPassive Traffic Monitoring, while the Passive Traffic Monitoring could be further divided into Behavior-based, DNS-based, andMining-based techniques. Among all the Intrusion Detection System designs, the mining-based method, operated on the flow levelinternet traffic, has shown some promising resilience against the botnets evolutions. A preliminary experiment has been conductedin this paper observing the discriminating capabilities ofthe Hierarchical and K mean clustering algorithms and exploring a RTTadjustment procedure to mix the botnet trace with the background internet traffic.

I. I NTRODUCTION

The term Botnet denotes a network of compromised end hosts (bots) under the remote command of a botmaster [37]. Oncea botnet has been constructed, these bots are controlled autonomously and automatically, in some cases to perform some illicitmonetary activities.

A. Botnet Life Cycle

The general life cycle of a botnet, shown in Figure 1, contains four phases: initial infection, secondary injection, maintenance& update, and malicious activities [13].

Botmaster

Command and Control

(C&C) Server

Botnet

Vulnerable End Host

Initial Infection

Secondary

Injection

• Malicious Commands

• Maintenance & Update

Connection & Update

• Malicious Commands

• Maintenance & Update

Fig. 1: A General Botnet Life Cycle.

This material is based upon work supported by the National Science Foundation under Grant No. 0915552 and a Cisco SystemsURP gift.

2

1) Initial Infection: A computer can be infected in different ways: Inadvertentlyexecute malicious code, exploit systemvulnerabilities, and access through engineered backdoors. Users may accidentally download and execute the maliciousprogramswhile viewing a Web Site, opening an attachment from an email, or clicking a link in an incoming instant message. Everyreleased patch to update some of the most popular operating systems, such as Windows XP and Windows 7, is followed bya flurry of reverse engineering in the hacker community in order to exploit the problems that the most recent patch has fixed,because millions of users tend not to update their computer promptly and properly. Also, some ports, which are used forRemote Access or File Sharing services, are under constant scanning from other bots for vulnerabilities check, for example,port 135 - Microsoft Remote Procedure Call (RPC) service, and port 139 - Netbios File Sharing Service [3].The term ’backdoor’ denotes as the port which is forcefully opened by the malicious softwares, allows for remote connectionand therefore gives up the administrative control of the compromised computer. Given the current circumstances, a vulnerablecomputer is usually infected by multiple malicious software programs. In order to take advantage of this fact, a list of portshas been routinely examined by a single malicious software for backdoors left by others, including port 2745 - backdoor ofBagle worm, and port 3410 - backdoor of Optix Pro remote access trojan [3].

2) Secondary Injection:Although a particular botnet makes use of possible backdoors left by other botnets, it does notmean that botmasters would like to have a common shared pool of bots. So, most communication and command protocols arebotnet-specifically designed. Intuitively, most of the source codes are confidential. Although some most popular botnets havetheir source codes publicly available (e.g., Agobot, SDBot, and GT Bot), due to the complexity and modularity of the codingarchitecture, along with the constant evolvements of the botnets, there are no standardized command and control functions [4].Therefore, after the successful initial infection, the next step is to download and run the botnet code in order to becomea botwhich is under control of a specific botmaster. This procedure can be processed by using Trivial File Tansfer Protocol (TFTP),File Transfer Protocol (FTP), HyperText Transfer Protocol(HTTP) or CSend [3].

3) Maintenance and Update:The first two stages only contain communications between bots and targeted computer. Afterbecoming a bot, the infected machine starts to 1) log into thecommand and control server and 2) create a protected sessionparsing and executing the topics in the channel. These two steps are processed periodically and require authentication. Beforethe botmaster authorizes certain malicious activities, such as Distributed Denial of Services (DDoS), it usually sends out anupdate command to the C&C server which in turn contacts the bots to give the botmasteran updated status feedback of thebotnet. These internet flows, especially the periodically log/listen sessions, are of great interest for botnet detections, sincethe passive intrusion detection system (IDS) would like to recognize the suspicious patterns and disrupt the botnet before theactual attacks take place.

4) Malicious Activities: The aforementioned definition of botnet indicates that botnets are mostly used for criminallymotivated activities, include Distributed Denial of Services, Click Fraudulence, Spamming, and Identity Theft.

a) Distributed Denial of Services (DDoS):The earliest utilization of a botnet is to launch a DDoS attack, which causesa loss of service to users. Because a botnet often contains thousands of bots, the botmaster could direct all the online botsflooding packets to a specific web server/system. These packets would consume the bandwidth of the victim network, overloadthe computational resources of the victim system, or even congest the general internet traffic to make some public massivedamages [3]. Most implementations the DDoS attacks are categorized as TCP SYN and UDP flood attacks. A TCP SYNflood attack sends an overwhelming amount of SYN messages to aweb receiver. The TCP specification requires the receiverto allocate a chunk of memory for a certain time to maintain a TCP connection. Gradually, the harmful SYN messages willexhaust the memory of the receiver buffer, and make the TCP type services unavailable, such as Web Traffic, FTP, Telnet, andSMTP. An UDP flood attack sends a large number of UDP packets torandom ports (chargen, echo, daytime, etc.) of a serverto block the other legitimate traffic to it. Other protocols have also been used in DDoS attacks, such as recursive HTTP [3]and ICMP flood attacks.

b) Click Fraudulence:Instead of attacking a web site at the same time, bots can be also controlled to automatically andperiodically access particular links to artificially increase the number of clicks or manipulate the outcomes of onlinepolls. Anexample is the abuse of the Google’s AdSense program, which allows web sites possibly set up by the botmaster to displaythe Google advertisement and pays them money for the fraudulent number of clicks on the commercial [3][46].

c) Spamming:Another nefarious task of the botnets is to spread junk emails, which is called spamming. In general, botscould possibly open a SOCKS v4/v5 proxy or transmit their junk emails to an email spam proxy to avoid blacklists of staticspammers [34]. Certain botnets, such as Agobot, also harvest email addresses and download lists of email addresses sharedamong bots. One special form of spam is phishing, which luresusers to some disguised web sites and attempts to steal thesensitive personal information.

3

d) Identity Theft:Other than the aforementioned phishing technique, there are methods to retrieve valuable informationfrom users, such as sniffing traffic, and keylogging. Bots candownload some small, specialized password grabbers, such aspwdump2, to collect the username and password data from the hosts. Or, a bot can apply some tools, such as Cain and Abel, tomasquerade as a gateway of a subnet to gather passwords from other computers. Interestingly, there may also be collaborationsbetween bots, for example, one bot can harvest some encrypted password data, reformat it into a UNIX-like password file,and send it to a presumably faster bot to crack the password byusing software like Lopht Crack. In addition to these brute-force password cracking tools, some key logger programs, which capture key stroke sequences near certain keywords suchas’paypal.com’, can also be implemented to steal all the confidential data [3].

Overall, botnets are a hybrid of previous Internet threats with a defining characteristic of C&C channel usage. They canpropagate like worms, hide from detection like many viruses, and exploit attack methods like published toolkit [13].

B. Botnet History and Trends

Internet Relay Chat (IRC) was invented in August of 1988 by Jarkko Oikarinen of the University of Oulu, Finland [35]. Thisprotocol provides a platform that allows data dissemination among large number of end users by supporting multiple forms ofcommunication (point-to-point, point to multi-points, etc.) [37]. As the IRC protocol developed, administering busychannels,such as handling tedious 24-hours-a-day requests from users, becomes time consuming. Bot, or robot, was then created asthebenign assistant to IRC channel management. In 1989, Greg Lindahl, an IRC server operator, created the benevolent bot calledGM which would play a game of Hunt the Wumpus with IRC users. Starting from this simple example, bots have evolvedfrom being code that helps a single user to code that manages and runs IRC operations on local host as well as code thatprovides services for other users. Bots gradually have beendeveloped into a comprehensive tool which operates as an IRCchannel operator, for example, Eggdrop was written in 1993 to assist channel operators. In time, IRC bots with more nefariouspurposes emerged when some IRC servers and bots began offering the capability to make OS shell available which permitsusers to run commands on the IRC host. By the late 1990s, massive amount of trojan-infected computers tended to be groupedtogether and remotely controlled by a botmaster connected to an IRC server. Version 2.1 of the SubSeven Trojan, releasedinJune 1999, included the typical malicious functions, (suchas stealing password, logging keystrokes, and hiding its identity),and provided a significant new feature that permits the SubSeven server to be remotely controlled via an IRC channel. Thislink, between trojan server and IRC channels, set stage for all malicious botnets to come. In 2005, over a four month spanof botnet research conducted by the Honeynet Project, over amillion computers were observed as members of botnets [3].For over a decade, IRC based botnets were predominant among all the other existing ones. However, as the botnet detectionescalates, botnets have also evolved. In terms of protocol,more and more botnets start to implement HTTP and Fast Fluxnetwork based on DNS servers; topology wise, instead of the traditional single server centralized structure, more sophisticatedstructures, such as a group of IRC servers with inter links between each other or a Hybrid P2P system, have been implemented.In 2009, Conficker - arguably the most influential and sophisticated botnets - has appeared [8]. Conficker has implementedtheDNS as the C&C protocol and a P2P C&C structure. In the following sections, the predominant IRCbotnets are examinedwith examples of some most popular IRC based botnets; other protocol based botnet technologies are mentioned; and finally,the evolution of botnet C&C channel structure is presented.

II. IRC-BASED BOTNETS

IRC has provided a common protocol that is widely deployed across the Internet for activities among large number ofmachines, such as remote control and data distribution [35]. There are a large number of existing IRC networks that lackstrong authentication, and a number of tools to provide anonymity on IRC servers are available. Also, IRC has a simple text-based command syntax which make it flexible to be extended forcustom functionalities. These features have made IRC themost suitable choice for a botmaster because IRC provides a simple, low-latency, widely available, and anonymous commandand control channel for botnet communication [13]. In this section, four of the most commonly used IRC-based botnets areintroduced: Agobot, SDBot, Spybot, and GT Bot.

A. Agobot

Agobot, named after its creator Ago, was first released in C++in 2002 [3]. Because of its cross-platform capability, modularfunctionality, and public availability of the source code,nowadays there are thousands of variants of Agobot and this numberis steadily increasing [4]. The modularity principle is used throughout the design of the botnet process. Unlike other botnets,which commonly infect the target machine at once, Agobot corrupts the target system with three modules each of which retrievethe next module after completing its primary tasks. First, Agobot infects the computer with the bot client and opens a backdoorto allow the attacker to communicate with and control the machine; second, Agobot attempts to shut down processes associatedwith antivirus and security system; at last, Agobot tries toblock access from the infected computer to a variety of antivirus andsecurity-related web sites by altering DNC entries of thesesites to point to the compromised local host. Furthermore, Agobot

4

also includes commands and functions that: fortifying the local system against other malicious attacks such as closingNetBIOSshares and PRC-DCOM; detecting debuggers (e.g. SoftICE, OllyDbs and procdump) and virtual machines (e.g. VMWare andVirtual PC). Along with these aforementioned functions, Agobot also has a elaborate set of malicious attack commands whichoffer multiple types of DDoS attacks (UDP, TCP SYN, HTTP, PHAT SYN, PHAT ICMP, PHATwonk, and targa3 floods),capabilities in stealing sensitive information by implementing libpcap and Perl Compatible Regular Expressions (PCRE) tosniff and sort traffic, and multiple scanning methods (Bagle, Dcom, MyDoom, Dameware, NetBIOS, Radmin, and MS-SQLscanners). Most of the variants of Agobot apply standard IRCfor C&C channel communications. However, one branch, referredas Phatbot, uses a distributed and organized WASTE chat network as the C&C communicating protocol. WASTE is a P2Pprotocol designed by AOL to use encryption for more secure file transfers via P2P. Using WASTE has its advantages inavoiding botnet disruption by an IRC channel shutdown. But it also limits the scalability of the bot army because WASTEcan only manage 50 to 100 client nodes at a time. Overall, withits monolithic code architecture, creative modular principles,and standard data structures and code documentation, Agobot is arguably the most sophisticated and best-written source codeamong all the existing botnet source codes.

B. SDBot

SDBot was originally written in C and released by a Russian programmer known as sd [3]. The standard compact package ofSDBot source code behaves more like a benign tool, which is toprovide an utilitarian IRC-based command and control system.The only possible malicious activities included in the original package are UDP or ICMP DDoS attacks. Public collaborationand evolution have generated a large number of patches including specific malicious capabilities such as scanning, DDoSattacks, sniffers, and information harvesting routines. Similar to Agobot, SDBot includes some typical exploits targeting specificvulnerabilities. The most active ones are the brute-force password guessing attacks at ports 139 (NetBIOS sharing service),port 445 (Crypt32.dll) and port 1433 (MSSQL) [4]. Once the hacker gains complete access to compromised systems, theRemote Access Trojan (RAT) component of SDBot connects to anIRC server and lies silently waiting for instructions fromthe botmaster. This aforementioned code structure, a standard core package attached with customized patches, has madeSDBotarguably the most active and popular botnet. As of August 2004, SDBot has been reported to have more than 4,000 variants.In June 2006, a Microsoft report about the Malicious Software Removal Tool listed the SDBot as having been detected on678,000 infected machines.

C. Spybot

Spybot, a derivative of SDBot, first emerged in 2003. Like SDBot, the Spybot code is open source and available for thepublic to modify and contribute to develop further functionalities [3]. The main difference between SDBot and Spybot isthatSpybot was originally designed solely for malicious purposes [4]. First, Spybot adds a number of spyware-like capabilitiessuch as keystroke logging and email address harvesting. Second, Spybot includes some features to broadcast Spam over InstantMessaging (SPIM) and to modify the registries to prevent installation of Windows XP SP2 or to disable the Windows XPSecurity Center. This difference makes Spybot more efficient in some aspects of its malicious activities than SDBot, andit isthe main reason that Spybot has evolved into another botnet family as influential as SDBot.

D. GT Bot

GT Bot, which can be traced back to as early as 1998, is an abbreviation for Global Threat and the common names usedfor all mIRC scripted botnet codes. mIRC is an IRC client software package with two important characteristics for botnetconstruction [4]: it can run scripts in response to events onthe IRC server; and it supports raw TCP and UDP socket connectionsfor remote control and access. GT Bot also includes a characteristic HideWindow program which keeps the bot hidden on thelocal system [3]. GT Bot can be easily modified to suit a specific malicious purpose. However, the extensibility of this botnetis quite limited. Based on this fact, it appears that different versions have been generated for specific malicious intent, insteadof a general comprehensive package that provides an elaborate set of malicious capabilities.

III. F URTHER BOTNET DEVELOPMENTS

The proliferation of the botnets has drawn more and more attention in the Internet security community. Multiple studiesonbotnet phenomenon have been conducted and many botnet detection and disruption techniques have been invented. In orderto maintain the operations of the botnets against the escalations of internet security tools, the botnets technologieshave alsoadvanced. For example, one of the earliest and simplest botnet detection techniques is to set a signature matched systemmonitoring and inspecting all the live traffic going throughknown IRC ports (e.g. TCP port 6667). Once the known botnetcommands have been matched for the payload, the operator would be able to detect the corresponding IRC channel, and shutit down to disable the whole botnet. To avoid this disruption, botnets have adopted technologies operating on non-standard ports.

5

More protocols have been experimented for a remote control mechanism. For example, File Transfer Protocol (FTP) has beendesigned as the C&C channel for botnets such as Dumador and Haxdoor to perpetrate keglogging to steal sensitive information[28]. These botnets sniff communications of the compromised machine and present the user with fake web sites locally whenthe user enters HTTPS (Encrypted) Web sites to steal the credentials of the user. Once the credentials have been retrieved, theFTP C&C channel (also called drop zone) would directly feed them tothe botmaster. There are also some HTTP based botnetsin the wild. One example of a HTTP based botnet is the spam bot module in Rustock rootkit which implements encrypted HTTPfor C&C mechanism [10]. The use of encrypted HTTP has increased thedifficulties in detection and deobfuscation. Aside fromthe ordinary spam bot functionality, this spam module also has extensibility to other nefarious functionalities. BlackEnergy isanother typical HTTP based bot which is solely used for DDoS attacks. At last, a click fraudulent bot, Clickbot.A, also usedHTTP running the C&C channel [15]. In general, HTTP based botnets have encrypted C&C channels which are often Base64obscured. Also, it is easier for HTTP based botnets to pass through firewalls than IRC based botnets do.

A. P2P Bot

Other than changing the protocols carrying out C&C mechanism, the structures of C&C channels have also evolved. In theprevious section, not only all commonly used botnets families are IRC based, but also they all have centralized C&C structure,which is characterized by a central point that forward message among clients [13]. From the perspective of a botmaster, thecentralized C&C structure has become the fundamental weak point [47]. First, shutting down a limited number of C&C serverscould compromise the entire botnet. Second, C&C servers can be easily detected based on the incoming trafficfrom a largenumber of bots, or simply by the backward trace from a single captured bot. Third, once a C&C server has been capturedor hijacked, the entire botnet is under exposition. In orderto overcome these major weaknesses inherent to the centralizedarchitecture, a peer to peer (P2P) framework is a natural improvement. In a P2P architecture, bots communicate with other peerbots rather than a central server. These peer bots act as bothclients and servers such that there is no centralized coordinationpoint that can be incapacitated. Because of the lack of the central server, the botmaster cannot directly control all thebots.Instead, a set of commands is defined in the P2P system. When the botmaster attempts to launch an attack, it publishes one ofthe predefined commands on the P2P system, and all the bots which subscribed to the set will be able to execute this command.

In the last several years, botnets such as Slapper [2], Sinit[42], Nugache [43], and Conficker [8] have implemented multipleforms of P2P control architectures. Along with the inherentstructure and process of traditional P2P systems, each botnet hasits own advanced design and weakness. Sinit uses public key cryptography for update authentication and random probing forcommunications with other Sinit bots. The extensive probing traffic has caused easy detection and poor connectivity fortheconstructed botnets. Slapper builds a list of known bots foreach infected computer during propagation to remove the bootstrapprocess which is easily exploited by defenders to shut down abotnet. However, the lack of encryption implementation andcommand authentication have made Slapper vulnerable to be hijacked by others. On the other hand, Nugache has implementedan encrypted/obsfucated C&C channel. However, the reliance on a seed list of 22 IP addresses during its bootstrap processhas also make Nugache an easy target for detection. Confickerhas its C&C channel encrypted with the most sophisticatedalgorithms, and the list of possible C&C server Domain names/IP addresses are around 5000 updated on a daily basis.Comparing with the centralized system, a P2P communicationsystem is much harder to disrupt. However, P2P systems aremore complicated and there are typically no guarantees on message delivery or latency.

B. Fast Flux Service Network

A new technology implementing the Domain Name System (DNS) protocol within C&C communications, referred as theFast Flux network service (FFSN), has emerged in recent years. In general, the DNS protocol has applied two techniques tomapdomain names with IP addresses: Round Robin DNS (RRDNS) [9] and content distribution network (CDN) [16]. Respondingto a DNS request, RRDNS would return a list of DNS A records (i.e., hostname to IP address mappings). The DNS serverthen cycles through this list and returns them in a round robin fashion. Every A record also has a Time To Live (TTL) forthe mapping, specifying the amount of seconds the response remains valid. Typical TTL for RRDNS has been recommendedto be 1 to 5 days, according to RFC 1912 [5]. Instead of multiple A records, the CDN applies sophisticated techniques withrespect to network topology and current link characteristics to find the nearest edge server to the corresponding clients andreturns the IP addresses which belongs to the network of thisserver. The typical TTL of the A record for CDN is significantlylower than the one for RRDNS, because the CDN needs to react promptly to changes in link characteristics.

With the help of the mapping techniques mentioned above, a FFSN could be constructed as a distributed proxy network - consistof compromised machines (flux agents) - which could route thetraffic to the controlling element (control node/mothership)with the characteristical short TTL and multiple A records assignments by RRDNS. FFSN acts more like a super botnetsconstructed by multiple sub botnets. A few examples of FFSN have already been detected in the wild, such as the spamemail domainthearmynext.infofound in July 2007 [24]. Combining the NS records (authoritative name server for the domain)gives the FFNS one more layer of protection. This type of FFNS, referred as Double flux FFNS, has been implemented as

6

a robust phishing botnet, which creates a bogus web site called login.mylspacee.comto harvest Myspace user authenticationcredentials [38]. As the FFSN gains popularity, other botnets also take advantage of this technique. Other than using DNSas the carrier of the C&C mechanism, botmasters also use FFSN to host malicious content. The P2P bot Storm Worm, oneof the most prevalent botnet, uses fast flux domains to host the actual bot binary [25]. Moreover, beyond the regular DNSservices, other services such as HTTP, SMTP, POP and IMAP canbe delivered via FFSN because fast flux techniques utilizeblind TCP and UDP redirects which are suitable for all the directional service protocol with a single port. At last, even theconventional IRC based botnets have used Dynamic DNS algorithms to frequently alternate between several IP addresses ofIRC servers. Overall, botnets gradually utilize more protocols for specific malicious attacks and adapts more decentralizedC&C structures. In order to avoid the susceptibilities to the next generation of botnets, a few advanced designs and modelshave also be proposed for defending purpose. And advanced hybrid P2P botnet architecture attempts to use multiple classesof bots, with the characteristic class of servant bots whichbehave as both clients and servers. This hybrid P2P architectureprovides robust network connectivity, individualized encryption and control traffic dispersion, limited botnet exposure by eachcaptured bot, and easy monitoring and recovery by its botmaster [47].

IV. B OTNET DETECTION

Along with the prevalence of botnets related nerfarious activities, increasing numbers of botnet detection and trackingtechniques have been developed in recent years. These methods can be categorized into two approaches. One is honeynetbased method and the other is based on passive traffic monitoring.

A. Honeynet-based Methods

The general structure of honeynet based method consists of honeypot and honeywall [3]. Honeypot denotes an end hostwhich is very vulnerable to malicious attacks and is often successfully compromised in a very short time span. Honeywalldenotes software which is used to monitor, collect, control, and modify the traffic through the honeypot, such as Snort.The Honeynet project used unpatched versions of Windows 2000 or Windows XP systems as honeypot, and snortinline ashoneywall device to track botnets on a daily basis (i.e., thehoneynet would have been rebuilt in every 24 hours). This projecthas also listed a set of suggestions on how to write a useful botnet tracking IRC clients. First, this client shall have SOCKSv4 and multi-server support. Second, some useful packages,such as lbadns, libcurl, and Perl Compatible Regular Expression(PCRE) shall be included in this client. At last, the modularity and certain functionalities, such as no threading, shall be inconsideration through out the design of this client. A similar honeynet has been constructed [13], which consists of threevulnerable machines and a transparent proxy device (FreeBSD bridge). This project demonstrates some key features for thehoneywall/proxy device. First, the honeywall element shall be able to capture and inspect all the traffic payloads to retrievebotnet information such as the DNS/IP address of the C&C server with the corresponding port number and the authenticaldata to join the C&C channel. Second, the honeywall element shall be capable ofisolating the honeypots from other machinesin the local network by blocking outgoing connections containing suspicious keywords linked to possible malicious activities.

These aforementioned projects only offer a single vantage point of view on botnet activities, thus missing a substantial portionsof botnet spreading behaviors. In order to capture the comprehensive actions of the botnets, Rajab et al. [37] have constructeda multifaceted and distributed measurement infrastructure by combining a modified version of thenepenthesplatform with thehoneynets. Although honeynet is a powerful tool for understanding botnet technology and characteristics, and tracking botnetbehaviors, it is not very effective in botnet disruption. Also, the increasing used of anti-detection techniques in botnets, alongwith the propagation techniques which tend towards social engineering have make it more and more challenging to emulatethe bot and resource consuming to set up a honeynet system.

B. Passive Traffic Monitoring

Instead of purposefully setting up honeynet to attract and collect botnet data. Another approach is setting up vantage pointsto passively monitor the real Internet traffic and to detect or extract the botnet related packets. Based on different types ofInternet traffic data, such as DNS data, BGP route views, Netflow data, and proprietary enterprise data, and on the complexityand response time requirements, many Intrusion Detection System (IDS) designs have been proposed. These techniques canbe classified as behavior-based, DNS-based, and data-mining based respectively as described and summarized in the followingsections.

1) Behavior-based Detection:Behavior based dection methods can be further categorized as signature based and anomalybased.

7

a) Signature-based Detection:Knowledge of useful signatures of existing and captured botnets have provided greatguidance in botnet detection. First, a library of specific botnet commands and function names could be summarized andincluded in the proposed IDS. Once the IDS found matching keywords while inspecting the payload content, it can trigger thealert and take further actions against the botnet. For example, Snort [39] is an open source IDS that monitors network trafficto find signs of intrusion by searching matches based on the predefined set of rules and signatures. A major weakness of thesignature based detections is that they are limited to detect only the known botnets.

b) Anomaly-based Detection:Different from normal internet traffic, botnets often generates high volume of traffic thatmay cause high network latency, and traffic on unusual ports.These network traffic anomalies along with other unique botnetbehaviors have been utilized for botnet detection. Binkleyand Sigh [7] proposed an effective TCP based anomaly detectiontechnique with IRC tokenization and IRC message statisticsto detect botnet clients and reveal botnet servers. First, this anomalybased system implements an IRC parsing component to collectinformation on TCP packets and to determine an IRC channel.Next, these IRC channel traffics are correlated over a large set of sampled data in search of scanning activities. At last,the IRC channels with high scanning count would be stamped asthe possible botnet channels. Akiyama et al. [1] proposeda three-metrics based measurement to detect abnormal botnets behaviors under the assumptions that bots from the the samebotnet will have regularities in relationship, response and synchronization. Gu et al. [20] have proposed botnet detection system(Bothunter) that recognizes the bot infection phase by running an correlation algorithm with the help of the user definedbotinfection life cycle model. Although Bothunter is a C&C protocol and structure independent IDS, its performance is greatlyeffected by the accurate estimation of the predefined infection cycle dialog model. From the same authors, Botsniffer [21] hasbeen developed as an anomaly based algorithm designed to detect botnet C&C channels in a local area network using theobservation that bots within the same botnet would demonstrate strong synchronization in their response and activities (e.g.,sending spam, scanning and binary downloading). This algorithm does not require prior knowledge of a botnet and has lowfalse positive and false negative rates.

2) DNS-based Detection:DNS based detection is a hybrid of behavior based and data-mining based techniques performedon DNS traffic. The significant robustness and dramatic potential threat of FFSN make it necessary to emphasize the detectionalgorithms on the DNS traffic. For a botmaster to maintain andhide its bots, DNS queries have been implemented in multiplebotnet stages, such as the rallying process after infection, malicious attack initiation, and C&C server update. There are twomajor factors to distinguish botnet DNS queries from legitimate DNS queries. A first weakness is that queries to C&C servers,often in the form of DDNS, come only from botnet members. In 2005, Dagon [14] has proposed a mechanism to identifythe domain names of the C&C servers with abnormally high or temporally concentrated DDNS query rates. However, thistechnique could be easily evaded by using faked DNS queries thus generateing many false positives due to misclassificationof legitimate and popular domain names that use DNS with short TTL. An improved approach has been proposed in 2006with the additional utilization of NXDOMAIN reply rates [40]. This algorithm is based on the observation that the abnormallyrecurring name error (NXDOMAIN) responses to DDNS queries is mostly due to shut downs of the C&C servers. Comparingwith the previous method, this method is more effective in revealing suspicious domain names and generates less false positivesbecause NXDOMAIN replies are more likely to refer to DDNS than to other names. A second weakness is that bots usuallygenerate highly correlated DNS queries. In 2007, Choi et al.[11] proposed a botnet detection mechanism that monitors groupactivities which are often consist of DNS queries simultaneously sent by a large number of distributed bots. This methodismore robust than the aforementioned two and is botnet-type independent. Furthermore, it can also detect botnets with encryptedchannels since it uses information in IP headers. The main drawback of this approach is the high processing time requiredfordetailed monitoring of the huge scale of network traffic.

3) Data-mining based Detection:Although abnormal DNS traffic has been successfully distinguished from the legitimateone, botnet C&C communication pattern recognition or detection remains one of the most challenging tasks in IDS designs. Infact, since botnets utilize some regular protocols for C&C communications, the traffic is similar to regular traffic. Moreover, theC&C traffic is not high volume and does not cause high network latency. Along with the continuous evolution of botnets, theprevious behavior based detection algorithms are not useful to identify C&C traffic. Several data mining techniques includingdata classification and clustering have been explored to distinguish botnet C&C traffic. Geobl and Holz [18] introduced Rishi, amining based system in 2007. Rishi constructs its data set bycollecting IRC server nicknames, port numbers and implement an-gram analysis and a scoring system to detect bots that use uncommon communication channels which have evaded detectionsfrom other conventional IDS. However, Rishi can easily be misguided by the disguised nicknames and can not detect encryptedcommunication as well as non-IRC botnets. Mazzariello [32]provides another IRC botnet classification algorithm to differentiatehuman IRC traffic from automated IRC traffic in IRC log files. This algorithm applies Support Vector Machine (SVM) andJ48 decision trees with respect to the data set of following features in Table I. Although the experimental results indicatealmost perfect separation of botnet C&C traffic from normal one, the classification process of this algorithm demonstrates itsdependence on the predefined IRC models which limits the the effective detection among different types of botnets.

8

TABLE I: A List of Features used in the IRC botnet classification [32].Feature Name Feature DetailUser Number Total number of users in the IRC channel

Average Words Number Average number of unique words in asequence

Aaverage/Variance of Mean and variance of theChannel Dictionary Cardinality vocabulary’s cardinality

Unusual Nickname Nickname rarely seen among allexisting Nicknames

Equal Answers Number of sentences witha common ordered subset of words

Control Command Number Count of control commands issued

Join Number JOIN rate in the IRC channel

SetMode Number SetMode rate in the IRC channel

Nickname Changes Count of nickname changes in an IRC channel

Ping Number Ping rate in the IRC channel

IRC Commands Number Overall IRC command rate

Active User Number Number of users active in the IRC channel

TABLE II: A List of Features used in the IRC botnet classification [30].Feature Name Feature Detailstart/end Flow start/end times

IP-proto IP protocol of flow

TCP flags Summary of TCP SYN/FIN/ACK flags

pkts Total packets exchanged in flow

Bytes Total bytes exchanged in flow

pushed pkts Total packets pushed flow

duration Flow duration

maxwin Maximum initial congestion window

role Whether client or server initiated flow

bpp Average bytes per packet for flow

bps Average bits per second for flow

pps Average packets per second for flow

PctPktsPushed Percentage of packets pushed in flow

PCTBppHistBin0-7 Percent of packets in one of the eight packet size bins;these variables collectively forma histogram of packet size for flow

varIAT Variance of packet inter arrival time for flow

varBpp Variance of bytes per packet for flow

The Internetwork research department at BBN technologies has also proposed a machine learning technique for IRC botnetdetection. With the utilization of network flow level statistical characteristics, a network flow in the proposed systemis definedas a group of packets with the identical IP protocol, the IP source and destination addresses, the source and destinationportnumbers within a predefined time interval. This system has implemented multiple classification algorithms (J48 decision tree,naive Bayesian, and Bayesian network) upon the data set containing the following features in Table II. The outcome of thisIDS shows successful classification of IRC based C&C traffic.

Masud et al. [31] proposed a robust and effective flow based botnet traffic detection with the consideration of correlationbetween multiple log files. Furthermore, this method does not require access to payload content. This method does not imposeany restriction on the C&C protocol and is effective even if C&C channels are encrypted. One common character of all theaforementioned detection schemes is that they all implemented classification algorithms which require a well-defined trainingset to achieve good performance. This dependence on the training set has restricted these methods to be used to detect onlythe captured/known botnets. A recent approach [19], Botminer, has considered this limitation and used an unsupervisedX-mean clustering algorithm which does not require any training data. After the initial clustering, further correlations have beenprocessed to identify botnet C&C traffic. Botminer is an advanced botnet detection softwarewhich is independent of botnetprotocol and structure, and requires no botnet signature and training set, so it is able to detect real world botnets includingboth centralized IRC, HTTP and distributed P2P based botnets with a very low false positive rate.

Overall, there is no universal botnet detection method which can achieve high performance by all evaluation criteria, such as

9

response time, accuracy and resource consumption. From previous sections it is concluded that botnets tend to be diversifiedamong protocols and structures, the number of botnet variants is steadily increasing, and the botnet communication techniqueshave become more encrypted and disguised. Recent detectionapproaches with the utilization of clustering algorithm uponnetflow data without payload content have proved that in order to counter the escalations of botnet developments, botnetdetection techniques need to be independent of protocol, structure, and payload content. Although features with considerationof time have been used extensively in both classification andclustering mechanisms, the question of how best to mix a botnettrace with normal traffic in a proper timed manner have never drawn enough attention. Even in the most recent Botminersystem, botnet traces with possible international routingtime, which is usually multiple time larger than the intra-countryrouting time, have been directly mixed with campus wise background traffic. Given the usage of features such as the numberof flows per hour and the average bytes per second, this inconsideration of time factors has questioned the claimed greatperformances of Botminer. In the next section, we describe how a simulated botnet traces have been mixed with the real worldInternet traffic by using round trip time (RTT) attunement, and how clustering algorithms have been conducted on the resultingsalted trace.

V. EXPERIMENT SET UP AND RESULTS

From the previous section, it has stated that the clusteringalgorithm based detection upon flow level internet traffic withoutpacket content inspection has shown promising resilience to the rapid escalations of botnet development. In this section, somepreliminary experiments, inheriting the traffic filtering ideas and mining based algorithm from previous approaches, has beenconducted with a novel introduction of RTT adjustment. One of the referenced approaches is introduced by Karasaridis etal. in2007 [27] which has collected a specific type of netflow data - candidate controller conversation (CCC) - which is a conversationbetween a suspected bot and remote host that satisfies certain criteria that are consistent with control traffic, and applied ahierarchical scoring system to distinguish the CCC from normal traffic. This approach has demonstrated the effectiveness ofthe hierarchical algorithm and the discriminated power of the control traffic. A second referenced method is the extension ofthe aforementioned data-mining based method proposed by the Internetwork research department at BBN technologies [44].In this approach, a botnet testbed has been constructed consisted of an IRC server and 10 bots. A reverse-engineered Kaitenbotnet source code [23], which is used in our experiment, hasbeen implemented to generate the simulated botnet traffic. Afterthe mixture of botnet traffic with the normal background trace, and before the classification stage, a filtering stage has beenintroduced. In this filtering state, a first filter selects TCP-based flows; a second filter removes the port scanning traffic; a thirdfilter eliminates the flows with high bit rate; a forth filter excludes the flows containing packets whose size is larger than300bytes; a fifth filter rejects all short flows (less than 2 packets or 60 seconds). The filtering design, although time consuming,has provided the idea of extracting useful flows with the proper configurations. In our experiment, a filtering stage has alsobeen developed to extract the suitable RTT information. Ourapproach has mixed the simulated botnet traces with the normalInternet traffic by unifying the RTT extracted from real candidate traffic after filtering. Then hierarchical and K mean clusteringalgorithms have been implemented to distinguish the botnetC&C traffic.

A. Botnet Trace

A testbed consisted of one botmaster and three bots was constructed using VMWare by A. Tangpang [45]. Kaiten botnetsource code has been run for one hour to generate the C&C traffic described as below: after the bots start, it initiates aconnection to the botmaster and sends a NICK IRC message to convey that the client is online with a certain ID. Afterreceiving the corresponding ACK message from the botmaster, the bots idle for 20 minutes waiting for commands beforethey reinitiate connections to the botmaster. The originalIP addresses assigned in this botnet are 192.168.158.134 (botmaster),192.168.158.131, 192.168.158.133, and 192.168.158.135,and the port number used is 6668. Overall, 6 botnet flows have beengenerated and captured with wireshark [12].

B. Background Internet Traffic

The background traffic used in our experiment is captured by the internal Lawrence Berkley National Laboratory (LBNL)router from 1643PM to 1743PM on December 15, 2004 [36]. It has6,591,383 packets and 2,662 unique flows (i.e., groupof packets sharing same ID addresses, port numbers, and protocol). In preparation for the salting process, a filtering stagehas been designed to extract RTT from candidate IP addresses. First, an IP fan out filter select IP addresses connected to atleast 4 other IP addresses (3 bots and 1 normal host). Second,in order to calculate the RTTs, the IP addresses left from theprevious filter must have bidirectional TCP based flows. In practice, the filter is written astcp.f lags == 0x02||tcp.f lags ==0x12||(tcp.f lags == 0x10&&tcp.seq == 1&&tcp.ack == 1). At last, in order to calculate RTT as accurate as possible, thecandidate IP addresses shall have TCP based connections with at least 3 IP addresses sharing the same prefix of 28 bits (i.e.,the same subnet). At last, RTTs are computed as the summationof δ1 andδ2, as shown in Figure 2. There are 73 candidateIP addresses qualified after the filtering state. One examplecandidate IP address, demonstrated in Figure 3, is 148.19.5.188,which has TCP-based connections with 131.243.92.207, 131.243.92.148, 131.243.94.62, and 131.243.95.50.

10

Fig. 2: RTT Calculation Demonstration.

Source Port Source IP Destination IP Destination IP RTT Slave to Master Master to Slave

49403 131.243.92.207 148.19.5.188 110 0.00349900 0.00302100 0.00047800

49405 131.243.92.207 148.19.5.188 110 0.00347900 0.00299072 0.00048828

48903 131.243.92.148 148.19.5.188 22 0.00329590 0.00305176 0.00024414

49407 131.243.92.207 148.19.5.188 110 0.00366211 0.00305176 0.00061035

55340 131.243.95.50 148.19.5.188 22 0.00378418 0.00329590 0.00048828

4842 131.243.94.62 148.19.5.188 22 0.00561523 0.00329590 0.00231934

49409 131.243.92.207 148.19.5.188 110 0.00354004 0.00305176 0.00048828

4843 131.243.94.62 148.19.5.188 22 0.00451660 0.00292969 0.00158691

4844 131.243.94.62 148.19.5.188 22 0.00476074 0.00292969 0.00183105

49411 131.243.92.207 148.19.5.188 110 0.00341797 0.00292969 0.00048828

49412 131.243.92.207 148.19.5.188 110 0.00366211 0.00317383 0.00048828

49414 131.243.92.207 148.19.5.188 110 0.00341797 0.00292969 0.00048828

Average Values: 0.00388757 0.00305428 0.00083329

Variance: 0.00070615 0.00013427 0.00067481

Fig. 3: RTTs of 12 Flows in One Subset.

C. Salting the Background Trace with Botnet Trace

Similar to one of the design ideas used for the TCPopera [26],the timestamps of the acknowledgement packet is dependentto the corresponding data packet, and the timestamps of the next data packet is dependent to the previous acknowledgementtrace. In order to have the RTT of the botnet trace attuned to the background candidate traffic, the modification procedures,shown in Figure 4 are:

1. The initial timestamp stays the same,w1 = t1;2. The timpstamp of the first acknowledgement packet,w2, is changed asw2 = w1 + δ1, whereδ1 has been calculated

following Figure2;3. The timestamp of the second data packet,w3, is modified tow3 = w2 + δ2, whereδ2 is also computed from Figure 2;4. The rest of the timestamps are calculated correspondingly following the above two cases.

11

Fig. 4: RTT Modification Algorithm.

At last, the values of the timestamps inside each packet havebeen modified. The timestamps option in the tcp header consistsof two 32-bits fields, one is the Timestamp Value (TSval), andthe other is the Timestamp Echo Reply (TSecr). When thetimestamps option has been activated, the TSval would record the value of the current timestamps clock. The increment oftimestamps clock is usually proportional to the real time increment. An increment of 1 in TSval field is corresponding to 1millisecond increment in real time.

Fig. 5: One Packet Modification Example.

12

Fig. 6: One Packet Modification Example.

There are two ways to modify the pcap files. First method is to convert the pcap file into a text file, including all the packetbyte details, and then modified the corresponding field values in it. After finishing all the modifications, text2pcap.exeis usedto convert this text document back into a pcap file, as demonstrated in Figure 5 and Figure 6.

Another method is to use the netdude [29] software directly modifying the pcap file. In Figure 7, the timestamps field valueis going to be changed. Under IPv4 and TCP tabs, the IP addresses and port numbers could be changed correspondingly.

13

Fig. 7: Netdude Modification Example.

D. Experiment Results

1) Data Set:The data set used in the clustering stage consists of the LBNLTrace mentioned above and the 6 5tuple botnetflows. Since every internet application is assigned according to the destination port number. 4 tuple flows (with the samesourceIP, destination IP, destination port number, and protocol)have been extracted from the LBNL background trace. Overall, thereare 8803 flows in the data set. In Hierarchical and K-mean Clustering, a total number of 16 features, as shown as in Table III,have been used for each flow, which means that the data set is a 8803×16 matrix.

TABLE III: A List of Features used in the Experiment.Feature Name Feature DetailAvgSize C average size of IP payload sent by client

AvgSize S average size of IP payload sent by server

AvgSize C/S the ratio of AvgSizeC over AvgSizeS

VarSize C variance of size of IP payload sent by client

VarSize S variance of size of IP payload sent by server

VarSize C/S the ratio of VarSizeC over VarSizeS

SizeHomoC/S the ratio of SizeHomoC over SizeHomoS

SizeHomoS/C the ratio of SizeHomoS over SizeHomoC

AvgDiffSize C average of absolute difference in IP payloadsize of two consecutive packets sent by client

AvgDiffSize S average of absolute difference in IP payloadsize of two consecutive packets sent by server

AvgIntv C average time difference of two consecutive packets sent by client

AvgIntv S average time difference of two consecutive packets sent by server

VarIntv C variance of time difference of two consecutive packets sentby client

IntvHomo C the ratio of MaxIntv C over MinIntv S

IntvHomo S the ratio of MaxIntv S over MinIntv C

14

2) Hierarchical Clustering:Hierarchical clustering is used to partition the data set using agglomerative or divisive techniquesiteratively [33][17]. The Agglomerative (bottom up) technique starts with as many clusters as data points and combinesmostsimilar points into a single cluster. Divisive technique (top down) starts with a single cluster containing all the datapoints anddistinguish the most dissimilar data point as a cluster in each iteration. The overflow of the Hierarchical clustering algorithmimplemented in this experiment is described below:

1. Initiate n clusters for n data point. Each cluster has its center initiated by the 1×16 array values of the correspondingdata point.

2. Compute Euclidean distance (Eq 1) between all clusters. Each cluster stores the label of its nearest neighbor after thisstep.

DE(i) =

16∑

j=1

(Xij − Cij)2 (1)

3. Merge the two most similar (i.e., overall smallest Euclidean distance) data points into a new cluster. The center of the newcluster shall be the average value of the two centers of the merged clusters (Eq 2). An Euclidean distance computationis then processed on all the other clusters to find updated nearest neighbor.

Cnew(i) =1

nold1+ nold2

16∑

i=1

(nold1× Cold1

(i) + nold2× Cold2

(i)) (2)

4. Repeat step 2 until specified number of clusters has been satisfied.

Since the botnet flows are almost identical statistically for the aspects of the selected features, they have been clusteredtogether at the early stage of hierarchical processing. By setting our final number of clusters to be 420 (this number is theoptimal number of clusters for Kmean clustering, which would be derived in the following section), there are 44 other flowscombined with the 6 botnet flows in cluster number 26. However, out of these 44 flows, only one flow is TCP based, with adestination port number of 111 (Open Network Computing Remote Procedure Call[41]). By setting another boolean featuretofilter out non TCP flows, it is proved that the Hierarchical clustering should be able to distinguish the botnet traces fromthelegitimate background trace. Overall, it takes 1237 seconds running the programm over the flow level data set.

3) K mean Clustering:K mean is another common and easy to implement partition algorithm. Its overflow is given asfollow[22]:

1. All data points are randomly divided into predefined number of clusters. For each cluster, the centroid is computed asthe average values of all the data points arrays in this cluster. All the labels of the partitioned data points have also beenstored in the lists of corresponding clusters.

2. Every data point would be reassigned to the cluster which has the smallest Euclidean distance.3. The centroid and group lists for the clusters would get updated.4. Repeat from step 2 until certain condition have been satisfied.

One key feature for K mean clustering is how to find the optimalnumber of clusters to minimize the specific objectives. Inour experiment, the objective function denotes as the totaldistortion, which is the sum of squares of the Euclidean distancesbetween all the data points and the centroid of their belonging cluster. To achieve the minimum distortion, the data set hasbeen split into a training set (7000 flows) and a testing set (1803 flows) randomly for 5 times, under different assumption ofthe number of clusters. After the positions of the centroid have been computed by using the training set, the correspondingdistortion for the testing should also be calculated. Basedon the distortion trend plot in Figure 8, while the distortion of thetraining set is monotonically decreasing, the one of the testing set seems to reach the optimal point at 420 clusters, which isthe number of clusters chosen for further K mean clustering analysis.

Among the 420 clusters, all 6 botnet flows have been clusteredinto one cluster (cluster 212). Similar to the Hierarchicalresult, there are 18 other flows in that cluster. Nevertheless, by setting the boolean feature that eliminates all the nonTCPflows, only two TCP flows running on port 111 are left as the false positive traces. Having noticed that the objective functionof the K mean clustering algorithm in our experiment is to minimize the distortion of the entire traffic, other number ofclusters initiations have been tested for botnet flows cluster purity (i.e., a cluster with only botnet flows). For example, whilethe entire data set has been partitioned into 512 clusters, all 6 botnet flows have been clustered into cluster 202, which doesnot include any other flows. So it is proved that K mean clustering may reach the perfect performance on detecting the botnetflows. Also, K mean algorithms consumes much less time to be completed, comparing to the running time of the Hierarchicalmethod. It takes 111 seconds to reach the steady state for 420clusters, and 128 seconds for 512 clusters. Furthermore, with

15

Fig. 8: Distortion vs. Number of clusters.

the consideration of feature reduction, a relative score system (standard deviation/average value) could be exploredwith theclusters. Basically, the feature with small intra cluster relative score (and possible large inter cluster relative score) shall beconsider as the features with great discriminating power, as shown in Figure 9.

Overall, this preliminary experiment has shown the capability of the Hierarchical and K mean clustering in detecting botnetflows and provide a RTT adjustment method in mixing the botnettrace with the background normal internet traffic.

VI. CONCLUSION

Since 1989, botnets have evolved from the benign assistant tool to the predominant threat in modern internet. Although thenumber of bots to each botnet seems to be decreasing, the monetary damaging power of the botnets is continuously increasinggiven the development of internet bandwidth. Instead of using a centralized, IRC based C&C channel to perform multiplenefarious attacks, the botnets have been gradually developed into more complicated, stealthy, and modular based packagewhich perform particular malicious activity with diverse C&C protocols and structures. In order to counter the escalation ofthe botnets evolution, the mining based detection methods operated on the flow level internet traffic have demonstrated somepromising performances. However, the feature extractionsfrom the raw data, huge dimensions of possible features, andpropermixture between botnet traces with background internet traffic, have made this method difficult to be an online IDS. Even forconventional IRC based botnets, there are rarely invariantfeatures among all the C&C traffic. Instead of designing an universalIDS, a particular solution need to be developed under different circumstances such as internet data type, response time, andcomplexity.

Acknowledgements:We wish to thank Berkay Celik for his feedback on this manuscript.

REFERENCES

[1] M. Akiyama, T. Kawamoto, M. Shimamura, T. Yokoyama, Y. Kadobayashi, and S. Yamaguchi. A proposal of metrics for botnet detection based on itscooperative behavior. InApplications and the Internet Workshops, 2007. SAINT Workshops 2007. International Symposium on, pages 82–82, 2007.

[2] I. Arce and E. Levy. An analysis of the slapper worm.IEEE Security & Privacy, 1(1):82–87, 2003.[3] P. Bacher, T. Holz, M. Kotter, and G. Wicherski. Know yourenemy: Tracking botnets. http://www.honeynet.org/papers/bots, 2005.[4] P. Barford and V. Yegneswaran. An inside look at botnets.Malware Detection, pages 171–191, 2006.[5] D. Barr. RFC 1912: Common DNS operational and configuration errors. http://www.ietf.org, Feb. 1996. Obsoletes RFC1537 [6]. Status:

INFORMATIONAL.[6] P. Beertema. RFC 1537: Common DNS data file configuration errors. http://www.ietf.org, Oct. 1993. Obsoleted by RFC1912 [5]. Status:

INFORMATIONAL.

16

Fig. 9: SizeHomoC, SizeHomoS, and VarSizeC/s are good choices.

17

[7] J. Binkley and S. Singh. An algorithm for anomaly-based botnet detection. InProceedings of USENIX Steps to Reducing Unwanted Traffic on theInternet Workshop (SRUTI), pages 43–48, 2006.

[8] M. Bowden. The enemy within. http://www.theatlantic.com/magazine/archive/2010/06/the-enemy-within/8098/,June 2010.[9] T. Brisco. RFC 1794: DNS support for load balancing. http://www.ietf.org, Apr. 1995. Status: INFORMATIONAL.

[10] K. Chiang and L. Lloyd. A case study of the rustock rootkit and spam bot. InThe First Workshop in Understanding Botnets, 2007.[11] H. Choi, H. Lee, H. Lee, and H. Kim. Botnet detection by monitoring group activities in DNS traffic. Inproceedings of the 7th IEEE International

Conference on Computer and Information Technology, pages 715–720. IEEE Computer Society, 2007.[12] G. Combs et al. Wireshark. http://www.wireshark.org,2007.[13] E. Cooke, F. Jahanian, and D. McPherson. The zombie roundup: Understanding, detecting, and disrupting botnets. InProceedings of the USENIX SRUTI

Workshop, pages 39–44, 2005.[14] D. Dagon. Botnet detection and response. InOARC Workshop, 2005, 2005.[15] N. Daswani and M. Stoppelman. The anatomy of Clickbot. A. In Proceedings of the first conference on First Workshop on Hot Topics in Understanding

Botnets, page 11. USENIX Association, 2007.[16] J. Dilley, B. Maggs, J. Parikh, H. Prokop, R. Sitaraman,and B. Weihl. Globally distributed content delivery.IEEE Internet Computing, pages 50–58,

2002.[17] R. O. Duda, P. E. Hart, and D. G. Stork.Pattern Classification. Wiley-Interscience Publication, 2000.[18] J. Goebel and T. Holz. Rishi: Identify bot contaminatedhosts by irc nickname evaluation. InUSENIX Workshop on Hot Topics in Understanding Botnets

(HotBots 07), 2007.[19] G. Gu, R. Perdisci, J. Zhang, W. Lee, et al. BotMiner: Clustering analysis of network traffic for protocol-and structure-independent botnet detection. In

Proceedings of the 17th USENIX Security Symposium (Security08), 2008.[20] G. Gu, P. Porras, V. Yegneswaran, M. Fong, and W. Lee. Bothunter: Detecting malware infection through ids-driven dialog correlation. InProceedings

of the 16th USENIX Security Symposium, pages 167–182, 2007.[21] G. Gu, J. Zhang, and W. Lee. BotSniffer: Detecting botnet command and control channels in network traffic. InProceedings of the 15th Annual Network

and Distributed System Security Symposium (NDSS08). Citeseer, 2008.[22] A. Hinneburg and D. Keim. An efficient approach to clustering in large multimedia databases with noise.Knowledge Discovery and Data Mining, 5865,

1998.[23] T. Holz. A short visit to the bot zoo.IEEE Security and Privacy, 3:76–79, 2005.[24] T. Holz, C. Gorecki, K. Rieck, and F. Freiling. Measuring and detecting fast-flux service networks. InSymposium on Network and Distributed System

Security. Citeseer, 2008.[25] T. Holz, M. Steiner, F. Dahl, E. Biersack, and F. Freiling. Measurements and mitigation of peer-to-peer-based botnets: a case study on storm worm. In

Proceedings of the 1st USENIX Workshop on Large-Scale Exploits and Emergent Threats, pages 1–9. USENIX Association, 2008.[26] S. Hong and S. Wu. On interactive Internet traffic replay. In Recent Advances in Intrusion Detection, pages 247–264. Springer, 2005.[27] A. Karasaridis, B. Rexroad, and D. Hoeflin. Wide-scale botnet detection and characterization. InUSENIX Workshop on Hot Topics in Understanding

Botnets (HotBots 07), 2007.[28] M. Kola. Botnets: Overview and Case Study. PhD thesis, IBM Research, 2008.[29] C. Kreibich. Design and implementation of netdude, a framework for packet trace manipulation. InProc. USENIX/FREENIX, 2004.[30] C. Livadas, R. Walsh, D. Lapsley, and W. Strayer. Using machine learning techniques to identify botnet traffic. In2nd IEEE LCN Workshop on Network

Security (WoNS2006). Citeseer, 2006.[31] M. Masud, T. Al-khateeb, L. Khan, B. Thuraisingham, andK. Hamlen. Flow-based identification of botnet traffic by mining multiple log files. In

Distributed Framework and Applications, First International Conference on, pages 200–206, 2008.[32] C. Mazzariello. IRC traffic analysis for botnet detection. In Information Assurance and Security, 2008. ISIAS’08. Fourth International Conference on,

pages 318–323, 2008.[33] J. Navarro, C. Frenk, and S. White. Hierarchical Clustering. The Astrophysical Journal, 490:493–508, 1997.[34] J. Nazario. Blackenergy DDoS bot analysis.Arbor, 2007.[35] J. Oikarinen and D. Reed. RFC 1459: Internet Relay Chat Protocol. http://www.ietf.org, 1993.[36] R. Pang, M. Allman, V. Paxson, and J. Lee. The devil and packet trace anonymization.ACM SIGCOMM Computer Communication Review, 36(1):38,

2006.[37] A. Rajab, J. Zarfoss, F. Monrose, and A. Terzis. A multifaceted approach to understanding the botnet phenomenon. InProceedings of the 6th ACM

SIGCOMM Conference on Internet Measurement, page 52. ACM, 2006.[38] J. Riden. Know Your Enemy: Fast-flux Service Networks. http://www.honeynet.org/papers/ff, 2008.[39] M. Roesch. Snort-lightweight intrusion detection fornetworks. InProceedings of the 13th USENIX conference on System administration, pages 229–238.

Seattle, Washington, 1999.[40] A. Schonewille and D. van Helmond. The domain name service as an IDS.Research Project for the Master System-and Network Engineering at the

University of Amsterdam, 2006.[41] R. Srinivasan. RFC 1831: RPC: Remote procedure call protocol specification version 2. www.ietf.org, Aug. 1995. Status: PROPOSED STANDARD.[42] J. Stewart. Sinit P2P trojan analysis. http://www.secureworks.com/research/threats/sinit, 2003.[43] S. Stover, D. Dittrich, J. Hernandez, and S. Dietrich. Analysis of the Storm and Nugache Trojans: P2P is here.USENIX; login, 32(6):2007–12, 2007.[44] W. Strayer, D. Lapsely, R. Walsh, and C. Livadas. Botnetdetection based on network behavior.Botnet Detection, pages 1–24, 2006.[45] A. Tangpong and G. Kesidis. A controlled environment for botnet traffic generation. http://www.cse.psu.edu/∼tangpong/botnet/, April 2009.[46] R. Vogt, J. Aycock, and M. Jacobson. Army of botnets. InProceedings of the 2007 Network and Distributed System Security Symposium (NDSS 2007),

pages 111–123. Citeseer, 2007.[47] P. Wang, S. Sparks, and C. C. Zou. An advanced hybrid peer-to-peer botnet. InUSENIX Workshop on Hot Topics in Understanding Botnets (HotBots07),

2007.