14
SUT: Quantifying and mitigating URL typosquatting Anirban Banerjee, Md Sazzadur Rahman , Michalis Faloutsos StopTheHacker, Jaal LLC, School of Computer Science, University of California, Riverside, Riverside, CA 92507, United States article info Article history: Received 17 February 2011 Received in revised form 26 May 2011 Accepted 2 June 2011 Available online 25 June 2011 Keywords: Typosquatting Phony sites Classification abstract One form of profiting from the web is URL typosquatting: people register phony sites that are common mispellings of popular sites. These phony sites advertise and sell products or, in the worst case, con users into identify theft. In this work, we quantify the extent of this phenomenon, and propose, SUT, a practical countermeasure based on network metrics. We start with an initial set of 900 popular websites, and create 3 million name variations in a systematic and exhaustive way. We find that URL typosquatting is a wide-spread phenomenon and identify common practices and preferred targets of typo- squatters. Second, we find that phony websites exhibit significantly different network- layer behavior, such as number of http redirections, compared to regular sites. Based on this insight, we develop, SUT, an automated approach to detect phony websites. We find that the power of SUT lies in the use of the network-layer profile of the phony sites, and less in the perceived popularity of the site. We find that SUT can identify phony websites with near perfect accuracy and recall in our controlled tests. We conclude that our approach is a promising step towards protecting users from URL typosquatting. Ó 2011 Elsevier B.V. All rights reserved. 1. Introduction Typosquatting is a relatively recent but serious phenom- enon, which is attracting significant attention [12,13]. Typosquatters register domain names similar to prominent websites and expect to take advantage of users mistyping the URL address or through spam campaigns. Once at these sites, which we call phony, typosquatters attempt to advertise and sell products or to extract personal informa- tion from users, essentially starting an identity theft pro- cess. Note that typosquatting as defined above is not technically illegal, the perpetrator lawfully owns the mis- spelt domain. For example, samachar.com is a popular news portal, which when mistyped as samchar.com opens up an adult site. In an effort to address this problem, pop- ular sites often buy URLs which are similar to their own URL. For example, gogle.com leads to google.com. Interest- ingly, eliminating phony sites through buy-outs indirectly encourages typosquatting. Typo-squatting is an enabling factor for cyber-crime. First, the average user is not careful and could use any help one can get. For example, users are often careless in typing or in visiting a site through an e-mail link. A study by Garfinkel and Miller [10] indicates the gullibility of users in trusting spoofed email and spam: the name of the sen- der and the content matter more than the email address of the sender. Second, typosquatting can be seen as an enabling component of the larger cyber-crime problem [1–6,8,9,11–14], which includes a range of activities, from annoying behaviors, like pop-up windows and spam, to e-customer poaching, all the way to identity theft. Exten- sive studies performed by the Gartner Group in 2007 [7], put the cost of Internet-based identity theft around $3.2 billion per year in the US alone. In more detail, typosquatting is parasitic or even harm- ful in four ways, as it usually leads users to: (a) ad-portal or parked domains, which bring monetary benefit to the owner of the ad-portal domains by arguably misusing the ad syndication business [42], (b) the competitor’s site from 1389-1286/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.comnet.2011.06.005 Corresponding author. E-mail addresses: [email protected] (A. Banerjee), rahmanm@ cs.ucr.edu (M.S. Rahman), [email protected] (M. Faloutsos). Computer Networks 55 (2011) 3001–3014 Contents lists available at ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet

SUT: Quantifying and mitigating URL typosquatting

Embed Size (px)

Citation preview

Page 1: SUT: Quantifying and mitigating URL typosquatting

Computer Networks 55 (2011) 3001–3014

Contents lists available at ScienceDirect

Computer Networks

journal homepage: www.elsevier .com/ locate/comnet

SUT: Quantifying and mitigating URL typosquatting

Anirban Banerjee, Md Sazzadur Rahman ⇑, Michalis FaloutsosStopTheHacker, Jaal LLC, School of Computer Science, University of California, Riverside, Riverside, CA 92507, United States

a r t i c l e i n f o

Article history:Received 17 February 2011Received in revised form 26 May 2011Accepted 2 June 2011Available online 25 June 2011

Keywords:TyposquattingPhony sitesClassification

1389-1286/$ - see front matter � 2011 Elsevier B.Vdoi:10.1016/j.comnet.2011.06.005

⇑ Corresponding author.E-mail addresses: [email protected] (A. Ba

cs.ucr.edu (M.S. Rahman), [email protected] (M. F

a b s t r a c t

One form of profiting from the web is URL typosquatting: people register phony sites thatare common mispellings of popular sites. These phony sites advertise and sell productsor, in the worst case, con users into identify theft. In this work, we quantify the extentof this phenomenon, and propose, SUT, a practical countermeasure based on networkmetrics. We start with an initial set of 900 popular websites, and create 3 million namevariations in a systematic and exhaustive way. We find that URL typosquatting is awide-spread phenomenon and identify common practices and preferred targets of typo-squatters. Second, we find that phony websites exhibit significantly different network-layer behavior, such as number of http redirections, compared to regular sites. Based onthis insight, we develop, SUT, an automated approach to detect phony websites. We findthat the power of SUT lies in the use of the network-layer profile of the phony sites, andless in the perceived popularity of the site. We find that SUT can identify phony websiteswith near perfect accuracy and recall in our controlled tests. We conclude that ourapproach is a promising step towards protecting users from URL typosquatting.

� 2011 Elsevier B.V. All rights reserved.

1. Introduction

Typosquatting is a relatively recent but serious phenom-enon, which is attracting significant attention [12,13].Typosquatters register domain names similar to prominentwebsites and expect to take advantage of users mistypingthe URL address or through spam campaigns. Once at thesesites, which we call phony, typosquatters attempt toadvertise and sell products or to extract personal informa-tion from users, essentially starting an identity theft pro-cess. Note that typosquatting as defined above is nottechnically illegal, the perpetrator lawfully owns the mis-spelt domain. For example, samachar.com is a popularnews portal, which when mistyped as samchar.com opensup an adult site. In an effort to address this problem, pop-ular sites often buy URLs which are similar to their ownURL. For example, gogle.com leads to google.com. Interest-

. All rights reserved.

nerjee), rahmanm@aloutsos).

ingly, eliminating phony sites through buy-outs indirectlyencourages typosquatting.

Typo-squatting is an enabling factor for cyber-crime.First, the average user is not careful and could use any helpone can get. For example, users are often careless in typingor in visiting a site through an e-mail link. A study byGarfinkel and Miller [10] indicates the gullibility of usersin trusting spoofed email and spam: the name of the sen-der and the content matter more than the email addressof the sender. Second, typosquatting can be seen as anenabling component of the larger cyber-crime problem[1–6,8,9,11–14], which includes a range of activities, fromannoying behaviors, like pop-up windows and spam, toe-customer poaching, all the way to identity theft. Exten-sive studies performed by the Gartner Group in 2007 [7],put the cost of Internet-based identity theft around$3.2 billion per year in the US alone.

In more detail, typosquatting is parasitic or even harm-ful in four ways, as it usually leads users to: (a) ad-portal orparked domains, which bring monetary benefit to theowner of the ad-portal domains by arguably misusing thead syndication business [42], (b) the competitor’s site from

Page 2: SUT: Quantifying and mitigating URL typosquatting

3002 A. Banerjee et al. / Computer Networks 55 (2011) 3001–3014

the intended site, (c) malicious sites, which will infect theusers’ computers and turn them into bots, and (d) phishingsites, which subject the users to identity theft.

Despite the limited research studies, typosquatting hasattracted sufficient attention in practice to evoke commer-cial and community-based countermeasures. First, thereare a few commercial proprietary products and servicesthat attempt to detect phishing and malware sites, as wediscuss in Section 7. Second, there are several databasesthat identify phishing and malware sites [38–41], but nottyposquatting explicitly, while their completeness andaccuracy is an open question. Finally, there have beensome efforts that attempt to estimate the ‘‘reputation’’ ofa site based on its ranking on popular search engines[17]. Recently, some efforts have examined typo-squattingalong with domain parking and other issues [15,16,42], butnone of these efforts has developed a deployable counter-measure to typosquatting.

In this paper, we focus on typosquatting: (a) we quan-tify its extent through a massive measurement study, (b)we create a profile of phony websites, including network-layer metrics, and (c) we develop SUT, a practical counter-measure to detect phony sites. We start with a corpus of900 popular websites, which we refer to as initial URLs,and generate roughly 3 million URLs by varying the initialnames systematically and exhaustively. Each of these newwebsites is either a phony site, or a legitimate site thathappens to be similar, which we refer to as incidentallysimilar (IS) website.

To the best of our knowledge, this work is the first interms of: (a) the sheer volume of the web-site namesexamined, and (b) its focus on a network-based analysisof phony sites. Our main results can be summarized inthe following points.

1.1. Typosquatting is a prevalent phenomenon among popularsites

Our measurements quantify the aggressiveness and thepreferred practices of typosquatters.

Over 99% of sites with similar names to our initial set arephony. We find that for nearly 57% of all initial URLs inour corpus, more than 35% of all possible variations foreach initial URL exist in the Internet. Among these existingsites, we find that 99% are phony.

One-character variations are most popular among typo-squatters. We find that 99% of phony URLs per initial URL, dif-fer from the initial ones by just one character, in length or inspelling. Further, URLs with less than 10 characters are moreprone to being typosquatted. Finally, URLs which belong toUS and German banks suffer most from typosquatting.

1 Our criteria seem to work well currently. If typosquatters make aneffort to alter the network properties of their sites, our profiles may need tobe updated. However, this is a typical case in security: a method issuccessful if it raises the bar and makes the ‘‘bad guys’’ work harder.

1.2. The network-based profiles of phony sites are significantlydifferent from those of ‘‘good’’ sites

We establish profiles for phony sites using network-layer metrics, such as content, network behavior, and net-work location.

The size of the page of the initial sites is larger than thatof phony sites. We find that for initial sites, the average

page size is 20 KB, while for phony sites, the average sizeis 9 KB.

Reaching a phony website involves a larger number ofHTTP-redirections. We find that the number of HTTP redi-rections for connections which end up at phony sites are6–9 times more than those that arrive at initial websites.

1.3. SUT: a practical countermeasure to detectingtyposquatting

We develop SUT (short for Stop URL Typo-squatting) todetect phony sites based on the insights from our measure-ment study. In fact, SUT consists of two modules: (a) SUT-net, which transforms the empirical network-based obser-vations into a set of criteria, and (b) SUT-pop, which usesthe perceived popularity of a website by using a searchengine.

Network-based criteria are key to SUT’s success. We findthat SUT-net outperformed SUT-pop. In a controlled exper-iment with manually-verified sites, SUT-net identifiesphony websites perfectly, meaning 100% accuracy and re-call. We find that the network-based criteria seem morerobust and harder to evade,1 and they do not rely on someexternal database of websites.

SUT outperforms SiteAdvisor by McAfee. SUT performssignificantly better than McAfee’s SiteAdvisor, which isparticularly weak in identifying phony sites. Specifically,SiteAdvisor has a recall of 5% for phony sites and 76% forinitial sites. In contrast, SUT had 100% recall in both groups.Interestingly, both approaches had 100% accuracy in ourexperiment with manually-verified sites. Furthermore, ini-tial measurements suggests that SUT detects phony sitesthat are not listed in databases of blacklisted sites.

SUT can be the basis for a practical, effective and user-based approach against typosquatting. We envision SUThaving an advisory role as a plug-into a web-browser ora mail application. In addition, it could be part of an offlineeffort to characterize a large body of sites and create adatabase. Furthermore, typosquatting is related to the gen-eral problem of URL hijacking, which includes: (a) browserhijacking, where the browser of the user is compromisedand its bookmarks can be redirected to phony sites, and(b) server hijacking, where the hijacker compromises aDNS server or a search engine. Although our work is fo-cused on typosquatting, SUT detects phony sites, and thus,it could be beneficial for the other types of URL hijacking aswell, as we discuss in Section 5.

The rest of the paper is organized as follows. We discussthe experimental setup in Section 2. In Section 3, wequantify the extent of typosquatting. Then, we presentthe network metrics employed by SUT and justify the net-work-based profiles we use in SUT in Section 4. This is fol-lowed by the SUT architecture in Section 5 and itsevaluation in Section 6. We discuss the related work in Sec-tion 7 and draw concluding remarks in Section 8.

Page 3: SUT: Quantifying and mitigating URL typosquatting

A. Banerjee et al. / Computer Networks 55 (2011) 3001–3014 3003

2. Experimental setup

Here, we explain the methodology for ourmeasurements.

2.1. Building the corpus

We collect approximately 900 URLs among the mostpopular websites [30–35]. These initial URLs were manu-ally categorized in: brokerage firms (37 URLs), credit cardfirms (23), eCommerce sites (40), eMail-providers (13), tra-vel services (40), software vendors (32) and banking insti-tutions which range from the US (253) to Canada (21), toEurope (220) and Asia (45), etc. The average length of ini-tial URLs (without extension) was 8.9 characters, while themedian was 8.

2.2. Obtaining URL name variations

These initial URLs are modified by either replacing orinserting or deleting one or more characters at a timeand then probing to ascertain if a URL with the modifiedname is registered. For example, consider that we intendto find how many similarly named sites exist for Google.We substitute the first character with all possible alphabetletters. We repeat this with each character in the URL toobtain all variations of the initial with one character mod-ification. We term this method 1-mod-inplace, since itchanges only 1 character in the initial URL, without chang-ing the length of the URL. We also remove one characterfrom the URL, which we call 1-mod-deflate, or increasethe length of the URL by one character, which we call1-mod-inflate. We also experiment with 2 and 3 charactermodifications for inplace, inflate and deflate schemes.

Further, we modify the extension of the initial URLs tofind which particular extensions (say .com) are targetedaggressively by typosquatters. After generating phonyURLs from initial URLs using the previously describedschemes, we attach an extension to these URL variations.This allows us to check whether the presence of a particu-lar extension has an effect on the existence of phony sites.We experiment with .com, .gov, .org, .net, .biz, .edu,and .mil.

2.3. Distinguishing IS from phony sites

Our goal is to find phony sites, but we have to makesure that we do not consider incidentally similar sites asphony. Given our URL variations, we want to distinguishbetween phony and incidental sites. In fact, we need anautomated way to do this for our 3 M site variations. Wedevelop the following approach that uses the content ofthe page, and we use manual inspection to finetune andvalidate our method.

Building a keyword set: we use keyword matching[18,19] to compute similarity. To do this we build a key-word set (K-set) by stripping off words from HTML-headers manually from a subset of the initial sites. At least20% of keywords used are present in each initial site. Weuse keywords which can be easily related to particular

categories. These categories have been defined above. Thisallows us to observe semantic differences in the content.We are aware of the possibility that our results may varydepending on the choice of keywords. However, we believeour approach is sufficient for our needs, as it has been suc-cessfully used in [19]. Other mechanisms to classify sites[20,21] use more features, such as image content, but someof those approaches were not applicable in our case, forexample, most suspect sites did not have significant imagecontent. We use two methods and multiple metrics tocompute similarity:

Method 1 (M1): we compare the frequency of keywordsfound in an initial page with the frequency in a suspectedphony page. We compute the Root Mean Square Error(RMSE) of the number of times a keyword appears in a sus-pected phony site with respect to the number of appear-ances in the initial site. This is a type of L2 metricformally represented by R(xi � oi)2, where xi representsfrequency of occurrence of keyword i in a suspected phonysite and oi represents frequency of occurrence in the initialsite. Note that, only keywords found in initial sites are com-pared with ones found in suspect sites.

Method 2 (M2): we use a bit-vector-based hammingdistance (HD) metric to ascertain this. This is a popularL1 metric defined by Rjxi � oij, where xi and oi representsthe ith bits in the two vectors [24,25]. A1 in the bit vectorrepresents the presence of the specific keyword in the site.Via this method, we develop heuristics based on a bipartitegraph mapping. Here we represent every keyword, Kn as anode in the bipartite graph and every site (phony or initial)as nodes Sn. We now observe which keywords appear onwhich sites. This simple formulation allows us to calculatethe keyword-span (KS) and the weighted keyword-span(WKS) of each site. The weighted keyword-span, is simplythe sum of the number of occurrences of the keywordswhich appear on a site. We analyze the similarity of sus-pected phony sites with the initial sites by using this graphrepresentation to generate a bit vector for each suspect siteand compare the hamming distance to the bit vector of theinitial site. Here, all keywords from the K-set are considered.

We need to choose thresholds for these metrics for sep-arating phony sites from IS and initial sites. We sample 75initial sites from our initial sites. Then, we consider the URLname variations for these and identify phony and IS sitesvia manual inspection. From there, we define the thresh-olds which can distinguish the phony sites from IS andinitial sites: (i) 12 < KS < 19, (ii) 60 < WKS < 100, (iii)RMSE 6 190 and (iv) 0 < HDavg < 30. As we will see later,these prove to be effective.

Due to space constraints, we present a subset of resultsin Table 1, which depicts the values of these metrics inphony sites. We observe two kinds of phony sites: (a) phonysites which try to imitate initial sites closely and (b) phonysites which advertise everything from bathroom fixtures tofinancial instruments. A low RMSE value in column 2 im-plies that typosquatters try to make their phony sitesresemble initial sites closely. These phony sites belong tothe first kind. This is observed for Deutsche Bank, Block-buster and Batchmates. For the second kind, we observehigher RMSE values. This is seen for AMD, Bank of America,BestBuy, Costco and Adobe. For these shotgun-style sites,

Page 4: SUT: Quantifying and mitigating URL typosquatting

Table 1Content Similarity between of initial and manually-verified phony sites using keywords.

Initial site M1 M2

RMSE avg RMSE SD Orig KS KS mode KS avg KS SD WKS avg WKS SD HD avg HD SD

Blockbuster 13.28 12.11 10 17 15.44 9.67 57.33 31.37 17.22 7.95DeutscheBank 16.69 14.92 25 25 18.39 7.33 101.56 57.6 14.39 11.84Batchmates 26.27 34.73 19 17 15.64 6.59 60.5 25.66 17.21 9.54Abercrombie 44.24 72.81 13 17 19.45 13.76 124.2 150.44 25.55 10.95Adidas 60.26 80.87 11 17 14.1 8.1 62.02 55.87 17.07 4.92GoldmanSachs 76.07 82.06 16 17 14.89 8.39 75.53 68.4 18.49 7.33AmericanExpress 77.82 62.09 17 17 14.71 8.19 66.01 55.65 17.43 3.99Amazon 81.73 96.67 36 17 16.08 10.62 80.93 77.89 28.37 7.82Delta 89.74 107.16 17 17 12.63 7.81 55.34 52.32 18.86 6.06Apple 93.34 87.86 22 17 14.19 8.6 65.28 60.39 19.91 2.87Costco 94.89 164.5 36 17 14.6 10.14 72.1 85.02 33.78 3.8BestBuy 98.05 96.46 29 17 14.28 9.31 65.71 69.6 26.39 2.35Adobe 101.73 110.02 17 17 14.1 8.43 64.11 57.99 19.34 4.84Dell 101.78 106.41 17 17 13.89 8.41 62.52 60.73 19.04 6.41BankOfAmerica 155.87 287.9 32 17 8.14 8.37 67.65 65.45 23.54 5.04AMD 162.23 108.24 14 17 12.58 8.46 57.13 63.42 20.3 5.79

3004 A. Banerjee et al. / Computer Networks 55 (2011) 3001–3014

typosquatters attempt to insert a large variety of keywordsadvertising products and/or services resulting in large KSvalues. The ranges we have chosen allow us to identify bothkinds of phony sites.

3. Quantifying typosquatting

We analyze the extent of typosquatting along severaldimensions URL length, top-level domain (extension) andsite category. This is a follow up of our earlier 5-page work[43], which we report for completeness.

3.1. The effect of URL modification

Observation 1: Short initial URLs suffer more fromtyposquatting. As depicted in Fig. 1, for initial sites whichhave URL length less than 10 characters, more than 10% ofall possible phony URLs are registered in the Internet. Thisis an indication of typosquatters targeting sites with short-er names. This is somewhat expected as popular sites oftenhave short names.

Observation 2: Significant numbers of phony sites ex-ist in the Internet. We observe from Fig. 2 that modifyingeach of the initial URLs using 1-char-mod schemes leads to

0 10 20 30 40 50 60 70 80 90

0 5 10 15 20 25 30

%ag

e of

Pho

ny S

ites

Initial URL Length (Characters)

Percentage of Phony sites versus Initial URL length

Fig. 1. Percentage of phony sites versus the length of initial URL.

the discovery of significant numbers of phony sites.Employing the 1-mod-inplace scheme, for nearly 30% ofcorpus URLs, we find that between 30–90% of all possiblephony sites exist in the Internet. Using the 1-mod-inflatescheme for about 25% of initial URLs, we observe that be-tween 20–90% of all possible phony sites exist. Similarly,using the 1-mod-deflate scheme, for 57% of corpus URLs,we observe that between 35–90% of all possible phonysites exist. We see that most of these phony sites areobtained via the 1-mod-deflate scheme. We also experimentwith schemes exploring multi-character spelling changes,URL inflation and deflation. We find that for 2 or 3 charac-ter schemes the percentage of phony sites existing per ini-tial URL is below 0.5%.

Observation 3: Typosquatters prefer 1 character modi-fications of popular URLs. We present Fig. 3, where we seethat only 1 character modifications correspond to signifi-cant numbers of phony sites. Multi-character modificationsuncover miniscule numbers of phony sites. Approximately,99.5% of phony sites identified are obtained via 1-modschemes. Understandably, typosquatters do not expectInternet users to significantly mis-type URLs and concen-trate on registering domains with small differences.

As an example, let us analyze some popular URLs. Con-sider Google.com, depicted in Fig. 4. (a). We identify three

0

20

40

60

80

100

0 100 200 300 400 500 600 700 800 900Per

cent

age

of s

ites

exis

ting

Initial URLs

Percentage of Phony sites existing per initial URL with respective extension

"Inplace""Inflate"

"Deflate"

Fig. 2. The percentage of all phony sites which exist per initial URL pertype of single-character variation.

Page 5: SUT: Quantifying and mitigating URL typosquatting

Percentage of Phony Sites Existing Out of all Possible Variations (for approx. 400 initialsites)

0

10

20

30

40

50

60

70

1-Char 2-Char 3-Char

Number of Character Modifications

% o

f P

hony

Sit

es

Deflate Inplace Inflate

Fig. 3. Percentage of phony sites existing (out of all possible variations) for at about 400 initial URLs from corpus.

Fig. 4. (a) Phony, IS and re-directed URLs found for Google.com. (b) Phony, IS and re-directed URLs found for Citicards.com.

A. Banerjee et al. / Computer Networks 55 (2011) 3001–3014 3005

types of sites (1) phony, (2) incidentally similar URL butvariations that point to the initial site e.g. gogle.com and(3) IS sites (goole.com). Another example presented inFig. 4. (b) depicts Citicards.com. Here, we see that 66.6%of all possible URL variations belong to phony sites. Nearlyall these phony sites are 1 character modifications ofciticards.com.

3.2. URL top-level domain analysis

Here, we analyze how the extension or top-leveldomain of an initial URL, such as .com, influences its prob-ability of being typosquatted.

The number of phony URLs, obtained by the deflateschemes is highest. As described in Section 2, we attach

Page 6: SUT: Quantifying and mitigating URL typosquatting

3006 A. Banerjee et al. / Computer Networks 55 (2011) 3001–3014

different extensions to URL name variations. We presentTable 2 which describes average percentage of phony siteswhich exist for each scheme and extension. For the 1-mod-deflate scheme, the average percentage of existing phonysites for each extension are (i) .com: 52%, (ii) .org: 30.5%,(iii) .biz: 14.5%. This is much higher than any of the otherschemes. Also, for .com, .org and .biz extensions, the differ-ence in the averages among the inplace and deflateschemes is higher than the inflate scheme. Fig. 6(a) depictsthe case when a .com extension is attached to a URL namevariation and Fig. 6(b) depicts the case for .org.

Observation 4: Initial sites with a .com extension aremost typosquatted, more than initial sites with .org/.net/.biz extensions. We find that the .com domain is themost prone to typosquatting and thereby present a

0

20

40

60

80

100

0 100 200 300 400 500

Per

cent

age

of s

ites

exis

ting

Initial .com URLs

Percentage of similarly named sites exisiting perinitial .com URL with .com extension(1-mod-inplace)

Fig. 5. Percentages of existing phony sites with URLs obtained from 1-mod-inplace scheme applied on initial.com URLs.

Table 2Percentage of phony sites with various extensions.

Scheme and siteextension

Avg. percentageof phony sites existing

1-mod-inplace (.com) 291-mod-inflate (.com) 191-mod-deflate (.com) 521-mod-inplace (.org) 17.51-mod-inflate (.org) 6.71-mod-deflate (.org) 30.51-mod-inplace (.biz) 111-mod-inflate (.biz) 2.71-mod-deflate (.biz) 14.5

0

20

40

60

80

100

0 100 200 300 400 500 600 700 800 900 1000Per

cent

age

of s

ites

exis

ting

Initial URLs

Percentage of phony sites existing perinitial URLs

1-mod-Inplace1-mod-Inflate

1-mod-Deflate

(a) (b

Fig. 6. Percentage of phony sites existing with various e

detailed analysis of the .com scenario. We present Fig. 5,which depicts the percentage of existing phony URLs ob-tained by inplace modifications of initial .com URLs. Wefind that for nearly a quarter of all initial .com URLs, atleast 50% of all possible phony sites exist. This indicates thata URL with .com extension has a high chance of being typo-squatted. Further, we analyze how sites with .com exten-sions are typosquatted across different domains, namely.org, .gov, .biz, .net, .edu and .mil. We present Fig. 7(a)and (b) which displays how .com sites are typosquattedacross the complete range of URL extensions for 1-mod-in-place and 1-mod-inflate. Fig. 7(c) and (d) displays the casefor corpus URLs without a .com extension. We observe thatthe percentage of typosquatting initial .com URLs is maxi-mum in the .net domain. In fact, the level of .com typo-squatting in the .net domain is at least 17% higher thanin the .org domain. To summarize:

1. Initial URLs with a .com extension are typosquattedwith decreasing probability in the following domains(i) .net (ii) .org and (iii) .biz.

2. Initial URLs without a .com extension are typosquattedwith decreasing probability in the following domains(i) .com (ii) .net (iii) .org and (iv) .biz.

3.3. Effect of URL category

We now study the effect of the initial URL category onthe percentage of phony sites discovered per initial URL.Recall that we have defined several categories at the begin-ning of Section 2 and some are displayed in Table 3.

Observation 5: URLs which belong to German bankssuffer most from typosquatting, followed by URLs whichbelong to US banks. Other significantly typosquatted cat-egories are UK banks, software and technology companiesand travel-related sites. We list some of the most typo-squatted sites in each category in Table 3. Apart from thementioned categories we find that for Chinese bankinginstitutions the maximum typosquatting percentage isclose to 50% and for email providers, Gmail, Yahoo andHotmail, at least 60% of all possible phony URLs exist. Fur-ther, they are typosquatted primarily in the .com domain.German banks are most typosquatted in the .net domain.Japanese banks suffer from very low levels of typosquat-ting. The average percentage of phony sites for highly

0

20

40

60

80

100

0 100 200 300 400 500 600 700 800 900Per

cent

age

of s

ites

exis

ting

Initial URLs

Percentage of phony sites existing perinitial URLs

1-mod-Inplace1-mod-Inflate

1-mod-Deflate

)

xtensions. (a) .com extension. (b) .org extension.

Page 7: SUT: Quantifying and mitigating URL typosquatting

0 2 4 6 8

10 12 14 16 18

.mil.edu.biz.net.org.govAvg

. % o

f exi

stin

g ph

ony

site

s

Avg. Percentage of Phony sites existing perinitial .com URL with different extension

0

1

2

3

4

5

6

7

.mil.edu.biz.net.org.govAvg

. % o

f exi

stin

g ph

ony

site

s

Avg. Percentage of Phony sites existing perinitial .com URL with different extension

(a) 1-mod-inplace (b) 1-mod-inflate

0

5

10

15

20

25

30

35

.mil.edu.biz.net.org.gov.comAvg

. % o

f exi

stin

g ph

ony

site

s

Avg. Percentage of Phony sites existing perinitial non .com URL with different extension

0 2 4 6 8

10 12 14 16 18 20

.mil.edu.biz.net.org.gov.comAvg

. % o

f exi

stin

g ph

ony

site

s

Avg. Percentage of Phony sites existing perinitial non .com URL with different extension

(c) 1-mod-inplace (d) 1-mod-inflate

Fig. 7. Effect of extension of an initial URL on the probability of it being typosquatted by phony sites in various top-level domains (.com, .org etc.). (a) and(b) represents % of phony sites (with various extensions) which have URLs derived from initial.com sites. (c) and (d) represents % of phony sites found whichhave URLs derived from various schemes applied on initially non.com sites.

Table 3Highly typosquatted websites in various categories.

Category Highly typosquatted entities

Brokerage Firms MerrillLynch, Credit Suisse, LehmanBrosCanadian Banks Banque Nationale Du CanadaCredit Card Firms Chase and jcbusaeCommerce Firms Adidas, Gamestop, Walmart, SamsungeMail Providers Gmail, Yahoo and HotmailIndian Banks UTI BankSocial-NetSites Orkut and batchmatesTravel Sites US air and SouthwestSoftware and Tech. Adobe, Sun, and Lexmark

A. Banerjee et al. / Computer Networks 55 (2011) 3001–3014 3007

typosquatted categories such as German and US banks andtravel related sites are displayed in Fig. 8(a)–(c). Theseresults correlate well with findings in previous research

0 5

10 15 20 25 30 35 40 45

.com.gov

.org.net

.biz.edu

.mil

% o

f pho

ny s

ites

Avg. % of Phony Sites with various extensions per initial URL in each category (1-mod-Inplace)

German-BanksTravel-Sites

US-Banks

0

5

10

15

20

25

30

35

.com.gov

.org.ne

% o

f pho

ny s

ites

Avg. % of Phony Sites wper initial URL in each ca

GermTr

(a) (b)

Fig. 8. Average percentage of phony URLs existing per legitimate URL, in each caschemes for German banks, travel related sites and US banks.

efforts [15]. Due to space constraints we list results fora subset of the categories. We find that for all thecases, again, the 1-mod-deflate scheme contributes mostof the number of phony URLs in comparison to otherschemes.

3.4. Defense employed by initial sites

To counter typosquatting, companies buy out similarsounding URLs and make their web page available underthese URLs. This is called URL redirection and uses the‘‘redirect’’ feature of the HTTP protocol. When a browserrequests a URL from a web server, the server can returnHTTP redirection status codes ‘‘3XX’’ [27], Upon receivingsuch response with this status code, the browser findsthe ‘‘Location’’ header and issues a request for the URL

t.biz

.edu.m

il

ith various extensions tegory (1-mod-Inflate)

an-Banksavel-SitesUS-Banks

0

10

20

30

40

50

60

70

80

.com.gov

.org.net

.biz.edu

.mil

% o

f pho

ny s

ites

Avg. % of Phony Sites with various extensions per initial URL in each category (1-mod-Deflate)

German-BanksTravel-Sites

US-Banks

(c)

tegory. (a)–(c) represent 1-mod-inplace, 1-mod-inflate and 1-mod-deflate

Page 8: SUT: Quantifying and mitigating URL typosquatting

3008 A. Banerjee et al. / Computer Networks 55 (2011) 3001–3014

specified in that field. We observe that less than 5% ofinitial sites own variations of their initial URLs. Whenprobing for the existence of phony sites using URL varia-tions obtained from the 1-mod-deflate scheme, we seethat HTTP connections terminate at about 250 uniquesites. We find that for the same scheme, accessing URLvariations for 1800flowers.com leads to their home site.In fact, 22.7% of all possible URL variations (obtained by1-mod-deflate) are controlled by 1800flowers.com. Forinflate and inplace schemes we observe that the numberof URLs controlled by initial sites is lower. 1800flow-ers.com, controls about 8.6% and 12.3% of URL variationsobtained via the inflate and inplace schemes respectively.In this section apart from quantifying the extent of typo-squatting we have also identified preferences and pat-terns of typosquatters, which we use in SUT to detectphony sites.

4. Profiling via network metrics

We now present a profile of phony sites based on net-work metrics. We analyze the HTML page sizes, HTTP redi-rections, AS-ranks by using an IP to AS lookup [22] and theeffect of pop-up windows in this section.

4.1. HTML page size analysis

We begin by comparing the sizes of the HTML pages forphony sites versus initial sites. In Fig. 9(a) and (b), we pres-ent the CDF plots for the sizes of HTML pages for legitimateand phony sites. These graphs clearly show the compara-tively even spread of web page sizes for legitimate sitescompared to the highly clustered sizes for phony sites.

(a)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

1 10 100 1000 100

(b)

1

0.8

0.6

0.4

0.2

00 20000 40000 60000 8

Fig. 9. CDF of HTML page sizes for (a)

Observation 6: For 90% of phony sites, HTML page sizesare less than 31 KB. We find that 30% of initial sites have apage size larger than 31 KB. The most popular file sizes forphony sites are 305 bytes, 2.8 KB, 13 KB, and 74 KB. Wefind that for initial sites, the average page size is 20 KB,while for phony sites, the average size is 9 KB.

We observe an interesting characteristic of 5032 phonysites which have a 305 byte page size. All these 305-bytesphony sites follow exactly the same HTML-page struc-ture, employing one frame within the html body, with ex-actly the same tag indentations. This similarity hintstowards the fact that these were mass-produced usingHTML source generators. All these pages have exactly onehyperlink embedded inside them which points to thesource from where the page dynamically loads the con-tents of the HTML frame. This frame contains the actualhyperlinks which advertise everything from mortgage ser-vices to flowers. Embedding hyperlinks which load thecontent of the HTML page dynamically allows the control-ler of all these pages to update HTML content effortlessly.Based on this information, one of the heuristics used bySUT to detect phony pages incorporates HTML page size.In the future, if typosquatters pad the size, we could stillattempt to use the page structure to detect a phony page.

4.2. AS-centric analysis

Observation 7: ASes which have ranks in the range (a)1–900, (b) 1600–1800, (c) 2600–2800 and (d) 5100 andbeyond host most of the phony sites. This is true for alldomains. We would like to see within the AS hierarchythose ASes which host phony sites. Thus, we use informa-tion about the degree based ranks of ASes which host

00 100000 1e+06 1e+07 1e+08

0000 100000 120000 140000

legitimate and (b) phony sites.

Page 9: SUT: Quantifying and mitigating URL typosquatting

Fig. 10. Histogram for AS ranks of ASes which host typosquatters for the.com domain.

Table 4The top 4 ASes hosting phony sites per domain (extn). Each tuple represents(AS number: % of phony sites hosted in that domain).

Extn Top 1 Top 2 Top 3 Top 4

.com 19318:8.1 26496:5.3 33626:4.4 13768:4.4

.org 19318:6.1 26496:4.8 8560:4.3 33626:4.2

.net 11486:9.9 19318:4.6 8560:3.9 26496:3.8

.biz 8560:13 26496:8.7 33070:8.3 33626:5.0

.gov 2152:5.3 7018:4.4 2714:3.9 3136:3.3

.edu 16966:2.9 8560:2.3 6325:2.1 4323:2.0

.mil 721:18.5 2152:14.8 27046:11.1 724:7.4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 5 10 15 20 25

CD

F

Number of redirections

Number of HTTP redirections for Phony URLs

1-mod-Inplace1-mod-Inflate

1-mod-Deflate

Fig. 11. CDF for number of HTTP redirections for phony URLs obtained byapplying various schemes to each initial URL.

A. Banerjee et al. / Computer Networks 55 (2011) 3001–3014 3009

phony sites. Using the latest CAIDA [36] AS-Rank datasetand mechanisms employed in [22], we are able to observethe relative ranks of ASes by mapping Autonomous SystemNumbers to ranks. We present Fig. 10 which depicts the ASranks of the various unique IPs recorded for the .comdomain.

Observation 8: A few ASes host many phony sites. Wepresent Table 4, which lists the top 4 ASes for each domain.Each entry in the table lists a tuple separated by a colon.The first part represents the Autonomous System Numberwhile the next represents the percentage of phony siteshosted in that AS with respect to the total number of phonysites with that extension. Interestingly, just the top 4 ASesfor the .com domain host 22.2% of all phony sites. It is asimilar case for the other domains too. Intrigued by thisobservation, we manually examined all the ASs listed inthe Table 4 that exhibited high typosquatting behavior.We found that, four of them are also marked for hostingparked domains in the top 10 park domain list showed in[42] and parked domains are known to exhibit typosquat-ting behavior [42].

4.3. HTTP connection analysis

Observation 9: The number of URL redirections forphony URLs is usually more than 6. We find that fornearly 50% of phony sites discovered using the 1-mod-inplace scheme, the number of URL redirections is greaterthan 6. For the 1-mod-inflate scheme about 30% of allphony sites had more than 6 URL redirections. The CDFfor the number of redirections for each scheme when

applied to initial websites is depicted in Fig. 11. We usethis heuristic within the SUT framework.

Why is the number of redirections high?: URL redirectionis often employed to direct users to updated site content[23]. This could imply that the majority of phony URLswhich have high number of URL redirections do not hostnative content and simply act as forwarding agents sinceURL forwarding is cheaper than hosting a web-page foreach URL. Only a barebones subscription needs to bebought from an ISP/Domain-registrar for URL forwarding.No hosting space is needed.

We also examined the number of HTTP redirection forlegitimate sites from top 500 domains listed in Alexa [49]and found that the median number of HTTP redirection is0 and 95% percentile domain has less than one redirection.

Interestingly, we find that for URLs obtained via the1-mod-delete scheme the average number of redirectionsis 1. Most 1-mod-delete domains are ‘‘throwaway’’ do-mains. These domains may be registered then let go within5–50 days of first use. Malicious hackers use these multiplethrowaway domains to point users to phishing sites/fakeanti virus sites which stay on for longer periods of time.This is similar to the case where hackers inject multiplebad websites via injected malware in benign websites.All the bad websites are used as throwaway redirections,but they ultimately point to a few long-standing phish-ing/fake antivirus sites.

4.4. Phony sites and pop-up windows

Observation 10: Nearly every phony site attempted toopen a pop-up pointing to a small number of sites.Pop-ups can be thought of as a delivery mechanism forInternet-advertisments. A pop-up is a new browser win-dow that opens up when visiting a site. Pop-up windowscontaining advertisements are usually generated by Java-Script programs, but they can be generated by other meansas well. As an example, we present an analysis of the mostprolific typosquatter.

We find that all the phony sites with a 305 byte HTMLsplash page, attempted to open pop up windows linkedto http://b.casalemedia.com. Further, most of them ob-tained the contents for the main page frame fromwww.searchnut.com, while the domain that is advertised

Page 10: SUT: Quantifying and mitigating URL typosquatting

0 100 200 300 400 500 600 700 800 900

0 5 10 15 20

Num

ber

of U

RLs

Categories

Number of URLs owned by Searchnut.com in various categories (.com)

Fig. 12. Number of URLs operated by Searchnut.com within.com domain,sub-classified according to URL categories.

3010 A. Banerjee et al. / Computer Networks 55 (2011) 3001–3014

on the main HTML page is http://mortgages-rates.com.This entity is the most aggressive typosquatter we couldidentify. We discover that this entity owns 1380 (via in-place), 1549 (via deflate) and 295 (via inflate) sites withinthe .com domain. Similarly it is responsible for 479 (in-place), 401 (deflate) and 82 (inflate) sites within .org and414 (inplace), 350 (deflate) and 82 (inflate) sites in the.net domain. Clearly, this typosquatter targets sites with.com, .org and .net extensions. Most URLs registered by thisentity are discovered by applying inplace and deflateschemes to initial URLs. This suggests that typosquattersexpect users to miss out, or misspell characters whileentering URL addresses.

Delving deeper, we categorize these results based onthe kind of URLs that this entity attempts to ‘‘attack’’ bytyposquatting. We present Fig. 12, which depicts the num-bers of phony URLs controlled by this single entity across.com domain. We find that the numbers for phony URLspeak for the following categories: credit-card companiesand eCommerce retailers (categories 4 and 5), Germanbanks (category 8), social networking sites and softwarecompanies (categories 14 and 15) and US banks (category20). We find that searchnut.com hosts a large number ofphony sites. These sites feature a large number of key-words. In fact, the KS value for these phony sites turnsout to be 17. We observe this as a popular number in theKS-mode column in Table 1. The second most prolific typo-squatter is sedoparking.com/sedo.com which declaresclearly on its pages that its business model is based on reg-istering sites with phony URLs in order to display ads tounsuspecting users. This information can be used to devel-op simple filters integrated with browsers to warn usersthat they might be visiting phony sites.

5. The architecture of SUT

In this section, we describe our tool SUT which detectsphony sites using a two-layer structure: modules, whichindependently evaluate a site, and a deciding logic thatcombines the decisions of the different modules into a finaldecision.

In its current implementation, SUT consists of twodetection modules: SUT-net and SUT-pop, depicted inFig. 13. SUT-net uses network-level metrics, and this is akey novelty of SUT compared to most previous non-proprietary approaches. The network-level profile of a site

is compared to the profile of phony sites, which is based onour measurement study. The second module, SUT-pop,uses the popularity of the site as an indication of its valid-ity. Popularity can be defined and calculated in many dif-ferent ways, and this calculation is done by many searchengines. Thus, instead of replicating such an effort, SUT-pop leverages on these services provided by search en-gines, and in this version, uses Google.

The decision logic module considers the results fromthe two modules, SUT-net and SUT-pop. The decision pro-cess can be a simple logic operation (and, or etc.), or basedon a weighted sum of the decision of the modules. Havingmultiple modules, we have the capability to finetune therelative weights, as we do in Section 6, in order to improvethe overall performance.

Note that the decisions for each module and the finaldecision can range from a binary good/bad, to a fine-grained score of suspicion, such as a 20-point scale. Clearly,the latter can provide more flexibility and provide ‘‘moreinformation’’ to the user. At the same time, a lay-person of-ten does not want all the information, and may prefer asimple suggestion. Although both approaches are possiblewith SUT, we use a binary classification, since this makesthe comparison easier to present. Note that we use value0 to indicate a good site, while value 1 suggests a suspi-cious site, i.e. we use the score as flag.

SUT as an adaptive tunable framework. Although wehave developed SUT as a readily-deployable tool, we couldalso envision SUT as an evolving approach enriched bylearning capabilities and a customizable user-interfaceshown in Fig. 13. Note that some of the features describedbelow are not yet supported by our tool, but they illustrateits full potential.

We envision SUT as a plug-into web-browsers. Theplug-in can monitor the requests to connect to a web-server and examine its network metrics while at the sametime it can (a) query web search engines, (b) consult anyreliable database, and (c) examine any local white or blacklists. The result can be either a warning to the user, or if theuser permits it, blocking the http connection.

Handling browser hijacking. Note that SUT could bealso used to monitor web-access even when using a book-mark. This would essentially provide the same protectionto browser hijacking, where the browser has been compro-mised [26]. Of course, in such a case, the hijacker could cir-cumvent the activation of the SUT plug-in. In this case,using SUT as a stand alone program outside the browsermay be preferable.

5.1. SUT-net: using network-level metrics

The SUT-net module uses a set of network-level criteriato decide if a site is phony. We describe each criteria, andprovide the observation that supports it. We list these cri-teria in Table 5.

(a) URL length and edit distance: as will be seen inobservations 1 and 3 (in Section 3.1), URLs, which:(a) are less than 10 characters long, and (b) differfrom initial corpus of URLs by one character aremost likely phony.

Page 11: SUT: Quantifying and mitigating URL typosquatting

SUTL-netSUT-pop

WebsiteDatabase

Update moduleWebpage

URL

InputBrowser

APIInterface

Decision Module

Feedback

Output

SUT

Fig. 13. SUT is composed of SUT-net and SUT-pop. It is supplemented by a dynamic code update system. Modules in dotted lines are under development.

Table 5Criteria used by SUT-net to detect phony sites.

Feature-set Primary criterion

URL name suspect-site URL length 610 & suspect URL differsfrom initial URL by 1 char.

HTTP- redirectionsNo. of HTTP- redirections for when accessing suspect sites is

P6 andsuspect_URL_length P initial_URL_length

URL extn. Accessing a suspect site with.com extension finallyopens a site with.biz/.net/.org extn.

AS-Rank If HTML content of suspect site hosted by ASes withranks, 2600–2800/5100

HTML PageSize

HTML Page size of suspect site is 631 KB/305B

Behavior Accessing suspect site leads to opening a popuplinking to casalemedia/sedoparking.com

Keywordanalysis

For suspect site HTML content 12 < KS < 19,60 < WKS < 100 & 0 < HDavg < 30

A. Banerjee et al. / Computer Networks 55 (2011) 3001–3014 3011

(b) No. of HTTP-redirections: most original websites didnot have any HTTP redirections. In contrast, siteswith 6 or more HTTP redirections are most likelyphony, as seen in observation 9.

(c) URL extension: if typing a site with a .com extensionleads to a .biz/.net/.org site, the final site is possiblyphony.

(d) Page size: in observation 6, we saw the difference inthe size of phony and initial pages, which on averageis 20 KB, for the initial sites, and 9 KB for phony sites.In addition, 90% of phony sites have page sizes lessthan 31 KB, while 30% of initial sites have sizes morethan 31 KB. Although increasing the page-size isfairly easy for a typosquatter, we thought of usingthis information until this happens.

(e) Pop-up: phony sites often try to open pop-ups,mostly pointing to advertising and parked webpagesas we saw in Section 4. In our experience, the pop-ups were often from casalemedia.com and sedopar-king.com, which are well known online advertisingnetworks who potentially attempt to capitalize on

typos and the use of large numbers of parkeddomain names.

(f) Keyword analysis: in Section 2, we derived rules toclassify phony sites and we use them here. We usethe following: (i) 12 < KS < 19, (ii) 60 < WKS < 100,(iii) 0 < HDavg < 30 and (iv) RMSE 6 190.

SUT-net classifies a site as phony if at least three criteriaare found to be true. We experimented with other config-urations and found this to be the best.

5.2. SUT-pop: using the popularity

The SUT-pop module uses the community knowledgethat is harnessed by search engines when they displaytheir results for a query. Web searching has reach such amaturity level that it would have been foolish to: (a) notleverage our work, and (b) try to develop our own method.This approach has some similarities with the iTrustPagework [17], but we modified its behavior for our needs.

In the current implementation, SUT-pop analyzes thesearch results obtained from google.com, when the URLof the site in question is given as a query: ‘‘sitename.dom’’,where ‘‘.dom’’ is the suffix we type, say ‘‘.com’’. SUT-pop is-sues the query, receives the results and analyzes the reply.We examine the results for two things.

(i) Did You Mean Rule (DYM): Does ‘‘Did you mean:’’appear in the results? We use Google’s mechanismfor suspecting a typo or an unlikely query. Thismechanism suggests an alternative query andreturns: ‘‘Did you mean: <Alternative-url>’’. Notethat this clause does not seem to be used byiTrustPage.

(ii) Top 10 Rule (T10): Do the first k results include a link tothe site? If the site in question is not phony, weexpect that the results should contain a direct httplink to it. For example, we expect to see ‘‘www.sitename.dom’’ or link to a directory within that site,such as ‘‘www.sitename.dom/index.html’’. So, if the

Page 12: SUT: Quantifying and mitigating URL typosquatting

3012 A. Banerjee et al. / Computer Networks 55 (2011) 3001–3014

first k results do not have a link to the web-site, webecome suspicious. Currently, we have k = 10, whichcorresponds to the default limit of results in aresponse from Google. Clearly, this is a parameterthat can be finetuned. For example, iTrustPage usesthe top five hits. Our choice of 10 seemed to workwell as we will see in our evaluation.The above two clauses can be combined for a finaldecision. Currently, we classify a site as phony(score = 1), if clause (i) i.e. DYM, is true, or clause(ii) i.e. T10, is false. In other words, Google asks‘‘Did you mean’’ or the site is not in the top 10 sites,it is considered as phony.Note that we have experimented with various com-binations of DYM and T10 and have found the abovecombination to be most effective.

6. Evaluating SUT

We evaluate SUT and compare it with commercial tools.However, first, we study and finetune the performance ofthe internal modules of SUT.

In our tests, we use three ‘‘control’’ sets: (a) IN900: the900 initial sites in our corpus, (b) PH100: 100 phony sites,(c) IS100: 100 IS sites. For the PH100 and IS100, we first se-lected sites randomly among our initial corpus, and thenwe examined manually to ensure correct classification. Inaddition, we use 80 K, which consists of 80,000 randomlyselected suspect URLs. However, for this set, we do notknow the ground truth.

6.1. Evaluating SUT-net

Observation 11: SUT-net can identify phony sites withhigh accuracy and recall. In fact, SUT exhibits perfect per-formance in our experiments. We observe in Table 6 thatSUT-net is 100% accurate at identifying initial, phony andIS sites for IN900, IS100 and PH100 test sets. Recall thatSUT-net employs the network-metrics discussed in theprevious sections.

It would be naive to use this evaluation to conclude thatSUT is perfect. The performance of SUT in our experimentsis impressive. It is possible that with significantly diversedwebsites, the performance will vary. However, we can con-clude that the performance of SUT is promising. In addi-tion, the evaluation can be used to compare the relativeperformance of the different approaches as we do below.

6.2. Evaluating SUT-pop

Observation 12: SUT-pop is not as accurate as SUT-net.We find that SUT-pop does considerably more poorly incomparison to SUT-net, especially for the incidental sites

Table 6Correlation between SUT-net (Snet) and SUT-pop (Spop) Classification of Sites.

Safe:0, Unsafe:1 IS100 IN900

Snet0 (%) Snet1 (%) Snet0 (%) Snet

Spop0 51 0 89 0Spop1 49 0 11 0

IS. With the IS100 test set, SUT-pop misclassifies 49% ofIS sites as unsafe. In contrast, it does reasonably well withphony sites in PH100: SUT-pop did not identify 28% of thesites.

Why does SUT-pop perform so poorly with IS sites? Weattribute this to the fact that IS sites are variations of pop-ular sites: thus, they are most likely non-popular sites. Thisis supported by the experiment on the IN900 set of initialsites, where SUT-pop has an accuracy of 89%, with 11% ofsites being misclassified as shown in Table 6.

Examining other variations for SUT-pop. In our effortsto improve SUT-pop’s accuracy we have varied SUT-pop’sclassification criteria. Recall that SUT-pop uses: the DYMrule (existence of ‘‘Did You Mean’’), and the T10 rule (siteappearing in the top 10 results), and a site is deemedphony if DYM is true or T10 is false. We tried differentways to combine the two rules, but none of the combina-tions improved SUT-pop’s performance.

Specifically, we examine the following variations forSUT-pop: (v.1) the DYM rule only, (v.2) the T10 rule only,and (v.3) if DYM is true and T10 is false, then the site isphony. For the PH100 set, the accuracy was lower, withv.1 at 63%, v.2 at 48% and v.3 29%, respectively. We havealso tested these criteria with the IS100 set and found thatSUT-pop’s accuracy remains at 89% for versions v.1 and v.3.Interestingly, version v.2 is 100% accurate, but performspoorly with the PH100 set.

6.3. SUT and blacklisted IPs

Observation 13: SUT identifies phony sites not listed inpopular IP-blacklist databases. Another important benefitof SUT is that it can identify possibly unsafe sites which arenot listed in popular IP blacklists, such as Google’s Phishingdatabase [39], malware.com [40] or Project-Honeypot-Spam lists [41]. In fact, only 2.3% of all phony sites discov-ered by our experiments were listed in these blacklists.Using SUT can lead to the identification of the large num-ber of possibly malicious sites which are not yet listed onthese IP blacklists. This also reinforces the point that phish-ing and typosquatting may not be the same problem.

6.4. SUT vs. McAfee’s SiteAdvisor

Observation 14: SUT performs significantly better thanSiteAdvisor with regards to recall. We compare SUT withthe Siteadvisor service [37] by McAfee which is an industryleader in network security.

However, we faced a practical limitation while queryingSiteAdvisor for the initial sites (IN900). McAfee’s siteblocked us from requesting a large numbers of securityreports, possibly fearing a DoS attack. Thereby, we only

PH100 80 K

1 (%) Snet0 (%) Snet1 (%) Snet0 (%) Snet1 (%)

0 28 3 290 72 4 64

Page 13: SUT: Quantifying and mitigating URL typosquatting

Table 7Comparison of SUT and McAfee Siteadvisor service (R: Recall, A: Accuracy,FAIL: No-reply (from McAfee)).

McAfee SUT

FAIL% R% A% R% A%

PH100 57 5 100 100 100IN100 23 76 100 100 100IS100 38 61 100 100 100

A. Banerjee et al. / Computer Networks 55 (2011) 3001–3014 3013

report on 100 sites from the IN900 set, which we callIN100.

We present the results in Table 7. Each entry in the ta-ble represents the recall and accuracy for the two tools.The main observation is that SiteAdvisor does poorlyregarding recall in all categories, but it is perfectly accu-rate. In contrast, SUT is perfect in both accuracy and recall.Regarding recall, SiteAdvisor could classify correctly 5% ofthe phony sites, 76% of the initial sites, and 61% of the ISsites. Further, SiteAdvisor did not provide an answer for alarge number of (grey) sites 57% of phony, 23% of initial,and 38% for IS sites.

7. Related work

Efforts characterizing typosquatting have been limitedto identifying ‘‘typosquatters’’ [14] and developing a lim-ited profile [15,16]. Experimental studies such as the oneperformed by Jagatic et al. [11], in which a social networkwas used, showed that more than 80% of recipients fol-lowed a URL pointer that they believed a friend sent them,and over 70% of the recipients continued to enter creden-tials at the corresponding site. This is an indication of thegullible nature of most Internet users.

A significant work has been done on different webclassification aspects such as detecting phishing pages,malicious pages or ad portal domains except on typosquat-ting. Regarding studies on phishing, such as the one by Mail-frontier [9,6] provide credence to the fact that maliciousimpersonation is a real threat. An important piece of workby Jakobsson et al. [5,8] describes how to set up a phishingexperiment to measure how users might respond to an un-safe environment. Articles and reports quoting various sta-tistics Lie testament to the problem we attempt to address[2–4,13]. Whittaker et al. [44] propose a machine learningbased classifier for detecting phishing websites. This classi-fier is used to update Googles phishing blacklist automati-cally. To quantify the effectiveness of such blacklists, [45]performs a study of phishing blacklists using anti-phishingtestbed. However, Ma et al. [46] proposes another classifica-tion technique for malicious URL classification that reliesonly on the URL features. Almishari et al. [42] proposes aclassifier for the identification of ad portal domain.

Unfortunately, all the above body of work does not pro-vide a comprehensive analysis of typosquatting. This prob-lem is so severe that heavyweights like the Coca-ColaCompany, McDonalds Corporation, Pepsico, Inc., The Wash-ington Post Company and others have all been forced intolitigation with entities which registered URLs closelyresembling their official URLs [29]. To provide genuinewebsites with more clout when attempting to counter

URL squatters, [28], the US passed the ‘AnticybersquattingConsumer Protection Act of 1999’. Inspite of all these legalmechanisms, this problem still exists. Previous research aspresented in [15] discusses a tool typo-patrol which identi-fies typo-squatting sites by analyzing 3rd party redirectionsand the use of cookies. This does not ensure an accurate re-sult. As an example, http://www.Bankofamerica.com ishighlighted as red (possibly meaning unsafe) by Typo-pa-trol because of 3 redirections and use of cookies. Almishariet al. [42] show that 41/39.8% of the (two level) ⁄.com/⁄.netad portals in the web are typos. But they use third party ser-vices provided by Google and Yahoo to identify typos,which are not an accurate means of identifying typosquat-ting and protecting the user in the way we propose here.

Linari et al. [48] show that a strong correlation existsbetween popularity of a domain name and the size of itssyntactical and visual neighborhoods in their experimentson ‘.co.uk’ registry. Moore et al. [47] investigate the fundersof typosquatting and show that 80% of the 285,000 typo-squatting domains are supported by pay-per-click ads.Unlike these works, we show different network layerbehavior that typosquatting domains exhibit and basedon that propose an automated approach to detect such do-mains in this paper.

In [16] a large scale study of typo-squatting is con-ducted. Our experiments are more extensive and ourmechanism performs better. iTrust, [17] uses a Googlebased ranking mechanism to provide a threat level for aphishing site. Again, our work is different from the previ-ous approaches since we conduct a detailed analysis oftyposquatting: which sites are effected, where are the fakesites hosted and what can be done to combat it. Citizen-Hawk [38], a startup, provides typo-squatting analysis forURLs but their methodology is not publicly available andhence is tricky to compare. We have also compared resultsobtained via SUT with entries in three popular IP blacklists.We have used Google’s Safe-Browsing service [39] to com-pare phishing entities, lists from Malware.com [40] tocompare malware distributors and lists from the Project-Honeypot spam domain list [41] to compare spammer IPs.

8. Conclusion

Our research has focused on quantifying the typosquat-ting phenomenon. We conduct an extensive measurementbased analysis probing more than 3 million sites obtainedby modifying URLs from a corpus of 900 popular sites. Weuncover that most phony URLs differ from legitimate onesby just 1 character, in length or in spelling. We find thattyposquatting is a widespread problem, 99% of all siteshaving URLs obtained by various modification schemes,were found to be phony. Interestingly, URLs which belongto US and German banks suffer most from typosquattingfollowed by software and technology companies and tra-vel-related sites.

To combat this problem, we develop SUT, a combinationof SUT-net and SUT-pop modules.SUT-net uses a plethora ofmeaningful features ranging from URL names and HTTPredirections to criterion obtained from keyword analysisof sites while SUT-pop employs Google’s search result to

Page 14: SUT: Quantifying and mitigating URL typosquatting

3014 A. Banerjee et al. / Computer Networks 55 (2011) 3001–3014

provide a threat score. We find that SUT can successfullyidentify phony sites with very high accuracy and significantrecall rates. SUT performs competitively when compared tocommercial security services such as McAfee’s Siteadvisor.Further, SUT can warn users about the large number of pos-sibly malicious sites not listed on popular IP-blacklists.

Acknowledgement

This work was supported partly by NSF CyberTrust0831530.

References

[1] <http://www.antiphishing.org>.[2] <http://www.crime-research.org>.[3] <http://www.csmonitor.com/p13s01-stin.html>.[4] <http://www.cs.cmu.edu/help/security/>.[5] M. Jakobsson, J. Ratkiewicz, Designing ethical phishing experiments:

A study of (ROT13) rOnl auction query features., WWW 2006.[6] Rachna Dhamija, J. Doug Tygar, The battle against phishing: dynamic

security skins, In Proceedings of ACM SOUPS (2005).[7] Gartner Inc. Gartner Survey Shows Phishing Attacks Escalated in

2007. <http://www.gartner.com> (Dec. 2007).[8] M. Jakobsson, S. Myers, Phishing and Counter-Measures, John Wiley

and Sons Inc, 2006.[9] Mailfrontier phishing IQ test. <http://survey.mailfrontier.com/

survey/quiztest.html>.[10] S. Garfinkel, R. Miller, Johnny 2: A user test of key continuity

management with S/MIME and Outlook Express, in: Symposium onUsable Privacy and Security.

[11] T. Jagatic, N. Johnson, M.J., Menczer, F. Social phishing. 2006.[12] <www.ngssoftware.com/ThePharmingGuide.pdf>.[13] <http://www.drive-bypharming.com/>.[14] <http://www.caida.org/BojanZdrnjaCompSci780Project>.[15] siteadvisor.com/studies/typo_squatters_nov2007.[16] Y.M. Wang, D. Beck, J. Wang, C. Verbowski, B. Daniels, Strider typo-

patrol: discovery and analysis of systematic typo-squatting, in:SRUTI ’06.

[17] T. Ronda, S. Saroiu, A. Wolman, iTrustPage: A User-Assisted Anti-Phishing Tool Procs. of EuroSys 2008.

[18] <http://www.cs.auckland.ac.nz/trebor/papers/CHEN02.pdf>.[19] T. Honda, M. Yamamoto, A. Ohuchi, Automatic Classification of

Websites based on Keyword Extraction of Nouns, Information andCommunication Technologies in Tourism 2006, Springer Vienna, ’07.

[20] S. Roy, S. Joshi, R. Krishnapuram, Automatic categorization of websites based on source types, in: Proceedings of the fifteenth ACMconference on Hypertext and hypermedia, 2004, pp. 38–39.

[21] G. Kening, Y. Leiming, Z. Bin, C. Qiaozi, M. Anxiang, AutomaticClassification of Web Information Based on Site Structure, in:Proceedings of International Conference on Cyberworlds, 2005.

[22] A. Banerjee, A. Mitra, M. Faloutsos, Dude where’s my Peer,Proceddings of Globecom, ISET (2006).

[23] J.A. Kunze, Towards electronic persistence using ARK identifiers, ARKMotivation and Overview (2003).

[24] S. Berkovich, M. Inayatullah, A fuzzy find matching tool for imagetext analysis, In Proceedings of Aipr’04 (2004).

[25] Y. Zhang, N. Zincir-Heywood, E. Milios, Narrative text classificationfor automatic key phrase extraction in web document corpora, InProceddings of WIDM ’05 (2005).

[26] R.S. Cox, S.D. Gribble, H.M. Levy, J.G. Hansen, A safety-orientedplatform for web applications, In Proceedings of Security & Privacy(2006).

[27] <http://www.w3.org/Protocols/rfc2616/rfc2616>.[28] <http://www.uspto.gov/web/repcongress.pdf>.[29] <http://www.nysd.uscourts.gov/02-08168.PDF>.[30] <http://www.pharming.org>.[31] <http://www.alexa.com>.[32] <http://www.forbes.com>.[33] <http://www.netvalley.com>.[34] <http://www.wired.com>.[35] <http://www.consumersearch.com>.[36] <http://www.caida.org>.[37] <http://www.siteadvisor.com/sites/sedoparking.com>.[38] <http://www.citizenhawk.com/typosquasher.html>.

[39] <http://code.google.com/apis/safebrowsing/>.[40] <www.malware.com.br>.[41] <www.phsdl.net>.[42] M. Almishari, X. Yang, Ads-portal domains: identification and

measurements, ACM Transactions on Web 4 (2) (2010) 134.[43] A. Banerjee, D. Barman, M. Faloutsos, L. Bhuyan, Cyber-Fraud is One

Typo Away, in: Proceedings of IEEE INFOCOM 2008 Mini-Conference,Phoenix, AZ, USA, April 2008.

[44] C. Whittaker, B. Ryner, M. Nazif, Large-Scale Automatic Classificationof Phishing Pages, NDSS, San Diego, CA, Feb 28–Mar 3, 2010.

[45] S. Sheng, B. Wardman, G. Warner, L. Cranor, J. Hong, C. Zhang, AnEmpirical Analysis of Phishing Blacklists, CEAS, Mountain View, CA,2009.

[46] J. Ma, L.K. Saul, S. Savage, G.M. Voelker, Beyond blacklists: learningto detect malicious web sites from suspicious URLs, in: Proceedingsof the ACM SIGKDD Conference, Paris, France, June 2009.

[47] T. Moore, B. Edelman, Measuring the perpetrators and funders oftyposquatting, Proceedings of Financial Crypto’10, Canary Islands,Spain, 2010.

[48] A. Linari, F. Mitchell, D. Duce, S. Morris, Typo-squatting: the curse’’ ofpopularity, in: Proceedings of the WebSci’09: Society On-Line,Athens, Greece, March 2009.

[49] <http://www.alexa.com/topsites>.

Anirban Banerjee (Ph.D. 2008, UC Riverside)is the Chief Security Officer and co-founder ofstopthehacker.com. Anirban has authoredmore than 15 scientific papers on Internetmeasurements, security and privacy. He wonthe best paper award for his paper titled ‘‘Issomeone tracking P2P users?’’ in IFIP Net-working 2007. His work has received widepublicity in the popular press, such as digg.-com, and Ars Technica.

Md Sazzadur Rahman is a Ph.D. student inDepartment of Computer Science, UC River-side. Before joining UC Riverside, he com-pleted his MS from School of ComputerScience, University of Oklahoma. His researchinterests Lie in web classification, Internetmeasurement, overlay networks and systems.

Michalis Faloutsos is a faculty member at theComputer Science Dpt in University of Cali-fornia, Riverside. He got his bachelor’s degreeat the National Technical University of Athensand his M.Sc and Ph.D. at the University ofToronto. His interests include, Internet pro-tocols and measurements, peer-to-peer net-works, network security, BGP routing, and adhoc networks. He is actively involved in thecommunity as a reviewer and a TPC memberin many conferences and journals. With histwo brothers, he co-authored the paper on

powerlaws of the Internet topology (SIGCOMM’99), which is one of thetop 10 most cited papers of 1999. His most recent work on peer-to-peermeasurements have been widely cited in popular printed and electronic

press such as slashdot, ACM Electronic News, USA Today, and Wired. Mostrecently he has focused on the classification of traffic and identification ofabnormal network behavior. He also works in the area of Internet routing(BGP), and ad hoc networks routing, and network security, with emphasison routing.