Textual and Visual Content Based Anti-Phishing First Review

8/10/2019 Textual and Visual Content Based Anti-Phishing First Review

1/53

Textual And Visual Content Based Anti-Phishing A SVM Approach

Abstract

A novel framework using a SVM [Support Vector Machine] approach for content-

based phishing web page detection is presented. Our model takes into account

te tual and visual contents to measure the similarit! between the protected web

page and suspicious web pages. A te t classifier" an image classifier" and an

algorithm fusing the results from classifiers are introduced. An outstanding feature

of this paper is the e ploration of a SVM model to estimate the matching threshold.

#his is re$uired in the classifier for determining the class of the web page andidentif!ing whether the web page is phishing or not. %n the te t classifier" the naive

SVM rule is used to calculate the probabilit! that a web page is phishing. %n the

image classifier" the earth mover&s distance is emplo!ed to measure the visual

similarit!" and our SVM model is designed to determine the threshold. %n the data

fusion algorithm" the SVM theor! is used to s!nthesi'e the classification results

from te tual and visual content. #he effectiveness of our proposed approach was

e amined in a large-scale dataset collected from real phishing cases. ( perimental

results demonstrated that the te t classifier and the image classifier we designed

deliver promising results" the fusion algorithm outperforms either of the individual

classifiers" and our model can be adapted to different phishing cases.


2/53

Introduction

Malicious people" also known as phishers" create phishing web pages" i.e."

forgeries of real web pages" to steal individuals& personal information such as bank

account" password" credit card number" and other financial data. )nwar! online

users can be easil! deceived b! these phishing web pages because of their high

similarities to the real ones. #he Anti-*hishing +orking ,roup reported that there

were at least " /0 phishing attacks between 1anuar! 2" 344/" and 1une 54" 344/.

#he latest statistics show that phishing remains a ma6or criminal activit! involving

great losses of mone! and personal data.

Automaticall! detecting phishing web pages has attracted much attention from

securit! and software providers" financial institutions" to academic researchers.

Methods for detecting phishing web pages can be classified into industrial toolbar

based anti-phishing" user-interface-based anti-phishing" and web page content-

based anti-phishing. #o date" techni$ues for phishing detection used b! the industr!

mainl! include authentication" filtering" attack tracing and anal!'ing" phishing

report generating" and network law enforcement. #hese anti-phishing internet

services are built into e-mail servers and web browsers and available as web

browser toolbars.

#hese industrial services" however" do not efficientl! know all phishing attacks.

+u et al. conducted thorough stud! and anal!sis on the effectiveness of anti-

phishing toolbars" which consist of three securit! toolbars and other mostl! used

browser securit! indicators. #he stud! indicates that all e amined toolbars in were

ineffective to prevent web pages from phishing attacks. 7eports show that 34 out

of 54 sub6ects were spoofed b! at least one phishing attack" 0 8 of the spoofed

sub6ects indicated that the websites look legitimate or e actl! same as the! visited


3/53

before" and 948 of the spoofed sub6ects were tricked due to poorl! designed web

sites. :ranor et al. performed another stud! on an evaluation of 24 anti-phishing

tools. #he! indicated that onl! one tool could consistentl! detect more than 48 of

phishing web sites without a high rate of false positives" whilst four tools were notable to recogni'e 48 of the tested web sites. Apart from these studies on the

effectiveness of anti-phishing toolbars" ;i and


4/53

#he method works b! first finding the associated web pages of the given webpage

and then constructing a S;? from all those web pages. A mechanism of reasoning

on the S;? is e ploited to identif! the phishing target. hang et al. developed a

content-based approach" i.e." :arnegie Mellon Anti-phishing and ?etwork Anal!sis#ool" for anti-phishing b! emplo!ing the idea of robust h!perlinks [2 ]. ,iven a

web page" this method first calculates the #=-%B= of each term" an algorithm

usuall! used in information retrieval" generates a le ical signature9 b! selecting a

few terms" supplies this signature to a search engine" and then matches the domain

name of current web page and several top search results to evaluate the current

web page is legitimate or not. Another content-based techni$ue" CA*#" is designed

to identif! phishing websites b! using an open-source Ca!esian filter on the basis

of tokens which are e tracted b! a document ob6ect module >BOM@ anal!'er.

#he concept of visual approach to phishing detection was first introduced b! ;iu et

al. #his approach" which is oriented b! the BOM-based visual similarit! of web

pages" first decomposes the web pages into salient block regions. #he visual

similarit! between two web pages is then evaluated b! three metrics" namel!" block level similarit!" la!out similarit!" and overall st!le similarit!" which are based on

the matching of the salient block regions. =u et al. followed the overall strateg!"

but proposed another method to calculate the visual similarit! of web pages. #he!

first converted


5/53

#he main ob6ective of this pro6ect is as followsD

#o detect the phishing web pages b! using the SVM algorithm.

#o classif! the webpage b! using te tual and visual SVM classificationalgorithms.

#o combine the classified results like te tual and visual content b! using

fusion algorithm.

#o compare the true and false web page fused results b! finding the

probabilit!" to find the given web page is phishing or not.

Scope O Project

#he main scope of the pro6ect is as followsD

#o detect the website is a phishing website or not.

#o detect the website is hacked b! the attacker or not.

#o compare the true and attacked websites b! detecting its fusion results.

Project !escription


6/53

"xisting S#ste$

*hishing techni$ue used b! the e isting s!stem mainl! includes authentication"

filtering" attack tracing and anal!'ing. #oolbar based anti-phishing which

guides the user to interact with trusted website. #he toolbars like securit!

toolbars and browser securit! toolbars are used in the s!stem. Methods for

detecting phishing web pages can be classified into industrial toolbar-based

anti-phishing" user-interface-based anti-phishing" and web page content-based

anti-phishing. #echni$ues for phishing detection used b! the industr! mainl!

include authentication" filtering" attack tracing and anal!'ing" phishing report

generating" and network law enforcement. #hese anti-phishing internet services

are built into e-mail servers and web browsers and available as web browser

toolbars.

:ontent-based anti-phishing" which is referred to as using the features of web

pages" consists of surface level characteristics" te tual content" and visual

content. +e clarif! that the content of a web page we discuss here include the

whole information of a web page such as a domain name" )7;" h!perlinks"

terms" images" and forms embedded in the web page. Surface-level

characteristics have been commonl! used b! industrial toolbars to detect

phishing. =or e ample" the Spoof-,uard makes use of inspecting the age of

domain" well known logos" )7;" and links to ac$uire the characteristics of

phishing web pages. ;iu et al. proposed the use of semantic link network to

automaticall! identif! the phishing target of a given webpage.

#he method works b! first finding the associated web pages of the given

webpage and then constructing a S;? from all those web pages. A mechanism

of reasoning on the S;? is e ploited to identif! the phishing target. hang et al.


7/53

developed a content-based approach" i.e." :arnegie Mellon Anti-phishing and

?etwork Anal!sis #ool" for anti-phishing b! emplo!ing the idea of robust

h!perlinks. ,iven a web page" this method first calculates the #=-%B= of each

term" an algorithm usuall! used in information retrieval" generates a le icalsignature9 b! selecting a few terms" supplies this signature to a search engine"

and then matches the domain name of current web page and several top search

results to evaluate the current web page is legitimate or not. Another content-

based techni$ue" CA*#" is designed to identif! phishing websites b! using an

open-source Ca!esian filter on the basis of tokens which are e tracted b! a

document ob6ect module anal!'er.

#he concept of visual approach to phishing detection was first introduced b!

;iu et al. #his approach" which is oriented b! the BOM-based visual similarit!

of web pages" first decomposes the web pages into salient block regions. #he

visual similarit! between two web pages is then evaluated b! three metrics"

namel!" block level similarit!" la!out similarit!" and overall st!le similarit!"

which are based on the matching of the salient block regions. =u et al. followedthe overall strateg!" but proposed another method to calculate the visual

similarit! of web pages. #he! first converted


8/53

All phishing attacks will not be detected b! using the detection techni$ues.

#he toolbar techni$ue is an ineffective wa! to prevent web pages from

phishing attacks.

#he online traffic will decrease the $ualit! of web pages and its applications.

#he e isting approach onl! investigates phishing detection at the pi el level

of web pages without considering the te t level.

#he e isting s!stems like :A?#%?A" #ool-bar based techni$ue is ver!

difficult to implement.

All the phishing web pages will not be detected b! using the :A?#%?A and

#ool-bar based s!stems.

Proposed S#ste$


9/53

#he content representation of proposed s!stem is divided into two categories.

2@Textual content% E#e tual contentF in this paper is defined as the terms or

words that appear in a given web page" e cept for the stop words. +e first separate

the main te t content from


10/53

#he s!stem includes a training section" which is to estimate the statistics of

historical data" and a testing section" which is to e amine the incoming testing web

pages. #he statistics of the web page training set consists of the probabilities that a

te tual web page belongs to the categories" the matching thresholds of classifiers"and the posterior probabilit! of data fusion. #hrough the preprocessing" content

representations" i.e." te tual and visual" are rapidl! e tracted from a given testing

web page. #he te t classifier is used to classif! the given web page into the

corresponding categor! based on the te tual features. #he image classifier is used

to classif! the given web page into the corresponding categor! based on the visual

content. #hen the fusion algorithm is used to combine the detection results

delivered b! the two classifiers. #he detection results are eventuall! transmitted to

the online users or the web browsers.

*reprocessing is the main conte ts of a given web page are firstl! separated from


11/53

of original words. +e store the stemmed words to construct the vocabular!. ,iven

a web page" we then form a histogram vector" where each component represents

the term fre$uenc! and n denotes the total number of components in the vector. +e

e plain three points here.

2@ +e do not e tract words from all the web pages in a dataset to construct the

vocabular!" because phishers usuall! onl! use the words from a targeted web page

to scam unwar! users.

3@ =or the sake of simplicit!" we do not use an! feature e traction algorithms in

the process of vocabular! construction.

5@ +e do not take the semantic associations of web pages into account" because

the si'es of most phishing web pages are small.

%n realit!" using onl! te t content is insufficient to detect phishing web pages. #his

method will usuall! result in high false positives" because phishing web pages are

highl! similar to the targeted web pages not onl! in te tual content but also in

visual content such as famous logos" la!out" and overall st!le. %n this s!stem" weuse the same approach as in using the SVM to measure the visual similarit!

between an incoming web page and a protected web page.

=irst" we retrieve the suspected web pages and protected web pages from the web.

Second" we generate their signatures" which are used for the calculation of the

SVM between them. #hus all the web page images are normali'ed into fi ed-si'e

s$uare images. +e use these normali'ed images to generate the signature of eachweb page.

#he image classifier is implemented b! setting a threshold" which is later

estimated in the subse$uent section. %f the visual similarit! between a suspected


12/53

web page and the protected web page e ceeds the threshold" the web page is

classified as phishing" otherwise.

#he overall implementation process of image classifier is summari'ed as

follows.

Step 2D Obtain the images of a web page from its )7; and perform

normali'ation.

Step 3D ,enerate visual signature of the input image including the color and

coordinate features.

Step 5D :alculate the visual similarit! between the input web page image and

the protected web page image using SVM approach.

Step 9D :lassif! the input web page into corresponding categor! according to

the comparison of the visual similarit! and the threshold.

#he overall implementation procedures of fusion algorithm are summari'ed as

follows.


13/53

Step 2D %nput the training set" train a te t classifier and an image classifier" and

then collect similarit! measurements from different classifiers.

Step 3D *artition the interval of similarit! measurements into sub-intervals.

Step 5D (stimate the posterior probabilities conditioning on all the sub-intervals

for the image classifier.

Step 9D (stimate the posterior probabilities conditioning on all the sub-intervals

for the image classifier.

Step D =or a new testing web page" classif! it into corresponding categor! b!

using the te t classifier and the image classifier.

Step D Bispla! the results whether the given web page is phishing or not.

Advantages

#he data fusion framework enables us to directl! incorporate the multipleresults produced b! different classifiers.

#he SVM algorithm is used for classif!ing both the te tual and visual

content.

All phishing websites will be detected b! using this approach.

&iterature Surve#


14/53

!etecting phishing 'eb pages 'ith visual si$ilarit# assess$ent based on

earth $over(s distance

An effective approach to phishing +eb page detection is proposed" which uses

(arth Mover&s Bistance >(MB@ to measure +eb page visual similarit!. +e first

convert the involved +eb pages into low resolution images and then use color

and coordinate features to represent the image signatures. +e use (MB to

calculate the signature distances of the images of the +eb pages. +e train an

(MB threshold vector for classif!ing a +eb page as a phishing or a normal one.

;arge-scale e periments with 24"302 suspected +eb pages are carried out to

show high classification precision" phishing recall" and applicable time performance for online enterprise solution. +e also compare our method with

two others to manifest its advantage. +e also built up a real s!stem which is

alread! used online and it has caught man! real phishing cases.

*hishing web pages are forged web pages that are created b! malicious people

to mimic web pages of real web sites. Most of these kinds of web pages have

high visual similarities to scam their victims. Some of these kinds of web pages

look e actl! like the real ones. )nwar! %nternet users ma! be easil! deceived

b! this kind of scam. Victims of phishing web pages ma! e pose their bank

account" password" credit card number" or other important information to the

phishing +eb page owners. *hishing is a relativel! new %nternet crime in

comparison with other forms" e.g." virus and hacking. More and more phishing

+eb pages have been found in recent !ears in an accelerative wa!. A reportfrom the Anti-*hishing +orking ,roup shows that the number of phishing +eb

pages is increasing each month b! 4 percent and usuall! percent of the

phishing e-mail receivers will respond to the scams. Also" there were 2 "4 4

phishing cases reported simpl! in one month in 1une 344 . #his problem has


15/53

drawn high attention from both industr! and the academic research domain

since it is a severe securit! and privac! problem and has caused huge negative

impacts on the %nternet world. %t is threatening people&s confidence to use the

+eb to conduct online finance-related activities.

%n this s!stem" we propose an effective approach for detecting phishing +eb

pages" which emplo!s the (arth Mover&s Bistance >(MB@ to calculate the

visual similarit! of +eb pages. #he most important reason that %nternet users

could become phishing victims is that phishing +eb pages alwa!s have high

visual similarit! with the real +eb pages" such as visuall! similar block la!outs"

dominant colors" images" and fonts" etc. +e follow the anti-phishing strateg! into obtain suspected +eb pages" which are supposed to be collected from )7;s

in those e-mails containing ke!words associated with protected +eb pages. +e

first convert them into normali'ed images and then represent their image

signatures with features composed of dominant color categor! and its

corresponding centroid coordinate to calculate the visual similarit! of two +eb

pages.

#he linear programming algorithm for (MB is applied to visual similarit!

computation of the two signatures. An anti-phishing s!stem ma! be re$uested to

protect man! +eb pages. A threshold is calculated for each protected +eb page

using supervised training. %f the (MB-based visual similarit! of a +eb page

e ceeds the threshold of a protected +eb page" we classif! the +eb page as a

phishing one.

(volving with the anti-phishing techni$ues" various phishing techni$ues and

more complicated and hard-to-detect methods are used b! phi-shers. #he most


16/53

straightforward wa! for a phi-sher to scam people is to make the phishing +eb

pages similar to their targets.

A phishing strateg! includes both +eb link obfuscation and +eb page

obfuscation. +eb link obfuscation can be carried out in four basic wa!sD adding

a suffi to a domain name of the )7;" using an actual link different from the

visible link" utili'ing s!stem bugs in real +eb sites to redirect the link to the

phishing +eb pages. *revious research works on duplicated document detection

approaches focus on plain te t documents and use pure te t features in

similarit! measure" such as collection statistics" s!ntactic anal!sis" displa!ing

structure" visual-based understanding" vector space model" etc.


17/53

*hishing is considered as one of the most serious threats for the %nternet and e-

commerce. *hishing attacks abuse trust with the help of deceptive e-mails"

fraudulent web sites and malware. %n order to prevent phishing attacks some

organi'ations have implemented %nternet browser tool-bars for identif!ingdeceptive activities.


18/53

of target name in the )7; or onl! %* address without host name. #hese

ambiguous domain names are ha'ardous for careless consumers.

Cecause of the careless usabilit! securit! design" phishers can easil! take

advantage of poor usabilit! design. %n order to offer more reliable securit!" anti-

phishing tool-bars should be easier to use. Moreover" as end-users must be able

to use the toolbars and make correct choices" usabilit! evaluation of these

toolbars is important. Our research ob6ective was to Gnd out general usabilit!

design principles for anti-phishing client side applications. Such information

ma! result in valuable information for improving usabilit! and securit! of anti-

phishing applications. Cased on this motivation" we conducted the heuristicusabilit! evaluation of Gve toolbars.


19/53

spoofed.


20/53

*eb 'allet% Preventing phishing attac+s b# revealing user intentions

+e introduce a new anti-phishing solution" the +eb +allet. #he +eb +allet is a

browser sidebar which users can use to submit their sensitive information

online. %t detects phishing attacks b! determining where users intend to submit

their information and suggests an alternative safe path to their intended site if

the current site does not match it. %t integrates securit! $uestions into the user&s

workflow so that its protection cannot be ignored b! the user. +e conducted a

user stud! on the +eb +allet protot!pe and found that the +eb +allet is a

promising approach. %n the stud!" it significantl! decreased the spoof rate of

t!pical phishing attacks from 58 to H8" and it effectivel! prevented all phishing attacks as long as it was used. A ma6orit! of the sub6ects successfull!

learned to depend on the +eb +allet to submit their login information.


21/53

sensitive data" she presses a dedicated securit! ke! on the ke!board to open the

+eb +allet. )sing the +eb +allet" she ma! t!pe her data or retrieve her stored

data. #he data is then filled into the web form. Cut before the fill-in" the +eb

+allet checks if the current site is good enough to receive the sensitive data. %f the current site is not $ualified" the +eb +allet re$uires the user to e plicitl!

indicate where she wants the data to go. %f the user&s intended site is not the

current site" the +eb +allet shows a warning to the user about this discrepanc!"

and gives her a safe path to her intended site. #here is one simple rule to

correctl! use the +eb +alletD EAlwa!s use the +eb +allet to submit sensitive

information b! pressing the securit! ke! first.F ($uivalentl!" Enever submit

sensitive information directl! through a web form because it is not a secure

practice.F

+e have run a user stud! to test the +eb +allet interface. #he results are

promisingD

J #he +eb +allet significantl! decreased the spoof rate of normal phishing

attacks from 58 to H8.

J All the simulated phishing attacks in the stud! were effectivel! prevented b!

the +eb +allet as long as it was used.

J C! disabling direct input into web forms and thus making itself the onl! wa!

to input sensitive information" the +eb +allet successfull! trained a ma6orit! of

the sub6ects to use it to protect their sensitive information submission.

Cut there are also negative results which we plan to deal with in future researchD

J #he sub6ects totall! failed to differentiate the authentic +eb +allet interface

from a fake +eb +allet presented b! a phishing site. #his is a new t!pe of


22/53

phishing attack. %nstead of mimicking a legitimate site&s appearance" the

attacker fakes the interface of securit! software that is run b! the user.

J %t is not eas! to completel! stop all sub6ects from t!ping sensitive information

directl! into web forms. )sers are familiar with web form submission and have

a strong tendenc! to use it.

*hishing attacks e ploit the gap between the wa! a user perceives a

communication and the actual effect of the communication. #he computer

s!stem and the human user have two different understandings of a web site. #he

user recogni'es a site based on its visual appearance and the semantic meaning

of its content. Cut the browser recogni'es a site based on s!stem properties"

e.g." whether the site has an SS; certificate" when and where this site registered"

etc. As a result" neither the computer s!stem nor the human user alone can

effectivel! prevent phishing attacks.

On the one hand" it is hard" if not impossible" for the computer to alwa!s

correctl! derive the semantic meaning of the content. On the other hand"

ordinar! users do not know how to correctl! interpret the s!stem properties.

#he user interface is thus the e act place to bridge the gap between the user&s

mental model and the s!stem model b! letting the human user and the s!stem

share what the! individuall! know about the current site. #he +eb +allet helps

the users transfer their real intention to the browser" especiall! when the! are

doing phishing-critical actions" such as submitting sensitive data to web sites.

+hen a user uses the +eb +allet a dedicated interface for sensitive information

submission she implicitl! indicates that the submitting data is sensitive. #he

user further indicates the sensitive data t!pe b! using the appropriate card in the

+eb +allet.


23/53

Intelligent phishing 'ebsite detection s#ste$ using u,,# techni ues

Betecting and identif!ing e-banking *hishing websites is reall! a comple and

d!namic problem involving man! factors and criteria. Cecause of the sub6ective

considerations and the ambiguities involved in the detection" =u''! Bata

Mining #echni$ues can be an effective tool in assessing and identif!ing e-

banking phishing websites since it offers a more natural wa! of dealing with

$ualit! factors rather than e act values. %n this s!stem" we present novel

approach to overcome the fu''iness in the e-banking phishing website

assessment and propose an intelligent resilient and effective model for detecting

e-banking phishing websites. #he proposed model is based on =u''! logiccombined with Bata Mining algorithms to characteri'e the e-banking phishing

website factors and to investigate its techni$ues b! classif!ing there phishing

t!pes and defining si e-banking phishing website attack criteria&s with a la!er

structure. A :ase stud! was applied to illustrate and simulate the phishing

process. Our e perimental results showed the significance and importance of

the e-banking phishing website criteria represented b! la!er one and the variet!influence of the phishing characteristic la!ers on the final e-banking phishing

website rate.

(-banking *hishing websites are forged website that is created b! malicious

people to mimic real e-banking websites. Most of these kinds of +eb pages

have high visual similarities to scam their victims. Some of these +eb pages

look e actl! like the real ones. )nwar! %nternet users ma! be easil! deceived b! this kind of scam. Victims of e-banking phishing +ebsites ma! e pose their

bank account" password" credit card number" or other important information to

the phishing +eb page owners. #he impact is the breach of information securit!

through the compromise of confidential data and the victims ma! finall! suffer


24/53

losses of mone! or other kinds. *hishing is a relativel! new %nternet crime in

comparison with other forms" e.g." virus and hacking.

(-banking *hishing website is a ver! comple issue to understand and to

anal!'e" since it is 6oining technical and social problem with each other for

which there is no known single silver bullet to entirel! solve it. #he motivation

behind this stud! is to create a resilient and effective method that uses =u''!

Bata Mining algorithms and tools to detect e-banking phishing websites in an

automated manner. BM approaches such as neural networks" rule induction" and

decision trees can be a useful addition to the fu''! logic model. %t can deliver

answers to business $uestions that traditionall! were too time consuming toresolve such as" K+hich are most important e-banking *hishing website

:haracteristic %ndicators and wh!LK b! anal!'ing massive databases and

historical data for training purposes.

.u,,# !ata Mining Algorith$s / Techni ues

#he approach described here is to appl! fu''! logic and data mining algorithms

to assess e-banking phishing website risk on the 3H characteristics and factors

which stamp the forged website. #he essential advantage offered b! fu''! logic

techni$ues is the use of linguistic variables to represent e! *hishing

characteristic indicators and relating e-banking phishing website probabilit!.

01 .u,,i ication

%n this step" linguistic descriptors such as


25/53

between classes. #he degree of belongingness of the values of the variables

to an! selected class is called the degree of membership Membership

function is designed for each *hishing characteristic indicator" which is a

curve that defines how each point in the input space is mapped to amembership value between [4" 2]. ;inguistic values are assigned for each

*hishing indicator as ;ow" Moderate" and


26/53

51 Aggregation o the rule outputs

#his is the process of unif!ing the outputs of all discovered rules.

:ombining the membership functions of all the rules conse$uents previousl!

scaled into single fu''! sets.

4) !e- u,,i ication

#his is the process of transforming a fu''! output of a fu''! inference

s!stem into a crisp output. =u''iness helps to evaluate the rules" but the final

output has to be a crisp number. #he input for the de-fu''ification process is

the aggregate output fu''! set and the output is a number. #his step wasdone using :entroid techni$ue since it is a commonl! used method.

#here are a number of challenges posed b! doing post- hoc classification of e-

banking phishing websites. Most of these challenges onl! appl! to the e-banking

phishing websites data and materiali'e as a form of information" which has the net

effect of increasing the false negative rate. #he age of the dataset is the most

significant problem" which is particularl! relevant with the phishing corpus. (- banking *hishing websites are short-lived" often lasting onl! in the order of 90

hours. Some of our features can therefore not be e tracted from older websites"

making our tests difficult. #he average phishing site sta!s live for appro imatel!

3.3 da!s. =urthermore" the process of transforming the original e- banking

phishing website archives into record feature datasets is not without error. %t

re$uires the use of heuristics at several steps. #hus high accurac! from the data

mining algorithms cannot be e pected.


27/53

CA6TI6A% A content-based approach to detecting phishing 'eb sites

*hishing is a significant problem involving fraudulent email and web sites that

trick unsuspecting users into revealing private information. %n this paper" we

present the design" implementation" and evaluation of :A?#%?A" a novel"

content-based approach to detecting phishing web sites" based on the #=-%B=

information retrieval algorithm. +e also discuss the design and evaluation of

several heuristics we developed to reduce false positives. Our e periments show

that :A?#%?A is good at detecting phishing sites" correctl! labeling

appro imatel! / 8 of phishing sites.

7ecentl!" there has been a dramatic increase in phishing" a kind of attack in

which victims are tricked b! spoofed emails and fraudulent web sites into

giving up personal information. *hishing is a rapidl! growing problem" with

/"3 uni$ue phishing sites reported in 1une of 344 alone. %t is unknown

precisel! how much phishing costs each !ear since impacted industries are

reluctant to release figures estimates range from Q2 billion to 3.0 billion per

!ear. #o respond to this threat" software vendors and companies have released a

variet! of anti-phishing toolbars.

=or e ample" eCa! offers a free toolbar that can positivel! identif! eCa!-owned

sites" and ,oogle offers a free toolbar aimed at identif!ing an! fraudulent site.

As of September 344 " the free software download site download.com" listed 09

anti-phishing toolbars.


28/53

%n this s!stem" we present the design" implementation" and evaluation of

:A?#%?A" a novel content-based approach for detecting phishing web sites.

:A?#%?A e amines the content of a web page to determine whether it is

legitimate or not" in contrast to other approaches that look at surfacecharacteristics of a web page" for e ample the )7; and its domain name.

:A?#%?A makes use of the well-known #=-%B= algorithm used in information

retrieval" and more specificall!" the 7obust

developed b! *helps and +ilensk! for overcoming broken h!perlinks. Our

results show that :A?#%?A is $uite good at detecting phishing sites" detecting

/9-/H8 of phishing sites.

+e also show that we can use :A?#%?A in con6unction with heuristics used

b! other tools to reduce false positives" while lowering phish detection rates

onl! slightl!. +e present a summar! evaluation" comparing :A?#%?A to two

popular anti-phishing toolbars that are representative of the most effective tools

for detecting phishing sites currentl! available. Our e periments show that

:A?#%?A has comparable or better performance to Spoof-,uard with far fewer false positives" and does about as well as ?et :raft. =inall!" we show that

:A?#%?A combined with heuristics is effective at detecting phishing )7;s in

usersP actual email" and that it&s most fre$uent mistake is labeling spam-related

)7;s as phishing.

#=-%B= is an algorithm often used in information retrieval and te t mining. #=-

%B= !ields a weight that measures how important a word is to a document in acorpus. #he importance increases proportionall! to the number of times a word

appears in the document" but is offset b! the fre$uenc! of the word in the

corpus. #he term fre$uenc! >#=@ is simpl! the number of times a given term

appears in a specific document. #his count is usuall! normali'ed to prevent a


29/53

bias towards longer documents to give a measure of the importance of the term

within the particular document. #he inverse document fre$uenc! >%B=@ is a

measure of the general importance of the term. 7oughl! speaking" the %B=

measures how common a term is across an entire collection of documents.#hus" a term has a high #=-%B= weight b! having a high term fre$uenc! in a

given document.

:A?#%?A works as followsD

,iven a web page" calculate the #=-%B= scores of each term on that web

page. R ,enerate a le ical signature b! taking the five terms with highest

#=-%B= weights.

=eed this le ical signature to a search engine" which in our case is

,oogle.

%f the domain name of the current web page matches the domain name of

the ? top search results" we consider it to be a legitimate web site.

Otherwise" we consider it a phishing site.

Our techni$ue makes the assumption that ,oogle inde es the vast ma6orit!

of legitimate web sites" and that legitimate sites will be ranked higher than

phishing sites. :ombined suggest that a phishing scam will rarel!" if ever" be

highl! ranked. At the end of this paper" however" we discuss some wa!s of

possibl! subverting :A?#%?A.

Age o !o$ain

#his heuristic checks the age of the domain name. Man! phishing sites have

domains that are registered onl! a few da!s before phishing emails are sent

out. +e use a +


30/53

measures the number of months from when the domain name was first

registered. %f the page has been registered longer than 23 months" the

heuristic will return 2" deeming it as legitimate and otherwise returns -2"

deeming it as phishing. %f the +


31/53

retrieving the page. :ombined with the limited si'e of the browser address

bar" this makes it possible to write )7;s that appear legitimate within the

address bar" but actuall! cause the browser to retrieve a different page. #his

heuristic is used b! Mo'illa =ire-=o . Bashes are also rarel! used b!legitimate sites" so we use this as another heuristic. Spoof-,uard checks for

both at s!mbols and dashes in )7;s.

Suspicious &in+s

#his heuristic applies the )7; check above to all the links on the page. %f

an! link on a page fails this )7; check" then the page is labeled as a

possible phishing scam. #his heuristic is also used b! Spoof-,uard.

IP Address

#his heuristic checks if a page&s domain name is an %* address. #his

heuristic is also used in *%;=(7.

!ots in )3&

#his heuristic check the number of dots in a page&s )7;. +e found that

phishing pages tend to use man! dots in their )7;s but legitimate sites

usuall! do not. :urrentl!" this heuristic labels a page as phish if there are

or more dots. #his heuristic is also used in *%;=(7.

.or$s

#his heuristic checks if a page contains an!


32/53

So t'are !escription

8ava

1ava is a programming language originall! developed b! 1ames ,osling at Sun

Micros!stems >now a subsidiar! of Oracle :orporation@ and released in 2// as a

core component of Sun Micros!stemsP 1ava platform. #he language derives much

of its s!nta from : and : but has a simpler ob6ect model and fewer low-level

facilities. 1ava applications are t!picall! compiled to b!te code >class file@ that can

run on an! 1ava Virtual Machine >1VM@ regardless of computer architecture. 1ava

is a general-purpose" concurrent" class-based" ob6ect-oriented language that isspecificall! designed to have as few implementation dependencies as possible. %t is

intended to let application developers Kwrite once" run an!where.K 1ava is currentl!

one of the most popular programming languages in use" particularl! for client-

server web applications.

#he original and reference implementation 1ava compilers" virtual machines" and

class libraries were developed b! Sun from 2// . As of Ma! 344H" in compliancewith the specifications of the 1ava :ommunit! *rocess" Sun relicensed most of its

1ava technologies under the ,?) ,eneral *ublic ;icense. Others have also

developed alternative implementations of these Sun technologies" such as the ,?)

:ompiler for 1ava and ,?) :lass path.

8ava Plat or$%

One characteristic of 1ava is portabilit!" which means that computer programs

written in the 1ava language must run similarl! on an! hardwareToperating-s!stem

platform. #his is achieved b! compiling the 1ava language code to an intermediate

representation called 1ava b!te code" instead of directl! to platform-specific


33/53

machine code. 1ava b!te code instructions are analogous to machine code" but are

intended to be interpreted b! a virtual machine >VM@ written specificall! for the

host hardware. (nd-users commonl! use a 1ava 7untime (nvironment >17(@

installed on their own machine for standalone 1ava applications" or in a +eb browser for 1ava applets.

Standardi'ed libraries provide a generic wa! to access host-specific features such

as graphics" threading" and networking.

A ma6or benefit of using b!te code is porting.


34/53

the ?et Ceans runtime container is an e ecution environment that understands

what a module is" handles its lifec!cle" and enables it to interact with other

modules in the same application.

7egistration of various ob6ects" files and hints into la!er is prett! central to the wa!

?et Ceans based applications handle communication between modules. #his page

summari'es the list of such e tension points defined b! modules with A*%.

:onte t menu actions are read from the la!er folder ;oadersTte tT -

ant mlTActions.

e! maps folder contains subfolders for individual ke! maps >(macs" 1Cuilder" ?et Ceans@. #he name of ke! map can be locali'ed. )se

KS!stem=ileS!stem.locali'ingCundleK attribute of !our folder for this purpose.

%ndividual ke! map folder contains shadows to actions. Shortcut is mapped to the

name of file. (macs shortcut format is used" multike!s are separated b! space chars

>K:-X *K means :trl X followed b! *@. Kcurrent e!mapK propert! of K e! mapsK

folder contains original >not locali'ed@ name of current ke! map.

#his folder contains registration of shortcuts. %ts supported for backward

compatibilit! purpose onl!. All new shortcuts should be registerred in

K e!mapsT?etCeansK folder. Shortcuts installed ins Shortcuts folder will be added

to all ke!maps" if there is no conflict. %t means that if the same shortcut is mapped

to different actions in Shortcut folder and current ke!map folder >like

e!mapT?etCeans@" the Shortcuts folder mapping will be ignored.

Y Batabase( plorer;a!erA*% in Batabase ( plorer

Y ;oaders-te t-dbschema-Actions in Batabase ( plorer

Y ;oaders-te t-s$l-Actions in Batabase ( plorer


35/53

Y *lugin7egistration in 1ava (( Server 7egistr!

XM; la!er contract for registration of server plug-ins and instances that

implement optional capabilities of server plug-ins. *lug-ins with server-specific

deplo!ment descriptor files should declare the full list in XM; la!er as specified in

the document plugin-la!er-file.html from the above link.

K*ro6ectsTorg-netbeans-modules-6ava-63sepro6ectT:ustomi'erK folderPs content

is used to construct the pro6ectPs customi'er. %tPs content is e pected to be

*ro6ect:ustomi'er.:omposite:ategor!*rovider instances. #he lookup passed to

the panels contains an instance of *ro6ect and

org.netbeans.modules.6ava.63sepro6ect.ui.customi'er.13S(*ro6ect*roperties *lease

note that the latter is not part of an! public A*%s and !ou need implementation

dependenc! to make use of it.

K*ro6ectsTorg-netbeans-modules-6ava-63sepro6ectT?odesK folderPs content is

used to construct the pro6ectPs child nodes. %tPs content is e pected to be ?ode

=actor! instances.

K*ro6ectsTorg-netbeans-modules-6ava-63sepro6ectT;ookupK folderPs content is

used to construct the pro6ectPs additional lookup. %tPs content is e pected to be

;ookup *rovider instances. 13S( pro6ect provides ;ookup Mergers for Sources"

*rivileged #emplates and 7ecommended #emplates. %mplementations added b! 5rd

parties will be merged into a single instance in the pro6ectPs lookup.

)se Options Bialog folder for registration of custom top level options panels.7egister !our implementation of Options :ategor! there >Y.instance file@. Standard

file s!stems sorting mechanism is used.


36/53

)se Options BialogTAdvanced folder for registration of custom panels to

Miscellaneous *anel. 7egister !our implementation of Advanced:ategor! there

>Y.instance file@. Standard file s!stems sorting mechanism is used.

)se Options ( portT M! :ategor!W folder for registration of items for

e portTimport of options. 7egistration in la!ers looks as follows

Source files must be named after the public class the! contain" appending the suffi

.6ava" for e ample"


37/53

#he ke!word void indicates that the main method does not return an! value to the

caller. %f a 1ava program is to e it with an error code" it must call S!stem.e it>@

e plicitl!.

#he method name KmainK is not a ke!word in the 1ava language. %t is simpl! the

name of the method the 1ava launcher calls to pass control to the program. 1ava

classes that run in managed environments such as applets and (nterprise

1avaCeans do not use or need a main >@ method. A 1ava program ma! contain

multiple classes that have main methods" which means that the VM needs to be

e plicitl! told which class to launch from.

#he main method must accept an arra! of String ob6ects. C! convention" it is

referenced as args although an! other legal identifier name can be used. Since 1ava

" the main method can also use variable arguments" in the form of public static

void main>String... args@" allowing the main method to be invoked with an arbitrar!

number of String arguments. #he effect of this alternate declaration is semanticall!

identical >the args parameter is still an arra! of String ob6ects@" but allows an

alternative s!nta for creating and passing the arra!.

#he 1ava launcher launches 1ava b! loading a given class >specified on the

command line or as an attribute in a 1A7@ and starting its public static void

main>String[]@ method. Stand-alone programs must declare this method e plicitl!.

#he String[] args parameter is an arra! of String ob6ects containing an! arguments

passed to the class. #he parameters to main are often passed b! means of a

command line.

*rinting is part of a 1ava standard librar!D #he S!stem class defines a public static

field called out. #he out ob6ect is an instance of the *rint Stream class and provides


38/53

man! methods for printing data to standard out" including println >String@ which

also appends a new line to the passed string.

8ava 9:;igh-level &anguage%

A high-level programming language developed b! Sun Micros!stems. 1ava was

originall! called OA " and was designed for handheld devices and set-top bo es.

Oak was unsuccessful so in 2// Sun changed the name to 1ava and modified the

language to take advantage of the burgeoning +orld +ide +eb.

1ava is an ob6ect-oriented language similar to : " but simplified to eliminate

language features that cause common programming errors. 1ava source code files>files with a .6ava e tension@ are compiled into a format called b!te code >files with

a .class e tension@" which can then be e ecuted b! a 1ava interpreter. :ompiled

1ava code can run on most computers because 1ava interpreters and runtime

environments" known as 1ava Virtual Machines >VMs@" e ist for most operating

s!stems" including )?%X" the Macintosh OS" and +indows. C!te code can also be

converted directl! into machine language instructions b! a 6ust-in-time compiler

>1%#@.

1ava is a general purpose programming language with a number of features that

make the language well suited for use on the +orld +ide +eb. Small 1ava

applications are called 1ava applets and can be downloaded from a +eb server and

run on !our computer b! a 1ava-compatible +eb browser" such as ?etscape

?avigator or Microsoft %nternet ( plorer.

Ob6ect-oriented software development matured significantl! during the past

several !ears. #he convergence of ob6ect-oriented modeling techni$ues and

notations" the development of ob6ect-oriented frameworks and design patterns" and


39/53

the evolution of ob6ect-oriented programming languages have been essential in the

progression of this technolog!.

Ob6ect-Oriented Software Bevelopment using 1avaD *rinciples" *atterns" and

=rameworks contains a ver! applied focus that develops skills in designing

software-particularl! in writing well-designed" medium-si'ed ob6ect-oriented

programs. %t provides a broad and coherent coverage of ob6ect-oriented technolog!"

including ob6ect-oriented modeling using the )nified Modeling ;anguage >)M;@

ob6ect-oriented design using Besign *atterns" and ob6ect-oriented programming

using 1ava.

6etBeans

#he 6etBeans Plat or$ is a reusable framework for simplif!ing the development

of 1ava Swing desktop applications. #he ?etCeans %B( bundle for 1ava S(

contains what is needed to start developing ?etCeans plug-ins and ?etCeans

*latform based applications no additional SB is re$uired.

Applications can install modules d!namicall!. An! application can include the

)pdate :enter module to allow users of the application to download digitall!-

signed upgrades and new features directl! into the running application.

7einstalling an upgrade or a new release does not force users to download the

entire application again.
http://en.wikipedia.org/wiki/Software_frameworkhttp://en.wikipedia.org/wiki/Java_Swinghttp://en.wikipedia.org/wiki/Digital_signaturehttp://en.wikipedia.org/wiki/Digital_signaturehttp://en.wikipedia.org/wiki/Software_frameworkhttp://en.wikipedia.org/wiki/Java_Swinghttp://en.wikipedia.org/wiki/Digital_signaturehttp://en.wikipedia.org/wiki/Digital_signature


40/53

#he platform offers reusable services common to desktop applications" allowing

developers to focus on the logic specific to their application. Among the features of

the platform areD

)ser interface management >e.g. menus and toolbars@

)ser settings management

Storage management >saving and loading an! kind of data@

+indow management

+i'ard framework >supports step-b!-step dialogs@

?etCeans Visual ;ibrar!

%ntegrated Bevelopment #ools


41/53

*a$p Server

+AM*s are packages of independentl!-created programs installed on computersthat use a Microsoft +indows operating s!stem. +AM* is an acron!m formed

from the initials of the operating s!stem Microsoft +indows and the principal

components of the packageD Apache "M!SZ; and one of *


42/53

S#ste$ Architecture


43/53

Modules

;oading web page training set.

#e tual and visual content feature e traction.

#e t and image classification.

=using of detected results.

:omparison of detected fusion results.

Module !escription

&oading 'eb page training set

;oading the phishing web pages into the database.

;oading the protected web pages into the database.

Textual and visual content eature extraction

( traction of te tual content of web page b! using e traction algorithms.

( traction of visual content of web page b! using e traction algorithms.


44/53

#he te tual feature e traction is done b! using


45/53

=usion algorithm is used for merging or 6oining the te tual and visual

classified results.

Co$parison o detected usion results

#he detected fusion results will be compared with original web page.

#he posteriori probabilit! will be found b! using the similarit!.

C! this probabilit! the fusion results of false and true web pages will be

compared.

#he false web page is compared with the true web page.

#he detected results will be shown to the user.


46/53

S#ste$ 3e uire$ents

So t'are 3e uire$ent Operating S!stem D +indows X* ;anguage D :ore 1ava Version D 1B 2. %B( D ?et beans .3 Batabase D M!-S$l

;ard'are 3e uire$ents *7O:(SSO7 D *(?#%)M %V :;O: S*((B D 3.H ,< 7AM :A*A:%# D 2 ,C


47/53

Conclusion

A new content-based anti-phishing s!stem has been thoroughl! developed. %n this

s!stem" we presented a new framework to solve the anti-phishing problem. #he

new features of this framework can be represented b! a te t classifier" an image

classifier" and a fusion algorithm. Cased on the te tual content" the te t classifier is

able to classif! a given web page into corresponding categories as phishing or

normal. #his te t classifier was modeled b! SVM rule. Cased on the visual content"

the image classifier" which relies on SVM" is able to calculate the visual similarit!

between the given web page and the protected web page efficientl!. #he matching

threshold used in both te t classifier and image classifier is effectivel! estimated

b! using a probabilistic model derived from the SVM theor!. A novel data fusion

model using the SVM theor! was developed and the corresponding fusion

algorithm presented. #his data fusion framework enables us to directl! incorporate

the multiple results produced b! different classifiers. #his fusion method provides

insights for other data fusion applications. More importantl!" it is worth noting that

our content-based model can be easil! embedded into current industrial anti-

phishing s!stems.


48/53

.uture "nhance$ent

Our future work will include adding more features into the content

representations into our current model.

%nvestigating incremental learning models to solve the knowledge updating

problem in current probabilistic model.

Adding more data sets with te tual and visual content of web pages for both

true and false web pages.


49/53

3e erences

A. (migh. >344 " Oct.@. Online %dentit! #heftD *hishing #echnolog!"

:hokepoints and :ountermeasures. 7adi ;aboratories %nc." (au :laire" +%

[Online]. AvailableD httpDTTwww.antiphishing.orgTphisgingdhs- report.pdf

;. 1ames" *hishing ( posed. 7ockland" MAD S!ngress" 344 .

A. . =u" +. ;iu" and X. Beng" EBetecting phishing web pages with visual

similarit! assessment based on earth mover&s distance >(MB@"F %((( #rans.

Bepend. Secure :omput." vol. 5" no. 9" pp. 542I522" Oct.I Bec. 344 .

,lobal *hishing Surve!D Bomain ?ame )se and #rends in 2


50/53

. hang" S. (gelman" ;. :ranor" and 1.


51/53

. hang" 1.


52/53

M. :handrasekaran" . ?ara!anan" and S. )padh!a!a" E*hishing email

detection based on structural properties"F in *roc. /th Annu. ? S :!ber

Secur. :onf." ?ew ork" 1un. 344 " pp. 3I0.

%. =ette" ?. Sadeh" and A. #omasic" E;earning to detect phishing emails"F in

*roc. 2 th %nt. :onf. +orld +ide +eb" Canff" AC" :anada" Ma! 344H" pp.

9/I .

S. Abu-?imeh" B. ?appa" X. +ang" and S. ?air" EA comparison of machine

learning techni$ues for phishing detection"F in *roc. Anti-*hish. +ork.

,roups 3nd Annu. e:rime 7es. Summit" *ittsburgh" *A" Oct. 344H" pp. 4I

/.

7. Casnet" S. Mukkamala" and A.


53/53

:. 7. 1ohn" #he %mage *rocessing

Documents

Textual and Visual Content Based Anti-Phishing First Review