Upload
pethaperumal-perumal
View
223
Download
0
Embed Size (px)
Citation preview
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
1/53
Textual And Visual Content Based Anti-Phishing A SVM Approach
Abstract
A novel framework using a SVM [Support Vector Machine] approach for content-
based phishing web page detection is presented. Our model takes into account
te tual and visual contents to measure the similarit! between the protected web
page and suspicious web pages. A te t classifier" an image classifier" and an
algorithm fusing the results from classifiers are introduced. An outstanding feature
of this paper is the e ploration of a SVM model to estimate the matching threshold.
#his is re$uired in the classifier for determining the class of the web page andidentif!ing whether the web page is phishing or not. %n the te t classifier" the naive
SVM rule is used to calculate the probabilit! that a web page is phishing. %n the
image classifier" the earth mover&s distance is emplo!ed to measure the visual
similarit!" and our SVM model is designed to determine the threshold. %n the data
fusion algorithm" the SVM theor! is used to s!nthesi'e the classification results
from te tual and visual content. #he effectiveness of our proposed approach was
e amined in a large-scale dataset collected from real phishing cases. ( perimental
results demonstrated that the te t classifier and the image classifier we designed
deliver promising results" the fusion algorithm outperforms either of the individual
classifiers" and our model can be adapted to different phishing cases.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
2/53
Introduction
Malicious people" also known as phishers" create phishing web pages" i.e."
forgeries of real web pages" to steal individuals& personal information such as bank
account" password" credit card number" and other financial data. )nwar! online
users can be easil! deceived b! these phishing web pages because of their high
similarities to the real ones. #he Anti-*hishing +orking ,roup reported that there
were at least " /0 phishing attacks between 1anuar! 2" 344/" and 1une 54" 344/.
#he latest statistics show that phishing remains a ma6or criminal activit! involving
great losses of mone! and personal data.
Automaticall! detecting phishing web pages has attracted much attention from
securit! and software providers" financial institutions" to academic researchers.
Methods for detecting phishing web pages can be classified into industrial toolbar
based anti-phishing" user-interface-based anti-phishing" and web page content-
based anti-phishing. #o date" techni$ues for phishing detection used b! the industr!
mainl! include authentication" filtering" attack tracing and anal!'ing" phishing
report generating" and network law enforcement. #hese anti-phishing internet
services are built into e-mail servers and web browsers and available as web
browser toolbars.
#hese industrial services" however" do not efficientl! know all phishing attacks.
+u et al. conducted thorough stud! and anal!sis on the effectiveness of anti-
phishing toolbars" which consist of three securit! toolbars and other mostl! used
browser securit! indicators. #he stud! indicates that all e amined toolbars in were
ineffective to prevent web pages from phishing attacks. 7eports show that 34 out
of 54 sub6ects were spoofed b! at least one phishing attack" 0 8 of the spoofed
sub6ects indicated that the websites look legitimate or e actl! same as the! visited
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
3/53
before" and 948 of the spoofed sub6ects were tricked due to poorl! designed web
sites. :ranor et al. performed another stud! on an evaluation of 24 anti-phishing
tools. #he! indicated that onl! one tool could consistentl! detect more than 48 of
phishing web sites without a high rate of false positives" whilst four tools were notable to recogni'e 48 of the tested web sites. Apart from these studies on the
effectiveness of anti-phishing toolbars" ;i and
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
4/53
#he method works b! first finding the associated web pages of the given webpage
and then constructing a S;? from all those web pages. A mechanism of reasoning
on the S;? is e ploited to identif! the phishing target. hang et al. developed a
content-based approach" i.e." :arnegie Mellon Anti-phishing and ?etwork Anal!sis#ool" for anti-phishing b! emplo!ing the idea of robust h!perlinks [2 ]. ,iven a
web page" this method first calculates the #=-%B= of each term" an algorithm
usuall! used in information retrieval" generates a le ical signature9 b! selecting a
few terms" supplies this signature to a search engine" and then matches the domain
name of current web page and several top search results to evaluate the current
web page is legitimate or not. Another content-based techni$ue" CA*#" is designed
to identif! phishing websites b! using an open-source Ca!esian filter on the basis
of tokens which are e tracted b! a document ob6ect module >BOM@ anal!'er.
#he concept of visual approach to phishing detection was first introduced b! ;iu et
al. #his approach" which is oriented b! the BOM-based visual similarit! of web
pages" first decomposes the web pages into salient block regions. #he visual
similarit! between two web pages is then evaluated b! three metrics" namel!" block level similarit!" la!out similarit!" and overall st!le similarit!" which are based on
the matching of the salient block regions. =u et al. followed the overall strateg!"
but proposed another method to calculate the visual similarit! of web pages. #he!
first converted
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
5/53
#he main ob6ective of this pro6ect is as followsD
#o detect the phishing web pages b! using the SVM algorithm.
#o classif! the webpage b! using te tual and visual SVM classificationalgorithms.
#o combine the classified results like te tual and visual content b! using
fusion algorithm.
#o compare the true and false web page fused results b! finding the
probabilit!" to find the given web page is phishing or not.
Scope O Project
#he main scope of the pro6ect is as followsD
#o detect the website is a phishing website or not.
#o detect the website is hacked b! the attacker or not.
#o compare the true and attacked websites b! detecting its fusion results.
Project !escription
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
6/53
"xisting S#ste$
*hishing techni$ue used b! the e isting s!stem mainl! includes authentication"
filtering" attack tracing and anal!'ing. #oolbar based anti-phishing which
guides the user to interact with trusted website. #he toolbars like securit!
toolbars and browser securit! toolbars are used in the s!stem. Methods for
detecting phishing web pages can be classified into industrial toolbar-based
anti-phishing" user-interface-based anti-phishing" and web page content-based
anti-phishing. #echni$ues for phishing detection used b! the industr! mainl!
include authentication" filtering" attack tracing and anal!'ing" phishing report
generating" and network law enforcement. #hese anti-phishing internet services
are built into e-mail servers and web browsers and available as web browser
toolbars.
:ontent-based anti-phishing" which is referred to as using the features of web
pages" consists of surface level characteristics" te tual content" and visual
content. +e clarif! that the content of a web page we discuss here include the
whole information of a web page such as a domain name" )7;" h!perlinks"
terms" images" and forms embedded in the web page. Surface-level
characteristics have been commonl! used b! industrial toolbars to detect
phishing. =or e ample" the Spoof-,uard makes use of inspecting the age of
domain" well known logos" )7;" and links to ac$uire the characteristics of
phishing web pages. ;iu et al. proposed the use of semantic link network to
automaticall! identif! the phishing target of a given webpage.
#he method works b! first finding the associated web pages of the given
webpage and then constructing a S;? from all those web pages. A mechanism
of reasoning on the S;? is e ploited to identif! the phishing target. hang et al.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
7/53
developed a content-based approach" i.e." :arnegie Mellon Anti-phishing and
?etwork Anal!sis #ool" for anti-phishing b! emplo!ing the idea of robust
h!perlinks. ,iven a web page" this method first calculates the #=-%B= of each
term" an algorithm usuall! used in information retrieval" generates a le icalsignature9 b! selecting a few terms" supplies this signature to a search engine"
and then matches the domain name of current web page and several top search
results to evaluate the current web page is legitimate or not. Another content-
based techni$ue" CA*#" is designed to identif! phishing websites b! using an
open-source Ca!esian filter on the basis of tokens which are e tracted b! a
document ob6ect module anal!'er.
#he concept of visual approach to phishing detection was first introduced b!
;iu et al. #his approach" which is oriented b! the BOM-based visual similarit!
of web pages" first decomposes the web pages into salient block regions. #he
visual similarit! between two web pages is then evaluated b! three metrics"
namel!" block level similarit!" la!out similarit!" and overall st!le similarit!"
which are based on the matching of the salient block regions. =u et al. followedthe overall strateg!" but proposed another method to calculate the visual
similarit! of web pages. #he! first converted
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
8/53
All phishing attacks will not be detected b! using the detection techni$ues.
#he toolbar techni$ue is an ineffective wa! to prevent web pages from
phishing attacks.
#he online traffic will decrease the $ualit! of web pages and its applications.
#he e isting approach onl! investigates phishing detection at the pi el level
of web pages without considering the te t level.
#he e isting s!stems like :A?#%?A" #ool-bar based techni$ue is ver!
difficult to implement.
All the phishing web pages will not be detected b! using the :A?#%?A and
#ool-bar based s!stems.
Proposed S#ste$
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
9/53
#he content representation of proposed s!stem is divided into two categories.
2@Textual content% E#e tual contentF in this paper is defined as the terms or
words that appear in a given web page" e cept for the stop words. +e first separate
the main te t content from
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
10/53
#he s!stem includes a training section" which is to estimate the statistics of
historical data" and a testing section" which is to e amine the incoming testing web
pages. #he statistics of the web page training set consists of the probabilities that a
te tual web page belongs to the categories" the matching thresholds of classifiers"and the posterior probabilit! of data fusion. #hrough the preprocessing" content
representations" i.e." te tual and visual" are rapidl! e tracted from a given testing
web page. #he te t classifier is used to classif! the given web page into the
corresponding categor! based on the te tual features. #he image classifier is used
to classif! the given web page into the corresponding categor! based on the visual
content. #hen the fusion algorithm is used to combine the detection results
delivered b! the two classifiers. #he detection results are eventuall! transmitted to
the online users or the web browsers.
*reprocessing is the main conte ts of a given web page are firstl! separated from
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
11/53
of original words. +e store the stemmed words to construct the vocabular!. ,iven
a web page" we then form a histogram vector" where each component represents
the term fre$uenc! and n denotes the total number of components in the vector. +e
e plain three points here.
2@ +e do not e tract words from all the web pages in a dataset to construct the
vocabular!" because phishers usuall! onl! use the words from a targeted web page
to scam unwar! users.
3@ =or the sake of simplicit!" we do not use an! feature e traction algorithms in
the process of vocabular! construction.
5@ +e do not take the semantic associations of web pages into account" because
the si'es of most phishing web pages are small.
%n realit!" using onl! te t content is insufficient to detect phishing web pages. #his
method will usuall! result in high false positives" because phishing web pages are
highl! similar to the targeted web pages not onl! in te tual content but also in
visual content such as famous logos" la!out" and overall st!le. %n this s!stem" weuse the same approach as in using the SVM to measure the visual similarit!
between an incoming web page and a protected web page.
=irst" we retrieve the suspected web pages and protected web pages from the web.
Second" we generate their signatures" which are used for the calculation of the
SVM between them. #hus all the web page images are normali'ed into fi ed-si'e
s$uare images. +e use these normali'ed images to generate the signature of eachweb page.
#he image classifier is implemented b! setting a threshold" which is later
estimated in the subse$uent section. %f the visual similarit! between a suspected
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
12/53
web page and the protected web page e ceeds the threshold" the web page is
classified as phishing" otherwise.
#he overall implementation process of image classifier is summari'ed as
follows.
Step 2D Obtain the images of a web page from its )7; and perform
normali'ation.
Step 3D ,enerate visual signature of the input image including the color and
coordinate features.
Step 5D :alculate the visual similarit! between the input web page image and
the protected web page image using SVM approach.
Step 9D :lassif! the input web page into corresponding categor! according to
the comparison of the visual similarit! and the threshold.
#he overall implementation procedures of fusion algorithm are summari'ed as
follows.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
13/53
Step 2D %nput the training set" train a te t classifier and an image classifier" and
then collect similarit! measurements from different classifiers.
Step 3D *artition the interval of similarit! measurements into sub-intervals.
Step 5D (stimate the posterior probabilities conditioning on all the sub-intervals
for the image classifier.
Step 9D (stimate the posterior probabilities conditioning on all the sub-intervals
for the image classifier.
Step D =or a new testing web page" classif! it into corresponding categor! b!
using the te t classifier and the image classifier.
Step D Bispla! the results whether the given web page is phishing or not.
Advantages
#he data fusion framework enables us to directl! incorporate the multipleresults produced b! different classifiers.
#he SVM algorithm is used for classif!ing both the te tual and visual
content.
All phishing websites will be detected b! using this approach.
&iterature Surve#
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
14/53
!etecting phishing 'eb pages 'ith visual si$ilarit# assess$ent based on
earth $over(s distance
An effective approach to phishing +eb page detection is proposed" which uses
(arth Mover&s Bistance >(MB@ to measure +eb page visual similarit!. +e first
convert the involved +eb pages into low resolution images and then use color
and coordinate features to represent the image signatures. +e use (MB to
calculate the signature distances of the images of the +eb pages. +e train an
(MB threshold vector for classif!ing a +eb page as a phishing or a normal one.
;arge-scale e periments with 24"302 suspected +eb pages are carried out to
show high classification precision" phishing recall" and applicable time performance for online enterprise solution. +e also compare our method with
two others to manifest its advantage. +e also built up a real s!stem which is
alread! used online and it has caught man! real phishing cases.
*hishing web pages are forged web pages that are created b! malicious people
to mimic web pages of real web sites. Most of these kinds of web pages have
high visual similarities to scam their victims. Some of these kinds of web pages
look e actl! like the real ones. )nwar! %nternet users ma! be easil! deceived
b! this kind of scam. Victims of phishing web pages ma! e pose their bank
account" password" credit card number" or other important information to the
phishing +eb page owners. *hishing is a relativel! new %nternet crime in
comparison with other forms" e.g." virus and hacking. More and more phishing
+eb pages have been found in recent !ears in an accelerative wa!. A reportfrom the Anti-*hishing +orking ,roup shows that the number of phishing +eb
pages is increasing each month b! 4 percent and usuall! percent of the
phishing e-mail receivers will respond to the scams. Also" there were 2 "4 4
phishing cases reported simpl! in one month in 1une 344 . #his problem has
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
15/53
drawn high attention from both industr! and the academic research domain
since it is a severe securit! and privac! problem and has caused huge negative
impacts on the %nternet world. %t is threatening people&s confidence to use the
+eb to conduct online finance-related activities.
%n this s!stem" we propose an effective approach for detecting phishing +eb
pages" which emplo!s the (arth Mover&s Bistance >(MB@ to calculate the
visual similarit! of +eb pages. #he most important reason that %nternet users
could become phishing victims is that phishing +eb pages alwa!s have high
visual similarit! with the real +eb pages" such as visuall! similar block la!outs"
dominant colors" images" and fonts" etc. +e follow the anti-phishing strateg! into obtain suspected +eb pages" which are supposed to be collected from )7;s
in those e-mails containing ke!words associated with protected +eb pages. +e
first convert them into normali'ed images and then represent their image
signatures with features composed of dominant color categor! and its
corresponding centroid coordinate to calculate the visual similarit! of two +eb
pages.
#he linear programming algorithm for (MB is applied to visual similarit!
computation of the two signatures. An anti-phishing s!stem ma! be re$uested to
protect man! +eb pages. A threshold is calculated for each protected +eb page
using supervised training. %f the (MB-based visual similarit! of a +eb page
e ceeds the threshold of a protected +eb page" we classif! the +eb page as a
phishing one.
(volving with the anti-phishing techni$ues" various phishing techni$ues and
more complicated and hard-to-detect methods are used b! phi-shers. #he most
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
16/53
straightforward wa! for a phi-sher to scam people is to make the phishing +eb
pages similar to their targets.
A phishing strateg! includes both +eb link obfuscation and +eb page
obfuscation. +eb link obfuscation can be carried out in four basic wa!sD adding
a suffi to a domain name of the )7;" using an actual link different from the
visible link" utili'ing s!stem bugs in real +eb sites to redirect the link to the
phishing +eb pages. *revious research works on duplicated document detection
approaches focus on plain te t documents and use pure te t features in
similarit! measure" such as collection statistics" s!ntactic anal!sis" displa!ing
structure" visual-based understanding" vector space model" etc.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
17/53
*hishing is considered as one of the most serious threats for the %nternet and e-
commerce. *hishing attacks abuse trust with the help of deceptive e-mails"
fraudulent web sites and malware. %n order to prevent phishing attacks some
organi'ations have implemented %nternet browser tool-bars for identif!ingdeceptive activities.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
18/53
of target name in the )7; or onl! %* address without host name. #hese
ambiguous domain names are ha'ardous for careless consumers.
Cecause of the careless usabilit! securit! design" phishers can easil! take
advantage of poor usabilit! design. %n order to offer more reliable securit!" anti-
phishing tool-bars should be easier to use. Moreover" as end-users must be able
to use the toolbars and make correct choices" usabilit! evaluation of these
toolbars is important. Our research ob6ective was to Gnd out general usabilit!
design principles for anti-phishing client side applications. Such information
ma! result in valuable information for improving usabilit! and securit! of anti-
phishing applications. Cased on this motivation" we conducted the heuristicusabilit! evaluation of Gve toolbars.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
19/53
spoofed.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
20/53
*eb 'allet% Preventing phishing attac+s b# revealing user intentions
+e introduce a new anti-phishing solution" the +eb +allet. #he +eb +allet is a
browser sidebar which users can use to submit their sensitive information
online. %t detects phishing attacks b! determining where users intend to submit
their information and suggests an alternative safe path to their intended site if
the current site does not match it. %t integrates securit! $uestions into the user&s
workflow so that its protection cannot be ignored b! the user. +e conducted a
user stud! on the +eb +allet protot!pe and found that the +eb +allet is a
promising approach. %n the stud!" it significantl! decreased the spoof rate of
t!pical phishing attacks from 58 to H8" and it effectivel! prevented all phishing attacks as long as it was used. A ma6orit! of the sub6ects successfull!
learned to depend on the +eb +allet to submit their login information.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
21/53
sensitive data" she presses a dedicated securit! ke! on the ke!board to open the
+eb +allet. )sing the +eb +allet" she ma! t!pe her data or retrieve her stored
data. #he data is then filled into the web form. Cut before the fill-in" the +eb
+allet checks if the current site is good enough to receive the sensitive data. %f the current site is not $ualified" the +eb +allet re$uires the user to e plicitl!
indicate where she wants the data to go. %f the user&s intended site is not the
current site" the +eb +allet shows a warning to the user about this discrepanc!"
and gives her a safe path to her intended site. #here is one simple rule to
correctl! use the +eb +alletD EAlwa!s use the +eb +allet to submit sensitive
information b! pressing the securit! ke! first.F ($uivalentl!" Enever submit
sensitive information directl! through a web form because it is not a secure
practice.F
+e have run a user stud! to test the +eb +allet interface. #he results are
promisingD
J #he +eb +allet significantl! decreased the spoof rate of normal phishing
attacks from 58 to H8.
J All the simulated phishing attacks in the stud! were effectivel! prevented b!
the +eb +allet as long as it was used.
J C! disabling direct input into web forms and thus making itself the onl! wa!
to input sensitive information" the +eb +allet successfull! trained a ma6orit! of
the sub6ects to use it to protect their sensitive information submission.
Cut there are also negative results which we plan to deal with in future researchD
J #he sub6ects totall! failed to differentiate the authentic +eb +allet interface
from a fake +eb +allet presented b! a phishing site. #his is a new t!pe of
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
22/53
phishing attack. %nstead of mimicking a legitimate site&s appearance" the
attacker fakes the interface of securit! software that is run b! the user.
J %t is not eas! to completel! stop all sub6ects from t!ping sensitive information
directl! into web forms. )sers are familiar with web form submission and have
a strong tendenc! to use it.
*hishing attacks e ploit the gap between the wa! a user perceives a
communication and the actual effect of the communication. #he computer
s!stem and the human user have two different understandings of a web site. #he
user recogni'es a site based on its visual appearance and the semantic meaning
of its content. Cut the browser recogni'es a site based on s!stem properties"
e.g." whether the site has an SS; certificate" when and where this site registered"
etc. As a result" neither the computer s!stem nor the human user alone can
effectivel! prevent phishing attacks.
On the one hand" it is hard" if not impossible" for the computer to alwa!s
correctl! derive the semantic meaning of the content. On the other hand"
ordinar! users do not know how to correctl! interpret the s!stem properties.
#he user interface is thus the e act place to bridge the gap between the user&s
mental model and the s!stem model b! letting the human user and the s!stem
share what the! individuall! know about the current site. #he +eb +allet helps
the users transfer their real intention to the browser" especiall! when the! are
doing phishing-critical actions" such as submitting sensitive data to web sites.
+hen a user uses the +eb +allet a dedicated interface for sensitive information
submission she implicitl! indicates that the submitting data is sensitive. #he
user further indicates the sensitive data t!pe b! using the appropriate card in the
+eb +allet.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
23/53
Intelligent phishing 'ebsite detection s#ste$ using u,,# techni ues
Betecting and identif!ing e-banking *hishing websites is reall! a comple and
d!namic problem involving man! factors and criteria. Cecause of the sub6ective
considerations and the ambiguities involved in the detection" =u''! Bata
Mining #echni$ues can be an effective tool in assessing and identif!ing e-
banking phishing websites since it offers a more natural wa! of dealing with
$ualit! factors rather than e act values. %n this s!stem" we present novel
approach to overcome the fu''iness in the e-banking phishing website
assessment and propose an intelligent resilient and effective model for detecting
e-banking phishing websites. #he proposed model is based on =u''! logiccombined with Bata Mining algorithms to characteri'e the e-banking phishing
website factors and to investigate its techni$ues b! classif!ing there phishing
t!pes and defining si e-banking phishing website attack criteria&s with a la!er
structure. A :ase stud! was applied to illustrate and simulate the phishing
process. Our e perimental results showed the significance and importance of
the e-banking phishing website criteria represented b! la!er one and the variet!influence of the phishing characteristic la!ers on the final e-banking phishing
website rate.
(-banking *hishing websites are forged website that is created b! malicious
people to mimic real e-banking websites. Most of these kinds of +eb pages
have high visual similarities to scam their victims. Some of these +eb pages
look e actl! like the real ones. )nwar! %nternet users ma! be easil! deceived b! this kind of scam. Victims of e-banking phishing +ebsites ma! e pose their
bank account" password" credit card number" or other important information to
the phishing +eb page owners. #he impact is the breach of information securit!
through the compromise of confidential data and the victims ma! finall! suffer
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
24/53
losses of mone! or other kinds. *hishing is a relativel! new %nternet crime in
comparison with other forms" e.g." virus and hacking.
(-banking *hishing website is a ver! comple issue to understand and to
anal!'e" since it is 6oining technical and social problem with each other for
which there is no known single silver bullet to entirel! solve it. #he motivation
behind this stud! is to create a resilient and effective method that uses =u''!
Bata Mining algorithms and tools to detect e-banking phishing websites in an
automated manner. BM approaches such as neural networks" rule induction" and
decision trees can be a useful addition to the fu''! logic model. %t can deliver
answers to business $uestions that traditionall! were too time consuming toresolve such as" K+hich are most important e-banking *hishing website
:haracteristic %ndicators and wh!LK b! anal!'ing massive databases and
historical data for training purposes.
.u,,# !ata Mining Algorith$s / Techni ues
#he approach described here is to appl! fu''! logic and data mining algorithms
to assess e-banking phishing website risk on the 3H characteristics and factors
which stamp the forged website. #he essential advantage offered b! fu''! logic
techni$ues is the use of linguistic variables to represent e! *hishing
characteristic indicators and relating e-banking phishing website probabilit!.
01 .u,,i ication
%n this step" linguistic descriptors such as
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
25/53
between classes. #he degree of belongingness of the values of the variables
to an! selected class is called the degree of membership Membership
function is designed for each *hishing characteristic indicator" which is a
curve that defines how each point in the input space is mapped to amembership value between [4" 2]. ;inguistic values are assigned for each
*hishing indicator as ;ow" Moderate" and
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
26/53
51 Aggregation o the rule outputs
#his is the process of unif!ing the outputs of all discovered rules.
:ombining the membership functions of all the rules conse$uents previousl!
scaled into single fu''! sets.
4) !e- u,,i ication
#his is the process of transforming a fu''! output of a fu''! inference
s!stem into a crisp output. =u''iness helps to evaluate the rules" but the final
output has to be a crisp number. #he input for the de-fu''ification process is
the aggregate output fu''! set and the output is a number. #his step wasdone using :entroid techni$ue since it is a commonl! used method.
#here are a number of challenges posed b! doing post- hoc classification of e-
banking phishing websites. Most of these challenges onl! appl! to the e-banking
phishing websites data and materiali'e as a form of information" which has the net
effect of increasing the false negative rate. #he age of the dataset is the most
significant problem" which is particularl! relevant with the phishing corpus. (- banking *hishing websites are short-lived" often lasting onl! in the order of 90
hours. Some of our features can therefore not be e tracted from older websites"
making our tests difficult. #he average phishing site sta!s live for appro imatel!
3.3 da!s. =urthermore" the process of transforming the original e- banking
phishing website archives into record feature datasets is not without error. %t
re$uires the use of heuristics at several steps. #hus high accurac! from the data
mining algorithms cannot be e pected.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
27/53
CA6TI6A% A content-based approach to detecting phishing 'eb sites
*hishing is a significant problem involving fraudulent email and web sites that
trick unsuspecting users into revealing private information. %n this paper" we
present the design" implementation" and evaluation of :A?#%?A" a novel"
content-based approach to detecting phishing web sites" based on the #=-%B=
information retrieval algorithm. +e also discuss the design and evaluation of
several heuristics we developed to reduce false positives. Our e periments show
that :A?#%?A is good at detecting phishing sites" correctl! labeling
appro imatel! / 8 of phishing sites.
7ecentl!" there has been a dramatic increase in phishing" a kind of attack in
which victims are tricked b! spoofed emails and fraudulent web sites into
giving up personal information. *hishing is a rapidl! growing problem" with
/"3 uni$ue phishing sites reported in 1une of 344 alone. %t is unknown
precisel! how much phishing costs each !ear since impacted industries are
reluctant to release figures estimates range from Q2 billion to 3.0 billion per
!ear. #o respond to this threat" software vendors and companies have released a
variet! of anti-phishing toolbars.
=or e ample" eCa! offers a free toolbar that can positivel! identif! eCa!-owned
sites" and ,oogle offers a free toolbar aimed at identif!ing an! fraudulent site.
As of September 344 " the free software download site download.com" listed 09
anti-phishing toolbars.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
28/53
%n this s!stem" we present the design" implementation" and evaluation of
:A?#%?A" a novel content-based approach for detecting phishing web sites.
:A?#%?A e amines the content of a web page to determine whether it is
legitimate or not" in contrast to other approaches that look at surfacecharacteristics of a web page" for e ample the )7; and its domain name.
:A?#%?A makes use of the well-known #=-%B= algorithm used in information
retrieval" and more specificall!" the 7obust
developed b! *helps and +ilensk! for overcoming broken h!perlinks. Our
results show that :A?#%?A is $uite good at detecting phishing sites" detecting
/9-/H8 of phishing sites.
+e also show that we can use :A?#%?A in con6unction with heuristics used
b! other tools to reduce false positives" while lowering phish detection rates
onl! slightl!. +e present a summar! evaluation" comparing :A?#%?A to two
popular anti-phishing toolbars that are representative of the most effective tools
for detecting phishing sites currentl! available. Our e periments show that
:A?#%?A has comparable or better performance to Spoof-,uard with far fewer false positives" and does about as well as ?et :raft. =inall!" we show that
:A?#%?A combined with heuristics is effective at detecting phishing )7;s in
usersP actual email" and that it&s most fre$uent mistake is labeling spam-related
)7;s as phishing.
#=-%B= is an algorithm often used in information retrieval and te t mining. #=-
%B= !ields a weight that measures how important a word is to a document in acorpus. #he importance increases proportionall! to the number of times a word
appears in the document" but is offset b! the fre$uenc! of the word in the
corpus. #he term fre$uenc! >#=@ is simpl! the number of times a given term
appears in a specific document. #his count is usuall! normali'ed to prevent a
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
29/53
bias towards longer documents to give a measure of the importance of the term
within the particular document. #he inverse document fre$uenc! >%B=@ is a
measure of the general importance of the term. 7oughl! speaking" the %B=
measures how common a term is across an entire collection of documents.#hus" a term has a high #=-%B= weight b! having a high term fre$uenc! in a
given document.
:A?#%?A works as followsD
,iven a web page" calculate the #=-%B= scores of each term on that web
page. R ,enerate a le ical signature b! taking the five terms with highest
#=-%B= weights.
=eed this le ical signature to a search engine" which in our case is
,oogle.
%f the domain name of the current web page matches the domain name of
the ? top search results" we consider it to be a legitimate web site.
Otherwise" we consider it a phishing site.
Our techni$ue makes the assumption that ,oogle inde es the vast ma6orit!
of legitimate web sites" and that legitimate sites will be ranked higher than
phishing sites. :ombined suggest that a phishing scam will rarel!" if ever" be
highl! ranked. At the end of this paper" however" we discuss some wa!s of
possibl! subverting :A?#%?A.
Age o !o$ain
#his heuristic checks the age of the domain name. Man! phishing sites have
domains that are registered onl! a few da!s before phishing emails are sent
out. +e use a +
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
30/53
measures the number of months from when the domain name was first
registered. %f the page has been registered longer than 23 months" the
heuristic will return 2" deeming it as legitimate and otherwise returns -2"
deeming it as phishing. %f the +
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
31/53
retrieving the page. :ombined with the limited si'e of the browser address
bar" this makes it possible to write )7;s that appear legitimate within the
address bar" but actuall! cause the browser to retrieve a different page. #his
heuristic is used b! Mo'illa =ire-=o . Bashes are also rarel! used b!legitimate sites" so we use this as another heuristic. Spoof-,uard checks for
both at s!mbols and dashes in )7;s.
Suspicious &in+s
#his heuristic applies the )7; check above to all the links on the page. %f
an! link on a page fails this )7; check" then the page is labeled as a
possible phishing scam. #his heuristic is also used b! Spoof-,uard.
IP Address
#his heuristic checks if a page&s domain name is an %* address. #his
heuristic is also used in *%;=(7.
!ots in )3&
#his heuristic check the number of dots in a page&s )7;. +e found that
phishing pages tend to use man! dots in their )7;s but legitimate sites
usuall! do not. :urrentl!" this heuristic labels a page as phish if there are
or more dots. #his heuristic is also used in *%;=(7.
.or$s
#his heuristic checks if a page contains an!
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
32/53
So t'are !escription
8ava
1ava is a programming language originall! developed b! 1ames ,osling at Sun
Micros!stems >now a subsidiar! of Oracle :orporation@ and released in 2// as a
core component of Sun Micros!stemsP 1ava platform. #he language derives much
of its s!nta from : and : but has a simpler ob6ect model and fewer low-level
facilities. 1ava applications are t!picall! compiled to b!te code >class file@ that can
run on an! 1ava Virtual Machine >1VM@ regardless of computer architecture. 1ava
is a general-purpose" concurrent" class-based" ob6ect-oriented language that isspecificall! designed to have as few implementation dependencies as possible. %t is
intended to let application developers Kwrite once" run an!where.K 1ava is currentl!
one of the most popular programming languages in use" particularl! for client-
server web applications.
#he original and reference implementation 1ava compilers" virtual machines" and
class libraries were developed b! Sun from 2// . As of Ma! 344H" in compliancewith the specifications of the 1ava :ommunit! *rocess" Sun relicensed most of its
1ava technologies under the ,?) ,eneral *ublic ;icense. Others have also
developed alternative implementations of these Sun technologies" such as the ,?)
:ompiler for 1ava and ,?) :lass path.
8ava Plat or$%
One characteristic of 1ava is portabilit!" which means that computer programs
written in the 1ava language must run similarl! on an! hardwareToperating-s!stem
platform. #his is achieved b! compiling the 1ava language code to an intermediate
representation called 1ava b!te code" instead of directl! to platform-specific
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
33/53
machine code. 1ava b!te code instructions are analogous to machine code" but are
intended to be interpreted b! a virtual machine >VM@ written specificall! for the
host hardware. (nd-users commonl! use a 1ava 7untime (nvironment >17(@
installed on their own machine for standalone 1ava applications" or in a +eb browser for 1ava applets.
Standardi'ed libraries provide a generic wa! to access host-specific features such
as graphics" threading" and networking.
A ma6or benefit of using b!te code is porting.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
34/53
the ?et Ceans runtime container is an e ecution environment that understands
what a module is" handles its lifec!cle" and enables it to interact with other
modules in the same application.
7egistration of various ob6ects" files and hints into la!er is prett! central to the wa!
?et Ceans based applications handle communication between modules. #his page
summari'es the list of such e tension points defined b! modules with A*%.
:onte t menu actions are read from the la!er folder ;oadersTte tT -
ant mlTActions.
e! maps folder contains subfolders for individual ke! maps >(macs" 1Cuilder" ?et Ceans@. #he name of ke! map can be locali'ed. )se
KS!stem=ileS!stem.locali'ingCundleK attribute of !our folder for this purpose.
%ndividual ke! map folder contains shadows to actions. Shortcut is mapped to the
name of file. (macs shortcut format is used" multike!s are separated b! space chars
>K:-X *K means :trl X followed b! *@. Kcurrent e!mapK propert! of K e! mapsK
folder contains original >not locali'ed@ name of current ke! map.
#his folder contains registration of shortcuts. %ts supported for backward
compatibilit! purpose onl!. All new shortcuts should be registerred in
K e!mapsT?etCeansK folder. Shortcuts installed ins Shortcuts folder will be added
to all ke!maps" if there is no conflict. %t means that if the same shortcut is mapped
to different actions in Shortcut folder and current ke!map folder >like
e!mapT?etCeans@" the Shortcuts folder mapping will be ignored.
Y Batabase( plorer;a!erA*% in Batabase ( plorer
Y ;oaders-te t-dbschema-Actions in Batabase ( plorer
Y ;oaders-te t-s$l-Actions in Batabase ( plorer
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
35/53
Y *lugin7egistration in 1ava (( Server 7egistr!
XM; la!er contract for registration of server plug-ins and instances that
implement optional capabilities of server plug-ins. *lug-ins with server-specific
deplo!ment descriptor files should declare the full list in XM; la!er as specified in
the document plugin-la!er-file.html from the above link.
K*ro6ectsTorg-netbeans-modules-6ava-63sepro6ectT:ustomi'erK folderPs content
is used to construct the pro6ectPs customi'er. %tPs content is e pected to be
*ro6ect:ustomi'er.:omposite:ategor!*rovider instances. #he lookup passed to
the panels contains an instance of *ro6ect and
org.netbeans.modules.6ava.63sepro6ect.ui.customi'er.13S(*ro6ect*roperties *lease
note that the latter is not part of an! public A*%s and !ou need implementation
dependenc! to make use of it.
K*ro6ectsTorg-netbeans-modules-6ava-63sepro6ectT?odesK folderPs content is
used to construct the pro6ectPs child nodes. %tPs content is e pected to be ?ode
=actor! instances.
K*ro6ectsTorg-netbeans-modules-6ava-63sepro6ectT;ookupK folderPs content is
used to construct the pro6ectPs additional lookup. %tPs content is e pected to be
;ookup *rovider instances. 13S( pro6ect provides ;ookup Mergers for Sources"
*rivileged #emplates and 7ecommended #emplates. %mplementations added b! 5rd
parties will be merged into a single instance in the pro6ectPs lookup.
)se Options Bialog folder for registration of custom top level options panels.7egister !our implementation of Options :ategor! there >Y.instance file@. Standard
file s!stems sorting mechanism is used.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
36/53
)se Options BialogTAdvanced folder for registration of custom panels to
Miscellaneous *anel. 7egister !our implementation of Advanced:ategor! there
>Y.instance file@. Standard file s!stems sorting mechanism is used.
)se Options ( portT M! :ategor!W folder for registration of items for
e portTimport of options. 7egistration in la!ers looks as follows
Source files must be named after the public class the! contain" appending the suffi
.6ava" for e ample"
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
37/53
#he ke!word void indicates that the main method does not return an! value to the
caller. %f a 1ava program is to e it with an error code" it must call S!stem.e it>@
e plicitl!.
#he method name KmainK is not a ke!word in the 1ava language. %t is simpl! the
name of the method the 1ava launcher calls to pass control to the program. 1ava
classes that run in managed environments such as applets and (nterprise
1avaCeans do not use or need a main >@ method. A 1ava program ma! contain
multiple classes that have main methods" which means that the VM needs to be
e plicitl! told which class to launch from.
#he main method must accept an arra! of String ob6ects. C! convention" it is
referenced as args although an! other legal identifier name can be used. Since 1ava
" the main method can also use variable arguments" in the form of public static
void main>String... args@" allowing the main method to be invoked with an arbitrar!
number of String arguments. #he effect of this alternate declaration is semanticall!
identical >the args parameter is still an arra! of String ob6ects@" but allows an
alternative s!nta for creating and passing the arra!.
#he 1ava launcher launches 1ava b! loading a given class >specified on the
command line or as an attribute in a 1A7@ and starting its public static void
main>String[]@ method. Stand-alone programs must declare this method e plicitl!.
#he String[] args parameter is an arra! of String ob6ects containing an! arguments
passed to the class. #he parameters to main are often passed b! means of a
command line.
*rinting is part of a 1ava standard librar!D #he S!stem class defines a public static
field called out. #he out ob6ect is an instance of the *rint Stream class and provides
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
38/53
man! methods for printing data to standard out" including println >String@ which
also appends a new line to the passed string.
8ava 9:;igh-level &anguage%
A high-level programming language developed b! Sun Micros!stems. 1ava was
originall! called OA " and was designed for handheld devices and set-top bo es.
Oak was unsuccessful so in 2// Sun changed the name to 1ava and modified the
language to take advantage of the burgeoning +orld +ide +eb.
1ava is an ob6ect-oriented language similar to : " but simplified to eliminate
language features that cause common programming errors. 1ava source code files>files with a .6ava e tension@ are compiled into a format called b!te code >files with
a .class e tension@" which can then be e ecuted b! a 1ava interpreter. :ompiled
1ava code can run on most computers because 1ava interpreters and runtime
environments" known as 1ava Virtual Machines >VMs@" e ist for most operating
s!stems" including )?%X" the Macintosh OS" and +indows. C!te code can also be
converted directl! into machine language instructions b! a 6ust-in-time compiler
>1%#@.
1ava is a general purpose programming language with a number of features that
make the language well suited for use on the +orld +ide +eb. Small 1ava
applications are called 1ava applets and can be downloaded from a +eb server and
run on !our computer b! a 1ava-compatible +eb browser" such as ?etscape
?avigator or Microsoft %nternet ( plorer.
Ob6ect-oriented software development matured significantl! during the past
several !ears. #he convergence of ob6ect-oriented modeling techni$ues and
notations" the development of ob6ect-oriented frameworks and design patterns" and
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
39/53
the evolution of ob6ect-oriented programming languages have been essential in the
progression of this technolog!.
Ob6ect-Oriented Software Bevelopment using 1avaD *rinciples" *atterns" and
=rameworks contains a ver! applied focus that develops skills in designing
software-particularl! in writing well-designed" medium-si'ed ob6ect-oriented
programs. %t provides a broad and coherent coverage of ob6ect-oriented technolog!"
including ob6ect-oriented modeling using the )nified Modeling ;anguage >)M;@
ob6ect-oriented design using Besign *atterns" and ob6ect-oriented programming
using 1ava.
6etBeans
#he 6etBeans Plat or$ is a reusable framework for simplif!ing the development
of 1ava Swing desktop applications. #he ?etCeans %B( bundle for 1ava S(
contains what is needed to start developing ?etCeans plug-ins and ?etCeans
*latform based applications no additional SB is re$uired.
Applications can install modules d!namicall!. An! application can include the
)pdate :enter module to allow users of the application to download digitall!-
signed upgrades and new features directl! into the running application.
7einstalling an upgrade or a new release does not force users to download the
entire application again.
http://en.wikipedia.org/wiki/Software_frameworkhttp://en.wikipedia.org/wiki/Java_Swinghttp://en.wikipedia.org/wiki/Digital_signaturehttp://en.wikipedia.org/wiki/Digital_signaturehttp://en.wikipedia.org/wiki/Software_frameworkhttp://en.wikipedia.org/wiki/Java_Swinghttp://en.wikipedia.org/wiki/Digital_signaturehttp://en.wikipedia.org/wiki/Digital_signature8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
40/53
#he platform offers reusable services common to desktop applications" allowing
developers to focus on the logic specific to their application. Among the features of
the platform areD
)ser interface management >e.g. menus and toolbars@
)ser settings management
Storage management >saving and loading an! kind of data@
+indow management
+i'ard framework >supports step-b!-step dialogs@
?etCeans Visual ;ibrar!
%ntegrated Bevelopment #ools
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
41/53
*a$p Server
+AM*s are packages of independentl!-created programs installed on computersthat use a Microsoft +indows operating s!stem. +AM* is an acron!m formed
from the initials of the operating s!stem Microsoft +indows and the principal
components of the packageD Apache "M!SZ; and one of *
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
42/53
S#ste$ Architecture
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
43/53
Modules
;oading web page training set.
#e tual and visual content feature e traction.
#e t and image classification.
=using of detected results.
:omparison of detected fusion results.
Module !escription
&oading 'eb page training set
;oading the phishing web pages into the database.
;oading the protected web pages into the database.
Textual and visual content eature extraction
( traction of te tual content of web page b! using e traction algorithms.
( traction of visual content of web page b! using e traction algorithms.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
44/53
#he te tual feature e traction is done b! using
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
45/53
=usion algorithm is used for merging or 6oining the te tual and visual
classified results.
Co$parison o detected usion results
#he detected fusion results will be compared with original web page.
#he posteriori probabilit! will be found b! using the similarit!.
C! this probabilit! the fusion results of false and true web pages will be
compared.
#he false web page is compared with the true web page.
#he detected results will be shown to the user.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
46/53
S#ste$ 3e uire$ents
So t'are 3e uire$ent Operating S!stem D +indows X* ;anguage D :ore 1ava Version D 1B 2. %B( D ?et beans .3 Batabase D M!-S$l
;ard'are 3e uire$ents *7O:(SSO7 D *(?#%)M %V :;O: S*((B D 3.H ,< 7AM :A*A:%# D 2 ,C
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
47/53
Conclusion
A new content-based anti-phishing s!stem has been thoroughl! developed. %n this
s!stem" we presented a new framework to solve the anti-phishing problem. #he
new features of this framework can be represented b! a te t classifier" an image
classifier" and a fusion algorithm. Cased on the te tual content" the te t classifier is
able to classif! a given web page into corresponding categories as phishing or
normal. #his te t classifier was modeled b! SVM rule. Cased on the visual content"
the image classifier" which relies on SVM" is able to calculate the visual similarit!
between the given web page and the protected web page efficientl!. #he matching
threshold used in both te t classifier and image classifier is effectivel! estimated
b! using a probabilistic model derived from the SVM theor!. A novel data fusion
model using the SVM theor! was developed and the corresponding fusion
algorithm presented. #his data fusion framework enables us to directl! incorporate
the multiple results produced b! different classifiers. #his fusion method provides
insights for other data fusion applications. More importantl!" it is worth noting that
our content-based model can be easil! embedded into current industrial anti-
phishing s!stems.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
48/53
.uture "nhance$ent
Our future work will include adding more features into the content
representations into our current model.
%nvestigating incremental learning models to solve the knowledge updating
problem in current probabilistic model.
Adding more data sets with te tual and visual content of web pages for both
true and false web pages.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
49/53
3e erences
A. (migh. >344 " Oct.@. Online %dentit! #heftD *hishing #echnolog!"
:hokepoints and :ountermeasures. 7adi ;aboratories %nc." (au :laire" +%
[Online]. AvailableD httpDTTwww.antiphishing.orgTphisgingdhs- report.pdf
;. 1ames" *hishing ( posed. 7ockland" MAD S!ngress" 344 .
A. . =u" +. ;iu" and X. Beng" EBetecting phishing web pages with visual
similarit! assessment based on earth mover&s distance >(MB@"F %((( #rans.
Bepend. Secure :omput." vol. 5" no. 9" pp. 542I522" Oct.I Bec. 344 .
,lobal *hishing Surve!D Bomain ?ame )se and #rends in 2
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
50/53
. hang" S. (gelman" ;. :ranor" and 1.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
51/53
. hang" 1.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
52/53
M. :handrasekaran" . ?ara!anan" and S. )padh!a!a" E*hishing email
detection based on structural properties"F in *roc. /th Annu. ? S :!ber
Secur. :onf." ?ew ork" 1un. 344 " pp. 3I0.
%. =ette" ?. Sadeh" and A. #omasic" E;earning to detect phishing emails"F in
*roc. 2 th %nt. :onf. +orld +ide +eb" Canff" AC" :anada" Ma! 344H" pp.
9/I .
S. Abu-?imeh" B. ?appa" X. +ang" and S. ?air" EA comparison of machine
learning techni$ues for phishing detection"F in *roc. Anti-*hish. +ork.
,roups 3nd Annu. e:rime 7es. Summit" *ittsburgh" *A" Oct. 344H" pp. 4I
/.
7. Casnet" S. Mukkamala" and A.
8/10/2019 Textual and Visual Content Based Anti-Phishing First Review
53/53
:. 7. 1ohn" #he %mage *rocessing