Upload
oralee
View
28
Download
0
Embed Size (px)
DESCRIPTION
Learning to remove Internet advertisements. Nicholas Kushmerick Department of Computer Science, University College Dublin, Ireland. Presented by Bo Zhang Department of Computer Science Michigan Technological University. - PowerPoint PPT Presentation
Citation preview
Dec 6, 2004 12004 Michigan Technological University
Nicholas KushmerickNicholas Kushmerick
Department of Computer Science,Department of Computer Science,
University College Dublin, IrelandUniversity College Dublin, Ireland
Learning to remove Internet Learning to remove Internet advertisementsadvertisements
Presented by Bo ZhangDepartment of Computer Science Michigan Technological University
Dec 6, 2004 22004 Michigan Technological University
OverviewOverview
BackgroundBackground
Introduction of ADEATERIntroduction of ADEATER
Design of ADEATERDesign of ADEATER
EvaluationEvaluation
Related WorkRelated Work
Conclusion and Future WorkConclusion and Future Work
Dec 6, 2004 32004 Michigan Technological University
BackgroundBackground Negative Impact of advertisement images on InternetNegative Impact of advertisement images on Internet
Slow down the speed of browsing Consume resources of computer Extra costs for users
Advertisement Image
Advertisement Image
Advertisement Image
Dec 6, 2004 42004 Michigan Technological University
Introduction of ADEATERIntroduction of ADEATER
Definition:Definition:
- A browsing assistant that automatically removes advertisement images from Internet pages.
Property:Property:
Rules generated from learning algorithm
Dec 6, 2004 52004 Michigan Technological University
Introduction of ADEATERIntroduction of ADEATER ExamplesExamples
Dec 6, 2004 62004 Michigan Technological University
Design of ADEATER Design of ADEATER
System ArchitectureSystem Architecture
Dec 6, 2004 72004 Michigan Technological University
Design of ADEATERDesign of ADEATER Encoding instanceEncoding instance
Fixed–width feature vector
Images enclosed in anchor tag <A> is a candidate advertisement
Geometric features of an image: -Height <IMG height=90> -Width <IMG width=90> -Aspect ratio (ratio of width to height)
Local feature: -Whether destination URL and image URL are in the same internet
domain www.ee.mtu.edu/page.html www.cs.mtu.edu/image.jpg YES
www.dell.com/notebook.html www.cs.mtu.edu/image.jpg No
Dec 6, 2004 82004 Michigan Technological University
Design of ADEATERDesign of ADEATER
Encoding instanceEncoding instance
Fixed–width feature vector
Caption feature: -Words occuring in enclosing <A> tag with phrase length<K
and phrase count >M -K is maximum phrase length -M is minimum phrase count
Alt Feature -Set of “alternate” words in the <IMG> tag (<IMG alt=“ad”>)
with phrase length<K and phrase count >M -K is maximum phrase length -M is minimum phrase count
Dec 6, 2004 92004 Michigan Technological University
Design of ADEATERDesign of ADEATER
Encoding instanceEncoding instance
Fixed–width feature vector
Ubase, Udest, Uimg
-Words occuring in base URL, destination URL, image URL with phrase length<K and phrase count >M -K is maximum phrase length -M is minimum phrase count
Stop list -Low-information terms (“http”, “www”, ”jpg”, etc.)
Dec 6, 2004 102004 Michigan Technological University
Design of ADEATERDesign of ADEATER
Encoding instanceEncoding instance
Samples of HTML page
Dec 6, 2004 112004 Michigan Technological University
Design of ADEATERDesign of ADEATER
Encoding of samples
Dec 6, 2004 122004 Michigan Technological University
Design of ADEATERDesign of ADEATER
Encoding of samples (cont)
Dec 6, 2004 132004 Michigan Technological University
Design of ADEATERDesign of ADEATER
Gathering examplesGathering examples
AD samples are generated by ADGRABBER browsing assistant
Identifier candidate advertisements
Generate vector encoding
NON-AD samples are generated by a custom-built Internet spider
Extract images from randomly-generated URLs.
Dec 6, 2004 142004 Michigan Technological University
Design of ADEATERDesign of ADEATER
Learning rules
Algorithm - C4.5 decision tree learning algorithm
Properties - Quick on-line execution of classifier - Not be overly sensitive to missing features or noises - Scale well and insensitive to irrelevant features
Examples of rules - If aspect ratio > 4.5833, alt doesn’t contain “to” but does
contain “click+here”,and Udest doesn’t contain “http+www”, then instance is an AD
- If Ubase does not contain “messier”, and Udest contains the “redirect+cgi”, then instance is an AD
Dec 6, 2004 152004 Michigan Technological University
Design of ADEATERDesign of ADEATER
Removing advertisementsRemoving advertisements
Process
- Fetch HTML pages from Internet - Identify candidate advertisements - Classify instances with learned rules - Replace the image’s URL with the URL of an inconspicuous low-bandwidth image
Implementation
- Removal module as a proxy server
Dec 6, 2004 162004 Michigan Technological University
Evaluation
Speed and accuracySpeed and accuracy
Experiment setting
Total samples - AD: 459 examples
- NON-AD: 2820 examples
10-fold cross-validation - Training set: 90% examples - Test set: 10% examples
Off-line training phase: 5.8 minutes
On-line classification phase: 70 msec/image
Average accuracy: 97.1%
Dec 6, 2004 172004 Michigan Technological University
Evaluation Learning curvesLearning curves
Simple methodology - Not recalculate feature set Realistic methodology - Recalculate feature set
Dec 6, 2004 182004 Michigan Technological University
Evaluation
Alternative encodingsAlternative encodings
Dec 6, 2004 192004 Michigan Technological University
Related Work Muffin: Filtering web pages
ImageKill Filter: Hand-crafted rules
ImageKill.minheight
- Only remove images which are at least n pixels high
ImageKill.minwidth
- Only remove images which are at least n pixels wide
ImageKill.ratio
- Remove images which are more than n times as wide as
they are high
ImageKill.exclude
- Don't remove images that match the given string/regexp
Dec 6, 2004 202004 Michigan Technological University
Related Work
WebFilter: Filtering web pages
Solution
- User provides a list of URL templates and corresponding
filter scripts
Dec 6, 2004 212004 Michigan Technological University
Related Work
Junkbuster: Junkbuster: Filtering web pages
Solution
- User provides a block file
Dec 6, 2004 222004 Michigan Technological University
Related Work
Smokey: Detect abusive messagesSmokey: Detect abusive messages
Solution
- Training samples and generate rules by training - Parse messages and generate feature vector - Classify the feature vector with rules generated
Dec 6, 2004 232004 Michigan Technological University
Conclusion and Future Work
ConclusionConclusion
High accuracy
Modest resource cost (processing time, training samples)
Future WorkFuture Work
Incremental learning algorithm
More efficient feature selection mechanism
Dec 6, 2004 242004 Michigan Technological University
Thank you!Thank you!