Shady Paths: Leveraging Surfing Crowds to Detect
Malicious Web Pages
Gianluca Stringhini, Christopher Kruegel, and Giovanni Vigna
University of California, Santa Barbara
The Web is a Dangerous Place
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 2
• Drive-by downloads• Social engineering
Current Detection Techniques
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 3
Static Analysis Dynamic Analysis
Suspicious elements in• URLs• JavaScript• Flash
Visit the web page (honeyclients)• Signs of exploitation
Obfuscation Cloaking
Can only detect attacks that exploit vulnerabilities!
Our Technique
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 4
Redirection Graphs
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 5
By analyzing the characteristics of the set of visitors and of the redirection graph, we can determine if the destination page is malicious
No need to analyze the final page!
Legitimate Uses of Redirections
• Inform that a web page has moved
• Login functionalities
• Advertisements
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 6
We cannot flag all redirections as malicious
Luckily, malicious redirection graphs look different
Malicious Redirection Graphs
Uniform software configuration
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 7
Malicious Redirection Graphs
Cross-domain redirections
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 8
evil.co.cc malicious.ru
Malicious Redirection Graphs
“Hubs” to aggregate traffic
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 9
Malicious Redirection Graphs
“Infected” websites
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 10
System Overview
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 11
Our System: SpiderWeb
We leverage the differences between legitimate and malicious redirection graphs for detection
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 12
Three components:• Data collection• Creation of redirection graphs• Classification component
Data Collection
SpiderWeb needs a set of navigation data from a diverse population of users
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 13
Dataset obtained from a large AV vendor• Users of a browser
security tool• Data collection was opt-
in only• Data was anonymized
Creation of Redirection Graphs
When we specify the final page, we allow wildcards (e.g., malicious.com/*) → Groupings
We need to discard groupings that are too general
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 14
a.com
b.com
c.com
c.com
d.com
d.com
c.com d.com
Classification Component
Five categories of features• Client features (3 features)• Referrer features (4 features)• Landing page features (4 features)• Final page features (5 features)
Distinct URLs, Parameters, TLD, Domain is an IP
• Redirection graph features (12 features)Length of chains, same country across referrer and final page, intra-domain redirections, hubs
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 15
} how diverse are these elements
We use Support Vector Machines for classification
Evaluation
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 16
Evaluation Dataset
388,098 redirection chains, collected over two months
• 34,011 final URLs
• 13,780 distinct user IP addresses per week
• 145 countries
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 17
Labeled dataset for training• 2,533 redirection chains leading to 1,854 malicious URLs• 2,466 redirection chains leading to 510 legitimate URLs
Analysis of the Classifier
SpiderWeb’s performance depends on the redirection graph complexity
• Complexity ≥ 6 causes no FPs and no FNs
• Our dataset is limited → we discard graphs with complexity < 4
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 18
We need to accept a certain amount of FPs and FNsFull URL grouping: 1.2% FP rate, 17% FN rate
Redirection-graph specific features are the most important:Without them, FNs raise to 67%
Detection in the Wild
3,549 redirection graphs with complexity ≥ 4
564 flagged as malicious → 3,368 URLs
778 URLs undetected by the AV vendor
• We could not confirm 1.5% of them
• Effectively complements state of the art
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 19
Comparison with Previous WorkA few previous systems leverage redirection information to detect malicious web pages
These systems also use other type of information
• WarningBird: uses Twitter profile information
• SURF: SEO specific
If this additional information is not present, SpiderWeboutperforms previous systems
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 20
Possible Use Cases
Offline detection (blacklist)
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 21
Online detection
Users get infected until the required “complexity” is reached
We performed a chronological experimentSpiderWeb would have protected 93% users
Discussion
Limitations
• Graphs with high complexity are required
• Groupings are not perfect
• Attackers might redirect users to legitimate pages
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 22
Attackers might make their redirections look legitimate• Stop using cloaking (easier to detect by previous work)• Stop using hubs (raises the bar)
Conclusions
• We showed that malicious and legitimate redirection graphs differ
• We presented a system that analyzes redirection graphs to detect malicious web pages
• We showed that our system is effective, and complements existing systems
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 23
@gianlucaSB
Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages 24