Upload
summer-ridgway
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
© 2011 Cisco Systems, Inc. All rights reerved.
1
Applications of Machine Learning in Cisco Web Security
Richard Wheeldon PhD BSc
2© 2011 Cisco Systems, Inc. All rights reerved.
Cisco Web Security
• Cisco, Ironport and ScanSafe
• Request time filtering•Categorization and classification•Reputation
• Response time filtering•Malware types and attack vectors•Malware detection•Dynamic classification
• Other challenges
3© 2011 Cisco Systems, Inc. All rights reerved.
The Ubiquitous Speaker Slide
• Richard Wheeldon•UCL Graduate in 1999•PhD from Birkbeck in 2003•Joined Cisco December 2009•http://www.rswheeldon.com/
• Acknowledgements•Steve Poulson - [email protected]•Bryan Feeney - [email protected]
4© 2011 Cisco Systems, Inc. All rights reerved.
Cisco, Ironport and ScanSafe
• Cisco•World’s leading network company
• Ironport•Leader in Anti-spam•Provide Web Security Appliances
• ScanSafe•World leader in “Security as a Service”•Scans 1.8 billion web requests a day•Blocks 32 million of them
5© 2011 Cisco Systems, Inc. All rights reerved.
We’re local
6© 2011 Cisco Systems, Inc. All rights reerved.
Previous MSc projects
• Tree Kernels for CFG similarity•Guangyan Song, 2010
• Fast computation of the Kernel of a Tree and applications to Semi-Supervised Learning
•Malcolm Reynolds, 2009
• Comparing N-gram features for web page classification•Noureen Tejani, 2007
7© 2011 Cisco Systems, Inc. All rights reerved.
We’re hiring• Positions
•Software Developers•QA, Operations, Research
• Locations•ScanSafe•UK - Bedfont Lakes, Reading, Staines, Edinburgh•Galway, EMEA, US, Worldwide
• Graduate recruitment•http://www.cisco.com/go/universityjobs•http://www.cisco.com/careers/• [email protected]
8© 2011 Cisco Systems, Inc. All rights reerved.
1. Availability
Time our service is available to scan traffic99.999% guaranteed availability
2. Latency
Additional load time attributable to servicesEvaluated by 3rd party analysis
3. False Positives
Pages that were blocked but should not have
4. False Negatives
Pages that were not blocked, but should have
Scansafe’s SaaS
9© 2011 Cisco Systems, Inc. All rights reerved.
Risks of Unfiltered Content
• Software threats•Malware•Phishing•Botnets
• Business threats•Productivity Loss•Bandwidth congestion•Legal liability•Data Leaks
10© 2011 Cisco Systems, Inc. All rights reerved.
The Web vs. Email
Web EmailMost web traffic is good Most e-mail is bad
Easy to find safe sites Easy to get Spam
Harder to get dangerous URLs Harder to get examples of good mail
Blocking web sites is visible Blocking email is invisible
Performance gain from white-listing Performance gain from blocking
Very Real-Time (<2s) Not Real-Time (<Nhrs)
11© 2011 Cisco Systems, Inc. All rights reerved.
Request time filtering
• Motivation•Quicker blocks save bandwidth and processing time• If the request is made, the damage may be done
• Techniques•Databases•Reputation•Rules•Trained systems
12© 2011 Cisco Systems, Inc. All rights reerved.
Category-based filtering
• Responsible for most blocks
• High-risk and high-traffic
• Manual categorizers
• 10 million URLs
• 97% of traffic
• 2 million porn sites
13© 2011 Cisco Systems, Inc. All rights reerved.
Web Reputation
3rd PartyFeeds Spam H o sts
Databases
Sco re between -10 and +10(Bad, N eutral o r Go o d)
• Feeds•Phishing sites•Malware sites
• Heuristics• In spam but not in ham•Age of domain registration•High traffic – e.g. Alexa 1000•Scanned but never blocked
14© 2011 Cisco Systems, Inc. All rights reerved.
Web Reputation in the WSA
15© 2011 Cisco Systems, Inc. All rights reerved.
16© 2011 Cisco Systems, Inc. All rights reerved.
Keyword-based URL filtering
• Keyword rules•Fitness -> Health•Basketball -> Sport•Pizzeria -> Food•Restaurant -> Food•Whore -> Porn
• Strange URLs•whorepresents.com• therapistfinder.com• speedofart.com•expertsexchange.com•penisland.com•powergenitalia.it
17© 2011 Cisco Systems, Inc. All rights reerved.
Recognizing Porn URLs
• http://www.penisland.com
• Example of segmentation problemP('peni') X P('sland')
P('penis') X P('land')
P('pen') X P('island')
• Extends to classificationP('penis') X P('land') X P(porn|'penis') X P(porn|'land')
P('pen') X P('island') X P(not_porn|'pen') X P(not_porn|'island')
18© 2011 Cisco Systems, Inc. All rights reerved.
Phishing and Malware Examples
• Phishing examples•http://pavpals-com-usaprewiwerluithaniirse.345.pl•http://82.195.143.18/onlinepaypal.com/•http://www.jetboatflush.com/~nfioemro/www.paypal.fr/webscrcmd=...
• Malicious examples:•www1.scan-projectrf.cz.cc•www1.scan-projectsi.cz.cc•www1.scan-projectst.cz.cc•www1.scan-projectte.cz.cc•www1.scan-projectti.cz.cc
19© 2011 Cisco Systems, Inc. All rights reerved.
Searchahead
• If we can identify bad URLs we can warn before the user clicks.
• Over 90% of new sites are visited as the result of an Internet search
Acceptable
Uncategorized
Prohibited
Malicious
20© 2011 Cisco Systems, Inc. All rights reerved.
Response Time Scanning
• Trusted sites are targets
• Strength-in-depth combination of commercial scanners and in-house technology.
Graphics
Webmail
New Web Pages
BlogsAd Links
Links
Comments
Banner Ads
Backdoors
Rootkits
Trojan Horses
Keyloggers
Worms
21© 2011 Cisco Systems, Inc. All rights reerved.
Exploited sites in recent years
• Times India
• Miami Dolphins
• Samsung
22© 2011 Cisco Systems, Inc. All rights reerved.
Nothing is safe – not even Twitter!
http://www.youtube.com/fslabs
23© 2011 Cisco Systems, Inc. All rights reerved.
Signature Databases
0
0.5
1
1.5
Signatures(millions)
2006
2007
2008
• From 2006 to 2008, the F-Secure signature database grew from 250000 entries to 1.5 million
• The rate at which variants of viruses come out is growing rapidly
• No vendor can rely exclusively on signatures
24© 2011 Cisco Systems, Inc. All rights reerved.
Zero-hour protection
• Vendors take time to release signature updates
•Win32.IstBar.jl trojan
• Outbreak Intelligence (OI) provides proactive threat detection
• A huge data set of traffic to be leveraged
25© 2011 Cisco Systems, Inc. All rights reerved.
How does OI use Machine Learning?
• Approaches•Malware detection•Anomaly detection•Dynamic categorization
• Techniques Employed•Supervised Learning•Unsupervised Learning•Sandboxing
26© 2011 Cisco Systems, Inc. All rights reerved.
Dynamic Classification
• Document classification across 80 categories• Increases coverage•Language identification
• Identifies inappropriate content•Porn is relatively easy•Phishing is harder – but not impossible?•Hate speech is harder still
27© 2011 Cisco Systems, Inc. All rights reerved.
DC for identifying malicious sites
• Automated tools generate malicious sites•Fake escrow•Fake pharmacy•Mule recruitment
• Examples from Richard Clayton’s 2010 FOSDEM talk•http://www.google.com/search?q=%22before+that+was+a+commercial+manager+of+a+large+corporation+engaged+in+electronics+production%22
•http://www.google.com/search?q=%22as+the+most+trusted+escrow+service+on+the+internet%22
28© 2011 Cisco Systems, Inc. All rights reerved.
Malicious Executable Files
• The final stage of an attack is frequently downloading an executable
• Traditionally blocked using signatures
• We use a combination of signature-based scanners and machine-learning
29© 2011 Cisco Systems, Inc. All rights reerved.
Drive-by attacks
• Almost no-one opens executables from odd sources any more, so instead people use drive-by attacks.
• A normal file (e.g. Flash, PDF, Javascript, Image file) is crafted to exploit a vulnerability in a viewer or library and execute code embedded within the file.
30© 2011 Cisco Systems, Inc. All rights reerved.
Flash
“Symantec recently highlighted Flash for having one of the worst security records in 2009. We also know first hand that Flash is the number one reason Macs crash. We have been working with Adobe to fix these problems, but they have persisted for several years now. We don’t want to reduce the reliability and security of our iPhones, iPods and iPads by adding Flash”
Steve Jobs, April 2010
http://www.apple.com/hotnews/thoughts-on-flash/
31© 2011 Cisco Systems, Inc. All rights reerved.
The growing threat of Java
• Almost as common as Flash•90% of PCs have Java•700 000 JDK downloads per month•3.48 Million JRE downloads per month
• Growth in known vulnerabilities•29 patched in a single update (Oct 2010)•Growth in exploits reported by Sophos, Symantec, Microsoft and Cisco
• Signatures + Trained Scanlet
32© 2011 Cisco Systems, Inc. All rights reerved.
Detecting Malicious JavaScript
• Sandboxing•Behavioural checking•Good way to beat obfuscation techniques•Difficult to constrain
• Trained classification•Analyse features
33© 2011 Cisco Systems, Inc. All rights reerved.
Javascript Features
v46f658f5e2260(v46f658f5e3226){ function v46f658f5e4207 () {return 16;} return(parseInt(v46f658f5e3226,v46f658f5e4207()));}function v46f658f5e61f4(v46f658f5e7174){ function v46f658f5ea0cd () {return 2;} var v46f658f5e813e=\'\';for(v46f658f5e9105=0; v46f658f5e9105<v46f658f5e7174.length; v46f658f5e9105+=v46f658f5ea0cd()){ v46f658f5e813e+=(String.fromCharCode(v46f658f5e2260(v46f658f5e7174.substr(v46f658f5e9105, v46f658f5ea0cd()))));}return v46f658f5e813e;} document.write(v46f658f5e61f4(\'3C5343524950543E77696E646F772E7374617475733D2\'));
The above is JavaScript, but where are the features?An exercise for the reader!
34© 2011 Cisco Systems, Inc. All rights reerved.
Obfuscation
• Attackers use obfuscation•But so do legitimate vendors (e.g. Google)•And large Web 2.0 libraries
• Techniques include•Name changes•String concatenation (eval)•Dynamically loaded/generated/decrypted code (eval)•Splitting functionality across files
35© 2011 Cisco Systems, Inc. All rights reerved.
Malicious Non-Executable Files
• There are a lot of file formats out there – documents, pictures, videos.
• For zero-day attacks, we have no data to compare against.
• Basically this is anomaly detection.
36© 2011 Cisco Systems, Inc. All rights reerved.
Development Constraints
• Low False Positive Rate
• Robust•Tolerant against malformed data•Language-agnostic
• Scalable•1.8 Billion requests per day on 1000 servers
• Low latency
37© 2011 Cisco Systems, Inc. All rights reerved.
Back-end processing
A M scanners
U R L Black l ists
A V scanners
bad
F i le Whitel ists
N o A V hi ts
U R L Whitel ists
go o d
Behav io ural features
Co ntent featuresM L
bad go o d
• If a technique is too slow for real-time scanning, that doesn’t make it useless.
• Back end processing can generate lists of good and bad files and help evaluate new techniques.
38© 2011 Cisco Systems, Inc. All rights reerved.
Want to know more?
• Cisco 2Q10 Global Threat Report http://www.cisco.com/web/about/security/intelligence/cisco_threat_072610_959.pdf
• Richard Clayton : Evil on the Internet http://www.securitytube.net/Phishing-(Evil-on-the-Internet)-FOSDEM-Talk-video.aspx
• Kaspersky Lab Security News Service http://threatpost.com/
• A plan for Spam http://www.paulgraham.com/spam.html
39© 2011 Cisco Systems, Inc. All rights reerved.
Still want to know more?
• Identifying Suspicious URLs : An Application of Large-Scale Online Learning http://videolectures.net/icml09_ma_isu/
• Peter Norvig Google : Statistical Learning as the Ultimate Agile Development Tool http://videolectures.net/cikm08_norvig_slatuad/
• Writing ClamAV Signatures Alain Zidouemba http://www.clamav.net/doc/webinars/Webinar-Alain-2009-03-04.ppt
40© 2011 Cisco Systems, Inc. All rights reerved.
Take Home Messages
• Web Security•Challenging and interesting domain•Many applications for Machine Learning
• ScanSafe and Cisco•Many opportunities for collaboration•Several opportunities for student projects