BY IBRAHIM MOSAAD
SUPERVISED BY OSAMA KAMAL
VISUAL FINGERPRINTING FOR MALICIOUS DOMAINS
OUTLINE• Introduction
• Statistics of malicious Domains/URLs
• Goal
• How
• Conceptually
• Theoretically
• Practically
• Testing And Results
• Challenges
• Future Works
INTRODUCTION
• Statistics
• In 2014, Kaspersky Lab’s web antivirus detected 123,054,503 unique malicious objects: scripts, exploits, executable files, etc
INTRODUCTION
• Exploit kits
• How Common Are Exploit Kits?• 6000 infections/0.2 hour
• 2B visitors/month
• 2/3rd of all malwares delivered by exploit kits
GOAL
“Create an automated system to d iff erenti ate between benign and mal ic ious websi tes”
HOW - CONCEPTUALLY
• How do malicious websites behave?• Lack of a good training set
• How do benign websites behave? • Testing top 250 websites from different categories in Alexa
• Scoring system
HOW – THEORETICALLY
• Browsing websites using real/emulated system
• Store/Visualize The collected data
• Score it
HOW - PRACTICALLY
• Browsing websites using honeyclients• Low-interaction
• Thug
• HoneySpider Network 2.0
• High-interaction
• Capture-HPC
• HoneyClient
HOW - PRACTICALLY
• HSN• Modular Framework – Extendable
• Wappalyzer module (Developed)
• Peepdf Module (Developed)
• Cuckoo sandbox module (Updated)
• Yara module (Updated)
HOW - PRACTICALLY• Storing collected data
• Graph database neo4j
• GraphDB driver to HSN using Py2neo
• Scoring System• Mix of First and Second Degree functions
FIRST RUN - TRAINING
• Number of websites: 1500
MOZILLA.ORG
AVG.COM
ORACLE.COM
APPLE.COM
FIRST RUN
• Feature Extraction • Number levels
• Number resources
• Number redirections
• Number Iframes
• Website Topology
BABYLON.COM
SECOND RUN – REAL CASE
• Top domains looked malicious• http://dictionary.reverso.net
• http://n4hr.com
• http://s02.arab.sh
• http://dc11.arabsh.com
CHALLENGES
• HSN• Lack of good documentation
• Last version was released in 2013
• Code written in 3 languages C/Python/Java
• Lack of community support
CHALLENGES
CHALLENGES
• Graph Database (py2neo)• Insertion
• Library is still immature
• REST-API can’t handle it
• 7000 URL * 30 * 2 = 420000 ~ 0.5M Nodes
• Store the queries in one request?!
• Huge POST request
• Querying
• 7000 URL => 7000*20 = 140K
FUTURE WORKS
• HSN• Enhance the web-client module
• Enhance SWF emulation module
• Scoring System• Machine learning
• Graph Database• Adopt Giraph database rather neo4j
• Monitoring governmental websites
BIGGER PICTURE
Questions?