Honey Pot for Web Crawlers

Honey Pot for Web Crawlers

Karolina LewandowskaBoguslawa PiekarskaIoannis Zografakis

Honey Pot:

• Is a trap set to detect, deflect, or in some manner counteract attempts at unauthorized use of information systems.

• It consists of a computer, data, or a network site that appears to be part of a network, but is actually isolated, (un)protected, and monitored, and which seems to contain information or a resource of value to attackers.

Web Crawler:

• Is a computer program that browses the World Wide Web in a methodical, automated manner.

• This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data.

•The process is initiated with the addition of a list of hypertext documents (the seeds) to the crawling frontier

•Documents in the frontier are ranked, selecting the next document to crawl and removing it from the frontier

•An HTTP request is issued for the document and after it has been retrieved, its contents are processed and its outward links are extracted

•All links that are not already in the frontier and were not yet crawled are added to the frontier

•The process continues recursively

Web Crawler – How they work

Web Crawler – Politeness policy

• Robots exclusion protocol: protocol that is a standard for administrators to indicate which parts of their Web servers should not be accessed by crawlers.

• Resource-level through META tag in HTML files:<META NAME=“ROBOTS” CONTENT=“NOINDEX,

NOFOLLOW”>

Honey Pots – more details

•Honey Pots have several purposes including distract attackers from more valuable machines on the network, provide early warning about the attack and allow an in-depth examination of adversaries during and after the exploitation of the Honey Pot.

•Honey Pots are supposed to interact only with the intruders all the transactions and interactions of the honey pot are unauthorized and the information gathered should be analyzed

•Honey Pots can be distinguished into types based on their purpose and level of interaction.

Honey Pots – distinguished by purpose

•Research Honey Pot: is designed to gain information about the blackhat community

•Production honey Pots: are mostly used within organizations to protect them and help migrate the risk

•Honeytoken: is like a honeypot, but it is not a computer but a digital entry

Honey Pots – distinguished by level of interaction

•Low interaction honeypots:are honeypots with no operating system for the attacker to interact with.

•Medium interaction honeypots: provide the attacker with an illusion of an operating system and give the attacker more to interact with.

•High interaction honeypots: the most advanced honeypots, they provide the attacker a real operating system to interact with, where nothing is simulated or restricted.

Milestones of the project

• Create some Web page

• Make high position for the Web page (it will be easier for Web Crawlers to find it)

• Create a script in PHP which shows website visits and divide on normal guests and Web crawlers

• Observe and make statistics about attacks

• Compare attacks from two servers

Honey Pot – Web Page ”Healthy lifestyle”

Content of Web Page:

• Good advices about healthy lifestyle• What to eat • How prepare the food• Counting BMI

• Exercises (available after log in)• Plank on Elbows and Toes• Long arm crunch• Bicycle exercise

• Diets (available after log in)• Blood type diet• Grapefruit diet• Low fat diet

• Calorie Table• Contact

Main specification

• Language: XHTML, PHP, MySQL,elements of Flash

• Servers: orfi.uwm.edu.plx10hosting.com

• URL – addresses: orfi.uwm.edu.pl/~bagietka/inthealthylifestyle.x10hosting.com

Our steps to attract page for Web Crawlers

•Text links instead of buttonso Crawlers can’t read text from image

• Not too many images – crawlers prefer textOur steps to attract page for Web Crawlers - cont'd

• Links between page to page of same web siteo Crawlers move from one web page to other

through the navigational links. • Minimize JavaScript effects

o Crawlers don't get the content in between < script > ... < /script > tags

• Search engine optimization (SEO)o Making good position for web site. A lot of

links from pages with good position

Our steps to attract page for Web Crawlers - cont'd

Search engine optimization (SEO)

• SEO directories

• Blog about health – presell page

• Finding the most popular keywords with Google AdWords

• Keywords: health, healthy, exercises, diets, exercise fitness, weight loss diet, diet plan, health food, healthy lifestyle, calorie table, diet food

Examples of directories and blog used by us

•Directories for the page:

e.g.:dmoz.org ,click4choice.com, internet-web-directory.com, thalesdirectory.com, canlinks.net, politicalforecast.net, nashvillebbb.org and more

•Blog:

http://health4us.yolasite.com

•Directories for blog: e.g.: shane-english.com, pegasusdirectory.com, skoobe.biz, tsection.com and more

Hidden links – Home Page

http://orfi.uwm.edu.pl/~bagietka/int/index.php?content=flash

Hidden links – left-bottom side of the footer

Robots.txt

• "Robots.txt" is a regular text file that through its name, has special meaning to the majority of "honorable" robots on the web.

• By defining a few rules in this text file, we can instruct robots to not crawl and index certain files, directories within our site, or at all.

• File robots.txt is uploaded to the root accessible directory of our site

Robots.txt – our file

User-agent: *Disallow: http://orfi.uwm.edu.pl/~bagietka/int/

/index.php?content=calorie_table

Disallow: /bagietka/int/calorie_table.php

Logfiles – filtering the usersfunction getIsCrawler($userAgent) {

$crawlers = 'Webduniabot|UnChaos|SitiDi|DIE-KRAEHE|comAgent|anything.com|neofonie|A-Online|miggibot|'.

'aardvark-crawler|AbachoBOT|ABCdatos|Aberja|abot|About|Ack|Acorn|AESOP|SharewarePlaza|AIBOT|aipbot|Aladin|Aleksika Spider|AlkalineBOT|'.

'Allesklar|AltaVista Intranet|AmfibiBOT|Amfibibot|AnnoMille spider|antibot|AnswerBus|AnzwersCrawl|Aport|appie|ArabyBot|'.

'MLBot|Mouse-House|MQbot|msnbot|MSRBOT|crawler|Crawler|Ask Jeeves|MuscatFerret|Webinator|inktomi|Vagabondo|Crawl|galaxy|DAUMOA|hizbang|'.

'WhizBang|BecomeBot|Diffbot|Digger|Exabot|FatBot|Galbot|heritrix|ShunixBot|Slurp|VoilaBot|Blogbot|NABOT|nabot|Bot|NetLookout|NetResearchServer|'.

'googlebot|nsyght|NuSearch|Nutch|ObjectsSearch|Openfind|OpenWebSpider|OrangeSpider|PicoSearch|Pompos|PrivacyFinder|Progressive|QuepasaCreep|'.

'bot|Reaper|RedCarpet|RedCell|RedKernel|RoboPal|ScanWeb|ScoutAbout|Scrubby|SearchGuild|Searchmee|Seekbot|ShopWiki|silk|Skimpy|Sphider|'.

'Sphere|Spider|spider|Spinne|Sqworm|Steeler|Szukacz|Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|'.

'AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|Tagword|TCDBOT|Teemer|Teoma|terraminds|thumbshots|tivraSpider|Toutatis|Trampelpfad-Spider|'.

'Twiceler|TygoProwler|Ultraseek|verzamelgids|voyager|VSE|vspider|Waypath|Webclipping|webcrawl|WebFilter|Websquash|worio|WSB|yacybot|Yeti|yoono|'.

'GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby';

Logfiles – filtering the users

$isCrawler = (preg_match("/$crawlers/i", $userAgent) > 0);return $isCrawler;

}$isCrawler = getIsCrawler($_SERVER['HTTP_USER_AGENT']);

if ($isCrawler) {$file=$f."c.txt";$filea="logs/allc.txt";} else{$file=$f.'.txt';$filea="logs/all.txt";}

Logfiles – writing into a file

$date=date("H:i:s d-m-Y");$IP=$_SERVER['REMOTE_ADDR'];$user=$_SERVER['HTTP_USER_AGENT'];$where=$_SERVER['REQUEST_URI'];$data="IP: $IP, Date: $date, UserAgent: $user\n Where: $where\n“;$fpa=fopen("$filea", "r+");$data=$data.fread($fpa, filesize($filea));rewind($fpa);flock($fpa, 2);fwrite($fpa, $data);flock($fpa, 3);fclose($fpa);

Logs - examples

•IP: 91.184.196.22, Date: 20:16:30 25-11-2009, UserAgent: Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9.0.15) Gecko/2009101601 Firefox/3.0.15 (.NET CLR 3.5.30729) Where: /~bagietka/int/index.php?content=diets

• IP: 128.30.52.71, Date: 20:39:06 25-11-2009, UserAgent: W3C_Validator/1.654 Where: /~bagietka/int/index.php

• IP: 80.50.235.70, Date: 20:46:10 25-11-2009, UserAgent: Mozilla/5.0 (Windows; U; Windows NT 6.0; pl; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729) Where: /~bagietka/int/index.php?content=contact

Logs – Web crawlers – orfi.uwm.edu.pl

• Googlebot:

• IP: 66.249.65.53, Date: 13:56:13 28-11-2009, UserAgent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Where: /~bagietka/int/

• Next visited pages by Googlebot:•15:55:33 28-11-2009 /~bagietka/int/index.php•16:04:59 28-11-2009 /~bagietka/int/index.php?content=flash•16:10:08 28-11-2009 /~bagietka/int/index.php?content=contact •16:20:25 28-11-2009 /~bagietka/int/index.php?content=calorie_table

Remarks - Web crawlers – orfi.uwm.edu.pl :

• Googlebot:• visited page which was forbidden in robots.txt• went to invisible for people page „flash”• „jumping” between pages were slowly – bot read each

page 10 minutes on average• didn’t go in order of appearing links – first was

calorie_table and then contact• supposition: contact page contains much more

links

Logs – Web crawlers – x10hosting.com

• IP: 208.255.176.240, Date: 14:30:03 25-11-2009, UserAgent: -;

• IP: 209.150.130.33, Date: 12:07:26 26-11-2009, UserAgent: Custom Spiderwww.homepageseek.com /1.0;

•IP: 66.249.67.165, Date: 05:29:05 28-11-2009, UserAgent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Remarks – Web crawlers –x10hosting.com

• Some crawlers don’t leave their name

• Crawlers use more than one IP

• So far no crawler went from one server to the other but we still wait…