29
Honey Pot for Web Crawlers Karolina Lewandowska Boguslawa Piekarska Ioannis Zografakis

Honey Pot for Web Crawlers

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Honey Pot for Web Crawlers

Honey Pot for Web Crawlers

Karolina LewandowskaBoguslawa PiekarskaIoannis Zografakis

Page 2: Honey Pot for Web Crawlers

Honey Pot:

• Is a trap set to detect, deflect, or in some manner counteract attempts at unauthorized use of information systems.

• It consists of a computer, data, or a network site that appears to be part of a network, but is actually isolated, (un)protected, and monitored, and which seems to contain information or a resource of value to attackers.

Page 3: Honey Pot for Web Crawlers

Web Crawler:

• Is a computer program that browses the World Wide Web in a methodical, automated manner.

• This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data.

Page 4: Honey Pot for Web Crawlers

•The process is initiated with the addition of a list of hypertext documents (the seeds) to the crawling frontier

•Documents in the frontier are ranked, selecting the next document to crawl and removing it from the frontier

•An HTTP request is issued for the document and after it has been retrieved, its contents are processed and its outward links are extracted

•All links that are not already in the frontier and were not yet crawled are added to the frontier

•The process continues recursively

Web Crawler – How they work

Page 5: Honey Pot for Web Crawlers

Web Crawler – Politeness policy

• Robots exclusion protocol: protocol that is a standard for administrators to indicate which parts of their Web servers should not be accessed by crawlers.

• Resource-level through META tag in HTML files:<META NAME=“ROBOTS” CONTENT=“NOINDEX,

NOFOLLOW”>

Page 6: Honey Pot for Web Crawlers

Honey Pots – more details

•Honey Pots have several purposes including distract attackers from more valuable machines on the network, provide early warning about the attack and allow an in-depth examination of adversaries during and after the exploitation of the Honey Pot.

•Honey Pots are supposed to interact only with the intruders all the transactions and interactions of the honey pot are unauthorized and the information gathered should be analyzed

•Honey Pots can be distinguished into types based on their purpose and level of interaction.

Page 7: Honey Pot for Web Crawlers

Honey Pots – distinguished by purpose

•Research Honey Pot: is designed to gain information about the blackhat community

•Production honey Pots: are mostly used within organizations to protect them and help migrate the risk

•Honeytoken: is like a honeypot, but it is not a computer but a digital entry

Page 8: Honey Pot for Web Crawlers

Honey Pots – distinguished by level of interaction

•Low interaction honeypots:are honeypots with no operating system for the attacker to interact with.

•Medium interaction honeypots: provide the attacker with an illusion of an operating system and give the attacker more to interact with.

•High interaction honeypots: the most advanced honeypots, they provide the attacker a real operating system to interact with, where nothing is simulated or restricted.

Page 9: Honey Pot for Web Crawlers

Milestones of the project

• Create some Web page

• Make high position for the Web page (it will be easier for Web Crawlers to find it)

• Create a script in PHP which shows website visits and divide on normal guests and Web crawlers

• Observe and make statistics about attacks

• Compare attacks from two servers

Page 10: Honey Pot for Web Crawlers

Honey Pot – Web Page ”Healthy lifestyle”

Page 11: Honey Pot for Web Crawlers

Content of Web Page:

• Good advices about healthy lifestyle• What to eat • How prepare the food• Counting BMI

• Exercises (available after log in)• Plank on Elbows and Toes• Long arm crunch• Bicycle exercise

• Diets (available after log in)• Blood type diet• Grapefruit diet• Low fat diet

• Calorie Table• Contact

Page 12: Honey Pot for Web Crawlers

Main specification

• Language: XHTML, PHP, MySQL,elements of Flash

• Servers: orfi.uwm.edu.plx10hosting.com

• URL – addresses: orfi.uwm.edu.pl/~bagietka/inthealthylifestyle.x10hosting.com

Page 13: Honey Pot for Web Crawlers

Our steps to attract page for Web Crawlers

•Text links instead of buttonso Crawlers can’t read text from image

Page 14: Honey Pot for Web Crawlers

• Not too many images – crawlers prefer textOur steps to attract page for Web Crawlers - cont'd

Page 15: Honey Pot for Web Crawlers

• Links between page to page of same web siteo Crawlers move from one web page to other

through the navigational links. • Minimize JavaScript effects

o Crawlers don't get the content in between < script > ... < /script > tags

• Search engine optimization (SEO)o Making good position for web site. A lot of

links from pages with good position

Our steps to attract page for Web Crawlers - cont'd

Page 16: Honey Pot for Web Crawlers

Search engine optimization (SEO)

• SEO directories

• Blog about health – presell page

• Finding the most popular keywords with Google AdWords

• Keywords: health, healthy, exercises, diets, exercise fitness, weight loss diet, diet plan, health food, healthy lifestyle, calorie table, diet food

Page 17: Honey Pot for Web Crawlers

Examples of directories and blog used by us

•Directories for the page:

e.g.:dmoz.org ,click4choice.com, internet-web-directory.com, thalesdirectory.com, canlinks.net, politicalforecast.net, nashvillebbb.org and more

•Blog:

http://health4us.yolasite.com

•Directories for blog: e.g.: shane-english.com, pegasusdirectory.com, skoobe.biz, tsection.com and more

Page 18: Honey Pot for Web Crawlers

Hidden links – Home Page

http://orfi.uwm.edu.pl/~bagietka/int/index.php?content=flash

Page 19: Honey Pot for Web Crawlers

Hidden links – left-bottom side of the footer

Page 20: Honey Pot for Web Crawlers

Robots.txt

• "Robots.txt" is a regular text file that through its name, has special meaning to the majority of "honorable" robots on the web.

• By defining a few rules in this text file, we can instruct robots to not crawl and index certain files, directories within our site, or at all.

• File robots.txt is uploaded to the root accessible directory of our site

Page 21: Honey Pot for Web Crawlers

Robots.txt – our file

User-agent: *Disallow: http://orfi.uwm.edu.pl/~bagietka/int/

/index.php?content=calorie_table

Disallow: /bagietka/int/calorie_table.php

Page 22: Honey Pot for Web Crawlers

Logfiles – filtering the usersfunction getIsCrawler($userAgent) {

$crawlers = 'Webduniabot|UnChaos|SitiDi|DIE-KRAEHE|comAgent|anything.com|neofonie|A-Online|miggibot|'.

'aardvark-crawler|AbachoBOT|ABCdatos|Aberja|abot|About|Ack|Acorn|AESOP|SharewarePlaza|AIBOT|aipbot|Aladin|Aleksika Spider|AlkalineBOT|'.

'Allesklar|AltaVista Intranet|AmfibiBOT|Amfibibot|AnnoMille spider|antibot|AnswerBus|AnzwersCrawl|Aport|appie|ArabyBot|'.

'MLBot|Mouse-House|MQbot|msnbot|MSRBOT|crawler|Crawler|Ask Jeeves|MuscatFerret|Webinator|inktomi|Vagabondo|Crawl|galaxy|DAUMOA|hizbang|'.

'WhizBang|BecomeBot|Diffbot|Digger|Exabot|FatBot|Galbot|heritrix|ShunixBot|Slurp|VoilaBot|Blogbot|NABOT|nabot|Bot|NetLookout|NetResearchServer|'.

'googlebot|nsyght|NuSearch|Nutch|ObjectsSearch|Openfind|OpenWebSpider|OrangeSpider|PicoSearch|Pompos|PrivacyFinder|Progressive|QuepasaCreep|'.

'bot|Reaper|RedCarpet|RedCell|RedKernel|RoboPal|ScanWeb|ScoutAbout|Scrubby|SearchGuild|Searchmee|Seekbot|ShopWiki|silk|Skimpy|Sphider|'.

'Sphere|Spider|spider|Spinne|Sqworm|Steeler|Szukacz|Google|msnbot|Rambler|Yahoo|AbachoBOT|accoona|'.

'AcioRobot|ASPSeek|CocoCrawler|Dumbot|FAST-WebCrawler|Tagword|TCDBOT|Teemer|Teoma|terraminds|thumbshots|tivraSpider|Toutatis|Trampelpfad-Spider|'.

'Twiceler|TygoProwler|Ultraseek|verzamelgids|voyager|VSE|vspider|Waypath|Webclipping|webcrawl|WebFilter|Websquash|worio|WSB|yacybot|Yeti|yoono|'.

'GeonaBot|Gigabot|Lycos|MSRBOT|Scooter|AltaVista|IDBot|eStyle|Scrubby';

Page 23: Honey Pot for Web Crawlers

Logfiles – filtering the users

$isCrawler = (preg_match("/$crawlers/i", $userAgent) > 0);return $isCrawler;

}$isCrawler = getIsCrawler($_SERVER['HTTP_USER_AGENT']);

if ($isCrawler) {$file=$f."c.txt";$filea="logs/allc.txt";} else{$file=$f.'.txt';$filea="logs/all.txt";}

Page 24: Honey Pot for Web Crawlers

Logfiles – writing into a file

$date=date("H:i:s d-m-Y");$IP=$_SERVER['REMOTE_ADDR'];$user=$_SERVER['HTTP_USER_AGENT'];$where=$_SERVER['REQUEST_URI'];$data="IP: $IP, Date: $date, UserAgent: $user\n Where: $where\n“;$fpa=fopen("$filea", "r+");$data=$data.fread($fpa, filesize($filea));rewind($fpa);flock($fpa, 2);fwrite($fpa, $data);flock($fpa, 3);fclose($fpa);

Page 25: Honey Pot for Web Crawlers

Logs - examples

•IP: 91.184.196.22, Date: 20:16:30 25-11-2009, UserAgent: Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9.0.15) Gecko/2009101601 Firefox/3.0.15 (.NET CLR 3.5.30729) Where: /~bagietka/int/index.php?content=diets

• IP: 128.30.52.71, Date: 20:39:06 25-11-2009, UserAgent: W3C_Validator/1.654 Where: /~bagietka/int/index.php

• IP: 80.50.235.70, Date: 20:46:10 25-11-2009, UserAgent: Mozilla/5.0 (Windows; U; Windows NT 6.0; pl; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729) Where: /~bagietka/int/index.php?content=contact

Page 26: Honey Pot for Web Crawlers

Logs – Web crawlers – orfi.uwm.edu.pl

• Googlebot:

• IP: 66.249.65.53, Date: 13:56:13 28-11-2009, UserAgent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Where: /~bagietka/int/

• Next visited pages by Googlebot:•15:55:33 28-11-2009 /~bagietka/int/index.php•16:04:59 28-11-2009 /~bagietka/int/index.php?content=flash•16:10:08 28-11-2009 /~bagietka/int/index.php?content=contact •16:20:25 28-11-2009 /~bagietka/int/index.php?content=calorie_table

Page 27: Honey Pot for Web Crawlers

Remarks - Web crawlers – orfi.uwm.edu.pl :

• Googlebot:• visited page which was forbidden in robots.txt• went to invisible for people page „flash”• „jumping” between pages were slowly – bot read each

page 10 minutes on average• didn’t go in order of appearing links – first was

calorie_table and then contact• supposition: contact page contains much more

links

Page 28: Honey Pot for Web Crawlers

Logs – Web crawlers – x10hosting.com

• IP: 208.255.176.240, Date: 14:30:03 25-11-2009, UserAgent: -;

• IP: 209.150.130.33, Date: 12:07:26 26-11-2009, UserAgent: Custom Spiderwww.homepageseek.com /1.0;

•IP: 66.249.67.165, Date: 05:29:05 28-11-2009, UserAgent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Page 29: Honey Pot for Web Crawlers

Remarks – Web crawlers –x10hosting.com

• Some crawlers don’t leave their name

• Crawlers use more than one IP

• So far no crawler went from one server to the other but we still wait…