Web Crawlers - Exploring the WWW

Exploring The World Wide Web

1

---Khusbu Mishra---Siddhartha Anand

OutlineIntroductionMotivation Of CrawlersBasic Crawlers And Implementation Issues

Crawling AlgorithmsCrawler Ethics And Conflicts

Demonstration

2

IntroductionExploit the graph structure of the web in a

sequential, automated manner.

Also known as spiders, bots, wanderers, ants, fish and worms.

Starts from a seed page and then uses external links within it to attend to other pages.

Links that we have got are stored in a repository to be crawled later.

3



Demonstration

4

Q: How does a search engine know that all these pages contain the query

terms? A: Because all of those pages have

been crawled

5

Motivation For CrawlersSupport universal search engines (Google,

Yahoo, MSN/Windows Live, Ask, etc.)

Getting real-world data on large-scale

Vertical (specialized) search engines, e.g. news, shopping, papers, recipes, reviews, etc.

Business intelligence: keep track of potential competitors, partners

Monitor Web sites of interest

Evil: harvest emails for spamming, phishing…

Can you think of some others? 7



Demonstration

8

Building A Crawling Infrastructure

This is a sequential crawler

Seeds can be any list of starting URLs

Order of page visits is determined by frontier data structure

Stop criterion can be anything

FrontierThe to-do list of a crawler that contains the

URLs of unvisited pages.

It may be implemented as a FIFO Queue.

URLs stored must be unique.

Can be maintained as a Hash-Table with URL as a key.

Fetching Pages

We need a client that sends a request and receives a response.

Should have timeouts

Exception Handling and Error Checking

Must follow Robot Exclusion Principle

Parsing

Once a page has been fetched we need to parse its content to extract information.

It may imply URL Extraction or data.

The URL extracted needs to be canonicalised before adding it to the frontier so that same page is not revisited.

A Basic Crawler in Python Breadth First Searchqueue: a FIFO list (shift and push)

import urllib2from BeautifulSoup import BeautifulSoupseed=http://www.seedurl.comqueue={seed}while(len(queue)!=0)

seed=queue.popleft()page=urllib2.urlopen(seed)soup=BeautifulSoup(page)links=soup.findAll(‘a’)for item in links:

p=item.attrsqueue.append(p[0][1]) 13

URL Extraction and Canonicalization

Relative vs. Absolute URLs

Crawler must translate relative URLs into absolute URLs

Need to obtain Base URL from HTTP header, or HTML Meta tag, or else current page path by default

ExamplesBase: http://www.cnn.com/linkto/

Relative URL: intl.htmlAbsolute URL: http://www.cnn.com/linkto/intl.html

Relative URL: /US/Absolute URL: http://www.cnn.com/US/

14

More On Canonical URLsConvert the protocol and hostname to lowercase.

Remove the anchor or reference part of the URL.

Perform URL encoding for some commonly used characters .

Remove ‘ .. ’ and its parent directory from the URL.

15

More On Canonical URLsSome transformation are trivial, for example:

http://i nformatics.indiana.edu/index.html#fragmenthttp://informatics.indiana.edu/index.html

http://informatics.indiana.edu/dir1/./../dir2/http://informatics.indiana.edu/dir2/

http://informatics.indiana.edu/~fil/http://informatics.indiana.edu/%7Efil/

http://INFORMATICS.INDIANA.EDU/fil/http://informatics.indiana.edu/fil/

16

URL Extraction and CanonicalizationURL canonicalization

All of these:http://www.cnn.com/TECH/http://WWW.CNN.COM/TECH/http://www.cnn.com:80/TECH/http://www.cnn.com/bogus/../TECH/

Are really equivalent to this canonical form:http://www.cnn.com/TECH/

In order to avoid duplication, the crawler must transform all URLs into canonical form

17

Implementation IssuesDon’t want to fetch same page twice!

Keep lookup table (hash) of visited pages

The frontier grows very fast!May need to prioritize for large crawls

Fetcher must be robust!Don’t crash if download failsTimeout mechanism

Determine file type to skip unwanted filesCan try using extensions, but not reliable

18

More Implementation IssuesFetching

Get only the first 10-100 KB per pageTake care to detect and break redirection loopsSoft fail for timeout, server not responding, file

not found, and other errors

19

More Implementation Issues: ParsingHTML has the structure of a

DOM (Document Object Model) tree

Unfortunately actual HTML is often incorrect in a strict syntactic sense

Crawlers, like browsers, must be robust/forgiving

Must pay attention to HTML entities and unicode in text



Demonstration

21

Graph Traversal (BFS or DFS?)

Breadth First SearchImplemented with QUEUE (FIFO) Finds pages along shortest

pathsIf we start with “good” pages,

this keeps us close;Depth First Search

Implemented with STACK (LIFO)

Wander away (“lost in cyberspace”)

22

23

Breadth First Search

URL 1

URL 2

URL 5

URL 4

URL 7

URL 3 URL 6

Undiscovered

Discovered

Finished

Top of queue

Queue: URL1

24


URL 1

URL 2

URL 5

URL 4

URL 7

URL 3 URL 6

Undiscovered

Discovered

Finished

Top of queue

Queue: URL1

25


URL 1 URL5

URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: URL2

Top of queue

URL 21Shortest pathfrom s

Popped: URL1

26


URL1

URL 2

URL5

URL4

URL7

URL6

0

Undiscovered

Discovered

Finished

Queue: s 2

Top of queue

URL3

1

1

Queue: URL2 URL3

Popped: URL1

27


URL1

URL2 URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: s 2 3

Top of queue

URL5

1

1

1

Queue: s 2 Queue: URL2 URL3 URL5

Popped: URL1

28


URL1

URL2

URL5

URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 2 3 5

Top of queue

1

1

1

Queue: s 2 3 Queue: s 2 Queue: URL2 URL3 URL5

Popped: URL1

29


URL1

URL2

URL5URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 2 3 5

Top of queue

URL4

1

1

Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL3 URL5 URL4

Popped: URL2

30


URL1

URL2

URL5

URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 2 3 5 4

Top of queue

1

1

1

2

5 already discovered:don't enqueue

Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL3 URL5 URL4

Popped: URL2

31


URL1

URL2

URL5

URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 2 3 5 4

Top of queue

1

1

1

2

Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL3 URL5 URL4

Popped: URL2

32


URL1

URL2

URL5

URL4

URL7

URL3

URL6

0

Undiscovered

Discovered

Finished

Queue: 3 5 4

Top of queue

1

1

1

2

Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL5 URL4

Popped: URL3

33


URL1

URL2

URL5

URL4

URL7

URL3

Undiscovered

Discovered

Finished

Queue: 3 5 4

Top of queue

1

2

URL6

Popped: URL3

Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL5 URL4 URL6

Popped: URL3

34


URL1

URL2

URL5

URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 3 5 4 6

Top of queue

1

1

1

2

2

Queue: 3 5 4

Popped: URL3

Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL5 URL4 URL6

Popped: URL3

35


URL1

URL2

URL5

URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 5 4 6

Top of queue

1

1

1

2

2

Queue: 3 5 4 6 Queue: 3 5 4

Popped: URL3

Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL4 URL6

Popped: URL5

36


URL1

URL2

URL5

URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 4 6

Top of queue

1

1

1

2

2

Queue: 5 4 6 Queue: 3 5 4 6 Queue: 3 5 4

Popped: URL3

Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL6

Popped: URL4

37


URL1

URL2

URL5

URL4

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 6 8

Top of queue

1

1

1

2

2

3

URL7

3

Queue: 4 6 Queue: 5 4 6 Queue: 3 5 4 6 Queue: 3 5 4

Popped: URL3

Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL7

Popped: URL6

38


URL1

URL2

URL5

URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 6 8 7

Top of queue

1

1

1

2

2

3

Queue: 6 8 Queue: 4 6 Queue: 5 4 6 Queue: 3 5 4 6 Queue: 3 5 4

Popped: URL3

Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue:

Popped: URL7

39


URL1

URL2

URL5

URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 6 8 7

Top of queue

1

1

1

2

2

3

Queue: 6 8 Queue: 4 6 Queue: 5 4 6 Queue: 3 5 4 6 Queue: 3 5 4

Popped: URL3

Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue:

Popped:

BFS v/s DFS

Best First SearchIt assigns a score to

the unvisited URLs on the page based on similarity measures.

URLs are added to the frontier that is maintained as a priority queue.

The best URL is crawled first.

Shark Search•It uses a similarity

measure for estimating a the relevance of an unvisited URL.

• Relevance is measured by

-Anchor Text-Link contents.

• It assigns an anchor and a context score for each URL.

• It maintains a depth bound.



Demonstration

43

Crawler Ethics And ConflictsCrawlers can cause trouble, even unwillingly, if

not properly designed to be “polite” and “ethical”

For example, sending too many requests in rapid succession to a single server can amount to a Denial of Service (DoS) attack!Server administrator and users will be upsetCrawler developer/admin IP address may be

blacklisted

44

Crawler Etiquette (Important!)Identify yourself

Use ‘User-Agent’ HTTP header to identify crawler, website with description of crawler and contact information for crawler developer

Use ‘From’ HTTP header to specify crawler developer emailDo not disguise crawler as a browser by using their ‘User-

Agent’ string

Always check that HTTP requests are successful, and in case of error, use HTTP error code to determine and immediately address problem

Pay attention to anything that may lead to too many requests to any one server, even unwillingly

45

Crawler Etiquette (Important!)

Spread the load, do not overwhelm a serverMake sure that no more than some max. number of requests

to any single server per unit time, say < 1/second

Honor the Robot Exclusion ProtocolA server can specify which parts of its document tree any

crawler is or is not allowed to crawl by a file named ‘robots.txt’ placed in the HTTP root directory, e.g.

Crawler should always check, parse, and obey this file before sending any requests to a server

More info at:http://www.google.com/robots.txthttp://www.robotstxt.org/wc/exclusion.html

46

More On Robot ExclusionMake sure URLs are canonical before checking

against robots.txt

Avoid fetching robots.txt for each request to a server by caching its policy as relevant to this crawler

Let’s look at some examples to understand the protocol…

47

www.apple.com/robots.txt

48

# robots.txt for http://w w w.apple.com /

User-agent: *Disallow:

All crawlers…

…can go anywhere!

www.microsoft.com/robots.txt

49

# Robots.txt file for http://w w w.m icrosoft.com

User-agent: *Disallow: /canada/Library/m np/2/aspx/Disallow: /com m unities/bin.aspxDisallow: /com m unities/eventdetails.m spxDisallow: /com m unities/blogs/PortalResults.m spxDisallow: /com m unities/rss.aspxDisallow: /dow nloads/Brow se.aspxDisallow: /dow nloads/info.aspxDisallow: /france/form ation/centres/planning.aspDisallow: /france/m np_utility.m spxDisallow: /germ any/library/im ages/m np/Disallow: /germ any/m np_utility.m spxDisallow: /ie/ie40/Disallow: /info/custom error.htmDisallow: /info/sm art404.aspDisallow: /intlkb/Disallow: /isapi/# etc…

All crawlers…

…are not allowed in

these paths…

www.springer.com/robots.txt

50

# Robots.txt for http://w w w.springer.com (fragm ent)

User-agent: GooglebotDisallow: /chl/*Disallow: /uk/*Disallow: /italy/*Disallow: /france/*

User-agent: slurpDisallow:Craw l-delay: 2

User-agent: M SNBotDisallow:Craw l-delay: 2

User-agent: scooterDisallow:

# all othersUser-agent: *Disallow: /

Google crawler is allowed

everywhere except these

paths

Yahoo and MSN/Windows

Live are allowed everywhere but

should slow down

AltaVista has no limits

Everyone else keep off!



Demonstration

51

DemonstrationCrawlerDataVisualize Graph(using Gephi)

Thank You!

Data & Analytics

Web Crawlers - Exploring the WWW