53
Exploring The World Wide Web 1 ---Khusbu Mishra ---Siddhartha Anand

Web Crawlers - Exploring the WWW

Embed Size (px)

Citation preview

Page 1: Web Crawlers - Exploring the WWW

Exploring The World Wide Web

1

---Khusbu Mishra---Siddhartha Anand

Page 2: Web Crawlers - Exploring the WWW

OutlineIntroductionMotivation Of CrawlersBasic Crawlers And Implementation Issues

Crawling AlgorithmsCrawler Ethics And Conflicts

Demonstration

2

Page 3: Web Crawlers - Exploring the WWW

IntroductionExploit the graph structure of the web in a

sequential, automated manner.

Also known as spiders, bots, wanderers, ants, fish and worms.

Starts from a seed page and then uses external links within it to attend to other pages.

Links that we have got are stored in a repository to be crawled later.

3

Page 4: Web Crawlers - Exploring the WWW

OutlineIntroductionMotivation Of CrawlersBasic Crawlers And Implementation Issues

Crawling AlgorithmsCrawler Ethics And Conflicts

Demonstration

4

Page 5: Web Crawlers - Exploring the WWW

Q: How does a search engine know that all these pages contain the query

terms? A: Because all of those pages have

been crawled

5

Page 6: Web Crawlers - Exploring the WWW
Page 7: Web Crawlers - Exploring the WWW

Motivation For CrawlersSupport universal search engines (Google,

Yahoo, MSN/Windows Live, Ask, etc.)

Getting real-world data on large-scale

Vertical (specialized) search engines, e.g. news, shopping, papers, recipes, reviews, etc.

Business intelligence: keep track of potential competitors, partners

Monitor Web sites of interest

Evil: harvest emails for spamming, phishing…

Can you think of some others? 7

Page 8: Web Crawlers - Exploring the WWW

OutlineIntroductionMotivation Of CrawlersBasic Crawlers And Implementation Issues

Crawling AlgorithmsCrawler Ethics And Conflicts

Demonstration

8

Page 9: Web Crawlers - Exploring the WWW

Building A Crawling Infrastructure

This is a sequential crawler

Seeds can be any list of starting URLs

Order of page visits is determined by frontier data structure

Stop criterion can be anything

Page 10: Web Crawlers - Exploring the WWW

FrontierThe to-do list of a crawler that contains the

URLs of unvisited pages.

It may be implemented as a FIFO Queue.

URLs stored must be unique.

Can be maintained as a Hash-Table with URL as a key.

Page 11: Web Crawlers - Exploring the WWW

Fetching Pages

We need a client that sends a request and receives a response.

Should have timeouts

Exception Handling and Error Checking

Must follow Robot Exclusion Principle

Page 12: Web Crawlers - Exploring the WWW

Parsing

Once a page has been fetched we need to parse its content to extract information.

It may imply URL Extraction or data.

The URL extracted needs to be canonicalised before adding it to the frontier so that same page is not revisited.

Page 13: Web Crawlers - Exploring the WWW

A Basic Crawler in Python Breadth First Searchqueue: a FIFO list (shift and push)

import urllib2from BeautifulSoup import BeautifulSoupseed=http://www.seedurl.comqueue={seed}while(len(queue)!=0)

seed=queue.popleft()page=urllib2.urlopen(seed)soup=BeautifulSoup(page)links=soup.findAll(‘a’)for item in links:

p=item.attrsqueue.append(p[0][1]) 13

Page 14: Web Crawlers - Exploring the WWW

URL Extraction and Canonicalization

Relative vs. Absolute URLs

Crawler must translate relative URLs into absolute URLs

Need to obtain Base URL from HTTP header, or HTML Meta tag, or else current page path by default

ExamplesBase: http://www.cnn.com/linkto/

Relative URL: intl.htmlAbsolute URL: http://www.cnn.com/linkto/intl.html

Relative URL: /US/Absolute URL: http://www.cnn.com/US/

14

Page 15: Web Crawlers - Exploring the WWW

More On Canonical URLsConvert the protocol and hostname to lowercase.

Remove the anchor or reference part of the URL.

Perform URL encoding for some commonly used characters .

Remove ‘ .. ’ and its parent directory from the URL.

15

Page 16: Web Crawlers - Exploring the WWW

More On Canonical URLsSome transformation are trivial, for example:

http://i nformatics.indiana.edu/index.html#fragmenthttp://informatics.indiana.edu/index.html

http://informatics.indiana.edu/dir1/./../dir2/http://informatics.indiana.edu/dir2/

http://informatics.indiana.edu/~fil/http://informatics.indiana.edu/%7Efil/

http://INFORMATICS.INDIANA.EDU/fil/http://informatics.indiana.edu/fil/

16

Page 17: Web Crawlers - Exploring the WWW

URL Extraction and CanonicalizationURL canonicalization

All of these:http://www.cnn.com/TECH/http://WWW.CNN.COM/TECH/http://www.cnn.com:80/TECH/http://www.cnn.com/bogus/../TECH/

Are really equivalent to this canonical form:http://www.cnn.com/TECH/

In order to avoid duplication, the crawler must transform all URLs into canonical form

17

Page 18: Web Crawlers - Exploring the WWW

Implementation IssuesDon’t want to fetch same page twice!

Keep lookup table (hash) of visited pages

The frontier grows very fast!May need to prioritize for large crawls

Fetcher must be robust!Don’t crash if download failsTimeout mechanism

Determine file type to skip unwanted filesCan try using extensions, but not reliable

18

Page 19: Web Crawlers - Exploring the WWW

More Implementation IssuesFetching

Get only the first 10-100 KB per pageTake care to detect and break redirection loopsSoft fail for timeout, server not responding, file

not found, and other errors

19

Page 20: Web Crawlers - Exploring the WWW

More Implementation Issues: ParsingHTML has the structure of a

DOM (Document Object Model) tree

Unfortunately actual HTML is often incorrect in a strict syntactic sense

Crawlers, like browsers, must be robust/forgiving

Must pay attention to HTML entities and unicode in text

Page 21: Web Crawlers - Exploring the WWW

OutlineIntroductionMotivation Of CrawlersBasic Crawlers And Implementation Issues

Crawling AlgorithmsCrawler Ethics And Conflicts

Demonstration

21

Page 22: Web Crawlers - Exploring the WWW

Graph Traversal (BFS or DFS?)

Breadth First SearchImplemented with QUEUE (FIFO) Finds pages along shortest

pathsIf we start with “good” pages,

this keeps us close;Depth First Search

Implemented with STACK (LIFO)

Wander away (“lost in cyberspace”)

22

Page 23: Web Crawlers - Exploring the WWW

23

Breadth First Search

URL 1

URL 2

URL 5

URL 4

URL 7

URL 3 URL 6

Undiscovered

Discovered

Finished

Top of queue

Queue: URL1

Page 24: Web Crawlers - Exploring the WWW

24

Breadth First Search

URL 1

URL 2

URL 5

URL 4

URL 7

URL 3 URL 6

Undiscovered

Discovered

Finished

Top of queue

Queue: URL1

Page 25: Web Crawlers - Exploring the WWW

25

Breadth First Search

URL 1 URL5

URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: URL2

Top of queue

URL 21Shortest pathfrom s

Popped: URL1

Page 26: Web Crawlers - Exploring the WWW

26

Breadth First Search

URL1

URL 2

URL5

URL4

URL7

URL6

0

Undiscovered

Discovered

Finished

Queue: s 2

Top of queue

URL3

1

1

Queue: URL2 URL3

Popped: URL1

Page 27: Web Crawlers - Exploring the WWW

27

Breadth First Search

URL1

URL2 URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: s 2 3

Top of queue

URL5

1

1

1

Queue: s 2 Queue: URL2 URL3 URL5

Popped: URL1

Page 28: Web Crawlers - Exploring the WWW

28

Breadth First Search

URL1

URL2

URL5

URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 2 3 5

Top of queue

1

1

1

Queue: s 2 3 Queue: s 2 Queue: URL2 URL3 URL5

Popped: URL1

Page 29: Web Crawlers - Exploring the WWW

29

Breadth First Search

URL1

URL2

URL5URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 2 3 5

Top of queue

URL4

1

1

Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL3 URL5 URL4

Popped: URL2

Page 30: Web Crawlers - Exploring the WWW

30

Breadth First Search

URL1

URL2

URL5

URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 2 3 5 4

Top of queue

1

1

1

2

5 already discovered:don't enqueue

Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL3 URL5 URL4

Popped: URL2

Page 31: Web Crawlers - Exploring the WWW

31

Breadth First Search

URL1

URL2

URL5

URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 2 3 5 4

Top of queue

1

1

1

2

Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL3 URL5 URL4

Popped: URL2

Page 32: Web Crawlers - Exploring the WWW

32

Breadth First Search

URL1

URL2

URL5

URL4

URL7

URL3

URL6

0

Undiscovered

Discovered

Finished

Queue: 3 5 4

Top of queue

1

1

1

2

Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL5 URL4

Popped: URL3

Page 33: Web Crawlers - Exploring the WWW

33

Breadth First Search

URL1

URL2

URL5

URL4

URL7

URL3

Undiscovered

Discovered

Finished

Queue: 3 5 4

Top of queue

1

2

URL6

Popped: URL3

Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL5 URL4 URL6

Popped: URL3

Page 34: Web Crawlers - Exploring the WWW

34

Breadth First Search

URL1

URL2

URL5

URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 3 5 4 6

Top of queue

1

1

1

2

2

Queue: 3 5 4

Popped: URL3

Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL5 URL4 URL6

Popped: URL3

Page 35: Web Crawlers - Exploring the WWW

35

Breadth First Search

URL1

URL2

URL5

URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 5 4 6

Top of queue

1

1

1

2

2

Queue: 3 5 4 6 Queue: 3 5 4

Popped: URL3

Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL4 URL6

Popped: URL5

Page 36: Web Crawlers - Exploring the WWW

36

Breadth First Search

URL1

URL2

URL5

URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 4 6

Top of queue

1

1

1

2

2

Queue: 5 4 6 Queue: 3 5 4 6 Queue: 3 5 4

Popped: URL3

Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL6

Popped: URL4

Page 37: Web Crawlers - Exploring the WWW

37

Breadth First Search

URL1

URL2

URL5

URL4

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 6 8

Top of queue

1

1

1

2

2

3

URL7

3

Queue: 4 6 Queue: 5 4 6 Queue: 3 5 4 6 Queue: 3 5 4

Popped: URL3

Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL7

Popped: URL6

Page 38: Web Crawlers - Exploring the WWW

38

Breadth First Search

URL1

URL2

URL5

URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 6 8 7

Top of queue

1

1

1

2

2

3

Queue: 6 8 Queue: 4 6 Queue: 5 4 6 Queue: 3 5 4 6 Queue: 3 5 4

Popped: URL3

Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue:

Popped: URL7

Page 39: Web Crawlers - Exploring the WWW

39

Breadth First Search

URL1

URL2

URL5

URL4

URL7

URL3 URL6

0

Undiscovered

Discovered

Finished

Queue: 6 8 7

Top of queue

1

1

1

2

2

3

Queue: 6 8 Queue: 4 6 Queue: 5 4 6 Queue: 3 5 4 6 Queue: 3 5 4

Popped: URL3

Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue:

Popped:

Page 40: Web Crawlers - Exploring the WWW

BFS v/s DFS

Page 41: Web Crawlers - Exploring the WWW

Best First SearchIt assigns a score to

the unvisited URLs on the page based on similarity measures.

URLs are added to the frontier that is maintained as a priority queue.

The best URL is crawled first.

Page 42: Web Crawlers - Exploring the WWW

Shark Search•It uses a similarity

measure for estimating a the relevance of an unvisited URL.

• Relevance is measured by

-Anchor Text-Link contents.

• It assigns an anchor and a context score for each URL.

• It maintains a depth bound.

Page 43: Web Crawlers - Exploring the WWW

OutlineIntroductionMotivation Of CrawlersBasic Crawlers And Implementation Issues

Crawling AlgorithmsCrawler Ethics And Conflicts

Demonstration

43

Page 44: Web Crawlers - Exploring the WWW

Crawler Ethics And ConflictsCrawlers can cause trouble, even unwillingly, if

not properly designed to be “polite” and “ethical”

For example, sending too many requests in rapid succession to a single server can amount to a Denial of Service (DoS) attack!Server administrator and users will be upsetCrawler developer/admin IP address may be

blacklisted

44

Page 45: Web Crawlers - Exploring the WWW

Crawler Etiquette (Important!)Identify yourself

Use ‘User-Agent’ HTTP header to identify crawler, website with description of crawler and contact information for crawler developer

Use ‘From’ HTTP header to specify crawler developer emailDo not disguise crawler as a browser by using their ‘User-

Agent’ string

Always check that HTTP requests are successful, and in case of error, use HTTP error code to determine and immediately address problem

Pay attention to anything that may lead to too many requests to any one server, even unwillingly

45

Page 46: Web Crawlers - Exploring the WWW

Crawler Etiquette (Important!)

Spread the load, do not overwhelm a serverMake sure that no more than some max. number of requests

to any single server per unit time, say < 1/second

Honor the Robot Exclusion ProtocolA server can specify which parts of its document tree any

crawler is or is not allowed to crawl by a file named ‘robots.txt’ placed in the HTTP root directory, e.g.

Crawler should always check, parse, and obey this file before sending any requests to a server

More info at:http://www.google.com/robots.txthttp://www.robotstxt.org/wc/exclusion.html

46

Page 47: Web Crawlers - Exploring the WWW

More On Robot ExclusionMake sure URLs are canonical before checking

against robots.txt

Avoid fetching robots.txt for each request to a server by caching its policy as relevant to this crawler

Let’s look at some examples to understand the protocol…

47

Page 48: Web Crawlers - Exploring the WWW

www.apple.com/robots.txt

48

# robots.txt for http://w w w.apple.com /

User-agent: *Disallow:

All crawlers…

…can go anywhere!

Page 49: Web Crawlers - Exploring the WWW

www.microsoft.com/robots.txt

49

# Robots.txt file for http://w w w.m icrosoft.com

User-agent: *Disallow: /canada/Library/m np/2/aspx/Disallow: /com m unities/bin.aspxDisallow: /com m unities/eventdetails.m spxDisallow: /com m unities/blogs/PortalResults.m spxDisallow: /com m unities/rss.aspxDisallow: /dow nloads/Brow se.aspxDisallow: /dow nloads/info.aspxDisallow: /france/form ation/centres/planning.aspDisallow: /france/m np_utility.m spxDisallow: /germ any/library/im ages/m np/Disallow: /germ any/m np_utility.m spxDisallow: /ie/ie40/Disallow: /info/custom error.htmDisallow: /info/sm art404.aspDisallow: /intlkb/Disallow: /isapi/# etc…

All crawlers…

…are not allowed in

these paths…

Page 50: Web Crawlers - Exploring the WWW

www.springer.com/robots.txt

50

# Robots.txt for http://w w w.springer.com (fragm ent)

User-agent: GooglebotDisallow: /chl/*Disallow: /uk/*Disallow: /italy/*Disallow: /france/*

User-agent: slurpDisallow:Craw l-delay: 2

User-agent: M SNBotDisallow:Craw l-delay: 2

User-agent: scooterDisallow:

# all othersUser-agent: *Disallow: /

Google crawler is allowed

everywhere except these

paths

Yahoo and MSN/Windows

Live are allowed everywhere but

should slow down

AltaVista has no limits

Everyone else keep off!

Page 51: Web Crawlers - Exploring the WWW

OutlineIntroductionMotivation Of CrawlersBasic Crawlers And Implementation Issues

Crawling AlgorithmsCrawler Ethics And Conflicts

Demonstration

51

Page 52: Web Crawlers - Exploring the WWW

DemonstrationCrawlerDataVisualize Graph(using Gephi)

Page 53: Web Crawlers - Exploring the WWW

Thank You!