Upload
siddhartha-anand
View
110
Download
4
Embed Size (px)
Citation preview
Exploring The World Wide Web
1
---Khusbu Mishra---Siddhartha Anand
OutlineIntroductionMotivation Of CrawlersBasic Crawlers And Implementation Issues
Crawling AlgorithmsCrawler Ethics And Conflicts
Demonstration
2
IntroductionExploit the graph structure of the web in a
sequential, automated manner.
Also known as spiders, bots, wanderers, ants, fish and worms.
Starts from a seed page and then uses external links within it to attend to other pages.
Links that we have got are stored in a repository to be crawled later.
3
OutlineIntroductionMotivation Of CrawlersBasic Crawlers And Implementation Issues
Crawling AlgorithmsCrawler Ethics And Conflicts
Demonstration
4
Q: How does a search engine know that all these pages contain the query
terms? A: Because all of those pages have
been crawled
5
Motivation For CrawlersSupport universal search engines (Google,
Yahoo, MSN/Windows Live, Ask, etc.)
Getting real-world data on large-scale
Vertical (specialized) search engines, e.g. news, shopping, papers, recipes, reviews, etc.
Business intelligence: keep track of potential competitors, partners
Monitor Web sites of interest
Evil: harvest emails for spamming, phishing…
Can you think of some others? 7
OutlineIntroductionMotivation Of CrawlersBasic Crawlers And Implementation Issues
Crawling AlgorithmsCrawler Ethics And Conflicts
Demonstration
8
Building A Crawling Infrastructure
This is a sequential crawler
Seeds can be any list of starting URLs
Order of page visits is determined by frontier data structure
Stop criterion can be anything
FrontierThe to-do list of a crawler that contains the
URLs of unvisited pages.
It may be implemented as a FIFO Queue.
URLs stored must be unique.
Can be maintained as a Hash-Table with URL as a key.
Fetching Pages
We need a client that sends a request and receives a response.
Should have timeouts
Exception Handling and Error Checking
Must follow Robot Exclusion Principle
Parsing
Once a page has been fetched we need to parse its content to extract information.
It may imply URL Extraction or data.
The URL extracted needs to be canonicalised before adding it to the frontier so that same page is not revisited.
A Basic Crawler in Python Breadth First Searchqueue: a FIFO list (shift and push)
import urllib2from BeautifulSoup import BeautifulSoupseed=http://www.seedurl.comqueue={seed}while(len(queue)!=0)
seed=queue.popleft()page=urllib2.urlopen(seed)soup=BeautifulSoup(page)links=soup.findAll(‘a’)for item in links:
p=item.attrsqueue.append(p[0][1]) 13
URL Extraction and Canonicalization
Relative vs. Absolute URLs
Crawler must translate relative URLs into absolute URLs
Need to obtain Base URL from HTTP header, or HTML Meta tag, or else current page path by default
ExamplesBase: http://www.cnn.com/linkto/
Relative URL: intl.htmlAbsolute URL: http://www.cnn.com/linkto/intl.html
Relative URL: /US/Absolute URL: http://www.cnn.com/US/
14
More On Canonical URLsConvert the protocol and hostname to lowercase.
Remove the anchor or reference part of the URL.
Perform URL encoding for some commonly used characters .
Remove ‘ .. ’ and its parent directory from the URL.
15
More On Canonical URLsSome transformation are trivial, for example:
http://i nformatics.indiana.edu/index.html#fragmenthttp://informatics.indiana.edu/index.html
http://informatics.indiana.edu/dir1/./../dir2/http://informatics.indiana.edu/dir2/
http://informatics.indiana.edu/~fil/http://informatics.indiana.edu/%7Efil/
http://INFORMATICS.INDIANA.EDU/fil/http://informatics.indiana.edu/fil/
16
URL Extraction and CanonicalizationURL canonicalization
All of these:http://www.cnn.com/TECH/http://WWW.CNN.COM/TECH/http://www.cnn.com:80/TECH/http://www.cnn.com/bogus/../TECH/
Are really equivalent to this canonical form:http://www.cnn.com/TECH/
In order to avoid duplication, the crawler must transform all URLs into canonical form
17
Implementation IssuesDon’t want to fetch same page twice!
Keep lookup table (hash) of visited pages
The frontier grows very fast!May need to prioritize for large crawls
Fetcher must be robust!Don’t crash if download failsTimeout mechanism
Determine file type to skip unwanted filesCan try using extensions, but not reliable
18
More Implementation IssuesFetching
Get only the first 10-100 KB per pageTake care to detect and break redirection loopsSoft fail for timeout, server not responding, file
not found, and other errors
19
More Implementation Issues: ParsingHTML has the structure of a
DOM (Document Object Model) tree
Unfortunately actual HTML is often incorrect in a strict syntactic sense
Crawlers, like browsers, must be robust/forgiving
Must pay attention to HTML entities and unicode in text
OutlineIntroductionMotivation Of CrawlersBasic Crawlers And Implementation Issues
Crawling AlgorithmsCrawler Ethics And Conflicts
Demonstration
21
Graph Traversal (BFS or DFS?)
Breadth First SearchImplemented with QUEUE (FIFO) Finds pages along shortest
pathsIf we start with “good” pages,
this keeps us close;Depth First Search
Implemented with STACK (LIFO)
Wander away (“lost in cyberspace”)
22
23
Breadth First Search
URL 1
URL 2
URL 5
URL 4
URL 7
URL 3 URL 6
Undiscovered
Discovered
Finished
Top of queue
Queue: URL1
24
Breadth First Search
URL 1
URL 2
URL 5
URL 4
URL 7
URL 3 URL 6
Undiscovered
Discovered
Finished
Top of queue
Queue: URL1
25
Breadth First Search
URL 1 URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: URL2
Top of queue
URL 21Shortest pathfrom s
Popped: URL1
26
Breadth First Search
URL1
URL 2
URL5
URL4
URL7
URL6
0
Undiscovered
Discovered
Finished
Queue: s 2
Top of queue
URL3
1
1
Queue: URL2 URL3
Popped: URL1
27
Breadth First Search
URL1
URL2 URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: s 2 3
Top of queue
URL5
1
1
1
Queue: s 2 Queue: URL2 URL3 URL5
Popped: URL1
28
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 2 3 5
Top of queue
1
1
1
Queue: s 2 3 Queue: s 2 Queue: URL2 URL3 URL5
Popped: URL1
29
Breadth First Search
URL1
URL2
URL5URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 2 3 5
Top of queue
URL4
1
1
Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL3 URL5 URL4
Popped: URL2
30
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 2 3 5 4
Top of queue
1
1
1
2
5 already discovered:don't enqueue
Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL3 URL5 URL4
Popped: URL2
31
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 2 3 5 4
Top of queue
1
1
1
2
Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL3 URL5 URL4
Popped: URL2
32
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3
URL6
0
Undiscovered
Discovered
Finished
Queue: 3 5 4
Top of queue
1
1
1
2
Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL5 URL4
Popped: URL3
33
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3
Undiscovered
Discovered
Finished
Queue: 3 5 4
Top of queue
1
2
URL6
Popped: URL3
Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL5 URL4 URL6
Popped: URL3
34
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 3 5 4 6
Top of queue
1
1
1
2
2
Queue: 3 5 4
Popped: URL3
Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL5 URL4 URL6
Popped: URL3
35
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 5 4 6
Top of queue
1
1
1
2
2
Queue: 3 5 4 6 Queue: 3 5 4
Popped: URL3
Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL4 URL6
Popped: URL5
36
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 4 6
Top of queue
1
1
1
2
2
Queue: 5 4 6 Queue: 3 5 4 6 Queue: 3 5 4
Popped: URL3
Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL6
Popped: URL4
37
Breadth First Search
URL1
URL2
URL5
URL4
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 6 8
Top of queue
1
1
1
2
2
3
URL7
3
Queue: 4 6 Queue: 5 4 6 Queue: 3 5 4 6 Queue: 3 5 4
Popped: URL3
Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue: URL7
Popped: URL6
38
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 6 8 7
Top of queue
1
1
1
2
2
3
Queue: 6 8 Queue: 4 6 Queue: 5 4 6 Queue: 3 5 4 6 Queue: 3 5 4
Popped: URL3
Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue:
Popped: URL7
39
Breadth First Search
URL1
URL2
URL5
URL4
URL7
URL3 URL6
0
Undiscovered
Discovered
Finished
Queue: 6 8 7
Top of queue
1
1
1
2
2
3
Queue: 6 8 Queue: 4 6 Queue: 5 4 6 Queue: 3 5 4 6 Queue: 3 5 4
Popped: URL3
Queue: 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 4 Queue: 2 3 5 Queue: 2 3 5 Queue: s 2 3 Queue: s 2 Queue:
Popped:
BFS v/s DFS
Best First SearchIt assigns a score to
the unvisited URLs on the page based on similarity measures.
URLs are added to the frontier that is maintained as a priority queue.
The best URL is crawled first.
Shark Search•It uses a similarity
measure for estimating a the relevance of an unvisited URL.
• Relevance is measured by
-Anchor Text-Link contents.
• It assigns an anchor and a context score for each URL.
• It maintains a depth bound.
OutlineIntroductionMotivation Of CrawlersBasic Crawlers And Implementation Issues
Crawling AlgorithmsCrawler Ethics And Conflicts
Demonstration
43
Crawler Ethics And ConflictsCrawlers can cause trouble, even unwillingly, if
not properly designed to be “polite” and “ethical”
For example, sending too many requests in rapid succession to a single server can amount to a Denial of Service (DoS) attack!Server administrator and users will be upsetCrawler developer/admin IP address may be
blacklisted
44
Crawler Etiquette (Important!)Identify yourself
Use ‘User-Agent’ HTTP header to identify crawler, website with description of crawler and contact information for crawler developer
Use ‘From’ HTTP header to specify crawler developer emailDo not disguise crawler as a browser by using their ‘User-
Agent’ string
Always check that HTTP requests are successful, and in case of error, use HTTP error code to determine and immediately address problem
Pay attention to anything that may lead to too many requests to any one server, even unwillingly
45
Crawler Etiquette (Important!)
Spread the load, do not overwhelm a serverMake sure that no more than some max. number of requests
to any single server per unit time, say < 1/second
Honor the Robot Exclusion ProtocolA server can specify which parts of its document tree any
crawler is or is not allowed to crawl by a file named ‘robots.txt’ placed in the HTTP root directory, e.g.
Crawler should always check, parse, and obey this file before sending any requests to a server
More info at:http://www.google.com/robots.txthttp://www.robotstxt.org/wc/exclusion.html
46
More On Robot ExclusionMake sure URLs are canonical before checking
against robots.txt
Avoid fetching robots.txt for each request to a server by caching its policy as relevant to this crawler
Let’s look at some examples to understand the protocol…
47
www.apple.com/robots.txt
48
# robots.txt for http://w w w.apple.com /
User-agent: *Disallow:
All crawlers…
…can go anywhere!
www.microsoft.com/robots.txt
49
# Robots.txt file for http://w w w.m icrosoft.com
User-agent: *Disallow: /canada/Library/m np/2/aspx/Disallow: /com m unities/bin.aspxDisallow: /com m unities/eventdetails.m spxDisallow: /com m unities/blogs/PortalResults.m spxDisallow: /com m unities/rss.aspxDisallow: /dow nloads/Brow se.aspxDisallow: /dow nloads/info.aspxDisallow: /france/form ation/centres/planning.aspDisallow: /france/m np_utility.m spxDisallow: /germ any/library/im ages/m np/Disallow: /germ any/m np_utility.m spxDisallow: /ie/ie40/Disallow: /info/custom error.htmDisallow: /info/sm art404.aspDisallow: /intlkb/Disallow: /isapi/# etc…
All crawlers…
…are not allowed in
these paths…
www.springer.com/robots.txt
50
# Robots.txt for http://w w w.springer.com (fragm ent)
User-agent: GooglebotDisallow: /chl/*Disallow: /uk/*Disallow: /italy/*Disallow: /france/*
User-agent: slurpDisallow:Craw l-delay: 2
User-agent: M SNBotDisallow:Craw l-delay: 2
User-agent: scooterDisallow:
# all othersUser-agent: *Disallow: /
Google crawler is allowed
everywhere except these
paths
Yahoo and MSN/Windows
Live are allowed everywhere but
should slow down
AltaVista has no limits
Everyone else keep off!
OutlineIntroductionMotivation Of CrawlersBasic Crawlers And Implementation Issues
Crawling AlgorithmsCrawler Ethics And Conflicts
Demonstration
51
DemonstrationCrawlerDataVisualize Graph(using Gephi)
Thank You!