Distributed Web Crawling over DHTs
Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy
CS294-4
Index
Search Today
Search
CrawlCrawl
What’s Wrong?Users have a limited search interface Today’s web is dynamic and growing: n Timely re-crawls required.n Not feasible for all web sites.
Search engines control your search results:n Decide which sites get crawled:w 550 billion documents estimated in 2001 (BrightPlanet)w Google indexes 3.3 billion documents.
n Decide which sites gets updated more frequentlyn May censor or skew results rankings.
Challenge: User customizable searches that scale.
Our Solution: A Distributed CrawlerP2P users donate excess bandwidth and computation resources to crawl the web.n Organized using Distributed Hash tables (DHTs) n DHT and Query Processor agnostic crawler:
w Designed to work over any DHTw Crawls can be expressed as declarative recursive queries
n Easy for user customization.n Queries can be executed over PIER, a DHT-based relational P2P
Query Processor
Crawlers: PIER nodes
Crawlees: Web Servers
PotentialInfrastructure for crawl personalization: n User-defined focused crawlersn Collaborative crawling/filtering (special interest groups)
Other possibilities:n Bigger, better, faster web crawlern Enables new search and indexing technologies
w P2P Web Searchw Web archival and storage (with OceanStore)
Generalized crawler for querying distributed graph structures.n Monitor file-sharing networks. E.g. Gnutella.n P2P network maintenance:
w Routing information.w OceanStore meta-data.
Challenges that We InvestigatedScalability and Throughputn DHT communication overheads.n Balance network load on crawlers
w 2 components of network load: Download and DHT bandwidth.n Network Proximity. Exploit network locality of crawlers.
Limit download rates on web sites n Prevents denial of service attacks.
Main tradeoff: Tension between coordination and communicationn Balance load either on crawlers or on crawlees ! n Exploit network proximity at the cost of communication.
Crawler Thread
Crawl as a Recursive Query
Downloader
Extractor
Publish WebPage(url)
Input Urls
Output Links
Π: Link.destUrlàWebPage(url)
Redirect
CrawlWrapper
DupElim
Dup Elim
Filters
Publish Link (sourceUrl, destUrl)
Rate Throttle & Reorder
DHT Scan: WebPage(url)
Seed Urls
Crawl Distribution StrategiesPartition by URLn Ensures even distribution of crawler workload.n High DHT communication traffic.
Partition by Hostnamen One crawler per hostname.n Creates a “control point” for per-server rate
throttling.n May lead to uneven crawler load distributionn Single point of failure:w “Bad” choice of crawler affects per-site crawl throughput.
n Slight variation: X crawlers per hostname.
Redirection• Simple technique that allows a crawler to redirect or pass on its
assigned work to another crawler (and so on….)• A second chance distribution mechanism orthogonal to the
partitioning scheme. • Example: Partition by hostname
• Node responsible for google.com (red) dispatches work (by URL) to grey nodes
• Load balancing benefits of partition by URL• Control benefits of partition by hostname
• When? Policy-based• Crawler load (queue size)• Network proximity
• Why not? Cost of redirection• Increased DHT control traffic• Hence, put a limit number of redirections per URL.
www.google.com
ExperimentsDeploymentn WebCrawler over PIER, Bamboo DHT, up to 80 PlanetLab nodesn 3 Crawl Threads per crawler, 15 min crawl duration
Distribution (Partition) Schemesn URLn Hostnamen Hostname with 8 crawlers per unique hostn Hostname, one level redirection on overload.
Crawl Workloadn Exhaustive crawl
w Seed URL: http://www.google.comw 78244 different web servers
n Crawl of fixed number of sites w Seed URL: http://www.google.comw 45 web servers within google
n Crawl of single site within http://groups.google.com
Crawl of Multiple Sites I
Hostname: Can only exploit at most 45 crawlers.
Redirect (hybrid hostname/url) does the best.
Partition by Hostname shows poor imbalance (70% idle). Better off when more crawlers are busy
CDF of Per-crawler Downloads (80 nodes)
Crawl Throughput Scaleup
Crawl of Multiple Sites II
Redirection incurs higher overheads only after queue size exceeds a threshold.
Hostname incurs low overheads since crawl only looks at google.com which has lots of self-links.
Redirect: The per-URL DHT overheads hit their maximum around 70 nodes.
Per-URL DHT Overheads
Network Proximity
Sampled 5100 crawl targets and measured ping times from each of 80 PlanetLab hosts
Partition by hostname approximates random assignment
Best-3 random is “close enough” to Best-5 random
Sanity check: what if a single host crawls all targets ?
Summary of Schemes
+
-
+
Load-balance download bandwidth
DHT Communication
overheads
Network proximity
Rate limitCrawlees
Load-balance DHT bandwidth
--++?Redirect
+?+-Hostname
---+URL
Related WorkHerodotus, at MIT (Chord-based) n Partition by URL n Batching with ring-based forwarding.n Experimented on 4 local machines
Apoidea, at GaTech (Chord-based)n Partition by hostname.n Forwards crawl to DHT neighbor closest to
website.n Experimented on 12 local machines.
ConclusionOur main contributions:n Propose a DHT and QP agnostic Distributed
Crawler.n Express crawl as a query.w Permits user-customizable refinement of crawls
n Discover important trade-offs in distributed crawling: w Co-ordination comes with extra communication costs
n Deployment and experimentation on PlanetLab.w Examine crawl distribution strategies under different
workloads on live web sourcesw Measure the potential benefits of network proximity.
Backup slides
Existing CrawlersCluster-based crawlersn Google: Centralized dispatcher sends urls
to be crawled.n Hash-based parallel crawlers.Focused Crawlersn BINGO!n Crawls the web given basic training set.Peer-to-Peern Grub SETI@Home infrastructure. n 23993 members .
Exhaustive CrawlPartition by Hostname shows imbalance. Some crawlers are over-utilized for downloads.
Little difference in throughput. Most crawler threads are kept busy.
Single Site
URL is best, followed by redirect and hostname.
Future Work
Fault ToleranceSecuritySingle-Node ThroughputWork-Sharing between Crawl Queriesn Essential for overlapping users.
Crawl Global Prioritization n A requirement of personalized crawls.n Online relevance feedback.
Deep web retrieval.