Download pdf - Distributed Web Crawling over DHTskubitron/courses/cs294-4-F03/slides/loo...Crawler. n Express crawl as a query. wPermits user-customizable refinement of crawls n Discover important

Distributed Web Crawling over DHTs

Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy

CS294-4

Index

Search Today

Search

CrawlCrawl

What’s Wrong?Users have a limited search interface Today’s web is dynamic and growing: n Timely re-crawls required.n Not feasible for all web sites.

Search engines control your search results:n Decide which sites get crawled:w 550 billion documents estimated in 2001 (BrightPlanet)w Google indexes 3.3 billion documents.

n Decide which sites gets updated more frequentlyn May censor or skew results rankings.

Challenge: User customizable searches that scale.

Our Solution: A Distributed CrawlerP2P users donate excess bandwidth and computation resources to crawl the web.n Organized using Distributed Hash tables (DHTs) n DHT and Query Processor agnostic crawler:

w Designed to work over any DHTw Crawls can be expressed as declarative recursive queries

n Easy for user customization.n Queries can be executed over PIER, a DHT-based relational P2P

Query Processor

Crawlers: PIER nodes

Crawlees: Web Servers

PotentialInfrastructure for crawl personalization: n User-defined focused crawlersn Collaborative crawling/filtering (special interest groups)

Other possibilities:n Bigger, better, faster web crawlern Enables new search and indexing technologies

w P2P Web Searchw Web archival and storage (with OceanStore)

Generalized crawler for querying distributed graph structures.n Monitor file-sharing networks. E.g. Gnutella.n P2P network maintenance:

w Routing information.w OceanStore meta-data.

Challenges that We InvestigatedScalability and Throughputn DHT communication overheads.n Balance network load on crawlers

w 2 components of network load: Download and DHT bandwidth.n Network Proximity. Exploit network locality of crawlers.

Limit download rates on web sites n Prevents denial of service attacks.

Main tradeoff: Tension between coordination and communicationn Balance load either on crawlers or on crawlees ! n Exploit network proximity at the cost of communication.

Crawler Thread

Crawl as a Recursive Query

Downloader

Extractor

Publish WebPage(url)

Input Urls

Output Links

Π: Link.destUrlàWebPage(url)

Redirect

CrawlWrapper

DupElim

Dup Elim

Filters

Publish Link (sourceUrl, destUrl)

Rate Throttle & Reorder

DHT Scan: WebPage(url)

Seed Urls

Crawl Distribution StrategiesPartition by URLn Ensures even distribution of crawler workload.n High DHT communication traffic.

Partition by Hostnamen One crawler per hostname.n Creates a “control point” for per-server rate

throttling.n May lead to uneven crawler load distributionn Single point of failure:w “Bad” choice of crawler affects per-site crawl throughput.

n Slight variation: X crawlers per hostname.

Redirection• Simple technique that allows a crawler to redirect or pass on its

assigned work to another crawler (and so on….)• A second chance distribution mechanism orthogonal to the

partitioning scheme. • Example: Partition by hostname

• Node responsible for google.com (red) dispatches work (by URL) to grey nodes

• Load balancing benefits of partition by URL• Control benefits of partition by hostname

• When? Policy-based• Crawler load (queue size)• Network proximity

• Why not? Cost of redirection• Increased DHT control traffic• Hence, put a limit number of redirections per URL.

www.google.com

ExperimentsDeploymentn WebCrawler over PIER, Bamboo DHT, up to 80 PlanetLab nodesn 3 Crawl Threads per crawler, 15 min crawl duration

Distribution (Partition) Schemesn URLn Hostnamen Hostname with 8 crawlers per unique hostn Hostname, one level redirection on overload.

Crawl Workloadn Exhaustive crawl

w Seed URL: http://www.google.comw 78244 different web servers

n Crawl of fixed number of sites w Seed URL: http://www.google.comw 45 web servers within google

n Crawl of single site within http://groups.google.com

Crawl of Multiple Sites I

Hostname: Can only exploit at most 45 crawlers.

Redirect (hybrid hostname/url) does the best.

Partition by Hostname shows poor imbalance (70% idle). Better off when more crawlers are busy

CDF of Per-crawler Downloads (80 nodes)

Crawl Throughput Scaleup

Crawl of Multiple Sites II

Redirection incurs higher overheads only after queue size exceeds a threshold.

Hostname incurs low overheads since crawl only looks at google.com which has lots of self-links.

Redirect: The per-URL DHT overheads hit their maximum around 70 nodes.

Per-URL DHT Overheads

Network Proximity

Sampled 5100 crawl targets and measured ping times from each of 80 PlanetLab hosts

Partition by hostname approximates random assignment

Best-3 random is “close enough” to Best-5 random

Sanity check: what if a single host crawls all targets ?

Summary of Schemes

+

-

+

Load-balance download bandwidth

DHT Communication

overheads

Network proximity

Rate limitCrawlees

Load-balance DHT bandwidth

--++?Redirect

+?+-Hostname

---+URL

Related WorkHerodotus, at MIT (Chord-based) n Partition by URL n Batching with ring-based forwarding.n Experimented on 4 local machines

Apoidea, at GaTech (Chord-based)n Partition by hostname.n Forwards crawl to DHT neighbor closest to

website.n Experimented on 12 local machines.

ConclusionOur main contributions:n Propose a DHT and QP agnostic Distributed

Crawler.n Express crawl as a query.w Permits user-customizable refinement of crawls

n Discover important trade-offs in distributed crawling: w Co-ordination comes with extra communication costs

n Deployment and experimentation on PlanetLab.w Examine crawl distribution strategies under different

workloads on live web sourcesw Measure the potential benefits of network proximity.

Backup slides

Existing CrawlersCluster-based crawlersn Google: Centralized dispatcher sends urls

to be crawled.n Hash-based parallel crawlers.Focused Crawlersn BINGO!n Crawls the web given basic training set.Peer-to-Peern Grub SETI@Home infrastructure. n 23993 members .

Exhaustive CrawlPartition by Hostname shows imbalance. Some crawlers are over-utilized for downloads.

Little difference in throughput. Most crawler threads are kept busy.

Single Site

URL is best, followed by redirect and hostname.

Future Work

Fault ToleranceSecuritySingle-Node ThroughputWork-Sharing between Crawl Queriesn Essential for overlapping users.

Crawl Global Prioritization n A requirement of personalized crawls.n Online relevance feedback.

Deep web retrieval.