33
Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam [email protected] AND

Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam [email protected] AND

Embed Size (px)

Citation preview

Page 1: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Parallel Crawlers

Efficient URL Caching for World Wide Web Crawling

PresenterSawood [email protected]

AND

Page 2: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Parallel Crawlers

Hector Garcia-MolinaStanford University

[email protected]

Junghoo ChoUniversity of California

[email protected]

Page 3: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

ABSTRACTDesign an effective and scalable parallel

crawlerPropose multiple architectures for a parallel

crawlerIdentify fundamental issues related to

parallel crawlingMetrics to evaluate a parallel crawlerCompare the proposed architectures using

40 million pages

Page 4: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Challenges for parallel crawlersOverlapQualityCommunication bandwidth

Page 5: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

AdvantagesScalabilityNetwork-load dispersionNetwork-load reduction

CompressionDifferenceSummarization

Page 6: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Related workGeneral architecturePage selectionPage update

Page 7: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Geographical categorizationIntra-site parallel crawlerDistributed crawler

Page 8: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

CommunicationIndependentDynamic assignmentStatic assignment

Page 9: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Crawling modes (Static)Firewall modeCross-over modeExchange mode

Page 10: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

URL exchange minimizationBatch communicationReplication

Page 11: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Partitioning functionURL-hash basedSite-hash basedHierarchical

Page 12: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Evaluation modelsOverlapCoverageQualityCommunication overhead

Page 13: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Firewall mode and coverage

Page 14: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Cross-over mode and overlap

Page 15: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Exchange mode and communication

Page 16: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Quality and batch communication

Page 17: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

ConclusionFirewall mode is good if processes <= 4URL exchange poses network overhead <

1%Quality is maintained even in the batch

communicationReplicating 10,000 to 100,000 popular

URLs can reduce 40% communication overhead

Page 18: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Efficient URL Caching for World Wide Web Crawling

Andrei Z. BroderIBM TJ Watson Research Center

[email protected]

Janet L. WienerHewlett Packard [email protected]

Marc NajorkMicrosoft Research

[email protected]

Page 19: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

IntroductionFetch a pageParse it to extract all linked URLsFor all the URLs not seen before, repeat the

process

Page 20: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

ChallengesThe web is very large (coverage)

doubling every 9-12 monthsWeb pages are changing rapidly (freshness)

all changes (40% weekly)changes by a third or more (7% weekly)

Page 21: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

CrawlersIA crawlerOriginal Google crawlerMercator web crawlerCho and Garcia-Molina’s crawlerWebFountainUbiCrawlerShkapenyuk and Suel’s crawler

Page 22: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

CachingAnalogous to OS cacheNon-uniformity of requestsTemporal correlation or locality of reference

Page 23: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Caching algorithmsInfinite cache (INFINITE)Clairvoyant caching (MIN)Least recently used (LRU)CLOCKRandom replacement (RANDOM)Static caching (STATIC)

Page 24: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Experimental setup

Page 25: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

URL Streamsfull tracecross sub-trace

Page 26: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Result plots

Page 27: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Result plots

Page 28: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Result plots

Page 29: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

ResultsLRU & CLOCK performed equally well but slightly

worse than MIN except for critical region (for both traces)

RANDOM is slightly inferior to CLOCK and LRU, while STATIC is generally much worse

Concludes considerable locality of reference in the traces

For very large cache STATIC is better than MIN (excluding initial k misses)

STATIC is relatively better for cross traceLack of deep links, often pointing to home pages.Intersection between the most popular URLs and the

cross trace tends to be larger

Page 30: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Critical regionMiss rate for all efficient algorithms is

constant (~70%) in k = 2^14 - 2^18Above k = 2^18 miss rate drops abruptly

to ~20%

Page 31: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Cache Implementation

Page 32: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Conclusions and future directions1,800 simulations over 26.86 billion URLs

resulted cache of 50,000 entries gives 80% hit rate

Cache size of 100 ~ 500 entries per thread is recommended

CLOCK or RANDOM implementation using scatter table with circular chain is recommended

To what order graph traversal method affects caching?

Global cache or per thread cache is better?

Page 33: Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

THANKS