Frontera: open source, large scale web crawling framework

Alexander Sibiryakov, October 1, 2015 sibiryakov@scrapinghub.com

• Born in Yekaterinburg, RU

• 5 years at Yandex, search quality department: social and QA search, snippets.

• 2 years at Avast! antivirus, research team: automatic false positive solving, large scale prediction of malicious download attempts.

Sziasztok résztvevők!

Task• Crawl Spanish web to gather

statistics about hosts and their sizes.

• Limit crawl to .es zone.

• Breadth-first strategy: first crawl 1-click distance documents, next 2-clicks, and so on,

• Finishing condition: absence of hosts with less than 100 crawled documents.

• Low costs.

Spanish internet (.es) in 2012

• Domain names registered - 1,56М (39% growth per year)

• Web server in zone - 283,4K (33,1%)

• Hosts - 4,2M (21%)

• Spanish web sites in DMOZ catalog - 22043

* - отчет OECD Communications Outlook 2013

Solution• Scrapy* - network operations.

• Apache Kafka - data bus (offsets, partitioning).

• Apache HBase - storage (random access, linear scanning, scalability).

• Twisted.Internet - library for async primitives for use in workers.

• Snappy - efficient compression algorithm for IO-bounded applications.

* - network operations in Scrapy are implemented asynchronously, based on the same Twisted.Internet

ArchitectureKafka topic

SW Crawling strategy workers

Storage workers

1. Big and small hosts problem

• When crawler comes to huge number of links from some host, along with usage of simple prioritization models, it turns out queue is flooded with URLs from the same host.

• That causes underuse of spider resources.

• We adopted additional per-host (optionally per-IP) queue and metering algorithm: URLs from big hosts are cached in memory.

3. DDoS DNS service Amazon AWS

• Breadth-first strategy assumes first visiting of previously unknown hosts, therefore generating huge amount of DNS request.

• Recursive DNS server on each downloading node, with upstream set to Verizon and OpenDNS.

• We used dnsmasq.

4. Tuning Scrapy thread pool’а for efficient DNS resolution

• Scrapy uses a thread pool to resolve DNS name to IP.

• When ip is absent in cache, request is sent to DNS server in it’s own thread, which is blocking.

• Scrapy reported numerous errors related to DNS name resolution and timeouts.

• We added option to Scrapy for thread pool size and timeout adjustment.

5. Overloaded HBase region servers during state check

• Crawler extracts from document hundreds of links in average.

• Before adding this links to queue, they needs to be checked if they weren’t already crawled (to avoid repetitive visiting).

• On small volumes SSDs were just fine. After increase of table size, we had to move to HDDs, and response times dramatically grew up.

• Host-local fingerprint function for keys in HBase.

• Tuning HBase block cache to fit average host states into one block.

6. Intensive network traffic from workers to services

• We noticed throughput between workers Kafka and HBase up to 1Gbit/s.

• Switched to Thrift compact protocol for HBase communication.

• Message compression in Kafka using Snappy.

7. Further query and traffic optimizations to HBase

• State check required lion’s share of requests and network throughput.

• Consistency was another requirement.

• We created local state cache in strategy worker.

• For consistency, spider log was partitioned by host, to avoid cache overlap between workers.

State cache• All operations are batched:

• If key is absent in cache, it’s requested from HBase,

• every ~4K documents cache is flushed to HBase.

• When achieving 3M (~1Гб) elements, flush and cleanup happens.

• It seems Least-Recently-Used (LRU) algorithm is a good fit there.

Spider priority queue (slot)• Cell has an array of:

- fingerprint, - Crc32(hostname), - URL, - score

• Dequeueing top N.

• Such design is prone to huge hosts.

• Partially this problem can be solved using scoring model taking into account known document count per host.

8. Problem of big and small hosts (strikes back!)

• During crawling we’ve found few very huge hosts (>20M docs)

• All queue partitions were flooded with pages from few huge hosts, because of queue design and scoring model used.

• We made two MapReduce jobs:

• queue shuffling,

• limiting all hosts to no more than 100 documents.

• Single-thread Scrapy spider gives 1200 pages/min. from about 100 websites in parallel.

• Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example:

• 12 spiders ~ 14.4K pages/min., • 3 SW and 3 DB workers, • Total 18 cores.

Hardware requirements

• Apache HBase,

• Apache Kafka,

• Python 2.7+,

• Scrapy 0.24+,

• DNS Service.

Software requirements

CDH (100% Open source Hadoop package)

Maintaining Cloudera Hadoop on Amazon EC2

• CDH is very sensitive to free space on root partition, parcels, and storage of Cloudera Manager.

• We’ve moved it using symbolic links to separate EBS partition.

• EBS should be at least 30Gb, base IOPS should be enough.

• Initial hardware was 3 x m3.xlarge (4 CPU, 15Gb, 2x40 SSD).

• After one week of crawling, we ran out of space, and started to move DataNodes to d2.xlarge (4 CPU, 30.5Gb, 3x2Tb HDD).

Spanish (.es) internet crawl results

• fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites

• 68.7K domains found (~600K expected),

• 46.5M crawled pages overall,

• 1.5 months,

• 22 websites with more than 50M pages

where are the rest of web servers?!

Bow-tie modelA. Broder et al. / Computer Networks 33 (2000) 309-320

Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005

Graph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014

• Online operation: scheduling of new batch, updating of DB state.

• Storage abstraction: write your own backend (sqlalchemy, HBase is included).

• Canonical URLs resolution abstraction: each document has many URLs, which to use?

• Scrapy ecosystem: good documentation, big community, ease of customization.

Main features

• Communication layer is Apache Kafka: topic partitioning, offsets mechanism.

• Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module.

• Polite by design: each website is downloaded by at most one spider.

• Python: workers, spiders.

Main features

References• Distributed Frontera. https://github.com/

scrapinghub/distributed-frontera

• Frontera. https://github.com/scrapinghub/frontera

• Documentation:

• http://distributed-frontera.readthedocs.org/

• http://frontera.readthedocs.org/

Future plans• Lighter version, without HBase

and Kafka. Communicating using sockets.

• Revisiting strategy out-of-box.

• Watchdog solution: tracking website content changes.

• PageRank or HITS strategy.

• Own HTML and URL parsers.

• Integration into Scrapinghub services.

• Testing on larger volumes.

Contribute!• Distributed Frontera is a

historically first attempt to implement web scale web crawler using Python.

• Truly resource-intensive task: CPU, network, disks.

• Made in Scrapinghub, a company where Scrapy was created.

• A plans to become an Apache Software Foundation project.

We’re hiring!http://scrapinghub.com/jobs/

Köszönöm!Thank you!

Alexander Sibiryakov, sibiryakov@scrapinghub.com

Frontera: open source, large scale web crawling framework · Frontera: open source, large scale web...

Documents

Advanced Crawling Techniques Chapter 6. Outline Selective Crawling Focused Crawling Distributed Crawling Web Dynamics

Albert Crawling

· u. catolica u. adventista de chile u. autonoma inacap u. catolica frontera u. dela frontera u. mayor frontera u. catolica u. autonoma u. catolica frontera u. catolica u. catolica

La Frontera Indomita

Web Scale Crawling with Apache Nutch

Frontera: Large-Scale Open Source Web Crawling Framework · Frontera: Large-Scale Open Source Web ... Apache Nutch instead of ... Frontera-Open Source Large Scale Web Crawling Framework

Historias de frontera

Frontera Estocastica

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF …oak.cs.ucla.edu/~cho/papers/cho-thesis.pdf · 2001. 11. 29. · CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

La Frontera

Crawling HTML

Large Scale Crawling with Apache Nutch and Friends

10 Crawling

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University

Large scale crawling with Apache Nutch

large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Furniture Frontera Lounge

Borderlands-La Frontera

Focused Crawling with Scalable Ordinal Regression Solverssaketh/research/icml07slides.pdf · Focused Crawling Focused Crawling Focused Crawling Given a topic (seed pages) ﬁnd out

Frontera norte