Download pdf - Crawling the world

Crawling the world@mmoreram

Apparently...

Nobodyuses parsing in their applications

Not evenChuck Norris

Many bussinesses

need crawling

Crawling brings you knowledge

Knowledge is

power

And power is

Money

What is crawling?Or parsing

Crawling

We download just an url with a request (HTML, XML…)

We manipulate response by searching the desired data, like links, headers or any kind of text or label

Once we have needed content, we can just update our database and take following decisions, for example parsing some found links.

and that’s it!

–Marc Morera, yesterday

“Machines will do what humans do before they realize”

Let’s see an example

Step by step-

chicplace.com

Our goal is parse all available products, saving name, description, price, shop and categories

We will use linear strategy. There are some kind of strategies when a site must be parsed

Let’s see all available strategies

Parsing Strategies

Linear. Just one script. If any page fails (crawling error, server timeout, …) some kind of exception could be thrown and catched.

Advantages: Just an script is needed. Easier? Not even close…

Problems: Cannot be distributed. Just one script for 1M requests. Memory problem?

Parsing Strategies

Distributed. One script for each case. If any page fails can be recovered by simply execute himself again.

Advantages: All cases are encapsulated in an individual script, low memory. Can be easily distributed by using queues.

Problems: Any

Crawling steps

Analyzing. Think like Google does. find the fastest way through the labyrinth

Scripting. Build scripts using queues for distributed strategy. Each queue means one page

Running. keep in mind the impact of your actions. DDOS attack, copyright

Analyzing

Every parsing process should be evaluated as a simple crawler. For example Google

How to access to all needed pages with the lowest server impact

Usually, all serious websites are designed to easily access to all pages within 3 clicks

AnalyzingWe will use category map to just access to

all available products

AnalyzingEach category will list all available products

Analyzing

Do we need also to parse product page?

In fact, we do. We already have name, price and category, but we also need description and shop

So we have main page to parse all category links, we have category page with all product ( can be paginated ) and we need also product page to get all information

Product page is responsible for saving all data in DDBB

Scripting

We will use distributed strategy, using queues and supervisord

Supervisord is responsible for managing X instances of a process running at the same time.

Using distributed queue system, we will have 3 workers.

Worker?

Yep, worker. Using a queue system, a worker is like a box ( script ) with a parameters ( input values ), that just do something.

We have 3 kind of workers. One of them, the CategoryWorker will just receive a category url, will parse related content ( HTML ) and will detect all products. Each product will generate a new instance for ProductWorker

Running

We just enable all workers and forces first to run.

First worker will find all categories urls and will enqueue them into a queue named categories-queue

Second worker ( for example 10 instances ) will just consume categories-queue looking for urls and parsing their content.

Their content means just products urls

Running

Each url is enqueued to another queued named products-queue

Third and last worker ( 50 instances ) just consume this queue, parses their content and get needed data ( name, description, shop, category and price.

OK. Call me God

but…

–Some bored man

“Don't shoot the messenger”

warning!

50 workers requesting chicplace in parallel. This is a big problem

@Gonzalo (CTO) will be angry and he will detect something is happening

So, we must be careful to not alert him or just prevent us discover

Warningdo not try this at home

Be invisible

To be invisible we just can parse all site slowly ( days )

To be faster we just can mask our IP using Proxies ( How about different proxy for every request? )

To be faster we just can user some reversed Proxy, like TOR.

To be stupid we can just parse chicplace with our IP ( most companies will not even notice )

They are attacking me !

–Matthew 21:22

“And whatever you ask in prayer, you will receive, if you have faith”

My pray!

A good crawling implementation is infallible

Server will receive dozens of requests per second and will not recognize any pattern to discriminate crawler requests from simple user requests

So…?

Welcome to amazing world of

Crawling

Where no one isSAVE