Crawler

[email protected]

Anything can be a crawler

November 11, 2012

1 / 19

What’s the Crawler

Crawlers walk on the network, search anything itfound and doing anything what they wants...

I Search engineI Data finder / collectorI Anything else...

2 / 19

Conception

Crawler can easy to be separate into threesteps...

I DownloadI Data operationI Find the next seed

3 / 19

Pseudo CodeFetch the web page, parser it, get usefulinformation and repeat it again.

for u r l in nextSeed ( ) :i n f o = fe t ch ( u r l )data , seeds = operate ( i n f o )pushSeed ( seeds )

4 / 19

Greedy

But easy things are always too hard to besolved...

I Web server always block the crawler!I Data always never structured!I How to find the next seed!I Crawler always bounded on network

speed...

5 / 19

Operation

When we link to the target...

I Download the web page, parser the HTMLcode

I Download the database, parser the DBformat

I Finial, record everything into our DB

6 / 19

Pseudo CodeParser the HTML code, for example, searchwhat’s you need...

from Beaut i fu lSoup import ∗

soup = Beaut i fu lSoup ( webpage )## P r i n t the main bodypr in t soup . html . body## P r i n t the f i r s t tag <a> i n bodypr in t soup . html . body . a## Find the p a r t i c u l a r tagtags = soup . f i n d A l l ( ’ form ’ )

7 / 19

Operation (cont’d)

And more, you also can do something else, likepayload, when operate the web page...

I Post / Get the method based on HTMLI Find the next seed on the web pageI Something good / bad

8 / 19

Link to Site

Before we operated the web page, we need to...I Link to web siteI Get the web page

But server master hates the net crawler, ’causeI No functionalityI Slow down / burn out the resourceI As the thief

9 / 19

Fetch

If you are not GoogleYou must be the human

10 / 19

Be a Human

Be a human as a human being...

I No one can press anything under 0.11second

I No one can look page with few secodeI No one can work for all day

11 / 19

Rules

Using the framework / tool to enumlate thebrowser

I Change the default settingI Simulate the existed browserI Cookie supportI Time issue and random variable

12 / 19

Pseudo CodeSimple fetch code

import u r l l i b 2from c o o k i e l i b import CookieJarimport t ime , random

for n in range (MAX LOOP ) :## Cookieck = CookieJar ( )ck = u r l l i b 2 . HTTPCookieProcessor ( ck )req = u r l l i b 2 . bu i ld opener ( ck )## User−Agentreq . addheaders = [ ( ’ User−Agent ’ , ’ c raw ler cmj ’ ) ]data = req . open ( u r l ) . read ( )## Waitt ime . sleep ( random . r a n d i n t ( 0 , 5 ) )

13 / 19

Seed

The last one, but the hardest one...

We always unknown thenext sheep

14 / 19

Find SheepUsing the well known search engine

I Also, search engine blocks other crawlerI The crawler needs to parser the garbage

codeI The result maybe the js code...

Using the random / enumerate method

I Too hard to find the useful targetI Cost lots of timeI Cannot shut sheeps immediately

15 / 19

Based Search Engine

Design an other crawler

I Given the initial keyword as the seedI Fetch the search engineI Parser the result, and get the next seed if

possibleI Repeat until stop or blocked.

16 / 19

Tricky

Using the distribution model

I Separate each partsI More volunteers can speed-up

17 / 19

Pyro4

Pyro4 can help you to remote control pythonobject...

I Expose the object can access as on localside

I Using the remote resource to processI Provide the M-n model

18 / 19

Thanks for participationQ & A

19 / 19

Documents

Crawler