19

Click here to load reader

Crawler

Embed Size (px)

Citation preview

Page 1: Crawler

[email protected]

Anything can be a crawler

November 11, 2012

1 / 19

Page 2: Crawler

What’s the Crawler

Crawlers walk on the network, search anything itfound and doing anything what they wants...

I Search engineI Data finder / collectorI Anything else...

2 / 19

Page 3: Crawler

Conception

Crawler can easy to be separate into threesteps...

I DownloadI Data operationI Find the next seed

3 / 19

Page 4: Crawler

Pseudo CodeFetch the web page, parser it, get usefulinformation and repeat it again.

for u r l in nextSeed ( ) :i n f o = fe t ch ( u r l )data , seeds = operate ( i n f o )pushSeed ( seeds )

4 / 19

Page 5: Crawler

Greedy

But easy things are always too hard to besolved...

I Web server always block the crawler!I Data always never structured!I How to find the next seed!I Crawler always bounded on network

speed...

5 / 19

Page 6: Crawler

Operation

When we link to the target...

I Download the web page, parser the HTMLcode

I Download the database, parser the DBformat

I Finial, record everything into our DB

6 / 19

Page 7: Crawler

Pseudo CodeParser the HTML code, for example, searchwhat’s you need...

from Beaut i fu lSoup import ∗

soup = Beaut i fu lSoup ( webpage )## P r i n t the main bodypr in t soup . html . body## P r i n t the f i r s t tag <a> i n bodypr in t soup . html . body . a## Find the p a r t i c u l a r tagtags = soup . f i n d A l l ( ’ form ’ )

7 / 19

Page 8: Crawler

Operation (cont’d)

And more, you also can do something else, likepayload, when operate the web page...

I Post / Get the method based on HTMLI Find the next seed on the web pageI Something good / bad

8 / 19

Page 9: Crawler

Link to Site

Before we operated the web page, we need to...I Link to web siteI Get the web page

But server master hates the net crawler, ’causeI No functionalityI Slow down / burn out the resourceI As the thief

9 / 19

Page 10: Crawler

Fetch

If you are not GoogleYou must be the human

10 / 19

Page 11: Crawler

Be a Human

Be a human as a human being...

I No one can press anything under 0.11second

I No one can look page with few secodeI No one can work for all day

11 / 19

Page 12: Crawler

Rules

Using the framework / tool to enumlate thebrowser

I Change the default settingI Simulate the existed browserI Cookie supportI Time issue and random variable

12 / 19

Page 13: Crawler

Pseudo CodeSimple fetch code

import u r l l i b 2from c o o k i e l i b import CookieJarimport t ime , random

for n in range (MAX LOOP ) :## Cookieck = CookieJar ( )ck = u r l l i b 2 . HTTPCookieProcessor ( ck )req = u r l l i b 2 . bu i ld opener ( ck )## User−Agentreq . addheaders = [ ( ’ User−Agent ’ , ’ c raw ler cmj ’ ) ]data = req . open ( u r l ) . read ( )## Waitt ime . sleep ( random . r a n d i n t ( 0 , 5 ) )

13 / 19

Page 14: Crawler

Seed

The last one, but the hardest one...

We always unknown thenext sheep

14 / 19

Page 15: Crawler

Find SheepUsing the well known search engine

I Also, search engine blocks other crawlerI The crawler needs to parser the garbage

codeI The result maybe the js code...

Using the random / enumerate method

I Too hard to find the useful targetI Cost lots of timeI Cannot shut sheeps immediately

15 / 19

Page 16: Crawler

Based Search Engine

Design an other crawler

I Given the initial keyword as the seedI Fetch the search engineI Parser the result, and get the next seed if

possibleI Repeat until stop or blocked.

16 / 19

Page 17: Crawler

Tricky

Using the distribution model

I Separate each partsI More volunteers can speed-up

17 / 19

Page 18: Crawler

Pyro4

Pyro4 can help you to remote control pythonobject...

I Expose the object can access as on localside

I Using the remote resource to processI Provide the M-n model

18 / 19

Page 19: Crawler

Thanks for participationQ & A

19 / 19