Click here to load reader
Upload
hackstuff
View
1.547
Download
0
Embed Size (px)
Citation preview
What’s the Crawler
Crawlers walk on the network, search anything itfound and doing anything what they wants...
I Search engineI Data finder / collectorI Anything else...
2 / 19
Conception
Crawler can easy to be separate into threesteps...
I DownloadI Data operationI Find the next seed
3 / 19
Pseudo CodeFetch the web page, parser it, get usefulinformation and repeat it again.
for u r l in nextSeed ( ) :i n f o = fe t ch ( u r l )data , seeds = operate ( i n f o )pushSeed ( seeds )
4 / 19
Greedy
But easy things are always too hard to besolved...
I Web server always block the crawler!I Data always never structured!I How to find the next seed!I Crawler always bounded on network
speed...
5 / 19
Operation
When we link to the target...
I Download the web page, parser the HTMLcode
I Download the database, parser the DBformat
I Finial, record everything into our DB
6 / 19
Pseudo CodeParser the HTML code, for example, searchwhat’s you need...
from Beaut i fu lSoup import ∗
soup = Beaut i fu lSoup ( webpage )## P r i n t the main bodypr in t soup . html . body## P r i n t the f i r s t tag <a> i n bodypr in t soup . html . body . a## Find the p a r t i c u l a r tagtags = soup . f i n d A l l ( ’ form ’ )
7 / 19
Operation (cont’d)
And more, you also can do something else, likepayload, when operate the web page...
I Post / Get the method based on HTMLI Find the next seed on the web pageI Something good / bad
8 / 19
Link to Site
Before we operated the web page, we need to...I Link to web siteI Get the web page
But server master hates the net crawler, ’causeI No functionalityI Slow down / burn out the resourceI As the thief
9 / 19
Fetch
If you are not GoogleYou must be the human
10 / 19
Be a Human
Be a human as a human being...
I No one can press anything under 0.11second
I No one can look page with few secodeI No one can work for all day
11 / 19
Rules
Using the framework / tool to enumlate thebrowser
I Change the default settingI Simulate the existed browserI Cookie supportI Time issue and random variable
12 / 19
Pseudo CodeSimple fetch code
import u r l l i b 2from c o o k i e l i b import CookieJarimport t ime , random
for n in range (MAX LOOP ) :## Cookieck = CookieJar ( )ck = u r l l i b 2 . HTTPCookieProcessor ( ck )req = u r l l i b 2 . bu i ld opener ( ck )## User−Agentreq . addheaders = [ ( ’ User−Agent ’ , ’ c raw ler cmj ’ ) ]data = req . open ( u r l ) . read ( )## Waitt ime . sleep ( random . r a n d i n t ( 0 , 5 ) )
13 / 19
Seed
The last one, but the hardest one...
We always unknown thenext sheep
14 / 19
Find SheepUsing the well known search engine
I Also, search engine blocks other crawlerI The crawler needs to parser the garbage
codeI The result maybe the js code...
Using the random / enumerate method
I Too hard to find the useful targetI Cost lots of timeI Cannot shut sheeps immediately
15 / 19
Based Search Engine
Design an other crawler
I Given the initial keyword as the seedI Fetch the search engineI Parser the result, and get the next seed if
possibleI Repeat until stop or blocked.
16 / 19
Tricky
Using the distribution model
I Separate each partsI More volunteers can speed-up
17 / 19
Pyro4
Pyro4 can help you to remote control pythonobject...
I Expose the object can access as on localside
I Using the remote resource to processI Provide the M-n model
18 / 19
Thanks for participationQ & A
19 / 19