








































a tutorial on crawling tools

jianguo lu

September 25, 2014

1 overview

2 wget

3 crawler4j

4 nutch

type of crawlers

classify crawling according to the type of the web:

surface web crawler: obtain web pages by following hyperlinksdeep web crawlerprogrammable web apis

classify crawling according to the content:

general purpose, e.g., google. archive.focused crawlers, e.g.,

academic crawlers (google scholar)social networks (twitter, weibo)

surface web and deep web are intertwined

most large web sites provide both surface web and deep web (e.g.,google scholar, twitter)

tools for surface web crawling

command line

wget (www get), preinstalled in ubuntu(our cs machines)curl (crawl url), OSX preinstalled

Simple crawling apis

Java: crawler4j in java: scrapy:

large scale scrawling

Heritrix, crawler for

get a webpage (in java)

import j a v a . net . ∗ ;import j a v a . i o . ∗ ;pub l i c c l a s s URLReader {

pub l i c s t a t i c void main ( S t r i n g [ ] a r g s ) throws Excep t i on {

URL o r a c l e = new URL( ” ht tp ://www. o r a c l e . com/” ) ;Bu f f e r edReade r i n = new Buf f e r edReade r (new InputSt reamReader ( o r a c l e . openStream ( ) ) ) ;

S t r i n g i n p u t L i n e ;whi le ( ( i n p u t L i n e = i n . r e adL i n e ( ) ) != nu l l )

System . out . p r i n t l n ( i n p u t L i n e ) ;i n . c l o s e ( ) ;



how to analyze the page to get other urlshow to control the process

how deep to crawlhow often to send the request...

stands for www get.developed in 1996preinstalled on most linux like machinesexample to download a single file:

$ wget h t tp : //www. openss7 . org / r epo s / t a r b a l l s / s t r x25 − 0 . 9 . 2 . 1 . t a r . bz2Sav ing to : ‘ s t r x25 − 0 . 9 . 2 . 1 . t a r . bz2 . 1 ’31\% [=================> 1 ,213 ,592 68 .2K/ s e ta 34 s

get more pages

get a single page


support http, ftp etc., e.g.


More complex usage includes automatic download of multiple URLsinto a directory hierarchy.

wget -e robots=off -r -l1 --no-parent -A.gif

Recursive retrieval using -r

program begins following links from the website and downloadingthem too. has a link to, so it willdownload that too if we use recursive retrieval.

will also follow any other links: if there was a link to http://uwo.casomewhere on that page, it would follow that and download it as well.

By default, -r sends wget to a depth of five sites after the first one.This is following links, to a limit of five clicks after the first website.

not beyond last parent directory using –no-parent

The double-dash indicates the full-text of a command. All commandsalso have a short version, this could be initiated using -np.

wget should follow links, but not beyond the last parent directory.

wont go anywhere that is not part of the

how far you want to go

The default: follow each link and carry on to a limit of five pagesaway from the first page.

wget -l 2, which takes us to a depth of two web-pages.

Note this is a lower-case L, not a number 1.

-w 10

adds a ten second wait in between server can shorten this, as ten seconds is quite can also use the parameter: –random-wait to let wget chose arandom number of seconds to wait.wget --random-wait -r -p -e robots=off -U mozilla


limit the maximum download speed to 20kb/s.Opinion varies on what a good limit rate is, but you are probably goodup to about 200kb/s

some sites are protective

if the robots.txt does not allow you to crawl anything

use robots=offwget -r -p -e robots=off

mask user agent

if a web site checks a browser identity

$ wget −r −p −e −U mo z i l l a h t tp : //www. example . com

$ wget −−use r−agent=”Moz i l l a /5 .0 (X11 ; U; L inux i 686 ; en−US ; r v : 1 . 9 . 0 . 3 ) Gecko /2008092416 F i r e f o x /3 . 0 . 3 ”URL−TO−DOWNLOAD

Increase Total Number of Retry Attempts

By default wget retries 20 times to make the download successful.

If the internet connection has problem, you may want to increase thenumber of tries

wget –tries=75 DOWNLOAD-URL

Download Multiple Files / URLs Using Wget -i

First, store all the download files or URLs in a text file as:

Next, give the download-file-list.txt as argument to wget using -ioption as shown below.

$ ca t > download− f i l e − l i s t . t x tURL1URL2URL3URL4

$ wget − i download− f i l e − l i s t . t x t

Download Only Certain File Types Using wget -r -A

You can use this under following situations:

Download all images from a websiteDownload all videos from a websiteDownload all PDF files from a website

$ wget −r −A. pdf h t tp : // u r l−to−webpage−with−pd f s /

download a directory

task: download all the files under the papers directory

wget -r --no-parent -w 2 --limit-rate=20k

Note that the trailing slash on the URL is critical

if you omit it, wget will think that papers is a file rather than adirectory.

When it is done, you should have a directory labeled ActiveHistory.cathat contains the /papers/ sub-directory perfectly mirrored on yoursystem.

This directory will appear in the location that you ran the commandfrom in your command line

Links will be replaced with internal links to the other pages you’vedownloaded, so you can actually have a fully working ActiveHistory.casite on your computer.

mirror a website using -m

If you want to mirror an entire website, there is a built-in commandto wget.

This command means mirror, and is especially useful for backing upan entire website.

it looks at the time stamps, and does not repeat the download if thefile in the local system is recent.

it supports infinite recursion (it will go as many layers into the site asnecessary).

The command for mirroring would be:

wget -m -w 2 --limit-rate=20k

download in the background

$ wget −b ht tp ://www. openss7 . org / r epo s / t a r b a l l s / s t r x25 − 0 . 9 . 2 . 1 . t a r . bz2Cont i nu i ng i n background , p i d 1984 .Output w i l l be w r i t t e n to ' wget−l og ' .

pub l i c c l a s s Ba s i c C r aw lC o n t r o l l e r {

pub l i c s t a t i c vo id main ( S t r i n g [ ] a r g s ) throws Excep t i on {S t r i n g c r aw l S t o r a g eFo l d e r = a rg s [ 0 ] ;i n t numberOfCrawlers = I n t e g e r . p a r s e I n t ( a r g s [ 1 ] ) ;C raw lCon f i g c o n f i g = new Craw lCon f i g ( ) ;c o n f i g . s e tC r aw l S t o r a g eFo l d e r ( c r aw l S t o r a g eFo l d e r ) ;c o n f i g . s e t P o l i t e n e s sD e l a y ( 1000 ) ;c o n f i g . setMaxDepthOfCrawl ing ( 2 ) ;c o n f i g . setMaxPagesToFetch ( 1000 ) ;c o n f i g . s e tResumab l eCraw l i ng ( f a l s e ) ;PageFetcher pageFe tche r = new PageFetcher ( c o n f i g ) ;Robo t s t x tCon f i g r o b o t s t x t C o n f i g = new Robo t s t x tCon f i g ( ) ;Robo t s t x t S e r v e r r o b o t s t x t S e r v e r = new Robo t s t x t S e r v e r ( r o b o t s t x tCon f i g , pageFe tche r ) ;C r aw l C o n t r o l l e r c o n t r o l l e r = new C r aw lC o n t r o l l e r ( c on f i g , pageFetcher , r o b o t s t x t S e r v e r ) ;

c o n t r o l l e r . addSeed ( ” h t tp : //www. i c s . u c i . edu/” ) ;c o n t r o l l e r . s t a r t ( Ba s i cC r aw l e r . c l a s s , numberOfCrawlers ) ;


nutch overview

Apache Nutch is an open source Web crawler written in Java.

can find Web page hyperlinks in an automated manner, reduce lots ofmaintenance work,

for example checking broken links,

and create a copy of all the visited pages for searching over.

Install Nutch

Option 1: Setup Nutch from a binary distribution

Download a binary package ( your binary Nutch package. There should be a apache-nutch-1.X/

$NUTCH RUNTIME HOME refers to the current directory(apache-nutch-1.X/).

Set up Nutch from a source distribution

Advanced users may also use the source distribution:

Download a source package (


cd apache-nutch-1.X/

Run ant in this folder (cf. RunNutchInEclipse)

Now there is a directory runtime/local which contains a ready to useNutch installation. When the source distribution is used${NUTCH RUNTIME HOME}refers to apache-nutch-1.X/runtime/local/. Note thatconfig files should be modified inapache-nutch-1.X/runtime/local/conf/

ant clean will remove this directory (keep copies of modified config files)

Verify your Nutch installation

run ”bin/nutch” - You can confirm a correct installation if you seeingsimilar to the following:

Usage: nutch COMMAND where command is one of:

crawl: one-step crawler for intranets (DEPRECATED)readdb read / dump crawl dbmergedb merge crawldb-s, with optional filteringreadlinkdb read / dump link dbinject inject new urls into the databasegenerate generate new segments to fetch from crawl dbfreegen generate new segments to fetch from text filesfetch fetch a segment’s pages

Some troubleshooting tips

Run the following command if you are seeing ”Permission denied”:chmod +x bin/nutch

Setup JAVA HOME if you are seeing JAVA HOME not set.

On Mac, you can run the following command or add it to /.bashrc:


JAVA HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home

On Debian or Ubuntu, you can run the following command or add itto /.bashrc:

export JAVA HOME=$(readlink -f /usr/bin/java | sed


Crawl your first website

Nutch requires two configuration changes before a website can be crawled:

Customize your crawl properties

provide a name for your crawler for external servers to recognize

Set a seed list of URLs to crawl

tutorial for wget.


Data Mining the Web Via Crawling, By Kate Matsudaira July 26,2012


How to crawl a quarter billion webpages in 40 hours by MichaelNielsen on August 10, 2012

crawl using amazon ec2

Beautiful Soap, a screen scraping tool,

