Web Crawler

Web crawlerWeb crawlerWeb crawlerWeb crawler

[email protected]@eSobi.com

Agenda• Forward of Web Crawler

– HTML Parser

• Practice– Feed Crawler

• Prototype demo• Conclusion

HTML Parser• HTML found on Web is usually

dirty, ill-formed and unsuitable for further processing.

• First clean up the mess and bring the order to tags, attributes and ordinary text.

Well-known Parser• Access the information using

standard XML interfaces.• HtmlCleaner• HtmlParser• Nekohtml

Parser inner structure• HTML scanner

– Pre-processing action • Tag balancer

– Reorders individual elements– Produces well-formed XML

• Extraction• Transformation

Example

Extraction• Text extraction

– for use as input for text search engine databases for example

• Link extraction– for crawling through web pages or harvesting

email addresses

• Screen scraping– for programmatic data input from web pages

Extraction• Resource extraction

– collecting images or sound• A browser front end

– the preliminary stage of page display • Link checking

– ensuring links are valid• Site monitoring

– checking for page differences beyond simplistic diffs

Transformation • URL rewriting

– modifying some or all links on a page

• Site capture– moving content from the web to local disk

• Censorship– removing offending words and phrases

from pages

Transformation • HTML cleanup

– correcting erroneous pages • AD removal

– excising URLs referencing advertising • Conversion to XML

– moving existing web pages to XML

Practice• Feed Crawler

– HTML• Bloglines, Feedage

– XML• RssMountain

– JSON• Google AJAX Feed API

• Prototype– Demo

Conclusion• Page search, image search, news

search, blog search, feed search ...• Fault toleration of text processing• Text mining in web• Q & A

Technology

Web Crawler