12
Web crawler Web crawler [email protected] [email protected]

Web Crawler

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Web Crawler

Web crawlerWeb crawlerWeb crawlerWeb crawler

[email protected]@eSobi.com

Page 2: Web Crawler

Agenda• Forward of Web Crawler

– HTML Parser

• Practice– Feed Crawler

• Prototype demo• Conclusion

Page 3: Web Crawler

HTML Parser• HTML found on Web is usually

dirty, ill-formed and unsuitable for further processing.

• First clean up the mess and bring the order to tags, attributes and ordinary text.

Page 4: Web Crawler

Well-known Parser• Access the information using

standard XML interfaces.• HtmlCleaner• HtmlParser• Nekohtml

Page 5: Web Crawler

Parser inner structure• HTML scanner

– Pre-processing action • Tag balancer

– Reorders individual elements– Produces well-formed XML

• Extraction• Transformation

Page 6: Web Crawler

Example

Page 7: Web Crawler

Extraction• Text extraction

– for use as input for text search engine databases for example

• Link extraction– for crawling through web pages or harvesting

email addresses

• Screen scraping– for programmatic data input from web pages

Page 8: Web Crawler

Extraction• Resource extraction

– collecting images or sound• A browser front end

– the preliminary stage of page display • Link checking

– ensuring links are valid• Site monitoring

– checking for page differences beyond simplistic diffs

Page 9: Web Crawler

Transformation • URL rewriting

– modifying some or all links on a page

• Site capture– moving content from the web to local disk

• Censorship– removing offending words and phrases

from pages

Page 10: Web Crawler

Transformation • HTML cleanup

– correcting erroneous pages • AD removal

– excising URLs referencing advertising • Conversion to XML

– moving existing web pages to XML

Page 11: Web Crawler

Practice• Feed Crawler

– HTML• Bloglines, Feedage

– XML• RssMountain

– JSON• Google AJAX Feed API

• Prototype– Demo

Page 12: Web Crawler

Conclusion• Page search, image search, news

search, blog search, feed search ...• Fault toleration of text processing• Text mining in web• Q & A