Upload
allan-huang
View
364
Download
4
Embed Size (px)
DESCRIPTION
Citation preview
Web crawlerWeb crawlerWeb crawlerWeb crawler
[email protected]@eSobi.com
Agenda• Forward of Web Crawler
– HTML Parser
• Practice– Feed Crawler
• Prototype demo• Conclusion
HTML Parser• HTML found on Web is usually
dirty, ill-formed and unsuitable for further processing.
• First clean up the mess and bring the order to tags, attributes and ordinary text.
Well-known Parser• Access the information using
standard XML interfaces.• HtmlCleaner• HtmlParser• Nekohtml
Parser inner structure• HTML scanner
– Pre-processing action • Tag balancer
– Reorders individual elements– Produces well-formed XML
• Extraction• Transformation
Example
Extraction• Text extraction
– for use as input for text search engine databases for example
• Link extraction– for crawling through web pages or harvesting
email addresses
• Screen scraping– for programmatic data input from web pages
Extraction• Resource extraction
– collecting images or sound• A browser front end
– the preliminary stage of page display • Link checking
– ensuring links are valid• Site monitoring
– checking for page differences beyond simplistic diffs
Transformation • URL rewriting
– modifying some or all links on a page
• Site capture– moving content from the web to local disk
• Censorship– removing offending words and phrases
from pages
Transformation • HTML cleanup
– correcting erroneous pages • AD removal
– excising URLs referencing advertising • Conversion to XML
– moving existing web pages to XML
Practice• Feed Crawler
– HTML• Bloglines, Feedage
– XML• RssMountain
– JSON• Google AJAX Feed API
• Prototype– Demo
Conclusion• Page search, image search, news
search, blog search, feed search ...• Fault toleration of text processing• Text mining in web• Q & A