Upload
bennett-morgan
View
214
Download
1
Embed Size (px)
Citation preview
It is impossible to guarantee that all relevant pages are returned (even inspected) (Figure 1):
• Millions of pages available, many of them not indexed in any search engine ("hidden web"),• Pages come and go; what is there one day may not be there next.
Web
Relevant Pages
RetrievedPages
It is impossible to guarantee that all pages returned are relevant (Figure 1):
• Quality of content cannot always be guaranteed.• A page may contain noisy content (adds, links, etc.).• Only part of a page may contain relevant data.• Information may be old.
Figure 1
There is no perfect search engine:• Mostly limited to keyword based search.• Absence or existence of a word does not necessarily imply irrelevancy or relevancy: ‘vessel’, ‘ready to ship’.• You cannot ask for pages about a topic: ‘rust in ships’.• Millions of results are returned.
In this project, we developed a focused web crawler to retrieve web resources on a user-provided topic. Given a thesaurus and a set of URLs by the user, we used Google API to get more seed URLs for the topic. We mainly used Oracle DB to store and index crawled pages, and utilized the provided thesaurus in ranking and filtering the content.
We review some of the problems encountered, roughly dividing them into practical or engineering issues and conceptual issues.
FOCUSED CRAWLING: EXPERIENCES IN A REAL WORLD PROJECTFOCUSED CRAWLING: EXPERIENCES IN A REAL WORLD PROJECTAntonio Badia, Tulay Muezzinoglu, Olfa Nasraoui
University of Louisville{abadia, t0muez01, olfa.nasraoui}@louisville.edu
ChallengesChallengesAbstractAbstract
The National Surface Treatment Center partners with the Navy, DoD Operations, and industry to fight corrosion and solve coating problems. Its web site aims to be the main reference point for people and organizations involved in these problems.
The goal is to help the NST Center achieve its objectives for the web portal by
gathering relevant information from the web, distributing this information to users and interested
parties.
GoalGoal
Focused Web Crawler: A program that explores the web trying to get pages on a particular topic,
Thesaurus: Series of semantic networks built-up by a collection of keywords connected by various relationships.
ToolsTools
Difficulty in knowledge representation• No formal definition of what a topic is.• No consistent assignment of relationships• Automated searches demand for thesauri with relationships that are subcategorized.
ApproachesApproaches
Relevancy to a topic:• Difficult to determine.• Utilize thesauri: easy to maintain.
Network structure: • If page p is about a topic t and p links to p’, there is a chance that page p’ is about t.• Start with pages about the desired topic.• Beware of topic drift.• Find the pages linking to a given page to determine the topic: backlinks. • Web is not a fully connected network: allow new source discovery mechanism.• Keep track of good sites and hubs.
Unit of work:• Page may contain several topics: page is too coarse unit.• Entire site may be devoted to a certain topic: page is not an apt unit, but site is.
Useful information:• Extract real content (Clean forms, footers, advertisements, site navigation menus, etc.)• Detect good hubs: No actual content.• Eliminate forums, blogs: no easy way to determine their quality.
Freshness of information:•Very important in many cases, disregarded by most search engines.• Last modified date in http response header: not always available.• For announcements last modified date is not useful.• Page itself may contain time (date) information, but it is very difficult to find - Information Extraction techniques are needed.
Anatomy of the SystemAnatomy of the System
Search potential pages using broad terms in the thesaurus; many results returned.
GoogleGoogle
CrawlCrawl
ThesaurusThesaurus
Classifier(News,
Technical Docs, etc.)
Classifier(News,
Technical Docs, etc.)
Back-End Database
Back-End Database
Filter irrelevant web pages using text algorithms that analyze page's topic. Also update site list according to overall score.
Extract & Index Content
Extract & Index Content
Crawl pages: eliminate old or duplicate ones.
FilterFilter
Clear noisy content (ads, menus, etc) & index text.
Generate queries
Generate queries
ThesaurusThesaurus Improve the thesaurus with relevancy (i.e. ranking) feedback.
User provided/Discovered
Sites
User provided/Discovered
Sites
API
Web PortalWeb Portal
Feed the web portal with the new results.
Classify discovered pages.
Use Google to increase recall. Initial gathering of pages guided by several
keyword searches; each one uses a few top thesaurus terms ("cast a wide net").
Keep a list of good sites, hub pages. Increase weights of the keywords with their depth in
the thesaurus when scoring. Variety of keywords are rewarded. Negative word list is constructed: if a word from the
list exists, score of page is significantly lowered .
Highlights of the AlgorithmHighlights of the Algorithm
Written in Java. Web interface with JSP. Many PL/SQL functions to support effective
thesaurus usage in Oracle. Incorporated open source packages such as
Htmlparser, JTidy, Lucene.
Figure 2: Visualization of thesaurus in Oracle Text system
Figure 3: Part of an interface used to set system parameters
Figure 4: Interface used for system evaluation
Application InterfaceApplication Interface
This research is supported by Innovative Productivity, Inc., a nonprofit Kentucky company that runs the National Surface Treatment Center for the U.S. Navy.
AcknowledgementsAcknowledgements
The work of A. Badia was supported by National Science Foundation CAREER award IIS-0347555. The work of O. Nasraoui was supported by National Science Foundation CAREER Award IIS-0133948.
Find ways to automatically update thesaurus. Enrich thesaural relationships for effective
relevancy judgment. Take temporal information into account when
scoring relevancy. Introduce IE and QA tools into web crawling.
Further Research Further Research
Knowledge Discovery & Web MiningKnowledge Discovery & Web MiningResearch / E-commerce LabResearch / E-commerce Lab
Database LabDatabase LabCECS DepartmentCECS Department Director: Dr. Olfa NasraouiDirector: Dr. Olfa NasraouiDirector: Dr. Antonio BadiaDirector: Dr. Antonio Badia