1
It is impossible to guarantee that all relevant pages are returned (even inspected) (Figure 1): • Millions of pages available, many of them not indexed in any search engine ("hidden web"), • Pages come and go; what is there one day may not be there next. Web Relevan t Pages Retriev ed Pages It is impossible to guarantee that all pages returned are relevant (Figure 1): • Quality of content cannot always be guaranteed. • A page may contain noisy content (adds, links, etc.). • Only part of a page may contain relevant data. • Information may be old. Figure 1 There is no perfect search engine: • Mostly limited to keyword based search. • Absence or existence of a word does not necessarily imply irrelevancy or relevancy: ‘vessel’, ‘ready to ship’. • You cannot ask for pages about a topic: ‘rust in ships’. • Millions of results are returned. In this project, we developed a focused web crawler to retrieve web resources on a user-provided topic. Given a thesaurus and a set of URLs by the user, we used Google API to get more seed URLs for the topic. We mainly used Oracle DB to store and index crawled pages, and utilized the provided thesaurus in ranking and filtering the content. We review some of the problems encountered, roughly dividing them into practical or engineering issues and conceptual issues. FOCUSED CRAWLING: EXPERIENCES IN A REAL WORLD PROJECT FOCUSED CRAWLING: EXPERIENCES IN A REAL WORLD PROJECT Antonio Badia, Tulay Muezzinoglu, Olfa Nasraoui University of Louisville {abadia, t0muez01, olfa.nasraoui}@louisville.edu Challenges Challenges Abstract Abstract The National Surface Treatment Center partners with the Navy, DoD Operations, and industry to fight corrosion and solve coating problems. Its web site aims to be the main reference point for people and organizations involved in these problems. The goal is to help the NST Center achieve its objectives for the web portal by gathering relevant information from the web, distributing this information to users and interested parties. Goal Goal Focused Web Crawler: A program that explores the web trying to get pages on a particular topic, Thesaurus: Series of semantic networks built-up by a collection of keywords connected by various relationships. Tools Tools Difficulty in knowledge representation • No formal definition of what a topic is. No consistent assignment of relationships Automated searches demand for thesauri with relationships that are subcategorized. Approaches Approaches Relevancy to a topic: • Difficult to determine. Utilize thesauri: easy to maintain. Network structure: • If page p is about a topic t and p links to p’, there is a chance that page p’ is about t. • Start with pages about the desired topic. Beware of topic drift. • Find the pages linking to a given page to determine the topic: backlinks. Web is not a fully connected network: allow new source discovery mechanism. • Keep track of good sites and hubs. Unit of work: • Page may contain several topics: page is too coarse unit. • Entire site may be devoted to a certain topic: page is not an apt unit, but site is. Useful information: • Extract real content (Clean forms, footers, advertisements, site navigation menus, etc.) Detect good hubs: No actual content. • Eliminate forums, blogs: no easy way to determine their quality. Freshness of information: •Very important in many cases, disregarded by most search engines. Last modified date in http response header: not always available. • For announcements last modified date is not useful. Page itself may contain time (date) information, but it is very difficult to find - Information Extraction techniques are needed. Anatomy of the System Anatomy of the System Search potential pages using broad terms in the thesaurus; many results returned. Google Crawl Thesaurus Classifier (News, Technical Docs, etc.) Back-End Database Filter irrelevant web pages using text algorithms that analyze page's topic. Also update site list according to overall score. Extract & Index Content Crawl pages: eliminate old or duplicate ones. Filter Clear noisy content (ads, menus, etc) & index text. Generate queries Thesaurus Improve the thesaurus with relevancy (i.e. ranking) feedback. User provided/ Discovered Sites API Web Portal Feed the web portal with the new results. Classify discovered pages. Use Google to increase recall. Initial gathering of pages guided by several keyword searches; each one uses a few top thesaurus terms ("cast a wide net"). Keep a list of good sites, hub pages. Increase weights of the keywords with their depth in the thesaurus when scoring. Variety of keywords are rewarded. Negative word list is constructed: if a word from the list exists, score of page is significantly lowered . Highlights of the Algorithm Highlights of the Algorithm Written in Java. Web interface with JSP. Many PL/SQL functions to support effective thesaurus usage in Oracle. Incorporated open source packages such as Htmlparser, JTidy, Lucene. Figure 2: Visualization of thesaurus in Oracle Text system Figure 3: Part of an interface used to set system parameters Figure 4: Interface used for system evaluation Application Interface Application Interface This research is supported by Innovative Productivity, Inc., a nonprofit Kentucky company that runs the National Surface Treatment Center for the U.S. Navy. Acknowledgements Acknowledgements The work of A. Badia was supported by National Science Foundation CAREER award IIS-0347555. The work of O. Nasraoui was supported by National Science Foundation CAREER Award IIS- 0133948. Find ways to automatically update thesaurus. Enrich thesaural relationships for effective relevancy judgment. Take temporal information into account when scoring relevancy. Introduce IE and QA tools into web crawling. Further Research Further Research Knowledge Discovery & Web Mining Knowledge Discovery & Web Mining Research / E-commerce Lab Research / E-commerce Lab Database Lab Database Lab CECS Department CECS Department Director: Dr. Olfa Nasraoui Director: Dr. Olfa Nasraoui Director: Dr. Antonio Badia Director: Dr. Antonio Badia

It is impossible to guarantee that all relevant pages are returned (even inspected) (Figure 1): Millions of pages available, many of them not indexed in

Embed Size (px)

Citation preview

Page 1: It is impossible to guarantee that all relevant pages are returned (even inspected) (Figure 1): Millions of pages available, many of them not indexed in

It is impossible to guarantee that all relevant pages are returned (even inspected) (Figure 1):

• Millions of pages available, many of them not indexed in any search engine ("hidden web"),• Pages come and go; what is there one day may not be there next.

Web

Relevant Pages

RetrievedPages

It is impossible to guarantee that all pages returned are relevant (Figure 1):

• Quality of content cannot always be guaranteed.• A page may contain noisy content (adds, links, etc.).• Only part of a page may contain relevant data.• Information may be old.

Figure 1

There is no perfect search engine:• Mostly limited to keyword based search.• Absence or existence of a word does not necessarily imply irrelevancy or relevancy: ‘vessel’, ‘ready to ship’.• You cannot ask for pages about a topic: ‘rust in ships’.• Millions of results are returned.

In this project, we developed a focused web crawler to retrieve web resources on a user-provided topic. Given a thesaurus and a set of URLs by the user, we used Google API to get more seed URLs for the topic. We mainly used Oracle DB to store and index crawled pages, and utilized the provided thesaurus in ranking and filtering the content.

We review some of the problems encountered, roughly dividing them into practical or engineering issues and conceptual issues.

FOCUSED CRAWLING: EXPERIENCES IN A REAL WORLD PROJECTFOCUSED CRAWLING: EXPERIENCES IN A REAL WORLD PROJECTAntonio Badia, Tulay Muezzinoglu, Olfa Nasraoui

University of Louisville{abadia, t0muez01, olfa.nasraoui}@louisville.edu

ChallengesChallengesAbstractAbstract

The National Surface Treatment Center partners with the Navy, DoD Operations, and industry to fight corrosion and solve coating problems. Its web site aims to be the main reference point for people and organizations involved in these problems.

The goal is to help the NST Center achieve its objectives for the web portal by

gathering relevant information from the web, distributing this information to users and interested

parties.

GoalGoal

Focused Web Crawler: A program that explores the web trying to get pages on a particular topic,

Thesaurus: Series of semantic networks built-up by a collection of keywords connected by various relationships.

ToolsTools

Difficulty in knowledge representation• No formal definition of what a topic is.• No consistent assignment of relationships• Automated searches demand for thesauri with relationships that are subcategorized.

ApproachesApproaches

Relevancy to a topic:• Difficult to determine.• Utilize thesauri: easy to maintain.

Network structure: • If page p is about a topic t and p links to p’, there is a chance that page p’ is about t.• Start with pages about the desired topic.• Beware of topic drift.• Find the pages linking to a given page to determine the topic: backlinks. • Web is not a fully connected network: allow new source discovery mechanism.• Keep track of good sites and hubs.

Unit of work:• Page may contain several topics: page is too coarse unit.• Entire site may be devoted to a certain topic: page is not an apt unit, but site is.

Useful information:• Extract real content (Clean forms, footers, advertisements, site navigation menus, etc.)• Detect good hubs: No actual content.• Eliminate forums, blogs: no easy way to determine their quality.

Freshness of information:•Very important in many cases, disregarded by most search engines.• Last modified date in http response header: not always available.• For announcements last modified date is not useful.• Page itself may contain time (date) information, but it is very difficult to find - Information Extraction techniques are needed.

Anatomy of the SystemAnatomy of the System

Search potential pages using broad terms in the thesaurus; many results returned.

GoogleGoogle

CrawlCrawl

ThesaurusThesaurus

Classifier(News,

Technical Docs, etc.)

Classifier(News,

Technical Docs, etc.)

Back-End Database

Back-End Database

Filter irrelevant web pages using text algorithms that analyze page's topic. Also update site list according to overall score.

Extract & Index Content

Extract & Index Content

Crawl pages: eliminate old or duplicate ones.

FilterFilter

Clear noisy content (ads, menus, etc) & index text.

Generate queries

Generate queries

ThesaurusThesaurus Improve the thesaurus with relevancy (i.e. ranking) feedback.

User provided/Discovered

Sites

User provided/Discovered

Sites

API

Web PortalWeb Portal

Feed the web portal with the new results.

Classify discovered pages.

Use Google to increase recall. Initial gathering of pages guided by several

keyword searches; each one uses a few top thesaurus terms ("cast a wide net").

Keep a list of good sites, hub pages. Increase weights of the keywords with their depth in

the thesaurus when scoring. Variety of keywords are rewarded. Negative word list is constructed: if a word from the

list exists, score of page is significantly lowered .

Highlights of the AlgorithmHighlights of the Algorithm

Written in Java. Web interface with JSP. Many PL/SQL functions to support effective

thesaurus usage in Oracle. Incorporated open source packages such as

Htmlparser, JTidy, Lucene.

Figure 2: Visualization of thesaurus in Oracle Text system

Figure 3: Part of an interface used to set system parameters

Figure 4: Interface used for system evaluation

Application InterfaceApplication Interface

This research is supported by Innovative Productivity, Inc., a nonprofit Kentucky company that runs the National Surface Treatment Center for the U.S. Navy.

AcknowledgementsAcknowledgements

The work of A. Badia was supported by National Science Foundation CAREER award IIS-0347555. The work of O. Nasraoui was supported by National Science Foundation CAREER Award IIS-0133948.

Find ways to automatically update thesaurus. Enrich thesaural relationships for effective

relevancy judgment. Take temporal information into account when

scoring relevancy. Introduce IE and QA tools into web crawling.

Further Research Further Research

Knowledge Discovery & Web MiningKnowledge Discovery & Web MiningResearch / E-commerce LabResearch / E-commerce Lab

Database LabDatabase LabCECS DepartmentCECS Department Director: Dr. Olfa NasraouiDirector: Dr. Olfa NasraouiDirector: Dr. Antonio BadiaDirector: Dr. Antonio Badia