21
Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering A Case Study in the Homepage Domain

Dynamic Reference Sifting

Embed Size (px)

DESCRIPTION

Dynamic Reference Sifting. A Case Study in the Homepage Domain. Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering. Outline. Introduction Softbots and Dynamic Reference Sifters Searching the Web - PowerPoint PPT Presentation

Citation preview

Dynamic Reference Sifting

Jonathan Shakes, Marc Langheinrich, and Oren Etzioni

University of Washington

Department of Computer Science and Engineering

A Case Studyin the Homepage Domain

2

Intr

od

uct

ion

- O

utl

ine

Outline

Introduction Softbots and Dynamic Reference

Sifters Searching the Web

Case Study: Personal Homepages Ahoy! The Homepage Finder Experimental Results

Future and Related Work Other Domains for DRS

3

Intr

od

uct

ion

- S

oft

bots

& D

RS

Softbots and Dynamic Reference Sifters

Dynamic Reference Sifters Part of “Internet Softbots Project”

[Etzioni and Weld, 1994]Softbots

person states what softbot determines how and where

4

Information Retrieval Definitions

Precision Measure of Search Service

Accuracy

Recall Measure of Search Service

Comprehensiveness

Intr

od

uct

ion

- IR

Defin

itio

ns

5

Intr

od

uct

ion

- IR

Defin

itio

ns

Precision

Precision:

Relevant Documents Irrelevant Documents

Search Space

Relevant Search ResultsAll Search Results

All Search ResultsAll Search Results

6

Intr

od

uct

ion

- IR

Defin

itio

ns

Recall

Recall:

Relevant Documents Irrelevant Documents

Search Space

Relevant Search ResultsAll Relevant Documents

All Search ResultsAll Search Results

7

Searching the Web

Web Indices (AltaVista, Hotbot) Automated - high recall

Keyword based - low precision

Web Directories (Yahoo, A2Z) Classified manually - high precision

- low recall

Manual Search slow

Intr

od

uct

ion

- S

earc

hin

g t

he W

eb

8

Searching the Web

Dynamic Reference Sifter An information retrieval tool that uses: multiple, complementary data sources

for high recall, domain-specific filtering techniques for

high precision, and machine learning to improve

performance over time.

Intr

od

uct

ion

- S

earc

hin

g t

he W

eb

9

Case Study: The Personal Homepage Domain

“Conventional” Search Services Indices find too much Directories find too little Manual Search takes too long Failures are expensive

Ahoy! The Homepage Finderattempts to provide High Recall High Precision Speed

Case

Stu

dy -

Overv

iew

10

Case

Stu

dy -

Ah

oy!

Arc

hit

ect

ure

Ahoy! ArchitectureUser Input

Filters

Output

Web PageReference Source

InstitutionalInformation Source

E-mail Address Sources

11

Performance AnalysisTest using lists of known homepages

Researchers sample: 582 homepages Transportation sample: 53 homepages

Compare against MetaCrawler, Hotbot, AltaVista, Yahoo!

Maximize competitors’ performance by using “expert” options allowing up to 200 references

Case

Stu

dy -

Perf

orm

an

ce A

naly

sis

12

Case

Stu

dy -

Perf

orm

an

ce A

naly

sis

Performance Analysis

“Precision” - Researcher Sample

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Ahoy! MetaCrawler Hotbot AltaVista Yahoo!

Search Service

Ta

rge

ts F

ou

nd

Highest-ranked Reference

13

Case

Stu

dy -

Perf

orm

an

ce A

naly

sis

Performance Analysis

Top 10 References - Researcher Sample

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Ahoy! MetaCrawler Hotbot AltaVista Yahoo!

Search Service

Ta

rge

ts F

ou

nd

Highest 10 ReferencesHighest-ranked Reference

14

Case

Stu

dy -

Perf

orm

an

ce A

naly

sis

Performance AnalysisRecall (all References) - Researcher Sample

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Ahoy! MetaCrawler Hotbot AltaVista Yahoo!

Search Service

Ta

rgets

Fo

un

d

All References Highest 10 ReferencesHighest-ranked Reference

15

Case

Stu

dy -

Perf

orm

an

ce A

naly

sis

Performance AnalysisRecall (all References) - Transportation Sample

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Ahoy! MetaCrawler Hotbot AltaVista Yahoo!

Search Service

Ta

rge

ts F

ou

nd

All References Highest 10 ReferencesHighest-ranked Reference

16

Learning in Ahoy!

Learns URL ‘patterns’ http://sdcc3.ucsd.edu/home-pages/<Login>/

50,000+ patterns in 3 months

Indexes patterns by institution 11,000+ institutions indexed in 3

months

Performance Impact Up to 8% gain in recall

Case

Stu

dy -

Learn

ing

in

Ah

oy!

17

Domain Characteristics

Many elements

Easily identifiable target

Some targets found in web indices

User can form specific query

Futu

re W

ork

- D

om

ain

ch

ara

cteri

stic

s

18

Domain Examples

Personal HomepagesArticles or PapersProduct ReviewsPrice ListsTransportation SchedulesRecipesJokes

and more

Futu

re W

ork

- D

om

ain

exam

ple

s

19

(un)Related Work

Automated Index Generation WebCrawler, Lycos, AltaVista, ...

Automated Directory Generation IAF, OKRA, WhoWhere?

Dynamic Internet Search Netfind

Learning User Preferences on web WebWatcher, Syskill & Webert, Firefly

Learning about the web ShopBot, auto-generated wrappers

Futu

re W

ork

- (

un

)Rela

ted

Work

20

Summary and Conclusions

Dynamic Reference Sifting domain-specific, high precision, high recall,

fast

Ahoy! the Homepage Finder 2000 searches per day 1-2 references returned per search 50-75% targets found

25% not found, often correctly so

10-15 seconds per search

Future domains Academic Papers, Jokes

Su

mm

ary

& C

on

clu

sion

s

21

Ahoy! the Homepage Finder

http://www.cs.washington.edu/research/ahoy/

Ah

oy!

Th

e H

om

ep

ag

e F

ind

er