Upload
margaret-castillo
View
39
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Dynamic Reference Sifting. A Case Study in the Homepage Domain. Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering. Outline. Introduction Softbots and Dynamic Reference Sifters Searching the Web - PowerPoint PPT Presentation
Citation preview
Dynamic Reference Sifting
Jonathan Shakes, Marc Langheinrich, and Oren Etzioni
University of Washington
Department of Computer Science and Engineering
A Case Studyin the Homepage Domain
2
Intr
od
uct
ion
- O
utl
ine
Outline
Introduction Softbots and Dynamic Reference
Sifters Searching the Web
Case Study: Personal Homepages Ahoy! The Homepage Finder Experimental Results
Future and Related Work Other Domains for DRS
3
Intr
od
uct
ion
- S
oft
bots
& D
RS
Softbots and Dynamic Reference Sifters
Dynamic Reference Sifters Part of “Internet Softbots Project”
[Etzioni and Weld, 1994]Softbots
person states what softbot determines how and where
4
Information Retrieval Definitions
Precision Measure of Search Service
Accuracy
Recall Measure of Search Service
Comprehensiveness
Intr
od
uct
ion
- IR
Defin
itio
ns
5
Intr
od
uct
ion
- IR
Defin
itio
ns
Precision
Precision:
Relevant Documents Irrelevant Documents
Search Space
Relevant Search ResultsAll Search Results
All Search ResultsAll Search Results
6
Intr
od
uct
ion
- IR
Defin
itio
ns
Recall
Recall:
Relevant Documents Irrelevant Documents
Search Space
Relevant Search ResultsAll Relevant Documents
All Search ResultsAll Search Results
7
Searching the Web
Web Indices (AltaVista, Hotbot) Automated - high recall
Keyword based - low precision
Web Directories (Yahoo, A2Z) Classified manually - high precision
- low recall
Manual Search slow
Intr
od
uct
ion
- S
earc
hin
g t
he W
eb
8
Searching the Web
Dynamic Reference Sifter An information retrieval tool that uses: multiple, complementary data sources
for high recall, domain-specific filtering techniques for
high precision, and machine learning to improve
performance over time.
Intr
od
uct
ion
- S
earc
hin
g t
he W
eb
9
Case Study: The Personal Homepage Domain
“Conventional” Search Services Indices find too much Directories find too little Manual Search takes too long Failures are expensive
Ahoy! The Homepage Finderattempts to provide High Recall High Precision Speed
Case
Stu
dy -
Overv
iew
10
Case
Stu
dy -
Ah
oy!
Arc
hit
ect
ure
Ahoy! ArchitectureUser Input
Filters
Output
Web PageReference Source
InstitutionalInformation Source
E-mail Address Sources
11
Performance AnalysisTest using lists of known homepages
Researchers sample: 582 homepages Transportation sample: 53 homepages
Compare against MetaCrawler, Hotbot, AltaVista, Yahoo!
Maximize competitors’ performance by using “expert” options allowing up to 200 references
Case
Stu
dy -
Perf
orm
an
ce A
naly
sis
12
Case
Stu
dy -
Perf
orm
an
ce A
naly
sis
Performance Analysis
“Precision” - Researcher Sample
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Ahoy! MetaCrawler Hotbot AltaVista Yahoo!
Search Service
Ta
rge
ts F
ou
nd
Highest-ranked Reference
13
Case
Stu
dy -
Perf
orm
an
ce A
naly
sis
Performance Analysis
Top 10 References - Researcher Sample
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Ahoy! MetaCrawler Hotbot AltaVista Yahoo!
Search Service
Ta
rge
ts F
ou
nd
Highest 10 ReferencesHighest-ranked Reference
14
Case
Stu
dy -
Perf
orm
an
ce A
naly
sis
Performance AnalysisRecall (all References) - Researcher Sample
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Ahoy! MetaCrawler Hotbot AltaVista Yahoo!
Search Service
Ta
rgets
Fo
un
d
All References Highest 10 ReferencesHighest-ranked Reference
15
Case
Stu
dy -
Perf
orm
an
ce A
naly
sis
Performance AnalysisRecall (all References) - Transportation Sample
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Ahoy! MetaCrawler Hotbot AltaVista Yahoo!
Search Service
Ta
rge
ts F
ou
nd
All References Highest 10 ReferencesHighest-ranked Reference
16
Learning in Ahoy!
Learns URL ‘patterns’ http://sdcc3.ucsd.edu/home-pages/<Login>/
50,000+ patterns in 3 months
Indexes patterns by institution 11,000+ institutions indexed in 3
months
Performance Impact Up to 8% gain in recall
Case
Stu
dy -
Learn
ing
in
Ah
oy!
17
Domain Characteristics
Many elements
Easily identifiable target
Some targets found in web indices
User can form specific query
Futu
re W
ork
- D
om
ain
ch
ara
cteri
stic
s
18
Domain Examples
Personal HomepagesArticles or PapersProduct ReviewsPrice ListsTransportation SchedulesRecipesJokes
and more
Futu
re W
ork
- D
om
ain
exam
ple
s
19
(un)Related Work
Automated Index Generation WebCrawler, Lycos, AltaVista, ...
Automated Directory Generation IAF, OKRA, WhoWhere?
Dynamic Internet Search Netfind
Learning User Preferences on web WebWatcher, Syskill & Webert, Firefly
Learning about the web ShopBot, auto-generated wrappers
Futu
re W
ork
- (
un
)Rela
ted
Work
20
Summary and Conclusions
Dynamic Reference Sifting domain-specific, high precision, high recall,
fast
Ahoy! the Homepage Finder 2000 searches per day 1-2 references returned per search 50-75% targets found
25% not found, often correctly so
10-15 seconds per search
Future domains Academic Papers, Jokes
Su
mm
ary
& C
on
clu
sion
s