35
August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Distributed IR for Distributed IR for Digital Libraries Digital Libraries Ray R. Larson Ray R. Larson School of Information Management & Systems School of Information Management & Systems University of California, Berkeley University of California, Berkeley [email protected] [email protected]

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson Distributed IR for Digital Libraries Ray R. Larson School of Information Management & Systems University

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Distributed IR for Digital LibrariesDistributed IR for Digital Libraries Ray R. LarsonRay R. Larson

School of Information Management & SystemsSchool of Information Management & SystemsUniversity of California, BerkeleyUniversity of California, Berkeley

[email protected]@sims.berkeley.edu

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

OverviewOverviewOverviewOverview

• The problem areaThe problem area

• Distributed searching tasks and issuesDistributed searching tasks and issues

• Our approach to resource characterization Our approach to resource characterization and searchand search

• Experimental evaluation of the approachExperimental evaluation of the approach

• Application and use of this method in Application and use of this method in working systemsworking systems

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

The ProblemThe ProblemThe ProblemThe Problem• Prof. Casarosa’s definition of the Digital Library Prof. Casarosa’s definition of the Digital Library

vision in yesterday afternoons plenary session -- vision in yesterday afternoons plenary session -- Access to everyone for “all human knowledge”Access to everyone for “all human knowledge”

• Lyman and Varian’s estimates of the “Dark Web”Lyman and Varian’s estimates of the “Dark Web”• Hundreds or Thousands of servers with databases Hundreds or Thousands of servers with databases

ranging widely in content, topic, formatranging widely in content, topic, format– Broadcast search is expensive in terms of bandwidth Broadcast search is expensive in terms of bandwidth

and in processing too many irrelevant resultsand in processing too many irrelevant results– How to select the “best” ones to search?How to select the “best” ones to search?

• Which resource to search first?Which resource to search first?• Which to search next if more is wanted?Which to search next if more is wanted?

– Topical /domain constraints on the search selectionsTopical /domain constraints on the search selections– Variable contents of database (metadata only, full text, Variable contents of database (metadata only, full text,

multimedia…)multimedia…)

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Distributed Search TasksDistributed Search TasksDistributed Search TasksDistributed Search Tasks• Resource DescriptionResource Description

– How to collect metadata about digital libraries and their How to collect metadata about digital libraries and their collections or databasescollections or databases

• Resource SelectionResource Selection– How to select relevant digital library collections or databases How to select relevant digital library collections or databases

from a large number of databasesfrom a large number of databases

• Distributed SearchDistributed Search– How to perform parallel or sequential searching over the How to perform parallel or sequential searching over the

selected digital library databasesselected digital library databases

• Data FusionData Fusion– How to merge query results from different digital libraries with How to merge query results from different digital libraries with

their different search engines, differing record structures, etc.their different search engines, differing record structures, etc.

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

An Approach for Distributed An Approach for Distributed Resource DiscoveryResource Discovery

An Approach for Distributed An Approach for Distributed Resource DiscoveryResource Discovery

• Distributed resource representation and discoveryDistributed resource representation and discovery– New approach to building resource descriptions based on New approach to building resource descriptions based on

Z39.50Z39.50– Instead of using Instead of using broadcastbroadcast search across resources we are using search across resources we are using

two Z39.50 Servicestwo Z39.50 Services• Identification of database metadata using Z39.50 Identification of database metadata using Z39.50 ExplainExplain• Extraction of distributed indexes using Z39.50 Extraction of distributed indexes using Z39.50 SCANSCAN

• Evaluation Evaluation – How efficiently can we build distributed indexes? How efficiently can we build distributed indexes? – How effectively can we choose databases using the index?How effectively can we choose databases using the index?– How effective is merging search results from multiple sources?How effective is merging search results from multiple sources?– Can we build hierarchies of servers Can we build hierarchies of servers

(general/meta-topical/individual)?(general/meta-topical/individual)?

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Z39.50 OverviewZ39.50 OverviewZ39.50 OverviewZ39.50 Overview

UI

UI

MapQuery

Internet

MapResults

MapQuery

MapResults

MapQuery

MapResults

SearchEngine

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Z39.50 ExplainZ39.50 ExplainZ39.50 ExplainZ39.50 Explain

• Explain supports searches for Explain supports searches for – Server-Level metadata Server-Level metadata

• Server NameServer Name

• IP AddressesIP Addresses

• Ports Ports

– Database-Level metadataDatabase-Level metadata• Database nameDatabase name

• Search attributes (indexes and combinations) Search attributes (indexes and combinations)

– Support metadata (record syntaxes, etc)Support metadata (record syntaxes, etc)

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Z39.50 SCANZ39.50 SCANZ39.50 SCANZ39.50 SCAN

• Originally intended to support Browsing Originally intended to support Browsing • Query for Query for

– DatabaseDatabase– Attributes plus Term (i.e., index and start point)Attributes plus Term (i.e., index and start point)– Step SizeStep Size– Number of terms to retrieveNumber of terms to retrieve– Position in Response setPosition in Response set

• Results Results – Number of terms returnedNumber of terms returned– List of Terms and their frequency in the database (for List of Terms and their frequency in the database (for

the given attribute combination)the given attribute combination)

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Z39.50 SCAN ResultsZ39.50 SCAN ResultsZ39.50 SCAN ResultsZ39.50 SCAN Results% zscan title cat 1 20 1{SCAN {Status 0}{Terms 20}{StepSize 1}{Position 1}}{cat 27}{cat-fight 1}{catalan 19}{catalogu 37}{catalonia 8}{catalyt 2}{catania 1}{cataract 1}{catch 173}{catch-all 3}{catch-up 2} …

zscan topic cat 1 20 1{SCAN {Status 0}{Terms 20}{StepSize 1}{Position 1}}{cat 706}{cat-and-mouse 19}{cat-burglar 1}{cat-carrying 1}{cat-egory 1}{cat-fight 1}{cat-gut 1}{cat-litter 1}{cat-lovers 2}{cat-pee 1}{cat-run 1}{cat-scanners 1} …

Syntax: zscan indexname1 term stepsize number_of_terms pref_pos

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Resource Index CreationResource Index CreationResource Index CreationResource Index Creation

• For all servers, or a topical subset…For all servers, or a topical subset…– Get Explain information Get Explain information – For each indexFor each index

• Use SCAN to extract terms and frequencyUse SCAN to extract terms and frequency• Add term + freq + source index + database Add term + freq + source index + database

metadata to the XML “Collection Document” metadata to the XML “Collection Document” for the resourcefor the resource

– Planned extensions:Planned extensions:• Post-Process indexes (especially Geo Names, Post-Process indexes (especially Geo Names,

etc) for special types of data etc) for special types of data – e.g. create “geographical coverage” indexese.g. create “geographical coverage” indexes

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

MetaSearch ApproachMetaSearch ApproachMetaSearch ApproachMetaSearch Approach

MetaSearchServer

Map ExplainAnd ScanQueries

Internet

MapResults

MapQuery

MapResults

SearchEngine

DB2DB 1

MapQuery

MapResults

SearchEngine

DB 4DB 3

DistributedIndex

SearchEngine

Db 6Db 5

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Known Issues and ProblemsKnown Issues and ProblemsKnown Issues and ProblemsKnown Issues and Problems

• Not all Z39.50 Servers support SCAN or Not all Z39.50 Servers support SCAN or ExplainExplain

• Solutions that appear to work well:Solutions that appear to work well:– Probing for attributes instead of explain (e.g. Probing for attributes instead of explain (e.g.

DC attributes or analogs)DC attributes or analogs)– We also support OAI and can extract OAI We also support OAI and can extract OAI

metadata for servers that support OAImetadata for servers that support OAI– Query-based sampling (Callan)Query-based sampling (Callan)

• Collection Documents are static and need to Collection Documents are static and need to be replaced when the associated collection be replaced when the associated collection changeschanges

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Evaluation Evaluation Evaluation Evaluation

• Test EnvironmentTest Environment– TREC Tipster data (approx. 3 GB)TREC Tipster data (approx. 3 GB)– Partitioned into 236 smaller collections based Partitioned into 236 smaller collections based

on source and date by month (no DOE)on source and date by month (no DOE)• High size variability (from 1 to thousands of High size variability (from 1 to thousands of

records)records)• Same database as used in other distributed search Same database as used in other distributed search

studies by J. French and J. Callan among othersstudies by J. French and J. Callan among others

– Used TREC topics 51-150 for evaluation (these Used TREC topics 51-150 for evaluation (these are the only topics with relevance judgements are the only topics with relevance judgements for all 3 TIPSTER disksfor all 3 TIPSTER disks

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

TREC DiskTREC Disk SourceSource Size MBSize MB Size docSize doc

11 WSJ (86-89)WSJ (86-89) 270270 98,73298,732

11 AP (89)AP (89) 259259 84,67884,678

11 ZIFFZIFF 245245 75,18075,180

11 FR (89)FR (89) 262262 25,96025,960

22 WSJ (90-92)WSJ (90-92) 247247 74.52074.520

22 AP (88)AP (88) 241241 79,91979,919

22 ZIFFZIFF 178178 56,92056,920

22 FR (88)FR (88) 211211 19,86019,860

33 AP (90)AP (90) 242242 78,32178,321

33 SJMN (91)SJMN (91) 290290 90,25790,257

33 PATPAT 245245 6,7116,711

TotalsTotals 2,6902,690 691,058691,058

Test Database CharacteristicsTest Database CharacteristicsTest Database CharacteristicsTest Database Characteristics

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

TREC DiskTREC Disk SourceSource Num DBNum DB Total DBTotal DB

11 WSJ (86-89)WSJ (86-89) 2929 Disk 1Disk 1

11 AP (89)AP (89) 1212 6767

11 ZIFFZIFF 1414

11 FR (89)FR (89) 1212

22 WSJ (90-92)WSJ (90-92) 2222 Disk 2Disk 2

22 AP (88)AP (88) 1111 5454

22 ZIFFZIFF 11 (1 dup)11 (1 dup)

22 FR (88)FR (88) 1010

33 AP (90)AP (90) 1212 Disk 3Disk 3

33 SJMN (91)SJMN (91) 1212 116116

33 PATPAT 9292

TotalsTotals 237 - 1237 - 1 237 - 1237 - 1

Test Database CharacteristicsTest Database CharacteristicsTest Database CharacteristicsTest Database Characteristics

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Harvesting EfficiencyHarvesting EfficiencyHarvesting EfficiencyHarvesting Efficiency

• Tested using the databases on the previous slide Tested using the databases on the previous slide + the full FT database (210,158 records ~ 600 + the full FT database (210,158 records ~ 600 Mb)Mb)

• Average of 23.07 seconds per database to SCAN Average of 23.07 seconds per database to SCAN each database (3.4 indexes on average) and create each database (3.4 indexes on average) and create a collection representative, over the networka collection representative, over the network

• Average of 14.07 secondsAverage of 14.07 seconds• Also tested larger databases (E.g. TREC FT Also tested larger databases (E.g. TREC FT

database ~600 Mb with 7 indexes was harvested database ~600 Mb with 7 indexes was harvested in 131 seconds.in 131 seconds.

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Our Collection Ranking Our Collection Ranking ApproachApproach

Our Collection Ranking Our Collection Ranking ApproachApproach

• We attempt to estimate the probability of We attempt to estimate the probability of relevance for a given collection with respect to relevance for a given collection with respect to a query using the Logistic Regression method a query using the Logistic Regression method developed at Berkeley (W. Cooper, F. Gey, D. developed at Berkeley (W. Cooper, F. Gey, D. Dabney, A. Chen) with new algorithm for Dabney, A. Chen) with new algorithm for weight calculation at retrieval timeweight calculation at retrieval time

• Estimates from multiple extracted indexes are Estimates from multiple extracted indexes are combined to provide an overall ranking score combined to provide an overall ranking score for a given resource (I.e., fusion of multiple for a given resource (I.e., fusion of multiple query results)query results)

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic RegressionRegression

∑=

+=6

10),|(

iii XccCQRP

Probability of relevance for a given index is based on logistic regression from a sample set documentsto determine values of the coefficients (TREC).At retrieval the probability estimate is obtained by:

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Statistics Used for Regression Statistics Used for Regression VariablesVariables

Statistics Used for Regression Statistics Used for Regression VariablesVariables

• Average Absolute Query FrequencyAverage Absolute Query Frequency• Query LengthQuery Length• Average Absolute Collection FrequencyAverage Absolute Collection Frequency• Collection size estimateCollection size estimate• Average Inverse Collection FrequencyAverage Inverse Collection Frequency• Number of terms in common between query and Number of terms in common between query and

collection representativecollection representative

(Details in the proceedings)(Details in the proceedings)

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Other ApproachesOther ApproachesOther ApproachesOther Approaches

• GlOSS – Developed by the DL project at GlOSS – Developed by the DL project at Stanford Univ. Uses fairly conventional Stanford Univ. Uses fairly conventional TFIDF ranking TFIDF ranking

• CORI – Developed by J. Callan and CORI – Developed by J. Callan and students at CIIR. Uses a ranking that students at CIIR. Uses a ranking that exploits some of the features of the exploits some of the features of the INQUERY system in merging evidenceINQUERY system in merging evidence

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

EvaluationEvaluationEvaluationEvaluation• Effectiveness Effectiveness

– Tested using the collection representatives Tested using the collection representatives described above (as harvested from over the described above (as harvested from over the network) and the TIPSTER relevance judgements network) and the TIPSTER relevance judgements

– Testing by comparing our approach to known Testing by comparing our approach to known algorithms for ranking collectionsalgorithms for ranking collections

– Results were measured against reported results for Results were measured against reported results for the Ideal and CORI algorithms and against the the Ideal and CORI algorithms and against the optimal “Relevance Based Ranking” (MAX)optimal “Relevance Based Ranking” (MAX)

– Recall analog (How many of the Rel docs Recall analog (How many of the Rel docs occurred in the top n databases – averaged)occurred in the top n databases – averaged)

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Titles only (short query)Titles only (short query)Titles only (short query)Titles only (short query)

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Long QueriesLong QueriesLong QueriesLong Queries

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Very Long QueriesVery Long QueriesVery Long QueriesVery Long Queries

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Current UsageCurrent UsageCurrent UsageCurrent Usage

• Mersey LibrariesMersey Libraries

• Distributed Archives HubDistributed Archives Hub

• Related approachesRelated approaches– JISC Resource Discovery NetworkJISC Resource Discovery Network

• (OAI-MHP Harvesting with Cheshire Search)(OAI-MHP Harvesting with Cheshire Search)

– Planned use with TEL by the BLPlanned use with TEL by the BL

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

FutureFutureFutureFuture

• Logically Clustering servers by topicLogically Clustering servers by topic

• Meta-Meta Servers (treating the Meta-Meta Servers (treating the MetaSearch database as just another MetaSearch database as just another database)database)

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Distributed Metadata ServersDistributed Metadata ServersDistributed Metadata ServersDistributed Metadata Servers

Replicatedservers

Meta-TopicalServers

General ServersDatabaseServers

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

ConclusionConclusionConclusionConclusion

• A practical method for metadata harvesting and an A practical method for metadata harvesting and an effective algorithm for distributed resource effective algorithm for distributed resource discoverydiscovery

• Further researchFurther research– Continuing development of the Cheshire III systemContinuing development of the Cheshire III system

– Applicability of language modelling methods to Applicability of language modelling methods to resource discoveryresource discovery

– Developing and Evaluating methods for merging cross-Developing and Evaluating methods for merging cross-domain results, such as text and image or text and GIS domain results, such as text and image or text and GIS datasets (or, perhaps, when to keep them separate)datasets (or, perhaps, when to keep them separate)

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Further InformationFurther InformationFurther InformationFurther Information

• Full Cheshire II client and server source is Full Cheshire II client and server source is available available ftp://cheshire.berkeley.edu/pub/cheshire/ftp://cheshire.berkeley.edu/pub/cheshire/– Includes HTML documentationIncludes HTML documentation

• Project Web Site Project Web Site http://cheshire.berkeley.edu/http://cheshire.berkeley.edu/

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes

Probabilistic Retrieval: Logistic Probabilistic Retrieval: Logistic Regression attributesRegression attributes

MX

n

nNICF

ICFM

X

CLX

CAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

10

log1

log1

6

15

4

13

2

11

=

−=

=

=

=

=

=

∑Average Absolute Query Frequency

Query Length

Average Absolute Collection Frequency

Collection size estimate

Average Inverse Collection Frequency

Inverse Document Frequency (N = Number of collections

M = Number of Terms in common between query and document

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

CORI rankingCORI rankingCORI rankingCORI ranking

( )

ranked being databases theof average theis

in wordsofnumber theis

ranked being databases ofnumber theis ||

containing databases ofnumber is

containing documents ofnumber is

:where

6.04.0)|(

0.1||log

5.0||log

/15050

cwcw

dbcw

DB

rcf

rdf

ITdbrp

DB

cfDB

I

cwcwdf

dfT

i

k

k

ik ⋅⋅+=

+

⎟⎟⎠

⎞⎜⎜⎝

⎛ +

=

⋅++=

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Measures for EvaluationMeasures for EvaluationMeasures for EvaluationMeasures for Evaluation

• Assume each database has some Assume each database has some meritmerit for a for a given query, given query, qq

• Given a Baseline ranking Given a Baseline ranking B B and an estimated and an estimated (test) ranking (test) ranking E E for for

• Let Let dbdbbibi and and dbdbei ei denote the database in the denote the database in the i-i-

th ranked position of rankings th ranked position of rankings B B and and EE

• Let Let BBii = merit = merit((q, dbq, dbbibi) and ) and EEii = merit = merit((q, dbq, dbeiei))

• We can define some measures:We can define some measures:

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Measures for Evaluation – Recall Measures for Evaluation – Recall AnalogsAnalogs

Measures for Evaluation – Recall Measures for Evaluation – Recall AnalogsAnalogs

=

=

=

=

=

≠=

=

*

1

1

*

1

1

ˆ

0such that max

n

ii

n

ii

n

k

n

ii

n

ii

n

B

ER

Bkn

B

ER

August 20, 2003 ECDL 2003, Trondheim -- Ray R. Larson

Measures for Evaluation – Measures for Evaluation – Precison AnalogPrecison Analog

Measures for Evaluation – Measures for Evaluation – Precison AnalogPrecison Analog

merit. zero-non with ranking estimated

in the databases top theoffraction theI.e.,

|)(|

|}0),(|)({|

n

ETop

dbqmeritETopdbP

n

nn

>∈=