From federated to aggregated search - University of …mounia/Papers/SIGIR2010Tutorial.pdf · From federated to aggregated search ... Recap – Introduction federated ... scale (documents,

1

From federated to aggregated search

Fernando Diaz, Mounia Lalmas and Milad Shokouhi

[email protected] [email protected]

[email protected]

Outline  Introduction and Terminology  Architecture  Resource Representation  Resource Selection  Result Presentation  Evaluation  Open Problems  Bibliography

2


Introduction

 What is federated search?  What is aggregated search?

 Motivations  Challenges  Relationships

3

A classical example of federated search

www.theeuropeanlibrary.org

Collections to be searched

One query

A classical example of federated search

www.theeuropeanlibrary.org Merged list of results

4

Motivation for federated search

 Search a number of independent collections, with a focus on hidden web collections  Collections not easily crawlable (and often

should not)  Access to up-to-date information and data  Parallel search over several collections  Effective tool for enterprise and digital

library environments

Challenges for federated search

 How to represent collections, so that to know what documents each contain?

 How to select the collection(s) to be searched for relevant documents?

 How to merge results retrieved from several collections, to return one list of results to the users?

 Cooperative environment  Uncooperative environment

5

From federated search to aggregated search  “Federated search on the web”

 Peer-to-peer network connects distributed peers (usually for file sharing), where each peer can be both server and client

 Metasearch engine combines the results of different search engines into a single result list

 Vertical search – also known as aggregated search – add the top-ranked results from relevant verticals (e.g. images, videos, maps) to typical web search results

A classical example of aggregated search

News

Homepage

Wikipedia

Real-time results

Video

Twitter

Structured Data

6

Motivation for aggregated search  Increasingly different types of information being

available, sough and relevant  e.g. news, image, wiki, video, audio, blog, map, tweet

 Search engine allows accessing these through so-called verticals

 Two “ways” to search  Users can directly search the verticals  Or rely on so called aggregated search

Google universal search 2007: [ … ] search across all its content sources, compare and rank all the information in real time, and deliver a single, integrated set of search results [ … ] will incorporate information from a variety of previously separate sources – including videos, images, news, maps, books, and websites – into a single set of results. http://www.google.com/intl/en/press/pressrel/universalsearch_20070516.html

Motivation for aggregated search

(Arguello et al, 09)

25K editorially classified queries

7



8

Challenges in aggregated search  Extremely heterogeneous collections  What is/are the vertical intent(s)?  And

  Handling ambiguous (query | vertical) intent   Handling non-stationary intent (e.g. news, local)

 How many results from each to return and where to position them in the result page?   Slotting results   Users looking at 1st result page

 Page optimization and its evaluation

Ambiguous non-stationary intent

Query - Travel - Molusk - Paul

Vertical - Wikipedia - News - Image

9

Recap – Introduction

federated search

aggregated search

heterogeneity low high

scale (documents,

users) small large

user feedback little a lot

Terminology

1.  federated search, distributed information retrieval, data fusion, aggregated search, universal search, peer-to-peer network

2.  resource, vertical, database, collection, source, server, domain, genre

3.  merging, blending, fusion, aggregation, slotted, tiled

10

Problem definition

Present the “querier” with a summary of search results from one or more resources.

User

Search Interface/ Portal/ Broker

Source/ Server/ Vertical

Query Query Query Query Query





Raw Query

General architecture

11

Peer-to-peer network

Peer Directory Server

Peer to Peer (P2P) networks  Broker-based

 Single centralized broker with documents lists shared from peer (e.g. Napster, original version)

 Decentralized  Each peer acts as both client and server (e.g.

Gnutella v0.4)

 Structure-based  Use distributed hash tables (DHT) (e.g. Chord (Stocia

et al, 03) )

 Hierarchical  Use local directory services for routing and merging

(e.g. Swapper.NET)

12

Query

Broker

Collection A

Query Query Query Query Query

Collection B

Collection C

Collection D

Collection E

Sum A

Sum B

Sum C

Sum D

Sum E

Merged results

Federated search

Federated search

 Also known as distributed information retrieval (DIR) system

 Provides one portal for searching information from multiple sources  corporate intranets, fee-based databases,

library catalogues, internet resources, user-specific digital storage

 Funnelback, Westlaw, FedStats, Cheshire, etc (see also http://federatedsearchblog.com/)"

13

http://funnelback.com/pdfs/brochures/enterprise.pdf

User

Metasearch engine

Query Query Query Query

Raw Query

WWW

Metasearch

14

Metasearch

 Search engine querying several different search engines and combines results from them (blended), or displays results separately (non-blended)

 Does not crawl the web but rely on data gathered by other search engines

 Dogpile,Metacrawler, Search.com, etc (see http://www.cryer.co.uk/resources/searchengines/meta.htm)

User

Query Query Query

Query

Angelina Jolie Results

WWW Index (text)

Aggregated search

15

Aggregated search  Specific to a web search engine  “Increasingly” more than one type of information

relevant to an information need  mostly web page + image, map, blog, etc

 These types of information are indexed and ranked using dedicated approaches (verticals)

 Presenting the results from verticals in an aggregated way believed to be more useful

 All major search engines are doing some levels of aggregated search

Query

GOV2

BM25 KL Inquery Anchor only Title only

One document collection

Different document representations

Different retrieval models

Merging

One ranked list of result (merged)

Data fusion

(e.g. Voorhees etal, 95)

16

Data fusion  Search one collection  Document can be indexed in different ways

 Title index, abstract index, etc (poly-representation)  Weighting scheme

 Different retrieval models  Rankings generated by different retrieval models

(or different document representations) merged to produce the final rank

 Has often been shown to improve retrieval performance (TREC)

Terminology - Resource  Source  Server  Database  Collection (federated search)  Server  Vertical (aggregated search)  Domain  Genre

17

Terminology - Aggregation

 Merging  Blending  Fusion

 Slotted  Tiled

Aggregated search (tiled)

http://au.alpha.yahoo.com/

18

Aggregated search (tiled)

Naver.com

Aggregated search (slotted)

19

Others

 Clustering  Faceted search  Multi-document summarization

 Document generation  Entity search

(see special issue – in press – on “Current research in focused retrieval and result aggregation”, Journal of Information Retrieval (Trotman etal, 10))

Yippy – Clustering search engine from Vivisimo

clusty.com

20

Faceted search

Multi-document summarization

http://newsblaster.cs.columbia.edu/

21

“Fictitious” document generation

(Paris et al, 10)

Entity search

http://sandbox.yahoo.com/Correlator

22

Recap

 Shown the relations between federated, aggregated search, and others

 Exposed the various terminologies used

 In the rest of the tutorial, we concentrate on federated search and aggregated search

 Focus is on “effective search”


23

Architecture: what are the general components of federated and aggregated search systems.

Federated search architecture

24

Aggregated search architecture

 Pre-retrieval aggregation: decide verticals before seeing results

 Post-retrieval aggregation: decide verticals after seeing results

 Pre-web aggregation: decide verticals before seeing web results

 Post-web aggregation: decide verticals after seeing web results

Post-retrieval, pre-web

25

Pre and post-retrieval, pre-web


26

Resource representation: how to represent resources, so that we know what documents each contain.

Resource representation in federated search (Also known as resource summary/description)

27

Resource representation

 Cooperative environments  Comprehensive term statistics  Collection size information

 Uncooperative environments  Query-based sampling  Collection size estimation

Resource representation (cooperative environments)  STARTS Protocol (Gravano et al, 97)

  Source metadata   Rich query language

28

 Different types of term statistics (Callan et al, 95; Gravano et al, 94a,b,99; Meng et al, 01; Yuwono and Lee, 97; Xu and Callan, 98; Zobel, 97)

 Anchor-text  HARP (Hawking and Thomas, 05)

Resource representation (cooperative environments)

Resource representation (uncooperative environments)  Query-based sampling (Callan and Connell, 01)

 Select a query, probe collection  Download the top n documents  Select the next query, repeat

Query selector

Query

Sampled documents

29

 Query selector  (Callan and Connell, 01)

 Other resource description (ord)  Learned resource description (lrd)

•  Average tf, random, df, ctf

 Query logs  (Craswell, 00; Shokouhi et al, 07d)

 Focused probing  (Ipeirotis and Gravano, 02)

Resource representation (uncooperative environments)

 Adaptive sampling  (Shokouhi et al, 06a)

 Rate of visiting new vocabulary  (Baillie et al, 06a)

 Rate of sample quality improvement (reference query log)

 (Caverlee et al, 06)  Proportional document ratio (PD)  Proportional vocabulary ratio (PV)  Vocabulary growth (VG)


30

 Improving incomplete samples  Shrinkage (Ipeirotis, 04; Ipeirotis and Gravano, 04):

topically related collections should share similar terms

 Q-pilot (Sugiura and Etzioni, 00): sampled documents + backlinks + front page


 Capture-recapture (Liu et al, 01)

Resource representation (Collection size estimation)

Sample A (Capture)

Sample B (recapture)

http://www.dorlingkindersley-uk.co.uk/static/cs/uk/11/clipart/nature/image_nature040.html

31

Resource representation (Collection size estimation)  

 Multiple queries sampler (Thomas and Hawking, 07)

 Random-walk sampler, and pool-based sampler (Bar-Yossef and Gurevich, 06)

 Collection overlap estimation (Shokouhi and Zobel, 07)

Resource representation (Collection size estimation)

32

Resource representation (Updating summaries) (Ipeirotis et al, 05) (Shokouhi et al, 07a)

Resource representation in aggregated search  Vertical content

 samples or access to vertical API  represents content supply

 Vertical query logs  samples or access to historic vertical searches  represents content demand

33

Vertical content includes text

NEWS

Vertical content includes structure

SPORTS

34

Vertical content includes images

IMAGES

Issues with vertical content

 Dynamics  some vertical becomes stale fast

 Heterogeneous content  heterogeneous ranking algorithms

 Non-free text APIs  affects query-based sampling

35

Addressing content dynamics

  sample most recently indexed documents (Diaz 09)

  assumes users more likely to be interested in recent content

  in practice, only need a fraction of the corpus to perform well

(Konig et al, 09)

Addressing heterogeneous content

1.  use text available with documents (e.g. captions)

2.  manually map to surrogates (e.g. wikipedia pages)


performance of two different methods of dealing with heterogeneous content

36

Vertical query logs

  Queries issued directly to a vertical represent explicit vertical intent

  Is similar to having a large body of labeled queries

Issues with vertical query logs

 Dynamics  some verticals require temporally-sensitive

sampling  for example, we do not want to sample news

query logs for a whole year  Non-free text APIs

 affects query modeling

37

Hybrid approaches

 Should only sample documents likely to be useful for vertical selection/merging  e.g. a document which is never requested is not

useful for representing a vertical  Suggests log-biased sampling

(Shokouhi et al, 06; Arguello et al, 09)

Recap – Resource representation

federated search

aggregated search

Representation completeness low low-high

Representation generation

sampling/shared dictionaries sampling, API

Freshness important critical

38


Resource selection: how to select the resource(s) to be searched for relevant documents.

39

Resource selection for federated search

Query

Broker

Collection A

Query Query Query

Collection B

Collection C

Collection D

Collection E

Sum A

Sum B

Sum C

Sum D

Sum E

 “Big-document” bag of word summaries CORI (Callan et al, 95) GlOSS (Gravano et al, 94b) CVV (Yuwono and Lee, 97)

Resource selection (Lexicon-based methods)

Col

lect

ion

C

Col

lect

ion

A C

olle

ctio

n B

Sampling

Sampling

Sampling

Broker

40

Resource selection (Lexicon-based methods)

 CORI

 GlOSS

 Sample documents with retained boundaries ReDDE (Si and Callan, 03a) CRCS (Shokouhi, 07a) SUSHI (Thomas and Shokouhi, 09)

Resource selection (Document-surrogate methods)

Col

lect

ion

C

Col

lect

ion

A C

olle

ctio

n B

Sampling

Sampling

Sampling

Broker

41


 ReDDE   ReDDE assumes that the top-

ranked sampled documents are relevant.

  ReDDE estimates the size of collections by sample-resample

  Assuming that all collections have the same size we have: yellow > blue > red

  CRCS is inspired by ReDDE but assigns different probability of relevance based on document position: red > yellow, blue

Broker

Query

Ranking

 SUSHI


http://www.monthly.se/nucleus/index.php?itemid=1464

42

 SUSHI



 SUSHI


 Different regression functions for each collection and query

 Scores are comparable (estimated over the same index)


43

 Utility maximization techniques  Model the search effectiveness  DTF (Nottelmann and Fuhr, 03), UUM (Si and

Callan, 04a), RUM (Si and Callan, 05b)

 Classification-based methods  Classify collections/queries for better selection  Classification-aware server selection (Ipeirotis

and Gravano, 08), classification-based resource selection (Arguello et al, 09a), learning from past queries (Cetintas et al, 09)

Resource selection (Supervised methods)

Resource selection in aggregated Search  Content-based predictors

 derived from (sampled) vertical content

 Query string-based predictors  derived from query text, independent of any

resource associated with a vertical

 Query log-based predictors  derived from previous requests issued by users

to the vertical portal

44

Content-based predictors

 Distributed information retrieval (DIR) predictors

 Simple result set predictors  numresults, score distributions, etc (Diaz 09; Konig etal, 09)

 Complex result set predictors  Clarity (Cronen-Townsend et al, 02)

 Autocorrelation (Diaz, 07)

 Many, many more (Hauff, 10)

Issues with content-based predictors

 DIR (usually) assumes homogeneous content types

 performance predictors (usually) assume text corpora

 assumes ranking function consistency  between verticals  between vertical selector machine and vertical

ranker machine

 verticals have different dynamics (e.g. news vs. image)

45

String-based predictors

 Dictionary lookups  terms correlated with a vertical (e.g., movie

titles)

 Regular expressions  patterns correlated with explicit vertical

requests (e.g., obama news)  Named entities

 automatically-detected entity types (e.g., geographic entities)

String-based predictors

 Issues  curating lists and expressions (manual or

automatic)  terms included in dictionary manually vetted for

relevance  high precision/low recall

46

Log-based predictors

 Classification approaches (Beitzel etal 07; Li etal, 08)

 Language model approaches (Arguello etal, 09)

 Issues  verticals with structured queries (e.g. local)  query logs with dynamics (e.g. news) (Diaz, 09)

Comparing predictor performance


47

Predictor cost

 Pre-retrieval predictors  computed without sending the query to the

vertical  no network cost

 Post-retrieval predictors  computed on the results from the vertical  requires vertical support of web scale query

traffic  incurs network latency  can be mitigated with vertical content caches

Combining predictors

 Use predictors as features for a machine-learned model

 Training data 1.  editorial data 2.  behavioral data (e.g. clicks) 3.  other vertical data

(Diaz, 09; Arguello etal, 09; Konig etal, 09)

48

Editorial data

 Data: <query,vertical,{+,-}>  Features: predictors based on

f(query,vertical)  Models:

 log-linear (Arguello etal, 09)  boosted decision trees (Arguello etal, 10)

Combining predictors

(Arguello etal, 09)

49

Click data

 Data: <query,vertical,{click,skip}>, <query,vertical,click through rate>

 Features: predictors based on f(query,vertical)

 Models:  log-linear (Diaz, 09)  boosted decision trees (Konig etal, 09)

Gathering click data

 Exploration bucket:  show suboptimal presentations in order to

gather positive (and negative) click/skip data

 Cold start problem:  without a basic model, the best exploration is

random  Random exploration results in poor user

experience

50

Gathering click data

 Solutions  reduce impact to small fraction of traffic/users  train a basic high-precision non-click model

(perhaps with editorial data)

 Other issues  Presentation bias: different verticals have

different click-through rates a priori  Position bias: different presentation positions

have different click-through rates a priori

Click precision and recall

(Konig etal, 09)

ability to predict queries using thresholded click-through-rate to infer relevance

51

Non-target data

have training data no data

Non-target data

 Data: <query,source vertical,{+,-}>  Features: predictors based on f

(query,target vertical)  Models:

 generic model+adaptation

(Arguello etal, 10)

52

Non-target data

(Arguello etal, 10)

Generic model

 Objective  train a single model that performs well for all

source verticals

 Assumption  if it performs well across all source verticals, it

will perform well on the target vertical

(Arguello etal, 10)

53

Non-target data

(Arguello etal, 10)

adapted model

Adapted model

 Objective  learn non-generic relationship between features

and the target vertical

 Assumption  can bootstrap from labels generated by the

generic model

(Arguello etal, 10)

54

Non-target query classification

(Arguello etal, 10)

average precision on target query classification; red (blue) indicates statistically significant improvements (degradations) compared to the single predictor

Training set characteristics

 What is the cost of generating training data  how much money?  how much time?  how many negative impressions as a result of

exploration?

 Are targets normalized?  can we compare classifier output?

55

Training set cost summary

Online adaptation

 Production vertical selection systems receive a variety of feedback signals  clicks, skips  reformulations

 A machine-learned system can adjust predictions based on real time user feedback  very important for dynamic verticals

(Diaz, 09; Diaz and Arguello, 09)

56

Online adaptation

 Passive feedback: adjust prediction/parameters in response to feedback  allows recovery from false positives  difficult to recover from false negatives

 Active feedback/explore-exploit: opportunistically present suboptimal verticals for feedback  allows recovery from both errors  incurs exploration cost


Online adaptation

 Issues  setting learning rate for dynamic intent verticals  normalizing feedback signal across verticals  resolving feedback and training signal

(click≠relevance)


57

Recap – Resource selection

federated search

aggregated search

Features and content type often textual diverse

Collection size unavailable (uncooperative)

Training data none some-much


58

Resource presentation: how to return results retrieved from several resources to users.

 Same source (web) different overlapped indexes  Document scores may not be available  Title, snippet, position and timestamps

 D-WISE (Yuwono and Lee, 96)   Inquirus (Glover et al., 99)   SavvySearch (Dreilinger and Howe, 1997)

Result merging (Metasearch engines)

59

 Same corpus  Different retrieval models  Document scores/positions available

 Unsupervised techniques  CombSUM, CombMNZ (Fox and Shaw, 93, 94)  Borda fuse (Aslam and Montague, 01)

 Supervised techniques  Bayes-fuse, weighted Borda fuse (Aslam and Montague, 01)  Segment-based fusion (Lillis et al 06, 08; Shokouhi 07b)

Result merging (Data fusion)

Result merging in federated search

User

Broker

Collection A

Query Query

Collection B

Collection C

Collection D

Collection E

Sum A

Sum B

Sum C

Sum D

Sum E

Merged results

Query

60

 CORI (Callan et al, 95)  Normalized collection score + Normalized

document score.

Result merging

Result merging

 SSL (Si and Callan, 2003b)

Broker

A

G

B

C

D

E

F

H

Query

Ranking

Selected resources

L

R

D

F

Q

61

Result merging

http://upload.wikimedia.org/wikipedia/en/1/13/Linear_regression.png

Source-specific score

Bro

ker s

core

 

 Multi-lingual result merging  SSL with logistic regression

(Si and Callan, 05a; Si et al, 08)

 Personalized metasearch  (Thomas, 08)

 Merging overlapped collections  COSCO (Hernandez and

Kambhampati 05): exact duplicates

 GHV (Bernstein et al, 06; Shokouhi et al, 07b):

exact/near duplicates

Result merging - Miscellaneous scenarios

62

Images on top Images in the middle Images at the bottom

Images at top-right Images on the left Images at the bottom-right

Slotted vs tiled result presentation

3 verticals 3 positions 3 degree of vertical intents (Sushmita et al, 10)

Designers of aggregated search interfaces should account for the aggregation styles

 for both, vertical intent key for deciding on position and type of “vertical” results

 slotted accurate estimation of the best position of “vertical” result

 tiled accurate selection of the type of “vertical” result

Slotted vs tiled

63

Recap – Result presentation

federated search

aggregated search

Content type homogenous (text documents) heterogeneous

Document scores depends on environment heterogeneous

Oracle centralized index none


64

Evaluation

Evaluation: how to measure the effectiveness of federated and aggregated search systems.

 CTF ratio (Callan and Connell, 01)

 Spearman rank correlation coefficient (SRCC), (Callan and Connell, 01)

 Kullback-Leibler divergence (KL) (Baillie et al,06b; Ipeirotis et al, 2005), topical KL (Baillie et al, 09)

 Predictive likelihood (Baillie et al, 06a)

Resource representation (summaries) evaluation – Federated search

65

Resource selection evaluation – Federated search  

Result merging evaluation – Federated search  Oracle

 Correct merging (centralized index ranking) (Hawking and Thistlewaite, 99)

 Perfect merging (ordered by relevance labels) (Hawking and Thistlewaite, 99)

 Metrics  Precision  Correct matches (Chakravarthy and Haase, 95)

66

Vertical Selection Evaluation – Aggregated search

  Majority of publications focus on single vertical selection  vertical accuracy,

precision, recall   Evaluation data

 editorial data  behavioral data

single vertical selection

Editorial data

 Guidelines  judge relevance based on vertical results

(implicit judging of retrieval/content quality)  judge relevance based on vertical description

(assumes idealized retrieval/content quality)  Evaluation metric derived from binary or

graded relevance judgments

(Arguello etal, 09; Arguello et al, 10)

67

Behavioral data

 Inference relevance from behavioral data (e.g. click data)

 Evaluation metric  regression error on predicted CTR  infer binary or graded relevance

(Diaz, 09; Konig etal, 09)

Test collections (a la TREC)

* There are on an average more than 100 events/shots contained in each video clip (document)

Statistics on Topics number of topics 150

average rel docs per topic 110.3

average rel verticals per topic 1.75

ratio of “General Web” topics 29.3%

ratio of topics with two vertical intents 66.7%

ratio of topics with more than two vertical intents 4.0%

quantity/media text image video total

size (G) 2125 41.1 445.5 2611.6

number of documents 86,186,315 670,439 1,253* 86,858,007

(Zhou & Lalmas, 10)

68

ImageCLEF photo retrieval

track …… TREC

web track INEX

ad-hoc track TREC

blog track

topic t1

doc d1 d2 d3 … dn

judgment R N R … R

…… Blog Vertical

Reference (Encyclopedia)

Vertical

Image Vertical

General Web Vertical

Shopping Vertical

topic t1

doc d1 d2 … dV1

judgment R N … R

vertical V1

V2 d1 d2 … dV2

N N … R

……

Vk d1 d2 … dVk

N N … N

t1

existing test collections

(simulated) verticals

Test collections (a la TREC)

Recap – Evaluation

federated search

aggregated search

Editorial data document relevance judgments

query labels

Behavioral data none critical

69


Open problems in federated search   Beyond big document

 Classification-based server selection (Arguello et al, 09a)  Topic modeling

  Query expansion  Previous techniques had little success (Ogilvie and Callan, 01;

Shokouhi et al, 09)

  Evaluating federated search  Confounding factors

  Federated search in other context  Blog Search (Elsas et al, 08; Seo and Croft, 08)

  Effective merging  Supervised techniques

70

Open problems in aggregated search

 Evaluation metrics  slotted presentation  tiled presentation  metrics based on behavioral signals

 Models for multiple verticals  Minimizing the cost for new verticals,

markets


71

Bibliography

  J. Arguello, F. Diaz, J. Callan, and J.-F. Crespo, Sources of evidence for vertical selection. In SIGIR 2009 (2009).

  J. Arguello, J. Callan, and F. Diaz. Classification-based resource selection. In Proceedings of the ACM CIKM, Pages 1277--1286, Hong Kong, China, 2009a.

  J. Arguello, F. Diaz, J.-F. Paiement, Vertical Selection in the Presence of Unlabeled Verticals. In SIGIR 2010 (2010).

  J. Aslam and Mark Montague. Models for metasearch, In Proceedings of ACM SIGIR, Pages, 276--284, New Orleans, LA, 2001.

  M. Baillie, L. Azzopardi, and F. Crestani. Adaptive query-based sampling of distributed collections, In Proceedings of SPIRE, Pages 316--328, Glasgow, UK, 2006a.

  M. Baillie, L. Azzopardi, and F. Crestani. Towards better measures: evaluation of estimated resource description quality for distributed IR. In X. Jia, editor, Proceedings of the First International Conference on Scalable Information systems, page 41, Hong Kong, 2006b.

  M. Baillie, M. Carman, and F. Crestani. A topic-based measure of resource description quality for distributed information retrieval. In Proceedings of ECIR, pages 485--496, Toulouse, France, 2009.

Bibliography

  Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. Proceedings of WWW, pages 367--376, Edinburgh, UK, 2006.

  S. M. Beitzel, E. C. Jensen, D. D. Lewis, A. Chowdhury, O. and Frieder, Automatic classification of web queries using very large unlabeled query logs. ACM Trans. Inf. Syst. 25, 2 (2007), 9.

  Y. Bernstein, M. Shokouhi, and J. Zobel. Compact features for detection of near-duplicates in distributed retrieval. Proceedings of SPIRE, Pages 110--121, Glasgow, UK, 2006.

  J. Callan and M. Connell. Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2):97--130, 2001.

  J. Callan, Z. Lu, and B. Croft. Searching distributed collections with inference networks. In Proceedings of ACM SIGIR, pages 21--28. Seattle, WA, 1995

  J. Caverlee, L. Liu, and J. Bae. Distributed query sampling: a quality-conscious approach. In Proceedings of ACM SIGIR, pages 340--347. Seattle, WA, 2006.

  S. Cetintas, L. Si, and H. Yuan, Learning from past queries for resource selection, In Proceedings of ACM CIKM, Pages1867--1870, Hong Kong, China.

72

  B.T. Bartell, G.W. Cottrell, and R.K. Belew. Automatic Combination of Multiple Ranked Retrieval Systems, ACM SIGIR, pp 173-181, 1994.

  C. Baumgarten. A Probabilitstic Solution to the Selection and Fusion Problem in Distributed Information Retrieval, ACM SIGIR, pp 246-253, 1999.

  N. Craswell. Methods for Distributed Information Retrieval. PhD thesis, Australian National University, 2000.

  S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query performance. ACM SIGIR, pp 299–306, 2002.

  A. Chakravarthy and K. Haase. NetSerf: using semantic knowledge to find internet information archives, ACM SIGIR, pp 4-11, Seattle, WA, 1995.

  F. Diaz. Performance prediction using spatial autocorrelation. ACM SIGIR, pp. 583–590, 2007.

  F. Diaz. Integration of news content into web results. ACM International Conference on Web Search and Data Mining, 2009.

  F. Diaz, J. and Arguello. Adaptation of offline vertical selection predictions in the presence of user feedback, ACM SIGIR, 2009.

  D. Dreilinger and A. Howe. Experiences with selecting search engines using metasearch. ACM Transaction on Information Systems, 15(3):195-222, 1997.

  J. Elsas, J. Arguello, J. Callan, and J. Carbonell. Retrieval and feedback models for blog feed search, ACM SIGIR, pp 347-354, Singapore, 2009.

Bibliography

  E. Glover, S. Lawrence, W. Birmingham, and C. Giles. Architecture of a metasearch engine that supports user information needs, ACM CIKM, pp 210—216,1999.

  L. Gravano, H. García-Molina, and A. Tomasic. Precision and recall of GlOSS estimators for database discovery. Third International conference on Parallel and Distributed Information Systems, pp 103--106, Austin, TX, 1994a.

  L. Gravano, H. García-Molina, and A. Tomasic. The effectiveness of GlOSS for the text database discovery problem. ACM SIGMOD, pp 126--137, Minneapolis, MN, 1994b.

  L. Gravano, C. Chang, H. García-Molina, and A. Paepcke. STARTS:Stanford proposal for internet metasearching, ACM SIGMOD, pp 207--218, Tucson, AZ, 1997.

  L. Gravano, H. García-Molina, and A. Tomasic. GlOSS: text-source discovery over the internet, ACM Transactions on Database Systems, 24(2):229--264, 1999.

  E. Fox and J. Shaw. Combination of multiple searches. Second Text REtrieval Conference, pp 243-252, Gaithersburg, MD, 1993.

  E. Fox and J. Shaw. Combination of multiple searches, Third Text REtrieval Conference, pp 105-108, Gaithersburg, MD, 1994.

  J. French, and A. Powell. Metrics for evaluating database selection techniques, World Wide Web, 3(3):153--163, 2000.

  C. Hauff. Predicting the Effectiveness of Queries and Retrieval Systems, PhD thesis, University of Twente, 2010.

 

Bibliography

73

  D. Hawking and P. Thomas. Server selection methods in hybrid portal search, ACM SIGIR, pp 75-82, Salvador, Brazil, 2005.

  D. Hawking and P. Thistlewaite. Methods for information server selection, ACM Transactions on Information Systems, 17(1):40-76, 1999.

  T. Hernandez and S. Kambhampati. Improving text collection selection with coverage and overlap statistics. WWW, pp 1128-1129, Chiba, Japan, 2005.

  P. Ipeirotis and L. Gravano. When one sample is not enough: improving text database selection using shrinkage. ACM SIGMOD, pp 767-778, Paris, France, 2004.

  P. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. VLDB, pages 394-405, Hong Kong, China, 2002.

  P. Ipeirotis and L. Gravano. Classification-aware hidden-web text database selection. ACM Transactions on Information Systems, 26(2):1-66, 2008.

  P. Ipeirotis, A. Ntoulas, J. Cho, and L. Gravano. Modeling and managing content changes in text databases, 21st International Conference on Data Engineering, pp 606-617, Tokyo, Japan, 2005.

  A. C. König, M. Gamon, and Q. Wu. Click-through prediction for news queries, ACM SIGIR, 2009.

Bibliography

  X. Li, Y.-Y. Wang, and A. Acero, Learning query intent from regularized click graphs, ACM SIGIR, pp. 339–346.

  D. Lillis, F. Toolan, R. Collier, and J. Dunnion. ProbFuse: a probabilistic approach to data fusion, ACM SIGIR, pp 139-146, Seattle, WA, 2006.

  K. Liu, C. Yu, and W. Meng. Discovering the representative of a search engine. ACM CIKM, pp 652-654, McLean, VA, 2002.

  N. Liu, J. Yan, W. Fan, Q. Yang, and Z. Chen. Identifying Vertical Search Intention of Query through Social Tagging Propagation, WWW, Madrid, 2009.

  W. Meng, Z. Wu, C. Yu, and Z. Li. A highly scalable and effective method for metasearch, ACM Transactions on Information Systems, 19(3):310-335, 2001.

  W. Meng, C. Yu, and K. Liu. Building efficient and effective metasearch engines. ACM Computing Surveys, 34(1):48-89, 2002.

  V. Murdock, and M. Lalmas. Workshop on aggregated search, SIGIR Forum 42(2): 80-83, 2008.

  H. Nottelmann and N. Fuhr. Combining CORI and the decision-theoretic approach for advanced resource selection, ECIR, pp 138--153, Sunderland, UK, 2004.

  P. Ogilvie and J. Callan. The effectiveness of query expansion for distributed information retrieval, ACM CIKM, pp 1830--190, Atlanta, GA, 2001.

  C. Paris, S. Wan and P. Thomas. Focused and aggregated search: a perspective from natural language generation, Journal of Information Retrieval, Special Issue, 2010.

Bibliography

74

  S. Park. Analysis of characteristics and trends of Web queries submitted to NAVER, a major Korean search engine, Library & Information Science Research 31(2): 126-133, 2009.

  F. Schumacher and R. Eschmeyer. The estimation of fish populations in lakes and ponds, Journal of the Tennessee Academy of Science, 18:228-249, 1943.

  M. Shokouhi. Central-rank-based collection selection in uncooperative distributed information retrieval, ECIR, pp 160-172, Rome, Italy, 2007a.

  J. Seo and B. Croft. Blog site search using resource selection, ACM CIKM, pp 1053-1062, Napa Valley, CA, 2008.

  M. Shokouhi. Segmentation of search engine results for effective data-fusion, ECIR, pp 185-197, Rome, Italy, 2007b.

  M. Shokouhi and J. Zobel. Robust result merging using sample-based score estimates, ACM Transactions on Information Systems, 27(3):1-29, 2009.

  M. Shokouhi and J. Zobel. Federated text retrieval from uncooperative overlapped collections, ACM SIGIR, pp 495-502. Amsterdam, Netherlands, 2007.

  M. Shokouhi, F. Scholer, and J. Zobel. Sample sizes for query probing in uncooperative distributed information retrieval, Eighth Asia Pacific Web Conference, pp 63--75, Harbin, China, 2006a.

Bibliography

  M. Shokouhi, J. Zobel, F. Scholer, and S. Tahaghoghi. Capturing collection size for distributed non-cooperative retrieval, ACM SIGIR, pp 316-323, Seattle, WA, 2006b.

  M. Shokouhi, J. Zobel, S. Tahaghoghi, and F. Scholer. Using query logs to establish vocabularies in distributed information retrieval, Information Processing and Management, 43(1):169-180, 2007d.

  M. Shokouhi, P. Thomas, and L. Azzopardi. Effective query expansion for federated search, ACM SIGIR, pp 427-434, Singapore, 2009.

  L. Si and J. Callan. Unified utility maximization framework for resource selection, ACM CIKM, pages 32-41, Washington, DC, 2004a.

  L. Si and J. Callan. CLEF2005: multilingual retrieval by combining multiple multilingual ranked lists. Sixth Workshop of the Cross-Language Evaluation Forum, Vienna, Austria, 2005a. http://www.cs.purdue.edu/homes/lsi/publications.htm

  L. Si, J. Callan, S. Cetintas, and H. Yuan. An effective and efficient results merging strategy for multilingual information retrieval in federated search environments, Information Retrieval, 11(1):1--24, 2008.

  L. Si and J. Callan. Relevant document distribution estimation method for resource selection, ACM SIGIR, pp 298-305, Toronto, Canada, 2003a.

  L. Si and J. Callan. Modeling search engine effectiveness for federated search, ACM SIGIR, pp 83-90, Salvador, Brazil, 2005b.

  L. Si and J. Callan. A semisupervised learning method to merge search engine results, ACM Transactions on Information Systems, 21(4):457-491, 2003b.

Bibliography

75

  A. Sugiura and O. Etzioni. Query routing for web search engines: architectures and experiments, WWW, Pages 417-429, Amsterdam, Netherlands, 2000.

  S. Sushmita, H. Joho and M. Lalmas. A Task-Based Evaluation of an Aggregated Search Interface, SPIRE, Saariselkä, Finland, 2009.

  S. Sushmita, H. Joho, M. Lalmas, and R. Villa. Factors Affecting Click-Through Behavior in Aggregated Search Interfaces, ACM CIKM, Toronto, Canada, 2010.

  S. Sushmita, B. Piwowarski, and M. Lalmas. Dynamics of Genre and Domain Intents, Technical Report, University of Glasgow 2010.

  S. Sushmita, H. Joho, M. Lalmas and J.M. Jose. Understanding domain "relevance" in web search, WWW 2009 Workshop on Web Search Result Summarization and Presentation, Madrid, Spain, 2009.

  P. Thomas and D. Hawking. Evaluating sampling methods for uncooperative collections, ACM SIGIR, pp 503-510, Amsterdam, Netherlands, 2007.

  P. Thomas. Server characterisation and selection for personal metasearch, PhD thesis, Australian National University, 2008.

  P. Thomas and M. Shokouhi. SUSHI: scoring scaled samples for server selection, ACM SIGIR, pp 419-426, Singapore, Singapore, 2009.

  A. Trotman, S. Geva, J. Kamps, M. Lalmas and V. Murdock (eds). Current research in focused retrieval and result aggregation, Special Issue in the Journal of Information Retrieval, Springer, 2010.

Bibliography

  T. Tsikrika and M. Lalmas. Merging Techniques for Performing Data Fusion on the Web, ACM CIKM, pp 181-189, Atlanta, Georgia, 2001.

  Ellen M. Voorhees, Narendra Kumar Gupta, Ben Johnson-Laird. Learning Collection Fusion Strategies, ACM SIGIR, pp 172-179, 1995.

  B. Yuwono and D. Lee. WISE: A world wide web resource database system. IEEE Transactions on Knowledge and Data Engineering, 8(4):548--554, 1996.

  B. Yuwono and D. Lee. Server ranking for distributed text retrieval systems on the internet. Fifth International Conference on Database Systems for Advanced Applications, 6, pp 41-50, Melbourne, Australia, 1997.

  J. Xu and J. Callan. Effective retrieval with distributed collections, ACM SIGIR, pp 112-120, Melbourne, Australia, 1998.

  A. Zhou and M. Lalmas. Building a Test Collection for Aggregated Search, Technical Report, University of Glasgow 2010.

  J. Zobel. Collection selection via lexicon inspection, Australian Document Computing Symposium, pp 74--80, Melbourne, Australia, 1997.

Bibliography

Documents

From federated to aggregated search - University of …mounia/Papers/SIGIR2010Tutorial.pdf · From federated to aggregated search ... Recap – Introduction federated ... scale (documents,