Search Engines Information Retrieval in Practice.pdf
Embed Size (px)
344 x 292
429 x 357
514 x 422
599 x 487
Citation preview
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
1/541
Search Engines
Information Retrieval in Practice
©W.B. Cro, D. Metzler, T. Strohman, 2015 is book was previously
published by: Pearson Education, Inc.
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
2/541
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
3/541
Preface
is book provides an overview of the important issues in information
retrieval,
and how those issues affect the design and implementation of
searchengines. Notevery topic is covered at the same level of
detail. We focus instead on what we consider to be the most
important alternatives to implementing search engine components and
the information retrieval models underlying them. Web search
engines are obviously a major topic, and we base our coverage
primarily on the technologywealluseontheWeb,1 but search engines
are also used in many other applications. at is the reason for the
strong emphasis on the information re- trieval theories and
concepts that underlie all search engines.
e target audience for the book is primarily undergraduates in
computer sci- ence or computer engineering, but graduate students
should also nd this useful.
We also consider the book to be suitable for most students in
information sci-
ence programs. Finally, practicing search engineers should benet
from the book, whatever their background. ere is mathematics
in the book, but nothing too esoteric. ere are also code and
programming exercises in the book, but nothing beyond the
capabilities of someone who has taken some basic computer science
and programming classes.
e exercises at the end of each chapter make extensive use of a
Java™-based open source search engine called Galago. Galago was
designed both for this book and to incorporate lessons
learned from experience with the Lemur and Indri
projects. In other words, this is a fully functional search
engine that can be used to support real applications. Many of the
programming exercises require the use,
modication, and extension of Galago components. 1 In keeping with
common usage, most uses of the word “web” in this book are not
cap-
italized, except when we refer to the World Wide Web as a separate
entity.
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
4/541
Contents
In the rst chapter, we provide a high-level review of the eld of
information re-
trieval and its relationship to search engines. In the second
chapter, we describe the architecture of a search engine. is is
done to introduce the entire range of search engine
components without getting stuck in the details of any particular
aspect.InChapter3,wefocusoncrawling,documentfeeds,andothertechniques
for acquiring the information that will be searched. Chapter 4
describes the sta- tistical nature of text and the techniques that
are used to process it, recognize im-
portant features, and prepare it for indexing. Chapter 5
describes how to create indexes for efficient search and how those
indexes are used to process queries. In Chapter 6, we describe the
techniques that are used to process queries and trans- form them
into better representations of the user’s information need.
Ranking algorithms and the retrieval models they are based on are
coveredin Chapter 7. is chapter also includes an overview of
machine learning tech- niques and how they relate to information
retrieval and search engines. Chapter 8 describes the evaluation
and performance metrics that are used to compare and
tunesearchengines.Chapter9coverstheimportantclassesoftechniquesusedfor
classication, ltering, clustering, and dealing with spam. Social
search is a term used to describe search applications that involve
communities of people in tag- gingcontent or answering questions.
Search techniques for these applications and
peer-to-peer search are described in Chapter10. Finally, in
Chapter11, we give an overview of advanced techniques that capture
more of the content of documents than simple word-based approaches.
is includes techniques that use linguistic
features, the document structure, and the content of nontextual
media, such as images or music.
Information retrieval theory and the design, implementation,
evaluation, and use of search engines cover too many topics to
describe them all in depth in one book. We have tried to focus on
the most important topics while giving some coverage to all aspects
of this challenging and rewarding subject.
Supplements
A range of supplementary material is provided for the book. is
material is de-
signed both for those taking a course based on the book and for
those giving the course. Specically, this includes:
• Extensive lecture slides (in PDF and PPT format)
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
5/541
• Solutions to selected end–of–chapter problems (instructors only)
• Test collections for exercises • Galago search engine
e supplements are available at
www.search-engines-book.com.
Acknowledgments
First and foremost, this book would not have happened without the
tremen- dous support and encouragement from our wives, Pam Aselton,
Anne-Marie Strohman, and Shelley Wang. e University of
MassachusettsAmherst provided
materialsupportforthepreparationofthebookandawardedaContiFacultyFel-
lowship to Cro, which sped up our progress signicantly. e staff at
the Center
for Intelligent Information Retrieval (Jean Joyce, Kate Moruzzi,
Glenn Stowell, and Andre Gauthier) made our lives easier in many
ways, and our colleagues and students in the Center provided the
stimulating environment that makes work- ing in this area so
rewarding. A number of people reviewed parts of the book and
we appreciated their comments. Finally, we have to mention
our children, Doug, Eric, Evan, and Natalie, or they would never
forgive us.
B C D M
T S
2015 Update
is version of the book is being made available for free download.
It has been edited to correct the minor errors noted in the 5 years
since the book’s publica- tion. e authors, meanwhile, are working
on a second edition.
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
6/541
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
7/541
1 Search Engines and Information Retrieval . . . . . . . . .
. . . . . . . . . . . . . . 1
1.1 What Is Information Retrieval? . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 11.2 e Big Issues . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 1.3 Search Engines . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 6 1.4 Search Engineers .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 9
2 Architecture of a Search Engine . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 13 2.1 What Is an Architecture? . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13 2.2 Basic Building Blocks . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 14 2.3 Breaking It Down . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 17
2.3.1 Text Acquisition. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 17 2.3.2 Text Transformation . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 Index Creation . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 22 2.3.4 User Interaction . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.5 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 25 2.3.6 Evaluation . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 27
2.4 How Does It Really Work? . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 28
3 Crawls and Feeds . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . 31 3.1 Deciding What to
Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 31 3.2 Crawling the Web . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Retrieving Web Pages . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 33 3.2.2 e Web Crawler . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.3 Freshness . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 37 3.2.4 Focused Crawling . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.5 Deep Web . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 41
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
8/541
3.2.6 Sitemaps . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 43 3.2.7 Distributed Crawling . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.3 Crawling Documents and Email . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 46 3.4 Document Feeds . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 e Conversion Problem . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 49
3.5.1 Character Encodings . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 50 3.6 Storing the Documents . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
3.6.1 Using a Database System . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 53 3.6.2 Random Access . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.6.3
Compression and Large Files . . . . . . . . . . . . . . . . . . . .
. . . . . . 54 3.6.4 Update . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . 56 3.6.5 BigTable
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 57
3.7 Detecting Duplicates . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 60
3.8 Removing Noise. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 63
4 Processing Text . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.1 From
Words to Terms . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 73 4.2 Text Statistics . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 75
4.2.1 Vocabulary Growth . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 80 4.2.2 Estimating Collection and Result
Set Sizes . . . . . . . . . . . . . . 83
4.3 Document Parsing . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 86 4.3.1 Overview. . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 86 4.3.2 Tokenizing . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 87 4.3.3 Stopping . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 90
4.3.4 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 91 4.3.5 Phrases and N-grams . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
4.4 Document Structure and Markup . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 101 4.5 Link Analysis . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
104
4.5.1 Anchor Text . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 105 4.5.2 PageRank . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
105 4.5.3 Link uality . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 111
4.6 Information Extraction . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 113 4.6.1 Hidden Markov Models
for Extraction . . . . . . . . . . . . . . . . . 115
4.7 Internationalization . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 118
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
9/541
Contents XI
5 Ranking with Indexes . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 125 5.1 Overview. . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 125 5.2 Abstract Model of Ranking . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . 126 5.3 Inverted
Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 129
5.3.1 Documents . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 131 5.3.2 Counts . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 133 5.3.3 Positions . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 134 5.3.4 Fields and
Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 136 5.3.5 Scores . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.3.6
Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 139
5.4 Compression . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 140 5.4.1 Entropy and
Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 142 5.4.2 Delta Encoding . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 144
5.4.3 Bit-Aligned Codes . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 145 5.4.4 Byte-Aligned Codes . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.4.5
Compression in Practice . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 149 5.4.6 Looking Ahead . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 151 5.4.7 Skipping
and Skip Pointers . . . . . . . . . . . . . . . . . . . . . . . . .
. . 151
5.5 Auxiliary Structures. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 154 5.6 Index Construction . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 156
5.6.1 Simple Construction . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 156 5.6.2 Merging . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.6.3 Parallelism and Distribution . . . . . . . . . . . . . . . .
. . . . . . . . . . 158 5.6.4 Update . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
164
5.7 Query Processing . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 165 5.7.1 Document-at-a-time
Evaluation . . . . . . . . . . . . . . . . . . . . . . . 166 5.7.2
Term-at-a-time Evaluation . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 168 5.7.3 Optimization Techniques . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 170 5.7.4 Structured ueries .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
178 5.7.5 Distributed Evaluation . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 180 5.7.6 Caching . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
181
6 Queries and Interfaces . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . 187 6.1 Information
Needs and ueries . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 187
6.2 Query Transformation and Renement . . . . . . . . . . . . . . .
. . . . . . . . 1906.2.1 Stopping and Stemming Revisited . . . . .
. . . . . . . . . . . . . . . . 190 6.2.2 Spell Checking and
Suggestions . . . . . . . . . . . . . . . . . . . . . . . 193
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
10/541
XII Contents
6.2.3 Query Expansion . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 199 6.2.4 Relevance Feedback. . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . 208 6.2.5
Context and Personalization . . . . . . . . . . . . . . . . . . . .
. . . . . . 211
6.3 Showing the Results . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 215 6.3.1 Result Pages and
Snippets . . . . . . . . . . . . . . . . . . . . . . . . . . . .
215 6.3.2 Advertising and Search . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 218 6.3.3 Clustering the Results . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.4 Cross-Language Search . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 226
7 Retrieval Models . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 233 7.1 Overview of
Retrieval Models . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 233
7.1.1 Boolean Retrieval . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 235 7.1.2 e Vector Space Model . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 237
7.2 Probabilistic Models . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 2437.2.1 Information Retrieval
as Classication . . . . . . . . . . . . . . . . . 244 7.2.2 e BM25
Ranking Algorithm . . . . . . . . . . . . . . . . . . . . . . . .
250
7.3 Ranking Based on Language Models . . . . . . . . . . . . . . .
. . . . . . . . . . 252 7.3.1 Query Likelihood Ranking . . . . . .
. . . . . . . . . . . . . . . . . . . . . 254 7.3.2 Relevance
Models and Pseudo-Relevance Feedback . . . . . . 261
7.4 Complex ueries and Combining Evidence . . . . . . . . . . . . .
. . . . . . 267 7.4.1 e Inference Network Model . . . . . . . . . .
. . . . . . . . . . . . . . 268 7.4.2 e Galago Query Language . . .
. . . . . . . . . . . . . . . . . . . . . . . 273
7.5 Web Search . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 279 7.6 Machine Learning
and Information Retrieval. . . . . . . . . . . . . . . . . .
283
7.6.1 Learning to Rank . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 284 7.6.2 Topic Models and Vocabulary
Mismatch . . . . . . . . . . . . . . . . 288
7.7 Application-Based Models . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 291
8 Evaluating Search Engines . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 297 8.1 Why Evaluate? . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 297 8.2 e Evaluation Corpus . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . 299 8.3 Logging . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 305 8.4 Effectiveness Metrics . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
308
8.4.1 Recall and Precision . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 308 8.4.2 Averaging and Interpolation . .
. . . . . . . . . . . . . . . . . . . . . . . . 313 8.4.3 Focusing
on the Top Documents . . . . . . . . . . . . . . . . . . . . . .
318 8.4.4 Using Preferences . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 321
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
11/541
8.5 Efficiency Metrics . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 322 8.6 Training, Testing,
and Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 325
8.6.1 Signicance Tests . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 325 8.6.2 Setting Parameter Values . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . 330 8.6.3
Online Testing . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 332
8.7 e Bottom Line . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 333
9 Classication and Clustering . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . 339 9.1 Classication
and Categorization . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 340
9.1.1 Naïve Bayes . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 342 9.1.2 Support Vector Machines .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 9.1.3
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 359 9.1.4 Classier and Feature
Selection . . . . . . . . . . . . . . . . . . . . . . . . 359
9.1.5 Spam, Sentiment, and Online Advertising . . . . . . . . . . .
. . . 3649.2 Clustering . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 373 9.2.1
Hierarchical and K -Means Clustering . . . . . . . . . . . . .
. . . . . 375 9.2.2 K Nearest Neighbor Clustering
. . . . . . . . . . . . . . . . . . . . . . . 384 9.2.3 Evaluation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 386 9.2.4 How to Choose K . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . 387 9.2.5
Clustering and Search . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 389
10 Social Search . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . 397 10.1 What
Is Social Search? . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 397 10.2 User Tags and Manual Indexing . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . 400
10.2.1 Searching Tags . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 402 10.2.2 Inferring Missing Tags . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
10.2.3 Browsing and Tag Clouds . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 406
10.3 Searching with Communities . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 408 10.3.1 What Is a Community? . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . 408 10.3.2
Finding Communities . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 409 10.3.3 Community-Based uestion Answering . . .
. . . . . . . . . . . . . 415 10.3.4 Collaborative Searching . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . 420
10.4 Filtering and Recommending . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 423 10.4.1 Document Filtering . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 10.4.2
Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 432
10.5 Peer-to-Peer and Metasearch . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 438 10.5.1 Distributed Search . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
438
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
12/541
10.5.2 P2P Networks . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 442
11 Beyond Bag of Words . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . 451 11.1 Overview . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 451 11.2 Feature-Based Retrieval Models . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . 452 11.3 Term
Dependence Models . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 454 11.4 Structure Revisited . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
459
11.4.1 XML Retrieval . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 461 11.4.2 Entity Search . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
464
11.5 Longer uestions, Better Answers . . . . . . . . . . . . . . .
. . . . . . . . . . . . 466 11.6 Words, Pictures, and Music . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470 11.7
One Search Fits All? . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 479
References . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
Index . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 511
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
13/541
List of Figures
1.1 Search engine design and the core information retrieval issues
. . . 9
2.1 e indexing process . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 15 2.2 e query process . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 16
3.1 A uniform resource locator (URL), split into three parts . . .
. . . . 33 3.2 Crawling the Web. e web crawler connects to web
servers to
nd pages. Pages may link to other pages on the same server or on
different servers. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 34
3.3 An example robots.txt le . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 36 3.4 A simple crawling thread
implementation . . . . . . . . . . . . . . . . . . . . 37 3.5 An
HTTP HEAD request and server response . . . . . . . . . . . . . . .
38 3.6 Age and freshness of a single page over time . . . . . . . .
. . . . . . . . . . 39 3.7 Expected age of a page with mean change
frequency λ = 1/7
(one week) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 40 3.8 An example sitemap le
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 43 3.9 An example RSS 2.0 feed . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 48 3.10 An example of text in
the TREC Web compound document
format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 55 3.11 An example link
with anchor text . . . . . . . . . . . . . . . . . . . . . . . . .
. . 56 3.12 BigTable stores data in a single logical table, which
is split into
many smaller tablets . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 57 3.13 A BigTable row . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 58
3.14 Example of ngerprinting process . . . . . . . . . . . . . . .
. . . . . . . . . . . . 62 3.15 Example of simhash
ngerprinting process . . . . . . . . . . . . . . . . . . . 64 3.16
Main content block in a web page . . . . . . . . . . . . . . . . .
. . . . . . . . . . 65
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
14/541
XVI List of Figures
3.17 Tag counts used to identify text blocks in a web page . . . .
. . . . . . . 66 3.18 Part of the DOM structure for the example web
page . . . . . . . . . . 67
4.1 Rank versus probability of occurrence for words assuming
Zipf ’s law (rank × probability = 0.1) . . . . . . . . .
. . . . . . . . . . . . . . . 76
4.2 A log-log plot of Zipf ’s law compared to real data from AP89.
e predicted relationship between probability of occurrence and rank
breaks down badly at high ranks. . . . . . . . . . . . . . . . . .
. . 79
4.3 Vocabulary growth for the TREC AP89 collection compared to
Heaps’ law . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 81
4.4 Vocabulary growth for the TREC GOV2 collection compared to
Heaps’ law . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 82
4.5 Result size estimate for web search . . . . . . . . . . . . . .
. . . . . . . . . . . . 83
4.6 Comparison of stemmer output for a TREC query. Stopwordshave
also been removed. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 95 4.7 Output of a POS tagger for a TREC
query . . . . . . . . . . . . . . . . . . . 98 4.8 Part of a web
page from Wikipedia . . . . . . . . . . . . . . . . . . . . . . . .
. . 102 4.9 HTML source for example Wikipedia page . . . . . . . .
. . . . . . . . . . 103 4.10 A sample “Internet” consisting of just
three web pages. e
arrows denote links between the pages. . . . . . . . . . . . . . .
. . . . . . . . 108 4.11 Pseudocode for the iterative PageRank
algorithm. . . . . . . . . . . . . . 110 4.12 Trackback links in
blog postings . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 112 4.13 Text tagged by information extraction . . . . . . . . .
. . . . . . . . . . . . . . 114 4.14 Sentence model for statistical
entity extractor . . . . . . . . . . . . . . . . . 116
4.15 Chinese segmentation and bigrams . . . . . . . . . . . . . . .
. . . . . . . . . . . 119
5.1 e components of the abstract model of ranking: documents,
features, queries, the retrieval function, and document scores . .
. . 127
5.2 A more concrete model of ranking. Notice how both the
query and the document have feature functions in this model.
. . . . . . . . 128
5.3 An inverted index for the documents (sentences) in Table 5.1 .
. . 132 5.4 An inverted index, with word counts, for the documents
in
Table 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 134 5.5 An inverted index,
with word positions, for the documents in
Table 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 135
5.6 Aligning posting lists for “tropical” and “sh” to nd the phrase
“tropical sh” . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 136
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
15/541
List of Figures XVII
5.7 Aligning posting lists for “sh” and title to nd matches of the
word “sh” in the title eld of a document. . . . . . . . . . .
. . . . . . . . . 138
5.8 Pseudocode for a simple indexer . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 157 5.9 An example of index merging. e rst
and second indexes are
merged together to produce the combined index. . . . . . . . . . .
. . . 158 5.10 MapReduce . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . 161 5.11 Mapper
for a credit card summing algorithm . . . . . . . . . . . . . . . .
. . 162 5.12 Reducer for a credit card summing algorithm . . . . .
. . . . . . . . . . . . 162 5.13 Mapper for documents . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.14
Reducer for word postings . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 164 5.15 Document-at-a-time query
evaluation. e numbers (x:y)
represent a document number (x) and a word count (y). . . . . . . .
166 5.16 A simple document-at-a-time retrieval algorithm . . . . .
. . . . . . . . . 167
5.17 Term-at-a-time query evaluation . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 168 5.18 A simple term-at-a-time
retrieval algorithm . . . . . . . . . . . . . . . . . . . 169 5.19
Skip pointers in an inverted list. e gray boxes show skip
pointers, which point into the white boxes, which are
inverted list postings. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 170
5.20 A term-at-a-time retrieval algorithm with conjunctive
processing 173 5.21 A document-at-a-time retrieval algorithm with
conjunctive
processing . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . 174 5.22 MaxScore
retrieval with the query “eucalyptus tree”. e gray
boxes indicate postings that can be safely ignored during scoring.
176 5.23 Evaluation tree for the structured
query #combine(#od:1(tropical
fish) #od:1(aquarium fish) fish) . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 179
6.1 Top ten results for the query “tropical sh” . . . . . . . . . .
. . . . . . . . . 209 6.2 Geographic representation of Cape Cod
using bounding
rectangles . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 214 6.3 Typical document
summary for a web search . . . . . . . . . . . . . . . . . . 215
6.4 An example of a text span of words (w) bracketed by
signicant
words (s) using Luhn’s algorithm . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 216 6.5 Advertisements displayed by a
search engine for the query “sh
tanks” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 220 6.6 Clusters formed
by a search engine from top-ranked documents
for the query “tropical sh”. Numbers in brackets are the number of
documents in the cluster. . . . . . . . . . . . . . . . . . . . . .
. . . . 222
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
16/541
XVIII List of Figures
6.7 Categories returned for the query “tropical sh” in a popular
online retailer . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 225
6.8 Subcategories and facets for the “Home & Garden” category .
. . . 225 6.9 Cross-language search . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 226 6.10 A French web
page in the results list for the query “pecheur
france” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 228
7.1 Term-document matrix for a collection of four documents . . . .
. . 239 7.2 Vector representation of documents and queries . . . .
. . . . . . . . . . . 240 7.3 Classifying a document as relevant or
non-relevant . . . . . . . . . . . . 245 7.4 Example inference
network model . . . . . . . . . . . . . . . . . . . . . . . . . .
269 7.5 Inference network with three nodes . . . . . . . . . . . .
. . . . . . . . . . . . . 271 7.6 Galago query for the dependence
model . . . . . . . . . . . . . . . . . . . . . 282
7.7 Galago query for web data. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 282 8.1 Example of a TREC topic . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
8.2 Recall and precision values for two rankings of six
relevant
documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 311 8.3 Recall and precision
values for rankings from two different queries314 8.4
Recall-precision graphs for two queries . . . . . . . . . . . . . .
. . . . . . . . . 315 8.5 Interpolated recall-precision graphs for
two queries . . . . . . . . . . . . 316 8.6 Average
recall-precision graph using standard recall levels. . . . . . .
317 8.7 Typical recall-precision graph for 50 queries from TREC . .
. . . . . 318 8.8 Probability distribution for test statistic
values assuming the
null hypothesis. e shaded area is the region of rejection for
aone-sided test. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 327 8.9 Example distribution
of query effectiveness improvements . . . . . . 335
9.1 Illustration of how documents are represented in the multiple-
Bernoulli event space. In this example, there are 10 documents
(each with a unique id), two classes (spam and not spam), and
a
vocabulary that consists of the terms “cheap”, “buy”,
“banking”, “dinner”, and “the”. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . 346
9.2 Illustration of how documents are represented in the
multinomial event space. In this example, there are 10
documents (each with a unique id), two classes (spam and notspam),
and a vocabulary that consists of the terms “cheap”, “buy”,
“banking”, “dinner”, and “the”. . . . . . . . . . . . . . . . . . .
. . . . . . . 349
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
17/541
List of Figures XIX
9.3 Data set that consists of two classes (pluses and minuses). e
data set on the le is linearly separable, whereas the one on the
right is not. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 352
9.4 Graphical illustration of Support Vector Machines for the
linearly separable case. Here, the hyperplane dened
by w is shown, as well as the margin, the decision
regions, and the support vectors, which are indicated by circles. .
. . . . . . . . . . . . . . . 353
9.5 Generative process used by the Naïve Bayes model. First, a
class is chosen according to P (c), and then a document is
chosen according to P (d|c). . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 360
9.6 Example data set where non-parametric learning algorithms, such
as a nearest neighbor classier, may outperform parametric
algorithms. e pluses and minuses indicate positive and
negative training examples, respectively. e solid gray line shows
the actual decision boundary, which is highly non-linear. .
361
9.7 Example output of SpamAssassin email spam lter . . . . . . . .
. . . . 365 9.8 Example of web page spam, showing the main page and
some
of the associated term and link spam . . . . . . . . . . . . . . .
. . . . . . . . . . 367 9.9 Example product review incorporating
sentiment . . . . . . . . . . . . . 370 9.10 Example semantic class
match between a web page about
rainbow sh (a type of tropical sh) and an advertisement for
tropical sh food. e nodes “Aquariums”, “Fish”, and “Supplies” are
example nodes within a semantic hierarchy. e web page is classied
as “Aquariums - Fish” and the ad is classied as “Supplies - Fish”.
Here, “Aquariums” is the least common ancestor. Although the web
page and ad do not share any terms in common, they can be matched
because of their semantic similarity. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . 372
9.11 Example of divisive clustering with K = 4. e
clustering proceeds from le to right and top to bottom,
resulting in four clusters. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
376
9.12 Example of agglomerative clustering with K =
4. e clustering proceeds from le to right and top to bottom,
resulting in four clusters. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 377
9.13 Dendrogram that illustrates the agglomerative clustering of
the points from Figure 9.12 . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 377
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
18/541
XX List of Figures
9.14 Examples of clusters in a graph formed by connecting nodes
representing instances. A link represents a distance between the
two instances that is less than some threshold value. . . . . . . .
. . . . . 379
9.15 Illustration of how various clustering cost functions are
computed381 9.16 Example of overlapping clustering using nearest
neighbor
clustering with K = 5. e overlapping clusters for the
black points (A, B, C, and D) are shown. e ve nearest
neighbors for each black point are shaded gray and labeled
accordingly. . . . . 385
9.17 Example of overlapping clustering using Parzen windows. e
clusters for the black points (A, B, C, and D) are shown. e shaded
circles indicate the windows used to determine cluster membership.
e neighbors for each black point are shaded gray and labeled
accordingly. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 388
9.18 Cluster hypothesis tests on two TREC collections. e top two
compare the distributions of similarity values between
relevant-relevant and relevant-nonrelevant pairs (light gray)
of documents. e bottom two show the local precision of the
relevant documents. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 390
10.1 Search results used to enrich a tag representation. In this
example, the tag being expanded is “tropical sh”. e query
“tropical sh” is run against a search engine, and the snippets
returned are then used to generate a distribution over related
terms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 403
10.2 Example of a tag cloud in the form of a weighted list. e tags
are in alphabetical order and weighted according to some criteria,
such as popularity. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 407
10.3 Illustration of the HITS algorithm. Each row corresponds to a
single iteration of the algorithm and each column corresponds to a
specic step of the algorithm. . . . . . . . . . . . . . . . . . . .
. . . . . . . . 412
10.4 Example of how nodes within a directed graph can be
represented as vectors. For a given node p, its vector
representation has component q set to 1
if p → q . . . . . . . . . . . . . .
413
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
19/541
List of Figures XXI
10.5 Overview of the two common collaborative search scenarios. On
the le is co-located collaborative search, which involves multiple
participants in the same location at the same time. On the right is
remote collaborative search, where participants are in different
locations and not necessarily all online and searching at the same
time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 421
10.6 Example of a static ltering system. Documents arrive over time
and are compared against each prole. Arrows from documents to
proles indicate the document matches the prole and is retrieved. .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 425
10.7 Example of an adaptive ltering system. Documents arrive over
time and are compared against each prole. Arrows from documents to
proles indicate the document matches the
prole and is retrieved. Unlike static ltering, where proles
are static over time, proles are updated dynamically (e.g., when a
new match occurs).. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 428
10.8 A set of users within a recommender system. Users and their
ratings for some item are given. Users with question marks above
their heads have not yet rated the item. It is the goal of
the recommender system to ll in these question marks. . . . . . . .
. 434
10.9 Illustration of collaborative ltering using clustering. Groups
of similar users are outlined with dashed lines. Users and their
ratings for some item are given. In each group, there is a single
user who has not judged the item. For these users, the unjudged
item is assigned an automatic rating based on the ratings of
similar users. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 435
10.10 Metasearch engine architecture. e query is broadcast to
multiple web search engines and result lists are merged. . . . . .
. . . 439
10.11 Network architectures for distributed search: (a) central
hub; (b) pure P2P; and (c) hierarchical P2P. Dark circles are hub
or superpeer nodes, gray circles are provider nodes, and white
circles are consumer nodes. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 443
10.12Neighborhoods (N i) of a hub node (H ) in a
hierarchical P2P network . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
20/541
XXII List of Figures
11.1 Example Markov Random Field model assumptions, including
full independence (top le), sequential dependence (top right), full
dependence (bottom le), and general dependence (bottom right) . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 455
11.2 Graphical model representations of the relevance model
technique (top) and latent concept expansion (bottom) used for
pseudo-relevance feedback with the query “hubble telescope
achievements” . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 459
11.3 Functions provided by a search engine interacting with a
simple database system . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 461
11.4 Example of an entity search for organizations using the TREC
Wall Street Journal 1987 Collection . . . . . . . . . .
. . . . . . . . . . . . . . . 464
11.5 uestion answering system architecture . . . . . . . . . . . .
. . . . . . . . . . 467
11.6 Examples of OCR errors . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 472 11.7 Examples of speech
recognizer errors . . . . . . . . . . . . . . . . . . . . . . . .
473 11.8 Two images (a sh and a ower bed) with color
histograms.
e horizontal axis is hue value. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 474 11.9 ree examples of content-based image
retrieval. e collection
for the rst two consists of 1,560 images of cars, faces, apes, and
other miscellaneous subjects. e last example is from a collection
of 2,048 trademark images. In each case, the lemost image is the
query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 475
11.10Key frames extracted from a TREC video clip . . . . . . . . .
. . . . . . . 476 11.11Examples of automatic text annotation of
images . . . . . . . . . . . . . . 477 11.12 ree representations of
Bach’s “Fugue #10”: audio, MIDI, and
conventional music notation . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 478
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
21/541
3.1 UTF-8 encoding . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 51
4.1 Statistics for the AP89 collection . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 77 4.2 Most frequent 50 words from AP89
. . . . . . . . . . . . . . . . . . . . . . . . . 78 4.3
Low-frequency words from AP89 . . . . . . . . . . . . . . . . . . .
. . . . . . . . 78 4.4 Example word frequency ranking . . . . . . .
. . . . . . . . . . . . . . . . . . . . 79 4.5 Proportions of words
occurring n times in 336,310 documents
from the TREC Volume 3 corpus. e total vocabulary size (number of
unique words) is 508,209. . . . . . . . . . . . . . . . . . . . . .
. . 80
4.6 Document frequencies and estimated frequencies for word
combinations (assuming independence) in the GOV2 Web collection.
Collection size (N ) is 25,205,179. . . . . . . . . . . . . .
. . . . 84
4.7 Examples of errors made by the original Porter stemmer. False
positives are pairs of words that have the same stem. False
negatives are pairs that have different stems. . . . . . . . . . .
. . . . . . . . 93
4.8 Examples of words with the Arabic root ktb . . . . .
. . . . . . . . . . . . . 96 4.9 High-frequency noun phrases from a
TREC collection and
U.S. patents from 1996 . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 99 4.10 Statistics for the Google
n-gram sample . . . . . . . . . . . . . . . . . . . . . . 101
5.1 Four sentences from the Wikipedia entry for tropical sh
. . . . . . . 132
5.2 Elias- γ
code examples . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 1465.3 Elias-δ code examples.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 147 5.4 Space requirements for numbers encoded in v-byte . .
. . . . . . . . . . 149
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
22/541
XXIV List of Tables
5.5 Sample encodings for v-byte . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 149 5.6 Skip lengths (k) and expected
processing steps . . . . . . . . . . . . . . . . 152
6.1 Partial entry for the Medical Subject (MeSH) Heading
“Neck Pain” . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . 200
6.2 Term association measures . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 203 6.3 Most strongly associated words
for “tropical” in a collection of
TREC news stories. Co-occurrence counts are measured at the
document level. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 204
6.4 Most strongly associated words for “sh” in a collection
of TREC news stories. Co-occurrence counts are measured at
the document level. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 205
6.5 Most strongly associated words for “sh” in a collection
of
TREC news stories. Co-occurrence counts are measured
in windows of ve words. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 205
7.1 Contingency table of term occurrences for a particular query .
. . 248 7.2 BM25 scores for an example document . . . . . . . . . .
. . . . . . . . . . . . 252 7.3 Query likelihood scores for an
example document . . . . . . . . . . . . . 260 7.4
Highest-probability terms from relevance model for four
example queries (estimated using top 10 documents) . . . . . . . .
. . . 266 7.5 Highest-probability terms from relevance model for
four
example queries (estimated using top 50 documents) . . . . . . . .
. . . 267 7.6 Conditional probabilities for example network . . . .
. . . . . . . . . . . 272
7.7 Highest-probability terms from four topics in LDA model . . . .
. 290 8.1 Statistics for three example text collections. e average
number
of words per document is calculated without stemming. . . . . . . .
. 301 8.2 Statistics for queries from example text collections . .
. . . . . . . . . . . 301 8.3 Sets of documents dened by a simple
search with binary
relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 309 8.4 Precision values at
standard recall levels calculated using
interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 317 8.5 Denitions of some
important efficiency metrics . . . . . . . . . . . . . . 323 8.6
Articial effectiveness data for two retrieval algorithms (A
and
B) over 10 queries. e column B – A gives the difference
ineffectiveness. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 328
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
23/541
List of Tables XXV
9.1 A list of kernels that are typically used with SVMs. For each
kernel, the name, value, and implicit dimensionality are given. . .
357
10.1 Example questions submitted to Yahoo! Answers . . . . . . . .
. . . . . . 416 10.2 Translations automatically learned from a set
of question and
answer pairs. e 10 most likely translations for the terms
“everest”, “xp”, and “search” are given. . . . . . . . . . . . . .
. . . . . . . . . . . . 419
10.3 Summary of static and adaptive ltering models. For each, the
prole representation and prole updating algorithm are given.
. 430
10.4 Contingency table for the possible outcomes of a ltering
system. Here, TP (true positive) is the number of relevant
documents retrieved, FN (false negative) is the number of
relevant documents not retrieved, FP (false positive) is the
number of non-relevant documents retrieved, and TN (truenegative)
is the number of non-relevant documents not retrieved. 431
11.1 Most likely one- and two-word concepts produced using latent
concept expansion with the top 25 documents retrieved for the query
“hubble telescope achievements” on the TREC ROBUST collection . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 460
11.2 Example TREC QA questions and their corresponding
question categories . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 469
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
24/541
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
25/541
“Mr. Helpmann, I’m keen to get into Information Retrieval.”
Sam Lowry, Brazil
1.1 What Is Information Retrieval?
is book is designed to help people understand search engines,
evaluate and compare them, and modify them for specic applications.
Searching for infor- mation on the Web is, for most people, a daily
activity. Search and communi- cation are by far the most popular
uses of the computer. Not surprisingly, many
people in companies and universities are trying to improve
search by coming up with easier and faster ways to nd the
right information. ese people, whether they call themselves
computer scientists, soware engineers, information scien- tists,
search engine optimizers, or something else, are working in the eld
of In-
formation Retrieval . 1
So, before we launch into a detailed journey through theinternals
of search engines, we will take a few pages to provide a context
for the rest of the book.
GerardSalton,apioneerininformationretrievalandoneoftheleadinggures
from the 1960s to the 1990s, proposed the followingdenition in
hisclassic 1968 textbook (Salton, 1968):
Information retrieval is a eld concerned with the structure,
analysis, or- ganization, storage, searching, and retrieval of
information.
Despite the huge advances in the understanding and technology of
search in the past 40 years, this denition is still
appropriate and accurate. e term “informa-
1 Information retrieval is oen abbreviated as IR. In this book, we
mostly use the full term. is has nothing to do with the fact that
manypeople think IR means “infrared” or something else.
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
26/541
2 1 Search Engines and Information Retrieval
tion” is very general, and information retrieval includes work on a
wide range of types of information and a variety of
applications related to search.
e primary focus of the eld since the 1950s has been on text and
text docu- ments. Web pages, email, scholarly papers, books,
and news stories are just a few of the many examples of
documents. All of these documents have some amount of structure,
such as the title, author, date, and abstract information
associated
with the content of papers that appear in scientic journals.
e elements of this structure are called attributes, or elds, when
referring to database records. e important distinction between a
document and a typical database record, such as a bank account
record or a ight reservation, is that most of the information in
the document is in the form of text, which is relatively
unstructured.
To illustrate thisdifference, consider the information contained in
two typical
attributesofanaccountrecord,theaccountnumberandcurrentbalance.Bothare
very well dened, both in terms of their format (for example,
a six-digit integer for an account number and a real number with
two decimal places for balance) and their meaning. It is very easy
to compare values of these attributes, and conse- quently it is
straightforward to implement algorithms to identify the records
that satisfy queries such as “Find account number 321456” or “Find
accounts with balances greater than $50,000.00”.
Now considera news story about the merger of two banks. e story
will have some attributes, such as the headline and source of the
story, but the primary con-
tentisthestoryitself.Inadatabasesystem,thiscriticalpieceofinformationwould
typically be stored as a single large attribute with no internal
structure. Most of the queries submitted to a web search
engine such as Google2 that relate to this story will be of the
form “bank merger” or “bank takeover”. To do this search,
we must design algorithms that can compare the text of the
queries with the text of the story and decide whether the story
contains the information that is being sought. Dening the
meaning of a word, a sentence, a paragraph, or a whole news story
is much more difficult than dening an account number, and
consequently comparing text is not easy. Understanding and
modeling how people compare texts, and designing computer
algorithms to accurately perform this comparison, is at the core of
information retrieval.
Increasingly, applications of information retrieval involve
multimedia docu- ments with structure, signicant text content, and
other media. Popular infor-
mation media include pictures, video, and audio, including music
and speech. In 2 http://www.google.com
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
27/541
1.1 What Is Information Retrieval? 3
some applications, such as in legal support, scanned document
images are also important. ese media have content that, like text,
is difficult to describe and compare. e current technology for
searching non-text documents relies on text descriptions of their
content rather than the contents themselves, but progress is being
made on techniques for direct comparison of images, for
example.
In addition to a range of media, information retrieval involves a
range of tasks and applications. e usual search scenario involves
someone typing in a query to a search engine and receiving answers
in the form of a list of documents in ranked order. Although
searching the World Wide Web (web search) is by far the most common
application involving information retrieval, search is also a
crucial part of applications in corporations, government, and many
other domains. Vertical search is a
specialized form of web search where the domain of the search is
re- stricted to a particular topic. Enterprise search involves
nding the required infor-
mation in the huge variety of computer les scattered across a
corporate intranet. Web pages are certainly a part of that
distributed information store, but most information will be found
in sources such as email, reports, presentations, spread- sheets,
and structured data in corporate databases. Desktop
search is the personal
version of enterprise search, where the information sources
are the les stored on an individual computer, including email
messages and web pages that have re- cently been
browsed. Peer-to-peer search involves ndinginformation in
networks of nodes or computers without any centralized control. is
type of search began as a le sharing tool for music but can be used
in any community based on shared interests, or even shared locality
in the case of mobile devices. Search and related information
retrieval techniques are used for advertising, for intelligence
analy- sis, for scientic discovery, for health care, for customer
support, for real estate, and so on. Any application that involves
a collection3of text or other unstructured information will need to
organize and search that information.
Search based on a userquery (sometimes called ad hoc search
becausetherange of possible queries is huge and not prespecied) is
not the only text-based task that is studied in information
retrieval. Other tasks include ltering ,
classication, and question answering . Filtering or tracking
involves detecting stories of interest based on a person’s
interests and providing an alert using email or some other
mechanism. Classication or categorization uses a dened set of
labels or classes
3 e term database is oen used to refer to a collection of
either structured or unstruc- tured data. To avoid confusion, we
mostly use the term document collection (or just
collection) for text. However, the terms web
database and search engine database are so common
that we occasionally use them in this book.
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
28/541
4 1 Search Engines and Information Retrieval
(such as the categories listed in the Yahoo! Directory 4) and
automatically assigns those labels to documents. uestion answering
is similar to search but is aimed at more specic questions, such as
“What is the height of Mt. Everest?”. e goal of question answering
is to return a specic answer found in the text, rather than a list
of documents. Table 1.1 summarizes some of these aspects or
dimensions of the eld of information retrieval.
Examples of Examples of Examples of Content Applications
Tasks
Text Web search Ad hoc search Images Vertical search Filtering
Video Enterprise search Classication
Scanned documents Desktop search uestion answering
Audio Peer-to-peer searchMusic
1.2 The Big Issues
Information retrieval researchers have focused on a few key issues
that remain just as important in the era of commercial web search
engines working with billions
of web pages as they were when tests were done in the 1960s on
document col- lections containing about 1.5 megabytes of text. One
of these issues is relevance. Relevance is a fundamental
concept in information retrieval. Loosely speaking, a relevant
document contains the information that a person was looking for
when
shesubmittedaquerytothesearchengine.Althoughthissoundssimple,thereare
many factors that go into a person’s decision as to whether a
particular document is relevant. ese factors must be taken into
account when designing algorithms for comparing text and ranking
documents. Simply comparing the text of a query
with the text of a document and looking for an exact match,
as might be done in a database system or using the
grep utility in Unix, produces very poor results
in
terms of relevance. One obvious reason for this is that language
can be used to ex- 4 http://dir.yahoo.com/
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
29/541
1.2 e Big Issues 5
press the same concepts in many different ways, oen with very
different words. is is referred to as the ocabulary mismatch
problem in information retrieval.
It is also important to distinguish between topical relevance and
user relevance. A text document is topically relevant to a query if
it is on the same topic. For ex- ample, a news story about a
tornado in Kansas would be topically relevant to the query “severe
weather events”. e person who asked the question (oen called the
user) may not consider the story relevant, however, if she has seen
that story before, or if the story is ve years old, or if the
story is in Chinese from a Chi- nese news agency. User relevance
takes these additional features of the story into account.
To address the issue of relevance, researchers propose retrieval
models and test how well they work. A retrieval model is a formal
representation of the process of matching a query and a
document. It is the basis of the ranking algorithm that
is
used in a search engine to produce the ranked list of documents. A
good retrieval model will nd documents that are likely to be
considered relevant by the person
who submitted the query. Some retrieval models focus on
topical relevance, but a search engine deployed in a real
environment must use ranking algorithms that incorporate user
relevance.
An interesting feature of the retrieval models used in information
retrieval is
thattheytypicallymodelthestatisticalpropertiesoftextratherthanthelinguistic
structure. is means, for example, that the ranking algorithms are
typically far more concerned with the counts of word occurrences
than whether the word is a noun or an adjective. More advanced
models do incorporate linguistic features, but they tend to be of
secondary importance. e use of word frequency infor- mation to
represent text started with another information retrieval pioneer,
H.P. Luhn, in the 1950s. is view of text did not become popular in
other elds of computer science, such as natural language
processing, until the 1990s.
Another core issue for information retrieval is evaluation. Since
the quality of a document ranking depends on how well it
matches a person’s expectations, it
was necessary early on to develop evaluation measures and
experimental proce- dures for acquiring this data and using it to
compare ranking algorithms. Cyril Cleverdon led the way in
developing evaluation methods in the early 1960s, and two of the
measures he used, precision and recall , are
still popular. Precision is a
very intuitive measure, and is the proportion of retrieved
documents that are rel-
evant. Recall is the proportion of relevant documents that are
retrieved. Whenthe recall measure is used, there is an assumption
that all the relevant documents for a given query are known. Such
an assumption is clearly problematic in a web
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
30/541
6 1 Search Engines and Information Retrieval
search environment, but with smaller test collections of
documents, this measure can be useful. A test collection5 for
information retrieval experiments consists of a collection of
text documents, a sample of typical queries, and a list of relevant
documents for each query (the relevance judgments). e
best-known test collec- tions are those associated with the TREC6
evaluation forum.
Evaluation of retrieval models and search engines is a very active
area, with much of the current focus on using large volumes of log
data from user interac- tions, such as
clickthrough data, which records the documents that were
clicked on during a search session. Clickthrough and other log data
is strongly correlated
with relevance so it can be used to evaluate search, but
search engine companies still use relevance judgments in addition
to log data to ensure the validity of their results.
ethirdcoreissueforinformationretrievalistheemphasisonusersandtheir
information needs. is should be clear given that the evaluation of
search is user- centered. at is, the users of a search engine are
the ultimate judges of quality. is has led to numerous studies on
how people interact with search engines and, in particular, to the
development of techniques to help people express their in-
formation needs. An information need is the underlying cause of the
query that a person submits to a search engine. In contrast to a
request to a database system, such as for the balance of a bank
account, text queries are oen poor descriptions of whatthe
useractually wants. A one-wordquerysuchas “cats” couldbe a request
for information on where to buy cats or for a description of the
Broadway musi- cal. Despite their lack of specicity, however,
one-word queries are very common in web search. Techniques such as
query suggestion, query expansion,and relevance
feedback use interactionand context to rene the initial
query in order to produce better ranked lists.
ese issues will come up throughout this book, and will be discussed
in con- siderably more detail. We now have sufficient background to
start talking about the main product of research in information
retrieval—namely, search engines.
1.3 Search Engines
A search engine is the practical application of information
retrieval techniques to large-scale text collections. A web search
engine is the obvious example, but as
5 Also known as an
evaluation corpus (plural corpora). 6 Text REtrieval
Conference—http://trec.nist.gov/
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
31/541
1.3 Search Engines 7
has been mentioned, search engines can be found in many different
applications, such as desktop search or enterprise search. Search
engines have been around for many years. For example, MEDLINE, the
online medical literature search sys- tem, started in the 1970s. e
term “search engine” was originally used to refer to specialized
hardware for text search. From the mid-1980s onward, however, it
gradually came to be used in preference to “information retrieval
system” as the name for the soware system that compares queries to
documents and produces ranked result lists of documents. ere is
much more to a search engine than the ranking algorithm, of course,
and we will discuss the general architecture of these systems in
the next chapter.
Search engines come in a number of congurations that reect the
applica-
tionstheyaredesignedfor.Websearchengines,suchasGoogleandYahoo!, 7
must be able to capture, or crawl , many terabytes of
data, and then provide subsecond
response times to millions of queries submitted every day from
around the world. Enterprise search engines—for example,
Autonomy 8—must be able to process the large variety of
information sources in a company and use company-specic knowledge
as part of search and related tasks, such as data
mining . Data mining
referstotheautomaticdiscoveryofinterestingstructureindataandincludestech-
niques such as clustering . Desktop search engines, such
as the Microso Vista™ search feature, must be able to rapidly
incorporate new documents, web pages, and email as the person
creates or looks at them, as well as provide an intuitive interface
for searching this very heterogeneous mix of information. ere is
over- lap between these categories with systems such as Google, for
example, which is available in congurations for enterprise and
desktop search.
Open source search engines are another important class of
systems that have somewhat different design goals than the
commercial search engines. ere are a number of these systems, and
the Wikipedia page for information retrieval9 pro-
vides links to many of them. ree systems of particular
interest are Lucene,10
Lemur,11 and the system provided with this book, Galago.12 Lucene
is a popular Java-based search engine that has been used for
a wide range of commercial ap- plications. e information
retrieval techniques that it uses are relatively simple.
7 http://www.yahoo.com 8 http://www.autonomy.com 9
http://en.wikipedia.org/wiki/Information_retrieval
10 http://lucene.apache.org 11
http://www.lemurproject.org 12
http://www.search-engines-book.com
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
32/541
8 1 Search Engines and Information Retrieval
Lemur is an open source toolkit that includes the Indri C++-based
search engine. Lemur has primarily been used by information
retrieval researchers to compare advanced search techniques. Galago
is a Java-based search engine that is based on the Lemur and Indri
projects. e assignments in this book make extensive use of
Galago. It is designed to be fast, adaptable, and easy to
understand, and incorpo- rates very effective information retrieval
techniques.
e “big issues” in the design of search engines include the ones
identied for information retrieval: effective ranking algorithms,
evaluation, and user interac- tion. ere are, however,a number of
additional critical features of searchengines that result from
their deployment in large-scale, operational environments. Fore-
mostamongthesefeaturesisthe performance
ofthesearchengineintermsofmea- sures such as response
time, query throughput , and indexing speed .
Response time is the delay between submitting a query and receiving
the result list, throughput
measures the number of queries that can be processed in a given
time, and index- ing speed is the rate at which text documents can
be transformed into indexes for searching.
An index is a data structure that improves the
speed of search. e design of indexes for search engines is one of
the major topics in this book.
Anotherimportantperformancemeasureishowfastnewdatacanbeincorpo-
rated into theindexes.Searchapplicationstypically deal with
dynamic, constantly changing
information. Coerage measures how much of the existing
information in, say, a corporate information environment has been
indexed and stored in the search engine,
and recency or eshness measures the “age” of
the stored informa- tion.
Searchenginescanbeusedwithsmallcollections,suchasafewhundredemails
and documents on a desktop, or extremely large collections, such as
the entire
Web. ere may be only a few users of a given application, or
many thousands. Scalability is clearly an important issue for
search engine design. Designs that
work for a given application should continue to work as the
amount of data and
thenumberofusersgrow.Insection1.1,wedescribedhowsearchenginesareused
in many applications and for many tasks. To do this, they have to
be customizable or adaptable. is means that many different
aspects of the search engine, such as the ranking algorithm, the
interface, or the indexing strategy, must be able to be tuned and
adapted to the requirements of the application.
Practical issues that impact search engine design also occur for
specic appli-
cations. e best example of this is spam in web search. Spam is
generally thoughtof as unwanted email, but more generally it could
be dened as misleading, inap- propriate, or non-relevant
information in a document that is designed for some
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
33/541
1.4 Search Engineers 9
Fig. 1.1. Search engine design and the core information retrieval
issues
Based on thisdiscussion of therelationship betweeninformation
retrieval and search engines, we now consider what roles computer
scientists and others play in the design and use of search
engines.
1.4 Search Engineers
Information retrieval research involves the development of
mathematical models of text and language, large-scale experiments
with test collections or users, and a lot of scholarly paper
writing. For these reasons, it tends to be done by aca-
demics or people in research laboratories. ese people are primarily
trained incomputer science, although information science,
mathematics, and, occasionally, social science and computational
linguistics are also represented. So who works
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
34/541
10 1 Search Engines and Information Retrieval
with search engines? To a large extent, it is the same sort
of people but with a more practical emphasis. e computing
industry has started to use the term search
engineer to describe this type of person. Search
engineers are primarily people trained in computer science, mostly
with a systems or database background. Sur-
prisingly few of them have training in information retrieval,
which is one of the major motivations for this book.
What is the role of a search engineer? Certainly the people
who work in the major web search companies designing and
implementing new search engines are search engineers, but the
majority of search engineers are the people who modify, extend,
maintain, or tune existing search engines for a wide range of
commercial applications. People who design or “optimize” content
for search engines are also search engineers, as are people who
implement techniques to deal with spam. e search engines that
search engineers work with cover the entire range mentioned
in the last section: they primarily use open source and enterprise
search engines for application development, but also get the most
out of desktop and web search engines.
e importance and pervasiveness of search in modern
computerapplications has meant that search engineering has become a
crucial profession in the com-
puter industry. ere are, however, very few courses being
taught in computer science departments that give students an
appreciation of the variety of issues that
areinvolved,especiallyfromtheinformationretrievalperspective.isbookisin-
tended to give potential search engineers the understanding and
tools they need.
References and Further Reading
In each chapter, we provide some pointers to papers and books that
give more detail on the topics that have been covered. is
additional reading should not be necessary to understand material
that has been presented, but instead will give more background,
more depth in some cases, and, for advanced topics, will de- scribe
techniques and research results that are not covered in this
book.
e classic references on information retrieval, in our opinion, are
the books by Salton(1968;1983) andvanRijsbergen(1979). Van
Rijsbergen’s bookremains
popular, since it is available on the Web.13 All three books
provide excellent de- scriptions of the research done in the early
years of information retrieval, up to the late 1970s. Salton’s
early book was particularly important in terms of
dening
13 http://www.dcs.gla.ac.uk/Keith/Preface.html
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
35/541
theeldofinformationretrievalforcomputerscience.Morerecentbooksinclude
Baeza-Yates and Ribeiro-Neto (1999) and Manning et al.
(2008).
Research papers on all the topics covered in this book can be found
in the Proceedings of the Association for Computing Machinery (ACM)
Special In- terest Group on Information Retrieval (SIGIR)
Conference. ese proceedings are available on the Web as part of the
ACM Digital Library. 14 Good papers on information retrieval and
search also appear in the European Conference on Information
Retrieval (ECIR), the Conference on Information and Knowl- edge
Management (CIKM), and the Web Search and Data Mining Conference
(WSDM). e WSDM conference is a spin-off of the World Wide Web
Confer- ence (WWW), which has included some important papers on web
search. e
proceedings from the TREC workshops are available online and
contain useful descriptionsof newresearch techniques from many
different academic andindus-
try groups. An overview of the TREC experiments can be found in
Voorhees and Harman (2005). An increasing number of search-related
papers are beginning to appear in database conferences, such as
VLDB and SIGMOD. Occasional papers also show up in language
technology conferences, such as ACL and HLT (As- sociation for
Computational Linguistics and Human Language Technologies), machine
learning conferences, and others.
Exercises
1.1. ink up and write down a small number of queries for a web
search engine.
Make sure that the queries vary in length (i.e., they are not all
one word). Try to specify exactly what information you are
looking for in some of the queries. Run these queries on two
commercial web search engines and compare the top 10 results for
each query by doing relevance judgments. Write a report that an-
swers at least the following questions: What is the precision of
the results? What is the overlap between the results for the two
search engines? Is one search engine clearly better than the other?
If so, by how much? How do short queries perform compared to long
queries?
1.2. Site search is another common application of search
engines. In this case, search is restricted to the web pages at a
given website. Compare site search to
web search, vertical search, and enterprise search. 14
http://www.acm.org/dl
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
36/541
12 1 Search Engines and Information Retrieval
1.3. Use the Web to nd as many examples as you can of open
source search en- gines, information retrieval systems, or related
technology. Give a brief descrip- tion of each searchengine and
summarize thesimilarities and differences between them.
1.4. List ve web servicesorsitesthatyouuse thatappear touse
search, not includ- ing web search engines. Describe the role
of search for that service. Also describe
whetherthe search is based on a database or grep
styleofmatching,orifthesearch is using some type of ranking.
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
37/541
2Architecture of a Search Engine
“While your rst question may be the most per- tinent, you may or
may not realize it is also the most irrelevant.”
e Architect, Matrix Reloaded
2.1 What Is an Architecture?
In this chapter, we describe the basic soware architecture of a
search engine. Al- though there is no universal agreement on the
denition, a soware architecture generally consists of soware
components, the interfaces provided by those com-
ponents, and the relationships between them. An architecture
is used to describe a system at a particular level of abstraction.
An example of an architecture used to
provide a standard for integratingsearch and related language
technology compo- nents is UIMA (Unstructured Information
Management Architecture).1 UIMA denes interfaces for components in
order to simplify the addition of new tech- nologies into systems
that handle text and other unstructured data.
Our search engine architecture is used to present high-level
descriptions of the important components of the system and
the relationships between them. It is not a code-level description,
although some of the components do correspond to soware modules in
the Galago search engine and other systems. We use this
architecture in this chapter and throughout the book to provide
context to the discussion of specic techniques.
An architecture is designed to ensure that a system will satisfy
the application requirements or goals. e two primary goals of a
search engine are:
• Effectiveness (quality): We want to be able to retrieve the most
relevant set of documents possible for a query.
• Efficiency (speed): We want to process queries from users as
quickly as possi-ble.
1 http://www.research.ibm.com/UIMA
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
38/541
14 2 Architecture of a Search Engine
We may have more specic goals, too, but usually these fall
into the categories of effectiveness or efficiency (or both). For
instance, the collection of documents
we want to search may be changing; making sure that the
search engine immedi-
atelyreactstochangesindocumentsisbothaneffectivenessissueandanefficiency
issue.
e architecture of a search engine is determined by these two
requirements. Because we want an efficient system,searchengines
employ specialized data struc- tures that are optimized for fast
retrieval. Because we want high-quality results, search engines
carefully process text and store text statistics that help improve
the relevance of results.
Many of the components we discuss in the following sections have
been used for decades, and this general design has been shown to be
a useful compromise between the competing goals of effective and
efficient retrieval. In later chapters,
we will discuss these components in more detail.
2.2 Basic Building Blocks
Searchenginecomponentssupporttwomajorfunctions,whichwecallthe
index- ing process and the query process. e indexing
process builds the structures that enable searching, and the query
process uses those structures and a person’s query to produce
a ranked list of documents. Figure 2.1 shows the high-level
“building blocks” of the indexing process. ese major
components are text acquisition, text transformation,
and index creation.
e task of the text acquisition component is to identify and make
availablethe documents that will be searched.Although in some cases
this will involve sim- ply using an existing collection, text
acquisition will more oen require building a collection
by crawling or scanning the Web, a corporate
intranet, a desktop, or other sources of information. In addition
to passing documents to the next com-
ponent in the indexing process, the text acquisition
component creates a docu- ment data store, which contains the text
and metadata for all the documents. Metadata is
information about a document that is not part of the text content,
such the document type (e.g., email or web page), document
structure, and other features, such as document length.
e text transformation component transforms documents
into index terms
or features. Index terms, as the name implies, are the parts
of a document that are stored in the index and used in searching. e
simplest index term is a word, but not every word may be used for
searching. A “feature” is more oen used in
8/17/2019 Search Engines Information Retrieval in
Practice.pdf
http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf
39/541
Email, web pages, news articles, memos, letters
Text Acquisition
Text Transformation
Index Creation
Fig. 2.1. e indexing
LOAD MORE