Search Engines Information Retrieval in Practice.pdf

http://slidepdf.com/reader/full/search-engines-information-retrieval-in-practicepdf 1/541
Search Engines
Information Retrieval in Practice
©W.B. Cro, D. Metzler, T. Strohman, 2015 is book was previously published by: Pearson Education, Inc.
8/17/2019 Search Engines Information Retrieval in Practice.pdf
Preface
is book provides an overview of the important issues in information retrieval,
and how those issues affect the design and implementation of searchengines. Notevery topic is covered at the same level of detail. We focus instead on what we consider to be the most important alternatives to implementing search engine components and the information retrieval models underlying them. Web search engines are obviously a major topic, and we base our coverage primarily on the technologywealluseontheWeb,1 but search engines are also used in many other applications. at is the reason for the strong emphasis on the information retrieval theories and concepts that underlie all search engines.
e target audience for the book is primarily undergraduates in computer science or computer engineering, but graduate students should also nd this useful.
We also consider the book to be suitable for most students in information sci-
ence programs. Finally, practicing search engineers should benet from the book, whatever their background. ere is mathematics in the book, but nothing too esoteric. ere are also code and programming exercises in the book, but nothing beyond the capabilities of someone who has taken some basic computer science and programming classes.
e exercises at the end of each chapter make extensive use of a Java™-based open source search engine called Galago. Galago was designed both for this book and to incorporate lessons learned from experience with the Lemur and Indri
projects. In other words, this is a fully functional search engine that can be used to support real applications. Many of the programming exercises require the use,
modication, and extension of Galago components. 1 In keeping with common usage, most uses of the word “web” in this book are not cap-
italized, except when we refer to the World Wide Web as a separate entity.
Contents
In the rst chapter, we provide a high-level review of the eld of information re-
trieval and its relationship to search engines. In the second chapter, we describe the architecture of a search engine. is is done to introduce the entire range of search engine components without getting stuck in the details of any particular aspect.InChapter3,wefocusoncrawling,documentfeeds,andothertechniques for acquiring the information that will be searched. Chapter 4 describes the statistical nature of text and the techniques that are used to process it, recognize im-
portant features, and prepare it for indexing. Chapter 5 describes how to create indexes for efficient search and how those indexes are used to process queries. In Chapter 6, we describe the techniques that are used to process queries and trans- form them into better representations of the user’s information need.
Ranking algorithms and the retrieval models they are based on are coveredin Chapter 7. is chapter also includes an overview of machine learning techniques and how they relate to information retrieval and search engines. Chapter 8 describes the evaluation and performance metrics that are used to compare and tunesearchengines.Chapter9coverstheimportantclassesoftechniquesusedfor classication, ltering, clustering, and dealing with spam. Social search is a term used to describe search applications that involve communities of people in tag- gingcontent or answering questions. Search techniques for these applications and
peer-to-peer search are described in Chapter10. Finally, in Chapter11, we give an overview of advanced techniques that capture more of the content of documents than simple word-based approaches. is includes techniques that use linguistic
features, the document structure, and the content of nontextual media, such as images or music.
Information retrieval theory and the design, implementation, evaluation, and use of search engines cover too many topics to describe them all in depth in one book. We have tried to focus on the most important topics while giving some coverage to all aspects of this challenging and rewarding subject.
Supplements
A range of supplementary material is provided for the book. is material is de-
signed both for those taking a course based on the book and for those giving the course. Specically, this includes:
• Extensive lecture slides (in PDF and PPT format)
• Solutions to selected end–of–chapter problems (instructors only) • Test collections for exercises • Galago search engine
e supplements are available at www.search-engines-book.com.
Acknowledgments
First and foremost, this book would not have happened without the tremen- dous support and encouragement from our wives, Pam Aselton, Anne-Marie Strohman, and Shelley Wang. e University of MassachusettsAmherst provided materialsupportforthepreparationofthebookandawardedaContiFacultyFel- lowship to Cro, which sped up our progress signicantly. e staff at the Center
for Intelligent Information Retrieval (Jean Joyce, Kate Moruzzi, Glenn Stowell, and Andre Gauthier) made our lives easier in many ways, and our colleagues and students in the Center provided the stimulating environment that makes working in this area so rewarding. A number of people reviewed parts of the book and
we appreciated their comments. Finally, we have to mention our children, Doug, Eric, Evan, and Natalie, or they would never forgive us.
B C D M
T S
2015 Update
is version of the book is being made available for free download. It has been edited to correct the minor errors noted in the 5 years since the book’s publica- tion. e authors, meanwhile, are working on a second edition.
1 Search Engines and Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 What Is Information Retrieval? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 e Big Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Search Engineers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Architecture of a Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1 What Is an Architecture? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Basic Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Breaking It Down . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Text Acquisition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2 Text Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 Index Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.4 User Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.5 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 How Does It Really Work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Crawls and Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1 Deciding What to Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Crawling the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Retrieving Web Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.2 e Web Crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.3 Freshness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.4 Focused Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.5 Deep Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.6 Sitemaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.7 Distributed Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Crawling Documents and Email . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4 Document Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5 e Conversion Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.1 Character Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6 Storing the Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6.1 Using a Database System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.6.2 Random Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.6.3 Compression and Large Files . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.6.4 Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.6.5 BigTable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.7 Detecting Duplicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.8 Removing Noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4 Processing Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.1 From Words to Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 Text Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2.1 Vocabulary Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.2 Estimating Collection and Result Set Sizes . . . . . . . . . . . . . . 83
4.3 Document Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.2 Tokenizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3.3 Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3.4 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3.5 Phrases and N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4 Document Structure and Markup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.5 Link Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.5.1 Anchor Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.5.2 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.5.3 Link uality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.6 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.6.1 Hidden Markov Models for Extraction . . . . . . . . . . . . . . . . . 115
4.7 Internationalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Contents XI
5 Ranking with Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.2 Abstract Model of Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.3 Inverted Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.3.1 Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.3.2 Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.3.3 Positions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.3.4 Fields and Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.3.5 Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.3.6 Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.4 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.4.1 Entropy and Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.4.2 Delta Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.4.3 Bit-Aligned Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.4.4 Byte-Aligned Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.4.5 Compression in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.4.6 Looking Ahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.4.7 Skipping and Skip Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.5 Auxiliary Structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.6 Index Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.6.1 Simple Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.6.2 Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.6.3 Parallelism and Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.6.4 Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.7 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.7.1 Document-at-a-time Evaluation . . . . . . . . . . . . . . . . . . . . . . . 166 5.7.2 Term-at-a-time Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 5.7.3 Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.7.4 Structured ueries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 5.7.5 Distributed Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 5.7.6 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6 Queries and Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 6.1 Information Needs and ueries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
6.2 Query Transformation and Renement . . . . . . . . . . . . . . . . . . . . . . . 1906.2.1 Stopping and Stemming Revisited . . . . . . . . . . . . . . . . . . . . . 190 6.2.2 Spell Checking and Suggestions . . . . . . . . . . . . . . . . . . . . . . . 193
XII Contents
6.2.3 Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 6.2.4 Relevance Feedback. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 6.2.5 Context and Personalization . . . . . . . . . . . . . . . . . . . . . . . . . . 211
6.3 Showing the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 6.3.1 Result Pages and Snippets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 6.3.2 Advertising and Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 6.3.3 Clustering the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.4 Cross-Language Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
7 Retrieval Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 7.1 Overview of Retrieval Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.1.1 Boolean Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 7.1.2 e Vector Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
7.2 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2437.2.1 Information Retrieval as Classication . . . . . . . . . . . . . . . . . 244 7.2.2 e BM25 Ranking Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 250
7.3 Ranking Based on Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 252 7.3.1 Query Likelihood Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 7.3.2 Relevance Models and Pseudo-Relevance Feedback . . . . . . 261
7.4 Complex ueries and Combining Evidence . . . . . . . . . . . . . . . . . . . 267 7.4.1 e Inference Network Model . . . . . . . . . . . . . . . . . . . . . . . . 268 7.4.2 e Galago Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . 273
7.5 Web Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 7.6 Machine Learning and Information Retrieval. . . . . . . . . . . . . . . . . . 283
7.6.1 Learning to Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 7.6.2 Topic Models and Vocabulary Mismatch . . . . . . . . . . . . . . . . 288
7.7 Application-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
8 Evaluating Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 8.1 Why Evaluate? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 8.2 e Evaluation Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 8.3 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 8.4 Effectiveness Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
8.4.1 Recall and Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 8.4.2 Averaging and Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 313 8.4.3 Focusing on the Top Documents . . . . . . . . . . . . . . . . . . . . . . 318 8.4.4 Using Preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
8.5 Efficiency Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 8.6 Training, Testing, and Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
8.6.1 Signicance Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 8.6.2 Setting Parameter Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 8.6.3 Online Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
8.7 e Bottom Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
9 Classication and Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 9.1 Classication and Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
9.1.1 Naïve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 9.1.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 9.1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 9.1.4 Classier and Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 359
9.1.5 Spam, Sentiment, and Online Advertising . . . . . . . . . . . . . . 3649.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 9.2.1 Hierarchical and K -Means Clustering . . . . . . . . . . . . . . . . . . 375 9.2.2 K Nearest Neighbor Clustering . . . . . . . . . . . . . . . . . . . . . . . 384 9.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 9.2.4 How to Choose K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 9.2.5 Clustering and Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
10 Social Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 10.1 What Is Social Search? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 10.2 User Tags and Manual Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
10.2.1 Searching Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 10.2.2 Inferring Missing Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 10.2.3 Browsing and Tag Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
10.3 Searching with Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 10.3.1 What Is a Community? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 10.3.2 Finding Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 10.3.3 Community-Based uestion Answering . . . . . . . . . . . . . . . . 415 10.3.4 Collaborative Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
10.4 Filtering and Recommending . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 10.4.1 Document Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 10.4.2 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
10.5 Peer-to-Peer and Metasearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 10.5.1 Distributed Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
10.5.2 P2P Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
11 Beyond Bag of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 11.2 Feature-Based Retrieval Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452 11.3 Term Dependence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 11.4 Structure Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
11.4.1 XML Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 11.4.2 Entity Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
11.5 Longer uestions, Better Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . 466 11.6 Words, Pictures, and Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470 11.7 One Search Fits All? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
List of Figures
1.1 Search engine design and the core information retrieval issues . . . 9
2.1 e indexing process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 e query process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 A uniform resource locator (URL), split into three parts . . . . . . . 33 3.2 Crawling the Web. e web crawler connects to web servers to
nd pages. Pages may link to other pages on the same server or on different servers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 An example robots.txt le . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 A simple crawling thread implementation . . . . . . . . . . . . . . . . . . . . 37 3.5 An HTTP HEAD request and server response . . . . . . . . . . . . . . . 38 3.6 Age and freshness of a single page over time . . . . . . . . . . . . . . . . . . 39 3.7 Expected age of a page with mean change frequency λ = 1/7
(one week) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.8 An example sitemap le . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.9 An example RSS 2.0 feed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.10 An example of text in the TREC Web compound document
format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.11 An example link with anchor text . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.12 BigTable stores data in a single logical table, which is split into
many smaller tablets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.13 A BigTable row . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.14 Example of ngerprinting process . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.15 Example of simhash ngerprinting process . . . . . . . . . . . . . . . . . . . 64 3.16 Main content block in a web page . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
XVI List of Figures
3.17 Tag counts used to identify text blocks in a web page . . . . . . . . . . . 66 3.18 Part of the DOM structure for the example web page . . . . . . . . . . 67
4.1 Rank versus probability of occurrence for words assuming Zipf ’s law (rank × probability = 0.1) . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 A log-log plot of Zipf ’s law compared to real data from AP89. e predicted relationship between probability of occurrence and rank breaks down badly at high ranks. . . . . . . . . . . . . . . . . . . . 79
4.3 Vocabulary growth for the TREC AP89 collection compared to Heaps’ law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4 Vocabulary growth for the TREC GOV2 collection compared to Heaps’ law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.5 Result size estimate for web search . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.6 Comparison of stemmer output for a TREC query. Stopwordshave also been removed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.7 Output of a POS tagger for a TREC query . . . . . . . . . . . . . . . . . . . 98 4.8 Part of a web page from Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.9 HTML source for example Wikipedia page . . . . . . . . . . . . . . . . . . 103 4.10 A sample “Internet” consisting of just three web pages. e
arrows denote links between the pages. . . . . . . . . . . . . . . . . . . . . . . 108 4.11 Pseudocode for the iterative PageRank algorithm. . . . . . . . . . . . . . 110 4.12 Trackback links in blog postings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.13 Text tagged by information extraction . . . . . . . . . . . . . . . . . . . . . . . 114 4.14 Sentence model for statistical entity extractor . . . . . . . . . . . . . . . . . 116
4.15 Chinese segmentation and bigrams . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.1 e components of the abstract model of ranking: documents, features, queries, the retrieval function, and document scores . . . . 127
5.2 A more concrete model of ranking. Notice how both the query and the document have feature functions in this model. . . . . . . . . 128
5.3 An inverted index for the documents (sentences) in Table 5.1 . . . 132 5.4 An inverted index, with word counts, for the documents in
Table 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.5 An inverted index, with word positions, for the documents in
Table 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.6 Aligning posting lists for “tropical” and “sh” to nd the phrase “tropical sh” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
List of Figures XVII
5.7 Aligning posting lists for “sh” and title to nd matches of the word “sh” in the title eld of a document. . . . . . . . . . . . . . . . . . . . 138
5.8 Pseudocode for a simple indexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.9 An example of index merging. e rst and second indexes are
merged together to produce the combined index. . . . . . . . . . . . . . 158 5.10 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.11 Mapper for a credit card summing algorithm . . . . . . . . . . . . . . . . . . 162 5.12 Reducer for a credit card summing algorithm . . . . . . . . . . . . . . . . . 162 5.13 Mapper for documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.14 Reducer for word postings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 5.15 Document-at-a-time query evaluation. e numbers (x:y)
represent a document number (x) and a word count (y). . . . . . . . 166 5.16 A simple document-at-a-time retrieval algorithm . . . . . . . . . . . . . . 167
5.17 Term-at-a-time query evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 5.18 A simple term-at-a-time retrieval algorithm . . . . . . . . . . . . . . . . . . . 169 5.19 Skip pointers in an inverted list. e gray boxes show skip
pointers, which point into the white boxes, which are inverted list postings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
5.20 A term-at-a-time retrieval algorithm with conjunctive processing 173 5.21 A document-at-a-time retrieval algorithm with conjunctive
processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 5.22 MaxScore retrieval with the query “eucalyptus tree”. e gray
boxes indicate postings that can be safely ignored during scoring. 176 5.23 Evaluation tree for the structured query #combine(#od:1(tropical
fish) #od:1(aquarium fish) fish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.1 Top ten results for the query “tropical sh” . . . . . . . . . . . . . . . . . . . 209 6.2 Geographic representation of Cape Cod using bounding
rectangles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 6.3 Typical document summary for a web search . . . . . . . . . . . . . . . . . . 215 6.4 An example of a text span of words (w) bracketed by signicant
words (s) using Luhn’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 6.5 Advertisements displayed by a search engine for the query “sh
tanks” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 6.6 Clusters formed by a search engine from top-ranked documents
for the query “tropical sh”. Numbers in brackets are the number of documents in the cluster. . . . . . . . . . . . . . . . . . . . . . . . . . 222
XVIII List of Figures
6.7 Categories returned for the query “tropical sh” in a popular online retailer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.8 Subcategories and facets for the “Home & Garden” category . . . . 225 6.9 Cross-language search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 6.10 A French web page in the results list for the query “pecheur
france” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
7.1 Term-document matrix for a collection of four documents . . . . . . 239 7.2 Vector representation of documents and queries . . . . . . . . . . . . . . . 240 7.3 Classifying a document as relevant or non-relevant . . . . . . . . . . . . 245 7.4 Example inference network model . . . . . . . . . . . . . . . . . . . . . . . . . . 269 7.5 Inference network with three nodes . . . . . . . . . . . . . . . . . . . . . . . . . 271 7.6 Galago query for the dependence model . . . . . . . . . . . . . . . . . . . . . 282
7.7 Galago query for web data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 8.1 Example of a TREC topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 8.2 Recall and precision values for two rankings of six relevant
documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 8.3 Recall and precision values for rankings from two different queries314 8.4 Recall-precision graphs for two queries . . . . . . . . . . . . . . . . . . . . . . . 315 8.5 Interpolated recall-precision graphs for two queries . . . . . . . . . . . . 316 8.6 Average recall-precision graph using standard recall levels. . . . . . . 317 8.7 Typical recall-precision graph for 50 queries from TREC . . . . . . . 318 8.8 Probability distribution for test statistic values assuming the
null hypothesis. e shaded area is the region of rejection for aone-sided test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 8.9 Example distribution of query effectiveness improvements . . . . . . 335
9.1 Illustration of how documents are represented in the multiple- Bernoulli event space. In this example, there are 10 documents (each with a unique id), two classes (spam and not spam), and a
vocabulary that consists of the terms “cheap”, “buy”, “banking”, “dinner”, and “the”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
9.2 Illustration of how documents are represented in the multinomial event space. In this example, there are 10
documents (each with a unique id), two classes (spam and notspam), and a vocabulary that consists of the terms “cheap”, “buy”, “banking”, “dinner”, and “the”. . . . . . . . . . . . . . . . . . . . . . . . . . 349
List of Figures XIX
9.3 Data set that consists of two classes (pluses and minuses). e data set on the le is linearly separable, whereas the one on the right is not. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
9.4 Graphical illustration of Support Vector Machines for the linearly separable case. Here, the hyperplane dened by w is shown, as well as the margin, the decision regions, and the support vectors, which are indicated by circles. . . . . . . . . . . . . . . . . 353
9.5 Generative process used by the Naïve Bayes model. First, a class is chosen according to P (c), and then a document is chosen according to P (d|c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
9.6 Example data set where non-parametric learning algorithms, such as a nearest neighbor classier, may outperform parametric algorithms. e pluses and minuses indicate positive and
negative training examples, respectively. e solid gray line shows the actual decision boundary, which is highly non-linear. . 361
9.7 Example output of SpamAssassin email spam lter . . . . . . . . . . . . 365 9.8 Example of web page spam, showing the main page and some
of the associated term and link spam . . . . . . . . . . . . . . . . . . . . . . . . . 367 9.9 Example product review incorporating sentiment . . . . . . . . . . . . . 370 9.10 Example semantic class match between a web page about
rainbow sh (a type of tropical sh) and an advertisement for tropical sh food. e nodes “Aquariums”, “Fish”, and “Supplies” are example nodes within a semantic hierarchy. e web page is classied as “Aquariums - Fish” and the ad is classied as “Supplies - Fish”. Here, “Aquariums” is the least common ancestor. Although the web page and ad do not share any terms in common, they can be matched because of their semantic similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
9.11 Example of divisive clustering with K = 4. e clustering proceeds from le to right and top to bottom, resulting in four clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
9.12 Example of agglomerative clustering with K = 4. e clustering proceeds from le to right and top to bottom, resulting in four clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
9.13 Dendrogram that illustrates the agglomerative clustering of the points from Figure 9.12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
XX List of Figures
9.14 Examples of clusters in a graph formed by connecting nodes representing instances. A link represents a distance between the two instances that is less than some threshold value. . . . . . . . . . . . . 379
9.15 Illustration of how various clustering cost functions are computed381 9.16 Example of overlapping clustering using nearest neighbor
clustering with K = 5. e overlapping clusters for the black points (A, B, C, and D) are shown. e ve nearest neighbors for each black point are shaded gray and labeled accordingly. . . . . 385
9.17 Example of overlapping clustering using Parzen windows. e clusters for the black points (A, B, C, and D) are shown. e shaded circles indicate the windows used to determine cluster membership. e neighbors for each black point are shaded gray and labeled accordingly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
9.18 Cluster hypothesis tests on two TREC collections. e top two compare the distributions of similarity values between relevant-relevant and relevant-nonrelevant pairs (light gray) of documents. e bottom two show the local precision of the relevant documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
10.1 Search results used to enrich a tag representation. In this example, the tag being expanded is “tropical sh”. e query “tropical sh” is run against a search engine, and the snippets returned are then used to generate a distribution over related terms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
10.2 Example of a tag cloud in the form of a weighted list. e tags are in alphabetical order and weighted according to some criteria, such as popularity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
10.3 Illustration of the HITS algorithm. Each row corresponds to a single iteration of the algorithm and each column corresponds to a specic step of the algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
10.4 Example of how nodes within a directed graph can be represented as vectors. For a given node p, its vector representation has component q set to 1 if p → q . . . . . . . . . . . . . . 413
List of Figures XXI
10.5 Overview of the two common collaborative search scenarios. On the le is co-located collaborative search, which involves multiple participants in the same location at the same time. On the right is remote collaborative search, where participants are in different locations and not necessarily all online and searching at the same time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
10.6 Example of a static ltering system. Documents arrive over time and are compared against each prole. Arrows from documents to proles indicate the document matches the prole and is retrieved. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
10.7 Example of an adaptive ltering system. Documents arrive over time and are compared against each prole. Arrows from documents to proles indicate the document matches the
prole and is retrieved. Unlike static ltering, where proles are static over time, proles are updated dynamically (e.g., when a new match occurs).. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
10.8 A set of users within a recommender system. Users and their ratings for some item are given. Users with question marks above their heads have not yet rated the item. It is the goal of the recommender system to ll in these question marks. . . . . . . . . 434
10.9 Illustration of collaborative ltering using clustering. Groups of similar users are outlined with dashed lines. Users and their ratings for some item are given. In each group, there is a single user who has not judged the item. For these users, the unjudged item is assigned an automatic rating based on the ratings of similar users. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
10.10 Metasearch engine architecture. e query is broadcast to multiple web search engines and result lists are merged. . . . . . . . . 439
10.11 Network architectures for distributed search: (a) central hub; (b) pure P2P; and (c) hierarchical P2P. Dark circles are hub or superpeer nodes, gray circles are provider nodes, and white circles are consumer nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
10.12Neighborhoods (N i) of a hub node (H ) in a hierarchical P2P network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
XXII List of Figures
11.1 Example Markov Random Field model assumptions, including full independence (top le), sequential dependence (top right), full dependence (bottom le), and general dependence (bottom right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
11.2 Graphical model representations of the relevance model technique (top) and latent concept expansion (bottom) used for pseudo-relevance feedback with the query “hubble telescope achievements” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
11.3 Functions provided by a search engine interacting with a simple database system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
11.4 Example of an entity search for organizations using the TREC Wall Street Journal 1987 Collection . . . . . . . . . . . . . . . . . . . . . . . . . 464
11.5 uestion answering system architecture . . . . . . . . . . . . . . . . . . . . . . 467
11.6 Examples of OCR errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472 11.7 Examples of speech recognizer errors . . . . . . . . . . . . . . . . . . . . . . . . 473 11.8 Two images (a sh and a ower bed) with color histograms.
e horizontal axis is hue value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 11.9 ree examples of content-based image retrieval. e collection
for the rst two consists of 1,560 images of cars, faces, apes, and other miscellaneous subjects. e last example is from a collection of 2,048 trademark images. In each case, the lemost image is the query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
11.10Key frames extracted from a TREC video clip . . . . . . . . . . . . . . . . 476 11.11Examples of automatic text annotation of images . . . . . . . . . . . . . . 477 11.12 ree representations of Bach’s “Fugue #10”: audio, MIDI, and
conventional music notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
3.1 UTF-8 encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1 Statistics for the AP89 collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2 Most frequent 50 words from AP89 . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.3 Low-frequency words from AP89 . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.4 Example word frequency ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.5 Proportions of words occurring n times in 336,310 documents
from the TREC Volume 3 corpus. e total vocabulary size (number of unique words) is 508,209. . . . . . . . . . . . . . . . . . . . . . . . 80
4.6 Document frequencies and estimated frequencies for word combinations (assuming independence) in the GOV2 Web collection. Collection size (N ) is 25,205,179. . . . . . . . . . . . . . . . . . 84
4.7 Examples of errors made by the original Porter stemmer. False positives are pairs of words that have the same stem. False negatives are pairs that have different stems. . . . . . . . . . . . . . . . . . . 93
4.8 Examples of words with the Arabic root ktb . . . . . . . . . . . . . . . . . . 96 4.9 High-frequency noun phrases from a TREC collection and
U.S. patents from 1996 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.10 Statistics for the Google n-gram sample . . . . . . . . . . . . . . . . . . . . . . 101
5.1 Four sentences from the Wikipedia entry for tropical sh . . . . . . . 132
5.2 Elias- γ
code examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1465.3 Elias-δ code examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.4 Space requirements for numbers encoded in v-byte . . . . . . . . . . . . 149
XXIV List of Tables
5.5 Sample encodings for v-byte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.6 Skip lengths (k) and expected processing steps . . . . . . . . . . . . . . . . 152
6.1 Partial entry for the Medical Subject (MeSH) Heading “Neck Pain” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
6.2 Term association measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 6.3 Most strongly associated words for “tropical” in a collection of
TREC news stories. Co-occurrence counts are measured at the document level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
6.4 Most strongly associated words for “sh” in a collection of TREC news stories. Co-occurrence counts are measured at the document level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.5 Most strongly associated words for “sh” in a collection of
TREC news stories. Co-occurrence counts are measured in windows of ve words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
7.1 Contingency table of term occurrences for a particular query . . . 248 7.2 BM25 scores for an example document . . . . . . . . . . . . . . . . . . . . . . 252 7.3 Query likelihood scores for an example document . . . . . . . . . . . . . 260 7.4 Highest-probability terms from relevance model for four
example queries (estimated using top 10 documents) . . . . . . . . . . . 266 7.5 Highest-probability terms from relevance model for four
example queries (estimated using top 50 documents) . . . . . . . . . . . 267 7.6 Conditional probabilities for example network . . . . . . . . . . . . . . . 272
7.7 Highest-probability terms from four topics in LDA model . . . . . 290 8.1 Statistics for three example text collections. e average number
of words per document is calculated without stemming. . . . . . . . . 301 8.2 Statistics for queries from example text collections . . . . . . . . . . . . . 301 8.3 Sets of documents dened by a simple search with binary
relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 8.4 Precision values at standard recall levels calculated using
interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 8.5 Denitions of some important efficiency metrics . . . . . . . . . . . . . . 323 8.6 Articial effectiveness data for two retrieval algorithms (A and
B) over 10 queries. e column B – A gives the difference ineffectiveness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
List of Tables XXV
9.1 A list of kernels that are typically used with SVMs. For each kernel, the name, value, and implicit dimensionality are given. . . 357
10.1 Example questions submitted to Yahoo! Answers . . . . . . . . . . . . . . 416 10.2 Translations automatically learned from a set of question and
answer pairs. e 10 most likely translations for the terms “everest”, “xp”, and “search” are given. . . . . . . . . . . . . . . . . . . . . . . . . . 419
10.3 Summary of static and adaptive ltering models. For each, the prole representation and prole updating algorithm are given. . 430
10.4 Contingency table for the possible outcomes of a ltering system. Here, TP (true positive) is the number of relevant documents retrieved, FN (false negative) is the number of relevant documents not retrieved, FP (false positive) is the
number of non-relevant documents retrieved, and TN (truenegative) is the number of non-relevant documents not retrieved. 431
11.1 Most likely one- and two-word concepts produced using latent concept expansion with the top 25 documents retrieved for the query “hubble telescope achievements” on the TREC ROBUST collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
11.2 Example TREC QA questions and their corresponding question categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
“Mr. Helpmann, I’m keen to get into Information Retrieval.”
Sam Lowry, Brazil
1.1 What Is Information Retrieval?
is book is designed to help people understand search engines, evaluate and compare them, and modify them for specic applications. Searching for information on the Web is, for most people, a daily activity. Search and communi- cation are by far the most popular uses of the computer. Not surprisingly, many
people in companies and universities are trying to improve search by coming up with easier and faster ways to nd the right information. ese people, whether they call themselves computer scientists, soware engineers, information scientists, search engine optimizers, or something else, are working in the eld of In-
formation Retrieval . 1
So, before we launch into a detailed journey through theinternals of search engines, we will take a few pages to provide a context for the rest of the book.
GerardSalton,apioneerininformationretrievalandoneoftheleadinggures from the 1960s to the 1990s, proposed the followingdenition in hisclassic 1968 textbook (Salton, 1968):
Information retrieval is a eld concerned with the structure, analysis, or- ganization, storage, searching, and retrieval of information.
Despite the huge advances in the understanding and technology of search in the past 40 years, this denition is still appropriate and accurate. e term “informa-
1 Information retrieval is oen abbreviated as IR. In this book, we mostly use the full term. is has nothing to do with the fact that manypeople think IR means “infrared” or something else.
2 1 Search Engines and Information Retrieval
tion” is very general, and information retrieval includes work on a wide range of types of information and a variety of applications related to search.
e primary focus of the eld since the 1950s has been on text and text documents. Web pages, email, scholarly papers, books, and news stories are just a few of the many examples of documents. All of these documents have some amount of structure, such as the title, author, date, and abstract information associated
with the content of papers that appear in scientic journals. e elements of this structure are called attributes, or elds, when referring to database records. e important distinction between a document and a typical database record, such as a bank account record or a ight reservation, is that most of the information in the document is in the form of text, which is relatively unstructured.
To illustrate thisdifference, consider the information contained in two typical attributesofanaccountrecord,theaccountnumberandcurrentbalance.Bothare
very well dened, both in terms of their format (for example, a six-digit integer for an account number and a real number with two decimal places for balance) and their meaning. It is very easy to compare values of these attributes, and consequently it is straightforward to implement algorithms to identify the records that satisfy queries such as “Find account number 321456” or “Find accounts with balances greater than $50,000.00”.
Now considera news story about the merger of two banks. e story will have some attributes, such as the headline and source of the story, but the primary con- tentisthestoryitself.Inadatabasesystem,thiscriticalpieceofinformationwould typically be stored as a single large attribute with no internal structure. Most of the queries submitted to a web search engine such as Google2 that relate to this story will be of the form “bank merger” or “bank takeover”. To do this search,
we must design algorithms that can compare the text of the queries with the text of the story and decide whether the story contains the information that is being sought. Dening the meaning of a word, a sentence, a paragraph, or a whole news story is much more difficult than dening an account number, and consequently comparing text is not easy. Understanding and modeling how people compare texts, and designing computer algorithms to accurately perform this comparison, is at the core of information retrieval.
Increasingly, applications of information retrieval involve multimedia documents with structure, signicant text content, and other media. Popular infor-
mation media include pictures, video, and audio, including music and speech. In 2 http://www.google.com
1.1 What Is Information Retrieval? 3
some applications, such as in legal support, scanned document images are also important. ese media have content that, like text, is difficult to describe and compare. e current technology for searching non-text documents relies on text descriptions of their content rather than the contents themselves, but progress is being made on techniques for direct comparison of images, for example.
In addition to a range of media, information retrieval involves a range of tasks and applications. e usual search scenario involves someone typing in a query to a search engine and receiving answers in the form of a list of documents in ranked order. Although searching the World Wide Web (web search) is by far the most common application involving information retrieval, search is also a crucial part of applications in corporations, government, and many other domains. Vertical search is a specialized form of web search where the domain of the search is restricted to a particular topic. Enterprise search involves nding the required infor-
mation in the huge variety of computer les scattered across a corporate intranet. Web pages are certainly a part of that distributed information store, but most information will be found in sources such as email, reports, presentations, spread- sheets, and structured data in corporate databases. Desktop search is the personal
version of enterprise search, where the information sources are the les stored on an individual computer, including email messages and web pages that have re- cently been browsed. Peer-to-peer search involves ndinginformation in networks of nodes or computers without any centralized control. is type of search began as a le sharing tool for music but can be used in any community based on shared interests, or even shared locality in the case of mobile devices. Search and related information retrieval techniques are used for advertising, for intelligence analysis, for scientic discovery, for health care, for customer support, for real estate, and so on. Any application that involves a collection3of text or other unstructured information will need to organize and search that information.
Search based on a userquery (sometimes called ad hoc search becausetherange of possible queries is huge and not prespecied) is not the only text-based task that is studied in information retrieval. Other tasks include ltering , classication, and question answering . Filtering or tracking involves detecting stories of interest based on a person’s interests and providing an alert using email or some other mechanism. Classication or categorization uses a dened set of labels or classes
3 e term database is oen used to refer to a collection of either structured or unstructured data. To avoid confusion, we mostly use the term document collection (or just collection) for text. However, the terms web database and search engine database are so common that we occasionally use them in this book.
(such as the categories listed in the Yahoo! Directory 4) and automatically assigns those labels to documents. uestion answering is similar to search but is aimed at more specic questions, such as “What is the height of Mt. Everest?”. e goal of question answering is to return a specic answer found in the text, rather than a list of documents. Table 1.1 summarizes some of these aspects or dimensions of the eld of information retrieval.
Examples of Examples of Examples of Content Applications Tasks
Text Web search Ad hoc search Images Vertical search Filtering Video Enterprise search Classication
Scanned documents Desktop search uestion answering
Audio Peer-to-peer searchMusic
1.2 The Big Issues
Information retrieval researchers have focused on a few key issues that remain just as important in the era of commercial web search engines working with billions
of web pages as they were when tests were done in the 1960s on document collections containing about 1.5 megabytes of text. One of these issues is relevance. Relevance is a fundamental concept in information retrieval. Loosely speaking, a relevant document contains the information that a person was looking for when shesubmittedaquerytothesearchengine.Althoughthissoundssimple,thereare many factors that go into a person’s decision as to whether a particular document is relevant. ese factors must be taken into account when designing algorithms for comparing text and ranking documents. Simply comparing the text of a query
with the text of a document and looking for an exact match, as might be done in a database system or using the grep utility in Unix, produces very poor results in
terms of relevance. One obvious reason for this is that language can be used to ex- 4 http://dir.yahoo.com/
1.2 e Big Issues 5
press the same concepts in many different ways, oen with very different words. is is referred to as the ocabulary mismatch problem in information retrieval.
It is also important to distinguish between topical relevance and user relevance. A text document is topically relevant to a query if it is on the same topic. For example, a news story about a tornado in Kansas would be topically relevant to the query “severe weather events”. e person who asked the question (oen called the user) may not consider the story relevant, however, if she has seen that story before, or if the story is ve years old, or if the story is in Chinese from a Chi- nese news agency. User relevance takes these additional features of the story into account.
To address the issue of relevance, researchers propose retrieval models and test how well they work. A retrieval model is a formal representation of the process of matching a query and a document. It is the basis of the ranking algorithm that is
used in a search engine to produce the ranked list of documents. A good retrieval model will nd documents that are likely to be considered relevant by the person
who submitted the query. Some retrieval models focus on topical relevance, but a search engine deployed in a real environment must use ranking algorithms that incorporate user relevance.
An interesting feature of the retrieval models used in information retrieval is thattheytypicallymodelthestatisticalpropertiesoftextratherthanthelinguistic structure. is means, for example, that the ranking algorithms are typically far more concerned with the counts of word occurrences than whether the word is a noun or an adjective. More advanced models do incorporate linguistic features, but they tend to be of secondary importance. e use of word frequency information to represent text started with another information retrieval pioneer, H.P. Luhn, in the 1950s. is view of text did not become popular in other elds of computer science, such as natural language processing, until the 1990s.
Another core issue for information retrieval is evaluation. Since the quality of a document ranking depends on how well it matches a person’s expectations, it
was necessary early on to develop evaluation measures and experimental proce- dures for acquiring this data and using it to compare ranking algorithms. Cyril Cleverdon led the way in developing evaluation methods in the early 1960s, and two of the measures he used, precision and recall , are still popular. Precision is a
very intuitive measure, and is the proportion of retrieved documents that are rel-
evant. Recall is the proportion of relevant documents that are retrieved. Whenthe recall measure is used, there is an assumption that all the relevant documents for a given query are known. Such an assumption is clearly problematic in a web
search environment, but with smaller test collections of documents, this measure can be useful. A test collection5 for information retrieval experiments consists of a collection of text documents, a sample of typical queries, and a list of relevant documents for each query (the relevance judgments). e best-known test collections are those associated with the TREC6 evaluation forum.
Evaluation of retrieval models and search engines is a very active area, with much of the current focus on using large volumes of log data from user interac- tions, such as clickthrough data, which records the documents that were clicked on during a search session. Clickthrough and other log data is strongly correlated
with relevance so it can be used to evaluate search, but search engine companies still use relevance judgments in addition to log data to ensure the validity of their results.
ethirdcoreissueforinformationretrievalistheemphasisonusersandtheir
information needs. is should be clear given that the evaluation of search is user- centered. at is, the users of a search engine are the ultimate judges of quality. is has led to numerous studies on how people interact with search engines and, in particular, to the development of techniques to help people express their information needs. An information need is the underlying cause of the query that a person submits to a search engine. In contrast to a request to a database system, such as for the balance of a bank account, text queries are oen poor descriptions of whatthe useractually wants. A one-wordquerysuchas “cats” couldbe a request for information on where to buy cats or for a description of the Broadway musi- cal. Despite their lack of specicity, however, one-word queries are very common in web search. Techniques such as query suggestion, query expansion,and relevance
feedback use interactionand context to rene the initial query in order to produce better ranked lists.
ese issues will come up throughout this book, and will be discussed in con- siderably more detail. We now have sufficient background to start talking about the main product of research in information retrieval—namely, search engines.
1.3 Search Engines
A search engine is the practical application of information retrieval techniques to large-scale text collections. A web search engine is the obvious example, but as
5 Also known as an evaluation corpus (plural corpora). 6 Text REtrieval Conference—http://trec.nist.gov/
1.3 Search Engines 7
has been mentioned, search engines can be found in many different applications, such as desktop search or enterprise search. Search engines have been around for many years. For example, MEDLINE, the online medical literature search system, started in the 1970s. e term “search engine” was originally used to refer to specialized hardware for text search. From the mid-1980s onward, however, it gradually came to be used in preference to “information retrieval system” as the name for the soware system that compares queries to documents and produces ranked result lists of documents. ere is much more to a search engine than the ranking algorithm, of course, and we will discuss the general architecture of these systems in the next chapter.
Search engines come in a number of congurations that reect the applica- tionstheyaredesignedfor.Websearchengines,suchasGoogleandYahoo!, 7 must be able to capture, or crawl , many terabytes of data, and then provide subsecond
response times to millions of queries submitted every day from around the world. Enterprise search engines—for example, Autonomy 8—must be able to process the large variety of information sources in a company and use company-specic knowledge as part of search and related tasks, such as data mining . Data mining referstotheautomaticdiscoveryofinterestingstructureindataandincludestech- niques such as clustering . Desktop search engines, such as the Microso Vista™ search feature, must be able to rapidly incorporate new documents, web pages, and email as the person creates or looks at them, as well as provide an intuitive interface for searching this very heterogeneous mix of information. ere is overlap between these categories with systems such as Google, for example, which is available in congurations for enterprise and desktop search.
Open source search engines are another important class of systems that have somewhat different design goals than the commercial search engines. ere are a number of these systems, and the Wikipedia page for information retrieval9 pro-
vides links to many of them. ree systems of particular interest are Lucene,10
Lemur,11 and the system provided with this book, Galago.12 Lucene is a popular Java-based search engine that has been used for a wide range of commercial applications. e information retrieval techniques that it uses are relatively simple.
7 http://www.yahoo.com 8 http://www.autonomy.com 9 http://en.wikipedia.org/wiki/Information_retrieval
10 http://lucene.apache.org 11 http://www.lemurproject.org 12 http://www.search-engines-book.com
Lemur is an open source toolkit that includes the Indri C++-based search engine. Lemur has primarily been used by information retrieval researchers to compare advanced search techniques. Galago is a Java-based search engine that is based on the Lemur and Indri projects. e assignments in this book make extensive use of Galago. It is designed to be fast, adaptable, and easy to understand, and incorpo- rates very effective information retrieval techniques.
e “big issues” in the design of search engines include the ones identied for information retrieval: effective ranking algorithms, evaluation, and user interaction. ere are, however,a number of additional critical features of searchengines that result from their deployment in large-scale, operational environments. Fore- mostamongthesefeaturesisthe performance ofthesearchengineintermsofmea- sures such as response time, query throughput , and indexing speed . Response time is the delay between submitting a query and receiving the result list, throughput
measures the number of queries that can be processed in a given time, and indexing speed is the rate at which text documents can be transformed into indexes for searching. An index is a data structure that improves the speed of search. e design of indexes for search engines is one of the major topics in this book.
Anotherimportantperformancemeasureishowfastnewdatacanbeincorpo- rated into theindexes.Searchapplicationstypically deal with dynamic, constantly changing information. Coerage measures how much of the existing information in, say, a corporate information environment has been indexed and stored in the search engine, and recency or eshness measures the “age” of the stored information.
Searchenginescanbeusedwithsmallcollections,suchasafewhundredemails and documents on a desktop, or extremely large collections, such as the entire
Web. ere may be only a few users of a given application, or many thousands. Scalability is clearly an important issue for search engine design. Designs that
work for a given application should continue to work as the amount of data and thenumberofusersgrow.Insection1.1,wedescribedhowsearchenginesareused in many applications and for many tasks. To do this, they have to be customizable or adaptable. is means that many different aspects of the search engine, such as the ranking algorithm, the interface, or the indexing strategy, must be able to be tuned and adapted to the requirements of the application.
Practical issues that impact search engine design also occur for specic appli-
cations. e best example of this is spam in web search. Spam is generally thoughtof as unwanted email, but more generally it could be dened as misleading, inap- propriate, or non-relevant information in a document that is designed for some
1.4 Search Engineers 9

Fig. 1.1. Search engine design and the core information retrieval issues
Based on thisdiscussion of therelationship betweeninformation retrieval and search engines, we now consider what roles computer scientists and others play in the design and use of search engines.
1.4 Search Engineers
Information retrieval research involves the development of mathematical models of text and language, large-scale experiments with test collections or users, and a lot of scholarly paper writing. For these reasons, it tends to be done by aca-
demics or people in research laboratories. ese people are primarily trained incomputer science, although information science, mathematics, and, occasionally, social science and computational linguistics are also represented. So who works
with search engines? To a large extent, it is the same sort of people but with a more practical emphasis. e computing industry has started to use the term search engineer to describe this type of person. Search engineers are primarily people trained in computer science, mostly with a systems or database background. Sur-
prisingly few of them have training in information retrieval, which is one of the major motivations for this book.
What is the role of a search engineer? Certainly the people who work in the major web search companies designing and implementing new search engines are search engineers, but the majority of search engineers are the people who modify, extend, maintain, or tune existing search engines for a wide range of commercial applications. People who design or “optimize” content for search engines are also search engineers, as are people who implement techniques to deal with spam. e search engines that search engineers work with cover the entire range mentioned
in the last section: they primarily use open source and enterprise search engines for application development, but also get the most out of desktop and web search engines.
e importance and pervasiveness of search in modern computerapplications has meant that search engineering has become a crucial profession in the com-
puter industry. ere are, however, very few courses being taught in computer science departments that give students an appreciation of the variety of issues that areinvolved,especiallyfromtheinformationretrievalperspective.isbookisin- tended to give potential search engineers the understanding and tools they need.
References and Further Reading
In each chapter, we provide some pointers to papers and books that give more detail on the topics that have been covered. is additional reading should not be necessary to understand material that has been presented, but instead will give more background, more depth in some cases, and, for advanced topics, will describe techniques and research results that are not covered in this book.
e classic references on information retrieval, in our opinion, are the books by Salton(1968;1983) andvanRijsbergen(1979). Van Rijsbergen’s bookremains
popular, since it is available on the Web.13 All three books provide excellent descriptions of the research done in the early years of information retrieval, up to the late 1970s. Salton’s early book was particularly important in terms of dening
13 http://www.dcs.gla.ac.uk/Keith/Preface.html
theeldofinformationretrievalforcomputerscience.Morerecentbooksinclude Baeza-Yates and Ribeiro-Neto (1999) and Manning et al. (2008).
Research papers on all the topics covered in this book can be found in the Proceedings of the Association for Computing Machinery (ACM) Special In- terest Group on Information Retrieval (SIGIR) Conference. ese proceedings are available on the Web as part of the ACM Digital Library. 14 Good papers on information retrieval and search also appear in the European Conference on Information Retrieval (ECIR), the Conference on Information and Knowl- edge Management (CIKM), and the Web Search and Data Mining Conference (WSDM). e WSDM conference is a spin-off of the World Wide Web Confer- ence (WWW), which has included some important papers on web search. e
proceedings from the TREC workshops are available online and contain useful descriptionsof newresearch techniques from many different academic andindus-
try groups. An overview of the TREC experiments can be found in Voorhees and Harman (2005). An increasing number of search-related papers are beginning to appear in database conferences, such as VLDB and SIGMOD. Occasional papers also show up in language technology conferences, such as ACL and HLT (As- sociation for Computational Linguistics and Human Language Technologies), machine learning conferences, and others.
Exercises
1.1. ink up and write down a small number of queries for a web search engine.
Make sure that the queries vary in length (i.e., they are not all one word). Try to specify exactly what information you are looking for in some of the queries. Run these queries on two commercial web search engines and compare the top 10 results for each query by doing relevance judgments. Write a report that answers at least the following questions: What is the precision of the results? What is the overlap between the results for the two search engines? Is one search engine clearly better than the other? If so, by how much? How do short queries perform compared to long queries?
1.2. Site search is another common application of search engines. In this case, search is restricted to the web pages at a given website. Compare site search to
web search, vertical search, and enterprise search. 14 http://www.acm.org/dl
1.3. Use the Web to nd as many examples as you can of open source search engines, information retrieval systems, or related technology. Give a brief description of each searchengine and summarize thesimilarities and differences between them.
1.4. List ve web servicesorsitesthatyouuse thatappear touse search, not including web search engines. Describe the role of search for that service. Also describe
whetherthe search is based on a database or grep styleofmatching,orifthesearch is using some type of ranking.
2Architecture of a Search Engine
“While your rst question may be the most per- tinent, you may or may not realize it is also the most irrelevant.”
e Architect, Matrix Reloaded
2.1 What Is an Architecture?
In this chapter, we describe the basic soware architecture of a search engine. Al- though there is no universal agreement on the denition, a soware architecture generally consists of soware components, the interfaces provided by those com-
ponents, and the relationships between them. An architecture is used to describe a system at a particular level of abstraction. An example of an architecture used to
provide a standard for integratingsearch and related language technology components is UIMA (Unstructured Information Management Architecture).1 UIMA denes interfaces for components in order to simplify the addition of new technologies into systems that handle text and other unstructured data.
Our search engine architecture is used to present high-level descriptions of the important components of the system and the relationships between them. It is not a code-level description, although some of the components do correspond to soware modules in the Galago search engine and other systems. We use this architecture in this chapter and throughout the book to provide context to the discussion of specic techniques.
An architecture is designed to ensure that a system will satisfy the application requirements or goals. e two primary goals of a search engine are:
• Effectiveness (quality): We want to be able to retrieve the most relevant set of documents possible for a query.
• Efficiency (speed): We want to process queries from users as quickly as possi-ble.
1 http://www.research.ibm.com/UIMA
14 2 Architecture of a Search Engine
We may have more specic goals, too, but usually these fall into the categories of effectiveness or efficiency (or both). For instance, the collection of documents
we want to search may be changing; making sure that the search engine immedi- atelyreactstochangesindocumentsisbothaneffectivenessissueandanefficiency issue.
e architecture of a search engine is determined by these two requirements. Because we want an efficient system,searchengines employ specialized data structures that are optimized for fast retrieval. Because we want high-quality results, search engines carefully process text and store text statistics that help improve the relevance of results.
Many of the components we discuss in the following sections have been used for decades, and this general design has been shown to be a useful compromise between the competing goals of effective and efficient retrieval. In later chapters,
we will discuss these components in more detail.
2.2 Basic Building Blocks
Searchenginecomponentssupporttwomajorfunctions,whichwecallthe indexing process and the query process. e indexing process builds the structures that enable searching, and the query process uses those structures and a person’s query to produce a ranked list of documents. Figure 2.1 shows the high-level “building blocks” of the indexing process. ese major components are text acquisition, text transformation, and index creation.
e task of the text acquisition component is to identify and make availablethe documents that will be searched.Although in some cases this will involve simply using an existing collection, text acquisition will more oen require building a collection by crawling or scanning the Web, a corporate intranet, a desktop, or other sources of information. In addition to passing documents to the next com-
ponent in the indexing process, the text acquisition component creates a document data store, which contains the text and metadata for all the documents. Metadata is information about a document that is not part of the text content, such the document type (e.g., email or web page), document structure, and other features, such as document length.
e text transformation component transforms documents into index terms
or features. Index terms, as the name implies, are the parts of a document that are stored in the index and used in searching. e simplest index term is a word, but not every word may be used for searching. A “feature” is more oen used in
Email, web pages, news articles, memos, letters
Text Acquisition
Text Transformation
Index Creation
Fig. 2.1. e indexing