Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology
M. Soleymani
Fall 2014
Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
Boolean retrieval model
2
Query: Boolean expressions
Boolean queries use AND, OR and NOT to join query terms
Views each doc as a set of words
Term-incidence matrix is sufficient
Shows presence or absence of terms in each doc
Perhaps the simplest model to build an IR system on
Boolean queries: Exact match
In pure Boolean model, retrieved docs are not ranked
Result is a set of docs.
It is precise or exact match (docs match condition or not).
Primary commercial retrieval tool for 3 decades (Until
1990’s).
Many search systems you still use are Boolean:
Email, library catalog, Mac OS X Spotlight
3
Sec. 1.3
Example: Plays of Shakespeare
Which plays of Shakespeare contain the words Brutus
AND Caesar but NOT Calpurnia?
scanning all of Shakespeare’s plays for Brutus and Caesar, then
strip out those containing Calpurnia?
The above solution cannot be the answer for large
corpora (computationally expensive)
Efficiency is also an important issue (along with the
effectiveness)
Index: data structure built on the text to speed up the searches
4
Sec. 1.1
Example: Plays of Shakespeare
Term-document incidence matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
1 if play contains
word, 0 otherwise
Sec. 1.1
5
Incidence vectors
So we have a 0/1 vector for each term.
Brutus AND Caesar but NOT Calpurnia
To answer query: take the vectors for Brutus, Caesar
and Calpurnia (complemented) bitwise AND.
110100 AND 110111 AND 101111 = 100100.
6
Sec. 1.1
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Answers to query
Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the
Capitol; Brutus killed me.
7
Sec. 1.1
Calpurnia NOTbut Caesar AND Brutus
Bigger collections
Number of docs: N = 106
Average length of a doc≈ 1000 words
No. of distinct terms: M = 500,000
Average length of a word ≈ 6 bytes
including spaces/punctuation
8
Sec. 1.1
6GB of data
Can’t build the matrix
500K x 1M matrix has half-a-trillion 0’s and 1’s.
But it has no more than one billion 1’s.
matrix is extremely sparse.
so a minimum of 99.8% of the cells are zero.
What’s a better representation?
We only record the 1 positions.
9
Why?
Sec. 1.1
Inverted index
For each term t, store a list of all docs that contain t.
Identify each by a docID, a document serial number
Can we use fixed-size arrays for this?
10
1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
What happens if the word
is added to doc 14?
Sec. 1.2
174
54 101
Inverted index
We need variable-size postings lists
On disk, a continuous run of postings is normal and best
In memory, can use linked lists or variable length arrays
Some tradeoffs in size/ease of insertion
11
Dictionary Postings
Sorted by docID
Posting
Sec. 1.2
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
174
54 101
Tokenizer
Token stream Friends Romans Countrymen
Inverted index construction
Linguistic modules
Modified tokens
friend roman countryman
Indexer
Inverted index
friend
roman
countryman
2 4
2
13 16
1
More on
these later.
Docs to
be indexed
Friends, Romans, countrymen.
Sec. 1.2
12
Indexer steps: Token sequence
Sequence of (Modified token, Document ID) pairs.
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Doc 2
Sec. 1.2
13
Indexer steps: Sort
Sort by terms
And then docID
Core indexing step
Sec. 1.2
14
Indexer steps: Dictionary & Postings
Multiple term entries in a single doc are merged.
Split into Dictionary and Postings
Document frequency information is added.
Why frequency?
Will discuss later.
Sec. 1.2
15
Where do we pay in storage?
16 Pointers
Terms and
counts
Sec. 1.2
Lists of
docIDs
A naïve dictionary
An array of structure:
char[20] int Postings *
Sec. 3.1
17
Dictionary data structures
Two main choices:
Hashtables
Search trees
Some IR systems use hashtables, some trees
Sec. 3.1
18
Hashtables
Each vocabulary term is hashed to an integer
Pros:
Lookup is faster than for a tree: O(1)
Cons:
No easy way to find minor variants:
judgment/judgement
No prefix search
tolerant retrieval
If vocabulary keeps growing, need to occasionally rehash
everything
Sec. 3.1
19
Root a-m n-z
a-hu hy-m n-sh si-z
Binary tree
Sec. 3.1
20
Binary tree
Terms Freq. Postings ptr.
a 656,265
aachen 65
…. ….
zulu 221
Dictionary search
structure
Sec. 5.2
21
Trees
Simplest: binary tree
More usual: B-trees
Pros:
Solves the prefix problem (terms starting with hyp)
Cons:
Slower: O(log M) [and this requires balanced tree]
Rebalancing binary trees is expensive
But B-trees mitigate the rebalancing problem
Sec. 3.1
22
The index we just built
So far, we built the index
How do we process a query?
What kinds of queries can we process?
23
Sec. 1.3
Query processing: AND
Consider processing the query:
Brutus AND Caesar
Locate Brutus in the dictionary;
Retrieve its postings.
Locate Caesar in the dictionary;
Retrieve its postings.
“Merge” (intersect) the two postings:
24
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar
Sec. 1.3
The merge
Walk through the two postings simultaneously, in time
linear in the total number of postings entries
25
If list lengths are x and y, merge takes O(x+y) operations.
.docID: postings sorted by Crucial
Sec. 1.3
128
31
2 4 8 41 48 64
1 2 3 8 11 17 21
2 8
Intersecting two postings lists
(a “merge” algorithm)
26
Boolean queries: More general merges
Exercise: Adapt the merge for the queries:
Brutus AND NOT Caesar
Brutus OR NOT Caesar
Can we still run through the merge in time 𝑂(𝑥 + 𝑦)?
27
Sec. 1.3
Merging
What about an arbitrary Boolean formula?
(Brutus OR Caesar) AND NOT (Antony OR Cleopatra)
Can we merge in “linear” time for general Boolean
queries?
Linear in what?
Can we do better?
28
Sec. 1.3
Query optimization
What is the best order for query processing?
Consider a query that is an AND of 𝑛 terms.
For each of the 𝑛 terms, get its postings, then AND
them together.
Brutus
Caesar
Calpurnia
1 2 3 5 8 16 21 34
2 4 8 16 32 64 128
13 16
Query: Brutus AND Calpurnia AND Caesar
29
Sec. 1.3
29
Query optimization example
Process in order of increasing freq:
start with smallest set, then keep cutting further.
30
This is why we kept
document freq. in dictionary
Execute the query as (Calpurnia AND Brutus) AND Caesar.
Sec. 1.3
Brutus
Caesar
Calpurnia
1 2 3 5 8 16 21 34
2 4 8 16 32 64 128
13 16
More general optimization
Example:
(madding OR crowd) AND (ignoble OR strife)
Get doc frequencies for all terms.
Estimate the size of each OR by the sum of its
doc. freq.’s (conservative).
Process in increasing order of OR sizes.
31
Sec. 1.3
Exercise
Recommend a query
processing order for
Term Freq
eyes 213312
kaleidoscope 87009
marmalade 107913
skies 271658
tangerine 46653
trees 316812
32
(tangerine OR trees) AND
(marmalade OR skies) AND
(kaleidoscope OR eyes)
Query processing exercises
Exercise: If the query is friends AND romans AND (NOT
countrymen), how could we use the freq of
countrymen?
Exercise: Extend the merge to an arbitrary Boolean query.
Can we always guarantee execution in time linear in the
total postings size?
Hint: Begin with the case of a Boolean formula query
where each term appears only once in the query.
33
Example of extended Boolean model:
WestLaw http://www.westlaw.com/
Largest commercial (paying subscribers) legal
search service (started 1975; ranking added 1992)
Tens of terabytes of data; 700,000 users
Majority of users still use boolean queries
Example query:
What is the statute of limitations in cases involving the
federal tort claims act?
LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
/k = within k words, /S = in same sentence
34
Sec. 1.4
Advantages of exact match
36
It can be implemented very efficiently
Predictable, easy to explain
precise semantics
Structured queries for pinpointing precise docs
neat formalism
Work well when you know exactly (or roughly) what the
collection contains and what you’re looking for
Disadvantages of the Boolean Model
37
Query formulation (Boolean expression) is difficult for most users
Too simplistic Boolean queries by most users
AND, OR as opposite extremes in a precision/recall tradeoff
As a consequence, frequently returns either too few or too many docs in response to a user query
Difficulty increases with collection size
Retrieval based on binary decision criteria
No ranking of the docs is provided (absence of a grading scale)
Index term weighting can provide a substantial improvement
Ranking results in advanced IR models
Boolean queries give inclusion or exclusion of docs.
Results of queries in Boolean model as a set
Modern information retrieval systems are no longer
based on the Boolean model
Often we want to rank/group results
Need to measure proximity from query to each doc.
38
39
Text operations
Recall the basic indexing pipeline
Tokenizer
Token stream Friends Romans Countrymen
Linguistic modules
Modified tokens friend roman countryman
Indexer
Inverted index
friend
roman
countryman
2 4
2
13 16
1
Document Friends, Romans, countrymen.
40
Text operations
41
Tokenization
Stop word removal
Normalization
Stemming or lemmatization
Equivalence classes
Example1: case folding
Example2: using thesauri (or Soundex) to find equivalence classes of
synonyms and homonyms [later lectures]
Parsing a document
What format is it in?
pdf/word/excel/html?
What language is it in?
What character set is in use?
Each of these is a classification problem,
which we will study later in the course.
But these tasks are often done heuristically …
Sec. 2.1
42
Complications: Format/language
Corpus can include docs from different languages
A single index may have to contain terms of several languages.
Sometimes a doc or its components can contain multiple
languages/formats
French email with a German pdf attachment.
What is a unit document? (indexing granularity)
A file?
An email? (Perhaps one of many in an mbox.)
An email with 5 attachments?
A group of files (PPT or LaTeX as HTML pages)
Sec. 2.1
43
Tokenization
Input: “Friends, Romans, Countrymen”
Output: Tokens
Friends
Romans
Countrymen
Each such token is now a candidate for an index entry,
after further processing
Sec. 2.2.1
44
Tokenization
Issues in tokenization:
Finland’s capital Finland? Finlands? Finland’s?
Hewlett-Packard Hewlett and Packard as two tokens?
co-education
lower-case
state-of-the-art: break up hyphenated sequence.
It can be effective to get the user to put in possible hyphens
San Francisco: one token or two?
How do you decide it is one token?
Sec. 2.2.1
45
Tokenization: Numbers
Examples 3/12/91 Mar. 12, 1991 12/3/91
55 B.C.
B-52
My PGP key is 324a3df234cb23e
(800) 234-2333
Often have embedded spaces
Older IR systems may not index numbers But often very useful
e.g., looking up error codes/stack traces on the web
Will often index “meta-data” separately Creation date, format, etc.
Sec. 2.2.1
46
Tokenization: Language issues
French
L'ensemble: one token or two?
L ? L’ ? Le ?
German noun compounds are not segmented
Lebensversicherungsgesellschaftsangestellter
‘life insurance company employee’
German retrieval systems benefit greatly from a compound
splitter module
Can give a 15% performance boost for German
Sec. 2.2.1
47
Tokenization: Language issues
Chinese and Japanese have no spaces between words:
莎拉波娃现在居住在美国东南部的佛罗里达。
Not always guaranteed a unique tokenization
Further complicated in Japanese, with multiple alphabets
intermingled
Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana Hiragana Kanji Romaji
End-user can express query entirely in hiragana!
Sec. 2.2.1
48
Tokenization: Language issues
Arabic (or Hebrew) is basically written right to left, but with
certain items like numbers written left to right
Words are separated, but letter forms within a word form
complex ligatures
‘Algeria achieved its independence in 1962 after 132 years of French
occupation.’
With Unicode, the surface presentation is complex, but the
stored form is straightforward
Sec. 2.2.1
49
Stop words
Stop list: exclude from dictionary the commonest words.
They have little semantic content: ‘the’, ‘a’, ‘and’, ‘to’, ‘be’
There are a lot of them: ~30% of postings for top 30 words
But the trend is away from doing this:
Good compression techniques (IIR, Chapter 5)
the space for including stop words in a system is very small
Good query optimization techniques (IIR, Chapter 7)
pay little at query time for including stop words.
You need them for:
Phrase queries: “King of Denmark”
Various song titles, etc.: “Let it be”, “To be or not to be”
Relational queries: “flights to London”
Sec. 2.2.2
50
Normalization to terms
Normalize words in indexed text (also query)
U.S.A. USA
Term is a (normalized) word type, which is an entry in our IR
system dictionary
We most commonly implicitly define equivalence classes of
terms by, e.g.,
deleting periods to form a term
U.S.A., USA USA
deleting hyphens to form a term
anti-discriminatory, antidiscriminatory antidiscriminatory
Sec. 2.2.3
51
Normalization: Other languages
Accents: e.g., French résumé vs. resume.
Umlauts: e.g., German: Tuebingen vs. Tübingen
Should be equivalent
Most important criterion:
How are your users like to write their queries for these words?
Users often may not type them (even in languages that standardly have
accents)
Often best to normalize to a de-accented term
Tuebingen, Tübingen, Tubingen Tubingen
Sec. 2.2.3
52
Normalization: Other languages
Normalization of things like date forms
7月30日 vs. 7/30
Japanese use of kana vs. Chinese characters
Tokenization and normalization may depend on the
language (intertwined with language detection)
Crucial: Need to “normalize” indexed text as well as
query terms into the same form
…
Is this
German “mit”?
Sec. 2.2.3
53
Case folding
Reduce all letters to lower case
exception: upper case in mid-sentence?
e.g., General Motors
Fed vs. fed
SAIL vs. sail
Often best to lower case everything, since users will use
lowercase regardless of ‘correct’ capitalization…
Google example: Query C.A.T.
#1 result was for “cat” not Caterpillar Inc.
Sec. 2.2.3
54
Normalization to terms
An alternative to equivalence classing is to do asymmetric
expansion
An example of where this may be useful
Enter: window Search: window, windows
Enter: windows Search: Windows, windows, window
Enter: Windows Search: Windows
Potentially more powerful, but less efficient
Sec. 2.2.3
55
Thesauri and soundex
Do we handle synonyms and homonyms?
E.g., by hand-constructed equivalence classes
car = automobile color = colour
We can rewrite to form equivalence-class terms
When the doc contains automobile, index it under car-automobile (and/or vice-versa)
Or we can expand a query
When the query contains automobile, look under car as well
What about spelling mistakes?
One approach is soundex, which forms equivalence classes of words based on phonetic heuristics (More Chapter 3 & 9)
56
Lemmatization
Reduce inflectional/variant forms to base form, e.g.,
am, are, is be
car, cars, car's, cars' car
the boy's cars are different colors the boy car be different color
Lemmatization implies doing “proper” reduction to
dictionary headword form
Sec. 2.2.4
57
Stemming
Reduce terms to their “roots” before indexing
Stemming: crude affix chopping
language dependent
e.g., automate(s), automatic, automation all reduced to
automat.
for example compressed
and compression are both
accepted as equivalent to
compress.
for exampl compress and
compress ar both accept
as equival to compress
Sec. 2.2.4
58
Porter’s algorithm
Commonest algorithm for stemming English
Results suggest it’s at least as good as other stemming options
Conventions + 5 phases of reductions
phases applied sequentially
each phase consists of a set of commands
sample convention: Of the rules in a compound command, select
the one that applies to the longest suffix.
Sec. 2.2.4
59
Porter’s algorithm: Typical rules
sses ss
ies i
ational ate
tional tion
Rules sensitive to the measure of words
(m>1) EMENT →
replacement → replac
cement → cement
Sec. 2.2.4
60
Other stemmers
Other stemmers exist, e.g., Lovins stemmer http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm
Single-pass, longest suffix removal (about 250 rules)
Full morphological analysis – at most modest benefits for retrieval
Do stemming and other normalizations help? English: very mixed results. Helps recall but harms precision
operative (dentistry) ⇒ oper
operational (research) ⇒ oper
operating (systems) ⇒ oper
Definitely useful for Spanish, German, Finnish, … 30% performance gains for Finnish!
Sec. 2.2.4
61
Language-specificity
Many of the above features embody transformations that
are
Language-specific
Often, application-specific
These are “plug-in” addenda to the indexing process
Both open source and commercial plug-ins are available
for handling these
Sec. 2.2.4
62
Dictionary entries – first cut
時間
Sec. 2.2
63
More on this in ranking/query processing.
These may be grouped by language (or not…).
Resources
IIR 1
IIR 2.1-2.2
MIR 9.2
Porter’s stemmer:
http://www.tartarus.org/~martin/PorterStemmer/
64