Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Information Retrieval and ApplicationsKnowledge Discovery and Data Mining 2 (VU) (707.004)
Roman Kern, Heimo Gursch
KTI, TU Graz
2014-03-19
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 1 / 101
Outline
1 BasicsInverted IndexProbabilistic IR
2 Advanced ConceptsLatent Semantic IndexingRelevance FeedbackQuery Processing
3 Evaluation
4 Applications of IRClassificationContent-based Recommender systems
5 SolutionsOpen Source SolutionsCommercial Solutions
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 2 / 101
Basics
BasicsInverted Index, Text-Preprocessing, Language-based &
Probabilistic IR
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 3 / 101
Basics
Functionnal features
Search for single keyword
Search for multiple keywords
Search for phrases, i. e. keywords in particular order
Boolean combination of keywords (AND, OR, NOT)
Rank results by number of keyword occurrences
Find the positions of keywords in a document
Compose snippets of text around the keyword positions
Motivation
How can this goals be achieved?
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 4 / 101
Basics
Structured, Semi-Structured Data
Structured data
Format and type of information is known
IR task: normalize representations between different sources
Example: SQL databases
Semi-structured data
Combination of structured and unstructured data
IR task: extract information from the unstructured part and combineit with the structured information
Example: email, document with meta-data
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 5 / 101
Basics
Unstructured data
Unstructured data
Format of information is not know, information is hidden in(human-readable) text
IR task: extract information at all
Example: blog-posts, Wikipedia articles
This lecture focuses on unstructured data.
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 6 / 101
Basics
Simple Approach—Sequential Search
Advantage Simple
Disadvantages Slow and hard to scale
Usage Small textsFrequently changing textNot enough space for index available
Example grep on the UNIX command line
Not feasible for large texts. A different approach is needed. But how?
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 7 / 101
Basics
Definition
Information Retrieval (IR) is finding material (usually documents) of anunstructured nature (usually text) that satisfies an information needfrom within large collections (usually stored on computers).
The term “term” is very common in Information Retrieval and typically refers to single words(also common: bi-grams, phrases, ...)
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 8 / 101
Basics
Assumptions
Basic assumptions of Information Retrieval
Collection Fixed set of documents
Goal Retrieve documents with information that is relevant touser’s information need and helps him complete a task.
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 9 / 101
Basics
Overview Indexed Searching
Split process
Indexing Convert all input documents into a data-structure suitablefor fast searching
Searching Take an input query and map it onto the pre-processeddata-structure to retrieve the matching documents
Inverted index
Reverses the relationship between documents and contained words (terms)Central data-structure of a search engine
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 10 / 101
Basics
Three Example Documents
Document 1
Sepp was stopping off in Hawaii.
Document 2
Sepp is a co-worker of Heidi.
Document 3
Heidi stopped going to Hawaii.
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 11 / 101
Basics Inverted Index
Inverted Index Matrix
Term-document incidence
Matrix entries
0 The document does not contain the corresponding term
1 The document does contain the corresponding term
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 12 / 101
Basics Inverted Index
Example Matrix
Document 1 Document 2 Document 3
co-worker 0 1 0going 0 0 1Hawaii 1 0 1Heidi 0 1 1Seppl 1 1 0stopped 0 0 1stopping off 1 0 0
Matrix entries
0 The document does not contain the corresponding term
1 The document does contain the corresponding term
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 13 / 101
Basics Inverted Index
Properties of the Inverted Index Matrix
Vectorization of Documents
Column Words of an document
Row All documents where a word occurs
Possible Queries
Documents containing all terms of a query Take row vectors for eachterm and do a bitwise AND
Documents contain any terms of a query Take row vectors for each termand do a bitwise OR
Documents not containing a term Use a bitwise NOT
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 14 / 101
Basics Inverted Index
Examples
All documents containing the term Hawaii
Take the row vector of Hawaii, look for a one: Document 1 & Document 3
All documents containing the terms Seppl & Heidi
1 Take the rows for the terms Seppl & Heidi
2 Combine them by a bitwise AND
3 The position in the result vector indicates the correct document:Document 2
More complex queries
Use Boolean algebra on the row vectors
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 15 / 101
Basics Inverted Index
What we have so far...
Advantages
Search for terms
Formulate boolean queries
Works very fast
Disadvantages
Cannot search for phrases (i. e. terms in a particular order)
No information where the term occurred in the document
No snippets
No ranking possible, documents either match or do not match
Matrix is usually very sparse, possible waste of memory
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 16 / 101
Basics Inverted Index
Make the inverted index memory efficient
Dictionary
Store a (sorteda) list of all terms in the documents.
aThe sorting is not necessary for it to work, but it is necessary to find termsfast
Postings
For every term: store a list in which documents the term occurs
Posting list
Linked lists generally preferred to arrays
Dynamic space allocation
Insertion of terms into documents easy
Space overhead of pointers
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 17 / 101
Basics Inverted Index
Example
Dictionary 7→ Postings
co-worker 7→ Document 2
going 7→ Document 3
Hawaii 7→ Document 1, Document 3
Heidi 7→ Document 2, Document 3
Seppl 7→ Document 1, Document 2
stopped 7→ Document 3
stopping off 7→ Document 1
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 18 / 101
Basics Inverted Index
Boolean Retrieval
AND Intersection of postings for two or more terms
OR Union of postings for two or more terms
NOT All documents not in the postings of a term
Make it efficient
Sort the dictionary
Sort the postings by document IDs
But: Tricky and long boolean combinations can still get difficult!
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 19 / 101
Basics Inverted Index
Properties
Prerequisite
Every document needs a unique, sortable ID, i. e. an unique integer number
Advantages
Very efficient storage
Search for terms
Formulate boolean queries
Disadvantages
Cannot search for phrases (i. e. terms in a particular order)
No information where the term occurred in the document
No snippets
No ranking possible, documents either match or do not match
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 20 / 101
Basics Inverted Index
Positional Information
Store not only that a document contains a term, but also where the termoccurs in the document.
Dictionary 7→ Postings
term 1 7→ Document ID, Position(s) in document
term 2 7→ Document ID, Position(s) in document
... 7→ ...
term n 7→ Document ID, Position(s) in document
Search for phrases (i. e. consecutive words)
1 Get each word from the dictionary and retrieve its posting list
2 If the document IDs are equal for two or more words, search forconsecutive positions
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 21 / 101
Basics Inverted Index
Example
Dictionary 7→ Postings
co-worker 7→ (Doc 2, Pos 4)
going 7→ (Doc 3, Pos 3)
Hawaii 7→ (Doc 1, Pos 5); (Doc 3, Pos 5)
Heidi 7→ (Doc 2, Pos 6); (Doc 3, Pos 1)
Seppl 7→ (Doc 1, Pos 1); (Doc 2, Pos 1)
stopped 7→ (Doc 3, Pos 2)
stopping off 7→ (Doc 1, Pos 3)
Search for the phrase Heidi stopped
1 Heidi 7→ (Doc 2, 6); (Doc 3, Pos1)
2 stopped 7→ (Doc 3, Pos 2)
3 Consecutive positions for Document 3
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 22 / 101
Basics Inverted Index
Properties
Advantages
Still efficient storage
Search for terms
Formulate boolean queries
Search for phrases (i. e. terms in a particular order)
Information where the term occurred in the document
Possible creation of snippets
Disadvantages
Ranking a bit tricky, but possible
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 23 / 101
Basics Inverted Index
Term Frequencies
Implicit term frequencies
By storing the positions where the terms occur in a document, alsothe information how often it occurs is stored. As there have to bemore position markers if the term occurs more often.
Problem: Counting needs time
Explicit term frequencies
Store in the posting list, how often the term occurs in the document
Possible to leave the positional information out, if not needed
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 24 / 101
Basics Inverted Index
Term Frequencies
Structure
Store in the posting list, how often the term occurs in the document(This is often called Term Frequency (TF))
Possible to leave the positional information out, if not needed
Dictionary 7→ Postings
term 1 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...term 2 7→ Doc ID, TF, Position(s); ...... 7→ ...
term n 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 25 / 101
Basics Inverted Index
Properties
Advantages
Still efficient storage
Search for terms
Formulate boolean queries
Search for phrases (i. e. terms in a particular order)
Information where the term occurred in the document
Possible creation of snippets
Documents can be ranked by the number of terms they contain
Disadvantages
Takes up space, not only storing the original documents, but also theinverted index
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 26 / 101
Basics Inverted Index
Steps for building an index
Step 1 Build term-document table
Step 2 Sort by terms
Step 3 Add term frequency, multiple entries from single documentget merged
Correspondingly Get the positions of the terms in the documents
Step 4 : Result is split into dictionary file and description file.
Dictionary 7→ Postings
term 1 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...term 2 7→ Doc ID, TF, Position(s); ...... 7→ ...
term n 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 27 / 101
Basics Inverted Index
Steps for building an index
Step 1 Build term-document table
Step 2 Sort by terms
Step 3 Add term frequency, multiple entries from single documentget merged
Correspondingly Get the positions of the terms in the documents
Step 4 : Result is split into dictionary file and description file.
Dictionary 7→ Postings
term 1 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...term 2 7→ Doc ID, TF, Position(s); ...... 7→ ...
term n 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 27 / 101
Basics Inverted Index
Steps for building an index
Step 1 Build term-document table
Step 2 Sort by terms
Step 3 Add term frequency, multiple entries from single documentget merged
Correspondingly Get the positions of the terms in the documents
Step 4 : Result is split into dictionary file and description file.
Dictionary 7→ Postings
term 1 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...term 2 7→ Doc ID, TF, Position(s); ...... 7→ ...
term n 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 27 / 101
Basics Inverted Index
Steps for building an index
Step 1 Build term-document table
Step 2 Sort by terms
Step 3 Add term frequency, multiple entries from single documentget merged
Correspondingly Get the positions of the terms in the documents
Step 4 : Result is split into dictionary file and description file.
Dictionary 7→ Postings
term 1 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...term 2 7→ Doc ID, TF, Position(s); ...... 7→ ...
term n 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 27 / 101
Basics Inverted Index
Preprocessing of Text
To make a text indexable, a couple of steps are necessary:
1 Detect the language of the text
2 Detect sentence boundaries
3 Detect term (word) boundaries
4 Stemming and normalization
5 Remove stop words
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 28 / 101
Basics Inverted Index
Language Detection
Letter Frequencies
Percentage of occurrenceLetter English German
A 8.17 6.51E 12.70 17.40I 6.97 7.55O 7.51 2.51U 2.76 4.35
N-Gram Frequencies
Better—more elaborate—methods use statistics over more than one letter,e. g. statistics over two, three or even more consecutive letters.
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 29 / 101
Basics Inverted Index
Sentence Detection
First approachEvery period marks the end of a sentence
ProblemPeriods also mark abbreviations, decimal points,email-addresses, etc.
Utilize other features
Is the next letter capitalized?Is the term in front of the period a known abbreviation?Is the period surround by digits?...
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 30 / 101
Basics Inverted Index
Tokenization – Problem
Goal
Split sentences into tokens
Throw away punctuations
Possible Pit Falls
White spaces are no safe delimiter for word boundaries(e. g. NewYork)
Non-alphanumerical characters can separate words, but need not(e. g. co-worker)
Different forms (e. g. white space vs. whitespace)
German compound nouns (e. g. Donaudampfschiffahrtsgesellschaft,meaning Danube Steamboat Shipping Company)
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 31 / 101
Basics Inverted Index
Tokenization – Solutions
Supervised Learning
Train a model with annotated training data, use the trained model totokenize unknown text
Hidden-Markov-Models and conditional random fields are commonlyused
Dictionary Approach
Build a dictionary (i. e. list ) of tokens
Go over the sentence and always take the longest fitting token(greedy algorithm!)
Remark: Works well for Asian languages without white spaces andshort words. Problematic for European languages
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 32 / 101
Basics Inverted Index
Normalization
Some definitions
Token Sequence of characters representing a useful semantic unit,instance of a type
Type Common concept for tokens, element of the vocabulary
Term Type that is stored in the dictionary of an index
Task
1 Map all possible tokens to the corresponding type
2 Store all representing terms of one type in the dictionary
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 33 / 101
Basics Inverted Index
Example
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 34 / 101
Basics Inverted Index
Posible Ways for Normalization
Ignore cases
Rule based removal of hyphens, periods, white spaces, accents,diacritics (Be aware of possible different meaning after removal)
Provide a list of all possible representations of a type
Map different spelling version by using a phonetic algorithm.Phonetic algorithms translate the spelling into their pronunciationequivalent (e. g. Double Metaphone, Soundex).
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 35 / 101
Basics Inverted Index
Stemming & Lemmatization
Definition
Goal of both Reduce different grammatical forms of a word to theircommon infinitivea
Stemming Usage of heuristics to chop off / replace last part of words(Example: Porter’s algorithm).
Lemmatization Usage of proper grammatical rules to recreate theinfinitive.
aThe common infinitive is the form of a word how it is written in adictionary. This form is called lemma, hence the name Lemmatization.
Examples
Map do, doing, done, to commone infinitive do
Map digitalizing, digitalized to digital
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 36 / 101
Basics Inverted Index
Stop Words
Definition Extremely common words that appear in nearly every text
Problem As stop words are so conmen, their occurrence does notcharacterize a text
Solution Just drop them, i. e. do not put them in the index directory
Common stop words
Write a list with stop words (List might be topic specific!)
Usual suspects: articles (a, an, the), conjunction (and, or, but, ...),pre- and postposition (in, for, from), etc.
Solution
Stop word list Ignore word that are given on a list (black list)
Problem Special names and phrases (The Who, Let It Be, ...)
Solution Make another list... (white list)
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 37 / 101
Basics Probabilistic IR
Information flow
The classic search model
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 38 / 101
Basics Probabilistic IR
Probabilistic Information Retrial
Every transformation can hold a possible information loss
1 Task to query
2 Query to document
Information loss leads to uncertainty
Probability theory can deal with uncertainty
Basic idea behind probabilistic IR
Model the uncertainty, which originates in the imperfecttransformations
Use this uncertainty as a measurement how relevant a document is,given a certain query
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 39 / 101
Basics Probabilistic IR
Probabilistic Information Retrial Models
Binary Independence Model
Binary: documents and queries represented as binaryterm incidence vectorsIndependence: terms are assumed to be independentof each other
Okapi BM25
Takes number of term frequencies and document lengthinto accountParameters to tune influence of term frequencies anddocument length
Bayesian networks for text retrieval
Language Models Coming now...
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 40 / 101
Basics Probabilistic IR
Language Models
Basic Idea
Train a probabilistic model Md for document d , i. e. as many trainedmodels as documents
Md ranks documents on how possible it is that the user’s query q wascreated by the document d
Results are document models (or documents, respectively) which havea high probability to generate the user’s query
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 41 / 101
Basics Probabilistic IR
Examples for Language Models
Successive Probability of query termsP(t1t2 . . . tn) = P(t1)P(t2|t1)P(t3|t1t2) . . .P(tn|t1 . . . tn−1)
Unigram Languge ModelPuni (t1t2 . . . tn) = P(t1)P(t2) . . .P(tn)
Bigram Languge ModelP(t1t2 . . . tn) = P(t1)P(t2|t1)P(t3|t2) . . .P(tn|tn−1)
Probability P(tn) Probabilities are generated from the term occurrence inthe document in question
And many more... A lot more, and more complex models are available
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 42 / 101
Advanced Concepts
Advanced ConceptsLSA, Relevance Feedback, Query Processing, ...
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 43 / 101
Advanced Concepts Latent Semantic Indexing
Index Construction
Going beyond simple indexing
Term-document matrices are very large
But the number of topics that people talk about is small (in somesense)
Clothes, movies, politics, ...
Can we represent the term-document space by a lower dimensionallatent space?
⇒ Latent Semantic Indexing (LSI)
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 44 / 101
Advanced Concepts Latent Semantic Indexing
Mathematically speaking
Latent Semantic Indexing - Introduction
• This is taking a 3-D point and projecting it into 2-D
• The arrow in this picture acts like the overhead projector
( 10, 10, 10)
( x, y, z)
( 10, 10)
( x, y)
10
10
10
10
10
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 45 / 101
Advanced Concepts Latent Semantic Indexing
Mathematically speaking
Latent Semantic Indexing - Introduction
• We can project from any number of dimensions into any
other number of dimensions.
• Increasing dimensions adds redundant information
• But sometimes useful
• Support Vector Machines (kernel methods) do this
effectively
• Latent Semantic Indexing always reduces the number of
dimensions
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 46 / 101
Advanced Concepts Latent Semantic Indexing
Mathematically speaking
Latent Semantic Indexing - Introduction
• Latent Semantic Indexing always reduces the number of
dimensions( 10, 10)
(x,y)
( 10 )
(x)
10
10
10
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 47 / 101
Advanced Concepts Latent Semantic Indexing
Mathematically speaking
Latent Semantic Indexing - Introduction
• Latent Semantic Indexing can project on an arbitrary axis, not
just a principal axis
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 48 / 101
Advanced Concepts Latent Semantic Indexing
Mathematically speaking
Latent Semantic Indexing - Introduction
• Our documents were just points in an N-dimensional term
space
• We can project them also
Antony
Brutus
Antony and Cleopatra
Julius Caesar
Tempest
Hamlet
Othello
MacBeth
Antony
BrutusMacBeth
Othello
Hamlet
Julius Caesar
Antony and Cleopatra
Tempest
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 49 / 101
Advanced Concepts Latent Semantic Indexing
Latent Semantic Indexing
LSI - expected behaviour
A term vector that is projected on new vectors may uncover deepermeanings
For example transforming the 3 axes of a term matrix from ball, batand cave to
An axis that merges ball and bat
An axis that merges bat and cave
Should be able to separate differences in meaning of the term bat
Bonus: less dimensions is faster
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 50 / 101
Advanced Concepts Latent Semantic Indexing
Singular Value Decomposition
SVD overview
For an m × n matrix A of rank r there exists a factorization (SingularValue Decomposition = SVD) as follows:
M = UΣV T
... U is an m ×m unitary matrix
... Σ is an m × n diagonal matrix
... V T is an n × n unitary matrix
The columns of U are orthogonal eigenvectors of AAT
The columns of V are orthogonal eigenvectors of ATA
Eigenvalues λ1...λr of AAT are the eigenvalues of ATA.
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 51 / 101
Advanced Concepts Latent Semantic Indexing
Singular Value Decomposition
Matrix Decomposition
SVD enables lossy compression of a term-document matrix
Reduces the dimensionality or the rankArbitrarily reduce the dimensionality by putting zeros in the bottomright of sigmaThis is a mathematically optimal way of reducing dimensions
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 52 / 101
Advanced Concepts Latent Semantic Indexing
Singular Value Decomposition
Reduced SVD
If we retain only k singular values, and set the rest to 0, then wedon’t need the matrix parts in green
Then Σ is k × k, U is M × k , V T is k × n, and Ak is m × n
This is referred to as the reduced SVD
It is the convenient (space-saving) and usual form for computationalapplications
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 53 / 101
Advanced Concepts Latent Semantic Indexing
Singular Value Decomposition
Approximation Error
How good (bad) is this approximation?
It’s the best possible, measured by the Frobenius norm of the error
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 54 / 101
Advanced Concepts Latent Semantic Indexing
Singular Value Decomposition
Reduced SVD
Whereas the term-doc matrix A may have m = 50000, n = 10million(and rank close to 50000)
We can construct an approximation A100 with rank 100.
Of all rank 100 matrices, it would have the lowest Frobenius error.
⇒ Latent Semantic Indexing
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 55 / 101
Advanced Concepts Latent Semantic Indexing
Latent Semantic Indexing
Application of SVD - LSI
From term-document matrix A, we compute the approximation Ak .
There is a row for each term and a column for each document in Ak
Thus documents live in a space of k << r dimensions
These dimensions are not the original axes
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 56 / 101
Advanced Concepts Latent Semantic Indexing
Latent Semantic Indexing
Approximation Error
Each row and column of A gets mapped into the k-dimensional LSIspace, by the SVD.
Claim – this is not only the mapping with the best (Frobenius error)approximation to A, but in fact improves retrieval.
A query q is also mapped into this space
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 57 / 101
Advanced Concepts Latent Semantic Indexing
Latent Semantic Indexing
LSI - Results
Similar terms map to similar location in low dimensional space
Noise reduction by dimension reduction
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 58 / 101
Advanced Concepts Relevance Feedback
Relevance Feedback
Integrate feedback from the user
Given a search result for a user’s query
... allow the user to rate the retrieved results (good vs. not-so-goodmatch)
This information can then be used to re-rank the results to match theuser’s preference
Often automatically inferred from the user’s behaviour using thesearch query logs
... and ultimately improve the search experience
The query logs (optimally) contain the queries plus the items the user has clicked on (the user’ssession)
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 59 / 101
Advanced Concepts Relevance Feedback
Relevance Feedback
Pseudo relevance feedback
No user interaction is needed for the pseudo relevance feedback
... the first n results are simply assumed to be a good match
As if the users has clicked on the first results and marked them asgood match
→ often improves the quality of the search results
Pseudo relevance feedback is sometimes also called blind relevance feedback
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 60 / 101
Advanced Concepts Relevance Feedback
More Like This
A common feature of search engines is the more like this search
Given a search results
... the user wants items similar to one of the presented results
This can be seen as: taking a document as query
→ which is a common implementation (but restricting the query tothe most discriminative terms)
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 61 / 101
Advanced Concepts Query Processing
Query Reformulation
Modify the query, during of after the user interaction
Query suggestion & completion
(Automatic) query correction
(Automatic) query expansion
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 62 / 101
Advanced Concepts Query Processing
Query Suggestion
The user start typing a query
... automatically gets suggestions
→ to complete the currently typed word
→ to complete a whole phrase (e.g. the next word)
→ suggest related queries
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 63 / 101
Advanced Concepts Query Processing
Query Correction
Query misspellings
Our principal focus here
E.g., the query Britni Spears
We can either
Retrieve documents indexed by the correct spelling,
Rewrite the input query (replace, expand),
britni spears 7→ (britni OR britney) spears
Return several suggested alternative queries with the correct spelling -Did you mean Britney Spears?
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 64 / 101
Advanced Concepts Query Processing
Query Correction
Isolated word correction
Fundamental premise – there is a lexicon from which the correctspellings come
Basic choices for thisA standard lexicon such as
Webster’s English DictionaryAn industry-specific lexicon – hand-maintained
The lexicon of the indexed corpus
E.g., all words on the webAll names, acronyms etc.(Including the mis-spellings)
The use of a query log
Use other users behaviour
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 65 / 101
Advanced Concepts Query Processing
Query Correction
Isolated word correction
Given a lexicon and a character sequence Q, return the words in thelexicon closest to Q
What’s closest?
We’ll study several alternatives1 Edit distance (Levenshtein distance)2 Weighted edit distance3 n-gram overlap
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 66 / 101
Advanced Concepts Query Processing
Query Correction
Word correction: edit distance
Given two strings S1 and S2, the minimum number of operations toconvert one to the other
Operations are typically character-level
Insert, Delete, Replace, (Transposition)
Examples
From dof to dog is 1From cat to act is 2 (Just 1 with transpose.)From cat to dog is 3.
See http://www.merriampark.com/ld.htm for a nice example plus an applet.Notebook with various distances (Levenshtein being the best known):http://nbviewer.ipython.org/gist/MajorGressingham/7691723
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 67 / 101
Advanced Concepts Query Processing
Query Correction
Word correction: edit distance
Given two strings S1 and S2, the minimum number of operations toconvert one to the other
Operations are typically character-level
Insert, Delete, Replace, (Transposition)
Examples
From dof to dog is 1From cat to act is 2 (Just 1 with transpose.)From cat to dog is 3.
See http://www.merriampark.com/ld.htm for a nice example plus an applet.Notebook with various distances (Levenshtein being the best known):http://nbviewer.ipython.org/gist/MajorGressingham/7691723
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 67 / 101
Advanced Concepts Query Processing
Query Correction
Word correction: weighted edit distance
As above, but the weight of an operation depends on the character(s)involved
Meant to capture OCR or keyboard errors, e.g. m more likely to bemis-typed as n than as q
Therefore, replacing m by n is a smaller edit distance than by q
This may be formulated as a probability model
Requires weight matrix as input
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 68 / 101
Advanced Concepts Query Processing
Query Correction
Word correction: n-gram overlap
Enumerate all the n-grams in the query string as well as in the lexicon
Use the n-gram index (recall wild-card search) to retrieve all lexiconterms matching any of the query n-grams
Threshold by number of matching n-grams
Variants – weight by keyboard layout, etc.
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 69 / 101
Advanced Concepts Query Processing
Query Correction
Word correction: n-gram overlap, example
Suppose the text is november
Trigrams are: nov ove vem emb mbe ber
The query is december
Trigrams are: dec ece cem emb mbe ber
So 3 trigrams overlap (of 6 in each term)
How can we turn this into a normalised measure of overlap?
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 70 / 101
Advanced Concepts Query Processing
Query Correction
Word correction: n-gram overlap, example
Suppose the text is november
Trigrams are: nov ove vem emb mbe ber
The query is december
Trigrams are: dec ece cem emb mbe ber
So 3 trigrams overlap (of 6 in each term)
How can we turn this into a normalised measure of overlap?
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 70 / 101
Advanced Concepts Query Processing
Query Correction
n-gram overlap, Jaccard Coefficient
A commonly-used measure of overlap
Let X and Y be two sets; then the Jaccard Coefficient is
Equals 1 when X and Y have the same elementsEquals 0 when they are disjoint
X and Y don’t have to be of the same size
Always assigns a number between 0 and 1
Now threshold to decide if you have a matchE.g., if JC > 0.8 ⇒ declare a match
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 71 / 101
Advanced Concepts Query Processing
Query Expansion
The user has entered a query
... automatically (related) words are added to the query
→ Global query expansion - only use the query (plus corpus orbackground knowledge)
→ Local query expansion - conduct the search and analyse the searchresults
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 72 / 101
Advanced Concepts Query Processing
Global Query Expansion
Uses only the query
Example: Use of a thesaurus
Query: [buy car]
Use synonyms from the thesaurusExpanded query: [(buy OR purchase) (car OR auto)]
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 73 / 101
Advanced Concepts Query Processing
Local query expansion
Uses the query plus the search results
... conceptually similar to the pseudo relevance feedback
Look for terms in the first n search results and add them to the query
Re-run the query with the added terms
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 74 / 101
Evaluation
EvaluationAssess the quality of search
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 75 / 101
Evaluation
Overview: Evaluation
Two basic questions / goals
Are the retrieved document relevant for the user’s information need?
Have all relevant document been retrieved?
In practice these two goal contradict each other.
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 76 / 101
Evaluation
Precision & Recall
Evaluation Scenario
Fixed document collection
Number of queries
Set of documents marked as relevant for each query
Main Evaluation Metrics
Precision = #(relevant items retrieved)#(retrieved items)
Recall = #(relevant items retrieved)#(relevant items)
F1 = 2PrecisionRecallPrecision+Recall
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 77 / 101
Evaluation
Precision & Recall
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 78 / 101
Evaluation
Advanced Evaluation Measures
Mean Average Precision - MAP
... mean of all average precision of multiple queries
Map = 1|Q|
∑q∈Q APq
APq = 1n
∑nk=1 Precision(k)
(Normalised) discounted cumulative gain (nDCG)
... where there is a reference ranking the relevant results
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 79 / 101
Applications of IR
Selected Applications of IRClassification & Recommender Systems
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 80 / 101
Applications of IR Classification
Overview
Classification
Supervised Learning: labelled training data is needed (effort forlabelling)
Classification problems: classify an example into a given set ofcategories
Algorithm learns model from the labelled training set and applies thismodel on unknown data
Used in real applications
Clustering
Unsupervised Learning
Find groups in the data, where the similarity within the group is highand the similarity between groups is low
Hardly ever used in real applications
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 81 / 101
Applications of IR Classification
Supervised learning
Given examples of a function (x, f (x))
Predict function f (x) for new examples x
Depending on f (x) we have:1 Classification if f (x) is a discrete function2 Regression if f (x) is a continuous function3 Probability estimation if f (x) is a probability distribution
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 82 / 101
Applications of IR Classification
Classification
Definition
Given: Training examples (x, f (x)) for some unknown discretefunction f
Find: A good approximation for f
x is data, e. g. a vector, text documents, emails, words in text, etc.
Classification approaches
Logistic regression
Decision trees
Support vector machines
Neural networks
Statistical inference such as Bayesian learning
Nearest neighbour algorithms (Here: kNN-algorithm)
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 83 / 101
Applications of IR Classification
General kNN-Algorithm
1 For each desired class: at least one example must be provided
2 Take a unclassified document: Look at its k nearest neighbours
3 Assign the unknown element to the class with the most neighbours init
4 Goto 2
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 84 / 101
Applications of IR Classification
Document Classification
1 For each desired class: at least one document must be provided2 Find k nearest neighbours of a document:
1 Take all terms of a document2 Combine them to a query (e. g. with a logical OR)3 Take the first k results form an index
3 Assign the unknown document to the class with the most neighboursin it
4 Goto 2
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 85 / 101
Applications of IR Content-based Recommender systems
Overview
Recommender systems should provide usable suggestion to users
Recommender systems should show new, unknown—butsimilar—documents
Recommender systems can utilize different information sources
Content-based RecommenderFinds similar documents on the basis of documentcontent
Collaborative FileringEmploys user rating to find new and useful documents
Knowledge-based RecommenderMakes decisions based on pre-programmed rules
Hybrid RecommenderCombination of two or three other approaches
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 86 / 101
Applications of IR Content-based Recommender systems
Types of Content-based Recommender
Document-to-Document Recommender
Document-to-User Recommender
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 87 / 101
Applications of IR Content-based Recommender systems
Building a Recommender by Using an Index
Document-to-Document Recommender
1 Take the terms from a document
2 Construct a (weighted) query with them
3 Query the index
4 Results are possible candidates for recommending
Document-to-User Recommender
1 Create a user model from
Terms typed by the userDocuments viewed by the user⇒ Simplest user model is just a set of terms
2 Construct a (weighted) query with them
3 Query the index
4 Results are possible candidates for recommending
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 88 / 101
Solutions
SolutionsOpen-source and commercial IR solutions
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 89 / 101
Solutions Open Source Solutions
Open Source Search Engine
Apache Lucene
Very popular, high quality
Backbone of many other search engines, e.g. Apache Solr,ElasticSearch
Lucene is the core search engineSolr provides a service & configuration infrastructure (plus a few extrafeatures)
Implemented in Java, bindings for many other programming languages
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 90 / 101
Solutions Open Source Solutions
Solr Index Management
API
HTTP Server
Various formats (XML, binary, JavaScript, ...)
Document life-cycle
There is no update
Delete (done automatically by Solr)
Insert
Implications
An unique id is necessaryUse batch updates
Commit, rollback (and optimize)
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 91 / 101
Solutions Open Source Solutions
Solr Text Processing
Scope
During indexing & query
Tokenization
Split text into tokens
Lower-case alignment
Stemming (e.g. ponies, pony⇒ poni, triplicate⇒ triplic, ...)
Synonyms (via Thesaurus)
Stop-word filtering
Multi-word splitting (e.g. Wi-Fi ⇒ Wi, Fi)
n-grams, soundex, umlauts
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 92 / 101
Solutions Open Source Solutions
Solr Query Processing
Query parsers
Lucene query parser (rich syntax)
AND, OR, NOT, range queries, wildcards, fuzzy query, phrase queryBoosting of individual partsExample: ((boltzmann OR schroedinger) NOT einstein)
Dismax query parser
No query syntaxSearches over multiple fields (separate boost for each field)Configure the amount of terms to be mandatoryDistance between terms is used for ranking (phrase boosting)
Dismax is a good starting point, but may become expensive
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 93 / 101
Solutions Open Source Solutions
Search Features
Query filter
Additional query
No impact on ranking
Results are cached
Boosting query
Only in Dismax
Query elevation
Fix certain queries
Request handler
Pre-define clauses
Invariants
Function queries
Score is computed on field values
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 94 / 101
Solutions Open Source Solutions
Solr Search Result
Ranking
Relevance
Sort on field value (only single term per document)
Available data & features
Sequence of IDs & score
Stored fields
Snippets (plus highlighting)
Facets
Count the search hitsTypes: field value, dates, queriesSort, prefix, ...Could be used for term suggestion (aka. query suggestion)
Field collapsing (grouping)
Spell checking (did-you-mean)
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 95 / 101
Solutions Open Source Solutions
Additional Solr Features
Query by Example
More like this
Stats
Per field
Min, max, sum, missing, ...
Admin-GUI
Webapp to troubleshoot queries
Browse schema
JMX
Read properties & statistics
Can be accessed remotely
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 96 / 101
Solutions Open Source Solutions
Solr Integration
Deployment
Within a web application server
Embedded
Monitor
Log output
Access
Various language bindings
Java, Ruby, JavaScript, PHP, ...
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 97 / 101
Solutions Open Source Solutions
Other Open-Source Search Engines
Xapian
Support for many (programming) languages
Used by many open-source projects
Lemur & Indri
Language models
Academic background
Terrier
Many algorithms
Academic background
Whoosh
for Pythonistas
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 98 / 101
Solutions Commercial Solutions
What customers want
Indexing pipeline
And also: user interface, access control mechanisms, logging, etc.
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 99 / 101
Solutions Commercial Solutions
Commercial Search Engines
Autonomy
Grantees not to truncate documents
Offers a wide range on source connectivity
Sinequa
Expert Search
Creates new views, i. e. do not only show result list, but combineresults to new unit
Attivio
Use SQL statements in queries
IntraFIND
Lucene-based
And many, many more...
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 100 / 101
Solutions Commercial Solutions
The EndNext: Preprocessing Platform & Practical Applications
Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 101 / 101