Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Information Retrieval and ApplicationsKnowledge Discovery and Data Mining 2 (VU) (707.004)

Roman Kern, Heimo Gursch

KTI, TU Graz

2014-03-19

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 1 / 101

Outline

1 BasicsInverted IndexProbabilistic IR

2 Advanced ConceptsLatent Semantic IndexingRelevance FeedbackQuery Processing

3 Evaluation

4 Applications of IRClassificationContent-based Recommender systems

5 SolutionsOpen Source SolutionsCommercial Solutions


Basics

BasicsInverted Index, Text-Preprocessing, Language-based &

Probabilistic IR


Basics

Functionnal features

Search for single keyword

Search for multiple keywords

Search for phrases, i. e. keywords in particular order

Boolean combination of keywords (AND, OR, NOT)

Rank results by number of keyword occurrences

Find the positions of keywords in a document

Compose snippets of text around the keyword positions

Motivation

How can this goals be achieved?


Basics

Structured, Semi-Structured Data

Structured data

Format and type of information is known

IR task: normalize representations between different sources

Example: SQL databases

Semi-structured data

Combination of structured and unstructured data

IR task: extract information from the unstructured part and combineit with the structured information

Example: email, document with meta-data


Basics

Unstructured data

Unstructured data

Format of information is not know, information is hidden in(human-readable) text

IR task: extract information at all

Example: blog-posts, Wikipedia articles

This lecture focuses on unstructured data.


Basics

Simple Approach—Sequential Search

Advantage Simple

Disadvantages Slow and hard to scale

Usage Small textsFrequently changing textNot enough space for index available

Example grep on the UNIX command line

Not feasible for large texts. A different approach is needed. But how?


Basics

Definition

Information Retrieval (IR) is finding material (usually documents) of anunstructured nature (usually text) that satisfies an information needfrom within large collections (usually stored on computers).

The term “term” is very common in Information Retrieval and typically refers to single words(also common: bi-grams, phrases, ...)


Basics

Assumptions

Basic assumptions of Information Retrieval

Collection Fixed set of documents

Goal Retrieve documents with information that is relevant touser’s information need and helps him complete a task.


Basics

Overview Indexed Searching

Split process

Indexing Convert all input documents into a data-structure suitablefor fast searching

Searching Take an input query and map it onto the pre-processeddata-structure to retrieve the matching documents

Inverted index

Reverses the relationship between documents and contained words (terms)Central data-structure of a search engine


Basics

Three Example Documents

Document 1

Sepp was stopping off in Hawaii.

Document 2

Sepp is a co-worker of Heidi.

Document 3

Heidi stopped going to Hawaii.


Basics Inverted Index

Inverted Index Matrix

Term-document incidence

Matrix entries

0 The document does not contain the corresponding term

1 The document does contain the corresponding term



Example Matrix

Document 1 Document 2 Document 3

co-worker 0 1 0going 0 0 1Hawaii 1 0 1Heidi 0 1 1Seppl 1 1 0stopped 0 0 1stopping off 1 0 0

Matrix entries

0 The document does not contain the corresponding term

1 The document does contain the corresponding term



Properties of the Inverted Index Matrix

Vectorization of Documents

Column Words of an document

Row All documents where a word occurs

Possible Queries

Documents containing all terms of a query Take row vectors for eachterm and do a bitwise AND

Documents contain any terms of a query Take row vectors for each termand do a bitwise OR

Documents not containing a term Use a bitwise NOT



Examples

All documents containing the term Hawaii

Take the row vector of Hawaii, look for a one: Document 1 & Document 3

All documents containing the terms Seppl & Heidi

1 Take the rows for the terms Seppl & Heidi

2 Combine them by a bitwise AND

3 The position in the result vector indicates the correct document:Document 2

More complex queries

Use Boolean algebra on the row vectors



What we have so far...

Advantages

Search for terms

Formulate boolean queries

Works very fast

Disadvantages

Cannot search for phrases (i. e. terms in a particular order)

No information where the term occurred in the document

No snippets

No ranking possible, documents either match or do not match

Matrix is usually very sparse, possible waste of memory



Make the inverted index memory efficient

Dictionary

Store a (sorteda) list of all terms in the documents.

aThe sorting is not necessary for it to work, but it is necessary to find termsfast

Postings

For every term: store a list in which documents the term occurs

Posting list

Linked lists generally preferred to arrays

Dynamic space allocation

Insertion of terms into documents easy

Space overhead of pointers



Example

Dictionary 7→ Postings

co-worker 7→ Document 2

going 7→ Document 3

Hawaii 7→ Document 1, Document 3

Heidi 7→ Document 2, Document 3

Seppl 7→ Document 1, Document 2

stopped 7→ Document 3

stopping off 7→ Document 1



Boolean Retrieval

AND Intersection of postings for two or more terms

OR Union of postings for two or more terms

NOT All documents not in the postings of a term

Make it efficient

Sort the dictionary

Sort the postings by document IDs

But: Tricky and long boolean combinations can still get difficult!



Properties

Prerequisite

Every document needs a unique, sortable ID, i. e. an unique integer number

Advantages

Very efficient storage

Search for terms


Disadvantages

Cannot search for phrases (i. e. terms in a particular order)

No information where the term occurred in the document

No snippets

No ranking possible, documents either match or do not match



Positional Information

Store not only that a document contains a term, but also where the termoccurs in the document.


term 1 7→ Document ID, Position(s) in document

term 2 7→ Document ID, Position(s) in document

... 7→ ...

term n 7→ Document ID, Position(s) in document

Search for phrases (i. e. consecutive words)

1 Get each word from the dictionary and retrieve its posting list

2 If the document IDs are equal for two or more words, search forconsecutive positions



Example


co-worker 7→ (Doc 2, Pos 4)

going 7→ (Doc 3, Pos 3)

Hawaii 7→ (Doc 1, Pos 5); (Doc 3, Pos 5)

Heidi 7→ (Doc 2, Pos 6); (Doc 3, Pos 1)

Seppl 7→ (Doc 1, Pos 1); (Doc 2, Pos 1)

stopped 7→ (Doc 3, Pos 2)

stopping off 7→ (Doc 1, Pos 3)

Search for the phrase Heidi stopped

1 Heidi 7→ (Doc 2, 6); (Doc 3, Pos1)

2 stopped 7→ (Doc 3, Pos 2)

3 Consecutive positions for Document 3



Properties

Advantages

Still efficient storage

Search for terms


Search for phrases (i. e. terms in a particular order)

Information where the term occurred in the document

Possible creation of snippets

Disadvantages

Ranking a bit tricky, but possible



Term Frequencies

Implicit term frequencies

By storing the positions where the terms occur in a document, alsothe information how often it occurs is stored. As there have to bemore position markers if the term occurs more often.

Problem: Counting needs time

Explicit term frequencies

Store in the posting list, how often the term occurs in the document

Possible to leave the positional information out, if not needed



Term Frequencies

Structure

Store in the posting list, how often the term occurs in the document(This is often called Term Frequency (TF))

Possible to leave the positional information out, if not needed


term 1 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...term 2 7→ Doc ID, TF, Position(s); ...... 7→ ...

term n 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...



Properties

Advantages

Still efficient storage

Search for terms


Search for phrases (i. e. terms in a particular order)

Information where the term occurred in the document

Possible creation of snippets

Documents can be ranked by the number of terms they contain

Disadvantages

Takes up space, not only storing the original documents, but also theinverted index



Steps for building an index

Step 1 Build term-document table

Step 2 Sort by terms

Step 3 Add term frequency, multiple entries from single documentget merged

Correspondingly Get the positions of the terms in the documents

Step 4 : Result is split into dictionary file and description file.







































Preprocessing of Text

To make a text indexable, a couple of steps are necessary:

1 Detect the language of the text

2 Detect sentence boundaries

3 Detect term (word) boundaries

4 Stemming and normalization

5 Remove stop words



Language Detection

Letter Frequencies

Percentage of occurrenceLetter English German

A 8.17 6.51E 12.70 17.40I 6.97 7.55O 7.51 2.51U 2.76 4.35

N-Gram Frequencies

Better—more elaborate—methods use statistics over more than one letter,e. g. statistics over two, three or even more consecutive letters.



Sentence Detection

First approachEvery period marks the end of a sentence

ProblemPeriods also mark abbreviations, decimal points,email-addresses, etc.

Utilize other features

Is the next letter capitalized?Is the term in front of the period a known abbreviation?Is the period surround by digits?...



Tokenization – Problem

Goal

Split sentences into tokens

Throw away punctuations

Possible Pit Falls

White spaces are no safe delimiter for word boundaries(e. g. NewYork)

Non-alphanumerical characters can separate words, but need not(e. g. co-worker)

Different forms (e. g. white space vs. whitespace)

German compound nouns (e. g. Donaudampfschiffahrtsgesellschaft,meaning Danube Steamboat Shipping Company)



Tokenization – Solutions

Supervised Learning

Train a model with annotated training data, use the trained model totokenize unknown text

Hidden-Markov-Models and conditional random fields are commonlyused

Dictionary Approach

Build a dictionary (i. e. list ) of tokens

Go over the sentence and always take the longest fitting token(greedy algorithm!)

Remark: Works well for Asian languages without white spaces andshort words. Problematic for European languages



Normalization

Some definitions

Token Sequence of characters representing a useful semantic unit,instance of a type

Type Common concept for tokens, element of the vocabulary

Term Type that is stored in the dictionary of an index

Task

1 Map all possible tokens to the corresponding type

2 Store all representing terms of one type in the dictionary



Example



Posible Ways for Normalization

Ignore cases

Rule based removal of hyphens, periods, white spaces, accents,diacritics (Be aware of possible different meaning after removal)

Provide a list of all possible representations of a type

Map different spelling version by using a phonetic algorithm.Phonetic algorithms translate the spelling into their pronunciationequivalent (e. g. Double Metaphone, Soundex).



Stemming & Lemmatization

Definition

Goal of both Reduce different grammatical forms of a word to theircommon infinitivea

Stemming Usage of heuristics to chop off / replace last part of words(Example: Porter’s algorithm).

Lemmatization Usage of proper grammatical rules to recreate theinfinitive.

aThe common infinitive is the form of a word how it is written in adictionary. This form is called lemma, hence the name Lemmatization.

Examples

Map do, doing, done, to commone infinitive do

Map digitalizing, digitalized to digital



Stop Words

Definition Extremely common words that appear in nearly every text

Problem As stop words are so conmen, their occurrence does notcharacterize a text

Solution Just drop them, i. e. do not put them in the index directory

Common stop words

Write a list with stop words (List might be topic specific!)

Usual suspects: articles (a, an, the), conjunction (and, or, but, ...),pre- and postposition (in, for, from), etc.

Solution

Stop word list Ignore word that are given on a list (black list)

Problem Special names and phrases (The Who, Let It Be, ...)

Solution Make another list... (white list)


Basics Probabilistic IR

Information flow

The classic search model



Probabilistic Information Retrial

Every transformation can hold a possible information loss

1 Task to query

2 Query to document

Information loss leads to uncertainty

Probability theory can deal with uncertainty

Basic idea behind probabilistic IR

Model the uncertainty, which originates in the imperfecttransformations

Use this uncertainty as a measurement how relevant a document is,given a certain query



Probabilistic Information Retrial Models

Binary Independence Model

Binary: documents and queries represented as binaryterm incidence vectorsIndependence: terms are assumed to be independentof each other

Okapi BM25

Takes number of term frequencies and document lengthinto accountParameters to tune influence of term frequencies anddocument length

Bayesian networks for text retrieval

Language Models Coming now...



Language Models

Basic Idea

Train a probabilistic model Md for document d , i. e. as many trainedmodels as documents

Md ranks documents on how possible it is that the user’s query q wascreated by the document d

Results are document models (or documents, respectively) which havea high probability to generate the user’s query



Examples for Language Models

Successive Probability of query termsP(t1t2 . . . tn) = P(t1)P(t2|t1)P(t3|t1t2) . . .P(tn|t1 . . . tn−1)

Unigram Languge ModelPuni (t1t2 . . . tn) = P(t1)P(t2) . . .P(tn)

Bigram Languge ModelP(t1t2 . . . tn) = P(t1)P(t2|t1)P(t3|t2) . . .P(tn|tn−1)

Probability P(tn) Probabilities are generated from the term occurrence inthe document in question

And many more... A lot more, and more complex models are available


Advanced Concepts

Advanced ConceptsLSA, Relevance Feedback, Query Processing, ...


Advanced Concepts Latent Semantic Indexing

Index Construction

Going beyond simple indexing

Term-document matrices are very large

But the number of topics that people talk about is small (in somesense)

Clothes, movies, politics, ...

Can we represent the term-document space by a lower dimensionallatent space?

⇒ Latent Semantic Indexing (LSI)



Mathematically speaking

Latent Semantic Indexing - Introduction

• This is taking a 3-D point and projecting it into 2-D

• The arrow in this picture acts like the overhead projector

( 10, 10, 10)

( x, y, z)

( 10, 10)

( x, y)

10

10

10

10

10





• We can project from any number of dimensions into any

other number of dimensions.

• Increasing dimensions adds redundant information

• But sometimes useful

• Support Vector Machines (kernel methods) do this

effectively

• Latent Semantic Indexing always reduces the number of

dimensions





• Latent Semantic Indexing always reduces the number of

dimensions( 10, 10)

(x,y)

( 10 )

(x)

10

10

10





• Latent Semantic Indexing can project on an arbitrary axis, not

just a principal axis





• Our documents were just points in an N-dimensional term

space

• We can project them also

Antony

Brutus

Antony and Cleopatra

Julius Caesar

Tempest

Hamlet

Othello

MacBeth

Antony

BrutusMacBeth

Othello

Hamlet

Julius Caesar

Antony and Cleopatra

Tempest



Latent Semantic Indexing

LSI - expected behaviour

A term vector that is projected on new vectors may uncover deepermeanings

For example transforming the 3 axes of a term matrix from ball, batand cave to

An axis that merges ball and bat

An axis that merges bat and cave

Should be able to separate differences in meaning of the term bat

Bonus: less dimensions is faster



Singular Value Decomposition

SVD overview

For an m × n matrix A of rank r there exists a factorization (SingularValue Decomposition = SVD) as follows:

M = UΣV T

... U is an m ×m unitary matrix

... Σ is an m × n diagonal matrix

... V T is an n × n unitary matrix

The columns of U are orthogonal eigenvectors of AAT

The columns of V are orthogonal eigenvectors of ATA

Eigenvalues λ1...λr of AAT are the eigenvalues of ATA.




Matrix Decomposition

SVD enables lossy compression of a term-document matrix

Reduces the dimensionality or the rankArbitrarily reduce the dimensionality by putting zeros in the bottomright of sigmaThis is a mathematically optimal way of reducing dimensions




Reduced SVD

If we retain only k singular values, and set the rest to 0, then wedon’t need the matrix parts in green

Then Σ is k × k, U is M × k , V T is k × n, and Ak is m × n

This is referred to as the reduced SVD

It is the convenient (space-saving) and usual form for computationalapplications




Approximation Error

How good (bad) is this approximation?

It’s the best possible, measured by the Frobenius norm of the error




Reduced SVD

Whereas the term-doc matrix A may have m = 50000, n = 10million(and rank close to 50000)

We can construct an approximation A100 with rank 100.

Of all rank 100 matrices, it would have the lowest Frobenius error.

⇒ Latent Semantic Indexing




Application of SVD - LSI

From term-document matrix A, we compute the approximation Ak .

There is a row for each term and a column for each document in Ak

Thus documents live in a space of k << r dimensions

These dimensions are not the original axes




Approximation Error

Each row and column of A gets mapped into the k-dimensional LSIspace, by the SVD.

Claim – this is not only the mapping with the best (Frobenius error)approximation to A, but in fact improves retrieval.

A query q is also mapped into this space




LSI - Results

Similar terms map to similar location in low dimensional space

Noise reduction by dimension reduction


Advanced Concepts Relevance Feedback

Relevance Feedback

Integrate feedback from the user

Given a search result for a user’s query

... allow the user to rate the retrieved results (good vs. not-so-goodmatch)

This information can then be used to re-rank the results to match theuser’s preference

Often automatically inferred from the user’s behaviour using thesearch query logs

... and ultimately improve the search experience

The query logs (optimally) contain the queries plus the items the user has clicked on (the user’ssession)



Relevance Feedback

Pseudo relevance feedback

No user interaction is needed for the pseudo relevance feedback

... the first n results are simply assumed to be a good match

As if the users has clicked on the first results and marked them asgood match

→ often improves the quality of the search results

Pseudo relevance feedback is sometimes also called blind relevance feedback



More Like This

A common feature of search engines is the more like this search

Given a search results

... the user wants items similar to one of the presented results

This can be seen as: taking a document as query

→ which is a common implementation (but restricting the query tothe most discriminative terms)


Advanced Concepts Query Processing

Query Reformulation

Modify the query, during of after the user interaction

Query suggestion & completion

(Automatic) query correction

(Automatic) query expansion



Query Suggestion

The user start typing a query

... automatically gets suggestions

→ to complete the currently typed word

→ to complete a whole phrase (e.g. the next word)

→ suggest related queries



Query Correction

Query misspellings

Our principal focus here

E.g., the query Britni Spears

We can either

Retrieve documents indexed by the correct spelling,

Rewrite the input query (replace, expand),

britni spears 7→ (britni OR britney) spears

Return several suggested alternative queries with the correct spelling -Did you mean Britney Spears?



Query Correction

Isolated word correction

Fundamental premise – there is a lexicon from which the correctspellings come

Basic choices for thisA standard lexicon such as

Webster’s English DictionaryAn industry-specific lexicon – hand-maintained

The lexicon of the indexed corpus

E.g., all words on the webAll names, acronyms etc.(Including the mis-spellings)

The use of a query log

Use other users behaviour



Query Correction

Isolated word correction

Given a lexicon and a character sequence Q, return the words in thelexicon closest to Q

What’s closest?

We’ll study several alternatives1 Edit distance (Levenshtein distance)2 Weighted edit distance3 n-gram overlap



Query Correction

Word correction: edit distance

Given two strings S1 and S2, the minimum number of operations toconvert one to the other

Operations are typically character-level

Insert, Delete, Replace, (Transposition)

Examples

From dof to dog is 1From cat to act is 2 (Just 1 with transpose.)From cat to dog is 3.

See http://www.merriampark.com/ld.htm for a nice example plus an applet.Notebook with various distances (Levenshtein being the best known):http://nbviewer.ipython.org/gist/MajorGressingham/7691723


http://www.merriampark.com/ld.htm

http://nbviewer.ipython.org/gist/MajorGressingham/7691723


Query Correction

Word correction: edit distance

Given two strings S1 and S2, the minimum number of operations toconvert one to the other

Operations are typically character-level

Insert, Delete, Replace, (Transposition)

Examples

From dof to dog is 1From cat to act is 2 (Just 1 with transpose.)From cat to dog is 3.

See http://www.merriampark.com/ld.htm for a nice example plus an applet.Notebook with various distances (Levenshtein being the best known):http://nbviewer.ipython.org/gist/MajorGressingham/7691723


http://www.merriampark.com/ld.htm

http://nbviewer.ipython.org/gist/MajorGressingham/7691723


Query Correction

Word correction: weighted edit distance

As above, but the weight of an operation depends on the character(s)involved

Meant to capture OCR or keyboard errors, e.g. m more likely to bemis-typed as n than as q

Therefore, replacing m by n is a smaller edit distance than by q

This may be formulated as a probability model

Requires weight matrix as input



Query Correction

Word correction: n-gram overlap

Enumerate all the n-grams in the query string as well as in the lexicon

Use the n-gram index (recall wild-card search) to retrieve all lexiconterms matching any of the query n-grams

Threshold by number of matching n-grams

Variants – weight by keyboard layout, etc.



Query Correction

Word correction: n-gram overlap, example

Suppose the text is november

Trigrams are: nov ove vem emb mbe ber

The query is december

Trigrams are: dec ece cem emb mbe ber

So 3 trigrams overlap (of 6 in each term)

How can we turn this into a normalised measure of overlap?



Query Correction

Word correction: n-gram overlap, example

Suppose the text is november

Trigrams are: nov ove vem emb mbe ber

The query is december

Trigrams are: dec ece cem emb mbe ber

So 3 trigrams overlap (of 6 in each term)

How can we turn this into a normalised measure of overlap?



Query Correction

n-gram overlap, Jaccard Coefficient

A commonly-used measure of overlap

Let X and Y be two sets; then the Jaccard Coefficient is

Equals 1 when X and Y have the same elementsEquals 0 when they are disjoint

X and Y don’t have to be of the same size

Always assigns a number between 0 and 1

Now threshold to decide if you have a matchE.g., if JC > 0.8 ⇒ declare a match



Query Expansion

The user has entered a query

... automatically (related) words are added to the query

→ Global query expansion - only use the query (plus corpus orbackground knowledge)

→ Local query expansion - conduct the search and analyse the searchresults



Global Query Expansion

Uses only the query

Example: Use of a thesaurus

Query: [buy car]

Use synonyms from the thesaurusExpanded query: [(buy OR purchase) (car OR auto)]



Local query expansion

Uses the query plus the search results

... conceptually similar to the pseudo relevance feedback

Look for terms in the first n search results and add them to the query

Re-run the query with the added terms


Evaluation

EvaluationAssess the quality of search


Evaluation

Overview: Evaluation

Two basic questions / goals

Are the retrieved document relevant for the user’s information need?

Have all relevant document been retrieved?

In practice these two goal contradict each other.


Evaluation

Precision & Recall

Evaluation Scenario

Fixed document collection

Number of queries

Set of documents marked as relevant for each query

Main Evaluation Metrics

Precision = #(relevant items retrieved)#(retrieved items)

Recall = #(relevant items retrieved)#(relevant items)

F1 = 2PrecisionRecallPrecision+Recall


Evaluation

Precision & Recall


Evaluation

Advanced Evaluation Measures

Mean Average Precision - MAP

... mean of all average precision of multiple queries

Map = 1|Q|

∑q∈Q APq

APq = 1n

∑nk=1 Precision(k)

(Normalised) discounted cumulative gain (nDCG)

... where there is a reference ranking the relevant results


Applications of IR

Selected Applications of IRClassification & Recommender Systems


Applications of IR Classification

Overview

Classification

Supervised Learning: labelled training data is needed (effort forlabelling)

Classification problems: classify an example into a given set ofcategories

Algorithm learns model from the labelled training set and applies thismodel on unknown data

Used in real applications

Clustering

Unsupervised Learning

Find groups in the data, where the similarity within the group is highand the similarity between groups is low

Hardly ever used in real applications



Supervised learning

Given examples of a function (x, f (x))

Predict function f (x) for new examples x

Depending on f (x) we have:1 Classification if f (x) is a discrete function2 Regression if f (x) is a continuous function3 Probability estimation if f (x) is a probability distribution



Classification

Definition

Given: Training examples (x, f (x)) for some unknown discretefunction f

Find: A good approximation for f

x is data, e. g. a vector, text documents, emails, words in text, etc.

Classification approaches

Logistic regression

Decision trees

Support vector machines

Neural networks

Statistical inference such as Bayesian learning

Nearest neighbour algorithms (Here: kNN-algorithm)



General kNN-Algorithm

1 For each desired class: at least one example must be provided

2 Take a unclassified document: Look at its k nearest neighbours

3 Assign the unknown element to the class with the most neighbours init

4 Goto 2



Document Classification

1 For each desired class: at least one document must be provided2 Find k nearest neighbours of a document:

1 Take all terms of a document2 Combine them to a query (e. g. with a logical OR)3 Take the first k results form an index

3 Assign the unknown document to the class with the most neighboursin it

4 Goto 2


Applications of IR Content-based Recommender systems

Overview

Recommender systems should provide usable suggestion to users

Recommender systems should show new, unknown—butsimilar—documents

Recommender systems can utilize different information sources

Content-based RecommenderFinds similar documents on the basis of documentcontent

Collaborative FileringEmploys user rating to find new and useful documents

Knowledge-based RecommenderMakes decisions based on pre-programmed rules

Hybrid RecommenderCombination of two or three other approaches



Types of Content-based Recommender

Document-to-Document Recommender

Document-to-User Recommender



Building a Recommender by Using an Index

Document-to-Document Recommender

1 Take the terms from a document

2 Construct a (weighted) query with them

3 Query the index

4 Results are possible candidates for recommending

Document-to-User Recommender

1 Create a user model from

Terms typed by the userDocuments viewed by the user⇒ Simplest user model is just a set of terms

2 Construct a (weighted) query with them

3 Query the index

4 Results are possible candidates for recommending


Solutions

SolutionsOpen-source and commercial IR solutions


Solutions Open Source Solutions

Open Source Search Engine

Apache Lucene

Very popular, high quality

Backbone of many other search engines, e.g. Apache Solr,ElasticSearch

Lucene is the core search engineSolr provides a service & configuration infrastructure (plus a few extrafeatures)

Implemented in Java, bindings for many other programming languages



Solr Index Management

API

HTTP Server

Various formats (XML, binary, JavaScript, ...)

Document life-cycle

There is no update

Delete (done automatically by Solr)

Insert

Implications

An unique id is necessaryUse batch updates

Commit, rollback (and optimize)



Solr Text Processing

Scope

During indexing & query

Tokenization

Split text into tokens

Lower-case alignment

Stemming (e.g. ponies, pony⇒ poni, triplicate⇒ triplic, ...)

Synonyms (via Thesaurus)

Stop-word filtering

Multi-word splitting (e.g. Wi-Fi ⇒ Wi, Fi)

n-grams, soundex, umlauts



Solr Query Processing

Query parsers

Lucene query parser (rich syntax)

AND, OR, NOT, range queries, wildcards, fuzzy query, phrase queryBoosting of individual partsExample: ((boltzmann OR schroedinger) NOT einstein)

Dismax query parser

No query syntaxSearches over multiple fields (separate boost for each field)Configure the amount of terms to be mandatoryDistance between terms is used for ranking (phrase boosting)

Dismax is a good starting point, but may become expensive



Search Features

Query filter

Additional query

No impact on ranking

Results are cached

Boosting query

Only in Dismax

Query elevation

Fix certain queries

Request handler

Pre-define clauses

Invariants

Function queries

Score is computed on field values



Solr Search Result

Ranking

Relevance

Sort on field value (only single term per document)

Available data & features

Sequence of IDs & score

Stored fields

Snippets (plus highlighting)

Facets

Count the search hitsTypes: field value, dates, queriesSort, prefix, ...Could be used for term suggestion (aka. query suggestion)

Field collapsing (grouping)

Spell checking (did-you-mean)



Additional Solr Features

Query by Example

More like this

Stats

Per field

Min, max, sum, missing, ...

Admin-GUI

Webapp to troubleshoot queries

Browse schema

JMX

Read properties & statistics

Can be accessed remotely



Solr Integration

Deployment

Within a web application server

Embedded

Monitor

Log output

Access

Various language bindings

Java, Ruby, JavaScript, PHP, ...



Other Open-Source Search Engines

Xapian

Support for many (programming) languages

Used by many open-source projects

Lemur & Indri

Language models

Academic background

Terrier

Many algorithms

Academic background

Whoosh

for Pythonistas


Solutions Commercial Solutions

What customers want

Indexing pipeline

And also: user interface, access control mechanisms, logging, etc.



Commercial Search Engines

Autonomy

Grantees not to truncate documents

Offers a wide range on source connectivity

Sinequa

Expert Search

Creates new views, i. e. do not only show result list, but combineresults to new unit

Attivio

Use SQL statements in queries

IntraFIND

Lucene-based

And many, many more...



The EndNext: Preprocessing Platform & Practical Applications