106
Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern, Heimo Gursch KTI, TU Graz 2014-03-19 Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 1 / 101

Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Information Retrieval and ApplicationsKnowledge Discovery and Data Mining 2 (VU) (707.004)

Roman Kern, Heimo Gursch

KTI, TU Graz

2014-03-19

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 1 / 101

Page 2: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Outline

1 BasicsInverted IndexProbabilistic IR

2 Advanced ConceptsLatent Semantic IndexingRelevance FeedbackQuery Processing

3 Evaluation

4 Applications of IRClassificationContent-based Recommender systems

5 SolutionsOpen Source SolutionsCommercial Solutions

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 2 / 101

Page 3: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics

BasicsInverted Index, Text-Preprocessing, Language-based &

Probabilistic IR

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 3 / 101

Page 4: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics

Functionnal features

Search for single keyword

Search for multiple keywords

Search for phrases, i. e. keywords in particular order

Boolean combination of keywords (AND, OR, NOT)

Rank results by number of keyword occurrences

Find the positions of keywords in a document

Compose snippets of text around the keyword positions

Motivation

How can this goals be achieved?

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 4 / 101

Page 5: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics

Structured, Semi-Structured Data

Structured data

Format and type of information is known

IR task: normalize representations between different sources

Example: SQL databases

Semi-structured data

Combination of structured and unstructured data

IR task: extract information from the unstructured part and combineit with the structured information

Example: email, document with meta-data

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 5 / 101

Page 6: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics

Unstructured data

Unstructured data

Format of information is not know, information is hidden in(human-readable) text

IR task: extract information at all

Example: blog-posts, Wikipedia articles

This lecture focuses on unstructured data.

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 6 / 101

Page 7: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics

Simple Approach—Sequential Search

Advantage Simple

Disadvantages Slow and hard to scale

Usage Small textsFrequently changing textNot enough space for index available

Example grep on the UNIX command line

Not feasible for large texts. A different approach is needed. But how?

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 7 / 101

Page 8: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics

Definition

Information Retrieval (IR) is finding material (usually documents) of anunstructured nature (usually text) that satisfies an information needfrom within large collections (usually stored on computers).

The term “term” is very common in Information Retrieval and typically refers to single words(also common: bi-grams, phrases, ...)

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 8 / 101

Page 9: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics

Assumptions

Basic assumptions of Information Retrieval

Collection Fixed set of documents

Goal Retrieve documents with information that is relevant touser’s information need and helps him complete a task.

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 9 / 101

Page 10: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics

Overview Indexed Searching

Split process

Indexing Convert all input documents into a data-structure suitablefor fast searching

Searching Take an input query and map it onto the pre-processeddata-structure to retrieve the matching documents

Inverted index

Reverses the relationship between documents and contained words (terms)Central data-structure of a search engine

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 10 / 101

Page 11: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics

Three Example Documents

Document 1

Sepp was stopping off in Hawaii.

Document 2

Sepp is a co-worker of Heidi.

Document 3

Heidi stopped going to Hawaii.

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 11 / 101

Page 12: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Inverted Index Matrix

Term-document incidence

Matrix entries

0 The document does not contain the corresponding term

1 The document does contain the corresponding term

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 12 / 101

Page 13: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Example Matrix

Document 1 Document 2 Document 3

co-worker 0 1 0going 0 0 1Hawaii 1 0 1Heidi 0 1 1Seppl 1 1 0stopped 0 0 1stopping off 1 0 0

Matrix entries

0 The document does not contain the corresponding term

1 The document does contain the corresponding term

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 13 / 101

Page 14: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Properties of the Inverted Index Matrix

Vectorization of Documents

Column Words of an document

Row All documents where a word occurs

Possible Queries

Documents containing all terms of a query Take row vectors for eachterm and do a bitwise AND

Documents contain any terms of a query Take row vectors for each termand do a bitwise OR

Documents not containing a term Use a bitwise NOT

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 14 / 101

Page 15: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Examples

All documents containing the term Hawaii

Take the row vector of Hawaii, look for a one: Document 1 & Document 3

All documents containing the terms Seppl & Heidi

1 Take the rows for the terms Seppl & Heidi

2 Combine them by a bitwise AND

3 The position in the result vector indicates the correct document:Document 2

More complex queries

Use Boolean algebra on the row vectors

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 15 / 101

Page 16: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

What we have so far...

Advantages

Search for terms

Formulate boolean queries

Works very fast

Disadvantages

Cannot search for phrases (i. e. terms in a particular order)

No information where the term occurred in the document

No snippets

No ranking possible, documents either match or do not match

Matrix is usually very sparse, possible waste of memory

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 16 / 101

Page 17: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Make the inverted index memory efficient

Dictionary

Store a (sorteda) list of all terms in the documents.

aThe sorting is not necessary for it to work, but it is necessary to find termsfast

Postings

For every term: store a list in which documents the term occurs

Posting list

Linked lists generally preferred to arrays

Dynamic space allocation

Insertion of terms into documents easy

Space overhead of pointers

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 17 / 101

Page 18: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Example

Dictionary 7→ Postings

co-worker 7→ Document 2

going 7→ Document 3

Hawaii 7→ Document 1, Document 3

Heidi 7→ Document 2, Document 3

Seppl 7→ Document 1, Document 2

stopped 7→ Document 3

stopping off 7→ Document 1

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 18 / 101

Page 19: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Boolean Retrieval

AND Intersection of postings for two or more terms

OR Union of postings for two or more terms

NOT All documents not in the postings of a term

Make it efficient

Sort the dictionary

Sort the postings by document IDs

But: Tricky and long boolean combinations can still get difficult!

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 19 / 101

Page 20: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Properties

Prerequisite

Every document needs a unique, sortable ID, i. e. an unique integer number

Advantages

Very efficient storage

Search for terms

Formulate boolean queries

Disadvantages

Cannot search for phrases (i. e. terms in a particular order)

No information where the term occurred in the document

No snippets

No ranking possible, documents either match or do not match

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 20 / 101

Page 21: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Positional Information

Store not only that a document contains a term, but also where the termoccurs in the document.

Dictionary 7→ Postings

term 1 7→ Document ID, Position(s) in document

term 2 7→ Document ID, Position(s) in document

... 7→ ...

term n 7→ Document ID, Position(s) in document

Search for phrases (i. e. consecutive words)

1 Get each word from the dictionary and retrieve its posting list

2 If the document IDs are equal for two or more words, search forconsecutive positions

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 21 / 101

Page 22: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Example

Dictionary 7→ Postings

co-worker 7→ (Doc 2, Pos 4)

going 7→ (Doc 3, Pos 3)

Hawaii 7→ (Doc 1, Pos 5); (Doc 3, Pos 5)

Heidi 7→ (Doc 2, Pos 6); (Doc 3, Pos 1)

Seppl 7→ (Doc 1, Pos 1); (Doc 2, Pos 1)

stopped 7→ (Doc 3, Pos 2)

stopping off 7→ (Doc 1, Pos 3)

Search for the phrase Heidi stopped

1 Heidi 7→ (Doc 2, 6); (Doc 3, Pos1)

2 stopped 7→ (Doc 3, Pos 2)

3 Consecutive positions for Document 3

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 22 / 101

Page 23: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Properties

Advantages

Still efficient storage

Search for terms

Formulate boolean queries

Search for phrases (i. e. terms in a particular order)

Information where the term occurred in the document

Possible creation of snippets

Disadvantages

Ranking a bit tricky, but possible

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 23 / 101

Page 24: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Term Frequencies

Implicit term frequencies

By storing the positions where the terms occur in a document, alsothe information how often it occurs is stored. As there have to bemore position markers if the term occurs more often.

Problem: Counting needs time

Explicit term frequencies

Store in the posting list, how often the term occurs in the document

Possible to leave the positional information out, if not needed

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 24 / 101

Page 25: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Term Frequencies

Structure

Store in the posting list, how often the term occurs in the document(This is often called Term Frequency (TF))

Possible to leave the positional information out, if not needed

Dictionary 7→ Postings

term 1 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...term 2 7→ Doc ID, TF, Position(s); ...... 7→ ...

term n 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 25 / 101

Page 26: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Properties

Advantages

Still efficient storage

Search for terms

Formulate boolean queries

Search for phrases (i. e. terms in a particular order)

Information where the term occurred in the document

Possible creation of snippets

Documents can be ranked by the number of terms they contain

Disadvantages

Takes up space, not only storing the original documents, but also theinverted index

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 26 / 101

Page 27: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Steps for building an index

Step 1 Build term-document table

Step 2 Sort by terms

Step 3 Add term frequency, multiple entries from single documentget merged

Correspondingly Get the positions of the terms in the documents

Step 4 : Result is split into dictionary file and description file.

Dictionary 7→ Postings

term 1 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...term 2 7→ Doc ID, TF, Position(s); ...... 7→ ...

term n 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 27 / 101

Page 28: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Steps for building an index

Step 1 Build term-document table

Step 2 Sort by terms

Step 3 Add term frequency, multiple entries from single documentget merged

Correspondingly Get the positions of the terms in the documents

Step 4 : Result is split into dictionary file and description file.

Dictionary 7→ Postings

term 1 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...term 2 7→ Doc ID, TF, Position(s); ...... 7→ ...

term n 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 27 / 101

Page 29: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Steps for building an index

Step 1 Build term-document table

Step 2 Sort by terms

Step 3 Add term frequency, multiple entries from single documentget merged

Correspondingly Get the positions of the terms in the documents

Step 4 : Result is split into dictionary file and description file.

Dictionary 7→ Postings

term 1 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...term 2 7→ Doc ID, TF, Position(s); ...... 7→ ...

term n 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 27 / 101

Page 30: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Steps for building an index

Step 1 Build term-document table

Step 2 Sort by terms

Step 3 Add term frequency, multiple entries from single documentget merged

Correspondingly Get the positions of the terms in the documents

Step 4 : Result is split into dictionary file and description file.

Dictionary 7→ Postings

term 1 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...term 2 7→ Doc ID, TF, Position(s); ...... 7→ ...

term n 7→ Doc ID, TF, Position(s); Doc ID, TF, Position(s); ...

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 27 / 101

Page 31: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Preprocessing of Text

To make a text indexable, a couple of steps are necessary:

1 Detect the language of the text

2 Detect sentence boundaries

3 Detect term (word) boundaries

4 Stemming and normalization

5 Remove stop words

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 28 / 101

Page 32: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Language Detection

Letter Frequencies

Percentage of occurrenceLetter English German

A 8.17 6.51E 12.70 17.40I 6.97 7.55O 7.51 2.51U 2.76 4.35

N-Gram Frequencies

Better—more elaborate—methods use statistics over more than one letter,e. g. statistics over two, three or even more consecutive letters.

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 29 / 101

Page 33: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Sentence Detection

First approachEvery period marks the end of a sentence

ProblemPeriods also mark abbreviations, decimal points,email-addresses, etc.

Utilize other features

Is the next letter capitalized?Is the term in front of the period a known abbreviation?Is the period surround by digits?...

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 30 / 101

Page 34: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Tokenization – Problem

Goal

Split sentences into tokens

Throw away punctuations

Possible Pit Falls

White spaces are no safe delimiter for word boundaries(e. g. NewYork)

Non-alphanumerical characters can separate words, but need not(e. g. co-worker)

Different forms (e. g. white space vs. whitespace)

German compound nouns (e. g. Donaudampfschiffahrtsgesellschaft,meaning Danube Steamboat Shipping Company)

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 31 / 101

Page 35: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Tokenization – Solutions

Supervised Learning

Train a model with annotated training data, use the trained model totokenize unknown text

Hidden-Markov-Models and conditional random fields are commonlyused

Dictionary Approach

Build a dictionary (i. e. list ) of tokens

Go over the sentence and always take the longest fitting token(greedy algorithm!)

Remark: Works well for Asian languages without white spaces andshort words. Problematic for European languages

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 32 / 101

Page 36: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Normalization

Some definitions

Token Sequence of characters representing a useful semantic unit,instance of a type

Type Common concept for tokens, element of the vocabulary

Term Type that is stored in the dictionary of an index

Task

1 Map all possible tokens to the corresponding type

2 Store all representing terms of one type in the dictionary

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 33 / 101

Page 37: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Example

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 34 / 101

Page 38: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Posible Ways for Normalization

Ignore cases

Rule based removal of hyphens, periods, white spaces, accents,diacritics (Be aware of possible different meaning after removal)

Provide a list of all possible representations of a type

Map different spelling version by using a phonetic algorithm.Phonetic algorithms translate the spelling into their pronunciationequivalent (e. g. Double Metaphone, Soundex).

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 35 / 101

Page 39: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Stemming & Lemmatization

Definition

Goal of both Reduce different grammatical forms of a word to theircommon infinitivea

Stemming Usage of heuristics to chop off / replace last part of words(Example: Porter’s algorithm).

Lemmatization Usage of proper grammatical rules to recreate theinfinitive.

aThe common infinitive is the form of a word how it is written in adictionary. This form is called lemma, hence the name Lemmatization.

Examples

Map do, doing, done, to commone infinitive do

Map digitalizing, digitalized to digital

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 36 / 101

Page 40: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Inverted Index

Stop Words

Definition Extremely common words that appear in nearly every text

Problem As stop words are so conmen, their occurrence does notcharacterize a text

Solution Just drop them, i. e. do not put them in the index directory

Common stop words

Write a list with stop words (List might be topic specific!)

Usual suspects: articles (a, an, the), conjunction (and, or, but, ...),pre- and postposition (in, for, from), etc.

Solution

Stop word list Ignore word that are given on a list (black list)

Problem Special names and phrases (The Who, Let It Be, ...)

Solution Make another list... (white list)

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 37 / 101

Page 41: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Probabilistic IR

Information flow

The classic search model

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 38 / 101

Page 42: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Probabilistic IR

Probabilistic Information Retrial

Every transformation can hold a possible information loss

1 Task to query

2 Query to document

Information loss leads to uncertainty

Probability theory can deal with uncertainty

Basic idea behind probabilistic IR

Model the uncertainty, which originates in the imperfecttransformations

Use this uncertainty as a measurement how relevant a document is,given a certain query

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 39 / 101

Page 43: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Probabilistic IR

Probabilistic Information Retrial Models

Binary Independence Model

Binary: documents and queries represented as binaryterm incidence vectorsIndependence: terms are assumed to be independentof each other

Okapi BM25

Takes number of term frequencies and document lengthinto accountParameters to tune influence of term frequencies anddocument length

Bayesian networks for text retrieval

Language Models Coming now...

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 40 / 101

Page 44: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Probabilistic IR

Language Models

Basic Idea

Train a probabilistic model Md for document d , i. e. as many trainedmodels as documents

Md ranks documents on how possible it is that the user’s query q wascreated by the document d

Results are document models (or documents, respectively) which havea high probability to generate the user’s query

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 41 / 101

Page 45: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Basics Probabilistic IR

Examples for Language Models

Successive Probability of query termsP(t1t2 . . . tn) = P(t1)P(t2|t1)P(t3|t1t2) . . .P(tn|t1 . . . tn−1)

Unigram Languge ModelPuni (t1t2 . . . tn) = P(t1)P(t2) . . .P(tn)

Bigram Languge ModelP(t1t2 . . . tn) = P(t1)P(t2|t1)P(t3|t2) . . .P(tn|tn−1)

Probability P(tn) Probabilities are generated from the term occurrence inthe document in question

And many more... A lot more, and more complex models are available

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 42 / 101

Page 46: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts

Advanced ConceptsLSA, Relevance Feedback, Query Processing, ...

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 43 / 101

Page 47: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Latent Semantic Indexing

Index Construction

Going beyond simple indexing

Term-document matrices are very large

But the number of topics that people talk about is small (in somesense)

Clothes, movies, politics, ...

Can we represent the term-document space by a lower dimensionallatent space?

⇒ Latent Semantic Indexing (LSI)

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 44 / 101

Page 48: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Latent Semantic Indexing

Mathematically speaking

Latent Semantic Indexing - Introduction

• This is taking a 3-D point and projecting it into 2-D

• The arrow in this picture acts like the overhead projector

( 10, 10, 10)

( x, y, z)

( 10, 10)

( x, y)

10

10

10

10

10

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 45 / 101

Page 49: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Latent Semantic Indexing

Mathematically speaking

Latent Semantic Indexing - Introduction

• We can project from any number of dimensions into any

other number of dimensions.

• Increasing dimensions adds redundant information

• But sometimes useful

• Support Vector Machines (kernel methods) do this

effectively

• Latent Semantic Indexing always reduces the number of

dimensions

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 46 / 101

Page 50: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Latent Semantic Indexing

Mathematically speaking

Latent Semantic Indexing - Introduction

• Latent Semantic Indexing always reduces the number of

dimensions( 10, 10)

(x,y)

( 10 )

(x)

10

10

10

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 47 / 101

Page 51: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Latent Semantic Indexing

Mathematically speaking

Latent Semantic Indexing - Introduction

• Latent Semantic Indexing can project on an arbitrary axis, not

just a principal axis

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 48 / 101

Page 52: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Latent Semantic Indexing

Mathematically speaking

Latent Semantic Indexing - Introduction

• Our documents were just points in an N-dimensional term

space

• We can project them also

Antony

Brutus

Antony and Cleopatra

Julius Caesar

Tempest

Hamlet

Othello

MacBeth

Antony

BrutusMacBeth

Othello

Hamlet

Julius Caesar

Antony and Cleopatra

Tempest

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 49 / 101

Page 53: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Latent Semantic Indexing

Latent Semantic Indexing

LSI - expected behaviour

A term vector that is projected on new vectors may uncover deepermeanings

For example transforming the 3 axes of a term matrix from ball, batand cave to

An axis that merges ball and bat

An axis that merges bat and cave

Should be able to separate differences in meaning of the term bat

Bonus: less dimensions is faster

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 50 / 101

Page 54: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Latent Semantic Indexing

Singular Value Decomposition

SVD overview

For an m × n matrix A of rank r there exists a factorization (SingularValue Decomposition = SVD) as follows:

M = UΣV T

... U is an m ×m unitary matrix

... Σ is an m × n diagonal matrix

... V T is an n × n unitary matrix

The columns of U are orthogonal eigenvectors of AAT

The columns of V are orthogonal eigenvectors of ATA

Eigenvalues λ1...λr of AAT are the eigenvalues of ATA.

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 51 / 101

Page 55: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Latent Semantic Indexing

Singular Value Decomposition

Matrix Decomposition

SVD enables lossy compression of a term-document matrix

Reduces the dimensionality or the rankArbitrarily reduce the dimensionality by putting zeros in the bottomright of sigmaThis is a mathematically optimal way of reducing dimensions

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 52 / 101

Page 56: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Latent Semantic Indexing

Singular Value Decomposition

Reduced SVD

If we retain only k singular values, and set the rest to 0, then wedon’t need the matrix parts in green

Then Σ is k × k, U is M × k , V T is k × n, and Ak is m × n

This is referred to as the reduced SVD

It is the convenient (space-saving) and usual form for computationalapplications

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 53 / 101

Page 57: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Latent Semantic Indexing

Singular Value Decomposition

Approximation Error

How good (bad) is this approximation?

It’s the best possible, measured by the Frobenius norm of the error

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 54 / 101

Page 58: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Latent Semantic Indexing

Singular Value Decomposition

Reduced SVD

Whereas the term-doc matrix A may have m = 50000, n = 10million(and rank close to 50000)

We can construct an approximation A100 with rank 100.

Of all rank 100 matrices, it would have the lowest Frobenius error.

⇒ Latent Semantic Indexing

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 55 / 101

Page 59: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Latent Semantic Indexing

Latent Semantic Indexing

Application of SVD - LSI

From term-document matrix A, we compute the approximation Ak .

There is a row for each term and a column for each document in Ak

Thus documents live in a space of k << r dimensions

These dimensions are not the original axes

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 56 / 101

Page 60: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Latent Semantic Indexing

Latent Semantic Indexing

Approximation Error

Each row and column of A gets mapped into the k-dimensional LSIspace, by the SVD.

Claim – this is not only the mapping with the best (Frobenius error)approximation to A, but in fact improves retrieval.

A query q is also mapped into this space

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 57 / 101

Page 61: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Latent Semantic Indexing

Latent Semantic Indexing

LSI - Results

Similar terms map to similar location in low dimensional space

Noise reduction by dimension reduction

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 58 / 101

Page 62: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Relevance Feedback

Relevance Feedback

Integrate feedback from the user

Given a search result for a user’s query

... allow the user to rate the retrieved results (good vs. not-so-goodmatch)

This information can then be used to re-rank the results to match theuser’s preference

Often automatically inferred from the user’s behaviour using thesearch query logs

... and ultimately improve the search experience

The query logs (optimally) contain the queries plus the items the user has clicked on (the user’ssession)

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 59 / 101

Page 63: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Relevance Feedback

Relevance Feedback

Pseudo relevance feedback

No user interaction is needed for the pseudo relevance feedback

... the first n results are simply assumed to be a good match

As if the users has clicked on the first results and marked them asgood match

→ often improves the quality of the search results

Pseudo relevance feedback is sometimes also called blind relevance feedback

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 60 / 101

Page 64: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Relevance Feedback

More Like This

A common feature of search engines is the more like this search

Given a search results

... the user wants items similar to one of the presented results

This can be seen as: taking a document as query

→ which is a common implementation (but restricting the query tothe most discriminative terms)

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 61 / 101

Page 65: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Query Processing

Query Reformulation

Modify the query, during of after the user interaction

Query suggestion & completion

(Automatic) query correction

(Automatic) query expansion

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 62 / 101

Page 66: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Query Processing

Query Suggestion

The user start typing a query

... automatically gets suggestions

→ to complete the currently typed word

→ to complete a whole phrase (e.g. the next word)

→ suggest related queries

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 63 / 101

Page 67: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Query Processing

Query Correction

Query misspellings

Our principal focus here

E.g., the query Britni Spears

We can either

Retrieve documents indexed by the correct spelling,

Rewrite the input query (replace, expand),

britni spears 7→ (britni OR britney) spears

Return several suggested alternative queries with the correct spelling -Did you mean Britney Spears?

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 64 / 101

Page 68: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Query Processing

Query Correction

Isolated word correction

Fundamental premise – there is a lexicon from which the correctspellings come

Basic choices for thisA standard lexicon such as

Webster’s English DictionaryAn industry-specific lexicon – hand-maintained

The lexicon of the indexed corpus

E.g., all words on the webAll names, acronyms etc.(Including the mis-spellings)

The use of a query log

Use other users behaviour

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 65 / 101

Page 69: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Query Processing

Query Correction

Isolated word correction

Given a lexicon and a character sequence Q, return the words in thelexicon closest to Q

What’s closest?

We’ll study several alternatives1 Edit distance (Levenshtein distance)2 Weighted edit distance3 n-gram overlap

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 66 / 101

Page 70: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Query Processing

Query Correction

Word correction: edit distance

Given two strings S1 and S2, the minimum number of operations toconvert one to the other

Operations are typically character-level

Insert, Delete, Replace, (Transposition)

Examples

From dof to dog is 1From cat to act is 2 (Just 1 with transpose.)From cat to dog is 3.

See http://www.merriampark.com/ld.htm for a nice example plus an applet.Notebook with various distances (Levenshtein being the best known):http://nbviewer.ipython.org/gist/MajorGressingham/7691723

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 67 / 101

Page 71: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Query Processing

Query Correction

Word correction: edit distance

Given two strings S1 and S2, the minimum number of operations toconvert one to the other

Operations are typically character-level

Insert, Delete, Replace, (Transposition)

Examples

From dof to dog is 1From cat to act is 2 (Just 1 with transpose.)From cat to dog is 3.

See http://www.merriampark.com/ld.htm for a nice example plus an applet.Notebook with various distances (Levenshtein being the best known):http://nbviewer.ipython.org/gist/MajorGressingham/7691723

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 67 / 101

Page 72: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Query Processing

Query Correction

Word correction: weighted edit distance

As above, but the weight of an operation depends on the character(s)involved

Meant to capture OCR or keyboard errors, e.g. m more likely to bemis-typed as n than as q

Therefore, replacing m by n is a smaller edit distance than by q

This may be formulated as a probability model

Requires weight matrix as input

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 68 / 101

Page 73: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Query Processing

Query Correction

Word correction: n-gram overlap

Enumerate all the n-grams in the query string as well as in the lexicon

Use the n-gram index (recall wild-card search) to retrieve all lexiconterms matching any of the query n-grams

Threshold by number of matching n-grams

Variants – weight by keyboard layout, etc.

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 69 / 101

Page 74: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Query Processing

Query Correction

Word correction: n-gram overlap, example

Suppose the text is november

Trigrams are: nov ove vem emb mbe ber

The query is december

Trigrams are: dec ece cem emb mbe ber

So 3 trigrams overlap (of 6 in each term)

How can we turn this into a normalised measure of overlap?

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 70 / 101

Page 75: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Query Processing

Query Correction

Word correction: n-gram overlap, example

Suppose the text is november

Trigrams are: nov ove vem emb mbe ber

The query is december

Trigrams are: dec ece cem emb mbe ber

So 3 trigrams overlap (of 6 in each term)

How can we turn this into a normalised measure of overlap?

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 70 / 101

Page 76: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Query Processing

Query Correction

n-gram overlap, Jaccard Coefficient

A commonly-used measure of overlap

Let X and Y be two sets; then the Jaccard Coefficient is

Equals 1 when X and Y have the same elementsEquals 0 when they are disjoint

X and Y don’t have to be of the same size

Always assigns a number between 0 and 1

Now threshold to decide if you have a matchE.g., if JC > 0.8 ⇒ declare a match

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 71 / 101

Page 77: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Query Processing

Query Expansion

The user has entered a query

... automatically (related) words are added to the query

→ Global query expansion - only use the query (plus corpus orbackground knowledge)

→ Local query expansion - conduct the search and analyse the searchresults

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 72 / 101

Page 78: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Query Processing

Global Query Expansion

Uses only the query

Example: Use of a thesaurus

Query: [buy car]

Use synonyms from the thesaurusExpanded query: [(buy OR purchase) (car OR auto)]

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 73 / 101

Page 79: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Advanced Concepts Query Processing

Local query expansion

Uses the query plus the search results

... conceptually similar to the pseudo relevance feedback

Look for terms in the first n search results and add them to the query

Re-run the query with the added terms

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 74 / 101

Page 80: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Evaluation

EvaluationAssess the quality of search

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 75 / 101

Page 81: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Evaluation

Overview: Evaluation

Two basic questions / goals

Are the retrieved document relevant for the user’s information need?

Have all relevant document been retrieved?

In practice these two goal contradict each other.

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 76 / 101

Page 82: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Evaluation

Precision & Recall

Evaluation Scenario

Fixed document collection

Number of queries

Set of documents marked as relevant for each query

Main Evaluation Metrics

Precision = #(relevant items retrieved)#(retrieved items)

Recall = #(relevant items retrieved)#(relevant items)

F1 = 2PrecisionRecallPrecision+Recall

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 77 / 101

Page 83: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Evaluation

Precision & Recall

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 78 / 101

Page 84: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Evaluation

Advanced Evaluation Measures

Mean Average Precision - MAP

... mean of all average precision of multiple queries

Map = 1|Q|

∑q∈Q APq

APq = 1n

∑nk=1 Precision(k)

(Normalised) discounted cumulative gain (nDCG)

... where there is a reference ranking the relevant results

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 79 / 101

Page 85: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Applications of IR

Selected Applications of IRClassification & Recommender Systems

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 80 / 101

Page 86: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Applications of IR Classification

Overview

Classification

Supervised Learning: labelled training data is needed (effort forlabelling)

Classification problems: classify an example into a given set ofcategories

Algorithm learns model from the labelled training set and applies thismodel on unknown data

Used in real applications

Clustering

Unsupervised Learning

Find groups in the data, where the similarity within the group is highand the similarity between groups is low

Hardly ever used in real applications

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 81 / 101

Page 87: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Applications of IR Classification

Supervised learning

Given examples of a function (x, f (x))

Predict function f (x) for new examples x

Depending on f (x) we have:1 Classification if f (x) is a discrete function2 Regression if f (x) is a continuous function3 Probability estimation if f (x) is a probability distribution

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 82 / 101

Page 88: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Applications of IR Classification

Classification

Definition

Given: Training examples (x, f (x)) for some unknown discretefunction f

Find: A good approximation for f

x is data, e. g. a vector, text documents, emails, words in text, etc.

Classification approaches

Logistic regression

Decision trees

Support vector machines

Neural networks

Statistical inference such as Bayesian learning

Nearest neighbour algorithms (Here: kNN-algorithm)

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 83 / 101

Page 89: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Applications of IR Classification

General kNN-Algorithm

1 For each desired class: at least one example must be provided

2 Take a unclassified document: Look at its k nearest neighbours

3 Assign the unknown element to the class with the most neighbours init

4 Goto 2

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 84 / 101

Page 90: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Applications of IR Classification

Document Classification

1 For each desired class: at least one document must be provided2 Find k nearest neighbours of a document:

1 Take all terms of a document2 Combine them to a query (e. g. with a logical OR)3 Take the first k results form an index

3 Assign the unknown document to the class with the most neighboursin it

4 Goto 2

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 85 / 101

Page 91: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Applications of IR Content-based Recommender systems

Overview

Recommender systems should provide usable suggestion to users

Recommender systems should show new, unknown—butsimilar—documents

Recommender systems can utilize different information sources

Content-based RecommenderFinds similar documents on the basis of documentcontent

Collaborative FileringEmploys user rating to find new and useful documents

Knowledge-based RecommenderMakes decisions based on pre-programmed rules

Hybrid RecommenderCombination of two or three other approaches

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 86 / 101

Page 92: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Applications of IR Content-based Recommender systems

Types of Content-based Recommender

Document-to-Document Recommender

Document-to-User Recommender

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 87 / 101

Page 93: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Applications of IR Content-based Recommender systems

Building a Recommender by Using an Index

Document-to-Document Recommender

1 Take the terms from a document

2 Construct a (weighted) query with them

3 Query the index

4 Results are possible candidates for recommending

Document-to-User Recommender

1 Create a user model from

Terms typed by the userDocuments viewed by the user⇒ Simplest user model is just a set of terms

2 Construct a (weighted) query with them

3 Query the index

4 Results are possible candidates for recommending

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 88 / 101

Page 94: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Solutions

SolutionsOpen-source and commercial IR solutions

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 89 / 101

Page 95: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Solutions Open Source Solutions

Open Source Search Engine

Apache Lucene

Very popular, high quality

Backbone of many other search engines, e.g. Apache Solr,ElasticSearch

Lucene is the core search engineSolr provides a service & configuration infrastructure (plus a few extrafeatures)

Implemented in Java, bindings for many other programming languages

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 90 / 101

Page 96: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Solutions Open Source Solutions

Solr Index Management

API

HTTP Server

Various formats (XML, binary, JavaScript, ...)

Document life-cycle

There is no update

Delete (done automatically by Solr)

Insert

Implications

An unique id is necessaryUse batch updates

Commit, rollback (and optimize)

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 91 / 101

Page 97: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Solutions Open Source Solutions

Solr Text Processing

Scope

During indexing & query

Tokenization

Split text into tokens

Lower-case alignment

Stemming (e.g. ponies, pony⇒ poni, triplicate⇒ triplic, ...)

Synonyms (via Thesaurus)

Stop-word filtering

Multi-word splitting (e.g. Wi-Fi ⇒ Wi, Fi)

n-grams, soundex, umlauts

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 92 / 101

Page 98: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Solutions Open Source Solutions

Solr Query Processing

Query parsers

Lucene query parser (rich syntax)

AND, OR, NOT, range queries, wildcards, fuzzy query, phrase queryBoosting of individual partsExample: ((boltzmann OR schroedinger) NOT einstein)

Dismax query parser

No query syntaxSearches over multiple fields (separate boost for each field)Configure the amount of terms to be mandatoryDistance between terms is used for ranking (phrase boosting)

Dismax is a good starting point, but may become expensive

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 93 / 101

Page 99: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Solutions Open Source Solutions

Search Features

Query filter

Additional query

No impact on ranking

Results are cached

Boosting query

Only in Dismax

Query elevation

Fix certain queries

Request handler

Pre-define clauses

Invariants

Function queries

Score is computed on field values

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 94 / 101

Page 100: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Solutions Open Source Solutions

Solr Search Result

Ranking

Relevance

Sort on field value (only single term per document)

Available data & features

Sequence of IDs & score

Stored fields

Snippets (plus highlighting)

Facets

Count the search hitsTypes: field value, dates, queriesSort, prefix, ...Could be used for term suggestion (aka. query suggestion)

Field collapsing (grouping)

Spell checking (did-you-mean)

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 95 / 101

Page 101: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Solutions Open Source Solutions

Additional Solr Features

Query by Example

More like this

Stats

Per field

Min, max, sum, missing, ...

Admin-GUI

Webapp to troubleshoot queries

Browse schema

JMX

Read properties & statistics

Can be accessed remotely

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 96 / 101

Page 102: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Solutions Open Source Solutions

Solr Integration

Deployment

Within a web application server

Embedded

Monitor

Log output

Access

Various language bindings

Java, Ruby, JavaScript, PHP, ...

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 97 / 101

Page 103: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Solutions Open Source Solutions

Other Open-Source Search Engines

Xapian

Support for many (programming) languages

Used by many open-source projects

Lemur & Indri

Language models

Academic background

Terrier

Many algorithms

Academic background

Whoosh

for Pythonistas

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 98 / 101

Page 104: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Solutions Commercial Solutions

What customers want

Indexing pipeline

And also: user interface, access control mechanisms, logging, etc.

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 99 / 101

Page 105: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Solutions Commercial Solutions

Commercial Search Engines

Autonomy

Grantees not to truncate documents

Offers a wide range on source connectivity

Sinequa

Expert Search

Creates new views, i. e. do not only show result list, but combineresults to new unit

Attivio

Use SQL statements in queries

IntraFIND

Lucene-based

And many, many more...

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 100 / 101

Page 106: Information Retrieval and Applicationskti.tugraz.at/staff/rkern/courses/kddm2/2014/ir.pdf · Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) (707.004)

Solutions Commercial Solutions

The EndNext: Preprocessing Platform & Practical Applications

Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications 2014-03-19 101 / 101