66
Information Retrieval and Evaluation Knowledge Discovery and Data Mining 2 (VU) (707.004) Heimo Gursch, Roman Kern ISDS, TU Graz 2019-04-04 Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 1 / 66

Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Information Retrieval and EvaluationKnowledge Discovery and Data Mining 2 (VU) (707.004)

Heimo Gursch, Roman Kern

ISDS, TU Graz

2019-04-04

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 1 / 66

Page 2: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Outline1 Basics

FoundationsInverted IndexText Preprocessing

2 Advanced ConceptsRelevance FeedbackQuery ProcessingLearning-To-RankProbabilistic IRLatent Semantic Indexing

3 Evaluation4 Applications & Solutions

Distributed & Federated SearchContent-based Recommender systemsOpen Source SolutionsCommercial Solutions

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 2 / 66

Page 3: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

BasicsInverted Index & Text-Preprocessing

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 3 / 66

Page 4: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Foundations

Definition

Information Retrieval (IR) is findingmaterial (usually documents) of anunstructured nature (usually text) that satisfies an information need fromwithin large collections (usually stored on computers).

The term “term” is very common in Information Retrieval and typically refers to single words (alsocommon: bi-grams, phrases, ...)

Basic assumptions of Information RetrievalCollection Fixed set of documents

Goal Retrieve documents with information that is relevant touser’s information need and help her complete a task.

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 4 / 66

Page 5: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Structured, Semi-Structured Data

Structured dataFormat and type of information is knownIR task: normalize representations between different sourcesExample: SQL databases

Semi-structured dataCombination of structured and unstructured dataIR task: extract information from the unstructured part and combine itwith the structured informationExample: email, document with meta-data

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 5 / 66

Page 6: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Unstructured data

Unstructured dataFormat of information is not know, information is hidden in(human-readable) textIR task: extract information at allExample: blog-posts, Wikipedia articles

This lecture focuses on unstructured data.

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 6 / 66

Page 7: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Overview Indexed Searching

Split processIndexing Convert all input documents into a data-structure suitable for

fast searchingSearching Take an input query andmap it onto the pre-processed

data-structure to retrieve the matching documents

Inverted indexReverses the relationship between documents and contained terms(words)Central data-structure of a search engine

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 7 / 66

Page 8: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Information Stored in an Inverted Index

DictionarySorted List of all terms found in all documents

PostingsInformation about the occurrence of a term, consisting of:Document (DocID) Document identifier, i. e. , in which document the term

was foundPosition(s)(Pos) Location(s) of the term in den original documentTerm Frequency (TF) How often the term occurs in the documenta

aThe term frequencies could also be calculated by counting the positionentries. As computing time is more precious thanmemory, the term frequenciesare stored as well.

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 8 / 66

Page 9: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Example

Document 1Sepp was stopping off in Hawaii. Sepp is a co-worker of Heidi.

Document 2Heidi stopped going to Hawaii. Heidi is a co-worker of Sepp.

Dictionary 7→ Postings

co-worker 7→ (Doc:1; TF:1; Pos:10); (Doc:2; TF:1; Pos:9)going 7→ (Doc:2; TF:1; Pos:3)Hawaii 7→ (Doc:1; TF:1; Pos:6); (Doc:2; TF:1; Pos:5)Heidi 7→ (Doc:1; TF:1; Pos:12); (Doc:2; TF:2; Pos:1,6)Sepp 7→ (Doc:1; TF:2; Pos:1,7); (Doc:2; TF:1; Pos:11)stopped 7→ (Doc:2; TF:1; Pos:2)stopping off 7→ (Doc1; TF:1; Pos 3)

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 9 / 66

Page 10: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Properties

PropertiesSearch for termsFormulate boolean queriesSearch for phrases (i. e. , terms in a particular order)Information where the term occurred in the documentPossible creation of snippetsDocuments can be ranked by the number of terms they contain

Keep in mind that an inverted index... takes up additional space, not only storing the original documents,but also the inverted index... must be kept up to date

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 10 / 66

Page 11: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Text Preprocessing

Tomake a text indexable, a couple of steps are necessary:1 Detect the language of the text2 Detect sentence boundaries3 Detect term (word) boundaries4 Stemming and normalization5 Remove stop words

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 11 / 66

Page 12: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Language Detection

Letter Frequencies

Percentage of occurrenceLetter English German

A 8.17 6.51E 12.70 17.40I 6.97 7.55O 7.51 2.51U 2.76 4.35

N-Gram FrequenciesBetter–more elaborate–methods use statistics over more than one letter,e. g. statistics over two, three or evenmore consecutive letters.

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 12 / 66

Page 13: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Sentence Detection

First approachEvery period marks the end of a sentence

ProblemPeriods also mark abbreviations, decimal points,email-addresses, etc.

Utilize other featuresIs the next letter capitalized?Is the term in front of the period a known abbreviation?Is the period surround by digits?...

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 13 / 66

Page 14: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Tokenization – Problem

GoalSplit sentences into tokensThrow away punctuations

Possible Pit FallsWhite spaces are no safe delimiter for word boundaries(e. g. New York)Non-alphanumerical characters can separate words, but need not(e. g. co-worker)Different forms (e. g.white space vs. whitespace)German compound nouns (e. g. Donaudampfschiffahrtsgesellschaft,meaning Danube Steamboat Shipping Company)

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 14 / 66

Page 15: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Tokenization – Solutions

Supervised LearningTrain a model with annotated training data, use the trainedmodel totokenize unknown textHidden-Markov-Models and conditional random fields are commonlyused

Dictionary ApproachBuild a dictionary (i. e. list ) of tokensGo over the sentence and always take the longest fitting token (greedyalgorithm!)Remark: Works well for Asian languages without white spaces andshort words. Problematic for European languages

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 15 / 66

Page 16: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Normalization

Some definitionsToken Sequence of characters representing a useful semantic unit,

instance of a typeType Common concept for tokens, element of the vocabularyTerm Type that is stored in the dictionary of an index

Task1 Map all possible tokens to the corresponding type2 Store all representing terms of one type in the dictionary

SolutionsProvide list of different representationsProvide rules for mapping representations to main formMap different spelling version by using a phonetic algorithm

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 16 / 66

Page 17: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Example

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 17 / 66

Page 18: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Stemming & Lemmatization

DefinitionGoal of both Reduce different grammatical forms of a word to their

common infinitivea

Stemming Usage of heuristics to chop off / replace last part of words(Example: Porter’s algorithm).

Lemmatization Usage of proper grammatical rules to recreate theinfinitive.

aThe common infinitive is the form of a word how it is written in a dictionary.This form is called lemma, hence the name Lemmatization.

ExamplesMap do, doing, done, to common infinitive doMap going,went, gone to go

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 18 / 66

Page 19: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Stop Words

Definition Extremely common words that appear in nearly every textProblem As stop words are so common, their occurrence does not

characterise a textSolution Just drop them, i. e. , do not put them in the index directory

Common stop wordsWrite a list with stop words (List might be topic specific!)Usual suspects: articles (a, an, the), conjunction (and, or, but, ...), pre-and postposition (in, for, from), etc.

SolutionStop word list Ignore word that are given on a list (black list)

Problem Special names and phrases (The Who, Let It Be, ...)Solution Make another list... (white list)

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 19 / 66

Page 20: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Advanced ConceptsRelevance Feedback, Query Processing, ...

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 20 / 66

Page 21: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Relevance Feedback

Integrate feedback from the userGiven a search result for a user’s query... allow the user to rate the retrieved results (good vs. not-so-goodmatch)This information can then be used to re-rank the results to match theuser’s preferenceOften automatically inferred from the user’s behaviour using thesearch query logs... and ultimately improve the search experience

The query logs (optimally) contain the queries plus the items the user has clicked on (the user’s session)

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 21 / 66

Page 22: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Relevance Feedback

Pseudo relevance feedbackNo user interaction is needed for the pseudo relevance feedback... the first n results are simply assumed to be a goodmatchAs if the users has clicked on the first results andmarked them as goodmatch→ often improves the quality of the search results

Pseudo relevance feedback is sometimes also called blind relevance feedback

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 22 / 66

Page 23: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Query Reformulation

Modify the query, during of after the user interactionQuery suggestion & completion(Automatic) query correction(Automatic) query expansion

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 23 / 66

Page 24: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Query Suggestion

The user start typing a query... automatically gets suggestions→ to complete the currently typed word→ to complete a whole phrase (e.g. the next word)→ suggest related queries

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 24 / 66

Page 25: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Query Correction

The user entered a query, which may contain spelling errors... automatically correct or suggest possible rectified queries... also known as Query Reformulation→ use other users behaviour (harvest query logs)→ use the frequency of terms found in the indexed documents (andthe similarity with the entered words)

Notebook with various distances (Levenshtein being the best known):http://nbviewer.ipython.org/gist/MajorGressingham/7691723

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 25 / 66

Page 26: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Using Google as Spell Checker

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 26 / 66

Page 27: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Query Expansion

The user has entered a query... automatically add (related) words to the query→ Global query expansion - only use the query plus corpus orbackground knowledge→ Local query expansion - conduct the search and analyse the searchresults

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 27 / 66

Page 28: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Global Query Expansion

Uses only the query and background knowledgeExample: Use of a thesaurus

I Query: [buy car]I Use synonyms from the thesaurusI Expanded query: [(buy OR purchase) (car OR auto)]

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 28 / 66

Page 29: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Local query expansion

Uses the query plus the search results... conceptually similar to the pseudo relevance feedbackLook for terms in the first n search results and add them to the queryRe-run the query with the added terms

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 29 / 66

Page 30: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

More Like This

A common feature of search engines is themore like this searchGiven a search results... the user wants items similar to one of the presented resultsThis can be seen as: taking a document as query→which is a common implementation (but restricting the query to themost discriminative terms)

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 30 / 66

Page 31: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Learning-To-Rank

Motivation fromWeb-SearchRelevance might be due to a lot of different reasons

I Keywords in the text or the meta-dataI Anchor textsI PageRankI ... many other

Basic idea:I Use evidence (from users)I ... which source of informationI ... is the most beneficialI ... by the use of Machine Learning

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 31 / 66

Page 32: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Learning-To-Rank

Supervised Learning - SettingNumber of queries -Q = {q1, q2, . . . , qm}Set of documents -D = {d1, d2, . . . , dN}

I Each queries has labelled relevant documentsI With labelsY = {1, 2, . . . , l}I being ranked, i. e. l � l − 1 � · · · � 1I Relevant documents for query qi: Di = {di,1, di,2, . . . , di,ni}I Labels for the query qi: yyy i = {yi,1, yi,2, . . . , yi,ni}

Training data set: S = {(qi,Di),yyy i}mi=1I By replacing (qi,Di) by representative features

H. Li, “A Short Introduction to Learning to Rank,” IEICE Trans. Inf. Syst., vol. E94-D, no. 1, pp.1–2, 2011.

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 32 / 66

Page 33: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Learning-To-Rank

Supervised Learning - Data LabellingWhere to take the labelling information from?

I Directly from human judgementsF e. g. , perfect, excellent, good, fair, bad

I Indirectly via user behaviourF e. g. , via Web logs analysis

Note: Ordinal classification (ordinal regression) vs. ranking, i. e. , ranking is about recreatingthe expected ordering

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 33 / 66

Page 34: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Learning-To-Rank

Supervised Learning - Main ApproachesPointwise approach

I Transform the problem in a classification, regression or ordinalclassification

I e. g. , SVM for ordinal classification, i. e. , find parallel hyperplanesPairwise approach

I Transform the problem in a pairwise classification, pairwise regressionI e. g. , Ranking SVM, i. e. , to classify the order of pairs

Listwise approachI Directly learn/optimise the rankingI e.g. SVM MAP, i.e. to optimise on a scoring function (goodness of

ranking)

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 34 / 66

Page 35: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Information flow

The classic search model

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 35 / 66

Page 36: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Probabilistic Information Retrial

Every transformation can hold a possible information loss1 Task to query2 Query to document

Information loss leads to uncertaintyProbability theory can deal with uncertainty

Basic idea behind probabilistic IRModel the uncertainty, which originates in the imperfecttransformationsUse this uncertainty as a measurement how relevant a document is,given a certain query

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 36 / 66

Page 37: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Probabilistic Information Retrial

Basic modelGiven a query - q

I ... and an (unknown/uncertain) set of relevant document -DI ... and irrelevant documents - ¬D

Representation of a document - diP(D|di) is the probability that di is relevant and P(¬D|di) that it isirrelevantRank documents in decreasing order of probability of relevance -P(D|q, di)

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 37 / 66

Page 38: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Probabilistic Information Retrial Models

Binary Independence ModelBinary: documents and queries represented as binaryterm incidence vectorsIndependence: terms are assumed to be independentof each other

Note: Computing the “true” probabilities would not be feasible

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 38 / 66

Page 39: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Language Models

Basic IdeaTrain a probabilistic modelMd for document d, i. e. , as many trainedmodels as documentsMd ranks documents on how possible it is that the user’s query qwascreated by the document dResults are documentmodels (or documents, respectively) which havea high probability to generate the user’s query

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 39 / 66

Page 40: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Language Models

Alternative point of viewIntuitively, each document defines a “language”What is the probability to generate the given query, given a languagemodelWhat is the probability to generate the given document, given alanguage model

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 40 / 66

Page 41: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Examples for Language Models

Successive Probability of query termsP(t1t2 . . . tn) = P(t1)P(t2|t1)P(t3|t1t2) . . . P(tn|t1 . . . tn−1)

Unigram Languge ModelPuni(t1t2 . . . tn) = P(t1)P(t2) . . . P(tn)

Bigram Languge ModelP(t1t2 . . . tn) = P(t1)P(t2|t1)P(t3|t2) . . . P(tn|tn−1)

Probability P(tn) Probabilities are generated from the term occurrence inthe document in question

Andmanymore... A lot more, andmore complex models are available

Note: Smoothing plays an important role in languagemodels

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 41 / 66

Page 42: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Index Construction

Going beyond simple indexingTerm-document matrices are very largeBut the number of topics that people talk about is small (in somesense)

I Clothes, movies, politics, ...

Can we represent the term-document space by a lower dimensionallatent space?

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 42 / 66

Page 43: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Index Construction

Latent Semantic Indexing (LSI)is based on Latent Semantic Analysis (see KDDM1), whichis based on Singular Value Decomposition (SVD), whichcreates a new space (new orthonormal base), whichallows to bemapped to lower dimensions, whichmight improve the search performance

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 43 / 66

Page 44: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Singular Value Decomposition

SVD overviewFor anm× nmatrix A of rank r there exists a factorization (SingularValue Decomposition = SVD) as follows:

I M = UΣVT

I ... U is anm×m unitary matrixI ... Σ is anm× n diagonal matrixI ... VT is an n× n unitary matrix

The columns of U are orthogonal eigenvectors of AAT

The columns of V are orthogonal eigenvectors of ATAEigenvalues λ1...λr of AAT are the eigenvalues of ATA.

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 44 / 66

Page 45: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Singular Value Decomposition

Matrix DecompositionSVD enables lossy compression of a term-document matrix

I Reduces the dimensionality or the rankI Arbitrarily reduce the dimensionality by putting zeros in the bottom

right of sigmaI This is a mathematically optimal way of reducing dimensions

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 45 / 66

Page 46: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Singular Value Decomposition

Reduced SVDIf we retain only k singular values, and set the rest to 0ThenΣ is k × k, U ism× k, VT is k × n, and Ak ism× nThis is referred to as the reduced SVDIt is the convenient (space-saving) and usual form for computationalapplications

Approximation ErrorHow good (bad) is this approximation?It’s the best possible, measured by the Frobenius norm of the error

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 46 / 66

Page 47: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Latent Semantic Indexing

Application of SVD - LSIFrom term-document matrix A, we compute the approximation Ak.There is a row for each term and a column for each document in AkThus documents live in a space of k << r dimensions (thesedimensions are not the original axes)Each row and column of A gets mapped into the k-dimensional LSIspace, by the SVD.Claim – this is not only the mapping with the “best” approximation toA, but in fact improves retrieval.

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 47 / 66

Page 48: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Latent Semantic Indexing

Application of SVD - LSIA query q is also mapped into this space... within this space the query q is compared to all the documents... the document closest to the query are returned as search result

LSI - ResultsSimilar termsmap to similar location in low dimensional spaceNoise reduction by dimension reduction

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 48 / 66

Page 49: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

EvaluationAssess the quality of search

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 49 / 66

Page 50: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Overview: Evaluation

Two basic questions / goalsAre retrieved documents relevant for the user’s information need?Have all relevant documents been retrieved?

In practice these two goal contradict each other.

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 50 / 66

Page 51: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Precision & Recall

Evaluation ScenarioFixed document collectionNumber of queriesSet of documents marked as relevant for each query

Main Evaluation Metrics

Precision = #(relevant items retrieved)#(retrieved items)

Recall = #(relevant items retrieved)#(relevant items)

F1 = 2PrecisionRecallPrecision+Recall

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 51 / 66

Page 52: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Precision & Recall

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 52 / 66

Page 53: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Advanced Evaluation Measures

Mean Average Precision - MAP... mean of all average precision of multiple queriesMap = 1

|Q|∑

q∈Q APq

APq = 1n∑n

k=1 Precision(k)(Normalised) discounted cumulative gain (nDCG)

... where there is a reference ranking the relevant results

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 53 / 66

Page 54: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Application & SolutionsFederated Search, Recommender Systems, Open-source, &

commercial IR solutions

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 54 / 66

Page 55: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Distributed Search - Architectures

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 55 / 66

Page 56: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Term Based Index Split

Index CreationThe index is split based on dictionary termsCalled Global Inverted Files or Global Index OrganizationExample:

1 Terms beginning with A to B2 Terms beginning with C to G3 ...

PropertiesSingle index is split over multiple machinesEach index returns part of its postings to brokerBroker merges postings to generate result⇒ High network traffic, high load at broker

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 56 / 66

Page 57: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Document Based Index Split

Index CreationThe index is split based on document categoriesCalled local inverted fileExample:

1 Documents from computer science2 Documents from physics3 ...

PropertiesEach partition is a full indexBroker unifies the result lists⇒ re-ranking problem at broker

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 57 / 66

Page 58: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Federated Search

DefinitionSimultaneous search in multiple search enginesSingle user interface or API to access multiple search engineExamples:

1 Google Scholar2 Expedia3 ...

PropertiesUnified access to multiple search enginesEach search engines is a full search on its own, not only an index⇒ Re-ranking problem at broker⇒ Access control

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 58 / 66

Page 59: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Recommender systems - Overview

Recommender systems should provide usable suggestion to usersRecommender systems should show new, unknown–butsimilar–documentsRecommender systems can utilize different information sourcesContent-based Recommender

Finds similar documents on the basis of documentcontent

Collaborative FileringEmploys user rating to find new and useful documents

Knowledge-based RecommenderMakes decisions based on pre-programmed rules

Hybrid RecommenderCombination of two or three other approaches

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 59 / 66

Page 60: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Types of Content-based Recommender

Document-to-Document Recommender

Document-to-User Recommender

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 60 / 66

Page 61: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Building a Recommender by Using an Index

Document-to-Document Recommender1 Take the terms from a document2 Construct a (weighted) query with them3 Query the index4 Results are possible candidates for recommending

Document-to-User Recommender1 Create a user model from

I Terms typed by the userI Documents viewed by the userI ⇒ Simplest user model is just a set of terms

2 Construct a (weighted) query with them3 Query the index4 Results are possible candidates for recommending

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 61 / 66

Page 62: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Open Source Search Engine

Apache LuceneVery popular, high qualityBackbone of many other search engines, e.g. Apache Solr (fully),Elasticsearch

I Lucene is the core search engine libraryI Solr provides a search solution (service & configuration infrastructure,

distribution, indexing, etc.)I Elasticsearch provides a search solution and document database

focusses on distributed

Implemented in Java, bindings for many other programminglanguages

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 62 / 66

Page 63: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Other Open-Source Search Engines

XapianWritten in C++, Support for many (programming) languagesUsed by many open-source projects

TerrierMany algorithmsAcademic background

SphinxWritten in C++Offers also features comparable to database

Whooshfor Pythonistas

Lemur & IndriLanguage modelsAcademic background

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 63 / 66

Page 64: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

What customers want

Indexing pipeline

And also: user interface, access control mechanisms, logging, etc.

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 64 / 66

Page 65: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Commercial Search Engines

Dassault ExaleadDassault acquired the Exalead projectOffers specialised features for CAD/CAE/CAM files

SinequaExpert SearchCreates new views, i. e. do not only show result list, but combineresults to new unit

AttivioUse SQL statements in queries

IntraFindElasticearch-based commercial search solution

Andmany, manymore...

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 65 / 66

Page 66: Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

The EndNext: Pattern Mining

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 66 / 66