Intro to Information Retrieval By the end of the lecture you should be able to: explain the...

Intro to Information Retrieval

By the end of the lecture you should be able to: explain the differences between database

and information retrieval technologies describe the basic maths underlying set-

theoretic and vector models of classical IR.

Reminder: efficiency is vital Reminder: Google finds documents which

match your keywords; this must be done EFFICIENTLY – cant just go through each document from start to end for each keyword

So, cache stores copy of document, and also a “cut-down” version of the document for searching: just a “bag of words”, a sorted list (or array/vector/…) of words appearing in the document (with links back to full document)

Try to match keywords against this list; if found, then return the full document

Even cleverer: dictionary and inverted file…

Inverted file structure

Term 1 (2)

Term 2 (3)

Term 3 (1)

Term 4 (3)

Term 5 (4)..

Doc6..

dictionary Inverted or postings file Data file

IR vs DBMS

DBMS IR

match exact partial or best match inference deduction induction model deterministic probabilistic data record/field text document query language artificial natural? query specification

complete incomplete

items wanted matching relevant error response sensitive insensitive

informal introduction

IR was developed for bibliographic systems. We shall refer to ‘documents’, but the technique extends beyond items of text.

central to IR is representation of a document by a set of ‘descriptors’ or ‘index terms’ (“words in the document”).

searching for a document is carried out (mainly) in the ‘space’ of index terms.

we need a language for formulating queries, and a method for matching queries with document descriptors.

architecture

Query matching

Learning component

Object base

(objects and their descriptions)

feedback

basic notation

Given a list of m documents, D, and a list of n index

terms, T, we define wi,j 0 to be a weight associated with

the ith keyword and the jth document.

For the jth document, we define an index term vector, dj :

dj = (w1,j , w2,j , …., wn,j )

For example: D = { d1, d2, d3},

T = {pudding, jam, traffic, lane, treacle}

d1 = (1, 1, 0, 0, 0),

d2 = (0, 0, 1, 1, 0),

d3 = (1, 1, 1, 1, 0)

Recipe for jam pudding

DoT report on traffic lanes

Radio item on traffic jam in Pudding Lane

set theoretic, Boolean model Queries are Boolean expressions formed using

keywords, eg:(‘Jam’ ‘Treacle’) ’Pudding’ ¬ ‘Lane’ ¬ ‘Traffic’

Query is re-expressed in disjunctive normal form (DNF)

eg (1, 1, 0, 0, 0) (1, 0, 0, 0, 1) (1, 1, 0, 0, 1)To match a document with a query: sim(d, qDNF) = 1 if d is equal to a component of

= 0 otherwise

CF: T = {pudding, jam, traffic, lane, treacle}

d1 = (1, 1, 0, 0, 0),

d2 = (0, 0, 1, 1, 0),

d3 = (1, 1, 1, 1, 0)

(1, 1, 0, 0, 0) (1, 0, 0, 0, 1) (1, 1, 0, 0, 1)

T = {pudding, jam, traffic, lane, treacle}

pudding

trafficlane

treacle

collecting resultsT = {pudding, jam, traffic, lane, treacle}

Answer: d1 = (1, 1, 0, 0, 0) Jam pud recipe

Query:(‘Jam’ ‘Treacle’) ’Pudding’ ¬ ‘Lane’¬ ‘Traffic’(jam treacle) (pudding)

- Lane - Traffic

pudding

trafficlane

treacle

Statistical vector model

weights, 1 wi,j 0, no longer binary-valued query also represented by a vector

q = (w1q, w2q, …, wnq)– eg q = (1.0, 0.6, 0.0, 0.0, 0.8)

CF: T = {pudding, jam, traffic, lane, treacle}to match jth document with a query:

sim(dj, q) = dj q /( | dj | ×| q | )

i=1(wij × wiq)

i=1 i=1wij 2 × wiq

w21 w2q

i=1(wij × wiq)

i=1 i=1wij 2 ×

= cos()

Cosine coefficient

i=1(wij × wiq)

i=1 i=1wij 2 ×

= cos(0)

Cosine coefficient

w1q= 0

w21= 0w2q

i=1(wij × wiq)

i=1 i=1wij 2 ×

= cos(90º)

= 90º

Cosine coefficient

i=1(wij × wiq)

i=1wij 2 n

i=1wiq

q = (1.0, 0.6, 0.0, 0.0, 0.8)

i=1(wij × wiq)

i=1 i=1wij 2 × wiq

d1 = (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud recipe

= 0.8×1.0 + 0.8×0.6 + 0.0×0.0 + 0.0×0.0 + 0.2×0.8

= 1.44= 0.82 + 0.82 + 0.02 + 0.02 + 0.22 = 1.32= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0

= 1.44 = 0.89

1.32 × 2.0

i=1(wij × wiq)

i=1wij 2 n

i=1wiq

q = (1.0, 0.6, 0.0, 0.0, 0.8)

i=1(wij × wiq)

i=1 i=1wij 2 × wiq

d2 = (0.0, 0.0, 0.9, 0.8, 0), DoT Report

= 0.0×1.0 + 0.0×0.6 + 0.9×0.0 + 0.8×0.0 + 0.0×0.8

= 0.0= 0.02 + 0.02 + 0.92 + 0.82 + 0.02 = 1.45= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0

= 0.0 = 0.0

1.45 × 2.0

i=1(wij × wiq)

i=1wij 2 n

i=1wiq

q = (1.0, 0.6, 0.0, 0.0, 0.8)

i=1(wij × wiq)

i=1 i=1wij 2 × wiq

d3 = (0.6, 0.9, 1.0, 0.6, 0.0) Radio Traffic Report

= 0.6×1.0 + 0.9×0.6 + 1.0×0.0 + 0.6×0.0 + 0.0×0.8

= 1.14= 0.62 + 0.92 + 1.02 + 0.62 + 0.02 = 2.53= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0

= 1.14 = 0.51

2.53 × 2.0

q = (1.0, 0.6, 0.0, 0.0, 0.8)

2. d3 = (0.6, 0.9, 1.0, 0.6, 0.0) Radio Traffic (0.51) Report

collecting results

1. d1 = (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud recipe (0.89)

Rank document vector document (sim)

CF: T = {pudding, jam, traffic, lane, treacle}

Discussion: Set theoretic model

Boolean model is simple, queries have precise semantics, but it is an ‘exact match’ model, and does not Rank results

Boolean model popular with bibliographic systems; available on some search engines

Users find Boolean queries hard to formulate Attempts to use set theoretic model as basis

for a partial-match system: Fuzzy set model and the extended Boolean model.

Discussion: Vector Model

Vector model is simple, fast and results show leads to ‘good’ results.

Partial matching leads to ranked output Popular model with search engines Underlying assumption of term independence

(not realistic! Phrases, collocations, grammar) Generalised vector space model relaxes the

assumption that index terms are pairwise orthogonal (but is more complicated).

questions raised

Where do the index terms come from? (ALL the words in the source documents?)

What determines the weights? How well can we expect these systems to

work for practical applications? How can we improve them? How do we integrate IR into more traditional

DB management?

Questions to think about

Why is traditional database unsuited to retrieval of unstructured information?

How would you re-express a Boolean query, eg (A or B or (C and not D)), in disjunctive normal form?

For the matching coefficient, sim(., .) show that 0 sim(., .) 1, and that sim(a, a) = 1.

Compare and contrast the ‘vector’ and ‘set theoretic’ models in terms of power of representation of documents and queries.

Intro to Information Retrieval By the end of the lecture you should be able to: explain the...

Documents

Introduction to Information Retrieval - csuohio.edueecs.csuohio.edu/~sschung/CIS660/Lecture_Intro_IR_Phrase...Information Retrieval • Information Retrieval (IR) is finding material

Introduction to Information Retrieval - Stanford University...Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search

Introductionto InformationRetrieval · IntroductiontoInformationRetrieval Introductionto InformationRetrieval CS276:InformationRetrievalandWebSearch PanduNayakandPrabhakarRaghavan

HPI Potsdam, Winter Term 2012-13 INTRODUCTION TO ... · Differences to database systems Information retrieval systems Databases Corpus Unstructured, semi- structured information (text,

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar

Information Retrieval

Information Retrieval and Extraction - Berlin Chen's ...berlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Information Retrieval and Extraction ... Concepts and Technology

Introduction to Information Retrieval - cis.csuohio.educis.csuohio.edu/~sschung/cis612/Lecture_Intro_IR_PhrasePositioning.pdf · Information Retrieval • Information Retrieval (IR)

Information Retrieval and Extractionberlin.csie.ntnu.edu.tw/Courses/Information Retrieval and... · Information Retrieval and Extraction ... Bruce Croft, Donald Metzler, and Trevor

Introduction to Information Retrieval Boolean Retrieval

Multimedia Information Retrieval -Slidesmhd/8337sp09/mir.pdfMultimedia Information RetrievalMultimedia Information Retrieval Text-based Information Retrieval ¾Too many images to annotate

2 Information Retrieval. Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval Motivation Information

CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?

Introduction to Information Retrieval Information Retrieval Models

Introduction to Information Retrieval Introducing Information Retrieval and Web Search

1 Retrieval: Getting Information Out Module 27. 2 Retrieval: Getting Information Out Retrieval Cues

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval…

Multimedia Information Retrieval -Slidesdcalab.unipv.it/wp-content/uploads/2015/02/video_search.pdf · Multimedia Information Retrieval. Multimedia Information Retrieval Motivation

Introduction to Information Retrieval Information ...zhaojin/cs3245_2018/w11.pdf · Introduction to Information Retrieval CS3245 Information Retrieval Lecture 11: Probabilistic IR

Information Retrieval: Introduction · Information Retrieval: Introduction De nition of Information Retrieval De nition of Information Retrieval Information Retrieval (IR) is about