View
216
Download
2
Category
Tags:
Preview:
Citation preview
Intro to Information Retrieval
By the end of the lecture you should be able to: explain the differences between database
and information retrieval technologies describe the basic maths underlying set-
theoretic and vector models of classical IR.
Reminder: efficiency is vital Reminder: Google finds documents which
match your keywords; this must be done EFFICIENTLY – cant just go through each document from start to end for each keyword
So, cache stores copy of document, and also a “cut-down” version of the document for searching: just a “bag of words”, a sorted list (or array/vector/…) of words appearing in the document (with links back to full document)
Try to match keywords against this list; if found, then return the full document
Even cleverer: dictionary and inverted file…
Inverted file structure
Term 1 (2)
Term 2 (3)
Term 3 (1)
Term 4 (3)
Term 5 (4)..
1
2
1
2
3
2
2
3
4..
Doc 1
Doc2
Doc3
Doc4
Doc5
Doc6..
1
3
6
7
9..
dictionary Inverted or postings file Data file
IR vs DBMS
DBMS IR
match exact partial or best match inference deduction induction model deterministic probabilistic data record/field text document query language artificial natural? query specification
complete incomplete
items wanted matching relevant error response sensitive insensitive
informal introduction
IR was developed for bibliographic systems. We shall refer to ‘documents’, but the technique extends beyond items of text.
central to IR is representation of a document by a set of ‘descriptors’ or ‘index terms’ (“words in the document”).
searching for a document is carried out (mainly) in the ‘space’ of index terms.
we need a language for formulating queries, and a method for matching queries with document descriptors.
architecture
user
Query matching
Learning component
Object base
(objects and their descriptions)
hits
query
feedback
basic notation
Given a list of m documents, D, and a list of n index
terms, T, we define wi,j 0 to be a weight associated with
the ith keyword and the jth document.
For the jth document, we define an index term vector, dj :
dj = (w1,j , w2,j , …., wn,j )
For example: D = { d1, d2, d3},
T = {pudding, jam, traffic, lane, treacle}
d1 = (1, 1, 0, 0, 0),
d2 = (0, 0, 1, 1, 0),
d3 = (1, 1, 1, 1, 0)
Recipe for jam pudding
DoT report on traffic lanes
Radio item on traffic jam in Pudding Lane
set theoretic, Boolean model Queries are Boolean expressions formed using
keywords, eg:(‘Jam’ ‘Treacle’) ’Pudding’ ¬ ‘Lane’ ¬ ‘Traffic’
Query is re-expressed in disjunctive normal form (DNF)
eg (1, 1, 0, 0, 0) (1, 0, 0, 0, 1) (1, 1, 0, 0, 1)To match a document with a query: sim(d, qDNF) = 1 if d is equal to a component of
qDNF
= 0 otherwise
CF: T = {pudding, jam, traffic, lane, treacle}
d1 = (1, 1, 0, 0, 0),
d2 = (0, 0, 1, 1, 0),
d3 = (1, 1, 1, 1, 0)
(1, 1, 0, 0, 0) (1, 0, 0, 0, 1) (1, 1, 0, 0, 1)
T = {pudding, jam, traffic, lane, treacle}
pudding
jam
trafficlane
treacle
collecting resultsT = {pudding, jam, traffic, lane, treacle}
Answer: d1 = (1, 1, 0, 0, 0) Jam pud recipe
Query:(‘Jam’ ‘Treacle’) ’Pudding’ ¬ ‘Lane’¬ ‘Traffic’(jam treacle) (pudding)
- Lane - Traffic
pudding
jam
trafficlane
treacle
Statistical vector model
weights, 1 wi,j 0, no longer binary-valued query also represented by a vector
q = (w1q, w2q, …, wnq)– eg q = (1.0, 0.6, 0.0, 0.0, 0.8)
CF: T = {pudding, jam, traffic, lane, treacle}to match jth document with a query:
sim(dj, q) = dj q /( | dj | ×| q | )
=
i=1(wij × wiq)
n
i=1 i=1wij 2 × wiq
2n n
T1
T2
D1
Q
w11
w1q
w21 w2q
wiq 2
i=1(wij × wiq)
n
i=1 i=1wij 2 ×
n n
= cos()
Cosine coefficient
T1
T2
D1
Q
w11
w1q
w21
w2q
wiq 2
i=1(wij × wiq)
n
i=1 i=1wij 2 ×
n n
= cos(0)
= 1
=0
Cosine coefficient
T1
T2
D1
Q
w11
w1q= 0
w21= 0w2q
wiq 2
i=1(wij × wiq)
n
i=1 i=1wij 2 ×
n n
= cos(90º)
= 0
= 90º
Cosine coefficient
i=1(wij × wiq)
n
i=1wij 2 n
i=1wiq
2n
q = (1.0, 0.6, 0.0, 0.0, 0.8)
i=1(wij × wiq)
n
i=1 i=1wij 2 × wiq
2n n
d1 = (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud recipe
= 0.8×1.0 + 0.8×0.6 + 0.0×0.0 + 0.0×0.0 + 0.2×0.8
= 1.44= 0.82 + 0.82 + 0.02 + 0.02 + 0.22 = 1.32= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0
= 1.44 = 0.89
1.32 × 2.0
i=1(wij × wiq)
n
i=1wij 2 n
i=1wiq
2n
q = (1.0, 0.6, 0.0, 0.0, 0.8)
i=1(wij × wiq)
n
i=1 i=1wij 2 × wiq
2n n
d2 = (0.0, 0.0, 0.9, 0.8, 0), DoT Report
= 0.0×1.0 + 0.0×0.6 + 0.9×0.0 + 0.8×0.0 + 0.0×0.8
= 0.0= 0.02 + 0.02 + 0.92 + 0.82 + 0.02 = 1.45= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0
= 0.0 = 0.0
1.45 × 2.0
i=1(wij × wiq)
n
i=1wij 2 n
i=1wiq
2n
q = (1.0, 0.6, 0.0, 0.0, 0.8)
i=1(wij × wiq)
n
i=1 i=1wij 2 × wiq
2n n
d3 = (0.6, 0.9, 1.0, 0.6, 0.0) Radio Traffic Report
= 0.6×1.0 + 0.9×0.6 + 1.0×0.0 + 0.6×0.0 + 0.0×0.8
= 1.14= 0.62 + 0.92 + 1.02 + 0.62 + 0.02 = 2.53= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0
= 1.14 = 0.51
2.53 × 2.0
q = (1.0, 0.6, 0.0, 0.0, 0.8)
2. d3 = (0.6, 0.9, 1.0, 0.6, 0.0) Radio Traffic (0.51) Report
collecting results
1. d1 = (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud recipe (0.89)
Rank document vector document (sim)
CF: T = {pudding, jam, traffic, lane, treacle}
Discussion: Set theoretic model
Boolean model is simple, queries have precise semantics, but it is an ‘exact match’ model, and does not Rank results
Boolean model popular with bibliographic systems; available on some search engines
Users find Boolean queries hard to formulate Attempts to use set theoretic model as basis
for a partial-match system: Fuzzy set model and the extended Boolean model.
Discussion: Vector Model
Vector model is simple, fast and results show leads to ‘good’ results.
Partial matching leads to ranked output Popular model with search engines Underlying assumption of term independence
(not realistic! Phrases, collocations, grammar) Generalised vector space model relaxes the
assumption that index terms are pairwise orthogonal (but is more complicated).
questions raised
Where do the index terms come from? (ALL the words in the source documents?)
What determines the weights? How well can we expect these systems to
work for practical applications? How can we improve them? How do we integrate IR into more traditional
DB management?
Questions to think about
Why is traditional database unsuited to retrieval of unstructured information?
How would you re-express a Boolean query, eg (A or B or (C and not D)), in disjunctive normal form?
For the matching coefficient, sim(., .) show that 0 sim(., .) 1, and that sim(a, a) = 1.
Compare and contrast the ‘vector’ and ‘set theoretic’ models in terms of power of representation of documents and queries.
Recommended