Upload
sobhan-dasari
View
218
Download
0
Embed Size (px)
Citation preview
8/12/2019 Multimedia IR - Ppts 50
1/50
E.G.M Petrakis Multimedia InformationRetrieval
1
Information Retrieval (IR)
Deals with the representation, storageand retrieval of unstructured dataTopics of interest: systems, languages,
retrieval, user interfaces, datavisualization, distributed data setsClassical IR deals mainly with text
The evolution of multimedia databasesand of the web have given new interestto IR
8/12/2019 Multimedia IR - Ppts 50
2/50
E.G.M Petrakis Multimedia InformationRetrieval
2
Multimedia IR
A multimedia information system thatcan store and retrieveattributes
text
2D grey-scale and color images
1D time series
digitized voice or musicvideo
8/12/2019 Multimedia IR - Ppts 50
3/50
E.G.M Petrakis Multimedia InformationRetrieval
3
Applications
Financial, marketing:stock prices, sales etc.find companies whose stock prices move similarly
Scientific databases:sensor data
whether, geological, environmental dataOffice automation, electronic encyclopedias,
electronic booksMedical databasesX-rays, CT, MRI scans
Criminal investigationsuspects, fingerprints,Personal archives, text andcolor images
8/12/2019 Multimedia IR - Ppts 50
4/50
E.G.M Petrakis Multimedia InformationRetrieval 4
Queries
Formulation of users information needFree-text query
By example document (e.g., text, image)
e.g.,in a collection of color photos findthose showing a tree close to a house
Retrievalis based on the understanding
of the contentof documents and oftheir components
8/12/2019 Multimedia IR - Ppts 50
5/50
E.G.M Petrakis Multimedia InformationRetrieval 5
Goals of Retrieval
Accuracy: retrieve documents that theuser expects in the answerWith as few incorrect answers as possible
All relevant answers are retrieved
Speed: retrievals has to be fastThe system responds in real time
8/12/2019 Multimedia IR - Ppts 50
6/50
Accuracy of Retrieval
Depends on query criteria, querycomplexity and specificityAttributes / Content criteria?
Depends on what is matched with thequeryHow documents content is represented
Typically feature vectors
Depends also on matching function Euclidean distance
E.G.M Petrakis Multimedia InformationRetrieval 6
8/12/2019 Multimedia IR - Ppts 50
7/50
E.G.M Petrakis Multimedia InformationRetrieval 7
Speed of Retrieval
The query is compared (matched) with allstored documents
The definition of similarity criteria (similarity
or distance function between documents) is animportant issue:Similarity or distance function between documentsMatching has to fast: document matching has to be
computationally efficient and sequential searchingmust be avoided
Indexing: search documents that are likely tomatch the query
8/12/2019 Multimedia IR - Ppts 50
8/50
Doc1-FeactureVector
Doc2-FeactureVector
DocN-FeatureVector
Database
E.G.M Petrakis Multimedia InformationRetrieval 8
database
descriptions documents
Doc1
Doc2
DocN
index
query
8/12/2019 Multimedia IR - Ppts 50
9/50
E.G.M Petrakis Multimedia InformationRetrieval 9
Main Idea
Descriptionsare extracted and storedSame or separate storage with documentsQueries address the stored descriptions
rather the documents themselvesImages & video: the majority of stored
data
Solution: retrieval by textText retrievalis a well researched areaMost systems work with text, attributes
8/12/2019 Multimedia IR - Ppts 50
10/50
E.G.M Petrakis Multimedia InformationRetrieval 10
Problem
Text descriptions for images and videocontained in documents are not availablee.g., the text in a web site is not always
descriptive of every particular imagecontained in the web site
Two approaches
human annotationsfeature extraction
8/12/2019 Multimedia IR - Ppts 50
11/50
E.G.M Petrakis Multimedia InformationRetrieval 11
Human Annotations
Humans insert attributes, captions forimages and videos inconsistent or subjectiveover time and
users (different users do not give thesame descriptions)
retrievals will failif queries are
formulated using different keywords ordescriptions
time consuming process, expensiveorimpossible for large databases
8/12/2019 Multimedia IR - Ppts 50
12/50
E.G.M Petrakis Multimedia InformationRetrieval 12
Feature Extraction
Features are extracted from audio,image and video cheaper and replicable approach
consistent descriptions but
inexact
low level feature (patterns, colors etc.)
difficult to extract meaning different techniques for different data
8/12/2019 Multimedia IR - Ppts 50
13/50
E.G.M Petrakis Multimedia InformationRetrieval 13
Furht at.al. 96
Architecture of a IRSystem for Images
8/12/2019 Multimedia IR - Ppts 50
14/50
E.G.M Petrakis Multimedia InformationRetrieval 14
Multimedia IR in Practice
Relies on text descriptionsNon-text IRthere is nothing similar to text-based
retrieval systemsautomatic IRis sometimes impossiblerequires human intervention at some level
significant progress have been madedomain specific interpretations
8/12/2019 Multimedia IR - Ppts 50
15/50
E.G.M Petrakis Multimedia InformationRetrieval 15
Ranking of IR Methods
Ranking in terms of complexity and accuracyattributestext
audioimagevideo
CombinedIRbased on two or more data types
for more accurate retrievalsComplex data types (e.g.,video) are rich
sources of information
accuracy complexity
8/12/2019 Multimedia IR - Ppts 50
16/50
E.G.M Petrakis Multimedia InformationRetrieval 16
Similarity Searching
IRis not an exact processtwo documents are never identical!
they can be similar
searching must be approximateThe effectiveness of IRdepends on thetypes and correctness of descriptions used
types of queries allowed
user uncertainly as to what he is looking for
efficiency of search techniques
8/12/2019 Multimedia IR - Ppts 50
17/50
E.G.M Petrakis Multimedia InformationRetrieval 17
Approximate IR
A query is specified and all documentsup to a pre-specified degree ofsimilarity are retrieved and presentedto the user ordered by similarityTwo common types of similarity queries:range queries: retrieve all documents up to
a distance threshold Tnearest-neighbor queries: retrieve the k
best matches
8/12/2019 Multimedia IR - Ppts 50
18/50
E.G.M Petrakis Multimedia InformationRetrieval 18
Distance - Similarity
Decide whether two documents aresimilardistance: the lower the distance, the more
similar the documents aresimilarity: the higher the similarity, the
more similar the documents are
Key issue for successful retrieval: themore accurate the descriptions are, themore accurate the retrieval is
8/12/2019 Multimedia IR - Ppts 50
19/50
E.G.M Petrakis Multimedia InformationRetrieval 19
Retrieval Quality Criteria
Retrieval with as few errors as possibleTwo types of errors:false dismissals or misses: qualifying but
non retrieved documentsfalse positives or false drops: retrieved
but not qualifying documents
A good method minimizes bothRanking quality: retrieve qualifyingdocuments before non-qualifying ones
8/12/2019 Multimedia IR - Ppts 50
20/50
E.G.M Petrakis Multimedia InformationRetrieval 20
Evaluation of Retrieval
Given a collection of documents, a set ofqueries and human experts responses to theabove queries, the ideal system will retrieve
exactly what the human dictatedThe deviations from the above are measured
collectiontheinrelevantornumbertotal
answerinrelevantretrievedornumber=
retrievedofnumber
answerinrelevantretrievedofnumber=
recall
precision
8/12/2019 Multimedia IR - Ppts 50
21/50
E.G.M Petrakis Multimedia InformationRetrieval 21
Precision-Recall Diagram
Measure precision-recall for 1, 2 ..N answers
High precision means few false alarms
High recall mean few false dismissals
precision
recall
1
1
0,5
0,5
the ideal method
8/12/2019 Multimedia IR - Ppts 50
22/50
E.G.M Petrakis Multimedia InformationRetrieval 22
Harmonic Mean
A single measure combining precision and recall
r: recall,p: precision Ftakes values in [0,1] F 1as more retrieved documents are relevant F0as few retrieved documents are relevant Fis high when both precision and recall are high F expresses a compromise between precision and
recall
pr
F11
+
2=
8/12/2019 Multimedia IR - Ppts 50
23/50
E.G.M Petrakis Multimedia InformationRetrieval 23
Ranking Quality
The higher the Rnormthe better the ability ofa method to retrieve correct answer beforeincorrect
A human expert evaluates the answer setThen take answers in pairs: (relevant,irrelevant)S+: pairs ranked correctly (the relevantentry
has retrieved before the irrelevantone)S-: pairs ranked incorrectlySmax+: total ranked pairs
otherwise
SifR S
SS
norm
1
0)1( max-
21
max
-
8/12/2019 Multimedia IR - Ppts 50
24/50
E.G.M Petrakis Multimedia InformationRetrieval 24
Searching Method
Sequential scanning: the query ismatched with all stored documentsslow if the database is large or
if matching is slowThe documents must be indexedhashing, inverted files, B-trees, R-trees
different data types are indexed andsearched separately
8/12/2019 Multimedia IR - Ppts 50
25/50
E.G.M Petrakis Multimedia InformationRetrieval 25
Goals of IR
Summarizing, there are two generalgoals common to all IRsystemseffectiveness: IRmust be accurate
(retrieves what the user expects to see inthe answer)
efficiency: IRmust be fast(faster than
sequential scanning)
8/12/2019 Multimedia IR - Ppts 50
26/50
E.G.M Petrakis Multimedia InformationRetrieval 26
Approximate IR
A query is specified and all documentsup to a pre-specified degree ofsimilarity are retrieved and presented
to the user ordered by similarityTwo common types of similarity queries:range queries: retrieve all documents up to
a distance threshold Tnearest-neighbor queries: retrieve the k
best matches
8/12/2019 Multimedia IR - Ppts 50
27/50
E.G.M Petrakis Multimedia InformationRetrieval 27
Query Formulation
Must be flexible and convenient
SQLqueries constraints on allattributes and data types may becomevery complex
Queries by examplee.g., by providing anexample document or image
Browsing: display headers, summaries,miniatures etc. or for refining theretrieved results
8/12/2019 Multimedia IR - Ppts 50
28/50
E.G.M Petrakis Multimedia InformationRetrieval 28
Access Methods for Text
Access methods for text areinteresting for at least 3 reasonsmultimedia documents contain text, e.g.,
images often have captionstext retrieval has several applications in
itself (library automation, web search etc.)text retrieval research has led to useful
ideas like, vector space model, informationfiltering, relevance feedback etc.
8/12/2019 Multimedia IR - Ppts 50
29/50
8/12/2019 Multimedia IR - Ppts 50
30/50
E.G.M Petrakis Multimedia InformationRetrieval 30
Keyword Matching
The whole collection is searchedno preprocessing
no space overhead
updates are easy
The user specifies a string (regularexpression) and the text is parsed usinga finite state automatonKMP, BMH algorithms
8/12/2019 Multimedia IR - Ppts 50
31/50
E.G.M Petrakis Multimedia InformationRetrieval 31
Error Tolerant Methods
Methods that can tolerate errors
Scan text one character at a time bykeeping track of matched charactersand retrieve strings within a desiredediting distance from queryWu, Manber 92, Baeza-Yates, Connet 92
Extension: regular expressions built-upby strings and operators: pro(blem|tein) (s| ) | (0| 1| 2)
8/12/2019 Multimedia IR - Ppts 50
32/50
E.G.M Petrakis Multimedia InformationRetrieval 32
Error Counting Methods
Retrieve similar words or phrases e.g.,misspelled words or words withdifferent pronunciation
editing distance: a numerical estimateof the similarity between 2 strings
phonetic coding: search words withsimilar pronunciation (Soundex, Phonix)N-grams: count common N-length
substrings
8/12/2019 Multimedia IR - Ppts 50
33/50
E.G.M Petrakis Multimedia InformationRetrieval 33
Editing Distance
Minimum number of edit operationsthatare needed to transform the firststring to the second
edit operation: insert, delete, substituteand (sometimes) transposition ofcharacters
d(si,ti): distance between charactersusually d(si,ti)= 1if si < >tiand0otherwiseedit(cordis,codis) = 1: ris deletededit(cordis, codris) = 1: transposition of rd
8/12/2019 Multimedia IR - Ppts 50
34/50
E.G.M. Petrakis Multimedia InformationRetrieval 34
i 0 1 2 3 4 5 6
j a b b c d d
0 0 1 2 3 4 5 6
1 a 1 0 1 2 3 4 5
2 a 2 1 1 2 3 4 5
3 a 3 2 2 2 3 4 5
4 b 4 3 2 2 3 4 5
5 c 5 4 3 2 2 3 4
6 c 6 5 4 4 3 3 4
7 c 7 6 5 5 4 4 4
8 d 8 7 6 6 5 4 4
initialization cost
total cost
8/12/2019 Multimedia IR - Ppts 50
35/50
8/12/2019 Multimedia IR - Ppts 50
36/50
E.G.M. Petrakis Multimedia InformationRetrieval 36
Matching Algorithm
Compute D(A,B)#A, #Blengths of A, B 0: null symbol R: cost of an edit operationD(0,0) = 0
for i = 0to #A: D(i:0) = D(i-1,0) + R(A[i]0);forj = 0to #B: D(0:j) = D(0,j-1) + R(0B[j]);for i = 0to #A
forj = 0to #B{1. m1= D(i,j-1) + R(0B[j]);2. m
2= D(i-1,j) + R(A[i]0);
3. m3 = D(i-1,j-1) + R(A[i]B[j]);4. D(i,j) = min{m1, m2, m3};
}
8/12/2019 Multimedia IR - Ppts 50
37/50
E.G.M Petrakis Multimedia InformationRetrieval 37
Classical IRModels
Boolean modelsimple based on set theory
queries as Boolean expressions
Vector space modelqueries and documents as vectors in term
space
Probabilistic modela probabilistic approach
8/12/2019 Multimedia IR - Ppts 50
38/50
E.G.M Petrakis Multimedia InformationRetrieval 38
Text Indexing
A document is represented by a set ofindex termsthat summarize document
contents
adjectives, adverbs, connectives are lessuseful
mainly nouns (lexicon look-up)
requires text pre-processing (off-line)
8/12/2019 Multimedia IR - Ppts 50
39/50
E.G.M Petrakis Multimedia InformationRetrieval 39
Indexing
index
Data Repository
8/12/2019 Multimedia IR - Ppts 50
40/50
E.G.M Petrakis Multimedia InformationRetrieval 40
Text Preprocessing
Extract index terms from textProcessing stagesword separation, sentence splitting
change terms to a standard form (e.g.,lowercase)eliminate stop-words (e.g. and, is, the, )
reduce terms to their base form (e.g.,eliminate prefixes, suffixes)construct mapping between terms and
documents (indexing)
8/12/2019 Multimedia IR - Ppts 50
41/50
E.G.M Petrakis Multimedia InformationRetrieval 41
Text Preprocessing Chart
from Baeza Yates & Ribeiro Neto, 1999
8/12/2019 Multimedia IR - Ppts 50
42/50
8/12/2019 Multimedia IR - Ppts 50
43/50
E.G.M Petrakis Multimedia InformationRetrieval 43
Inverted Files
Dictionary: terms stored alphabeticallyPosting list: pointers to documents
containing the term
Posting info: document id, frequency ofoccurrence, location in document, etc.dictionary indexed by B-trees, tries or
binary searchedPros: fast and easy to implementCons: large space overhead (up to 300%)
8/12/2019 Multimedia IR - Ppts 50
44/50
E.G.M Petrakis Multimedia InformationRetrieval 44
Inverted Index
index posting list
(1,2)(3,4)
(4,3)(7,5)
(10,3)
1
2
3
4
5
6
7
8
9
10
11
documents
8/12/2019 Multimedia IR - Ppts 50
45/50
E.G.M Petrakis Multimedia InformationRetrieval 45
Query Processing
1. Parse query and extract query terms2. Dictionary lookup binary search
3. Get postings from inverted file one term at a time4. Accumulate postings record matching document ids, frequencies, etc.
compute weights5. Rank answers by weights (e.g., frequencies)
8/12/2019 Multimedia IR - Ppts 50
46/50
E.G.M Petrakis Multimedia InformationRetrieval 46
Signature Files
A filter for eliminating most of the non-qualifying documentssignature: short hash-coded representation
of document or querythe signatures are stored and searched
sequentially
search returns all qualifying documents plussome false alarms
the answer is searched to eliminate falsealarms
8/12/2019 Multimedia IR - Ppts 50
47/50
E.G.M Petrakis Multimedia InformationRetrieval 47
Superimposed Coding
Each word yields a bit pattern of size Fwith mbits set to 1and the rest left as 0size of Faffects the false drop rate
bit patterns are OR-ed to form the doc signature
find documents having 1in the same bits as thequery
word signature
data 001 000 110 010base 000 010 101 001
document signature 001 010 111 011
8/12/2019 Multimedia IR - Ppts 50
48/50
E.G.M Petrakis Multimedia InformationRetrieval
48
Bitmaps
For every term, a bit vector is storedeach bit corresponds to a document
set to 1if term appears in document, 0
otherwisee.g.,word 10100: the word appears in
documents 1and 3in a collection of 5
documentsVery efficient for Boolean queriesretrieve bit-vectors for each query term
combine bit vectors with Boolean operators
C i f T t I d i
8/12/2019 Multimedia IR - Ppts 50
49/50
E.G.M Petrakis Multimedia InformationRetrieval
49
Comparison of Text IndexingMethods
Bitmaps:pros: intuitive, easy to implement and to process
cons: space overhead
Signature files:pros: easy to implement, compact index, no lexicon
cons: many unnecessary accesses to documents
Inverted files:pros: used by most search engines
cons: large space overhead, index can becompressed
8/12/2019 Multimedia IR - Ppts 50
50/50
E G M Petrakis Multimedia Information 50
References
Searching Multimedia Databases by Content, C.Faloutsos, Kluwer Academic Publishers, 1996
Modern Information Retrieval, R. Baeza-Yates, B.Ribeiro-Neto, Addison Wesley, 1999
Automatic Text Processing, Gerard Salton, AddisonWesley, 1989 Information Retrieval Links:
http://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.html
http://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.htmlhttp://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.htmlhttp://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.htmlhttp://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.htmlhttp://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.htmlhttp://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.html