Upload
owen-terry
View
214
Download
0
Embed Size (px)
Citation preview
Information Retrieval:aka “Google-lite”
CMSC 16100
November 27, 2006
Roadmap
● Information Retrieval (IR)– Goal: Match Information Need to Document Concept– Solution: Vector Space Model
● Representation of Documents and Queries● Computing Similarity
● Implementation:– Indexing: Documents -> Vectors– Query Construction: Query -> Vector– Retrieval: Finding “Best” match: Query/Document
The Information Retrieval Task
● Goal:– Match the information need expressed by user
● (the Query)
– With concepts in documents● (the Document collection)
● Issues:– How do we represent documents and queries ?– How do we know if they're “similar”? Match?
Vector Space Model
● Represent documents and queries with– Pattern of words
● I.E. Queries and documents with lots of the same words
– Vector of word occurrences:– Each position in vector = word
● Value of position x in vector = # times word x occurs
● Similarity:– Dot product of document vector & query vector– Biggest wins
Vector Space Model
Computer
Tv
Program
Two documents: computer program, tv program
Query: computer program : matches 1 st doc: exact: distance=2 vs 0 educational program: matches both equally: distance=1
Information Retrieval in Scheme
● Representation:– A vector-rep is (vectorof number)– (define-struct doc-rep (id vec))– A doc is (make-doc-rep id vec)
● Where id:symbol; vec: vector-rep
– A doc-index is (listof doc)– A query is vector-rep
● A simple-web-page (swp) is:● (make-swp h b)● Where (define-struct swp h b); h:symbol; b: (listof symbol)
Three Steps to IR
● Three phases:– Indexing: Build collection of document representations
● Convert web pages to doc-rep – Vectors of word counts
– Query construction:● Convert query text to vector of word counts
– Retrieval:● Compute similarity between query and doc representation● Return closest match
Words-to-vector
(define (words-to-vector wlist wvec) ;; words-to-vector: (listof symbol) (vectorof num) -> (vectorof num) (cond ((null? wlist) wvec) (else (let ((wpos (posn (car wlist) dict))))
(let ((cur-count (vector-ref wvec wpos))) (vector-set! wvec wpos (+ cur-count 1))
(words-to-vector (cdr wlist) wvec)))))
(define (posn wd dict)(cond ((null? Dict) (error “ missing word”)) ((eq? (map-wd (car dict)) wd) (map-num (car dict)))
(else (posn wd (cdr dict))))
Indexing
(define (build-index swp-list);; build-index: (listof swp) -> (listof doc-rep);; Convert text of web pages to list of vector document reps (cond ((null? swp-list) '())
(else (cons (make-doc-rep (swp-header (car swp-list)) (words-to-vector (swp-body (car swp-list))
(make-vector dictionary-size 0)))(build-index (cdr swp-list)))))
Query Construction
(define (build-query wlist);; build-query: (listof symbol) -> vector-rep;; Convert query text to vector of word occurrence counts (words-to-vector wlist (make-vector dict-size 0)))
Retrieval
(define (retrieve query index) ;; retrieve: vector-rep (listof doc-rep) -> symbol;; Finds id of document with best match with query (doc-rep-id (max
(map (lambda (doc) (dot-product (doc-rep-vec doc) query)) index))))