11
Information Retrieval: aka “Google-lite” CMSC 16100 November 27, 2006

Information Retrieval: aka “Google-lite” CMSC 16100 November 27, 2006

Embed Size (px)

Citation preview

Page 1: Information Retrieval: aka “Google-lite” CMSC 16100 November 27, 2006

Information Retrieval:aka “Google-lite”

CMSC 16100

November 27, 2006

Page 2: Information Retrieval: aka “Google-lite” CMSC 16100 November 27, 2006

Roadmap

● Information Retrieval (IR)– Goal: Match Information Need to Document Concept– Solution: Vector Space Model

● Representation of Documents and Queries● Computing Similarity

● Implementation:– Indexing: Documents -> Vectors– Query Construction: Query -> Vector– Retrieval: Finding “Best” match: Query/Document

Page 3: Information Retrieval: aka “Google-lite” CMSC 16100 November 27, 2006

The Information Retrieval Task

● Goal:– Match the information need expressed by user

● (the Query)

– With concepts in documents● (the Document collection)

● Issues:– How do we represent documents and queries ?– How do we know if they're “similar”? Match?

Page 4: Information Retrieval: aka “Google-lite” CMSC 16100 November 27, 2006

Vector Space Model

● Represent documents and queries with– Pattern of words

● I.E. Queries and documents with lots of the same words

– Vector of word occurrences:– Each position in vector = word

● Value of position x in vector = # times word x occurs

● Similarity:– Dot product of document vector & query vector– Biggest wins

Page 5: Information Retrieval: aka “Google-lite” CMSC 16100 November 27, 2006

Vector Space Model

Computer

Tv

Program

Two documents: computer program, tv program

Query: computer program : matches 1 st doc: exact: distance=2 vs 0 educational program: matches both equally: distance=1

Page 6: Information Retrieval: aka “Google-lite” CMSC 16100 November 27, 2006

Information Retrieval in Scheme

● Representation:– A vector-rep is (vectorof number)– (define-struct doc-rep (id vec))– A doc is (make-doc-rep id vec)

● Where id:symbol; vec: vector-rep

– A doc-index is (listof doc)– A query is vector-rep

● A simple-web-page (swp) is:● (make-swp h b)● Where (define-struct swp h b); h:symbol; b: (listof symbol)

Page 7: Information Retrieval: aka “Google-lite” CMSC 16100 November 27, 2006

Three Steps to IR

● Three phases:– Indexing: Build collection of document representations

● Convert web pages to doc-rep – Vectors of word counts

– Query construction:● Convert query text to vector of word counts

– Retrieval:● Compute similarity between query and doc representation● Return closest match

Page 8: Information Retrieval: aka “Google-lite” CMSC 16100 November 27, 2006

Words-to-vector

(define (words-to-vector wlist wvec) ;; words-to-vector: (listof symbol) (vectorof num) -> (vectorof num) (cond ((null? wlist) wvec) (else (let ((wpos (posn (car wlist) dict))))

(let ((cur-count (vector-ref wvec wpos))) (vector-set! wvec wpos (+ cur-count 1))

(words-to-vector (cdr wlist) wvec)))))

(define (posn wd dict)(cond ((null? Dict) (error “ missing word”)) ((eq? (map-wd (car dict)) wd) (map-num (car dict)))

(else (posn wd (cdr dict))))

Page 9: Information Retrieval: aka “Google-lite” CMSC 16100 November 27, 2006

Indexing

(define (build-index swp-list);; build-index: (listof swp) -> (listof doc-rep);; Convert text of web pages to list of vector document reps (cond ((null? swp-list) '())

(else (cons (make-doc-rep (swp-header (car swp-list)) (words-to-vector (swp-body (car swp-list))

(make-vector dictionary-size 0)))(build-index (cdr swp-list)))))

Page 10: Information Retrieval: aka “Google-lite” CMSC 16100 November 27, 2006

Query Construction

(define (build-query wlist);; build-query: (listof symbol) -> vector-rep;; Convert query text to vector of word occurrence counts (words-to-vector wlist (make-vector dict-size 0)))

Page 11: Information Retrieval: aka “Google-lite” CMSC 16100 November 27, 2006

Retrieval

(define (retrieve query index) ;; retrieve: vector-rep (listof doc-rep) -> symbol;; Finds id of document with best match with query (doc-rep-id (max

(map (lambda (doc) (dot-product (doc-rep-vec doc) query)) index))))