45
Introduction to Digital Libraries Information Retrieval

Introduction to Digital Libraries Information Retrieval

Embed Size (px)

Citation preview

Page 1: Introduction to Digital Libraries Information Retrieval

Introduction to Digital Libraries

Information Retrieval

Page 2: Introduction to Digital Libraries Information Retrieval

Sample Statistics of Text Collections

• Dialog: claims to have >12 terabytes of data in

>600 Databases, > 800 million unique records

• LEXIS/NEXIS: claims 7 terabytes, 1.7 billion

documents, 1.5 million subscribers, 11,400

databases; >200,000 searches per day; 9

mainframes, 300 Unix servers, 200 NT servers

Page 3: Introduction to Digital Libraries Information Retrieval

Information Retrieval

• Motivation

– the larger the holdings of the archive, the more

useful it is

– however, it is harder to find what you want

Page 4: Introduction to Digital Libraries Information Retrieval

Simple IR ModelUser

Query Results

Pre-Processing

Post-Processing

Searching

Storage

Collection & Processing

BooleanVector

StemmingThesaurusSignature

RankingClusteringWeighting

BooleanVector

Feedback

Flat FilesInverted FilesSignature FilesPAT Trees

StemmingStoplist

Page 5: Introduction to Digital Libraries Information Retrieval

5

IR problem• In libraries

ISBN: 0-201-12227-8 Author: Salton, Gerard Title: Automatic text processing: the transformation,

analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989Content: <Text>

• external attributes and internal attribute (content)• Search by external attributes = Search in DB• IR: search by content

Page 6: Introduction to Digital Libraries Information Retrieval

Basic concepts

• Document is described by a set of representative keywords (index terms)

• Keywords may have binary weights or weights calculated from statistics of their frequency in text

• Retrieval is a ‘matching’ process between document keywords and words in queries

Page 7: Introduction to Digital Libraries Information Retrieval

IR Outline• Index Storage

– flat files, inverted files, signature files, PAT trees

• Processing – Stemming, stop-words

• Searching & Queries– Boolean, vector (including ranking, weighting,

feedback)

• Results– clustering

Page 8: Introduction to Digital Libraries Information Retrieval

Flat Files Index

• Simple files, no additional processing or storage needed

• Worst case keyword search time: O(DW)– D = # of documents– W = # words per document– linear search

• Clearly only acceptable for small collections

Page 9: Introduction to Digital Libraries Information Retrieval

Inverted Files• All input files are read, and a list of which

words appear in what documents (records) is made

• Extra space required can be up to 100% of original input files

• Worst case keyword search time is now O(log(DW))

• Almost all indexing systems in popular usage use inverted files

Page 10: Introduction to Digital Libraries Information Retrieval

Sample Inverted File

Term Record Frequencycomputer 1 3computer 3 5computing 2 1distributed 2 1parallel 1 2system 2 1... ... ...

Page 11: Introduction to Digital Libraries Information Retrieval

Structure of inverted index

• May be a hierarchical set of addresses, e.g.

word number within sentence number within paragraph number within chapter number within volume number within document number

• Consider as a vector (d,v,c,p,s,w)

Page 12: Introduction to Digital Libraries Information Retrieval

Inverted File Index

Store appearance of terms in documents (like index of a book)

alphabetdatabaseindexinformationretrievalsemistructuredXMLXPath

(15,42);(26,186);(31,86)(41,10)(15,76);(51,164);(76,641);(81,64)(16,76)(16,88)(5,61);(15,174);(25,41)(1,108);(2,65);(15,741);(21,421)(5,90);(21,301)

(document-ID,position in the doc)

Answer queries like „xml and index“, „information near retrieval“

But: not suitable for evaluating path expressions

Page 13: Introduction to Digital Libraries Information Retrieval

An Inverted File

• Search for– “databases”– “microsoft”

term docURLdata http://www-inst.eecs.berkeley.edu/~cs186database http://www-inst.eecs.berkeley.edu/~cs186date http://www-inst.eecs.berkeley.edu/~cs186day http://www-inst.eecs.berkeley.edu/~cs186dbms http://www-inst.eecs.berkeley.edu/~cs186decision http://www-inst.eecs.berkeley.edu/~cs186demonstrate http://www-inst.eecs.berkeley.edu/~cs186description http://www-inst.eecs.berkeley.edu/~cs186design http://www-inst.eecs.berkeley.edu/~cs186desire http://www-inst.eecs.berkeley.edu/~cs186developer http://www.microsoft.comdiffer http://www-inst.eecs.berkeley.edu/~cs186disability http://www.microsoft.comdiscussion http://www-inst.eecs.berkeley.edu/~cs186division http://www-inst.eecs.berkeley.edu/~cs186do http://www-inst.eecs.berkeley.edu/~cs186document http://www-inst.eecs.berkeley.edu/~cs186

Page 14: Introduction to Digital Libraries Information Retrieval

Other indexing structures

• Signature files– Each document has an associated signature, generating

by hashing each term it contains– Leads to possible matches; further processing to resolve

• Bitmaps– One-to-one hash function; each distinct term in

collection has a bit vector with one bit for each document

– Special case of signature file; storage expensive

Page 15: Introduction to Digital Libraries Information Retrieval

Signature FilesSignature size. Number of bits in a signature, F.

Word signature. A bit pattern of size F with exactly m bits set to 1 and the others 0.

Block. A sequence of text that contains D distinct words.

Block signature. The logical or of all the word signatures in a block of text.

Page 16: Introduction to Digital Libraries Information Retrieval

Signature File

• Each document is divided into “logical blocks”

-- pieces of text that contain a constant number

D of distinct, non-common words

• Each word yields a “word signature” which is a

bit pattern of size F, with m bits set to 1 and the

rest to 0

– F and m are design parameters

Page 17: Introduction to Digital Libraries Information Retrieval

Sample Signature File

Word Signature

free 001 000 110 010

text 000 010 101 001

block signature 001 010 111 011

Figure, D=2, F=12, m=4

Page 18: Introduction to Digital Libraries Information Retrieval

data 0000 0000 0000 0010 0000

base 0000 0001 0000 0000 0000

management 0000 1000 0000 0000 0000

system 0000 0000 0000 0000 1000

----------------------------------------

block

signature 0000 1001 0000 0010 1000

Figure, D=4, F=20, m=1

Page 19: Introduction to Digital Libraries Information Retrieval

Signature File

• Searching

– By examining each block signature for "1" 's in those bit

positions that the signature of the search word has a "1".

– False Drop

– probability that the signature test will “fail”, creating a “false

hit” or “false drop”

– A word signature may match the block signature, but the word is

not in the block. This is a false hit.

Page 20: Introduction to Digital Libraries Information Retrieval

Sistrings

• Original text:

”The traditional approach for searching a regular expression…”

• Sistrings

1. “The traditional approach for searching …”2. “he traditional approach for searching a…”

3. “e traditional approach for searching a …”

4. “onal approach for searching a regular …”

Page 21: Introduction to Digital Libraries Information Retrieval

Sistrings

• Once upon a time, in a far away land ...– sistring1: Once upon a time ...– sistring2: nce upon a time ...– sistring8: on a time, in a ...– sistring11: a time, in a far ...– sistring22: a far away land ...

Page 22: Introduction to Digital Libraries Information Retrieval

PAT Trees• PAT Tree:

– a Patricia Tree constructed over all the possible sistrings of a document

– bits of the key decide branching• 0 is branch to left subtree

• 1 is branch to right subtree

• internal node decides which bit of the key to use

• at leaf node, check any skipped bits

• PAT (Suffix) tree of a string S is a compacted trie that represents all substrings of S or semi-infinite string (sistring).

Page 23: Introduction to Digital Libraries Information Retrieval

PATRICIA TREE

• A particular type of “trie”

• Example, trie and PATRICIA TREE with content ‘010’, ‘011’, and ‘101’.

010 011 101

1

0

1

0

1

0 1

Lv0

Lv1

Lv2

trie

Lv0

Lv2 101

011010

10

10

PATRICIA TREE

010 011 101

1

0

1

0

1

0 1

Lv0

Lv1

Lv2

trie

Page 24: Introduction to Digital Libraries Information Retrieval

PAT Tree1

22

3 3 4 2

7 5 5 1 6 3

4 8

01100100010111... Text123456789.... Position

Query: 00101

sistrings 1-8already indexed

= sistring

= position to check

Page 25: Introduction to Digital Libraries Information Retrieval

Try to build the Patricia tree

• A 00001• S 10011• E 00101• R 10010• C 00011• H 01000• I 01001• N 01110• G 00111• X 11000• M 01101• P 10000

Page 26: Introduction to Digital Libraries Information Retrieval

PAT TreePAT Tree

A

E S

R XC H

G I N

M

P

Page 27: Introduction to Digital Libraries Information Retrieval

Example

Text 01100100010111 …sistring 1 01100100010111 …sistring 2 1100100010111 …sistring 3 100100010111 …sistring 4 00100010111 …sistring 5 0100010111 …sistring 6 100010111 …sistring 7 00010111 …sistring 8 0010111 ...

1 1

21 2

23

1

1

2

23

1

2

14

2

23

1

2

4 3

15

: external node sistring (integer displacement) total displacement of the bit to be inspected

: internal node skip counter & pointer

0 1 0 1

0 1

Page 28: Introduction to Digital Libraries Information Retrieval

SISTRING

• Bit level is too abstract, depends on application, we rarely apply this on bit level. Character level is a better idea!– e.g. CUHK– Corresponding sistrings would be

• CUHK000…• UHK000…• HK000…• K000…

– We require each should be at least 4 characters long.– (Why we pad 0/NULL at the end of sistring?)

Page 29: Introduction to Digital Libraries Information Retrieval

SISTRING (USAGE)

• We may instead storing the sistrings of ‘CUHK’, which requires O(n2) storage.

– CUHK <- represent C CU CUH CUHK at the same time– UHK0 <- represent U UH UHK at the same time– HK00 <- represent H HK at the same time– K000 <- represent K only

• A prefix-matching on sistrings is equivalent to the exact matching on the sub-strings.

• Conclusion, sistrings is better representation for storing sub-string information.

Page 30: Introduction to Digital Libraries Information Retrieval

CUHK 01000011 01010101 00000000 00000000UHK0 01010101 01001000 00000000 00000000HK00 01001000 01001011 00000000 00000000K000 01001011 00000000 00000000 00000000

PAT Tree (Example)

• By digitalizing the string, we can manually visualize how the PAT Tree could be.

• Following is the actual bit patternof the four sistrings

bit 3

bit 4

bit 6

10

10

PAT Tree

UHK0

K000HK00

CUHK

0 1

Page 31: Introduction to Digital Libraries Information Retrieval

PAT Tree (Example)

• This works! BUT…– We still need O(n2)

memory for storingthose sistrings

• We may reduce thememory to O(n)by making use ofpoints.

Hello This document is simple 01001000 …This document is simple 01010100 …document is simple 01100100 …is simple 01101001 …simple 01110011 …

bit 2

bit 3

bit 4

10

00

PAT Tree ofa REAL (but very simple)

document

simlpe

is simpledocument is

simple

Hello. This document is

simple.0 1

bit 3

This document is

simple.

11

Page 32: Introduction to Digital Libraries Information Retrieval

Space/Time Tradeoffs

Space

Time

inverted files

flat files

signature files

PAT trees

Page 33: Introduction to Digital Libraries Information Retrieval

33

Stemming

• Reason: – Different word forms may bear similar meaning (e.g. search,

searching): create a “standard” representation for them

• Stemming: – Removing some endings of word

computercompute computescomputingcomputedcomputation comput

Page 34: Introduction to Digital Libraries Information Retrieval

Inverted File, Stemmed

Term Record Frequencycomput 1 3comput 3 5comput 2 1distribut 2 1parallel 1 2system 2 1... ... ...

Page 35: Introduction to Digital Libraries Information Retrieval

Stemming 

• am, are, is  be car, cars, car's, cars'  car

• the boy's cars are different colors   the boy car be differ color

Page 36: Introduction to Digital Libraries Information Retrieval

Stemming

• Manual or Automatic

• Can reduce index files up to 50%

• Effectiveness studies of stemming are mixed, but in general it has either no effect or a positive effect when measuring both precision and recall

Page 37: Introduction to Digital Libraries Information Retrieval

Stopwords• Stopwords exist in stoplists or negative

dictionaries• Idea: remove low semantic content

– index should only have “important stuff”

• What not to index is domain dependent, but often includes:– “small” words: a, and, the, but, of, an, very, etc. – case is removed– punctuation

Page 38: Introduction to Digital Libraries Information Retrieval

Stop words

• Very common words that have no discriminatory power

• ( إلى من، (...،في،

Page 39: Introduction to Digital Libraries Information Retrieval

Normalization

• Token normalization– Canonicalizing tokens so that matches occur

despite superficial differences in the character sequences of the tokens

– U.S.A vs USA– Anti-discriminatory vs antidiscriminatory– Car vs automobile?

Page 40: Introduction to Digital Libraries Information Retrieval

Capitalization/case folding

• Good for– Allow instances of Automobile at the beginning of a

sentence to match with a query of automobile– Helps a search engine when most users type ferrari

when they are interested in a Ferrari car• Bad for

– Proper names vs common nouns– General Motors, Associated Press, Black

• Heuristic solution: lowercase only words at the beginning of the sentence; true casing via machine learning

Page 41: Introduction to Digital Libraries Information Retrieval

Performance of search

• 3 major classes of measuring performance– precision / recall

• TREC conference series, http://trec.nist.gov/

– space / time• see Esler & Nelson, JNCA for an example

• http://techreports.larc.nasa.gov/ltrs/PDF/1997/jp/NASA-97-jnca-sle.pdf

– usability• probably the most important measure, but largely ignored

Page 42: Introduction to Digital Libraries Information Retrieval

Precision and Recall

• Precision= No. of relevant documents retrieved

Total no. of documents retrieved

• Recall= No. of relevant documents retrieved .

Total no. of relevant documents in database

Page 43: Introduction to Digital Libraries Information Retrieval

Standard Evaluation Measures

w x

y z

n2 = w + y

n1 = w + x

N

relevant

not relevant

retrieved not retrieved

Starts with a CONTINGENCY table

Page 44: Introduction to Digital Libraries Information Retrieval

Precision and Recall

Recall:

Precision:

w

w+y

w+x

w

From all the documents that are relevant out there,how many did the IR system retrieve?

From all the documents that are retrieved by the IR system, how many are relevant?

Page 45: Introduction to Digital Libraries Information Retrieval

User-Centered IR Evaluation

• More user-oriented measures– Satisfaction, informativeness

• Other types of measures– Time, cost-benefit, error rate, task analysis

• Evaluation of user characteristics

• Evaluation of interface

• Evaluation of process or interaction