16
CS5286 Search Engine Technology and Algorithms/Xiao tie Deng Lecture 4: Indexing Files Inverted File Lexical Analysis Stop lists

Lecture 4: Indexing Files

  • Upload
    fagan

  • View
    20

  • Download
    0

Embed Size (px)

DESCRIPTION

Lecture 4: Indexing Files. Inverted File Lexical Analysis Stop lists. Indexing. Arrangement of data (data structure) to permit fast searching Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak. Creating inverted files. - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture 4: Indexing Files

CS5286 Search Engine Technology and Algorithms/Xiaotie Deng

Lecture 4:Indexing Files

Inverted File Lexical Analysis Stop lists

Page 2: Lecture 4: Indexing Files

CS5286 Search Engine Technology and Algorithms/Xiaotie Deng

Indexing

Arrangement of data (data structure) to permit fast searching

Which list is easier to search?sow fox pig eel yak hen ant cat dog hog

ant cat dog eel fox hen hog pig sow yak

Page 3: Lecture 4: Indexing Files

CS5286 Search Engine Technology and Algorithms/Xiaotie Deng

Creating inverted files

Original Documents

Document IDs

Word Extraction Word IDs

W1:d1,d2,d3W2:d2,d4,d7,d9

Wn :di,…dn

Inverted Files

W1:d1,d2,d3W2:d2,d4,d7,d9

Wn :di,…dn

Inverted Files

Page 4: Lecture 4: Indexing Files

CS5286 Search Engine Technology and Algorithms/Xiaotie Deng

Creating Inverted file Map the file names to file IDs Consider the following Original Documents

Our staff have contributed intellectually and professionally to the advancements in these fields.

The Department also produced its first PhD graduate in 1994.

followed by the MSc in Computer Science which was started in 1991.

The Department launched its first BSc(Hons) in Computer Studies in 1987.

The Department of Computer Science was established in 1984.

D5

D4

D3

D2

D1

Page 5: Lecture 4: Indexing Files

CS5286 Search Engine Technology and Algorithms/Xiaotie Deng

Creating Inverted file

Our staff have contributed intellectually and professionally to the advancements in these fields.

The Department also produced its first PhD graduate in 1994.

followed by the MSc in Computer Science which was started in 1991.

The Department launched its first BSc(Hons) in Computer Studies in 1987.

The Department of Computer Science was established in 1984.

D5

D4

D3

D2

D1

Red: stop word

Page 6: Lecture 4: Indexing Files

CS5286 Search Engine Technology and Algorithms/Xiaotie Deng

Creating Inverted file

staff contribut intellectu profession advanc field

depart produc phd graduat

follow msc comput scienc start

depart launch bsc hons comput studi

depart comput scienc establish

D5

D4

D3

D2

D1

After stemming, make lowercase (option), delete numbers (option)

Page 7: Lecture 4: Indexing Files

CS5286 Search Engine Technology and Algorithms/Xiaotie Deng

Creating Inverted file (unsorted)

d5

d5

d5

d5

d5

d5

d4

d4

d4

Documents

field

advancprofession

intellectucontributstaff

graduatphdproducWords

d3startd3mscd3followd2studid2honsd2bscd2launchd1establishd1,d3sciencd1,d2,d3computd1,d2,d4departDocumentsWords

Page 8: Lecture 4: Indexing Files

CS5286 Search Engine Technology and Algorithms/Xiaotie Deng

Creating Inverted file (sorted)

d2

d3

d5

d1,d3

d5

d4

d4

d3

Documents

studistartstaff

sciencprofession

producphdmscWords

d2launchd5intellectud4graduatd3followd5field

d1establishd1,d2,d4departd5contributd1,d2,d3computd2bscd5advancDocumentsWords

Page 9: Lecture 4: Indexing Files

CS5286 Search Engine Technology and Algorithms/Xiaotie Deng

Searching on Inverted File

Binary Search Using in the small scale

Create thesaurus and combining techniques such as: Hashing B+tree Pointer to the address in the indexed

file

Page 10: Lecture 4: Indexing Files

CS5286 Search Engine Technology and Algorithms/Xiaotie Deng

Lexical Analysis for indexing Word extraction

Spaces as English words boundaries Chinese word segmentation

Stop words elimination “a”,”an”,”the”,”about”,”etc”,”every”,

”you”,etc. Word stemming

Page 11: Lecture 4: Indexing Files

CS5286 Search Engine Technology and Algorithms/Xiaotie Deng

Lexical Analysis

Lexical analysis is the process of converting an input stream of characters into a stream of words or tokens

Lexical analysis is the first stage of: Automatic indexing Query processing

Page 12: Lecture 4: Indexing Files

CS5286 Search Engine Technology and Algorithms/Xiaotie Deng

Lexical Analysis for Automatic Indexing

What counts as a word or token in the indexing scheme? (an easy problem?) Digits

“Year 2000”, “Y2K” Hyphens

“F-16” “MS-DOS” Other Punctuation

“COMMAND.COM” “max_size” (often in C code) Case

IBM or ibm

Page 13: Lecture 4: Indexing Files

CS5286 Search Engine Technology and Algorithms/Xiaotie Deng

Lexical Analysis for Automatic Indexing (cont.)

No technical difficulty in solving any of these problems

Must think about them carefully Tradeoff between recall and precision

Breaking up hyphenated terms increase recall but decreases precision

Preserving case distinctions enhances precision but decreases recall

Page 14: Lecture 4: Indexing Files

CS5286 Search Engine Technology and Algorithms/Xiaotie Deng

Lexical Analysis for Query Processing

Depends on the design of the lexical analyzer for automatic indexing

Distinguish operators (Boolean operators, weighting function operators etc.)

Process certain characters: Control characters

“” for phase search, {} for priority Disallowed punctuation characters (error)

Page 15: Lecture 4: Indexing Files

CS5286 Search Engine Technology and Algorithms/Xiaotie Deng

STOPLISTS Many of the most frequently occurring words in

English (“the” ,”of” etc.) are worthless as index terms

Eliminating such words Speeds processing Saves huge amounts of space in indexes Does not damage retrieval effectiveness

Stoplists are used to eliminates such words. E.g., http://www.dcs.gla.ac.uk/idom/ir_resources/linguisti

c_utils/stop_words http://bll.epnet.com/help/ehost/Stop_Words.htm http://www.syger.com/jsc/docs/stopwords/english.h

tm

Page 16: Lecture 4: Indexing Files

CS5286 Search Engine Technology and Algorithms/Xiaotie Deng

STOPLISTS

Choices of words in stop list may vary from person to person.

The general idea is to find words that occur often so that they are not good terms for information retrieval.

How to use vector space model to find out a list of stop words?

How to find stop words in Chinese?