Upload
keitha
View
16
Download
0
Embed Size (px)
DESCRIPTION
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching. Alexander Gelbukh www.Gelbukh.com. Previous Chapter: Conclusions. Main measures: Precision & Recall. For sets Rankings are evaluated through initial subsets - PowerPoint PPT Presentation
Citation preview
Special Topics in Computer ScienceSpecial Topics in Computer Science
Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval
Lecture 4 Lecture 4 (book chapter 8)(book chapter 8): :
Indexing and SearchingIndexing and Searching
Alexander Gelbukh
www.Gelbukh.com
2
Previous Chapter: Previous Chapter: ConclusionsConclusions
Main measures: Precision & Recall.o For sets
o Rankings are evaluated through initial subsets
There are measures that combine them into oneo Involve user-defined preferences
Many (other) characteristicso An algorithm can be good at some and bad at others
o Averages are used, but not always are meaningful
Reference collection exists with known answers to evaluate new algorithms
3
Previous Chapter: Research topicsPrevious Chapter: Research topics
Different types of interfaces Interactive systems:
o What measures to use?
o Such as infromativeness
4
Types of searchingTypes of searching
Indexedo Semi-static
o Space overhead
Sequentialo Small texts
o Volatile, or space limited
Combinedo Index into large portions, then sequential inside portion
o Best combination of speed / overhead
5
Inverted filesInverted files
Vocabulary: sqrt (n). Heaps’ law. 1GB 5M Occurrences: n * 40% (stopwords)
o positions (word, char), files, sections...
6
Compression: Block addressingCompression: Block addressing
Block addressing: 5% overheado 256, 64K, ..., blocks (1, 2, ..., bytes)
o Equal size (faster search) or logical sections (retrieval units)
7
Searching in inverted filesSearching in inverted files
Vocabulary searcho Separate fileo Many searching techniqueso Lexicographic: log V (voc. size) = ½ log n (Heaps)o Hashing is not good for prefix search
Retrieval of occurrences Manipulation with occurrences: ~sqrt (n) (Heaps, Zipf)
o Boolean operations. Context search Merging occurrences For AND: One list is usually shorter (Zipf law) sublinear!
Only inverted files allow sublinear both space & timeo Suffix trees and signature files don’t
8
Building inverted file: 1Building inverted file: 1
Infinite memory? Use trie to store vocabulary. O(n)o append positions
Finite memory? Build in chunks, merge. Almost O(n) Insertion: index + merge. Deleting: O(n). Very fast.
9
Suffix treesSuffix trees
Text as one long string. No words.o Genetic databases
o Complex queries
o Compacted trie structure
o Problem: space
For text retrieval, inverted files are better
10
11
Info for tree comes from the text itself
12
Suffix arraySuffix array
All suffixes (by position) in lexicographic order Allows binary search Much less space: 40% n Supra-index: sampling, for better disk access
13
Suffix tree and suffix array:Suffix tree and suffix array:Searching. ConstructionSearching. Construction
Searching Patterns, prefixes, phrases. Not only words Suffix tree: O(m), but: space (m = query size) Suffix array: O(log n) (n = database size)
Construction of arrays: sortingo Large text: n2 log (M)/M, more than for inverted fileso Skip details
Addition: n n' log (M)/M. (n' is the size of new portion) Deletion: n
14
Signature filesSignature files
Usually worse than inverted files Words are mapped to bit patterns Blocks are mapped to ORs of their word patterns If a block contains a word, all bits of its pattern are set Sequential search for blocks False drops!
o Design of the hash function
o Have to traverse the block
Good to search ANDs or proximity querieso bit patterns are ORed
15
False drop: letters in 2nd block
16
Boolean operationsBoolean operations
Merging file (occurrences) listso AND: to find repetitions
According to query syntax tree Complexity linear in intermediate results
o Can be slow if they are huge
There are optimization techniqueso E.g.: merge small list with a big one by searching
o This is a usual case (Zipf)
17
Sequential searchSequential search
Necessary part of many algorithms (e.g., block addr) Brute force: O(nm) worst-case, O(n) on average MANY faster algorithms, but more complicated
o See the book
18
Approximate string matchingApproximate string matching
Match with k errors, select the one with min k Levenshtein distance between strings s1 and s2
o The minimum number of editing operations to make onefrom another
o Symmetric for standard sets of operations
o Operations: deletion, addition, change
o Sometimes weighted
Solution: dynamic programming. O(mn), O(kn)o m, n are lengths of the two strings
19
Regular expressionsRegular expressions
Regular expressionso Automation: O (m 2m) + O (n) – bad for long patterns
o There are better methods, see book
Using indices to search for words with errorso Inverted files: search in vocabulary
o Suffix trees and Suffix arrays: the same algorithms as forsearch without errors! Just allow deviations from the path
20
Search over compressionSearch over compression
Improves both space AND time (less disk operations) Compress query and search
o Huffman compression, words as symbols, bytes (frequencies: most frequent shorter)
o Search each word in the vocabulary its code
o More sophisticated algorithms
Compressed inverted files: less disk less time
Text and index compression can be combined
21
...compression...compression
Suffix trees can be compressed almost to size ofsuffix arrays
Suffix arrays can’t be compressed (almost random),but can be constructed over compressed texto instead of Huffman, use a code that respects alphabetic order
o almost the same compression
Signature files are sparse, so can be compressedo ratios up to 70%
22
23
Research topicsResearch topics
Perhaps, new details in integration of compression and search
“Linguistic” indexing: allowing linguistic variationso Search in plural or only singular
o Search with or without synonyms
24
ConclusionsConclusions
Inverted files seem to be the best option Other structures are good for specific cases
o Genetic databases
Sequential searching is an integral part of manyindexing-based search techniqueso Many methods to improve sequential searching
Compression can be integrated with search
25
Thank you!Till April 26, 6 pm