View
229
Download
0
Category
Tags:
Preview:
Citation preview
Hsin-Hsi Chen 4-1
Chapter 4 Query Language
Hsin-Hsi ChenDepartment of Computer Science and Information Engineering
National Taiwan University
Hsin-Hsi Chen 4-2
Introduction
• Goals– Which queries can be formulated– How the formulation is related to underlying
information retrieval models
• Query languages
Hsin-Hsi Chen 4-3
Boolean queriesFuzzy Boolean
structured queries
proximity
phrases
words
errors
substringsprefixessuffixes
regular expressionsextended patterns
natural language
keywords andcontext
pattern matching
basic queries
Hsin-Hsi Chen 4-4
Keyword-Based Querying
• single-word queries– A query is formulated by a word– A document is formulated by long sequences of words.– A word is a sequence of letters surrounded by separators– What are letters and separators?
• e.g., ‘on-line’
– Chinese sentences are composed of characters without word boundaries
– The division of the text into words is not arbitrary(This topic will be dealt with in a special talk for Chinese IR)
Hsin-Hsi Chen 4-5
斷詞問題• 問題
– 中文句子詞與詞之間並沒有明顯的分隔記號。– 這名記者會說國語。
• 這 名 記者 會 說 國語。• 這 名 記者會 說 國語。
• 詞的定義– 具有獨立意義,且扮演特定語法功能的字串應視為一個詞。
• 分詞標準– 中國大陸【信息處理用現代漢語分詞規範】
• 1989 年制定• 1993 年呈報國家標準
Hsin-Hsi Chen 4-6
斷詞問題 ( 續 )
–台灣【資訊處理用中文分詞標準草案】• 1996 年中華民國計算語言學學會草擬• 基本原則
–語義無法由組合成分直接相加而得之字串,應該分為一分詞單位。例如:撞期 vs 撞山
–詞類無法由組合成分直接得到,應該合為一分詞單位。例如:好喝
Hsin-Hsi Chen 4-7
處理模式• 詞典是不可缺少的重要資源
– 列出“所有”可能的詞• 把他的確實行動作了分析把,他,的,確實,實行,行動,動作,了,分析
• 電子計算機是會計算題目的機器電子,計算,計算機,電子計算機,是,會,會計,計算,計算題,題目,目的,的,機器
– word lattice
電 子 計 算 機 是 會 計 算 題 目 的 機 器
Hsin-Hsi Chen 4-8
處理模式 ( 續 )
• 歧義排除機置– 挑出最佳組合– 策略
• 規則式– 長詞優先台灣大學 是 有名 的 學府長詞遮蔽短詞:這 名 記者 會 說 國語。
– 除去造成路徑中斷的詞區段– 經驗法則:偏好三字詞 , ...– 剖析器
• 統計式– 馬可夫模型 , 鬆 弛法 , ...
– 效能─各家都宣稱有百分之九十五以上的準確率
Hsin-Hsi Chen 4-9
處理模式 ( 續 )
• 問題所在–詞典是否收錄所有可能的詞?
• A- 錢,凍蒜–策略
• 構詞率• ( 半 ) 自動建立新的詞典• 未知詞處理模式
Hsin-Hsi Chen 4-10
構詞率• 數詞與量詞的形成
– 一個個 , 一條條• 日期與時間
– 八十五年十月四日• 名詞或動詞的前綴或後綴
– 學生們• 特殊動詞
– 丟丟 看,吃吃 看,寫寫 看– 高高興興,歡歡喜喜,漂漂亮亮,迷迷糊糊– 打打球,跑跑步,寫寫字
• ...
Hsin-Hsi Chen 4-11
Context Queries
• definition– Search words in a given context, e.g., near other words
• types– phrase
• a sequence of single-word queries• e.g., enhance retrieval
– proximity• a sequence of single words or phrases, and a maximum
allowed distance between them are specified• e.g., within distance(enhance, retrieval, 4) will match
‘… enhance the power of retrieval …’
Hsin-Hsi Chen 4-12
Boolean Queries
• definition– A syntax composed of atoms that retrieve
documents, and of Boolean operators which work on their operands
– e.g., translation AND syntax OR syntactic
AND
translation OR
syntax syntactic
query syntax tree
Hsin-Hsi Chen 4-13
Boolean Queries (Continued)
• operands– (e1 OR e2)
• Select all documents which satisfy e1 or e2. Duplicates are eliminated.
– (e1 AND e2)• Select all documents which satisfy both e1 and e2.
– (e1 BUT e2)• Select all documents which satisfy e1 but not e2
• “fuzzy boolean”– Retrieve documents appearing in some operands
(The AND may require it to appear in more operands than the OR)
Hsin-Hsi Chen 4-14
Natural Language
• generalization of “fuzzy Boolean”
• A query is an enumeration of words and context queries.
• All the documents matching a portion of the user query are retrieved.
Hsin-Hsi Chen 4-15
Pattern Matching
• A pattern is a set of syntactic features that must occur in a text segment
• types– words– prefixes, e.g., ‘comput’ ‘computer’, ‘computation’, ‘comp
uting’, etc.– suffixes, e.g, ‘ters’ ‘computers’, ‘testers’, ‘painters’, etc.– substrings, e.g., ‘tal’ ‘coastal’, ‘talk’, ‘metallic’, etc.– Ranges (lexicographic order), between ‘held’ and ‘hold’ ‘
hoax’ and ‘hissing
Hsin-Hsi Chen 4-16
Pattern Matching (Continued)
– allowing errors• Retrieve all text words which are ‘similar’ to the giv
en word
• edit distance: the minimum number of character insertions, deletions, and replacements needed to make two strings equal, e.g., ‘flower’ and ‘flo wer’
• maximum allowed edit distance: query specifies the maximum number of allowed errors for a word to match the pattern
Hsin-Hsi Chen 4-17
Pattern Matching (Continued)
– regular expressions• union: if e1 and e2 are regular expressions, then (e1 | e2) matc
hes what e1 or e2 matches
• concatenation: if e1 and e2 are regular expressions, the occurrences of (e1 e2) are formed by the occurrences of e1 immediately followed by those of e2
• repetition: if e is a regular expression, then (e*) matches a sequence of zero or more contiguous occurrence of e.
• ‘pro (blem | tein) (s | ) (0 | 1 | 2)*’ ‘problem2’ and ‘proteins’
Hsin-Hsi Chen 4-18
Pattern Matching (Continued)
– extended patterns• subsets of the regular expressions expressed with a
simpler syntax
• classes of characters
• conditional expressions
• wild characters which match any sequence in the text
• combinations
Hsin-Hsi Chen 4-19
Structural Queries
• mixing contents and structure in queries– contents: words, phrases, or patterns– structural constraints: containment, proximity, or other
restrictions on structural elements
• issues– what structure a text may have– what queries can be made on which structures
• three main structures– form-like fixed structure– hypertext structure– hierarchical structure
Hsin-Hsi Chen 4-20
Form-like fixed structureDocument: a fixed set of fields For example, a mail has a sender, a receiver, a date, a subject and abody field. Search for the mails sent to a given person with “football” in the Subject field
fields
text
text
text
text
Hsin-Hsi Chen 4-21
Hypertext structureA hypertext is a directed graph where nodes hold some textthe links represent connections between nodes or between positions inside nodes
(text contents)
(structural connectivity)
WebGlimpse: combine browsing and searching on the Web
Hsin-Hsi Chen 4-22
WebGlimpse(http://glimpse.cs.arizona.edu/webglimpse/index.html
• WebGlimpse is a fast, flexible search engine for finding information in a related web of pages.
• The ability to index pages on remote sites provides a level of power one step above most search engine tools.
• You can define your own sub-area of the web simply by making a page of links to all relevant sites.
• Webglimpse will search by following your links, to whatever 'depth' you specify.
Hsin-Hsi Chen 4-23
Hierarchical StructureRecursive decomposition of the text
Hsin-Hsi Chen 4-24
Chapter 44.1 IntroductionWe cover in this chapterthe different kinds of ……4.4 Structural Queries…
chapter
section section
title title figure
Introduction We cover … … Structural … …
in
with
with
figure
section
title “structural”
Hsin-Hsi Chen 4-25
Issues
• static or dynamic structure– statistic: there are one or more explicit hierarchies– dynamic: the required elements are built on the fly
using text makeup
• restrictions on the structure – The text or the answers may have restrictions
about nesting and/or overlapping
Hsin-Hsi Chen 4-26
Issues (Continued)
• integration with text– integration of queries on text content with queries on text
structure
• query language– features
• selection of areas that contain (or not) other areas• selection of areas that are contained (or not) in other areas• selection of areas that follow (or are followed by) other areas• selection of areas that are close to other areas• set manipulation
– standardization, expressiveness taxonomy or formal categorization
Hsin-Hsi Chen 4-27
A Sample of Hierarchical Models
• PAT Expressions
• Overlapped Lists
• Proximal Nodes
• Tree Matching
Recommended