Chapter 4 Ql

Embed Size (px)

Citation preview

  • 8/12/2019 Chapter 4 Ql

    1/29

    Chapter 4 : Query Languages

    Baeza-Yates, 1999

    Modern Information Retrieval

  • 8/12/2019 Chapter 4 Ql

    2/29

    Outline

    Keyword-Based Querying

    Patten Matching

    Structural Queries

    Query Protocols

    Trends and Research Issues

  • 8/12/2019 Chapter 4 Ql

    3/29

    Keyword-Based Querying

    A query is formulation of a user information need

    Keyword-based queries are popular

    1. Single-Word Queries

    2. Context Queries

    3. Boolean Queries

    4. Natural Language

    Data Retrieval

    Information Retrieval

  • 8/12/2019 Chapter 4 Ql

    4/29

    Single-Word Queries

    A query is formulated by a word

    A document is formulated by long sequences ofwords

    A word is a sequence of letters surrounded byseparators

    What are letters and separators? e.g,on-line

    The division of the text into words is not

    arbitrary

  • 8/12/2019 Chapter 4 Ql

    5/29

    Context Queries

    Definition

    - Search words in a given context

    Types

    Phrase

    >a sequence of single-word queries

    >e.g, enhance retrieval

    Proximity

    >a sequence of single words or phrases, and a maximumallowed distance between them are specified

    >e.g,within distance (enhance, retrieval, 4) will matchenhance the power of retrieval

  • 8/12/2019 Chapter 4 Ql

    6/29

    Definition

    A syntax composed of atoms that retrieve documents, and ofBoolean operators which work on their operands

    e.g, translation AND syntax OR syntactic

    Fuzzy Boolean Retrieve documents appearing in some operands (The AND

    may require it to appear in more operands than the OR)

    Boolean Queries

  • 8/12/2019 Chapter 4 Ql

    7/29

    Natural Language

    Generalization of fuzzy Boolean

    A query is an enumeration of words and contextqueries

    All the documents matching a portion of the userquery are retrieved

  • 8/12/2019 Chapter 4 Ql

    8/29

    Pattern Matching

    Data retrieval

    A pattern is a set of syntactic features that mustoccur in a text segment

    Types Words

    Prefixes

    e.q comput->computer,computation,computing,etc

    Suffixese.q ters->computers,testers,painters,etc

    Substringse.q tal->coastal,talk,metallic,etc

    Ranges

    between heldand hold->hoaxand hissing

  • 8/12/2019 Chapter 4 Ql

    9/29

    Allowing errors

    Retrieve all text words which all similarto the

    given word

    edit distance:

    the minimum number of character insertions,deletions, and replacements needed to maketwo strings equal, e.q , flowerand flo wer

    maximum allowed edit distance:

    query specifies the maximum number of allowederrors for a word to match the pattern

  • 8/12/2019 Chapter 4 Ql

    10/29

    Regular expressions

    union:if e1and e2are regular expressions , then(e1|e2)matches what e1or e2matches

    concatenation:if e1and e2are regular expressions, the

    occurrences of (e1e2) are formed by the occurrences of e1immediately followed by those of e2

    repetition:if e is a regular expression , then (e*)matches a sequence of zero or more contiguousoccurrence of e

    pro(blem|tein)(s|)(0|1|2)*->problem2andproteins

  • 8/12/2019 Chapter 4 Ql

    11/29

    Structural Queries

    Mixing contents and structure in queries

    - contents: words, phrases, or patterns

    - structural constraints: containment, proximity,or other restrictions on structural elements

    Three main structures

    - Fixed structure

    - Hypertext structure- Hierarchical structure

  • 8/12/2019 Chapter 4 Ql

    12/29

    Fixed Structure

    Document:a fixed set of fields

    EX: a mail has a sender, a receiver, a date, a subject and a body field

    Search for the mails sent to a given person with football in theSubject field

  • 8/12/2019 Chapter 4 Ql

    13/29

    A hypertext is a directed graphwhere nodes hold sometext (text contents)

    the linksrepresent connections between nodes orbetween positions inside nodes (structural connectivity)

    Hypertext

  • 8/12/2019 Chapter 4 Ql

    14/29

    Hypertext : WebGlimpse

    WebGlimpse: combine browsing and searching onthe Web

  • 8/12/2019 Chapter 4 Ql

    15/29

    Hierarchical Structure

  • 8/12/2019 Chapter 4 Ql

    16/29

    Hierarchical Structure

  • 8/12/2019 Chapter 4 Ql

    17/29

    Hierarchical Structure

    PAT Expressions

    Overlapped Lists

    Lists of References

    Proximal Nodes

    Tree Matching

  • 8/12/2019 Chapter 4 Ql

    18/29

    Query Protocols

    Z39.50

    WAIS (Wide Area Information Service)

  • 8/12/2019 Chapter 4 Ql

    19/29

    Z39.50

    American National Standard InformationRetrieval Application Service Definition

    Can be implemented on any platform

    Query bibliographical information using astandard interface between the client and thehost database manager

    Z39.50 protocol is part of WAIS

  • 8/12/2019 Chapter 4 Ql

    20/29

    Z39.50 Brief history

    Z39.50-1988(version 1)

    Z39.50-1992(version 2)

    Z39.50-1995(version 3)

    Version 4, development began in Autumn 1995

  • 8/12/2019 Chapter 4 Ql

    21/29

    Using Z39.50 over the WWW

    WWW Client WWW Z39.50

    Z39.50 Client

    Z39.50Server

    RepositoryDigital library

  • 8/12/2019 Chapter 4 Ql

    22/29

    WAIS (Wide Area Information Service)

    Beginning in the 1990s

    Query databases through the Internet

  • 8/12/2019 Chapter 4 Ql

    23/29

    Trends and Research Issues

    Model Queries allowed

    BooleanVectorProbabilisticBBN

    word,set operationswordswordswords

    Relationship between types of queries and models

  • 8/12/2019 Chapter 4 Ql

    24/29

    Query Language Taxonomy

    The types of queries covered and how they are structured

  • 8/12/2019 Chapter 4 Ql

    25/29

  • 8/12/2019 Chapter 4 Ql

    26/29

    Overlapped Lists

    The model allow for the areas of a region tooverlap, but not to nest

    It is not clear, whether overlapping is good or

    not for capturing the structural properties

  • 8/12/2019 Chapter 4 Ql

    27/29

    Lists of References

    Overlap and nest are not allowed

    All elements must be of the same type,e.g onlysections, or only paragraphs.

    A reference is a pointer to a region of thedatabase.

  • 8/12/2019 Chapter 4 Ql

    28/29

  • 8/12/2019 Chapter 4 Ql

    29/29

    Tree Matching

    The leaves of the query can be not onlystructural elements but also text patterns,meaning that the ancestor of the leaf must

    contain that pattern.