XML Keyword Search - Penn Engineeringzives/03s/cis650/keyword.pdfOnce the tables are filled XML-QL...

XML Keyword Search

CIS650Yong Cha

Introduction

� One of the goals of XML is to represent document information and we might want to perform a keyword search on particular elements.

� The goal of the paper is to combine database-style query language with free-text search.

Role of RDBMS in XML keyword search

� XML is rapidly becoming data format of choice, but retrieving data from large XML sources can be a painfully slow experience.

� What if you want to search 100MB XML file?

� To improve XML query performance, RDBMS is used to extend XML-QL.

Why use RDBMS

� It is easy to build an extended XML query processor on top of RDBMS because most of the functionality are already built in.

� RDBMSs are universally available.� RDBMSs allow to mix XML data with other

types of data.� RDBMSs have very high performance.

� Extensible Markup Language� XML and HTML are subsets of SGML

(Standard Generalized Markup Language)� Comparison of SGML and XML� http://www.w3.org/TR/NOTE-sgml-xml-

971215

Sample XML

<author><name>Adam Dingle</name></author><author><name>Peter Sturmh</name></author><author><name>Li Zhang</name></author><title>Analysis and Characterization of Large-Scale Web Server Access Pattern</title><year>1999</year><booktitle>World Wide Web Journal</booktitle>

</article></document>

XML as Tree

elId16name

elId11author

elId17name

elId12author

elId13title

elId14year

elId15booktitle

elId1article

elid26name

elId21author

elid27name

elId22author

elId23title

elId24year

elId25booktitle

elId2article

elId0Root

XML-QL

� where (XML-pattern [ELEMENT_AS $elem_var]) IN fileName, (predicate)* construct XML-pattern | variable

� Example: Retrive all articles written by Mr. Dingle in 1999 about the web.

where <article><author><name>$N</name></author><title>$T</title><year>1999</year></article> ELEMENT_AS $E IN “bib.xml”, $N like *Dingle* , $T like *web* construct $E

Extending XML-QL with Keyword Search

� XML-QL is a good way to query XML document if you know the structure of the document.

� If user is not familiar with the structure queries get messy.

What if user wants to search for keyword ‘name’ without knowing the structure

Things get messy!!!/* search for “name” as a subelement, at any depth */where <article><author><*><name>$N</name></*></author>

<title>$T</title><year>1999</year></article>ELEMENT_AS $E IN “bib.xml”$N like *Dingle*, $T like *web*

construct $Eunion/* search for “name” as an attribute name, at any depth */where <article<author><*><! name=$N></_></*></author>

<title>$T</title><year>1999</year></article> ELEMENT_AS $E IN “bib.xml”,$N like *Dingle*, $T like *web*

construct $E

Contains predicate

� Contains predicate is introduced to simplify queries.� contains (tag_name, attribute_name content,

attribute_value)where <article> </article> ELEMENT_AS $E IN “bib.xml”, contains($E,”Dingle”, 3, any), contains($E, “1999”, 3, any), contains($E, “web”, 3, any)construct $E

Inverted Files

� What happens when there are too many big documents?

� Use inverted files.

� The classic index structure for keyword search.

� In the simplest form, inverted files contain word and document.

� For XML data, the paper used form <word, elId, depth, location>

<“article”, elId1, 0, tag>

<“id”, elId1, 1, attr>

How to create inverted files

� You want to index following text:� A cookie is just a cookie, but fig newtons

are fruit and cake.� <a, 1> <a, 17>� <cookie, 3> <cookie, 19>� <is, 10>� <just, 13> etc.

Storing Inverted Files in RDBMS

� Inverted files can be easily stored as tables.

� To improve the performance, inverted files are partitioned by words.

� Further partitioning is done by combining word with type.

� word-type(elId, depth, location)� Adam-name(elId16, 4, 1)

Problem with Inverted Files

� Indexes can grow to be bigger than the source.

� Size of original XML file: 7.7MB� Size of relation database: 90MB

Extended XML-QL Query Processing

� First, create binary table for all XML tags. � tag-name(source, target, value)� article(“article”, “author”, null)� author(“author”, “name”, null)� name(“name”, null, “Adam Dingle”)� title(“title”, null, “Analysis … “)� Once the tables are filled XML-QL queries can be executed as SQL. select art.sourcefrom article art, author aut, name n, title t, year ywhere art.target=aut.source and art.target=t.source and art.target = y.source and aut.target=n.source and y.value=1999 and t.value like “web” and n.value like “Dingle”

Keyword search

� Once you have inverted file, keyword search become simple select.

� where <article> </article> ELEMENT_AS $E IN “bib.xml”, contains($E, “Dingle”, 3, any) construct $E

� simply becomes� select elId from Dingle-article

Combining XML Query with Keyword Search

� Both query and keyword search can be combined easily into a single SQL statement.

� Where <article> <year>1999</year><author>$A</author> </article> ELEMENT_AS $E IN “bib.xml”, contains($E, “Web”, 4, any) construct $A

� becomes� select aut.target from article art, year y, author aut,

web-article c where art.target=year.source and art.target=aut.source and art.target=c.elId and year.value=“1999” and c.depth<=4

Distributed XML Query Processing

� Problem arises when XML data cannot be stored in RDBMS.

� Web site might give a permission to index the data without giving a permission to store the content.

� Two scenarios arise.� XML data source without query

capabilities and XML data source with query capabilities.

XML data sources without query capabilities

� Prefilter: User the inverted file stored in the RDBMS to find the relevant documents and/or elIDs.

� Retrieve: Get the relevant documents (or elements) from the data sources.

� Execute: Extract the query results from the retrieved documents (or elements).

XML data sources with query capabilities

� Prefilter: User the inverted file stored in the RDBMS to find the relevant documents and/or elIDs.

� Push Down: Pass elIds to the data sources and let the data sources execute the whole or parts of the query on these elIds.

� Refinement: Refine the results returned by the data sources if the data sources could not execute the whole query.

Query performance

� Experiments were done on three scenarios.

� Structured: The query exploits the full XML structure.

� Partially structured: The query exploits some structure.

� Unstructured: The query has no structure.

Performance

Query 1 Query 2

Running Time Query Result Running Time Query Result

Structured 3 secs 38 lines 7 secs 208 scenes

Part.Structured

1.5 secs 19 scenes 2.5 secs 1108 scenes

Unstructured 4 secs 56 elements 10 secs 10909elements

Questions

XML Keyword Search - Penn Engineeringzives/03s/cis650/keyword.pdfOnce the tables are filled XML-QL...

Documents

ASHIMA KALRA INTRODUCTION OF XML INTRODUCTION OF XML XML FEATURES XML FEATURES XML SYNTAX XML SYNTAX XML ELEMENTS XML ELEMENTS XML ATTRIBUTES

XML Schema and Stylus Studio. Introduction to XML Schema XML Schema defines building blocks of a XML document XML Schemas are alternative to DTD Why XML

XML DOCUMENTS & DATABASES. Summary of Introduction to XML HTML vs. XML HTML vs. XML Types of Data Types of Data Basics of XML Basics of XML XML Syntax,

1. XML Structure of XML Data XML Document Schema Querying and Transformation Application Program Interfaces to XML Storage of XML Data XML Applications

Information Retrieval Methods - Penn Engineeringzives/03s/cis650/ir.pdf · Databases vs. Information Retrieval DATABASES We know the schema in advance, so semantic correlation between

XML databases xml-db XML databases

SDPL 20112: XML Basics1 2 Basics of XML and XML documents Survivor's Guide to XML, or XML for Computer Scientists / Dummies 2.1 XML and XML documents 2.2

Prepared by...Unit – II: XML Introduction to XML: What is XML, XML verses HTML, XML terminology, XML standards, XML syntax checking, The idea of mark-up, XML Structure, Organizing

Adaptive Query Processing - Penn Engineeringzives/research/aqp-survey.pdf · operations, and combining text search with structured query capabili-ties, the weaknesses of the traditional

Chapter 10: XML. 10.2 XML Structure of XML Data XML Document Schema Querying and Transformation Application Program Interfaces to XML Storage of XML Data

XML – 1h. XML: Contents What is XML? What is “Well Formed” XML? What is “Valid” XML? –Document Type Definitions –Scalable Vector Graphics (SVG) XML in

XML Technologies XML Basics What is XML? Why use XML? How to write XML? 1XML Technologies - 2012 - David Raponi

Data and Schema Matching - Penn Engineeringzives/03s/cis650/schema... · 2003. 8. 29. · Data and Schema Matching Zachary G. Ives University of Pennsylvania April 16, 2003 “A survey

Introduction to XML · Introduction to XML XML Document Type Deﬁnition, DTD XML Namespaces XML Schema XML Processors Other XML Standards What Is XML? I XML is ameta-markup languagethat

The XML Standards Landscapedoveltech.com/wp-content/uploads/2017/09/XML... · Origins of XML: HTML HyperText Markup Language ... XHTML XML Query XML

Introduction to XML History of XML Advantages of XML Syntax and structure of XML Rules of well formed XML document Components of XML XML Basics

CIS 670 Fall 2001 (LN 5)1 XML 4 Introduction to XML –XML basics –DTDs –XML and semistructured data 4 Query languages for XML XML-QL, XQL, XSL 4 XML extensions

The Semantic Web-on the respective Roles of XML and RDFzives/03s/cis650/semantic.pdfWhy Semantic Web? ¡ To date, the World Wide Web has developed most rapidly as a medium of documents

XML EXtensible Markup Language. Agenda Introduction to XML XML Rules XML Elements XML Attributes XML Validation XML Exercises XML Namespaces XML CDATA

XML to XML through XML - Eindhoven University of Technology