August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin [email protected] Lister Hill Natrional Center for Biomedical

XML Developer’s conf. August 19 1999

Making XML Documents Searchable through the Web

Dongwook Shin

[email protected]

Lister Hill Natrional Center for Biomedical Informatics

National Library of Medicine

2 Dongwook Shin, National Library of Medicine

Importance of XML Search Engine

More and more documents are beginning to be provided in XML formats.

XML documents are supposed to have certain structures Current Web Search Engines do not provide structural search

capability


Searching Characteristics

Content Searching Searching for certain words in the element hierarchy Retrieve CHAPTER whose TITLE contains “servlet” and

PARAGRAPH contains “session”.

Structural Searching Searching for elements satisfying certain relations Retrieve SECTION that has at least two FIGUREs


Searching Characteristics (Cont’d)

Combined Searching Content + Structural Searching Retrieve SECTION that has TITLE containing “XML” and

contains at least a FIGURE.


Other XML Search Engines on the Web

Most engines provide search in a fixed set of fields. User cannot search in any elements in the document hierarchy. http://www.goxml.com http://www.scoobs.com http://www.xmlTree.com


XRS (XML Retrieval System)

Providing a variety of structural search functions Users can search in any elements in the document hierarchy Content + Structural Searching

Allowing less index overhead and quick retrieval time BUS (Bottom Up Scheme) is used

Applicable to valid documents, but not to well-formed documents Using DTD when making queries and retrieving Examples are Shakespeare or Bible data


Architecture of XRS

Search Engine

Query MediatorServlet

RenderingComponent

UserInterface

Web browser Client Side

Server Side

query

XML result

HTML format

Servlet

Java process

Socket Comm

Search result

Initiate Applet

with XSL


User Interface (Initialization)

DTD can bebrowsed here

Search results are shown here with similarity value

Query conditions appear here


Query Composition

Principle Any element can be a target - the element to be retrieved Search conditions can be imposed on any elements

EXAMPLE Retrieve SPEECH whose SPEAKER contains ‘Hamlet’ and

LINE contains ‘Denmark’

Target

Search Condition


User Interface (Query Composition)


User Interface (with Search Results)


Browser Side

Show XML results


Browsing a List of Elements


XML Result


Query at Another Target Element

Retrieve SCENE whose TITLE contains ‘Castle’ and SPEAKER contains ‘Horatio’


XML Result


Query Mediator Servlet

Mediate the query and results Convey the user query into the backend search engine Transmit the retrieved results to the applet or the rendering

component Send the result sets with brief information to the applet Send the XML content with a proper XSL to the rendering

component so that it can transform into the HTML format

Session tracking and Result Sets Reclamation Keep session tracking so that a user can use his/her session

continuously until he/she quits. Detect the dead sessions periodically and reclaim the

corresponding result sets.


Query Language

INIT Get the DBs and their DTDs available in the server It is sent to the server when the applet is initialized

SEARCH db_name search_cond db_name is one of DBs available in the server search_cond includes the target and search conditions

PRES num Get the XML results num is the n-th result in the result set


Result Set

A result set is assigned to each session Query Mediator does session tracking

Backend Search engine keeps multiple result sets Multi-thread safe code is required

When a session is relinquished, the result set is reclaimed Garbage collection for the result set is required


The Content of a Result Set

DB_name The name of the database where the search is performed and the

result is obtained

DB_path The directory path from the root where the DB resides

ptr_to_result_set pointer to the dynamic arrays having the search results

num_result number of elements retrieved

ptr_to_K_ary_table pointer to the table that keeps the k_ary information for the DB


RS (Result Set) Management

Backend search engine

Query Mediator

Session comes

RS Index returned (i)RS Index requested i-th RS

actual result

Session Monitor

Result Set Index TableUsed

Unused

i-th RS returned

i-th RS accessed


Periodical RS Reclamation

Query Mediator

Session ends

RS Indices to be reclaimed returned (j, ...)alive sessions sent

Backend search engine

i-th RS

actual result

Reclamation done

Reclamation requested(j ,…)

j-th RS

reclaimed

Session Monitor

Result Set Index TableUsed

Unused

Used butto be reclaimed

i j


Backend Search Engine

Less Indexing Overhead and Quick Retrieval Use BUS (Bottom Up Scheme) Most of codes are written in Native C code

Support Multi-thread Multi-thread safe C code Compile the C code into a shared library

Save index information in files


BUS (Bottom Up Scheme)

Main Idea Index only at the lowest level of the document structure Weight information at higher level is computed at retrieval time

Benifits Minimize the indexing overhead Support term weight and full-blown structural search Guarantee quick retrieval time


Principle of BUS

Document tree with index terms Bottom Up Scheme

hypertextbrowser

hypertextbrowser

hypertextinternet

multimedia

hypertextinternet

multimedia

hypertextinternet

java

hypertextinternet

java

para2para1

section1 section2

chapter

hypertext(2)browser(4)


hypertext(3)internet(3)

multimedia(5)


multimedia(5)


java(7)


java(7)

section1

para1 para2

section2

chapter hypertext(10)browser(4)internet(5)

multimedia(5)java(7)

hypertext(10)browser(4)internet(5)






Indexing is performed at leaf nodes only

Term frequency is computed at run time.


Key Issues in BUS

How to figure out ancestor elements of a leaf element efficiently ?

How to accumulate the term frequency effectively ?


UID (Unique element IDentifier)

aa real node

virtual node

3-ary tree

element UID element UID

abcde

12345

fghij

89

141516

Result of assigning UIDs

parent(i) = [(i-2)/k+1]

bb cc

dd ee ff gg

hh ii jj

ee ee ee ee ee

eeee

Represent each document as a k-ary complete tree and assign a UID to each node


K-ary table

Each document is assigned k, which is the maximum number of siblings in the document tree.

Each collection has a K-ary table, each element of which represent k in the document.

Each result set has a pointer to the K-ary table.


Level and Element Type Number

Level Level means the level in the document tree It gives a clue how many parent function is applied to get to a

target element

Element type number A unique number is assigned to each element type in DTD ( not

the elements in documents ) It enables to filter out unnecessary elements and accumulate the

correct frequencies


Level and Element Type Number (Cont’d)

User query: Retrieve sections that contain “hypertext”




multimedia(5)


multimedia(5)


java(7)


java(7)

section1

para1 para2

title

chapter hypertext(9)browser(1)internet(5)


hypertext(9)browser(1)internet(5)






Index information

Level 1

Level 2

Level 3

Level difference informs how many times parent function is applied

user level

text level

Element type numberlets unnecessary indexinformation filtered out.


Representing a Document Tree

hypertext(1)model(1)retrieval(1)semantics(1)

hypertext(1)model(1)retrieval(1)semantics(1)

index(3)lexical(1)noun(4)stem(2)

index(3)lexical(1)noun(4)stem(2)

document(4)index(3)precision(2)term(5)


ee ee



browser(2)hypertext(2)java(5)link(6)


anchor(2)browser(1)html(3)internet(5)


basian(3)inquiry(2)link(3)matrix(3)


ee ee ee

ee eeee ee ee ee ee ee

ee ee

<5,32,4,7> <5,33,4,7> <5,35,4,7> <5,36,4,7>

<5,8,3,5> <5,9,3,5>

<5,3,2,3>

<5,1,1,1>

<5,4,2,3><5,2,2,2>

<5,11,3,6> <5,12,3,6>


Query Evaluation

Create accumulators at user level Accumulators correspond to the elements at the user level

Compute the TF (Term Frequency) and DF (Document Frequency) of a term Summing up all the term frequencies of the descendent elements

into the corresponding accumulators. The number of non-zero accumulators is the DF of the term.

Calculate the term weight

Compute the similarity of the elements and rank


Accumulating Term Frequency

<5,1>

<5,11>

<5,12>

Subtree of the tree in slide 28

query : find sections containing ‘browser’

browser (4)index(3)precision(1)term(5)

browser (4)index(3)precision(1)term(5)







11

.

.

.

.

.

.

.

.

6611

.

.

.

.

.

.

.

.

<5,11,3,6>

<5,12,3,6>

<5,32,4,7> <5,33,4,7> <5,35,4,7> <5,36,4,7>


Performance Data (in Ultra Sparc 2)

Index Overhead

Retrieval time Almost of single term queries are evaluated within one second

Collection Data size(Mb)

Postingfile (Mb)

Indexoverhead (%)

Index time(hh/ mm)

PATENT(SGML)

256 120 46.87 1/ 30

SHAKE(XML)

7 2.8 40.00 < / 02

CLIN(XML)

3 1.36 45.33 < / 01


Advantage of XRS

Provides a variety of structural search functions.

Less indexing overhead and quick retrieval time

Easy to port Java + native C code C code is made as shared libraries


Alternative Architecture of XRS

Search Engine

Query Mediator Servlet

RenderingComponent

UserInterface

Web browser Client Side

Server Side

query

XML result

HTML format

Servlet

JNI interface

Search result

Initiate Applet

with XSL

Shared Library


Benefit and Problem

Benefit Simpler and easier to port than the current implementation Do not need an independent Java process

Problem Current Java Servlet engines can not run the shared libraries

Apache Jserv, Jrun and Jigsaw fail to run it!


Current Status

Finish the development of the content retrieval part Available on the Web at the end of August 1999. http://dlb2.nlm.nih.gov/~dwshin

Structural retrieval part is in development will be finished soon.