Upload
holly-martin
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
XML Developer’s conf. August 19 1999
Making XML Documents Searchable through the Web
Dongwook Shin
Lister Hill Natrional Center for Biomedical Informatics
National Library of Medicine
2 Dongwook Shin, National Library of Medicine
Importance of XML Search Engine
More and more documents are beginning to be provided in XML formats.
XML documents are supposed to have certain structures Current Web Search Engines do not provide structural search
capability
3 Dongwook Shin, National Library of Medicine
Searching Characteristics
Content Searching Searching for certain words in the element hierarchy Retrieve CHAPTER whose TITLE contains “servlet” and
PARAGRAPH contains “session”.
Structural Searching Searching for elements satisfying certain relations Retrieve SECTION that has at least two FIGUREs
4 Dongwook Shin, National Library of Medicine
Searching Characteristics (Cont’d)
Combined Searching Content + Structural Searching Retrieve SECTION that has TITLE containing “XML” and
contains at least a FIGURE.
5 Dongwook Shin, National Library of Medicine
Other XML Search Engines on the Web
Most engines provide search in a fixed set of fields. User cannot search in any elements in the document hierarchy. http://www.goxml.com http://www.scoobs.com http://www.xmlTree.com
6 Dongwook Shin, National Library of Medicine
XRS (XML Retrieval System)
Providing a variety of structural search functions Users can search in any elements in the document hierarchy Content + Structural Searching
Allowing less index overhead and quick retrieval time BUS (Bottom Up Scheme) is used
Applicable to valid documents, but not to well-formed documents Using DTD when making queries and retrieving Examples are Shakespeare or Bible data
7 Dongwook Shin, National Library of Medicine
Architecture of XRS
Search Engine
Query MediatorServlet
RenderingComponent
UserInterface
Web browser Client Side
Server Side
query
XML result
HTML format
Servlet
Java process
Socket Comm
Search result
Initiate Applet
with XSL
8 Dongwook Shin, National Library of Medicine
User Interface (Initialization)
DTD can bebrowsed here
Search results are shown here with similarity value
Query conditions appear here
9 Dongwook Shin, National Library of Medicine
Query Composition
Principle Any element can be a target - the element to be retrieved Search conditions can be imposed on any elements
EXAMPLE Retrieve SPEECH whose SPEAKER contains ‘Hamlet’ and
LINE contains ‘Denmark’
Target
Search Condition
10 Dongwook Shin, National Library of Medicine
User Interface (Query Composition)
11 Dongwook Shin, National Library of Medicine
User Interface (with Search Results)
12 Dongwook Shin, National Library of Medicine
Browser Side
Show XML results
13 Dongwook Shin, National Library of Medicine
Browsing a List of Elements
14 Dongwook Shin, National Library of Medicine
XML Result
15 Dongwook Shin, National Library of Medicine
Query at Another Target Element
Retrieve SCENE whose TITLE contains ‘Castle’ and SPEAKER contains ‘Horatio’
16 Dongwook Shin, National Library of Medicine
XML Result
17 Dongwook Shin, National Library of Medicine
Query Mediator Servlet
Mediate the query and results Convey the user query into the backend search engine Transmit the retrieved results to the applet or the rendering
component Send the result sets with brief information to the applet Send the XML content with a proper XSL to the rendering
component so that it can transform into the HTML format
Session tracking and Result Sets Reclamation Keep session tracking so that a user can use his/her session
continuously until he/she quits. Detect the dead sessions periodically and reclaim the
corresponding result sets.
18 Dongwook Shin, National Library of Medicine
Query Language
INIT Get the DBs and their DTDs available in the server It is sent to the server when the applet is initialized
SEARCH db_name search_cond db_name is one of DBs available in the server search_cond includes the target and search conditions
PRES num Get the XML results num is the n-th result in the result set
19 Dongwook Shin, National Library of Medicine
Result Set
A result set is assigned to each session Query Mediator does session tracking
Backend Search engine keeps multiple result sets Multi-thread safe code is required
When a session is relinquished, the result set is reclaimed Garbage collection for the result set is required
20 Dongwook Shin, National Library of Medicine
The Content of a Result Set
DB_name The name of the database where the search is performed and the
result is obtained
DB_path The directory path from the root where the DB resides
ptr_to_result_set pointer to the dynamic arrays having the search results
num_result number of elements retrieved
ptr_to_K_ary_table pointer to the table that keeps the k_ary information for the DB
21 Dongwook Shin, National Library of Medicine
RS (Result Set) Management
Backend search engine
Query Mediator
Session comes
RS Index returned (i)RS Index requested i-th RS
actual result
Session Monitor
Result Set Index TableUsed
Unused
i-th RS returned
i-th RS accessed
22 Dongwook Shin, National Library of Medicine
Periodical RS Reclamation
Query Mediator
Session ends
RS Indices to be reclaimed returned (j, ...)alive sessions sent
Backend search engine
i-th RS
actual result
Reclamation done
Reclamation requested(j ,…)
j-th RS
reclaimed
Session Monitor
Result Set Index TableUsed
Unused
Used butto be reclaimed
i j
23 Dongwook Shin, National Library of Medicine
Backend Search Engine
Less Indexing Overhead and Quick Retrieval Use BUS (Bottom Up Scheme) Most of codes are written in Native C code
Support Multi-thread Multi-thread safe C code Compile the C code into a shared library
Save index information in files
24 Dongwook Shin, National Library of Medicine
BUS (Bottom Up Scheme)
Main Idea Index only at the lowest level of the document structure Weight information at higher level is computed at retrieval time
Benifits Minimize the indexing overhead Support term weight and full-blown structural search Guarantee quick retrieval time
25 Dongwook Shin, National Library of Medicine
Principle of BUS
Document tree with index terms Bottom Up Scheme
hypertextbrowser
hypertextbrowser
hypertextinternet
multimedia
hypertextinternet
multimedia
hypertextinternet
java
hypertextinternet
java
para2para1
section1 section2
chapter
hypertext(2)browser(4)
hypertext(2)browser(4)
hypertext(3)internet(3)
multimedia(5)
hypertext(3)internet(3)
multimedia(5)
hypertext(5)internet(2)
java(7)
hypertext(5)internet(2)
java(7)
section1
para1 para2
section2
chapter hypertext(10)browser(4)internet(5)
multimedia(5)java(7)
hypertext(10)browser(4)internet(5)
multimedia(5)java(7)
hypertext(8)internet(5)
multimedia(5)java(7)
hypertext(8)internet(5)
multimedia(5)java(7)
Indexing is performed at leaf nodes only
Term frequency is computed at run time.
26 Dongwook Shin, National Library of Medicine
Key Issues in BUS
How to figure out ancestor elements of a leaf element efficiently ?
How to accumulate the term frequency effectively ?
27 Dongwook Shin, National Library of Medicine
UID (Unique element IDentifier)
aa real node
virtual node
3-ary tree
element UID element UID
abcde
12345
fghij
89
141516
Result of assigning UIDs
parent(i) = [(i-2)/k+1]
bb cc
dd ee ff gg
hh ii jj
ee ee ee ee ee
eeee
Represent each document as a k-ary complete tree and assign a UID to each node
28 Dongwook Shin, National Library of Medicine
K-ary table
Each document is assigned k, which is the maximum number of siblings in the document tree.
Each collection has a K-ary table, each element of which represent k in the document.
Each result set has a pointer to the K-ary table.
29 Dongwook Shin, National Library of Medicine
Level and Element Type Number
Level Level means the level in the document tree It gives a clue how many parent function is applied to get to a
target element
Element type number A unique number is assigned to each element type in DTD ( not
the elements in documents ) It enables to filter out unnecessary elements and accumulate the
correct frequencies
30 Dongwook Shin, National Library of Medicine
Level and Element Type Number (Cont’d)
User query: Retrieve sections that contain “hypertext”
hypertext(1)browser(1)
hypertext(1)browser(1)
hypertext(3)internet(3)
multimedia(5)
hypertext(3)internet(3)
multimedia(5)
hypertext(5)internet(2)
java(7)
hypertext(5)internet(2)
java(7)
section1
para1 para2
title
chapter hypertext(9)browser(1)internet(5)
multimedia(5)java(7)
hypertext(9)browser(1)internet(5)
multimedia(5)java(7)
hypertext(8)internet(5)
multimedia(5)java(7)
hypertext(8)internet(5)
multimedia(5)java(7)
Index information
Level 1
Level 2
Level 3
Level difference informs how many times parent function is applied
user level
text level
Element type numberlets unnecessary indexinformation filtered out.
31 Dongwook Shin, National Library of Medicine
Representing a Document Tree
hypertext(1)model(1)retrieval(1)semantics(1)
hypertext(1)model(1)retrieval(1)semantics(1)
index(3)lexical(1)noun(4)stem(2)
index(3)lexical(1)noun(4)stem(2)
document(4)index(3)precision(2)term(5)
document(4)index(3)precision(2)term(5)
ee ee
document(4)index(3)precision(1)term(5)
document(4)index(3)precision(1)term(5)
browser(2)hypertext(2)java(5)link(6)
browser(2)hypertext(2)java(5)link(6)
anchor(2)browser(1)html(3)internet(5)
anchor(2)browser(1)html(3)internet(5)
basian(3)inquiry(2)link(3)matrix(3)
basian(3)inquiry(2)link(3)matrix(3)
ee ee ee
ee eeee ee ee ee ee ee
ee ee
<5,32,4,7> <5,33,4,7> <5,35,4,7> <5,36,4,7>
<5,8,3,5> <5,9,3,5>
<5,3,2,3>
<5,1,1,1>
<5,4,2,3><5,2,2,2>
<5,11,3,6> <5,12,3,6>
32 Dongwook Shin, National Library of Medicine
Query Evaluation
Create accumulators at user level Accumulators correspond to the elements at the user level
Compute the TF (Term Frequency) and DF (Document Frequency) of a term Summing up all the term frequencies of the descendent elements
into the corresponding accumulators. The number of non-zero accumulators is the DF of the term.
Calculate the term weight
Compute the similarity of the elements and rank
33 Dongwook Shin, National Library of Medicine
Accumulating Term Frequency
<5,1>
<5,11>
<5,12>
Subtree of the tree in slide 28
query : find sections containing ‘browser’
browser (4)index(3)precision(1)term(5)
browser (4)index(3)precision(1)term(5)
browser(2)hypertext(2)java(5)link(6)
browser(2)hypertext(2)java(5)link(6)
anchor(2)browser(1)html(3)internet(5)
anchor(2)browser(1)html(3)internet(5)
basian(3)inquiry(2)link(3)matrix(3)
basian(3)inquiry(2)link(3)matrix(3)
11
.
.
.
.
.
.
.
.
6611
.
.
.
.
.
.
.
.
<5,11,3,6>
<5,12,3,6>
<5,32,4,7> <5,33,4,7> <5,35,4,7> <5,36,4,7>
34 Dongwook Shin, National Library of Medicine
Performance Data (in Ultra Sparc 2)
Index Overhead
Retrieval time Almost of single term queries are evaluated within one second
Collection Data size(Mb)
Postingfile (Mb)
Indexoverhead (%)
Index time(hh/ mm)
PATENT(SGML)
256 120 46.87 1/ 30
SHAKE(XML)
7 2.8 40.00 < / 02
CLIN(XML)
3 1.36 45.33 < / 01
35 Dongwook Shin, National Library of Medicine
Advantage of XRS
Provides a variety of structural search functions.
Less indexing overhead and quick retrieval time
Easy to port Java + native C code C code is made as shared libraries
36 Dongwook Shin, National Library of Medicine
Alternative Architecture of XRS
Search Engine
Query Mediator Servlet
RenderingComponent
UserInterface
Web browser Client Side
Server Side
query
XML result
HTML format
Servlet
JNI interface
Search result
Initiate Applet
with XSL
Shared Library
37 Dongwook Shin, National Library of Medicine
Benefit and Problem
Benefit Simpler and easier to port than the current implementation Do not need an independent Java process
Problem Current Java Servlet engines can not run the shared libraries
Apache Jserv, Jrun and Jigsaw fail to run it!
38 Dongwook Shin, National Library of Medicine
Current Status
Finish the development of the content retrieval part Available on the Web at the end of August 1999. http://dlb2.nlm.nih.gov/~dwshin
Structural retrieval part is in development will be finished soon.