38
XML Developer’s conf. August 19 1999 Making XML Documents Searchable through the Web Dongwook Shin [email protected] Lister Hill Natrional Center for Biomedical Informatics National Library of Medicine

August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin [email protected] Lister Hill Natrional Center for Biomedical

Embed Size (px)

Citation preview

Page 1: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

XML Developer’s conf. August 19 1999

Making XML Documents Searchable through the Web

Dongwook Shin

[email protected]

Lister Hill Natrional Center for Biomedical Informatics

National Library of Medicine

Page 2: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

2 Dongwook Shin, National Library of Medicine

Importance of XML Search Engine

More and more documents are beginning to be provided in XML formats.

XML documents are supposed to have certain structures Current Web Search Engines do not provide structural search

capability

Page 3: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

3 Dongwook Shin, National Library of Medicine

Searching Characteristics

Content Searching Searching for certain words in the element hierarchy Retrieve CHAPTER whose TITLE contains “servlet” and

PARAGRAPH contains “session”.

Structural Searching Searching for elements satisfying certain relations Retrieve SECTION that has at least two FIGUREs

Page 4: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

4 Dongwook Shin, National Library of Medicine

Searching Characteristics (Cont’d)

Combined Searching Content + Structural Searching Retrieve SECTION that has TITLE containing “XML” and

contains at least a FIGURE.

Page 5: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

5 Dongwook Shin, National Library of Medicine

Other XML Search Engines on the Web

Most engines provide search in a fixed set of fields. User cannot search in any elements in the document hierarchy. http://www.goxml.com http://www.scoobs.com http://www.xmlTree.com

Page 6: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

6 Dongwook Shin, National Library of Medicine

XRS (XML Retrieval System)

Providing a variety of structural search functions Users can search in any elements in the document hierarchy Content + Structural Searching

Allowing less index overhead and quick retrieval time BUS (Bottom Up Scheme) is used

Applicable to valid documents, but not to well-formed documents Using DTD when making queries and retrieving Examples are Shakespeare or Bible data

Page 7: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

7 Dongwook Shin, National Library of Medicine

Architecture of XRS

Search Engine

Query MediatorServlet

RenderingComponent

UserInterface

Web browser Client Side

Server Side

query

XML result

HTML format

Servlet

Java process

Socket Comm

Search result

Initiate Applet

with XSL

Page 8: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

8 Dongwook Shin, National Library of Medicine

User Interface (Initialization)

DTD can bebrowsed here

Search results are shown here with similarity value

Query conditions appear here

Page 9: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

9 Dongwook Shin, National Library of Medicine

Query Composition

Principle Any element can be a target - the element to be retrieved Search conditions can be imposed on any elements

EXAMPLE Retrieve SPEECH whose SPEAKER contains ‘Hamlet’ and

LINE contains ‘Denmark’

Target

Search Condition

Page 10: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

10 Dongwook Shin, National Library of Medicine

User Interface (Query Composition)

Page 11: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

11 Dongwook Shin, National Library of Medicine

User Interface (with Search Results)

Page 12: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

12 Dongwook Shin, National Library of Medicine

Browser Side

Show XML results

Page 13: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

13 Dongwook Shin, National Library of Medicine

Browsing a List of Elements

Page 14: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

14 Dongwook Shin, National Library of Medicine

XML Result

Page 15: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

15 Dongwook Shin, National Library of Medicine

Query at Another Target Element

Retrieve SCENE whose TITLE contains ‘Castle’ and SPEAKER contains ‘Horatio’

Page 16: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

16 Dongwook Shin, National Library of Medicine

XML Result

Page 17: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

17 Dongwook Shin, National Library of Medicine

Query Mediator Servlet

Mediate the query and results Convey the user query into the backend search engine Transmit the retrieved results to the applet or the rendering

component Send the result sets with brief information to the applet Send the XML content with a proper XSL to the rendering

component so that it can transform into the HTML format

Session tracking and Result Sets Reclamation Keep session tracking so that a user can use his/her session

continuously until he/she quits. Detect the dead sessions periodically and reclaim the

corresponding result sets.

Page 18: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

18 Dongwook Shin, National Library of Medicine

Query Language

INIT Get the DBs and their DTDs available in the server It is sent to the server when the applet is initialized

SEARCH db_name search_cond db_name is one of DBs available in the server search_cond includes the target and search conditions

PRES num Get the XML results num is the n-th result in the result set

Page 19: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

19 Dongwook Shin, National Library of Medicine

Result Set

A result set is assigned to each session Query Mediator does session tracking

Backend Search engine keeps multiple result sets Multi-thread safe code is required

When a session is relinquished, the result set is reclaimed Garbage collection for the result set is required

Page 20: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

20 Dongwook Shin, National Library of Medicine

The Content of a Result Set

DB_name The name of the database where the search is performed and the

result is obtained

DB_path The directory path from the root where the DB resides

ptr_to_result_set pointer to the dynamic arrays having the search results

num_result number of elements retrieved

ptr_to_K_ary_table pointer to the table that keeps the k_ary information for the DB

Page 21: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

21 Dongwook Shin, National Library of Medicine

RS (Result Set) Management

Backend search engine

Query Mediator

Session comes

RS Index returned (i)RS Index requested i-th RS

actual result

Session Monitor

Result Set Index TableUsed

Unused

i-th RS returned

i-th RS accessed

Page 22: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

22 Dongwook Shin, National Library of Medicine

Periodical RS Reclamation

Query Mediator

Session ends

RS Indices to be reclaimed returned (j, ...)alive sessions sent

Backend search engine

i-th RS

actual result

Reclamation done

Reclamation requested(j ,…)

j-th RS

reclaimed

Session Monitor

Result Set Index TableUsed

Unused

Used butto be reclaimed

i j

Page 23: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

23 Dongwook Shin, National Library of Medicine

Backend Search Engine

Less Indexing Overhead and Quick Retrieval Use BUS (Bottom Up Scheme) Most of codes are written in Native C code

Support Multi-thread Multi-thread safe C code Compile the C code into a shared library

Save index information in files

Page 24: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

24 Dongwook Shin, National Library of Medicine

BUS (Bottom Up Scheme)

Main Idea Index only at the lowest level of the document structure Weight information at higher level is computed at retrieval time

Benifits Minimize the indexing overhead Support term weight and full-blown structural search Guarantee quick retrieval time

Page 25: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

25 Dongwook Shin, National Library of Medicine

Principle of BUS

Document tree with index terms Bottom Up Scheme

hypertextbrowser

hypertextbrowser

hypertextinternet

multimedia

hypertextinternet

multimedia

hypertextinternet

java

hypertextinternet

java

para2para1

section1 section2

chapter

hypertext(2)browser(4)

hypertext(2)browser(4)

hypertext(3)internet(3)

multimedia(5)

hypertext(3)internet(3)

multimedia(5)

hypertext(5)internet(2)

java(7)

hypertext(5)internet(2)

java(7)

section1

para1 para2

section2

chapter hypertext(10)browser(4)internet(5)

multimedia(5)java(7)

hypertext(10)browser(4)internet(5)

multimedia(5)java(7)

hypertext(8)internet(5)

multimedia(5)java(7)

hypertext(8)internet(5)

multimedia(5)java(7)

Indexing is performed at leaf nodes only

Term frequency is computed at run time.

Page 26: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

26 Dongwook Shin, National Library of Medicine

Key Issues in BUS

How to figure out ancestor elements of a leaf element efficiently ?

How to accumulate the term frequency effectively ?

Page 27: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

27 Dongwook Shin, National Library of Medicine

UID (Unique element IDentifier)

aa real node

virtual node

3-ary tree

element UID element UID

abcde

12345

fghij

89

141516

Result of assigning UIDs

parent(i) = [(i-2)/k+1]

bb cc

dd ee ff gg

hh ii jj

ee ee ee ee ee

eeee

Represent each document as a k-ary complete tree and assign a UID to each node

Page 28: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

28 Dongwook Shin, National Library of Medicine

K-ary table

Each document is assigned k, which is the maximum number of siblings in the document tree.

Each collection has a K-ary table, each element of which represent k in the document.

Each result set has a pointer to the K-ary table.

Page 29: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

29 Dongwook Shin, National Library of Medicine

Level and Element Type Number

Level Level means the level in the document tree It gives a clue how many parent function is applied to get to a

target element

Element type number A unique number is assigned to each element type in DTD ( not

the elements in documents ) It enables to filter out unnecessary elements and accumulate the

correct frequencies

Page 30: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

30 Dongwook Shin, National Library of Medicine

Level and Element Type Number (Cont’d)

User query: Retrieve sections that contain “hypertext”

hypertext(1)browser(1)

hypertext(1)browser(1)

hypertext(3)internet(3)

multimedia(5)

hypertext(3)internet(3)

multimedia(5)

hypertext(5)internet(2)

java(7)

hypertext(5)internet(2)

java(7)

section1

para1 para2

title

chapter hypertext(9)browser(1)internet(5)

multimedia(5)java(7)

hypertext(9)browser(1)internet(5)

multimedia(5)java(7)

hypertext(8)internet(5)

multimedia(5)java(7)

hypertext(8)internet(5)

multimedia(5)java(7)

Index information

Level 1

Level 2

Level 3

Level difference informs how many times parent function is applied

user level

text level

Element type numberlets unnecessary indexinformation filtered out.

Page 31: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

31 Dongwook Shin, National Library of Medicine

Representing a Document Tree

hypertext(1)model(1)retrieval(1)semantics(1)

hypertext(1)model(1)retrieval(1)semantics(1)

index(3)lexical(1)noun(4)stem(2)

index(3)lexical(1)noun(4)stem(2)

document(4)index(3)precision(2)term(5)

document(4)index(3)precision(2)term(5)

ee ee

document(4)index(3)precision(1)term(5)

document(4)index(3)precision(1)term(5)

browser(2)hypertext(2)java(5)link(6)

browser(2)hypertext(2)java(5)link(6)

anchor(2)browser(1)html(3)internet(5)

anchor(2)browser(1)html(3)internet(5)

basian(3)inquiry(2)link(3)matrix(3)

basian(3)inquiry(2)link(3)matrix(3)

ee ee ee

ee eeee ee ee ee ee ee

ee ee

<5,32,4,7> <5,33,4,7> <5,35,4,7> <5,36,4,7>

<5,8,3,5> <5,9,3,5>

<5,3,2,3>

<5,1,1,1>

<5,4,2,3><5,2,2,2>

<5,11,3,6> <5,12,3,6>

Page 32: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

32 Dongwook Shin, National Library of Medicine

Query Evaluation

Create accumulators at user level Accumulators correspond to the elements at the user level

Compute the TF (Term Frequency) and DF (Document Frequency) of a term Summing up all the term frequencies of the descendent elements

into the corresponding accumulators. The number of non-zero accumulators is the DF of the term.

Calculate the term weight

Compute the similarity of the elements and rank

Page 33: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

33 Dongwook Shin, National Library of Medicine

Accumulating Term Frequency

<5,1>

<5,11>

<5,12>

Subtree of the tree in slide 28

query : find sections containing ‘browser’

browser (4)index(3)precision(1)term(5)

browser (4)index(3)precision(1)term(5)

browser(2)hypertext(2)java(5)link(6)

browser(2)hypertext(2)java(5)link(6)

anchor(2)browser(1)html(3)internet(5)

anchor(2)browser(1)html(3)internet(5)

basian(3)inquiry(2)link(3)matrix(3)

basian(3)inquiry(2)link(3)matrix(3)

11

.

.

.

.

.

.

.

.

6611

.

.

.

.

.

.

.

.

<5,11,3,6>

<5,12,3,6>

<5,32,4,7> <5,33,4,7> <5,35,4,7> <5,36,4,7>

Page 34: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

34 Dongwook Shin, National Library of Medicine

Performance Data (in Ultra Sparc 2)

Index Overhead

Retrieval time Almost of single term queries are evaluated within one second

Collection Data size(Mb)

Postingfile (Mb)

Indexoverhead (%)

Index time(hh/ mm)

PATENT(SGML)

256 120 46.87 1/ 30

SHAKE(XML)

7 2.8 40.00 < / 02

CLIN(XML)

3 1.36 45.33 < / 01

Page 35: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

35 Dongwook Shin, National Library of Medicine

Advantage of XRS

Provides a variety of structural search functions.

Less indexing overhead and quick retrieval time

Easy to port Java + native C code C code is made as shared libraries

Page 36: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

36 Dongwook Shin, National Library of Medicine

Alternative Architecture of XRS

Search Engine

Query Mediator Servlet

RenderingComponent

UserInterface

Web browser Client Side

Server Side

query

XML result

HTML format

Servlet

JNI interface

Search result

Initiate Applet

with XSL

Shared Library

Page 37: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

37 Dongwook Shin, National Library of Medicine

Benefit and Problem

Benefit Simpler and easier to port than the current implementation Do not need an independent Java process

Problem Current Java Servlet engines can not run the shared libraries

Apache Jserv, Jrun and Jigsaw fail to run it!

Page 38: August 19 1999XML Developer’s conf. Making XML Documents Searchable through the Web Dongwook Shin dwshin@nlm.nih.gov Lister Hill Natrional Center for Biomedical

38 Dongwook Shin, National Library of Medicine

Current Status

Finish the development of the content retrieval part Available on the Web at the end of August 1999. http://dlb2.nlm.nih.gov/~dwshin

Structural retrieval part is in development will be finished soon.