Effective XML Keyword Search with Relevance Oriented Ranking Zhifeng Bao, Tok Wang Ling, Bo Chen,...

Effective XML Keyword Effective XML Keyword Search with Relevance Search with Relevance

Oriented RankingOriented Ranking

Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu

Introduction

• XML Keyword search– Inspired by IR style keyword search on the

web– Enables user to access information in XML

database– XML data modeled as a rooted, labeled tree– Recent research efforts

• Efficiency• Effectiveness

Effectiveness

• Capture user’s search intention– Identify the target that user intends to search for

– Infer the predicate constraint that user intends to search via

• Result ranking–Rank the query results according to their

objective relevance to user search intention

State of the Art

• Search semantics design– LCA (Lowest Common Ancestor)

• Node v is a LCA of keyword set K={w1, w2,…,wk} if the sub-tree rooted at v contains at least one occurrence of all keywords in K, after excluding the sub-elements that already contain all keywords in K

– SLCA (Smallest LCA)• Node v is a SLCA of keyword set K={w1, w2,…,wk} if

– (1) v is a LCA of K

– (2) no proper descendant of v is LCA of K

– XSeek• Infers the search intention based on the concept of objects and an

analysis of the matching between keyword and data node

State of the Art (cont)

• Efficient result retrieval– Designed based on a certain search semantics– XKSearch, Multiway SLCA etc.

• Result ranking– XRANK, XKSEarch, EASE– They only consider

• Structural compactness of matching results• Keyword proximity• Similarity at node level

Problems Unaddressed

• Not address the user search intention adequately!– Meaningfulness of query result

• SLCA is less meaningful in many cases

– Keyword Ambiguity Problems1. A keyword can appear both as an xml node type and as

the text value of some other nodes

2. A keyword can appear in the text values of different xml node types and carry different meaningsNeither SLCA nor Xseek can well address keyword ambiguity

Meaningfulness• Keyword query “rock music”

– Search intention: find customers interested in “rock music” – C3

– SLCA returns: interest node of C3

customers

storeDB

... ...book

title publisherIDauthors

author“B 2 ”

“Edward Martin”

“Sophia Jones”

author

customer

IDname

interest

interests...

“art”“Rock Davis”

“C 4 ”

“Daniel Jones”“John Williams”

title...

IDauthors

author“B 1 ”

author

“Art of Customer Interest Care”

customer

IDname

addressinterest

streetcity

interestscontact

“1”

“Art Street”...

“fashion”“Mary Smith”

“C 1 ”

customer

IDname

interest

interests

“rock music”“Art Smith”

“C 3 ”

purchase

purchases

customer

ID nameinterest

interests

“street art”“John Martin”

“C 2 ”

......

...name

“Oxford”

Problems

Keyword Ambiguity• Q = “customer, interest, art”

– Ambiguity 1: customer, interest; Ambiguity 2: art – Intention: find customer whose interest is art– less relevant or irrelevant result to be returned also --- C1,C3, B1’s title

customers

storeDB

... ...book

author“B 2 ”

“Edward Martin”

“Sophia Jones”

author

customer

IDname

interest

interests...

“C 4 ”

title...

IDauthors

author“B 1 ”

author

customer

IDname

addressinterest

streetcity

interestscontact

“1”

“Art Street”...

“C 1 ”

customer

IDname

interest

interests

“C 3 ”

purchase

purchases

customer

ID name

interest

interests

“C 2 ”

......

...name

“Oxford”

Problems

Keyword Ambiguity (cont)

• Q = “customer, art”– “art” can be the value of interest node(C2, C4), name node(C3), or

street node of customer(C1), or title node of book(B1)– “customer” can be tag name of customer node, or (part of) value of

title of(B1) - How to rank C1 to C4 and B1?

customers

storeDB

... ...book

author“B 2 ”

“Edward Martin”

“Sophia Jones”

author

customer

IDname

interest

interests...

“C 4 ”

title...

IDauthors

author“B 1 ”

author

customer

IDname

addressinterest

streetcity

interestscontact

“1”

“Art Street”...

“C 1 ”

customer

IDname

interest

interests

“C 3 ”

purchase

purchases

customer

ID name

interest

interests

“C 2 ”

......

...name

“Oxford”

Problems

Objectives & Challenges

• ChallengesI. How to decide which sub-tree(s) with appropriate node types can

capture user desired information

II. How to return sub-trees of an appropriate size (i.e. contain enough but non-overwhelming information)

III. How to rank those sub-trees by their relevance

• Address the below as a single problem – Search intention identification– Query result retrieval– Result ranking

– Extend original TF*IDF from text database to XML database, while capture the hierarchical structure of XML data

Challenges

Difficulty in applying TF*IDF to XMLXML DB carries semantic information while text DB

contains pure text information. XML TF*IDF must be aware of the underlying semantics.

All contents of XML data are stored in leaf nodes onlyWhat is analogy of “flat document” in XML?

o Sub-tree classified according to its prefix path

Normalization factor is not simply the size of sub-treeo Structure of sub-trees may also infest the ranks

TF*IDF Recap

• Rule 1: A keyword appearing in many documents should not be regarded as more important than a keyword appearing in a few. --- IDF

• Rule 2: A document with more occurrences of a query keyword should not be regarded as less important for that keyword than a document that has less. --- TF

• Rule 3: A normalization factor is needed to balance between long and short documents – as Rule 2 discriminates against short documents which may

have less chance to contain more occurrences of keywords.

Our Approach– Extend IR-style keyword search techniques (like TF*IDF) from

text database to XML database, in order to capture the hierarchical structure of xml document• by analyzing the knowledge of statistics of underlying XML data

– Major Contributions1. Identify user’s desired search-for node and search-via node(s) in a

heuristic way Define XML TF (term frequency) and XML DF (document frequency) Confidence Formulas for search for/via candidates

2. Define XML TF*IDF Similarity Propose 3 guidelines specifically for xml keyword search Take keyword ambiguity problems into account

3. Design a Keyword Search Engine XReal13

Data Model • Node type - Two nodes are of same node type if they share the same prefix path

/storeDB/customers/customer/name vs.

/storeDB/books/book/publisher/name

customers

storeDB

... ...book

author“B 2 ”

“Edward Martin”

“Sophia Jones”

author

customer

IDname

interest

interests...

“C 4 ”

title...

IDauthors

author“B 1 ”

author

customer

IDname

addressinterest

streetcity

interestscontact

“1”

“Art Street”...

“C 1 ”

customer

IDname

interest

interests

“C 3 ”

purchase

purchases

customer

ID name

interest

interests

“C 2 ”

......

...name

“Oxford”

• Value node – text values contained in leaf node• Structural node

Single-valued node type, multi-valued node type Grouping type – all its children are of same multi-valued type

XML TF and IDF

• XML DF (document frequency) – The number of T-typed nodes that contain keyword

k in their sub-trees in XML database.• Granularity of similarity measurement is sub-trees of

certain node type T

• XML TF (term frequency)– The number of occurrences of a keyword k in a

given value node a in XML database.

Infer the desired search-for node• Guidelines: A node type T is considered as a desired

search for node if1. T is intuitively related to every query keyword2. XML nodes of type T should be informative enough to contain

enough relevant information3. XML nodes of type T should be not overwhelming to contain too

much irrelevant information

• Confidence of T as the search for node w.r.t. query q.• product instead of sum is used to follow 1st guideline• log part designed to follow 3rd guideline• exponential part designed to follow 2nd guideline• r is a decay factor in (0,1].

( )( , ) log (1 )*T depth Tfor e k

C T q f r

Infer the Search-Via Nodes• Infer structural node to search via

– Structural node n is a good candidate if it is related to as many (but not necessarily all) keywords as possible

• Search via node type normally is not unique

• Infer individual value node to search via– Statistics alone is not adequate to infer the likelihood of a value

node as (part of) search via node

– Capture keyword co-occurrence

( , ) log (1 )Tvia e kk q

C T q f

customers

storeDB

... ...book

author“B 2 ”

“Edward Martin”

“Sophia Jones”

author

customer

IDname

interest

interests...

“C 4 ”

title...

IDauthors

author“B 1 ”

author

customer

IDname

addressinterest

streetcity

interestscontact

“1”

“Art Street”...

“C 1 ”

customer

IDname

interest

interests

“C 3 ”

purchase

purchases

customer

ID nameinterest

interests

“C 2 ”

......

...name

“Oxford”

• E.g. Q = “ customer, name, rock, interest, art ” Easy to find name and interest have high confidence to be the

search via nodes But hard to know rock is value of name or interest,

art is value of interest or nameHow to differ customer C4

from C3?

Capture keyword co-occurrence

Capture keyword co-occurrence• Proximity factors for a value node v of type kt

containing keyword k– Given a query q and a certain value node v, if there are two

keywords kt and k in q, s.t. kt matches the type of an ancestor node of v and k matches a keyword in v

– In-Query distance • Distance between keyword k and node type kt in query q

• Favors: kt appears before k

– Structural distance• Depth distance between v and the nearest kt typed ancestor

node of v

– Value-Type distance• Max of the above two

1( , , ) 1

( , , , )t

viatk q ancType v

C q v kDist q v k k

Principles of XML keyword search• Principle 1

– When searching for D-typed nodes via a single-valued type V, ideally only the values and structures nested in V-typed nodes can affect the relevance, regardless of the size of other typed nodes nested in D-typed nodes.• However, TF*IDF similarity in IR normalizes the relevance score of

each document w.r.t. its size

• Principle 2 – address keyword Ambiguity 2– When searching for nodes of type D via a multi-valued type V’,

the relevance of a D-typed node which contains a query relevant V’-typed node should not be affected (i.e. normalized) too much by other query-irrelevant V’-typed nodes.• Example: query “art” - C4 should not be less relevant than C1

Principles of XML keyword search

• Principle 1 and 2 – Especially useful for interpreting pure keyword query -

find search via node correctly

• Principle 3– The order of keywords in a query is important to indicate

the search intention• Incorporate the search via confidence Cvia we defined

before

XML TF*IDF Similarity• To calculate the similarity between the search for

node and the query q– Base case: similarity between value node a and q

• Apply original TF*IDF directly since a contains keywords only without any structure

– Recursive case: similarity between structural node n and q• Based on similarities of its children c and the confidence

level of c as the node type to search via

( , )similarity q a ,, *

Taa kq k

k q aTaq a

IDF TFNormalization factor

, ( , , )*ln(1 / (1 ))a a

T Tq k via T kW C q a k N f

, ,1 ln( )a k a kW f 2,( )a aT T

q q kk q

2,a a k

XML TF*IDF Similarity (cont.)

• Recursive Case– Intuition 2. An internal node n is relevant to q, if n has a

child c such that the type of c has high confidence to be a search via node w.r.t. q (i.e. large Cvia(Tc , q)), and c is highly relevant to q (i.e. large sim(q, c)).

– Intuition 3. An internal node n is more relevant to q if n has more query-relevant children when all others being equal.

( , )* ( , )

( , )via c

c chd n

sim q c C T q

similarity q nW

Weighted sum of all n’s children’s similarity and their confidence to be the search

via node

Overall weight of node n w.r.t query q which essentially

plays the role of a normalization factor 23

Flowchart of answering a query

1. Identify user search intention– Compute the confidence of all possible candidate node

types and choose desired search for node Tfor

2. Relevance-oriented ranking– Compute XML TF*IDF similarity in a bottom-up

approach from value nodes containing keywords up to nodes of type Tfor

– Return a ranked list of sub-trees rooted at nodes of type Tfor

• If more than one search for node type have comparable confidence, a ranked list for each search for node is returned

Experimental Result

• Data set– DBLP, XMark, WSU, eBay

• Comparison– Compare XReal with SLCA, Xseek

• Equipment– Implement in Java– Run on 3.6GHz pentium IV, 1 GB memory PC with

Windows XP– Berkeley DB java edition for storing keyword inverted

lists and keyword frequency table

Search Effectiveness

• Accuracy in inferring the search for node– Conducted by user survey– Tested queries contain at least one of the two

ambiguity problems– Conclusion

• XReal works well, especially when the search for node is not given explicitly in the query

Search Effectiveness

• Result effectiveness– Measured by precision, recall, F-measure– Observations

• XReal achieves higher precision than SLCA and Xseek for queries that contain ambiguities

• XReal Performs as well as XSeek when queries have no ambiguity in XML data

• XReal: Top-100 precision higher than overall precision

• F-measure also shows good overall effectiveness of both XReal and XSeek

Ranking Effectiveness

• Metrics– Number of Top-1 answers that are relevant– Reciprocal Rank (R-Rank)– Mean Average Precision (MAP)

Efficiency & Scalability

• Compare three adoptions of indices for XReal, and SLCA– Dup

• Store only the dewey id and XML TF

– DupType• Stores an extra node type (i.e. its prefix path)

– DupTypeNorm• Stores an extra normalization factor Wa for value

XMark DBLP

Thank You

customers

storeDB

... ...book

author...

“Edward Martin”

“Sophia Jones”

author

customer

IDname

interest

interests...

title...

IDauthors

authorauthor

customer

IDname

addressinterest

streetcity

interestscontact

“1”

“Art Street”

“C1”

customer

IDname

interest

interests

“rock music”

“Art Smith”

purchase

purchases

customer

ID name

interest

interests

......

...name

“Oxford”

“C2”

“C3”

“C4”

“B1”

“B2”

Effective XML Keyword Search with Relevance Oriented Ranking Zhifeng Bao, Tok Wang Ling, Bo Chen,...

Documents

Christmas TOK

Tok Reading 10: Imagination 1.Dobrowski Chapter 11 Reason- April 30 ( Tok File) ToK Dictionary 9: Tok File 1.Imagination + Creatiity 2.Imagination + Fiction

Knowledge in TOK What you need from the TOK Guide

Tok altruism

ToK structure

TOK and Extended Essays. TOK And Chemistry We are all TOK Teachers

Sequence-to-Sequence Models Can Zhifeng Chen Directly

On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques Ting Chen, Jiaheng Lu, Tok Wang Ling

Euthanasia ToK

EMOTION - TOK

1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of

Tok Review

TIK TOK Technologies Ltd Company Snap Shottiktoktechnologies.com/documents/1A TIK TOK Company SNAP SHOT MAY 2018.pdf · TIK TOK Technologies Ltd (TIK TOK) is an independent online

Issues in the Revision of the OUP Tok Pisin Dictionary Tok Pisin... · Tok Pisin Dictionary Craig Alan Volker craig.volker@jcu.edu.au . Tok Pisin lexicography . Tok Pisin lexicography

On Boosting Holism in XML Twig Pattern Matching Using Two Data Streaming Techniques Presenter: Lu Jiaheng Supervisor: Prof. Ling Tok Wang Joint work: Chen

Wang, Qinghuang, Luo, Yongyue, Feng, Chunfang, Yi, Zhifeng

Tok brochure

ToK - Ethics

ToK - Emotion

APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore