[IEEE 2011 IEEE Consumer Communications and Networking Conference (CCNC) - Las Vegas, NV, USA (2011.01.9-2011.01.12)] 2011 IEEE Consumer Communications and Networking Conference (CCNC)

Shopping search and the semantic web Aravind M. Canthadai

SmarTek21 Kirkland, WA 98034

[email protected]

Abstract— Faceted search is becoming very common in commerce search. We introduce a new type of facet: a predefined schema for a product which can be used for the drilldown process. The actual search process which ensues from using such a facet is what is commonly known as a keytree search. Keytree search is a search on a library of XML documents which need to be ranked so they satisfy both the keyword and the tree structure constraints. This is a position paper which proposes combining existing vector space models (for keyword match) with tree isomorphism codes (for structural match).

Keywords- semantic web,ecommerce, information retrieval

I. INTRODUCTION Two ongoing trends will make it possible for consumers to

do shopping search much more easily. The first trend is the encouragement from search engines to introduce semantic markup into the contents of web pages. For example, Google and Yahoo encourage webmasters to publish semantic markup for their web pages [1, 2]. While the initial focus is on introduction of semantic markup to custom site-specific search, the idea can be extended to broad web search. The second trend is the growing use of faceted search in eCommerce. This is seen in how faceted search has moved from being used primarily in eCommerce websites (such as Amazon.com) to being used in shopping search engines. As an example, faceting and drill down is implemented in Google, Bing and Yahoo shopping search portals.

Figure 1 Specifying product details using a tree structure

For the webmaster, adding rich semantic markup within comments would be simpler in terms of implementation as it would not affect existing page layout and can be incorporated into existing web layouts more easily. Let us suppose the webmaster for a eCommerce site adds semantic markup for each product page on the site. Suppose there are existing schemas for the specific product being searched. As an

example, a user issuing a query for a mobile phone may be shown a search suggestion where the user could fill out values for specific keys corresponding to a predefined schema for mobile phones, as shown in Figure 1.

The user may choose to ignore inputting values for some keys and specify values for the other keys. Once this information is input, the input for the search engine is not just a keyword but rather a keytree. In other words, the best product match must be most similar to the input keytree in terms of content as well as structure.

II. RELATED WORK This notion of using XML fragments themselves as input

for the search has been explored in previous work. In [3], the authors model this as an unordered tree inclusion problem and provide solution approaches. While this is possibly the most accurate approach, the implementation does not scale well for large search implementations since every XML document must be searched on using the input XML tree. Indeed, this search scalability issue is the reason why web search engines use the inverted document index and the associated vector space models [4]. In [5], the authors approach this problem from the other direction, by constructing an inverted index and modifying the vector space model by a) processing the input XML documents into bags of words with the tree path as the field name, and b) modifying the scoring system suitably. This is an improvement over the approach in [3], but here the document itself is heavily modified to permit the search. Our approach is also based on using an existing search library (Lucene) which uses the vector space model. However, we introduce very few modifications to the input XML document, as discussed in Section V.

For the purposes of this discussion, a keytree search can be summarized as follows: the input to the search engine is a keytree whose nodes consists of (key,value) pairs. The keys are based on a predefined XML schema, while the values are user inputs. The document collection consists of XML documents which are generated for individual products in the catalogs of the merchant websites. The output from the search is expected to be the most relevant product matching the keytree which has been supplied as the input.

III. MATCHING THE CONTENT: LUCENE SEARCH LIBRARY The Lucene search library is a free, open source

information retrieval library which allows users to create custom search implementations for their needs [6]. Lucene is based on a document-field approach for searching textual content. A directory of documents is created for the

The 8th Annual IEEE Consumer Communications and Networking Conference - Work in Progress (Short Papers)

978-1-4244-8790-5/11/$26.00 ©2011 IEEE 699

information which needs to be searched. Lucene allows this directory to be created in memory as well as in the file system. Each document comprises of a set of fields. A field has a name and a value. For example, a document describing a single product will have a field named URL, with the value being the hyperlink to the product page. In addition, it may have a field named UPC number, with the value being the UPC number for the product, and so on.

Lucene allows users to search within these documents and uses a vector space model to assign scores to each document which is retrieved in the search result. In our approach, the existing Lucene search mechanism will be used for the content search. A nice feature of Lucene is its flexibility in querying. It is possible, for example, to specify that all the keywords in the input must be present in a document. Similarly, filtering the search results based on specific field names and their values is also easy to implement.

IV. MATCHING THE STRUCTURE: TREE ISOMORPHISM Imagine a tree T1 with nodes in (key, value) format (while

the key, value notation is used, this is not a dictionary. The names of the keys may repeat). When determining if another tree T2 is isomorphic to the first tree T1, we can define the isomorphism test in two ways. In the case of ordered isomorphism, we wish to simply find if the two tree structures are identical, without permitting reordering of the tree structures. A simple traversal of both trees would create two lists of identical (key, value) pairs. We could also perform unordered isomorphism tests. Here, the two trees are actually isomorphic, and this becomes evident after some ordering of the tree structure. In this case, we will determine the isomorphism code of a tree T1 and tree T2 and see if they are the same (due to space constraints, the reader is referred to [7] for an extensive discussion on the tree isomorphism problem and the tree isomorphism code). In the case of unordered isomorphism testing, the isomorphism code should match, and the value of the keys should also be correctly matched.

V. COMBINING CONTENT AND STRUCTURE SEARCH Suppose document D is an XML document describing the

semantic markup of product P. Since an XML document is a tree, we can determine its isomorphism code. In order to combine both content and structure search, we will store the isomorphism code for this XML document when creating a Lucene document for this particular product1.

During the search process, the schema is used as the facet. The user inputs for the values are used to assemble the query which is submitted to the Lucene search library. Lucene returns a set of results which match the search terms. In addition, we also submit the structure of the XML input as an isomorphism code, and filter only those results where the isomorphism code matches the one which is stored in the document. This process

1 During this step, we also need to make some decisions about handling

XML whitespace, empty tags, as well as the attributes within elements. While this information can be added to the Lucene document, for the sake of simplicity, we will suppose the XML document itself is completely in (key,value) format without empty values.

produces a refinement of the search result whereby both the content and the structure of the input XML fragment is matched in the search results.

VI. FUTURE WORK In many ways, this is a very simplistic approach for the

proposed keytree search implementation. There are many issues to tackle to create more relevant search results. As an example, extending the isomorphism code check to perform a subtree isomorphism search would make the search results much more useful, since the document describing the product will likely consider much more information than the keytree used for the input and would be a superset of the input information.

In addition, it is possible that the schema itself may be defined in such a way that not all the nodes in the tree will have values (but they should have keys). In such a case, more sophisticated approaches need to be used for the search process.

The scoring scheme used in Lucene is very efficient for free form textual content. It is fair to expect that in the case of searching on content based on structure as well as content, some changes need to be applied to the Lucene scoring system to produce desired search results.

REFERENCES

[1] http://code.google.com/apis/customsearch/docs/snippets.html as seen on 09/06/2010

[2] http://developer.yahoo.com/searchmonkey/smguide/semantic_web.html as seen on 09/06/2010

[3] Schieder, T. and Naumann, F. Approximate tree embedding for querying XML data. In ACM SIGIR workshop on XML and information retrieval, Athens, Greece, July 2000.

[4] http://infolab.stanford.edu/~backrub/google.html as on 09/06/2010 [5] Carmel, D., Maarek, Y., Mass, Y., Efarty, N. and Landau, G., An

extension of the vector space model for querying XML documents via XML fragments. In ACM SIGIR 2002 Workshop on XML and Information Retrieval.

[6] http://en.wikipedia.org/wiki/Lucene as on 09/06/2010 [7] Valiente, G. Algorithms on Trees and Graphs, Springer, 2002.

700

Documents

[IEEE 2011 IEEE Consumer Communications and Networking Conference (CCNC) - Las Vegas, NV, USA (2011.01.9-2011.01.12)] 2011 IEEE Consumer Communications and Networking Conference (CCNC)