4
Ranking, Aggregation, and Reachability in Faceted Search with SemFacet ? Evgeny Kharlamov 1 Luca Giacomelli 2 Evgeny Sherkhonov 1 Bernardo Cuenca Grau 1 Egor V. Kostylev 1 Ian Horrocks 1 1 University of Oxford, UK 2 Sapienza University of Rome, Italy 1 Introduction Motivation. Faceted search is a prominent search paradigm that became the standard in many Web applications including online shopping and real-estate portals, where users can progressively narrow down the search results by applying filters, called facets [9] which are organised in faceted interfaces. Faceted search has also been proposed in the Semantic Web context for exploring and querying RDF graphs and a number of RDF- based faceted search systems have been developed (see [3,7,4,5,2,6] for an overview). One of the main challenges that hampers usability of faceted search systems is in- formation overload [9]: when the size of the faceted interface becomes comparable to the size of data over which the search is performed. The overload is already a challenge in the case of the classical faceted search and it becomes even stronger when faceted search is applied to RDF data. Indeed, observe that in both classical and RDF settings the search is over annotated entities, at the same time, in the latter case the number of annotations and possible values is typically much larger: any predicate occurring in an RDF data set can give a facet during a search, and any entity or value that it points to in the RDF data can become a value in this facet. At the same time, in the classical case the list of possible facets is typically predefined and controlled. Moreover, differently from the classical case, in the RDF settings entities are also interconnected. Thus, in an RDF driven faceted interface one has to nest facets in order to reflect this interconnections. In Figure 1, left, one can see a possible way to nest facets which is implemented in our SemFacet system. This not only raises a non-trivial problem of how to arrange nested facets within an interface in an intuitive and user oriented way, but also leads to an unavoidable overload of the interface with various paths of nested facets. Thus, faceted navigation over RDF requires from users to be experts in the structure of the underlying RDF graph in order to be able to find the required facet which can be deeply nested. In order to see why it might be hard to find a required nested facet, consider in Figure 2 an example RDF data. Assume that one is looking for a smartphone with the price within £500–£900 and the processor that is manufactured by a company with headquarters in Asia. One can see in Figure 2 that Samsung S8 is such a smartphone. At the same time, finding this smartphone via a faceted interface requires one to traverse the data via 4 nested facets (Figure 1, left). SemFacet System. In our RDF-based faceted search system SemFacet we propose to address the information overload by ranking facets and their values and thus offering to users top-k most prominent facets and values; ? This work was supported by the Royal Society under a University Research Fellowship and the EPSRC under an IAA award and the projects DBOnto, MaSI 3 , ED 3 , and VADA.

Ranking, Aggregation, and Reachability in Faceted …ceur-ws.org/Vol-1963/paper650.pdfRanking Handler Range Handler Shortcuts Handler Facet Generator SPARQL Q. Constructor Facet Composer

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Ranking, Aggregation, and Reachability in Faceted …ceur-ws.org/Vol-1963/paper650.pdfRanking Handler Range Handler Shortcuts Handler Facet Generator SPARQL Q. Constructor Facet Composer

Ranking, Aggregation, and Reachability inFaceted Search with SemFacet?

Evgeny Kharlamov1 Luca Giacomelli2 Evgeny Sherkhonov1

Bernardo Cuenca Grau1 Egor V. Kostylev1 Ian Horrocks1

1University of Oxford, UK 2Sapienza University of Rome, Italy

1 IntroductionMotivation. Faceted search is a prominent search paradigm that became the standard inmany Web applications including online shopping and real-estate portals, where userscan progressively narrow down the search results by applying filters, called facets [9]which are organised in faceted interfaces. Faceted search has also been proposed in theSemantic Web context for exploring and querying RDF graphs and a number of RDF-based faceted search systems have been developed (see [3,7,4,5,2,6] for an overview).

One of the main challenges that hampers usability of faceted search systems is in-formation overload [9]: when the size of the faceted interface becomes comparable tothe size of data over which the search is performed. The overload is already a challengein the case of the classical faceted search and it becomes even stronger when facetedsearch is applied to RDF data. Indeed, observe that in both classical and RDF settingsthe search is over annotated entities, at the same time, in the latter case the number ofannotations and possible values is typically much larger: any predicate occurring in anRDF data set can give a facet during a search, and any entity or value that it points to inthe RDF data can become a value in this facet. At the same time, in the classical casethe list of possible facets is typically predefined and controlled.

Moreover, differently from the classical case, in the RDF settings entities are alsointerconnected. Thus, in an RDF driven faceted interface one has to nest facets in orderto reflect this interconnections. In Figure 1, left, one can see a possible way to nestfacets which is implemented in our SemFacet system. This not only raises a non-trivialproblem of how to arrange nested facets within an interface in an intuitive and useroriented way, but also leads to an unavoidable overload of the interface with variouspaths of nested facets. Thus, faceted navigation over RDF requires from users to beexperts in the structure of the underlying RDF graph in order to be able to find therequired facet which can be deeply nested.

In order to see why it might be hard to find a required nested facet, consider inFigure 2 an example RDF data. Assume that one is looking for a smartphone withthe price within £500–£900 and the processor that is manufactured by a company withheadquarters in Asia. One can see in Figure 2 that Samsung S8 is such a smartphone. Atthe same time, finding this smartphone via a faceted interface requires one to traversethe data via 4 nested facets (Figure 1, left).

SemFacet System. In our RDF-based faceted search system SemFacet we propose toaddress the information overload by

– ranking facets and their values and thus offering to users top-k most prominentfacets and values;

? This work was supported by the Royal Society under a University Research Fellowship andthe EPSRC under an IAA award and the projects DBOnto, MaSI3, ED3, and VADA.

Page 2: Ranking, Aggregation, and Reachability in Faceted …ceur-ws.org/Vol-1963/paper650.pdfRanking Handler Range Handler Shortcuts Handler Facet Generator SPARQL Q. Constructor Facet Composer

ANY

Processor

LaptopSmartPhone

Searchphone

type

hasPart

keywords

facet predicate

Samsung Galaxy S8Galaxy S8 is a phone that offers an exceptional experience for any user. The large screen is a real turning point in flagship phone design and should usher in the end of large bezels, and the camera and slick performance work brilliantly under the finger.

answerwithHQ

producedByANY

ANY

facet values

price

inCountry

ANY

inContinent

ANY

North AmericaAsia

1.600500 900

Search Task: Find smartphones ▪ whose processor▪ is manufactured by a

company ▪ with headquarters in Asia▪ and whose price is within

£500-£9000

ANY

Processor

LaptopSmartPhone

Searchphone

San DiegoSuwon

type

hasPart

withHQ

producedBy

Reachable facets

refocussed answer

maxprice

0 1.600

aggregate facets

refocusing

enter facet

ANY

Exycon 8 ProcessorGalaxy S8 is built around the new Exynos 8895, featuring new custom CPU cores, new Mali GPU, and with the whole thing fabricated via a cutting-edge 10nm process intended to maximize performance and power efficiency.

ANY

500 900

(8.270)

(5.850)

(2.490)

(1.800)

(630)

(8)

(5)

Search Task: Show me processors of smartphones ▪ whose processor▪ is manufactured by a company ▪ with headquarters in Asia▪ and whose max price among

providers is £500-£900

inContinentAsiaN. America

(1)

(3)

Fig. 1. Left: SemFacet with deep nesting over RDF data; Right: SemFacet enhanced with rank-ing, aggregation, and reachability

– minimising the number of values within a facet by first grouping them according tothe corresponding entities and then aggregating them with the standard aggregatefunctions max, min, count, sum, avg;

– shortcutting paths of nested facets with the help of a reachability operator.

Observe in Figure 1, right, the SemFacet interface where all these three featuresare incorporated. The facets and values are now ranked and only the top 10 of themare displayed. The users can choose whether to look for a maximal price of entities byactivating aggregate functions within facets with numerical values. In the screenshotthe user selected max and thus the system is searching for smartphones whose maximalprice across providers is within the range £500–£900. Finally, the users can enter thename of a desired predicate instead of choosing a facet value in order to ask the systemto find a shortcut (a reachable facet). In the screenshot the user entered inContinentinstead of selecting SanDiego or Suwon.

Demo Overview. The goal of the demo is to show the attendees how our novel features,that is, ranking, aggregation, and shortcutting, make hard faceted search—over RDFdatasets that are highly interconnected and with many data values—easier. In order toexperience the impact of these features the attendees will be able to explore the demodataset using two versions of our SemFacet systems: with and without the features.

2 SemFacet SystemWe first give an overview of SemFacet with the focus on its novel components and thenpresent technical details behind ranking, aggregation, and shortcuts.

System Overview. SemFacet is implemented in Java and available under an academiclicense [1]. In Figure 2 there is the architecture of SemFacet where the arrows denotethe data flow between the systems’ components.

On the client side, SemFacet has an HTML 5 based GUI consisting of three mainparts: a free text search box for keywords, a hierarchically organised faceted interface,and a scrollable panel containing snippet-shaped answers. User keywords are sent bythe client to the server, evaluated and this gives the initial faceted interface. User se-lections in the faceted interface are compiled into a SPARQL query using the SPARQLquery constructor and then sent to the server for evaluation. The snippet composer andfacet composer receive information about facets and answers that should be displayedto the user and update the currently displayed interface and query answers. The sys-tem updates the faceted interface incrementally: only the parts of the interface that areaffected by users’ actions are updated, which allows for a significantly faster response

2

Page 3: Ranking, Aggregation, and Reachability in Faceted …ceur-ws.org/Vol-1963/paper650.pdfRanking Handler Range Handler Shortcuts Handler Facet Generator SPARQL Q. Constructor Facet Composer

4

:SamsungS8

800 900

:Exynos

290 270

:Samsung :Suwon :S .Korea

:AsiaSmartphone Processor Company

:HP Elite X3

730

:Snapdragon

300280

:Qualcomm :SanDiego :USA

:NorthAmerica

type

type

type

type

type

type

price price

hasPart producedBy withHQ inCountry

inContinent

price price

price

hasPart producedBy

priceprice

withHQ inCountry

inContinent

Fig. 2. Example RDF graph about products

Let O = [F1, . . . , Fn] be a possible ordering of S. Our function f computes thescore of O by combining the following three characteristics of O:

(1) selectivity of facets in O,(2) diversity of facets in O, and(3) nesting depth of facets in O.

In particular, for (1) we prefer those facets that are more selective, i.e., tickingvalues in the facet narrow down the search result more rapidly than doing soin other facets. For (2), we prefer those facets that lead to results which arenot covered by selecting other facets. Finally, for (3), we prefer facets that allowdeeper nesting thus allowing explore the graph structure of the underlying data.We now define the functions corresponding each of (1)-(3) and then combinethem into the final scoring function f .

Let F denote a facet, A the set of current answers, r(F ) ✓ A the set ofanswers obtained by selecting any value in F , and d > 0 a predefined thresholdparameter for nesting depth. To account for (1), we compute the selectivity scoreof F given A as follows:

sel(F, A) = 1 � log|A| |r(F )|.

In particular, the less search results is obtained by applying F , the higher theselectivity score of F is. To account for (2), we consider a variation of the Set-Cover ranking [5] to compute the overlap score of Fi 2 O w.r.t. O as follows:

overlap(Fi, O) =1

n ⇥ |r(Fi)|nX

j=1

| r(Fi) \j[

m=1,m 6=i

r(Fm)|

In particular, a high overlap score of facet Fi means that selecting values in itleads to results that are not present in the first highly ranked facets. Thus, highlypositioned facets (according to the overlap score) provide more diverse results.To account for (3), we compute the depth score depth(F, d) of F given d as the

RDF Data, Rules

Client

ServerSnippet Generator

Search Engine

Aggregation Handler

Ranking Handler

RangeHandler

ShortcutsHandler

Facet Generator

SPARQL Q.Constructor

Facet Composer

RDFox: Storage and Query Execution

Processor

LaptopSmartPhone

type

hasPart

(8.270)

(5.850)

(2.490)

Exycon 8 ProcessorGalaxy S8 is built around the new Exynos 8895, featuring new custom CPU cores, new Mali GPU, and with the whole thing fabricated via a cutting-edge 10nm process intended to maximize performance and power efficiency.

Snippet Composer

phonephone

Fig. 2. Left: Example RDF graph about products; Right: Architecture of SemFacet

time. The user is able to do refocusing. This feature of SemFacet can be observed inFigure 1, right, where the user can click on a box in the hasPart facet to change theanswers from phones to their parts, that is, from Samsung S8 to its processor Exycon 8.

On the server side, the system relies on an in-memory triple store RDFox [8] tostore the inverted index, input RDF data, query answers, and all necessary auxiliaryinformation such as materialisation rules which we discuss later in this section. One ofthe server components is an inverted index based full-text search engine; in order toensure a better integration between full-text and faceted search and thus achieve goodefficiency of SemFacet we implemented our own search engine. Another backend com-ponent is snippet composer that for given answer entities retrieves their textual descrip-tions, images, and links. The next component is facet generator that constructs facetedinterfaces in response to user actions. This component is backed by four handlers: ag-gregation, shortcuts, ranking, and ranges. The range handler computes and stores leftand right bounds for the range sliders for numerical facet values, like the price-sliderin Figure 1, that correspond to the lower and upper bounds of possible values for thisproperty name in the underlying RDF data.

Ranking facets. Whenever SemFacet updates the interface it should decide in whichorder to present relevant facets. This is done in two steps. First, we compute the setS of all relevant facets that should be displayed in the same level of nesting. Assumethat S has n facets. Then, for each of n! possible orderings of S we find the optimalorder using our scoring function f and SemFacet displays the facets from S (or someof them) according to this order. Our function f computes the score of a given orderingof facet by combining the following three characteristics of the order: (1) selectivityof its facets, (2) diversity of its facets, and (3) nesting depth of its facets. In particular,for (1) we prefer those facets that are more selective, i.e., ticking values in the facetnarrow down the search result more rapidly than doing so in other facets. For (2), weprefer those facets that lead to results which are not covered by selecting other facets.Finally, for (3), we prefer facets that allow deeper nesting thus allowing explore thegraph structure of the underlying data. We now define the functions corresponding eachof (1)-(3) and then combine them into the final scoring function f .

Ranking facet values. Besides ranking of facets, it is important to rank facet valuesas well. We adopt the count-based ranking (see also [10]): facet values that lead to alarger number of results are ranked higher. Although the computation of counts is con-ceptually trivial, the main challenge is in their update. Indeed, the integers associatedto each facet value in the interface must be updated every time the user interacts withthe faceted interface without affecting the performance of the system. Additional chal-lenges come from (i) graph structure of the data and (ii) interplay of conjunctive anddisjunctive interpretation of facet values. In our experience, implementing the straight-

3

Page 4: Ranking, Aggregation, and Reachability in Faceted …ceur-ws.org/Vol-1963/paper650.pdfRanking Handler Range Handler Shortcuts Handler Facet Generator SPARQL Q. Constructor Facet Composer

forward approach to updating counts leads to a significant increase of the user interfaceresponse time. To mitigate this problem, we adopted a multi-threading solution: eachthread receives a set of facet values for which counts need to be updated and, addition-ally, the load among the threads is balanced. For instance, we avoid the situation whenone thread is busy with updating counts for values with a high number of results only,while another gets away with updating counts for values with a low number of results.This approach led to a significant response time decrease.

Aggregation. Recall that every interface update performed by the user (i.e., refocusing,selection of a facet or a facet value) results in formulating a corresponding SPARQLquery on the fly that is then issued to RDFox. Then the interface is updated (i.e., withsearch results and available facets at this point) depending on the result of this query.When aggregate facets are considered, it is possible to follow the same approach andformulate aggregate SPARQL queries. However, we decided to do materialisation ofaggregate information at loading time instead since (1) RDFox, our back-end system,supports efficient materialisation of aggregate information and, more importantly, (2)non-aggregate SPARQL queries are usually faster to answer which is essential for re-sponsive user interface updates.

Reachability. SemFacet provides the shortcut functionality described in Section 1. Inthe user interface, SemFacet offers a search box within each facet that allows users tosearch for reachable facets. As the user has typed in a facet name F in the box, thesystem checks if such a facet is reachable from the current facet. For this we performbreadth-first search to find all reachable nodes with an outgoing property F and westore corresponding witnessing paths. These paths are important in constructing thecorresponding SPARQL query for fetching possible facet values for F and for furtherfacet navigation. For a faster response time, we predefine a parameter B and checkfacet reachability up to length B instead. In our example, in Figure 1 (right) the usercan search for the inContinent facet, which in this case is reachable within 2 steps, andthen select ‘Asia’, thus selecting processors produced by an asian company.

References

1. SemFacet Project Page. http://www.cs.ox.ac.uk/isg/tools/SemFacet/2. Arenas, M., Cuenca Grau, B., Kharlamov, E., Marciuska, S., Zheleznyakov, D.: Towards

Semantic Faceted Search. In: Proc. of WWW (Companion Volume). pp. 219–220 (2014)3. Arenas, M., Cuenca Grau, B., Kharlamov, E., Marciuska, S., Zheleznyakov, D.: Faceted

search over RDF-based knowledge graphs. J. Web Semantics 37, 55–74 (2016)4. Arenas, M., Cuenca Grau, B., Kharlamov, E., Marciuska, S., Zheleznyakov, D.: Enabling

Faceted Search over OWL 2 with SemFacet. In: Proc. of OWLED. pp. 121–132 (2014)5. Arenas, M., Cuenca Grau, B., Kharlamov, E., Marciuska, S., Zheleznyakov, D., Jimenez-

Ruiz, E.: SemFacet: Semantic Faceted Search over Yago. In: Proc. of WWW (CompanionVolume). pp. 123–126 (2014)

6. Cuenca Grau, B., Kharlamov, E., Marciuska, S., Zheleznyakov, D., Zhou, Y.: Querying LifeScience Ontologies with SemFacet. In: Proc. of SWAT4LS (2014)

7. Grau, B.C., Kharlamov, E., Marciuska, S., Zheleznyakov, D., Arenas, M.: Semfacet: Facetedsearch over ontology enhanced knowledge graphs. In: ISWC Posters & Demos (2016)

8. Motik, B., Nenov, Y., Piro, R., Horrocks, I., Olteanu, D.: Parallel Materialisation of DatalogPrograms in Centralised, Main-Memory RDF Systems. In: Proc. of AAAI (2014)

9. Tunkelang, D.: Faceted Search. Synthesis Lectures on Information Concepts, Retrieval, andServices, Morgan & Claypool Publishers (2009)

10. Wagner, A.J.: Faceted semantic search, technical report. Tech. rep.

4