A Survey of Entity Ranking over RDF Graphs

Preview:

DESCRIPTION

The increasing amount of valuable semi-structured data has become available online. In this talk, we overview the state of the art in entity ranking over structured data ("linked data").

Citation preview

A Survey of Entity Ranking overRDF Graphs

Nikita Zhiltsov

Kazan Federal UniversityRussia

November 29, 2013

1 / 60

Outline

1 Introduction

2 Task Statement and Evaluation Methodology

3 Approaches

4 Conclusion

2 / 60

MotivationI The increasing amount of valuable semi-structured

data has become available online, e.g.I RDF graphs: Linking Open Data (LOD) cloudI Web pages enhanced with microformats, RDFa

etc.: CommonCrawl, Web Data CommonsI Google: Freebase Annotations of the ClueWeb

Corpora

I More than a half of queries from real query logshave the entity-centric user intent

I Examples from industry: Google Knowledge Graph,Facebook Graph Search, Yandex Islands ⇒

3 / 60

Google Knowledge Graph

4 / 60

Facebook Graph Graph

5 / 60

Yandex Islands

6 / 60

Overview of Semantic Search ApproachesT. Tran, P. Mika. Semantic Search - Systems, Concepts, Methods and Communities behind It

7 / 60

Outline

1 Introduction

2 Task Statement and Evaluation Methodology

3 Approaches

4 Conclusion

8 / 60

In this talk, we focus on entity ranking over RDFgraphs given a keyword search query

9 / 60

Key Issues in Entity Ranking

I Ambiguity in namesI Related entities from heterogeneous

data sourcesI Complex queries with clarifying terms

10 / 60

Key Issues in Entity RankingAmbiguity in names

Given a query university of michigan,

I University of Michigan, Ann Arbor ,I Central Michigan University, MichiganTechnological University, Michigan StateUniversity /

11 / 60

Key Issues in Entity RankingRelated entities from heterogeneous data sourcesGiven a query harry potter movie,

Semantic link information can effectively enhance termcontext

12 / 60

Key Issues in Entity RankingComplex queries with clarifying terms

Given a query shobanamasala, the user intent islikely about ShobanaChandrakumar, an Indianactress starring in movies ofthe Masala genre

13 / 60

Ad-hoc Object Retrieval in the Web of Data

Jeffrey Pound, Peter Mika, Hugo Zaragoza

WWW 2010

14 / 60

Query CategoriesI Entity query (∼ 40%∗), e.g. 1978 cj5jeep

I Type query† (∼ 12%), e.g. doctors inbarcelona

I Attribute query (∼ 5%), e.g. zip codeatlanta

I Other query (∼ 36%)I however, ∼ 14% of them contain a context

entity or type∗estimated on real query logs from Yahoo!†a.k.a. list search query

15 / 60

Repeatable and ReliableSearch System Evaluation

using Crowdsourcing

Roi Blanco, Harry Halpin, Daniel M. Herzig,Peter Mika, Jeffrey Pound, Henry S. Thompson,

Thanh D. Tran

SIGIR 2011

16 / 60

Data Collection

I Billion Triples Challenge 2009 RDF data setI The size of uncompressed data is 247GB;1.4B triples describing 114 million objects

I It was composed by combining crawls ofmultiple RDF search engines

17 / 60

Data CollectionClasses

18 / 60

Data CollectionProperties

19 / 60

Data CollectionSources

20 / 60

Query Set Preparation1 Emulate top queries

I Given Microsoft Live Search log containingqueries repeated by at least 10 different users

I Sample 50 queries prefiltered with a NER anda gazetteer

2 Emulate long-tailed queriesI Given Yahoo! Search Query Log Tiny Sample

v1.0 – 4,500 queriesI Sample and manually filter out ambiguous

queries ⇒ 42 queries

3 ⇒ a list of 92 queries

21 / 60

Crowdsourcing JudgementsI A purpose-built rendering tool to presentthe search results

I There have been conducted the evaluation(MT1) and its repetition(MT2) after 6months

I Using Amazon Mechanical Turk HITsI Each HIT consists of 12 query-result pairs:10 real ones and 2 were from "goldenstandard" annotated by experts

I 64 workers for MT1 and 69 workers for MT222 / 60

Rendering Tool

23 / 60

Analysis of ResultsRepeatability

I The level of agreement is the same for twopools

I The rank order of the systems is unchanged

24 / 60

Targeting Evaluation Measures IAll the measures are usually computed on top-10 searchresults (k=10)

1 P@k (precision at k):

P@k(π, l) =

∑t≤k I{lπ(k)=1}

k

2 MAP (mean average precision):

AP (π, l) =

∑mk=1 P@k · I{lπ(k)=1}

m1

MAP = mean of AP over all queries25 / 60

Targeting Evaluation Measures II3 NDCG: normalized discounted cumulative gain

DCG@k(π, l) =k∑j=1

G(lπ(j)) · η(j),

where G(·), the rating of a document, is usuallyG(z) = 2z − 1, η(j) = 1

log(j+1) , lπ(j) ∈ {0, 1, 2}

NDCG@k(π, l) =1

ZkDCG@k(π, l)

26 / 60

Analysis of ResultsReliability

Metric DifferenceMAP 1.8%NDCG 3.5%P@10 12.8%

I In the setting, experts rate more resultsnegative than workers

I P@10 is more fragile than MAP and NDCG

27 / 60

Yahoo! SemSearch Challenge (YSC) 2010 & 2011http://semsearch.yahoo.com

28 / 60

Outline

1 Introduction

2 Task Statement and Evaluation Methodology

3 Approaches

4 Conclusion

29 / 60

Entity Search Track Submission byYahoo! Research Barcelona

Roi Blanco, Peter Mika, Hugo Zaragoza

SSW at WWW 2010

30 / 60

YSC 2010 Winner ApproachI RDF S-P-O triples with literals are only consideredI Triples are filtered by predicates from a predefined

list of 300 predicatesI Triples about the same subject are grouped into a

pseudo document with multiple fieldsI BM25F ranking formula is applied (the weighting

scheme wc is handcrafted):

BM25F =∑t∈q∩d

tf(t, d)

k1 + b ∗ tf(t, d)· idf(t),

tf(t, d) =∑c∈d

wc · tfc(t, d)

31 / 60

Sindice BM25MF at SemSearch 2011

Stephane Campinas, Renaud Delbru, Nur A. Rakhmawati,Diego Ceccarelli, Giovanni Tummarello

SSW at WWW 2011

32 / 60

YSC 2011 Winner Approach I

I URI resolution for triple objectsI Extended BM25F approach with additionalnormalization for term frequencies perpredicate types:

I The weighting scheme is handcraftedI The proportion of query terms in entityliterals

33 / 60

YSC 2011 Winner Approach II

RDF graph example:

34 / 60

YSC 2011 Winner Approach III

Star-shaped query matching the entity:

35 / 60

YSC 2011 Winner Approach IV

Empirical weights:

36 / 60

On the Modeling of Entitiesfor Ad-Hoc Entity Search in the Web of Data

Robert Neumayer, Kristztian Balog, Kjetil Nørvåg

ECIR 2012

37 / 60

Approach to entity representation I

RDF graph example:

38 / 60

Approach to entity representation II

a) Unstructured Entity Model; b) Structure Entity Model:

39 / 60

Main Findings

I Two generative language models (LMs) forthe task:

I Unstructured Entity ModelI Structured Entity Model

I The evaluation on the YSC data shows thatthe representation of relations as a mixtureof predicate type LMs can contributesignificantly to overall performance

40 / 60

LM Retrieval Framework

P (e|q) = P (q|e)P (e)P (q)

rank= P (q|e)P (e),

where P (e|q) - probability of being relevant given query q

Further Assumptions(i) P (e) is uniform; (ii) query terms are i.i.d

Let θe be the entity model that predicts how likely theentity would produce a given term t, thenthe query likelihood is

P (q|θe) =∏t∈q

P (t|θe)tf(t,q)

41 / 60

Unstructured Entity Model

IdeaCollapse all text values of properties associatedwith the entity into a single document and applystandard IR techniques

The entity model is a Dirichlet-smoothedmultinomial distribution:

P (t|θe) =tf (t, e) + µP (t|θc)

|e| + µ

42 / 60

Structured Entity ModelFolding Predicates

Group RDF triples by the following predicate types pt:

I Name, e.g. literal values of foaf:name, rdfs:label

I Attributes, i.e. remaining datatype properties

I OutRelations: resolving "object" (O) URIs in S-P-Otriple getting their names

I InRelations: resolving "subject" (S) URIs in S-P-Otriple getting their names

43 / 60

Structured Entity ModelMixture of Language Models

Each group has its own LM P (t|θpte ):

P (t|θpte ) =tf (t, pt, e) + µptP (t|θptc )

|pt, e| + µpt

Then, the entity model is a linear mixture of thepredicate type LMs:

P (t|θe) =∑pt

P (t|θpte )P (pt)

44 / 60

Comparative Evaluation

Model MAP P@10 NDCGYSC 2010

UEM 0.207 0.314 0.383SEM 0.282 (+36.2%) 0.400 (+27.4%) 0.494 (+29.0%)

YSC 2011UEM 0.207 0.188 0.295SEM 0.261 (+26.1%) 0.242 (+28.7%) 0.400 (+35.6%)

The multi-fielded document approach improvesthe targeted measures on 26-35%

45 / 60

Combining N-gram Retrieval with WeightsPropagation on Massive RDF Graphs

He Hu, Xiaoyang Du

FSKD 2012

46 / 60

Approach I

I Considering 2- to 5-grams while indexing entityURIs as well as literals

I Thinking of URIs as hierarchical namesI Computing the entity-query similarity scores:

simURI(Q) =engram_hit_count

(||Q| − |URI.path||+ 1) · (URI.depth+ 1)

simLITERAL(Q) =engram_hit_count

||Q| − |LITERAL.length||+ 1

47 / 60

Approach III Ranking score:

ScoreURI(Q) = 1− e−sim(Q)

I Taking advantage of iterative PageRank-like weightpropagation:

WURI_hit(i+ 1) = α ·WURI_hit(i)

WURI_unhit(i+ 1) = (1− α) ·WURI_hit_neighbors(i)

NURI_hit_neighbors

I Improvement up to 80% w.r.t. the plain n-gramranker

48 / 60

Combining Inverted Indicesand Structured Search

for Ad-hoc Object Retrieval

Alberto Tonton, Gianluca Demartini,Phillipe Cudré-Mauroux

SIGIR 2012

49 / 60

Hybrid Search System

50 / 60

Structured Inverted Index

Consider the following property values as fields:I URI: tokens from entity URI, e.g. http://dbpedia.org/page/Barack_Obama⇒ ’barack’, ’obama’ etc.

I Labels: values of a list of manually selecteddatatype properties

I Attributes: other properties

BM25F is used as a ranking function

51 / 60

Graph-based Entity Search

1 Given a query q, obtain a list of entitiesRetr = {e1, e2, . . . , en} ranked by the BM25Fscores

2 Use top-N elements as seeds for graph traversal3 To get StructRetr = {e′1, . . . , e′m}, exploit

promising LOD properties‡ as well as Jaro-Winklerstring similarity scores JW (q, e′) > τ

4 Combine two rankings:

finalScore(q, e′) = λ×BM25(q, e) + (1− λ)× JW (q, e′)

‡owl:sameAs, dbpedia:disambiguates, dbpedia:redirect52 / 60

Evaluation

I The graph-based approach (S1_1) outperforms BM25scoring with 25% improvement of MAP on the 2010 data set

I No significant improvement over baseline on the 2011 datasetThis may be explained by the lack of the used predicates(owl:sameAs volume < 0.7%)

53 / 60

Improving Entity Search over Linked Databy Modeling Latent Semantics

Nikita Zhiltsov, Eugene Agichtein

CIKM 2013

54 / 60

Key Contributions

I A tensor factorization based approach to incorporatesemantic link information into ranking model

I Outperforms the state of the art baseline inNDCG/MAP/P@10

I A thorough evaluation of the proposed techniquesby acquiring thousands of manual labels to augmentthe YSC benchmark data set

⇒ more details in the next talk

55 / 60

Negative results

The ideas that do not work out

56 / 60

Negative ResultsThe ideas from standard IR that do not work out:

I Wordnet-based query expansion [Tonon et al.,SIGIR 2012]

I Pseudo-relevance feedback [Tonon et al., SIGIR2012]

I Query suggestions of a commercial search engine[Tonon et al., SIGIR 2012]

I Direct application of centrality measures, such asPageRank and HITS [Campinas et al., SSW WWW2010; Dali et al., 2012]

57 / 60

Outline

1 Introduction

2 Task Statement and Evaluation Methodology

3 Approaches

4 Conclusion

58 / 60

Wrap up

I Entity search over RDF graphs a.k.a. ad-hoc objectretrieval has emerged as a new task in IR

I There is a robust and consistent evaluationmethodology for it

I State-of-the-art approaches revolve aroundapplications of well-known IR methods along

I Lack of approaches for leveraging semantic links

I Lots of data: scalability really matters

59 / 60

Thanks for your attention!

60 / 60

Recommended