60
A Survey of Entity Ranking over RDF Graphs Nikita Zhiltsov Kazan Federal University Russia November 29, 2013 1 / 60

A Survey of Entity Ranking over RDF Graphs

Embed Size (px)

DESCRIPTION

The increasing amount of valuable semi-structured data has become available online. In this talk, we overview the state of the art in entity ranking over structured data ("linked data").

Citation preview

Page 1: A Survey of Entity Ranking over RDF Graphs

A Survey of Entity Ranking overRDF Graphs

Nikita Zhiltsov

Kazan Federal UniversityRussia

November 29, 2013

1 / 60

Page 2: A Survey of Entity Ranking over RDF Graphs

Outline

1 Introduction

2 Task Statement and Evaluation Methodology

3 Approaches

4 Conclusion

2 / 60

Page 3: A Survey of Entity Ranking over RDF Graphs

MotivationI The increasing amount of valuable semi-structured

data has become available online, e.g.I RDF graphs: Linking Open Data (LOD) cloudI Web pages enhanced with microformats, RDFa

etc.: CommonCrawl, Web Data CommonsI Google: Freebase Annotations of the ClueWeb

Corpora

I More than a half of queries from real query logshave the entity-centric user intent

I Examples from industry: Google Knowledge Graph,Facebook Graph Search, Yandex Islands ⇒

3 / 60

Page 4: A Survey of Entity Ranking over RDF Graphs

Google Knowledge Graph

4 / 60

Page 5: A Survey of Entity Ranking over RDF Graphs

Facebook Graph Graph

5 / 60

Page 6: A Survey of Entity Ranking over RDF Graphs

Yandex Islands

6 / 60

Page 7: A Survey of Entity Ranking over RDF Graphs

Overview of Semantic Search ApproachesT. Tran, P. Mika. Semantic Search - Systems, Concepts, Methods and Communities behind It

7 / 60

Page 8: A Survey of Entity Ranking over RDF Graphs

Outline

1 Introduction

2 Task Statement and Evaluation Methodology

3 Approaches

4 Conclusion

8 / 60

Page 9: A Survey of Entity Ranking over RDF Graphs

In this talk, we focus on entity ranking over RDFgraphs given a keyword search query

9 / 60

Page 10: A Survey of Entity Ranking over RDF Graphs

Key Issues in Entity Ranking

I Ambiguity in namesI Related entities from heterogeneous

data sourcesI Complex queries with clarifying terms

10 / 60

Page 11: A Survey of Entity Ranking over RDF Graphs

Key Issues in Entity RankingAmbiguity in names

Given a query university of michigan,

I University of Michigan, Ann Arbor ,I Central Michigan University, MichiganTechnological University, Michigan StateUniversity /

11 / 60

Page 12: A Survey of Entity Ranking over RDF Graphs

Key Issues in Entity RankingRelated entities from heterogeneous data sourcesGiven a query harry potter movie,

Semantic link information can effectively enhance termcontext

12 / 60

Page 13: A Survey of Entity Ranking over RDF Graphs

Key Issues in Entity RankingComplex queries with clarifying terms

Given a query shobanamasala, the user intent islikely about ShobanaChandrakumar, an Indianactress starring in movies ofthe Masala genre

13 / 60

Page 14: A Survey of Entity Ranking over RDF Graphs

Ad-hoc Object Retrieval in the Web of Data

Jeffrey Pound, Peter Mika, Hugo Zaragoza

WWW 2010

14 / 60

Page 15: A Survey of Entity Ranking over RDF Graphs

Query CategoriesI Entity query (∼ 40%∗), e.g. 1978 cj5jeep

I Type query† (∼ 12%), e.g. doctors inbarcelona

I Attribute query (∼ 5%), e.g. zip codeatlanta

I Other query (∼ 36%)I however, ∼ 14% of them contain a context

entity or type∗estimated on real query logs from Yahoo!†a.k.a. list search query

15 / 60

Page 16: A Survey of Entity Ranking over RDF Graphs

Repeatable and ReliableSearch System Evaluation

using Crowdsourcing

Roi Blanco, Harry Halpin, Daniel M. Herzig,Peter Mika, Jeffrey Pound, Henry S. Thompson,

Thanh D. Tran

SIGIR 2011

16 / 60

Page 17: A Survey of Entity Ranking over RDF Graphs

Data Collection

I Billion Triples Challenge 2009 RDF data setI The size of uncompressed data is 247GB;1.4B triples describing 114 million objects

I It was composed by combining crawls ofmultiple RDF search engines

17 / 60

Page 18: A Survey of Entity Ranking over RDF Graphs

Data CollectionClasses

18 / 60

Page 19: A Survey of Entity Ranking over RDF Graphs

Data CollectionProperties

19 / 60

Page 20: A Survey of Entity Ranking over RDF Graphs

Data CollectionSources

20 / 60

Page 21: A Survey of Entity Ranking over RDF Graphs

Query Set Preparation1 Emulate top queries

I Given Microsoft Live Search log containingqueries repeated by at least 10 different users

I Sample 50 queries prefiltered with a NER anda gazetteer

2 Emulate long-tailed queriesI Given Yahoo! Search Query Log Tiny Sample

v1.0 – 4,500 queriesI Sample and manually filter out ambiguous

queries ⇒ 42 queries

3 ⇒ a list of 92 queries

21 / 60

Page 22: A Survey of Entity Ranking over RDF Graphs

Crowdsourcing JudgementsI A purpose-built rendering tool to presentthe search results

I There have been conducted the evaluation(MT1) and its repetition(MT2) after 6months

I Using Amazon Mechanical Turk HITsI Each HIT consists of 12 query-result pairs:10 real ones and 2 were from "goldenstandard" annotated by experts

I 64 workers for MT1 and 69 workers for MT222 / 60

Page 23: A Survey of Entity Ranking over RDF Graphs

Rendering Tool

23 / 60

Page 24: A Survey of Entity Ranking over RDF Graphs

Analysis of ResultsRepeatability

I The level of agreement is the same for twopools

I The rank order of the systems is unchanged

24 / 60

Page 25: A Survey of Entity Ranking over RDF Graphs

Targeting Evaluation Measures IAll the measures are usually computed on top-10 searchresults (k=10)

1 P@k (precision at k):

P@k(π, l) =

∑t≤k I{lπ(k)=1}

k

2 MAP (mean average precision):

AP (π, l) =

∑mk=1 P@k · I{lπ(k)=1}

m1

MAP = mean of AP over all queries25 / 60

Page 26: A Survey of Entity Ranking over RDF Graphs

Targeting Evaluation Measures II3 NDCG: normalized discounted cumulative gain

DCG@k(π, l) =k∑j=1

G(lπ(j)) · η(j),

where G(·), the rating of a document, is usuallyG(z) = 2z − 1, η(j) = 1

log(j+1) , lπ(j) ∈ {0, 1, 2}

NDCG@k(π, l) =1

ZkDCG@k(π, l)

26 / 60

Page 27: A Survey of Entity Ranking over RDF Graphs

Analysis of ResultsReliability

Metric DifferenceMAP 1.8%NDCG 3.5%P@10 12.8%

I In the setting, experts rate more resultsnegative than workers

I P@10 is more fragile than MAP and NDCG

27 / 60

Page 28: A Survey of Entity Ranking over RDF Graphs

Yahoo! SemSearch Challenge (YSC) 2010 & 2011http://semsearch.yahoo.com

28 / 60

Page 29: A Survey of Entity Ranking over RDF Graphs

Outline

1 Introduction

2 Task Statement and Evaluation Methodology

3 Approaches

4 Conclusion

29 / 60

Page 30: A Survey of Entity Ranking over RDF Graphs

Entity Search Track Submission byYahoo! Research Barcelona

Roi Blanco, Peter Mika, Hugo Zaragoza

SSW at WWW 2010

30 / 60

Page 31: A Survey of Entity Ranking over RDF Graphs

YSC 2010 Winner ApproachI RDF S-P-O triples with literals are only consideredI Triples are filtered by predicates from a predefined

list of 300 predicatesI Triples about the same subject are grouped into a

pseudo document with multiple fieldsI BM25F ranking formula is applied (the weighting

scheme wc is handcrafted):

BM25F =∑t∈q∩d

tf(t, d)

k1 + b ∗ tf(t, d)· idf(t),

tf(t, d) =∑c∈d

wc · tfc(t, d)

31 / 60

Page 32: A Survey of Entity Ranking over RDF Graphs

Sindice BM25MF at SemSearch 2011

Stephane Campinas, Renaud Delbru, Nur A. Rakhmawati,Diego Ceccarelli, Giovanni Tummarello

SSW at WWW 2011

32 / 60

Page 33: A Survey of Entity Ranking over RDF Graphs

YSC 2011 Winner Approach I

I URI resolution for triple objectsI Extended BM25F approach with additionalnormalization for term frequencies perpredicate types:

I The weighting scheme is handcraftedI The proportion of query terms in entityliterals

33 / 60

Page 34: A Survey of Entity Ranking over RDF Graphs

YSC 2011 Winner Approach II

RDF graph example:

34 / 60

Page 35: A Survey of Entity Ranking over RDF Graphs

YSC 2011 Winner Approach III

Star-shaped query matching the entity:

35 / 60

Page 36: A Survey of Entity Ranking over RDF Graphs

YSC 2011 Winner Approach IV

Empirical weights:

36 / 60

Page 37: A Survey of Entity Ranking over RDF Graphs

On the Modeling of Entitiesfor Ad-Hoc Entity Search in the Web of Data

Robert Neumayer, Kristztian Balog, Kjetil Nørvåg

ECIR 2012

37 / 60

Page 38: A Survey of Entity Ranking over RDF Graphs

Approach to entity representation I

RDF graph example:

38 / 60

Page 39: A Survey of Entity Ranking over RDF Graphs

Approach to entity representation II

a) Unstructured Entity Model; b) Structure Entity Model:

39 / 60

Page 40: A Survey of Entity Ranking over RDF Graphs

Main Findings

I Two generative language models (LMs) forthe task:

I Unstructured Entity ModelI Structured Entity Model

I The evaluation on the YSC data shows thatthe representation of relations as a mixtureof predicate type LMs can contributesignificantly to overall performance

40 / 60

Page 41: A Survey of Entity Ranking over RDF Graphs

LM Retrieval Framework

P (e|q) = P (q|e)P (e)P (q)

rank= P (q|e)P (e),

where P (e|q) - probability of being relevant given query q

Further Assumptions(i) P (e) is uniform; (ii) query terms are i.i.d

Let θe be the entity model that predicts how likely theentity would produce a given term t, thenthe query likelihood is

P (q|θe) =∏t∈q

P (t|θe)tf(t,q)

41 / 60

Page 42: A Survey of Entity Ranking over RDF Graphs

Unstructured Entity Model

IdeaCollapse all text values of properties associatedwith the entity into a single document and applystandard IR techniques

The entity model is a Dirichlet-smoothedmultinomial distribution:

P (t|θe) =tf (t, e) + µP (t|θc)

|e| + µ

42 / 60

Page 43: A Survey of Entity Ranking over RDF Graphs

Structured Entity ModelFolding Predicates

Group RDF triples by the following predicate types pt:

I Name, e.g. literal values of foaf:name, rdfs:label

I Attributes, i.e. remaining datatype properties

I OutRelations: resolving "object" (O) URIs in S-P-Otriple getting their names

I InRelations: resolving "subject" (S) URIs in S-P-Otriple getting their names

43 / 60

Page 44: A Survey of Entity Ranking over RDF Graphs

Structured Entity ModelMixture of Language Models

Each group has its own LM P (t|θpte ):

P (t|θpte ) =tf (t, pt, e) + µptP (t|θptc )

|pt, e| + µpt

Then, the entity model is a linear mixture of thepredicate type LMs:

P (t|θe) =∑pt

P (t|θpte )P (pt)

44 / 60

Page 45: A Survey of Entity Ranking over RDF Graphs

Comparative Evaluation

Model MAP P@10 NDCGYSC 2010

UEM 0.207 0.314 0.383SEM 0.282 (+36.2%) 0.400 (+27.4%) 0.494 (+29.0%)

YSC 2011UEM 0.207 0.188 0.295SEM 0.261 (+26.1%) 0.242 (+28.7%) 0.400 (+35.6%)

The multi-fielded document approach improvesthe targeted measures on 26-35%

45 / 60

Page 46: A Survey of Entity Ranking over RDF Graphs

Combining N-gram Retrieval with WeightsPropagation on Massive RDF Graphs

He Hu, Xiaoyang Du

FSKD 2012

46 / 60

Page 47: A Survey of Entity Ranking over RDF Graphs

Approach I

I Considering 2- to 5-grams while indexing entityURIs as well as literals

I Thinking of URIs as hierarchical namesI Computing the entity-query similarity scores:

simURI(Q) =engram_hit_count

(||Q| − |URI.path||+ 1) · (URI.depth+ 1)

simLITERAL(Q) =engram_hit_count

||Q| − |LITERAL.length||+ 1

47 / 60

Page 48: A Survey of Entity Ranking over RDF Graphs

Approach III Ranking score:

ScoreURI(Q) = 1− e−sim(Q)

I Taking advantage of iterative PageRank-like weightpropagation:

WURI_hit(i+ 1) = α ·WURI_hit(i)

WURI_unhit(i+ 1) = (1− α) ·WURI_hit_neighbors(i)

NURI_hit_neighbors

I Improvement up to 80% w.r.t. the plain n-gramranker

48 / 60

Page 49: A Survey of Entity Ranking over RDF Graphs

Combining Inverted Indicesand Structured Search

for Ad-hoc Object Retrieval

Alberto Tonton, Gianluca Demartini,Phillipe Cudré-Mauroux

SIGIR 2012

49 / 60

Page 50: A Survey of Entity Ranking over RDF Graphs

Hybrid Search System

50 / 60

Page 51: A Survey of Entity Ranking over RDF Graphs

Structured Inverted Index

Consider the following property values as fields:I URI: tokens from entity URI, e.g. http://dbpedia.org/page/Barack_Obama⇒ ’barack’, ’obama’ etc.

I Labels: values of a list of manually selecteddatatype properties

I Attributes: other properties

BM25F is used as a ranking function

51 / 60

Page 52: A Survey of Entity Ranking over RDF Graphs

Graph-based Entity Search

1 Given a query q, obtain a list of entitiesRetr = {e1, e2, . . . , en} ranked by the BM25Fscores

2 Use top-N elements as seeds for graph traversal3 To get StructRetr = {e′1, . . . , e′m}, exploit

promising LOD properties‡ as well as Jaro-Winklerstring similarity scores JW (q, e′) > τ

4 Combine two rankings:

finalScore(q, e′) = λ×BM25(q, e) + (1− λ)× JW (q, e′)

‡owl:sameAs, dbpedia:disambiguates, dbpedia:redirect52 / 60

Page 53: A Survey of Entity Ranking over RDF Graphs

Evaluation

I The graph-based approach (S1_1) outperforms BM25scoring with 25% improvement of MAP on the 2010 data set

I No significant improvement over baseline on the 2011 datasetThis may be explained by the lack of the used predicates(owl:sameAs volume < 0.7%)

53 / 60

Page 54: A Survey of Entity Ranking over RDF Graphs

Improving Entity Search over Linked Databy Modeling Latent Semantics

Nikita Zhiltsov, Eugene Agichtein

CIKM 2013

54 / 60

Page 55: A Survey of Entity Ranking over RDF Graphs

Key Contributions

I A tensor factorization based approach to incorporatesemantic link information into ranking model

I Outperforms the state of the art baseline inNDCG/MAP/P@10

I A thorough evaluation of the proposed techniquesby acquiring thousands of manual labels to augmentthe YSC benchmark data set

⇒ more details in the next talk

55 / 60

Page 56: A Survey of Entity Ranking over RDF Graphs

Negative results

The ideas that do not work out

56 / 60

Page 57: A Survey of Entity Ranking over RDF Graphs

Negative ResultsThe ideas from standard IR that do not work out:

I Wordnet-based query expansion [Tonon et al.,SIGIR 2012]

I Pseudo-relevance feedback [Tonon et al., SIGIR2012]

I Query suggestions of a commercial search engine[Tonon et al., SIGIR 2012]

I Direct application of centrality measures, such asPageRank and HITS [Campinas et al., SSW WWW2010; Dali et al., 2012]

57 / 60

Page 58: A Survey of Entity Ranking over RDF Graphs

Outline

1 Introduction

2 Task Statement and Evaluation Methodology

3 Approaches

4 Conclusion

58 / 60

Page 59: A Survey of Entity Ranking over RDF Graphs

Wrap up

I Entity search over RDF graphs a.k.a. ad-hoc objectretrieval has emerged as a new task in IR

I There is a robust and consistent evaluationmethodology for it

I State-of-the-art approaches revolve aroundapplications of well-known IR methods along

I Lack of approaches for leveraging semantic links

I Lots of data: scalability really matters

59 / 60

Page 60: A Survey of Entity Ranking over RDF Graphs

Thanks for your attention!

60 / 60