ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking

ZenCrowd Leveraging Probabilis3c Reasoning and

Crowdsourcing Techniques for Large-‐Scale En3ty Linking

Gianluca Demar3ni, Djellel Eddine

Difallah, and Philippe Cudre-‐Mauroux eXascale Infolab, University of Fribourg

Switzerland

MoFvaFon

•  Linked Open Data (LOD) •  Linking enFty from text to LOD – Rich snippets – EnFty-‐centric search

•  Linking – Algorithmic – Manual (NYT arFcles)

HTML+ RDFaPages

LOD Cloud

22-‐Apr-‐12 Gianluca DemarFni, eXascale Infolab 2

Gianluca DemarFni, eXascale Infolab 3

hVp://dbpedia.org/resource/Facebook

hVp://dbpedia.org/resource/Instagram

Zase:Instagram owl:sameAs

Google

Android

Facebook is not waiFng for its iniFal public offering to make its first big purchase.In its largest acquisiFon to date, the social network has purchased Instagram, the popular photo-‐sharing applicaFon, for about $1 billion in cash and stock, the company said Monday.

<cite property=”rdfs:label">Facebook</cite> is not waiFng for its iniFal public offering to make its first big purchase.In its largest acquisiFon to date, the social network has purchased <cite property=”rdfs:label">Instagram</cite> , the popular photo-‐sharing applicaFon, for about $1 billion in cash and stock, the company said Monday.

RDFa enrichment

HTML:

Crowdsourcing

•  Exploit human intelligence to solve – Tasks simple for humans, complex for machines – With a large number of humans (the Crowd) – Small problems: micro-‐tasks (Amazon MTurk)

•  Examples – Wikipedia, Image tagging

•  IncenFves – Financial, fun, visibility


ZenCrowd

•  Combine both algorithmic and manual linking •  Automate manual linking via crowdsourcing •  Dynamically assess human workers with a probabilisFc reasoning framework

22-‐Apr-‐12 5

Crowd

Algorithms Machines

Outline

•  Related approaches •  ZenCrowd: System architecture •  ProbabilisFc model to combine automaFc matching and crowdsourcing results

•  Experimental evaluaFon


Related Approaches

•  EnFty Linking •  Ad-‐hoc Object Retrieval – Keyword queries – Looking for a specific enFty –  IR indexing and ranking over RDF data

•  Crowdsourcing – Training set, tagging, annotaFon –  IR evaluaFon – CrowdDB


ZenCrowd Architecture

Micro Matching

Tasks

HTMLPages

HTML+ RDFaPages

LOD Open Data Cloud

CrowdsourcingPlatform

ZenCrowdEntity

Extractors

LOD Index Get Entity

Input Output

Probabilistic Network

Decision Engine

Micr

o-Ta

sk M

anag

er

Workers Decisions

AlgorithmicMatchers


Algorithmic Matching

•  Inverted index over LOD enFFes – DBPedia, Freebase, Geonames, NYT

•  TF-‐IDF (IR ranking funcFon) •  Top ranked URIs linked to enFFes in docs

•  Threshold on the ranking funcFon or top N


EnFty Factor Graphs

•  Graph components – Workers, links, clicks – Prior probabiliFes – Link Factors – Constraints

•  ProbabilisFc Inference – Select all links with posterior prob >τ

w1 w2

l1 l2

pw1( ) pw2( )

lf1( ) lf2( )

pl1( ) pl2( )

l3

lf3( )

pl3( )

c11 c22c12c21 c13 c23

u2-3( )sa1-2( )

2 workers, 6 clicks, 3 candidate links

Link priors

Worker priors

Observed variables

Link factors

SameAs constraints

Dataset Unicity constraints


EnFty Factor Graphs

•  Training phase –  IniFalize worker priors – with k matches on known answers

•  UpdaFng worker Priors – Use link decision as new observaFons – Compute new worker probabiliFes

•  IdenFfy (and discard) unreliable workers


Experimental EvaluaFon •  Datasets –  25 news arFcles from

•  CNN.com (Global news) •  NYTimes.com (Global news) •  Washington-‐post.com (US local news) •  Timesofindia.indiaFmes.com (India local news) •  Swissinfo.com (Switzerland local news)

–  40M enFFes (Freebase, DBPedia, Geonames, NYT) •  80 workers on Amazon MTurk, $0.01 per enFty •  Standard evaluaFon measures: Prec, Recall, Acc •  Gold Standard: editorial selecFon


The micro-‐task


Experimental EvaluaFon

•  AutomaFc Approach: P/R trade-‐off

0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"

1" 2" 3" 4" 5"

Precision)/)Re

call)

Top)N)Results)

P"R"



•  EnFty Linking with Crowdsourcing and agreement vote (at least 2 out of 5 workers select the same URI)

Top-‐1 precision: 0.70

0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"

0" 0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"

Precision)/)Re

call)

Matching)Probability)Threshold)

P"R"

0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"

1" 2" 3" 4" 5"

Precision)/)Re

call)

Top)N)Results)

P"R"

Figure 5: Performance results (Precision, Recall) forthe automatic approach.

the URIs with at least 2 votes are selected as valid links(we tried various thresholds and manually picked 2 in theend since it leads to the highest precision scores while keep-ing good recall values for our experiments). We report onthe performance of this crowdsourcing technique in Table 2.The values are averaged over all linkable entities for di↵erentdocument types and worker communities.

Table 2: Performance results for crowdsourcing withagreement vote over linkable entities.

US Workers Indian WorkersP R A P R A

GL News 0.79 0.85 0.77 0.60 0.80 0.60US News 0.52 0.61 0.54 0.50 0.74 0.47IN News 0.62 0.76 0.65 0.64 0.86 0.63SW News 0.69 0.82 0.69 0.50 0.69 0.56All News 0.74 0.82 0.73 0.57 0.78 0.59

The first question we examine is whether there is a di↵er-ence in reliability between the various populations of work-ers. In Figure 6 we show the performance for tasks per-formed by workers located in USA and India (each pointcorresponds to the average precision and recall over all en-tities in one document). On average, we observe that tasksperformed by workers located in the USA lead to higherprecision values. As we can see in Table 2, Indian workersobtain higher precision and recall on local Indian news ascompared to US workers. The biggest di↵erence in terms ofaccuracy between the two communities can be observed onthe global interest news.

0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"

0" 0.2" 0.4" 0.6" 0.8" 1"

Recall&

Precision&

US"India"

Figure 6: Per document task e↵ectiveness.

A second question we examine is how the textual contextgiven for an entity influences the worker performance. InFigure 7, we compare the tasks for which only the entitylabel is given to those for which a context consisting of all

the sentences containing the entity are shown to the worker(snippets). Surprisingly, we could not observe a significantdi↵erence in e↵ectiveness caused by the di↵erent textual con-texts given to the workers. Thus, we focus on only one typeof context for the remaining experiments (we always givethe snippet context).

0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"

1" 2" 3" 4" 5" 6" 7" 8" 9" 10"

Precision)

Document)

Simple"

Snippet"

Figure 7: Crowdsourcing results with two di↵erenttextual contexts

Entity Linking with ZenCrowd.We now focus on the performance of the probabilistic in-

ference network as proposed in this paper. We consider themethod described in Section 4, with an initial training phaseconsisting of 5 entities, and a second, continuous trainingphase, consisting of 5% of the other entities being o↵ered tothe workers (i.e., the workers are given a task whose solutionis known by the system every 20 tasks on average).In order to reduce the number of tasks having little influ-

ence in the final results, a simple technique of blacklistingof bad workers is used. A bad worker (who can be consid-ered as a spammer) is a worker who randomly and rapidlyclicks on the links, hence generating noise in our system.In our experiments, we consider that 3 consecutive bad an-swers in the training phase is enough to identify the workeras a spammer and to blacklist him/her. We report the aver-age results of ZenCrowd when exploiting the training phase,constraints, and blacklisting in Table 3. As we can observe,precision and accuracy values are higher in all cases whencompared to the agreement vote approach.

Table 3: Performance results for crowdsourcing withZenCrowd over linkable entities.



Finally, we compare ZenCrowd to the state of the artcrowdsourcing approach (using the optimal agreement vote)and our best automatic approach on a per-task basis in Fig-ure 8. The comparison is given for each document in thetest collection. We observe that in most cases the humanintelligence contribution improves the precision of the auto-matic approach. We also observe that ZenCrowd dominates



•  Worker community and textual context

0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"

0" 0.2" 0.4" 0.6" 0.8" 1"

Recall&

Precision&

US"India"

0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"

1" 2" 3" 4" 5" 6" 7" 8" 9" 10"Precision)

Document)

Simple"

Snippet"



•  Worker SelecFon

Top$US$Worker$

0$

0.5$

1$

0$ 250$ 500$

Worker&P

recision

&

Number&of&Tasks&

US$Workers$

IN$Workers$

0.6$0.62$0.64$0.66$0.68$0.7$0.72$0.74$0.76$0.78$0.8$

1$ 2$ 3$ 4$ 5$ 6$ 7$ 8$ 9$

Precision)

Top)K)workers)



•  EnFty Linking with ZenCrowd – Training with first 5 enFFes + 5% aserwards – 3 consecuFve bad answers lead to blacklisFng

0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"

0" 0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"

Precision)/)Re

call)

Matching)Probability)Threshold)

P"R"

0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"

1" 2" 3" 4" 5"

Precision)/)Re

call)

Top)N)Results)

P"R"

Figure 5: Performance results (Precision, Recall) forthe automatic approach.

the URIs with at least 2 votes are selected as valid links(we tried various thresholds and manually picked 2 in theend since it leads to the highest precision scores while keep-ing good recall values for our experiments). We report onthe performance of this crowdsourcing technique in Table 2.The values are averaged over all linkable entities for di↵erentdocument types and worker communities.

Table 2: Performance results for crowdsourcing withagreement vote over linkable entities.



The first question we examine is whether there is a di↵er-ence in reliability between the various populations of work-ers. In Figure 6 we show the performance for tasks per-formed by workers located in USA and India (each pointcorresponds to the average precision and recall over all en-tities in one document). On average, we observe that tasksperformed by workers located in the USA lead to higherprecision values. As we can see in Table 2, Indian workersobtain higher precision and recall on local Indian news ascompared to US workers. The biggest di↵erence in terms ofaccuracy between the two communities can be observed onthe global interest news.

0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"

0" 0.2" 0.4" 0.6" 0.8" 1"

Recall&

Precision&

US"India"

Figure 6: Per document task e↵ectiveness.

A second question we examine is how the textual contextgiven for an entity influences the worker performance. InFigure 7, we compare the tasks for which only the entitylabel is given to those for which a context consisting of all

the sentences containing the entity are shown to the worker(snippets). Surprisingly, we could not observe a significantdi↵erence in e↵ectiveness caused by the di↵erent textual con-texts given to the workers. Thus, we focus on only one typeof context for the remaining experiments (we always givethe snippet context).

0"0.1"0.2"0.3"0.4"0.5"0.6"0.7"0.8"0.9"1"

1" 2" 3" 4" 5" 6" 7" 8" 9" 10"

Precision)

Document)

Simple"

Snippet"

Figure 7: Crowdsourcing results with two di↵erenttextual contexts

Entity Linking with ZenCrowd.We now focus on the performance of the probabilistic in-

ference network as proposed in this paper. We consider themethod described in Section 4, with an initial training phaseconsisting of 5 entities, and a second, continuous trainingphase, consisting of 5% of the other entities being o↵ered tothe workers (i.e., the workers are given a task whose solutionis known by the system every 20 tasks on average).In order to reduce the number of tasks having little influ-

ence in the final results, a simple technique of blacklistingof bad workers is used. A bad worker (who can be consid-ered as a spammer) is a worker who randomly and rapidlyclicks on the links, hence generating noise in our system.In our experiments, we consider that 3 consecutive bad an-swers in the training phase is enough to identify the workeras a spammer and to blacklist him/her. We report the aver-age results of ZenCrowd when exploiting the training phase,constraints, and blacklisting in Table 3. As we can observe,precision and accuracy values are higher in all cases whencompared to the agreement vote approach.

Table 3: Performance results for crowdsourcing withZenCrowd over linkable entities.



Finally, we compare ZenCrowd to the state of the artcrowdsourcing approach (using the optimal agreement vote)and our best automatic approach on a per-task basis in Fig-ure 8. The comparison is given for each document in thetest collection. We observe that in most cases the humanintelligence contribution improves the precision of the auto-matic approach. We also observe that ZenCrowd dominates


Comparing 3 matching techniques

•  ZenCrowd best for 75% of documents


0"

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0.9"

1"

1" 2" 3" 4" 5" 6" 7" 8" 9" 10"11"12"13"14"15"16"17"18"19"20"21"22"23"24"25"

Precision)

)

Document)

Agr."Vote"

ZenCrowd"

Top"1"

Simple Crowdsourcing ZenCrowd AutomaFc

Lessons Learnt

•  Crowdsourcing + Prob reasoning works! •  But – Different worker communiFes perform differently – Many low quality workers – No differences w/ different contexts – CompleFon Fme may vary (based on reward)


Conclusions

•  ZenCrowd: ProbabilisFc reasoning over automaFc and crowdsourcing methods for enFty linking

•  Standard crowdsourcing improves 6% over automaFc •  4% -‐ 35% improvement over standard crowdsourcing •  14% average improvement over automaFc approaches

•  Next steps –  Long-‐term worker behavior analysis – More efficient and effecFve linking by LOD dataset pre-‐selecFon

hVp://diuf.unifr.ch/xi/zencrowd/

Technology

ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking