4
Collection Ranking & Selection for Federated Entity Search Krisztian Balog , Robert Neumayer, and Kjetil Nørvåg Norwegian University of Science and Technology 19th International Symposium on String Processing and Information Retrieval (SPIRE 2012) Cartagena de Indias, Colombia, October 2012 Motivation - Entities are ubiquitous Many information needs revolve around entities (people, products, organisations, places, etc.) - Growing amount of structured data Again, organised around entities - Entities are often searched by their name If we know (heard of) it, we just ask for it by the name Top searches of 2011* Web search 1. iPhone 2. Casey Anthony 3. Kim Kardashian 4. Katy Perry 5. Jennifer Lopez 6. Lindsay Lohan 7. American Idol 8. Jennifer Aniston 9. Japan earthquake 10. Osama bin Laden Mobile search 1. iPhone 5 2. Powerball 3. MLB 4. Scrabble cheat 5. Casey Anthony 6. Hurricane Irene 2011 7. Kim Kardashian 8. Translator 9. Amy Winehouse 10. May 21, 2011 Rapture * http://yearinreview.yahoo.com/ Top searches of 2011* Web search 1. iPhone 2. Casey Anthony 3. Kim Kardashian 4. Katy Perry 5. Jennifer Lopez 6. Lindsay Lohan 7. American Idol 8. Jennifer Aniston 9. Japan earthquake 10. Osama bin Laden Mobile search 1. iPhone 5 2. Powerball 3. MLB 4. Scrabble cheat 5. Casey Anthony 6. Hurricane Irene 2011 7. Kim Kardashian 8. Translator 9. Amy Winehouse 10. May 21, 2011 Rapture http://yearinreview.yahoo.com/ “users have learned that search engine relevance decreases with longer queries and have grown accustomed to reducing their query (at least initially) to the name of an entity” [Blanco et al., 2011] The Web of Data 0 100 200 300 2007 2008 2009 2010 2011 # Linking Open Data datasets SW Conference Corpus DBpedia RDF Book Mashup DBLP Berlin Revyu Project Guten- berg FOAF Geo- names Music- brainz Magna- tune Jamendo World Fact- book DBLP Hannover SIOC Sem- Web- Central Euro- stat ECS South- ampton BBC Later + TOTP Fresh- meat Open- Guides Gov- Track US Census Data W3C WordNet flickr wrappr Wiki- company Open Cyc NEW! lingvoj Onto- world NEW! NEW! NEW! LOD in 2007 As of September 2011 Music Brainz (zitgist) P20 Turismo de Zaragoza yovisto Yahoo! Geo Planet YAGO World Fact- book El Viajero Tourism WordNet (W3C) WordNet (VUA) VIVO UF VIVO Indiana VIVO Cornell VIAF URI Burner Sussex Reading Lists Plymouth Reading Lists UniRef UniProt UMBEL UK Post- codes legislation data.gov.uk Uberblic UB Mann- heim TWC LOGD Twarql transport data.gov. uk Traffic Scotland theses. fr Thesau- rus W totl.net Tele- graphis TCM Gene DIT Taxon Concept Open Library (Talis) tags2con delicious t4gm info Swedish Open Cultural Heritage Surge Radio Sudoc STW RAMEAU SH statistics data.gov. uk St. Andrews Resource Lists ECS South- ampton EPrints SSW Thesaur us Smart Link Slideshare 2RDF semantic web.org Semantic Tweet Semantic XBRL SW Dog Food Source Code Ecosystem Linked Data US SEC (rdfabout) Sears Scotland Geo- graphy Scotland Pupils & Exams Scholaro- meter WordNet (RKB Explorer) Wiki UN/ LOCODE Ulm ECS (RKB Explorer) Roma RISKS RESEX RAE2001 Pisa OS OAI NSF New- castle LAAS KISTI JISC IRIT IEEE IBM Eurécom ERA ePrints dotAC DEPLOY DBLP (RKB Explorer) Crime Reports UK Course- ware CORDIS (RKB Explorer) CiteSeer Budapest ACM riese Revyu research data.gov. uk Ren. Energy Genera- tors reference data.gov. uk Recht- spraak. nl RDF ohloh Last.FM (rdfize) RDF Book Mashup Rådata nå! PSH Product Types Ontology Product DB PBAC Poké- pédia patents data.go v.uk Ox Points Ord- nance Survey Openly Local Open Library Open Cyc Open Corpo- rates Open Calais OpenEI Open Election Data Project Open Data Thesau- rus Ontos News Portal OGOLOD Janus AMP Ocean Drilling Codices New York Times NVD ntnusc NTU Resource Lists Norwe- gian MeSH NDL subjects ndlna my Experi- ment Italian Museums medu- cator MARC Codes List Man- chester Reading Lists Lotico Weather Stations London Gazette LOIUS Linked Open Colors lobid Resources lobid Organi- sations LEM Linked MDB LinkedL CCN Linked GeoData LinkedCT Linked User Feedback LOV Linked Open Numbers LODE Eurostat (Ontology Central) Linked EDGAR (Ontology Central) Linked Crunch- base lingvoj Lichfield Spen- ding LIBRIS Lexvo LCSH DBLP (L3S) Linked Sensor Data (Kno.e.sis) Klapp- stuhl- club Good- win Family National Radio- activity JP Jamendo (DBtune) Italian public schools ISTAT Immi- gration iServe IdRef Sudoc NSZL Catalog Hellenic PD Hellenic FBD Piedmont Accomo- dations GovTrack GovWILD Google Art wrapper gnoss GESIS GeoWord Net Geo Species Geo Names Geo Linked Data GEMET GTAA STITCH SIDER Project Guten- berg Medi Care Euro- stat (FUB) EURES Drug Bank Disea- some DBLP (FU Berlin) Daily Med CORDIS (FUB) Freebase flickr wrappr Fishes of Texas Finnish Munici- palities ChEMBL FanHubz Event Media EUTC Produc- tions Eurostat Europeana EUNIS EU Insti- tutions ESD stan- dards EARTh Enipedia Popula- tion (En- AKTing) NHS (En- AKTing) Mortality (En- AKTing) Energy (En- AKTing) Crime (En- AKTing) COEmission (En- AKTing) EEA SISVU educatio n.data.g ov.uk ECS South- ampton ECCO- TCP GND Didactal ia DDC Deutsche Bio- graphie data dcs Music Brainz (DBTune) Magna- tune John Peel (DBTune) Classical (DB Tune) Audio Scrobbler (DBTune) Last.FM artists (DBTune) DB Tropes Portu- guese DBpedia dbpedia lite Greek DBpedia DBpedia data- open- ac-uk SMC Journals Pokedex Airports NASA (Data Incu- bator) Music Brainz (Data Incubator) Moseley Folk Metoffice Weather Forecasts Discogs (Data Incubator) Climbing data.gov.uk intervals Data Gov.ie data bnf.fr Cornetto reegle Chronic- ling America Chem2 Bio2RDF Calames business data.gov. uk Bricklink Brazilian Poli- ticians BNB UniSTS UniPath way UniParc Taxono my UniProt (Bio2RDF) SGD Reactome PubMed Pub Chem PRO- SITE ProDom Pfam PDB OMIM MGI KEGG Reaction KEGG Pathway KEGG Glycan KEGG Enzyme KEGG Drug KEGG Com- pound InterPro Homolo Gene HGNC Gene Ontology GeneID Affy- metrix bible ontology BibBase FTS BBC Wildlife Finder BBC Program mes BBC Music Alpine Ski Austria LOCAH Amster- dam Museum AGROV OC AEMET US Census (rdfabout) Media Geographic Publications Government Cross-domain Life sciences User-generated content LOD in 2011 - Task Given a keyword query, targeting a particular entity, provide a ranked list of relevant entities (i.e., URIs) - Queries Sampled from web search engine logs (142 in total) - Data collection Billion Triple Challenge 2009 (BTC) dataset - Relevance judgments On a 3-point scale, collected using crowdsourcing Ad-hoc entity search At the 2010/11 Semantic Search Challenge

LOD in 2007Music Brainz Turismo (zitgist)P20 de Zaragoza yovisto Yahoo! Geo Planet YAGO World Fact-book El Viajero Tourism WordNet (W3C) WordNet (VUA)VIVO UF VIVO Indiana VIVO Cornell

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LOD in 2007Music Brainz Turismo (zitgist)P20 de Zaragoza yovisto Yahoo! Geo Planet YAGO World Fact-book El Viajero Tourism WordNet (W3C) WordNet (VUA)VIVO UF VIVO Indiana VIVO Cornell

Collection Ranking & Selection for Federated Entity SearchKrisztian Balog, Robert Neumayer, and Kjetil NørvågNorwegian University of Science and Technology

19th International Symposium on String Processing and Information Retrieval (SPIRE 2012)Cartagena de Indias, Colombia, October 2012

Motivation- Entities are ubiquitous

Many information needs revolve around entities (people, products, organisations, places, etc.)

- Growing amount of structured dataAgain, organised around entities

- Entities are often searched by their nameIf we know (heard of) it, we just ask for it by the name

Top searches of 2011* Web search1. iPhone 2. Casey Anthony 3. Kim Kardashian 4. Katy Perry 5. Jennifer Lopez 6. Lindsay Lohan 7. American Idol 8. Jennifer Aniston 9. Japan earthquake 10. Osama bin Laden

Mobile search1. iPhone 5 2. Powerball 3. MLB 4. Scrabble cheat 5. Casey Anthony 6. Hurricane Irene 2011 7. Kim Kardashian 8. Translator 9. Amy Winehouse 10. May 21, 2011 Rapture

* http://yearinreview.yahoo.com/

Top searches of 2011* Web search1. iPhone 2. Casey Anthony 3. Kim Kardashian 4. Katy Perry 5. Jennifer Lopez 6. Lindsay Lohan 7. American Idol 8. Jennifer Aniston 9. Japan earthquake 10. Osama bin Laden

Mobile search1. iPhone 5 2. Powerball 3. MLB 4. Scrabble cheat 5. Casey Anthony 6. Hurricane Irene 2011 7. Kim Kardashian 8. Translator 9. Amy Winehouse 10. May 21, 2011 Rapture

* http://yearinreview.yahoo.com/

“users have learned that search engine relevance decreases with longer queries and have grown accustomed to reducing their query (at least initially) to the name of an entity”

[Blanco et al., 2011]

The Web of Data

0

100

200

300

2007 2008 2009 2010 2011

# Linking Open Data datasets

SW

Conference

Corpus

DBpedia

RDF Book Mashup

DBLPBerlin

Revyu

Project Guten-berg

FOAF

Geo-names

Music-brainz

Magna-tune

Jamendo

World

Fact-

book

DBLPHannover

SIOC

Sem-

Web-

Central

Euro-

stat

ECS

South-

amptonBBC

Later +TOTP

Fresh-meat

Open-

Guides

Gov-Track

US Census Data

W3CWordNet

flickrwrappr

Wiki-

company

OpenCyc

NEW! lingvoj

Onto-world

NEW!

NEW!

NEW!

LOD in 2007

As of September 2011

MusicBrainz

(zitgist)

P20

Turismo de

Zaragoza

yovisto

Yahoo! Geo

Planet

YAGO

World Fact-book

El ViajeroTourism

WordNet (W3C)

WordNet (VUA)

VIVO UF

VIVO Indiana

VIVO Cornell

VIAF

URIBurner

Sussex Reading

Lists

Plymouth Reading

Lists

UniRef

UniProt

UMBEL

UK Post-codes

legislationdata.gov.uk

Uberblic

UB Mann-heim

TWC LOGD

Twarql

transportdata.gov.

uk

Traffic Scotland

theses.fr

Thesau-rus W

totl.net

Tele-graphis

TCMGeneDIT

TaxonConcept

Open Library (Talis)

tags2con delicious

t4gminfo

Swedish Open

Cultural Heritage

Surge Radio

Sudoc

STW

RAMEAU SH

statisticsdata.gov.

uk

St. Andrews Resource

Lists

ECS South-ampton EPrints

SSW Thesaur

us

SmartLink

Slideshare2RDF

semanticweb.org

SemanticTweet

Semantic XBRL

SWDog Food

Source Code Ecosystem Linked Data

US SEC (rdfabout)

Sears

Scotland Geo-

graphy

ScotlandPupils &Exams

Scholaro-meter

WordNet (RKB

Explorer)

Wiki

UN/LOCODE

Ulm

ECS (RKB

Explorer)

Roma

RISKS

RESEX

RAE2001

Pisa

OS

OAI

NSF

New-castle

LAASKISTI

JISC

IRIT

IEEE

IBM

Eurécom

ERA

ePrints dotAC

DEPLOY

DBLP (RKB

Explorer)

Crime Reports

UK

Course-ware

CORDIS (RKB

Explorer)CiteSeer

Budapest

ACM

riese

Revyu

researchdata.gov.

ukRen. Energy Genera-

tors

referencedata.gov.

uk

Recht-spraak.

nl

RDFohloh

Last.FM (rdfize)

RDF Book

Mashup

Rådata nå!

PSH

Product Types

Ontology

ProductDB

PBAC

Poké-pédia

patentsdata.go

v.uk

OxPoints

Ord-nance Survey

Openly Local

Open Library

OpenCyc

Open Corpo-rates

OpenCalais

OpenEI

Open Election

Data Project

OpenData

Thesau-rus

Ontos News Portal

OGOLOD

JanusAMP

Ocean Drilling Codices

New York

Times

NVD

ntnusc

NTU Resource

Lists

Norwe-gian

MeSH

NDL subjects

ndlna

myExperi-ment

Italian Museums

medu-cator

MARC Codes List

Man-chester Reading

Lists

Lotico

Weather Stations

London Gazette

LOIUS

Linked Open Colors

lobidResources

lobidOrgani-sations

LEM

LinkedMDB

LinkedLCCN

LinkedGeoData

LinkedCT

LinkedUser

FeedbackLOV

Linked Open

Numbers

LODE

Eurostat (OntologyCentral)

Linked EDGAR

(OntologyCentral)

Linked Crunch-

base

lingvoj

Lichfield Spen-ding

LIBRIS

Lexvo

LCSH

DBLP (L3S)

Linked Sensor Data (Kno.e.sis)

Klapp-stuhl-club

Good-win

Family

National Radio-activity

JP

Jamendo (DBtune)

Italian public

schools

ISTAT Immi-gration

iServe

IdRef Sudoc

NSZL Catalog

Hellenic PD

Hellenic FBD

PiedmontAccomo-dations

GovTrack

GovWILD

GoogleArt

wrapper

gnoss

GESIS

GeoWordNet

GeoSpecies

GeoNames

GeoLinkedData

GEMET

GTAA

STITCH

SIDER

Project Guten-berg

MediCare

Euro-stat

(FUB)

EURES

DrugBank

Disea-some

DBLP (FU

Berlin)

DailyMed

CORDIS(FUB)

Freebase

flickr wrappr

Fishes of Texas

Finnish Munici-palities

ChEMBL

FanHubz

EventMedia

EUTC Produc-

tions

Eurostat

Europeana

EUNIS

EU Insti-

tutions

ESD stan-dards

EARTh

Enipedia

Popula-tion (En-AKTing)

NHS(En-

AKTing) Mortality(En-

AKTing)

Energy (En-

AKTing)

Crime(En-

AKTing)

CO2 Emission

(En-AKTing)

EEA

SISVU

education.data.g

ov.uk

ECS South-ampton

ECCO-TCP

GND

Didactalia

DDC Deutsche Bio-

graphie

datadcs

MusicBrainz

(DBTune)

Magna-tune

John Peel

(DBTune)

Classical (DB

Tune)

AudioScrobbler (DBTune)

Last.FM artists

(DBTune)

DBTropes

Portu-guese

DBpedia

dbpedia lite

Greek DBpedia

DBpedia

data-open-ac-uk

SMCJournals

Pokedex

Airports

NASA (Data Incu-bator)

MusicBrainz(Data

Incubator)

Moseley Folk

Metoffice Weather Forecasts

Discogs (Data

Incubator)

Climbing

data.gov.uk intervals

Data Gov.ie

databnf.fr

Cornetto

reegle

Chronic-ling

America

Chem2Bio2RDF

Calames

businessdata.gov.

uk

Bricklink

Brazilian Poli-

ticians

BNB

UniSTS

UniPathway

UniParc

Taxonomy

UniProt(Bio2RDF)

SGD

Reactome

PubMedPub

Chem

PRO-SITE

ProDom

Pfam

PDB

OMIMMGI

KEGG Reaction

KEGG Pathway

KEGG Glycan

KEGG Enzyme

KEGG Drug

KEGG Com-pound

InterPro

HomoloGene

HGNC

Gene Ontology

GeneID

Affy-metrix

bible ontology

BibBase

FTS

BBC Wildlife Finder

BBC Program

mes BBC Music

Alpine Ski

Austria

LOCAH

Amster-dam

Museum

AGROVOC

AEMET

US Census (rdfabout)

Media

Geographic

Publications

Government

Cross-domain

Life sciences

User-generated content

LOD in 2011

- Task Given a keyword query, targeting a particular entity, provide a ranked list of relevant entities (i.e., URIs)

- QueriesSampled from web search engine logs (142 in total)

- Data collectionBillion Triple Challenge 2009 (BTC) dataset

- Relevance judgmentsOn a 3-point scale, collected using crowdsourcing

Ad-hoc entity searchAt the 2010/11 Semantic Search Challenge

Page 2: LOD in 2007Music Brainz Turismo (zitgist)P20 de Zaragoza yovisto Yahoo! Geo Planet YAGO World Fact-book El Viajero Tourism WordNet (W3C) WordNet (VUA)VIVO UF VIVO Indiana VIVO Cornell

In this talk- Address the ad-hoc entity retrieval task in a

distributed setting- The Web of Data is inherently distributed- Some data sources may not be crawleable at all

- Specifically, our focus is on the collection ranking and collection selection steps

Federated searchA typical broker-based architecture

1 Collection ranking

Summary ASummary BSummary C

Central broker

A

C

B

Q 1

Federated searchA typical broker-based architecture

1

Collection selection2

Collection A

Collection B

Collection C

Summary A

Summary B

Summary C

Central broker

A

C

Q

B2

Q 1

Q

Collection ranking

Federated searchA typical broker-based architecture

Collection ranking1

Result merging2

3

Collection A

Collection B

Collection C

Summary A

Summary B

Summary C

Central broker

A

C

Q

B2

3

Q 1

Q

Collection selection

Next: baseline modelsfor collection ranking and selection

Result merging

1 Collection rankingCollection selection2

3

Collection A

Collection B

Collection C

Summary A

Summary B

Summary C

Central broker

A

C

Q

B2

3

Q 1

Q

- Lexicon-based methodTreat and score each collection as if it was a single, large document

Collection rankingCollection-centric (CC)

QCollection A

Collection B

Collection C

... ...

P (c|q) / P (c) ·Yt2q

P (t|✓c)

- Document-surrogate methodModel and query individual documents (entities) and aggregate their relevance scores

Collection rankingEntity-centric (EC)

Q

Collection A

Collection B

Collection C

... ...

P (c|q) /X

e2c,r(e,q)<�

P (e|q)

- Choosing a fixed cutoff (K) ahead of timeK is usually set between 5 and 20

Collection selectionTop-K selection

A

D

C

E

B

F

Page 3: LOD in 2007Music Brainz Turismo (zitgist)P20 de Zaragoza yovisto Yahoo! Geo Planet YAGO World Fact-book El Viajero Tourism WordNet (W3C) WordNet (VUA)VIVO UF VIVO Indiana VIVO Cornell

- Exploit that entities are searched by their name- The central broker maintains a complete dictionary

of entity names (and corresponding identifiers)- Utilise this information in the collection selection step

to dynamically adjust the #collections selected

Our method: AENN“All that an Entity Needs is a Name”

AENN collection ranking- Key observation

Different methods—collection-centric (CC) vs. entity-centric (EC)—work best for different queries

- IdeaCombination should give better results than any of the two methods alone

AENN(c, q) = (1� �) · CC(c, q) + � · EC(c, q)

AENN collection rankingExample

A

B

C

A

D

C

E

B

F

0.65

0.3

0.05

0.35

0.3

0.2

0.15

0.1

0.05

A

B

D

C

E

F

0.5

0.2

0.15

0.125

0.075

0.025

AENNCC EC

AENN collection selection- Key observation

CC has higher recall, while EC has better precision- Idea

Use the collection rankings generated by EC and/or CC to dynamically adjust the set of collections selected- Precision-oriented selection- Recall-oriented selection- Balanced selection

- Only select collections returned by EC

AENN collection selectionPrecision-oriented selection (AENN(p))

A

B

C

A

D

C

E

B

F

0.65

0.3

0.05

0.35

0.3

0.2

0.15

0.1

0.05

A

B

D

C

E

F

0.5

0.2

0.15

0.125

0.075

0.025

AENNCC EC AENN(p)

A

B

C

0.5

0.2

0.125

- Include collections from CC until all from EC are covered. This defines the cutoff point for AENN

AENN collection selectionRecall-oriented selection (AENN(r))

A

B

C

A

D

C

E

B

F

0.65

0.3

0.05

0.35

0.3

0.2

0.15

0.1

0.05

0.5

0.2

0.15

0.125

0.075

0.025

AENNCC EC AENN(r)

A

B

D

C

E

0.5

0.2

0.15

0.125

0.075

A

B

D

C

E

F

- Include collections from AENN until all from EC are covered

AENN collection selectionBalanced selection (AENN(b))

A

B

C

0.65

0.3

0.05

0.35

0.3

0.2

0.15

0.1

0.05

0.5

0.2

0.15

0.125

0.075

0.025

AENNCC EC

A

B

D

C

E

F

A

D

C

E

B

F

A

B

D

C

0.5

0.2

0.15

0.125

AENN(b)

AENN collection selectionComparison of approaches

A

B

D

C

0.5

0.2

0.15

0.125

AENN(b)AENN(r)AENN(p)

A

B

C

0.5

0.2

0.125

A

B

D

C

E

0.5

0.2

0.15

0.125

0.075

Page 4: LOD in 2007Music Brainz Turismo (zitgist)P20 de Zaragoza yovisto Yahoo! Geo Planet YAGO World Fact-book El Viajero Tourism WordNet (W3C) WordNet (VUA)VIVO UF VIVO Indiana VIVO Cornell

Experimental setupBased on the 2010/11 Semantic Search Challenge

- Distributed environmentTop 100 largest second-level domains from BTC- Three sets with different handling of DBpedia

- RelevanceConsidered the #relevant entities from each collection

- Metrics- Collection ranking: Standard IR metrics (MAP, MRR, nDCG)- Collection selection: Analogues of precision and recall, plus

the avg. #coll. selected

Test collections

BTC BTC\DBpedia DBpedia

#Entities 68.8M 60.5M 8.8M#Collections 100 99 100#Queries 136 116 130Avg. #rel. entities/query 14.9 4.8 10.1Avg. #rel. entities/collection 3.4 2.8 9.4

ResultsCollection ranking (BTC)

0

0.25

0.50

0.75

Name-only Full content

MAP

CC EC AENN

ResultsDifferent collection selection strategies (BTC\DBpedia)

0

0.3

0.6

0.9

1 3 10 20 50 100

Precision

K

0.6

0.7

0.9

1.0

1 3 10 20 50 100

Recall

K

0

17

33

50

1 3 10 20 50 100

Avg. #coll. selected

K

AENN(p) AENN(r) AENN(b)

ResultsCollection selection (DBpedia)

0

0.2

0.3

0.5

1 3 10 20 50 100

Precision

K

0

0.3

0.7

1.0

1 3 10 20 50 100

Recall

K

0

33

67

100

1 3 10 20 50 100

Avg. #coll. selected

K

CC-N EC-C AENN(b)

Summary- Addressed the task of ad-hoc entity retrieval in a

distributed setting- Introduced AENN, a novel collection ranking and

selection method based on a lean name-based entity representation

- Showed experimentally that AENN can outperform standard baselines that consider all entity content

- Further, AENN can be geared towards high precision, high recall, or a balanced setting

Questions?Resources are available at http://bit.ly/OzfYK2

Contact@krisztianbalog

krisztianbalog.com