ESTER Efficient Search on Text, Entities, and Relations

ESTEREfficient Search on Text, Entities, and Relations

Holger BastMax-Planck-Institut für Informatik

Saarbrücken, Germany

joint work with

Alexandru Chitea, Fabian Suchanek, Ingmar Weber

Talk at SIGIR’07 in Amsterdam, July 26th

ESTEREfficient Search on Text, Entities, and Relations



joint work with





joint work with



ESTERIt’s about:

Fast Semantic Search

Keyword Search vs. Semantic Search

Keyword search

– Query: john lennon

– Answer: documents containing the words john and lennon

Semantic search

– Query: musician

– Answer: documents containing an instance of musician

Combined search

– Query: beatles musician

– Answer: documents containing the word beatles and an instance of musicianUseful by itself or as a component of a QA system

Semantic Search: Challenges + Our System

1. Entity recognition– approach 1: let users annotate (semantic web)

– approach 2: annotate (semi-)automatically

– our system: uses Wikipedia links + learns from them

2. Query Processing– build a space-efficient index

– which enables fast query answers

– our system: as compact and fast as a standard full-text engine

3. User Interface– easy to use

– yet powerful query capabilities

– our system: standard interface with interactive suggestions

Semantic Search: Challenges + Our System

1. Entity recognition– approach 1: let users annotate (semantic web)

– approach 2: annotate (semi-)automatically

– our system: uses Wikipedia links + learns from them

2. Query Processing– build a space-efficient index

– which enables fast query answers

– our system: as compact and fast as a standard full-text engine

3. User Interface– easy to use

– yet powerful query capabilities

– our system: standard interface with interactive suggestions

focus of the paperand of this talk

In the Rest of this Talk …

Efficiency

– three simple ideas (which all fail)

– our approach (which works)

Queries supported

– essentially all SPARQL queries, and

– seamless integration with ordinary full-text search

Experiments

– efficiency (great)

– quality (not so great yet)

Conclusions

– lots of interesting + challenging open problems

Efficiency: Simple Idea 1

Add “semantic tags” to the document

– e.g., add the special word tag:musician before every occurrence of a musician in a document

Problem 1: Index blowup

– e.g., John Lennon is a: Musician, Singer, Composer, Artist, Vegetarian, Person, Pacifist, … (28 classes)

Problem 2: Limited querying capabilities

– e.g., could not produce list of musicians that occur in documents that also contain the word beatles

– i.p., could not do all SPARQL queries (more on that later)


Query Expansion

– e.g., replace query word musician by disjunction

musician:aaron_copland OR … OR musician:zarah_leander

(7,593 musicians in Wikipedia)

Problem: Inefficient query processing

– one intersection per element of the disjunction needed


Use a database

– map semantic queries to SQL queries on suitably constructed tables

– that’s what the Artificial-Intelligence / Semantic-Web people usually do

Problem: Inefficient + Lack of control

– building a search engine on top of an off-the-shelf database is orders of magnitude slower or uses orders of magnitude more space, or both

– very limited control regarding efficiency aspects

Efficiency: Our Approach

Two basic operations

– prefix search of a special kind [will be explained by example]

– join [will be explained by example]

An index data structure

– which supports these two operations efficiently

Artificial words in the documents

– such that a large class of semantic queries reduces to a combination of (few of) these operations

Processing the query “beatles musician”

Gitanes

… legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice …

Gitanes


John Lennon

0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer …

John Lennon


entity:john_lennonentity:1964entity:liverpooletc.

entity:wolfang_amadeus_mozartentity:johann_sebastian_bachentity:john_lennonetc.

entity:john_lennonetc.

twoprefix

queries

onejoin

position

beatles entity:* entity:* . relation:is_a .

class:musician


Problem: entity:* has a huge number of occurrences– ≈ 200 million for Wikipedia, which is ≈ 20% of all occurrences– prefix search efficient only for up to ≈ 1% (explanation follows)

Solution: frontier classes– classes at “appropriate” level in the hierarchy– e.g.: artist, believer, worker, vegetable, animal, …

Gitanes


Gitanes


John Lennon


John Lennon


position

beatles entity:* entity:* . relation:is_a .

class:musician


Gitanes

… legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked …

Gitanes

… legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked …

John Lennon

0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician …

John Lennon

0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician …

artist:john_lennonartist:graham_greeneartist:pete_bestetc.

artist:wolfang_amadeus_mozartartist:johann_sebastian_bachartist:john_lennonetc.

artist:john_lennonetc.

position

beatles artist:* artist:* . relation:is_a .

class:musiciantwoprefix

queries

onejoin

first figure out:musician artist

(easy)

Maintains lists for word ranges (not words)

Looks like this for person:*

abl-abt Doc. 12 Doc. 83 Doc. 83 Doc. 187 …

Pos. 5 Pos. 14 Pos. 124 Pos. 88 …

Scor. 0.5 Scor. 0.2 Scor. 0.7 Scor. 0.4 …

able ablaze abroad abnormal

person:* Doc. 17 Doc. 23 Doc. 72 Doc. 72 …

Pos. 12 Pos. 3 Pos. 55 Pos. 59 …


person:john_lenno

nperson:ringo_starr

person:graham_gree

neperson:john_lenno

n

The HYB Index [Bast/Weber, SIGIR’06]

The HYB Index [Bast/Weber, SIGIR’06]

Maintains lists for word ranges (not words)

Provably efficient

– no more space than an inverted index (on the same data)

– each query = scan of a moderate number of (compressed) items

abl-abt Doc. 12 Doc. 83 Doc. 83 Doc. 187 …

Pos. 5 Pos. 14 Pos. 124 Pos. 88 …


able ablaze abroad abnormal

Extremely versatile

– can do all kinds of things an inverted index cannot do (efficiently)

– autocompletion, faceted search, query expansion, errorcorrection, select and join, …

SPARQL Protocol And

RDF Query Language

(yes, it’s recursive)

Queries we can handle

We prove the following theorem:

– Any basic SPARQL graph query with m edges can be reduced to at most 2m prefix / join operations

SELECT ?who WHERE { ?who is_a Musician ?who born_in_year ?when John_Lennon born_in_year ?when }

ESTER achieves seamless integration with full-text search

– SPARQL has no means for dealing with full text search

– XQuery can handle full-text search, but is not really suitable for semantic search

musicians bornin the same yearas John Lennon

more about supported queries in the paper

Experiments: Corpus, Ontology, Index

Corpus: English Wikipedia (xml dump from Nov. 2006)

≈ 8 GB raw xml

≈ 2,8 million documents

≈ 1 billion words

Ontology: YAGO (Suchanek/Kasneci/Weikum, WWW’07)

≈ 2,5 million facts

derived from clever combination of Wikipedia + WordNet (Entities from Wikipedia, Taxonomy from WordNet)

Our Index

≈ 1.5 billion words (original + artificial)

≈ 3.3 GB total index size; ontology-only is a mere 100 MB

Note: our system works for an arbitrary corpus + ontology

Experiments: Efficiency — What Baseline?

SPARQL engines – can’t do text search

– and slow for ontology-only too (on Wikipedia: seconds)

XQuery engines – extremely slow for text search (on Wikipedia: minutes)

– and slow for ontology-only too (on Wikipedia: seconds)

Other prototypes which do semantic + full-text search– efficiency is hardly considered

– e.g., the system of Castells/Fernandez/Vallet (TKDE’07)

“… average informally observed response time on a standard professional desktop computer [of] below 30 seconds [on 145,316 documents and an ontology with 465,848 facts] …”

– our system: ~100ms, 2.8 million documents, 2.5 million facts

Experiments: Efficiency — Stress Test 1

Compare to ontology-only system

– the YAGO engine from WWW’07

– Onto Simple : when was [person] born [1000 queries]

– Onto Advanced: list all people from [profession] [1000 queries]

– Onto Hard : when did people die who were born in the same year as [person] [1000 queries]

Note: comparison very unfair (for our system)

Our system Onto-Only

avg. max. avg. max.

Onto Simple 2 ms 5 ms 3 ms 20 ms

Onto Advanced 9 ms 31 ms 3 ms794 ms

Onto Hard64 ms

208 ms

78 ms

550 ms

100 MB index

4 GB index

Experiments: Efficiency — Stress Test 2

Our system Full-Text Only

avg. max. avg. max.

Onto+Text Easy224 ms

772 ms 90 ms 498 ms

Onto+Text Hard279 ms

502 ms 44 ms 85 ms

Compare to text-only search engine

– state-of-the-art system from SIGIR’06

– Onto+Text Easy: counties in [US state] [50 queries]

– Onto+Text Hard: computer scientists [nationality] [50 queries]

– Full-text query: e.g. german computer scientists Note: hardly finds relevant documents

Note: comparison extremely unfair (for our system)

Experiments: Quality — Entity Recognition

Use Wikipedia links as hints

– “… following [[John Lennon | Lennon]] and Paul McCartney, two of the Beatles, …”

– “… The southern terminus is located south of the town of [[Lennon, Michigan | Lennon]] …”

Learn other links

– use words in neighborhood as features

Accuracy

all words 2 senses 3 senses ≥4 senses

93.4% 88.2% 84.4% 80.3%

Experiments: Quality — Relevance

2 Query Sets

– People associated with [american university] [100 queries]

– Counties of [american state] [50 queries]

Ground truth

– Wikipedia has corresponding lists

e.g., List of Carnegie Mellon University People

Precision and Recallprecision@1

0recall

PEOPLE 37.3% 89.7%

COUNTIES 66.5% 97.8%

Conclusions

Semantic Retrieval System ESTER

– fast and scalable via reduction to prefix search and join

– can handle all basic SPARQL queries

– seamless integration with full-text search

– standard user interface with (semantic) suggestions

Lots of interesting and challenging problems

– simultaneous ranking of entities and documents

– proper snippet generation and highlighting

– search result quality

– … Dank je wel!

Documents

ESTER Efficient Search on Text, Entities, and Relations