Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections

Arnd Christian KönigVenkatesh Ganti

Rares Vernica

Microsoft Research

Entity Categorization Over Large Document

Collections

Relationship Extraction from TextTask: Given a corpus of documents and entity-

recognition logic, extract structured relations between entities from text.

… Donald Knuth works in research …

is-a-researcher(Donald_Knuth)

…Yao Ming plays forthe Houston Rockets…

works-for(Yao_Ming, Houston_Rockets)

Motivation: Going from unstructured data to structured data Applications in search, business intelligence, etc.Focus: Open relationship extraction vs. targeted extraction

Context

Entity

Relationship Extraction from TextTask: Given a corpus of documents and entity-

recognition logic, extract structured relations between entities from text.

… Donald Knuth works in research …

is-a-researcher(Donald_Knuth)

…Yao Ming plays forthe Houston Rockets…

works-for(Yao_Ming, Houston_Rockets)

Motivation: Going from unstructured data to structured data Applications in search, business intelligence, etc.Focus: Open relationship extraction vs. targeted extraction Large document collections (> 107 Documents)

Context

Entity

Using Aggregate ContextSingle-context Extraction:

([Entity], is-a-researcher)

Multi-context Extraction:

“…[Entity] works in research…”

“…[Entity] published…”

“…[Entity]’s paper…”

“…[Entity] gave a talk…” }

([Entity], is-a-researcher)

Multi-Feature Relation Extractor

Extraction logic: ‘[E] works … research’

We track an entity across contexts, allowing us to combine

less predictive features.

[Entity], ‘paper’ [Entity], ‘talk’[Entity], ‘published’

Aggregate Context Features

Using Co-occurrence FeaturesLeverage co-occurrence of entity classes (e.g.

directors likely co-occur with actors) for extraction.

Example: Extraction of is-a-director relation:

Alan AlbaRichard GereJulia Roberts …

Actor-List

… Julia Roberts

starred in a

Robert Altman

film in 1994 …

Co-occurrence features can be between

Entities of different classes. Entities of one class.Combination with text-features

possible: e.g., ‘[Entity] plays for

[Team_Name]’.

Robert_Altman, co-occurs with actor name

…

Aggregate Context Features

}Two Questions:

(a) What difference do the aggregate

contexts make for extraction accuracy?

(b) This means keeping track of contexts

across documents - can we make this

efficient?

Processing large Document Collections

Single-Context Extraction

Agg. Feature Extraction

Architecture

Context FeatureExtraction

Document Corpus D

Entity-Relation Pairs

Aggregation

COUNT(entity, relation) > Δ

Entity-Feature Pairs

Classification

Co-Occurrence List corpus L

Co-OccurrenceDetection




Þ Duplicated overhead from - Document scanning - Document processing - Entity Extraction.

New Architecture

Challenges:1. Fast & accurate

co-occurrence detection using the synopsis.

2. Pruning of redundant output.

Context FeatureExtraction

New Architecture

Document Corpus D

Aggregation

Rule-based Extraction

Classification

Agg. Feature Extraction

Synopsis of L

Delete false Positives

Co-Occurrence List corpus L

AggregationList-Member

Extraction


Entity – Candidate Context Pairs

Entity-List Pairs

Entity-Feature Pairs

Fast identification of candidate matches through 2-stage filtering.

Use of Bloom-Filters to trade off memory footprint with false positive rate.

Frequency-distribution of entities very skewed.

Pruning based on retaining most frequent entities and list members in memory.

Challenge: Determining frequencies online.

=> Compact hash-synopses of frequencies (CM-Sketch) perform well.

Potentially very large output:

Duplication via very many co-occurrences, e.g. actor-actor.

Potentially very large output:

Duplication, e.g. Entity: “George

Bush” Feature: ‘President’

Experiments

Experimental Evaluation Task: Categorization of entities into

professions (actor, writer, painter, etc.) Document-Corpus: 3.2 Million Wikipedia

pages Training data generated using Wikipedia

lists of famous painters, writers, etc… Aggregate-Context Classifier: linear SVM

using text n-gram & co-occurrence features (binary)

Single-Context classifier: 100K extraction rules (incl. gaps) derived from training data (algorithm of [König and Brill, KDD’06]).

Co-occurrence list: contains 10% of entity strings in training data.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%75%

80%

85%

90%

95%

100%

Painters - aggregate n-gram features only

Painters - n-gram and list-membership features

Rule-base extraction (60% conf.)

Rule-based extraction (80% conf.)

Recall

Pre

cisi

on

Experimental Evaluation: Accuracy

Baseline Size = 1% Size = 2% Size = 5% Size = 10%0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Aggregation

Verification

Write-Overhead

Perc

enta

ge o

f Bas

elin

e O

verh

ead

Experimental Evaluation: Overhead

Main remaining overhead: writing of entity-features pairs.

Simple caching strategy reduces this overhead by an order of magnitude.

Conclusions Studied the effect of aggregate context in

relation extraction. Proposed efficient processing techniques

for large text corpora. Both aggregate and co-occurrence

features provide significant increase in extraction accuracy compared to single-context classifiers.

The use of pruning techniques and approximate filters results in significant reduction in the overall extraction overhead.

Questions?