28
Text Mining Search and Navigation Group Research Eugene Agichtein and Silviu Cucerzan Microsoft Research redicting Accuracy of Extracting Information from Unstructured Text Collections

Eugene Agichtein and Silviu Cucerzan Microsoft Research

  • Upload
    juro

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

Predicting Accuracy of Extracting Information from Unstructured Text Collections. Eugene Agichtein and Silviu Cucerzan Microsoft Research. Web Documents. Information Extraction System. Extracting and Managing Information in Text. Text Document Collections. - PowerPoint PPT Presentation

Citation preview

Page 1: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Eugene Agichtein and Silviu Cucerzan

Microsoft Research

Predicting Accuracy of Extracting Information from Unstructured Text Collections

Page 2: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Extracting and Managing Information in TextExtracting and Managing Information in Text

TextDocument Collections

WebDocuments Blogs

NewsAlerts …

Information Extraction System

Events

----------------------

Entities

----------------------------------------------------------------------------

E1

Relations

E2

E3 E4

E4 E1

Varying propertiesDifferent LanguagesVarying consistencyNoise/errors….

Complex problem Usually many parametersOften tuning required

Success ~ Accuracy

Page 3: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

The Goal: Predict Extraction AccuracyThe Goal: Predict Extraction Accuracy

Estimate the expected success of an IE system that relies on contextual patterns before• running expensive experiments• tuning parameters• training the system

Useful when adapting an IE system to• a new task• a new document collection• a new language

Page 4: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Specific Extraction TasksSpecific Extraction Tasks

• Named Entity Recognition (NER)

• Relation Extraction (RE)

European champions Liverpool paved the way to the group stages of the Champions League taking a 3-1 lead over CSKA Sofia on Wednesday [...] Gerard Houllier's men started the match in Sofia on fire with Steven Gerrard scoring [...]

Organization

PersonLocation

Misc

Abraham Lincoln was born on Feb. 12, 1809, in a log cabin in Hardin (now Larue) County, Ky

BORN Who When Where

Abraham Lincoln Feb. 12, 1809 Hardin County, KY

Page 5: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Contextual CluesContextual Clues

Left context Right context

Left context Right contextMiddle context

engineers Orville and Wilbur Wright built the first working airplane in 1903 .

… yesterday, Mrs Clinton told reporters the move to the East Room

Page 6: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Approach: Language ModellingApproach: Language Modelling

• Presence of contextual clues for a task appears related to extraction difficulty

• The more “obvious” the clues, the easier the task

• Can be modelled as “unexpectedness” of a word

• Use Language Modelling (LM) techniques to quantify intuition

Page 7: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Language Models (LM)Language Models (LM)

• An LM is summary of word distribution in text• Can define unigram, bigram, trigram, n-gram models• More complex models exist

– Distance, syntax, word classes– But: not robust for web, other languages, …

• LMs used in IR, ASR, Text Classification, Clustering:– Query Clarity: Predicting query performance

[Cronen-Townsend et al, SIGIR 2002]

– Context Modelling for NER[Cucerzan et al., EMNLP 1999], [Klein et al. CoNLL 2003]

Page 8: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Document Language ModelsDocument Language Models

• A basic LM is a normalized word histogram for the document collection

• Unigram (word) models commonly used

• Higher-order n-grams (bigrams, trigrams) can be used

word Freq

the 0.0584

to 0.0269

and 0.0199

said 0.0147

. . . . . .

's 0.0018

company 0.0014

mrs 0.0003

won 0.0003

president 0.0003

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

the to and said 's company mrs won president

fre

qu

en

cy

Page 9: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Context Language ModelsContext Language Models

• Senator Christopher Dodd, D-Conn., named general chairman of the Democratic National Committee last week by President Bill Clinton , said it was premature to talk about lifting the U.S. embargo against Cuba…

• Although the Clinton ‘s health plan failed to make it through Congress this year , Mrs Clinton vowed continued support for the proposal.

• A senior White House official, who accompanied Clinton , told reporters…

• By the fall of 1905, the Wright brothers ’ experimental period ended. With their third powered airplane , they now routinely made flights of several …

• Against this backdrop, we see the Wright brothers efforts to develop an airplane …

Page 10: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Key ObservationKey Observation

• If normally rare words consistently appear in contexts around entities, extraction task tends to be “easier”.

• Contexts for a task are an intrinsic property of collection and extraction task, and not restricted to a specific information extraction system.

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

the to and said 's company mrs won president

fre

qu

en

cy

Page 11: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Divergence MeasuresDivergence Measures

• Cosine Divergence:

• Relative entropy: KL Divergence

22 ||||||||1),(Cosine

CBG

BGCBGC LMLM

LMLMLMLM

Vw BG

CiCBGC wLM

wLMwLMLMLM

)(

)(log)()||(KL

Page 12: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Interpreting Divergence: Reference LMInterpreting Divergence: Reference LM

• Need to calibrate the observed divergence• Compute Reference Model LMR :

– Pick K random non-stopwords R and compute the context language model around Ri.

… the five-star Hotel Astoria is a symbol of elegance and comfort. With an unbeatable location in St Isaac's Square in the heart

of St Petersburg, ...

• Normalized KL(LMC)=

• Normalization corrects for bias introduced by small sample size

)(

)(

R

C

LMKL

LMKL

Page 13: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

6.85

4.79

3.733.17

2.43

0.530.741.27

5.85

3.29

2.67 2.49 2.39

1.820.680.92

5.89

3.623.01 2.81

2.712.11

1.51 1.251.62

3.864.06

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

1 2 3 4 5 10 20 50 100

random sample size

avera

ge K

L-d

iverg

en

ce

3-word context 2-word context 1-word context

• LMR converges to LMBG for large sample sizes

• Divergence of LMR substantial for small samples

Reference LM (cont)Reference LM (cont)

Page 14: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Predicting Extraction Accuracy: The AlgorithmPredicting Extraction Accuracy: The Algorithm

1. Start with a small sample S of entities (or relation tuples) to be extracted

2. Find occurrences of S in given collection

3. Compute LMBG for the collection

4. Compute LMC for S and the collection

5. Pick |S| random words R from LMBG

6. Compute context LM for R LMR

7. Compute KL(LMC || LMBG), KL(LMR || LMBG)

8. Return normalized KL(LMC)

Page 15: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Experimental EvaluationExperimental Evaluation

• How to measure success?

– Compare predicted ease of task vs. observed extraction accuracy

• Extraction Tasks: NER and RE

– NER: Datasets from the CoNLL 2002, 2003 evaluations

– RE: Binary relations between NEs and generic phrases

Page 16: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Extraction Task AccuracyExtraction Task Accuracy

NER

RE

RelationAccuracy (%)

strict partialTask Difficulty

BORN 0.73 0.96 Easy

DIED 0.34 0.97 Easy

INVENT 0.35 0.64 Hard

WROTE 0.12 0.50 Hard

English Spanish Dutch

LOC 90.21 79.84 79.19

MISC 78.83 55.82 73.9

ORG 81.86 79.69 69.48

PER 91.47 86.83 78.83

Overall 86.77 79.2 75.24

Page 17: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Document CollectionsDocument Collections

Task Collection Size

NER

Reuters RCV1, 1/100 3,566,125 words

Reuters RCV1, 1/10 35,639,471 words

EFE newswire articles, May 2000 (Spanish) 367,589 words

“De Morgen” articles (Dutch) 268,705 words

RE Encarta document collection 64,187,912 words

Note that Spanish and Dutch corpus sizes are much smaller

Page 18: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Predicting NER Performance (English)Predicting NER Performance (English)

Florian et al. Chieu et al. Klein et al. Zhang et al. Carreras et al. Average

LOC 91.15 91.12 89.98 89.54 89.26 90.21

MISC 80.44 79.16 80.15 75.87 78.54 78.83

ORG 84.67 84.32 80.48 80.46 79.41 81.86

PER 93.85 93.44 90.72 90.44 88.93 91.47

Overall 88.76 88.31 86.31 85.50 85.00 86.77

Context size Absolute Normalized

LOC 0.98 1.07

MISC 1.29 1.40

ORG 2.83 3.08

PER 4.10 4.46

RANDOM 0.92

Absolute and Normalized KL-divergence

LOC exception:

Large overlap between locations in the training and test collections (i.e., simple gazetteers effective).

Reuters 1/10, Context = 3 words, discard stopwords, avg

Page 19: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

NER – Robustness / Different DimensionsNER – Robustness / Different Dimensions

• Counting stopwords (w) or not (w/o)

• Context Size

• Corpus size

Reuters 1/100, context ±3, avg

Reuters 1/100, no stopwords, avg

Reuters, context ±3, no stopwords, avg

LOC MISC ORG PER RAND

F 90.2 78.8 81.9 91.5 -

w 0.93 1.09 2.68 3.91 0.78

w/o 1.48 1.83 3.81 5.62 1.27

LOC MISC ORG PER RAND

1 0.88 1.26 2.12 2.94 2.43

2 1.06 1.47 2.95 4.11 1.14

3 1.07 1.4 3.08 4.46 0.92

LOC MISC ORG PER RAND

1/10 1.07 1.4 3.08 4.46 0.92

1/100 1.48 1.83 3.81 5.62 1.27

Page 20: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Other Dimensions: Sample SizeOther Dimensions: Sample Size

1

2

3

4

5

6

7

10 20 30 40 50sample size

No

rmal

ized

KL

div

erg

ence

LOC MISCORG PER

6.85

4.79

3.733.17

2.43

0.530.741.27

5.85

3.29

2.67 2.49 2.39

1.820.680.92

5.89

3.623.01 2.81

2.712.11

1.51 1.251.62

3.864.06

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

1 2 3 4 5 10 20 50 100

random sample size

aver

age

KL-d

iver

genc

e

3-word context 2-word context 1-word context

• Normalized divergence of LMC remains high- Contrast with LMR for larger sample sizes

Page 21: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Other Dimensions: N-gram sizeOther Dimensions: N-gram size

Higher order n-grams may help in some cases.

1

2

3

4

5

6

7

1 2 3

ngram size

No

rmal

ized

KL

div

erg

ence

LOC MISC ORG PER LOC 90.21

MISC 78.83

ORG 81.86

PER 91.47

Page 22: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Other LanguagesOther Languages

• Spanish

• Dutch Entity Actual

LOC 79.19

MISC 73.9

ORG 69.48

PER 78.83

  Context=1 Context=2 Context=3

LOC 1.44 1.65 1.61

MISC 1.97 2.02 1.91

ORG 1.53 1.86 1.92

PER 2.25 2.63 2.60

RANDOM 2.59 1.89 1.71

  Context=1 Context=2 Context=3

LOC 1.18 1.39 1.42

MISC 1.73 2.12 2.35

ORG 1.42 1.59 1.64

PER 2.01 2.31 2.56

RANDOM 2.42 1.82 1.53

Entity Actual

LOC 79.84

MISC 55.82

ORG 79.69

PER 86.83

Problem: very small collections

Page 23: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Predicting RE Performance (English)Predicting RE Performance (English)

Relation Context size 1 Context size 2 Context size 3

BORN 2.02 2.17 2.39

DIED 1.89 1.86 1.83

INVENT 1.94 1.75 1.72

WROTE 1.59 1.59 1.53

RANDOM 6.87 6.24 5.79

Relation Accuracy (%)

BORN 0.73 0.96

DIED 0.34 0.97

INVENT 0.35 0.64

WROTE 0.12 0.50

• 2- and 3- word contexts correctly distinguish between “easy” tasks (BORN, DIED), and “difficult” tasks (INVENT, WROTE).

• 1-word context size appears not sufficient for predicting RE

Page 24: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Other Dimensions: Sample SizeOther Dimensions: Sample Size

• Divergence increases w/ sample size

1

1.5

2

2.5

3

10 20 30 40sample size

No

rmal

ized

KL

d

iver

gen

ce

BORN DIED INVENT WROTE

6.85

4.79

3.733.17

2.43

0.530.741.27

5.85

3.29

2.67 2.49 2.39

1.820.680.92

5.89

3.623.01 2.81

2.712.11

1.51 1.251.62

3.864.06

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

1 2 3 4 5 10 20 50 100

random sample size

aver

age K

L-di

verg

ence

3-word context 2-word context 1-word context

Page 25: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Results SummaryResults Summary

• Context models can be effective in predicting the success of information extraction systems

• Even a small sample of available entities can be sufficient for making accurate predictions

• Available large collection size most important limiting factor

Page 26: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

Other Applications and Future WorkOther Applications and Future Work

• Could use results for– Active learning/training IE– Improved boundary detection for NER– Improved confidence estimation of extraction

• e.g.: Culotta and McCallum [HLT 2004]

• For better results, could incorporate:– Internal contexts, gazeteers (e.g., for LOC entities)

• e.g.: Agichtein & Ganti [KDD 2004], Cohen & Sarawagi [KDD 2004]

– Syntax/logical distance– Coreference Resolution– Word classes

Page 27: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

SummarySummary

• Presented the first attempt to predict information extraction accuracy for a given task and collection

• Developed a general, system-independent method utilizing Language Modelling techniques

• Estimates for extraction accuracy can help– Deploy information extraction systems – Port Information Extraction systems to new

tasks, domains, collections, and languages

Page 28: Eugene Agichtein and Silviu Cucerzan Microsoft  Research

Text Mining Search and Navigation Group Research

For More InformationFor More Information

Text Mining, Search, and Navigation Grouphttp://research.microsoft.com/tmsn/