63
Using Semantic Relations to Improve Information Retrieval Tom Morton

Using Semantic Relations to Improve Information Retrieval Tom Morton

Embed Size (px)

Citation preview

Page 1: Using Semantic Relations to Improve Information Retrieval Tom Morton

Using Semantic Relations to Improve Information Retrieval

Tom Morton

Page 2: Using Semantic Relations to Improve Information Retrieval Tom Morton

Introduction

NLP techniques have been largely unsuccessful at information retrieval. Why?– Document retrieval has been the primary

measure of information retrieval success. Document retrieval reduces the need for NLP

techniques.– Discourse factors can be ignored.– Query words perform word-sense disambiguation.

– Lack of robustness: NLP techniques are typically not as robust as word

indexing.

Page 3: Using Semantic Relations to Improve Information Retrieval Tom Morton

Introduction

Paragraph retrieval for natural-language questions.– Paragraphs can be influenced by discourse factors.– Correctness of answers to natural language questions can

be accurately determined automatically.– Standard precursor to TREC question answering task.

What NLP technologies might help at this information retrieval task and are they robust enough?

Page 4: Using Semantic Relations to Improve Information Retrieval Tom Morton

Introduction

Question Analysis:– Questions tend to specify the semantic type of

their answer. This component tries to identify this type.

Named-Entity Detection:– Named-entity detection determines the semantic

type of proper nouns and numeric amounts in text.

Page 5: Using Semantic Relations to Improve Information Retrieval Tom Morton

Introduction

Question Analysis– The category predicted is appended to the question.

Named-Entity Detection:– The NE categories found in text are included as new terms.

This approach requires additional question terms to be in the paragraph.

What party is John Major in? (ORGANIZATION)

It probably won't be clear for some time whether the Conservative Party has chosen in John Major a truly worthy successor to Margaret Thatcher, who has been a giant on the world stage. +ORGANIZATION +PERSON

Page 6: Using Semantic Relations to Improve Information Retrieval Tom Morton

Introduction

Coreference Relations:– Interpretation of a paragraph may depend on the

context in which it occurs.

Syntactically-based Categorical Relation Extraction:– Appositive and predicate nominative

constructions provide descriptive terms about entities.

Page 7: Using Semantic Relations to Improve Information Retrieval Tom Morton

Coreference:– Use coreference relationships to introduce new

terms referred to but not present in the paragraph’s text.

Introduction

How long was Margaret Thatcher the prime minister? (DURATION)

The truth, which has been added to over each of her 11 1/2 years in power, is that they don't make many like her anymore. +MARGARET +THATCHER +PRIME +MINISTER +DURATION

Page 8: Using Semantic Relations to Improve Information Retrieval Tom Morton

Introduction

Categorical Relation Extraction– Identifies DESCRIPTION category.– Allows descriptive terms to be used in term

expansion.

Famed architect Frank Lloyd Wright… +DESCRIPTION

Buildings he designed include the Guggenheim Museum in New York and Robie House in Chicago. +FRANK +LLOYD +WRIGHT +FAMED +ARCHITECT

Who is Frank Lloyd Wright? (DESCRIPTION) What architect designed Robie House? (PERSON)

Page 9: Using Semantic Relations to Improve Information Retrieval Tom Morton

Introduction

Indexing

Retrieval

NE Detection

Coreference Resolution

Documents

Search Engine

Question Analysis

Question Paragraphs

Paragraphs+

Pre-processing

Categorical Relation

Extraction

Page 10: Using Semantic Relations to Improve Information Retrieval Tom Morton

Introduction

Will these semantic relations improve paragraph retrieval?– Are the implementations robust enough to see a

benefit across large document collections and question sets?

– Are there enough questions where these relationships are required to find an answer.

Questions need only be answered once.

Short Answer: Yes!

Page 11: Using Semantic Relations to Improve Information Retrieval Tom Morton

Overview

Introduction Pre-processing Named-Entity Detection Coreference Categorical Relation Extraction Question Analysis Paragraph Retrieval Conclusion Proposed Work

Page 12: Using Semantic Relations to Improve Information Retrieval Tom Morton

Preprocessing

Paragraph Detection Sentence Detection Tokenization POS Tagging NP-Chunking

Page 13: Using Semantic Relations to Improve Information Retrieval Tom Morton

Preprocessing

Paragraph finding:– Explicitly marked:

Newline, <p>, blank line, etc.

– Implicitly marked: What is the column width of this document? Would this capitalized, likely sentence initial word fit on the

previous line?

Sentence Detection:– Is this [.?!] the end of a sentence?– Use software developed in Reynar & Ratnaparki 97.

Page 14: Using Semantic Relations to Improve Information Retrieval Tom Morton

Preprocessing

Tokenization:– Are there additional tokens in this initial space-

delimited set of tokens.– Use techniques described in Reyner 98.

POS Tagging:– Use software developed in Ratnaparki 96.

Page 15: Using Semantic Relations to Improve Information Retrieval Tom Morton

Preprocessing

NP-Chunking– Developed a maxent tagging model where each

token is assigned a tag of either: Start-NP, Continue-NP, Other

– Software is very similar to the POS tagger.– Performance was evaluated to be at or near

state-of-the-art.

Page 16: Using Semantic Relations to Improve Information Retrieval Tom Morton

Preprocessing

Producing Robust Components– Sentence, Tokenization and POS-tagging

components we all retrained: Added small samples of texts from the paragraph

retrieval domains to the WSJ-based training data.– Allowed components to deal with editorial conventions

which differed from the Wall Street Journal.

Page 17: Using Semantic Relations to Improve Information Retrieval Tom Morton

Overview

Introduction Pre-processing Named-Entity Detection Coreference Categorical Relation Extraction Paragraph Retrieval Question Analysis Conclusion Proposed Work

Page 18: Using Semantic Relations to Improve Information Retrieval Tom Morton

Named-Entity Detection

Task Approach 1 Approach 2

Page 19: Using Semantic Relations to Improve Information Retrieval Tom Morton

Named-Entity Detection

Task:– Identify the following categories:

Person, Location, Organization, Money, Percentage, Time Point.

Approach 1:– Use an existing NE-detector.

Performance on some genres of text was poor. Couldn’t add new categories. Couldn’t retrain the classifier.

Page 20: Using Semantic Relations to Improve Information Retrieval Tom Morton

Named-Entity Detection

Approach 2:– Train a maxent classifier on the output of an

existing NE-detector. Used BBN’s MUC NE tagger (Bikel et al. 1997) to create

a corpus.– Combined Time and Date tags to create “Time Point”

category. Added a small sample of tagged text from the paragraph

retrieval domains.– Constructed rule-based models for additional categories.

Distance and Amount

Page 21: Using Semantic Relations to Improve Information Retrieval Tom Morton

Overview

Introduction Pre-processing Named-Entity Detection Coreference Categorical Relation Extraction Question Analysis Paragraph Retrieval Conclusion Proposed Work

Page 22: Using Semantic Relations to Improve Information Retrieval Tom Morton

Coreference

Task Approach Results Related Work

Page 23: Using Semantic Relations to Improve Information Retrieval Tom Morton

Coreference

Task:– Determine space of entity extents:

Basal noun phrases:– Named entities consisting of multiple basal noun phrases

are treated as a single entity. Pre-nominal proper nouns. Possessive pronouns.

– Determine which extents refer to the same entity in the world.

Page 24: Using Semantic Relations to Improve Information Retrieval Tom Morton

Coreference

Approach (Morton 2000)– Divide referring expressions into three classes

Singular third person pronouns. Proper nouns. Definite noun phrases.

– Create separate resolution approach for each class.

– Apply resolution approaches to text in an interleaved fashion.

Page 25: Using Semantic Relations to Improve Information Retrieval Tom Morton

Coreference

Singular Third Person Pronouns– Compare the pronoun to each entity in the current

sentence and the previous two sentences.– Compute argmaxi( p(coref|pronoun,entityi)) using

maxent model.– Compute p(nonref|pronoun) using maxent model.– If (p(corefi) > p(nonref)) then resolve pronoun

Page 26: Using Semantic Relations to Improve Information Retrieval Tom Morton

Coreference

1. John Major, a truly worthy…

2. Margaret Thatcher, her, …

3. The Conservative Party

4. the undoubted exception

5. Winston Churchill

6. …

she?20%

70%

10%

5%10%

Pronoun is resolved to entity rather than most recent extent.

Page 27: Using Semantic Relations to Improve Information Retrieval Tom Morton

Coreference

Classifier Features:– Distance:

in NPs, Sentences, Left-To-Right, Right-To-Left

– Syntactic Context: NP’s position in sentence. NP’s surrounding context. Pronoun’s syntactic context.

– Salience: Number of times the entity has been mentioned.

– Gender: Pairings of the pronoun’s gender and the lexical items in entity.

Page 28: Using Semantic Relations to Improve Information Retrieval Tom Morton

Coreference

Proper Nouns:– Remove honorifics, corporate designators,

determiners, and pre-nominal appositives.– Compare the proper noun to each entity

preceding it.– Resolve it to the first preceding proper noun

extent for which this proper noun is a substring (observing word boundaries).

Bob Smith <- Mr. Smith <- Bob <- Smith

Page 29: Using Semantic Relations to Improve Information Retrieval Tom Morton

Coreference

Definite Noun Phrases– Remove determines.– Resolve to first entity which shares the same

head word and modifiers. the big mean man <- the big man <- the man.

Page 30: Using Semantic Relations to Improve Information Retrieval Tom Morton

Coreference

Results:– Trained pronominal model on 200 WSJ

documents with only pronouns annotated. Interleaved with other resolution approaches to compute

mention statistics.

– Evaluated using 10-fold cross validation.– P 94.4%, R 76.0%, F 84.2%.

Page 31: Using Semantic Relations to Improve Information Retrieval Tom Morton

Coreference

Results:– Evaluated the proper noun and definite noun

phrase approaches on 80 hand annotated WSJ files.

Proper Nouns P 92.1%, R 88.0%, F 90.0%. Definite NPs P 82.5%, R 47.4%, F 60.2%.

– Combined Evaluation: MUC6 Coreference Task:

– Annotation guidelines are not identical.– Ignored headline and dateline coreference.– Included appositives and predicate nominatives.

P 79.6%, R 44.5%, F 57.1%.

Page 32: Using Semantic Relations to Improve Information Retrieval Tom Morton

Coreference

Related Work – Ge et al. 1998:

Presents similar statistical treatment. Assumes non-referential pronouns are pre-marked. Assumes mention statistics are pre-computed.

– Soon et al. 2001: Targets MUC Tasks. P 65.5-67.3%, R 56.1-58.3%, F 60.4-62.6%.

– Ng and Cardie 2002: Targets MUC Tasks. P 70.8-78.0%, R 55.7-64.2%, F 63.1-70.4%.

Our approach favors precision over recall:– Coreference relationships are used in passage retrieval.

Page 33: Using Semantic Relations to Improve Information Retrieval Tom Morton

Overview

Introduction Pre-processing Named-Entity Detection Coreference Categorical Relation Extraction Question Analysis Paragraph Retrieval Conclusion Proposed Work

Page 34: Using Semantic Relations to Improve Information Retrieval Tom Morton

Categorical Relation Extraction

Task Approach Results Related Work

Page 35: Using Semantic Relations to Improve Information Retrieval Tom Morton

Categorical Relation Extraction

Task– Identify whether a categorical relation exists

between NPs in the following contexts: Appositives: NP,NP. Predicative Nominatives: NP copula NP. Pre-nominal appositives:

– (NP (SNP Japanese automaker) Mazda Motor Corp.)

Page 36: Using Semantic Relations to Improve Information Retrieval Tom Morton

Categorical Relation Extraction

Approach:– Appositives and predicate nominatives:

Create a single binary maxent classifier to determine when NP’s in the appropriate syntactic context express a categorical relationship.

– Pre-nominal appositives: Create a maxent classifier to determine where the split exists

between the appositive and the rest of the noun phrase.– Use the lexical and POS-based features of noun phrases.

Use word/POS pair features. Differentiate between head and modifier words. Pre-nominal appositive classifier also use a word’s presence

on a list of 69 titles as a feature.

Page 37: Using Semantic Relations to Improve Information Retrieval Tom Morton

Categorical Relation Extraction

Results– Appositives and predicate nominatives:

Training - 1000/1200 examples Test 3 fold cross validation

– Appositive - P 90.9% R 79.1% F 84.6%.– Predicate Nominatives – P 78.8% R 74.4% F 76.5%.

– Pre-nominal appositives: Training - 2000 examples

– Used active learning to select new examples for annotation (884 positive).

Test - 1500 examples (81 positive)– P 98.6% R 85.2% F 91.4%.

Page 38: Using Semantic Relations to Improve Information Retrieval Tom Morton

Categorical Relation Extraction

Related Work– Soon et al. (2001) defines a specific feature to

identify appositive constructions.– Hovy et al. (2001) uses syntactic patterns to

identify “DEFINITION” and “WHY FAMOUS” types. Our work is unique in that:

– Statistical treatment of extracting categorical relations.

– Uses categorical relations for term expansion in paragraph indexing.

Page 39: Using Semantic Relations to Improve Information Retrieval Tom Morton

Overview

Introduction Pre-processing Named-Entity Detection Coreference Categorical Relation Extraction Question Analysis Paragraph Retrieval Conclusion Proposed Work

Page 40: Using Semantic Relations to Improve Information Retrieval Tom Morton

Question Analysis

Task Approach Results Related Work

Page 41: Using Semantic Relations to Improve Information Retrieval Tom Morton

Question Analysis

Task– Map natural language questions onto 10

categories: Person, Location, Organization, Time Point, Duration,

Money, Percentage, Distance, Amount, Description, Other

– Where is West Point Military Academy? (Location)

– When was ice cream invented? (Time Point)– How high is Mount Shasta? (Distance)

Page 42: Using Semantic Relations to Improve Information Retrieval Tom Morton

Question Analysis

Approach– Identify Question Word:

Who, What, When, When, Where, Why, Which, Whom, How (JJ|RB)*, Name.

– Identify Focus Noun Noun phrase which specifies the type of the answer. Use a series of syntactic patterns to identify.

– Train maxent classifier to predict which category the answer falls into.

Page 43: Using Semantic Relations to Improve Information Retrieval Tom Morton

Question Analysis

Focus Noun Syntactic Patterns– Who copula (np)– What copula* (np)– Which copula (np)– Which of (np)– How (JJ|RB) (np)– Name of (np)

Page 44: Using Semantic Relations to Improve Information Retrieval Tom Morton

Question Analysis

Classifier Features– Lexical Features

Question word, matrix verb, head noun of focus noun phrase, modifiers of the focus noun.

– Word-class features WordNet synsets and entry number of the focus noun.

– Location of the focus noun. Is it the last NP?

– Who is (NP-Focus Colin Powell )?

Page 45: Using Semantic Relations to Improve Information Retrieval Tom Morton

Question Analysis

Question:– Which poet was born in 1572 and appointed Dean

of St. Paul's Cathedral in 1621? Features:

– def qw=which verb=which_was rw=was rw=born rw=in rw=1572 rw=and rw=appointed rw=Dean rw=of rw=St rw=. rw=Paul rw='s rw=Cathedral rw=in rw=1621 rw=? hw=poet ht=NN s0=poet1 s0=writer1 s0=communicator1 s0=person1 s0=life_form1 s0=causal_agent1 s0=entity1 fnIsLast=false

Page 46: Using Semantic Relations to Improve Information Retrieval Tom Morton

Question Analysis

Results:– Training:

1888 hand-tagged examples from web-logs and web searches.

– Test: TREC8 Questions – 89.0%. TREC9 Questions – 76.6%.

Page 47: Using Semantic Relations to Improve Information Retrieval Tom Morton

Question Analysis

Related Work– Ittycheriah et al. 2001:

Similar:– Uses maximum entropy model.– Uses focus nouns and WordNet.

Differs:– Assumes first NP is the focus noun.– 3300 annotated questions.– Uses MUC NE categories plus PHRASE and REASON.– Uses feature selection with held-out data.

Page 48: Using Semantic Relations to Improve Information Retrieval Tom Morton

Overview

Introduction Pre-processing Named-Entity Detection Coreference Categorical Relation Extraction Question Analysis Paragraph Retrieval Conclusion Proposed Work

Page 49: Using Semantic Relations to Improve Information Retrieval Tom Morton

Paragraph Retrieval

Task Approach Results Related Work

Page 50: Using Semantic Relations to Improve Information Retrieval Tom Morton

Paragraph Retrieval

Task– Given a natural language question:

TREC-9 question collection.

– A collection of documents: ~1M documents:

– AP, LA Times, WSJ, Financial Times, FBIS, and SJM.

– Return a paragraph which answers the question. Used TREC-9 answer patterns to evaluate.

Page 51: Using Semantic Relations to Improve Information Retrieval Tom Morton

Paragraph Retrieval

Approach:– Indexing:

Use named-entity detector to supplement paragraphs with terms for each NE category present in the text.

Use coreference relationships to introduce new terms referred to but not present in the paragraph’s text.

Use syntactically-based categorical relations to create a DESCRIPTION category and for term expansion.

Used open source tf*idf based search engine for retrieval (Lucene).

– No length normalization.

Page 52: Using Semantic Relations to Improve Information Retrieval Tom Morton

Paragraph Retrieval

Approach:– Retrieval:

Use question analysis component to predict answer category and append it to the question.

– Evaluate using TREC-9 questions and answer patterns

500 questions.

Page 53: Using Semantic Relations to Improve Information Retrieval Tom Morton

Paragraph Retrieval

Indexing

Retrieval

NE Detection

Coreference Resolution

Documents

Search Engine

Question Analysis

Question Paragraphs

Paragraphs+

Pre-processing

Syntactic Relation

Extraction

Page 54: Using Semantic Relations to Improve Information Retrieval Tom Morton

Paragraph Retrieval

Results:

205

255

305

355

405

5 10 15 20 25 30 35 40 45

Number of Passages

Nu

mb

er o

f Q

ue

stiu

on

s A

ns

we

re

d

w/ Term Expansion w/ Semantic Categories Baseline

Page 55: Using Semantic Relations to Improve Information Retrieval Tom Morton

Paragraph Retrieval

Related Work– Prager et al. 2000

Indexes NE categories as terms for question answering passage retrieval.

Our approach is unique in that:– Uses coreference and categorical relation

extraction to perform term expansion.– Demonstrates that this improves performance.

Page 56: Using Semantic Relations to Improve Information Retrieval Tom Morton

Overview

Introduction Pre-processing Question Analysis Named-Entity Detection Coreference Categorical Relation Extraction Paragraph Retrieval Conclusion Proposed Work

Page 57: Using Semantic Relations to Improve Information Retrieval Tom Morton

Conclusion

Developed and evaluated new techniques in:– Coreference Resolution.– Categorical Relation Extraction.– Question Analysis.

Integrated these techniques with existing NLP components:

– NE detection, POS tagging, sentence detection, etc.

Demonstrated that these techniques can be used to improve performance in an information retrieval task.

– Paragraph retrieval for natural language questions.

Page 58: Using Semantic Relations to Improve Information Retrieval Tom Morton

Overview

Introduction Pre-processing Question Analysis Named-Entity Detection Coreference Categorical Relation Extraction Paragraph Retrieval Conclusion Proposed Work

Page 59: Using Semantic Relations to Improve Information Retrieval Tom Morton

Proposed Work

Named Entity Detection:– Evaluate existing NE performance.

Use MUC NE evaluation data.

– Add additional NE categories: Age Use active learning to annotate data for classifiers.

Page 60: Using Semantic Relations to Improve Information Retrieval Tom Morton

Proposed Work

Coreference:– Annotate 200 document corpus with all NP

coreference. (done)– Create statistical model for proper nouns and

definite noun phrases. (in progress)– Incorporate named-entity information into

coreference model. (in progress)– Evaluate using new corpus, and MUC 6 and 7

data.

Page 61: Using Semantic Relations to Improve Information Retrieval Tom Morton

Proposed Work

Categorical Relation Extraction:– Incorporate name-entity information and WordNet

classes for common nouns. Similar to approach used in Question Analysis

component.

Page 62: Using Semantic Relations to Improve Information Retrieval Tom Morton

Proposed Work

Question Analysis:– Use parser to provide a richer set of features for

classifier. (implemented Ratnaparki 97)– Construct a model to identify the focus noun

phrase. Where did Hillary Clinton go to (NP-Focus college) ?

– Expand the set of answer categories. How old is Dick Clark? (Age)

Page 63: Using Semantic Relations to Improve Information Retrieval Tom Morton

Proposed Work

Paragraph Retrieval:– Rerun paragraph retrieval evaluation after

completion of proposed work.– Evaluate using TREC X questions.