Lecture 22: NLP for IR

Preview:

DESCRIPTION

Prof. Ray Larson University of California, Berkeley School of Information. Lecture 22: NLP for IR. Principles of Information Retrieval. Today. MiniTREC results Review Cheshire III Design – GRID-based DLs NLP for IR Text Summarization. - PowerPoint PPT Presentation

Citation preview

2010.04.21 - SLIDE 1IS 240 – Spring 2010

Prof. Ray Larson University of California, Berkeley

School of Information

Principles of Information Retrieval

Lecture 22: NLP for IR

2010.04.21 - SLIDE 2IS 240 – Spring 2010

Today • MiniTREC results• Review

– Cheshire III Design – GRID-based DLs• NLP for IR• Text Summarization

Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer

2010.04.21 - SLIDE 3IS 240 – Spring 2010

MiniTREC 2010-All Submissions

2010.04.21 - SLIDE 4IS 240 – Spring 2010

MiniTREC 2010 - Best Runs

2010.04.21 - SLIDE 5IS 240 – Spring 2010

Grid

mid

dlew

are

Che m

ical

Eng i

nee r

ing

Applications

ApplicationToolkits

GridServices

GridFabric

Clim

ate

Data

Grid

Rem

ote

Com

putin

g

Rem

ote

Visu

aliza

tion

Colla

bora

torie

s

High

ene

rgy

phy

sics

Cosm

olog

y

Astro

phys

ics

Com

bust

ion

.….

Porta

ls

Rem

ote

sens

ors

..…Protocols, authentication, policy, instrumentation,Resource management, discovery, events, etc.

Storage, networks, computers, display devices, etc.and their associated local services

Grid Architecture -- (Dr. Eric Yen, Academia Sinica, Taiwan.)

2010.04.21 - SLIDE 6IS 240 – Spring 2010

Che m

ical

Eng i

nee r

ing

Applications

ApplicationToolkits

GridServices

GridFabric

Grid

mid

dlew

are

Clim

ate

Data

Grid

Rem

ote

Com

putin

g

Rem

ote

Visu

aliza

tion

Colla

bora

torie

s

High

ene

rgy

phy

sics

Cosm

olog

y

Astro

phys

ics

Com

bust

ion

Hum

anitie

sco

mpu

ting

Digi

tal

Libr

arie

s

Porta

ls

Rem

ote

sens

ors

Text

Min

ing

Met

adat

am

anag

emen

t

Sear

ch &

Retri

eval …

Protocols, authentication, policy, instrumentation,Resource management, discovery, events, etc.

Storage, networks, computers, display devices, etc.and their associated local services

Grid Architecture (ECAI/AS Grid Digital Library Workshop)

Bio-

Med

ical

2010.04.21 - SLIDE 7IS 240 – Spring 2010

Grid IR Issues• Want to preserve the same retrieval

performance (precision/recall) while hopefully increasing efficiency (I.e. speed)

• Very large-scale distribution of resources is a challenge for sub-second retrieval

• Different from most other typical Grid processes, IR is potentially less computing intensive and more data intensive

• In many ways Grid IR replicates the process (and problems) of metasearch or distributed search

2010.04.21 - SLIDE 8IS 240 – Spring 2010

ContextData Grid

Layer

Data Grid

SRBiRODS

Digital Library LayerApplicationLayer

Web BrowserMultivalent

Dedicated Client

User Interface

Apache+Mod_Python+

Cheshire3

Protocol Handler

Process Management

KeplerCheshire3

Query Results

Query

Results

Export Parse

Document ParsersMultivalent,...

NaturalLanguageProcessing

InformationExtraction

Text Mining ToolsTsujii Labs, ...

ClassificationClustering

Data Mining ToolsOrange, Weka, ...

Query

Results

Search /Retrieve

Index /Store

Information SystemCheshire3

User Interface

MySRBPAWN

Process Management

KepleriRODS rules

Term Management

TermineWordNet

...

Store

2010.04.21 - SLIDE 9IS 240 – Spring 2010

Grid ArchitectureMaster Task

Slave Task 1 Slave Task N

Data Grid

GPFS Temporary Storage

(workflow, process, document) (workflow, process, document)

fetch document fetch document

document document

extracted data extracted data

2010.04.21 - SLIDE 10IS 240 – Spring 2010

Grid Architecture - Phase 2Master Task

Slave Task 1 Slave Task N

Data Grid

GPFS Temporary Storage

(index, load) (index, load)

store index store index

fetch extracted data fetch extracted data

2010.04.21 - SLIDE 11IS 240 – Spring 2010

Today• Natural Language Processing and IR

– Based on Papers in Reader and on• David Lewis & Karen Sparck Jones “Natural

Language Processing for Information Retrieval” Communications of the ACM, 39(1) Jan. 1996

• Text summarization: Lecture from Ed Hovy (USC)

2010.04.21 - SLIDE 12IS 240 – Spring 2010

Natural Language Processing and IR

• The main approach in applying NLP to IR has been to attempt to address– Phrase usage vs individual terms– Search expansion using related

terms/concepts– Attempts to automatically exploit or assign

controlled vocabularies

2010.04.21 - SLIDE 13IS 240 – Spring 2010

NLP and IR• Much early research showed that (at least in the

restricted test databases tested)– Indexing documents by individual terms

corresponding to words and word stems produces retrieval results at least as good as when indexes use controlled vocabularies (whether applied manually or automatically)

– Constructing phrases or “pre-coordinated” terms provides only marginal and inconsistent improvements

2010.04.21 - SLIDE 14IS 240 – Spring 2010

NLP and IR• Not clear why intuitively plausible

improvements to document representation have had little effect on retrieval results when compared to statistical methods– E.g. Use of syntactic role relations between

terms has shown no improvement in performance over “bag of words” approaches

2010.04.21 - SLIDE 15IS 240 – Spring 2010

General Framework of NLP

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 16IS 240 – Spring 2010

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 17IS 240 – Spring 2010

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

John run+s. P-N V 3-pre N plu

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 18IS 240 – Spring 2010

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

John run+s. P-N V 3-pre N plu

S

NP

P-N

John

VP

V

run

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 19IS 240 – Spring 2010

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

John run+s. P-N V 3-pre N plu

S

NP

P-N

John

VP

V

runPred: RUN Agent:John

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 20IS 240 – Spring 2010

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

John run+s. P-N V 3-pre N plu

S

NP

P-N

John

VP

V

runPred: RUN Agent:John

John is a student.He runs.

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 21IS 240 – Spring 2010

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Domain AnalysisAppelt:1999

Tokenization

Part of Speech Tagging

Term recognition(Ananiadou)

Inflection/DerivationCompounding

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 22IS 240 – Spring 2010

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 23IS 240 – Spring 2010

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge Incomplete Lexicons

Open class words TermsTerm recognitionNamed Entities Company names Locations Numerical expressions

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 24IS 240 – Spring 2010

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

Incomplete Grammar Syntactic Coverage Domain Specific Constructions Ungrammatical Constructions

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 25IS 240 – Spring 2010

Syntactic Analysis

General Framework of NLP

Morphological andLexical Processing

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

Incomplete Domain Knowledge Interpretation Rules

PredefinedAspects of

Information

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 26IS 240 – Spring 2010

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

(2) Ambiguities:Combinatorial

Explosion

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 27IS 240 – Spring 2010

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

(2) Ambiguities:Combinatorial

Explosion

Most words in Englishare ambiguous in terms of their parts of speech. runs: v/3pre, n/plu clubs: v/3pre, n/plu and two meanings

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 28IS 240 – Spring 2010

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

(2) Ambiguities:Combinatorial

Explosion

Structural Ambiguities

Predicate-argument Ambiguities

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 29IS 240 – Spring 2010

Structural Ambiguities

(1)Attachment Ambiguities John bought a car with large seats. John bought a car with $3000.

(2) Scope Ambiguities

young women and men in the room

(3)Analytical Ambiguities Visiting relatives can be boring.

The manager of Yaxing Benz, a Sino-German joint ventureThe manager of Yaxing Benz, Mr. John Smith

John bought a car with Mary.$3000 can buy a nice car.

Semantic Ambiguities(1)

Semantic Ambiguities(2)

Every man loves a woman.

Co-reference Ambiguities

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 30IS 240 – Spring 2010

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

(2) Ambiguities:Combinatorial

Explosion

Structural Ambiguities

Predicate-argument Ambiguities

CombinatorialExplosion

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 31IS 240 – Spring 2010

Note:

Ambiguities vs Robustness

More comprehensive knowledge: More Robust

big dictionariescomprehensive grammar

More comprehensive knowledge: More ambiguities

Adaptability: Tuning, LearningSlides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 32IS 240 – Spring 2010

Framework of IE

IE as compromise NLP

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 33IS 240 – Spring 2010

Syntactic Analysis

General Framework of NLP

Morphological andLexical Processing

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

Incomplete Domain Knowledge Interpretation Rules

PredefinedAspects of

Information

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 34IS 240 – Spring 2010

Syntactic Analysis

General Framework of NLP

Morphological andLexical Processing

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

Incomplete Domain Knowledge Interpretation Rules

PredefinedAspects of

Information

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 35IS 240 – Spring 2010

Techniques in IE

(1) Domain Specific Partial Knowledge: Knowledge relevant to information to be extracted(2) Ambiguities: Ignoring irrelevant ambiguities Simpler NLP techniques

(4) Adaptation Techniques: Machine Learning, Trainable systems

(3) Robustness: Coping with Incomplete dictionaries (open class words) Ignoring irrelevant parts of sentences

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 36IS 240 – Spring 2010

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Anaysis

Context processingInterpretation

Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names

Domain specific rules: <Word><Word>, Inc. Mr. <Cpt-L>. <Word>Machine Learning: HMM, Decision TreesRules + Machine Learning

Part of Speech Tagger FSA rulesStatistic taggers

95 %

F-Value90

DomainDependent

Local ContextStatistical Bias

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 37IS 240 – Spring 2010

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Anaysis

Context processingInterpretation

FASTUS

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the applicationBasic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA)

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 38IS 240 – Spring 2010

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Anaysis

Context processingInterpretation

FASTUS

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the applicationBasic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA)

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 39IS 240 – Spring 2010

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

FASTUS

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the applicationBasic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA)

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2010.04.21 - SLIDE 40IS 240 – Spring 2010

Using NLP• Strzalkowski (in Reader)

Text NLP repres Dbasesearch

TAGGERNLP: PARSER TERMS

2010.04.21 - SLIDE 41IS 240 – Spring 2010

Using NLP

INPUT SENTENCEThe former Soviet President has been a local hero ever sincea Russian tank invaded Wisconsin.

TAGGED SENTENCEThe/dt former/jj Soviet/jj President/nn has/vbz been/vbn a/dt local/jj hero/nn ever/rb since/in a/dt Russian/jj tank/nn invaded/vbd Wisconsin/np ./per

2010.04.21 - SLIDE 42IS 240 – Spring 2010

Using NLP

TAGGED & STEMMED SENTENCEthe/dt former/jj soviet/jj president/nn have/vbz be/vbn a/dt local/jj hero/nn ever/rb since/in a/dt russian/jj tank/nn invade/vbd wisconsin/np ./per

2010.04.21 - SLIDE 43IS 240 – Spring 2010

Using NLP

PARSED SENTENCE[assert [[perf [have]][[verb[BE]] [subject [np[n PRESIDENT][t_pos THE] [adj[FORMER]][adj[SOVIET]]]] [adv EVER] [sub_ord[SINCE [[verb[INVADE]] [subject [np [n TANK][t_pos A] [adj [RUSSIAN]]]] [object [np [name [WISCONSIN]]]]]]]]]

2010.04.21 - SLIDE 44IS 240 – Spring 2010

Using NLP

EXTRACTED TERMS & WEIGHTSPresident 2.623519 soviet 5.416102President+soviet 11.556747 president+former 14.594883Hero 7.896426 hero+local 14.314775Invade 8.435012 tank 6.848128Tank+invade 17.402237 tank+russian 16.030809Russian 7.383342 wisconsin 7.785689

2010.04.21 - SLIDE 45IS 240 – Spring 2010

Same Sentence, different sys

INPUT SENTENCEThe former Soviet President has been a local hero ever sincea Russian tank invaded Wisconsin.

TAGGED SENTENCE (using uptagger from Tsujii)The/DT former/JJ Soviet/NNP President/NNP has/VBZ been/VBN a/DT local/JJ hero/NN ever/RB since/IN a/DT Russian/JJ tank/NN invaded/VBD Wisconsin/NNP ./.

2010.04.21 - SLIDE 46IS 240 – Spring 2010

Same Sentence, different sys

CHUNKED Sentence (chunkparser – Tsujii)(TOP (S (NP (DT The) (JJ former) (NNP Soviet) (NNP President) ) (VP (VBZ has) (VP (VBN been) (NP (DT a) (JJ local) (NN hero) ) (ADVP (RB ever) ) (SBAR (IN since) (S (NP (DT a) (JJ Russian) (NN tank) ) (VP (VBD invaded) (NP (NNP Wisconsin) ) ) ) ) ) ) (. .) ) )

2010.04.21 - SLIDE 47IS 240 – Spring 2010

Same Sentence, different sysEnju ParserROOT ROOT ROOT ROOT -1 ROOT been be VBN VB 5been be VBN VB 5 ARG1 President president NNP NNP 3been be VBN VB 5 ARG2 hero hero NN NN 8a a DT DT 6 ARG1 hero hero NN NN 8a a DT DT 11 ARG1 tank tank NN NN 13local local JJ JJ 7 ARG1 hero hero NN NN 8The the DT DT 0 ARG1 President president NNP NNP 3former former JJ JJ 1 ARG1 President president NNP NNP 3Russian russian JJ JJ 12 ARG1 tank tank NN NN 13Soviet soviet NNP NNP 2 MOD President president NNP NNP 3invaded invade VBD VB 14 ARG1 tank tank NN NN 13invaded invade VBD VB 14 ARG2 Wisconsin wisconsin NNP NNP 15has have VBZ VB 4 ARG1 President president NNP NNP 3has have VBZ VB 4 ARG2 been be VBN VB 5since since IN IN 10 MOD been be VBN VB 5since since IN IN 10 ARG1 invaded invade VBD VB 14ever ever RB RB 9 ARG1 since since IN IN 10

2010.04.21 - SLIDE 48IS 240 – Spring 2010

NLP & IR• Indexing

– Use of NLP methods to identify phrases• Test weighting schemes for phrases

– Use of more sophisticated morphological analysis

• Searching– Use of two-stage retrieval

• Statistical retrieval• Followed by more sophisticated NLP filtering

2010.04.21 - SLIDE 49IS 240 – Spring 2010

NPL & IR• Lewis and Sparck Jones suggest research in

three areas– Examination of the words, phrases and sentences

that make up a document description and express the combinatory, syntagmatic relations between single terms

– The classificatory structure over document collection as a whole, indicating the paradigmatic relations between terms and permitting controlled vocabulary indexing and searching

– Using NLP-based methods for searching and matching

2010.04.21 - SLIDE 50IS 240 – Spring 2010

NLP & IR Issues• Is natural language indexing using more

NLP knowledge needed?• Or, should controlled vocabularies be used• Can NLP in its current state provide the

improvements needed• How to test

2010.04.21 - SLIDE 51IS 240 – Spring 2010

NLP & IR• New “Question Answering” track at TREC

has been exploring these areas– Usually statistical methods are used to

retrieve candidate documents– NLP techniques are used to extract the likely

answers from the text of the documents

2010.04.21 - SLIDE 52IS 240 – Spring 2010

Mark’s idle speculation• What people think is going on always

Keywords

NLPFrom Mark Sanderson, University of Sheffield

2010.04.21 - SLIDE 53IS 240 – Spring 2010

Mark’s idle speculation• What’s usually actually going on

KeywordsNLPFrom Mark Sanderson, University of Sheffield

Recommended