Lecture 22: NLP for IR

2009.04.29 - SLIDE 1IS 240 – Spring 2009

Prof. Ray Larson University of California, Berkeley

School of Information

Principles of Information Retrieval

Lecture 22: NLP for IR

2009.04.29 - SLIDE 2IS 240 – Spring 2009

Today

• Review– Cheshire III Design – GRID-based DLs

• NLP for IR

• Text Summarization

Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer

2009.04.29 - SLIDE 3IS 240 – Spring 2009

Grid

mid

dlew

are

Chem

i cal

Eng i

neer

i ng

Applications

ApplicationToolkits

GridServices

GridFabric

Clim

ate

Data

Grid

Rem

ote

Com

putin

g

Rem

ote

Visu

aliza

tion

Colla

bora

torie

s

High

ene

rgy

phy

sics

Cosm

olog

y

Astro

phys

ics

Com

bust

ion

.….

Porta

ls

Rem

ote

sens

ors

..…Protocols, authentication, policy, instrumentation,Resource management, discovery, events, etc.

Storage, networks, computers, display devices, etc.and their associated local services

Grid Architecture -- (Dr. Eric Yen, Academia Sinica,

Taiwan.)

2009.04.29 - SLIDE 4IS 240 – Spring 2009

Chem

i cal

Eng i

neer

i ng

Applications

ApplicationToolkits

GridServices

GridFabric

Grid

mid

dlew

are

Clim

ate

Data

Grid

Rem

ote

Com

putin

g

Rem

ote

Visu

aliza

tion

Colla

bora

torie

s

High

ene

rgy

phy

sics

Cosm

olog

y

Astro

phys

ics

Com

bust

ion

Hum

anitie

sco

mpu

ting

Digi

tal

Libr

arie

s

…

Porta

ls

Rem

ote

sens

ors

Text

Min

ing

Met

adat

am

anag

emen

t

Sear

ch &

Retri

eval …

Protocols, authentication, policy, instrumentation,Resource management, discovery, events, etc.

Storage, networks, computers, display devices, etc.and their associated local services

Grid Architecture (ECAI/AS Grid Digital Library

Workshop)

Bio-

Med

ical

2009.04.29 - SLIDE 5IS 240 – Spring 2009

Grid IR Issues

• Want to preserve the same retrieval performance (precision/recall) while hopefully increasing efficiency (I.e. speed)

• Very large-scale distribution of resources is a challenge for sub-second retrieval

• Different from most other typical Grid processes, IR is potentially less computing intensive and more data intensive

• In many ways Grid IR replicates the process (and problems) of metasearch or distributed search

2009.04.29 - SLIDE 6IS 240 – Spring 2009

Context

• Environmental Requirements:– Very Large scale information systems

• Terabyte scale (Data Grid)• Computationally expensive processes (Comp. Grid)

• Digital Preservation• Analysis of data, not just retrieval (Data/Text

Mining)• Ease of Extensibility, Customizability (Python)• Open Source• Integrate not Re-implement• "Web 2.0" – interactivity and dynamic interfaces

2009.04.29 - SLIDE 7IS 240 – Spring 2009

Context

Data Grid Layer

Data Grid

SRBiRODS

Digital Library LayerApplicationLayer

Web BrowserMultivalent

Dedicated Client

User Interface

Apache+Mod_Python+

Cheshire3

Protocol Handler

Process Management

KeplerCheshire3

Query Results

Query

Results

Export Parse

Document ParsersMultivalent,...

NaturalLanguageProcessing

InformationExtraction

Text Mining ToolsTsujii Labs, ...

ClassificationClustering

Data Mining ToolsOrange, Weka, ...

Query

Results

Search /Retrieve

Index /Store

Information System

Cheshire3

User Interface

MySRBPAWN

Process Management

KepleriRODS rules

Term Management

TermineWordNet

...

Store

2009.04.29 - SLIDE 8IS 240 – Spring 2009

Cheshire3 Object Model

UserStore

User

ConfigStoreObject

Database

Query

Record

Transformer

Records

ProtocolHandler

Normaliser

IndexStore

Terms

ServerDocument

Group

Ingest ProcessDocuments

Index

RecordStore

Parser

Document

Query

ResultSet

DocumentStore

Document

PreParserPreParserPreParser

Extracter

2009.04.29 - SLIDE 9IS 240 – Spring 2009

Object Configuration

• One XML 'record' per non-data object• Very simple base schema, with extensions as

needed• Identifiers for objects unique within a context

(e.g., unique at individual database level, but not necessarily between all databases)

• Allows workflows to reference by identifier but act appropriately within different contexts.

• Allows multiple administrators to define objects without reference to each other

2009.04.29 - SLIDE 10IS 240 – Spring 2009

Grid

• Focus on ingest, not discovery (yet)• Instantiate architecture on every node• Assign one node as master, rest as slaves.

Master then divides the processing as appropriate.

• Calls between slaves possible• Calls as small, simple as possible:

(objectIdentifier, functionName, *arguments)• Typically:

('workflow-id', 'process', 'document-id')

2009.04.29 - SLIDE 11IS 240 – Spring 2009

Grid ArchitectureMaster Task

Slave Task 1 Slave Task N

Data Grid

GPFS Temporary Storage

(workflow, process, document) (workflow, process, document)

fetch document fetch document

document document

extracted data extracted data

2009.04.29 - SLIDE 12IS 240 – Spring 2009

Grid Architecture - Phase 2Master Task

Slave Task 1 Slave Task N

Data Grid

GPFS Temporary Storage

(index, load) (index, load)

store index store index

fetch extracted data fetch extracted data

2009.04.29 - SLIDE 13IS 240 – Spring 2009

Workflow Objects

• Written as XML within the configuration record.• Rewrites and compiles to Python code on object

instantiationCurrent instructions:

– object– assign– fork– for-each– break/continue– try/except/raise– return– log (= send text to default logger object)

Yes, no if!

2009.04.29 - SLIDE 14IS 240 – Spring 2009

Workflow example

<subConfig id=“buildSingleWorkflow”><objectType>workflow.SimpleWorkflow</objectType><workflow> <object type=“workflow” ref=“PreParserWorkflow”/> <try> <object type=“parser” ref=“NsSaxParser”/> </try> <except> <log>Unparsable Record</log> <raise/> </except> <object type=“recordStore” function=“create_record”/> <object type=“database” function=“add_record”/> <object type=“database” function=“index_record”/> <log>”Loaded Record:” + input.id</log></workflow></subConfig>

2009.04.29 - SLIDE 15IS 240 – Spring 2009

Text Mining

• Integration of Natural Language Processing tools

• Including:– Part of Speech taggers (noun, verb, adjective,...)– Phrase Extraction – Deep Parsing (subject, verb, object, preposition,...)– Linguistic Stemming (is/be fairy/fairy vs is/is fairy/fairi)

• Planned: Information Extraction tools

2009.04.29 - SLIDE 16IS 240 – Spring 2009

Data Mining

• Integration of toolkits difficult unless they support sparse vectors as input - text is high dimensional, but has lots of zeroes

• Focus on automatic classification for predefined categories rather than clustering

• Algorithms integrated/implemented:– Perceptron, Neural Network (pure python)– Naïve Bayes (pure python)– SVM (libsvm integrated with python wrapper)– Classification Association Rule Mining (Java)

2009.04.29 - SLIDE 17IS 240 – Spring 2009

Data Mining

• Modelled as multi-stage PreParser object (training phase, prediction phase)

• Plus need for AccumulatingDocumentFactory to merge document vectors together into single output for training some algorithms (e.g., SVM)

• Prediction phase attaches metadata (predicted class) to document object, which can be stored in DocumentStore

• Document vectors generated per index per document, so integrated NLP document normalization for free

2009.04.29 - SLIDE 18IS 240 – Spring 2009

Data Mining + Text Mining

• Testing integrated environment with 500,000 medline abstracts, using various NLP tools, classification algorithms, and evaluation strategies.

• Computational grid for distributing expensive NLP analysis• Results show better accuracy with fewer attributes:

Vector Source Avg

Attributes

TCV

Accuracy

Every word in document 99 85.7%

Stemmed words in document 95 86.2%

Part of Speech filtered words 69 85.2%

Stemmed Part of Speech filtered 65 86.3%

Genia filtered 68 85.5%

Genia Stem filtered 64 87.2%

2009.04.29 - SLIDE 19IS 240 – Spring 2009

Applications (1)

Automated Collection Strength AnalysisPrimary aim: Test if data mining techniques could

be used to develop a coverage map of items available in the London libraries.

The strengths within the library collections were automatically determined through enrichment and analysis of bibliographic level metadata records.

This involved very large scale processing of records to:– Deduplicate millions of records – Enrich deduplicated records against database of 45

million – Automatically reclassify enriched records using

machine learning processes (Naïve Bayes)

2009.04.29 - SLIDE 20IS 240 – Spring 2009

Applications (1)

• Data mining enhances collection mapping strategies by making a larger proportion of the data usable, by discovering hidden relationships between textual subjects and hierarchically based classification systems.

• The graph shows the comparison of numbers of books classified in the domain of Psychology originally and after enhancement using data mining

Goldsmiths Kings Queen Mary Senate UCL Westminster

0

1000

2000

3000

4000

5000

6000Records per Library for All of Psychology

Original

Enhanced

2009.04.29 - SLIDE 21IS 240 – Spring 2009

Applications (2)

Assessing the Grade Level of NSDL Education Material• The National Science Digital Library has assembled a

collection of URLs that point to educational material for scientific disciplines for all grade levels. These are harvested into the SRB data grid.

• Working with SDSC we assessed the grade-level relevance by examining the vocabulary used in the material present at each registered URL.

• We determined the vocabulary-based grade-level with the Flesch-Kincaid grade level assessment. The domain of each website was then determined using data mining techniques (TF-IDF derived fast domain classifier).

• This processing was done on the Teragrid cluster at SDSC.

2009.04.29 - SLIDE 22IS 240 – Spring 2009

Applications (2)

• The formula for the Flesch Reading Ease Score: FRES = 206.835 –1.015 ((total words)/(total sentences)) – 84.6 ((total

syllables)/(total words))

• The Flesch-Kincaid Grade Level Formula: FKGLF = 0.39 * ((total words)/(total sentences)) + 11.8 * ((total

syllables)/(total words)) –15.59

• The Domain was determined by: – Domains used were based upon the AAAS Benchmarks– Taking in samples from each of the domain areas being examined and

produces scored and ranked lists of vocabularies for each domain.– Each token in a document is passed through a lookup function against

this table and tallies are calculated for the entire document. – These tallies are then used to rank the order of likelihood of the

document being about each topic and a statistical pass of the results returns only those topics that are above in certain threshold.

2009.04.29 - SLIDE 23IS 240 – Spring 2009

Today

• Natural Language Processing and IR– Based on Papers in Reader and on

• David Lewis & Karen Sparck Jones “Natural Language Processing for Information Retrieval” Communications of the ACM, 39(1) Jan. 1996

• Text summarization: Lecture from Ed Hovy (USC)

2009.04.29 - SLIDE 24IS 240 – Spring 2009

Natural Language Processing and IR

• The main approach in applying NLP to IR has been to attempt to address

– Phrase usage vs individual terms

– Search expansion using related terms/concepts

– Attempts to automatically exploit or assign controlled vocabularies

2009.04.29 - SLIDE 25IS 240 – Spring 2009

NLP and IR

• Much early research showed that (at least in the restricted test databases tested)– Indexing documents by individual terms

corresponding to words and word stems produces retrieval results at least as good as when indexes use controlled vocabularies (whether applied manually or automatically)

– Constructing phrases or “pre-coordinated” terms provides only marginal and inconsistent improvements

2009.04.29 - SLIDE 26IS 240 – Spring 2009

NLP and IR

• Not clear why intuitively plausible improvements to document representation have had little effect on retrieval results when compared to statistical methods– E.g. Use of syntactic role relations between

terms has shown no improvement in performance over “bag of words” approaches

2009.04.29 - SLIDE 27IS 240 – Spring 2009

General Framework of NLP

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2009.04.29 - SLIDE 28IS 240 – Spring 2009


Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.


2009.04.29 - SLIDE 29IS 240 – Spring 2009



Syntactic Analysis

Semantic Analysis


John runs.

John run+s. P-N V 3-pre N plu


2009.04.29 - SLIDE 30IS 240 – Spring 2009



Syntactic Analysis

Semantic Analysis


John runs.


S

NP

P-N

John

VP

V

run


2009.04.29 - SLIDE 31IS 240 – Spring 2009



Syntactic Analysis

Semantic Analysis


John runs.


S

NP

P-N

John

VP

V

runPred: RUN Agent:John


2009.04.29 - SLIDE 32IS 240 – Spring 2009



Syntactic Analysis

Semantic Analysis


John runs.


S

NP

P-N

John

VP

V

runPred: RUN Agent:John

John is a student.He runs.


2009.04.29 - SLIDE 33IS 240 – Spring 2009



Syntactic Analysis

Semantic Analysis


Domain AnalysisAppelt:1999

Tokenization

Part of Speech Tagging

Term recognition(Ananiadou)

Inflection/Derivation

Compounding


2009.04.29 - SLIDE 34IS 240 – Spring 2009



Syntactic Analysis

Semantic Analysis


Difficulties of NLP

(1) Robustness: Incomplete Knowledge


2009.04.29 - SLIDE 35IS 240 – Spring 2009



Syntactic Analysis

Semantic Analysis


Difficulties of NLP

(1) Robustness: Incomplete Knowledge Incomplete Lexicons

Open class words TermsTerm recognitionNamed Entities Company names Locations Numerical expressions


2009.04.29 - SLIDE 36IS 240 – Spring 2009



Syntactic Analysis

Semantic Analysis


Difficulties of NLP


Incomplete Grammar Syntactic Coverage Domain Specific Constructions Ungrammatical Constructions


2009.04.29 - SLIDE 37IS 240 – Spring 2009

Syntactic Analysis



Semantic Analysis


Difficulties of NLP


Incomplete Domain Knowledge Interpretation Rules

PredefinedAspects of

Information


2009.04.29 - SLIDE 38IS 240 – Spring 2009



Syntactic Analysis

Semantic Analysis


Difficulties of NLP


(2) Ambiguities:Combinatorial

Explosion


2009.04.29 - SLIDE 39IS 240 – Spring 2009



Syntactic Analysis

Semantic Analysis


Difficulties of NLP



Explosion

Most words in Englishare ambiguous in terms of their parts of speech. runs: v/3pre, n/plu clubs: v/3pre, n/plu and two meanings


2009.04.29 - SLIDE 40IS 240 – Spring 2009



Syntactic Analysis

Semantic Analysis


Difficulties of NLP



Explosion

Structural Ambiguities

Predicate-argument Ambiguities


2009.04.29 - SLIDE 41IS 240 – Spring 2009


(1)Attachment Ambiguities John bought a car with large seats. John bought a car with $3000.

(2) Scope Ambiguities

young women and men in the room

(3)Analytical Ambiguities Visiting relatives can be boring.

The manager of Yaxing Benz, a Sino-German joint ventureThe manager of Yaxing Benz, Mr. John Smith

John bought a car with Mary.$3000 can buy a nice car.

Semantic Ambiguities(1)

Semantic Ambiguities(2)

Every man loves a woman.

Co-reference Ambiguities


2009.04.29 - SLIDE 42IS 240 – Spring 2009



Syntactic Analysis

Semantic Analysis


Difficulties of NLP



Explosion


Predicate-argument Ambiguities

CombinatorialExplosion


2009.04.29 - SLIDE 43IS 240 – Spring 2009

Note:

Ambiguities vs Robustness

More comprehensive knowledge: More Robust

big dictionariescomprehensive grammar

More comprehensive knowledge: More ambiguities

Adaptability: Tuning, LearningSlides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

2009.04.29 - SLIDE 44IS 240 – Spring 2009

Framework of IE

IE as compromise NLP


2009.04.29 - SLIDE 45IS 240 – Spring 2009

Syntactic Analysis



Semantic Analysis


Difficulties of NLP




Information


2009.04.29 - SLIDE 46IS 240 – Spring 2009

Syntactic Analysis



Semantic Analysis


Difficulties of NLP




Information


2009.04.29 - SLIDE 47IS 240 – Spring 2009

Techniques in IE

(1) Domain Specific Partial Knowledge: Knowledge relevant to information to be extracted

(2) Ambiguities: Ignoring irrelevant ambiguities Simpler NLP techniques

(4) Adaptation Techniques: Machine Learning, Trainable systems

(3) Robustness: Coping with Incomplete dictionaries (open class words) Ignoring irrelevant parts of sentences


2009.04.29 - SLIDE 48IS 240 – Spring 2009



Syntactic Analysis

Semantic Anaysis


Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names

Domain specific rules: <Word><Word>, Inc. Mr. <Cpt-L>. <Word>Machine Learning: HMM, Decision TreesRules + Machine Learning

Part of Speech TaggerFSA rulesStatistic taggers

95 %

F-Value90

DomainDependent

Local ContextStatistical Bias


2009.04.29 - SLIDE 49IS 240 – Spring 2009



Syntactic Analysis

Semantic Anaysis


FASTUS

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA)


2009.04.29 - SLIDE 50IS 240 – Spring 2009



Syntactic Analysis

Semantic Anaysis


FASTUS









2009.04.29 - SLIDE 51IS 240 – Spring 2009



Syntactic Analysis

Semantic Analysis


FASTUS









2009.04.29 - SLIDE 52IS 240 – Spring 2009

Using NLP

• Strzalkowski (in Reader)

Text NLP represDbasesearch

TAGGERNLP: PARSER TERMS

2009.04.29 - SLIDE 53IS 240 – Spring 2009

Using NLP

INPUT SENTENCEThe former Soviet President has been a local hero ever sincea Russian tank invaded Wisconsin.

TAGGED SENTENCEThe/dt former/jj Soviet/jj President/nn has/vbz been/vbn a/dt local/jj hero/nn ever/rb since/in a/dt Russian/jj tank/nn invaded/vbd Wisconsin/np ./per

2009.04.29 - SLIDE 54IS 240 – Spring 2009

Using NLP

TAGGED & STEMMED SENTENCEthe/dt former/jj soviet/jj president/nn have/vbz be/vbn a/dt local/jj hero/nn ever/rb since/in a/dt russian/jj tank/nn invade/vbd wisconsin/np ./per

2009.04.29 - SLIDE 55IS 240 – Spring 2009

Using NLP

PARSED SENTENCE

[assert

[[perf [have]][[verb[BE]]

[subject [np[n PRESIDENT][t_pos THE]

[adj[FORMER]][adj[SOVIET]]]]

[adv EVER]

[sub_ord[SINCE [[verb[INVADE]]

[subject [np [n TANK][t_pos A]

[adj [RUSSIAN]]]]

[object [np [name [WISCONSIN]]]]]]]]]

2009.04.29 - SLIDE 56IS 240 – Spring 2009

Using NLP

EXTRACTED TERMS & WEIGHTS

President 2.623519 soviet 5.416102

President+soviet 11.556747 president+former 14.594883

Hero 7.896426 hero+local 14.314775

Invade 8.435012 tank 6.848128

Tank+invade 17.402237 tank+russian 16.030809

Russian 7.383342 wisconsin 7.785689

2009.04.29 - SLIDE 57IS 240 – Spring 2009

Same Sentence, different sys

INPUT SENTENCEThe former Soviet President has been a local hero ever sincea Russian tank invaded Wisconsin.

TAGGED SENTENCE (using uptagger from Tsujii)The/DT former/JJ Soviet/NNP President/NNP has/VBZ been/VBN a/DT local/JJ hero/NN ever/RB since/IN a/DT Russian/JJ tank/NN invaded/VBD Wisconsin/NNP ./.

2009.04.29 - SLIDE 58IS 240 – Spring 2009


CHUNKED Sentence (chunkparser – Tsujii)(TOP (S (NP (DT The) (JJ former) (NNP Soviet) (NNP President) ) (VP (VBZ has) (VP (VBN been) (NP (DT a) (JJ local) (NN hero) ) (ADVP (RB ever) ) (SBAR (IN since) (S (NP (DT a) (JJ Russian) (NN tank) ) (VP (VBD invaded) (NP (NNP Wisconsin) ) ) ) ) ) ) (. .) ) )

2009.04.29 - SLIDE 59IS 240 – Spring 2009


Enju ParserROOT ROOT ROOT ROOT -1 ROOT been be VBN VB 5been be VBN VB 5 ARG1 President president NNP NNP 3been be VBN VB 5 ARG2 hero hero NN NN 8a a DT DT 6 ARG1 hero hero NN NN 8a a DT DT 11 ARG1 tank tank NN NN 13local local JJ JJ 7 ARG1 hero hero NN NN 8The the DT DT 0 ARG1 President president NNP NNP 3former former JJ JJ 1 ARG1 President president NNP NNP 3Russian russian JJ JJ 12 ARG1 tank tank NN NN 13Soviet soviet NNP NNP 2 MOD President president NNP NNP 3invaded invade VBD VB 14 ARG1 tank tank NN NN 13invaded invade VBD VB 14 ARG2 Wisconsin wisconsin NNP NNP 15has have VBZ VB 4 ARG1 President president NNP NNP 3has have VBZ VB 4 ARG2 been be VBN VB 5since since IN IN 10 MOD been be VBN VB 5since since IN IN 10 ARG1 invaded invade VBD VB 14ever ever RB RB 9 ARG1 since since IN IN 10

2009.04.29 - SLIDE 60IS 240 – Spring 2009

NLP & IR

• Indexing– Use of NLP methods to identify phrases

• Test weighting schemes for phrases

– Use of more sophisticated morphological analysis

• Searching– Use of two-stage retrieval

• Statistical retrieval• Followed by more sophisticated NLP filtering

2009.04.29 - SLIDE 61IS 240 – Spring 2009

NPL & IR

• Lewis and Sparck Jones suggest research in three areas– Examination of the words, phrases and sentences

that make up a document description and express the combinatory, syntagmatic relations between single terms

– The classificatory structure over document collection as a whole, indicating the paradigmatic relations between terms and permitting controlled vocabulary indexing and searching

– Using NLP-based methods for searching and matching

2009.04.29 - SLIDE 62IS 240 – Spring 2009

NLP & IR Issues

• Is natural language indexing using more NLP knowledge needed?

• Or, should controlled vocabularies be used

• Can NLP in its current state provide the improvements needed

• How to test

2009.04.29 - SLIDE 63IS 240 – Spring 2009

NLP & IR

• New “Question Answering” track at TREC has been exploring these areas– Usually statistical methods are used to

retrieve candidate documents– NLP techniques are used to extract the likely

answers from the text of the documents