2009.04.29 - SLIDE 1IS 240 – Spring 2009
Prof. Ray Larson University of California, Berkeley
School of Information
Principles of Information Retrieval
Lecture 22: NLP for IR
2009.04.29 - SLIDE 2IS 240 – Spring 2009
Today
• Review– Cheshire III Design – GRID-based DLs
• NLP for IR
• Text Summarization
Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer
2009.04.29 - SLIDE 3IS 240 – Spring 2009
Grid
mid
dlew
are
Chem
i cal
Eng i
neer
i ng
Applications
ApplicationToolkits
GridServices
GridFabric
Clim
ate
Data
Grid
Rem
ote
Com
putin
g
Rem
ote
Visu
aliza
tion
Colla
bora
torie
s
High
ene
rgy
phy
sics
Cosm
olog
y
Astro
phys
ics
Com
bust
ion
.….
Porta
ls
Rem
ote
sens
ors
..…Protocols, authentication, policy, instrumentation,Resource management, discovery, events, etc.
Storage, networks, computers, display devices, etc.and their associated local services
Grid Architecture -- (Dr. Eric Yen, Academia Sinica,
Taiwan.)
2009.04.29 - SLIDE 4IS 240 – Spring 2009
Chem
i cal
Eng i
neer
i ng
Applications
ApplicationToolkits
GridServices
GridFabric
Grid
mid
dlew
are
Clim
ate
Data
Grid
Rem
ote
Com
putin
g
Rem
ote
Visu
aliza
tion
Colla
bora
torie
s
High
ene
rgy
phy
sics
Cosm
olog
y
Astro
phys
ics
Com
bust
ion
Hum
anitie
sco
mpu
ting
Digi
tal
Libr
arie
s
…
Porta
ls
Rem
ote
sens
ors
Text
Min
ing
Met
adat
am
anag
emen
t
Sear
ch &
Retri
eval …
Protocols, authentication, policy, instrumentation,Resource management, discovery, events, etc.
Storage, networks, computers, display devices, etc.and their associated local services
Grid Architecture (ECAI/AS Grid Digital Library
Workshop)
Bio-
Med
ical
2009.04.29 - SLIDE 5IS 240 – Spring 2009
Grid IR Issues
• Want to preserve the same retrieval performance (precision/recall) while hopefully increasing efficiency (I.e. speed)
• Very large-scale distribution of resources is a challenge for sub-second retrieval
• Different from most other typical Grid processes, IR is potentially less computing intensive and more data intensive
• In many ways Grid IR replicates the process (and problems) of metasearch or distributed search
2009.04.29 - SLIDE 6IS 240 – Spring 2009
Context
• Environmental Requirements:– Very Large scale information systems
• Terabyte scale (Data Grid)• Computationally expensive processes (Comp. Grid)
• Digital Preservation• Analysis of data, not just retrieval (Data/Text
Mining)• Ease of Extensibility, Customizability (Python)• Open Source• Integrate not Re-implement• "Web 2.0" – interactivity and dynamic interfaces
2009.04.29 - SLIDE 7IS 240 – Spring 2009
Context
Data Grid Layer
Data Grid
SRBiRODS
Digital Library LayerApplicationLayer
Web BrowserMultivalent
Dedicated Client
User Interface
Apache+Mod_Python+
Cheshire3
Protocol Handler
Process Management
KeplerCheshire3
Query Results
Query
Results
Export Parse
Document ParsersMultivalent,...
NaturalLanguageProcessing
InformationExtraction
Text Mining ToolsTsujii Labs, ...
ClassificationClustering
Data Mining ToolsOrange, Weka, ...
Query
Results
Search /Retrieve
Index /Store
Information System
Cheshire3
User Interface
MySRBPAWN
Process Management
KepleriRODS rules
Term Management
TermineWordNet
...
Store
2009.04.29 - SLIDE 8IS 240 – Spring 2009
Cheshire3 Object Model
UserStore
User
ConfigStoreObject
Database
Query
Record
Transformer
Records
ProtocolHandler
Normaliser
IndexStore
Terms
ServerDocument
Group
Ingest ProcessDocuments
Index
RecordStore
Parser
Document
Query
ResultSet
DocumentStore
Document
PreParserPreParserPreParser
Extracter
2009.04.29 - SLIDE 9IS 240 – Spring 2009
Object Configuration
• One XML 'record' per non-data object• Very simple base schema, with extensions as
needed• Identifiers for objects unique within a context
(e.g., unique at individual database level, but not necessarily between all databases)
• Allows workflows to reference by identifier but act appropriately within different contexts.
• Allows multiple administrators to define objects without reference to each other
2009.04.29 - SLIDE 10IS 240 – Spring 2009
Grid
• Focus on ingest, not discovery (yet)• Instantiate architecture on every node• Assign one node as master, rest as slaves.
Master then divides the processing as appropriate.
• Calls between slaves possible• Calls as small, simple as possible:
(objectIdentifier, functionName, *arguments)• Typically:
('workflow-id', 'process', 'document-id')
2009.04.29 - SLIDE 11IS 240 – Spring 2009
Grid ArchitectureMaster Task
Slave Task 1 Slave Task N
Data Grid
GPFS Temporary Storage
(workflow, process, document) (workflow, process, document)
fetch document fetch document
document document
extracted data extracted data
2009.04.29 - SLIDE 12IS 240 – Spring 2009
Grid Architecture - Phase 2Master Task
Slave Task 1 Slave Task N
Data Grid
GPFS Temporary Storage
(index, load) (index, load)
store index store index
fetch extracted data fetch extracted data
2009.04.29 - SLIDE 13IS 240 – Spring 2009
Workflow Objects
• Written as XML within the configuration record.• Rewrites and compiles to Python code on object
instantiationCurrent instructions:
– object– assign– fork– for-each– break/continue– try/except/raise– return– log (= send text to default logger object)
Yes, no if!
2009.04.29 - SLIDE 14IS 240 – Spring 2009
Workflow example
<subConfig id=“buildSingleWorkflow”><objectType>workflow.SimpleWorkflow</objectType><workflow> <object type=“workflow” ref=“PreParserWorkflow”/> <try> <object type=“parser” ref=“NsSaxParser”/> </try> <except> <log>Unparsable Record</log> <raise/> </except> <object type=“recordStore” function=“create_record”/> <object type=“database” function=“add_record”/> <object type=“database” function=“index_record”/> <log>”Loaded Record:” + input.id</log></workflow></subConfig>
2009.04.29 - SLIDE 15IS 240 – Spring 2009
Text Mining
• Integration of Natural Language Processing tools
• Including:– Part of Speech taggers (noun, verb, adjective,...)– Phrase Extraction – Deep Parsing (subject, verb, object, preposition,...)– Linguistic Stemming (is/be fairy/fairy vs is/is fairy/fairi)
• Planned: Information Extraction tools
2009.04.29 - SLIDE 16IS 240 – Spring 2009
Data Mining
• Integration of toolkits difficult unless they support sparse vectors as input - text is high dimensional, but has lots of zeroes
• Focus on automatic classification for predefined categories rather than clustering
• Algorithms integrated/implemented:– Perceptron, Neural Network (pure python)– Naïve Bayes (pure python)– SVM (libsvm integrated with python wrapper)– Classification Association Rule Mining (Java)
2009.04.29 - SLIDE 17IS 240 – Spring 2009
Data Mining
• Modelled as multi-stage PreParser object (training phase, prediction phase)
• Plus need for AccumulatingDocumentFactory to merge document vectors together into single output for training some algorithms (e.g., SVM)
• Prediction phase attaches metadata (predicted class) to document object, which can be stored in DocumentStore
• Document vectors generated per index per document, so integrated NLP document normalization for free
2009.04.29 - SLIDE 18IS 240 – Spring 2009
Data Mining + Text Mining
• Testing integrated environment with 500,000 medline abstracts, using various NLP tools, classification algorithms, and evaluation strategies.
• Computational grid for distributing expensive NLP analysis• Results show better accuracy with fewer attributes:
Vector Source Avg
Attributes
TCV
Accuracy
Every word in document 99 85.7%
Stemmed words in document 95 86.2%
Part of Speech filtered words 69 85.2%
Stemmed Part of Speech filtered 65 86.3%
Genia filtered 68 85.5%
Genia Stem filtered 64 87.2%
2009.04.29 - SLIDE 19IS 240 – Spring 2009
Applications (1)
Automated Collection Strength AnalysisPrimary aim: Test if data mining techniques could
be used to develop a coverage map of items available in the London libraries.
The strengths within the library collections were automatically determined through enrichment and analysis of bibliographic level metadata records.
This involved very large scale processing of records to:– Deduplicate millions of records – Enrich deduplicated records against database of 45
million – Automatically reclassify enriched records using
machine learning processes (Naïve Bayes)
2009.04.29 - SLIDE 20IS 240 – Spring 2009
Applications (1)
• Data mining enhances collection mapping strategies by making a larger proportion of the data usable, by discovering hidden relationships between textual subjects and hierarchically based classification systems.
• The graph shows the comparison of numbers of books classified in the domain of Psychology originally and after enhancement using data mining
Goldsmiths Kings Queen Mary Senate UCL Westminster
0
1000
2000
3000
4000
5000
6000Records per Library for All of Psychology
Original
Enhanced
2009.04.29 - SLIDE 21IS 240 – Spring 2009
Applications (2)
Assessing the Grade Level of NSDL Education Material• The National Science Digital Library has assembled a
collection of URLs that point to educational material for scientific disciplines for all grade levels. These are harvested into the SRB data grid.
• Working with SDSC we assessed the grade-level relevance by examining the vocabulary used in the material present at each registered URL.
• We determined the vocabulary-based grade-level with the Flesch-Kincaid grade level assessment. The domain of each website was then determined using data mining techniques (TF-IDF derived fast domain classifier).
• This processing was done on the Teragrid cluster at SDSC.
2009.04.29 - SLIDE 22IS 240 – Spring 2009
Applications (2)
• The formula for the Flesch Reading Ease Score: FRES = 206.835 –1.015 ((total words)/(total sentences)) – 84.6 ((total
syllables)/(total words))
• The Flesch-Kincaid Grade Level Formula: FKGLF = 0.39 * ((total words)/(total sentences)) + 11.8 * ((total
syllables)/(total words)) –15.59
• The Domain was determined by: – Domains used were based upon the AAAS Benchmarks– Taking in samples from each of the domain areas being examined and
produces scored and ranked lists of vocabularies for each domain.– Each token in a document is passed through a lookup function against
this table and tallies are calculated for the entire document. – These tallies are then used to rank the order of likelihood of the
document being about each topic and a statistical pass of the results returns only those topics that are above in certain threshold.
2009.04.29 - SLIDE 23IS 240 – Spring 2009
Today
• Natural Language Processing and IR– Based on Papers in Reader and on
• David Lewis & Karen Sparck Jones “Natural Language Processing for Information Retrieval” Communications of the ACM, 39(1) Jan. 1996
• Text summarization: Lecture from Ed Hovy (USC)
2009.04.29 - SLIDE 24IS 240 – Spring 2009
Natural Language Processing and IR
• The main approach in applying NLP to IR has been to attempt to address
– Phrase usage vs individual terms
– Search expansion using related terms/concepts
– Attempts to automatically exploit or assign controlled vocabularies
2009.04.29 - SLIDE 25IS 240 – Spring 2009
NLP and IR
• Much early research showed that (at least in the restricted test databases tested)– Indexing documents by individual terms
corresponding to words and word stems produces retrieval results at least as good as when indexes use controlled vocabularies (whether applied manually or automatically)
– Constructing phrases or “pre-coordinated” terms provides only marginal and inconsistent improvements
2009.04.29 - SLIDE 26IS 240 – Spring 2009
NLP and IR
• Not clear why intuitively plausible improvements to document representation have had little effect on retrieval results when compared to statistical methods– E.g. Use of syntactic role relations between
terms has shown no improvement in performance over “bag of words” approaches
2009.04.29 - SLIDE 27IS 240 – Spring 2009
General Framework of NLP
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 28IS 240 – Spring 2009
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
John runs.
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 29IS 240 – Spring 2009
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
John runs.
John run+s. P-N V 3-pre N plu
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 30IS 240 – Spring 2009
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
John runs.
John run+s. P-N V 3-pre N plu
S
NP
P-N
John
VP
V
run
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 31IS 240 – Spring 2009
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
John runs.
John run+s. P-N V 3-pre N plu
S
NP
P-N
John
VP
V
runPred: RUN Agent:John
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 32IS 240 – Spring 2009
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
John runs.
John run+s. P-N V 3-pre N plu
S
NP
P-N
John
VP
V
runPred: RUN Agent:John
John is a student.He runs.
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 33IS 240 – Spring 2009
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Domain AnalysisAppelt:1999
Tokenization
Part of Speech Tagging
Term recognition(Ananiadou)
Inflection/Derivation
Compounding
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 34IS 240 – Spring 2009
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 35IS 240 – Spring 2009
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge Incomplete Lexicons
Open class words TermsTerm recognitionNamed Entities Company names Locations Numerical expressions
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 36IS 240 – Spring 2009
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
Incomplete Grammar Syntactic Coverage Domain Specific Constructions Ungrammatical Constructions
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 37IS 240 – Spring 2009
Syntactic Analysis
General Framework of NLP
Morphological andLexical Processing
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
Incomplete Domain Knowledge Interpretation Rules
PredefinedAspects of
Information
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 38IS 240 – Spring 2009
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
(2) Ambiguities:Combinatorial
Explosion
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 39IS 240 – Spring 2009
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
(2) Ambiguities:Combinatorial
Explosion
Most words in Englishare ambiguous in terms of their parts of speech. runs: v/3pre, n/plu clubs: v/3pre, n/plu and two meanings
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 40IS 240 – Spring 2009
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
(2) Ambiguities:Combinatorial
Explosion
Structural Ambiguities
Predicate-argument Ambiguities
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 41IS 240 – Spring 2009
Structural Ambiguities
(1)Attachment Ambiguities John bought a car with large seats. John bought a car with $3000.
(2) Scope Ambiguities
young women and men in the room
(3)Analytical Ambiguities Visiting relatives can be boring.
The manager of Yaxing Benz, a Sino-German joint ventureThe manager of Yaxing Benz, Mr. John Smith
John bought a car with Mary.$3000 can buy a nice car.
Semantic Ambiguities(1)
Semantic Ambiguities(2)
Every man loves a woman.
Co-reference Ambiguities
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 42IS 240 – Spring 2009
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
(2) Ambiguities:Combinatorial
Explosion
Structural Ambiguities
Predicate-argument Ambiguities
CombinatorialExplosion
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 43IS 240 – Spring 2009
Note:
Ambiguities vs Robustness
More comprehensive knowledge: More Robust
big dictionariescomprehensive grammar
More comprehensive knowledge: More ambiguities
Adaptability: Tuning, LearningSlides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 44IS 240 – Spring 2009
Framework of IE
IE as compromise NLP
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 45IS 240 – Spring 2009
Syntactic Analysis
General Framework of NLP
Morphological andLexical Processing
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
Incomplete Domain Knowledge Interpretation Rules
PredefinedAspects of
Information
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 46IS 240 – Spring 2009
Syntactic Analysis
General Framework of NLP
Morphological andLexical Processing
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
Incomplete Domain Knowledge Interpretation Rules
PredefinedAspects of
Information
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 47IS 240 – Spring 2009
Techniques in IE
(1) Domain Specific Partial Knowledge: Knowledge relevant to information to be extracted
(2) Ambiguities: Ignoring irrelevant ambiguities Simpler NLP techniques
(4) Adaptation Techniques: Machine Learning, Trainable systems
(3) Robustness: Coping with Incomplete dictionaries (open class words) Ignoring irrelevant parts of sentences
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 48IS 240 – Spring 2009
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Anaysis
Context processingInterpretation
Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names
Domain specific rules: <Word><Word>, Inc. Mr. <Cpt-L>. <Word>Machine Learning: HMM, Decision TreesRules + Machine Learning
Part of Speech TaggerFSA rulesStatistic taggers
95 %
F-Value90
DomainDependent
Local ContextStatistical Bias
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 49IS 240 – Spring 2009
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Anaysis
Context processingInterpretation
FASTUS
1.Complex Words: Recognition of multi-words and proper names
2.Basic Phrases:Simple noun groups, verb groups and particles
3.Complex phrases:Complex noun groups and verb groups
4.Domain Events:Patterns for events of interest to the application
Basic templates are to be built.
5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.
Based on finite states automata (FSA)
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 50IS 240 – Spring 2009
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Anaysis
Context processingInterpretation
FASTUS
1.Complex Words: Recognition of multi-words and proper names
2.Basic Phrases:Simple noun groups, verb groups and particles
3.Complex phrases:Complex noun groups and verb groups
4.Domain Events:Patterns for events of interest to the application
Basic templates are to be built.
5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.
Based on finite states automata (FSA)
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 51IS 240 – Spring 2009
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
FASTUS
1.Complex Words: Recognition of multi-words and proper names
2.Basic Phrases:Simple noun groups, verb groups and particles
3.Complex phrases:Complex noun groups and verb groups
4.Domain Events:Patterns for events of interest to the application
Basic templates are to be built.
5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.
Based on finite states automata (FSA)
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2009.04.29 - SLIDE 52IS 240 – Spring 2009
Using NLP
• Strzalkowski (in Reader)
Text NLP represDbasesearch
TAGGERNLP: PARSER TERMS
2009.04.29 - SLIDE 53IS 240 – Spring 2009
Using NLP
INPUT SENTENCEThe former Soviet President has been a local hero ever sincea Russian tank invaded Wisconsin.
TAGGED SENTENCEThe/dt former/jj Soviet/jj President/nn has/vbz been/vbn a/dt local/jj hero/nn ever/rb since/in a/dt Russian/jj tank/nn invaded/vbd Wisconsin/np ./per
2009.04.29 - SLIDE 54IS 240 – Spring 2009
Using NLP
TAGGED & STEMMED SENTENCEthe/dt former/jj soviet/jj president/nn have/vbz be/vbn a/dt local/jj hero/nn ever/rb since/in a/dt russian/jj tank/nn invade/vbd wisconsin/np ./per
2009.04.29 - SLIDE 55IS 240 – Spring 2009
Using NLP
PARSED SENTENCE
[assert
[[perf [have]][[verb[BE]]
[subject [np[n PRESIDENT][t_pos THE]
[adj[FORMER]][adj[SOVIET]]]]
[adv EVER]
[sub_ord[SINCE [[verb[INVADE]]
[subject [np [n TANK][t_pos A]
[adj [RUSSIAN]]]]
[object [np [name [WISCONSIN]]]]]]]]]
2009.04.29 - SLIDE 56IS 240 – Spring 2009
Using NLP
EXTRACTED TERMS & WEIGHTS
President 2.623519 soviet 5.416102
President+soviet 11.556747 president+former 14.594883
Hero 7.896426 hero+local 14.314775
Invade 8.435012 tank 6.848128
Tank+invade 17.402237 tank+russian 16.030809
Russian 7.383342 wisconsin 7.785689
2009.04.29 - SLIDE 57IS 240 – Spring 2009
Same Sentence, different sys
INPUT SENTENCEThe former Soviet President has been a local hero ever sincea Russian tank invaded Wisconsin.
TAGGED SENTENCE (using uptagger from Tsujii)The/DT former/JJ Soviet/NNP President/NNP has/VBZ been/VBN a/DT local/JJ hero/NN ever/RB since/IN a/DT Russian/JJ tank/NN invaded/VBD Wisconsin/NNP ./.
2009.04.29 - SLIDE 58IS 240 – Spring 2009
Same Sentence, different sys
CHUNKED Sentence (chunkparser – Tsujii)(TOP (S (NP (DT The) (JJ former) (NNP Soviet) (NNP President) ) (VP (VBZ has) (VP (VBN been) (NP (DT a) (JJ local) (NN hero) ) (ADVP (RB ever) ) (SBAR (IN since) (S (NP (DT a) (JJ Russian) (NN tank) ) (VP (VBD invaded) (NP (NNP Wisconsin) ) ) ) ) ) ) (. .) ) )
2009.04.29 - SLIDE 59IS 240 – Spring 2009
Same Sentence, different sys
Enju ParserROOT ROOT ROOT ROOT -1 ROOT been be VBN VB 5been be VBN VB 5 ARG1 President president NNP NNP 3been be VBN VB 5 ARG2 hero hero NN NN 8a a DT DT 6 ARG1 hero hero NN NN 8a a DT DT 11 ARG1 tank tank NN NN 13local local JJ JJ 7 ARG1 hero hero NN NN 8The the DT DT 0 ARG1 President president NNP NNP 3former former JJ JJ 1 ARG1 President president NNP NNP 3Russian russian JJ JJ 12 ARG1 tank tank NN NN 13Soviet soviet NNP NNP 2 MOD President president NNP NNP 3invaded invade VBD VB 14 ARG1 tank tank NN NN 13invaded invade VBD VB 14 ARG2 Wisconsin wisconsin NNP NNP 15has have VBZ VB 4 ARG1 President president NNP NNP 3has have VBZ VB 4 ARG2 been be VBN VB 5since since IN IN 10 MOD been be VBN VB 5since since IN IN 10 ARG1 invaded invade VBD VB 14ever ever RB RB 9 ARG1 since since IN IN 10
2009.04.29 - SLIDE 60IS 240 – Spring 2009
NLP & IR
• Indexing– Use of NLP methods to identify phrases
• Test weighting schemes for phrases
– Use of more sophisticated morphological analysis
• Searching– Use of two-stage retrieval
• Statistical retrieval• Followed by more sophisticated NLP filtering
2009.04.29 - SLIDE 61IS 240 – Spring 2009
NPL & IR
• Lewis and Sparck Jones suggest research in three areas– Examination of the words, phrases and sentences
that make up a document description and express the combinatory, syntagmatic relations between single terms
– The classificatory structure over document collection as a whole, indicating the paradigmatic relations between terms and permitting controlled vocabulary indexing and searching
– Using NLP-based methods for searching and matching
2009.04.29 - SLIDE 62IS 240 – Spring 2009
NLP & IR Issues
• Is natural language indexing using more NLP knowledge needed?
• Or, should controlled vocabularies be used
• Can NLP in its current state provide the improvements needed
• How to test
2009.04.29 - SLIDE 63IS 240 – Spring 2009
NLP & IR
• New “Question Answering” track at TREC has been exploring these areas– Usually statistical methods are used to
retrieve candidate documents– NLP techniques are used to extract the likely
answers from the text of the documents