Upload
marie
View
14
Download
0
Embed Size (px)
DESCRIPTION
Prof. Ray Larson University of California, Berkeley School of Information. Lecture 22: NLP for IR. Principles of Information Retrieval. Today. MiniTREC results Review Cheshire III Design – GRID-based DLs NLP for IR Text Summarization. - PowerPoint PPT Presentation
Citation preview
2010.04.21 - SLIDE 1IS 240 – Spring 2010
Prof. Ray Larson University of California, Berkeley
School of Information
Principles of Information Retrieval
Lecture 22: NLP for IR
2010.04.21 - SLIDE 2IS 240 – Spring 2010
Today • MiniTREC results• Review
– Cheshire III Design – GRID-based DLs• NLP for IR• Text Summarization
Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer
2010.04.21 - SLIDE 3IS 240 – Spring 2010
MiniTREC 2010-All Submissions
2010.04.21 - SLIDE 4IS 240 – Spring 2010
MiniTREC 2010 - Best Runs
2010.04.21 - SLIDE 5IS 240 – Spring 2010
Grid
mid
dlew
are
Che m
ical
Eng i
nee r
ing
Applications
ApplicationToolkits
GridServices
GridFabric
Clim
ate
Data
Grid
Rem
ote
Com
putin
g
Rem
ote
Visu
aliza
tion
Colla
bora
torie
s
High
ene
rgy
phy
sics
Cosm
olog
y
Astro
phys
ics
Com
bust
ion
.….
Porta
ls
Rem
ote
sens
ors
..…Protocols, authentication, policy, instrumentation,Resource management, discovery, events, etc.
Storage, networks, computers, display devices, etc.and their associated local services
Grid Architecture -- (Dr. Eric Yen, Academia Sinica, Taiwan.)
2010.04.21 - SLIDE 6IS 240 – Spring 2010
Che m
ical
Eng i
nee r
ing
Applications
ApplicationToolkits
GridServices
GridFabric
Grid
mid
dlew
are
Clim
ate
Data
Grid
Rem
ote
Com
putin
g
Rem
ote
Visu
aliza
tion
Colla
bora
torie
s
High
ene
rgy
phy
sics
Cosm
olog
y
Astro
phys
ics
Com
bust
ion
Hum
anitie
sco
mpu
ting
Digi
tal
Libr
arie
s
…
Porta
ls
Rem
ote
sens
ors
Text
Min
ing
Met
adat
am
anag
emen
t
Sear
ch &
Retri
eval …
Protocols, authentication, policy, instrumentation,Resource management, discovery, events, etc.
Storage, networks, computers, display devices, etc.and their associated local services
Grid Architecture (ECAI/AS Grid Digital Library Workshop)
Bio-
Med
ical
2010.04.21 - SLIDE 7IS 240 – Spring 2010
Grid IR Issues• Want to preserve the same retrieval
performance (precision/recall) while hopefully increasing efficiency (I.e. speed)
• Very large-scale distribution of resources is a challenge for sub-second retrieval
• Different from most other typical Grid processes, IR is potentially less computing intensive and more data intensive
• In many ways Grid IR replicates the process (and problems) of metasearch or distributed search
2010.04.21 - SLIDE 8IS 240 – Spring 2010
ContextData Grid
Layer
Data Grid
SRBiRODS
Digital Library LayerApplicationLayer
Web BrowserMultivalent
Dedicated Client
User Interface
Apache+Mod_Python+
Cheshire3
Protocol Handler
Process Management
KeplerCheshire3
Query Results
Query
Results
Export Parse
Document ParsersMultivalent,...
NaturalLanguageProcessing
InformationExtraction
Text Mining ToolsTsujii Labs, ...
ClassificationClustering
Data Mining ToolsOrange, Weka, ...
Query
Results
Search /Retrieve
Index /Store
Information SystemCheshire3
User Interface
MySRBPAWN
Process Management
KepleriRODS rules
Term Management
TermineWordNet
...
Store
2010.04.21 - SLIDE 9IS 240 – Spring 2010
Grid ArchitectureMaster Task
Slave Task 1 Slave Task N
Data Grid
GPFS Temporary Storage
(workflow, process, document) (workflow, process, document)
fetch document fetch document
document document
extracted data extracted data
2010.04.21 - SLIDE 10IS 240 – Spring 2010
Grid Architecture - Phase 2Master Task
Slave Task 1 Slave Task N
Data Grid
GPFS Temporary Storage
(index, load) (index, load)
store index store index
fetch extracted data fetch extracted data
2010.04.21 - SLIDE 11IS 240 – Spring 2010
Today• Natural Language Processing and IR
– Based on Papers in Reader and on• David Lewis & Karen Sparck Jones “Natural
Language Processing for Information Retrieval” Communications of the ACM, 39(1) Jan. 1996
• Text summarization: Lecture from Ed Hovy (USC)
2010.04.21 - SLIDE 12IS 240 – Spring 2010
Natural Language Processing and IR
• The main approach in applying NLP to IR has been to attempt to address– Phrase usage vs individual terms– Search expansion using related
terms/concepts– Attempts to automatically exploit or assign
controlled vocabularies
2010.04.21 - SLIDE 13IS 240 – Spring 2010
NLP and IR• Much early research showed that (at least in the
restricted test databases tested)– Indexing documents by individual terms
corresponding to words and word stems produces retrieval results at least as good as when indexes use controlled vocabularies (whether applied manually or automatically)
– Constructing phrases or “pre-coordinated” terms provides only marginal and inconsistent improvements
2010.04.21 - SLIDE 14IS 240 – Spring 2010
NLP and IR• Not clear why intuitively plausible
improvements to document representation have had little effect on retrieval results when compared to statistical methods– E.g. Use of syntactic role relations between
terms has shown no improvement in performance over “bag of words” approaches
2010.04.21 - SLIDE 15IS 240 – Spring 2010
General Framework of NLP
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 16IS 240 – Spring 2010
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
John runs.
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 17IS 240 – Spring 2010
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
John runs.
John run+s. P-N V 3-pre N plu
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 18IS 240 – Spring 2010
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
John runs.
John run+s. P-N V 3-pre N plu
S
NP
P-N
John
VP
V
run
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 19IS 240 – Spring 2010
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
John runs.
John run+s. P-N V 3-pre N plu
S
NP
P-N
John
VP
V
runPred: RUN Agent:John
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 20IS 240 – Spring 2010
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
John runs.
John run+s. P-N V 3-pre N plu
S
NP
P-N
John
VP
V
runPred: RUN Agent:John
John is a student.He runs.
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 21IS 240 – Spring 2010
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Domain AnalysisAppelt:1999
Tokenization
Part of Speech Tagging
Term recognition(Ananiadou)
Inflection/DerivationCompounding
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 22IS 240 – Spring 2010
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 23IS 240 – Spring 2010
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge Incomplete Lexicons
Open class words TermsTerm recognitionNamed Entities Company names Locations Numerical expressions
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 24IS 240 – Spring 2010
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
Incomplete Grammar Syntactic Coverage Domain Specific Constructions Ungrammatical Constructions
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 25IS 240 – Spring 2010
Syntactic Analysis
General Framework of NLP
Morphological andLexical Processing
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
Incomplete Domain Knowledge Interpretation Rules
PredefinedAspects of
Information
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 26IS 240 – Spring 2010
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
(2) Ambiguities:Combinatorial
Explosion
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 27IS 240 – Spring 2010
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
(2) Ambiguities:Combinatorial
Explosion
Most words in Englishare ambiguous in terms of their parts of speech. runs: v/3pre, n/plu clubs: v/3pre, n/plu and two meanings
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 28IS 240 – Spring 2010
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
(2) Ambiguities:Combinatorial
Explosion
Structural Ambiguities
Predicate-argument Ambiguities
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 29IS 240 – Spring 2010
Structural Ambiguities
(1)Attachment Ambiguities John bought a car with large seats. John bought a car with $3000.
(2) Scope Ambiguities
young women and men in the room
(3)Analytical Ambiguities Visiting relatives can be boring.
The manager of Yaxing Benz, a Sino-German joint ventureThe manager of Yaxing Benz, Mr. John Smith
John bought a car with Mary.$3000 can buy a nice car.
Semantic Ambiguities(1)
Semantic Ambiguities(2)
Every man loves a woman.
Co-reference Ambiguities
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 30IS 240 – Spring 2010
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
(2) Ambiguities:Combinatorial
Explosion
Structural Ambiguities
Predicate-argument Ambiguities
CombinatorialExplosion
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 31IS 240 – Spring 2010
Note:
Ambiguities vs Robustness
More comprehensive knowledge: More Robust
big dictionariescomprehensive grammar
More comprehensive knowledge: More ambiguities
Adaptability: Tuning, LearningSlides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 32IS 240 – Spring 2010
Framework of IE
IE as compromise NLP
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 33IS 240 – Spring 2010
Syntactic Analysis
General Framework of NLP
Morphological andLexical Processing
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
Incomplete Domain Knowledge Interpretation Rules
PredefinedAspects of
Information
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 34IS 240 – Spring 2010
Syntactic Analysis
General Framework of NLP
Morphological andLexical Processing
Semantic Analysis
Context processingInterpretation
Difficulties of NLP
(1) Robustness: Incomplete Knowledge
Incomplete Domain Knowledge Interpretation Rules
PredefinedAspects of
Information
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 35IS 240 – Spring 2010
Techniques in IE
(1) Domain Specific Partial Knowledge: Knowledge relevant to information to be extracted(2) Ambiguities: Ignoring irrelevant ambiguities Simpler NLP techniques
(4) Adaptation Techniques: Machine Learning, Trainable systems
(3) Robustness: Coping with Incomplete dictionaries (open class words) Ignoring irrelevant parts of sentences
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 36IS 240 – Spring 2010
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Anaysis
Context processingInterpretation
Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names
Domain specific rules: <Word><Word>, Inc. Mr. <Cpt-L>. <Word>Machine Learning: HMM, Decision TreesRules + Machine Learning
Part of Speech Tagger FSA rulesStatistic taggers
95 %
F-Value90
DomainDependent
Local ContextStatistical Bias
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 37IS 240 – Spring 2010
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Anaysis
Context processingInterpretation
FASTUS
1.Complex Words: Recognition of multi-words and proper names
2.Basic Phrases:Simple noun groups, verb groups and particles
3.Complex phrases:Complex noun groups and verb groups
4.Domain Events:Patterns for events of interest to the applicationBasic templates are to be built.
5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.
Based on finite states automata (FSA)
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 38IS 240 – Spring 2010
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Anaysis
Context processingInterpretation
FASTUS
1.Complex Words: Recognition of multi-words and proper names
2.Basic Phrases:Simple noun groups, verb groups and particles
3.Complex phrases:Complex noun groups and verb groups
4.Domain Events:Patterns for events of interest to the applicationBasic templates are to be built.
5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.
Based on finite states automata (FSA)
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 39IS 240 – Spring 2010
General Framework of NLP
Morphological andLexical Processing
Syntactic Analysis
Semantic Analysis
Context processingInterpretation
FASTUS
1.Complex Words: Recognition of multi-words and proper names
2.Basic Phrases:Simple noun groups, verb groups and particles
3.Complex phrases:Complex noun groups and verb groups
4.Domain Events:Patterns for events of interest to the applicationBasic templates are to be built.
5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.
Based on finite states automata (FSA)
Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester
2010.04.21 - SLIDE 40IS 240 – Spring 2010
Using NLP• Strzalkowski (in Reader)
Text NLP repres Dbasesearch
TAGGERNLP: PARSER TERMS
2010.04.21 - SLIDE 41IS 240 – Spring 2010
Using NLP
INPUT SENTENCEThe former Soviet President has been a local hero ever sincea Russian tank invaded Wisconsin.
TAGGED SENTENCEThe/dt former/jj Soviet/jj President/nn has/vbz been/vbn a/dt local/jj hero/nn ever/rb since/in a/dt Russian/jj tank/nn invaded/vbd Wisconsin/np ./per
2010.04.21 - SLIDE 42IS 240 – Spring 2010
Using NLP
TAGGED & STEMMED SENTENCEthe/dt former/jj soviet/jj president/nn have/vbz be/vbn a/dt local/jj hero/nn ever/rb since/in a/dt russian/jj tank/nn invade/vbd wisconsin/np ./per
2010.04.21 - SLIDE 43IS 240 – Spring 2010
Using NLP
PARSED SENTENCE[assert [[perf [have]][[verb[BE]] [subject [np[n PRESIDENT][t_pos THE] [adj[FORMER]][adj[SOVIET]]]] [adv EVER] [sub_ord[SINCE [[verb[INVADE]] [subject [np [n TANK][t_pos A] [adj [RUSSIAN]]]] [object [np [name [WISCONSIN]]]]]]]]]
2010.04.21 - SLIDE 44IS 240 – Spring 2010
Using NLP
EXTRACTED TERMS & WEIGHTSPresident 2.623519 soviet 5.416102President+soviet 11.556747 president+former 14.594883Hero 7.896426 hero+local 14.314775Invade 8.435012 tank 6.848128Tank+invade 17.402237 tank+russian 16.030809Russian 7.383342 wisconsin 7.785689
2010.04.21 - SLIDE 45IS 240 – Spring 2010
Same Sentence, different sys
INPUT SENTENCEThe former Soviet President has been a local hero ever sincea Russian tank invaded Wisconsin.
TAGGED SENTENCE (using uptagger from Tsujii)The/DT former/JJ Soviet/NNP President/NNP has/VBZ been/VBN a/DT local/JJ hero/NN ever/RB since/IN a/DT Russian/JJ tank/NN invaded/VBD Wisconsin/NNP ./.
2010.04.21 - SLIDE 46IS 240 – Spring 2010
Same Sentence, different sys
CHUNKED Sentence (chunkparser – Tsujii)(TOP (S (NP (DT The) (JJ former) (NNP Soviet) (NNP President) ) (VP (VBZ has) (VP (VBN been) (NP (DT a) (JJ local) (NN hero) ) (ADVP (RB ever) ) (SBAR (IN since) (S (NP (DT a) (JJ Russian) (NN tank) ) (VP (VBD invaded) (NP (NNP Wisconsin) ) ) ) ) ) ) (. .) ) )
2010.04.21 - SLIDE 47IS 240 – Spring 2010
Same Sentence, different sysEnju ParserROOT ROOT ROOT ROOT -1 ROOT been be VBN VB 5been be VBN VB 5 ARG1 President president NNP NNP 3been be VBN VB 5 ARG2 hero hero NN NN 8a a DT DT 6 ARG1 hero hero NN NN 8a a DT DT 11 ARG1 tank tank NN NN 13local local JJ JJ 7 ARG1 hero hero NN NN 8The the DT DT 0 ARG1 President president NNP NNP 3former former JJ JJ 1 ARG1 President president NNP NNP 3Russian russian JJ JJ 12 ARG1 tank tank NN NN 13Soviet soviet NNP NNP 2 MOD President president NNP NNP 3invaded invade VBD VB 14 ARG1 tank tank NN NN 13invaded invade VBD VB 14 ARG2 Wisconsin wisconsin NNP NNP 15has have VBZ VB 4 ARG1 President president NNP NNP 3has have VBZ VB 4 ARG2 been be VBN VB 5since since IN IN 10 MOD been be VBN VB 5since since IN IN 10 ARG1 invaded invade VBD VB 14ever ever RB RB 9 ARG1 since since IN IN 10
2010.04.21 - SLIDE 48IS 240 – Spring 2010
NLP & IR• Indexing
– Use of NLP methods to identify phrases• Test weighting schemes for phrases
– Use of more sophisticated morphological analysis
• Searching– Use of two-stage retrieval
• Statistical retrieval• Followed by more sophisticated NLP filtering
2010.04.21 - SLIDE 49IS 240 – Spring 2010
NPL & IR• Lewis and Sparck Jones suggest research in
three areas– Examination of the words, phrases and sentences
that make up a document description and express the combinatory, syntagmatic relations between single terms
– The classificatory structure over document collection as a whole, indicating the paradigmatic relations between terms and permitting controlled vocabulary indexing and searching
– Using NLP-based methods for searching and matching
2010.04.21 - SLIDE 50IS 240 – Spring 2010
NLP & IR Issues• Is natural language indexing using more
NLP knowledge needed?• Or, should controlled vocabularies be used• Can NLP in its current state provide the
improvements needed• How to test
2010.04.21 - SLIDE 51IS 240 – Spring 2010
NLP & IR• New “Question Answering” track at TREC
has been exploring these areas– Usually statistical methods are used to
retrieve candidate documents– NLP techniques are used to extract the likely
answers from the text of the documents
2010.04.21 - SLIDE 52IS 240 – Spring 2010
Mark’s idle speculation• What people think is going on always
Keywords
NLPFrom Mark Sanderson, University of Sheffield
2010.04.21 - SLIDE 53IS 240 – Spring 2010
Mark’s idle speculation• What’s usually actually going on
KeywordsNLPFrom Mark Sanderson, University of Sheffield