Overview of Statistical NLP
IR Group Meeting
March 7, 2006
03/07/2006 IR Group Meeting -- NLP 2
Outline
Some basic/important NLP problems Topics that recently attracted many interests NLP research groups Discussion on the relation between NLP and
IR
03/07/2006 IR Group Meeting -- NLP 3
Levels of Analysis in NLP(from Dan Roth’s CS598) Morphology
How words are constructed Syntax
Structural relation between words Semantics
The meaning of words and of combinations of words Pragmatics.
How is a sentence used? What’s its purpose? Discourse (sometimes distinguished as a subfield of
Pragmatics) Relationships between sentences; global context.
03/07/2006 IR Group Meeting -- NLP 4
Some NLP Problems
N-gram Models Word Sense Disambiguation Lexical Acquisition (POS) Tagging (Syntactic) Parsing Semantic Role Labeling (Semantic Parsing) Named Entity Recognition Textual Entailment …
03/07/2006 IR Group Meeting -- NLP 5
N-gram Models
The task: to estimate P(wn|w1,…,wn-1) Approaches:
Maximum likelihood estimation Various smoothing methods
Applications: Automatic speech recognition Spelling correction Handwriting recognition Statistical machine translation
03/07/2006 IR Group Meeting -- NLP 6
Word Sense Disambiguation (WSD) The task: to determine which of the senses of an ambiguous
word is involved in a particular use of the word Approaches:
Supervised: Log-linear models Information-theoretic Memory-based learning (kNN)
Dictionary-based: Sense definitions Thesauri Translations in a second language
Unsupervised: Clustering using EM algorithm
03/07/2006 IR Group Meeting -- NLP 7
Word Sense Disambiguation (WSD) Accuracy:
Word-specific Easy words: > 90% Hard words: 50~70%
Applications: Statistical machine translation Information retrieval
03/07/2006 IR Group Meeting -- NLP 8
Lexical Acquisition
The task: to develop algorithms and statistical techniques for filling the holes in existing machine-learnable dictionaries by looking at the occurrence patterns of words in large text corpora
Examples: Verb subcategorization Propositional phrase attachment disambiguation Selectional preferences Semantic similarity
03/07/2006 IR Group Meeting -- NLP 9
Semantic Similarity
The task: to acquire a relative measure of similarity between two words
Approaches: Vector space measures (document space, word
space, modifier space, etc.) Probabilistic measures (KL-divergence, etc.)
Applications: Information retrieval (query expansion)
03/07/2006 IR Group Meeting -- NLP 10
POS Tagging
The task: labeling each word in a sentence with its appropriate part of speech
Major approaches HMM Transformation-based
Advantages: speed and storage
Other approaches Neural networks, decision trees, memory-based
learning, maximum entropy models
03/07/2006 IR Group Meeting -- NLP 11
POS Tagging Accuracy:
95~97% Achieved only when the application text and the training
text are from the similar source Applications
For higher-level NLP tasks: partial parsing, parsing, NER, etc.
“…the best lexicalized probabilistic parsers are now good enough that they perform better starting with untagged text and doing the tagging themselves, rather than using a tagger as preprocessor.” (Charniak 1997)
03/07/2006 IR Group Meeting -- NLP 12
(Syntactic) Parsing
The task: to find the most likely syntactic parse tree of a sentence
Approaches: Probabilistic context free grammar (PCFG)
Supervised Unsupervised
Lexicalized models Dependency-based models
03/07/2006 IR Group Meeting -- NLP 13
(Syntactic) Parsing
Accuracy: Charniak 1997: Rec 0.875 Prec 0.874 Collins 1997: Rec 0.881 Prec 0.886
Applications: For other NLP tasks such as semantic role
labeling and relation extraction
03/07/2006 IR Group Meeting -- NLP 14
Semantic Role Labeling
The task: to identify the predicate-argument structures in sentences
Approaches: Supervised learning
Accuracy: Best ~70% (CoNLL 04 shared task)
Applications: Information extraction Question answering
03/07/2006 IR Group Meeting -- NLP 15
Textual Entailment
The task: given two text fragments, to recognize whether the meaning of one text is entailed (can be inferred) from the other text
Approaches: Word overlap Statistical lexical relations Syntactic matching Logic inference
Accuracy: ~0.56, best ~0.60 (PASCAL Challenge 05)
Applications: Question answering Multi-document summarization
03/07/2006 IR Group Meeting -- NLP 16
Tools
Brill Tagger Charniak Parser Collins Parser MiniPar Semantic Parser
ASSERT Parser CCG’s demo
03/07/2006 IR Group Meeting -- NLP 17
Corpora
WordNet Penn Treebank (Sample) PropBank FrameNet
03/07/2006 IR Group Meeting -- NLP 18
Other Tasks
Automatic Speech Recognition Natural Language Generation Automatic Summarization …
03/07/2006 IR Group Meeting -- NLP 19
Outline
Some basic/important NLP problems Topics that recently attracted many interests NLP research groups Discussion on the relation between NLP and
IR
03/07/2006 IR Group Meeting -- NLP 20
Recent topics
Unsupervised and semi-supervised approaches Knowledge acquisition bottleneck
Semantic role labeling Improve the performance of SRL Use the results for other tasks
Relation extraction WSD Parsing Statistical machine translation
Word alignment
03/07/2006 IR Group Meeting -- NLP 21
Outline
Some basic/important NLP problems Topics that recently attracted many interests NLP research groups Discussion on the relation between NLP and
IR
03/07/2006 IR Group Meeting -- NLP 22
NLP Research Groups
USC/ISI Stanford UPenn Johns-Hopkins UIUC …
03/07/2006 IR Group Meeting -- NLP 23
Outline
Some basic/important NLP problems Topics that recently attracted many interests NLP research groups Discussion on the relation between NLP and
IR