Upload
shannon-gibbs
View
251
Download
1
Tags:
Embed Size (px)
Citation preview
Corpus-based computational linguistics or computational corpus linguistics?
Joakim Nivre
Uppsala UniversityDepartment of Linguistics and Philology
Outline
• Different worlds?– Corpus-based computational linguistics– Computational corpus linguistics– Similarities and differences– Opportunities for collaboration
• Computational linguistics – an example– Dependency-based syntactic analysis– Machine learning
Different worlds?
Corpora and computers
• The empirical revolution in (computational) linguistics:– Increased use of empirical data– Development of large corpora– Annotation of corpus data (syntactic, semantic)
• Underlying causes:– Technical development:
• Availability of machine-readable text (and digitized speech)• Computational capacity:
– Storage– Processing
– Scientific shift:• Criticism of armchair linguistics• Development of statistical language models
Computational corpus linguistics
• Goal:– Knowledge of language
• Descriptive studies• Theoretical hypothesis testing
• Means:– Corpus data as a source of knowledge of language
• Descriptive statistics• Statistical inference for hypothesis testing
– Computer programs for processing corpus data• Corpus development and annotation• Search and visualization (for humans)• Statistical analysis (descriptive and inferential)
Corpus-based computational linguistics
• Goal:– Computer programs that process natural language
• Practical applications (translation, summarization, …)
• Models of language learning and use
• Means:– Corpus data as a source of knowledge of language:
• Statistical inference for model parameters (estimation)– Computer programs for processing corpus data
• Corpus development and annotation
• Search and information extraction (for computers)
• Statistical analysis (estimation/machine learning)
Corpus processing 1
• Corpus development:– Tokenization (minimal units, words, etc.)– Segmentation (on several levels)– Normalization (e.g., abbreviations, orthography, multi-word units;
graphical elements, metadata, etc.)
• Annotation:– Part-of-speech tagging (word word class)– Lemmatization (word base form/lemma)– Syntactic analysis (sentence syntactic representation)– Semantic analysis (word sense, sentence proposition)
• Standard methodology:– Automatic analysis (often based on other corpus data)– Manual validation (and correction)
Corpus processing 2
• Searching and sorting:– Search methods:
• String matching• Regular expressions• Dedicated query languages• Special-purpose programs
– Results: • Concordances• Frequency lists
• Visualization:– Textual:
• Concordances, etc.
– Graphical:• Diagram, syntax trees, etc.
Corpus processing 3
• Statistical analysis:– Descriptive statistics
• Frequency tables and diagrams
– Statistical inference• Hypothesis testing (t-test, 2, Mann-Whitney, etc.)• Machine learning:
– Probabilistic: Estimate probability distributions
– Discriminative: Approximate mapping input-output
– Induction of lexical and grammatical resources(e.g. collocations, valency frames)
User Requirements
• Corpus linguists– Software
• Accessible
• Easy to use
• General
– Output• Suitable for humans
• Perspicuous (graphical visualization)
– Functions• Specific search
• Descriptive statistics
• Computational linguists– Software
• Efficient
• Modifiable
• Specific
– Output• Suitable for computers
• Well-defined format (annotated text)
– Functions• Exhaustive search
• Statistical learning
Summary
• Different goals:– Study language– Create computer programs
• … give (partly) different requirements:– Accessible and usable (for humans)– Efficient and standardized (for computers)
• … but (partly) the same needs:– Corpus development and annotation– Searching, sorting, and statistical analysis
Symbiosis?
• What can computational linguists do for corpus linguists?– Technical and general linguistic competence– Software for automatic analysis (annotation)
• What can corpus linguists do for computational linguists?– Linguistic and language specific competence– Manual validation of automatic analysis
• What can they achieve together?– Automatic annotation improves precision in corpus linguistics– Manual validation improves precision computational linguistics– A virtuous circle?
Computational linguistics – an example
Dependency analysis
0 1 2 3 4 5 6 7 8 9
Economic news had little effect on financial markets .
JJ NN VBD JJ NN IN JJ NNS .
ROOT
NMOD SBJ NMOD NMOD
OBJ PMOD
NMOD
P
Inductive dependency parsing
• Deterministic syntactic analysis (parsing):– Algorithm for deriving dependency structures– Requires decision function in choice situations– All decisions are final (deterministic)
• Inductive machine learning:– Decision function based on previous experience– Generalize from examples (successive refinement)– Examples = Annotated sentences (treebank)– No grammar – just analogy
Algorithm
• Data structures:– Queue of unanalyzed words (next = first in queue)– Stack of partially analyzed words (top = on top of stack)
• Start state:– Empty stack– All words in queue
• Algorithm steps:– Shift: Put next on top of stack (push)– Reduce: Remove top from stack (pop)– Right: Put next on top of stack (push); link top next– Left: Remove top from stack (pop); link next top
1 2 3 4 5 6 7 8 9
Economic news had little effect on financial markets .
JJ NN VBD JJ NN IN JJ NNS .
REDUCELA(NMOD)SHIFTLA(SBJ)SHIFTSHIFTLA(NMOD)RA(OBJ)RA(NMOD)SHIFTLA(NMOD)RA(PMOD)REDUCEREDUCESHIFTRA(P)
NMOD SBJ NMOD
OBJ
NMOD NMOD
PMOD
Algorithm example
ROOT
0
P
Decision function
• Non-determinism:
• Decision function: (Queue, Stack, Graph) Step
• Possible approaches:– Grammar?– Inductive generalization!
eats pizza with ……
OBJ RA(ATT)? RE?
Machine learning
• Decision function: – (Queue, Stack, Graph) Step
• Model:– (Queue, Stack, Graph) (f1, …, fn)
• Classifier:– (f1, …, fn) Step
• Learning:– { ((f1, …, fn), Step) } Classifier
Model
• Parts of speech: t1, top, next, n1, n2, n3
• Dependency types: t.hd, t.ld, t.rd, n.ld
• Word forms: top, next, top.hd, n1
hdld rd ld
.th next.top . n1…… …… n2 n3t1
Stack Queue
Memory-based learning
• Memory-based learning and classification:– Learning is storing experiences in memory.– Problem solving is achieved by reusing solutions of
similar problems experienced in the past.
• TIMBL (Tilburg Memory-Based Learner):– Basic method: k-nearest neighbor – Parameters:
• Number of neighbors (k)• Distance metrics• Weighting av attributes, values and instances
Learning example
• Instance base:1. (a, b, a, c) A
2. (a, b, c, a) B
3. (b, a, c, c) C
4. (c, a, b, c) A
1. New instance:1. (a, b, b, a)
• Distances:1. D(1, 5) = 2
2. D(2, 5) = 1
3. D(3, 5) = 4
4. D(4, 5) = 3
• k-NN:1. 1-NN(5) = B
2. 2-NN(5) = A/B
3. 3-NN(5) = A
Experimental evaluation
• Inductive dependency analysis:– Deterministic algorithm– Memory-based decision function
• Data:– English:
• Penn Treebank, WSJ (1M words)• Converted to dependency structure
– Swedish:• Talbanken, Professional prose (100k words)• Dependency structure based on MAMBA annotation
Results
• English:– 87.3% of all words got the correct head– 85.6% of all words got the correct head and label
• Svenska:– 85.9% of all words got the correct head– 81.6% of all words got the correct head and label
Dependency types: English
• High precision (86% F):VC (auxiliary verb main verb) 95.0%NMOD (noun modifier) 91.0%SBJ (verb subject) 89.3%PMOD (complement of preposition) 88.6%SBAR (complementizer verb) 86.1%
• Medium precision (73% F 83%):ROOT 82.4%OBJ (verb object) 81.1% VMOD (adverbial)76.8%AMOD (adj/adv modifier) 76.7%PRD (predicative complement) 73.8%
• Low precision (F 70%):DEP (other)
Dependency types: Swedish
• High precision (84% F):IM (infinitive marker infinitive) 98.5%PR (preposition noun) 90.6%UK (complementizer verb) 86.4%VC (auxiliary verb main verb) 86.1%DET (noun determiner) 89.5%ROOT 87.8%SUB (verb subject) 84.5%
• Medium precision (76% F 80%):ATT (noun modifier) 79.2%CC (coordination) 78.9%OBJ (verb object) 77.7%PRD (verb predicative) 76.8%ADV (adverbial) 76.3%
• Low precision (F 70%):INF, APP, XX, ID
Corpus annotation
• How good is 85%?– Good enough to save time for manual annotators– Good enough to improve search precision– Recent release: SUC with syntactic annotation
• How can accuracy be improved further?– By annotation of more data, which facilitates
machine learning– By refined linguistic analysis of the structures to be
annotated and the errors performed
MaltParser
• Software for inductive dependency parsing:– Freely available (open source)
• http//maltparser.org
– Evaluated on close to 30 different languages– Used for annotating corpora at Uppsala University