Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Corpus-based computational linguistics or computational corpus linguistics?

Joakim Nivre

Uppsala UniversityDepartment of Linguistics and Philology

Outline

• Different worlds?– Corpus-based computational linguistics– Computational corpus linguistics– Similarities and differences– Opportunities for collaboration

• Computational linguistics – an example– Dependency-based syntactic analysis– Machine learning

Different worlds?

Corpora and computers

• The empirical revolution in (computational) linguistics:– Increased use of empirical data– Development of large corpora– Annotation of corpus data (syntactic, semantic)

• Underlying causes:– Technical development:

• Availability of machine-readable text (and digitized speech)• Computational capacity:

– Storage– Processing

– Scientific shift:• Criticism of armchair linguistics• Development of statistical language models

Computational corpus linguistics

• Goal:– Knowledge of language

• Descriptive studies• Theoretical hypothesis testing

• Means:– Corpus data as a source of knowledge of language

• Descriptive statistics• Statistical inference for hypothesis testing

– Computer programs for processing corpus data• Corpus development and annotation• Search and visualization (for humans)• Statistical analysis (descriptive and inferential)

Corpus-based computational linguistics

• Goal:– Computer programs that process natural language

• Practical applications (translation, summarization, …)

• Models of language learning and use

• Means:– Corpus data as a source of knowledge of language:

• Statistical inference for model parameters (estimation)– Computer programs for processing corpus data

• Corpus development and annotation

• Search and information extraction (for computers)

• Statistical analysis (estimation/machine learning)

Corpus processing 1

• Corpus development:– Tokenization (minimal units, words, etc.)– Segmentation (on several levels)– Normalization (e.g., abbreviations, orthography, multi-word units;

graphical elements, metadata, etc.)

• Annotation:– Part-of-speech tagging (word word class)– Lemmatization (word base form/lemma)– Syntactic analysis (sentence syntactic representation)– Semantic analysis (word sense, sentence proposition)

• Standard methodology:– Automatic analysis (often based on other corpus data)– Manual validation (and correction)

Corpus processing 2

• Searching and sorting:– Search methods:

• String matching• Regular expressions• Dedicated query languages• Special-purpose programs

– Results: • Concordances• Frequency lists

• Visualization:– Textual:

• Concordances, etc.

– Graphical:• Diagram, syntax trees, etc.

Corpus processing 3

• Statistical analysis:– Descriptive statistics

• Frequency tables and diagrams

– Statistical inference• Hypothesis testing (t-test, 2, Mann-Whitney, etc.)• Machine learning:

– Probabilistic: Estimate probability distributions

– Discriminative: Approximate mapping input-output

– Induction of lexical and grammatical resources(e.g. collocations, valency frames)

User Requirements

• Corpus linguists– Software

• Accessible

• Easy to use

• General

– Output• Suitable for humans

• Perspicuous (graphical visualization)

– Functions• Specific search

• Descriptive statistics

• Computational linguists– Software

• Efficient

• Modifiable

• Specific

– Output• Suitable for computers

• Well-defined format (annotated text)

– Functions• Exhaustive search

• Statistical learning

Summary

• Different goals:– Study language– Create computer programs

• … give (partly) different requirements:– Accessible and usable (for humans)– Efficient and standardized (for computers)

• … but (partly) the same needs:– Corpus development and annotation– Searching, sorting, and statistical analysis

Symbiosis?

• What can computational linguists do for corpus linguists?– Technical and general linguistic competence– Software for automatic analysis (annotation)

• What can corpus linguists do for computational linguists?– Linguistic and language specific competence– Manual validation of automatic analysis

• What can they achieve together?– Automatic annotation improves precision in corpus linguistics– Manual validation improves precision computational linguistics– A virtuous circle?

Computational linguistics – an example

Dependency analysis

0 1 2 3 4 5 6 7 8 9

Economic news had little effect on financial markets .

JJ NN VBD JJ NN IN JJ NNS .

ROOT

NMOD SBJ NMOD NMOD

OBJ PMOD

NMOD

P

Inductive dependency parsing

• Deterministic syntactic analysis (parsing):– Algorithm for deriving dependency structures– Requires decision function in choice situations– All decisions are final (deterministic)

• Inductive machine learning:– Decision function based on previous experience– Generalize from examples (successive refinement)– Examples = Annotated sentences (treebank)– No grammar – just analogy

Algorithm

• Data structures:– Queue of unanalyzed words (next = first in queue)– Stack of partially analyzed words (top = on top of stack)

• Start state:– Empty stack– All words in queue

• Algorithm steps:– Shift: Put next on top of stack (push)– Reduce: Remove top from stack (pop)– Right: Put next on top of stack (push); link top next– Left: Remove top from stack (pop); link next top

1 2 3 4 5 6 7 8 9

Economic news had little effect on financial markets .

JJ NN VBD JJ NN IN JJ NNS .

REDUCELA(NMOD)SHIFTLA(SBJ)SHIFTSHIFTLA(NMOD)RA(OBJ)RA(NMOD)SHIFTLA(NMOD)RA(PMOD)REDUCEREDUCESHIFTRA(P)

NMOD SBJ NMOD

OBJ

NMOD NMOD

PMOD

Algorithm example

ROOT

0

P

Decision function

• Non-determinism:

• Decision function: (Queue, Stack, Graph) Step

• Possible approaches:– Grammar?– Inductive generalization!

eats pizza with ……

OBJ RA(ATT)? RE?

Machine learning

• Decision function: – (Queue, Stack, Graph) Step

• Model:– (Queue, Stack, Graph) (f1, …, fn)

• Classifier:– (f1, …, fn) Step

• Learning:– { ((f1, …, fn), Step) } Classifier

Model

• Parts of speech: t1, top, next, n1, n2, n3

• Dependency types: t.hd, t.ld, t.rd, n.ld

• Word forms: top, next, top.hd, n1

hdld rd ld

.th next.top . n1…… …… n2 n3t1

Stack Queue

Memory-based learning

• Memory-based learning and classification:– Learning is storing experiences in memory.– Problem solving is achieved by reusing solutions of

similar problems experienced in the past.

• TIMBL (Tilburg Memory-Based Learner):– Basic method: k-nearest neighbor – Parameters:

• Number of neighbors (k)• Distance metrics• Weighting av attributes, values and instances

Learning example

• Instance base:1. (a, b, a, c) A

2. (a, b, c, a) B

3. (b, a, c, c) C

4. (c, a, b, c) A

1. New instance:1. (a, b, b, a)

• Distances:1. D(1, 5) = 2

2. D(2, 5) = 1

3. D(3, 5) = 4

4. D(4, 5) = 3

• k-NN:1. 1-NN(5) = B

2. 2-NN(5) = A/B

3. 3-NN(5) = A

Experimental evaluation

• Inductive dependency analysis:– Deterministic algorithm– Memory-based decision function

• Data:– English:

• Penn Treebank, WSJ (1M words)• Converted to dependency structure

– Swedish:• Talbanken, Professional prose (100k words)• Dependency structure based on MAMBA annotation

Results

• English:– 87.3% of all words got the correct head– 85.6% of all words got the correct head and label

• Svenska:– 85.9% of all words got the correct head– 81.6% of all words got the correct head and label

Dependency types: English

• High precision (86% F):VC (auxiliary verb main verb) 95.0%NMOD (noun modifier) 91.0%SBJ (verb subject) 89.3%PMOD (complement of preposition) 88.6%SBAR (complementizer verb) 86.1%

• Medium precision (73% F 83%):ROOT 82.4%OBJ (verb object) 81.1% VMOD (adverbial)76.8%AMOD (adj/adv modifier) 76.7%PRD (predicative complement) 73.8%

• Low precision (F 70%):DEP (other)

Dependency types: Swedish

• High precision (84% F):IM (infinitive marker infinitive) 98.5%PR (preposition noun) 90.6%UK (complementizer verb) 86.4%VC (auxiliary verb main verb) 86.1%DET (noun determiner) 89.5%ROOT 87.8%SUB (verb subject) 84.5%

• Medium precision (76% F 80%):ATT (noun modifier) 79.2%CC (coordination) 78.9%OBJ (verb object) 77.7%PRD (verb predicative) 76.8%ADV (adverbial) 76.3%

• Low precision (F 70%):INF, APP, XX, ID

Corpus annotation

• How good is 85%?– Good enough to save time for manual annotators– Good enough to improve search precision– Recent release: SUC with syntactic annotation

• How can accuracy be improved further?– By annotation of more data, which facilitates

machine learning– By refined linguistic analysis of the structures to be

annotated and the errors performed

MaltParser

• Software for inductive dependency parsing:– Freely available (open source)

• http//maltparser.org

– Evaluated on close to 30 different languages– Used for annotating corpora at Uppsala University

Documents

Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology