48
Machine Learning for (Psycho-)Linguistics Walter Daelemans [email protected] http://cnts.uia.ac.be CNTS, University of Antwerp ILK, Tilburg University QITL-02

Machine Learning for (Psycho-)Linguistics

  • Upload
    coye

  • View
    65

  • Download
    0

Embed Size (px)

DESCRIPTION

Machine Learning for (Psycho-)Linguistics. Walter Daelemans [email protected] http://cnts.uia.ac.be CNTS, University of Antwerp ILK, Tilburg University QITL-02. Outline. Machine Learning of Language Induction of rules and classes Learning by Analogy Case Studies - PowerPoint PPT Presentation

Citation preview

Page 1: Machine Learning for  (Psycho-)Linguistics

Machine Learning for (Psycho-)Linguistics

Walter [email protected]

http://cnts.uia.ac.be

CNTS, University of Antwerp

ILK, Tilburg University

QITL-02

Page 2: Machine Learning for  (Psycho-)Linguistics

Outline• Machine Learning of Language

– Induction of rules and classes– Learning by Analogy

• Case Studies– Discovery of phonological categories and

morphological rules– A single-route model of morphological processing

• Issues– Probabilities versus symbolic structure induction– Nativism versus empiricism– Exemplar analogy versus rules

Page 3: Machine Learning for  (Psycho-)Linguistics

Output Input

Performance Component

Ri RkRl

Rj

Learning Component

Search

Experience

BIAS

Page 4: Machine Learning for  (Psycho-)Linguistics

Problems with Probabilities

• Explanation – Also applies to neural networks

• Event relevance– Especially in unsupervised learning (clustering)

• Incorporation of linguistic knowledge

• Smoothing zero-frequency events

Page 5: Machine Learning for  (Psycho-)Linguistics

(symbolic) machine learning

• Rule induction (understandable induced theories)• Inductive Logic Programming (incorporating

linguistic knowledge)• Memory-based learning (similarity-based

smoothing of sparse data, feature weighting)

• …

Page 6: Machine Learning for  (Psycho-)Linguistics

Common Fallacies

• Rules = nativism (and connections = empiricism)

• Generalization = abstraction(and memory = table-lookup)

Page 7: Machine Learning for  (Psycho-)Linguistics

Rule-Based Innate

• Rules can be induced from primary linguistic data as well

• Applications in Linguistics– Evaluation and comparison of linguistic

hypotheses– Discovery of linguistic generalizations and

categories

Page 8: Machine Learning for  (Psycho-)Linguistics

Allomorphy in Dutch Diminutive• “one of the more spectacular phenomena of

modern Dutch morphophonemics” Trommelen (1983)

• Base form of Noun + [tje] (5 variants)• Linguistic theory (from Te Winkel 1862)

Rime last syllable, stress, morphological structure, …Trommelen 1983

Local phenomenon, stress & morphological structure do not play a role

• CELEX data (3900 nouns)- b i = - z @ = + m A nt je

Page 9: Machine Learning for  (Psycho-)Linguistics

Allomorphs

-tje kikker-tje 1896 48.0 50.9

-etje roman-etje 395 10.0 10.9

-pje lichaam-pje 104 2.6 4.0

-kje koning-kje 77 1.9 3.8

-je wereld-je 1478 37.4 30.4

Page 10: Machine Learning for  (Psycho-)Linguistics

Decision Tree Learning• Given a data set, construct a decision tree that

reflects the structure of the domain• A decision tree is a tree where

– non-leaf nodes represent features (tests)

– branches leading out of a test represent possible values for the feature

– leaf nodes represent outcomes (classes)

• Decision Tree can be translated into a set of IF-THEN rules (with further optimization)

• Value grouping

Page 11: Machine Learning for  (Psycho-)Linguistics

Given a set of examples T– If T contains one or more cases all belonging to

the same class C, then the decision tree for T is a leaf node with category C.

– If T contains different classes then • Choose a feature, and partition T into subsets that

have the same value for the feature chosen. The decision tree consists of a node containing the feature name, and a branch for each value leading to a subset.

• Apply the procedure recursively to subsets created this way.

Decision Tree Construction

Page 12: Machine Learning for  (Psycho-)Linguistics

Induced rule set

Default class is -tje1. IF coda last is /lm/ or /rm/ THEN -pje2. IF nucleus last is [+bimoraic] AND coda last

is /m/ THEN -pje3. IF coda last is /N/ THEN

IF nucleus penultimate is empty or schwa THEN -etje ELSE -kje

4. IF nucleus last is [+short] and coda last is [+nas] or [+liq] THEN -etje

5. IF coda last is [+obstruent] THEN -je

Page 13: Machine Learning for  (Psycho-)Linguistics

Results

• Problem is almost perfectly learnable (98.4%)• More than last syllable is needed for a full solution• Only rime of last syllable (not stress or onset) is

relevant• Induced Categories

– Nasals, liquids, obstruents, short vowels, bimoraic vowels (consists of vowels, diphtongs, schwa)

– Task-dependent categories? Category formation is dependent on the task to be learned, not absolute, not language-independent

Page 14: Machine Learning for  (Psycho-)Linguistics

Conclusions: Rule Induction in Linguistics

• Falsify existing linguistic theories• Evaluate role of linguistic information sources• (Re)discover interesting linguistic rules (=

supervised learning)• (Re)discover interesting linguistic categories (=

unsupervised learning)• Empiricist alternative for (mostly nativist) rule-

based systems

Page 15: Machine Learning for  (Psycho-)Linguistics

There is one small problem …

• Current methodology for comparative machine learning experiments is not reliable (especially with small data)– Different runs of the algorithm provide different

resulting rule sets– Algorithm can be tweaked to get high performance with

any information source combination– Algorithm is highly sensitive to training data, feature

selection, algorithm parameter settings, …

• Only to be used as a heuristic– As with your own rule induction module

Page 16: Machine Learning for  (Psycho-)Linguistics

Word Sense Disambiguation (do)

Similar: experience, material, say, then, …

61.060.8Optimized parameters

59.560.8Optimized parameters LC

47.949.0Default

+ keywordsLocal

Context

Page 17: Machine Learning for  (Psycho-)Linguistics

Generalisation Abstraction

+ abstraction

- abstraction

+ generalisation - generalisation

Rule InductionConnectionism

Inductive Logic ProgrammingStatistics

Handcrafting

Table LookupMemory-Based Learning

…(Fill in your most hated

linguist here)

Page 18: Machine Learning for  (Psycho-)Linguistics

This “rule of nearest neighbor” has considerable elementary intuitive appeal and probably corresponds to practice in many situations. For example, it is possible that much medical diagnosis is influenced by the doctor's recollection of the subsequent history of an earlier patient whose symptoms resemble in some way those of the current patient. (Fix and Hodges, 1952, p.43)

MBL: Use memory traces of experiences as a basis for analogical reasoning, rather than using rules or other abstractions extracted from experience and replacing the experiences.

Page 19: Machine Learning for  (Psycho-)Linguistics

-etje

-kje

Coda last syl

Nucleus last syl

Rule Induction

Page 20: Machine Learning for  (Psycho-)Linguistics

?

-etje

-kje

Coda last syl

Nucleus last syl

MBL

Page 21: Machine Learning for  (Psycho-)Linguistics

Memory-Based Learning• Basis: k nearest neighbor algorithm:

– store all examples in memory– to classify a new instance X, look up the k examples in

memory with the smallest distance D(X,Y) to X– let each nearest neighbor vote with its class– classify instance X with the class that has the most

votes in the nearest neighbor set• Choices:

– similarity metric– number of nearest neighbors (k)– voting weights

Page 22: Machine Learning for  (Psycho-)Linguistics

Metrics

Δ X,Y( ) = w( f )δ x i,y i( )i=1

n

δ x i, y i( ) =| x i − y i |

maxi− mini

δ x i, y i( ) = 0 if x i = y i else 1

w( f ) = − P C( )C

∑ log2 P C( )

− −P V f( ) × P C |V f( )log2 P C |V f( )C

∑ ⎛

⎝ ⎜

⎠ ⎟

V

δ v1,v2( ) = p C | v1( ) − p C | v2( )C

Page 23: Machine Learning for  (Psycho-)Linguistics

Metrics (2)

Voting options:

• Equal weight for each nearest neighbor

• Distance weighted voting• Inverse distance 1/D(X,Y) (Wettschereck, 1994)

• RBF-style gaussian voting function (Shepard, 1987)

• Linear voting function (Dudani, 1976)

(NB: weighted NN distribution can be used as conditional probability)

Page 24: Machine Learning for  (Psycho-)Linguistics

MBL Acquisition

• Inflectional process is represented by a set of exemplars in memory– Exemplars act as models– Learning is incremental storage of exemplars– Compression and Metrics

• Exemplar consists of set of (mostly symbolic) features

Page 25: Machine Learning for  (Psycho-)Linguistics

MBL Processing

• New instances of a performance process are solved through – Memory-lookup– Analogical (Similarity-Based) Reasoning

• Similarity metric– Language (faculty) - independent– Adaptive (feature and exemplar weighting)

Page 26: Machine Learning for  (Psycho-)Linguistics

The properties of language processing tasks …

• Language processing tasks are mappings between linguistic representation levels that are– context-sensitive (but mostly local!)– complex (sub/ir/regularity), pockets of exceptions

• Similar representations at one linguistic level correspond to similar representations at the other level

• Several information sources interact in (often) unpredictable ways at the same level

• Data is sparse

Page 27: Machine Learning for  (Psycho-)Linguistics

… fit the bias of MBL

• Inference is based on Similarity-Based / Analogical Reasoning

• Adaptive data fusion / relevance assignment is available through feature weighting

• It is a non-parametric approach

• Similarity-based smoothing is implicit

• Regularities and subregularities / exceptions can be modeled uniformly

Page 28: Machine Learning for  (Psycho-)Linguistics

German and Dutch plurals

Page 29: Machine Learning for  (Psycho-)Linguistics

Data & Representation• Symbolic features

– segmental information (syllable structure)

– stress

– gender

• German Plural (~ 25,000 from CELEX)Vorlesung (lecture) l e - z U N F en

Classes: e (e)n s er - U- Uer Ue

• Dutch Plural (~ 62,000 from CELEX)ontruiming (evacuation) 0 - O nt 1 r L - 0 m I N en

Classes: (e)n s (-eren, -i, -a, …)

Page 30: Machine Learning for  (Psycho-)Linguistics

Cognitive Architectures of Inflectional Morphology

• Dual Route (Pinker, Clahsen, Marcus …)

– Rules for regular cases• (over)generalization

• default behaviour

– Associative memory for exceptions• irregularization / family effects

• Single Route (R&M, MacWhinney, Plunkett, Elman, …)

– Frequency-based regularity

Dual Route

PatternAssociator Rule

Input Features

Suffix-class

MemoryFailure

Page 31: Machine Learning for  (Psycho-)Linguistics

German Plural

• Notoriously complex but routinely acquired (at age 5)

• Evidence for Dual Route ? -s suffix is default/regular (novel words,

surnames, acronyms, …)

-s suffix is infrequent (least frequent of the five most important suffixes)

Page 32: Machine Learning for  (Psycho-)Linguistics

Class Frequency Umlaut Frequency Example(e)n 11920 Abarte 6656 no 4646 Abbau

yes 2010 Abdampf - 4651 no 4402 Aasgeier

yes 249 Abwasserer 974 no 287 Abbild

yes 687 Abgangs 967 Abonnement

Page 33: Machine Learning for  (Psycho-)Linguistics

The default status of -s

• Similar item missing Fnöhk-s

• Surname, product name Mann-s

• Borrowings Kiosk-s

• Acronyms BMW-s

• Lexicalized phrases Vergissmeinnicht-s

• Onomatopoeia, truncated roots, derived nouns, ...

Page 34: Machine Learning for  (Psycho-)Linguistics
Page 35: Machine Learning for  (Psycho-)Linguistics

Discussion• Three “classes” of plurals: ((-en -)(-e -er))(s)

the former 4 suffixes seem “regular”, can be accurately learned using information from phonology and gender

-s is learned reasonably well but information is lacking• Hypothesis: more “features” are needed (syntactic, semantic,

meta-linguistic, …) to enrich the “lexical similarity space”

• No difference in accuracy and speed of learning with and without Umlaut

• Overall generalization accuracy very high: 95% • Schema-based learning (Köpcke).

*,*,*,*,i,r,M e

Page 36: Machine Learning for  (Psycho-)Linguistics
Page 37: Machine Learning for  (Psycho-)Linguistics
Page 38: Machine Learning for  (Psycho-)Linguistics

Acquisition Data:Summary of previous studies

• Existing nouns: (Park 78; Veit 86; Mills 86; Schamer-Wolles 88; Clahsen et al. 93; Sedlak et al. 98)

– Children mainly overapply -e or -(e)n– -s plurals are learned late

• Novel words: (Mugdan 77; MacWhinney 78; Phillis & Bouma 80; Schöler & Kany 89)

– Children inflect novel words with -e or -(e)n – More “irregular” plural forms produced than

“defaults”

Page 39: Machine Learning for  (Psycho-)Linguistics

MBL simulation

overgeneralization

-en-e--s-er

• model overapplies mainly -en and -e

• -s is learned late and imperfectly

• Mainly but not completely parallel to input frequency (more -s overgeneralization than -er generalization)

Page 40: Machine Learning for  (Psycho-)Linguistics

Bartke, Marcus, Clahsen (1995)

Roots

0

10

20

30

40

50

60

70

rhyme non-rhyme

-en-s

• 37 children age 3.6 to 6.6

• pictures of imaginary things, presented as neologisms– names or roots

– rhymes of existing words or not

– choice -en or -s

• results:– children are aware that unusual

sounding words require the default

– children are aware that names require the default

Page 41: Machine Learning for  (Psycho-)Linguistics

MBL simulation

Words

0

5

10

15

20

25

all non-rhyme

-en-s

• sort CELEX data according to rhyme

• compare overgeneralization– to -en versus to -s

– percentage of total number of errors

• results:– when new words don’t rhyme more

errors are made

– overgeneralization to -en drops below the level of overgeneralization to -s

Page 42: Machine Learning for  (Psycho-)Linguistics

Dutch PluralSuffixes -en and -s are both defaults, and are

in complementary distributionSelection of -en or -s governed by

– phonological structure of the base noun (stressed vs. unstressed last syllable)

– morphological structure (suffix of the base noun)

– loan word status– semantic feature person vs. thing– both are possible after //

(Baayen et al. 2001)

Page 43: Machine Learning for  (Psycho-)Linguistics

0

2

4

6

8

10

12

Gain Ratio

stress-2onset-2nucleus-2coda-2stress-1onset-1nucleus-1coda-1stressonsetnucleuscoda

Feature Relevance

Page 44: Machine Learning for  (Psycho-)Linguistics

Accuracy on CELEX• Methodology

– “Leave-one-out”

• Results:– MBL 94.9 % accuracy

Prec Rec F

-(e)n 95.8 97.2 96.4

-s 93.8 91.4 92.6

-i 82.0 77.2 79.5

– without stress 94.9 % accuracy

– last syllable with stress 92.6 % accuracy

– last syllable without stress 92.4 % accuracy

– rhyme last syllable 89.6 % accuracy

Page 45: Machine Learning for  (Psycho-)Linguistics

Accuracy on pseudo-words

• Methodology– Train: Celex (all) and Celex (1000 most frequent types)

– Test: 8 * 10 pseudo-words (Baayen et al., 2001)dreip - workel - bastus - bestroeting - kloertje

stape - stree - kadisme

• Results: accuracy = number of decisions equal to subject majority for each item– Subjects 87.5 %

– MBL (all) 83.8 %

– MBL (top 1000) 90.0 %

Page 46: Machine Learning for  (Psycho-)Linguistics

-s -endreip (-en) subjects 4 96

mblp-decisions 0 100mblp-support 7 93

workel (-s) subjects 98 2mblp-decisions 100 0mblp-support 100 0

bastus (-en) subjects 0 100mblp-decisions 0 90mblp-support 0 100

bestroeting (-en) subjects 1 99mblp-decisions 0 100mblp-support 0 100

kloertje (-s) subjects 100 0mblp-decisions 100 0mblp-support 81 19

stape (?) subjects 30 70mblp-decisions 90 10mblp-support 86 14

stree (?) subjects 30 70mblp-decisions 80 20mblp-support 31 69

kadisme (?) subjects 25 75mblp-decisions 0 100mblp-support 6 94

muidus, muidinn: modus, modi

Celex bias

Low frequency andloan word nearestneighbours

Page 47: Machine Learning for  (Psycho-)Linguistics

Conclusions Memory-Based Single Route

• MBLP picks up the main “schemata” of Dutch and German plural formation and their exceptions without recourse to explicit rules or a dual route architecture

• MBLP trained on (part of) CELEX matches subject behavior on pseudo words and acquisition data

• Segmental information suffices to reliably predict plural in Dutch and most plurals in German, additional information needed for German -s

• Heterogeneity and density in lexical exemplar space as source of behavior predictions

Page 48: Machine Learning for  (Psycho-)Linguistics

Overall Conclusions

• Advantages of symbolic machine learning methods over ‘pure statistics’ – As a methodology for inducing interpretable

linguistic generalizations and categories– As a way of introducing an operationalisation

of analogy-based methods into (psycho)linguistics