28
1 LING 696B: Gradient phonotactics and well- formedness

1 LING 696B: Gradient phonotactics and well- formedness

Embed Size (px)

Citation preview

Page 1: 1 LING 696B: Gradient phonotactics and well- formedness

1

LING 696B: Gradient phonotactics and well-formedness

Page 2: 1 LING 696B: Gradient phonotactics and well- formedness

2

Vote on remaining topics Topics that have been fixed:

Morpho-phonological learning (Emily) + (LouAnn’s lecture) + Bayesian learning

Rule induction (Mans) + decision tree Learning and self-organization

(Andy’s lecture)

Page 3: 1 LING 696B: Gradient phonotactics and well- formedness

3

Voting on remaining topics Select 2-3 from the following (need

a ranking): OT and Stochastic OT Alternatives to OT: random

fields/maximum entropy Minimal Description Length word

chopping Feature-based lexical access

Page 4: 1 LING 696B: Gradient phonotactics and well- formedness

4

Well-formedness of words (following Mike’s talk) A word “sounds like English” if:

It is a close neighbor of some words that sound really English. E.g. “pand” is neighbor of sand, band, pad, pan, …

It agrees with what English grammar says what an English word should look like, e.g. gradient phonotactics says blick > bnick

Page 5: 1 LING 696B: Gradient phonotactics and well- formedness

5

Well-formedness of words (following Mike’s talk) A word “sounds like English” if:

It is a close neighbor of some words that sound really English. E.g. “pand” is neighbor of sand, band, pad, pan, …

It agrees with what English grammar says what an English word should look like, e.g. gradient phonotactics says blick > bnick

Today: relate these two ideas to the non-parametric and parametric perspectives

Page 6: 1 LING 696B: Gradient phonotactics and well- formedness

6

Many ways of calculating probability of a sequence Unigrams, bigrams, trigrams,

syllable parts, transition probabilities … No bound on the number of creative

ways

Page 7: 1 LING 696B: Gradient phonotactics and well- formedness

7

Many ways of calculating probability of a sequence Unigrams, bigrams, trigrams, syllable

parts, transition probabilities … No bound on the number of creative ways

What does it mean to say the “probability” of a phonological word? Objective/frequentist v.s. subjective/

Bayesian: philosophical (but important)

Page 8: 1 LING 696B: Gradient phonotactics and well- formedness

8

Many ways of calculating probability of a sequence

Unigrams, bigrams, trigrams, syllable parts, transition probabilities … No bound on the number of creative ways

What does it mean to say the “probability” of a phonological word? Objective/frequentist v.s. subjective/

Bayesian: philosophical (but important) Thinking “parametrically” may clarify

things “likelihood” = “probability” calculated from

a model

Page 9: 1 LING 696B: Gradient phonotactics and well- formedness

9

Parametric approach to phonotactics Example: “bag of sounds”

assumption/ exchangable distributions p(blik) = p(lbik) = p(kbli)

Page 10: 1 LING 696B: Gradient phonotactics and well- formedness

10

Parametric approach to phonotactics Example: “bag of sounds”

assumption/ exchangable distributions p(blik) = p(lbik) = p(kbli)

Unigram models: N-1 parameters

B L I K

What is ?How to get (hat)?How to assign prob to “blick”?

Page 11: 1 LING 696B: Gradient phonotactics and well- formedness

11

Parametric approach to phonotactics Unigram model with overlapping

observations: N2 - 1 parameters

B L I K

Note: input is #B BL LI IK K#

What is ?How to get (hat)?How to assign prob to “blick”?

Page 12: 1 LING 696B: Gradient phonotactics and well- formedness

12

Parametric approach to phonotactics Unigram with annotated

observations (Coleman and Pierrehumbert)

BL IKOnset of strong

Initial/final syllableRhyme of strong

Initial/final syllable

“osif” “rsif”

Input: segment annotated with a syllable parse

Page 13: 1 LING 696B: Gradient phonotactics and well- formedness

13

Parametric approach to phonotactics Bigram model: N(N-1) parameters

{p(wn|wn-1)} (how many for trigram?)

B L I K

Input: segment sequence

Page 14: 1 LING 696B: Gradient phonotactics and well- formedness

14

Ways that theory might help calculate probability Probability calculation must be

based on an explicit model Need a story about what sequences

are How can phonology help with

calculating sequence probability? More delicate representations More complex models

Page 15: 1 LING 696B: Gradient phonotactics and well- formedness

15

Ways that theory might help calculate probability Probability calculation must be

based on an explicit model Need a story about what sequences

are How can phonology help with

calculating sequence probability? More delicate representations More complex models

But: phonology is not quite about what sequences are …

Page 16: 1 LING 696B: Gradient phonotactics and well- formedness

16

More delicate representations Would CV phonology help?

Auto-segmental tiers, features, gestures? The chains no longer independent: more

sophisticated models are needed Limit: generative model of speech

production (very hard)

B L I K I T

Page 17: 1 LING 696B: Gradient phonotactics and well- formedness

17

More complex models Mixture of unigrams

Used in document classification

B L I K

Lexical strata

Unigram

Page 18: 1 LING 696B: Gradient phonotactics and well- formedness

18

More complex models More structure in the Markov chain

Can also model the length distribution with the so-called semi-Markov models

BL IK

“onset” “rhyme V”

“rhyme VC”

Page 19: 1 LING 696B: Gradient phonotactics and well- formedness

19

More complex models Probabilistic context free grammar

Syllable --> C + VC (0.6) Syllable --> C + V (0.35) Syllable --> C + C (0.05) C --> _ (0.01) C --> b (0.05) …

See 439/539

Page 20: 1 LING 696B: Gradient phonotactics and well- formedness

20

What’s the benefit for doing more sophisticated things? Recall: maximum likelihood need

more data to produce a better estimate

Data sparsity problem: training data often insufficient for estimating all the parameters, e.g. zero counts Lexicon size: we don’t have infinitely

many words to estimate phonotactics Smoothing: properly done, has a

Bayesian interpretation (often not)

Page 21: 1 LING 696B: Gradient phonotactics and well- formedness

21

Probability and well-formedness Generative modeling: characterize a

distribution over strings Why should we care about this

distribution? Hope: this may have something to do with

grammaticality judgements But: judgements also affected by what

other words “sound like”. Puzzle of mrupect/mrupation It may be easier to model a function with

input = string, output = judgements

Page 22: 1 LING 696B: Gradient phonotactics and well- formedness

22

Bailey and Hahn Tried all kinds of ways of calculating

phonotatics and neighborhood density, and see which combination “works the best” Typical reasoning: “metric X and Y as

factors explain 15% variance”

Page 23: 1 LING 696B: Gradient phonotactics and well- formedness

23

Bailey and Hahn Tried all kinds of ways of calculating

phonotatics and neighborhood density, and see which combination “works the best” Typical reasoning: “metric X and Y as factors

explain 15% variance” Methodology: ANOVA

Model (1-way): data = overall mean + effect + error

What can ANOVA do for us? How do we check if ANOVA makes sense? What is the “explained variance”?

Page 24: 1 LING 696B: Gradient phonotactics and well- formedness

24

Non-parametric approach to similarity neighborhood

A hint from B&H: the neighborhood model dij is weighted edit distance A,B,C,D estimated from polynomial

regression Recall: radial basis functions F(x) = i

ai K(x, xi), with K(x, xi) = e -d(x, xi)

Quadratic weighting ad hoc, should just do general nonlinear regression with RBF

Page 25: 1 LING 696B: Gradient phonotactics and well- formedness

25

Non-parametric approach to similarity neighborhood Recall: RBF as a “soft”

neighborhood model

Now think of strings also as data points, with neighborhood defined by some string distance (e.g. edit) Same kind of regression with RBF

Page 26: 1 LING 696B: Gradient phonotactics and well- formedness

26

Non-parametric approach to similarity neighborhood Key technical point: choosing the

right kernel Edit-distance kernel: K(x, xi) = e - edit(x, xi)

Sub-string kernel: measuring the length of common sub-sequence (mrupation)

Key experimental data: controlled stimuli, split into training and test sets (equal phonotactic prob) No need to transform rating scale

Page 27: 1 LING 696B: Gradient phonotactics and well- formedness

27

Non-parametric approach to similarity neighborhood An enterprise of questions open up

with the non-parametric perspective: Would yes/no task lead to word

“anchor” like support vectors? Would the new words interact with

each other, as seen in the transductive inference?

What type of metric most appropriate for inferring well-formedness from neighborhoods?

Page 28: 1 LING 696B: Gradient phonotactics and well- formedness

28

Integration Hard to integrate with a probabilistic

(parametric) model Neighborhood density has a strong non-

parametric character -- grows with data Possible to integrate phonotactic prob

in a non-parametric model: kernel algebra aK1(x,y) + bK2(x,y), K1(x,y)*K2(x,y) are

also kernels p kernel: K(x1, x2)= i p(x2|h)p(x1|h)p(h) p

comes from parametric model