171
Probabilistic models of language acquisition Klinton Bicknell & Roger Levy ESSLLI 26 University of Tübingen 22 August 2014

Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Probabilistic models of language acquisition

Klinton Bicknell & Roger Levy !

ESSLLI 26 University of Tübingen

22 August 2014

Page 2: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Situating language acquisition

2

day 1: sound categorization

−20 0 20 40 60 80

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

0

VOT

Prob

abilit

y de

nsity

/b/ /p/

27

Figure 5.2: Likelihood functions for /b/–/p/ phoneme categorizations, with µb =0, µp = 50, σb = σp = 12. For the inputx = 27, the likelihoods favor /p/.

−20 0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

VOT

Post

erio

r pro

babi

lity o

f /b/

Figure 5.3: Posterior probability curvefor Bayesian phoneme discrimination asa function of VOT

the conditional distributions over acoustic representations, Pb(x) and Pp(x) for /b/ and/p/ respectively (the likelihood functions), and the prior distribution over /b/ versus /p/.We further simplify the problem by characterizing any acoustic representation x as a singlereal-valued number representing the VOT, and the likelihood functions for /b/ and /p/ asnormal density functions (Section 2.10) with means µb, µp and standard deviations σb, σp

respectively.Figure 5.2 illustrates the likelihood functions for the choices µb = 0, µp = 50, σb = σp =

12. Intuitively, the phoneme that is more likely to be realized with VOT in the vicinity of agiven input is a better choice for the input, and the greater the discrepancy in the likelihoodsthe stronger the categorization preference. An input with non-negligible likelihood for eachphoneme is close to the “categorization boundary”, but may still have a preference. Theseintuitions are formally realized in Bayes’ Rule:

P (/b/|x) = P (x|/b/)P (/b/)

P (x)(5.10)

and since we are considering only two alternatives, the marginal likelihood is simply theweighted sum of the likelihoods under the two phonemes: P (x) = P (x|/b/)P (/b/) +P (x|/p/)P (/p/). If we plug in the normal probability density function we get

P (/b/|x) =1√2πσ2

b

exp!− (x−µb)2

2σ2b

"P (/b/)

1√2πσ2

b

exp!− (x−µb)2

2σ2b

"P (/b/) + 1√

2πσ2p

exp!− (x−µp)2

2σ2p

"P (/p/)

(5.11)

In the special case where σb = σp = σ we can simplify this considerably by cancelling the

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 85

Page 3: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Situating language acquisition

3

day 1: inferring sound category c from sound token S

Page 4: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Situating language acquisition

3

S!

c! category

sound value

• c ~ discrete choice, e.g., p(p) = p(b) = 0.5

• S|c ~ Gaussian(μc, σ2c)

day 1: inferring sound category c from sound token S

Page 5: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Statistical  Model

day 2a: sound similarity

Page 6: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Statistical  Model

day 2a: inferring target production T from sound token S

Page 7: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Statistical  Model

c

S

Choose a category c with probability p(c)

Articulate a target production T with probability p(T|c)

Listener hears speech sound S with probability p(S|T)

Tp(T |c) = N(µc,�

2c )

p(S|T ) = N(T,�2S)

day 2a: inferring target production T from sound token S

Page 8: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Statistical  Model

day 2b: incremental parsing

Garden path models 2

• The most famous garden-paths: reduced relative clauses (RRCs) versus main clauses (MCs)

• From the valence + simple-constituency perspective, MC and RRC analyses differ in two places:

The horse raced past the barn fell.

(that was)

Page 9: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Statistical  Model

day 2b: inferring syntactic structure T from words w

Page 10: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Statistical  Model

w

T ~ PCFG

words are leaves of the trees

T

day 2b: inferring syntactic structure T from words w

Page 11: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Statistical  Model

day 3: sentence processing and eye movements in reading

The coach smiled at the player tossed the frisbee The coach smiled at the player thrown the frisbee The coach smiled toward the player tossed the frisbee The coach smiled toward the player thrown the frisbee

Page 12: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Statistical  Model

day 3: sentence processing and eye movements in reading

The coach smiled at the player tossed the frisbee The coach smiled at the player thrown the frisbee The coach smiled toward the player tossed the frisbee The coach smiled toward the player thrown the frisbee

Page 13: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Statistical  Model

day 3: inferring words w and trees T from perceptual input I

Page 14: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Statistical  Model

a tree T ~ PCFG

words are leaves of the tree

visual input I ~ noise(w)

T

I

w

day 3: inferring words w and trees T from perceptual input I

Page 15: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Situating language acquisition

10

day 4: sentence meaning judgments

Page 16: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Situating language acquisition

11

day 4: inferring meaning M from linguistic form F and processing pressures P

Page 17: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Situating language acquisition

11

day 4: inferring meaning M from linguistic form F and processing pressures P

M P

F

F ~ function(M, P)

Page 18: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Situating language acquisition

12

Page 19: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Situating language acquisition

12

in all these cases

Page 20: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Situating language acquisition

12

in all these cases

• step 1: identify the relevant sources of information

Page 21: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Situating language acquisition

12

in all these cases

• step 1: identify the relevant sources of information

• step 2: make a generative model in which

Page 22: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Situating language acquisition

12

in all these cases

• step 1: identify the relevant sources of information

• step 2: make a generative model in which

• the thing to be inferred was a latent variable

Page 23: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Situating language acquisition

12

in all these cases

• step 1: identify the relevant sources of information

• step 2: make a generative model in which

• the thing to be inferred was a latent variable

• the relevant information was used to specify a prior for the latent variable or the likelihood of the data given that latent variable

Page 24: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Situating language acquisition

12

in all these cases

• step 1: identify the relevant sources of information

• step 2: make a generative model in which

• the thing to be inferred was a latent variable

• the relevant information was used to specify a prior for the latent variable or the likelihood of the data given that latent variable

• step 3: apply Bayesian inference (relatively easy here: given prior and likelihood)

Page 25: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

13

let's apply step 1 (identify relevant information sources) to problems in acquisition

Page 26: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

13

step 1

let's apply step 1 (identify relevant information sources) to problems in acquisition

Page 27: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

14

[yuwanttusiðəbʊk]?

[e.g., Goldwater et al., 2009]

Page 28: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

14

learning to segment words

[yuwanttusiðəbʊk]?

[e.g., Goldwater et al., 2009]

Page 29: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

14

learning to segment words

[yuwanttusiðəbʊk]?

-> you want to see the book?

[e.g., Goldwater et al., 2009]

Page 30: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

15

[e.g., Frank et al., 2009]

Page 31: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

15

learning word meanings

[e.g., Frank et al., 2009]

Page 32: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

16

[blɪk]

[e.g., Hayes & Wilson, 2008]

Page 33: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

16

learning phonotactics

[blɪk]

[e.g., Hayes & Wilson, 2008]

Page 34: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

16

learning phonotactics

[blɪk]

[mdwɨ]

[e.g., Hayes & Wilson, 2008]

Page 35: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

16

learning phonotactics

[blɪk]

[mdwɨ] Polish mdły: 'tasteless'

[e.g., Hayes & Wilson, 2008]

Page 36: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

17

The boy is hungry. -> Is the boy hungry?

[e.g., Perfors et al., 2011]

Page 37: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

17

learning syntax

The boy is hungry. -> Is the boy hungry?

[e.g., Perfors et al., 2011]

Page 38: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

17

learning syntax

The boy is hungry. -> Is the boy hungry?

The boy who is smiling is happy. -> ???

[e.g., Perfors et al., 2011]

Page 39: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

17

learning syntax

The boy is hungry. -> Is the boy hungry?

The boy who is smiling is happy. -> ???

Is the boy who is smiling happy?

[e.g., Perfors et al., 2011]

Page 40: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

17

learning syntax

The boy is hungry. -> Is the boy hungry?

The boy who is smiling is happy. -> ???

Is the boy who is smiling happy?*Is the boy who smiling is happy?

[e.g., Perfors et al., 2011]

Page 41: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

18

[e.g., Pajak et al., 2013]

Page 42: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

18

learning the sound categories in the language from the sounds

[e.g., Pajak et al., 2013]

Page 43: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

18

learning the sound categories in the language from the sounds

• are long VOTs [t] and short VOTs [d] functionally different sounds or just natural variation?

[e.g., Pajak et al., 2013]

Page 44: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

19

Page 45: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

19

steps 2 and 3: construct a generative model, perform inference

Page 46: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

19

steps 2 and 3: construct a generative model, perform inference

• a bit different than before

Page 47: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

19

steps 2 and 3: construct a generative model, perform inference

• a bit different than before

• we'll go in depth through an example of sound category learning

Page 48: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Problems in acquisition

19

steps 2 and 3: construct a generative model, perform inference

• a bit different than before

• we'll go in depth through an example of sound category learning

• note: many of my visualizations were made by Dr. Bozena Pajak (University of Rochester, Brain & Cognitive Sciences. soon to be my colleague at Northwestern University)

Page 49: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Distributional information

20

Page 50: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

learning sound categories

Distributional information

20

Page 51: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

learning sound categories

• how do we learn that [t] and [d] are different categories?

Distributional information

20

Page 52: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

learning sound categories

• how do we learn that [t] and [d] are different categories?

• information source 1: distributional information

Distributional information

20

Page 53: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

learning sound categories

• how do we learn that [t] and [d] are different categories?

• information source 1: distributional information

Distributional information

20

VOT

Page 54: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

learning sound categories

• how do we learn that [t] and [d] are different categories?

• information source 1: distributional information

Distributional information

20

VOT

Page 55: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Distributional information

21

Page 56: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

experimental evidence

Distributional information

21

Page 57: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

experimental evidence

• we know that babies and adults use distributional information to help infer category structure [Maye & Gerken, 2000; Maye et al., 2002]

Distributional information

21

Page 58: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

experimental evidence

• we know that babies and adults use distributional information to help infer category structure [Maye & Gerken, 2000; Maye et al., 2002]

• Pajak & Levy (2011) performed an experiment replicating this with adults, which I'll describe

Distributional information

21

Page 59: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

experimental evidence

• we know that babies and adults use distributional information to help infer category structure [Maye & Gerken, 2000; Maye et al., 2002]

• Pajak & Levy (2011) performed an experiment replicating this with adults, which I'll describe

Distributional information

21

/s/ /ss/ length

sound&class:&FRICATIVES&

singleton) geminate)

Page 60: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

22

Experimental  data  (Pajak  &  Levy  2011)

▪ Distributional  training:    ▪ adult  English  native  speakers  exposed  to  words  in  a  new  language,  

where  the  middle  consonant  varied  along  the  length  dimension

                 [aja]145ms                    [ina]205ms                    [ila]115ms                    [ama]160ms  

Page 61: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

23

Experimental  data  (Pajak  &  Levy  2011)

0

4

8

12

16

0

4

8

12

16

Expt1:sonorantsExpt2:fricatives

100 120 140 160 180 200 220 240 260 280Stimuli length continuum (in msec)

Fam

iliariz

atio

n fre

quen

cy

bimodalunimodal

Page 62: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

24

▪ Testing:▪ participants  made  judgments  about  pairs  of  words

Example: [ama]-[amma] “Are these two different words in this language or two repetitions of the same word?”

Experimental  data  (Pajak  &  Levy  2011)

Page 63: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

24

▪ Testing:▪ participants  made  judgments  about  pairs  of  words

▪ dependent  measure:  proportion  of  ‘different’  responses  (as  opposed  to  ‘same’)  on  ‘different’  trials

Example: [ama]-[amma] “Are these two different words in this language or two repetitions of the same word?”

Experimental  data  (Pajak  &  Levy  2011)

Page 64: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

24

▪ Testing:▪ participants  made  judgments  about  pairs  of  words

▪ dependent  measure:  proportion  of  ‘different’  responses  (as  opposed  to  ‘same’)  on  ‘different’  trials

▪ if  learning  is  successful,  we  expect:

Example: [ama]-[amma] “Are these two different words in this language or two repetitions of the same word?”

Experimental  data  (Pajak  &  Levy  2011)

‘DIFFERENT’  RESPONSES

Bimodal  training Unimodal  training>

Page 65: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

25

Experimental  data  (Pajak  &  Levy  2011)

0

4

8

12

16

0

4

8

12

16

Expt1:sonorantsExpt2:fricatives

100 120 140 160 180 200 220 240 260 280Stimuli length continuum (in msec)

Fam

iliariz

atio

n fre

quen

cy

bimodalunimodal

test stimuli

Page 66: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

26

Experimental  data  (Pajak  &  Levy  2011)

Expt1−trained Expt1−untrained

Prop

ortio

n of

'diff

eren

t' re

spon

ses

0.0

0.2

0.4

0.6

0.8

1.0

BimodalUnimodal

Page 67: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

Distributional information

27

Page 68: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

distributional information is used

Distributional information

27

Page 69: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

distributional information is used

• adults and babies use distributional information to infer categories

Distributional information

27

Page 70: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

distributional information is used

• adults and babies use distributional information to infer categories

• next steps: put this information into a generative model and perform inference

Distributional information

27

Page 71: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

A generative model

28

Page 72: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

A generative model

28

our model from before of where sounds come from

Page 73: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

A generative model

28

our model from before of where sounds come from

S!

c! category

sound value

• c ~ discrete choice, e.g., p(p) = p(b) = 0.5

• S|c ~ Gaussian(μc, σ2c)

Page 74: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

A generative model

28

our model from before of where sounds come from

S!

c! category

sound value

• c ~ discrete choice, e.g., p(p) = p(b) = 0.5

• S|c ~ Gaussian(μc, σ2c)

VOT

still pretty appropriate!

Page 75: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

A generative model

28

our model from before of where sounds come from

S!

c! category

sound value

• c ~ discrete choice, e.g., p(p) = p(b) = 0.5

• S|c ~ Gaussian(μc, σ2c)

VOT

still pretty appropriate!

Page 76: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

A generative model

28

our model from before of where sounds come from

S!

c! category

sound value

• c ~ discrete choice, e.g., p(p) = p(b) = 0.5

• S|c ~ Gaussian(μc, σ2c)

VOT

still pretty appropriate!

Page 77: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

A generative model

28

our model from before of where sounds come from

S!

c! category

sound value

• c ~ discrete choice, e.g., p(p) = p(b) = 0.5

• S|c ~ Gaussian(μc, σ2c)

VOT

still pretty appropriate!

Page 78: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

29

y

µΣ

m

n

(a) Simple model

y

µΣ

α

m

n

(b) Bayesian model priorsover category parameters

y

µΣ

Σφ

α

m

n

(c) Learning category proba-bilities as well

Figure 9.3: Graphical model for simple mixture of Gaussians

−5 0 5 10

0.0

0.1

0.2

0.3

0.4

x

P(x)

Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.

θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210

A generative model

Page 79: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

a model of where categories and sounds come from

29

y

µΣ

m

n

(a) Simple model

y

µΣ

α

m

n

(b) Bayesian model priorsover category parameters

y

µΣ

Σφ

α

m

n

(c) Learning category proba-bilities as well

Figure 9.3: Graphical model for simple mixture of Gaussians

−5 0 5 10

0.0

0.1

0.2

0.3

0.4

x

P(x)

Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.

θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210

A generative model

Page 80: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

a model of where categories and sounds come from

29

y

µΣ

m

n

(a) Simple model

y

µΣ

α

m

n

(b) Bayesian model priorsover category parameters

y

µΣ

Σφ

α

m

n

(c) Learning category proba-bilities as well

Figure 9.3: Graphical model for simple mixture of Gaussians

−5 0 5 10

0.0

0.1

0.2

0.3

0.4

x

P(x)

Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.

θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210

y ~ N(μi, Σi)

A generative model

Page 81: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

a model of where categories and sounds come from

29

y

µΣ

m

n

(a) Simple model

y

µΣ

α

m

n

(b) Bayesian model priorsover category parameters

y

µΣ

Σφ

α

m

n

(c) Learning category proba-bilities as well

Figure 9.3: Graphical model for simple mixture of Gaussians

−5 0 5 10

0.0

0.1

0.2

0.3

0.4

x

P(x)

Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.

θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210

y ~ N(μi, Σi)(S before)

A generative model

Page 82: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

a model of where categories and sounds come from

29

y

µΣ

m

n

(a) Simple model

y

µΣ

α

m

n

(b) Bayesian model priorsover category parameters

y

µΣ

Σφ

α

m

n

(c) Learning category proba-bilities as well

Figure 9.3: Graphical model for simple mixture of Gaussians

−5 0 5 10

0.0

0.1

0.2

0.3

0.4

x

P(x)

Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.

θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210

y ~ N(μi, Σi)(S before)(σ2 before)

A generative model

Page 83: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

a model of where categories and sounds come from

29

y

µΣ

m

n

(a) Simple model

y

µΣ

α

m

n

(b) Bayesian model priorsover category parameters

y

µΣ

Σφ

α

m

n

(c) Learning category proba-bilities as well

Figure 9.3: Graphical model for simple mixture of Gaussians

−5 0 5 10

0.0

0.1

0.2

0.3

0.4

x

P(x)

Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.

θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210

y ~ N(μi, Σi)i ~ discrete(ϕ)

(S before)(σ2 before)

A generative model

Page 84: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

a model of where categories and sounds come from

29

y

µΣ

m

n

(a) Simple model

y

µΣ

α

m

n

(b) Bayesian model priorsover category parameters

y

µΣ

Σφ

α

m

n

(c) Learning category proba-bilities as well

Figure 9.3: Graphical model for simple mixture of Gaussians

−5 0 5 10

0.0

0.1

0.2

0.3

0.4

x

P(x)

Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.

θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210

y ~ N(μi, Σi)i ~ discrete(ϕ)

(S before)(c before)

(σ2 before)

A generative model

Page 85: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

a model of where categories and sounds come from

29

y

µΣ

m

n

(a) Simple model

y

µΣ

α

m

n

(b) Bayesian model priorsover category parameters

y

µΣ

Σφ

α

m

n

(c) Learning category proba-bilities as well

Figure 9.3: Graphical model for simple mixture of Gaussians

−5 0 5 10

0.0

0.1

0.2

0.3

0.4

x

P(x)

Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.

θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210

y ~ N(μi, Σi)i ~ discrete(ϕ)

(S before)(c before)

n = # observations

(σ2 before)

A generative model

Page 86: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

a model of where categories and sounds come from

29

y

µΣ

m

n

(a) Simple model

y

µΣ

α

m

n

(b) Bayesian model priorsover category parameters

y

µΣ

Σφ

α

m

n

(c) Learning category proba-bilities as well

Figure 9.3: Graphical model for simple mixture of Gaussians

−5 0 5 10

0.0

0.1

0.2

0.3

0.4

x

P(x)

Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.

θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210

y ~ N(μi, Σi)i ~ discrete(ϕ)

(S before)(c before)

n = # observationsm = # categories

(σ2 before)

A generative model

Page 87: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

a model of where categories and sounds come from

29

y

µΣ

m

n

(a) Simple model

y

µΣ

α

m

n

(b) Bayesian model priorsover category parameters

y

µΣ

Σφ

α

m

n

(c) Learning category proba-bilities as well

Figure 9.3: Graphical model for simple mixture of Gaussians

−5 0 5 10

0.0

0.1

0.2

0.3

0.4

x

P(x)

Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.

θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210

y ~ N(μi, Σi)i ~ discrete(ϕ)

(S before)(c before)

n = # observationsm = # categories

(σ2 before)

A generative model

Page 88: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

a model of where categories and sounds come from

29

y

µΣ

m

n

(a) Simple model

y

µΣ

α

m

n

(b) Bayesian model priorsover category parameters

y

µΣ

Σφ

α

m

n

(c) Learning category proba-bilities as well

Figure 9.3: Graphical model for simple mixture of Gaussians

−5 0 5 10

0.0

0.1

0.2

0.3

0.4

x

P(x)

Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.

θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210

y ~ N(μi, Σi)i ~ discrete(ϕ)

(S before)(c before)

n = # observationsm = # categories

(σ2 before)

ϕ ~ distribution()

A generative model

Page 89: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

a model of where categories and sounds come from

29

y

µΣ

m

n

(a) Simple model

y

µΣ

α

m

n

(b) Bayesian model priorsover category parameters

y

µΣ

Σφ

α

m

n

(c) Learning category proba-bilities as well

Figure 9.3: Graphical model for simple mixture of Gaussians

−5 0 5 10

0.0

0.1

0.2

0.3

0.4

x

P(x)

Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.

θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210

y ~ N(μi, Σi)i ~ discrete(ϕ)

(S before)(c before)

n = # observationsm = # categories

(σ2 before)

ϕ ~ distribution()μi ~ distribution()

A generative model

Page 90: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

a model of where categories and sounds come from

29

y

µΣ

m

n

(a) Simple model

y

µΣ

α

m

n

(b) Bayesian model priorsover category parameters

y

µΣ

Σφ

α

m

n

(c) Learning category proba-bilities as well

Figure 9.3: Graphical model for simple mixture of Gaussians

−5 0 5 10

0.0

0.1

0.2

0.3

0.4

x

P(x)

Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.

θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210

y ~ N(μi, Σi)i ~ discrete(ϕ)

(S before)(c before)

n = # observationsm = # categories

(σ2 before)

ϕ ~ distribution()μi ~ distribution()Σi ~ distribution()

A generative model

Page 91: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

a model of where categories and sounds come from

29

y

µΣ

m

n

(a) Simple model

y

µΣ

α

m

n

(b) Bayesian model priorsover category parameters

y

µΣ

Σφ

α

m

n

(c) Learning category proba-bilities as well

Figure 9.3: Graphical model for simple mixture of Gaussians

−5 0 5 10

0.0

0.1

0.2

0.3

0.4

x

P(x)

Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.

θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210

y ~ N(μi, Σi)i ~ discrete(ϕ)

(S before)(c before)

n = # observationsm = # categories

(σ2 before)

ϕ ~ distribution()μi ~ distribution()Σi ~ distribution()

A generative model

'mixture of Gaussians'

Page 92: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

a model of where categories and sounds come from

29

y

µΣ

m

n

(a) Simple model

y

µΣ

α

m

n

(b) Bayesian model priorsover category parameters

y

µΣ

Σφ

α

m

n

(c) Learning category proba-bilities as well

Figure 9.3: Graphical model for simple mixture of Gaussians

−5 0 5 10

0.0

0.1

0.2

0.3

0.4

x

P(x)

Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.

θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210

y ~ N(μi, Σi)i ~ discrete(ϕ)

(S before)(c before)

n = # observationsm = # categories

(σ2 before)

ϕ ~ distribution()μi ~ distribution()Σi ~ distribution()

A generative model

'mixture of Gaussians' made up!}

Page 93: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

a model of where categories and sounds come from

29

y

µΣ

m

n

(a) Simple model

y

µΣ

α

m

n

(b) Bayesian model priorsover category parameters

y

µΣ

Σφ

α

m

n

(c) Learning category proba-bilities as well

Figure 9.3: Graphical model for simple mixture of Gaussians

−5 0 5 10

0.0

0.1

0.2

0.3

0.4

x

P(x)

Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.

θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.

Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210

y ~ N(μi, Σi)i ~ discrete(ϕ)

(S before)(c before)

n = # observationsm = # categories

(σ2 before)

ϕ ~ distribution()μi ~ distribution()Σi ~ distribution()

A generative model

'mixture of Gaussians' made up!}set in

advance

Page 94: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

30

Inference

Page 95: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

final step: inference

30

Inference

Page 96: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

final step: inference

• now we have generative model with prior on variables of interest and well-defined likelihood

30

Inference

Page 97: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

final step: inference

• now we have generative model with prior on variables of interest and well-defined likelihood

• but inference is usually quite hard

30

Inference

Page 98: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

final step: inference

• now we have generative model with prior on variables of interest and well-defined likelihood

• but inference is usually quite hard

• two common classes of methods for inference, which I'll describe now

30

Inference

Page 99: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

31

Inference

Page 100: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

inference method 1: Expectation Maximization (EM)

31

Inference

Page 101: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

inference method 1: Expectation Maximization (EM)

• if we knew the category parameters (means, variance, probability), it would be easy to categorize datapoints like before

31

Inference

Page 102: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

inference method 1: Expectation Maximization (EM)

• if we knew the category parameters (means, variance, probability), it would be easy to categorize datapoints like before

• if we knew how the datapoints were categorized, it would be easy to find category parameters

31

Inference

Page 103: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

inference method 1: Expectation Maximization (EM)

• if we knew the category parameters (means, variance, probability), it would be easy to categorize datapoints like before

• if we knew how the datapoints were categorized, it would be easy to find category parameters

• Q: how?

31

Inference

Page 104: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

inference method 1: Expectation Maximization (EM)

• if we knew the category parameters (means, variance, probability), it would be easy to categorize datapoints like before

• if we knew how the datapoints were categorized, it would be easy to find category parameters

• Q: how?

• idea behind EM:

31

Inference

Page 105: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

inference method 1: Expectation Maximization (EM)

• if we knew the category parameters (means, variance, probability), it would be easy to categorize datapoints like before

• if we knew how the datapoints were categorized, it would be easy to find category parameters

• Q: how?

• idea behind EM:

• randomly initialize category assignments of datapoints

31

Inference

Page 106: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

inference method 1: Expectation Maximization (EM)

• if we knew the category parameters (means, variance, probability), it would be easy to categorize datapoints like before

• if we knew how the datapoints were categorized, it would be easy to find category parameters

• Q: how?

• idea behind EM:

• randomly initialize category assignments of datapoints

• estimate category parameters from current assignments

31

Inference

Page 107: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

inference method 1: Expectation Maximization (EM)

• if we knew the category parameters (means, variance, probability), it would be easy to categorize datapoints like before

• if we knew how the datapoints were categorized, it would be easy to find category parameters

• Q: how?

• idea behind EM:

• randomly initialize category assignments of datapoints

• estimate category parameters from current assignments

• now assign datapoints to categories based on category parameters

31

Inference

Page 108: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

inference method 1: Expectation Maximization (EM)

• if we knew the category parameters (means, variance, probability), it would be easy to categorize datapoints like before

• if we knew how the datapoints were categorized, it would be easy to find category parameters

• Q: how?

• idea behind EM:

• randomly initialize category assignments of datapoints

• estimate category parameters from current assignments

• now assign datapoints to categories based on category parameters

• estimate category parameters from current assignments

31

Inference

Page 109: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

inference method 1: Expectation Maximization (EM)

• if we knew the category parameters (means, variance, probability), it would be easy to categorize datapoints like before

• if we knew how the datapoints were categorized, it would be easy to find category parameters

• Q: how?

• idea behind EM:

• randomly initialize category assignments of datapoints

• estimate category parameters from current assignments

• now assign datapoints to categories based on category parameters

• estimate category parameters from current assignments

• …31

Inference

Page 110: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

32

Inference

Page 111: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

EM illustrations (on board)

32

Inference

Page 112: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

EM illustrations (on board)

• bimodal data

32

Inference

Page 113: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

EM illustrations (on board)

• bimodal data

• two categories

32

Inference

Page 114: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

33

Inference

Page 115: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

EM issues

33

Inference

Page 116: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

EM issues

• only finds most likely value for variables, not posterior distribution on them

33

Inference

Page 117: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

EM issues

• only finds most likely value for variables, not posterior distribution on them

• can get stuck in 'local optima'

33

Inference

Page 118: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

EM issues

• only finds most likely value for variables, not posterior distribution on them

• can get stuck in 'local optima'

33

Inference

Page 119: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

34

Inference

Page 120: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

inference method 2: Markov chain Monto Carlo

34

Inference

Page 121: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

inference method 2: Markov chain Monto Carlo

• like EM, the model is in some state at each point in time (with current guesses for latent variables)

34

Inference

Page 122: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

inference method 2: Markov chain Monto Carlo

• like EM, the model is in some state at each point in time (with current guesses for latent variables)

• like EM, the model transitions between states

34

Inference

Page 123: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

inference method 2: Markov chain Monto Carlo

• like EM, the model is in some state at each point in time (with current guesses for latent variables)

• like EM, the model transitions between states

• unlike EM, the model sometimes moves to lower probability states

34

Inference

Page 124: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

inference method 2: Markov chain Monto Carlo

• like EM, the model is in some state at each point in time (with current guesses for latent variables)

• like EM, the model transitions between states

• unlike EM, the model sometimes moves to lower probability states

• if you calculate the transition probabilities correctly, the amount of time the model spends in each state is proportional to its posterior probability

34

Inference

Page 125: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

inference method 2: Markov chain Monto Carlo

• like EM, the model is in some state at each point in time (with current guesses for latent variables)

• like EM, the model transitions between states

• unlike EM, the model sometimes moves to lower probability states

• if you calculate the transition probabilities correctly, the amount of time the model spends in each state is proportional to its posterior probability

• thus: don't just get most likely variables out, but full posterior distribution

34

Inference

Page 126: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

inference method 2: Markov chain Monto Carlo

• like EM, the model is in some state at each point in time (with current guesses for latent variables)

• like EM, the model transitions between states

• unlike EM, the model sometimes moves to lower probability states

• if you calculate the transition probabilities correctly, the amount of time the model spends in each state is proportional to its posterior probability

• thus: don't just get most likely variables out, but full posterior distribution

• also: guaranteed to converge (not getting stuck in local optima)

34

Inference

Page 127: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

inference method 2: Markov chain Monto Carlo

• like EM, the model is in some state at each point in time (with current guesses for latent variables)

• like EM, the model transitions between states

• unlike EM, the model sometimes moves to lower probability states

• if you calculate the transition probabilities correctly, the amount of time the model spends in each state is proportional to its posterior probability

• thus: don't just get most likely variables out, but full posterior distribution

• also: guaranteed to converge (not getting stuck in local optima)

• however, this guarantee is only with infinite time…34

Inference

Page 128: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

35

Modeling category learning

Page 129: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

mixture of Gaussian models to learn phonetic categories

35

Modeling category learning

Page 130: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

mixture of Gaussian models to learn phonetic categories

• many groups have had success, at least on simple problems (Vallabha et al., 2007; McMurray et al., 2009; Feldman et al., 2009)

35

Modeling category learning

Page 131: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

mixture of Gaussian models to learn phonetic categories

• many groups have had success, at least on simple problems (Vallabha et al., 2007; McMurray et al., 2009; Feldman et al., 2009)

• but it seems clear that other information sources are needed too, because categories overlap too much

35

Modeling category learning

Page 132: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

mixture of Gaussian models to learn phonetic categories

• many groups have had success, at least on simple problems (Vallabha et al., 2007; McMurray et al., 2009; Feldman et al., 2009)

• but it seems clear that other information sources are needed too, because categories overlap too much

• next up: work by Pajak and colleagues on another information source

35

Modeling category learning

Page 133: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

36

Modeling category learning

Page 134: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

[slides on Pajak, Bicknell, & Levy, 2013]

36

Modeling category learning

Page 135: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

normally not considered part of the phonetic inventory of the Bulgarian language. The Bulgarian obstruentconsonants are divided into 12 pairs voiced<>voiceless on the criteria of sonority. The only obstruent withouta counterpart is the voiceless velar fricative /x/. The contrast 'voiced vs. voiceless' is neutralized in word-finalposition, where all obstruents are voiceless (as in most Slavic languages); this neutralization is, however, notreflected in the spelling.

BilabialLabio-

dental

Dental/

Alveolar

Post-

alveolarPalatal Velar

Nasalhard m (ɱ) �n (ŋ)soft mʲ ɲ

Plosivehard p b t d k ɡsoft pʲ bʲ tʲ dʲ c ɟ

Affricatehard ts͡ (d͡z) tʃ͡ d͡ʒsoft ts͡ʲ (d͡zʲ)

Fricativehard f v s z

ʃ ʒx (ɣ)

soft fʲ vʲ sʲ zʲ (xʲ)

Trillhard rsoft rʲ

Approximant

hard (w)soft j

Lateral

hard ɫ (l)soft ʎ

^1 According to Klagstad Jr. (1958:46–48), /t tʲ d dʲ s sʲ z zʲ n/ are dental. He also analyzes /ɲ/ as palatalizeddental nasal, and provides no information about the place of articulation of /ts͡ ts͡ʲ r rʲ l ɫ/.

^2 Only as an allophone of /m/ and /n/ before /f/ and /v/. Examples: инфлация [iɱˈflaʦijɐ] 'inflation'.[5]

^3 As an allophone of /n/ before /k/ and /g/. Examples: тънко [ˈtɤŋko] 'thin' (neut.), танго [tɐŋˈgɔ] 'tango'.[6]

^4 /ɣ/ exists as an allophone of /x/ only at word boundaries before voiced obstruents. Example: видях го [viˈdʲaɣgo] 'I saw him'.[7]

^5 Not a native phoneme, but appears in borrowings from English.

^6 /l/ can be analyzed as an allophone of /ɫ/ as it appears only before front vowels. A trend of l-vocalization isemerging among younger native speakers and more often in colloquial speech.

Hard and palatalized consonants

 

BULGARIAN  consonants  

1  

What  may  help  solve  this  problem  

(Wikipedia:  Bulgarian  phonology)  

VOICING  PALATALIZATION  

§  Learning  may  be  facilitated  by  languages’  extensive  re-­‐use  of  a  set  of  phoneAc  dimensions  (Clements  2003)  

 

Page 136: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

§  Learning  may  be  facilitated  by  languages’  extensive  re-­‐use  of  a  set  of  phoneAc  dimensions  (Clements  2003)  

 

2  

What  may  help  solve  this  problem  

(Wikipedia:  Thai  language)  

Monophthongs of Thai. From Tingsabadh& Abramson (1993:25)

Diphthongs of Thai. From Tingsabadh &Abramson (1993:25)

Front Back

unrounded unrounded rounded

short long short long short long

Close/i/ -ิ

/iː/ -ี

/ɯ/ -ึ

/ɯː/ -ื-

/u/ -ุ

/uː/ -ู

Close-mid/e/เ-ะ

/eː/เ-

/ɤ/เ-อะ

/ɤː/เ-อ

/o/โ-ะ

/oː/โ-

Open-mid/ɛ/แ-ะ

/ɛː/แ-

/ɔ/เ-าะ

/ɔː/-อ

Open /a/

-ะ, -ั-/aː/-า

The vowels each exist in long-short pairs: these are distinct phonemes forming unrelated words in Thai,[12] butusually transliterated the same: เขา (khao) means "he" or "she", while ขาว (khao) means "white".

The long-short pairs are as follows:

Long Short

Thai IPA Example Thai IPA Example

–า /aː/ ฝาน /fǎːn/ 'to slice' –ะ /a/ ฝน /fǎn/ 'to dream'

–ี /iː/ กรีด /krìːt/ 'to cut' –ิ /i/ กริช /krìt/ 'kris'

–ู /uː/ สูด /sùːt/ 'to inhale' –ุ /u/ สุด /sùt/ 'rearmost'

เ– /eː/ เอน /ʔēːn/ 'to recline' เ–ะ /e/ เอ็น /ʔēn/ 'tendon, ligament'

แ– /ɛː/ แพ /pʰɛ́ː / 'to be defeated' แ–ะ /ɛ/ แพะ /pʰɛʔ́/ 'goat'

–ื- /ɯː/ คล่ืน /kʰlɯ̂ːn/ 'wave' –ึ /ɯ/ ข้ึน /kʰɯ̂n/ 'to go up'

เ–อ /ɤː/ เดิน /dɤ̄ː n/ 'to walk' เ–อะ /ɤ/ เงิน /ŋɤn̄/ 'silver'

โ– /oː/ โคน /kʰôːn/ 'to fell' โ–ะ /o/ ขน /kʰôn/ 'thick (soup)'

–อ /ɔː/ กลอง /klɔːŋ/ 'drum' เ–าะ /ɔ/ กลอง /klɔŋ̀/ 'box'

The basic vowels can be combined into diphthongs. Tingsabadh &Abramson (1993) analyze those ending in high vocoids as underlyingly/Vj/ and /Vw/. For purposes of determining tone, those marked with anasterisk are sometimes classified as long:

 

THAI  vowels      

LENGTH  

Page 137: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

§  Learning  may  be  facilitated  by  languages’  extensive  re-­‐use  of  a  set  of  phoneAc  dimensions  (Clements  2003)  

§  ExisAng  experimental  evidence  supports  this  view  §  both  infants  and  adults  generalize  newly  learned  phoneAc  category  disAncAons  to  untrained  sounds  along  the  same  dimension  (McClaskey  et  al.  1983,  Maye  et  al.  2008,  Perfors  &  Dunbar  2010,  Pajak  &  Levy  2011)  

3  

What  may  help  solve  this  problem  

/s/ /ss/ length

singleton   geminate  

/n/ /nn/ length

voicing /b/ /p/  

voicing /g/ /k/ voiced   voiceless  

Page 138: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

§  What  are  the  mechanisms  underlying  generalizaAon?  §  How  do  learners  make  use  of  informaAon  about  some  

phoneAc  categories  when  learning  other  categories?  

/s/ /ss/ length

/n/ /nn/ length

4  

How  do  people  learn  phoneAc  categories?  

Page 139: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

How  do  people  learn  phoneAc  categories?  

§  Pajak  et  al.’s  proposal:  §  in  addiAon  to  learning  specific  categories,  people  also  

learn  category  types  

/s/ /ss/ length /n/ /nn/ length

5  

singleton  <-­‐>  geminate  

Page 140: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

§  Experimental  data  illustraAng  generalizaAon  across  analogous  disAncAons  (from  Pajak  &  Levy  2011)  

§  Our  computaAonal  proposal  for  how  this  kind  of  generalizaAon  might  be  accomplished  

§  SimulaAons  of  Pajak  &  Levy’s  data  

6  

outline  

Page 141: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

7  

Experimental  data  (Pajak  &  Levy  2011)  

/s/ /ss/ length

sound  class:  FRICATIVES  

/n/ /nn/ length

sound  class:  SONORANTS  

singleton   geminate  

Page 142: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

8  

Experimental  data  (Pajak  &  Levy  2011)  

§  DistribuAonal  training:    §  adult  English  naAve  speakers  exposed  to  words  in  a  new  language,  where  the  middle  consonant  varied  along  the  length  dimension  

                 [aja]145ms                    [ina]205ms                    [ila]115ms                    [ama]160ms  

…  

Page 143: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

9  

0

4

8

12

16

0

4

8

12

16

Expt1:sonorantsExpt2:fricatives

100 120 140 160 180 200 220 240 260 280Stimuli length continuum (in msec)

Fam

iliariz

atio

n fre

quen

cy

bimodalunimodal

[n]-­‐…-­‐[nn]  [m]-­‐…-­‐[mm]  [l]-­‐…-­‐[ll]  [j]-­‐…-­‐[jj]  

[s]-­‐…-­‐[ss]  [f]-­‐…-­‐[ff]  [ʃ]-­‐…-­‐[ʃʃ]  [θ]-­‐…-­‐[θθ]  

this  difference  reflects  natural  distribuYons  of  length  in  different  sound  classes  

Experimental  data  (Pajak  &  Levy  2011)  Training  

Page 144: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

10  

§  TesAng:  §  parAcipants  made  judgments  about  pairs  of  words  

 

 §  dependent  measure:  proporAon  of  ‘different’  responses  (as  opposed  to  ‘same’)  on  ‘different’  trials  

§  if  learning  is  successful,  we  expect:  

   

§  to  assess  generalizaAon,  tesAng  included  both  trained  and  untrained  sound  classes  (i.e.,  both  sonorants  and  fricaAves)  

Example: [ama]-[amma] “Are these two different words in this language or two repetitions of the same word?”

 

Experimental  data  (Pajak  &  Levy  2011)  

‘DIFFERENT’  RESPONSES  

Bimodal  training   Unimodal  training  >  

Page 145: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

0

4

8

12

16

0

4

8

12

16

Expt1:sonorantsExpt2:fricatives

100 120 140 160 180 200 220 240 260 280Stimuli length continuum (in msec)

Fam

iliariz

atio

n fre

quen

cy

bimodalunimodal

11  

[n]-­‐…-­‐[nn]  [m]-­‐…-­‐[mm]  [l]-­‐…-­‐[ll]  [j]-­‐…-­‐[jj]  

[s]-­‐…-­‐[ss]  [f]-­‐…-­‐[ff]  [ʃ]-­‐…-­‐[ʃʃ]  [θ]-­‐…-­‐[θθ]  

untrained  sound  class  (e.g.,  ama-­‐amma)  

trained  sound  class  (e.g.,  asa-­‐assa)  

EXPT  1:  ALIGNED  CATEGORIES  

EXPT  2:  MISALIGNED  CATEGORIES  

TesYng  

Experimental  data  (Pajak  &  Levy  2011)  

x  

trained  sound  class  (e.g.,  ama-­‐amma)  untrained  sound  class  (e.g.,  asa-­‐assa)  untrained  sound  class  (e.g.,  asa-­‐assa)  

x  

Page 146: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

12  

Expt1−trained Expt1−untrained

Prop

ortio

n of

'diff

eren

t' re

spon

ses

0.0

0.2

0.4

0.6

0.8

1.0

BimodalUnimodal

Expt2−trained Expt2−untrained

Prop

ortio

n of

'diff

eren

t' re

spon

ses

0.0

0.2

0.4

0.6

0.8

1.0

BimodalUnimodal

EXPT  1:  ALIGNED  CATEGORIES  

EXPT  2:  MISALIGNED  CATEGORIES  

Experimental  data  (Pajak  &  Levy  2011)  

learning  generalizaYon  

trained  untrained  

trained  untrained  

Page 147: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

ComputaAonal  modeling  

① How  can  we  account  for  distribuAonal  learning?  

② How  can  we  account  for  generalizaAon  across  sound  classes?  

13  

Page 148: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

§  Mixture  of  Gaussians  approach  (de  Boer  &  Kuhl  2003,  Vallabha  et  al.  2007,  McMurray  et  al.  2009,  Feldman  et  al.  2009,  Toscano  &  McMurray  2010,  Dillon  et  al.  2013)  

14  

Modeling  phoneYc  category  learning  

zi

di i 2 {1..n}

datapoint  (perceptual  

token)  

phoneAc  category  

di ⇠N (µzi ,s2zi)

length /s/ /ss/

μs,  σs2   μss,  σss2  

120ms      201ms      165ms      182ms      115ms  …  

inference  

Page 149: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

H

G0g

zi

di i 2 {1..n}

§  Our  general  approach  (following  Feldman  et  al.  2009):  §  learning  via  nonparametric  Bayesian  inference  §  using  Dirichlet  processes,  which  allow  the  model  to  learn  the  number  of  categories  from  the  data  

15  

Modeling  phoneYc  category  learning  

datapoint  (perceptual  

token)  

phoneAc  category  

H : µ ⇠ N (µ0,s2

k0)

s2 ⇠ InvChiSq(n0,s20 )

G0 ⇠ DP(g,H)zi ⇠ G0di ⇠ N (µzi ,s2

zi)

prior  

/s/ /ss/

μs,  σs2   μss,  σss2  

length /s/ /ss/

μs,  σs2   μss,  σss2  

x  

x  

x  

x  

Page 150: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

ComputaAonal  modeling  

① How  can  we  account  for  distribuAonal  learning?  

② How  can  we  account  for  generalizaAon  across  sound  classes?  

16  

✓  ① How  can  we  account  for  distribuAonal  learning?  

② How  can  we  account  for  generalizaAon  across  sound  classes?  

Page 151: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

§  In  addiAon  to  acquiring  specific  categories,  learners  infer  category  types,  which  can  be  shared  across  sound  classes  

§  This  means  that  already  learned  categories  can  be  directly  re-­‐used  to  categorize  other  sounds  

§  To  implement  this  proposal,  we  use  a  hierarchical  Dirichlet  process,  which  allows  for  sharing  categories  across  data  groups  (here,  sound  classes)  

17  

Proposal  

/s/ /ss/ length /n/ /nn/ length

singleton  <-­‐>  geminate  

fricaYves   sonorants  

Page 152: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

H

G0g

Gca0

zic

dic i 2 {1..nc}c 2 C

18  

Modeling  generalizaYon:  HDP  

prior  

datapoint  (perceptual  

token)  

phoneAc  category  

sound  class  

fricaYves  

length /s/ /ss/

sg gem

c  =  fricaYves    

son-sg son-gem

H : µ ⇠ N (µ0,s2

k0)

s2 ⇠ InvChiSq(n0,s20 )

G0 ⇠ DP(g,H)Gc ⇠ DP(a0,G0)zic ⇠ Gc

dic ⇠ N (µzic ,s2zic)

c  =  sonorants    

fric-sg fric-gem

μfric-­‐sg  σfric-­‐sg2  

μfric-­‐gem  σfric-­‐gem2  

/n/ /nn/ length

sonorants  

x  

x  

x  

x  

x  x  

x  

x  

x  

Page 153: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

§  But  people  are  able  to  generalize  even  when  analogous  category  types  are  implemented  phoneAcally  in  different  ways  

§  We  want  the  model  to  account  for  potenAal  differences  between  sound  classes  

19  

Modeling  generalizaYon:  HDP  

length /s/ /ss/

fricaYves  

length /n/ /nn/

sonorants  

Page 154: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

H

G0g

Gca0

zic

dic

fc

s f

i 2 {1..nc}c 2 C

20  

Modeling  generalizaYon:  HDP  AccounAng  for  differences  between  sound  classes:  

prior  

datapoint  (perceptual  

token)  

phoneAc  category  

sound  class  

learnable  class-­‐specific  ‘offsets’  by  which  data  in  a  class  are  shioed  along  a  phoneAc  dimension  (cf.  Dillon  et  al.  2013)  

H : µ ⇠ N (µ0,s2

k0)

s2 ⇠ InvChiSq(n0,s20 )

G0 ⇠ DP(g,H)Gc ⇠ DP(a0,G0)zic ⇠ Gc

fc ⇠ N (0,s2f )

dic ⇠ N (µzic ,s2zic)+ fc

fricaYves  

length /s/ /ss/

length

sonorants  

/n/ /nn/

Page 155: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

21  

SimulaYon  results  

trained untrained

0.00

0.25

0.50

0.75

1.00

bim

odal

unim

odal

bim

odal

unim

odal

Prop

ortio

n of

2−c

ateg

ory

infe

renc

es

Extended model:Experiment 2

trained untrained

0.00

0.25

0.50

0.75

1.00

bim

odal

unim

odal

bim

odal

unim

odal

Prop

ortio

n of

2−c

ateg

ory

infe

renc

esExtended model:

Experiment 1EXPT  1:  ALIGNED  CATEGORIES  

EXPT  2:  MISALIGNED  CATEGORIES  

Expt1−trained Expt1−untrained

Prop

ortio

n of

'diff

eren

t' re

spon

ses

0.0

0.2

0.4

0.6

0.8

1.0

BimodalUnimodal

Expt2−trained Expt2−untrained

Prop

ortio

n of

'diff

eren

t' re

spon

ses

0.0

0.2

0.4

0.6

0.8

1.0

BimodalUnimodal

HUMAN  DATA  

trained  untrained  

trained  untrained  

Page 156: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

22  

SimulaYon  results:  NO  OFFSET  PARAMETER  

Expt1−trained Expt1−untrained

Prop

ortio

n of

'diff

eren

t' re

spon

ses

0.0

0.2

0.4

0.6

0.8

1.0

BimodalUnimodal

Expt2−trained Expt2−untrained

Prop

ortio

n of

'diff

eren

t' re

spon

ses

0.0

0.2

0.4

0.6

0.8

1.0

BimodalUnimodal

HUMAN  DATA  

trained untrained

0.00

0.25

0.50

0.75

1.00

bim

odal

unim

odal

bim

odal

unim

odal

Prop

ortio

n of

2−c

ateg

ory

infe

renc

esBasic model:Experiment 1

EXPT  1:  ALIGNED  CATEGORIES  

trained untrained

0.00

0.25

0.50

0.75

1.00

bim

odal

unim

odal

bim

odal

unim

odal

Prop

ortio

n of

2−c

ateg

ory

infe

renc

es

Basic model:Experiment 2

EXPT  2:  MISALIGNED  CATEGORIES  

fricaYves  

length /s/ /ss/

length

sonorants  

/n/ /nn/

we  need  the  shig  parameter  to  account  for  generalizaYon  

across  misaligned  categories!  

trained  untrained  

trained  untrained  

Page 157: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

37

Modeling category learning

1000150020002500

300

400

500

600

700

800

900

Second Formant (Hz)

First

Fo

rma

nt (H

z)

Vowel Categories (Men)(a)

i

ɪ

e

ɛ

ɝ

æ

u

ʊo

ɔ

a

ʌ

1000150020002500

300

400

500

600

700

800

900Second Formant (Hz)

Firs

t For

man

t (H

z)

Lexical−Distributional Model(b)

1000150020002500

300

400

500

600

700

800

900Second Formant (Hz)

Firs

t For

man

t (H

z)

Distributional Model(c)

1000150020002500

300

400

500

600

700

800

900

Firs

t For

man

t (H

z)

Second Formant (Hz)

Gradient Descent Algorithm(d)

Figure 3: Ellipses delimit the area corresponding to 90%of vowel tokens for Gaussian categories (a) computed frommen’s vowel productions from Hillenbrand et al. (1995) andlearned by the (b) lexical-distributional model, (c) distribu-tional model, and (d) gradient descent algorithm.

boring vowel categories. Positing the presence of a lexi-con therefore showed evidence of helping the ideal learnerdisambiguate overlapping vowel categories, even though thephonological forms contained in the lexicon were not givenexplicitly to the learner.

Pairwise accuracy and completeness measures were com-puted for each learner as a quantitative measure of modelperformance (Table 1). For these measures, pairs of voweltokens that were correctly placed into the same category werecounted as a hit; pairs of tokens that were incorrectly assignedto different categories when they should have been in thesame category were counted as a miss; and pairs of tokensthat were incorrectly assigned to the same category whenthey should have been in different categories were countedas a false alarm. The accuracy score was computed as

hitshits+false alarms and the completeness score as hits

hits+misses .Both measures were high for the lexical-distributional learner,but accuracy scores were substantially lower for the purelydistributional learners, reflecting the fact that these modelsmistakenly merged several overlapping categories.

Results suggest that as predicted, a model that uses theinput to learn word categories in addition to phonetic cate-gories produces better phonetic category learning results thana model that only learns phonetic categories. Note that thedistributional learners are likely to show better performanceif they are given dimensions beyond just the first two formants(Vallabha et al., 2007) or if they are given more data pointsduring learning. These two solutions actually work againsteach other: as dimensions are added, more data are necessaryto maintain the same learning outcome. Nevertheless, we do

100020003000

200

400

600

800

1000

1200

Second Formant (Hz)

Vowel Categories (All Speakers)

First

Fo

rma

nt (H

z)(a)

i ɪ

ɝ

æ

u

ʊ o

ɔ

a

ʌ

100020003000

200

400

600

800

1000

1200

Second Formant (Hz)

Lexical−Distributional Model

Firs

t For

man

t (H

z)

(b)

100020003000

200

400

600

800

1000

1200

Second Formant (Hz)

Firs

t For

man

t (H

z)

Distributional Model(c)

100020003000

200

400

600

800

1000

1200

Second Formant (Hz)

Firs

t For

man

t (H

z)

Gradient Descent Algorithm(d)

Figure 4: Ellipses delimit the area corresponding to 90% ofvowel tokens for Gaussian categories (a) computed from allspeakers’ vowel productions from Hillenbrand et al. (1995)and learned by the (b) lexical-distributional model, (c) distri-butional model, and (d) gradient descent algorithm.

not wish to suggest that a purely distributional learner cannotacquire phonetic categories. The simulations presented hereare instead meant to demonstrate that in a language wherephonetic categories have substantial overlap, an interactivesystem, where learners can use information from words thatcontain particular speech sounds, can increase the robustnessof phonetic category learning.

DiscussionThis paper has presented a model of phonetic category acqui-sition that allows interaction between speech sound and wordcategorization. The model was not given a lexicon a priori,but was allowed to begin learning a lexicon from the data atthe same time that it was learning to categorize individualspeech sounds, allowing it to take into account the distribu-tion of speech sounds in words. This lexical-distributionallearner outperformed a purely distributional learner on a cor-pus whose categories were based on English vowel cate-gories, showing better disambiguation of overlapping cate-gories from the same number of data points.

Infants learn to segment words from fluent speech aroundthe same time that they begin to show signs of acquiringnative language phonetic categories, and they are able tomap these segmented words onto tokens heard in isolation(Jusczyk & Aslin, 1995), suggesting that they are perform-ing some sort of rudimentary categorization on the wordsthey hear. Infants may therefore have access to informationfrom words that can help them disambiguate overlapping cat-egories. If information from words can feed back to constrainphonetic category learning, the large degree of overlap be-

2212

Page 158: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

other work on category learning

37

Modeling category learning

1000150020002500

300

400

500

600

700

800

900

Second Formant (Hz)

First

Fo

rma

nt (H

z)

Vowel Categories (Men)(a)

i

ɪ

e

ɛ

ɝ

æ

u

ʊo

ɔ

a

ʌ

1000150020002500

300

400

500

600

700

800

900Second Formant (Hz)

Firs

t For

man

t (H

z)

Lexical−Distributional Model(b)

1000150020002500

300

400

500

600

700

800

900Second Formant (Hz)

Firs

t For

man

t (H

z)

Distributional Model(c)

1000150020002500

300

400

500

600

700

800

900

Firs

t For

man

t (H

z)

Second Formant (Hz)

Gradient Descent Algorithm(d)

Figure 3: Ellipses delimit the area corresponding to 90%of vowel tokens for Gaussian categories (a) computed frommen’s vowel productions from Hillenbrand et al. (1995) andlearned by the (b) lexical-distributional model, (c) distribu-tional model, and (d) gradient descent algorithm.

boring vowel categories. Positing the presence of a lexi-con therefore showed evidence of helping the ideal learnerdisambiguate overlapping vowel categories, even though thephonological forms contained in the lexicon were not givenexplicitly to the learner.

Pairwise accuracy and completeness measures were com-puted for each learner as a quantitative measure of modelperformance (Table 1). For these measures, pairs of voweltokens that were correctly placed into the same category werecounted as a hit; pairs of tokens that were incorrectly assignedto different categories when they should have been in thesame category were counted as a miss; and pairs of tokensthat were incorrectly assigned to the same category whenthey should have been in different categories were countedas a false alarm. The accuracy score was computed as

hitshits+false alarms and the completeness score as hits

hits+misses .Both measures were high for the lexical-distributional learner,but accuracy scores were substantially lower for the purelydistributional learners, reflecting the fact that these modelsmistakenly merged several overlapping categories.

Results suggest that as predicted, a model that uses theinput to learn word categories in addition to phonetic cate-gories produces better phonetic category learning results thana model that only learns phonetic categories. Note that thedistributional learners are likely to show better performanceif they are given dimensions beyond just the first two formants(Vallabha et al., 2007) or if they are given more data pointsduring learning. These two solutions actually work againsteach other: as dimensions are added, more data are necessaryto maintain the same learning outcome. Nevertheless, we do

100020003000

200

400

600

800

1000

1200

Second Formant (Hz)

Vowel Categories (All Speakers)

First

Fo

rma

nt (H

z)(a)

i ɪ

ɝ

æ

u

ʊ o

ɔ

a

ʌ

100020003000

200

400

600

800

1000

1200

Second Formant (Hz)

Lexical−Distributional Model

Firs

t For

man

t (H

z)

(b)

100020003000

200

400

600

800

1000

1200

Second Formant (Hz)

Firs

t For

man

t (H

z)

Distributional Model(c)

100020003000

200

400

600

800

1000

1200

Second Formant (Hz)

Firs

t For

man

t (H

z)

Gradient Descent Algorithm(d)

Figure 4: Ellipses delimit the area corresponding to 90% ofvowel tokens for Gaussian categories (a) computed from allspeakers’ vowel productions from Hillenbrand et al. (1995)and learned by the (b) lexical-distributional model, (c) distri-butional model, and (d) gradient descent algorithm.

not wish to suggest that a purely distributional learner cannotacquire phonetic categories. The simulations presented hereare instead meant to demonstrate that in a language wherephonetic categories have substantial overlap, an interactivesystem, where learners can use information from words thatcontain particular speech sounds, can increase the robustnessof phonetic category learning.

DiscussionThis paper has presented a model of phonetic category acqui-sition that allows interaction between speech sound and wordcategorization. The model was not given a lexicon a priori,but was allowed to begin learning a lexicon from the data atthe same time that it was learning to categorize individualspeech sounds, allowing it to take into account the distribu-tion of speech sounds in words. This lexical-distributionallearner outperformed a purely distributional learner on a cor-pus whose categories were based on English vowel cate-gories, showing better disambiguation of overlapping cate-gories from the same number of data points.

Infants learn to segment words from fluent speech aroundthe same time that they begin to show signs of acquiringnative language phonetic categories, and they are able tomap these segmented words onto tokens heard in isolation(Jusczyk & Aslin, 1995), suggesting that they are perform-ing some sort of rudimentary categorization on the wordsthey hear. Infants may therefore have access to informationfrom words that can help them disambiguate overlapping cat-egories. If information from words can feed back to constrainphonetic category learning, the large degree of overlap be-

2212

Page 159: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

other work on category learning

• Feldman et al. (2009) add a (latent) lexicon to category learning

37

Modeling category learning

1000150020002500

300

400

500

600

700

800

900

Second Formant (Hz)

First

Fo

rma

nt (H

z)

Vowel Categories (Men)(a)

i

ɪ

e

ɛ

ɝ

æ

u

ʊo

ɔ

a

ʌ

1000150020002500

300

400

500

600

700

800

900Second Formant (Hz)

Firs

t For

man

t (H

z)

Lexical−Distributional Model(b)

1000150020002500

300

400

500

600

700

800

900Second Formant (Hz)

Firs

t For

man

t (H

z)

Distributional Model(c)

1000150020002500

300

400

500

600

700

800

900

Firs

t For

man

t (H

z)

Second Formant (Hz)

Gradient Descent Algorithm(d)

Figure 3: Ellipses delimit the area corresponding to 90%of vowel tokens for Gaussian categories (a) computed frommen’s vowel productions from Hillenbrand et al. (1995) andlearned by the (b) lexical-distributional model, (c) distribu-tional model, and (d) gradient descent algorithm.

boring vowel categories. Positing the presence of a lexi-con therefore showed evidence of helping the ideal learnerdisambiguate overlapping vowel categories, even though thephonological forms contained in the lexicon were not givenexplicitly to the learner.

Pairwise accuracy and completeness measures were com-puted for each learner as a quantitative measure of modelperformance (Table 1). For these measures, pairs of voweltokens that were correctly placed into the same category werecounted as a hit; pairs of tokens that were incorrectly assignedto different categories when they should have been in thesame category were counted as a miss; and pairs of tokensthat were incorrectly assigned to the same category whenthey should have been in different categories were countedas a false alarm. The accuracy score was computed as

hitshits+false alarms and the completeness score as hits

hits+misses .Both measures were high for the lexical-distributional learner,but accuracy scores were substantially lower for the purelydistributional learners, reflecting the fact that these modelsmistakenly merged several overlapping categories.

Results suggest that as predicted, a model that uses theinput to learn word categories in addition to phonetic cate-gories produces better phonetic category learning results thana model that only learns phonetic categories. Note that thedistributional learners are likely to show better performanceif they are given dimensions beyond just the first two formants(Vallabha et al., 2007) or if they are given more data pointsduring learning. These two solutions actually work againsteach other: as dimensions are added, more data are necessaryto maintain the same learning outcome. Nevertheless, we do

100020003000

200

400

600

800

1000

1200

Second Formant (Hz)

Vowel Categories (All Speakers)

First

Fo

rma

nt (H

z)(a)

i ɪ

ɝ

æ

u

ʊ o

ɔ

a

ʌ

100020003000

200

400

600

800

1000

1200

Second Formant (Hz)

Lexical−Distributional Model

Firs

t For

man

t (H

z)

(b)

100020003000

200

400

600

800

1000

1200

Second Formant (Hz)

Firs

t For

man

t (H

z)

Distributional Model(c)

100020003000

200

400

600

800

1000

1200

Second Formant (Hz)

Firs

t For

man

t (H

z)

Gradient Descent Algorithm(d)

Figure 4: Ellipses delimit the area corresponding to 90% ofvowel tokens for Gaussian categories (a) computed from allspeakers’ vowel productions from Hillenbrand et al. (1995)and learned by the (b) lexical-distributional model, (c) distri-butional model, and (d) gradient descent algorithm.

not wish to suggest that a purely distributional learner cannotacquire phonetic categories. The simulations presented hereare instead meant to demonstrate that in a language wherephonetic categories have substantial overlap, an interactivesystem, where learners can use information from words thatcontain particular speech sounds, can increase the robustnessof phonetic category learning.

DiscussionThis paper has presented a model of phonetic category acqui-sition that allows interaction between speech sound and wordcategorization. The model was not given a lexicon a priori,but was allowed to begin learning a lexicon from the data atthe same time that it was learning to categorize individualspeech sounds, allowing it to take into account the distribu-tion of speech sounds in words. This lexical-distributionallearner outperformed a purely distributional learner on a cor-pus whose categories were based on English vowel cate-gories, showing better disambiguation of overlapping cate-gories from the same number of data points.

Infants learn to segment words from fluent speech aroundthe same time that they begin to show signs of acquiringnative language phonetic categories, and they are able tomap these segmented words onto tokens heard in isolation(Jusczyk & Aslin, 1995), suggesting that they are perform-ing some sort of rudimentary categorization on the wordsthey hear. Infants may therefore have access to informationfrom words that can help them disambiguate overlapping cat-egories. If information from words can feed back to constrainphonetic category learning, the large degree of overlap be-

2212

Page 160: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

38

Conclusions

Page 161: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

probabilistic models in acquisition

38

Conclusions

Page 162: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

probabilistic models in acquisition

• in general: still working on trying to incorporate many sources of information

38

Conclusions

Page 163: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

probabilistic models in acquisition

• in general: still working on trying to incorporate many sources of information

• it seems that infants, children, and adults use many sources of information to learn language

38

Conclusions

Page 164: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

probabilistic models in acquisition

• in general: still working on trying to incorporate many sources of information

• it seems that infants, children, and adults use many sources of information to learn language

• one very exciting information source: other languages

38

Conclusions

Page 165: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

39

Conclusions

Page 166: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

computational psycholinguistics

39

Conclusions

Page 167: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

computational psycholinguistics

• this course gave a very broad overview of many areas: perception, comprehension, production, acquisition

39

Conclusions

Page 168: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

computational psycholinguistics

• this course gave a very broad overview of many areas: perception, comprehension, production, acquisition

• gave a sense for how probabilistic models can be used

39

Conclusions

Page 169: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

computational psycholinguistics

• this course gave a very broad overview of many areas: perception, comprehension, production, acquisition

• gave a sense for how probabilistic models can be used

• also covered much of the essential technical knowledge to use probabilistic models

39

Conclusions

Page 170: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

computational psycholinguistics

• this course gave a very broad overview of many areas: perception, comprehension, production, acquisition

• gave a sense for how probabilistic models can be used

• also covered much of the essential technical knowledge to use probabilistic models

• this is still very much the beginning of probabilistic modeling in the study of language: very many open questions and models yet to build!

39

Conclusions

Page 171: Probabilistic models of language acquisitionidiom.ucsd.edu/~rlevy/teaching/esslli2014/lectures/...Probabilistic models of language acquisition Klinton Bicknell & Roger Levy!! ESSLLI

computational psycholinguistics

• this course gave a very broad overview of many areas: perception, comprehension, production, acquisition

• gave a sense for how probabilistic models can be used

• also covered much of the essential technical knowledge to use probabilistic models

• this is still very much the beginning of probabilistic modeling in the study of language: very many open questions and models yet to build!

• hope you enjoyed!

39

Conclusions