Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Probabilistic models of language acquisition
Klinton Bicknell & Roger Levy !
ESSLLI 26 University of Tübingen
22 August 2014
Situating language acquisition
2
day 1: sound categorization
−20 0 20 40 60 80
0.00
00.
005
0.01
00.
015
0.02
00.
025
0.03
0
VOT
Prob
abilit
y de
nsity
/b/ /p/
27
Figure 5.2: Likelihood functions for /b/–/p/ phoneme categorizations, with µb =0, µp = 50, σb = σp = 12. For the inputx = 27, the likelihoods favor /p/.
−20 0 20 40 60 80
0.0
0.2
0.4
0.6
0.8
1.0
VOT
Post
erio
r pro
babi
lity o
f /b/
Figure 5.3: Posterior probability curvefor Bayesian phoneme discrimination asa function of VOT
the conditional distributions over acoustic representations, Pb(x) and Pp(x) for /b/ and/p/ respectively (the likelihood functions), and the prior distribution over /b/ versus /p/.We further simplify the problem by characterizing any acoustic representation x as a singlereal-valued number representing the VOT, and the likelihood functions for /b/ and /p/ asnormal density functions (Section 2.10) with means µb, µp and standard deviations σb, σp
respectively.Figure 5.2 illustrates the likelihood functions for the choices µb = 0, µp = 50, σb = σp =
12. Intuitively, the phoneme that is more likely to be realized with VOT in the vicinity of agiven input is a better choice for the input, and the greater the discrepancy in the likelihoodsthe stronger the categorization preference. An input with non-negligible likelihood for eachphoneme is close to the “categorization boundary”, but may still have a preference. Theseintuitions are formally realized in Bayes’ Rule:
P (/b/|x) = P (x|/b/)P (/b/)
P (x)(5.10)
and since we are considering only two alternatives, the marginal likelihood is simply theweighted sum of the likelihoods under the two phonemes: P (x) = P (x|/b/)P (/b/) +P (x|/p/)P (/p/). If we plug in the normal probability density function we get
P (/b/|x) =1√2πσ2
b
exp!− (x−µb)2
2σ2b
"P (/b/)
1√2πσ2
b
exp!− (x−µb)2
2σ2b
"P (/b/) + 1√
2πσ2p
exp!− (x−µp)2
2σ2p
"P (/p/)
(5.11)
In the special case where σb = σp = σ we can simplify this considerably by cancelling the
Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 85
Situating language acquisition
3
day 1: inferring sound category c from sound token S
Situating language acquisition
3
S!
c! category
sound value
• c ~ discrete choice, e.g., p(p) = p(b) = 0.5
• S|c ~ Gaussian(μc, σ2c)
day 1: inferring sound category c from sound token S
Statistical Model
day 2a: sound similarity
Statistical Model
day 2a: inferring target production T from sound token S
Statistical Model
c
S
Choose a category c with probability p(c)
Articulate a target production T with probability p(T|c)
Listener hears speech sound S with probability p(S|T)
Tp(T |c) = N(µc,�
2c )
p(S|T ) = N(T,�2S)
day 2a: inferring target production T from sound token S
Statistical Model
day 2b: incremental parsing
Garden path models 2
• The most famous garden-paths: reduced relative clauses (RRCs) versus main clauses (MCs)
• From the valence + simple-constituency perspective, MC and RRC analyses differ in two places:
The horse raced past the barn fell.
(that was)
…
Statistical Model
day 2b: inferring syntactic structure T from words w
Statistical Model
w
T ~ PCFG
words are leaves of the trees
T
day 2b: inferring syntactic structure T from words w
Statistical Model
day 3: sentence processing and eye movements in reading
The coach smiled at the player tossed the frisbee The coach smiled at the player thrown the frisbee The coach smiled toward the player tossed the frisbee The coach smiled toward the player thrown the frisbee
Statistical Model
day 3: sentence processing and eye movements in reading
The coach smiled at the player tossed the frisbee The coach smiled at the player thrown the frisbee The coach smiled toward the player tossed the frisbee The coach smiled toward the player thrown the frisbee
Statistical Model
day 3: inferring words w and trees T from perceptual input I
Statistical Model
a tree T ~ PCFG
words are leaves of the tree
visual input I ~ noise(w)
T
I
w
day 3: inferring words w and trees T from perceptual input I
•
Situating language acquisition
10
day 4: sentence meaning judgments
Situating language acquisition
11
day 4: inferring meaning M from linguistic form F and processing pressures P
Situating language acquisition
11
day 4: inferring meaning M from linguistic form F and processing pressures P
M P
F
F ~ function(M, P)
Situating language acquisition
12
Situating language acquisition
12
in all these cases
Situating language acquisition
12
in all these cases
• step 1: identify the relevant sources of information
Situating language acquisition
12
in all these cases
• step 1: identify the relevant sources of information
• step 2: make a generative model in which
Situating language acquisition
12
in all these cases
• step 1: identify the relevant sources of information
• step 2: make a generative model in which
• the thing to be inferred was a latent variable
Situating language acquisition
12
in all these cases
• step 1: identify the relevant sources of information
• step 2: make a generative model in which
• the thing to be inferred was a latent variable
• the relevant information was used to specify a prior for the latent variable or the likelihood of the data given that latent variable
Situating language acquisition
12
in all these cases
• step 1: identify the relevant sources of information
• step 2: make a generative model in which
• the thing to be inferred was a latent variable
• the relevant information was used to specify a prior for the latent variable or the likelihood of the data given that latent variable
• step 3: apply Bayesian inference (relatively easy here: given prior and likelihood)
Problems in acquisition
13
let's apply step 1 (identify relevant information sources) to problems in acquisition
Problems in acquisition
13
step 1
let's apply step 1 (identify relevant information sources) to problems in acquisition
Problems in acquisition
14
[yuwanttusiðəbʊk]?
[e.g., Goldwater et al., 2009]
Problems in acquisition
14
learning to segment words
[yuwanttusiðəbʊk]?
[e.g., Goldwater et al., 2009]
Problems in acquisition
14
learning to segment words
[yuwanttusiðəbʊk]?
-> you want to see the book?
[e.g., Goldwater et al., 2009]
Problems in acquisition
15
[e.g., Frank et al., 2009]
Problems in acquisition
15
learning word meanings
[e.g., Frank et al., 2009]
Problems in acquisition
16
[blɪk]
[e.g., Hayes & Wilson, 2008]
Problems in acquisition
16
learning phonotactics
[blɪk]
[e.g., Hayes & Wilson, 2008]
Problems in acquisition
16
learning phonotactics
[blɪk]
[mdwɨ]
[e.g., Hayes & Wilson, 2008]
Problems in acquisition
16
learning phonotactics
[blɪk]
[mdwɨ] Polish mdły: 'tasteless'
[e.g., Hayes & Wilson, 2008]
Problems in acquisition
17
The boy is hungry. -> Is the boy hungry?
[e.g., Perfors et al., 2011]
Problems in acquisition
17
learning syntax
The boy is hungry. -> Is the boy hungry?
[e.g., Perfors et al., 2011]
Problems in acquisition
17
learning syntax
The boy is hungry. -> Is the boy hungry?
The boy who is smiling is happy. -> ???
[e.g., Perfors et al., 2011]
Problems in acquisition
17
learning syntax
The boy is hungry. -> Is the boy hungry?
The boy who is smiling is happy. -> ???
Is the boy who is smiling happy?
[e.g., Perfors et al., 2011]
Problems in acquisition
17
learning syntax
The boy is hungry. -> Is the boy hungry?
The boy who is smiling is happy. -> ???
Is the boy who is smiling happy?*Is the boy who smiling is happy?
[e.g., Perfors et al., 2011]
Problems in acquisition
18
[e.g., Pajak et al., 2013]
Problems in acquisition
18
learning the sound categories in the language from the sounds
[e.g., Pajak et al., 2013]
Problems in acquisition
18
learning the sound categories in the language from the sounds
• are long VOTs [t] and short VOTs [d] functionally different sounds or just natural variation?
[e.g., Pajak et al., 2013]
Problems in acquisition
19
Problems in acquisition
19
steps 2 and 3: construct a generative model, perform inference
Problems in acquisition
19
steps 2 and 3: construct a generative model, perform inference
• a bit different than before
Problems in acquisition
19
steps 2 and 3: construct a generative model, perform inference
• a bit different than before
• we'll go in depth through an example of sound category learning
Problems in acquisition
19
steps 2 and 3: construct a generative model, perform inference
• a bit different than before
• we'll go in depth through an example of sound category learning
• note: many of my visualizations were made by Dr. Bozena Pajak (University of Rochester, Brain & Cognitive Sciences. soon to be my colleague at Northwestern University)
Distributional information
20
learning sound categories
Distributional information
20
learning sound categories
• how do we learn that [t] and [d] are different categories?
Distributional information
20
learning sound categories
• how do we learn that [t] and [d] are different categories?
• information source 1: distributional information
Distributional information
20
learning sound categories
• how do we learn that [t] and [d] are different categories?
• information source 1: distributional information
Distributional information
20
VOT
learning sound categories
• how do we learn that [t] and [d] are different categories?
• information source 1: distributional information
Distributional information
20
VOT
Distributional information
21
experimental evidence
Distributional information
21
experimental evidence
• we know that babies and adults use distributional information to help infer category structure [Maye & Gerken, 2000; Maye et al., 2002]
Distributional information
21
experimental evidence
• we know that babies and adults use distributional information to help infer category structure [Maye & Gerken, 2000; Maye et al., 2002]
• Pajak & Levy (2011) performed an experiment replicating this with adults, which I'll describe
Distributional information
21
experimental evidence
• we know that babies and adults use distributional information to help infer category structure [Maye & Gerken, 2000; Maye et al., 2002]
• Pajak & Levy (2011) performed an experiment replicating this with adults, which I'll describe
Distributional information
21
/s/ /ss/ length
sound&class:&FRICATIVES&
singleton) geminate)
22
Experimental data (Pajak & Levy 2011)
▪ Distributional training: ▪ adult English native speakers exposed to words in a new language,
where the middle consonant varied along the length dimension
[aja]145ms [ina]205ms [ila]115ms [ama]160ms
…
23
Experimental data (Pajak & Levy 2011)
0
4
8
12
16
0
4
8
12
16
Expt1:sonorantsExpt2:fricatives
100 120 140 160 180 200 220 240 260 280Stimuli length continuum (in msec)
Fam
iliariz
atio
n fre
quen
cy
bimodalunimodal
24
▪ Testing:▪ participants made judgments about pairs of words
Example: [ama]-[amma] “Are these two different words in this language or two repetitions of the same word?”
Experimental data (Pajak & Levy 2011)
24
▪ Testing:▪ participants made judgments about pairs of words
▪ dependent measure: proportion of ‘different’ responses (as opposed to ‘same’) on ‘different’ trials
Example: [ama]-[amma] “Are these two different words in this language or two repetitions of the same word?”
Experimental data (Pajak & Levy 2011)
24
▪ Testing:▪ participants made judgments about pairs of words
▪ dependent measure: proportion of ‘different’ responses (as opposed to ‘same’) on ‘different’ trials
▪ if learning is successful, we expect:
Example: [ama]-[amma] “Are these two different words in this language or two repetitions of the same word?”
Experimental data (Pajak & Levy 2011)
‘DIFFERENT’ RESPONSES
Bimodal training Unimodal training>
25
Experimental data (Pajak & Levy 2011)
0
4
8
12
16
0
4
8
12
16
Expt1:sonorantsExpt2:fricatives
100 120 140 160 180 200 220 240 260 280Stimuli length continuum (in msec)
Fam
iliariz
atio
n fre
quen
cy
bimodalunimodal
test stimuli
26
Experimental data (Pajak & Levy 2011)
Expt1−trained Expt1−untrained
Prop
ortio
n of
'diff
eren
t' re
spon
ses
0.0
0.2
0.4
0.6
0.8
1.0
BimodalUnimodal
Distributional information
27
distributional information is used
Distributional information
27
distributional information is used
• adults and babies use distributional information to infer categories
Distributional information
27
distributional information is used
• adults and babies use distributional information to infer categories
• next steps: put this information into a generative model and perform inference
Distributional information
27
A generative model
28
A generative model
28
our model from before of where sounds come from
A generative model
28
our model from before of where sounds come from
S!
c! category
sound value
• c ~ discrete choice, e.g., p(p) = p(b) = 0.5
• S|c ~ Gaussian(μc, σ2c)
A generative model
28
our model from before of where sounds come from
S!
c! category
sound value
• c ~ discrete choice, e.g., p(p) = p(b) = 0.5
• S|c ~ Gaussian(μc, σ2c)
VOT
still pretty appropriate!
A generative model
28
our model from before of where sounds come from
S!
c! category
sound value
• c ~ discrete choice, e.g., p(p) = p(b) = 0.5
• S|c ~ Gaussian(μc, σ2c)
VOT
still pretty appropriate!
A generative model
28
our model from before of where sounds come from
S!
c! category
sound value
• c ~ discrete choice, e.g., p(p) = p(b) = 0.5
• S|c ~ Gaussian(μc, σ2c)
VOT
still pretty appropriate!
A generative model
28
our model from before of where sounds come from
S!
c! category
sound value
• c ~ discrete choice, e.g., p(p) = p(b) = 0.5
• S|c ~ Gaussian(μc, σ2c)
VOT
still pretty appropriate!
29
y
µΣ
iφ
m
n
(a) Simple model
y
µΣ
iφ
α
m
n
(b) Bayesian model priorsover category parameters
y
µΣ
iφ
Σφ
α
m
n
(c) Learning category proba-bilities as well
Figure 9.3: Graphical model for simple mixture of Gaussians
−5 0 5 10
0.0
0.1
0.2
0.3
0.4
x
P(x)
Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.
θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.
Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210
A generative model
a model of where categories and sounds come from
29
y
µΣ
iφ
m
n
(a) Simple model
y
µΣ
iφ
α
m
n
(b) Bayesian model priorsover category parameters
y
µΣ
iφ
Σφ
α
m
n
(c) Learning category proba-bilities as well
Figure 9.3: Graphical model for simple mixture of Gaussians
−5 0 5 10
0.0
0.1
0.2
0.3
0.4
x
P(x)
Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.
θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.
Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210
A generative model
a model of where categories and sounds come from
29
y
µΣ
iφ
m
n
(a) Simple model
y
µΣ
iφ
α
m
n
(b) Bayesian model priorsover category parameters
y
µΣ
iφ
Σφ
α
m
n
(c) Learning category proba-bilities as well
Figure 9.3: Graphical model for simple mixture of Gaussians
−5 0 5 10
0.0
0.1
0.2
0.3
0.4
x
P(x)
Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.
θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.
Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210
y ~ N(μi, Σi)
A generative model
a model of where categories and sounds come from
29
y
µΣ
iφ
m
n
(a) Simple model
y
µΣ
iφ
α
m
n
(b) Bayesian model priorsover category parameters
y
µΣ
iφ
Σφ
α
m
n
(c) Learning category proba-bilities as well
Figure 9.3: Graphical model for simple mixture of Gaussians
−5 0 5 10
0.0
0.1
0.2
0.3
0.4
x
P(x)
Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.
θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.
Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210
y ~ N(μi, Σi)(S before)
A generative model
a model of where categories and sounds come from
29
y
µΣ
iφ
m
n
(a) Simple model
y
µΣ
iφ
α
m
n
(b) Bayesian model priorsover category parameters
y
µΣ
iφ
Σφ
α
m
n
(c) Learning category proba-bilities as well
Figure 9.3: Graphical model for simple mixture of Gaussians
−5 0 5 10
0.0
0.1
0.2
0.3
0.4
x
P(x)
Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.
θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.
Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210
y ~ N(μi, Σi)(S before)(σ2 before)
A generative model
a model of where categories and sounds come from
29
y
µΣ
iφ
m
n
(a) Simple model
y
µΣ
iφ
α
m
n
(b) Bayesian model priorsover category parameters
y
µΣ
iφ
Σφ
α
m
n
(c) Learning category proba-bilities as well
Figure 9.3: Graphical model for simple mixture of Gaussians
−5 0 5 10
0.0
0.1
0.2
0.3
0.4
x
P(x)
Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.
θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.
Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210
y ~ N(μi, Σi)i ~ discrete(ϕ)
(S before)(σ2 before)
A generative model
a model of where categories and sounds come from
29
y
µΣ
iφ
m
n
(a) Simple model
y
µΣ
iφ
α
m
n
(b) Bayesian model priorsover category parameters
y
µΣ
iφ
Σφ
α
m
n
(c) Learning category proba-bilities as well
Figure 9.3: Graphical model for simple mixture of Gaussians
−5 0 5 10
0.0
0.1
0.2
0.3
0.4
x
P(x)
Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.
θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.
Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210
y ~ N(μi, Σi)i ~ discrete(ϕ)
(S before)(c before)
(σ2 before)
A generative model
a model of where categories and sounds come from
29
y
µΣ
iφ
m
n
(a) Simple model
y
µΣ
iφ
α
m
n
(b) Bayesian model priorsover category parameters
y
µΣ
iφ
Σφ
α
m
n
(c) Learning category proba-bilities as well
Figure 9.3: Graphical model for simple mixture of Gaussians
−5 0 5 10
0.0
0.1
0.2
0.3
0.4
x
P(x)
Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.
θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.
Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210
y ~ N(μi, Σi)i ~ discrete(ϕ)
(S before)(c before)
n = # observations
(σ2 before)
A generative model
a model of where categories and sounds come from
29
y
µΣ
iφ
m
n
(a) Simple model
y
µΣ
iφ
α
m
n
(b) Bayesian model priorsover category parameters
y
µΣ
iφ
Σφ
α
m
n
(c) Learning category proba-bilities as well
Figure 9.3: Graphical model for simple mixture of Gaussians
−5 0 5 10
0.0
0.1
0.2
0.3
0.4
x
P(x)
Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.
θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.
Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210
y ~ N(μi, Σi)i ~ discrete(ϕ)
(S before)(c before)
n = # observationsm = # categories
(σ2 before)
A generative model
a model of where categories and sounds come from
29
y
µΣ
iφ
m
n
(a) Simple model
y
µΣ
iφ
α
m
n
(b) Bayesian model priorsover category parameters
y
µΣ
iφ
Σφ
α
m
n
(c) Learning category proba-bilities as well
Figure 9.3: Graphical model for simple mixture of Gaussians
−5 0 5 10
0.0
0.1
0.2
0.3
0.4
x
P(x)
Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.
θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.
Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210
y ~ N(μi, Σi)i ~ discrete(ϕ)
(S before)(c before)
n = # observationsm = # categories
(σ2 before)
A generative model
a model of where categories and sounds come from
29
y
µΣ
iφ
m
n
(a) Simple model
y
µΣ
iφ
α
m
n
(b) Bayesian model priorsover category parameters
y
µΣ
iφ
Σφ
α
m
n
(c) Learning category proba-bilities as well
Figure 9.3: Graphical model for simple mixture of Gaussians
−5 0 5 10
0.0
0.1
0.2
0.3
0.4
x
P(x)
Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.
θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.
Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210
y ~ N(μi, Σi)i ~ discrete(ϕ)
(S before)(c before)
n = # observationsm = # categories
(σ2 before)
ϕ ~ distribution()
A generative model
a model of where categories and sounds come from
29
y
µΣ
iφ
m
n
(a) Simple model
y
µΣ
iφ
α
m
n
(b) Bayesian model priorsover category parameters
y
µΣ
iφ
Σφ
α
m
n
(c) Learning category proba-bilities as well
Figure 9.3: Graphical model for simple mixture of Gaussians
−5 0 5 10
0.0
0.1
0.2
0.3
0.4
x
P(x)
Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.
θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.
Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210
y ~ N(μi, Σi)i ~ discrete(ϕ)
(S before)(c before)
n = # observationsm = # categories
(σ2 before)
ϕ ~ distribution()μi ~ distribution()
A generative model
a model of where categories and sounds come from
29
y
µΣ
iφ
m
n
(a) Simple model
y
µΣ
iφ
α
m
n
(b) Bayesian model priorsover category parameters
y
µΣ
iφ
Σφ
α
m
n
(c) Learning category proba-bilities as well
Figure 9.3: Graphical model for simple mixture of Gaussians
−5 0 5 10
0.0
0.1
0.2
0.3
0.4
x
P(x)
Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.
θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.
Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210
y ~ N(μi, Σi)i ~ discrete(ϕ)
(S before)(c before)
n = # observationsm = # categories
(σ2 before)
ϕ ~ distribution()μi ~ distribution()Σi ~ distribution()
A generative model
a model of where categories and sounds come from
29
y
µΣ
iφ
m
n
(a) Simple model
y
µΣ
iφ
α
m
n
(b) Bayesian model priorsover category parameters
y
µΣ
iφ
Σφ
α
m
n
(c) Learning category proba-bilities as well
Figure 9.3: Graphical model for simple mixture of Gaussians
−5 0 5 10
0.0
0.1
0.2
0.3
0.4
x
P(x)
Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.
θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.
Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210
y ~ N(μi, Σi)i ~ discrete(ϕ)
(S before)(c before)
n = # observationsm = # categories
(σ2 before)
ϕ ~ distribution()μi ~ distribution()Σi ~ distribution()
A generative model
'mixture of Gaussians'
a model of where categories and sounds come from
29
y
µΣ
iφ
m
n
(a) Simple model
y
µΣ
iφ
α
m
n
(b) Bayesian model priorsover category parameters
y
µΣ
iφ
Σφ
α
m
n
(c) Learning category proba-bilities as well
Figure 9.3: Graphical model for simple mixture of Gaussians
−5 0 5 10
0.0
0.1
0.2
0.3
0.4
x
P(x)
Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.
θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.
Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210
y ~ N(μi, Σi)i ~ discrete(ϕ)
(S before)(c before)
n = # observationsm = # categories
(σ2 before)
ϕ ~ distribution()μi ~ distribution()Σi ~ distribution()
A generative model
'mixture of Gaussians' made up!}
a model of where categories and sounds come from
29
y
µΣ
iφ
m
n
(a) Simple model
y
µΣ
iφ
α
m
n
(b) Bayesian model priorsover category parameters
y
µΣ
iφ
Σφ
α
m
n
(c) Learning category proba-bilities as well
Figure 9.3: Graphical model for simple mixture of Gaussians
−5 0 5 10
0.0
0.1
0.2
0.3
0.4
x
P(x)
Figure 9.4: The generative mixture of Gaussians in one dimension. Model parameters are:φ1 = 0.35, µ1 = 0, σ1 = 1, µ2 = 4, σ2 = 2.
θ = ⟨φ,µ,Σ⟩. This model is known as a mixture of Gaussians, since the observa-tions are drawn from some mixture of individually Gaussian distributions. An illustrationof this generative model in one dimension is given in Figure 9.4, with the Gaussian mixturecomponents drawn above the observations. At the bottom of the graph is one sample fromthis Gaussian mixture in which the underlying categories are distinguished; at the top of thegraph is another sample in which the underlying categories are not distinguished. The rela-tively simple supervised problem would be to infer the means and variances of the Gaussianmixture components from the category-distinguished sample at the bottom of the graph; themuch harder unsupervised problem is to infer these means and variances—and potentiallythe number of mixture components!—from the category-undistinguished sample at the topof the graph. The graphical model corresponding to this unsupervised learning problem isgiven in Figure 9.3a.
Roger Levy – Probabilistic Models in the Study of Language draft, November 6, 2012 210
y ~ N(μi, Σi)i ~ discrete(ϕ)
(S before)(c before)
n = # observationsm = # categories
(σ2 before)
ϕ ~ distribution()μi ~ distribution()Σi ~ distribution()
A generative model
'mixture of Gaussians' made up!}set in
advance
30
Inference
final step: inference
30
Inference
final step: inference
• now we have generative model with prior on variables of interest and well-defined likelihood
30
Inference
final step: inference
• now we have generative model with prior on variables of interest and well-defined likelihood
• but inference is usually quite hard
30
Inference
final step: inference
• now we have generative model with prior on variables of interest and well-defined likelihood
• but inference is usually quite hard
• two common classes of methods for inference, which I'll describe now
30
Inference
31
Inference
inference method 1: Expectation Maximization (EM)
31
Inference
inference method 1: Expectation Maximization (EM)
• if we knew the category parameters (means, variance, probability), it would be easy to categorize datapoints like before
31
Inference
inference method 1: Expectation Maximization (EM)
• if we knew the category parameters (means, variance, probability), it would be easy to categorize datapoints like before
• if we knew how the datapoints were categorized, it would be easy to find category parameters
31
Inference
inference method 1: Expectation Maximization (EM)
• if we knew the category parameters (means, variance, probability), it would be easy to categorize datapoints like before
• if we knew how the datapoints were categorized, it would be easy to find category parameters
• Q: how?
31
Inference
inference method 1: Expectation Maximization (EM)
• if we knew the category parameters (means, variance, probability), it would be easy to categorize datapoints like before
• if we knew how the datapoints were categorized, it would be easy to find category parameters
• Q: how?
• idea behind EM:
31
Inference
inference method 1: Expectation Maximization (EM)
• if we knew the category parameters (means, variance, probability), it would be easy to categorize datapoints like before
• if we knew how the datapoints were categorized, it would be easy to find category parameters
• Q: how?
• idea behind EM:
• randomly initialize category assignments of datapoints
31
Inference
inference method 1: Expectation Maximization (EM)
• if we knew the category parameters (means, variance, probability), it would be easy to categorize datapoints like before
• if we knew how the datapoints were categorized, it would be easy to find category parameters
• Q: how?
• idea behind EM:
• randomly initialize category assignments of datapoints
• estimate category parameters from current assignments
31
Inference
inference method 1: Expectation Maximization (EM)
• if we knew the category parameters (means, variance, probability), it would be easy to categorize datapoints like before
• if we knew how the datapoints were categorized, it would be easy to find category parameters
• Q: how?
• idea behind EM:
• randomly initialize category assignments of datapoints
• estimate category parameters from current assignments
• now assign datapoints to categories based on category parameters
31
Inference
inference method 1: Expectation Maximization (EM)
• if we knew the category parameters (means, variance, probability), it would be easy to categorize datapoints like before
• if we knew how the datapoints were categorized, it would be easy to find category parameters
• Q: how?
• idea behind EM:
• randomly initialize category assignments of datapoints
• estimate category parameters from current assignments
• now assign datapoints to categories based on category parameters
• estimate category parameters from current assignments
31
Inference
inference method 1: Expectation Maximization (EM)
• if we knew the category parameters (means, variance, probability), it would be easy to categorize datapoints like before
• if we knew how the datapoints were categorized, it would be easy to find category parameters
• Q: how?
• idea behind EM:
• randomly initialize category assignments of datapoints
• estimate category parameters from current assignments
• now assign datapoints to categories based on category parameters
• estimate category parameters from current assignments
• …31
Inference
32
Inference
EM illustrations (on board)
32
Inference
EM illustrations (on board)
• bimodal data
32
Inference
EM illustrations (on board)
• bimodal data
• two categories
32
Inference
33
Inference
EM issues
33
Inference
EM issues
• only finds most likely value for variables, not posterior distribution on them
33
Inference
EM issues
• only finds most likely value for variables, not posterior distribution on them
• can get stuck in 'local optima'
33
Inference
EM issues
• only finds most likely value for variables, not posterior distribution on them
• can get stuck in 'local optima'
33
Inference
34
Inference
inference method 2: Markov chain Monto Carlo
34
Inference
inference method 2: Markov chain Monto Carlo
• like EM, the model is in some state at each point in time (with current guesses for latent variables)
34
Inference
inference method 2: Markov chain Monto Carlo
• like EM, the model is in some state at each point in time (with current guesses for latent variables)
• like EM, the model transitions between states
34
Inference
inference method 2: Markov chain Monto Carlo
• like EM, the model is in some state at each point in time (with current guesses for latent variables)
• like EM, the model transitions between states
• unlike EM, the model sometimes moves to lower probability states
34
Inference
inference method 2: Markov chain Monto Carlo
• like EM, the model is in some state at each point in time (with current guesses for latent variables)
• like EM, the model transitions between states
• unlike EM, the model sometimes moves to lower probability states
• if you calculate the transition probabilities correctly, the amount of time the model spends in each state is proportional to its posterior probability
34
Inference
inference method 2: Markov chain Monto Carlo
• like EM, the model is in some state at each point in time (with current guesses for latent variables)
• like EM, the model transitions between states
• unlike EM, the model sometimes moves to lower probability states
• if you calculate the transition probabilities correctly, the amount of time the model spends in each state is proportional to its posterior probability
• thus: don't just get most likely variables out, but full posterior distribution
34
Inference
inference method 2: Markov chain Monto Carlo
• like EM, the model is in some state at each point in time (with current guesses for latent variables)
• like EM, the model transitions between states
• unlike EM, the model sometimes moves to lower probability states
• if you calculate the transition probabilities correctly, the amount of time the model spends in each state is proportional to its posterior probability
• thus: don't just get most likely variables out, but full posterior distribution
• also: guaranteed to converge (not getting stuck in local optima)
34
Inference
inference method 2: Markov chain Monto Carlo
• like EM, the model is in some state at each point in time (with current guesses for latent variables)
• like EM, the model transitions between states
• unlike EM, the model sometimes moves to lower probability states
• if you calculate the transition probabilities correctly, the amount of time the model spends in each state is proportional to its posterior probability
• thus: don't just get most likely variables out, but full posterior distribution
• also: guaranteed to converge (not getting stuck in local optima)
• however, this guarantee is only with infinite time…34
Inference
35
Modeling category learning
mixture of Gaussian models to learn phonetic categories
35
Modeling category learning
mixture of Gaussian models to learn phonetic categories
• many groups have had success, at least on simple problems (Vallabha et al., 2007; McMurray et al., 2009; Feldman et al., 2009)
35
Modeling category learning
mixture of Gaussian models to learn phonetic categories
• many groups have had success, at least on simple problems (Vallabha et al., 2007; McMurray et al., 2009; Feldman et al., 2009)
• but it seems clear that other information sources are needed too, because categories overlap too much
35
Modeling category learning
mixture of Gaussian models to learn phonetic categories
• many groups have had success, at least on simple problems (Vallabha et al., 2007; McMurray et al., 2009; Feldman et al., 2009)
• but it seems clear that other information sources are needed too, because categories overlap too much
• next up: work by Pajak and colleagues on another information source
35
Modeling category learning
36
Modeling category learning
[slides on Pajak, Bicknell, & Levy, 2013]
36
Modeling category learning
normally not considered part of the phonetic inventory of the Bulgarian language. The Bulgarian obstruentconsonants are divided into 12 pairs voiced<>voiceless on the criteria of sonority. The only obstruent withouta counterpart is the voiceless velar fricative /x/. The contrast 'voiced vs. voiceless' is neutralized in word-finalposition, where all obstruents are voiceless (as in most Slavic languages); this neutralization is, however, notreflected in the spelling.
BilabialLabio-
dental
Dental/
Alveolar
Post-
alveolarPalatal Velar
Nasalhard m (ɱ) �n (ŋ)soft mʲ ɲ
Plosivehard p b t d k ɡsoft pʲ bʲ tʲ dʲ c ɟ
Affricatehard ts͡ (d͡z) tʃ͡ d͡ʒsoft ts͡ʲ (d͡zʲ)
Fricativehard f v s z
ʃ ʒx (ɣ)
soft fʲ vʲ sʲ zʲ (xʲ)
Trillhard rsoft rʲ
Approximant
hard (w)soft j
Lateral
hard ɫ (l)soft ʎ
^1 According to Klagstad Jr. (1958:46–48), /t tʲ d dʲ s sʲ z zʲ n/ are dental. He also analyzes /ɲ/ as palatalizeddental nasal, and provides no information about the place of articulation of /ts͡ ts͡ʲ r rʲ l ɫ/.
^2 Only as an allophone of /m/ and /n/ before /f/ and /v/. Examples: инфлация [iɱˈflaʦijɐ] 'inflation'.[5]
^3 As an allophone of /n/ before /k/ and /g/. Examples: тънко [ˈtɤŋko] 'thin' (neut.), танго [tɐŋˈgɔ] 'tango'.[6]
^4 /ɣ/ exists as an allophone of /x/ only at word boundaries before voiced obstruents. Example: видях го [viˈdʲaɣgo] 'I saw him'.[7]
^5 Not a native phoneme, but appears in borrowings from English.
^6 /l/ can be analyzed as an allophone of /ɫ/ as it appears only before front vowels. A trend of l-vocalization isemerging among younger native speakers and more often in colloquial speech.
Hard and palatalized consonants
BULGARIAN consonants
1
What may help solve this problem
(Wikipedia: Bulgarian phonology)
VOICING PALATALIZATION
§ Learning may be facilitated by languages’ extensive re-‐use of a set of phoneAc dimensions (Clements 2003)
§ Learning may be facilitated by languages’ extensive re-‐use of a set of phoneAc dimensions (Clements 2003)
2
What may help solve this problem
(Wikipedia: Thai language)
Monophthongs of Thai. From Tingsabadh& Abramson (1993:25)
Diphthongs of Thai. From Tingsabadh &Abramson (1993:25)
Front Back
unrounded unrounded rounded
short long short long short long
Close/i/ -ิ
/iː/ -ี
/ɯ/ -ึ
/ɯː/ -ื-
/u/ -ุ
/uː/ -ู
Close-mid/e/เ-ะ
/eː/เ-
/ɤ/เ-อะ
/ɤː/เ-อ
/o/โ-ะ
/oː/โ-
Open-mid/ɛ/แ-ะ
/ɛː/แ-
/ɔ/เ-าะ
/ɔː/-อ
Open /a/
-ะ, -ั-/aː/-า
The vowels each exist in long-short pairs: these are distinct phonemes forming unrelated words in Thai,[12] butusually transliterated the same: เขา (khao) means "he" or "she", while ขาว (khao) means "white".
The long-short pairs are as follows:
Long Short
Thai IPA Example Thai IPA Example
–า /aː/ ฝาน /fǎːn/ 'to slice' –ะ /a/ ฝน /fǎn/ 'to dream'
–ี /iː/ กรีด /krìːt/ 'to cut' –ิ /i/ กริช /krìt/ 'kris'
–ู /uː/ สูด /sùːt/ 'to inhale' –ุ /u/ สุด /sùt/ 'rearmost'
เ– /eː/ เอน /ʔēːn/ 'to recline' เ–ะ /e/ เอ็น /ʔēn/ 'tendon, ligament'
แ– /ɛː/ แพ /pʰɛ́ː / 'to be defeated' แ–ะ /ɛ/ แพะ /pʰɛʔ́/ 'goat'
–ื- /ɯː/ คล่ืน /kʰlɯ̂ːn/ 'wave' –ึ /ɯ/ ข้ึน /kʰɯ̂n/ 'to go up'
เ–อ /ɤː/ เดิน /dɤ̄ː n/ 'to walk' เ–อะ /ɤ/ เงิน /ŋɤn̄/ 'silver'
โ– /oː/ โคน /kʰôːn/ 'to fell' โ–ะ /o/ ขน /kʰôn/ 'thick (soup)'
–อ /ɔː/ กลอง /klɔːŋ/ 'drum' เ–าะ /ɔ/ กลอง /klɔŋ̀/ 'box'
The basic vowels can be combined into diphthongs. Tingsabadh &Abramson (1993) analyze those ending in high vocoids as underlyingly/Vj/ and /Vw/. For purposes of determining tone, those marked with anasterisk are sometimes classified as long:
THAI vowels
LENGTH
§ Learning may be facilitated by languages’ extensive re-‐use of a set of phoneAc dimensions (Clements 2003)
§ ExisAng experimental evidence supports this view § both infants and adults generalize newly learned phoneAc category disAncAons to untrained sounds along the same dimension (McClaskey et al. 1983, Maye et al. 2008, Perfors & Dunbar 2010, Pajak & Levy 2011)
3
What may help solve this problem
/s/ /ss/ length
singleton geminate
/n/ /nn/ length
voicing /b/ /p/
voicing /g/ /k/ voiced voiceless
§ What are the mechanisms underlying generalizaAon? § How do learners make use of informaAon about some
phoneAc categories when learning other categories?
/s/ /ss/ length
/n/ /nn/ length
4
How do people learn phoneAc categories?
How do people learn phoneAc categories?
§ Pajak et al.’s proposal: § in addiAon to learning specific categories, people also
learn category types
/s/ /ss/ length /n/ /nn/ length
5
singleton <-‐> geminate
§ Experimental data illustraAng generalizaAon across analogous disAncAons (from Pajak & Levy 2011)
§ Our computaAonal proposal for how this kind of generalizaAon might be accomplished
§ SimulaAons of Pajak & Levy’s data
6
outline
7
Experimental data (Pajak & Levy 2011)
/s/ /ss/ length
sound class: FRICATIVES
/n/ /nn/ length
sound class: SONORANTS
singleton geminate
8
Experimental data (Pajak & Levy 2011)
§ DistribuAonal training: § adult English naAve speakers exposed to words in a new language, where the middle consonant varied along the length dimension
[aja]145ms [ina]205ms [ila]115ms [ama]160ms
…
9
0
4
8
12
16
0
4
8
12
16
Expt1:sonorantsExpt2:fricatives
100 120 140 160 180 200 220 240 260 280Stimuli length continuum (in msec)
Fam
iliariz
atio
n fre
quen
cy
bimodalunimodal
[n]-‐…-‐[nn] [m]-‐…-‐[mm] [l]-‐…-‐[ll] [j]-‐…-‐[jj]
[s]-‐…-‐[ss] [f]-‐…-‐[ff] [ʃ]-‐…-‐[ʃʃ] [θ]-‐…-‐[θθ]
this difference reflects natural distribuYons of length in different sound classes
Experimental data (Pajak & Levy 2011) Training
10
§ TesAng: § parAcipants made judgments about pairs of words
§ dependent measure: proporAon of ‘different’ responses (as opposed to ‘same’) on ‘different’ trials
§ if learning is successful, we expect:
§ to assess generalizaAon, tesAng included both trained and untrained sound classes (i.e., both sonorants and fricaAves)
Example: [ama]-[amma] “Are these two different words in this language or two repetitions of the same word?”
Experimental data (Pajak & Levy 2011)
‘DIFFERENT’ RESPONSES
Bimodal training Unimodal training >
0
4
8
12
16
0
4
8
12
16
Expt1:sonorantsExpt2:fricatives
100 120 140 160 180 200 220 240 260 280Stimuli length continuum (in msec)
Fam
iliariz
atio
n fre
quen
cy
bimodalunimodal
11
[n]-‐…-‐[nn] [m]-‐…-‐[mm] [l]-‐…-‐[ll] [j]-‐…-‐[jj]
[s]-‐…-‐[ss] [f]-‐…-‐[ff] [ʃ]-‐…-‐[ʃʃ] [θ]-‐…-‐[θθ]
untrained sound class (e.g., ama-‐amma)
trained sound class (e.g., asa-‐assa)
EXPT 1: ALIGNED CATEGORIES
EXPT 2: MISALIGNED CATEGORIES
TesYng
Experimental data (Pajak & Levy 2011)
x
trained sound class (e.g., ama-‐amma) untrained sound class (e.g., asa-‐assa) untrained sound class (e.g., asa-‐assa)
x
12
Expt1−trained Expt1−untrained
Prop
ortio
n of
'diff
eren
t' re
spon
ses
0.0
0.2
0.4
0.6
0.8
1.0
BimodalUnimodal
Expt2−trained Expt2−untrained
Prop
ortio
n of
'diff
eren
t' re
spon
ses
0.0
0.2
0.4
0.6
0.8
1.0
BimodalUnimodal
EXPT 1: ALIGNED CATEGORIES
EXPT 2: MISALIGNED CATEGORIES
Experimental data (Pajak & Levy 2011)
learning generalizaYon
trained untrained
trained untrained
ComputaAonal modeling
① How can we account for distribuAonal learning?
② How can we account for generalizaAon across sound classes?
13
§ Mixture of Gaussians approach (de Boer & Kuhl 2003, Vallabha et al. 2007, McMurray et al. 2009, Feldman et al. 2009, Toscano & McMurray 2010, Dillon et al. 2013)
14
Modeling phoneYc category learning
zi
di i 2 {1..n}
datapoint (perceptual
token)
phoneAc category
di ⇠N (µzi ,s2zi)
length /s/ /ss/
μs, σs2 μss, σss2
120ms 201ms 165ms 182ms 115ms …
inference
H
G0g
zi
di i 2 {1..n}
§ Our general approach (following Feldman et al. 2009): § learning via nonparametric Bayesian inference § using Dirichlet processes, which allow the model to learn the number of categories from the data
15
Modeling phoneYc category learning
datapoint (perceptual
token)
phoneAc category
H : µ ⇠ N (µ0,s2
k0)
s2 ⇠ InvChiSq(n0,s20 )
G0 ⇠ DP(g,H)zi ⇠ G0di ⇠ N (µzi ,s2
zi)
prior
/s/ /ss/
μs, σs2 μss, σss2
length /s/ /ss/
μs, σs2 μss, σss2
x
x
x
x
ComputaAonal modeling
① How can we account for distribuAonal learning?
② How can we account for generalizaAon across sound classes?
16
✓ ① How can we account for distribuAonal learning?
② How can we account for generalizaAon across sound classes?
§ In addiAon to acquiring specific categories, learners infer category types, which can be shared across sound classes
§ This means that already learned categories can be directly re-‐used to categorize other sounds
§ To implement this proposal, we use a hierarchical Dirichlet process, which allows for sharing categories across data groups (here, sound classes)
17
Proposal
/s/ /ss/ length /n/ /nn/ length
singleton <-‐> geminate
fricaYves sonorants
H
G0g
Gca0
zic
dic i 2 {1..nc}c 2 C
18
Modeling generalizaYon: HDP
prior
datapoint (perceptual
token)
phoneAc category
sound class
fricaYves
length /s/ /ss/
sg gem
c = fricaYves
son-sg son-gem
H : µ ⇠ N (µ0,s2
k0)
s2 ⇠ InvChiSq(n0,s20 )
G0 ⇠ DP(g,H)Gc ⇠ DP(a0,G0)zic ⇠ Gc
dic ⇠ N (µzic ,s2zic)
c = sonorants
fric-sg fric-gem
μfric-‐sg σfric-‐sg2
μfric-‐gem σfric-‐gem2
/n/ /nn/ length
sonorants
x
x
x
x
x x
x
x
x
§ But people are able to generalize even when analogous category types are implemented phoneAcally in different ways
§ We want the model to account for potenAal differences between sound classes
19
Modeling generalizaYon: HDP
length /s/ /ss/
fricaYves
length /n/ /nn/
sonorants
H
G0g
Gca0
zic
dic
fc
s f
i 2 {1..nc}c 2 C
20
Modeling generalizaYon: HDP AccounAng for differences between sound classes:
prior
datapoint (perceptual
token)
phoneAc category
sound class
learnable class-‐specific ‘offsets’ by which data in a class are shioed along a phoneAc dimension (cf. Dillon et al. 2013)
H : µ ⇠ N (µ0,s2
k0)
s2 ⇠ InvChiSq(n0,s20 )
G0 ⇠ DP(g,H)Gc ⇠ DP(a0,G0)zic ⇠ Gc
fc ⇠ N (0,s2f )
dic ⇠ N (µzic ,s2zic)+ fc
fricaYves
length /s/ /ss/
length
sonorants
/n/ /nn/
21
SimulaYon results
trained untrained
0.00
0.25
0.50
0.75
1.00
bim
odal
unim
odal
bim
odal
unim
odal
Prop
ortio
n of
2−c
ateg
ory
infe
renc
es
Extended model:Experiment 2
trained untrained
0.00
0.25
0.50
0.75
1.00
bim
odal
unim
odal
bim
odal
unim
odal
Prop
ortio
n of
2−c
ateg
ory
infe
renc
esExtended model:
Experiment 1EXPT 1: ALIGNED CATEGORIES
EXPT 2: MISALIGNED CATEGORIES
Expt1−trained Expt1−untrained
Prop
ortio
n of
'diff
eren
t' re
spon
ses
0.0
0.2
0.4
0.6
0.8
1.0
BimodalUnimodal
Expt2−trained Expt2−untrained
Prop
ortio
n of
'diff
eren
t' re
spon
ses
0.0
0.2
0.4
0.6
0.8
1.0
BimodalUnimodal
HUMAN DATA
trained untrained
trained untrained
22
SimulaYon results: NO OFFSET PARAMETER
Expt1−trained Expt1−untrained
Prop
ortio
n of
'diff
eren
t' re
spon
ses
0.0
0.2
0.4
0.6
0.8
1.0
BimodalUnimodal
Expt2−trained Expt2−untrained
Prop
ortio
n of
'diff
eren
t' re
spon
ses
0.0
0.2
0.4
0.6
0.8
1.0
BimodalUnimodal
HUMAN DATA
trained untrained
0.00
0.25
0.50
0.75
1.00
bim
odal
unim
odal
bim
odal
unim
odal
Prop
ortio
n of
2−c
ateg
ory
infe
renc
esBasic model:Experiment 1
EXPT 1: ALIGNED CATEGORIES
trained untrained
0.00
0.25
0.50
0.75
1.00
bim
odal
unim
odal
bim
odal
unim
odal
Prop
ortio
n of
2−c
ateg
ory
infe
renc
es
Basic model:Experiment 2
EXPT 2: MISALIGNED CATEGORIES
fricaYves
length /s/ /ss/
length
sonorants
/n/ /nn/
we need the shig parameter to account for generalizaYon
across misaligned categories!
trained untrained
trained untrained
37
Modeling category learning
1000150020002500
300
400
500
600
700
800
900
Second Formant (Hz)
First
Fo
rma
nt (H
z)
Vowel Categories (Men)(a)
i
ɪ
e
ɛ
ɝ
æ
u
ʊo
ɔ
a
ʌ
1000150020002500
300
400
500
600
700
800
900Second Formant (Hz)
Firs
t For
man
t (H
z)
Lexical−Distributional Model(b)
1000150020002500
300
400
500
600
700
800
900Second Formant (Hz)
Firs
t For
man
t (H
z)
Distributional Model(c)
1000150020002500
300
400
500
600
700
800
900
Firs
t For
man
t (H
z)
Second Formant (Hz)
Gradient Descent Algorithm(d)
Figure 3: Ellipses delimit the area corresponding to 90%of vowel tokens for Gaussian categories (a) computed frommen’s vowel productions from Hillenbrand et al. (1995) andlearned by the (b) lexical-distributional model, (c) distribu-tional model, and (d) gradient descent algorithm.
boring vowel categories. Positing the presence of a lexi-con therefore showed evidence of helping the ideal learnerdisambiguate overlapping vowel categories, even though thephonological forms contained in the lexicon were not givenexplicitly to the learner.
Pairwise accuracy and completeness measures were com-puted for each learner as a quantitative measure of modelperformance (Table 1). For these measures, pairs of voweltokens that were correctly placed into the same category werecounted as a hit; pairs of tokens that were incorrectly assignedto different categories when they should have been in thesame category were counted as a miss; and pairs of tokensthat were incorrectly assigned to the same category whenthey should have been in different categories were countedas a false alarm. The accuracy score was computed as
hitshits+false alarms and the completeness score as hits
hits+misses .Both measures were high for the lexical-distributional learner,but accuracy scores were substantially lower for the purelydistributional learners, reflecting the fact that these modelsmistakenly merged several overlapping categories.
Results suggest that as predicted, a model that uses theinput to learn word categories in addition to phonetic cate-gories produces better phonetic category learning results thana model that only learns phonetic categories. Note that thedistributional learners are likely to show better performanceif they are given dimensions beyond just the first two formants(Vallabha et al., 2007) or if they are given more data pointsduring learning. These two solutions actually work againsteach other: as dimensions are added, more data are necessaryto maintain the same learning outcome. Nevertheless, we do
100020003000
200
400
600
800
1000
1200
Second Formant (Hz)
Vowel Categories (All Speakers)
First
Fo
rma
nt (H
z)(a)
i ɪ
eɛ
ɝ
æ
u
ʊ o
ɔ
a
ʌ
100020003000
200
400
600
800
1000
1200
Second Formant (Hz)
Lexical−Distributional Model
Firs
t For
man
t (H
z)
(b)
100020003000
200
400
600
800
1000
1200
Second Formant (Hz)
Firs
t For
man
t (H
z)
Distributional Model(c)
100020003000
200
400
600
800
1000
1200
Second Formant (Hz)
Firs
t For
man
t (H
z)
Gradient Descent Algorithm(d)
Figure 4: Ellipses delimit the area corresponding to 90% ofvowel tokens for Gaussian categories (a) computed from allspeakers’ vowel productions from Hillenbrand et al. (1995)and learned by the (b) lexical-distributional model, (c) distri-butional model, and (d) gradient descent algorithm.
not wish to suggest that a purely distributional learner cannotacquire phonetic categories. The simulations presented hereare instead meant to demonstrate that in a language wherephonetic categories have substantial overlap, an interactivesystem, where learners can use information from words thatcontain particular speech sounds, can increase the robustnessof phonetic category learning.
DiscussionThis paper has presented a model of phonetic category acqui-sition that allows interaction between speech sound and wordcategorization. The model was not given a lexicon a priori,but was allowed to begin learning a lexicon from the data atthe same time that it was learning to categorize individualspeech sounds, allowing it to take into account the distribu-tion of speech sounds in words. This lexical-distributionallearner outperformed a purely distributional learner on a cor-pus whose categories were based on English vowel cate-gories, showing better disambiguation of overlapping cate-gories from the same number of data points.
Infants learn to segment words from fluent speech aroundthe same time that they begin to show signs of acquiringnative language phonetic categories, and they are able tomap these segmented words onto tokens heard in isolation(Jusczyk & Aslin, 1995), suggesting that they are perform-ing some sort of rudimentary categorization on the wordsthey hear. Infants may therefore have access to informationfrom words that can help them disambiguate overlapping cat-egories. If information from words can feed back to constrainphonetic category learning, the large degree of overlap be-
2212
other work on category learning
37
Modeling category learning
1000150020002500
300
400
500
600
700
800
900
Second Formant (Hz)
First
Fo
rma
nt (H
z)
Vowel Categories (Men)(a)
i
ɪ
e
ɛ
ɝ
æ
u
ʊo
ɔ
a
ʌ
1000150020002500
300
400
500
600
700
800
900Second Formant (Hz)
Firs
t For
man
t (H
z)
Lexical−Distributional Model(b)
1000150020002500
300
400
500
600
700
800
900Second Formant (Hz)
Firs
t For
man
t (H
z)
Distributional Model(c)
1000150020002500
300
400
500
600
700
800
900
Firs
t For
man
t (H
z)
Second Formant (Hz)
Gradient Descent Algorithm(d)
Figure 3: Ellipses delimit the area corresponding to 90%of vowel tokens for Gaussian categories (a) computed frommen’s vowel productions from Hillenbrand et al. (1995) andlearned by the (b) lexical-distributional model, (c) distribu-tional model, and (d) gradient descent algorithm.
boring vowel categories. Positing the presence of a lexi-con therefore showed evidence of helping the ideal learnerdisambiguate overlapping vowel categories, even though thephonological forms contained in the lexicon were not givenexplicitly to the learner.
Pairwise accuracy and completeness measures were com-puted for each learner as a quantitative measure of modelperformance (Table 1). For these measures, pairs of voweltokens that were correctly placed into the same category werecounted as a hit; pairs of tokens that were incorrectly assignedto different categories when they should have been in thesame category were counted as a miss; and pairs of tokensthat were incorrectly assigned to the same category whenthey should have been in different categories were countedas a false alarm. The accuracy score was computed as
hitshits+false alarms and the completeness score as hits
hits+misses .Both measures were high for the lexical-distributional learner,but accuracy scores were substantially lower for the purelydistributional learners, reflecting the fact that these modelsmistakenly merged several overlapping categories.
Results suggest that as predicted, a model that uses theinput to learn word categories in addition to phonetic cate-gories produces better phonetic category learning results thana model that only learns phonetic categories. Note that thedistributional learners are likely to show better performanceif they are given dimensions beyond just the first two formants(Vallabha et al., 2007) or if they are given more data pointsduring learning. These two solutions actually work againsteach other: as dimensions are added, more data are necessaryto maintain the same learning outcome. Nevertheless, we do
100020003000
200
400
600
800
1000
1200
Second Formant (Hz)
Vowel Categories (All Speakers)
First
Fo
rma
nt (H
z)(a)
i ɪ
eɛ
ɝ
æ
u
ʊ o
ɔ
a
ʌ
100020003000
200
400
600
800
1000
1200
Second Formant (Hz)
Lexical−Distributional Model
Firs
t For
man
t (H
z)
(b)
100020003000
200
400
600
800
1000
1200
Second Formant (Hz)
Firs
t For
man
t (H
z)
Distributional Model(c)
100020003000
200
400
600
800
1000
1200
Second Formant (Hz)
Firs
t For
man
t (H
z)
Gradient Descent Algorithm(d)
Figure 4: Ellipses delimit the area corresponding to 90% ofvowel tokens for Gaussian categories (a) computed from allspeakers’ vowel productions from Hillenbrand et al. (1995)and learned by the (b) lexical-distributional model, (c) distri-butional model, and (d) gradient descent algorithm.
not wish to suggest that a purely distributional learner cannotacquire phonetic categories. The simulations presented hereare instead meant to demonstrate that in a language wherephonetic categories have substantial overlap, an interactivesystem, where learners can use information from words thatcontain particular speech sounds, can increase the robustnessof phonetic category learning.
DiscussionThis paper has presented a model of phonetic category acqui-sition that allows interaction between speech sound and wordcategorization. The model was not given a lexicon a priori,but was allowed to begin learning a lexicon from the data atthe same time that it was learning to categorize individualspeech sounds, allowing it to take into account the distribu-tion of speech sounds in words. This lexical-distributionallearner outperformed a purely distributional learner on a cor-pus whose categories were based on English vowel cate-gories, showing better disambiguation of overlapping cate-gories from the same number of data points.
Infants learn to segment words from fluent speech aroundthe same time that they begin to show signs of acquiringnative language phonetic categories, and they are able tomap these segmented words onto tokens heard in isolation(Jusczyk & Aslin, 1995), suggesting that they are perform-ing some sort of rudimentary categorization on the wordsthey hear. Infants may therefore have access to informationfrom words that can help them disambiguate overlapping cat-egories. If information from words can feed back to constrainphonetic category learning, the large degree of overlap be-
2212
other work on category learning
• Feldman et al. (2009) add a (latent) lexicon to category learning
37
Modeling category learning
1000150020002500
300
400
500
600
700
800
900
Second Formant (Hz)
First
Fo
rma
nt (H
z)
Vowel Categories (Men)(a)
i
ɪ
e
ɛ
ɝ
æ
u
ʊo
ɔ
a
ʌ
1000150020002500
300
400
500
600
700
800
900Second Formant (Hz)
Firs
t For
man
t (H
z)
Lexical−Distributional Model(b)
1000150020002500
300
400
500
600
700
800
900Second Formant (Hz)
Firs
t For
man
t (H
z)
Distributional Model(c)
1000150020002500
300
400
500
600
700
800
900
Firs
t For
man
t (H
z)
Second Formant (Hz)
Gradient Descent Algorithm(d)
Figure 3: Ellipses delimit the area corresponding to 90%of vowel tokens for Gaussian categories (a) computed frommen’s vowel productions from Hillenbrand et al. (1995) andlearned by the (b) lexical-distributional model, (c) distribu-tional model, and (d) gradient descent algorithm.
boring vowel categories. Positing the presence of a lexi-con therefore showed evidence of helping the ideal learnerdisambiguate overlapping vowel categories, even though thephonological forms contained in the lexicon were not givenexplicitly to the learner.
Pairwise accuracy and completeness measures were com-puted for each learner as a quantitative measure of modelperformance (Table 1). For these measures, pairs of voweltokens that were correctly placed into the same category werecounted as a hit; pairs of tokens that were incorrectly assignedto different categories when they should have been in thesame category were counted as a miss; and pairs of tokensthat were incorrectly assigned to the same category whenthey should have been in different categories were countedas a false alarm. The accuracy score was computed as
hitshits+false alarms and the completeness score as hits
hits+misses .Both measures were high for the lexical-distributional learner,but accuracy scores were substantially lower for the purelydistributional learners, reflecting the fact that these modelsmistakenly merged several overlapping categories.
Results suggest that as predicted, a model that uses theinput to learn word categories in addition to phonetic cate-gories produces better phonetic category learning results thana model that only learns phonetic categories. Note that thedistributional learners are likely to show better performanceif they are given dimensions beyond just the first two formants(Vallabha et al., 2007) or if they are given more data pointsduring learning. These two solutions actually work againsteach other: as dimensions are added, more data are necessaryto maintain the same learning outcome. Nevertheless, we do
100020003000
200
400
600
800
1000
1200
Second Formant (Hz)
Vowel Categories (All Speakers)
First
Fo
rma
nt (H
z)(a)
i ɪ
eɛ
ɝ
æ
u
ʊ o
ɔ
a
ʌ
100020003000
200
400
600
800
1000
1200
Second Formant (Hz)
Lexical−Distributional Model
Firs
t For
man
t (H
z)
(b)
100020003000
200
400
600
800
1000
1200
Second Formant (Hz)
Firs
t For
man
t (H
z)
Distributional Model(c)
100020003000
200
400
600
800
1000
1200
Second Formant (Hz)
Firs
t For
man
t (H
z)
Gradient Descent Algorithm(d)
Figure 4: Ellipses delimit the area corresponding to 90% ofvowel tokens for Gaussian categories (a) computed from allspeakers’ vowel productions from Hillenbrand et al. (1995)and learned by the (b) lexical-distributional model, (c) distri-butional model, and (d) gradient descent algorithm.
not wish to suggest that a purely distributional learner cannotacquire phonetic categories. The simulations presented hereare instead meant to demonstrate that in a language wherephonetic categories have substantial overlap, an interactivesystem, where learners can use information from words thatcontain particular speech sounds, can increase the robustnessof phonetic category learning.
DiscussionThis paper has presented a model of phonetic category acqui-sition that allows interaction between speech sound and wordcategorization. The model was not given a lexicon a priori,but was allowed to begin learning a lexicon from the data atthe same time that it was learning to categorize individualspeech sounds, allowing it to take into account the distribu-tion of speech sounds in words. This lexical-distributionallearner outperformed a purely distributional learner on a cor-pus whose categories were based on English vowel cate-gories, showing better disambiguation of overlapping cate-gories from the same number of data points.
Infants learn to segment words from fluent speech aroundthe same time that they begin to show signs of acquiringnative language phonetic categories, and they are able tomap these segmented words onto tokens heard in isolation(Jusczyk & Aslin, 1995), suggesting that they are perform-ing some sort of rudimentary categorization on the wordsthey hear. Infants may therefore have access to informationfrom words that can help them disambiguate overlapping cat-egories. If information from words can feed back to constrainphonetic category learning, the large degree of overlap be-
2212
38
Conclusions
probabilistic models in acquisition
38
Conclusions
probabilistic models in acquisition
• in general: still working on trying to incorporate many sources of information
38
Conclusions
probabilistic models in acquisition
• in general: still working on trying to incorporate many sources of information
• it seems that infants, children, and adults use many sources of information to learn language
38
Conclusions
probabilistic models in acquisition
• in general: still working on trying to incorporate many sources of information
• it seems that infants, children, and adults use many sources of information to learn language
• one very exciting information source: other languages
38
Conclusions
39
Conclusions
computational psycholinguistics
39
Conclusions
computational psycholinguistics
• this course gave a very broad overview of many areas: perception, comprehension, production, acquisition
39
Conclusions
computational psycholinguistics
• this course gave a very broad overview of many areas: perception, comprehension, production, acquisition
• gave a sense for how probabilistic models can be used
39
Conclusions
computational psycholinguistics
• this course gave a very broad overview of many areas: perception, comprehension, production, acquisition
• gave a sense for how probabilistic models can be used
• also covered much of the essential technical knowledge to use probabilistic models
39
Conclusions
computational psycholinguistics
• this course gave a very broad overview of many areas: perception, comprehension, production, acquisition
• gave a sense for how probabilistic models can be used
• also covered much of the essential technical knowledge to use probabilistic models
• this is still very much the beginning of probabilistic modeling in the study of language: very many open questions and models yet to build!
39
Conclusions
computational psycholinguistics
• this course gave a very broad overview of many areas: perception, comprehension, production, acquisition
• gave a sense for how probabilistic models can be used
• also covered much of the essential technical knowledge to use probabilistic models
• this is still very much the beginning of probabilistic modeling in the study of language: very many open questions and models yet to build!
• hope you enjoyed!
39
Conclusions