factoring syllables into feature-based distributions for learning

Incorporating syllables into feature-based distributions describing phonotactic patterns

Cesar Koirala

(University of Delaware)

LSA 2013 Boston, Massachusetts

In this presentation…

• I will introduce a method of integrating syllables into a feature-based model of phonotactic learning – Heinz & Koirala (2010).

• I will demonstrate that incorporation of syllables provides important contextual information to the model.

• I shall discuss how this project is an example of incorporating feature interactions into feature-based models.


Introduction Feature Interaction Heinz & Koirala Model Syllables Discussions

General schema of the presentation

feature interaction Broad picture: What is a feature-based model with syllables? Where does it fit in the broad categorization of learning models of phonotactics? A brief introduction to Heinz & Koirala model and the baseline version Method of integrating syllables into the model An illustrative example



Feature Interaction

• Features F and G interact if there is a phonotactic constraint in a language that targets the co-occurrence of F and G in some context.

• In English, feature [dorsal] and [nasal] interact because of the existence of the constraint *[#ŋ] or *# [+dorsal, +nasal], but no constraint *# [+dorsal] or *# [+nasal].

• Statistically speaking, the two features are not independent.



Theories of Feature Interaction

• The most restrictive theory of feature interaction: -No two (or more) features interact. -Statistically speaking, all features are independent.



Theories of Feature Interaction

• The most restrictive theory of feature interaction: -No two (or more) features interact. -Statistically speaking, all features are independent.

• The least restrictive theory of feature interaction: -Any group of features may (may not) interact.



Space of possibilities of interaction

F1 F2 F3 F4 F5 F6 F7 F8 … No Features interact (most restrictive)

F1 F2 F1 F3 F1 F4 … 2 Features interact

…

All Features interact F1F2 ..... Fn

Huge space of possible feature interactions

…

(least restrictive)



Space of possibilities of interaction



…



…

(least restrictive)

n-gram model



Research Question

• What is the nature of feature interaction in the world’s languages? • General Hypothesis: The nature of feature interaction is not arbitrary.

• Research Strategy: The huge space of possibilities does not need to be searched. If

we understand the nature of feature interaction, we can search a much smaller, more constrained space.

• My thesis examines this question typologically, experimentally and with a

computational model.

• In this talk I shall focus only on the computational model.



Computational models of Phonotactic learning

Phonotactic learner: The task of the model (a computer program) is to observe data from a corpus and learn a phonotactic grammar that reflects the linguistic generalizations present in the data.



Lexicalist Learning Models of Phonotactics Non Supra-segmental Supra-segmental Combination of both Segment-based Feature-based

Broad Categorization of Phonotactic Models

Assign well-formedness on the basis of lexical frequency of non supra-segmental units

• n-gram models • Vitevich & Luce (2004)

• Hayes & Wilson (2008) • Albright (2009) • Heinz & Koirala (2010)

Assign well-formedness on the basis of lexical frequency of supra-segmental units. e.g. syllable-based model • Coleman & Pierrehumbert (1997)

Uses both segmental and supra-segmental information

• Heinz & Koirala model with syllables • Daland et.al (2011)



Motivation for feature-based models with syllables

• Feature-based models - Hayes & Wilson (2008) and Albright (2009) - unanimously benefit by incorporating syllables when predicting sonority projection effects (Daland et al. 2011).

• Models based on syllables alone (e.g., Coleman & Pierrehumbert (1997)) require the coda and onset constituents to be independently well-formed to impose any kind of phonotactics .



Heinz & Koirala Model

• Heinz & Koirala (2010) presented a computational learning model of phonotactics which generalizes on the basis of phonological features. Their baseline version assumes statistical independence of the individual features (There is no feature interaction).




• Heinz & Koirala (2010) presented a computational learning model of phonotactics which generalizes on the basis of phonological features. Their baseline version assumes statistical independence of the individual features (There is no feature interaction).




• The basic idea: 1. We can define probability distributions over segments as a normalized product

of simple distributions over features 2. These simple distributions can be distributions of individual features or

distributions of a combination of features 3. Probabilistic Deterministic Finite Automata (PDFA) can be used to represent the

features and their distributions • Advantages:

1. Fewer parameters than an n-gram model - Fewer parameters means less training is necessary 2. Mathematically sound – well-formed probability distribution 3. Captures intuitions transparently – which like sounds have like distributions in like contexts is determined by those features that do NOT interact.

→ In the baseline version, this means “like sounds have like distributions in like contexts”



Illustration

(1) A binary feature system with ∑ = {a,b,c} and two features F and G

F G a + -

b + +

c - +



Illustration

(1) (2) A binary feature system with ∑ = {a,b,c} and two features F and G PDFAs for features F and G

F G a + -

b + +

c - +



Illustration

(1) (2) A binary feature system with ∑ = {a,b,c} and two features F and G PDFAs for features F and G

(3)

F G a + -

b + +

c - +

A fragment of the product machine of PDFAs for F and G. Z is a normalizing term that ensures well formed probability distribution.



Illustration

• The maximum likelihood estimate of the model given the corpus is obtained by passing the corpus through factor PDFAs ( not the product machine)

• The parse of the sample through the factor machines is counted and normalized in order to obtain the distributions of individual features.

• The normalized product of these PDFAs gives the actual probability distribution.



How restrictive is the baseline version in terms of feature interaction?

• As mentioned, the baseline version implements the most restrictive theory – no features interact!

• The baseline version is unable to learn feature interaction. • For instance, since the baseline version would observe nasals and velars word

initially in English, it would incorrectly predict that velar nasal can also do so.

• A non-baseline version that would a priori allow two features to interact can learn that [+dorsal] and [+nasal] cannot co-occur word initially in English.

• In general, the model can allow any fixed number of features to interact.



Allowing featural interaction in the model

• Generalizations can be made based on more than one feature.

• If F is [dorsal] and G is [nasal], then the model learns that [+dorsal] and [+nasal] cannot occur word initially in English while non-nasal velars and non-velar nasals can.

F G a + -

b + +

c - +





…



…

(least restrictive)




Baseline version

Two features are a priori allowed to interact



…



…

(least restrictive)




Baseline version

Two features are a priori allowed to interact

Our strategy: We do not need to search the whole space as it is a priori restricted in some fashion by UG

Factoring syllables into the model

• We use a single multivalued feature SYLL with the values onset, coda and nucleus. • A consonant at the onset position of a syllable has SYLL value onset. A consonant

at the coda position has the SYLL value coda. Syllabic sounds have SYLL value nucleus.

• This representation enables us to obtain the distribution of syllable structure in the language. It can be represented by the following PDFA.



The Sonority Sequencing Principle

• Sonority Sequencing Principle/Generalisation (Selkirk 1984:116) In any syllable, there is a segment constituting a sonority peak that is preceded and/or followed by a sequence of segments with progressively decreasing sonority values.

• This typological generalization is a statement of featural interaction. Features important to sonority interact with syllabic features.




• We use a single multivalued feature Sonority with the values stop, fricative, affricate, nasal, liquid, glide and vowel.

• A non-baseline version that would a priori allow interaction of syllabic features and sonority features would be able to learn SSP-like behavior

• This means training is done on the product of syllabic PDFA and sonority PDFA.





The sonority PDFA


Training data • CELEX2 English lemma corpus with pronunciations taken from the CMU

Pronouncing dictionary ( Chandlee 2012). • Stress markings were removed. • The training corpus consisted of 23,911 words. • Training was done on the version in which the corpus reflected information about

onsets and codas.

P R AH P OW Z AH L Pons Rons AH Pons OW Zons AH Lcod P R AH P OW Z Pons Rons AH Pons OW Zcod




Illustrative Example (English consonant clusters) The model assigns different probability values to the same clusters at onset and coda positions.




Illustrative Example (English consonant clusters) The results reflect that the model does not just assign different probabilities to the same consonant clusters at onset and coda positions, but it has also learned SSP like behavior.




Illustrative Example (English consonant clusters) The results reflect that the model does not just assign different probabilities to the same consonant clusters at onset and coda positions, but it has also learned SSP like behavior. For Example: 1. The model assigns higher probabilities to X-stop coda clusters than X-stop onset clusters as predicted by SSP. The only X-stop onset clusters with a non-zero probability are fricative-stop clusters (probably due to S-stop onsets)




Illustrative Example (English consonant clusters) 2. Similarly, the model assigns higher probabilities to X-fricative coda clusters than X-fricative onset clusters as predicted by SSP. The only two onset clusters with non zero probabilities are: fricative-fricative and stop-fricative. 3. The only X-affricate clusters with non zero values are nasal-affricate clusters (‘bench’) and liquid-affricate clusters (‘arch’) at coda positions.



Conclusion

• We introduced a way of integrating syllables into the Heinz & Koirala model. • Unlike the baseline version of the Heinz & Koirala model, this version a priori

allowed certain features (from the syllable PDFA and the sonority PDFA) to interact.

• This approach directly incorporated the structure of the SSP into the model via this interaction.

• Consequently the model finds different probability distributions for the same consonant cluster in different syllabic positions.



Thank you!

This research is supported by a NSF Linguistics DDRIG #1226793. I thank Jeff Heinz for valuable discussion that set the foundation for this work. I would also like to thank the University of Delaware’s phonology/phonetics group and the audiences at NECPhon 2012 for their comments and suggestions.



Documents

factoring syllables into feature-based distributions for learning