30
Word segmentation in Russian and English: A computational comparison Robert Daland Presented to the 3 rd  Annual Meeting of the Slavic Linguistics Society in Columbus, Ohio June 11, 2008

Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Word segmentationin Russian and English:A computational comparison

Robert DalandPresented to the 3rd Annual Meeting of the Slavic 

Linguistics Society in Columbus, OhioJune 11, 2008

Page 2: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

The Research Problem:Word Segmentation

Page 3: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Fluent listeners hear speech as a sequence of discrete words.

But there are no pauses in the waveform...

Page 4: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

But there are no pauses in the waveform...

Page 5: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison
Page 6: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Word segmentation

• Listener's Problem:

• Solution:– Find all word boundaries– Don't find any word non­boundaries

yesterdayiwenttoasinamaliandisawthecutestdress

yesterdayiwenttoasinamaliandisawthecutestdress

Page 7: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

What kind of information?

• Acoustic cues not reliable within­phrase• Remaining possibilities:

– Lexical (word recognition)

L

– Phonetic/Phonological (statistical cues)

P

– Some combination of both

Page 8: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

What kind of information?

• Acoustic cues not reliable within­phrase• Remaining possibilities:

– Lexical (word recognition)

L

– Phonetic/Phonological (statistical cues)

P

– Some combination of both

• Acquisition: segmentation before recognition– From little to no segmentation (6 months) to 

unfamiliar polysyllabic iambs (10.5 months)

u

• Saffran, Aslin, & Newport, 1996; Jusczyk, Hohne & Baumann, 1999; Jusczyk, Houston, & Newsome, 1999

– Mother reports: infants recognize 20­40 words

Page 9: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Proposed Solution:Diphone SegmentationHypothesis

Page 10: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Diphone probabilities

• Diphone – pair of successive segments– e.g. [pd] as in top.dog– e.g. [ŋg] as in an.ger 

• Proposal: listener estimates probability of word boundary between constituent phones– pxy = freq[x#y] / freq[xy]

– Hear word boundary when p is high

Page 11: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

 

from Hockema (2006), See also Mattys & Jusczyk (2001)

Word­internal (WI)

   Word­spanning (WS)

Empirical Evidence

Page 12: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Experiment I: English

Page 13: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

• Corpus: Map British National Corpus (BNC) to phonetic representation with CELEX

• Model: Calculate diphone probabilities from phonetic BNC as in Hockema (2006)

• Test: Try to reconstruct word boundaries in BNC– guess word boundary only if pWB > .5

• Analysis: Signal detection theory 

Diphone Segmentation: English

Page 14: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Signal Detection TheoryModel Model says: 

WBModel says: Not 

WBRecall Rates

Correct: WB Hits (#) Misses (#) Recall

Correct: Not WB

False alarms (#)

Correct Rejections (#)

Corecall

Precision Rates

Precision Coprecision Accuracy

• Hit WB is there, model finds it• Miss WB is there, model misses it• False alarm WB is not there, model says it is• Correct rejection WB is not there and model says so

Page 15: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Results – English

English Model says: WB

Model says: Not WB

Recall Rates

Correct: WB 60.6 19.4 75.8%

Correct: Not WB

8.9 283.9 95.0%

Precision Rates

87.2% 93.6% 92.4%

Page 16: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Results – EnglishEnglish Model says: 

WBModel says: Not 

WBRecall Rates

Correct: WB 60.6 19.4 75.8%

Correct: Not WB

8.9 283.9 95.0%

Precision Rates

87.2% 93.6% 92.4%

• Chance: same # of word boundaries occur randomly

• Overall better performance than chance• Especially: Lower false alarms/Higher Corecall

Page 17: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Discussion ­­ English

• Implications– Diphones are a highly reliable cue to the absence of a 

word boundary– Fairly reliable cue to the presence of a word boundary

• Predicted perceptual pattern: Undersegmentation– True words are not split up in perception– True words may be glommed together

• E.g. want to  wantto

– Consistent with usage­based and construction grammar approaches (Fillmore, 1996; Bybee & Hopper, 2001)

Page 18: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Limitations

• Is diphone segmentation cross­linguistically robust?– If it only works for English, limited interest– If it works for other languages, promising support 

for cognitive universal

• Could success be English­specific?– Highly complex syllable structure– Impoverished inflectional system

• Test model on Slavic language– Need comparable language resources

Page 19: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Russian National Corpus (RNC)

• ~70 million words, balanced textual corpus• Explicitly modeled after BNC• Sample:  Коммунистическая A коммунистический партия S партия Российской A российский Федерации S федерация

Page 20: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Словарь Зализниака (Электронний)

• Contains phonological and inflectional paradigm code for headwords– Stress in headword– Stress pattern (and inflection table)

• Sample:абрис 1 м 1а contourабрисный 1 п 1*аабсорбировать 7 св-нсв 2а absorb

Page 21: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Experiment II: Russian

Page 22: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Diphone Segmentation: Russian

• Corpus: Map Russian National Corpus (RNC) to phonetic representation with Zalizniak

• Model: Calculate diphone probabilities from phonetic RNC as in Hockema (2006)

• Test: Try to reconstruct word boundaries in RNC– guess word boundary only if pWB > .5

• Analysis: Signal detection theory

Page 23: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Generate phonetic representation

1) Orthographic­to­phonemic mapping• Palatalization/soft/hard sign

2) Calculate/Guess stress position• Lookup stress pattern from Zalizniak

3) Phonemic­to­phonetic mapping• Voicing, palatalization, place, manner assimilation• Vowel reduction

• Phrases: punctuation ­­> phrase boundaries, spaces ­­> word boundaries (except в,с,к) 

Page 24: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Results – RussianRussian Model says: WB Model says: Not 

WBRecall Rates

Correct: WB 11.4 14.0 44.8%

Correct: Not WB

2.4 180.9 98.7%

Precision Rates 82.4% 92.8% 92.1%

• Same pattern of performance• Overall better performance than chance• Especially: Lower false alarms/Higher Corecall

Page 25: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Results – Both LanguagesRussian Model says: WB Model says: Not WB Recall Rates

Correct: WB 11.4 14.0 44.8%

Correct: Not WB 2.4 180.9 98.7%

Precision Rates 82.4% 92.8% 92.1%

English Model says: WB Model says: Not WB Recall Rates

Correct: WB 60.6 19.4 75.8%

Correct: Not WB 8.9 283.9 95.0%

Precision Rates 87.2% 93.6% 92.4%

Page 26: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Results – Both LanguagesRussian Model says: WB Model says: Not WB Recall Rates

Correct: WB 11.4 14.0 44.8%

Correct: Not WB 2.4 180.9 98.7%

Precision Rates 82.4% 92.8% 92.1%

English Model says: WB Model says: Not WB Recall Rates

Correct: WB 60.6 19.4 75.8%

Correct: Not WB 8.9 283.9 95.0%

Precision Rates 87.2% 93.6% 92.4%

Page 27: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

General Discussion• Model exhibited a very similar pattern of results 

on both languages– Very low false alarm rate– Undersegmentation

• Fundamental differences in language structure– Prosodic: syllable structure– Morphological: inflectional structure

• Promising for Diphone Segmentation Hypothesis– Same cognitive mechanism, different language 

experience, same perceptual outcome– Plausible candidate for cross­linguistic universal

Page 28: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Future directions

• Investigate error patterns– False alarms and misses– Word­learning

• Bootstrapping diphone statistics from phrase boundaries

• Natural phonetic variation

Page 29: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

Acknowledgements

Elisabeth ElliottNatalia Gagarina

Janet PierrehumbertSerge SharoffAndrea Sims

Page 30: Word segmentation in Russian and English: A computational ...gradstudents.wcas.northwestern.edu/~rtd885/papers/... · Word segmentation in Russian and English: A computational comparison

References• Bybee, J. & Hopper, P (2001). Frequency and the Emergence of Linguistic 

Structure.  Amsterdam: John Benjamins.• Fillmore, C. J. & Kay, P. (1996).  Construction Grammar. Manuscript, 

University of California at Berkeley Department of Linguistics.• Jusczyk, P.W., Hohne, E.A., & Baumann, A. (1999). Infants’ sensitivity to 

allophonic cues for word segmentation. Perception & Psychophysics, 61, 1465­1476.

• Jusczyk, P.W., Houston, D., & Newsome, M. (1999). The beginnings of word segmentation in English­learning infants. Cognitive Psychology, 39, 159­207.

• Hockema, S.A. (2006). Finding words in speech: An investigation of American English. Language Learning and Development, 2(2), 119­146.

• Mattys, S.L. & Jusczyk, P.W. (2001). Phonotactic cues for segmentation of fluent speech by infants. Cognition, 78, 91­121.

• Saffran, J.R., Aslin, R.N., & Newport,, E.L. (1996). Statistical learning by 8­month­old infants. Science, 274, 1926­1928.