Click here to load reader

Syllables and Concepts in Large Vocabulary Continuous Speech Recognition

  • Upload
    colin

  • View
    29

  • Download
    2

Embed Size (px)

DESCRIPTION

Syllables and Concepts in Large Vocabulary Continuous Speech Recognition. Paul De Palma Ph. D. Candidate Department of Linguistics University of New Mexico Slides available at: www.cs.gonzaga.edu/depalma. An Engineered Artifact. Syllables Principled word segmentation scheme - PowerPoint PPT Presentation

Citation preview

Syllables and Concepts in Large Vocabulary Continuous Speech Recognition

Paul De PalmaPh. D. CandidateDepartment of LinguisticsUniversity of New MexicoSlides available at: www.cs.gonzaga.edu/depalma

1Syllables and Concepts in Large Vocabulary Continuous Speech Recognition1An Engineered Artifact

2SyllablesPrincipled word segmentation scheme No claim about human syllabification

Concepts Words and phrases with similar meaningsNo claim about cognition

2Reducing the Search Space3ASR answers the question: What is the most likely sequence of words given an acoustic signal?Considers many candidate word sequences

To Reduce the Search SpaceReduce number of candidatesUsing Syllables in the Language ModelUsing Concepts in a Concept Component

3Syllables in LM: Why?4

Switchboard(Greenberg, 1999, p. 167)Cumulative Frequency as a Function of Frequency Rank4Most Frequent Words are Monosyllabic5Number Syllablesper Word% of Corpus by Token% of Corpus by Type181.0422.39214.3039.7633.5024.264.969.915.183.216.02.40Polysyllabic words are easier to recognize (Hamalainen, et al. , 2007)And (of course) fewer syllables than words

(Greenberg, 1999, p. 167)5 Reduce the Search Space 2:Concept Component6Word Map:GOSyllable Map:GOA flightax f_l_ay_tdA ticketax t_ih k_ax_tdbook airline travelb_uh_kd eh_r l_ay_n t_r_ae v_ax_lbook reservationsb_uh_kd r_eh s_axr v_ey sh_ax_n_zCreate a reservationk_r_iy ey_td ax r_eh z_axr v_ey sh_ax_nDepartingd_ax p_aa_r dx_ix_ngFlyf_l_ayFlyingf_l_ay ix_ng getg_eh_tdI am leavingay ae_m l_iy v_ix_ng6The (Simplified) Architecture of an LVCSR System7Feature ExtractorTransforms an acoustic signal into a collection of 39 feature vectorsThe province of digital signal processingAcoustic ModelCollection of probabilities of acoustic observations given word sequencesLanguage ModelCollection of probabilities of word sequencesDecoderGuesses a probable sequence of words given an acoustic signal by searching the product of the probabilities found in the acoustic and language models7Simplified Schematic8FeatureExtractorsignalWordsDecoderAcoustic Model Language Model8Enhanced Recognizer9FeatureExtractorsignalSyllablesDecoderAcoustic Model Syllable Language ModelassumedassumedassumedMy WorkConcept ComponentSyllables,ConceptsMy WorkP(O|S)P(S)9How ASR Works 10Input is a sequence of acoustic observations:O = o1, o2 ,, otOutput is a string of words:W = w1, w2 ,, wnThen

The hypothesized word sequence is that string W in the target language with the greatest probability given a sequence of acoustic observations. (1)10Operationalizing Equation 111

(1)

(2)Using Bayes Rule:(3)Since the acoustic signal is the same for each candidate, (3) can be rewritten

(4)

Acoustic Model (likelihood O|W)Language Model(prior probability)Decoder11LM: A Closer Look12A collection of probabilities of word sequencesp(W) = p(w1wn)(5)Can be written by the probability chain rule:

(6)12Markov Assumption13Approximate the full decomposition of (6) by looking only a specified number of words into the pastBigram1 word into the pastTrigram 2 words into the pastN-gram n words into the past

13Bigram Language Model 14Def. Bigram Probability: p(wn | wn-1 ) = count(wn-1wn)/count(wn-1 ) (7)

Minicorpuspaul wrote his thesisjames wrote a different thesispaul wrote a thesis suggested by georgethe thesisjane wrote the poem

(e.g., ) p(paul|) = count(paul)/count() = 2/5

P(paul wrote a thesis) = p(paul|) * p(wrote|paul) * p(a|wrote) * p(thesis|a) * p(|thesis) = .075P(paul wrote the thesis) = p(paul|) * p(wrote|paul) * p(the|wrote) * p(thesis|the) * p(|thesis) = .0375

14Experiment 1: Perplexity15Perplexity: PP(X)Functionally related to entropy: H(X)Entropy is a measure of information

Hypothesis PPX(of syllable LM) < PPX (of word LM) Syllable LM contains more information15Definitions16Let X be a random variablep(x) be its probability function

Defs:H(X) = -xX p(x) * lg(p(x)) (1)PP(X) = 2H(X) (2)

Given certain assumptions1 and def. of H(X), PP(X) can be transformed to: p(w1wn)-1/nPerplexity is the nth inverse root of the probability of a word sequence

1. X is an ergodic and stationary process, n is arbitrarily large16Entropy As Information17Suppose the letters of a polynesian alphabet are distributed as follows:1

ptkaiu1/81/81/81/8Calculate the per letter entropy H(P) = -i{p,t,k,a,i,u} p(i) * lg(p(i)) = = 2 bits

2.5 bits on average required to encode a letter (p: 100, t: 00, etc)

1. Manning, C., Schutze, H. (1999). Foundations of Statistiical Natural Language Processing. Cambridge: MIT Press.

17Reducing the Entropy18SupposeThis language consists of entirely of CV syllablesWe know their distribution

We can compute the conditional entropy of syllables in the languageH(V|C), where V {a,i,u} and C {p,t,k}H(V|C) = 2.44 bits

Entropy for two letters, letter model: 5 bits

Conclusion: The syllable model contains more information than the letter model

18Perplexity As Weighted Average Branching Factor19Suppose:letters in alphabet occur with equal frequency At every fork we have 26 choices

19

Reducing the Branching Factor

20Suppose E occurs 75 times more frequently than any other letterp(any other letter) = x75 * x + 25*x = 1, since there are 25 such lettersx = .01. Since any letter, wi, is either E or one of the other 25 lettersp(wi ) = .75 + .01 = .76 and

Still 26 choices at each forkE is 75 times more likely than any other choice Perplexity is reducedModel contains more information

20Perplexity Experiment21Reduced perplexity in a language model is used as an indicator that an experiment with real data might be fruitfulTechnique (for both syllable and word corpora)Randomly choose 10% of the utterances from a corpus as a test setGenerate a language model from the remaining 90%Compute the perplexity of the test set given the language model Compute the mean over twenty runs of step 3

21The Corpora22Air Travel Information System (Hemphill, et al., 2009)Word types: 1604Word tokens: 219,009Syllable types: 1314Syllable Tokens: 317,578Transcript of simulated human-computer speech (NextIt, 2008)Word types: 482Word tokens: 5,782Syllable types: 537 (This will have repercussions in Exp. 2)Syllable tokens: 8,587

22Results23BigramsTrigramsMean Words NextIt39.4431.44Mean SyllablesNextIt35.9622.26Mean WordsATIS41.3531.43Mean SyllablesATIS21.1514.74Notice drop in perplexity from words to syllables.Perplexity of 14.74 for trigram syllable ATIS At every turn, less than as many choices as for trigram word ATIS23Experiment 2: Syllables in the language Model24Hypothesis:A syllable language model will perform better than a word-based language modelBy what Measure?24Symbol Error Rate25SER = (100 * (I + S + D))/TWhere:I is the number of insertionsS is the number of substitutions]D is the number of deletionsT is the total number of symbols

SER = 100(2+1+1)/5 = .811. Alignment performed by a dynamic programming algorithmMinimum Edit Distance25Technique26Phonetically transcribe corpus and reference filesSyllabify corpus and references filesBuild language models Run a recognizer on 18 short human-computer telephone monologuesCompute mean, median, std of SER for 1-gram, 2-gram, 3-gram, 4-gram over all monologues26Results27MeansSubstitutionInsertionDeletion SER N=2,3,473.14%103.00 %81.74%85.3%Syllables Normed by Words Syllables Compared to WordsMean SERWordsMean SERSyllablesN=246.441.0N=346.839.4N=446.739.027Experiment 3: A Concept Component28Hypothesis:A recognizer equipped with a post-processor that transforms syllable output to syllable/concept output will perform better than one not equipped with such a processor

28Technique29Develop equivalence classes from the training transcript : BE, WANT, GO, RETURNMap the equivalence classes onto the reference files used to score the output of the recognizer. Map the equivalence classes onto the output of the recognizer Determine the SER of the modified output in step 3 with respect to the reference files in step 2.

29Results30Mean SERSyllablesMean SERConceptsN=241.041.3N=339.440.6N=439.040.3MeansSubstitutionInsertionDeletion SERN=2,3,41.06%0.95%1.09%1.02%Concepts Normed by Syllables 2% decline overall. Why?Concepts Compared to Syllables30Mapping Was Intended to Produce an Upper Bound On SER31 For each distinct syllable string that appears in the hypothesis or reference files, search each of the concepts for a matchIf match, substitute concept for the syllable string: ay w_uh_dd l_ay_kd WANTMisrecognition of a single syllable no insertion

31Misalignment Between Training and Reference Files32Equivalence classes constructed using only the LM training model transcriptMore frequent in reference files:1st person singular (I want)Imperatives (List all flights)Less frequent in reference files:1st person plural (My husband and me want)Polite forms (I would like)BE does not appear (There should be, Theres going to be, etc.)32Summary331. Perplexity: syllable language model contains more information than a word language model (and probably will perform better)2. Syllable language model results in a 14.7% mean improvement in SER3. The very slight increase in mean SER for a concept language model justifies further research 33Further Research34Test the given system over a large production corpusDevelop of a probabilistic concept language modelDevelop necessary software to pass the output of the concept language model on to an expert system34The (Almost, Almost) Last Word

But it must be recognized that the notion probability of a sentence is an entirely useless one under any known interpretation of the term. Cited in Jurafsky and Martin (2009) from a 1969 essay on Quine.3536The (Almost) Last WordHe just never thought to count.3637The Last WordThanksTo my generous committee:

Bill Croft, Department of LinguisticsGeorge Luger, Department of Computer ScienceCaroline Smith, Department of LinguisticsChuck Wooters, U.S. Department of Defense37References38Cover, T., Thomas, J. (1991). Elements of Information Theory. Hoboken, NJ: John Wiley & Sons.

Greenberg, S. (1999) Speaking in shorthandA syllable-centric perspective for understanding pronunciation variation. Speech Communication, 29, 159-176.

Hemphill, C., Godfrey, J., Doddington, G. (2009). The ATIS Spoken Language Systems Pilot Corpus. Retrieved 6/17/09 from: http://www.ldc.upenn.edu/Catalog/readme_files/atis/sspcrd/corpus.html

Hamalainen, A., Boves, L., de Veth, J., ten Bosch, L. (2007) On the utility of syllable-based acoustic models for pronunciation variation modeling. EURASIP Journal on Audio, Speech, and Music Processing. 46460, 1-11.

Jurafsky, D., Martin, J. (2000) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Prentice Hall.

Jurafsky, D., Martin, J. (2009) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Prentice Hall.

Manning, C., Schutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge: MIT Press.

NextIt. (2008). Retrieved 4/5/08 from: http:/www.nextit.com.

NIST. (2007) Syllabification software. National Institute of Standards: NIST Spoken Language Technology Evaluation and Utility. Retrieved 11/30/07 from: http://www.nist.gov/speech/tools/.

38Additional Slides3939

Transcription of a recording

40REF: (3.203,5.553) GIVE ME A FLIGHT BETWEEN SPOKANE AND SEATTLEREF: (15.633,18.307) UM OCTOBER SEVENTEENTHREF: (26.827,29.606) OH I NEED A PLANE FROM SPOKANE TO SEATTLEREF: (43.337,46.682) I WANT A ROUNDTRIP FROM MINNEAPOLIS TOREF: (58.050,61.762) I WANT TO BOOK A TRIP FROM MISSOULA TO PORTLANDREF: (73.397,77.215) I NEED A TICKET FROM ALBUQUERQUE TO NEW YORKREF: (87.370,94.098) YEAH RIGHT UM I NEED A TICKET FROM SPOKANE SEPTEMBER THIRTIETH TO SEATTLE RETURNING OCTOBER THIRDREF: (107.381,113.593) I WANT TO GET FROM ALBUQUERQUE TO NEW ORLEANS ON OCTOBER THIRD TWO THOUSAND SEVEN

40Transcribed and Segmented141REF: (3.203,5.553) GIHV MIY AX FLAYTD BAX TWIYN SPOW KAEN AENDD SIY AE DXAXLREF: (15.633,18.307) AHM AAKD TOW BAXR SEH VAXN TIYNTHREF: (26.827,29.606) OW AY NIYDD AX PLEYN FRAHM SPOW KAEN TUW SIY AE DXAXLREF: (43.337,46.682) AY WAANTD AX RAWNDD TRIHPD FRAHM MIH NIY AE PAX LAXS TUWREF: (58.050,61.762) AY WAANTD TUW BUHKD AX TRIHPD FRAHM MIH ZUW LAX TUW PAORTD LAXNDDREF: (73.397,77.215) AY NIYDD AX TIH KAXTD FRAHM AEL BAX KAXR KIY TUW NUW YAORKDREF: (87.370,94.098) YAE RAYTD AHM AY NIYDD AX TIH KAXTD FRAHM SPOW KAEN SEHPD TEHM BAXR THER DXIY AXTH TUW SIY AE DXAXL RAX TER NIXNG AAKD TOW BAXR THERDDREF: (107.381,113.593) AY WAANTD TUW GEHTD FRAHM AEL BAX KAXR KIY TUW NUW AOR LIY AXNZ AAN AAKD TOW BAXR THERDD TUW THAW ZAXNDD SEH VAXN1. Produced by University of Colorado transcription software (to a version of ARPAbet) , National Institute of Standards (NIST) syllabifier, and my own Python classes that coordinate the two.

41With Inserted Equivalence Classes1 42REF: (3.203,5.553) GIHV MIY GO BAX TWIYN SPOW KAEN AENDD SIY AE DXAXLREF: (15.633,18.307) AHM AAKD TOW BAXR SEH VAXN TIYNTHREF: (26.827,29.606) OW WANT AX PLEYN FRAHM SPOW KAEN TUW SIY AE DXAXLREF: (43.337,46.682) WANT AX RAWNDD TRIHPD FRAHM MIH NIY AE PAX LAXS TUWREF: (58.050,61.762) WANT TUW BUHKD AX TRIHPD FRAHM MIH ZUW LAX TUW PAORTD LAXNDDREF: (73.397,77.215) WANT GO FRAHM AEL BAX KAXR KIY TUW NUW YAORKDREF: (87.370,94.098) YAE RAYTD AHM AY WANT GO FRAHM SPOW KAEN SEHPD TEHM BAXR THER DXIY AXTH TUW SIY AE DXAXL RETURN AAKD TOW BAXR THERDDREF: (107.381,113.593) WANT GO FRAHM AEL BAX KAXR KIY TUW NUW AOR LIY AXNZ AAN AAKD TOW BAXR THERDD TUW THAW ZAXNDD SEH VAXN1. A subset of a set in which all members share an equivalence relation. WANT is an equivalence class with members, I need, I would like, and so on.

42Including Flat Language ModelWord Perplexity43Flat LMN = 1N = 2N = 3N = 4Mean Appendix C482112.5639.4431.4430.42Median, Appendix C482110.8238.7428.9629.31Std Dev Appendix C0.013.617.215.824.05Mean ATIS1604191.8441.3531.5131.43Median ATIS1604191.7438.9328.230.19Std Dev ATIS0.0.578.647.64.3243Including Flat LMSyllable Perplexity44Flat LMN = 1N = 2N = 3N = 4Mean Appendix C537177.9935.9622.2621.04Median, Appendix C 537177.4435.8122.7120.75Std Dev Appendix C0.01.778.493.733.53Mean ATIS1314231.1322.4214.9114.11Median ATIS1314231.2621.1514.7413.2Std Dev ATIS0.0.0974.361.831.7944Words and Syllables Normed by Flat LM

45N = 1N = 2N = 3N = 4Mean Appendix C33.14%6.69%4.15%3.92%Mean ATIS17.58%1.71%1.13%1.07%N = 1N = 2N = 3N = 4Mean Appendix C23.35%8.18%6.52%6.31%Mean ATIS11.95%11.96%1.96%1.95%Words Data Normed by Flat LMSyllable Data Normed by Flat LM45Syllabifier from National Institute of Standards and Technology (NIST, 2007)Based on Daniel Kahns 1976 dissertation from MIT (Kahn, 1976)Generative in nature and English-biasedSyllabifiers4646Estimates of the number of English syllables range from 1000 to 30,000Suggests that there is some difficulty in pinning down what a syllable is.Usual hierarchical approach

SyllablessyllableCoda (C)onset (C)Nucleus (V)rhyme47Sonority rises to the nucleus and falls to the codaSpeech sounds appear to form a sonority hierarchy (from highest to lowest) vowels, glides, liquids, nasals, obstruentsUseful but not absolute: e.g, both depth and spit seem to violate the sonority hierarchySonority48