Forced Alignment as a Tool and Methodology for Phonetics ...languagelog.ldc.upenn.edu/myl/ldc/AlignmentNSFTuesday… · Web viewWe will now summarize our study on English word stress,

New Tools and Methods for Very-Large-Scale Phonetics Research

1. Introduction The field of phonetics has experienced two revolutions in the last century: the advent of the sound spectrograph in the 1950s and the application of computers beginning in the 1970s. Today, advances in digital multimedia, networking and mass storage are promising a third revolution: a movement from the study of small, individual datasets to the analysis of published corpora that are several orders of magnitude larger. Peterson & Barney’s influential 1952 study of American English vowels was based on measurements from a total of less than 30 minutes of speech. Many phonetic studies have been based on the TIMIT corpus, originally published in 1991, which contains just over 300 minutes of speech. Since then, much larger speech corpora have been published for use in technology development: LDC collections of transcribed conversational telephone speech in English now total more than 300,000 minutes, for example. And many even larger collections are now becoming accessible, from sources such as oral histories, audio books, political debates and speeches, podcasts, and so on.

These new bodies of data are badly needed, to enable the field of phonetics to develop and test hypotheses across languages and across the many types of individual, social and contextual variation. Allied fields such as sociolinguistics and psycholinguistics ought to benefit even more. However, in contrast to speech technology research, speech science has so far taken very little advantage of this opportunity, because access to these resources for phonetics research requires tools and methods that are now incomplete, untested, and inaccessible to most researchers.

Transcripts in ordinary orthography, typically inaccurate or incomplete in various ways, must be turned into detailed and accurate phonetic transcripts that are time-aligned with the digital recordings. And information about speakers, contexts, and content must be integrated with phonetic and acoustic information, within collections involving tens of thousands of speakers and billions of phonetic segments, and across collections with differing sorts of metadata that may be stored in complex and incompatible formats. Our research aims to solve these problems by integrating, adapting and improving techniques developed in speech technology research and database research.

The most important technique is forced alignment of digital audio with phonetic representations derived from orthographic transcripts, using HMM methods developed for speech recognition technology. Our preliminary results, described below, convince us that this approach will work. However, forced-alignment techniques must be improved and validated for robust application to phonetics research. There are three basic challenges to be met: orthographic ambiguity; pronunciation variation; and imperfect transcripts (especially the omission of disfluencies). Reliable confidence measures must be developed, so as to allow regions of bad alignment to be identified and eliminated or fixed. Researchers need an easy way to get a believable picture of the distribution of errors in their aligned data, so as to estimate confidence intervals, and also to determine the extent of any bias that may be introduced. And in addition to solving these problems for English, we need to show how to apply the same techniques to a range of other languages, with different phonetic problems, and with orthographies that (as in the case of Arabic and Chinese) may be more phonetically ambiguous than English.

In addition to more robust forced alignment, researchers also need improved techniques for dealing with the results. Existing LDC speech corpora involve tens of thousands of speakers, hundreds of millions of words, and billions of phonetic segments. Other sources of transcribed audio are collectively even larger. Different corpora, even from the same source, typically have differing sorts of metadata, and may be laid out in quite different ways. Manual or automatic annotation of syntactic, semantic or pragmatic categories may be added to some parts of some data sets.

Researchers need a coherent model of these varied, complex, and multidimensional databases, with methods to retrieve relevant subsets in a suitably combinatoric way. Approaches to these problems were developed at LDC under NSF awards 9983258, “Multidimensional Exploration of Linguistic Databases”, and 0317826, “Querying linguistic databases”; with key ideas documented in Bird and Liberman (2001); and we propose to adapt and improve the results

for the needs of phonetics research.The proposed research will help the field of phonetics to enter a new era: conducting

research using very large speech corpora, in the range from hundreds of hours to hundreds of thousands of hours. It will also enhance research in other language-related fields, not only within linguistics proper, but also in neighboring disciplines such as psycholinguistics, sociolinguistics and linguistic anthropology. And this effort to enable new kinds of research also brings up a number of research problems that are interesting in their own right, as we will explain.

2. Forced Alignment Analysis of large speech corpora is crucial for understanding variation in speech (Keating et al., 1994; Johnson, 2004). Understanding variation in speech is not only a fundamental goal of phonetics, but it is also important for studies of language change (Labov, 1994), language acquisition (Pierrehumbert, 2003), psycholinguistics (Jurafsky, 2003), and speech technology (Benzeghiba et al., 2007). In addition, large speech corpora provide rich sources of data to study prosody (Grabe et al., 2005; Chu et al., 2006), disfluency (Shriberg, 1996; Stouten et al., 2006), and discourse (Hastie et al., 2002). The ability to use speech corpora for phonetics research depends on the availability of phonetic segmentation and transcriptions. In the last twenty years, many large speech corpora have been collected; however, only a small portion of them have come with phonetic segmentation and transcriptions, including: TIMIT (Garofolo et al., 1993), Switchboard (Godfrey & Holliman, 1997), the Buckeye natural speech corpus (Pitt et al., 2007), the Corpus of Spontaneous Japanese (http://www.kokken.go.jp/katsudo/seika/corpus/public/), and the Spoken Dutch Corpus (http://lands.let.kun.nl/cgn/ehome.htm). Manual phonetic segmentation is time-consuming and expensive (Van Bael et al. 2007); it takes about 400 times real time (Switchboard Transcription Project, 1999) or 30 seconds per phoneme (1800 phonemes for 15 hours) (Leung and Zue, 1984). Furthermore, manual segmentation is somewhat inconsistent, with much less than perfect inter-annotator agreement (Cucchiarini, 1993). Forced alignment has been widely used for automatic phonetic segmentation in speech recognition and corpus-based concatenative speech synthesis. This task requires two inputs: recorded audio and (usually) word transcriptions. The transcribed words are mapped into a phone sequence in advance by using a pronouncing dictionary, or grapheme to phoneme rules. Phone boundaries are determined based on the acoustic models via computer algorithms such as Viterbi search (Wightman and Talkin, 1997) and Dynamic Time Warping (Wagner, 1981).

The most frequently used approach for forced alignment is to build a Hidden Markov Model (HMM) based phonetic recognizer. The speech signal is analyzed as a successive set of frames (e.g., every 3 - 10 ms). The alignment of frames with phonemes is determined via the Viterbi algorithm, which finds the most likely sequence of hidden states (in practice each phone has 3-5 states) given the observed data and the acoustic model represented by the HMMs. The acoustic features used for training HMMs are normally cepstral coefficients such as MFCCs (Davis and Mermelstein, 1980) and PLPs (Hermansky, 1990). A common practice involves training single Gaussian HMMs first and then extending these HMMs to more Gaussians (Gaussian Mixture Models (GMMs)). The reported performances of state-of-the-art HMM-based forced alignment systems range from 80%-90% agreement (of all boundaries) within 20 ms compared to manual segmentation on TIMIT (Hosom, 2000). Human labelers have an average agreement of 93% within 20 ms, with a maximum of 96% within 20 ms for highly-trained specialists (Hosom, 2000).

In forced alignment, unlike in automatic speech recognition, monophone (context-independent) HMMs are more commonly used than triphone (context-dependent) HMMs. Ljolje et al. (1997) provide a theoretical explanation as to why triphone models tend to be less precise in automatic segmentation. In the triphone model, the HMMs do not need to discriminate between the target phone and the context; the spectral movement characteristics are better modeled, but phone boundary accuracy is sacrificed. Toledano et al. (2003) compare monophone and triphone models for forced alignment under different criteria and show in their experiments that monophone models outperform triphone models for medium tolerances (15-30 ms different from manual segmentation). However, monophone models underperform for small tolerances (5-10 ms) and large tolerances (>35 ms).

Many researchers have tried to improve forced alignment accuracy. Hosom (2000) uses acoustic-phonetic information (phonetic transitions, acoustic-level features, and distinctive phonetic features) in addition to PLPs. This study shows that the phonetic transition information provides the greatest relative improvement in performance. The acoustic-level features - such as impulse detection, intensity discrimination, and voicing features – provide the next-greatest improvement, and the use of distinctive features (manner, place, and height) may increase or decrease performance, depending on the corpus used for evaluation. Toledano et al. (2003) propose a statistical correction procedure to compensate for the systematic errors produced by context-dependent HMMs. The procedure is comprised of two steps: a training phase, where some statistical averages are estimated; and a boundary correction phase, where the phone boundaries are moved according to the estimated averages. The procedure has been shown to correct segmentations produced by context-dependent HMMs; therefore, the results are more accurate than those obtained by context-independent and context-dependent HMMs alone. There are also studies in the literature that attempt to improve forced alignment by using a different model than HMMs. Lee (2006) employs a multilayer perceptron (MLP) to refine the phone boundaries provided by HMM-based alignment; Keshet et al. (2005) describe a new paradigm for alignment based on Support Vector Machines (SVMs). Although forced alignment works well on read speech and short sentences, the alignment of long and spontaneous speech remains a great challenge (Osuga et al., 2001; Toth, 2004). Spon -taneous speech contains filled pauses, disfluencies, errors, repairs, and deletions that do not nor-mally occur in read speech and are often omitted in transcripts. Moreover, pronunciations in spontaneous speech are much more variable than read speech. Researchers have attempted to improve recognition of spontaneous speech (Furui, 2005) by: using better models of pronunciation variation (Strik & Cucchiarini, 1998; Saraclar et al., 2000); using prosodic information (Wang, 2001, Shriberg & Stolcke, 2004); and improving lan-guage models (Stolcke & Shriberg, 1996; Johnson et al., 2004). With respect to pronunciation models, Riley et al. (1999) use statistical decision trees to gen-erate alternate word pronunciations in spontaneous speech. Bates et al. (2007) present a pho-netic-feature-based prediction model of pronunciation variation. Their study shows that feature-based models are more efficient than phone-based models; they require fewer parameters to pre-dict variation and give smaller distance and perplexity values when comparing predictions to the hand-labeled reference. Saraclar et al. (2000) propose a new method of accommodating non-standard pronunciations: rather than allowing a phoneme to be realized as one of a few alternate phones, the HMM states of the phoneme’s model are allowed to share Gaussian mixture compo-nents with the HMM states of the model(s) of the alternate realization(s). The use of prosody and language models to improve automatic recognition of spontaneous speech has been largely integrated. Liu et al. (2006) describe a metadata (sentence boundaries, pause fillers, and disfluencies) detection system; it combines information from different types of textual knowledge sources with information from a prosodic classifier. Huang and Renals (2007) incorporate syllable-based prosodic features into language models. Their experiment shows that exploiting prosody in language modeling significantly reduces perplexity and marginally reduces word error rate. In contrast to automatic recognition, little effort has been made to reduce forced alignment errors for spontaneous speech. Automatic phonetic transcription procedures tend to focus on the accuracy of the phonetic labels generated rather than the accuracy of the boundaries of the labels. Van Bael et al. (2007) show that in order to approximate the quality of the manually verified phonetic transcriptions in the Spoken Dutch corpus, one only needs an orthographic transcription, a canonical lexicon, a small sample of manually verified phonetic transcriptions, software for the implementation of decision trees, and a standard continuous speech recognizer. Chang et al. (2000) developed an automatic transcription system that does not use word-level transcripts. Instead, special purpose neural networks are built to classify each 10ms frame of speech in terms of articulatory-acoustic-based phonetic features; the features are subsequently mapped to phonetic labels using multilayer perceptron (MLP) networks. The phonetic labels generated by this system are 80% concordant with the labels produced by human transcribers. Toth (2004) presents a model for segmenting long recordings into smaller utterances. This approach estimates prosodic phrase break locations and places words around breaks (based on

length and break probabilities for each word). Forced alignment assumes that the orthographic transcription is correct and accurate.

However, transcribing spontaneous speech is difficult. Disfluencies are often missed in the transcription process (Lickley & Bard, 1996). Instructions to attend carefully to disfluencies increase bias to report them but not accuracy in locating them (Martin & Strange, 1968). Forced alignment also assumes that our word-to-phoneme mapping generates a path that contains the correct pronunciation – but of course, natural speech is highly variable.

The obvious approach is to use language models to postulate additional disfluencies that may have been omitted in the transcript, and to use models of pronunciation variation to enrich the lattice of pronunciation alternatives for words in context; and then to use the usual HMM Viterbi decoding to choose the best path given the acoustic data. Most of the research on related topics is aimed at improving speech recognition rather than improving phonetic alignments, but the results suggest that these approaches, properly used, will not only give better alignments, but also provide valid information about the distribution of phonetic variants. For example, Fox (2006) demonstrated that a forced alignment technique worked well in studying the distribution of s-deletion in Spanish, using LDC corpora of conversational telephone speech and radio news broadcasts. She was also able to get reliable estimates of the distribution of the durations of non-deleted /s/ segments. A critical component of any such research is estimation of the distribution of errors, whether in disambiguating alternative pronunciations, correcting the transcription of disfluencies, or determining the boundaries of segments. Since human annotators also disagree about these matters, it’s crucial to compare the distribution of human/human differences as well as the distribution of human/machine differences. And in both cases, the mean squared (or absolute-value) error often matters less than the bias. If we want to estimate (for example) the average duration of a certain vowel segment, or the average ratio of durations between vowels and following voiced vs. voiceless consonants, the amount of noise in the measurement of individual instances matters less than the bias of the noise, since as the volume of data increases, our confidence intervals will steadily shrink – and the whole point of this enterprise is to increase the available volume of data by several orders of magnitude.

Fox (2006) found this kind of noise reduction, just as we would hope, so that overall parameter estimates from forced alignment converged with the overall parameter estimates from human annotation. We will need to develop standard procedures for checking this in new applications. Since a sample of human annotations is a critical and expensive part of this process, a crucial step will be to define of the mimimal sample of such annotations required to achieve a given level of confidence in the result.

3. Preliminary results:

3.1. The Penn Phonetics Lab Forced Aligner The U.S. Supreme Court began recording its oral arguments in the early 1950s; some 9,000 hours of recording are stored in the National Archives. The transcripts do not identify the speak-ing turns of individual Justices but refer to them all as “The Court”. As part of a project to make this material available online in aligned digital form, we have developed techniques for identifying speakers and aligning entire (hour-long) transcripts with the digitized audio (Yuan & Liberman, 2008). The Penn Phonetics Lab Forced Aligner was developed from this project. Seventy-nine arguments of the SCOTUS corpus were transcribed, speaker identified, and manually word-aligned by the OYEZ project (http://www.oyez.org). Silence and noise segments in these arguments were also annotated. A total of 25.5 hours of speaker turns were extracted from the arguments and used for our training data; one argument was set aside for testing purposes. Silences were separately extracted and randomly added to the beginning and end of each turn. Our acoustic models are GMM-based, monophone HMMs. Each HMM state has 32 Gaussian Mixture components on 39 PLP coefficients (12 cepstral coefficients plus energy, and Delta and Acceleration). The models were trained using the HTK toolkit (http://htk.eng.cam.ac.uk) and the CMU American English Pronouncing Dictionary (http://www.speech.cs.cmu.edu/cgi-bin/cmudict). We tested the forced aligner on both TIMIT (the training set data) and the Buckeye corpus

(the data of s14). TIMIT is read speech and the audio files are short (a few seconds each). The Buckeye corpus is spontaneous interview speech and the audio files are nine minutes long on av-erage. Table 1 lists the average absolute difference between the automatic and manually labeled phone boundaries; it also lists the percentage of agreement within 25 ms (the length of the analy -sis window used by the aligner) between forced alignment and manual segmentation.

Table 1. Performance of the PPL Forced Aligner on TIMIT and Buckeye.

Average absolute difference Percentage of agreement within 25msTIMIT 12.5 ms 87.6% Buckeye 21.2 ms 79.2%

We also tested the aligner on hour-long audio files - i.e., alignment of entire hour-long recordings without cutting them into smaller pieces - using the British National Corpus (BNC) and the SCOTUS corpus. The spoken part of the BNC corpus consists of informal conversations recorded by volunteers. The conversations contain a large amount of background noise, speech overlaps, etc. To help our forced aligner better handle the BNC data, we combined the CMU pro -nouncing dictionary with the Oxford Advanced Learner's dictionary (http://ota.ahds.ac.uk/headers/0710.xml), which is a British English pronouncing dictionary. We also retrained the si-lence and noise model using data from the BNC corpus. We manually checked the word align-ments on a 50-minute recording, and 78.6% of the words in the recording were aligned accu-rately. The argument in the SCOTUS corpus that was set aside for testing in our study is 58 min -utes long and manually word-aligned. The performance of the aligner on this argument is shown by Figure 1, where a boxplot of alignment errors (absolute differences from manual segmenta-tion) in every minute from the beginning to the end of the recording is drawn. We can see that the alignment is consistently good throughout the entire recording.

Figure 1. Alignment errors in every minute in a 58-minute recording.

Possible reasons for why our forced aligner can handle long and spontaneous speech well include: the high quality of the training data; the fact that the training data is large enough to train robust monophone GMM models; and the robustness of the silence and noise models.3.2. Phonetics research using very large speech corpora and forced alignment We have used large speech corpora to investigate speech and language phenomena such as speaking rate (Yuan et al., 2006), speech overlap (Yuan et al., 2007), stress (Yuan et al., 2008), duration (Yuan, 2008), and tone sandhi (Chen & Yuan, 2007). We will now summarize our

study on English word stress, which shows how we can revisit classic phonetic and phonological problems from the perspective of utilizing very large speech corpora. Studies on the acoustic correlates of word-level stress have demonstrated contradictory re-sults regarding the importance of the acoustic correlates (see a review in Okobi, 2006). Most of the studies are based on small amounts of laboratory speech. By contrast, the acoustics of sec-ondary stress - especially the three-way distinction of primary-stress, secondary-stress, and re-duced vowels - has not been widely studied. We investigated the pitch and duration of vowels from different lexical stress classes in the SCOTUS corpus. The vowels were automatically segmented using the Penn Phonetics Lab Forced Aligner, including 157,138 primary-stress vowels, 10,368 secondary-stress vowels, and 116,229 reduced vowels. The durations of the vowels were calculated from forced alignment. The F0 was extracted using Praat (http://www.praat.org) and converted to a semitone scale. The base frequency used for calculating semitones was Justice-dependent and was defined as the 10 th per-centile of all F0 values for that Justice. A simple linear regression was applied to the pitch contour of each turn, and the regression residuals were used for the pitch analysis. Using the regression residuals instead of the real pitch values normalized the global downtrend of the pitch contours and captured the local pitch behavior of the vowels. Figure 2 shows the F0 contours of the vowels. We segmented each vowel into four equal parts and averaged the pitches within each quarter. The F0 contour of the primary stress vowels stayed well above zero, which represented the pitch regression line. The contours of secondary-stress and reduced vowels were very similar; both were below the regression line.

Figure 2. F0 contours of primary-stress (‘1’), secondary-stress (‘2’) and reduced (‘0’) vowels.

The histograms in Figure 3 (below) show the frequency distributions of vowel duration for the three stress classes. Interestingly, the secondary-stress vowels were more similar to the primary-stress vowels in terms of duration. The reduced vowels were much shorter than these two types of vowels.

Figure 3. Duration densities of primary-stress (‘1’), secondary-stress (‘2’) and reduced (‘0’) vow-els.

3.3. Improving automatic phonetic alignments Forced alignment is a powerful tool for utilizing very large speech corpora in phonetics research, but as we noted, it has several obvious problems: orthographic ambiguity, pronunciation variation, and imperfect transcripts. The general approach in all cases is to add alternative paths to the “language model” (which in the simplest case is just a simple sequence of expected phonetic segments), with estimates of the a priori probability of the alternatives, and let the Viterbi decoding choose the best option. In some cases, it may also be helpful to add additional acoustically-based features – perhaps based on some decision-specific machine learning – designed to discriminate among the alternatives. We and others have gotten promising results with such techniques, and we’re confident that with some improvements, they will deal adequately with the problems, as well as adding information about the distribution of phonetic variants in speech.

Pretonic schwa deletion (e.g., suppose -> [sp]ose) presents a typical challenge of this type (Hooper, 1978; Patterson et al., 2003; Davidson, 2006). Editing the pronouncing dictionary may solve the problem, but it is time-consuming and error-prone. We propose a different approach: using a “tee model” for schwa in forced alignment. A “tee-model” has a direct transition from the entry to the exit node in the HMM; therefore, a phone with a “tee-model” can have “zero” length. The “tee-model” has mainly been used for handling possible inter-word silence. In a pilot experiment, we trained a “tee-model” for schwa and used the model to identify schwa elision (“zero” length from alignment) in the SCOTUS corpus. We asked a phonetics student to examine all the tokens of the word suppose in the corpus (ninety-nine total) and manually identify whether there was a schwa in the word by listening to the sound and looking at the spectrogram. The agreement between the forced alignment procedure and the manual procedure was 88.9% (88/99). 24 tokens were identified as ‘no schwa’ (schwa elision) by the student, 22 of them (91.7%) were correctly identified by the aligner; 75 tokens were identified as having a schwa by the student, 66 of them (88%) were correctly identified by the aligner. Figure 4 (below) illustrates two examples from the forced alignment results. We can see that the forced aligner correctly identifies a schwa in the first word and a schwa elision in the second word, although the word suppose does not have a pronunciation variant with schwa elision in the pronouncing dictionary.

Figure 4. Identifying schwa elision through forced alignment and “tee-model”.

4. Research Plans We plan to improve the Penn Phonetics Lab (PPL) Forced Aligner in two respects: 1) its segmentation accuracy; and 2) its robustness to conversational speech and long recordings. We will further explore techniques for modeling phonetic variation and recognizing untranscribed disfluencies, and for marking regions of unreliable alignment. In addition, we will extend this system to other speech genres (e.g., child directed speech) and more languages, including Mandarin Chinese, Arabic, Vietnamese, Hindi/Urdu, French, and Portuguese. We will apply these techniques to the LDC’s very large speech corpoa, and explore how to integrate the resulting automated annotations into a database system that is convenient for phonetic search and retrieval, as per the techniques developed in previous NSF-funded projects at LDC. In addition to using these results in our own research, we will publish both the annotations and the search software for use by the research community at large, in order to learn as much as possible about the issues that arise in applying this new approach.

4.1. Analysis of segmentation errorsTo evaluate the performance of forced alignment and analyze segmentation errors, we

will develop phonetic segmentation data that can serve as a gold-standard benchmark. This dataset will be created in a uniform format for all our target languages. Manual phonetic segmen-tation is very time consuming, so we will randomly select representative utterances from the spo-ken corpora. Half an hour of benchmark data will be created English, and lesser amounts for other languages, consistent with estimating confidence intervals for a range of phonetic measure-ments. These datasets will be published through LDC by the end of the second year of the project.

Error analysis provides information about where and how the system should be improved; it allows us to estimate whether the deviations from human annotation introduce any bias; and in addition, it may yield insights of its own. Thus Greenberg and Chang (2000)

conducted a diagnostic evaluation of eight Switchboard-corpus recognition systems, and found that syllabic structure, prosodic stress and speaking rate were important factors in accounting for recognition performance. We performed a preliminary error analysis on alignment of the TIMIT corpus using the PPL aligner. As shown in Table 2, we found that the signed alignment errors between different phone classes have different patterns. There is no bias towards either phone class for the boundaries between Nasals and Glides (no matter which is first, -0.002s vs. 0.006s); however, there is a significant bias towards Stops for the boundaries between Stops and Glides (no matter which is first, -0.01s vs. 0.015s). There is no bias for the boundaries between Vowels and Glides (-0.002 s), but there is a significant bias towards Vowels for the boundaries between Glides and Vowels (0.013s). We will undertake further analyses to reveal how the error patterns are related to phone characteristics, coarticulation, and syllable structure. We will then use the information to improve forced alignment.

Table 2. Average signed errors for boundaries between broad phone classes (Seconds).

Affricate Fricative Glide /h/ Nasal Stop VowelAffricate –.008 –.006 - - - .019Fricative .026 –.009 - –.013 .008 .007Glide .014 .003 .013 .006 .015 .013/h/ - - –.008 - - .010Nasal - –.005 –.002 - .009 .013Stop - –.001 –.010 - –.008 - –.003Vowel .006 –.012 –.002 .006 –.004 .006 -

4.2. Analysis of phonetic variation A key issue in forced alignment is the inherent variation of human speech. Phonetic variation has been extensively studied (Keating et al., 1994; Bell et al., 2003; Johnson, 2004). Based upon our review of the literature, we will conduct studies to investigate the phonetic variation in the TIMIT, Buckeye and SCOTUS corpora from the perspective of forced alignment. Such studies are essential to improving forced alignment and improving our understanding of the mechanisms of speech production. Our study on Pretonic schwa deletion is a good example of how to integrate phonetic variation analysis and forced alignment. In this research, we will investigate how to better handle phonetic variation (i.e., deletion, reduction, and insertion) for the purpose of forced alignment. One possible experiment on vowel reduction involves building a system in which all English reduced vowels are the same phoneme; this special phoneme would have triphone (context dependent) models instead of a monophone model.

Figure 5. Forced alignment errors by speakers.

We will also conduct error analyses in terms of speaker variation using the TIMIT, Buckeye, and SCOTUS corpora. Toledano et al. (2003) demonstrate that the use of speaker adaptation techniques increases segmentation precision. Figure 5 (above) shows the absolute phone alignment errors using the PPL aligner on individual speakers in the TIMIT corpus. We can see that a small number of speakers deviate greatly from the others. The deviations may be due to factors such as speaking rate, disfluency, dialect, and speaker characteristics of pronunciation and prosody. In addition, as shown above, we found that speakers vary in different degrees on different phones (Yuan and Liberman, 2008). We aim to understand how speaker variation affects forced alignment and to improve forced alignment through better adaptation to speaker variation. A remaining challenge is the robust identification of regions where (for whatever reason) the automatic process has gone badly wrong, so that the resulting data should be disregarded. The use of likelihood scores or other confidence measures in the obvious approach, but the history of such confidence measures in automatic speech recognition is mixed at best. We hypothesize that this is mainly because the language model is weighted more heavily than the acoustic model in automatic speech recognition. In forced alignment, however, the word sequence is (mostly or completely) given, so the language model plays much less of a role in computing the likelihood scores, which are therefore more reliable and useful.

We have used likelihood scores to study speaker variation on individual phones. In partic-ular, we asked which phones are more acoustically variable among speakers. We again turned to the SCOTUS data. To identify the speaking turns of individual Justices, we trained Justice-depen-dent, GMM-based, monophone HMMs. We computed the acoustic variation among Justices on individual phones using two methods. First, we directly computed the distances of the GMM mod-els of the Justices on each phone. A natural measure between two distributions is the Kullback-Leibler divergence (Kullback, 1968); however, it cannot be analytically computed in the case of a GMM. We adopted a dissimilarity measure proposed in Goldberger and Aronowitz (2005), which is an accurate and efficiently computed approximation of the KL-divergence. Next, we computed the acoustic distance from likelihood scores. Let L(/p/i, Mj) denote the average likelihood score of the /p/ phones of speaker i when the phones are forced aligned using speaker j’s model Mj. L(/p/i, Mj) measures the distance of the phone /p/ from speaker i to speaker j. Figure 6 (below) shows that the correlation between the KL-divergence measure and the likelihood score distance measure is very high (r = 0.88). This result suggests that likelihood scores can be used as reliable measurements of acoustic and phonetic variations. The likelihood scores are particularly useful when there are not enough data to train GMM models. For example, to evaluate the sentence pronunciation of a foreign language learner, we can use the likelihood score obtained from alignment of the sentence against acoustic models trained on the standard accent speech.

Figure 6. Correlation between KL divergence and likelihood score distance

4.3. Integration of phonetic models To our understanding, part of the reason why the integration of phonetic knowledge has not significantly improved the accuracy of speech recognition is the strong impact of the language model in automatic speech recognition (ASR) procedures. Since the word sequence is provided in forced alignment, the application of phonetic knowledge is more likely to be successful here. The proposed research will attempt to improve forced alignment by incorporating well-established phonetic models. Specifically, we will explore the phone-transition model (Hertz, 1991), the π-gesture model (Byrd & Saltzman, 2003), and the landmark model (Stevens, 2002). Hertz (1991) presents a phone-transition model that treats formant transitions as independent units between phones, rather than incorporating the transitions into the phones as in more conventional models. The phone-transition model can be easily implemented in the HMM framework. However, several questions require investigation. For example, can (some) transitions be clustered with (some) reduced vowels in their acoustic models? The π-gesture model of Byrd and Saltzman (2003) suggests that boundary-related durational patterning can result from prosodic gestures or π-gestures, which stretch or shrink the local temporal fabric of an utterance. We propose to incorporate the π-gesture model into the forced alignment procedure through the rescoring of alignment lattices (Jennequin & Gauvain, 2007). Stevens (2002) proposes a model for lexical access based on acoustic landmarks and distinctive features. Landmark-based speech recognition has advanced in recent years (Hasegawa-Johnson et al., 2005). We will adopt a two-step procedure to apply the landmark model in forced alignment. In the first step, segment boundaries will be obtained by the HMM-based PPL forced aligner. In the second step, the boundaries will be refined through landmark detection, using the framework proposed in Juneja and Espy-Wilson (2008).

4.4. Incorporating prosody and language models The transcriptions of long and spontaneous speech are usually imperfect. Spontaneous speech contains filled pauses, disfluencies, errors, repairs, and deletions that are often missed in the transcription process. Recordings of long and spontaneous speech usually contain background noises, speech overlaps, and very long non-speech segments. These factors make the alignment of long and spontaneous speech a great challenge. We aim to improve the robustness of the PPL aligner to long and spontaneous speech in two aspects: 1) improve the acoustic models of silences, noises, and filled pauses; and 2) introduce constraints from prosody

and language model into forced alignment. We will use the Buckeye, SCOTUS, and BNC corpora for this part of the research. In our experiments on alignment of the BNC corpus - which consists of casual and long speech in a natural setting - we found that erroneous alignments can be reduced by adapting the silence and noise models of the PPL aligner to the BNC data. We will further explore the importance of the non-speech models in forced alignment of long and casual speech. We will also investigate ways to improve the acoustic models for better handling filled pauses. Schramm et al. (2003) created many pronunciation variants for a filled pause through a data-drive lexical modeling technique. The new model outperforms the single-pronunciation filled pause model in recognition of highly spontaneous medical speech. Stouten et al. (2006) argue that a better way to cope with pause fillers in speech recognition is to introduce a specialized filled pause detector (as a preprocessor) and supply the output of that detector to the general decoder. We will explore these two approaches for our purpose of improving forced alignment of long and casual speech. A common practice in forced alignment is to insert a “tee-model” phone, called sp (short pauses), after each word in the transcription for handling possible inter-word silence. Since a “tee-model” has a direct transition from the entry to the exit node, sp can be skipped during forced alignment. In this way, a forced aligner can “spot” and segment pauses in the speech, which are usually not transcribed. In casual and long speech, such pauses can be extremely long and filled with background noises. In this case, the sp-insertion approach could cause severe problems. In our study on the BNC corpus, we found that there are often many sp segments mistakenly determined by the aligner in regions where the word boundaries were not correctly aligned. We believe that these types of errors can be reduced by introducing constraints on the occurrences of sp from both a language model and a prosodic model. For example, it is unlikely to have pauses between the words in very common phrases such as “How are you doing?”. On the other hand, if there is a single word between two pauses in speech, the word is likely to be lengthened; hence, it should have longer duration and particular F0 characteristics. Another type of error we have seen from the BNC corpus is that some words are extremely long in the alignment results. This usually occurs when there is long speech-like background noise surrounding the words. This type of error can be reduced by introducing constraints on word or phone duration.

4.5. Extension to other speech genres and more languages The CHILDES corpus (http://childes.psy.cmu.edu/) contains audio/video data and transcripts collected from conversations between young children and their playmates and caretakers. It has been a great resource for studying language acquisition and phonetic variation. We propose to conduct forced alignment on the child directed speech data in this corpus in order to make the data more usable for phonetic research. Kirchhoff and Schimmel (2005) trained automatic speech recognizers on infant directed (ID) and adult directed speech (AD), respectively, and tested the recognizers on both ID and AD speech. They found that matched conditions produced better re-sults than mismatched conditions, and that the relative degradation of ID-trained recognizers on AD speech was significantly less severe than in the reverse case. We will conduct a similar study for forced alignment by comparing the aligner trained on the SCOTUS corpus and on the CHILDES corpus. We will also extend the PPL aligner to more languages. Our first target languages are Mandarin Chinese and Vietnamese, both of which are tonal languages. Many researchers have tried to improve automatic speech recognition for tonal languages by incorporating tonal models or by utilizing different acoustic units such as syllables and initials/finals (Fu et al. 1996, Vu et al., 2005, Lei, 2006). This research will not investigate how to apply advanced tonal models or acoustic unit modeling in forced alignment. Instead, we will build simple tone-dependent and tone-independent models based on monophones, syllables, and initials/finals; we will then choose the one that performs the best. The data we will use to build a Mandarin Chinese aligner is the Hub-4 Mandarin Broadcast News speech (LDC98S73) and the transcripts (LDC98T24). There is no easily accessible large speech corpus in Vietnamese. However, Giang Nguyen, a graduate student in our lab, has collected more than 30 hours of interview speech from Vietnam for her dissertation research; the recordings are currently being transcribed. We will use this dataset for training a Vietnamese forced aligner.

We will also attempt to extend the forced aligner to other languages - including Arabic, Hindi, Urdu, French and Portuguese - to provide a more general tool for conducting phonetics research using very large corpora.

4.6. Dissemination of the research

We will disseminate the research using methods that include journal publications; open-source toolkits and web-based applications; and tutorials, workshops, and courses. The research findings and methodology innovations will be published in archival journals. We are currently preparing two journal articles, “Forced alignment as a tool and methodology for phonetics research” (to submit to the Journal of International Phonetic Association), and “English stress and vowel reduction revisited: from the perspective of corpus phonetics” (to submit to the Journal of Phonetics). Other target journals include Speech Communication, the Journal of the Acoustical Society of America, IEEE Transactions on Audio, Speech and Language Processing, International Journal of Speech Technology, etc. We have already released the current version of the PPL aligner to many different sites, including NYU, Oxford, Stanford, UIUC, and the University of Chicago. We have built a freely accessible forced alignment online processing system, residing at http://www.ling.upenn.edu/phonetics/align.html. We will publish new version of the aligner annually at no cost, through LDC and the phonetics lab website. To ensure long-term use of the aligner, we will produce a permanent free-standing tutorial, covering the training and use of the aligner, and the integration of its output. We will also provide a collection of related Python scripts. We will also develop web-based applications that integrate forced alignment, database query, and phonetics research. For example, we have built a web-based search engine for searching phones and words in the SCOTUS corpus, where the search results are word-aligned speaking turns. (http://165.123.213.123:8180/PhoneticDatabaseSearchEngine/) We propose the organization of a workshop on the use of very large corpora in speech research during the first year of the project. The purposes of the workshop will be: 1) to introduce these techniques to those in the speech-research community who are not familiar with them; and 2) to promote phonetics research using very large corpora with forced alignment as both a tool and methodology. The workshop will also provide an opportunity for us to test the aligner on different datasets from the workshop participants, and to seek research collaborations. At the University of Pennsylvania we have been teaching a course on “corpus phonetics”, covering relevant Python and Praat scripting, database access, statistical analysis in R, etc. We plan to teach a similar course at the Linguistic Society of America's 2011 Summer Institute.

4.7. Time schedule

Year 1 (June 2009 – May 2010): 1. Improve segmentation accuracy through error analyses and the incorporation of phonetic

knowledge.2. Integrate the resulting annotations with database search and retrieval technology3. Organize a workshop on forced alignment for phonetics research.

Year 2 (June 2010 – May 2011):1. Improve the aligner’s robustness to long and spontaneous speech.2. Expand the aligner to child directed speech and to other languages, including Mandarin

Chinese, Arabic, Hindi, Urdu, Vietnamese, French, and Portuguese.

Year 3 (June 2011 – May 2012):1. Publish the PPL forced aligner and the data sets used for training the aligner on the web

and through LDC.2. Investigate new ways to use forced alignment as a methodology for phonetics research.

http://165.123.213.123:8180/PhoneticDatabaseSearchEngine/index.html

http://www.ling.upenn.edu/phonetics/align.html

Documents

Forced Alignment as a Tool and Methodology for Phonetics ...languagelog.ldc.upenn.edu/myl/ldc/AlignmentNSFTuesday… · Web viewWe will now summarize our study on English word stress,