63
2006:113 CIV MASTER'S THESIS Feasibility Study on a Text-To-Speech Synthesizer for Embedded Systems Linnea Hammarstedt Luleå University of Technology MSc Programmes in Engineering Electrical Engineering Department of Computer Science and Electrical Engineering Division of Signal Processing 2006:113 CIV - ISSN: 1402-1617 - ISRN: LTU-EX--06/113--SE

2006:113 CIV MASTER'S THESIS1022952/FULLTEXT01.pdf · 2016. 10. 4. · 2006:113 CIV MASTER'S THESIS Feasibility Study on a Text-To-Speech Synthesizer for Embedded Systems Linnea Hammarstedt

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • 2006:113 CIV

    M A S T E R ' S T H E S I S

    Feasibility Study on a Text-To-SpeechSynthesizer for Embedded Systems

    Linnea Hammarstedt

    Luleå University of Technology

    MSc Programmes in Engineering Electrical Engineering

    Department of Computer Science and Electrical EngineeringDivision of Signal Processing

    2006:113 CIV - ISSN: 1402-1617 - ISRN: LTU-EX--06/113--SE

  • Preface

    This is a master degree project commissioned by and performed at Teleca Systems GmbHin Nürnberg at the department of Speech Technology. Teleca is an IT services companyfocused on developing and integrating advanced software and information technology so-lutions.

    Today Teleca possesses a speak recognition system including a grapheme-to-phonememodule, i.e., an algorithm converting text into phonetic notation. Their future objectiveis to develop a Text-To-Speech system including this module. The purpose of this workfrom Teleca’s point of view is to investigate a possible solution of converting phoneticnotation into speech suitable for an embedded implementation platform.

    I would like to thank Dr. Andreas Kiessling at Teleca for his support and patientdiscussions during this work, and Dr. Stefan Dobler, the head of department, for givingme the possibility to experience this interesting field of speech technology. Finally, Iwish to thank all the other personnel of the department for their consistently practicalsupport.

    i

  • Abstract

    A system converting textual information into speech is usually denoted as a TTS (Text-To-Speech) system. The design of this system varies depending on its purpose and platformrequirements. In this thesis a TTS synthesizer designed for an embedded system operat-ing on an arbitrary vocabulary has been evaluated and partially implemented in Matlab,constituting a base for further development. The focus of this thesis is on the speech gen-eration part, which involves the conversion from phonetic notation into synthetic speech.

    The chosen TTS system is the so called Time Domain-PSOLA, which convincinglysuits the implementation and platform requirements. It concatenates segments of recordedspeech and changes its prosodic characteristics with the Pitch Synchronous Overlap andAdd (PSOLA) technique. The segment size is from the mid point of one phone to themid point of the next, referred to as a diphone.

    The quality of the generated synthesized speech is rather satisfying for the test sen-tences applied. Some disturbances still occur as a consequence of mismatches, such asdifferent spectral properties of the segments and pitch detection errors, but with furtherdeveloping a reduction of these can be performed.

    iii

  • Contents

    1 Introduction 1

    1.1 Introduction to TTS Systems . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Linguistic Analysis Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Speech Generation Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.3.1 Rule-Based Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3.2 Concatenative-Based Synthesis . . . . . . . . . . . . . . . . . . . . 4

    1.4 Project Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2 Theory 7

    2.1 Segment Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.1 Segment Format and Speech Corpus Selection . . . . . . . . . . . . 72.1.2 Preparation Process . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.3 Segment Representation . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.2 Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.1 Synthesizing Process . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Prosodic Information . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.3 PSOLA Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.1 PSOLA Operation Process . . . . . . . . . . . . . . . . . . . . . . . 132.3.2 Modification of Prosody . . . . . . . . . . . . . . . . . . . . . . . . 152.3.3 TD-PSOLA as Speech Synthesizer . . . . . . . . . . . . . . . . . . . 16

    2.4 Extension of TD-PSOLA into MBR-PSOLA . . . . . . . . . . . . . . . . . 162.4.1 Re-synthesis Process . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.2 Spectral Envelope Interpolation . . . . . . . . . . . . . . . . . . . . 182.4.3 Multi-Band Excitation Model . . . . . . . . . . . . . . . . . . . . . 192.4.4 Benefits with the respective PSOLA Methods . . . . . . . . . . . . 21

    2.5 Utilized Data from External TTS Projects . . . . . . . . . . . . . . . . . . 212.5.1 Festival and FestVox . . . . . . . . . . . . . . . . . . . . . . . . . . 212.5.2 MBROLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    3 Implementation 25

    3.1 Segment Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.1.1 Segment Information Modification . . . . . . . . . . . . . . . . . . . 253.1.2 Pitch Marks Modification . . . . . . . . . . . . . . . . . . . . . . . 273.1.3 Speech Corpus Modification . . . . . . . . . . . . . . . . . . . . . . 273.1.4 Additive Modifications . . . . . . . . . . . . . . . . . . . . . . . . . 27

    3.2 Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    v

  • vi CONTENTS

    3.2.1 Input Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2.2 Segment List Generator . . . . . . . . . . . . . . . . . . . . . . . . 283.2.3 Prosody Modification . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.4 Segment Concatenation . . . . . . . . . . . . . . . . . . . . . . . . 32

    4 Evaluation 33

    4.1 Analysis of the Segment Database and Input Data . . . . . . . . . . . . . . 344.1.1 Pitch Marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.1.2 Spectral Mismatch . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.1.3 Fundamental Frequencies . . . . . . . . . . . . . . . . . . . . . . . . 35

    4.2 Solution Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2.1 Window Choice for the ST-signal Extraction . . . . . . . . . . . . . 364.2.2 Frequency Modification . . . . . . . . . . . . . . . . . . . . . . . . . 374.2.3 Duration Modification . . . . . . . . . . . . . . . . . . . . . . . . . 394.2.4 Word Border Information . . . . . . . . . . . . . . . . . . . . . . . 40

    5 Discussion 41

    5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.1.1 Comparison of TD- and MBR-PSOLA . . . . . . . . . . . . . . . . 42

    5.2 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.2.1 Proceedings for Teleca . . . . . . . . . . . . . . . . . . . . . . . . . 425.2.2 Possible Quality Improvements . . . . . . . . . . . . . . . . . . . . 43

    A SAMPA Notation for British English 47

    B MRPA - SAMPA Lexicon for British English 51

    C Licence for CSTR’s British Diphone Database 53

  • List of abbreviations

    IPA International Phonetic AlphabetMBE Multi-Band ExciterMBR-PSOLA Multi-Band Re-synthesis PSOLAMBROLA short for MBR-PSOLAMOS Mean Opinion ScoreMRPA Machine Readable Phonetic AlphabetOLA Overlap and AddPSOLA Pitch-Synchronous Overlap and AddSAMPA Speech Assessment Methods Phonetic AlphabetTD-PSOLA Time Domain PSOLATTS Text-To-SpeechV/UV Voiced/Un-Voiced

    vii

  • Chapter 1

    Introduction

    The possibility of producing synthesized speech from plain textual information, so calledText-To-Speech (TTS) systems, has today aroused an extensive interest in many technicalareas. Different methods with varying quality and properties exist, and the developmentis still continuing.

    The purpose of this thesis is to define and evaluate a TTS synthesizer suitable forembedded systems. It is performed at Teleca Systems GmbH in Nürnberg and its focusis established by the requirements of the company. Today, Teleca holds a module able totransform text into phonetic notation, which is originally developed for another speechpurpose. This module is assumed able to be used also in a TTS system, and the startingpoint for this project is hence phonetic notation. The developed system is restrictedto British English, but the theoretical descriptions are, though, valid for an arbitrarylanguage.

    Since the starting level is at phonetic notation, it is not really correct to consider theinvestigated system as a Text-To-Speech system. However, for simplicity, and since theprocess of going from phonetic notation to speech is a major part of a TTS system, theterm TTS is though used in this thesis describing the evaluated overall process.

    In this study a TD-PSOLA (Time Domain-Pitch Synchronous Overlap and Add)[Dut97] synthesizer is investigated and implemented in Matlab. The result is evaluatedand suggestions of further work will be given for an accomplishment of the system. Apossible extension of this method with the Multi-Band Excitation (MBE) [GL88] modelis theoretically presented together with information about its benefits and disadvantages.

    The following sections briefly describe the main principals of a general TTS system aswell as some existing classifications and groupings. The ambition of the latter descriptionis to show what choices have been made and to give some explanation why.

    1.1 Introduction to TTS Systems

    A TTS synthesizer is a computer based system that takes a text string as input andconverts it into synthetic speech waveforms. The methods and equipment needed forthis process varies depending on physical restrictions according to the implementationplatform and development costs. Two main hardware restrictions are storage properties,such as its capacity and memory type, and the clock rate of the processor.

    1

  • 2 CHAPTER 1. INTRODUCTION

    The synthesis process can for all methods be divided into the two main modules pre-sented in Figure 1.1. The first step transcribes the input text into a linguistic format,which is usually expressed as phonetic notations (phones) together with additional infor-mation about its prosody. The term prosody refers to properties of each phone such asduration and pitch. The outputs are then used in the second block for construction ofthe final synthesis speech waves.

    Text Linguistic

    Analysis

    Phonemes &

    Prosody Info

    Speech

    Generation

    Speech

    Figure 1.1: Division of a general TTS system into two main modules.

    1.2 Linguistic Analysis Module

    In almost all languages the textual presentation of a word does not directly correspondto its pronunciation. The position of letters within a word and the words appearancewithin the sentence affect the pronunciation considerably, as well as additive characterssuch as punctuation marks and the content of the sentence do. An alternative symbolicrepresentation is therefore needed to comprise the hidden information. Usually a lan-guage can be described by 20 to 60 different phonetic characters [Lem99], when excludingthe information about its melody. To also be able to describe the pitch characteristicsadditional prosodic information is needed.

    Converting text into linguistic representation requires a large set of different rules andexceptions depending on language. This process can be described through three mainparts [Dut97]:

    1. Text analysis

    2. Automatic phonetization

    3. Prosody generation

    The first part functions as a pre-processing phase. It identifies special characters andnotations, such as numbers and abbreviations, and converts them into full text whenneeded. Several words can have different pronunciations depending on their meaning,and hence a contextual analysis is performed for categorization of the words. The laststep in the text analysis part is to find the structure of the text and to organize theutterance into clauses and phrases.

    After the text analysis phase an automatic phonetization of the words is performed,focusing on single words. The letter representation is automatically transcribed into aphonetic format using a dictionary-based or rule-based strategy, or a combination of both.

  • 1.3. SPEECH GENERATION MODULE 3

    The former strategy divides the words into morphemes1 and then converts them usinga morpheme to phoneme dictionary. A large database is required for the dictionary tofunction in a general environment, together with additive transcription rules for expressingun-matched morphemes. Also in the case of a general rule-based text conversion, a mixturebetween the two strategies is present. Here, an exception dictionary is needed for the wordsthat do not follow the defined pronunciation rules.

    The last part of the linguistic transcription process is to add the prosodic information.This is applied as additional information and hence the phonemes are not further changed.Prosodic features are created by grouping of syllables and words into larger segments. Ahierarchic classification of these groups then leads to the resulting prosody description,usually presented as pitch definitions and phonetic durations.

    1.3 Speech Generation Module

    The best synthetic result for generating speech is achieved by having recordings of allexisting words stored in a huge database. The input generated by the linguistic analysismodule can then simply be used to find and return the desired words. However, havinga TTS system able to work for arbitrary text inputs would in this case require an almostinfinite amount of recorded words. A more effective and by that means also more complexspeech generation system is therefore needed.

    There exist several different methods for generating undefined speech in an implemen-tation realistic manner, i.e., with a limited amount of storage and number of operations.This synthesis can be done either explicitly using models of the vocal tract, or implic-itly which is based on pre-recorded sounds [Dut97]. Implementation of these approachesresults in the two different classifications:

    1. Rule-based synthesis for explicit operations.

    2. Concatenative-based synthesis for implicit operations.

    1.3.1 Rule-Based Synthesis

    Creating rule-based synthesizers requires a careful study of how the different sounds forthe human voice are produced. This modelling is then usually represented either byarticulatory parameters or by formants2 [SO95]. In the former case several parametershold information about for example shapes and movements of lips and tongue, glottalaperture, cord tension and lung pressure [Lem99]. In the latter, formants are based on aset of rules used to determine the parameters necessary to synthesize a desired utterance.

    These rule-based synthesizers require a considerably high computational load com-pared to other common methods. Secondly, the synthesized speech sounds very unnat-ural due to the complicated modelling state and that it is impossible to model speechaccurately. On the other hand, they are space-efficient, since no speech segments need to

    1A morpheme is the smallest language unit that carries a semantic interpretation. For example, theword ’unbelievable’ can be divided into the three morphemes un-believe-able.

    2A formant is a peak in an acoustic frequency spectrum.

  • 4 CHAPTER 1. INTRODUCTION

    be stored, and can in principal easily be adjusted to a new speaker with a different voiceand dialect.

    1.3.2 Concatenative-Based Synthesis

    In concatenative-based synthesizers segments of pre-recorded speech are connected (con-catenated) to produce the desired utterances. The longer segments used for synthesizing,the better quality is achieved. On the other hand, using segments each consisting of sev-eral phonemes is, as mentioned previously referring words, not realistic for a non-definedimplementation area due to database size. In this case most often so called diphones (seesubsection 2.1.3) are used consisting of two phonemes and the transition in-between.

    The method principally used for concatenate speech segments is the PSOLA (Pitch-Synchronous Overlap and Add) technique, or only OLA in case of an exclusion of the pitchsynchronizing state, which will be closer described in section 2.3. Several methods usethese concatenation operations involved, with varying pre-processing steps of the databaseand different methods of applying desired prosody. Most common is the TD-PSOLA(Time Domain-PSOLA) and the MBR-PSOLA (Multi-Band Re-synthesis PSOLA), bothdescribed in the following chapter.

    An evaluation and comparison between four classical concatenative-based synthesesis described in [Dut94] involving the TD- and MBE-PSOLA methods, an LPC (Lin-ear Predictive Coding) synthesizer and a synthesizer based on the Hybrid H/S (Har-monic/Stochastic) method3. This study implies that the TD- and MBR-PSOLA methodsare very run-time effective, since they are estimated to require an operational load of 7operations per sample while at least ten times more is needed for the other two. Addi-tionally, the two PSOLA-based methods have better intelligibility and naturalness, onlyin case of fluidity the Hybrid H/S model is ranked higher than the TD-PSOLA.

    The main difference between the two methods TD- and MBR-PSOLA appears in thepre-processing state. In case of the latter mentioned synthesizer the segments are morenormalized and equalized, which is beneficial for data compression and speech fluiditybut at the cost of naturalness. The implementation of this method is also more complex,which can be seen in the next chapter where a more theoretical description of the twosynthesizers is shown.

    A drawback with concatenative-based synthesizers is the large memory space neededfor the stored segments. Additionally, the synthesizer cannot change speaker character-istics as in the case of rule-based systems. Some important benefits, though, are thesimplicity of implementation, the natural sound and the few real-time operations needed[SO95].

    1.4 Project Focus

    The speech recognition system existing at Teleca today includes a grapheme-to-phonememodule. This module is assumed able to be used as the linguistic part of a TTS system,converting text into phonetic notation. However, this is only an assumption and a future

    3See [Dut97] for description of the LPC and Hybrid H/S models.

  • 1.4. PROJECT FOCUS 5

    implementation task, and therefore not put in practice in this project4. The startinglevel for the TTS system described in this thesis is hence phonetic notation combinedwith prosodic information, assumed presented on a defined format suiting the synthesizemodel chosen.

    Since the future implementation platform for the desired TTS system is an embeddedsystem device working in real-time, a method with low operational load and small datastorage is required. This leads to, according to what is described in the previous section,that a concatenation-based synthesis method is preferable using the run-time effectivePSOLA technique. To minimize the data storage for a TTS system with an arbitraryvocabulary area, the segment database should consist of pure phoneme recordings. How-ever, since it is known that the transition between two phonemes is more important forunderstanding of speech than the stable state itself [Dut97], a segment database consistingof diphones (with one transition point present in each segment) is preferable. Naturally,the more transitions each segment includes the better the speech is understood, but thiswould also require a larger database. The memory load will approximately increase witha power of two for each additional transition included in each segment and very soon anunrealistic number of segments is reached. Therefore, to reduce data storage size, thisTTS system is based on diphone segments.

    The most common PSOLA based synthesizers are the TD- and the MBR-PSOLA.The former method is more widely used and requires a somewhat simpler implementation.For this project this TD-PSOLA method is chosen basically as a consequence of the factthat the MBR-PSOLA is more or less an extension of the TD-PSOLA, and hence theimplemented system still has the possibility to be further developed. It would though,be interesting to implement an MBR-PSOLA synthesizer as well and compare this withthe TD-PSOLA, but as a consequence of the time restrictions for the project this is notperformed.

    For the implementation of the TTS system on an external device it is preferable havingthe algorithm expressed in C- or a device dependent Assembler code. In this thesis theprogram is thoroughly implemented in Matlab because of its good analysis possibilitiesand tools. When the optimal solution is found, the code can relatively easy be translatedinto C-code.

    4The reason of an assumption and not an implementation is discussed in subsection 5.2.1

  • Chapter 2

    Theory

    The main operations needed for generating speech from a phonetic description through aconcatenative-based synthesizer can be divided into two main processes – a segment datapreparation process and a speech synthesis process. The former process creates the dataunderlying the synthesizing process and is performed once. It operates on collected speechdata and restructures the data into a format suitable for the synthesizer. Informationuseful for the future synthesis process is calculated and applied either as additional dataor as a recalculation in the collected data. The speech synthesis process consists of thefunctions operating on the phonetic input together with the generated data and producesthe resulting speech. It is independent on the language of the stored segments, only thedefined segment format, i.e., the length of each speech unit, is required.

    2.1 Segment Data Preparation

    2.1.1 Segment Format and Speech Corpus Selection

    The initial two steps in building a concatenative-based synthesizer are to determine thesegment format and to collect a speech corpus. The term segment format is referring tothe number of phonetic notes present in each speech unit together with information aboutwhere in a phonetic note the segment starts and ends. Usually the number of notes arefixed to a certain value, as in the cases described in subsection 2.1.3 below, but it couldalso have varying length as in the case of words. The selection of the segment formatis a trade-off between operational load and complexity, storage requirements and speechquality. Longer segments results in less concatenation points, and hence a simpler TTSsystem, and a better preserved naturalness. On the other hand, using a segment format ofseveral phonetic notes requires a large amount of recorded speech segments. For each newphonetic part included, an almost exponential increase1 of the memory size is required asa consequence of the mounting number of possible combinations.

    1According to combinatorial theory, the permutation

    P (n, k) =n!

    (n− k)!= n(n− 1)(n− 2) . . . (n− (k − 1)) ≈ nk (for small k and large n),

    where n denotes total number of phonetic notes and k the number of notes per segment.

    7

  • 8 CHAPTER 2. THEORY

    The segment data preparation process is based on a recorded speech corpus, andthe number of segments to be included in the corpus is derived according to the chosensegment format. It is preferable to record several versions of each segment and then laterchoose the most appropriate recording in the preparation process. The resulting qualityof the TTS system depends to a large extent on the quality of the speech corpus. Therecorded data should be noiseless and read by one single person, and to facilitate thefuture segment concatenation, it should be spoken as monotonously and energy stable aspossible.

    2.1.2 Preparation Process

    Figure 2.1 displays the operations involved in the segment data preparation process of ageneral concatenative-based TTS system. It starts at speech corpus level and a descriptionof each block is presented below.

    Segmentation

    Selective

    Equalization

    Corpus Speech

    Speech

    Analysis

    Synthesis

    InformationSegment

    Segments

    Figure 2.1: Block scheme for the segment data preparation process of a generalconcatenative-based speech generator.

    Selective Segmentation

    The recorded speech stored in the Speech Corpus database usually consists of completewords intended to be divided into the defined segment format. This segment extractionis performed by either marking of the segment end points or cutting out and storing thedesired parts. Finding the optimal cutting points is a time consuming process since anautomatic segmentation function is hard to develop and therefore it needs to be mademore or less by hand. Secondly, the most appropriate speech segment is chosen if severalrecordings per segment exist.

    Information about the segments are calculated and stored in the Segment Informationdatabase, referring for example the length of the segments and, when using segments con-sisting of at least two phonemes2, the mid position of the transition appearing between thephonemes. Finally, the extracted speech segments are transmitted to the next function.

    2Defined as the mental abstraction of a phonetic note, see next subsection.

  • 2.1. SEGMENT DATA PREPARATION 9

    Speech Analysis

    The operations involved in the Speech Analysis part mainly depend on the chosen synthesis-method of the TTS system. In some cases the speech segments are recalculated to betterresemble each other, such as the case of normalization. Additional information about eachsegment, for instance pitch marks, is for some methods stored in the Synthesis Segmentdatabase. Later in this chapter a description of the pre-processing calculations needed fora TTS synthesizer is presented for two different PSOLA-based systems.

    Equalization

    One operation concerning all concatenative-based methods is the energy equalization.When speech data is recorded from a human, it is never received as spoken with a constantvolume. This energy variation can lead to clearly audible mismatches when concatenatingthe different segments, so before final storing of the speech segments in the SynthesisSegment database, this equalization is applied.

    It is found [Dut97] that the amplitude energy of each phoneme differs from each otheraccording to the type of sound and where in the mouth it is produced. To preservethis natural energy variation, the equalization process is applied within each group ofequal phones. Note that the term equalization is used here in contrast to the termnormalization. An energy normalization process would set the energy for all segments toan averaged value and the natural energy variation would then be lost.

    2.1.3 Segment Representation

    A phoneme is the linguistic representation of a phonetic event and thereby defined asthe smallest unit an utterance can be divided into. Often the phoneme is incorrectlyidentified with a phone, but the latter describes a pronounced phonetic note while thephoneme corresponds to the mental abstraction of it. In other words, a phoneme can bedefined as the categorization of a group of related phones [Dut97]. To represent all phonesin a language a set of about 40 to 50 basic phonemes is needed, depending on languageand desired transcription accuracy [Lem99].

    A diphone describes the transition between two phonemes. It starts in the middleof the steady region in one phoneme and ends in the middle of the steady region in thenext. In this case the point to be concatenated will always appear at the most steadystate of a phone, and compared to using phoneme segments, the spectral mismatch atthe point for concatenation will be decreased. As an example, the word ’mean’ with thephonetic description [m, i:, n] (referring SAMPA notation, see further in this section)corresponds to a diphone description according to [#-m, m-i:, i:-n, n-#], where #denotes a short silence. Figure 2.2 displays the signal representation of the two phones mand i: spoken consecutively, together with a classification of its different regions. Thenumber of diphones needed for presenting a language is basically the square of the numberof phonemes, disregarding some non-existing phoneme combinations. This results thatthe English language consists of approximately 1500 to 2000 diphones [HAH+98].

    The most general alphabet for presenting phonemes is the International Phonetic Al-phabet (IPA). It has the capability to express all spoken languages of the world with onestandard notation. This notation is composed by a large set of different characters which

  • 10 CHAPTER 2. THEORY

    m Steady region of i:

    cuttingpoint

    Steady region of

    pointtransition

    DIPHONE

    Transition

    region

    cuttingpoint

    Figure 2.2: Signal representation of the two phones m and i: spoken consecu-tively, with defined conceptions marked.

    are mostly not represented by the ASCII codes. In computer systems it is however prefer-able to use an alphabet composed by a restricted number of combinations of these ASCIIcharacters. The SAMPA (Speech Assessment Methods Phonetic Alphabet) is one of themost popular machine-readable phonetic alphabets used today. It has a simple structureconsisting of ASCII characters with up to two combined characters. In Appendix A theSAMPA notation for British English is listed together with descriptions of pronunciationand classified according to how the phones are produced.

    Another, and more restricted used, phonetic alphabet is the MRPA (Machine-ReadablePhonetic Alphabet). It is developed by the CSTR (the Centre for Speech TechnologyResearch) at the University of Edinburgh in a project called Festival, which is furtherdescribed in subsection 2.5.1 below. This alphabet considerably resembles the SAMPA,but uses only non-capital letters together with the character @. The MRPA can directlybe mapped onto SAMPA notation, which can be seen in the appended MRPA-SAMPAlexicon in Appendix B.

    Recorded speech phones can be classified according to their waveform as voiced orunvoiced signals, usually denoted V and UV, respectively. A voiced signal contains aclearly identifiable fundamental frequency with clear harmonics while the unvoiced has afrequency spectrum resembling noise. Many speech phones, however, consist of a mixtureof these two classes, which occurs having varying V and UV proportions in differentfrequency regions, and therefore a ratio V/UV is introduced where 1 correspond to apurely voiced signal and 0 to an unvoiced.

    2.2 Speech Synthesis

    2.2.1 Synthesizing Process

    A model of the speech synthesis process is shown in Figure 2.3, with a description of eachblock below. The figure describes a general concatenative-based TTS system starting at

  • 2.2. SPEECH SYNTHESIS 11

    phonetic notation level.

    Segment List

    Generator

    Collector

    Prosody

    Modification

    Segment

    Concatenation

    Waveform

    Generator

    Speech

    Prosody InfoPhonemes &

    Segment File

    InformationSegment

    Segments Synthesis

    Figure 2.3: Block scheme for the run-time operating functions of a generalconcatenative-based speech synthesizer.

    Segment List Generator

    In this block, the phonetic input notation is transformed into the pre-defined segmentformat of the synthesizer. The structure of the prosodic information is then changed inorder that this information has an expression that corresponds to the defined format.This operation requires information about the stored segment and this is found in theSegment Information database together with data needed for further operations such asthe segment file addresses.

    The functions following this block operate on one segment at a time, and thereforethe segment list generator transmits the segment transcriptions with the correspondinginformation one by one.

    Segment File Collector

    The Segment File Collector reads the current speech segment file from the SynthesisSegment database according to the file address received from the Segment List Generatorand transmits the file further.

  • 12 CHAPTER 2. THEORY

    Prosody Modification

    An application of the desired prosodic properties is performed on the speech segment inthis block. These properties usually refer to pitch and time duration (see subsection 2.2.2below) and is adapted with a method depending on the synthesizing algorithm. Thisprocess will be described for PSOLA-based systems in section 2.3.

    Segment Concatenation

    The method for concatenation of two segments is independent of the segment formatchosen. For a good concatenation result, the two segments should have as equal prosodiccharacteristic as possible in their concatenation parts, i.e., at the last and first ends,respectively. At these points the segments are assumed to have equal fundamental fre-quencies and that the cutting points appear at the same position within their periodtimes. A concatenation using PSOLA technique is described in section 2.3.

    Before the concatenation of the segments an eventual smoothing of discontinuities isperformed. The concatenation process itself usually results in a smoothing of the endparts of the segments, but for instance in the method described in section 2.4 a spectralenvelop smoothing is performed by linear interpolation (which is also described in thesame section). Since the concatenation (or/and smoothing) of one segment depends onthe shape of the next one, a one-segment delay in the concatenation block is required.This delay forces the concatenative-based synthesizers to be partly non-causal. However,the smoothing of one segment will never depend on any non-adjacent segment [Dut97]and the non-causality of the system is therefore clearly restricted.

    Waveform Generator

    In some cases, for instance in rule-based TTS systems, the sound segment is parametricallystored or described by certain rules, and a Waveform Generator is then needed to decodethe sound into a perceptible format, i.e., by generating sound waves.

    2.2.2 Prosodic Information

    In linguistics, the term prosody refers to certain properties of a speech signal, whichusually denote audible changes in pitch, loudness and syllable length. Other propertiesrelated to prosody are for example speech rate and rhyme, though not as commonly usedin TTS systems.

    The pitch information is one of the most essential prosodic properties. It describesthe ’melody’ of the utterance and prevents thereby the output of a TTS system to soundmonotonously. Additionally, a stressed syllable can be symbolized by a fast and largepitch change, as well as an increase in its length of duration. This syllabic length alsovaries for different positions within a word and hence is another important parameter forspeech synthesis. The denotation of the desired pitch can be expressed as a sequence offrequency labels consisting of information about time appearance and value. The lastmentioned common prosody property is the loudness, which can also be defined as energyintensity. This parameter is however only of interest when producing emotional speech,since it is approximately constant within normal speech temper [Dut97].

  • 2.3. PSOLA METHOD 13

    2.3 PSOLA Method

    The purpose of the PSOLA (Pitch-Synchronous Overlap and Add) technique is to changepitch and duration of an audible signal without performing any operations in the frequencydomain. This process can be divided into two main steps:

    1. decomposition of the signal into separate but overlapping parts, and

    2. recombination of the parts by means of overlap-adding (OLA) with desired pitchand duration modification considered.

    The operations involved in these two steps are closely described in the following subsection.

    2.3.1 PSOLA Operation Process

    A signal3 s(t) is decomposed to several short-time signals (ST-signals) si(t) by windowsgenerated by time-shifted versions of the window w(t). This window is centralized aroundeach pitch mark pmi of the original signal. A pitch mark is defined as the maximum signalappearance, denoted in time, in a period time T0i of the instant fundamental frequencyF0 according to Figure 2.4. If the signal is cut yielding pm0 = 0, each pitch mark can bedescribed as

    pmi =

    i∑

    n=1

    T0n , i ∈ N. (2.1)

    As described, these pitch marks correspond to the time shift of the windows and theextraction of a general ST-signal can thus be expressed as

    si(t) = s(t)w(t− pmi). (2.2)

    Note that as time index the variable t is used, which usually defines continuous time.Time indexing in discrete time though, as in this case, is most often denoted by n, butfor better understandability4 the variable t is used.

    If the ST-signals are added together again but with another time shift, i.e., with anew pitch mark-vector pm′, the reconstructed signal will have changed its fundamentalfrequencies and is generated by

    s′(t) =∑

    i

    si(t− pm′

    i). (2.3)

    If the original signal is strictly periodic, the time periods T0i are equal for all i andthe pitch mark-vector in (2.1) can be simplified to pmi = iT0. The decomposition andrecombination described in equation (2.2) and (2.3) respectively, can thus for periodicsignals be redefined as

    siper(t) = sper(t)w(t− iT0), (2.4)

    s′per(t) =∑

    i

    siper(t− iT′

    0). (2.5)

    3For a TTS system, indicating one speech segment.4According to the author.

  • 14 CHAPTER 2. THEORY

    s (t)i

    pmi−1

    pmi

    pmi+1

    (b)

    (a)

    0iT

    w(t)

    s(t)

    Figure 2.4: Signal representation of the phone i:. (a) Original signal s(t) witha window w(t) centralized around pmi. (b) Extracted ST-signal si(t).

    Theoretically, the reconstruction comprising a pitch change as described in (2.5) can beperfectly performed. This means that s′per(t) must have the same spectral properties assper(t) with only a constant change of its fundamental frequency and harmonics. Thisstatement can be proved by the Poisson formula [DL93] meaning if a signal

    f(t)F←→ F (ω),

    then+∞∑

    n=−∞

    f(t− nT0)F←→

    T0

    +∞∑

    n=−∞

    F (n

    T0)δ(ω −

    n

    T0). (2.6)

    In words, the formula implies that summing an infinite number of shifted versions of agiven signal f(t) results in sampling its Fourier transform with a sampling period equalto the inverse of the time shift T0. The spectral envelope is hence preserved while thenew harmonics are evenly spread, and the statement of theoretically perfect time shift ofa periodic signal is approved.

    As previously described, the windows used for creating the ST-signals si(t) are sep-arated with the length of the local T0. If a window size much larger than this is used,spectral lines5 will appear in the spectrum of the ST-signal [Dut97]. These spectral linescan prevent s(t) from being harmonized, since the sampling of the frequency domain(cf. 2.6) can result in frequency values at spectral dips. On the other hand, using a toonarrow window will produce a very rough harmonization with approximated frequencyvalues. Choosing a window size as an intermediate between these two cases results in anoptimal window size of about twice the period time. This results in a window-overlapping

    5A spectral line is a dominated absence or presence of a certain frequency (spectral dip or top).

  • 2.3. PSOLA METHOD 15

    for the ST-signal extraction in (2.2) at the length of the period time T0, when using awindow size of exactly 2 ∗ T0.

    The Poisson formula presumes an infinite number of equal ST-signals as input, see (2.6),which is only achieved having a stationary signal. Speech, however, is a non-periodic signalbut with a relative slow varying frequency spectrum. This property of quasi-stationarityimplies a use of equation (2.6) for speech but with restricted summation boundaries andhence a somewhat less perfect result. Furthermore, the described windowing requiresa slow variation of the fundamental frequency, since the window size is defined to ap-proximately 2T0 and this condition will only be true if T0i ≈ T0i+1 . Consequently, therequirement of a quasi-stationary signal is also generated from the windowing step.

    2.3.2 Modification of Prosody

    The pitch of the signal is changed by implementing the new pitch mark-vector pm′ as donein equation (2.3). Each pitch mark pm′j corresponds to a point in the original pm-vector,as shown in Figure 2.5, but with changed distances in-between. If a pitch change with afactor k is desired, the new instant T ′0 =

    T0k

    . The factor k can have different values for eachpm′j , but with relative small variations for preserving the quasi-stationary assumption. Aconsequence of this pitch shift method is that the duration of the resulting signal ischanged inversely proportional to k. To get the desired signal duration the number ofST-signals must then be changed and is done by either duplicating or removing someST-signals before concatenation. The resulting function of pm′ for application of prosodyis therefore

    pm′j =

    j∑

    n=1

    T0a(n)ka(n)

    , (2.7)

    where a(j) consists of indices from the original ST-signal indexed i, i.e., the vector aindicates which ST-signals that are used. This results in a being a transfer function frompm to pm′.

    pm 1

    t’

    pm pm pm pm

    pm’ pm’pm’

    3 4 5 6

    5 6 9

    pm 2

    7pm’pm’pm’pm’

    t

    1 2 3pm’

    4pm’

    8

    Figure 2.5: Schematic example of a transfer function for the pitch mark-vectorregarding pitch and duration modification. Here a = {1, 2, 2, 3, 4, 4, 5, 6, 6}.

    An expression for the recombination of the extracted ST-signals including both pitch

  • 16 CHAPTER 2. THEORY

    and duration modification can now be presented by combining (2.3) and (2.7) into

    s′(t) =∑

    i

    si

    (

    t−

    j∑

    n=1

    T0a(n)ka(n)

    )

    . (2.8)

    2.3.3 TD-PSOLA as Speech Synthesizer

    The PSOLA operation presented above describes the prosody modification part of a TTSsystem. An expansion of this technique is the TD-PSOLA (Time Domain-PSOLA) thatfunctions as a complete speech synthesizer keeping all its operations in the time domain.It is defined that with this method it is possible performing a change in both pitch andtime duration by a factor in the range of 0.5 to 2, without any notable change in positionand bandwidth of its formants [Dut97].

    The PSOLA operation used in this TTS system requires information about the pitchmark locations for each segment is needed. This is usually generated in the speechanalysing step of the segment data preparation part (see Figure 2.1) by a pitch detectionalgorithm. However, detecting the fundamental frequency in a signal with a low V/UVvalue is difficult and sometimes even impossible (when V/UV ≈ 0) and the pitch mark-ing must therefore often partly be done by hand. In cases of purely un-voiced signalsno fundamental frequency exist and the pitch marks are spaced given a fixed distance,approximately an average of the pitches of the speech segments.

    In the point of concatenation, three different types of mismatches can appear as aconsequence of varying segment characteristics – pitch, harmonic phases and spectralenvelope mismatches. All these mismatches can lead to a degradation of the quality. Incase of differing pitches, the PSOLA process can eliminate this mismatch by placing thewindows in the recombination phase equal for both segments. However, since PSOLAis an approximate method the process changes the spectral properties of the segments.If then the pitch is to be changed rather differently, in case of a relative large pitchdifference, a spectral mismatch will appear. A second case of audible mismatch occurswhen two voiced signals have harmonics with different corresponding phases. The phasesof the fundamental frequency, though, are implicitly equalized through the pitch markingprocess, since the marking is always placed at the highest peak of the period time.

    The spectral mismatches described require operations in the frequency domain. Sincethe current synthesizer operates in time domain, a compensation of these mismatches, byfor example smoothing, cannot be done. However, in the special case of having ST-signalswith equal length (which occurs when the original pitch is constant), a spectral envelopesmoothing in time domain can be performed. This will be further described in the nextsection.

    2.4 Extension of TD-PSOLA into MBR-PSOLA

    An extension of the TD-PSOLA synthesizer has been developed by Dutoit and Le-ich, [DL93], which involves a re-synthesis of the segment database. This extended TTSmethod called MBR-PSOLA (Multi-Band Re-synthesis PSOLA) has the purpose of per-forming a more specific normalization in the pre-processing state than in the case of TD-PSOLA, and not requiring additional pitch mark files. The resulting segments are stored

  • 2.4. EXTENSION OF TD-PSOLA INTO MBR-PSOLA 17

    and the same synthesis method can be used as before together with a quality improv-ing interpolation block. Figure 2.6 displays this extension from the original TD-PSOLAsynthesizer (gray blocks) to an MBR-PSOLA TTS system.

    Segments MBR−PSOLA

    Re−synthesis

    Segment

    Interpolation

    Linear

    Speech

    Phonemes & Prosody Info

    SPEECH

    SYNTHESIS

    SegmentInformationDatabase

    SegmentsTD−PSOLA

    Figure 2.6: Extension of TD-PSOLA into MBR-PSOLA. The white blocksrefer to the additive operations.

    2.4.1 Re-synthesis Process

    The segment re-synthesis process consists of two normalization steps. First, the speechsegments are recalculated achieving constant pitch throughout the entire database. Thishas the consequence that the future window positioning performed in the PSOLA processcan be given one fixed value for all segments, relative the constant pitch period start,and therefore no additional pitch mark information is needed. The second re-synthesisoperation comprises harmonic phase normalization of voiced signals. These phases are setto fixed values, valid for all segments. The choice of these phase values affect the soundquality considerably. Constant or linearly distributed harmonic phases lead to a rathermetallic sound. A better quality result is achieved giving the phases randomly distributedvalues. Additionally, tests performed in [DL93] have shown that keeping the phases ofthe high-frequency harmonics at their original value actually improves the quality. Usingan upper border of about 2000 Hz for which harmonics to be normalized was proved asthe best value. If a higher value was used, no enhancement was noticed, while a too lowvalue resulted in worse quality.

    The method for re-synthesizing the TD-PSOLA segment database using the MBEoperations described in the next subsection is shown in Figure 2.7. First, each segmentis windowed into ST-signals according to equation (2.2). This requires, however, knownpitch marks and in this case no such information is available. Instead, the window sizeand position is calculated using a constant F0 for the whole database, estimated from arough average of the overall pitch ( 1

    T0av) of the complete corpus. The pitch mark pmi

    can hence be replaced with iT0av . Each ST-signal is then parameterized according to theMBE model (described in next subsection) into

  • 18 CHAPTER 2. THEORY

    • harmonic amplitudes (sampled spectral envelope),

    • harmonic phases, and

    • narrowband noise variances.

    In the calculation process of these parameters, a voiced/unvoiced classification of thesignal is included. This information is used for controlling if the signal will be modified(in case of V) or returned unchanged (in case of UV). Before final storing of the segments,the normalized ST-signals are concatenated with the OLA (Overlap and Add) method.

    Parametrization

    MBE

    V/UV

    Normalization

    Parametric

    Segment

    Synthesis

    w(t)

    OLA

    Segments TD−PSOLA

    Segments MBR−PSOLA

    Figure 2.7: Segment re-synthesis process using MBE model.

    2.4.2 Spectral Envelope Interpolation

    Another benefit having constant pitch and identical harmonic phases is that a spectralmatching in the concatenation point of the synthesizer can be performed by a linearinterpolation in time domain between the ST-signals. This normalization implies that thisso called direct temporal interpolation described below, is equivalent to an interpolationof the spectral envelope [DL93], which is wanted. Furthermore, the constant length ofthe segments also simplifies the interpolation by a direct position mapping between thesamples or parameters.

    If the segments sL and sR (referring left and right segments) is to be concatenated,the two overlapping ST-signals can be denoted as sL0 and s

    R0 , respectively. Each s

    Xn is

    described by the speech sample or parameter set pXn , where X refers to the segment (L orR) and n to its current window or ST-signal. The vector pXn is of constant length which

  • 2.4. EXTENSION OF TD-PSOLA INTO MBR-PSOLA 19

    is required for the vector operations. Suppose the difference |pL0 − pR0 | is to be divided

    onto NL windows on the left segment and NR on the right, the spectral smoothing canbe expressed as

    p′L

    −i = pL−i + (p

    R0 − p

    L0 )

    NL − i

    2NL, i = 0, 1, . . . , NL − 1 (2.9)

    p′R

    j = pRj + (p

    L0 − p

    R0 )

    NR − j

    2NR, j = 0, 1, . . . , NR − 1 (2.10)

    where p′L−i and p

    ′Rj denote the new interpolated values of the samples or parameters

    describing, respectively, the ST-signals sL−i and s

    Rj .

    The optimum number of ST-signals to use for the interpolation, i.e., NL and NR, variesbetween the different segments. It is preferable to avoid ST-signals from the transitionpart in spectral smoothing and since the length of a segment and its transition positionvaries, a segment-dependent selection of the number of smoothed windows is optimal.Additionally, spectral smoothing is only applied on voiced signal, and the selection ofwhich segments to use can also be achieved by the same segment classification. Thisselection of the number of windows to use, i.e., the segment classification, is based on theV/UV information calculated by the MBE analysing process described in the followingsubsection.

    2.4.3 Multi-Band Excitation Model

    The Multi-Band Excitation (MBE) model is originally designed for speech storage com-pression in voice codecs [GL88]. It is based on a parameterization of the frequency domainof a speech signal and since it includes information about its harmonic frequencies it isideal to use for pitch and phase normalization. Below follows a description of the MBEparameterization of an arbitrary short time speech signal.

    Suppose a voiced ST-signal sw(t) has the Fourier transform Sw(ω) according to Fig-ure 2.8(a). This frequency represented signal can be modelled as the product of its spectralenvelope Hw(ω) (with phase included) and an excitation spectrum |Ew(ω)| [GL88],

    Ŝw(ω) = Hw(ω)|Ew(ω)|. (2.11)

    If the fundamental frequency ω0 of the signal is known, the excitation spectrum canbe expressed as a combination of a periodic spectrum |Pw(ω)| which is based on ω0and a random noise spectrum |Uw(ω)| with the variance σ

    2. The periodic spectrumconsists of peaks with equal amplitude appearing at the fundamental frequency and itsharmonics as shown in Figure 2.8(c). A frequency band with a width of the distancebetween two harmonic peaks is defined as a harmonic band, centralized on a harmonic. AV/UV analysis is performed on Sw(ω) for each harmonic band and expressed on a binaryrepresentation using a threshold value, see Figure 2.8(d). The two spectral signals arecombined using the V/UV information to generate |Ew(ω)| by

    |Ew(ω)| = V/UV (ω) · |P (ω)|+ (1− V/UV (ω)) · |Uw(ω)|, (2.12)

    and these different spectrum parts can be seen in Figure 2.8(c-f). Figure 2.8(b) displays

  • 20 CHAPTER 2. THEORY

    Figure 2.8: Example of an MBE modelled signal. (a) Original spectrum, (b)Spectral envelope, (c) Periodic spectrum, (d) V/UV information, (e) Noisespectrum, (f) Excitation spectrum, (g) Synthetic spectrum.

    the spectral envelope |Hw(ω)|, which is usually represented by one sample value for eachharmonic in both voiced and unvoiced regions to reduce the number of parameters. Finallythe resulting synthetic signal spectrum Ŝw(ω) can be seen in Figure 2.8(g), calculated asdescribed above.

    The estimation of the parameters in this method is based on the least square errorbetween the synthesized spectrum |Ŝw(ω)| and the original spectrum |Sw(ω)|. This ap-proach is usually termed an analysis-by-synthesis method. First, the spectral envelopeand the periodic spectrum are estimated in the least square sense. Then the V/UV de-

  • 2.5. UTILIZED DATA FROM EXTERNAL TTS PROJECTS 21

    cisions are made by comparing the resulting spectrum to the original for each harmonicband and using a threshold value for the error to determine the band voiced or unvoiced.

    2.4.4 Benefits with the respective PSOLA Methods

    TD-PSOLA

    • High naturalness of the synthesizedspeech because of ’untouched’ seg-ments.

    • Less sensitive to analysis errorsregarding V/UV classification[Dut94].

    • Simpler data preparation step.

    MBR-PSOLA

    • No mismatch in harmonic phase andpitch.

    • No external pitch marks needed, im-plicitly calculated.

    • Simple spectral smoothing possible.

    • Good database compression poten-tial.

    2.5 Utilized Data from External TTS Projects

    There exist numerous companies and universities researching and offering products in thearea of TTS systems. The availability of these results varies between the owners, but inmost cases the solutions are hidden. The research projects mentioned in this section areall in some extend underlying the TTS system investigated in this thesis.

    2.5.1 Festival and FestVox

    The CSTR (Centre for Speech Technology Research) is an interdisciplinary research cen-tre at the University of Edinburgh. One of their project products, the Festival SpeechSynthesis System, contains a full concatenative-based TTS system with different synthe-sis methods implemented. Except for the PSOLA based synthesizer, the software for thevarious TTS systems is distributed under a free license [Fesa]. The latest version is Fes-tival 2.0, which is developed for a number of languages: British and American English,Spanish and Welsh.

    A further improvement of the Festival TTS system is developed at the Carnegie MellonUniversity (CMU) through their project FestVox. The aim for this project is to make thebuilding of new synthetic voices more systematic and better documented. FestVox 2.0 isthe latest version which was released in January 2003 [Fesb] with a software free of usewithout restrictions, referring both commercial and non-commercial use. The databasesinvolved are presented by FestVox on their homepage, containing, among other things,two voice databases consisting of all possible diphones for American and British En-glish. They are developed by the CMU and the CSTR, respectively, including waveforms,laryngograph (EEG) files, hand corrected labels of start, stop and transition points, andextracted pitch marks. The pitch marks are not hand corrected and thus not completelyreliable.

  • 22 CHAPTER 2. THEORY

    The data concerned in this thesis are extractions from the British database calledCSTR UK RAB Diphone. (A detailed description of the licence is attached in Ap-pendix C.) The current data is as follows:

    1. Recorded speech corpus spoken by one British male speaker covering all possiblediphones for British English. The data is stored as wave-files comprising a set of2001 nonsense words with a sampling rate of 16 kHz and a precision of 16 bits.

    2. List of all diphones described with MRPA notation, resulting in 2005 items. Eachdiphone is complemented with information about the corresponding sound segmentwhich consists of name of the wave file, labels with start and stop position of thecurrent diphone segment and a label for its transient point. The position labels arehand corrected and expressed in seconds with a significance of three decimals.

    3. Data files with extracted pitch marks. Each pitch mark file corresponds to onewave-file, and hence a number of 2001 files exist. The pitch position is expressed inseconds using a seven decimal significance.

    The notation of the diphones used in the segment information list (point 2. in the listabove) follows the MRPA scheme described in Appendix B. Additional information hasbeen included consisting of the three symbols #, and $. The first corresponds to a shortsilence while the second is indicating a consonant cluster, i.e., it is indicating two conso-nants appearing within a word instead of between two words. This can be exemplifiedby the notation t - r meaning the /tr/ as in ’true’ and not as in ’fat rat’. The lastcharacter, $, is in subsection 3.2.2 investigated and found to symbolize a word borderbetween a plosive and a vowel.

    2.5.2 MBROLA

    Another partly freely available TTS system is presented by the MBROLA project, whichis initiated by the TCTS Lab of the Facultè Polytechnique de Mons in Belgium [MBR].A product with the same name as the project, MBROLA, has been developed consistingof a speech synthesizer based on the MBR-PSOLA technique. It takes a list of phonemesas input, together with prosodic information consisting of phoneme duration and pitchdescription. The design of the system requires a diphone database, but apart from thisit can operate on any voice and language, assuming the defined input format. Since thestarting level is at phonetic notation, MBROLA is rather a phoneme to speech systemthan a complete TTS system.

    The MBROLA synthesizer is only provided for free for non-commercial and non-military applications. Originally it consisted of one single segment database, a Frenchspeaking male voice. Today the system is available for several different languages andvoices through cooperation with other research labs or companies contributing their di-phone databases. The downloading package from MBROLA includes example sentencesto use as input and an executable file. No source code is available since the algorithm isprotected. The example sentences are stored on the format

    phoneme duration(

    pitch position pitch position ...)

  • 2.5. UTILIZED DATA FROM EXTERNAL TTS PROJECTS 23

    with SAMPA-notation for phonemes and the duration expressed in milliseconds. Theposition of the eventual pitch definition [Hz] is given in percent of the specified phonemeduration. Several pitch definitions can appear and a linear pitch interpolation betweeneach pitch position is then intended. In total there are three different test files whichtogether include almost 30 seconds of speech.

  • Chapter 3

    Implementation

    As implementation tool for the TTS system designed in this thesis Matlab is consistentlyused, with a Linux based Matlab of version 7.0. The synthesiser follows the TD-PSOLAmodel presented in the previous chapter with diphones as the segment format and com-bines the Festival diphone databases with the input format of MBROLA using SAMPAnotation. The original corpus is stored as wave files with a sampling frequency of 16 kHz.This frequency is kept during the synthesis process and the output is stored as a wavefile.

    In this chapter the implementation is closely described step by step through a divisioninto a Segment Data Preparation part and a Speech Synthesis part. When a phoneme isdeclared by its phonetic description the MRPA alphabet is intended if no other phoneticalphabet is explicitly defined.

    3.1 Segment Data Preparation

    The segment data preparation process operates on the three Festival databases listed insubsection 2.5.1. It creates the Segment Information database and the Synthesis Segmentdatabase, the latter consisting of diphone segments and pitch mark vectors. In principalthe operation model follows the structure displayed in Figure 2.1. The difference in thiscase is that the pitch mark vectors already exist and are used to define the cutting pointsof the speech corpus. Secondly, information about where the diphone appears and itstransition point are known. The process in this case is better described by Figure 3.1,where the pitch mark operation block corresponds to the Speech Analysis in Figure 2.1and the segment information operations and the diphone extraction block is a part of theSelective Segmentation block. The databases denoted Pitch Marks and Diphone Segmentsare both representing the Synthesis Segment database block in Figure 2.1. Below followsa description of the modifications performed on the three original databases.

    3.1.1 Segment Information Modification

    The segment information file available from Festival contains data about each diphone onthe format

    diphone file name start point transition point end point

    25

  • 26 CHAPTER 3. IMPLEMENTATION

    Pitch Marks Segment Information Speech Corpus

    Diphone Info

    Recalculation

    Extraction

    Valid Pitch Mark

    Extraction

    Diphone

    Energy

    Equalization

    Diphone Segments

    Segment Information Pitch Marks

    MODIFIED

    ORIGINAL

    needed Diphones

    Exclusion of un−

    Figure 3.1: Block scheme of the segment data preparation process performedon Festival’s databases.

    where the diphones are noted with MRPA with the character - as separation between thephones. The received points are expressed in seconds referring where in the given file thediphone appears.

    As will be described in subsection 3.2.2, the character is used for denoting differentversions of the recorded segment. These are only added on phones with the length ofone character and for almost all cases only on one side. The exceptions are the notationsK - R, where K ∈ [k,p,t] and R ∈ [r,w,y,l]. For simplify the TTS system thesediphones are removed and each phone part is now restricted to two characters. Thissimplification does not affect the result considerably because of the first phone part ofthese segments is hardly audible and is therefore probably originally mentioned only tobe used for special cases. Before removing, though, the segment addresses is overwritingthe addresses for the diphones K - R, i.e., the sound files for K - R are used for K - R.

    When the diphones are extracted from the speech corpus and stored in new files (seesubsection 3.1.3), the start point is not needed in the segment information file and the fileaddresses are changed. Instead of the start points, the appearance of the first pitch markis included. This information is used in the synthesis process when the time duration ofthe diphone is calculated. Since one period time T0x is overlapped in the OLA process,this time loss has to be considered for correct length of the output.

  • 3.2. SPEECH SYNTHESIS 27

    3.1.2 Pitch Marks Modification

    If the diphones are cut at the positions of pitch marks, the phase of the fundamentalfrequency is the same at the start point for all segments. The defined cutting points aretherefore given the position of the closest pitch mark, which is not the case in the originalinformation file.

    The pitch mark database contains pitch marks for the whole speech corpus, i.e., also forthe parts outside a diphone region. The pitch marks valid for each diphone are extractedand stored in the modified pitch mark database.

    3.1.3 Speech Corpus Modification

    The modified start and end points, which appears at pitch mark positions, are used forthe extraction of the diphones. As can be seen in Figure 3.1, an energy equalization isperformed before the final storing. This equalization is focusing on each group of relatedphones by setting the pitch periods to be concatenated, i.e., the first or last period timeof the diphone, to the average value for each phoneme. For some phones, as in the caseof plosives, there is a short silence before and after the pronounced part of the phone. Anequalization of these phones is therefore not realistic and after a subjective energy analysisalso the two fricatives th and dh are excluded, comprising the same signal characteristicsas plosives. The different versions of each phoneme, referring the phonemes on the formatx, x and x where x is an arbitrary phoneme, are classified as the same phoneme in theequalization calculations. The equalization is linearly distributed at sample level on eachdiphone. As an example, if the phoneme m-i: of sample length L is to be equalizedwith a factor of a for m and b for i:, each sample value s(n) is multiplied with the factora + b−a

    L· n.

    3.1.4 Additive Modifications

    When the pitch marks are displayed together with their corresponding diphone segmentsseveral mismatches are found referring missing or considerably misplaced pitch marks.About ten percent of the segments contain a relation between the lengths of two followingpitch periods of a value of 2 or more, i.e., one of the pitch periods is twice as long as theother one. The pitch marks for these segments are considered not reliable and thereforeadjusted by hand. It is also found that the second phone part of the diphone m-p doesnot comprise a whole pitch period and hence this is lengthened one period time. Duringthe quality evaluation it was found that the diphone au-@@ was incorrectly pitch marked(see 4.1.1), and hence also corrected by hand.

    3.2 Speech Synthesis

    The Speech Synthesis process is divided into the four main blocks presented in Figure 2.3.In this thesis there exist no coding of the signals into parametric form, and hence theWaveform Generator block shown can be excluded. The Segment File Collector reads thestored data and includes no further operations and is hence not described in a subsectionbelow.

  • 28 CHAPTER 3. IMPLEMENTATION

    3.2.1 Input Format

    The structure of the input for this TTS system is based on the MBROLA format describedin subsection 2.5.2. The only data required here is the phonemes expressed with SAMPA,the others (duration and pitch definition) are optional, though the unit of the valuesmust be as described. If the pitch information is missing, the original frequencies areused without any interpolation between the varying pitches of the diphones. A conditionfor having an input without duration information is that also the pitch definitions aremissing. In this case the duration of each phoneme is given a certain value as describedin the following subsection.

    Another optional input is the information about the borders between the words. If asmall pause is intended, the character for SAMPA notation (mapped to # for MRPA)can be used. In case of a word border without a pause the system identifies a border bythe character , in the input sequence. This word border information can improve thequality of the result, since some diphones are pronounced differently depending on if itis describing a diphone part within a word or at the border between two words. Whichdiphones this is concerning are presented later in this chapter.

    3.2.2 Segment List Generator

    The operation process of the Segment List Generator part of the synthesizer consists ofthe six following steps:

    1. If the input does not include duration values for the phonemes, certain default valuesare given, see below.

    2. Mapping of the SAMPA denoted phonetic input into MRPA notation according tothe lexicon in Appendix B. The result is transcribed onto the format of diphones.

    3. Applying the additive notes and $ for different diphone positions as describedbelow.

    4. Reading the information corresponding to the current diphones from the SegmentInformation database.

    5. Calculation of the distribution of the desired duration between the two currentphone parts as explained below.

    6. Expressing the appearance of the input frequencies according to a diphone insteadof a phone. For description, see last part of this section.

    Default Values for Phoneme Duration

    If the input only consists of phonetic notes, the duration of the phonemes are given adefault value. These values are based on the duration values used in the French TTStest described in [Dut94], where the French phonemes are given the duration values asfollows: [a,E,9,i,O,y,u] = 70 ms, [e,2,o,a∼,o∼,e∼] = 170 ms, fricatives = 100 ms,sonorant liquids = 80 ms, sonorant nasals = 80 ms, plosives = 100 ms. The phonemesin the square brackets are denoted with the French SAMPA notation and describes the

  • 3.2. SPEECH SYNTHESIS 29

    French vowels. These can be seen listed in [GMW97] and is freely interpreted for Englishas the first set corresponding to the vowels classified as checked vowels together with thecentral vowel, while the second matches the free vowels, see Appendix A for this phonemeclassification. The phone groups affricates and sonorant glides are not defined and arefreely interpreted corresponding to the same values as for fricatives and the two sonorantgroups, respectively. Finally, the pause character is given the duration value 150 ms.Applying the described duration values results in a rather slow sounding word or sentence.A halving of these values results in a more natural speech rhythm. These duration valuescan be summarized as described in Table 3.1 below.

    Duration Phoneme Group

    50 ms fricatives, plosives, affricates40 ms sonorants (liquids, nasals and glides)35 ms checked and central vowels85 ms free vowels

    150 ms pauses

    Table 3.1: Default value for the duration of each phoneme group.

    Additional Diphone Notations

    For some diphones there exist two different recorded speech segments in the databaseobtained from Festival. The different versions are separated in the diphone descriptionby the MRPA note or $, depending on the content. As mentioned in subsection 2.5.1,the character indicates a diphone consisting of consonants that appears within a wordinstead of between two words. The second character, $, always appears in combinationwith a vowel and its purpose was found, through listening tests of the diphone segments,to indicate a word border between a vowel and a plosive. Notice the different eventindications of the two additive characters mentioned. A diphone within a word is indicatedby , while $ refers to a diphone between two words. The additional diphone segmentsincluded in this thesis are the following combinations:

    1. [plosive] - [sonorant liquid or glide],

    2. [plosive]$ - [vowel] with the vowel @ excluded, and

    3. [vowel] - $[plosive] with @ and t excluded.

    In the case of having no information about the word borders, the addition of the specialcharacter according to case 1. above is performed on the diphone segments satisfyingthe condition. The audible difference between two diphone segments with and withoutthis additional notation is explained in section 4.2. In the original database from Festivalit exist more diphone combinations including the characters and $ than mentioned inthe three cases presented above. These combinations, however, does not comprise anyimportant audible difference of the resulting speech output and are for this TTS systemexcluded.

  • 30 CHAPTER 3. IMPLEMENTATION

    Duration Distribution

    The desired phone duration is distributed proportionally to the diphone segments in-volved, i.e., the phone parts of the segments to be concatenated. This can be describedthrough the example case of having a phoneme a as input with a desired duration of dand with the intention that a is to be synthesized by the two diphones m-a and a-t. Thelength of each diphone segment is read from the Segment Information database togetherwith the time position of the first pitch mark and transition point. The time length ofeach phone part of the diphones can be denoted as m2|a1 and a2|t1, respectively. The firstpart in a diphone is calculated as the time between the first pitch mark and the transitionpoint and the last corresponds to the remaining time from the transition point to the endof the segment. The duration d will then be applied by change the a1 and a2 into the newtime length of the phone parts denoted as a′1 and a

    2. The proportionally distribution canthen be expressed as

    a′1a′2

    =a1a2

    , and d = a′1 + a′

    2.

    Frequency Distribution

    The appearance of the defined frequency positions (frequency marks) are at input denotedin percentage of the duration of a phoneme. Each phoneme can have no, one or severalfrequency marks and between two marks an interpolation is intended. Two differentinterpolation methods have been implemented, requiring different data calculated fromthis operation block - a Local and a Comprehensive frequency interpolation. The lattermethod is selected for the final TTS synthesizer and in the evaluation subsection 4.2.2the both methods are compared.

    The Comprehensive Frequency Interpolation method consists of a purely linear inter-polation between the frequency marks. First the percentage expression is recalculated toa notation in time of the phonemes using the time data information from the SegmentInformation database. Two vectors are then created consisting of the position in time foreach frequency mark and each boundary point, respectively, expressed from the first pointof the very first diphone. By a simple linear interpolation, the desired frequencies at thebeginning and end of each diphone segment are calculated and transferred further. Thelast point of one segment is in other words given the same value as the first in the next.For those diphones including frequency marks the appearing of these are also appendedexpressed in time from the beginning of the diphone segment.

    In the case when no frequency information is available, the frequency marks are basedon the frequencies of the stored segments. Since the appearance of the first pitch markis included in the Segment Information database, the initial frequency of each segment isknown. This value is set as the beginning frequency of a diphone and copied to the endof the previous. The interpolation in the Prosody Modification step can be applied withno further changes to function for this case of no frequency information.

    The Local Frequency Interpolation, chosen not to be used, is based on an interpolationonly within a diphone. If a diphone does not include a frequency mark the originalfrequencies are used. An example of this mapping from a phone based to a diphone basedpositioning of a frequency mark is modelled in Figure 3.2. The information that a markexists in the previous diphone is applied by copying the value of this mark and place

  • 3.2. SPEECH SYNTHESIS 31

    if at the beginning of the current and at the end of the old diphone, i.e., around thecutting point as shown in the example figure. The broken line shows the future intendedinterpolation.

    1f

    2f

    f3

    TransitionPoint

    CuttingPoint

    value

    time

    Diphone BDiphone A

    Phone X Phone Y

    Figure 3.2: Example of a mapping by local interpolation of the frequencymarks f1, f2 and f3 from phone into diphone based positioning. Each dotcorresponds to a frequency mark and the broken line displays the intendedinterpolation.

    3.2.3 Prosody Modification

    The prosody modification is performed with the PSOLA method described in section 2.3.Desired pitches and frequencies are applied by generating a new pitch mark vector accord-ing to (2.7), as described below. Finally, the PSOLA decomposition and recombinationis performed.

    When frequency marks exist for the current segment, the new pitch mark vector iscalculated keeping the length of the original vector. The period times corresponding tothe desired frequency values are placed according to their mark positions, and a linearinterpolation is performed between the values. In the frequency application method validfor the synthesizer all positions in the pitch mark vector are automatically given a value.However, in the second method special cases must be considered. If the frequency marksappear on only one side of the transition point, as for Diphone A in Figure 3.2, the originalperiod time values are used for the non-interpolated part, starting from the boundarypoint. In the case of no frequency marks at all, the whole original pitch mark vector isused for denoting the new vector.

    Once the new pitch mark vector is calculated, the defined duration is applied. First,the relation r between the desired and the current duration is calculated for both phoneparts, where the current duration is achieved from the last element in the new pitch markvector. The relation r is then used for dividing the original pitch period indices into thetransfer vector a in (2.7), by dividing the distance of first to last index into numbersseparated by r and then rounded into integer values. For phone parts consisting of threeor more pitch periods the pitch period including the boundary point is excluded in thecreation of the transfer vector a and then added after the distribution of the indices. Seesubsection4.2.3 in the next chapter for a justification of this operation.

  • 32 CHAPTER 3. IMPLEMENTATION

    The decomposition of a segment into ST-signals, equation (2.2), is performed with aHanning [Den98] window positioned by the original pitch mark vector. The size of thewindow is twice the instant T0, referring to the pitch period calculated from the previouspitch mark to the current, window centralizing mark. The extracted ST-signals are thenrecombined using the new pitch mark vector with mapped indices a included.

    3.2.4 Segment Concatenation

    Since the segments are windowed with a Hanning window around each pitch mark andthen added by the OLA method both ends of the segments remains smoothed havingthe value 0 at the very first and last points. The concatenation between two diphones isperformed using same technique as for the duration modification, i.e., the OLA, wherethe ST-signals to be concatenated are the first and last pitch periods. This requires adelay of one diphone (see the Segment Concatenation subsection in 2.2.1) and the overlapis positioned as the last point of the oldest segment added onto the first pitch mark pointof the newer.

  • Chapter 4

    Evaluation

    An evaluation of a TTS system can be performed through several aspects. The mainrelevant evaluation issues for an implementation area of embedded systems, as the casein this thesis, are:

    • Storage properties – The size of the database needed, which, except for the amountof information, also depends on the coding possibilities of the data and the choiceof codec.

    • Computational complexity – Number and type of operations per sample needed forsynthesizing a text.

    • Usability – Investigation of suitable application areas, such as reading e-mails oritems in a menu, and the possibility to extend the TTS system to function for otherlanguages than English.

    • Speech quality – The perceived characteristics of the synthesized speech in theaspects of intelligibility, fluidity and naturalness.

    • Implementation costs – The time needed for develop the system, including the col-lection of data, and the cost of the required hardware devices.

    The TTS synthesizer developed in this thesis is designed to suit the defined implementa-tion area. Since it is a feasibility study on a TTS system for embedded devices it can notbe definitive evaluated for all the issues above. The evaluation presented in this chapterrefers to the quality of the synthesized speech and the chosen solution.

    Because of the restricted time limit for this project, an extensive quality evaluationcannot be performed. For an overall quality judgment of a speech synthesis system usuallya MOS (Mean Opinion Score) [RM95] analysis method is performed, which requires manypersons for a reliable result. The listener grades the quality in the three categorizations– intelligibility, fluidity and naturalness – by a five level MOS scale from 1 to 5, where 1corresponds to bad and 5 to excellent.

    The audible evaluations described in this chapter are performed by only one listener,i.e., the author. The quality conclusions are based on relative clear differences and areassumed legible for any arbitrary listener. Additionally, the judging of the quality canonly be relative and not absolute. This arises from the lack of an original or maximum

    33

  • 34 CHAPTER 4. EVALUATION

    quality-valid signal and therefore only a comparison between different synthetic speechsignals can be performed. The intelligibility of the evaluated speech signals is very goodfor all cases and hence is not generally mentioned in the quality judgments.

    4.1 Analysis of the Segment Database and Input Data

    4.1.1 Pitch Marks

    As mentioned in the previous chapter the pitch marks achieved from Festival are not al-ways correct. In the implementation part an analysis is performed of the relation betweentwo following pitch periods and in the cases when the diphones contains a difference ratioof 2 or more, the pitch mark files are hand corrected. A lot of errors still exist in thedatabase resulting in audible mismatches in the synthesized speech. One example of thiscan be seen in Figure 4.1 where the upper signal shows the diphone segment au-@@ andthe lower a part of a sentence created with the speech synthesizer referring the currentdiphone. The dotted lines display the pitch marks and misplacement can clearly be seenas an irregular signal pattern in the lower figure, which audibly results in an annoyingdisturbance.

    0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

    −0.2

    0

    0.2

    time [s]

    (b)

    Ouput signal

    0 0.011 0.039 0.063 0.086 0.108 0.13 0.153

    −0.2

    0

    0.2

    time [s]

    (a)

    ’au−@@’

    Figure 4.1: Pitch misplacement. (a) The diphone au-@@ with correspondingpitch marks. (b) Same diphone with prosody modification.

  • 4.1. ANALYSIS OF THE SEGMENT DATABASE AND INPUT DATA 35

    4.1.2 Spectral Mismatch

    In the generated synthetic speech files a major spectral mismatch has been found. Itappears in the combination of the two diphones k$-au and au-@@ and results in anaudibly annoying disturbance. The two diphone segments are presented in Figure 4.2(a)with their end parts to be concatenated visible. The result of the concatenation can beseen in Figure 4.2(b). When both segments are listened separately, the quality soundsgood and therefore it can be concluded that the spectral mismatch is the reason for thedisturbance.

    0.02 0.04 0.06 0.08

    −0.2

    0

    0.2

    ’k$−au’

    time [s]

    (a)

    0.02 0.04 0.06 0.08

    −0.2

    0

    0.2

    ’au−@@’

    time [s]

    0.02 0.04 0.06 0.08 0.1 0.12 0.14

    −0.2

    0

    0.2

    Output signal

    time [s]

    (b)

    Figure 4.2: (a) Parts of the two diphones k$-au and au-@@. (b) Concatenationof the diphones with prosody modification included.

    4.1.3 Fundamental Frequencies

    The speech corpus that the diphone segments are extracted from is spoken with a some-what hoarse voice and at a relative low frequency. The latter can be seen in the histogramin Figure 4.3, where the pitch periods for the whole segment database are included, re-sulting in an average value for the fundamental frequencies of 109 Hz. The values in thehistogram are based on the pitch marks and since it is found that these comprise somemisleading values, the result is not fully reliable. If the errors are assumed symmetricallydistributed, the average value would be the same as for a perfect analysis. The varianceof the distribution of the real frequency values is however presumably somewhat smaller.

    In the test data received from MBROLA the average fundamental frequency is muchhigher than the recorded diphone segments. Figure 4.4 displays the distribution of thedesired frequency values having an average value of 140 Hz. The duration of time thateach frequency is intended to have, is not considered as in the previous histogram, butthe spread of the values are still representative.

  • 36 CHAPTER 4. EVALUATION

    0 50 100 150 200 2500

    500

    1000

    1500

    2000

    2500

    3000Mean value = 108.6 Hz

    Frequency [Hz]

    Figure 4.3: Histogram of the fundamental frequencies of the diphone segmentdatabase.

    0 50 100 150 200 2500

    5

    10

    15

    Mean value = 139.6 Hz

    Frequency [Hz]

    Figure 4.4: Histogram of the fundamental frequencies of the test files.

    4.2 Solution Analysis

    4.2.1 Window Choice for the ST-signal Extraction

    If the PSOLA operations are performed on a segment without any prosody modificationsinvolved, meaning pm′ = pm, the output signal should be a good approximation of theinput1. A rather common window to use in the decomposition process is the Hanningwindow and the outcome of this window choice can be evaluated by using a step functionas input. The resulting output signal can be seen in Figure 4.5, together with the result ofusing two other common window functions, the Hamming and the Blackman. The pitchperiods are given a constant value corresponding to a fundamental frequency of about theaverage pitch of the stored segments. With a sampling frequency of 16 kHz this results ina window length of 145 samples and a position shift of the windows with 290 samples. Inthe case of the Hamming window a slight amplification can be seen together with sharpends of the windows. If a ST-signal would be extracted with this window a discontinuitywill appear at its ends. The step response using a Blackman window function results onthe other hand in a substantial oscillation.

    To fulfil the desire of a good approximation of the input, the chosen window must

    1The terms input and output in this section refers to the signal before and after the PSOLA operationsare applied.

  • 4.2. SOLUTION ANALYSIS 37

    100 200 300 400 500 600 700 8000

    0.2

    0.4

    0.6

    0.8

    1

    [samples]

    Hanning window

    100 200 300 400 500 600 700 8000

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    [samples]

    Hamming window

    100 200 300 400 500 600 700 8000

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    [samples]

    Blackman window

    Figure 4.5: Step response of the PSOLA method with three different windows.

    be symmetrical in amplitude, when no bias level is included. This is the case for theHanning window resulting in a constant output amplitude with the value 1, as can beseen in Figure 4.5.

    4.2.2 Frequency Modification

    As described in the Implementation chapter two different methods for the application ofthe frequency marks has been realized. The result of these interpolation methods canbe seen in Figure 4.6 where the resulting fundamental frequencies are displayed for eachpitch period of a certain sentence.

    Local Frequency Interpolation

    The first method, where only interpolation within a diphone is performed, results in arather unnatural sounding speech signal with low fluidity. It has even worse quality thanin the case of no frequency modification at all. The reason of this poor quality is mostprobably because of the major frequency difference between the recorded speech segmentsand the input data, see Figure 4.3 and 4.4, which results in a relative fast pitch change.When the desired frequency marks are down-scaled with a factor of 0.85 the quality of

  • 38 CHAPTER 4. EVALUATION

    0 50 100 150 20050

    100

    150

    200

    250

    Pitch period index

    Fre

    quen

    cy [H

    z]

    0 50 100 150 200