Speech Synthesis

Speech Synthesis

April 14, 2009

Some Reminders• Final Exam is next Monday:

• In this room

• (I am looking into changing the start time to 9 am.)

• I have a review sheet for you (to hand out at the end of class).

Exemplar Categorization1. Stored memories of speech experiences are known as

traces.

• Each trace is linked to a category label.

2. Incoming speech tokens are known as probes.

3. A probe activates the traces it is similar to.

• Note: amount of activation is proportional to similarity between trace and probe.

• Traces that closely match a probe are activated a lot;

• Traces that have no similarity to a probe are not activated much at all.

Echoes from the Past• The combined average of activations from exemplars in memory is summed to create an echo of the perceptual system.

• This echo is more general features than either the traces or the probe.

• Inspiration: Francis Galton

Exemplar Categorization II• For each category label…

• The activations of the traces linked to it are summed up.

• The category with the most total activation wins.

• Note: we use all exemplars in memory to help us categorize new tokens.

• Also: any single trace can be linked to different kinds of category labels.

• Test: Peterson & Barney vowel data

• Exemplar model classified 81% of vowels correctly.

• Remember: humans got 94%; protoype model got 51%.

Exemplar Predictions• Point: all properties of all exemplars play a role in categorization…

• Not just the “definitive” ones.

• Prediction: non-invariant properties of speech categories should have an effect on speech perception.

• E.g., the voice in which a [b] is spoken.

• Or even the room in which a [b] is spoken.

• Is this true?

• Palmeri et al. (1993) tested this hypothesis with a continuous word recognition experiment…

Continuous Word Recognition• In a “continuous word recognition” task, listeners hear a long sequence of words…

• some of which are new words in the list, and some of which are repeats.

• Task: decide whether each word is new or a repeat.

• Twist: some repeats are presented in a new voice;

• others are presented in the old (same) voice.

• Finding: repetitions are identified more quickly and more accurately when they’re presented in the old voice.

• Implication: we store voice + word info together in memory.

Our Results• Continuous word recognition scores:

1. New items correctly recognized: 96.6%

2. Repeated items correctly recognized: 84.5%

• Of the repeated items:

• Same voice: 95.0%

• Different voice: 74.1%

• After 10 intervening stimuli:

• Same voice: 93.6%

• Different voice: 70.0%

• General finding: same voice effect does not diminish over time.

More Interactions• Another task (Nygaard et al., 1994):

• train listeners to identify talkers by their voices.

• Then test the listeners’ ability to recognize words spoken in noise by:

• the talkers they learned to recognize

• talkers they don’t know

• Result: word recognition scores are much better for familiar talkers.

• Implication: voice properties influence word recognition.

• The opposite is also true:

• Talker identification is easier in a language you know.

Variability in Learning• Caveat: it’s best not to let the relationship between words and voices get too close in learning.

• Ex: training Japanese listeners to discriminate between /r/ and /l/.

• Discrimination training on one voice: no improvement. (Strange and Dittman, 1984)

• Bradlow et al. (1997) tried:

• training on five different voices

• multiple phonological contexts (onset, coda, clusters)

• 4 weeks of training (with monetary rewards!)

• Result: improvement!

Variability in Learning• General pattern:

• Lots of variability in training better classification of novel tokens…

• Even though it slows down improvement in training itself.

• Variability in training also helps perception of synthetic speech. (Greenspan et al., 1988)

• Another interesting effect: dialect identification (Clopper, 2004)

• Bradlow et al. (1997) also found that perception training (passively) improved production skills…

Perception Production• Japanese listeners performed an /r/ - /l/ discrimination task.

• Important: listeners were told nothing about how to produce the /r/ - /l/ contrast

• …but, through perception training, their productions got better anyway.

Exemplars in Production• Goldinger (1998): “shadowing” task.

• Task 1--Production:

• A: Listeners read a word (out loud) from a script.

• B: Listeners hear a word (X), then repeat it.

• Finding: formant values and durations of (B) productions match the original (X) more closely than (A) productions.

• Task 2--Perception: AXB task

• A different group of listeners judges whether X (the original) sounds more like A or B.

• Result: B productions are perceptually more similar to the originals.

Shadowing: Interpretation• Some interesting complications:

• Repetition is more prominent for low frequency words…

• And also after shorter delays.

• Interpretation:

• The “probe” activates similar traces, which get combined into an echo.

• Shadowing imitation is a reflection of the echo.

• Probe-based activation decays quickly.

• And also has more of an influence over smaller exemplar sets.

Moral of the Story• Remember--categorical perception was initially used to justify the claim that listeners converted a continuous signal into a discrete linguistic representation.

• In reality, listeners don’t just discard all the continuous information.

• (especially for sounds like vowels)

• Perceptual categories have to be more richly detailed than the classical categories found in phonology textbooks.

• (We need the details in order to deal with variability.)

Speech Synthesis:A Basic Overview

• Speech synthesis is the generation of speech by machine.

• The reasons for studying synthetic speech have evolved over the years:

1. Novelty

2. To control acoustic cues in perceptual studies

3. To understand the human articulatory system

• “Analysis by Synthesis”

4. Practical applications

• Reading machines for the blind, navigation systems

Speech Synthesis:A Basic Overview

• There are four basic types of synthetic speech:

1. Mechanical synthesis

2. Formant synthesis

• Based on Source/Filter theory

3. Concatenative synthesis

• = stringing bits and pieces of natural speech together

4. Articulatory synthesis

• = generating speech from a model of the vocal tract.

1. Mechanical Synthesis• The very first attempts to produce synthetic speech were made without electricity.

• = mechanical synthesis

• In the late 1700s, models were produced which used:

• reeds as a voicing source

• differently shaped tubes for different vowels

Mechanical Synthesis, part II• Later, Wolfgang von Kempelen and Charles Wheatstone created a more sophisticated mechanical speech device…

• with independently manipulable source and filter mechanisms.

Mechanical Synthesis, part III• An interesting historical footnote:

• Alexander Graham Bell and his “questionable” experiments with his dog.

• Mechanical synthesis has largely gone out of style ever since.

• …but check out Mike Brady’s talking robot.

The Voder• The next big step in speech synthesis was to generate speech electronically.

• This was most famously demonstrated at the New York World’s Fair in 1939 with the Voder.

• The Voder was a manually controlled speech synthesizer.

• (operated by highly trained young women)

Voder Principles• The Voder basically operated like a vocoder.

• Voicing and fricative source sounds were filtered by 10 different resonators…

• each controlled by an individual finger!

• Only about 1 in 10 had the ability to learn how to play the Voder.

The Pattern Playback• Shortly after the invention of the spectrograph, the pattern playback was developed.

• = basically a reverse spectrograph.

• Idea at this point was still to use speech synthesis to determine what the best cues were for particular sounds.

2. Formant Synthesis• The next synthesizer was PAT (Parametric Artificial Talker).

• PAT was a parallel formant synthesizer.

• Idea: three formants are good enough for intelligble speech.

• Subtitles: What did you say before that? Tea or coffee? What have you done with it?

PAT Spectrogram

2. Formant Synthesis, part II• Another formant synthesizer was OVE, built by the Swedish phonetician Gunnar Fant.

• OVE was a cascade formant synthesizer.

• In the ‘50s and ‘60s, people debated whether parallel or cascade synthesis was better.

• Weeks and weeks of tuning each system could get much better results:

Synthesis by rule• The ultimate goal was to get machines to generate speech automatically, without any manual intervention.

• synthesis by rule

• A first attempt, on the Pattern Playback:

(I painted this by rule without looking at a spectrogram. Can you understand it?)

• Later, from 1961, on a cascade synthesizer:

• Note: first use of a computer to calculate rules for synthetic speech.

• Compare with the HAL 9000:

Parallel vs. Cascade• The rivalry between the parallel and cascade camps continued into the ‘70s.

• Cascade synthesizers were good at producing vowels and required fewer control parameters…

• but were bad with nasals, stops and fricatives.

• Parallel synthesizers were better with nasals and fricatives, but not as good with vowels.

• Dennis Klatt proposed a synthesis (sorry):

• and combined the two…

KlattTalk

• KlattTalk has since become the standard for formant synthesis. (DECTalk)

http://www.asel.udel.edu/speech/tutorials/synthesis/vowels.html

KlattVoice• Dennis Klatt also made significant improvements to the artificial voice source waveform.

• Perfect Paul:

• Beautiful Betty:

• Female voices have remained problematic.

• Also note: lack of jitter and shimmer

LPC Synthesis• Another method of formant synthesis, developed in the ‘70s, is known as Linear Predictive Coding (LPC).

• Here’s an example:

• To recapitulate childhood: http://www.speaknspell.co.uk/

• As a general rule, LPC synthesis is pretty lousy.

• But it’s cheap!

• LPC synthesis greatly reduces the amount of information in speech…

Filters + LPC• One way to understand LPC analysis is to think about a moving average filter.

• A moving average filter reduces noise in a signal by making each point equal to the average of the points surrounding it.

yn = (xn-2 + xn-1 + xn + xn+1 + xn+2) / 5

Filters + LPC• Another way to write the smoothing equation is

• yn = .2*xn-2 + .2*xn-1 + .2*xn + .2*xn+1 + .2*xn+2

• Note that we could weight the different parts of the equation differently.

• Ex: yn = .1*xn-2 + .2*xn-1 + .4*xn + .2*xn+1 + .1*xn+2

• Another trick: try to predict future points in the waveform on the basis of only previous points.

• Objective: find the combination of weights that predicts future points as perfectly as possible.

Deriving the Filter• Let’s say that minimizing the prediction errors for a certain waveform yields the following equation:

• yn = .5*xn - .3*xn-1 + .2*xn-2 - .1*xn-3

• The weights in the equation define a filter.

• Example: how would the values of y change if the input to the equation was a transient where:

• at time n, x = 1

• at all other times, x = 0

• Graph y at times n to n+3.

Decomposing the Filter• Putting a transient into the weighted filter equation yields a new waveform:

• The new equation reflects the weights in the equation.

• We can apply Fourier Analysis to the new waveform to determine its spectral characteristics.

LPC Spectrum• When we perform a Fourier Analysis on this waveform, we get a very smooth-looking spectrum function:

• This function is a good representation of what the vocal tract filter looks like.

LPC spectrum

Original spectrum

LPC Applications• Remember: the LPC spectrum is derived from the weights of a linear predictive equation.

• One thing we can do with the LPC-derived spectrum is estimate formant frequencies of a filter.

• (This is how Praat does it)

• Note: the more weights in the original equation, the more formants are assumed to be in the signal.

• We can also use that LPC-derived filter, in conjunction with a voice source, to create synthetic speech.

• (Like in the Speak & Spell)

3. Concatenative Synthesis• Formant synthesis dominated the synthetic speech world up until the ‘90s…

• Then concatenative synthesis started taking over.

• Basic idea: string together recorded samples of natural speech.

• Most common option: “diphone” synthesis

• Concatenated bits stretch from the middle of one phoneme to the middle of the next phoneme.

• Note: inventory has to include all possible phoneme sequences

• = only possible with lots of computer memory.

Concatenated Samples• Concatenated synthesis tends to sound more natural than formant synthesis.

• (basically because of better voice quality)

• Early (1977) combination of LPC + diphone synthesis:

• LPC + demisyllable-sized chunks (1980):

• More recent efforts with the MBROLA synthesizer:

• Also check out the Macintalk Pro synthesizer!

Recent Developments• Contemporary concatenative speech synthesizers use variable unit selection.

• Idea: record a huge database of speech…

• And play back the largest unit of speech you can, whenever you can.

• Interesting development #2: synthetic voices tailored to particular speakers.

• Check it out:

4. Articulatory Synthesis• Last but not least, there is articulatory synthesis.

• Generation of acoustic signals on the basis of models of the vocal tract.

• This is the most complicated of all synthesis paradigms.

• (we don’t understand articulations all that well)

• Some early attempts:

• Paul Boersma built his own articulatory synthesizer…

• and incorporated it into Praat.

Synthetic Speech Perception• In the early days, speech scientists thought that synthetic speech would lead to a form of “super speech”

• = ideal speech, without any of the extraneous noise of natural productions.

• However, natural speech is always more intelligible than synthetic speech.

• And more natural sounding!

• But: perceptual learning is possible.

• Requires lots and lots of practice.

• And lots of variability. (words, phonemes, contexts)

• An extreme example: blind listeners.

More Perceptual Findings1. Reducing the number of possible messages

dramatically increases intelligibility.

More Perceptual Findings2. Formant synthesis produces better vowels;

• Concatenative synthesis produces better consonants (and transitions)

3. Synthetic speech perception uses up more mental resources.

• memory and recall of number lists

4. Synthetic speech perception is a lot easier for native speakers of a language.

• And also adults.

5. Older listeners prefer slower rates of speech.

Audio-Visual Speech Synthesis

• The synthesis of audio-visual speech has primarily been spearheaded by Dominic Massaro, at UC-Santa Cruz.

• “Baldi”

• Basic findings:

• Synthetic visuals can induce the McGurk effect.

• Synthetic visuals improve perception of speech in noise

• …but not as well as natural visuals.

• Check out some samples.

Further Reading• In case you’re curious:

• http://www.cs.indiana.edu/rhythmsp/ASA/Contents.html

• http://www.acoustics.hut.fi/publications/files/theses/lemmetty_mst/contents.html

http://www.cs.indiana.edu/rhythmsp/ASA/Contents.html

http://www.cs.indiana.edu/rhythmsp/ASA/Contents.html

Documents

Speech Synthesis