46
Speech Synthesis April 14, 2009

Speech Synthesis

  • Upload
    bette

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

Speech Synthesis. April 14, 2009. Some Reminders. Final Exam is next Monday: In this room (I am looking into changing the start time to 9 am.) I have a review sheet for you (to hand out at the end of class). Exemplar Categorization. - PowerPoint PPT Presentation

Citation preview

Page 1: Speech Synthesis

Speech Synthesis

April 14, 2009

Page 2: Speech Synthesis

Some Reminders• Final Exam is next Monday:

• In this room

• (I am looking into changing the start time to 9 am.)

• I have a review sheet for you (to hand out at the end of class).

Page 3: Speech Synthesis

Exemplar Categorization1. Stored memories of speech experiences are known as

traces.

• Each trace is linked to a category label.

2. Incoming speech tokens are known as probes.

3. A probe activates the traces it is similar to.

• Note: amount of activation is proportional to similarity between trace and probe.

• Traces that closely match a probe are activated a lot;

• Traces that have no similarity to a probe are not activated much at all.

Page 4: Speech Synthesis

Echoes from the Past• The combined average of activations from exemplars in memory is summed to create an echo of the perceptual system.

• This echo is more general features than either the traces or the probe.

• Inspiration: Francis Galton

Page 5: Speech Synthesis

Exemplar Categorization II• For each category label…

• The activations of the traces linked to it are summed up.

• The category with the most total activation wins.

• Note: we use all exemplars in memory to help us categorize new tokens.

• Also: any single trace can be linked to different kinds of category labels.

• Test: Peterson & Barney vowel data

• Exemplar model classified 81% of vowels correctly.

• Remember: humans got 94%; protoype model got 51%.

Page 6: Speech Synthesis

Exemplar Predictions• Point: all properties of all exemplars play a role in categorization…

• Not just the “definitive” ones.

• Prediction: non-invariant properties of speech categories should have an effect on speech perception.

• E.g., the voice in which a [b] is spoken.

• Or even the room in which a [b] is spoken.

• Is this true?

• Palmeri et al. (1993) tested this hypothesis with a continuous word recognition experiment…

Page 7: Speech Synthesis

Continuous Word Recognition• In a “continuous word recognition” task, listeners hear a long sequence of words…

• some of which are new words in the list, and some of which are repeats.

• Task: decide whether each word is new or a repeat.

• Twist: some repeats are presented in a new voice;

• others are presented in the old (same) voice.

• Finding: repetitions are identified more quickly and more accurately when they’re presented in the old voice.

• Implication: we store voice + word info together in memory.

Page 8: Speech Synthesis

Our Results• Continuous word recognition scores:

1. New items correctly recognized: 96.6%

2. Repeated items correctly recognized: 84.5%

• Of the repeated items:

• Same voice: 95.0%

• Different voice: 74.1%

• After 10 intervening stimuli:

• Same voice: 93.6%

• Different voice: 70.0%

• General finding: same voice effect does not diminish over time.

Page 9: Speech Synthesis

More Interactions• Another task (Nygaard et al., 1994):

• train listeners to identify talkers by their voices.

• Then test the listeners’ ability to recognize words spoken in noise by:

• the talkers they learned to recognize

• talkers they don’t know

• Result: word recognition scores are much better for familiar talkers.

• Implication: voice properties influence word recognition.

• The opposite is also true:

• Talker identification is easier in a language you know.

Page 10: Speech Synthesis

Variability in Learning• Caveat: it’s best not to let the relationship between words and voices get too close in learning.

• Ex: training Japanese listeners to discriminate between /r/ and /l/.

• Discrimination training on one voice: no improvement. (Strange and Dittman, 1984)

• Bradlow et al. (1997) tried:

• training on five different voices

• multiple phonological contexts (onset, coda, clusters)

• 4 weeks of training (with monetary rewards!)

• Result: improvement!

Page 11: Speech Synthesis

Variability in Learning• General pattern:

• Lots of variability in training better classification of novel tokens…

• Even though it slows down improvement in training itself.

• Variability in training also helps perception of synthetic speech. (Greenspan et al., 1988)

• Another interesting effect: dialect identification (Clopper, 2004)

• Bradlow et al. (1997) also found that perception training (passively) improved production skills…

Page 12: Speech Synthesis

Perception Production• Japanese listeners performed an /r/ - /l/ discrimination task.

• Important: listeners were told nothing about how to produce the /r/ - /l/ contrast

• …but, through perception training, their productions got better anyway.

Page 13: Speech Synthesis

Exemplars in Production• Goldinger (1998): “shadowing” task.

• Task 1--Production:

• A: Listeners read a word (out loud) from a script.

• B: Listeners hear a word (X), then repeat it.

• Finding: formant values and durations of (B) productions match the original (X) more closely than (A) productions.

• Task 2--Perception: AXB task

• A different group of listeners judges whether X (the original) sounds more like A or B.

• Result: B productions are perceptually more similar to the originals.

Page 14: Speech Synthesis

Shadowing: Interpretation• Some interesting complications:

• Repetition is more prominent for low frequency words…

• And also after shorter delays.

• Interpretation:

• The “probe” activates similar traces, which get combined into an echo.

• Shadowing imitation is a reflection of the echo.

• Probe-based activation decays quickly.

• And also has more of an influence over smaller exemplar sets.

Page 15: Speech Synthesis

Moral of the Story• Remember--categorical perception was initially used to justify the claim that listeners converted a continuous signal into a discrete linguistic representation.

• In reality, listeners don’t just discard all the continuous information.

• (especially for sounds like vowels)

• Perceptual categories have to be more richly detailed than the classical categories found in phonology textbooks.

• (We need the details in order to deal with variability.)

Page 16: Speech Synthesis

Speech Synthesis:A Basic Overview

• Speech synthesis is the generation of speech by machine.

• The reasons for studying synthetic speech have evolved over the years:

1. Novelty

2. To control acoustic cues in perceptual studies

3. To understand the human articulatory system

• “Analysis by Synthesis”

4. Practical applications

• Reading machines for the blind, navigation systems

Page 17: Speech Synthesis

Speech Synthesis:A Basic Overview

• There are four basic types of synthetic speech:

1. Mechanical synthesis

2. Formant synthesis

• Based on Source/Filter theory

3. Concatenative synthesis

• = stringing bits and pieces of natural speech together

4. Articulatory synthesis

• = generating speech from a model of the vocal tract.

Page 18: Speech Synthesis

1. Mechanical Synthesis• The very first attempts to produce synthetic speech were made without electricity.

• = mechanical synthesis

• In the late 1700s, models were produced which used:

• reeds as a voicing source

• differently shaped tubes for different vowels

Page 19: Speech Synthesis

Mechanical Synthesis, part II• Later, Wolfgang von Kempelen and Charles Wheatstone created a more sophisticated mechanical speech device…

• with independently manipulable source and filter mechanisms.

Page 20: Speech Synthesis

Mechanical Synthesis, part III• An interesting historical footnote:

• Alexander Graham Bell and his “questionable” experiments with his dog.

• Mechanical synthesis has largely gone out of style ever since.

• …but check out Mike Brady’s talking robot.

Page 21: Speech Synthesis

The Voder• The next big step in speech synthesis was to generate speech electronically.

• This was most famously demonstrated at the New York World’s Fair in 1939 with the Voder.

• The Voder was a manually controlled speech synthesizer.

• (operated by highly trained young women)

Page 22: Speech Synthesis

Voder Principles• The Voder basically operated like a vocoder.

• Voicing and fricative source sounds were filtered by 10 different resonators…

• each controlled by an individual finger!

• Only about 1 in 10 had the ability to learn how to play the Voder.

Page 23: Speech Synthesis

The Pattern Playback• Shortly after the invention of the spectrograph, the pattern playback was developed.

• = basically a reverse spectrograph.

• Idea at this point was still to use speech synthesis to determine what the best cues were for particular sounds.

Page 24: Speech Synthesis

2. Formant Synthesis• The next synthesizer was PAT (Parametric Artificial Talker).

• PAT was a parallel formant synthesizer.

• Idea: three formants are good enough for intelligble speech.

• Subtitles: What did you say before that? Tea or coffee? What have you done with it?

Page 25: Speech Synthesis

PAT Spectrogram

Page 26: Speech Synthesis

2. Formant Synthesis, part II• Another formant synthesizer was OVE, built by the Swedish phonetician Gunnar Fant.

• OVE was a cascade formant synthesizer.

• In the ‘50s and ‘60s, people debated whether parallel or cascade synthesis was better.

• Weeks and weeks of tuning each system could get much better results:

Page 27: Speech Synthesis

Synthesis by rule• The ultimate goal was to get machines to generate speech automatically, without any manual intervention.

• synthesis by rule

• A first attempt, on the Pattern Playback:

(I painted this by rule without looking at a spectrogram. Can you understand it?)

• Later, from 1961, on a cascade synthesizer:

• Note: first use of a computer to calculate rules for synthetic speech.

• Compare with the HAL 9000:

Page 28: Speech Synthesis

Parallel vs. Cascade• The rivalry between the parallel and cascade camps continued into the ‘70s.

• Cascade synthesizers were good at producing vowels and required fewer control parameters…

• but were bad with nasals, stops and fricatives.

• Parallel synthesizers were better with nasals and fricatives, but not as good with vowels.

• Dennis Klatt proposed a synthesis (sorry):

• and combined the two…

Page 29: Speech Synthesis

KlattTalk

• KlattTalk has since become the standard for formant synthesis. (DECTalk)

http://www.asel.udel.edu/speech/tutorials/synthesis/vowels.html

Page 30: Speech Synthesis

KlattVoice• Dennis Klatt also made significant improvements to the artificial voice source waveform.

• Perfect Paul:

• Beautiful Betty:

• Female voices have remained problematic.

• Also note: lack of jitter and shimmer

Page 31: Speech Synthesis

LPC Synthesis• Another method of formant synthesis, developed in the ‘70s, is known as Linear Predictive Coding (LPC).

• Here’s an example:

• To recapitulate childhood: http://www.speaknspell.co.uk/

• As a general rule, LPC synthesis is pretty lousy.

• But it’s cheap!

• LPC synthesis greatly reduces the amount of information in speech…

Page 32: Speech Synthesis

Filters + LPC• One way to understand LPC analysis is to think about a moving average filter.

• A moving average filter reduces noise in a signal by making each point equal to the average of the points surrounding it.

yn = (xn-2 + xn-1 + xn + xn+1 + xn+2) / 5

Page 33: Speech Synthesis

Filters + LPC• Another way to write the smoothing equation is

• yn = .2*xn-2 + .2*xn-1 + .2*xn + .2*xn+1 + .2*xn+2

• Note that we could weight the different parts of the equation differently.

• Ex: yn = .1*xn-2 + .2*xn-1 + .4*xn + .2*xn+1 + .1*xn+2

• Another trick: try to predict future points in the waveform on the basis of only previous points.

• Objective: find the combination of weights that predicts future points as perfectly as possible.

Page 34: Speech Synthesis

Deriving the Filter• Let’s say that minimizing the prediction errors for a certain waveform yields the following equation:

• yn = .5*xn - .3*xn-1 + .2*xn-2 - .1*xn-3

• The weights in the equation define a filter.

• Example: how would the values of y change if the input to the equation was a transient where:

• at time n, x = 1

• at all other times, x = 0

• Graph y at times n to n+3.

Page 35: Speech Synthesis

Decomposing the Filter• Putting a transient into the weighted filter equation yields a new waveform:

• The new equation reflects the weights in the equation.

• We can apply Fourier Analysis to the new waveform to determine its spectral characteristics.

Page 36: Speech Synthesis

LPC Spectrum• When we perform a Fourier Analysis on this waveform, we get a very smooth-looking spectrum function:

• This function is a good representation of what the vocal tract filter looks like.

LPC spectrum

Original spectrum

Page 37: Speech Synthesis

LPC Applications• Remember: the LPC spectrum is derived from the weights of a linear predictive equation.

• One thing we can do with the LPC-derived spectrum is estimate formant frequencies of a filter.

• (This is how Praat does it)

• Note: the more weights in the original equation, the more formants are assumed to be in the signal.

• We can also use that LPC-derived filter, in conjunction with a voice source, to create synthetic speech.

• (Like in the Speak & Spell)

Page 38: Speech Synthesis

3. Concatenative Synthesis• Formant synthesis dominated the synthetic speech world up until the ‘90s…

• Then concatenative synthesis started taking over.

• Basic idea: string together recorded samples of natural speech.

• Most common option: “diphone” synthesis

• Concatenated bits stretch from the middle of one phoneme to the middle of the next phoneme.

• Note: inventory has to include all possible phoneme sequences

• = only possible with lots of computer memory.

Page 39: Speech Synthesis

Concatenated Samples• Concatenated synthesis tends to sound more natural than formant synthesis.

• (basically because of better voice quality)

• Early (1977) combination of LPC + diphone synthesis:

• LPC + demisyllable-sized chunks (1980):

• More recent efforts with the MBROLA synthesizer:

• Also check out the Macintalk Pro synthesizer!

Page 40: Speech Synthesis

Recent Developments• Contemporary concatenative speech synthesizers use variable unit selection.

• Idea: record a huge database of speech…

• And play back the largest unit of speech you can, whenever you can.

• Interesting development #2: synthetic voices tailored to particular speakers.

• Check it out:

Page 41: Speech Synthesis

4. Articulatory Synthesis• Last but not least, there is articulatory synthesis.

• Generation of acoustic signals on the basis of models of the vocal tract.

• This is the most complicated of all synthesis paradigms.

• (we don’t understand articulations all that well)

• Some early attempts:

• Paul Boersma built his own articulatory synthesizer…

• and incorporated it into Praat.

Page 42: Speech Synthesis

Synthetic Speech Perception• In the early days, speech scientists thought that synthetic speech would lead to a form of “super speech”

• = ideal speech, without any of the extraneous noise of natural productions.

• However, natural speech is always more intelligible than synthetic speech.

• And more natural sounding!

• But: perceptual learning is possible.

• Requires lots and lots of practice.

• And lots of variability. (words, phonemes, contexts)

• An extreme example: blind listeners.

Page 43: Speech Synthesis

More Perceptual Findings1. Reducing the number of possible messages

dramatically increases intelligibility.

Page 44: Speech Synthesis

More Perceptual Findings2. Formant synthesis produces better vowels;

• Concatenative synthesis produces better consonants (and transitions)

3. Synthetic speech perception uses up more mental resources.

• memory and recall of number lists

4. Synthetic speech perception is a lot easier for native speakers of a language.

• And also adults.

5. Older listeners prefer slower rates of speech.

Page 45: Speech Synthesis

Audio-Visual Speech Synthesis

• The synthesis of audio-visual speech has primarily been spearheaded by Dominic Massaro, at UC-Santa Cruz.

• “Baldi”

• Basic findings:

• Synthetic visuals can induce the McGurk effect.

• Synthetic visuals improve perception of speech in noise

• …but not as well as natural visuals.

• Check out some samples.

Page 46: Speech Synthesis

Further Reading• In case you’re curious:

• http://www.cs.indiana.edu/rhythmsp/ASA/Contents.html

• http://www.acoustics.hut.fi/publications/files/theses/lemmetty_mst/contents.html