Speech in Multimedia

Speech in Multimedia

Hao Jiang

Computer Science Department

Boston College

Oct. 9, 2007

Outline

Introduction

Topics in speech processing– Speech coding– Speech recognition– Speech synthesis– Speaker verification/recognition

Conclusion

Introduction

Speech is our basic communication tool.

We have been hoping to be able to communicate with machines using speech.

C3PO and R2D2

Speech Production Model

Anatomy Structure Mechanical Model

Characteristics of Digital Speech

Waveform

Spectrogram

0 0.5 1 1.5 2

x 104

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Time

Fre

quen

cy

0 2000 4000 6000 8000 100000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Speech

Voiced and Unvoiced Speech

0 100 200 300 400 500 600 700 800 900 1000-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

Silence unvoicedvoiced

Short-time Parameters

0 100 200 300 400 500 600 700 800 900 1000-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0 100 200 300 400 500 600 700 800 900 1000-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

Short timepower

WaveformEnvelop

0 100 200 300 400 500 600 700 800 900 1000-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

Zerocrossing rate

0 100 200 300 400 500 600 700-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

Pitchperiod

Speech Coding

Similar to images, we can also compress speech to make it smaller and easier to store and transmit.

General compression methods such as DPCM can also be used.

More compression can be achieved by taking advantage of the speech production model.

There are two classes of speech coders:– Waveform coder – Vocoder

LPC Speech Coder

Speechbuffer

SpeechAnalysis

Pitch

Voiced/unvoiced

Vocal track Parameter

EnergyParameter

QuantizerCode

generation

speechCodestream

Frame n Frame n+1

LPC and Vocal Track

x(n) = p=1k ap x(n-p) + e(n)

Mathematically, speech can be modeled as the following generation model:

{a1, a2, …, ak} are called Linear Prediction Coefficients (LPC), which can be used to model the shape of vocal track.

e(n) is the excitation to generate the speech.

Decoding and Speech Synthesis

ImpulseTrain

Generator

GlottalPulse

Generator

RandomNoise

Generator

VocalTrackModel

RadiationModel

Pitch Period

Gain

speech

U/V

An Example for Synthesizing Speech

Blending region

Glottal Pulse

Go through vocal track filter with gain control

Go through radiation filter

LPC10 (FS1015)

2.4kbps LPC10 was DOD speech coding standard for voice communication at 2.4kbps.

LPC10 works on speech of 8Hz, using a 22.5ms frame and 10 LPC coefficients.

OriginalSpeech

LPC DecodedSpeech

Mixed Excitation LP

For real speech, the excitation is usually not pure pulse or noise but a mixture.

The new 2.4kbps standard (MELP) addresses this problem.

Bandpassfilter

Bandpassfilter

+

w

1-w

pulses

noise

VocalTrackModel

RadiationModel

Gain

speech

OriginalSpeech

MELPDecodedSpeech

Hybrid Speech Codecs For higher bit rate speech coders, hybrid speech codecs have

more advantage than vocoders.

FS1016: CELP (Code Excitation Linear Predictive) G.723.1: A dual bit rate codec (5.3kbps and 6.3kbps) for

multimedia communication through Internet.

G.729: CELP based codec at 8kbps.

“perceptual”comparison

Model parametergeneration

Speechsynthesis

Analysis by Synthesis

speech code

Sound at 5.3kbps Sound at 6.3kbps

Sound at 8kbps

Speech Recognition

Speech recognition is the foundation of human computer interaction using speech.

Speech recognition in different contexts– Dependent or independent on the speaker.– Discrete words or continuous speech.– Small vocabulary or large vocabulary.– In quiet environment or noisy environment.

Parameteranalyzer

Comparisonand decisionalgorithm

Language model

Reference patterns

speechWords

How does Speech Recognition Work?

Words: grey whales

Phonemes: g r ey w ey l z

Each phonemehas different characteristics(for example,The power distribution).

Speech Recognition

g g r ey ey ey ey w ey ey l l z

How do we “match” the word when there are time and other variations?

Hidden Markov Model

S1 S2

S3

P12

{a,b,c,…}

{a,b,c,…}

{a,b,c,…}

Dynamic Programming in Decoding

time

states

We can find a path that corresponds to max-probable phonemesto generate the observation “feature” (extracted in eachspeech frame) sequence.

HMM for a Unigram Language Model

HMM1(word1)

HMM2(word2)

HMM3(wordn)

p1

p2

p3

s0

Speech Synthesis

Speech synthesis is to generate (arbitrary) speech with desired prosperities (pitch, speed, loudness, articulation mode, etc.)

Speech synthesis has been widely used for text-to-speech systems and different telephone services.

The easiest and most often used speech synthesis

method is waveform concatenation.

Increase the pitch without changing the speed

Speaker Recognition

Identifying or verifying the identity of a speaker is an application where computer exceeds human being.

Vocal track parameter can be used as a feature for speaker recognition.

1 2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

LPC covariance featureSpeaker one Speaker two

Applications

Speech recognitionCall routing

Directory Assistance

Operator Services

Document input

Speakerrecognition

Personalized service

Fraud Control

Text-to-Speechsynthesis

Speech Interface

Document Correction

Voice Commands

Speech Coding

Wireless Telephone

Voice over Internet

Documents

Speech in Multimedia