25
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007

Speech in Multimedia

  • Upload
    lamar

  • View
    109

  • Download
    0

Embed Size (px)

DESCRIPTION

Speech in Multimedia. Hao Jiang Computer Science Department Boston College Oct. 9, 2007. Outline. Introduction Topics in speech processing Speech coding Speech recognition Speech synthesis Speaker verification/recognition Conclusion. Introduction. - PowerPoint PPT Presentation

Citation preview

Page 1: Speech in Multimedia

Speech in Multimedia

Hao Jiang

Computer Science Department

Boston College

Oct. 9, 2007

Page 2: Speech in Multimedia

Outline

Introduction

Topics in speech processing– Speech coding– Speech recognition– Speech synthesis– Speaker verification/recognition

Conclusion

Page 3: Speech in Multimedia

Introduction

Speech is our basic communication tool.

We have been hoping to be able to communicate with machines using speech.

C3PO and R2D2

Page 4: Speech in Multimedia

Speech Production Model

Anatomy Structure Mechanical Model

Page 5: Speech in Multimedia

Characteristics of Digital Speech

Waveform

Spectrogram

0 0.5 1 1.5 2

x 104

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Time

Fre

quen

cy

0 2000 4000 6000 8000 100000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Speech

Page 6: Speech in Multimedia

Voiced and Unvoiced Speech

0 100 200 300 400 500 600 700 800 900 1000-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

Silence unvoicedvoiced

Page 7: Speech in Multimedia

Short-time Parameters

0 100 200 300 400 500 600 700 800 900 1000-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0 100 200 300 400 500 600 700 800 900 1000-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

Short timepower

WaveformEnvelop

Page 8: Speech in Multimedia

0 100 200 300 400 500 600 700 800 900 1000-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

Zerocrossing rate

0 100 200 300 400 500 600 700-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

Pitchperiod

Page 9: Speech in Multimedia

Speech Coding

Similar to images, we can also compress speech to make it smaller and easier to store and transmit.

General compression methods such as DPCM can also be used.

More compression can be achieved by taking advantage of the speech production model.

There are two classes of speech coders:– Waveform coder – Vocoder

Page 10: Speech in Multimedia

LPC Speech Coder

Speechbuffer

SpeechAnalysis

Pitch

Voiced/unvoiced

Vocal track Parameter

EnergyParameter

QuantizerCode

generation

speechCodestream

Frame n Frame n+1

Page 11: Speech in Multimedia

LPC and Vocal Track

x(n) = p=1k ap x(n-p) + e(n)

Mathematically, speech can be modeled as the following generation model:

{a1, a2, …, ak} are called Linear Prediction Coefficients (LPC), which can be used to model the shape of vocal track.

e(n) is the excitation to generate the speech.

Page 12: Speech in Multimedia

Decoding and Speech Synthesis

ImpulseTrain

Generator

GlottalPulse

Generator

RandomNoise

Generator

VocalTrackModel

RadiationModel

Pitch Period

Gain

speech

U/V

Page 13: Speech in Multimedia

An Example for Synthesizing Speech

Blending region

Glottal Pulse

Go through vocal track filter with gain control

Go through radiation filter

Page 14: Speech in Multimedia

LPC10 (FS1015)

2.4kbps LPC10 was DOD speech coding standard for voice communication at 2.4kbps.

LPC10 works on speech of 8Hz, using a 22.5ms frame and 10 LPC coefficients.

OriginalSpeech

LPC DecodedSpeech

Page 15: Speech in Multimedia

Mixed Excitation LP

For real speech, the excitation is usually not pure pulse or noise but a mixture.

The new 2.4kbps standard (MELP) addresses this problem.

Bandpassfilter

Bandpassfilter

+

w

1-w

pulses

noise

VocalTrackModel

RadiationModel

Gain

speech

OriginalSpeech

MELPDecodedSpeech

Page 16: Speech in Multimedia

Hybrid Speech Codecs For higher bit rate speech coders, hybrid speech codecs have

more advantage than vocoders.

FS1016: CELP (Code Excitation Linear Predictive) G.723.1: A dual bit rate codec (5.3kbps and 6.3kbps) for

multimedia communication through Internet.

G.729: CELP based codec at 8kbps.

“perceptual”comparison

Model parametergeneration

Speechsynthesis

Analysis by Synthesis

speech code

Sound at 5.3kbps Sound at 6.3kbps

Sound at 8kbps

Page 17: Speech in Multimedia

Speech Recognition

Speech recognition is the foundation of human computer interaction using speech.

Speech recognition in different contexts– Dependent or independent on the speaker.– Discrete words or continuous speech.– Small vocabulary or large vocabulary.– In quiet environment or noisy environment.

Parameteranalyzer

Comparisonand decisionalgorithm

Language model

Reference patterns

speechWords

Page 18: Speech in Multimedia

How does Speech Recognition Work?

Words: grey whales

Phonemes: g r ey w ey l z

Each phonemehas different characteristics(for example,The power distribution).

Page 19: Speech in Multimedia

Speech Recognition

g g r ey ey ey ey w ey ey l l z

How do we “match” the word when there are time and other variations?

Page 20: Speech in Multimedia

Hidden Markov Model

S1 S2

S3

P12

{a,b,c,…}

{a,b,c,…}

{a,b,c,…}

Page 21: Speech in Multimedia

Dynamic Programming in Decoding

time

states

We can find a path that corresponds to max-probable phonemesto generate the observation “feature” (extracted in eachspeech frame) sequence.

Page 22: Speech in Multimedia

HMM for a Unigram Language Model

HMM1(word1)

HMM2(word2)

HMM3(wordn)

p1

p2

p3

s0

Page 23: Speech in Multimedia

Speech Synthesis

Speech synthesis is to generate (arbitrary) speech with desired prosperities (pitch, speed, loudness, articulation mode, etc.)

Speech synthesis has been widely used for text-to-speech systems and different telephone services.

The easiest and most often used speech synthesis

method is waveform concatenation.

Increase the pitch without changing the speed

Page 24: Speech in Multimedia

Speaker Recognition

Identifying or verifying the identity of a speaker is an application where computer exceeds human being.

Vocal track parameter can be used as a feature for speaker recognition.

1 2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

LPC covariance featureSpeaker one Speaker two

Page 25: Speech in Multimedia

Applications

Speech recognitionCall routing

Directory Assistance

Operator Services

Document input

Speakerrecognition

Personalized service

Fraud Control

Text-to-Speechsynthesis

Speech Interface

Document Correction

Voice Commands

Speech Coding

Wireless Telephone

Voice over Internet