Upload
lamar
View
109
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Speech in Multimedia. Hao Jiang Computer Science Department Boston College Oct. 9, 2007. Outline. Introduction Topics in speech processing Speech coding Speech recognition Speech synthesis Speaker verification/recognition Conclusion. Introduction. - PowerPoint PPT Presentation
Citation preview
Speech in Multimedia
Hao Jiang
Computer Science Department
Boston College
Oct. 9, 2007
Outline
Introduction
Topics in speech processing– Speech coding– Speech recognition– Speech synthesis– Speaker verification/recognition
Conclusion
Introduction
Speech is our basic communication tool.
We have been hoping to be able to communicate with machines using speech.
C3PO and R2D2
Speech Production Model
Anatomy Structure Mechanical Model
Characteristics of Digital Speech
Waveform
Spectrogram
0 0.5 1 1.5 2
x 104
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Time
Fre
quen
cy
0 2000 4000 6000 8000 100000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Speech
Voiced and Unvoiced Speech
0 100 200 300 400 500 600 700 800 900 1000-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
Silence unvoicedvoiced
Short-time Parameters
0 100 200 300 400 500 600 700 800 900 1000-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0 100 200 300 400 500 600 700 800 900 1000-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
Short timepower
WaveformEnvelop
0 100 200 300 400 500 600 700 800 900 1000-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
Zerocrossing rate
0 100 200 300 400 500 600 700-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Pitchperiod
Speech Coding
Similar to images, we can also compress speech to make it smaller and easier to store and transmit.
General compression methods such as DPCM can also be used.
More compression can be achieved by taking advantage of the speech production model.
There are two classes of speech coders:– Waveform coder – Vocoder
LPC Speech Coder
Speechbuffer
SpeechAnalysis
Pitch
Voiced/unvoiced
Vocal track Parameter
EnergyParameter
QuantizerCode
generation
speechCodestream
Frame n Frame n+1
LPC and Vocal Track
x(n) = p=1k ap x(n-p) + e(n)
Mathematically, speech can be modeled as the following generation model:
{a1, a2, …, ak} are called Linear Prediction Coefficients (LPC), which can be used to model the shape of vocal track.
e(n) is the excitation to generate the speech.
Decoding and Speech Synthesis
ImpulseTrain
Generator
GlottalPulse
Generator
RandomNoise
Generator
VocalTrackModel
RadiationModel
Pitch Period
Gain
speech
U/V
An Example for Synthesizing Speech
Blending region
Glottal Pulse
Go through vocal track filter with gain control
Go through radiation filter
LPC10 (FS1015)
2.4kbps LPC10 was DOD speech coding standard for voice communication at 2.4kbps.
LPC10 works on speech of 8Hz, using a 22.5ms frame and 10 LPC coefficients.
OriginalSpeech
LPC DecodedSpeech
Mixed Excitation LP
For real speech, the excitation is usually not pure pulse or noise but a mixture.
The new 2.4kbps standard (MELP) addresses this problem.
Bandpassfilter
Bandpassfilter
+
w
1-w
pulses
noise
VocalTrackModel
RadiationModel
Gain
speech
OriginalSpeech
MELPDecodedSpeech
Hybrid Speech Codecs For higher bit rate speech coders, hybrid speech codecs have
more advantage than vocoders.
FS1016: CELP (Code Excitation Linear Predictive) G.723.1: A dual bit rate codec (5.3kbps and 6.3kbps) for
multimedia communication through Internet.
G.729: CELP based codec at 8kbps.
“perceptual”comparison
Model parametergeneration
Speechsynthesis
Analysis by Synthesis
speech code
Sound at 5.3kbps Sound at 6.3kbps
Sound at 8kbps
Speech Recognition
Speech recognition is the foundation of human computer interaction using speech.
Speech recognition in different contexts– Dependent or independent on the speaker.– Discrete words or continuous speech.– Small vocabulary or large vocabulary.– In quiet environment or noisy environment.
Parameteranalyzer
Comparisonand decisionalgorithm
Language model
Reference patterns
speechWords
How does Speech Recognition Work?
Words: grey whales
Phonemes: g r ey w ey l z
Each phonemehas different characteristics(for example,The power distribution).
Speech Recognition
g g r ey ey ey ey w ey ey l l z
How do we “match” the word when there are time and other variations?
Hidden Markov Model
S1 S2
S3
P12
{a,b,c,…}
{a,b,c,…}
{a,b,c,…}
Dynamic Programming in Decoding
time
states
We can find a path that corresponds to max-probable phonemesto generate the observation “feature” (extracted in eachspeech frame) sequence.
HMM for a Unigram Language Model
HMM1(word1)
HMM2(word2)
HMM3(wordn)
p1
p2
p3
s0
Speech Synthesis
Speech synthesis is to generate (arbitrary) speech with desired prosperities (pitch, speed, loudness, articulation mode, etc.)
Speech synthesis has been widely used for text-to-speech systems and different telephone services.
The easiest and most often used speech synthesis
method is waveform concatenation.
Increase the pitch without changing the speed
Speaker Recognition
Identifying or verifying the identity of a speaker is an application where computer exceeds human being.
Vocal track parameter can be used as a feature for speaker recognition.
1 2 3 4 5 6 7 8 9 101
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9 101
2
3
4
5
6
7
8
9
10
LPC covariance featureSpeaker one Speaker two
Applications
Speech recognitionCall routing
Directory Assistance
Operator Services
Document input
Speakerrecognition
Personalized service
Fraud Control
Text-to-Speechsynthesis
Speech Interface
Document Correction
Voice Commands
Speech Coding
Wireless Telephone
Voice over Internet