Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
International Research Institute MICAMultimedia, Information, Communication & Applications
UMI 2954
Hanoi University of Science and Technology
1 Dai Co Viet - Hanoi - Vietnam
Vocal TechnologiesFrom sound… to multilingualism
Pr. Eric Castelli
February, 2013
Part 2
Analysis of Speech signal
MICA
2013
Fast introduction
Basics of signal processing
Signals, systems, Fourier transform, spectra….
Analysis of speech signal
Phonemes, spectra, formants, sonagrammes...
Speech Synthesis
Speech signal Production
Vocal cords, vocal tract, modeling...
Speech signal coding
Automatic speech recognition
Basics of Speech databases
Content
2
MICA
2013
Speech
Linguistic information:
- what is pronounced by
the speaker
- contain, of course,
semantic information
Extra-linguistic information:
- speaker identity
- language
- physiologic and emotional
state of the speaker- emotions
- stress
- hill
Speech = information source
MICA
2013
Speech processing
Speech could be distinguished from other
sounds by its own acoustic characteristics
Speech sounds are produced by two different
process:
Vibration of the vocal cords
Voiced source (pseudo-periodic source)
Turbulences created by air flow
Passing quickly trough a vocal tract constriction
Or during an opening of a vocal tract occlusion
It is considered as a Noise source (pseudo random
source)
MICA
2013
Phonemes
For a language, the main function of sounds is to establish
distinctions between the significant units
Phonemes are the shorter acoustic elements which allow
us to distinguish the different words
Examples [p] [b] (in French)
pas / bas (negation/low)
paie / baie (wages/ berry or bay)
pot / beau (pot/beautiful)
MICA
2013
French phonemes
MICA
2013
French phonemes
MICA
2013
Vietnamese language structure
Vietnamese :
tonal language (as Mandarin, Canton language, Thai)
every syllable presents one tone
Vietnamese ma mà mã mả má mạ
French fantôme cependant cheval tombeau joue semis
syllable ma1 ma2 ma3 ma4 ma5 ma6
sound
Syllable structure
initial parttone inside the syllable
final part
initial sound
(optional)
pre-tonal
(optional)
vowel
(obligatory)
final sound
(optional)
Initial part: 21 consonants
Final part: 155 final parts
pre-tonal sound : 1 semi-vowel (optional)
vowel : 11 vowels + 4 diphtongues + 1 triphtongue (obligatory)
final sound : 6 consonants et 2 semi-vowels (optional)
MICA
2013
Speech signal
Bonsoir
b rõ s oa
MICA
2013
Speech signal
Vous êtes Monsieur Gilbert Dupont n’est-ce pas ?
MICA
2013
Main difficulty
The main characteristic of speech signal is
VARIABILITY
One person never pronounces same sound twice in
a same way
Two persons don’t produce the same speech signal
for a same word.
However, this sound is always well recognized and
well understood by humans
MICA
2013
Same speaker variability
Same sound, same speaker, Same recording conditions
MICA
2013
Variability between speakers
Same sound, same recording conditionsTwo different speakers
MICA
2013
Variability due to recording conditions
Same sound, same speaker, two different microphones
MICA
2013
Why analyze speech?
To study and understand the involved physical phenomena
Be a little less ignorant...
Also understand malfunctions (disabled person of language)
Be able to use this knowledge for learning foreign languages
To reproduce artificial speech
Speech synthesis (formant synthesis, diphone synthesis, HMM
synthesis, for instance)
Vocal tract modeling
Determining characterization measurements which can be
used by automatic speech recognition engines
Spectal characteristics (LPC, MFCC, fundamental frequency,
etc…)
MICA
2013
Speech processing: analysis – why?
Acoustic and spectral analysis
Nature of speech sounds in terms of Frequencies
Durations and timing constraints
Energy,
Co-ordination and co-articulation
etc.
Applications
understand physic phenomena involved in speech production
• Position of articulators, role of vocal system parts (vocal cords, vocal tract, etc.)
speech perception• How humans classify speech sounds ?
Define pertinent parameters For speech synthesis
• Elementary parts of signal, intonation, etc.
For speech recognition• Find a pertinent parameter vector for input of the recognition engine
For speech coding• Adapted filter (ADPCM, CELP, etc.) for efficient coding and good quality
MICA
2013
Direct measurements on human subject
Vocal tract transfer functionmeasurements
(Castelli, 1989)
MICA
2013
Direct measurements on human subject
Measurements of:- Radiated pressure- Flow at the lips output- High frequency energy- Low frequency energy- Fundamental frequency
Aerodynamics and acousticsmeasurements
MICA
2013
Direct measurements...
Disadvantages:
Difficult to realize
Need important equipments
Difficult to reproduce
No automatic procedures
The human subject must be trained
No movement during a long time
Move just ONE vocal tract articulator
Some measurements are intrusive
The human subject have difficulties to ”resist”
Modify “natural” conditions
Direct measurement of some characteristic are just impossible
For example: glottal flow
JUST FORFONDAMENTAL RESEARCH
MICA
2013
Speech analysis tools
MICA
2013
Production model: source-filter concept(or excitator-resonator)
The speech production system (vocal tract) can be
considered as a “speech instrument”
Resonator(filter)
Excitator(source)
cordes
corps
MICA
2013
Production model: source-filter concept(or excitator-resonator)
Filter
Commands
Gain
Noise
source
Impulse
source
Speech
Vocal cords
Occlusion
F1
F2F3
f
Vocal tract
tSignal glottique
Excitator (source) Resonator (filter)
MICA
2013
Formants
Characterize the resonance frequencies of the vocal tract
F1
F2F3
f
spectra
energy
spectrogram
MICA
2013
Example in frequency domain: spectra & vowel formants [o]
Formants
English vowels /i/, /a/, /u/French vowel /o/
MICA
2013
Formants: vocalic triangle and dispersion ellipses
Dispersions are due to the important speech variability (and speaker variability)
For French
Acoustic triangle
Phonetic triangle
MICA
2013
Formants: some values
For French
MICA
2013
Formants & vocalic triangle for Vietnamese
F1/F2
100
200
300
400
500
600
700
800
900
1000
1100
400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
a
i
u
ee
e
o
oo
ow
uw
F2/F3
2000
2200
2400
2600
2800
3000
3200
3400
400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
a
i
u
ee
e
o
oo
ow
uw
One male speaker LA
MICA
2013
Formants & vocalic triangle for Vietnamese
11 vowels
a, e, ê, i, u, o, ô, ơ, ư, â,
ă
9 Vietnamese speakers
(males)
F2/F3
2000
2200
2400
2600
2800
3000
3200
3400
400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
a
i
u
ee
e
o
oo
ow
uw
F1/F2
100
200
300
400
500
600
700
800
900
1000
1100
400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
a
i
u
ee
e
o
oo
ow
uw
ư
a
e
i
ơ
ă
â
MICA
2013
Spectrogram - 1
MICA
2013
Spectrogram - 2
MICA
2013
Spectrogram - 3
MICA
2013
Production model: source-filter concept(or excitator-resonator)
Filter
Commands
Gain
Noise
source
Impulse
source
Speech
Vocal cords
Occlusion
F1
F2F3
f
Vocal tract
tSignal glottique
Excitator (source) Resonator (filter)
1) Already done
2) Next slides
MICA
2013
Fundamental frequency estimation
F0 = fundamental frequency (Pitch)
Corresponds to vocal cords vibration frequency
Zero-crossing of threshold-crossing
Autocorrelation
AMDF (Absolute Magnitude Difference Function)
Cepstre
FFT and Dirac “comb”
1
0
)()()(N
k
nkxkxnAmdf
)))))()((((())(( txtwFFTAbsLogIFFTtxC
i
r
k
txr
irkkP
kFFTkPrFP
)(][
][][][ )(
MICA
2013
Fundamental frequency estimation
Signal Hamming window Signal after windowing
Signal spectra signal autocorrelation Signal cepstre
MICA
2013
Fundamental frequency estimation: autocorrelation
The autocorrelation function is given by the folowing
formula:
For a pseudo-periodic signal, this fuction will present
this form:
We can detect maxima and deduce period and
compute fundamental frequency (= 1/period)
N
i
x inxixn0
)(*)()(
Fe/F
0
MICA
2013
Fundamental frequency estimation:zero-crossing
During vowel production, speech signal presents a pseudo-periodic form
It is sufficient to set-up a threshold
Zero-crossing
Threshold-crossing And count the number of “crossing”
threshold
MICA
2013
Fundamental frequency estimation: cepstre
Cepstre is computed following this formula:
On the resultant signal, the correspondent peak (to the source) is well visible.
)))))()((((())(( txtwFFTAbsLogIFFTtxC
Source peak
MICA
2013
Indirect F0 measurements:Glottal flow measurements
Delete from speech acoustic signal the vocal tract contribution
s
Pression rayonnée
s
Ug
s
bruit
Radiated pressure signal
picked at the lips
isolate the voiced part (purely harmonic part) of speech signal
LPC analysis
Iterative procedure : Iterative Adaptative Inverse Filtering
Vocal cords
noise
filter = vocal tract
• Source-filter representation of the speech production system
• Inverse filtering
• Sinusoid decomposition
MICA
2013
Obtained signal for glottal flow
0
0.24 0.25 0.26 0.27 0.28 0.29 0.3 s
0
0.24 0.25 0.26 0.27 0.28 0.29 0.3 s
Signal of radiated pressure at lips
Glottal flow derivative signal
MICA
2013
Glottal flow characterization
0
0.09 0.1 s
0
cm3/s
cm3/s2
U0
Ee
Ei
T0 = 1/F0
Tc
Amplitude U0
Period T0
Closure Time TC
Energies Ei and Ee
MICA
2013
Application to Vietnamese tones Analysis
Nguyen Quoc Cuong PhD work
Tons Description TiÕng ViÖt Signe
ton1 ton plat (ou égal) kh«ng dÊu (b»ng ou ngang) ( )
ton2 ton descendant huyÒn (`)
ton3 ton brisé ng· (~)
ton4 ton interrogatif hái ( ?)
ton5 ton montant (aigu) s¾c (‘ )
ton6 ton grave nÆng (.)
plain mélodique glottal
haut plat montant briséregistre
bas descendant interrogatif grave
Number of tone to be characterized: 8
Tone1 (flat-không dấu), tone2 (decreasing-huyền), tone3 (break-ngã), tone4 (interrogative-hỏi)
tone5a and tone6a: tone5 and tone6 for opened syllables (increasing-sắc)
tone5b and tone6b: tone5 and tone6 for closed syllables (low-nặng)
Measurement points
Two register
MICA
2013
Vientamese tone standard shapes – for one woman
Ton1 Ton2 Ton3 Ton4 - North Ton4 - South & Centre
Ton5a Ton5b Ton6a Ton6b
Maximal value minimal value
Nguyen Quoc Cuong PhD work
MICA
2013
Vietnamese tone standard shapes – for a man
Ton1 Ton2 Ton3
Ton5a Ton5b Ton6a Ton6b
Ton4 - North Ton4 – South & Centre
Nguyen Quoc Cuong PhD work
MICA
2013
Vietnamese tone durations
Comparison of relative durations :
N
YX
X – relative duration for tone i
N – mean duration for six tones same speaker
Y – mean duration for tone i for the speaker
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
ton1 ton2 ton3 ton4 ton5a ton5b ton6a ton6b
PNY
VTT
DPQ
DHH
DHL
BXH
TTA
Diagramme de durée relative des sujets du Nord
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
ton1 ton2 ton3 ton4 ton5a ton5b ton6a ton6b
NTH
VTH
BKH
LPL
LVS
TVH
HBQ
TTT
Diagramme de durée relative des sujets du Sud et du Centre
ton6a different between North
& South, Centre
Nguyen Quoc Cuong PhD work
MICA
2013
Some special cases for tone 6a
Un exemple de ton6a du sujet NTH avec syllabe "cạnh"
Un exemple de ton6a du sujet TTT avec syllabe " bịa".
Un exemple de ton6a du sujet BKH avec syllabe "cộng»
Usually all tone6a are decreasing and short
But we can find some specific cases for tone 6a for South and Center
Un exemple de ton6a « normal »
Nguyen Quoc Cuong PhD work
MICA
2013
Multimedia demonstration
Spectral Analysis
FFT, spectra
Spectrogramme
Sensimetrics software
MICA
2013
References
CALLIOPE La parole et son traitement automatique
1989, Masson, CENT, ENST
FANT G. Acoustic theory of speech production
Mouton, The Hague (1960)
COLEMAN J. Introducing speech and language processing
Cambridge University press (2005)
Daniel JURAFSKY D. & MARTIN J.H. Speech and language processing
Prentice Hall (2000)