41
Speech Technologies – Speech Production II. Speech Production: from anatomy to modeling 1. Speech production mechanism 2. Acoustic properties of the speech signal 3. Speech production digital model

1. Speech production mechanism 2. Acoustic properties of

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Tecnologías de la VozII. Speech Production: from anatomy to modeling
1. Speech production mechanism 2. Acoustic properties of the speech signal 3. Speech production digital model
Speech Technologies – Speech Production
Vocal Cords
A pair of elastic structures of tendon, muscle and mucous membrane 15 mm long in men 13 mm long in women
Can be varied in length and thickness and positioned
Vibration of the vocal cords occurs when a) they are sufficiently elastic and close together
b) there is a sufficient difference between subglottal pressure and supraglottal pressure
Successive vocal fold openings the fundamental period the fundamental frequency or pitch -> men: 50-250 Hz -> women: 120-500 Hz
Speech Technologies – Speech Production
Speech Technologies – Speech Production
Vocal Cords Glottal volume velocity and resulting sound pressure at the
start of a voiced sound
Speech Technologies – Speech Production
Vocal Cords: Spectral Properties
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 500 1000 1500 2000 2500 3000 3500 4000 -50
-40
-30
-20
-10
0
10
20
30
40
50
frequency (Hz)
20 lo
g| U
g( ω )|
0 2 4 6 8 10 12 14 16 18 20 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 500 1000 1500 2000 2500 3000 3500 4000 -40
-30
-20
-10
0
10
20
30
Vocal Tract
1. Filtering: acoustic filter which modifies the spectral distribution of energy in the glottal sound wave (formantsformants)
2. Generation of sounds A constriction at some point along the vocal tract generates a turbulence exciting a portion of the vocal tract (sound /s/ of six)
Composed by the Pharyngeal and Oral cavities
Basic functions:
Two elemental excitation: 1. Voiced ...... Vocal cords vibration 2. Unvoiced ... Constriction somewhere along the vocal tract
Combinations
3. Mixed ....... Simultaneously voiced and unvoiced
4. Plosive ...... Short region of silent followed by a region of voiced or unvoiced sound
/t/ in pat (silence + unvoiced) /b/ in boot (silence + voiced)
5. Whisper ..... Unvoiced excitation generated at the vocal cords
Speech Technologies – Speech Production
Speech Technologies – Speech Production
Formants
Bandwitdth
f1f1 f2f2 A 700 1150 E 500 1500 I 250 2300 O 400 700 U 300 900
Speech Technologies – Speech Production
Some Phonetic definitions
Phoneme: the basic theoretical unit for describing how speech conveys linguistic meaning
code that consists of a unique set of articulatory gestures represents a class of sounds that convey the same meaning American English 42 phonemes (vowels, semivowels, diphthongs, and consonants)
Allophone: slight acoustic variations of the basic unit permissible freedom in producing a phoneme
Articulatory phonetics manner of articulation: level of occlusion, nasalization place of articulation: localization of the narrowest point in the vocal tract
Coarticulation
Sound Classification Manners of Articulation
degree of occlusion: closure (i.e. at some point in the vocal tract, the airflow is completely stopped),
close approximation (involving a constriction somewhere in the vocal tract, with the air being forced
through the opening), and open approximation (sounds in which the airflow is smooth). Additional
distinctions include whether the air flows through the nose (nasal), or not (oral), whether it runs along
the centre or the sides of the tongue (central vs. lateral), as well as the way in which the closure is
made.
Trill /r/ /rr/ vowels /a/ /e/ /i/ /o/ /u/
Semivowels (approximants) /j/ (labio) /w/ (agua)
Nasals /m/ /n/ /ñ/
Speech Technologies – Speech Production
Place of articulation: the narrowing of the airstream at some point in the vocal tract
1. Bilabial /m/ /p/(sorda) /b/ (sonora) 2. Labiodental /f/ /v/ 3. Dental /t/(sorda) /d/(sonora) 4. Interdental /z/ 5. Alveolar /s/ /n/ /l/ 6. Palatal /ch/ (sorda) /ñ/ /ll/ (sonora) 7. Velar /k/ (sorda) /g/ (sonora) /j/
Sound Classification
Basic tools: Waveform
time-frequency representation
Wide Band frequency resolution ... 300 Hz temporal resolution ... a few ms (<20 ms)
Narrow Band frequency resolution ... tens of Hz temporal resolution ... > 50 ms
Speech Technologies – Speech Production
Speech Technologies – Speech Production
Speech Technologies – Speech Production
Speech Technologies – Speech Production
Speech in Time and Frequency
Vowels voiced excitation, high amplitude, duration between 50 a 400 ms Energy concentrated below 1 kHz and a decay of -6 dB/oct
0 100 200 300 400 500 -8000
-6000
-4000
-2000
0
2000
4000
6000
8000
30
40
50
60
70
80
90
100
Speech Technologies – Speech Production
Speech in Time and Frequency
Nasals Similar to the vowels but weaker in energy Open nasal cavity and closed oral cavity Vocal tract acts as an antiresonator trapping energy at certain
frequencies 750 a 1250 Hz for /m/ 1450 a 2200 Hz for /n/ > 3000 Hz for /G/
Nasal cavity
occlusions oral cavity
Nasalization: vowel precede or follow a nasal sound broader first formant bandwidth
Speech Technologies – Speech Production
Fricatives unvoiced excitation (unvoiced fricatives)
constriction causes a noise source, the location of the constriction determine the fricative sound
labiodental /f/ (fine); interdental /D/ (then) alveolar /s/ (seven); glottal /h/ (heat)
energy at the middle and high frequencies
Speech in Time and Frequency
mixed excitation (voiced fricatives) voice bar (very low-frequency formant, near 150 Hz) more energy in the low frequencies than at high frequencies
Speech Technologies – Speech Production
Speech in Time and Frequency
Stops or Plosives transients, noncontinuation sounds voiced /b,d,g/ or unvoiced sounds /p,t,k/ pressure behind a total constriction somewhere along the vocal tract, and suddenly releasing this pressure.
0 200 400 600 800 1000 1200
-2000
-1000
0
1000
2000
3000
/pe/
samples
30
40
50
60
70
frequency
l1 l2
l3 l4
S(ω) = U(ω) H(ω) R(ω)
s(t) Lips pressure u(t) Glottis volume velocity h(t) Vocal tract mpulsive response in terms of volume velocity R(ω) Acoustic radiation impedance
Speech Technologies – Speech Production
Vocal Tract Model: Sequence of tubes without lossesVocal Tract Model: Sequence of tubes without losses
Contour conditions
Uk (xk =lk ,t)= Uk+1 (xk+1 =0,t) Pk (xk =lk ,t)= Pk+1 (xk+1 =0,t)
Ak+1Ak
k+1 (t)
k+1 (t+τ)
ρ ρ τ ρ ρτ
+ + +
− − +
+ − = − −+
Z ZA A A A Z Z
ρ ++
Speech Technologies – Speech Production
Vocal Tract Model: Sequence of tubes without lossesVocal Tract Model: Sequence of tubes without losses


1kτ +
1kτ +
1 1( )k kU t τ− + ++

Contour conditions: Lips ...... Model by a tube of infinite length
∞ ∞NA 1NA +
− − + + +
+ + +
= =
= =
k k k k
+

+ − = + − −
Speech Technologies – Speech Production
Vocal Tract Model: Sequence of tubes without lossesVocal Tract Model: Sequence of tubes without losses
Contour conditions: glottis
Z = −
1(0, ) (0, ) ( ) (0, ) (0, )o G
G
cU t U t U t U t U t Z A
ρ+ − + − − = − +
ρ ρ+ −+
1 ( )U t−
2 ( )U t+
2 ( )U t−

21 ρ+
Vocal Tract Model: Sequence of tubes without lossesVocal Tract Model: Sequence of tubes without losses
( )( ) 1 2
1 1 1
τ τ
− Ω +
Speech Technologies – Speech Production
Relation with digital filters
Δ =N tubes of length L, each section of length
x c
( ) ( ) ( 2 )L k k
=
= − + − +∑
Transformed space
( 2 ) 2
L k k k k
U s e e eτ τ τα α ∞ ∞
− + − −
= =
The spectrum is periodic with period 2 2 π τ
Speech Technologies – Speech Production
Digital Digital Vocal Tract ModeVocal Tract Model: l: uniformuniform lengthlength withoutwithout losseslosses
If the input signal has a bandwidth limited to
then the vocal tract works as a digital filter with impulsive response
1 4
B τ
21 ρ+
1 2z−
1 2z−
1 2z−
z zU z U z U z
ρ ρ ρ
ρ ρ ρ
Speech Technologies – Speech Production
( ) ( )
= =
+ −= − + +
Speech Technologies – Speech Production
N G G
ρ ρ ρ =
1
ρρ ρ
Speech Technologies – Speech Production
There are only poles in the transfer function If D(z) can be derived in a recursive way
1 ( )G GZρ = = ∞
N
D z
D z D z z D z k N D z D z
ρ − − − −
2 2m m
NcF T Lτ
τ Δ = =
If c=350 m/s and L=0,175 m 1000mF N Hz=
Digital Model for Speech Production
Speech Technologies – Speech Production
J ka sinUp r t j e r ka sin
ωθ θ ρ ω
2 cf Hz
1
1
1
R z z z z zo o o( ) ,= − ≈ <−1 1 11
voclpc
synth
Speech production mechanism
Número de diapositiva 18
Número de diapositiva 19
Número de diapositiva 23
Número de diapositiva 25
Número de diapositiva 26
Speech production acoustic theory
Número de diapositiva 29
Número de diapositiva 30
Número de diapositiva 31
Número de diapositiva 32
Número de diapositiva 33
Número de diapositiva 34
Número de diapositiva 35
Número de diapositiva 36
Número de diapositiva 37
Número de diapositiva 38
Número de diapositiva 39
Número de diapositiva 40
Número de diapositiva 41
Speech Production Digital Model