Upload
sara-candeias
View
48
Download
0
Tags:
Embed Size (px)
Citation preview
Jorge Proença 1
Dirce Celorico 1
Arlindo Veiga 1,2
Sara Candeias 1
Fernando Perdigão 1,2
1Instituto de Telecomunicações, Coimbra, Portugal2University of Coimbra, DEEC, Portugal
Acoustical Characterization of
Vocalic Fillers in European Portuguese
The 6th Workshop on Disfluency in Spontaneous Speech
Stockholm, Sweden August 21-23, 2013
2
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
Scope
Objectives
Characterization of the corpus
Formant frequencies determination
Results Duration
F1 and F2
Variation rates
Conclusions and future work
SUMMARY
3
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
Scope
Objectives
Characterization of the corpus
Formant frequencies determination
Results Duration
F1 and F2
Variation rates
Conclusions and future work
SUMMARY
4
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
Studying events that characterize the spontaneity of the speech has been
increasingly relevant as the development of speech technologies grows.
Studies on hesitations (so-called disfluencies) as well as vowel reductions
have gained importance over the last years.
Hesitation phenomena:
repetitions, truncated words, word fillers;
vocalic extensions into words;
filled pauses.
SCOPE
5
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
Hesitation phenomena:
repetitions, truncated words, word fillers;;
vocalic extensions into words;
filled pauses.
The most occurring, mainly on
spontaneous speech;
Occurring without any lexical
support;
Mostly fulfilled by relatively
stable vocalic segments.
Our previous studies say…
Filled pause vocalizations
or
Vocalic fillers (VFs)
Studying events that characterize the spontaneity of the speech has been
increasingly relevant as the development of speech technologies grows.
Studies on hesitations (so-called disfluencies) as well as vowel reductions
have gained importance over the last years.
SCOPE
6
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
Vocalic fillers (VFs)Why ?
represent an insertion at any moment during spontaneous speech;
carry multiple functions in the communication performance:
announcing upcoming discursive topics,
planning and delaying speech,
…
SCOPE
To develop an automatic detector of fillers
from continuous speech.
7
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
Scope
Objectives
Characterization of the corpus
Formant frequencies determination
Results Duration
F1 and F2
Variation rates
Conclusions and future work
SUMMARY
8
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
Studying the two most common VFs in European Portuguese:
the near-open central vowel [ɐ],
mid-central vowel [ə].
OBJECTIVE
How ? Acoustically characterizing VFs.
Analyzing:
first and second formant frequencies,
duration and
variation rates.
Comparing with lexical vowels (LVs) of similar timbre.
9
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
Scope
Objectives
Characterization of the corpus
Formant frequencies determination
Results Duration
F1 and F2
Variation rates
Conclusions and future work
SUMMARY
10
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
HESITA Database - manually annotated filled pauses
CORPUS
11
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
Broadcast
News audio
corpus
TV Broadcast
News MP4
podcasts
Daily
download
Extract audio stream
and downsample from
44.1kHz to 16 kHz
HESITA Database - manually annotated filled pauses
Source:
30 daily news programs (~ 27 hours)
CORPUS
12
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
Broadcast
News audio
corpus
TV Broadcast
News MP4
podcasts
Daily
download
Extract audio stream
and downsample from
44.1kHz to 16 kHz
HESITA Database - manually annotated filled pauses
Source:
30 daily news programs (~ 27 hours)
[ɐ] and [ə] VFs
the most common in the database and
chosen for analysis 808 [ɐ] and 344 [ə] occurrences.
CORPUS
13
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
Broadcast
News audio
corpus
TV Broadcast
News MP4
podcasts
Daily
download
Extract audio stream
and downsample from
44.1kHz to 16 kHz
HESITA Database - manually annotated filled pauses
Source:
30 daily news programs (~ 27 hours)
[ɐ] and [ə] VFs
the most common in the database and
chosen for analysis 808 [ɐ] and 344 [ə] occurrences.
Curiosity: Next most common VFs – nasal [ɐ], with 155 occurrences.
CORPUS
14
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
Studying the two most common VFs in European Portuguese:
the near-open central vowel [ɐ],
mid-central vowel [ə].
Acoustically characterizing VFs.
Analyzing:
first and second formant frequencies,
duration and
variation rates.
Comparing with lexical vowels (LVs) of similar timbre.
OBJECTIVE - revisited
15
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
A control corpus was used to estimate the acoustic characteristics of the
vocalic sounds [ɐ] and [ə] occurring in a context of a complete word (the LVs),
such as:
[ɐ] in <para> [pɐrɐ], (‘for’ in English)
[ə] in <devolver> [dəvolver] (‘to give back’)
CORPUS
16
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
A control corpus was used to estimate the acoustic characteristics of the
vocalic sounds [ɐ] and [ə] occurring in a context of a complete word (the LVs),
such as:
[ɐ] in <para> [pɐrɐ], (‘for’ in English)
[ə] in <devolver> [dəvolver] (‘to give back’)
LVs extracted from a read speech Database
recordings from 7 European Portuguese native adult speakers,
sentences and command words,
a segmentation and phone-level transcription were automatically
performed through forced alignment using in-house tools.
CORPUS
17
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
A control corpus was used to estimate the acoustic characteristics of the
vocalic sounds [ɐ] and [ə] occurring in a context of a complete word (the LVs) ,
such as:
[ɐ] in <para> [pɐrɐ], (‘for’ in English)
[ə] in <devolver> [dəvolver] (‘to give back’)
LVs extracted from a read speech Database
The total number of extracted LVs was 7426, in which we count:
4411 [ɐ],
3015 [ə].
Type Gender #[ɐ] #[ə]
VFsMale 605 301
Female 203 43
LVsMale 2674 1771
Female 1737 1244
Number of extracted (#) of VFs and LVs by gender and timbre.
CORPUS
18
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
Scope
Objectives
Characterization of the corpus
Formant frequencies determination
Results Duration
F1 and F2
Variation rates
Conclusions and future work
SUMMARY
19
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
Praat tool
The base recommended ceilings for estimating 5 formants (5500Hz for
female speakers and 5000Hz for male speakers) but, through observation,
these values cannot always successfully estimate F1 and F2.
FORMANT FREQUENCY DETERMINATION
20
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
Praat tool
The base recommended ceilings for estimating 5 formants (5500Hz for
female speakers and 5000Hz for male speakers) but, through observation,
these values cannot always successfully estimate F1 and F2.
Different vowels and speakers need different formant ceilings for automatic
calculation.
FORMANT FREQUENCY DETERMINATION
Iterative method:
ceilings chosen in the 4000-5500Hz range (for males) or 4800-6500Hz
(for females) in 50Hz steps, each 10ms;
selection of the optimal ceiling for a given VF – the one that provide the
smallest variance of the F1 and F2 pairs of values of that VF, calculated
as the sum of the variances of 20 log(F1) and 20 log(F2).
No speaker information kept for most of the news broadcast VFs: they were
considered as if each belonged to a different speaker.
21
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
Utterances with high clipping of the audio signal were discarded.
For each utterance, only the formant values where the energy level was
above of 10% of the maximum energy were considered, to specifically
discard possible unvoiced boundary segments.
Utterances with highly variant formant values, probably indicating a failure
in detecting F1 and F2 were not considered.
The same analysis was conducted for the LVs [ɐ] and [ə] with an additional
restriction of only considering segments of duration larger than 50ms.
OTHER APPLIED RESTRICTIONS
22
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
Scope
Objectives
Characterization of the corpus
Formant frequencies determination
Results Duration
F1 and F2
Variation rates
Conclusions and future work
SUMMARY
23
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
After applying the restrictions:
VFs: 520 [ɐ] and 244 [ə]
LVs: 1517 [ɐ] and 385 [ə]
Very large number of LVs were cut from analysis:
mostly the small-duration or low-energy segments, barely recognized
during alignment and more drastically occurring for [ə], as a
consequence of the nature of continuous speech.
RESULTS
24
RESULTS
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
2
4
6
8
10
12
14
16
18
duration (s)
LV
VF
Normalized histogram of the duration of VFs and LVs
DURATION
As expected, VFs are longer
25
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
500 100015002000250030003500
200
400
600
800
1000
1200
i
E
a
O
u
F2 (Hz)
F1 (H
z)
VF- ə
VF- ɐ
LV- ə
LV- ɐ
500 100015002000250030003500
200
400
600
800
1000
1200
i
E
a
O
u
F2 (Hz)
F1 (H
z)
VF- ə
VF- ɐ
LV- ə
LV- ɐ
[ə] (blue) and [ɐ] (green) for VF and LV of Males (left) and Females (right).
RESULTS
F1 and F2 of [ɐ] and [ə] means and 2-sigma concentration ellipsoids
‘triangle’ of [i], [ɛ], [a], [ɔ] and [u] from read speech corpus (calculated in a similar fashion to
the method described) was included to show the centrality of [ɐ] and [ə].
Average – F1 higher, F2 lower for VFs against LVs
Distributions overlap
M F
26
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
500 100015002000250030003500
200
400
600
800
1000
1200
i
E
a
O
u
F2 (Hz)
F1 (H
z)
VF- ə
VF- ɐ
LV- ə
LV- ɐ
500 100015002000250030003500
200
400
600
800
1000
1200
i
E
a
O
u
F2 (Hz)
F1 (H
z)
VF- ə
VF- ɐ
LV- ə
LV- ɐ
[ə] (blue) and [ɐ] (green) for VF and LV of Males (left) and Females (right).
RESULTS
F1 and F2 of [ɐ] and [ə]
LVs show the highest variances:
- high dependence of phonetic context and
- related coarticulation phenomenon
[ɐ] and [ə] VFs are hard to distinguish (they are both in the middle-point)
M F
27
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
RESULTS
VARIATION RATES Linear fit applied (although change can be non-linear)
High-variability
Average – small negative changes, but no trend can be discerned
LVs – less stable
No correlation between F1 and F2 simultaneous variation
-6000 -4000 -2000 0 2000 4000 6000
0
0.5
1
1.5
2
x 10-3
F1 variation rate (Hz/s)
VF
LV
-6000 -4000 -2000 0 2000 4000 6000
0
0.2
0.4
0.6
0.8
1
x 10-3
F2 variation rate (Hz/s)
VF
LV
Normalized histogram of the variation rates of F1 (left) and F2 (right) from a linear fit to each
utterance, for VFs and LVs.
28
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
Scope
Objectives
Characterization of the corpus
Formant frequencies determination
Results Duration
F1 and F2
Variation rates
Conclusions and future work
SUMMARY
29
DiSS 2013
Stockholm, Sweden - August 21-23, 2013
Each speaker could have its own personal preference on how to fill a
pause vocalically.
Choosing mainly sounds of the central vowels system, speakers appear
to adapt the production with their own specific production, possibly even
in a middle point of [ɐ] and [ə].
The main characteristic is of long stable segments.
New data of sustained vowels productions (including [ɐ] and [ə]) to better
distinguish the fillers.
A perceptual study to confirm that some vocalic fillers can be understood
differently with and without context or for different listeners.
Based on the knowledge attained from this study, to develop an automatic
detector of fillers and extensions from continuous speech.
Thank You
CONCLUSIONS AND FUTURE WORK
The 6th Workshop on Disfluency in Spontaneous Speech
Stockholm, Sweden August 21-23, 2013
Jorge Proença 1
Dirce Celorico 1
Arlindo Veiga 1,2
Sara Candeias 1
Fernando Perdigão 1,2
1Instituto de Telecomunicações, Coimbra, Portugal2University of Coimbra, DEEC, Portugal
Acoustical Characterization of
Vocalic Fillers in European Portuguese