Di ss2013 fillerspt_presentation_final

Jorge Proença 1

Dirce Celorico 1

Arlindo Veiga 1,2

Sara Candeias 1

Fernando Perdigão 1,2

1Instituto de Telecomunicações, Coimbra, Portugal2University of Coimbra, DEEC, Portugal

Acoustical Characterization of

Vocalic Fillers in European Portuguese

The 6th Workshop on Disfluency in Spontaneous Speech

Stockholm, Sweden August 21-23, 2013

2

DiSS 2013

Stockholm, Sweden - August 21-23, 2013

Scope

Objectives

Characterization of the corpus

Formant frequencies determination

Results Duration

F1 and F2

Variation rates

Conclusions and future work

SUMMARY

3

DiSS 2013


Scope

Objectives



Results Duration

F1 and F2

Variation rates


SUMMARY

4

DiSS 2013


Studying events that characterize the spontaneity of the speech has been

increasingly relevant as the development of speech technologies grows.

Studies on hesitations (so-called disfluencies) as well as vowel reductions

have gained importance over the last years.

Hesitation phenomena:

repetitions, truncated words, word fillers;

vocalic extensions into words;

filled pauses.

SCOPE

5

DiSS 2013


Hesitation phenomena:

repetitions, truncated words, word fillers;;

vocalic extensions into words;

filled pauses.

The most occurring, mainly on

spontaneous speech;

Occurring without any lexical

support;

Mostly fulfilled by relatively

stable vocalic segments.

Our previous studies say…

Filled pause vocalizations

or

Vocalic fillers (VFs)

Studying events that characterize the spontaneity of the speech has been

increasingly relevant as the development of speech technologies grows.

Studies on hesitations (so-called disfluencies) as well as vowel reductions

have gained importance over the last years.

SCOPE

6

DiSS 2013


Vocalic fillers (VFs)Why ?

represent an insertion at any moment during spontaneous speech;

carry multiple functions in the communication performance:

announcing upcoming discursive topics,

planning and delaying speech,

…

SCOPE

To develop an automatic detector of fillers

from continuous speech.

7

DiSS 2013


Scope

Objectives



Results Duration

F1 and F2

Variation rates


SUMMARY

8

DiSS 2013


Studying the two most common VFs in European Portuguese:

the near-open central vowel [ɐ],

mid-central vowel [ə].

OBJECTIVE

How ? Acoustically characterizing VFs.

Analyzing:

first and second formant frequencies,

duration and

variation rates.

Comparing with lexical vowels (LVs) of similar timbre.

9

DiSS 2013


Scope

Objectives



Results Duration

F1 and F2

Variation rates


SUMMARY

10

DiSS 2013


HESITA Database - manually annotated filled pauses

CORPUS

11

DiSS 2013


Broadcast

News audio

corpus

TV Broadcast

News MP4

podcasts

Daily

download

Extract audio stream

and downsample from

44.1kHz to 16 kHz


Source:

30 daily news programs (~ 27 hours)

CORPUS

12

DiSS 2013


Broadcast

News audio

corpus

TV Broadcast

News MP4

podcasts

Daily

download


and downsample from

44.1kHz to 16 kHz


Source:


[ɐ] and [ə] VFs

the most common in the database and

chosen for analysis 808 [ɐ] and 344 [ə] occurrences.

CORPUS

13

DiSS 2013


Broadcast

News audio

corpus

TV Broadcast

News MP4

podcasts

Daily

download


and downsample from

44.1kHz to 16 kHz


Source:


[ɐ] and [ə] VFs

the most common in the database and

chosen for analysis 808 [ɐ] and 344 [ə] occurrences.

Curiosity: Next most common VFs – nasal [ɐ], with 155 occurrences.

CORPUS

14

DiSS 2013


Studying the two most common VFs in European Portuguese:

the near-open central vowel [ɐ],

mid-central vowel [ə].

Acoustically characterizing VFs.

Analyzing:

first and second formant frequencies,

duration and

variation rates.

Comparing with lexical vowels (LVs) of similar timbre.

OBJECTIVE - revisited

15

DiSS 2013


A control corpus was used to estimate the acoustic characteristics of the

vocalic sounds [ɐ] and [ə] occurring in a context of a complete word (the LVs),

such as:

[ɐ] in <para> [pɐrɐ], (‘for’ in English)

[ə] in <devolver> [dəvolver] (‘to give back’)

CORPUS

16

DiSS 2013



vocalic sounds [ɐ] and [ə] occurring in a context of a complete word (the LVs),

such as:



LVs extracted from a read speech Database

recordings from 7 European Portuguese native adult speakers,

sentences and command words,

a segmentation and phone-level transcription were automatically

performed through forced alignment using in-house tools.

CORPUS

17

DiSS 2013



vocalic sounds [ɐ] and [ə] occurring in a context of a complete word (the LVs) ,

such as:



LVs extracted from a read speech Database

The total number of extracted LVs was 7426, in which we count:

4411 [ɐ],

3015 [ə].

Type Gender #[ɐ] #[ə]

VFsMale 605 301

Female 203 43

LVsMale 2674 1771

Female 1737 1244

Number of extracted (#) of VFs and LVs by gender and timbre.

CORPUS

18

DiSS 2013


Scope

Objectives



Results Duration

F1 and F2

Variation rates


SUMMARY

19

DiSS 2013


Praat tool

The base recommended ceilings for estimating 5 formants (5500Hz for

female speakers and 5000Hz for male speakers) but, through observation,

these values cannot always successfully estimate F1 and F2.

FORMANT FREQUENCY DETERMINATION

20

DiSS 2013


Praat tool

The base recommended ceilings for estimating 5 formants (5500Hz for

female speakers and 5000Hz for male speakers) but, through observation,

these values cannot always successfully estimate F1 and F2.

Different vowels and speakers need different formant ceilings for automatic

calculation.

FORMANT FREQUENCY DETERMINATION

Iterative method:

ceilings chosen in the 4000-5500Hz range (for males) or 4800-6500Hz

(for females) in 50Hz steps, each 10ms;

selection of the optimal ceiling for a given VF – the one that provide the

smallest variance of the F1 and F2 pairs of values of that VF, calculated

as the sum of the variances of 20 log(F1) and 20 log(F2).

No speaker information kept for most of the news broadcast VFs: they were

considered as if each belonged to a different speaker.

21

DiSS 2013


Utterances with high clipping of the audio signal were discarded.

For each utterance, only the formant values where the energy level was

above of 10% of the maximum energy were considered, to specifically

discard possible unvoiced boundary segments.

Utterances with highly variant formant values, probably indicating a failure

in detecting F1 and F2 were not considered.

The same analysis was conducted for the LVs [ɐ] and [ə] with an additional

restriction of only considering segments of duration larger than 50ms.

OTHER APPLIED RESTRICTIONS

22

DiSS 2013


Scope

Objectives



Results Duration

F1 and F2

Variation rates


SUMMARY

23

DiSS 2013


After applying the restrictions:

VFs: 520 [ɐ] and 244 [ə]

LVs: 1517 [ɐ] and 385 [ə]

Very large number of LVs were cut from analysis:

mostly the small-duration or low-energy segments, barely recognized

during alignment and more drastically occurring for [ə], as a

consequence of the nature of continuous speech.

RESULTS

24

RESULTS

DiSS 2013


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

2

4

6

8

10

12

14

16

18

duration (s)

LV

VF

Normalized histogram of the duration of VFs and LVs

DURATION

As expected, VFs are longer

25

DiSS 2013


500 100015002000250030003500

200

400

600

800

1000

1200

i

E

a

O

u

F2 (Hz)

F1 (H

z)

VF- ə

VF- ɐ

LV- ə

LV- ɐ

500 100015002000250030003500

200

400

600

800

1000

1200

i

E

a

O

u

F2 (Hz)

F1 (H

z)

VF- ə

VF- ɐ

LV- ə

LV- ɐ

[ə] (blue) and [ɐ] (green) for VF and LV of Males (left) and Females (right).

RESULTS

F1 and F2 of [ɐ] and [ə] means and 2-sigma concentration ellipsoids

‘triangle’ of [i], [ɛ], [a], [ɔ] and [u] from read speech corpus (calculated in a similar fashion to

the method described) was included to show the centrality of [ɐ] and [ə].

Average – F1 higher, F2 lower for VFs against LVs

Distributions overlap

M F

26

DiSS 2013


500 100015002000250030003500

200

400

600

800

1000

1200

i

E

a

O

u

F2 (Hz)

F1 (H

z)

VF- ə

VF- ɐ

LV- ə

LV- ɐ

500 100015002000250030003500

200

400

600

800

1000

1200

i

E

a

O

u

F2 (Hz)

F1 (H

z)

VF- ə

VF- ɐ

LV- ə

LV- ɐ

[ə] (blue) and [ɐ] (green) for VF and LV of Males (left) and Females (right).

RESULTS

F1 and F2 of [ɐ] and [ə]

LVs show the highest variances:

- high dependence of phonetic context and

- related coarticulation phenomenon

[ɐ] and [ə] VFs are hard to distinguish (they are both in the middle-point)

M F

27

DiSS 2013


RESULTS

VARIATION RATES Linear fit applied (although change can be non-linear)

High-variability

Average – small negative changes, but no trend can be discerned

LVs – less stable

No correlation between F1 and F2 simultaneous variation

-6000 -4000 -2000 0 2000 4000 6000

0

0.5

1

1.5

2

x 10-3

F1 variation rate (Hz/s)

VF

LV

-6000 -4000 -2000 0 2000 4000 6000

0

0.2

0.4

0.6

0.8

1

x 10-3

F2 variation rate (Hz/s)

VF

LV

Normalized histogram of the variation rates of F1 (left) and F2 (right) from a linear fit to each

utterance, for VFs and LVs.

28

DiSS 2013


Scope

Objectives



Results Duration

F1 and F2

Variation rates


SUMMARY

29

DiSS 2013


Each speaker could have its own personal preference on how to fill a

pause vocalically.

Choosing mainly sounds of the central vowels system, speakers appear

to adapt the production with their own specific production, possibly even

in a middle point of [ɐ] and [ə].

The main characteristic is of long stable segments.

New data of sustained vowels productions (including [ɐ] and [ə]) to better

distinguish the fillers.

A perceptual study to confirm that some vocalic fillers can be understood

differently with and without context or for different listeners.

Based on the knowledge attained from this study, to develop an automatic

detector of fillers and extensions from continuous speech.

Thank You

CONCLUSIONS AND FUTURE WORK

The 6th Workshop on Disfluency in Spontaneous Speech

Stockholm, Sweden August 21-23, 2013

Jorge Proença 1

Dirce Celorico 1

Arlindo Veiga 1,2

Sara Candeias 1

([email protected])

Fernando Perdigão 1,2

1Instituto de Telecomunicações, Coimbra, Portugal2University of Coimbra, DEEC, Portugal

Acoustical Characterization of

Vocalic Fillers in European Portuguese

Sports

Di ss2013 fillerspt_presentation_final