16
On Human Capability and Acoustic Cues for Discriminating Singing and Speaking Voices Yasunori Ohishi 1 Masataka Goto 3 Katunobu Itou 2 Kazuya Takeda 1 1 Graduate School of Information Science, Nagoya University, Japan 2 Faculty of Computer and Information Sciences, Hosei University, Japan 3 National Institute of Advanced Industrial Science and Technology

On Human Capability and Acoustic Cues for Discriminating ......The relative importance of the temporal structure is found for singing and speaking voice discrimination Automatic vocal

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On Human Capability and Acoustic Cues for Discriminating ......The relative importance of the temporal structure is found for singing and speaking voice discrimination Automatic vocal

On Human Capability and

Acoustic Cues for Discriminating

Singing and Speaking Voices

Yasunori Ohishi 1 Masataka Goto 3 Katunobu Itou 2 Kazuya Takeda 1

1 Graduate School of Information Science, Nagoya University, Japan

2 Faculty of Computer and Information Sciences, Hosei University, Japan

3 National Institute of Advanced Industrial Science and Technology

Page 2: On Human Capability and Acoustic Cues for Discriminating ......The relative importance of the temporal structure is found for singing and speaking voice discrimination Automatic vocal

Let’s do the Quiz Can you discriminate

between Singing and Speaking voices?

(Japanese voices)

Q.1. Can you do it ?

(2 s long)

Q.2. Can you do it ?

(500 ms long)

Q.3. Can you do it ?

(200 ms long)

Page 3: On Human Capability and Acoustic Cues for Discriminating ......The relative importance of the temporal structure is found for singing and speaking voice discrimination Automatic vocal

200 500 1000 1500 200060

65

70

75

80

85

90

95

100

200 500 1000 1500 200060

65

70

75

80

85

90

95

100

200 500 1000 1500 200060

65

70

75

80

85

90

95

100

Investigation of signal length

necessary for discrimination

Signal length [ms]

Corr

ect

rate

[%

]

Singing performance

Speaking performance

Total performance

1-s voice signals

500-ms voice signals

200-ms voice signals

Page 4: On Human Capability and Acoustic Cues for Discriminating ......The relative importance of the temporal structure is found for singing and speaking voice discrimination Automatic vocal

200 500 1000 1500 200060

65

70

75

80

85

90

95

100

200 500 1000 1500 200060

65

70

75

80

85

90

95

100

200 500 1000 1500 200060

65

70

75

80

85

90

95

100

Investigation of signal length

necessary for discrimination

Signal length [ms]

Corr

ect

rate

[%

]

Singing performance

Speaking performance

Total performance

1-s voice signals

500-ms voice signals

200-ms voice signals

Not only temporal characteristics

but also such short-term features

carry discriminative cues

Page 5: On Human Capability and Acoustic Cues for Discriminating ......The relative importance of the temporal structure is found for singing and speaking voice discrimination Automatic vocal

The goal of this study

Subjective experiments Investigation of acoustic cues necessary for

discrimination between singing and speaking voices

Automatic vocal style discriminator • Spectral feature measure

• F0 derivative measure

Based on knowledge obtained by

subjective experiments

Page 6: On Human Capability and Acoustic Cues for Discriminating ......The relative importance of the temporal structure is found for singing and speaking voice discrimination Automatic vocal

Introduction of the voice database

AIST humming database

75 Japanese subjects (37 males, 38 females)

Sing a chorus and verse A sections

at an arbitrary tempo,

without musical accompaniment

( 25 Japanese songs selected from

“RWC Music Database: Popular Music” )

Read the lyrics of chorus and verse A sections

Most of these subjects haven’t had

the special musical training

Page 7: On Human Capability and Acoustic Cues for Discriminating ......The relative importance of the temporal structure is found for singing and speaking voice discrimination Automatic vocal

The goal of this study

Subjective experiments Investigation of acoustic cues necessary for

discrimination between singing and speaking voices

Automatic vocal style discriminator • Short-term spectral feature measure

• F0 derivative measure

Based on knowledge obtained by

subjective experiments

Page 8: On Human Capability and Acoustic Cues for Discriminating ......The relative importance of the temporal structure is found for singing and speaking voice discrimination Automatic vocal

To compare the importance of temporal and

spectral cues for discrimination,

voice quality and prosody are modified

by using signal processing techniques

Investigation of acoustic cues

necessary for discrimination

Randomly

concatenating pieces

Random splicing technique

1 s

Am

plit

ude

Q.1 Q.2 Q.3

Let’s do the quiz 250 ms

(250 ms) (200 ms) (125 ms)

Temporal structure of signal is modified,

short-time spectral features are maintained

Page 9: On Human Capability and Acoustic Cues for Discriminating ......The relative importance of the temporal structure is found for singing and speaking voice discrimination Automatic vocal

Investigation of acoustic cues

necessary for discrimination

To compare the importance of temporal and

spectral cues for discrimination,

voice quality and prosody are modified

by using signal processing techniques

Fre

quency

Fre

quency

Eliminating frequency

component higher

than 800 Hz

Low-pass filtering technique

Q.1 Q.2 Q.3

Let’s do the quiz 1 s 1 s

Temporal structure of signal is maintained,

short-time spectral features are modified

Page 10: On Human Capability and Acoustic Cues for Discriminating ......The relative importance of the temporal structure is found for singing and speaking voice discrimination Automatic vocal

Origin

al voic

e

100%

Origin

al voic

e

99.3

%

Low

-pass F

iltering

9

8.9

%

Low

-pass F

iltering

8

6.9

%

Random

Splic

ing

(2

50 m

s)

94.9

%

Random

Splic

ing

(2

00 m

s)

90.0

%

Random

Splic

ing

(2

50 m

s)

84.3

%

Random

Splic

ing

(2

00 m

s)

76.9

%

Random

Splic

ing

(1

25 m

s)

95.0

%

Rand

om

Splic

ing

(1

25 m

s)

70

.6%

Singing voice Speaking voice

Investigation of acoustic cues

necessary for discrimination

Sp

eakin

g v

oic

e c

orr

ect ra

te [%

]

Sin

gin

g v

oic

e c

orr

ect ra

te [%

]

Stimuli

Page 11: On Human Capability and Acoustic Cues for Discriminating ......The relative importance of the temporal structure is found for singing and speaking voice discrimination Automatic vocal

i r i ch

Correct rate of singing voices declined

Random splicing technique

Temporal structure of the original voices

(rhythm and melody pattern) has been modified

Prolonged vowels of singing voices has been divided

into small pieces

Low-pass filtering technique

Frequency components higher than 800 Hz have

been eliminated

b i a ch

Discussion

r i b before a i i after a

Important acoustic cues

for discrimination ??

before after

Page 12: On Human Capability and Acoustic Cues for Discriminating ......The relative importance of the temporal structure is found for singing and speaking voice discrimination Automatic vocal

The goal of this study

Subjective experiments Short-term spectral feature

Temporal structure

Automatic vocal style discriminator • Spectral feature measure

• F0 derivative measure

Based on knowledge obtained by

subjective experiments

Importance !

Page 13: On Human Capability and Acoustic Cues for Discriminating ......The relative importance of the temporal structure is found for singing and speaking voice discrimination Automatic vocal

Automatic discrimination measure

Spectral feature measure

Difference in spectral envelopes

and vowel durations

Mel-Frequency

Cepstrum Coefficients (MFCC)

DMFCC (5-frame regression)

F0 derivative measure

Difference in dynamics of prosody

DF0 (5-frame regression)

F0 Extraction (PreFEst, Goto1999)

Singing voice

Speaking voice

Spectral

envelope

Time F

0

F0

Am

plit

ude

Frequency

Page 14: On Human Capability and Acoustic Cues for Discriminating ......The relative importance of the temporal structure is found for singing and speaking voice discrimination Automatic vocal

Training the discriminative model Gaussian mixture models (16-mixture GMM)

e.g. Discrimination using DF0

Rela

tive

Fre

quency

DF0 [cent/10ms]

Singing voice GMM

Speaking voice GMM

DF0 [cent/10ms]

Rela

tive

Fre

quency

-100 100 0

-100 100 0

0.02

0.04

0.02

0.04

0.06

F0 extraction and DF0 calculation

for DF0 of each frame

Likelihood

comparison

Singing Speaking

Input signal

Page 15: On Human Capability and Acoustic Cues for Discriminating ......The relative importance of the temporal structure is found for singing and speaking voice discrimination Automatic vocal

0 500 1000 1500 200050

60

70

80

90

100

0 500 1000 1500 200050

60

70

80

90

100

0 500 1000 1500 200050

60

70

80

90

100Automatic discrimination results

Input signal length [ms]

Corr

ect

rate

[%

]

Total performance

of MFCC+DMFCC

Total performance

of DF0

Total performance

of MFCC+DMFCC+DF0

87.6%

Human performance

Page 16: On Human Capability and Acoustic Cues for Discriminating ......The relative importance of the temporal structure is found for singing and speaking voice discrimination Automatic vocal

Summary and future work

Investigation of signal length necessary

Not only temporal characteristics but also short-time

spectral feature can be a cue for the discrimination

Investigation of acoustic cues necessary

The relative importance of the temporal structure is

found for singing and speaking voice discrimination

Automatic vocal style discriminator

Feature vector (MFCC+DMFCC+DF0)

For 2-s signals, the correct rate is 87.6%

Plan to propose new measures to improve

the automatic discrimination performance