15
ITOU 1 , Masataka GOTO 1 , Masashi UNOKI 2 and Masato AKAGI 2 tional Institute of Advanced Industrial Science and Technolog 2 Japan Advanced Institute of Science and Technol

Vocal Conversion from Speaking voice to Singing voice Using STRAIGHT

  • Upload
    sovann

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Vocal Conversion from Speaking voice to Singing voice Using STRAIGHT. Takeshi SAITOU 1 , Masataka GOTO 1 , Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) - PowerPoint PPT Presentation

Citation preview

Page 1: Vocal  Conversion  from  Speaking  voice  to  Singing voice  Using  STRAIGHT

Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2

1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan Advanced Institute of Science and Technology (JAIST)

Page 2: Vocal  Conversion  from  Speaking  voice  to  Singing voice  Using  STRAIGHT

Our research approach focuses on …

not text-to-singing (lyric-to-singing) synthesis

singing♪

♪♪

but speech-to-singing synthesis (vocal conversion).

⇒ Clarifying acoustic differences between singing and speaking.

⇒ Developing novel applications for computer music production.

speech singing♪

♪♪

Page 3: Vocal  Conversion  from  Speaking  voice  to  Singing voice  Using  STRAIGHT

Vocal conversion system is - based on speech manipulation system STRAIGHT (Kawahara et al,1998) and

- comprises three types of model; F0 control model Duration control model Spectral control model

Page 4: Vocal  Conversion  from  Speaking  voice  to  Singing voice  Using  STRAIGHT

Speaking voice: reading the lyrics of a song.

Musical score

Synchronization inform

ation

c v v c c c cv v

Page 5: Vocal  Conversion  from  Speaking  voice  to  Singing voice  Using  STRAIGHT

Musical scoreMusical notes

F0 control model:Adding four types of F0fluctuation into musical note.

F0 contour of singing voice

Melody contourVibrato : Quasi-periodic frequency modulation with 4 - 7 Hz.

Preparation : Deflection in the opposite direction of note change observed just before note change.

Fine fluctuation : irregularlyfluctuations higher than 10 Hzin full contour.

Overshoot : Deflection exceeding the target note after note change.

Page 6: Vocal  Conversion  from  Speaking  voice  to  Singing voice  Using  STRAIGHT

Speaking voice

STRAIGHT (analysis part)

Page 7: Vocal  Conversion  from  Speaking  voice  to  Singing voice  Using  STRAIGHT

Spectral sequence AP sequence

Duration control model: is lengthened according to the fix

rate. is not lengthened. is lengthened so that the duration

of the whole combination

corresponds to the note duration.

Page 8: Vocal  Conversion  from  Speaking  voice  to  Singing voice  Using  STRAIGHT

Lengthened Spectral and AP sequence

Spectral envelope and AP of vowel part.

Modified spectral envelope and AP

Spectral control model1:Adding singing formant by emphasizing peak of spectral envelope and dip of AP.

Page 9: Vocal  Conversion  from  Speaking  voice  to  Singing voice  Using  STRAIGHT

Modified spectral and AP Generated F0 contour

Synthesized singing voice

STRAIGHT (synthesis)

Adding an amplitude modulation (AM) of formants synchronized with vibrato by adding AMs into amplitude envelope of the synthesized singing voice during vibrato.

Spectral control model 2:

Synthesized singing voice (final version)

Page 10: Vocal  Conversion  from  Speaking  voice  to  Singing voice  Using  STRAIGHT

♪ Speaking voice (input): (male → female)

♪ Synthesized singing voice: (male → female → chorus)

Page 11: Vocal  Conversion  from  Speaking  voice  to  Singing voice  Using  STRAIGHT

Thank you!!

Page 12: Vocal  Conversion  from  Speaking  voice  to  Singing voice  Using  STRAIGHT

12

0),sin(

1),exp(

1),1sin()exp(1

1)),exp()(exp(12

)(2

2

212

tK

tKt

ttK

ttK

th

22 2)(

ss

KsH

Page 13: Vocal  Conversion  from  Speaking  voice  to  Singing voice  Using  STRAIGHT

lips teeth ・ alveolar arch palate glottis

voiced unvoiced voiced unvoiced voiced unvoiced

fricative /z/1.37 /s/1.18 /h/1.28

plosive /d/1.00 /t/1.09 /g/1.14 /k/0.97

semivowel /w/2.61 /r/2.12

nasal /m/1.35 /n/1.50

♪Calculating the ratios of the duration of each consonant in singing-voices to read speech

We can control phoneme duration by controlling articulation manner rather thanarticulation positions:fricative 1.28, plosive 1.00, semivowel 2.37, nasal 1.43, /y/ 1.22

Page 14: Vocal  Conversion  from  Speaking  voice  to  Singing voice  Using  STRAIGHT

♪ Singers’ formant: Remarkable peak of spectral at around 3 kHz.            (Sundberg, 1974)

♪ Amplitude modulation of formants synchronized with vibrato. (Hirano, 1985)

Both features are remarkably contained to a professional singing-voice.

Page 15: Vocal  Conversion  from  Speaking  voice  to  Singing voice  Using  STRAIGHT

2000 Hz

Spectral control 1: Singing formant that is a remarkable peak of spectrum at around 3 kHz .

Spectral control 2: Amplitude modulation of formants synchronized with vibrato in F0.