Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Conversion Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and

Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Co

nversion

Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano

Nara Institute of Science and Technology (NAIST), Japan

August 23rd, 2007

2

– Amusement device– Speech enhancement device

• for a speaking aid system recovering a disabled person’s voice

• for a hearing aid system to make speech sounds more intelligible

Voice Quality Control

• Technique for converting user’s voice quality into another one

Applications

Development of voice quality control with high quality and high controllability is desired!

Controller

Hello.

3

Contents

1. Conventional voice quality control methods

2. Proposed voice quality control methods

3. Experimental verification

4. Conclusions




4. Conclusions

4

Arbitrary speakers

Multiple pre-stored target speakers

Conversion

Training

Source speaker

Hello.Thank you.

Hello.Thank you.

Hello.Thank you.

Hello.Thank you.

Let’s convert. Let’s convert.

Eigenvoice GMM (EV-GMM)

Manually setting

Parallel data

One-to-Many Eigenvoice Conversion (EVC)[Toda et al., 2006]

• A source speaker’s voice is statistically converted into an arbitrary speaker’s one.

5

• Converted voice quality is controlled by weights for eigenvectors.

Eigenvoice GMM (EV-GMM)

Weight

Mean vector

Covariance matrix Eigenvectors

(for eigenvoices)

Bias vector(for average voice)

Parameters of the i th mixture

Source mean vector

Target mean vector

=+

)0()()(

)()(

Yi

Yi

XiZ

ibwB

μμ

i

)()(

)()()(

YYi

YXi

XYi

XXiZZ

iΣΣ

ΣΣΣ

Weights for eigenvoices(free parameters)

Problem: eigenvoices do NOT represent a specific physical meaning (such as a masculine voice or a clear voice). Intuitive control of the converted voice quality is difficult!

: Speaker independent parameters

: Free parameters

6

Contents




4. Conclusions

7

Proposed Framework

We would like to intuitively control the converted voice quality!

We propose multiple regression approaches to one-to-many EVC.

Converted voice quality is controlled with the voice quality control vector.* Similar approaches have been proposed in HMM-based speech synthesis [Tachibana et al., 2006].

8

Process of Proposed Framework

1. Preparing multiple parallel data sets2. Setting the voice quality control vector for

every pre-stored target speaker3. Modeling the target mean vectors with

voice quality control vector




9

Setting Voice Quality Control Vector

• We manually assign scores for expression word pairs to each pre-stored target speaker.

• Assigned scores are used as components of the voice quality control vector.

Tense

Hoarse

Masculine

Elderly

Thin

Feminine

Clear

Youthful

Deep

Lax

-3 -2 -1 0 1 2 3

Very VeryQuite QuiteSome-what

Some-what

No preference

2

1

1

-2

-1Voice quality control vectorfor the speaker A

Assigned scores for the speaker A

10

Process of Proposed Framework




We propose 3 regression methods.

11

Proposed Method A

.)()( rRwp se

s Regression parameters

Principal componentsfor the sth target speaker

Modeling principal components is modeled by

.minarg,2

1

)()(

S

s

se

s rRwprR

Minimizing the following error function:

Error of principal components for the sth pre-stored target speaker

)(sp

Total error over all pre-stored target speakers

Least-squares (LS) estimation of regression parameters converting the voice quality control vector into principal components

Voice quality control vectorfor the sth target speaker

12

Resulting EV-GMM in Method A

=

Weight

Mean vector

)0()()(

)()(

Yie

Yi

XiZ

ibrwRB

μμ

i

Covariance matrix

)()(

)()()(

YYi

YXi

XYi

XXiZZ

iΣΣ

ΣΣΣ

Eigenvectors

Bias vector


Target mean vector Regressionparameters

Voice quality control vector

++

Problem: the desired voice characteristics might not be represented as a linear combination of eigenvectors. Changing the eigenvectors themselves is necessary!

: Training parameters

: Speaker independent EV-GMM parameters

13

Proposed Method B

.)0()(minarg)0(,2

1

)()()()()()(

S

s

Yse

YYYY s bwBμbB

Minimizing the following error function:

.)0(

)()()()(

)(

Yse

Y

Y s

bwB

μ

Target mean vector is modeled by)()( sYμ

Error of target mean vectors for the sth pre-stored target speaker

Total error over all pre-stored target speakers

LS estimation of a regression parameters converting the voice quality control vector into the target mean vectors

= +

Regression parameters

Target mean vectorfor the sth target speaker


14

Resulting EV-GMM in Method B

=

Weight

Mean vector

+

)0(ˆˆ )()(

)()(

Yie

Yi

XiZ

ibwB

μμ

i

Covariance matrix

)()(

)()()(

YYi

YXi

XYi

XXiZZ

iΣΣ

ΣΣΣ



Target mean vector Voice quality control vector

Problem: the desired voice quality might not be obtained because the converted voice quality is affected by all EV-GMM parameters.


: Speaker independent EV-GMM parameters

15

Proposed Method C

.,logmaxarg )()()(

1 1

)(

)(

se

EVst

S

s

T

t

EV Ps

EVwλZλ

λ

Maximizing the following likelihood function:

* This process is considered as speaker adaptive training (SAT) of EV-GMM [Ohtani et al., Interspeech 2007].

Likelihood of the adapted EV-GMM for each pre-stored target speaker

Maximum Likelihood (ML) estimation of all EV-GMM parameters while fixing the voice quality control vector

Total likelihood over all pre-stored target speakers

.)0(

)()()()(

)(

Yse

Y

Y s

bwB

μ

Target mean vector is modeled by)()( sYμ

= +


Target mean vectorfor the sth target speaker


16

Resulting EV-GMM in Method C

=

Weight

Mean vector

+

)0(ˆˆ )()(

)()(

Yie

Yi

XiZ

ibwB

μμ

i

Covariance matrix

)()(

)()()(

YYi

YXi

XYi

XXiZZ

iΣΣ

ΣΣΣ


Target mean vectorVoice quality control vector



17

Comparison of Proposed Methods

Dependent variablesTied parameters of EV-GMM

Training criterion

Method A

Principal componentsSpeaker

independentLS

Method B

Target mean vectorsSpeaker

independentLS

Method C

Target mean vectors Optimized ML

18

Contents




4. Conclusions

19

Verification of Proposed Methods

• Objective verification

• Subjective verification

Source speaker One female

Pre-stored target speakers

15 males and 15 females

Sentences Phonetically balanced 50 sentences per a speaker

Expression word pairs masculine / feminine, hoarse / clear, elderly / youthful, thin / deep, lax / tense

Number of mixtures 128

Number of Eigenvectors 29 (no loss of information)

Experimental conditions

20

Objective VerificationIs a correspondence of the voice quality control vector into the converted voice quality appropriately modeled?

• For each pre-stored target speaker in the training data, the following two voice quality control vectors were compared.

1. Manually assigned one

2. Adjusted one on the trained EV-GMM so that the converted voice quality becomes similar to the target

* approximately determined by maximum likelihood eigen-decomposition for EV-GMM [Toda et al., 2006] using two sentences

• Euclidean distance and correlation coefficient between those two vectors were calculated as objective measures.

21

Results of Objective Verification

* Reassigned: assigned scores by the same listener a second time on a different day

Worse

Better

Better

Worse

Better!

1. The method A does not work at all.1. The method A does not work at all.2. The method B works but not so good.1. The method A does not work at all.2. The method B works but not so good.3. The method C works reasonably well.

Too consistent compared with

human judgment? Better!

22

Subjective Verification

• Preference test on the converted speech quality was conducted.– Comparison of average voices* by the trained EV-GMMs * converted voices when setting every component of the voice q

uality control vector to zero

Test sentences 50 sentences not included in training data

Number of subjects 5

Experimental conditions

Having very similar speaker individuality in both method B and C

Which is better, the method B or the method C?

23

Result of Subjective Verification

• The method B outperforms the method C.

Possibility to be thought– The EV-GMM parameters trained in EM algorithm

converged to local optima due to using inappropriate initial model (i.e., the target independent GMM).

24

Contents




4. Conclusions

25

Conclusions

• Proposal of regression approaches to the voice quality control based on one-to-many eigenvoice conversion (EVC)– Based on a statistical conversion framework– Allowing intuitive control of converted voice quality with voi

ce quality control vector

• Experimental verification– Showing the possibility that voice quality control with hig

h quality and high controllability is realized.

Documents

Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Conversion Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and