7
Neurocomputing 71 (2007) 174–180 A neural-wavelet architecture for voice conversion Rodrigo Capobianco Guido a, , Lucimar Sasso Vieira a , Sylvio Barbon Ju´nior a , Fabrı´cio Lopes Sanchez b , Carlos Dias Maciel b , Everthon Silva Fonseca b , Jose´ Carlos Pereira b a SpeechLab/FFI/IFSC/USP—Department of Physics and Informatics, Institute of Physics at Sa˜o Carlos,University of Sa˜o Paulo, Avenida Trabalhador Sa˜o Carlense 400, 13560-970 Sa˜o Carlos, SP, Brazil b SEL/EESC/USP—Department of Electrical Engineering, School of Engineering at Sa˜o Carlos, University of Sa˜o Paulo, Avenida Trabalhador Sa˜o Carlense 400, 13560-970 Sa˜o Carlos, SP, Brazil Available online 29 August 2007 Abstract In this letter we propose a new architecture for voice conversion that is based on a joint neural-wavelet approach. We also examine the characteristics of many wavelet families and determine the one that best matches the requirements of the proposed system. The conclusions presented in theory are confirmed in practice with utterances extracted from TIMIT speech corpus. r 2007 Elsevier B.V. All rights reserved. Keywords: RBF neural networks; Wavelet transforms; Voice conversion 1. Introduction Voice conversion, also known as voice morphing, enables a source speaker to transform his speech pattern to sound as if it were spoken by another person, that is the target speaker, preserving the original content of the spoken message [18]. Much literature has recently appeared addressing the issue of voice conversion [12,5,17]. Most methods are single-scale methods based on the interpolation of speech parameters and modeling of the speech signals using formant frequencies (FF) [1], linear prediction coding (LPC) and cepstrum coefficients (CC) [6], line spectral frequencies (LSF) [11], segmental codebooks (SC) [16], besides many others. We can also find some techniques that are based on hidden Markov models (HMMs) and Gaussian mixture models (GMM) [4], that are well-known methods in the speech community. Most of the techniques we mentioned suffer from absence of detailed information during the extraction of formant coefficients and the excitation signal. This results in the limitation of being able to accurately estimate parameters as well as distortion caused during the synthesis of the target speech. Turk and Arslan [16] introduced the discrete wavelet transform (DWT) [2,14] for voice conversion and got encouraging results. Following the ideas of Turk and Arslan, we can find other interesting contributions, as the one of Orphanidou et al. [13]. Particularly, this paper proposes a new algorithm for voice conversion that is based on wavelet transforms and radial basis function (RBF) neural networks [8]. This is the main contribution of this work, that extends the considera- tions of Turk and Arslan, and Orphanidou et al., on the use of wavelets for voice conversion. This paper is organized as follows. Section 2 presents a brief overview on how the wavelet-based algorithms for voice conversion work. The proposed approach is pre- sented in Section 3. A study on wavelets and their important characteristics for voice conversion, in order to determine the best wavelet family to be used, is available in Section 4. Section 5 describes the tests and results, and, lastly, Section 6 presents the conclusions. 2. A brief review on wavelet-based voice conversion The basic idea behind the use of the DWT for voice conversion is the sub-band separation. With that, the pitch period of voiced sounds, FF [4], plus other information, ARTICLE IN PRESS www.elsevier.com/locate/neucom 0925-2312/$ - see front matter r 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2007.08.010 Corresponding author. Tel./fax: +55 16 33739777. E-mail address: [email protected] (R.C. Guido). URL: http://speechlab.ifsc.usp.br (R.C. Guido).

A neural-wavelet architecture for voice conversion

Embed Size (px)

Citation preview

Page 1: A neural-wavelet architecture for voice conversion

ARTICLE IN PRESS

0925-2312/$ - se

doi:10.1016/j.ne

�CorrespondE-mail addr

URL: http:/

Neurocomputing 71 (2007) 174–180

www.elsevier.com/locate/neucom

A neural-wavelet architecture for voice conversion

Rodrigo Capobianco Guidoa,�, Lucimar Sasso Vieiraa, Sylvio Barbon Juniora,Fabrıcio Lopes Sanchezb, Carlos Dias Macielb, Everthon Silva Fonsecab, Jose Carlos Pereirab

aSpeechLab/FFI/IFSC/USP—Department of Physics and Informatics, Institute of Physics at Sao Carlos,University of Sao Paulo,

Avenida Trabalhador Sao Carlense 400, 13560-970 Sao Carlos, SP, BrazilbSEL/EESC/USP—Department of Electrical Engineering, School of Engineering at Sao Carlos, University of Sao Paulo,

Avenida Trabalhador Sao Carlense 400, 13560-970 Sao Carlos, SP, Brazil

Available online 29 August 2007

Abstract

In this letter we propose a new architecture for voice conversion that is based on a joint neural-wavelet approach. We also examine the

characteristics of many wavelet families and determine the one that best matches the requirements of the proposed system. The

conclusions presented in theory are confirmed in practice with utterances extracted from TIMIT speech corpus.

r 2007 Elsevier B.V. All rights reserved.

Keywords: RBF neural networks; Wavelet transforms; Voice conversion

1. Introduction

Voice conversion, also known as voice morphing,enables a source speaker to transform his speech patternto sound as if it were spoken by another person, that is thetarget speaker, preserving the original content of thespoken message [18].

Much literature has recently appeared addressing theissue of voice conversion [12,5,17]. Most methods aresingle-scale methods based on the interpolation of speechparameters and modeling of the speech signals usingformant frequencies (FF) [1], linear prediction coding(LPC) and cepstrum coefficients (CC) [6], line spectralfrequencies (LSF) [11], segmental codebooks (SC) [16],besides many others. We can also find some techniques thatare based on hidden Markov models (HMMs) andGaussian mixture models (GMM) [4], that are well-knownmethods in the speech community. Most of the techniqueswe mentioned suffer from absence of detailed informationduring the extraction of formant coefficients and theexcitation signal. This results in the limitation of beingable to accurately estimate parameters as well as distortion

e front matter r 2007 Elsevier B.V. All rights reserved.

ucom.2007.08.010

ing author. Tel./fax: +55 16 33739777.

ess: [email protected] (R.C. Guido).

/speechlab.ifsc.usp.br (R.C. Guido).

caused during the synthesis of the target speech. Turk andArslan [16] introduced the discrete wavelet transform(DWT) [2,14] for voice conversion and got encouragingresults. Following the ideas of Turk and Arslan, we canfind other interesting contributions, as the one ofOrphanidou et al. [13].Particularly, this paper proposes a new algorithm for

voice conversion that is based on wavelet transforms andradial basis function (RBF) neural networks [8]. This is themain contribution of this work, that extends the considera-tions of Turk and Arslan, and Orphanidou et al., on theuse of wavelets for voice conversion.This paper is organized as follows. Section 2 presents a

brief overview on how the wavelet-based algorithms forvoice conversion work. The proposed approach is pre-sented in Section 3. A study on wavelets and theirimportant characteristics for voice conversion, in order todetermine the best wavelet family to be used, is available inSection 4. Section 5 describes the tests and results, and,lastly, Section 6 presents the conclusions.

2. A brief review on wavelet-based voice conversion

The basic idea behind the use of the DWT for voiceconversion is the sub-band separation. With that, the pitchperiod of voiced sounds, FF [4], plus other information,

Page 2: A neural-wavelet architecture for voice conversion

ARTICLE IN PRESS

Fig. 1. Basic architecture of a wavelet-based system for voice conversion. (N.N.N.): neural network not yet trained; (T.N.N.): trained neural network.

R.C. Guido et al. / Neurocomputing 71 (2007) 174–180 175

can be effectively treated and converted into separatemanners. In order to convert the source speaker’s patterninto the target speaker’s pattern, artificial neural networkscan be used, as in [13]. Usually, these networks, that areimportant components of the system, are multilayerperceptron or RBF networks [8].

For training the networks, the DWT of the source andtarget speakers’ voice signals are used as input. Thesentences or phonemes uttered by both speakers must bethe same. The weights of the hidden and output layers ofthe networks are adjusted in such a way that the sourcespeaker’s pattern is converted into the target speaker’spattern. This conversion is done for each sub-band offrequencies separately, and the training step is a supervisedlearning procedure [8].

After the training step, the conversion step receives asinput a different sentence uttered by the source speaker,applies the DWT, and sends the signal to the networks thatconvert it. Lastly, the inverse DWT (IDWT) is applied toreturn the converted signal to the time domain. Theconverted sentence sounds as if it were spoken by the targetspeaker. Fig. 1 illustrates this.

Additional details about the architecture of the neuralnetworks that can be used for voice conversion were notincluded in this section because Orphanidou et al discussthis topic with details in [13], furthermore, complementa-rily explanations are given in Section 5 of this letter.

3. The proposed approach

The proposed approach is divided into two parts: train-ing (TR) and testing (TE). They are completely describedin Tables 1 and 2, respectively, and further explanationsfollow.

All the RBFs use the Gaussian’s equation [13] asactivation function in the hidden layers. On the otherhand, the output neurons use a simple linear weightedfunction. To train the RBFs, a two-step procedure wasadopted. In the first step, a non-supervised training is usedto determine the centers and variances of the Gaussian

Kernels, i.e., only the source’s speech data is used. Then,the weights of the output layer are adjusted by means of asupervised procedure, i.e., the source and target’s speechdata are used. This is exactly the procedure adopted byOrphanidou et al. in [13].In our approach, each one of the 21 RBFs, Ri,ð0pip20Þ, has ai input and output neurons, and bi hiddenneurons; ai being the number of samples of the criticalband i within a speech frame of 256 samples, according toTable 3, and bi being the number of speech frames to beused during the training, that is the same for all RBFs. Thisallows the supervised part of the training to be easily solvedby using the Moore–Penrose’s pseudo inversion in singularvalues [9].Another important observation is that only the voiced

speech frames are used during the training, since unvoiceddata does not contain significant information of the vocaltracts’ resonances.

4. Exploring the characteristics of wavelets

According to the DWT theory [2], the jth-level decom-position of a given discrete (speech) signal, f ½n�, can bewritten as [7]

f ½n� ¼Xn=2j�1

k¼0

Rj;k½n�fj;k½n� þXj

t¼1

Xn=2j�1

k¼0

St;k½n�ct;k½n�, (1)

f½n� ¼P

kh½k�f½2n� k� and c½n� ¼P

kg½k�f½2n� k� beingthe scaling and wavelet functions, respectively, that form aRiesz basis [2] to write signal f, Rj;k½n�phf ;fj;k½n�i, St;k½n�phf ;ct;k½n�i, and h½k� and g½k� ¼ ð�1Þkh½N � k � 1� being thequadrature mirror (QMF) low-pass and high-pass analysisfilters [2], respectively.During the training and conversion steps, when f ½n� is

being analysed, as illustrated in Fig. 2, low-pass and high-pass filtering occur by discrete convolutions, followed bydown-samplings by 2, and, therefore, the length of h½k� andg½k�, N, is responsible for both frequency selectivity, Q,and time resolution, R. As N increases, Q increases and R

Page 3: A neural-wavelet architecture for voice conversion

ARTICLE IN PRESS

Table 1

Proposed algorithm for training

� BEGINNING

� STEP TR-1: Define the source, s½ �, and target, t½ �, speech files, that are sampled at 16 000 samples per second, 16 bits, PCM;

� STEP TR-2: Divide s½ � and t½ � into n frames with 256 samples each, and then time-align them [13]. If the last frame contains less than 256 samples,

it can be zero-padded or discarded;

� STEP TR-3: For each frame i of s½ � and t½ �, ð0pipnÞ do:

� If i is a voiced speech frame then:

STEP TR-3.1: Obtain the 8th level complete DWT, i.e., discrete wavelet-packet transform (DWPT), of si½ � and ti½ �. The transforms,

that are named DWTPðsi ½ �Þ and DWTPðti½ �Þ, respectively, have the same length of the original signals, i.e., 256 samples. The family of

wavelets to be used, and the corresponding support-size of the filters, will be discussed later;

STEP TR-3.2: Divide DWTPðsi½ �Þ and DWTPðti½ �Þ into 21 sub-frames. Each sub-frame corresponds to a critical band of the human

auditory system [3], according to Table 3. Therefore, we have now the sub-signals DWTP0ðsi½ �Þ, DWTP1ðsi½ �Þ; . . . ;DWTP20ðsi ½ �Þ, and

DWTP0ðti½ �Þ, DWTP1ðti ½ �Þ; . . . ;DWTP20ðti ½ �Þ;

STEP TR-3.3: For each sub-signal j of DWTPðsi½ �Þ and DWTPðti½ �Þ, ð0pjp20Þ do:

STEP TR-3.3.1: Use the sub-signals DWTPjðsi½ �Þ and DWTPjðti½ �Þ to train the RBF neural network j, DWTPjðsi ½ �Þ serving as input

and DWTPjðti ½ �Þ serving as output;

� else

discard the frame;

� END

Table 2

The test procedure

� BEGINNING

� STEP TE-1: Define the test speech file, e½ �, that is sampled at 16 000 samples per second, 16 bits, PCM;

� STEP TE-2: Divide e½ � into n frames with 256 samples each. If the last frame contains less than 256 samples, it can be zero-padded or discarded;

� STEP TE-3: For each frame i of e½ �, ð0pipnÞ do:

� If i is a voiced speech frame then:

STEP TE-3.1: Obtain the 8th level complete DWT, i.e., discrete wavelet-packet transform (DWPT), of ei ½ �. The transforms, that are named

DWTPðei½ �Þ, have the same length of the original signals, i.e., 256 samples. The family of wavelets to be used, and the corresponding support-

size of the filters, is the same of the training procedure;

STEP TE-3.2: Divide DWTPðei ½ �Þ into 21 sub-frames. Each sub-frame corresponds to a critical band of the human auditory system [3],

according to Table 3. Therefore, we have now the sub-signals DWTP0ðei½ �Þ, DWTP1ðei½ �Þ; . . . ;DWTP20ðei½ �Þ;

STEP TE-3.3: For each sub-signal j of DWTPðei½ �Þ, ð0pjp20Þ do:

STEP TE-3.3.1: Pass the sub-signal DWTPjðei½ �Þ through the RBF neural network j to produce the morphed sub-bands;

� else

do not modify the frame;

� STEP TE-3.4: Invert DWTPðei½ �Þ to produce the morphed speech frame.

� END

Table 3

The 25 critical bands of the human auditory system

Band Frequency (Hz) IS FS Band Frequency (Hz) IS FS

0 0–100 0 2 12 1720–2000 55 63

1 100–200 3 5 13 2000–2320 64 73

2 200–300 6 9 14 2320–2700 74 85

3 300–400 10 12 15 2700–3150 86 100

4 400–510 13 15 16 3150–3700 101 118

5 510–630 16 19 17 3700–4400 119 141

6 630–770 20 24 18 4400–5300 142 170

7 770–920 25 28 19 5300–6400 171 205

8 920–1080 29 34 20 6400–7700 206 255

9 1080–1270 35 40 21 7700–9500 – –

10 1270–1480 41 46 22 9500–12 000 – –

11 1480–1720 47 54 23 12 000–15 500 – –

24 15 500–22 050 – –

As the speech signals are sampled at 16 000Hz, only the first 21 bands (0–20) are used. The last used band, i.e., band 20, ranges from 6400 to 7700Hz,

however, for the current application we approximate this to the range 6400 to 8000Hz. The columns IS and FS represent, respectively, the initial and final

samples of the 8th-level DWPT of a speech frame, that is 256-sample long, that approximately correspond to the critical band’s frequency content.

R.C. Guido et al. / Neurocomputing 71 (2007) 174–180176

Page 4: A neural-wavelet architecture for voice conversion

ARTICLE IN PRESSR.C. Guido et al. / Neurocomputing 71 (2007) 174–180 177

decreases. Therefore, a balanced value for N is needed. Letus define this as being the requirement 1. Another impor-tant consideration, say requirement 2, is that (almost) linearphase filters h½k� and g½k� are desirable to avoid distortionin the filtered signal, i.e., (almost) symmetrical or anti-symmetrical impulse responses are certainly preferable.

Fig. 2. Example of the third-level DWT decomposition of a discrete signal s

spectrum and number of samples at each sub-band.

0 50 100 150 200

0

0.2

0.4

0.6

SAMPLE (upsampled by 100)

AM

PL

ITU

DE

0 5

−0.2

0

0.2

0.4

0.6

SA

AM

PL

ITU

DE

0 5 10 15

−0.2

0

0.2

0.4

0.6

SAMPLE

AM

PL

ITU

DE

0 5 10

0

0.2

0.4

0.6

0.8

SA

AM

PL

ITU

DE

Fig. 3. From left to right: impulse responses’ shapes of the wavelet filte

During the conversion step, particularly when themorphed signal is being synthetized, the IDWT is applied,i.e., the synthesis filters, h½k� ¼ h½N � k � 1� and g½k� ¼

ð�1Þkþ1h½k�, are used, and up-samplings by 2 are performed[2]. A closer look at Eq. (1) calls our attention to the factthat this reconstruction corresponds to writing f ½n� as

½n� containing n samples and maximum frequency p. Note the frequency

10 15

MPLE

0 5 10 15 20 25

−0.2

0

0.2

0.4

0.6

SAMPLE

AM

PL

ITU

DE

15 20 25

MPLE

0 5 10 15−0.2

0

0.2

0.4

0.6

0.8

SAMPLE

AM

PL

ITU

DE

rs Haar, Daubechies, Vaidyanathan, Beylkin, Coiflet, and Symmlet.

Page 5: A neural-wavelet architecture for voice conversion

ARTICLE IN PRESSR.C. Guido et al. / Neurocomputing 71 (2007) 174–180178

a linear combination of f and c. Although the set {h½k�,g½k�, h½k�, and g½k�} forms a perfect-reconstruction filterbank [14], i.e., the filters satisfy the anti-aliasing (Eq. (2))and perfect reconstruction (Eq. (3)) conditions in the z-domain [14], the shape of f and c, and their correspondingnumber of vanishing moments, are of great importance.

H½z� ¼ G½�z�; G½z� ¼ �H½�z�, (2)

H½z�H½z� þ G½z�G½z� ¼ 2z�Nþ1. (3)

The neural networks modify the DWT of f ½n� before theapplication of the IDWT to synthetize the morphed speech,

0 50 100 150 200

0

0.2

0.4

0.6

0.8

1

SAMPLE (upsampled by 100)

AM

PL

ITU

DE

0 10 20

−0.4−0.2

00.20.40.60.81

1.2

SA

AM

PL

ITU

DE

0 10 20 30 40

−0.4−0.2

0

0.2

0.4

0.6

0.8

1

SAMPLE

AM

PL

ITU

DE

0 5 10

−0.2

0

0.2

0.4

0.6

0.8

1

SA

AM

PL

ITU

DE

Fig. 4. From left to right: scaling functions’ shapes of the wavelet filter

0 50 100 150 200

−1

−0.5

0

0.5

1

SAMPLE (upsampled by 100)

AM

PL

ITU

DE

0 5 10−0.3

−0.2

−0.1

0

0.1

0.2

SA

AM

PL

ITU

DE

0 10 20 30

−1

−0.5

0

0.5

1

SAMPLE

AM

PL

ITU

DE

0 10

−1

−0.5

0

0.5

SA

AM

PL

ITU

DE

Fig. 5. From left to right: wavelet functions’ shapes of the wavelet filter

thus, the synthetized signal, f ½n�, is, at maximum, similar toits original version, f ½n�. It means that the shape of thewaveform of f ½n� may become similar to the shape of f andc; this similarity becomes more evident as the DWT of f ½n�

is more modified by the networks. Therefore, smoothshapes for f and c are needed. This is requirement 3.In the last two paragraphs, we defined three require-

ments. Requirements 1 and 2 are related to the analysis

process that occurs during the training and conversionsteps. Requirement 3 relates to the synthesis process thatoccurs during the conversion step. In order to satisfyrequirement 1, we have to find N in such a way that

30 40 50

MPLE

0 10 20 30

−0.4−0.2

0

0.2

0.4

0.6

0.8

SAMPLE

AM

PL

ITU

DE

15 20 25

MPLE

0 5 10 15 20 25

−0.05

0

0.05

0.1

0.15

0.2

SAMPLE

AM

PL

ITU

DE

s Haar, Daubechies, Vaidyanathan, Beylkin, Coiflet, and Symmlet.

15 20 25

MPLE

0 10 20 30 40

−0.4−0.2

00.20.40.60.8

SAMPLE

AM

PL

ITU

DE

20 30

MPLE

0 5 10 15 20 25

−0.1

0

0.1

0.2

SAMPLE

AM

PL

ITU

DE

s Haar, Daubechies, Vaidyanathan, Beylkin, Coiflet, and Symmlet.

Page 6: A neural-wavelet architecture for voice conversion

ARTICLE IN PRESS

Table 4

Results of the tests on the trained system using sentences extracted from

TIMIT corpus

SSN SSG TSN TSG SUT SUC PTR

dr1/fcjf0 F dr1/fdaw0 F sa1 sa2 S: (8.5; 0.81)

H: (4.0; 0.91)

C: (8.0; 0.91)

D: (8.0; 0.87)

B: (8.0; 0.81)

V : (8.5; 0.91)

dr1/mcpm0 M dr1/mdac0 M sa1 sa2 S: (9.5; 0.83)

H: (4.5; 0.81)

C: (9.0; 0.91)

D: (9.0; 0.91)

B: (7.5; 0.81)

V : (8.0; 0.85)

dr1/fcjf0 F dr1/mcpm0 M sa1 sa2 S: (9.0; 0.91)

H: (4.5; 0.93)

C: (9.0; 0.93)

D: (8.5; 0.91)

B: (8.5; 0.85)

V : (8.0; 0.91)

dr1/mdac0 M dr1/fdaw0 F sa1 sa2 S: (9.5; 0.83)

H: (5.0; 0.83)

C: (9.0; 0.81)

D: (8.5; 0.83)

B: (8.5; 0.91)

V : (8.5; 0.87)

dr3/falk0 F dr3/fcke0 F sa2 sa1 S: (8.5; 0.91)

H: (4.5; 0.97)

C: (8.5; 0.81)

D: (8.5; 0.95)

B: (8.5; 0.81)

V : (8.5; 0.91)

dr3/madc0 M dr3/makb0 M sa2 sa1 S: (9.0; 0.91)

H: (5.0; 0.87)

C: (8.5; 0.87)

D: (8.5; 0.91)

B: (8.0; 0.85)

V : (8.0; 0.91)

R.C. Guido et al. / Neurocomputing 71 (2007) 174–180 179

balances the constraints Q and R. According to ourprevious results on audio and speech compression [7], wechose N around 24. Other values for N can produce non-optimal results in one respect, Q or R, according to theconsiderations above, decreasing the quality of themorphed speech.

Requirement 2 can only be satisfied by using finiteimpulse response (FIR) wavelet filters, since infiniteimpulse response (IIR) ones, say Shannon, Meyer, and soon, do not exhibit linear phase [14]. Therefore, the onlyalternative among the well-known FIR wavelet filters, sayHaar, Daubechies, Coiflets, Symmlets, Vaidyanathan,among others [14], is the Haar wavelet. Unfortunately,we also have to discard this last wavelet, since the supportof its filters is N ¼ 2, and therefore it fails to satisfyrequirement 1. The only alternative is the use of a waveletfamily whose filters exhibit almost linear phase, i.e., havean almost symmetrical or anti-symmetrical impulse re-sponses. Fig. 3, that gives an intuition of the impulseresponses’ shapes of such low-pass filters, shows thatSymmlets filters are the most appropriate, therefore, theywere chosen.

The above choice of Symmlets constrains the corre-sponding functions f and c to be derived from the samefamily of wavelet filters, by which requirement 3 isautomatically carried out. Symmlet scaling and waveletfunctions present smooth shapes that adapt well to most ofthe speech signals, therefore, the requirement is satisfied.Figs. 4 and 5 give an intuition of the shapes of the scalingand wavelet functions associated with the above-mentionedfilters.

Instead of using the DWT itself, the proposed approachuses the full decomposition tree, i.e., the discrete wavelet-packet transform (DWPT) [14], and then the sub-bands arerearranged according to the natural frequency ordering[10]. All the concepts discussed above are equally valid forthe DWPT.

dr3/falk0 F dr3/madc0 M sa2 sa1 S: (9.5; 0.97)

H: (5.0; 0.81)

C: (9.0; 0.81)

D: (9.0; 0.81)

B: (8.5; 0.81)

V :(8.5; 0.91)

dr3/makb0 M dr3/fcke0 F sa2 sa1 S: (9.0; 0.87)

H: (5.0; 0.81)

C: (9.0; 0.91)

D: (8.5; 0.91)

B: (8.0; 0.91)

V : (8.5; 0.85)

(SSN): directory and source speaker’s name; (SSG): source speaker’s

gender—M: male, F: female; (TSN): target speaker’s name; (TSG): target

speaker’s gender—M: male, F: female; (SUT): SU1;TU1, i.e., sentences

used as input for training; (SUC): sentence used as input for conversion,

i.e., SU2; (PTR): perceptual test rate—(mean; standard deviation)—

considering the 10 volunteers and S: Symmlet’s wavelet, H: Haar’s

wavelet, C: Coiflet’s wavelet, D: Daubechies’ wavelet, B: Beylkin’s

wavelet, V: Vaidyanathan wavelet.

5. Tests and results

We extracted speech data from TIMIT corpus [15] andused them to convert some voice patterns. Particularly, wereport the results obtained for converting two sentences foreach one of the following patterns: male speaker to femalespeaker, female speaker to male speaker, male speaker tomale speaker, and female speaker to female speaker. Moresentences were used during the tests, but since the resultsare quite similar and a considerable space is required to listthem, they are not reported.

The tests were as follows. A source speaker, S, and atarget one, T, were chosen. For both speakers, theutterances available include U1 and U2. Therefore, SU1

is the sentence U1 uttered by the speaker S, SU2 is thesentence U2 uttered by the speaker S, TU1 is the sentenceU1 uttered by the speaker T, and, lastly, TU2 is thesentence U2 uttered by the speaker T.

Page 7: A neural-wavelet architecture for voice conversion

ARTICLE IN PRESSR.C. Guido et al. / Neurocomputing 71 (2007) 174–180180

The input speech signals, that were originally sampled at16 000 samples per second, 16-bit PCM, were divided intoframes with 256 samples each and then the 8th level DWTswere applied. Each frame of speech data that passedthrough the DWT was sub-divided into 21 sub-frames ofvariable length in such a way that each sub-frame si

approximately matches one of the first 21 critical bands ofthe human ear [3]. This process allows the system to takeadvantage of some properties of the human ear, such asfrequency masking [3].

To train the system to convert a particular speaker’svoice pattern into another, the sentences SU1 and TU1

were extracted from the directory train of the dataset,where 50% of data of each file, randomly chosen, was used.At the moment the system is considered trained, that is themoment that each sample of SU1 is converted to therespective sample of TU1 with an error lower than 0:1%,SU2 was used as input to produce TU2, that is themorphed version of TU2. This output (TU2) is comparedto TU2, that is present in the database. This comparisonconsists of a perceptual test, similar to the ABX preferencetest [13]. We asked 10 volunteers to rate the convertedspeech from 1 to 10, the lower representing a non-recognizable morphed speech, and the higher meaning thatTU2 and TU2 are indistinguishable. Among the volunteers,we included men, women, young and elderly people.

For converting a speech pattern on a trained network,using N ¼ 24, the perceptual tests confirmed that Symmletsare preferable, but Coiflets and Daubechies’ waveletspresented results that are quite close to Symmlets. Inpractice, the perceptual results repeat for 16pNp32, butthe quality of the morphed speech worsens outside thisrange. Haar’s filters presented poor results, and this reflectsthe fact that they have N ¼ 2 and their scaling and wave-let functions are not so smooth. Table 4 summarizesthe results. The system was implemented using Cþþlanguage.

6. Conclusions and future work

We presented a new architecture for voice conversionbased on RBFs and wavelets, including a study on the bestwavelet for the proposed algorithm. The study consideredthe perceptual quality of the morphed speech. Based on ourown theoretical assumptions, confirmed in practice, weconcluded that Symmlets with N around 24 are the bestcandidates for the proposed architecture. Our future workwill include the use of matched wavelets to increase quality,reduce the computational complexity, and accelerate theconvergence of the networks.

Acknowledgements

We wish to thank the State of Sao Paulo ResearchFoundation for the Grants given to this work underprocess no. 2005/00015-1.

References

[1] M. Abe, S. Nakamura, K. Shikano, H. Kuwabara, Voice conversion

through vector quantization. in: Proceedings of the IEEE Interna-

tional Conference on Acoustics, Speech, and Signal Processing

(ICASSP 98), vol. 1, 1988, pp. 655–658

[2] P.S. Addison, The Illustrated Wavelet Transform Handbook:

Introductory Theory and Applications in Science, Engineering,

Medicine and Finance, Institute of Physics Publishing, Edinburgh,

2002.

[3] M. Bosi, R. Goldberg, Introduction to Digital Audio Coding and

Standards, second ed., Kluwer Academic Publishers, Massachusetts,

2003.

[4] L. Deng, O. O’Shaughnessy, Speech Processing: A Dynamic and

Optimization-Oriented Approach, Marcel Dekker Inc, New York,

USA, 2003.

[5] C. Drioli, Radial basis function networks for conversion of sound

speech spectra, EURASIP J. Appl. Signal Process. (1) (2001) 36–40.

[6] S. Furui, Research on individuality features in speech waves and

automatic speaker recognition techniques, Speech Commun. 5 (2)

(1986) 183–197.

[7] R.C. Guido, et al., A study on the best wavelet for audio

compression, in: 40th IEEE ASILOMAR International Conference

on Signals, Systems and Computers, Pacific-Grove, CA, USA, 2006,

pp. 2115–2118.

[8] S. Haykin, Neural Networks: A Comprehensive Foundation, second

ed., Prentice-Hall, Upper Saddle River, NJ, 1998.

[9] S. Haykin, Adaptive Filter Theory, in: third ed., Prentice-Hall, Upper

Saddle River, NJ, 1999.

[10] A. Jensen, A. Cour-Harbo, Ripples in Mathematics: The Discrete

Wavelet Transform, Springer, New York, USA, 2000.

[11] A. Kain, M. Macon, Spectral voice conversion for text-to-speech

synthesis, Proceedings of the IEEE International Conference on

Acoustics, Speech, and Signal Processing (ICASSP 98), vol. 1, 1998,

pp. 285–288.

[12] A. Mouchtaris, J. Van der Spiegel, P. Mueller, Non-parallel training

for voice conversion by maximum likelihood constrained adaptation,

in: Proceedings of the IEEE International Conference on Acoustics,

Speech, and Signal Processing (ICASSP 04), vol. 1, 17–21, 2004,

pp. I1–I4.

[13] C. Orphanidou, I.M. Moroz, S.J. Roberts, Wavelet-based voice

morphing, WSEAS J. Syst. 10 (3) (2004) 3297–3302.

[14] G. Strang, T. Nguyen, Wavelets and Filter Banks, Wellesley-

Cambridge Press, Wellesley, MA, USA, 1997.

[15] TIMIT corpus: Linguistic Data Consortium hhttp://www.ldc.

upenn.edu/i.

[16] O. Turk, L.M. Arslan, Subband based voice conversion, in:

Proceedings of the International Conference on Spoken Language

Processing (ICSLP), vol. 1, 2002, pp. 289–293.

[17] H. Valbret, Voice transformation using PSOLA technique, Speech

Commun. 11 (2–3) (1992) 175–187.

[18] C.H. Wu, C.C. Hsia, T.H. Liu, J.F. Wang, Voice conversion using

duration-embedded bi-HMMs for expressive speech synthesis, IEEE

Trans. Audio Speech Language Process. 14 (4) (2006) 1109–1116.