Reconstruction of articulatory movements during neutral ... › spire › papers_pdf › Nisha_JASA_2018.pdfk^n w i ðÞ l n n i ... Am. 143 (6), June 2018 Nisha Meenakshi G. and Prasanta

Reconstruction of articulatory movements during neutral speechfrom those during whispered speech

Nisha Meenakshi G.a) and Prasanta Kumar GhoshElectrical Engineering, Indian Institute of Science, Bangalore-560012, India

(Received 25 September 2017; revised 25 April 2018; accepted 9 May 2018; published online 6June 2018)

A transformation function (TF) that reconstructs neutral speech articulatory trajectories (NATs)

from whispered speech articulatory trajectories (WATs) is investigated, such that the dynamic time

warped (DTW) distance between the transformed whispered and the original neutral articulatory

movements is minimized. Three candidate TFs are considered: an affine function with a diagonal

matrix (Ad) which reconstructs one NAT from the corresponding WAT, an affine function with a

full matrix (Af ) and a deep neural network (DNN) based nonlinear function which reconstruct each

NAT from all WATs. Experiments reveal that the transformation could be approximated well by

Af , since it generalizes better across subjects and achieves the least DTW distance of 5.20 (61.27)

mm (on average), with an improvement of 7.47%, 4.76%, and 7.64% (relative) compared to that

with Ad, DNN, and the best baseline scheme, respectively. Further analysis to understand the dif-

ferences in neutral and whispered articulation reveals that the whispered articulators exhibit exag-

gerated movements in order to reconstruct the lip movements during neutral speech. It is also

observed that among the articulators considered in the study, the tongue exhibits a higher precision

and stability while whispering, implying that subjects control their tongue movements carefully in

order to render an intelligible whispered speech. VC 2018 Acoustical Society of America.

https://doi.org/10.1121/1.5039750

[JFL] Pages: 3352–3364

I. INTRODUCTION

Whispered speech is typically produced in private con-

versations, in addition to pathological cases such as laryn-

gectomy (Sharifzadeh et al., 2010). Such pathological

conditions lead to several types of alaryngeal speech includ-

ing esophageal speech, tracheoesophageal speech, and

hoarse whispered speech (Wszołek et al., 2014; Gilchrist,

1973). Since whispered speech is produced in the absence of

vocal fold vibrations, it lacks pitch (Tartter, 1989). Several

algorithms exist to reconstruct and synthesize neutral speech

from the less intelligible whispered speech (Sharifzadeh

et al., 2010; Morris and Clements, 2002; Ahmadi et al.,2008; Janke et al., 2014; Mcloughlin et al., 2015; Toda and

Shikano, 2005). Silent speech interfaces (SSIs) also address

this problem of reconstructing neutral speech (Denby et al.,2010). One line of research to obtain speech from articula-

tory movements using SSIs is to recognize word or sentence

from articulatory movements (Fagan et al., 2008) followed

by text-to-speech synthesis (Wang et al., 2014, 2012a,b,

2015). On the other hand, certain SSIs convert articulatory

movements into speech via direct synthesis. SSIs based on

the movements of speech articulators are used in the articula-

tory synthesis of neutral speech from the neutral articulation

data (Gonzalez et al., 2016; Toutios and Maeda, 2012;

Toutios and Narayanan, 2013; Fagel and Clemens, 2004;

Beskow, 2003; Aryal and Gutierrez-Osuna, 2016). By trans-

forming whispered articulatory movements into those of

neutral speech, we could employ an articulatory synthesis

framework to synthesize neutral speech. In order to do so, it

is critical to first have an understanding of the relationship

between the articulation in whispered speech and that in neu-

tral speech. For this, we study the whispered and neutral

articulatory movements captured using electromagnetic

articulography (EMA) (Sch€onle et al., 1987).

It is known that the articulation during whispered speech

differs from that during neutral speech, typically in two ways.

First, exaggerated articulatory movements are known to exist

in whispered speech (Yoshioka, 2008; Osfar, 2011; Schwartz,

1972; Parnell et al., 1977) unlike in neutral speech, in order to

compensate for the lack of pitch in whispers. Second, whis-

pered speech has a longer duration compared to the corre-

sponding neutral speech (Jovicic and �Saric, 2008). There are

several studies that examine the exaggeration in the whispered

articulatory movements. Yoshioka studied the differences in

the palato-lingual contact pattern during the production of

whispered unvoiced and voiced alveolar fricatives, namely, /s/

and /z/, using electro-palatography (Yoshioka, 2008). The

study revealed that the area of contact between the palate and

the tongue during the production of whispered /z/ is larger

compared to that during whispered /s/. The differences in the

movements of the lips during the production of whispered and

neutral bilabial consonants, /b/ and /p/, were studied using

both speech and facial video (Higashikawa et al., 2003). The

study revealed that the average peak opening and closing

velocities and the distance between the upper and the lower lip

for oral opening for /b/ were significantly higher than those for

/p/ while whispering. These studies show that exaggerated

articulation occurs during the production of “voiced” whis-

pered consonants [/z/ and /b/ from Yoshioka (2008) anda)Electronic mail: [email protected]

3352 J. Acoust. Soc. Am. 143 (6), June 2018 VC 2018 Acoustical Society of America0001-4966/2018/143(6)/3352/13/$30.00

https://doi.org/10.1121/1.5039750

mailto:[email protected]

http://crossmark.crossref.org/dialog/?doi=10.1121/1.5039750&domain=pdf&date_stamp=2018-06-01

Higashikawa et al. (2003), respectively]. Electro-palatography

based experiments with neutral and whispered alveolar conso-

nants, namely, /d/, /t/, and /n/, were done by Osfar (2011).

These experiments found that articulation is more stable and

precise in whispered speech compared to that in neutral

speech, confirming that subjects hyperarticulate while whisper-

ing compared to when they speak normally. These exaggerated

articulatory movements cause the whispered articulatory tra-

jectory (WAT) to differ from the neutral articulatory trajectory

(NAT). To the best of our knowledge, not much investigation

has been done in the literature to understand an underlying

mapping that could relate a WAT to a NAT. This work aims

to better understand the differences in the whispered and neu-

tral articulation. In this regard, we first find a suitable mapping

function to reconstruct each NAT from multiple WATs.

Second, we quantify the amount of exaggeration exhibited by

the whispered articulatory movements and compare it with

that of neutral speech.

We propose an iterative function independent dynamic

time warping (IFI-DTW) optimization to compute the opti-

mal transformation function (TF) to transform WATs, in

order to reconstruct NATs. In the IFI-DTW method, we opti-

mize the TF and the DTW (M€uller, 2007) warping path, by

an iterative alternate minimization procedure, till conver-

gence is achieved. Having obtained a transformation from

whispered to neutral articulatory movements, we investigate

the exaggeration in the whispered articulation. In particular,

we analyze the transformed whispered and neutral articula-

tory trajectories, to understand (1) those neutral articulators

whose reconstruction requires exaggerated articulatory

movements while whispering and (2) those articulators that

exhibit exaggerated movements in whispered speech.

II. MAPPING PROCEDURE BETWEEN WHISPEREDAND NEUTRAL SPEECH ARTICULATION

A. IFI-DTW optimization

Let us consider articulatory movements of neutral and

whispered utterances available at a sampling frequency of

Fs. Consider the number of training utterances to be N. We

propose an IFI-DTW algorithm to estimate a TF so that the

NATs and transformed WATs have the least distance. Let us

denote the WATs and NATs of Ns articulators corresponding

to the utterance i (after mean subtraction) by Wi ¼ ½w1;…;wTWi� and Ni ¼ ½n1;…; nTNi

�, of lengths TWiand and TNi

sam-

ples, respectively (Wi 2 RNs�TWi and Ni 2 RNs�TNi ), where

wk and nk denote the kth column of Wi and Ni. Therefore,

each row of Wi (or Ni) corresponds to one whispered (or neu-

tral) articulatory trajectory, e.g., the tongue tip, upper lip,

etc., and each column corresponds to the frame index along

time. Since the lengths of the whispered and neutral utteran-

ces need not be equal (TWi6¼ TNi

), we use DTW with

Euclidean distance for alignment to compute distance

between them. Therefore, we require an optimal TF, F� and

a set of optimal warping paths {m�i , i¼ 1,…,N}, that trans-

form WATs to NATs such that the total cost, D, i.e., the sum

of the DTW distances over all training utterances, is mini-

mized, as follows:

F�; fm�i g� �

¼ arg minf ;fmig

Dðf ; fmigÞ; (1)

where

Dðf ; fmigÞ ¼XN

i¼1

Dmiðf ðWiÞ;NiÞ; (2)

where mi is a DTW warping path between NATs and trans-

formed WATs and Dmiis the total squared Euclidean dis-

tance computed along mi for utterance i. Let the

reconstructed NATs (or the transformed WATs) be N i

¼ f ðWiÞ ¼ ½f ðw1Þ;…; f ðwTWiÞ� ¼ ½n1;…; nTWi

� (where each

row of N i corresponds to one transformed WAT). For an

utterance i, a warping path mi of length Li, between N i and

Ni, consists of the ordered pairs mi ¼ hmwi ðlÞ;mn

i ðlÞi;l ¼ 1;…;Li, such that 1 � mw

i ðlÞ � TWiand 1 � mn

i ðlÞ � TNi.

Therefore, given a warping path mi for an utterance i, we

have

DmiN i;Ni

� �¼ 1

Li � 1

XLi

l¼1

knmwi lð Þ � nmn

i lð Þk22; (3)

where k � k2 indicates the L2 norm. Thus, the optimal warp-

ing path for each utterance is given by

mi ¼ arg minm0i

Dm0iðN i;NiÞ; i ¼ 1;…;N: (4)

This optimization first involves the construction of a distance

matrix whose (p, q)th entry denotes the Euclidean distance

between np and nq. Dynamic programming (M€uller, 2007) is

then employed to compute the optimal warping path through

the distance matrix that results in the least overall Euclidean

distance [as in Eq. (4)]. From Eqs. (1) and (4), we see that

the TF and the DTW warping path depend on each other,

which makes the joint optimization equation (1), a challeng-

ing task. Therefore, in the IFI-DTW algorithm, we optimize

the TF and the DTW warping path, using an iterative alter-

nate minimization procedure.

If f is known, we could find the set of optimal paths

{mi} using Eqs. (3), (4) and the total cost D using Eq. (2).

Let us, now, assume that the set of optimal paths {mi} is

known. Therefore, for a given utterance i, let us define

Wmii ¼ ½wmw

i ð1Þ;…;wmwi ðLiÞ� and Nmi

i ¼ ½nmni ð1Þ;…; nmn

i ðLiÞ�,such that wk and nk represent the kth column of Wi and Ni,

respectively. Since we estimate one TF using all the training

utterances, we concatenate Wmii ðand Nmi

i Þ 8i ¼ 1;…;N to

obtain Wfmig ðandNfmigÞ. Specifically, we write Wfmig

¼ ½Wm1

1 ;…;WmNN � 2 RNs�L and Nfmig ¼ ½N

m1

1 ;…;NmNN �

2 RNs�L, where, L ¼PN

i¼1 Li. We then optimize for F as

follows:

F ¼ arg minfDðf ; fmigÞ (5)

¼ arg minfkf ðWfmigÞ � N fmigk

22: (6)

J. Acoust. Soc. Am. 143 (6), June 2018 Nisha Meenakshi G. and Prasanta Kumar Ghosh 3353

In this manner, we optimize for the warping path and the TF

using alternate minimization. The expressions to compute

different TFs are provided in detail in Secs. II B and II C. In

the IFI-DTW optimization, we initialize the TF in the first

iteration (denoted by Fð1Þ) to be an identity transform. We

then obtain the warping paths (denoted by fmð1Þi g) for each

training utterance using DTW equation (4) and compute the

total cost (denoted by Dð1Þ) using Eq. (2). Given the set of

warping paths, in the next iteration we compute a new TF

(denoted by Fð2Þ) corresponding to the entire training set

using Eq. (6). Given the new TF, we once again compute the

warping paths (denoted by fmð2Þi g) and the new total cost

(denoted by Dð2Þ). If the total cost in current iteration is

lesser than that in the previous iteration, we repeat the same

procedure of computing the TF and the warping paths, itera-

tively, till convergence is achieved. The IFI-DTW optimiza-

tion is described in algorithm 1.

We, now, provide the proof of convergence of the IFI-

DTW optimization

Proof. Consider the IFI-DTW optimization given in

algorithm 1. We need to show

Dðj�1ÞðF ðj�1Þ; fmðj�1Þi gÞ � DðjÞðF ðjÞ; fmðjÞi gÞ:

Since, FðjÞ is the optimal TF corresponding to the set of

warping paths fmðj�1Þi g [from operation 9 of algorithm 1],

we have

DðFðj�1Þ; fmðj�1Þi gÞ � DðFðjÞ; fmðj�1Þ

i gÞ:

Since the set of warping paths fmðjÞi g is optimal for FðjÞ[from operation 10 of algorithm 1], we have

DðFðj�1Þ; fmðj�1Þi gÞ � DðFðjÞ; fmðj�1Þ

i gÞ

� DðFðjÞ; fmðjÞi gÞ: (7)

ALGORITHM 1: IFI-DTW optimization.

1: Dð0Þð�Þ ¼ 12: Iteration j¼ 1

3: Initial TF: Fð1Þ

4: Optimize for fmð1Þi gi¼1…N using Fð1Þ in Eq. (4)

5: Compute Dð1ÞðF ð1Þ; fmð1Þi gÞ using Eq. (2)

6: Until Convergence:

7: while Dðj�1Þð�Þ > DðjÞð�Þ do

8: j jþ 1

9: Optimize for FðjÞ using mðj�1Þi in Eq. (6)

10: Optimize for fmðjÞi gi¼1…N using FðjÞ in Eq. (4)

11: Compute DðjÞðF ðjÞ; fmðjÞi gÞ using Eq. (2)

12: end while

13: Optimal TF: F� ¼ FðjÞ, Optimal set of warping paths:

fm�i g ¼ fmðjÞi g; i ¼ 1;…;N.

Hence, proved. �

B. Candidate transformation functions

In order to understand the underlying function which

transforms WATs to NATs, we consider three candidate

functional forms of the TF.

1. Full affine transformation—Af scheme

Since there exists a dependency among articulatory

movements (Jackson and Singampalli, 2008), we hypothe-

size that several WATs could contribute to reconstruct one

NAT. Therefore, we consider the first candidate to be an

affine transformation, as follows:

f ðWTfmigÞ ¼ ðWfmigÞ

T1L�1

h i|fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl}

W0

ANs�Ns

b1�Ns

� �; (8)

where W0 2 RL�ðNsþ1Þ. Substituting Eq. (8) in Eq. (6), we

get the affine transformation function to be ½ðW0ÞTW0��1

ðW0ÞTðN fmigÞT. It is to be noted that we place no constraints

on A or b. This special case is similar to the affine indepen-

dent DTW proposed by Qiao and Yasuhara (2006). Thus, the

(p, k)th coefficient of the full matrix A ¼ Af captures the

strength of the relation between pth WAT and kth NAT. Let

the pth WAT of utterance j be denoted by wp 2 RTWj�1

of

length TWj . Then the kth reconstructed NAT nk

can be writ-

ten as

nk ¼

XNs

p¼1

ap;kwp þ bk1TWj�1; 1 � k � Ns: (9)

2. Diagonal affine transformation—Ad scheme

To understand how each WAT transforms into the cor-

responding NAT, we consider the matrix A in Eq. (8) to be a

diagonal matrix Ad . Therefore, in this case, we assume that

only the pth WAT contributes to reconstruct the pth NAT

with the (p, p)th coefficient of the diagonal matrix Ad cap-

turing the strength of this contribution. Similar to Eq. (9), we

can express the pth reconstructed NAT np

as follows:

np ¼ ap;pwp þ bp1

TWj�1; 1 � p � Ns: (10)

3. Nonlinear transformation—DNN scheme

In the third scheme, we model the dependency among

articulatory movements by a nonlinear transformation using

a deep neural network (DNN). At the jth iteration, the opera-

tion (9) in algorithm 1 is executed by providing Wfmig and

Nfmig as the input and output, respectively, to a DNN. While

in the first iteration, the DNN is initialized with random

weights, for all iterations j> 1, the DNN is initialized with

the weight matrix from the DNN optimized in the (j – 1)th

iteration. The details of the implementation are provided in

Sec. III C.

C. Baseline schemes

To compare the performance of the candidate TFs, we

use two baseline schemes. In both these schemes, we use a

fixed TF between the WATs and the NATs and do not opti-

mize for the TF. Hence, the IFI-DTW algorithm stops in a

single step.

3354 J. Acoust. Soc. Am. 143 (6), June 2018 Nisha Meenakshi G. and Prasanta Kumar Ghosh

1. Abs1 scheme

In the first baseline scheme, we define a TF Fð1Þ ¼ Abs1

with respect to algorithm 1 such that the transformation when

applied to the WATs retains the mean and covariance of the

NATs. Let W¼ ½W1;…;WN� 2RNs�PN

i¼1TWi ; N¼ ½N1;…;NN�

2RNs�PN

i¼1TNi (concatenated versions of Wi and Ni8i

¼ 1;…;N), with corresponding mean vectors (across time)

lw¼ ½lw1 ;…;lw

Ns�T 2RNs�1 and ln¼ ½ln

1;…;lnNs�T 2RNs�1,

and covariance matrices Rw 2RNs�Ns and Rn 2RNs�Ns ,

respectively. To ensure that the transformed WATs, Fð1ÞðWÞ,have their mean vector and covariance matrix equal to ln and

Rn, we compute A¼ðUnK1=2n K�1=2

w UTwÞ

Tand b¼ðlnÞT

�ðlwÞTA in Eq. (8). Uw, Un and Kw, Kn are matrices of ortho-

normal eigen vectors and diagonal matrices containing the

eigen values, obtained by the eigen decomposition of Rw and

Rn, respectively. In this case, the expression for reconstruction

of the kth NAT is the same as that provided in Eq. (9).

2. Abs2 scheme

In the second baseline scheme, we define a TF Fð1Þ¼ Abs2 such that the transformation preserves the mean and

the variance of each of the Ns NATs. As defined in Sec.

II C 1, let lnp and lw

p be the means of the pth NAT and WAT,

respectively. The standard deviations (SD) of the pth NAT

and WAT are denoted as rnp and rw

p , respectively. With

respect to Eq. (8), we write Ap;p ¼ rnp=r

wp and bp ¼ ln

p

�ðrnp=r

wp Þlw

p ; p ¼ 1;…;Ns, where Ap,p is the pth diagonal

element of A and bp is the pth element of b. Since A is diago-

nal, the reconstruction of the pth NAT follows Eq. (10).

III. EXPERIMENTS

A. Dataset

In this work, we recorded both the neutral and whis-

pered articulatory movements of four male (M1, M2, M3,

M4) and two female (F1, F2) subjects using electromagnetic

articulograph AG501 AG5 (3D Electromagnetic

Articulograph, 1979). The native language of F1 and F2 is

Tamil and Bengali while that of M1, M2, M3, M4 is

Kannada, American English, Bengali, and Telugu, respec-

tively. None of the subjects were reported to have any

speech disorders. An informed consent was obtained from

each subject, prior to data collection.

We used the 460 phonetically balanced English senten-

ces from the MOCHA-TIMIT database as stimuli for record-

ing (Wrench, 1999). Simultaneous recordings of both audio

and articulatory movements were done in a sound-proof

chamber. In this study, we recorded the articulatory move-

ments of nine articulators, namely, upper lip (UL), lower lip

(LL), left commissure of the lip (LC), right commissure of

the lip (RC), jaw (J), throat (TH), tongue tip (TT), tongue

body (TB), and tongue dorsum (TD). The position of these

sensors is indicated in Fig. 1. We connected the TH sensor

typically near the laryngeal prominence for the subjects, in

order to capture the laryngeal movement as the subjects pho-

nate in neutral and whispered manner. Apart from the nine

sensors, we also connected two sensors needed for head cor-

rection in EMA recording.

Recorded at a sampling frequency of 250 Hz, the move-

ments of each articulator along the two axes (X and Z) of the

midsagittal plane (measured in mm), give rise to a total of

Ns¼ 18 (9 articulators � 2 axes) articulatory trajectories.

Thus, each NAT and WAT correspond to the movement of

an articulator in neutral and whispered speech, respectively.

Since articulatory movements are known to be low-pass in

nature (Ghosh and Narayanan, 2010), we first low pass filter

the articulatory trajectories with a cut-off frequency of 25 Hz

and then downsample to Fs¼ 100 Hz. Figure 2 shows the

low pass filtered and downsampled trajectories of upper lip,

jaw, throat, and tongue tip for utterance i¼ 2 of a male sub-

ject, for both, neutral and whispered speech (corresponding

to eight rows from N2 and W2). From the figure, we observe

that the duration of the whispered speech utterance is longer

than that of neutral speech. We also see that the movement

of the articulators along the two axes, follow a similar pat-

tern in, both, whispered and neutral speech. Across all six

subjects, the total duration of neutral and whispered record-

ings is 127.95 and 139.19 min, respectively.

B. Experimental setup

1. Subject-wise setup

We hypothesize that there exists a subject specific artic-

ulation strategy involved while whispering, to compensate

for its lack of intelligibility in the absence of voicing.

Therefore, we perform experiments in a fourfold setup by

dividing the data collected from each subject into four sets

where three sets (345 sentences) are used for training and the

remaining set (115 sentences) for testing. For each fold, we

use the corresponding training set of N¼ 345 utterances to

obtain the optimal TF, F�, using the IFI-DTW algorithm.

Using F� in Eqs. (3) and (4), we compute DTW distances

dk ¼ DmkðNk;NkÞ, for the kth test utterance. Let dtest

2 R460�1 be a vector that consists of dk from all four folds

(115� 4¼ 460) and �dtest denote the average of these 460 DTW

distances given by �dtest ¼ 1460

P460k¼1 dk. Therefore, the best

scheme is the one which results in the least �dtest for all subjects.

In order to understand if the dynamics of articulatory

movements could aid a better reconstruction of NATs, we

perform a second experiment. Here we learn a TF using not

only the position data of articulators, but also their dynamics

in a subject-specific manner. For this, we first compute the

velocity and the acceleration coefficients from the

FIG. 1. Schematic diagram depicting the placement of the nine sensors

along the mid sagittal plane and lips of a subject.


articulatory trajectories. Let DWi and DNi be the velocity

coefficients and DDWi and DDNi be the acceleration coeffi-

cients of the ith whispered and neutral utterance, respec-

tively. We then concatenate the trajectories corresponding to

position and its dynamics to obtain Wdi ¼ ½WT

i ;DWTi ;

DDWTi �

Tand Nd

i ¼ ½NTi ;DNT

i ;DDNTi �

Tfor each utterance.

The IFI-DTW algorithm is employed to obtain the optimal

warping paths and the transformation function between the

position and the dynamics of the neutral articulators (Ndi )

and those while whispering (Wdi ). Therefore, Eq. (2) can be

rewritten as

Dðf ; fmigÞ ¼XN

i¼1

Dmiðf ðWd

i Þ;Ndi Þ: (11)

In this case, the average DTW distance between the original

and the reconstructed NAT would reveal the benefits of uti-

lizing the information about the dynamics of whispered

articulatory movements to reconstruct NATs.

2. Cross subject setup

In the cross-subject setup, we test the model trained

using one subject’s positional data on the test data from

another subject. Hence, we could analyze the degree to

which the optimal TF could be subject dependent. We

employ the optimal TF obtained using the training set corre-

sponding to the ith fold of subject str, to predict the NATs

from the test set of the ith fold of subject st. dst;str2 R460�1

is used to denote the vector that comprises the DTW distan-

ces computed for all test utterances in all four folds, when str

and st denote the training and test subjects. Let �dst;strdenote

the average of these DTW distances.

C. Parameters

For the sake of practical implementation of the Af ; Ad

and the DNN schemes, convergence is said to be achieved in

the IFI-DTW optimization if step 7 of algorithm 1 is satisfied

considering seven digits after the decimal point (experimen-

tally observed). In both the subject-wise and cross subject

setups, we use a three layer network for the DNN scheme.

Using 15% of the training set as the validation set, we opti-

mize for different parameters such as the number of hidden

neurons in each layer (candidates: 64, 128, 256, and 512),

the activation functions (candidates: “tanh” and “relu”) and

the batch size (candidates: 16, 32, and 64). We use the

“linear” activation in the output layer. For each fold, we

choose the optimal parameters based on the best performing

DNN architecture (in terms of the minimum DTW distance)

on the validation set. Optimization is done using ADAM

(Kingma and Ba, 2014), with mean squared error as the loss

function. The implementation of the DNN is done using

KERAS (Chollet, 2015) and THEANO (Team et al., 2016)

libraries.

D. Broad class phoneme (BCP) specific analysis

We perform a BCP specific study in order to know the

accuracy of the NAT reconstruction in each class when the

TF and the warping paths optimized on the entire set are

used. In order to do so, we use the KALDI toolkit (Povey et al.,2011) to perform a forced alignment of the recorded speech

data (obtained during EMA recording), using a Gaussian

mixture model-hidden Markov model (GMM-HMM) setup,

with a reduced phoneme set consisting of thirty nine pho-

nemes (including silence) (Lee and Hon, 1989) considered in

the TIMIT database (Garofolo et al., 1993). Using the fine to

broad phone class mapping described by Scanlon et al.(2007), we map the 39 phonemes to the five broad phoneme

categories, namely, vowels, stop consonants, fricatives,

nasals, and silence. Thus, from the forced aligned boundaries,

we obtain the BCP boundaries. These boundaries are manu-

ally checked and corrected in case of any errors.

For the kth test utterance, we obtain Nk using different

schemes described in Secs. II B and II C. We extract the seg-

ments corresponding to each of the five BCP categories

from, both, Nk and Nk for every utterance k and compute the

segment-wise DTW distances, for all schemes. This is done

subject wise, for all the test utterances, k¼ 1,…,115, in each

fold. We report the average of these segment-wise distances,

across six subjects, for each of the five BCP categories,

obtained using the different TFs considered in the study.

IV. PERFORMANCE OF THE MAPPING METHODS

A. Subject-wise experimental results

The number of iterations to achieve convergence in the

IFI-DTW optimization, averaged across all folds and all sub-

jects, turns out to be 6.63(61.71), 5.75(61.78), and

5.46(62.15) for the Af ; Ad , and DNN schemes, respectively

[the numbers in brackets represent standard deviation (SD)].

For the DNN scheme, based on the performance in the vali-

dation set, the optimal number of neurons in the hidden

layers is found to be 64 for all folds of all subjects, except

for the fourth fold of subject M4, in which the optimal

FIG. 2. Trajectories of upper lip (UL),

jaw (J), throat (TH), and tongue tip

(TT), along X and Z directions, for the

utterance “Is this see-saw safe?,” of

subject M1, in neutral indicated by

continuous lines and in whispered

speech, indicated by dashed lines.


number turns out to be 128. We find the “relu” activation

function and a batch size of 64 to be the optimal parameters

across all subjects. The results for the subject-wise setup are

provided in Table I. The corresponding box plots of the dtest

from the five schemes for each of the six subjects are

included as supplementary material.1

From the table, it is clear that for each subject, the Af

scheme results in the least average DTW distance (indicated

by bold entry in each column) between the reconstructed and

original NATs. Averaged across all subjects and folds, the

DTW distance between the reconstructed and the original

NAT turns out to be 5.20 (61.27) mm for Af scheme, 5.62

(61.38) mm for Ad scheme, 5.46 (60.96) mm for DNN

scheme, 9.09 (62.96) mm and 5.63 (61.39) mm, for the

Abs1 and Abs2 schemes, respectively. From the table, we

observe a decrease of 42.79% and 7.64% (relative) in the Af

scheme compared to the Abs1 and Abs2 schemes, respec-

tively, averaged across all six subjects. The poor perfor-

mance of Abs1 scheme reveals that a TF that preserves the

mean and the covariance of the NATs alone, does not pro-

vide an optimal transformation from WATs to NATs.

Interestingly, the performance of the Abs2 scheme is similar

to that of the Ad scheme. This indicates that the optimal TF

learnt iteratively in the Ad scheme, tries to preserve the vari-

ance of the NATs.

From the table, we find a relative decrease in the aver-

age DTW distance in the Af scheme, with respect to the Ad

scheme, by 7.44%, 7.85%, 6.49%, 8.03%, 8.84%, and 6.21%

for the six subjects. The improved performance of the Af

scheme compared to the Ad scheme, reveals that several

WATs contribute to reconstruct a single NAT. Comparing

with the DNN scheme, we observe a relative drop in the

average DTW distance in the Af scheme by 2.86%, 4.78%,

2.75%, 7.13%, 4.63%, and 5.63%, for six subjects. In order

to examine if the performance of the Af scheme is statisti-

cally significant compared to the other schemes, we perform

a t-test. For each of the schemes Ad, DNN, Abs1, and Abs2

we consider the null hypothesis to indicate that the differ-

ence of dtest from Af and dtest from the considered scheme

comes from a normal distribution with zero mean and

unknown variance. The alternate hypothesis is that this dif-

ference comes from a normal distribution whose mean is

less than zero. The statistical analysis reveals that the null

hypothesis is rejected at 5% significance level (all p-values

�3.84e � 22) for all schemes. We find similar results (all p-

values �5.23e � 202) when the described t-test is performed

across all subjects. This indicates that the dtest obtained from

the Af scheme is statistically significantly lower than those

obtained from the other schemes.

For illustration, Fig. 3 shows the reconstructed TDx tra-

jectory using different TFs for one utterance from subject

F2. We see that the reconstructed NAT using Af scheme

closely approximates the original NAT, better than the other

schemes (rectangular box indicated for each scheme illus-

trates this in the figure). We also observe from Figs. 3(C),

3(F), and 3(A) that the reconstructed NAT using Ad

and Abs2 schemes are scaled versions of the original WAT.

The reconstructed NAT from the Af scheme is found to be

smoother than that from the DNN scheme [Fig. 3(D)].

Let us now consider the average DTW distance between

the original NATs and those reconstructed using, both, the

position and the dynamics of WATs. Table II provides these

distances for the two best performing schemes, namely, the

Af and DNN schemes. The corresponding box plots of the

TABLE I. �dtest (SD), in mm, across all folds of the six subjects.

Schemes F1 F2 M1 M2 M3 M4

Af 5.10 6.57 3.89 5.73 5.36 4.53

(0.85) (1.17) (0.68) (1.08) (0.97) (0.86)

Ad 5.51 7.13 4.16 6.23 5.88 4.83

(0.83) (1.24) (0.70) (1.13) (1.12) (0.87)

DNN 5.25 6.90 4.00 6.17 5.62 4.80

(0.20) (0.23) (0.17) (0.17) (0.18) (0.12)

Abs1 10.51 10.59 6.55 7.52 13.08 6.31

(1.02) (3.05) (0.79) (1.30) (1.56) (0.86)

Abs2 5.51 7.13 4.16 6.24 5.89 4.83

(0.84) (1.25) (0.70) (1.14) (1.11) (0.88)

FIG. 3. (A) provides the DTW mapped

original NAT and WAT TDx of subject

F2 corresponding to the utterance,

“Bright sunshine shimmers on the

ocean.” (B)–(F) provide the DTW

mapped original and the reconstructed

NAT using different schemes, men-

tioned in the respective figures.


dtest from these two schemes for each of the six subjects are

included as supplementary material.1 Comparing Tables I

and II, we find that for each subject, the average DTW dis-

tances reduce when the D and DD coefficients are considered

to reconstruct the NATs. We observe a relative drop in the

dtest by 1.57%, 2.44%, 1.29%, 0.7%, 1.31%, and 1.77%,

when the velocity and acceleration coefficients are used in

the best scheme compared to when they are not, for each of

the six subjects. We perform a t-test, similar to the descrip-

tion provided in Sec. IV A, to find if the inclusion of the

dynamics decreases the dtest significantly compared to using

the position data alone. The statistical analysis reveals that

the inclusion of the dynamics significantly improves the per-

formance, for both the schemes. Therefore, we find that the

information about the dynamics of the articulatory move-

ments helps in reconstructing NATs better from WATs.

Similar to the observation from Table I, we see that the per-

formance of the Af scheme is comparable to that of the

DNN scheme. We perform a t-test to check for the signifi-

cance in the difference between the performance of the two

methods. We find that except for subject F1, the null hypoth-

esis is rejected at 5% significance level. This indicates that

the optimal TF could be approximated well using an affine

function compared to using a complex nonlinear function as

learnt by a DNN.

B. Cross subject experimental results

Since the Af and DNN schemes are found to exhibit the

least �dtest in the subject-wise setup, we report the results of

these two methods for the experiments described in Sec.

III B 2. Figure 4 provides the box plots for the dst;strfor every

test-train pair, for the Af and DNN schemes. In both cases,

we see that the least average (also median) DTW distance,

dst;str, is achieved when the training and test subjects are

identical (matched case). This shows that the optimal TFs

are subject dependent, which supports the hypothesis that

there could be subject specific differences in articulation to

make whispered speech more intelligible in absence of pitch.

The relative increase in �dst;strfrom the matched case to

the worst mismatched case, using Af scheme and the DNN

scheme turns out to be 17.65% and 37.25% for F1, 6.55%

and 21.77% for F2, 28.54% and 54.24% for M1, 22.16% and

39.62% for M2, 30.60% and 49.25% for M3, and 10.38%

and 32.45% for M4. Hence, we find that the performance

using the worst model for the Af scheme is better than that

using the DNN scheme. This larger drop in the performance

of the DNN scheme compared to the Af scheme, in the cross

subject setup could be due to over-training in the subject

specific fine tuning of the DNN parameters. Performing a t-test as described in Sec. IV A, we find that the dst;str

from Af

scheme is statistically significantly lower than that of the

DNN scheme (p-values �3.84e � 22), for all test-train pairs.

Figure 4 also indicates that, the optimal affine transformation

is more generalizable compared to the finely tuned nonlinear

TF learnt using a DNN.

C. Results of BCP specific analysis

Table III provides the details of the number of segments

for each BCP category along with the average duration of

each segment for, both, whispered and neutral speech, for

each subject. We see that the number of vowel segments is

the highest across all subjects. In a decreasing order of the

number of segments, on average, the “Vowels” category is

followed by “Fricatives,” “Stops,” “Silence,” and, finally,

“Nasals.” The differences in the average durations of differ-

ent BCP categories across neutral and whispered speech can

be observed from the table. We find that Vowels, Fricatives,

and Silence categories have a longer average duration while

whispering compared to that in neutral speech for at least

five among the six subjects.

Table IV provides the segment-wise DTW distances for

the five BCP categories, averaged across all subjects and

folds, for the different schemes considered in the study.

From the table, we see that for all BCP categories, the aver-

age distance is the least for the Af scheme. We observe that

the average distance is the highest for the “Silence” category

in all schemes. This could be due to the fact that the posi-

tions of the NATs and WATs during different silence seg-

ments may not exhibit similar patterns, and hence, are

difficult to reconstruct. Similar to the discussion in Sec.

TABLE II. �dtest (SD), in mm, across all folds of the six subjects using both

position and dynamics of articulatory movements.

Schemes F1 F2 M1 M2 M3 M4

Af 5.03 6.41 3.84 5.69 5.29 4.45

(0.84) (1.14) (0.68) (1.07) (0.95) (0.88)

DNN 5.02 6.50 3.95 5.97 5.33 4.59

(0.78) (1.05) (0.63) (1.00) (0.86) (0.81)

FIG. 4. (Color online) Box plots of dst ;stracross folds for each test-train sub-

ject pair obtained from Af (in red, left) and DNN scheme (in blue, right).


IV A, the Abs1 scheme is found to perform poorly compared

to the rest, while the Abs2 scheme has a performance compa-

rable to that of the Ad scheme. The relative increase in the

average distance of the best performing category in the DNN

scheme, namely, Fricatives, compared to that in Af scheme

is found to be 5.55%. Similarly, the Nasals category with the

least average distance in the Ad scheme is seen to be 6.07%

(relative) higher than that of the Af scheme. A t-test as

described in Sec. IV A, reveals that the BCP specific DTW

distances obtained from the Af scheme is statistically signifi-

cantly lower than those from the other schemes, across all

folds, BCP categories and subjects (p-values �1.2e � 3).

This shows that the optimal affine TF is capable of recon-

structing the different BCP categories, better, than the other

schemes considered in the study.

V. ANALYSIS OF THE DIFFERENCES BETWEENARTICULATION IN NEUTRAL AND WHISPEREDSPEECH

A. The Af transformation

Figure 5 shows the Ns�Ns matrices, A ¼ Af , obtained

from one fold of each of the six subjects. From the figure,

we make two major observations. First, we observe that the

matrix A is not a purely diagonal matrix, which explains the

deterioration in the performance of the Ad scheme, com-

pared to the Af scheme. Second, we observe a subject spe-

cific difference in the structure of the Af matrix. The fall in

the performance in the cross subject setting (Sec. IV B) could

be a result of this subject specific nature of the TF. It could

be that, each subject modifies the articulation during whis-

pering compared to neutral speech in his/her own specific

manner to compensate for the loss of pitch in whispered

speech.

We see that several WATs contribute in the reconstruc-

tion of a single NAT, indicating that the motion of one artic-

ulator in neutral speech is encoded in multiple articulatory

motion during whispering. In order to understand the signifi-

cance of the contribution of each WAT to a particular NAT,

we perform a t-test at 5% significance level, with a null

hypothesis that its contribution is, indeed, zero. Table V lists,

for each NAT, the WATs whose contribution is significant

in every fold of all subjects, From the table, it is clear that

the information about one NAT is captured by a few WATs.

We observe that every WAT contributes significantly

towards the reconstruction of the corresponding NAT,

except for LCz and TBz. For these two NATs, the

TABLE III. The number of segments for each BCP category along with the average (SD) duration of each segment, in ms, for each subject. The description of

the entries in a cell of the table is as follows.

Subjects BCP category

Subject ID Total number of segments per BCP category

Average duration of whispered segment Average duration of neutral segment

(SD) (SD)

Subjects Vowels Stops Fricatives Nasals Silence

F1 5278 1129 1405 963 1091

98.88 92.67 65.81 80.18 103.49 101.47 91.48 89.48 83.49 70.63

(46.03) (40.73) (21.45) (24.79) (40.98) (40.35) (31.13) (25.86) (50.66) (42.78)

F2 4950 1154 1226 863 1306

91.80 99.43 72.10 88.08 110.53 129.61 95.28 102.29 105.55 89.49

(44.52) (42.84) (26.21) (32.61) (49.46) (51.33) (38.09) (33.87) (78.10) (66.69)

M1 5516 1217 1399 1030 1850

110.43 91.29 79.54 65.12 101.37 99.04 96.14 98.44 101.83 101.42

(47.98) (38.17) (28.48) (28.97) (38.87) (38.67) (32.00) (29.70) (70.61) (59.16)

M2 5259 1421 1532 1028 440

105.93 91.68 99.02 86.03 117.68 106.36 96.48 89.14 93.66 75.91

(57.68) (46.34) (38.61) (34.72) (53.34) (42.55) (41.44) (34.18) (64.05) (40.28)

M3 5515 1327 1547 1049 1145

112.13 98.50 77.62 87.45 115.92 110.47 95.70 96.26 101.08 80.10

(53.08) (46.28) (29.15) (31.18) (47.87) (46.30) (32.94) (30.65) (59.22) (47.08)

M4 5518 1273 1474 1030 1470

103.43 80.51 77.01 85.70 115.89 105.07 91.25 80.36 105.15 81.02

(50.58) (35.16) (31.68) (29.33) (47.87) (49.98) (35.61) (29.53) (56.56) (50.18)

TABLE IV. The average (SD), in mm, of the segment-wise DTW distances

for the five BCP categories, across all subjects and folds.

Schemes Vowels Stops Fricatives Nasals Silence

Af 6.76 6.01 5.49 5.54 6.89

(3.96) (3.16) (2.89) (2.70) (4.05)

Ad 7.05 6.41 6.00 5.88 7.25

(3.90) (3.14) (2.94) (2.67) (4.05)

DNN 7.13 6.36 5.79 5.88 7.17

(4.06) (3.23) (2.96) (2.82) (4.11)

Abs1 10.99 11.02 10.42 10.40 11.62

(5.66) (5.44) (5.25) (5.13) (5.78)

Abs2 7.07 6.43 6.02 5.89 7.27

(3.93) (3.16) (2.96) (2.69) (4.07)


corresponding WATs contribute significantly in every fold

of all subjects, except in one fold of subject F2. In most

cases, we observe that a WAT, apart from contributing sig-

nificantly towards the reconstruction of the corresponding

NAT, contributes significantly towards the reconstruction of

other NATs. From the table we find that for each NAT, at

least two different WATs contribute significantly for its

reconstruction. For instance, we see that the lip movements

while whispering contribute significantly to the reconstruc-

tion of five out of six NATs corresponding to the tongue.

Although the production of whispered speech does not

involve vibrations of the vocal folds, interestingly, we find

that the WATs THx and THz contribute significantly to

reconstruct the corresponding NATs. Laryngeal movement

is known to occur during human speech production for the

control of sound intensity and pitch (Curry, 1937; Ludlow,

2005). Specifically, an upward movement of the larynx is

observed during an increase in pitch (Curry, 1937). A study

based on magnetic resonance imaging, to understand the

phonation of whispered and neutral vowels, reveals that

the (upward and downward) position of the larynx, along the

mid sagittal plane, is similar across whispered and neutral

speech (Coleman et al., 2002). This is in agreement with our

finding, using EMA, that the movement of the TH sensor

along the Z direction while whispering, contributes signifi-

cantly to that in neutral speech (Table V). This indicates that

there exists a similarity in the laryngeal movements during

whispered and neutral speech.

In order to understand the similarities among the articu-

latory trajectories during neutral and whispered speech, we

compute the correlation coefficient between each of the 18

articulatory trajectories inWfmig with those in Nfmig. Figure

6 provides the correlation coefficient between the WATs and

the NATs, averaged across all folds and all subjects. From

the figure, we find that there exists a higher correlation

within the movements of certain sensors on the lips and

within those on the tongue during whispering and neutral

speech. In accordance with this observation, from Table V

we find that among the lip sensors, for each NAT at least one

other WAT belonging to the lips contributes significantly.

Similarly, we observe a similar trend for each NAT

FIG. 5. The matrix Af obtained from one fold for each of the six subjects. Brighter pixel indicates a larger value as indicated by the color bar.

TABLE V. Significant WATs to reconstruct each NAT.

NAT Significantly contributing WATs

ULx ULx ULz LLx Jz

ULz ULz LLz RCx LCz THx TDz

LLx ULz LLx RCx LCx Jx THx

LLz ULx LLz Jx

RCx ULz LLx LLz RCx LCx Jz TBz

RCz ULx ULz LLx LLz RCx RCz LCx LCz THz

LCx LLz RCx LCx THz TBz

LCz ULz RCx THx

Jx ULz Jx THz TDz

Jz ULx LLx LLz RCz Jx Jz THx

THx ULx LLz RCx THx THz TBz TDz

THz ULx Jx THz

TTx ULz LLz Jx TTx TBz TDx TDz

TTz THz TTz TBx TBz

TBx ULz LLz Jx Jz THz TTz TBx TDx TDz

TBz LCx Jx TBx

TDx LLx LLz Jx TBx TDx TDz

TDz LCz Jx TBx TBz TDx TDz


corresponding to the tongue sensors, as well. We observe

from the figure that the articulatory trajectories of the throat

exhibit a lower correlation with the other articulatory trajec-

tories irrespective of the degree of its proximity to the

sensors. Interestingly, from Table V we find that THz con-

tributes significantly to the reconstruction of TTz but not to

the vertical movements of the proximally close TB or TDsensors. This could be because all WATs that are highly cor-

related to a particular NAT need not contribute significantly

to reconstruct that NAT, since they could capture redundant

information (Table V).

From Fig. 6 we find that certain WATs are more corre-

lated with their neutral counterparts compared to others.

This could indicate that although the speech motor control

plans are similar between whispered and neutral speech

(Coleman et al., 2002), there could be some patterns in artic-

ulatory movements that are specific to whispered speech.

The reconstructed NATs could be used to synthesize neutral

speech by employing articulatory synthesis systems (Aryal

and Gutierrez-Osuna, 2016). We proceed to understand the

exaggeration of articulatory movements in whispered speech

in comparison to that in neutral speech using the optimal TF.

B. Quantifying the exaggeration of articulatorymovements in whispered speech

We hypothesize that, corresponding to a small displace-

ment in the movement of certain neutral articulators, there

could be an exaggeration of the whispered articulators via

larger displacements in the WATs. In order to test this

hypothesis, we consider the following approach. From Eq.

(8), we see that the kth column of the affine transformation

matrix, transforms the WATs to reconstruct the kth NAT.

ConsiderWfm�i g constructed from the optimal set of warping

paths, obtained from the IFI-DTW algorithm. Let w�p

2 RL�1 represent the pth WAT corresponding to the pth col-

umn of WTfm�i g

; ap;k be the (p, k)th coefficient of the matrix

Af in the optimal TF and bk be the coefficient corresponding

to the DC shift. With regard to Eq. (8), the kth reconstructed

NAT (kth column of N Tfm�i g

), n�k 2 RL�1, can be written as

follows:

n�k ¼

XNs

p¼1

ap;kw�p þ bk1L�1XNs

p¼1

ap;kw�p

þ bk1L�1 � n�k ¼ 0: (12)

To study the amount of contribution by different WATs

to reconstruct a particular NAT, we compute the angle

between the corresponding transformation plane (TP) given

in Eq. (12) and a reference plane. Since a DC shift in the TP

is of no consequence in the computation of the angle, we

neglect effect of the DC shift coefficient, bi. Therefore, from

Eq. (12), we see that the normal vector of the TP to recon-

struct the kth NAT is given by ½a1;k;…; aNs;k;�1�T . The nor-

mal to the reference plane is considered as ½0Ns�1; 1�T . Since

the angle between the two planes is given by the angle

between their normal vectors, we compute hk as follows:

hk ¼ cos�1 1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1þ

XNs

p¼1

a2p;k

vuut0BBB@

1CCCA: (13)

Let us consider the case when hk< 45, which, equivalently,

results in a conditionPNs

p¼1 a2p;k < 1. This implies that

0� ap,k< 1, 8p, k. A value of ap,k< 1 indicates that there are

higher variations in the movements of the pth WAT, in order

to reconstruct the kth NAT. hk< 45 indicates that the

WATs exhibit a larger variation in their movements in the

18-dimensional space in order to produce a small variation

in the kth NAT. Hence, a low hk (hk< 45) would indicate

that the whispered articulatory movements could be exagger-

ated in order to reconstruct a small displacement of the kth

NAT.

For each subject, we compute the angles hk, k¼ 1,…,Ns

corresponding to the Ns NATs in every fold. We observe that

across all subjects and folds, on average, 9.63(63.35) NATs

have an angle less than 45. Therefore, for each subject, we

find those NATs whose TP has an angle among the lowest

five (out of 18) in at least three out of the four folds. We

observe that the number of NATs that require exaggerated

movements of the whispered articulators varies across sub-

jects. The lip, jaw, and the throat articulators turn out to

have an angle lower than 45 at least for one subject.

Averaged across all subjects and folds, the TPs of the neutral

articulators, RCz, LLx, LCx, ULx, LCz, RCx, Jx, and THx have

an angle of 38.70 (62.66), 37.32 (64.24), 32.54

(610.22), 30.89 (63.99), 29.84 (64.15), 28.93 (61.03),

22.54 (61.23), and 21.78 (63.38), respectively.

Specifically, we find that three sensors on the lips, namely,

ULx, LLx, and LCx have lower angles for at least three among

six subjects. This could indicate that (1) the reconstruction

of the neutral articulatory movements that require exaggera-

tion in the whispered articulation could be subject dependent

and (2) the reconstruction of the movements of the sensors

FIG. 6. The correlation coefficient matrix between WATS and NATs, aver-

aged across all folds of six subjects.


on the lips during neutral speech requires exaggerated move-

ments of the WATs.

C. Stability and precision of whispered articulatorymovements

Study of palato-lingual contact patterns using electro-

palatography has shown that the articulation in whispered

speech is more stable, hence less variable and more precise

leading to a lower velocity of whispered articulatory move-

ments, compared to those in neutral speech (Osfar, 2011).

Osfar claims that an increase in the stability and precision in

the movements of articulators while whispering is an indica-

tion of hyperarticulation during whispering. Unlike this

work, where the primary focus is to understand the hyperarti-

culation by the tongue, in our work, we study the effects of

whispering on articulation, by the lips, jaw and throat, in

addition to the tongue, using EMA.

First, we analyze the precision in the whispered articula-

tion with regard to the velocity of the articulatory move-

ments while whispering. We compute the velocity of the

articulatory movements, in terms of their delta coefficients

(D). Similar to Wfmig and Nfmig (Sec. II A), we define

DWfmig ¼ ½DWm1

1 ;…;DWmNN � 2 RNs�L and DNfmig

¼ ½DNm1

1 ;…;DNmNN � 2 RNs�L. In order to examine the rela-

tive changes in the velocity of articulatory movements dur-

ing neutral and whispered speech, we learn an optimal

diagonal affine transformation function (Sec. II B 2) between

DWfmig and DNfmig following the optimization in Eq. (6) as

follows:

F ¼ arg minfkf ðDWfmigÞ � DNfmigk

22: (14)

The warping paths are optimized using the position data, as

given in Eq. (4). We now examine the (p, p)th coefficient

(p¼ 1,…, Ns) of the optimal diagonal TF obtained using Eq.

(14). A coefficient greater than 1 indicates that the velocity

of the whispered articulator is lower than that of the neutral

articulator.

Table VI lists the set of articulators whose coefficient in

the diagonal TF, obtained from Eq. (14), is greater than 1, in

at least in one of the four folds for each subject. From the

table, we observe that the set of articulators that exhibit a

lower velocity in whispered speech is subject dependent.

This could indicate a subject-specific nature of hyperarticu-

lation in whispered speech. Interestingly, we find that for

every subject, at least one sensor on the tongue, shows

reduction in its velocity, and, hence, more precise move-

ments. This is in accordance with the findings by Osfar

(2011), in which the tongue movements were found be more

precise in whispered speech.

Figure 7 shows, in the order of decreasing value, the

coefficients in the optimal TF, averaged across folds and

subjects. From the figure, we find that the sensors on the

tongue and the jaw exhibit a higher precision in their move-

ments compared to the sensors on the lips. Specifically, the

articulatory trajectories of Jx, TTx, TTz, TBx, and TDx are

observed to have a lower velocity while whispering for at

least three among six subjects. Interestingly, we observe that

among these WATs, Jx and TBx contribute significantly to

reconstruct ten among eighteen NATs (Table V).

Motivated by the work by Osfar (2011), we also exam-

ine which among the Ns whispered articulators, exhibit

reduced variability and, hence, more stability compared to

their neutral counterparts. For this, we compute the SD of

the velocities of WATs and NATs using samples in the kth

column of DWTfmig and DN T

fmig, respectively, as rDwk and

rDnk . We then compute the variance ratio, VRk ¼ ðrDw

k Þ2=

ðrDnk Þ

2. A value of VRk< 1 indicates that the movement of

the kth whispered articulator is more stable, since the vari-

ability of the velocity of the kth WAT is lower than that of

the kth NAT. In agreement with the previous findings, we

observe that the average VR of the tongue sensors is lower

compared to those of other articulators. Specifically, the sen-

sors TBx, TDx, TTx, and TDz are observed to have a VR< 1,

consistently, in every fold for all subjects. Their average

(SD) VR turns out to be 0.55(60.11) for TBx, 0.56(60.12)

for TDx, 0.63(60.18) for TTx, and 0.65(60.08) for TDz. This

indicates that there exists a greater stability in the movement

of the tongue, while whispering. Comparing with the find-

ings of the precision analysis of the articulatory movements,

we observe that most sensors placed on the tongue, show an

increase in, both, stability and precision in their movements,

while whispering. It could be that controlling the articulation

of tongue is key to improving intelligibility of whispered

TABLE VI. Subject-wise listing of articulators that exhibit movements of

reduced velocity during whispering compared to those in neutral speech.

Subject Articulators with lower velocity

F1 THz, TTz, TBx, TBz

F2 RCx, Jx, TTx, TBx, TDx

M1 TDx

M2 Jz, THx, THz, TTz, TBx, TDx, TDz

M3 LLx, RCx, LCx, LCz, Jx, Jz, THx, THz, TTx, TTz, TBx, TBz, TDx, TDz

M4 ULz, LLx, LCz, Jx, Jz, THx, TTx, TBx, TDx

FIG. 7. Coefficients of the optimal

diagonal TF, averaged across all sub-

jects and folds. Error bar indicates SD.


speech, compared to the other articulators considered in this

study.

VI. CONCLUSION

In this work, we use the IFI-DTW optimization to find

an optimal TF that transforms whispered articulatory move-

ments into those of neutral speech. Among several candidate

TFs, we find that an affine transformation with a full matrix

turns out to be the best TF to achieve the minimum distance

between NATs and WATs at both utterance level and for dif-

ferent BCP categories. This indicates that information about

a particular articulator’s movements in neutral speech is cap-

tured by those of several articulators while whispering. We

also find that this TF generalizes over different subjects, bet-

ter, compared to a DNN based nonlinear TF. It could be that

exaggerated articulatory movements need not result in a

highly nonlinear transformation between WAT and NAT,

but, in fact, could be well approximated by an affine trans-

formation. Analysis of the exaggerated articulatory move-

ments while whispering reveals that stable and precise

movements of the tongue are vital for the compensation of

the lack of intelligibility in whispered speech. Analyzing the

phoneme specific optimal TF, language specific effects in

the reconstruction and synthesizing neutral speech from the

reconstructed neutral articulatory trajectories are parts of our

future work.

ACKNOWLEDGMENTS

We thank the six subjects for their participation,

Aravind Illa for assisting with the data collection, and the

Pratiksha Trust for their support.

1See supplementary material at https://doi.org/10.1121/1.5039750 to view

box plots of the dtest from the five schemes for each of the six subjects,

considering only the position data, and box plots of the dtest from the Af

and DNN schemes for each of the six subjects, considering position and

the dynamics, in the subject-wise experiments.

3D Electromagnetic Articulograph (1979), http://www.articulograph.de/

(Last viewed September 14, 2017).

Ahmadi, F., McLoughlin, I. V., and Sharifzadeh, H. R. (2008). “Analysis-

by-synthesis method for whisper-speech reconstruction,” in IEEE AsiaPacific Conference on Circuits and Systems, APCCAS, pp. 1280–1283.

Aryal, S., and Gutierrez-Osuna, R. (2016). “Data driven articulatory synthe-

sis with deep neural networks,” Comput. Speech Lang. 36(C), 260–273.

Beskow, J. (2003). “Talking heads-models and applications for multimodal

speech synthesis,” Ph.D. thesis, Institutionen f€or Tal€overf€oring och

Musikakustik, Stockholm, Sweden.

Chollet, F. (2015). “keras,” https://github.com/fchollet/keras (Last viewed

September 14, 2017).

Coleman, J., Grabe, E., and Braun, B. (2002). “Larynx movements and into-

nation in whispered speech,” Summary of research supported by British

Academy.

Curry, R. (1937). “The mechanism of pitch change in the voice,” J. Physiol.

91(3), 254–258.

Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert, J. M., and

Brumberg, J. S. (2010). “Silent speech interfaces,” Speech Commun.

52(4), 270–287.

Fagan, M., Ell, S., Gilbert, J., Sarrazin, E., and Chapman, P. (2008).

“Development of a (silent) speech recognition system for patients follow-

ing laryngectomy,” Med. Eng. Phys. 30(4), 419–425.

Fagel, S., and Clemens, C. (2004). “An articulation model for audiovisual

speech synthesis—determination, adjustment, evaluation,” Speech

Commun. 44(1), 141–154.

Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., and Pallett, D. S.

(1993). “DARPA TIMIT acoustic-phonetic continuous speech corpus

CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon Technical Report

No. 93.

Ghosh, P. K., and Narayanan, S. (2010). “A generalized smoothness crite-

rion for acoustic-to-articulatory inversion,” J. Acoust. Soc. Am. 128(4),

2162–2172.

Gilchrist, A. G. (1973). “Rehabilitation after laryngectomy,” Acta Oto-

Laryngologica 75(2-6), 511–518.

Gonzalez, J. A., Cheah, L. A., Gilbert, J. M., Bai, J., Ell, S. R., Green, P. D.,

and Moore, R. K. (2016). “A silent speech system based on permanent

magnet articulography and direct synthesis,” Comput. Speech Lang. 39,

67–87.

Higashikawa, M., Green, J., Moore, C., and Minifie, F. (2003). “Lip kine-

matics for /p/ and /b/ production during whispered and voiced speech,”

Folia Phoniatr. Logop. 55, 1–9.

Jackson, P. J., and Singampalli, V. D. (2008). “Statistical identification of

critical, dependent and redundant articulators,” J. Acoust. Soc. Am.

123(5), 3321–3321.

Janke, M., Wand, M., Heistermann, T., Schultz, T., and Prahallad, K.

(2014). “Fundamental frequency generation for whisper-to-audible speech

conversion,” in IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP), pp. 2579–2583.

Jovicic, S. T., and �Saric, Z. (2008). “Acoustic analysis of consonants in

whispered speech,” J. Voice 22(3), 263–274.

Kingma, D. P., and Ba, J. (2014). “Adam: A method for stochastic opti-

mization,” arXiv:1412.6980.

Lee, K. F., and Hon, H. W. (1989). “Speaker-independent phone recognition

using hidden Markov models,” IEEE Trans. Acoust. Speech. Sign.

Process. 37(11), 1641–1648.

Ludlow, C. L. (2005). “Central nervous system control of the laryngeal

muscles in humans,” Respirat. Physiol. Neurobiol. 147(2), 205–222.

Mcloughlin, I. V., Sharifzadeh, H. R., Tan, S. L., Li, J., and Song, Y.

(2015). “Reconstruction of phonated speech from whispers using formant-

derived plausible pitch modulation,” ACM Trans. Access. Comput.

(TACCESS) 6(4), 12.

Morris, R. W., and Clements, M. A. (2002). “Reconstruction of speech from

whispers,” Med. Eng. Phys. 24(7), 515–520.

M€uller, M. (2007). “Dynamic time warping,” Information retrieval for

music and motion, pp. 69–84.

Osfar, M. J. (2011). “Articulation of whispered alveolar consonants,”

Master’s thesis, University of Illinois at Urbana-Champaign, Champaign,

IL.

Parnell, M., Amerman, J. D., and Wells, G. B. (1977). “Closure and con-

striction duration for alveolar consonants during voiced and whispered

speaking conditions,” J. Acoust. Soc. Am. 61, 612–613.

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N.,

Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J.,

Stemmer, G., and Vesely, K. (2011). “The Kaldi Speech Recognition

Toolkit,” in IEEE Workshop on Automatic Speech Recognition andUnderstanding.

Qiao, Y., and Yasuhara, M. (2006). “Affine invariant dynamic time warping

and its application to online rotated handwriting recognition,” in 18thInternational Conference on Pattern Recognition (ICPR’06), Vol. 2, pp.

905–908.

Scanlon, P., Ellis, D. P. W., and Reilly, R. B. (2007). “Using broad phonetic

group experts for improved speech recognition,” IEEE Trans. Audio

Speech Lang. Process. 15(3), 803–812.

Sch€onle, P. W., Gr€abe, K., Wenig, P., H€ohne, J., Schrader, J., and Conrad,

B. (1987). “Electromagnetic articulography: Use of alternating magnetic

fields for tracking movements of multiple points inside and outside the

vocal tract,” Brain Lang. 31(1), 26–35.

Schwartz, M. F. (1972). “Bilabial closure durations for /p/, /b/, and /m/ in

voiced and whispered vowel environments,” J. Acoust. Soc. Am. 51,

2025–2029.

Sharifzadeh, H. R., McLoughlin, I. V., and Ahmadi, F. (2010).

“Reconstruction of normal sounding speech for laryngectomy patients

through a modified CELP codec,” IEEE Trans. Biomed. Eng. 57(10),

2448–2458.

Tartter, V. C. (1989). “What’s in a whisper?,” J. Acoust. Soc. Am. 86,

1678–1683.

Team, T. T. D., Al-Rfou, R., Alain, G., Almahairi, A., Angermueller, C.,

Bahdanau, D., Ballas, N., Bastien, F., Bayer, J., Belikov, A., et al. (2016).


https://doi.org/10.1121/1.5039750

http://www.articulograph.de/

https://doi.org/10.1016/j.csl.2015.02.003

https://github.com/fchollet/keras

https://doi.org/10.1113/jphysiol.1937.sp003556

https://doi.org/10.1016/j.specom.2009.08.002

https://doi.org/10.1016/j.medengphy.2007.05.003



https://doi.org/10.1121/1.3455847

https://doi.org/10.3109/00016487309139782

https://doi.org/10.3109/00016487309139782

https://doi.org/10.1016/j.csl.2016.02.002

https://doi.org/10.1159/000068059

https://doi.org/10.1121/1.2933798

https://doi.org/10.1016/j.jvoice.2006.08.012

http://arxiv.org/abs/arXiv:1412.6980

https://doi.org/10.1109/29.46546

https://doi.org/10.1109/29.46546

https://doi.org/10.1016/j.resp.2005.04.015

https://doi.org/10.1145/2737724

https://doi.org/10.1145/2737724

https://doi.org/10.1016/S1350-4533(02)00060-7

https://doi.org/10.1121/1.381309

https://doi.org/10.1109/TASL.2006.885907

https://doi.org/10.1109/TASL.2006.885907

https://doi.org/10.1016/0093-934X(87)90058-7

https://doi.org/10.1121/1.1913063

https://doi.org/10.1109/TBME.2010.2053369

https://doi.org/10.1121/1.398598

“Theano: A python framework for fast computation of mathematical

expressions,” arXiv:1605.02688.

Toda, T., and Shikano, K. (2005). “NAM-to-speech conversion with

Gaussian mixture models,” in INTERSPEECH, pp. 1957–1960.

Toutios, A., and Maeda, S. (2012). “Articulatory VCV synthesis from EMA

data,” in INTERSPEECH, pp. 2566–2569.

Toutios, A., and Narayanan, S. (2013). “Articulatory synthesis of French

connected speech from EMA data,” in INTERSPEECH, pp. 2738–2742.

Wang, J., Hahm, S., and Mau, T. (2015). “Determining an optimal set of

flesh points on tongue, lips, and jaw for continuous silent speech recog-

nition,” in Proceedings of SLPAT 2015: 6th Workshop on Speech andLanguage Processing for Assistive Technologies, Association for

Computational Linguistics, Dresden, Germany, pp. 79–85.

Wang, J., Samal, A., and Green, J. R. (2014). “Preliminary test of a real-

time, interactive silent speech interface based on electromagnetic

articulograph,” in SLPAT@ACL, Association for Computational

Linguistics, pp. 38–45.

Wang, J., Samal, A., Green, J. R., and Rudzicz, F. (2012a). “Sentence recog-

nition from articulatory movements for silent speech interfaces,” in

ICASSP, IEEE, pp. 4985–4988.

Wang, J., Samal, A., Green, J. R., and Rudzicz, F. (2012b). “Whole-word

recognition from articulatory movements for silent speech interfaces,” in

INTERSPEECH, ISCA, pp. 1327–1330.

Wrench, A. (1999). “MOCHA-TIMIT,” speech database.

Wszołek, W., Modrzejewski, M., and Przysiezny, M. (2014). “Acoustic

analysis of esophageal speech in patients after total laryngectomy,” Arch.

Acoust. 32(4), 151–158.

Yoshioka, H. (2008). “The role of tongue articulation for /s/ and /z/

production in whispered speech,” in Proceedings of Acoustics, pp.

2335–2338.


http://arxiv.org/abs/arXiv:1605.02688

Documents

Reconstruction of articulatory movements during neutral ... › spire › papers_pdf › Nisha_JASA_2018.pdfk^n w i ðÞ l n n i ... Am. 143 (6), June 2018 Nisha Meenakshi G. and Prasanta