Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Reconstruction of articulatory movements during neutral speechfrom those during whispered speech
Nisha Meenakshi G.a) and Prasanta Kumar GhoshElectrical Engineering, Indian Institute of Science, Bangalore-560012, India
(Received 25 September 2017; revised 25 April 2018; accepted 9 May 2018; published online 6June 2018)
A transformation function (TF) that reconstructs neutral speech articulatory trajectories (NATs)
from whispered speech articulatory trajectories (WATs) is investigated, such that the dynamic time
warped (DTW) distance between the transformed whispered and the original neutral articulatory
movements is minimized. Three candidate TFs are considered: an affine function with a diagonal
matrix (Ad) which reconstructs one NAT from the corresponding WAT, an affine function with a
full matrix (Af ) and a deep neural network (DNN) based nonlinear function which reconstruct each
NAT from all WATs. Experiments reveal that the transformation could be approximated well by
Af , since it generalizes better across subjects and achieves the least DTW distance of 5.20 (61.27)
mm (on average), with an improvement of 7.47%, 4.76%, and 7.64% (relative) compared to that
with Ad, DNN, and the best baseline scheme, respectively. Further analysis to understand the dif-
ferences in neutral and whispered articulation reveals that the whispered articulators exhibit exag-
gerated movements in order to reconstruct the lip movements during neutral speech. It is also
observed that among the articulators considered in the study, the tongue exhibits a higher precision
and stability while whispering, implying that subjects control their tongue movements carefully in
order to render an intelligible whispered speech. VC 2018 Acoustical Society of America.
https://doi.org/10.1121/1.5039750
[JFL] Pages: 3352–3364
I. INTRODUCTION
Whispered speech is typically produced in private con-
versations, in addition to pathological cases such as laryn-
gectomy (Sharifzadeh et al., 2010). Such pathological
conditions lead to several types of alaryngeal speech includ-
ing esophageal speech, tracheoesophageal speech, and
hoarse whispered speech (Wszołek et al., 2014; Gilchrist,
1973). Since whispered speech is produced in the absence of
vocal fold vibrations, it lacks pitch (Tartter, 1989). Several
algorithms exist to reconstruct and synthesize neutral speech
from the less intelligible whispered speech (Sharifzadeh
et al., 2010; Morris and Clements, 2002; Ahmadi et al.,2008; Janke et al., 2014; Mcloughlin et al., 2015; Toda and
Shikano, 2005). Silent speech interfaces (SSIs) also address
this problem of reconstructing neutral speech (Denby et al.,2010). One line of research to obtain speech from articula-
tory movements using SSIs is to recognize word or sentence
from articulatory movements (Fagan et al., 2008) followed
by text-to-speech synthesis (Wang et al., 2014, 2012a,b,
2015). On the other hand, certain SSIs convert articulatory
movements into speech via direct synthesis. SSIs based on
the movements of speech articulators are used in the articula-
tory synthesis of neutral speech from the neutral articulation
data (Gonzalez et al., 2016; Toutios and Maeda, 2012;
Toutios and Narayanan, 2013; Fagel and Clemens, 2004;
Beskow, 2003; Aryal and Gutierrez-Osuna, 2016). By trans-
forming whispered articulatory movements into those of
neutral speech, we could employ an articulatory synthesis
framework to synthesize neutral speech. In order to do so, it
is critical to first have an understanding of the relationship
between the articulation in whispered speech and that in neu-
tral speech. For this, we study the whispered and neutral
articulatory movements captured using electromagnetic
articulography (EMA) (Sch€onle et al., 1987).
It is known that the articulation during whispered speech
differs from that during neutral speech, typically in two ways.
First, exaggerated articulatory movements are known to exist
in whispered speech (Yoshioka, 2008; Osfar, 2011; Schwartz,
1972; Parnell et al., 1977) unlike in neutral speech, in order to
compensate for the lack of pitch in whispers. Second, whis-
pered speech has a longer duration compared to the corre-
sponding neutral speech (Jovicic and �Saric, 2008). There are
several studies that examine the exaggeration in the whispered
articulatory movements. Yoshioka studied the differences in
the palato-lingual contact pattern during the production of
whispered unvoiced and voiced alveolar fricatives, namely, /s/
and /z/, using electro-palatography (Yoshioka, 2008). The
study revealed that the area of contact between the palate and
the tongue during the production of whispered /z/ is larger
compared to that during whispered /s/. The differences in the
movements of the lips during the production of whispered and
neutral bilabial consonants, /b/ and /p/, were studied using
both speech and facial video (Higashikawa et al., 2003). The
study revealed that the average peak opening and closing
velocities and the distance between the upper and the lower lip
for oral opening for /b/ were significantly higher than those for
/p/ while whispering. These studies show that exaggerated
articulation occurs during the production of “voiced” whis-
pered consonants [/z/ and /b/ from Yoshioka (2008) anda)Electronic mail: [email protected]
3352 J. Acoust. Soc. Am. 143 (6), June 2018 VC 2018 Acoustical Society of America0001-4966/2018/143(6)/3352/13/$30.00
Higashikawa et al. (2003), respectively]. Electro-palatography
based experiments with neutral and whispered alveolar conso-
nants, namely, /d/, /t/, and /n/, were done by Osfar (2011).
These experiments found that articulation is more stable and
precise in whispered speech compared to that in neutral
speech, confirming that subjects hyperarticulate while whisper-
ing compared to when they speak normally. These exaggerated
articulatory movements cause the whispered articulatory tra-
jectory (WAT) to differ from the neutral articulatory trajectory
(NAT). To the best of our knowledge, not much investigation
has been done in the literature to understand an underlying
mapping that could relate a WAT to a NAT. This work aims
to better understand the differences in the whispered and neu-
tral articulation. In this regard, we first find a suitable mapping
function to reconstruct each NAT from multiple WATs.
Second, we quantify the amount of exaggeration exhibited by
the whispered articulatory movements and compare it with
that of neutral speech.
We propose an iterative function independent dynamic
time warping (IFI-DTW) optimization to compute the opti-
mal transformation function (TF) to transform WATs, in
order to reconstruct NATs. In the IFI-DTW method, we opti-
mize the TF and the DTW (M€uller, 2007) warping path, by
an iterative alternate minimization procedure, till conver-
gence is achieved. Having obtained a transformation from
whispered to neutral articulatory movements, we investigate
the exaggeration in the whispered articulation. In particular,
we analyze the transformed whispered and neutral articula-
tory trajectories, to understand (1) those neutral articulators
whose reconstruction requires exaggerated articulatory
movements while whispering and (2) those articulators that
exhibit exaggerated movements in whispered speech.
II. MAPPING PROCEDURE BETWEEN WHISPEREDAND NEUTRAL SPEECH ARTICULATION
A. IFI-DTW optimization
Let us consider articulatory movements of neutral and
whispered utterances available at a sampling frequency of
Fs. Consider the number of training utterances to be N. We
propose an IFI-DTW algorithm to estimate a TF so that the
NATs and transformed WATs have the least distance. Let us
denote the WATs and NATs of Ns articulators corresponding
to the utterance i (after mean subtraction) by Wi ¼ ½w1;…;wTWi� and Ni ¼ ½n1;…; nTNi
�, of lengths TWiand and TNi
sam-
ples, respectively (Wi 2 RNs�TWi and Ni 2 RNs�TNi ), where
wk and nk denote the kth column of Wi and Ni. Therefore,
each row of Wi (or Ni) corresponds to one whispered (or neu-
tral) articulatory trajectory, e.g., the tongue tip, upper lip,
etc., and each column corresponds to the frame index along
time. Since the lengths of the whispered and neutral utteran-
ces need not be equal (TWi6¼ TNi
), we use DTW with
Euclidean distance for alignment to compute distance
between them. Therefore, we require an optimal TF, F� and
a set of optimal warping paths {m�i , i¼ 1,…,N}, that trans-
form WATs to NATs such that the total cost, D, i.e., the sum
of the DTW distances over all training utterances, is mini-
mized, as follows:
F�; fm�i g� �
¼ arg minf ;fmig
Dðf ; fmigÞ; (1)
where
Dðf ; fmigÞ ¼XN
i¼1
Dmiðf ðWiÞ;NiÞ; (2)
where mi is a DTW warping path between NATs and trans-
formed WATs and Dmiis the total squared Euclidean dis-
tance computed along mi for utterance i. Let the
reconstructed NATs (or the transformed WATs) be N i
¼ f ðWiÞ ¼ ½f ðw1Þ;…; f ðwTWiÞ� ¼ ½n1;…; nTWi
� (where each
row of N i corresponds to one transformed WAT). For an
utterance i, a warping path mi of length Li, between N i and
Ni, consists of the ordered pairs mi ¼ hmwi ðlÞ;mn
i ðlÞi;l ¼ 1;…;Li, such that 1 � mw
i ðlÞ � TWiand 1 � mn
i ðlÞ � TNi.
Therefore, given a warping path mi for an utterance i, we
have
DmiN i;Ni
� �¼ 1
Li � 1
XLi
l¼1
knmwi lð Þ � nmn
i lð Þk22; (3)
where k � k2 indicates the L2 norm. Thus, the optimal warp-
ing path for each utterance is given by
mi ¼ arg minm0i
Dm0iðN i;NiÞ; i ¼ 1;…;N: (4)
This optimization first involves the construction of a distance
matrix whose (p, q)th entry denotes the Euclidean distance
between np and nq. Dynamic programming (M€uller, 2007) is
then employed to compute the optimal warping path through
the distance matrix that results in the least overall Euclidean
distance [as in Eq. (4)]. From Eqs. (1) and (4), we see that
the TF and the DTW warping path depend on each other,
which makes the joint optimization equation (1), a challeng-
ing task. Therefore, in the IFI-DTW algorithm, we optimize
the TF and the DTW warping path, using an iterative alter-
nate minimization procedure.
If f is known, we could find the set of optimal paths
{mi} using Eqs. (3), (4) and the total cost D using Eq. (2).
Let us, now, assume that the set of optimal paths {mi} is
known. Therefore, for a given utterance i, let us define
Wmii ¼ ½wmw
i ð1Þ;…;wmwi ðLiÞ� and Nmi
i ¼ ½nmni ð1Þ;…; nmn
i ðLiÞ�,such that wk and nk represent the kth column of Wi and Ni,
respectively. Since we estimate one TF using all the training
utterances, we concatenate Wmii ðand Nmi
i Þ 8i ¼ 1;…;N to
obtain Wfmig ðandNfmigÞ. Specifically, we write Wfmig
¼ ½Wm1
1 ;…;WmNN � 2 RNs�L and Nfmig ¼ ½N
m1
1 ;…;NmNN �
2 RNs�L, where, L ¼PN
i¼1 Li. We then optimize for F as
follows:
F ¼ arg minfDðf ; fmigÞ (5)
¼ arg minfkf ðWfmigÞ � N fmigk
22: (6)
J. Acoust. Soc. Am. 143 (6), June 2018 Nisha Meenakshi G. and Prasanta Kumar Ghosh 3353
In this manner, we optimize for the warping path and the TF
using alternate minimization. The expressions to compute
different TFs are provided in detail in Secs. II B and II C. In
the IFI-DTW optimization, we initialize the TF in the first
iteration (denoted by Fð1Þ) to be an identity transform. We
then obtain the warping paths (denoted by fmð1Þi g) for each
training utterance using DTW equation (4) and compute the
total cost (denoted by Dð1Þ) using Eq. (2). Given the set of
warping paths, in the next iteration we compute a new TF
(denoted by Fð2Þ) corresponding to the entire training set
using Eq. (6). Given the new TF, we once again compute the
warping paths (denoted by fmð2Þi g) and the new total cost
(denoted by Dð2Þ). If the total cost in current iteration is
lesser than that in the previous iteration, we repeat the same
procedure of computing the TF and the warping paths, itera-
tively, till convergence is achieved. The IFI-DTW optimiza-
tion is described in algorithm 1.
We, now, provide the proof of convergence of the IFI-
DTW optimization
Proof. Consider the IFI-DTW optimization given in
algorithm 1. We need to show
Dðj�1ÞðF ðj�1Þ; fmðj�1Þi gÞ � DðjÞðF ðjÞ; fmðjÞi gÞ:
Since, FðjÞ is the optimal TF corresponding to the set of
warping paths fmðj�1Þi g [from operation 9 of algorithm 1],
we have
DðFðj�1Þ; fmðj�1Þi gÞ � DðFðjÞ; fmðj�1Þ
i gÞ:
Since the set of warping paths fmðjÞi g is optimal for FðjÞ[from operation 10 of algorithm 1], we have
DðFðj�1Þ; fmðj�1Þi gÞ � DðFðjÞ; fmðj�1Þ
i gÞ
� DðFðjÞ; fmðjÞi gÞ: (7)
ALGORITHM 1: IFI-DTW optimization.
1: Dð0Þð�Þ ¼ 12: Iteration j¼ 1
3: Initial TF: Fð1Þ
4: Optimize for fmð1Þi gi¼1…N using Fð1Þ in Eq. (4)
5: Compute Dð1ÞðF ð1Þ; fmð1Þi gÞ using Eq. (2)
6: Until Convergence:
7: while Dðj�1Þð�Þ > DðjÞð�Þ do
8: j jþ 1
9: Optimize for FðjÞ using mðj�1Þi in Eq. (6)
10: Optimize for fmðjÞi gi¼1…N using FðjÞ in Eq. (4)
11: Compute DðjÞðF ðjÞ; fmðjÞi gÞ using Eq. (2)
12: end while
13: Optimal TF: F� ¼ FðjÞ, Optimal set of warping paths:
fm�i g ¼ fmðjÞi g; i ¼ 1;…;N.
Hence, proved. �
B. Candidate transformation functions
In order to understand the underlying function which
transforms WATs to NATs, we consider three candidate
functional forms of the TF.
1. Full affine transformation—Af scheme
Since there exists a dependency among articulatory
movements (Jackson and Singampalli, 2008), we hypothe-
size that several WATs could contribute to reconstruct one
NAT. Therefore, we consider the first candidate to be an
affine transformation, as follows:
f ðWTfmigÞ ¼ ðWfmigÞ
T1L�1
h i|fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl}
W0
ANs�Ns
b1�Ns
� �; (8)
where W0 2 RL�ðNsþ1Þ. Substituting Eq. (8) in Eq. (6), we
get the affine transformation function to be ½ðW0ÞTW0��1
ðW0ÞTðN fmigÞT. It is to be noted that we place no constraints
on A or b. This special case is similar to the affine indepen-
dent DTW proposed by Qiao and Yasuhara (2006). Thus, the
(p, k)th coefficient of the full matrix A ¼ Af captures the
strength of the relation between pth WAT and kth NAT. Let
the pth WAT of utterance j be denoted by wp 2 RTWj�1
of
length TWj . Then the kth reconstructed NAT nk
can be writ-
ten as
nk ¼
XNs
p¼1
ap;kwp þ bk1TWj�1; 1 � k � Ns: (9)
2. Diagonal affine transformation—Ad scheme
To understand how each WAT transforms into the cor-
responding NAT, we consider the matrix A in Eq. (8) to be a
diagonal matrix Ad . Therefore, in this case, we assume that
only the pth WAT contributes to reconstruct the pth NAT
with the (p, p)th coefficient of the diagonal matrix Ad cap-
turing the strength of this contribution. Similar to Eq. (9), we
can express the pth reconstructed NAT np
as follows:
np ¼ ap;pwp þ bp1
TWj�1; 1 � p � Ns: (10)
3. Nonlinear transformation—DNN scheme
In the third scheme, we model the dependency among
articulatory movements by a nonlinear transformation using
a deep neural network (DNN). At the jth iteration, the opera-
tion (9) in algorithm 1 is executed by providing Wfmig and
Nfmig as the input and output, respectively, to a DNN. While
in the first iteration, the DNN is initialized with random
weights, for all iterations j> 1, the DNN is initialized with
the weight matrix from the DNN optimized in the (j – 1)th
iteration. The details of the implementation are provided in
Sec. III C.
C. Baseline schemes
To compare the performance of the candidate TFs, we
use two baseline schemes. In both these schemes, we use a
fixed TF between the WATs and the NATs and do not opti-
mize for the TF. Hence, the IFI-DTW algorithm stops in a
single step.
3354 J. Acoust. Soc. Am. 143 (6), June 2018 Nisha Meenakshi G. and Prasanta Kumar Ghosh
1. Abs1 scheme
In the first baseline scheme, we define a TF Fð1Þ ¼ Abs1
with respect to algorithm 1 such that the transformation when
applied to the WATs retains the mean and covariance of the
NATs. Let W¼ ½W1;…;WN� 2RNs�PN
i¼1TWi ; N¼ ½N1;…;NN�
2RNs�PN
i¼1TNi (concatenated versions of Wi and Ni8i
¼ 1;…;N), with corresponding mean vectors (across time)
lw¼ ½lw1 ;…;lw
Ns�T 2RNs�1 and ln¼ ½ln
1;…;lnNs�T 2RNs�1,
and covariance matrices Rw 2RNs�Ns and Rn 2RNs�Ns ,
respectively. To ensure that the transformed WATs, Fð1ÞðWÞ,have their mean vector and covariance matrix equal to ln and
Rn, we compute A¼ðUnK1=2n K�1=2
w UTwÞ
Tand b¼ðlnÞT
�ðlwÞTA in Eq. (8). Uw, Un and Kw, Kn are matrices of ortho-
normal eigen vectors and diagonal matrices containing the
eigen values, obtained by the eigen decomposition of Rw and
Rn, respectively. In this case, the expression for reconstruction
of the kth NAT is the same as that provided in Eq. (9).
2. Abs2 scheme
In the second baseline scheme, we define a TF Fð1Þ¼ Abs2 such that the transformation preserves the mean and
the variance of each of the Ns NATs. As defined in Sec.
II C 1, let lnp and lw
p be the means of the pth NAT and WAT,
respectively. The standard deviations (SD) of the pth NAT
and WAT are denoted as rnp and rw
p , respectively. With
respect to Eq. (8), we write Ap;p ¼ rnp=r
wp and bp ¼ ln
p
�ðrnp=r
wp Þlw
p ; p ¼ 1;…;Ns, where Ap,p is the pth diagonal
element of A and bp is the pth element of b. Since A is diago-
nal, the reconstruction of the pth NAT follows Eq. (10).
III. EXPERIMENTS
A. Dataset
In this work, we recorded both the neutral and whis-
pered articulatory movements of four male (M1, M2, M3,
M4) and two female (F1, F2) subjects using electromagnetic
articulograph AG501 AG5 (3D Electromagnetic
Articulograph, 1979). The native language of F1 and F2 is
Tamil and Bengali while that of M1, M2, M3, M4 is
Kannada, American English, Bengali, and Telugu, respec-
tively. None of the subjects were reported to have any
speech disorders. An informed consent was obtained from
each subject, prior to data collection.
We used the 460 phonetically balanced English senten-
ces from the MOCHA-TIMIT database as stimuli for record-
ing (Wrench, 1999). Simultaneous recordings of both audio
and articulatory movements were done in a sound-proof
chamber. In this study, we recorded the articulatory move-
ments of nine articulators, namely, upper lip (UL), lower lip
(LL), left commissure of the lip (LC), right commissure of
the lip (RC), jaw (J), throat (TH), tongue tip (TT), tongue
body (TB), and tongue dorsum (TD). The position of these
sensors is indicated in Fig. 1. We connected the TH sensor
typically near the laryngeal prominence for the subjects, in
order to capture the laryngeal movement as the subjects pho-
nate in neutral and whispered manner. Apart from the nine
sensors, we also connected two sensors needed for head cor-
rection in EMA recording.
Recorded at a sampling frequency of 250 Hz, the move-
ments of each articulator along the two axes (X and Z) of the
midsagittal plane (measured in mm), give rise to a total of
Ns¼ 18 (9 articulators � 2 axes) articulatory trajectories.
Thus, each NAT and WAT correspond to the movement of
an articulator in neutral and whispered speech, respectively.
Since articulatory movements are known to be low-pass in
nature (Ghosh and Narayanan, 2010), we first low pass filter
the articulatory trajectories with a cut-off frequency of 25 Hz
and then downsample to Fs¼ 100 Hz. Figure 2 shows the
low pass filtered and downsampled trajectories of upper lip,
jaw, throat, and tongue tip for utterance i¼ 2 of a male sub-
ject, for both, neutral and whispered speech (corresponding
to eight rows from N2 and W2). From the figure, we observe
that the duration of the whispered speech utterance is longer
than that of neutral speech. We also see that the movement
of the articulators along the two axes, follow a similar pat-
tern in, both, whispered and neutral speech. Across all six
subjects, the total duration of neutral and whispered record-
ings is 127.95 and 139.19 min, respectively.
B. Experimental setup
1. Subject-wise setup
We hypothesize that there exists a subject specific artic-
ulation strategy involved while whispering, to compensate
for its lack of intelligibility in the absence of voicing.
Therefore, we perform experiments in a fourfold setup by
dividing the data collected from each subject into four sets
where three sets (345 sentences) are used for training and the
remaining set (115 sentences) for testing. For each fold, we
use the corresponding training set of N¼ 345 utterances to
obtain the optimal TF, F�, using the IFI-DTW algorithm.
Using F� in Eqs. (3) and (4), we compute DTW distances
dk ¼ DmkðNk;NkÞ, for the kth test utterance. Let dtest
2 R460�1 be a vector that consists of dk from all four folds
(115� 4¼ 460) and �dtest denote the average of these 460 DTW
distances given by �dtest ¼ 1460
P460k¼1 dk. Therefore, the best
scheme is the one which results in the least �dtest for all subjects.
In order to understand if the dynamics of articulatory
movements could aid a better reconstruction of NATs, we
perform a second experiment. Here we learn a TF using not
only the position data of articulators, but also their dynamics
in a subject-specific manner. For this, we first compute the
velocity and the acceleration coefficients from the
FIG. 1. Schematic diagram depicting the placement of the nine sensors
along the mid sagittal plane and lips of a subject.
J. Acoust. Soc. Am. 143 (6), June 2018 Nisha Meenakshi G. and Prasanta Kumar Ghosh 3355
articulatory trajectories. Let DWi and DNi be the velocity
coefficients and DDWi and DDNi be the acceleration coeffi-
cients of the ith whispered and neutral utterance, respec-
tively. We then concatenate the trajectories corresponding to
position and its dynamics to obtain Wdi ¼ ½WT
i ;DWTi ;
DDWTi �
Tand Nd
i ¼ ½NTi ;DNT
i ;DDNTi �
Tfor each utterance.
The IFI-DTW algorithm is employed to obtain the optimal
warping paths and the transformation function between the
position and the dynamics of the neutral articulators (Ndi )
and those while whispering (Wdi ). Therefore, Eq. (2) can be
rewritten as
Dðf ; fmigÞ ¼XN
i¼1
Dmiðf ðWd
i Þ;Ndi Þ: (11)
In this case, the average DTW distance between the original
and the reconstructed NAT would reveal the benefits of uti-
lizing the information about the dynamics of whispered
articulatory movements to reconstruct NATs.
2. Cross subject setup
In the cross-subject setup, we test the model trained
using one subject’s positional data on the test data from
another subject. Hence, we could analyze the degree to
which the optimal TF could be subject dependent. We
employ the optimal TF obtained using the training set corre-
sponding to the ith fold of subject str, to predict the NATs
from the test set of the ith fold of subject st. dst;str2 R460�1
is used to denote the vector that comprises the DTW distan-
ces computed for all test utterances in all four folds, when str
and st denote the training and test subjects. Let �dst;strdenote
the average of these DTW distances.
C. Parameters
For the sake of practical implementation of the Af ; Ad
and the DNN schemes, convergence is said to be achieved in
the IFI-DTW optimization if step 7 of algorithm 1 is satisfied
considering seven digits after the decimal point (experimen-
tally observed). In both the subject-wise and cross subject
setups, we use a three layer network for the DNN scheme.
Using 15% of the training set as the validation set, we opti-
mize for different parameters such as the number of hidden
neurons in each layer (candidates: 64, 128, 256, and 512),
the activation functions (candidates: “tanh” and “relu”) and
the batch size (candidates: 16, 32, and 64). We use the
“linear” activation in the output layer. For each fold, we
choose the optimal parameters based on the best performing
DNN architecture (in terms of the minimum DTW distance)
on the validation set. Optimization is done using ADAM
(Kingma and Ba, 2014), with mean squared error as the loss
function. The implementation of the DNN is done using
KERAS (Chollet, 2015) and THEANO (Team et al., 2016)
libraries.
D. Broad class phoneme (BCP) specific analysis
We perform a BCP specific study in order to know the
accuracy of the NAT reconstruction in each class when the
TF and the warping paths optimized on the entire set are
used. In order to do so, we use the KALDI toolkit (Povey et al.,2011) to perform a forced alignment of the recorded speech
data (obtained during EMA recording), using a Gaussian
mixture model-hidden Markov model (GMM-HMM) setup,
with a reduced phoneme set consisting of thirty nine pho-
nemes (including silence) (Lee and Hon, 1989) considered in
the TIMIT database (Garofolo et al., 1993). Using the fine to
broad phone class mapping described by Scanlon et al.(2007), we map the 39 phonemes to the five broad phoneme
categories, namely, vowels, stop consonants, fricatives,
nasals, and silence. Thus, from the forced aligned boundaries,
we obtain the BCP boundaries. These boundaries are manu-
ally checked and corrected in case of any errors.
For the kth test utterance, we obtain Nk using different
schemes described in Secs. II B and II C. We extract the seg-
ments corresponding to each of the five BCP categories
from, both, Nk and Nk for every utterance k and compute the
segment-wise DTW distances, for all schemes. This is done
subject wise, for all the test utterances, k¼ 1,…,115, in each
fold. We report the average of these segment-wise distances,
across six subjects, for each of the five BCP categories,
obtained using the different TFs considered in the study.
IV. PERFORMANCE OF THE MAPPING METHODS
A. Subject-wise experimental results
The number of iterations to achieve convergence in the
IFI-DTW optimization, averaged across all folds and all sub-
jects, turns out to be 6.63(61.71), 5.75(61.78), and
5.46(62.15) for the Af ; Ad , and DNN schemes, respectively
[the numbers in brackets represent standard deviation (SD)].
For the DNN scheme, based on the performance in the vali-
dation set, the optimal number of neurons in the hidden
layers is found to be 64 for all folds of all subjects, except
for the fourth fold of subject M4, in which the optimal
FIG. 2. Trajectories of upper lip (UL),
jaw (J), throat (TH), and tongue tip
(TT), along X and Z directions, for the
utterance “Is this see-saw safe?,” of
subject M1, in neutral indicated by
continuous lines and in whispered
speech, indicated by dashed lines.
3356 J. Acoust. Soc. Am. 143 (6), June 2018 Nisha Meenakshi G. and Prasanta Kumar Ghosh
number turns out to be 128. We find the “relu” activation
function and a batch size of 64 to be the optimal parameters
across all subjects. The results for the subject-wise setup are
provided in Table I. The corresponding box plots of the dtest
from the five schemes for each of the six subjects are
included as supplementary material.1
From the table, it is clear that for each subject, the Af
scheme results in the least average DTW distance (indicated
by bold entry in each column) between the reconstructed and
original NATs. Averaged across all subjects and folds, the
DTW distance between the reconstructed and the original
NAT turns out to be 5.20 (61.27) mm for Af scheme, 5.62
(61.38) mm for Ad scheme, 5.46 (60.96) mm for DNN
scheme, 9.09 (62.96) mm and 5.63 (61.39) mm, for the
Abs1 and Abs2 schemes, respectively. From the table, we
observe a decrease of 42.79% and 7.64% (relative) in the Af
scheme compared to the Abs1 and Abs2 schemes, respec-
tively, averaged across all six subjects. The poor perfor-
mance of Abs1 scheme reveals that a TF that preserves the
mean and the covariance of the NATs alone, does not pro-
vide an optimal transformation from WATs to NATs.
Interestingly, the performance of the Abs2 scheme is similar
to that of the Ad scheme. This indicates that the optimal TF
learnt iteratively in the Ad scheme, tries to preserve the vari-
ance of the NATs.
From the table, we find a relative decrease in the aver-
age DTW distance in the Af scheme, with respect to the Ad
scheme, by 7.44%, 7.85%, 6.49%, 8.03%, 8.84%, and 6.21%
for the six subjects. The improved performance of the Af
scheme compared to the Ad scheme, reveals that several
WATs contribute to reconstruct a single NAT. Comparing
with the DNN scheme, we observe a relative drop in the
average DTW distance in the Af scheme by 2.86%, 4.78%,
2.75%, 7.13%, 4.63%, and 5.63%, for six subjects. In order
to examine if the performance of the Af scheme is statisti-
cally significant compared to the other schemes, we perform
a t-test. For each of the schemes Ad, DNN, Abs1, and Abs2
we consider the null hypothesis to indicate that the differ-
ence of dtest from Af and dtest from the considered scheme
comes from a normal distribution with zero mean and
unknown variance. The alternate hypothesis is that this dif-
ference comes from a normal distribution whose mean is
less than zero. The statistical analysis reveals that the null
hypothesis is rejected at 5% significance level (all p-values
�3.84e � 22) for all schemes. We find similar results (all p-
values �5.23e � 202) when the described t-test is performed
across all subjects. This indicates that the dtest obtained from
the Af scheme is statistically significantly lower than those
obtained from the other schemes.
For illustration, Fig. 3 shows the reconstructed TDx tra-
jectory using different TFs for one utterance from subject
F2. We see that the reconstructed NAT using Af scheme
closely approximates the original NAT, better than the other
schemes (rectangular box indicated for each scheme illus-
trates this in the figure). We also observe from Figs. 3(C),
3(F), and 3(A) that the reconstructed NAT using Ad
and Abs2 schemes are scaled versions of the original WAT.
The reconstructed NAT from the Af scheme is found to be
smoother than that from the DNN scheme [Fig. 3(D)].
Let us now consider the average DTW distance between
the original NATs and those reconstructed using, both, the
position and the dynamics of WATs. Table II provides these
distances for the two best performing schemes, namely, the
Af and DNN schemes. The corresponding box plots of the
TABLE I. �dtest (SD), in mm, across all folds of the six subjects.
Schemes F1 F2 M1 M2 M3 M4
Af 5.10 6.57 3.89 5.73 5.36 4.53
(0.85) (1.17) (0.68) (1.08) (0.97) (0.86)
Ad 5.51 7.13 4.16 6.23 5.88 4.83
(0.83) (1.24) (0.70) (1.13) (1.12) (0.87)
DNN 5.25 6.90 4.00 6.17 5.62 4.80
(0.20) (0.23) (0.17) (0.17) (0.18) (0.12)
Abs1 10.51 10.59 6.55 7.52 13.08 6.31
(1.02) (3.05) (0.79) (1.30) (1.56) (0.86)
Abs2 5.51 7.13 4.16 6.24 5.89 4.83
(0.84) (1.25) (0.70) (1.14) (1.11) (0.88)
FIG. 3. (A) provides the DTW mapped
original NAT and WAT TDx of subject
F2 corresponding to the utterance,
“Bright sunshine shimmers on the
ocean.” (B)–(F) provide the DTW
mapped original and the reconstructed
NAT using different schemes, men-
tioned in the respective figures.
J. Acoust. Soc. Am. 143 (6), June 2018 Nisha Meenakshi G. and Prasanta Kumar Ghosh 3357
dtest from these two schemes for each of the six subjects are
included as supplementary material.1 Comparing Tables I
and II, we find that for each subject, the average DTW dis-
tances reduce when the D and DD coefficients are considered
to reconstruct the NATs. We observe a relative drop in the
dtest by 1.57%, 2.44%, 1.29%, 0.7%, 1.31%, and 1.77%,
when the velocity and acceleration coefficients are used in
the best scheme compared to when they are not, for each of
the six subjects. We perform a t-test, similar to the descrip-
tion provided in Sec. IV A, to find if the inclusion of the
dynamics decreases the dtest significantly compared to using
the position data alone. The statistical analysis reveals that
the inclusion of the dynamics significantly improves the per-
formance, for both the schemes. Therefore, we find that the
information about the dynamics of the articulatory move-
ments helps in reconstructing NATs better from WATs.
Similar to the observation from Table I, we see that the per-
formance of the Af scheme is comparable to that of the
DNN scheme. We perform a t-test to check for the signifi-
cance in the difference between the performance of the two
methods. We find that except for subject F1, the null hypoth-
esis is rejected at 5% significance level. This indicates that
the optimal TF could be approximated well using an affine
function compared to using a complex nonlinear function as
learnt by a DNN.
B. Cross subject experimental results
Since the Af and DNN schemes are found to exhibit the
least �dtest in the subject-wise setup, we report the results of
these two methods for the experiments described in Sec.
III B 2. Figure 4 provides the box plots for the dst;strfor every
test-train pair, for the Af and DNN schemes. In both cases,
we see that the least average (also median) DTW distance,
dst;str, is achieved when the training and test subjects are
identical (matched case). This shows that the optimal TFs
are subject dependent, which supports the hypothesis that
there could be subject specific differences in articulation to
make whispered speech more intelligible in absence of pitch.
The relative increase in �dst;strfrom the matched case to
the worst mismatched case, using Af scheme and the DNN
scheme turns out to be 17.65% and 37.25% for F1, 6.55%
and 21.77% for F2, 28.54% and 54.24% for M1, 22.16% and
39.62% for M2, 30.60% and 49.25% for M3, and 10.38%
and 32.45% for M4. Hence, we find that the performance
using the worst model for the Af scheme is better than that
using the DNN scheme. This larger drop in the performance
of the DNN scheme compared to the Af scheme, in the cross
subject setup could be due to over-training in the subject
specific fine tuning of the DNN parameters. Performing a t-test as described in Sec. IV A, we find that the dst;str
from Af
scheme is statistically significantly lower than that of the
DNN scheme (p-values �3.84e � 22), for all test-train pairs.
Figure 4 also indicates that, the optimal affine transformation
is more generalizable compared to the finely tuned nonlinear
TF learnt using a DNN.
C. Results of BCP specific analysis
Table III provides the details of the number of segments
for each BCP category along with the average duration of
each segment for, both, whispered and neutral speech, for
each subject. We see that the number of vowel segments is
the highest across all subjects. In a decreasing order of the
number of segments, on average, the “Vowels” category is
followed by “Fricatives,” “Stops,” “Silence,” and, finally,
“Nasals.” The differences in the average durations of differ-
ent BCP categories across neutral and whispered speech can
be observed from the table. We find that Vowels, Fricatives,
and Silence categories have a longer average duration while
whispering compared to that in neutral speech for at least
five among the six subjects.
Table IV provides the segment-wise DTW distances for
the five BCP categories, averaged across all subjects and
folds, for the different schemes considered in the study.
From the table, we see that for all BCP categories, the aver-
age distance is the least for the Af scheme. We observe that
the average distance is the highest for the “Silence” category
in all schemes. This could be due to the fact that the posi-
tions of the NATs and WATs during different silence seg-
ments may not exhibit similar patterns, and hence, are
difficult to reconstruct. Similar to the discussion in Sec.
TABLE II. �dtest (SD), in mm, across all folds of the six subjects using both
position and dynamics of articulatory movements.
Schemes F1 F2 M1 M2 M3 M4
Af 5.03 6.41 3.84 5.69 5.29 4.45
(0.84) (1.14) (0.68) (1.07) (0.95) (0.88)
DNN 5.02 6.50 3.95 5.97 5.33 4.59
(0.78) (1.05) (0.63) (1.00) (0.86) (0.81)
FIG. 4. (Color online) Box plots of dst ;stracross folds for each test-train sub-
ject pair obtained from Af (in red, left) and DNN scheme (in blue, right).
3358 J. Acoust. Soc. Am. 143 (6), June 2018 Nisha Meenakshi G. and Prasanta Kumar Ghosh
IV A, the Abs1 scheme is found to perform poorly compared
to the rest, while the Abs2 scheme has a performance compa-
rable to that of the Ad scheme. The relative increase in the
average distance of the best performing category in the DNN
scheme, namely, Fricatives, compared to that in Af scheme
is found to be 5.55%. Similarly, the Nasals category with the
least average distance in the Ad scheme is seen to be 6.07%
(relative) higher than that of the Af scheme. A t-test as
described in Sec. IV A, reveals that the BCP specific DTW
distances obtained from the Af scheme is statistically signifi-
cantly lower than those from the other schemes, across all
folds, BCP categories and subjects (p-values �1.2e � 3).
This shows that the optimal affine TF is capable of recon-
structing the different BCP categories, better, than the other
schemes considered in the study.
V. ANALYSIS OF THE DIFFERENCES BETWEENARTICULATION IN NEUTRAL AND WHISPEREDSPEECH
A. The Af transformation
Figure 5 shows the Ns�Ns matrices, A ¼ Af , obtained
from one fold of each of the six subjects. From the figure,
we make two major observations. First, we observe that the
matrix A is not a purely diagonal matrix, which explains the
deterioration in the performance of the Ad scheme, com-
pared to the Af scheme. Second, we observe a subject spe-
cific difference in the structure of the Af matrix. The fall in
the performance in the cross subject setting (Sec. IV B) could
be a result of this subject specific nature of the TF. It could
be that, each subject modifies the articulation during whis-
pering compared to neutral speech in his/her own specific
manner to compensate for the loss of pitch in whispered
speech.
We see that several WATs contribute in the reconstruc-
tion of a single NAT, indicating that the motion of one artic-
ulator in neutral speech is encoded in multiple articulatory
motion during whispering. In order to understand the signifi-
cance of the contribution of each WAT to a particular NAT,
we perform a t-test at 5% significance level, with a null
hypothesis that its contribution is, indeed, zero. Table V lists,
for each NAT, the WATs whose contribution is significant
in every fold of all subjects, From the table, it is clear that
the information about one NAT is captured by a few WATs.
We observe that every WAT contributes significantly
towards the reconstruction of the corresponding NAT,
except for LCz and TBz. For these two NATs, the
TABLE III. The number of segments for each BCP category along with the average (SD) duration of each segment, in ms, for each subject. The description of
the entries in a cell of the table is as follows.
Subjects BCP category
Subject ID Total number of segments per BCP category
Average duration of whispered segment Average duration of neutral segment
(SD) (SD)
Subjects Vowels Stops Fricatives Nasals Silence
F1 5278 1129 1405 963 1091
98.88 92.67 65.81 80.18 103.49 101.47 91.48 89.48 83.49 70.63
(46.03) (40.73) (21.45) (24.79) (40.98) (40.35) (31.13) (25.86) (50.66) (42.78)
F2 4950 1154 1226 863 1306
91.80 99.43 72.10 88.08 110.53 129.61 95.28 102.29 105.55 89.49
(44.52) (42.84) (26.21) (32.61) (49.46) (51.33) (38.09) (33.87) (78.10) (66.69)
M1 5516 1217 1399 1030 1850
110.43 91.29 79.54 65.12 101.37 99.04 96.14 98.44 101.83 101.42
(47.98) (38.17) (28.48) (28.97) (38.87) (38.67) (32.00) (29.70) (70.61) (59.16)
M2 5259 1421 1532 1028 440
105.93 91.68 99.02 86.03 117.68 106.36 96.48 89.14 93.66 75.91
(57.68) (46.34) (38.61) (34.72) (53.34) (42.55) (41.44) (34.18) (64.05) (40.28)
M3 5515 1327 1547 1049 1145
112.13 98.50 77.62 87.45 115.92 110.47 95.70 96.26 101.08 80.10
(53.08) (46.28) (29.15) (31.18) (47.87) (46.30) (32.94) (30.65) (59.22) (47.08)
M4 5518 1273 1474 1030 1470
103.43 80.51 77.01 85.70 115.89 105.07 91.25 80.36 105.15 81.02
(50.58) (35.16) (31.68) (29.33) (47.87) (49.98) (35.61) (29.53) (56.56) (50.18)
TABLE IV. The average (SD), in mm, of the segment-wise DTW distances
for the five BCP categories, across all subjects and folds.
Schemes Vowels Stops Fricatives Nasals Silence
Af 6.76 6.01 5.49 5.54 6.89
(3.96) (3.16) (2.89) (2.70) (4.05)
Ad 7.05 6.41 6.00 5.88 7.25
(3.90) (3.14) (2.94) (2.67) (4.05)
DNN 7.13 6.36 5.79 5.88 7.17
(4.06) (3.23) (2.96) (2.82) (4.11)
Abs1 10.99 11.02 10.42 10.40 11.62
(5.66) (5.44) (5.25) (5.13) (5.78)
Abs2 7.07 6.43 6.02 5.89 7.27
(3.93) (3.16) (2.96) (2.69) (4.07)
J. Acoust. Soc. Am. 143 (6), June 2018 Nisha Meenakshi G. and Prasanta Kumar Ghosh 3359
corresponding WATs contribute significantly in every fold
of all subjects, except in one fold of subject F2. In most
cases, we observe that a WAT, apart from contributing sig-
nificantly towards the reconstruction of the corresponding
NAT, contributes significantly towards the reconstruction of
other NATs. From the table we find that for each NAT, at
least two different WATs contribute significantly for its
reconstruction. For instance, we see that the lip movements
while whispering contribute significantly to the reconstruc-
tion of five out of six NATs corresponding to the tongue.
Although the production of whispered speech does not
involve vibrations of the vocal folds, interestingly, we find
that the WATs THx and THz contribute significantly to
reconstruct the corresponding NATs. Laryngeal movement
is known to occur during human speech production for the
control of sound intensity and pitch (Curry, 1937; Ludlow,
2005). Specifically, an upward movement of the larynx is
observed during an increase in pitch (Curry, 1937). A study
based on magnetic resonance imaging, to understand the
phonation of whispered and neutral vowels, reveals that
the (upward and downward) position of the larynx, along the
mid sagittal plane, is similar across whispered and neutral
speech (Coleman et al., 2002). This is in agreement with our
finding, using EMA, that the movement of the TH sensor
along the Z direction while whispering, contributes signifi-
cantly to that in neutral speech (Table V). This indicates that
there exists a similarity in the laryngeal movements during
whispered and neutral speech.
In order to understand the similarities among the articu-
latory trajectories during neutral and whispered speech, we
compute the correlation coefficient between each of the 18
articulatory trajectories inWfmig with those in Nfmig. Figure
6 provides the correlation coefficient between the WATs and
the NATs, averaged across all folds and all subjects. From
the figure, we find that there exists a higher correlation
within the movements of certain sensors on the lips and
within those on the tongue during whispering and neutral
speech. In accordance with this observation, from Table V
we find that among the lip sensors, for each NAT at least one
other WAT belonging to the lips contributes significantly.
Similarly, we observe a similar trend for each NAT
FIG. 5. The matrix Af obtained from one fold for each of the six subjects. Brighter pixel indicates a larger value as indicated by the color bar.
TABLE V. Significant WATs to reconstruct each NAT.
NAT Significantly contributing WATs
ULx ULx ULz LLx Jz
ULz ULz LLz RCx LCz THx TDz
LLx ULz LLx RCx LCx Jx THx
LLz ULx LLz Jx
RCx ULz LLx LLz RCx LCx Jz TBz
RCz ULx ULz LLx LLz RCx RCz LCx LCz THz
LCx LLz RCx LCx THz TBz
LCz ULz RCx THx
Jx ULz Jx THz TDz
Jz ULx LLx LLz RCz Jx Jz THx
THx ULx LLz RCx THx THz TBz TDz
THz ULx Jx THz
TTx ULz LLz Jx TTx TBz TDx TDz
TTz THz TTz TBx TBz
TBx ULz LLz Jx Jz THz TTz TBx TDx TDz
TBz LCx Jx TBx
TDx LLx LLz Jx TBx TDx TDz
TDz LCz Jx TBx TBz TDx TDz
3360 J. Acoust. Soc. Am. 143 (6), June 2018 Nisha Meenakshi G. and Prasanta Kumar Ghosh
corresponding to the tongue sensors, as well. We observe
from the figure that the articulatory trajectories of the throat
exhibit a lower correlation with the other articulatory trajec-
tories irrespective of the degree of its proximity to the
sensors. Interestingly, from Table V we find that THz con-
tributes significantly to the reconstruction of TTz but not to
the vertical movements of the proximally close TB or TDsensors. This could be because all WATs that are highly cor-
related to a particular NAT need not contribute significantly
to reconstruct that NAT, since they could capture redundant
information (Table V).
From Fig. 6 we find that certain WATs are more corre-
lated with their neutral counterparts compared to others.
This could indicate that although the speech motor control
plans are similar between whispered and neutral speech
(Coleman et al., 2002), there could be some patterns in artic-
ulatory movements that are specific to whispered speech.
The reconstructed NATs could be used to synthesize neutral
speech by employing articulatory synthesis systems (Aryal
and Gutierrez-Osuna, 2016). We proceed to understand the
exaggeration of articulatory movements in whispered speech
in comparison to that in neutral speech using the optimal TF.
B. Quantifying the exaggeration of articulatorymovements in whispered speech
We hypothesize that, corresponding to a small displace-
ment in the movement of certain neutral articulators, there
could be an exaggeration of the whispered articulators via
larger displacements in the WATs. In order to test this
hypothesis, we consider the following approach. From Eq.
(8), we see that the kth column of the affine transformation
matrix, transforms the WATs to reconstruct the kth NAT.
ConsiderWfm�i g constructed from the optimal set of warping
paths, obtained from the IFI-DTW algorithm. Let w�p
2 RL�1 represent the pth WAT corresponding to the pth col-
umn of WTfm�i g
; ap;k be the (p, k)th coefficient of the matrix
Af in the optimal TF and bk be the coefficient corresponding
to the DC shift. With regard to Eq. (8), the kth reconstructed
NAT (kth column of N Tfm�i g
), n�k 2 RL�1, can be written as
follows:
n�k ¼
XNs
p¼1
ap;kw�p þ bk1L�1XNs
p¼1
ap;kw�p
þ bk1L�1 � n�k ¼ 0: (12)
To study the amount of contribution by different WATs
to reconstruct a particular NAT, we compute the angle
between the corresponding transformation plane (TP) given
in Eq. (12) and a reference plane. Since a DC shift in the TP
is of no consequence in the computation of the angle, we
neglect effect of the DC shift coefficient, bi. Therefore, from
Eq. (12), we see that the normal vector of the TP to recon-
struct the kth NAT is given by ½a1;k;…; aNs;k;�1�T . The nor-
mal to the reference plane is considered as ½0Ns�1; 1�T . Since
the angle between the two planes is given by the angle
between their normal vectors, we compute hk as follows:
hk ¼ cos�1 1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1þ
XNs
p¼1
a2p;k
vuut0BBB@
1CCCA: (13)
Let us consider the case when hk< 45, which, equivalently,
results in a conditionPNs
p¼1 a2p;k < 1. This implies that
0� ap,k< 1, 8p, k. A value of ap,k< 1 indicates that there are
higher variations in the movements of the pth WAT, in order
to reconstruct the kth NAT. hk< 45 indicates that the
WATs exhibit a larger variation in their movements in the
18-dimensional space in order to produce a small variation
in the kth NAT. Hence, a low hk (hk< 45) would indicate
that the whispered articulatory movements could be exagger-
ated in order to reconstruct a small displacement of the kth
NAT.
For each subject, we compute the angles hk, k¼ 1,…,Ns
corresponding to the Ns NATs in every fold. We observe that
across all subjects and folds, on average, 9.63(63.35) NATs
have an angle less than 45. Therefore, for each subject, we
find those NATs whose TP has an angle among the lowest
five (out of 18) in at least three out of the four folds. We
observe that the number of NATs that require exaggerated
movements of the whispered articulators varies across sub-
jects. The lip, jaw, and the throat articulators turn out to
have an angle lower than 45 at least for one subject.
Averaged across all subjects and folds, the TPs of the neutral
articulators, RCz, LLx, LCx, ULx, LCz, RCx, Jx, and THx have
an angle of 38.70 (62.66), 37.32 (64.24), 32.54
(610.22), 30.89 (63.99), 29.84 (64.15), 28.93 (61.03),
22.54 (61.23), and 21.78 (63.38), respectively.
Specifically, we find that three sensors on the lips, namely,
ULx, LLx, and LCx have lower angles for at least three among
six subjects. This could indicate that (1) the reconstruction
of the neutral articulatory movements that require exaggera-
tion in the whispered articulation could be subject dependent
and (2) the reconstruction of the movements of the sensors
FIG. 6. The correlation coefficient matrix between WATS and NATs, aver-
aged across all folds of six subjects.
J. Acoust. Soc. Am. 143 (6), June 2018 Nisha Meenakshi G. and Prasanta Kumar Ghosh 3361
on the lips during neutral speech requires exaggerated move-
ments of the WATs.
C. Stability and precision of whispered articulatorymovements
Study of palato-lingual contact patterns using electro-
palatography has shown that the articulation in whispered
speech is more stable, hence less variable and more precise
leading to a lower velocity of whispered articulatory move-
ments, compared to those in neutral speech (Osfar, 2011).
Osfar claims that an increase in the stability and precision in
the movements of articulators while whispering is an indica-
tion of hyperarticulation during whispering. Unlike this
work, where the primary focus is to understand the hyperarti-
culation by the tongue, in our work, we study the effects of
whispering on articulation, by the lips, jaw and throat, in
addition to the tongue, using EMA.
First, we analyze the precision in the whispered articula-
tion with regard to the velocity of the articulatory move-
ments while whispering. We compute the velocity of the
articulatory movements, in terms of their delta coefficients
(D). Similar to Wfmig and Nfmig (Sec. II A), we define
DWfmig ¼ ½DWm1
1 ;…;DWmNN � 2 RNs�L and DNfmig
¼ ½DNm1
1 ;…;DNmNN � 2 RNs�L. In order to examine the rela-
tive changes in the velocity of articulatory movements dur-
ing neutral and whispered speech, we learn an optimal
diagonal affine transformation function (Sec. II B 2) between
DWfmig and DNfmig following the optimization in Eq. (6) as
follows:
F ¼ arg minfkf ðDWfmigÞ � DNfmigk
22: (14)
The warping paths are optimized using the position data, as
given in Eq. (4). We now examine the (p, p)th coefficient
(p¼ 1,…, Ns) of the optimal diagonal TF obtained using Eq.
(14). A coefficient greater than 1 indicates that the velocity
of the whispered articulator is lower than that of the neutral
articulator.
Table VI lists the set of articulators whose coefficient in
the diagonal TF, obtained from Eq. (14), is greater than 1, in
at least in one of the four folds for each subject. From the
table, we observe that the set of articulators that exhibit a
lower velocity in whispered speech is subject dependent.
This could indicate a subject-specific nature of hyperarticu-
lation in whispered speech. Interestingly, we find that for
every subject, at least one sensor on the tongue, shows
reduction in its velocity, and, hence, more precise move-
ments. This is in accordance with the findings by Osfar
(2011), in which the tongue movements were found be more
precise in whispered speech.
Figure 7 shows, in the order of decreasing value, the
coefficients in the optimal TF, averaged across folds and
subjects. From the figure, we find that the sensors on the
tongue and the jaw exhibit a higher precision in their move-
ments compared to the sensors on the lips. Specifically, the
articulatory trajectories of Jx, TTx, TTz, TBx, and TDx are
observed to have a lower velocity while whispering for at
least three among six subjects. Interestingly, we observe that
among these WATs, Jx and TBx contribute significantly to
reconstruct ten among eighteen NATs (Table V).
Motivated by the work by Osfar (2011), we also exam-
ine which among the Ns whispered articulators, exhibit
reduced variability and, hence, more stability compared to
their neutral counterparts. For this, we compute the SD of
the velocities of WATs and NATs using samples in the kth
column of DWTfmig and DN T
fmig, respectively, as rDwk and
rDnk . We then compute the variance ratio, VRk ¼ ðrDw
k Þ2=
ðrDnk Þ
2. A value of VRk< 1 indicates that the movement of
the kth whispered articulator is more stable, since the vari-
ability of the velocity of the kth WAT is lower than that of
the kth NAT. In agreement with the previous findings, we
observe that the average VR of the tongue sensors is lower
compared to those of other articulators. Specifically, the sen-
sors TBx, TDx, TTx, and TDz are observed to have a VR< 1,
consistently, in every fold for all subjects. Their average
(SD) VR turns out to be 0.55(60.11) for TBx, 0.56(60.12)
for TDx, 0.63(60.18) for TTx, and 0.65(60.08) for TDz. This
indicates that there exists a greater stability in the movement
of the tongue, while whispering. Comparing with the find-
ings of the precision analysis of the articulatory movements,
we observe that most sensors placed on the tongue, show an
increase in, both, stability and precision in their movements,
while whispering. It could be that controlling the articulation
of tongue is key to improving intelligibility of whispered
TABLE VI. Subject-wise listing of articulators that exhibit movements of
reduced velocity during whispering compared to those in neutral speech.
Subject Articulators with lower velocity
F1 THz, TTz, TBx, TBz
F2 RCx, Jx, TTx, TBx, TDx
M1 TDx
M2 Jz, THx, THz, TTz, TBx, TDx, TDz
M3 LLx, RCx, LCx, LCz, Jx, Jz, THx, THz, TTx, TTz, TBx, TBz, TDx, TDz
M4 ULz, LLx, LCz, Jx, Jz, THx, TTx, TBx, TDx
FIG. 7. Coefficients of the optimal
diagonal TF, averaged across all sub-
jects and folds. Error bar indicates SD.
3362 J. Acoust. Soc. Am. 143 (6), June 2018 Nisha Meenakshi G. and Prasanta Kumar Ghosh
speech, compared to the other articulators considered in this
study.
VI. CONCLUSION
In this work, we use the IFI-DTW optimization to find
an optimal TF that transforms whispered articulatory move-
ments into those of neutral speech. Among several candidate
TFs, we find that an affine transformation with a full matrix
turns out to be the best TF to achieve the minimum distance
between NATs and WATs at both utterance level and for dif-
ferent BCP categories. This indicates that information about
a particular articulator’s movements in neutral speech is cap-
tured by those of several articulators while whispering. We
also find that this TF generalizes over different subjects, bet-
ter, compared to a DNN based nonlinear TF. It could be that
exaggerated articulatory movements need not result in a
highly nonlinear transformation between WAT and NAT,
but, in fact, could be well approximated by an affine trans-
formation. Analysis of the exaggerated articulatory move-
ments while whispering reveals that stable and precise
movements of the tongue are vital for the compensation of
the lack of intelligibility in whispered speech. Analyzing the
phoneme specific optimal TF, language specific effects in
the reconstruction and synthesizing neutral speech from the
reconstructed neutral articulatory trajectories are parts of our
future work.
ACKNOWLEDGMENTS
We thank the six subjects for their participation,
Aravind Illa for assisting with the data collection, and the
Pratiksha Trust for their support.
1See supplementary material at https://doi.org/10.1121/1.5039750 to view
box plots of the dtest from the five schemes for each of the six subjects,
considering only the position data, and box plots of the dtest from the Af
and DNN schemes for each of the six subjects, considering position and
the dynamics, in the subject-wise experiments.
3D Electromagnetic Articulograph (1979), http://www.articulograph.de/
(Last viewed September 14, 2017).
Ahmadi, F., McLoughlin, I. V., and Sharifzadeh, H. R. (2008). “Analysis-
by-synthesis method for whisper-speech reconstruction,” in IEEE AsiaPacific Conference on Circuits and Systems, APCCAS, pp. 1280–1283.
Aryal, S., and Gutierrez-Osuna, R. (2016). “Data driven articulatory synthe-
sis with deep neural networks,” Comput. Speech Lang. 36(C), 260–273.
Beskow, J. (2003). “Talking heads-models and applications for multimodal
speech synthesis,” Ph.D. thesis, Institutionen f€or Tal€overf€oring och
Musikakustik, Stockholm, Sweden.
Chollet, F. (2015). “keras,” https://github.com/fchollet/keras (Last viewed
September 14, 2017).
Coleman, J., Grabe, E., and Braun, B. (2002). “Larynx movements and into-
nation in whispered speech,” Summary of research supported by British
Academy.
Curry, R. (1937). “The mechanism of pitch change in the voice,” J. Physiol.
91(3), 254–258.
Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert, J. M., and
Brumberg, J. S. (2010). “Silent speech interfaces,” Speech Commun.
52(4), 270–287.
Fagan, M., Ell, S., Gilbert, J., Sarrazin, E., and Chapman, P. (2008).
“Development of a (silent) speech recognition system for patients follow-
ing laryngectomy,” Med. Eng. Phys. 30(4), 419–425.
Fagel, S., and Clemens, C. (2004). “An articulation model for audiovisual
speech synthesis—determination, adjustment, evaluation,” Speech
Commun. 44(1), 141–154.
Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., and Pallett, D. S.
(1993). “DARPA TIMIT acoustic-phonetic continuous speech corpus
CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon Technical Report
No. 93.
Ghosh, P. K., and Narayanan, S. (2010). “A generalized smoothness crite-
rion for acoustic-to-articulatory inversion,” J. Acoust. Soc. Am. 128(4),
2162–2172.
Gilchrist, A. G. (1973). “Rehabilitation after laryngectomy,” Acta Oto-
Laryngologica 75(2-6), 511–518.
Gonzalez, J. A., Cheah, L. A., Gilbert, J. M., Bai, J., Ell, S. R., Green, P. D.,
and Moore, R. K. (2016). “A silent speech system based on permanent
magnet articulography and direct synthesis,” Comput. Speech Lang. 39,
67–87.
Higashikawa, M., Green, J., Moore, C., and Minifie, F. (2003). “Lip kine-
matics for /p/ and /b/ production during whispered and voiced speech,”
Folia Phoniatr. Logop. 55, 1–9.
Jackson, P. J., and Singampalli, V. D. (2008). “Statistical identification of
critical, dependent and redundant articulators,” J. Acoust. Soc. Am.
123(5), 3321–3321.
Janke, M., Wand, M., Heistermann, T., Schultz, T., and Prahallad, K.
(2014). “Fundamental frequency generation for whisper-to-audible speech
conversion,” in IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP), pp. 2579–2583.
Jovicic, S. T., and �Saric, Z. (2008). “Acoustic analysis of consonants in
whispered speech,” J. Voice 22(3), 263–274.
Kingma, D. P., and Ba, J. (2014). “Adam: A method for stochastic opti-
mization,” arXiv:1412.6980.
Lee, K. F., and Hon, H. W. (1989). “Speaker-independent phone recognition
using hidden Markov models,” IEEE Trans. Acoust. Speech. Sign.
Process. 37(11), 1641–1648.
Ludlow, C. L. (2005). “Central nervous system control of the laryngeal
muscles in humans,” Respirat. Physiol. Neurobiol. 147(2), 205–222.
Mcloughlin, I. V., Sharifzadeh, H. R., Tan, S. L., Li, J., and Song, Y.
(2015). “Reconstruction of phonated speech from whispers using formant-
derived plausible pitch modulation,” ACM Trans. Access. Comput.
(TACCESS) 6(4), 12.
Morris, R. W., and Clements, M. A. (2002). “Reconstruction of speech from
whispers,” Med. Eng. Phys. 24(7), 515–520.
M€uller, M. (2007). “Dynamic time warping,” Information retrieval for
music and motion, pp. 69–84.
Osfar, M. J. (2011). “Articulation of whispered alveolar consonants,”
Master’s thesis, University of Illinois at Urbana-Champaign, Champaign,
IL.
Parnell, M., Amerman, J. D., and Wells, G. B. (1977). “Closure and con-
striction duration for alveolar consonants during voiced and whispered
speaking conditions,” J. Acoust. Soc. Am. 61, 612–613.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N.,
Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J.,
Stemmer, G., and Vesely, K. (2011). “The Kaldi Speech Recognition
Toolkit,” in IEEE Workshop on Automatic Speech Recognition andUnderstanding.
Qiao, Y., and Yasuhara, M. (2006). “Affine invariant dynamic time warping
and its application to online rotated handwriting recognition,” in 18thInternational Conference on Pattern Recognition (ICPR’06), Vol. 2, pp.
905–908.
Scanlon, P., Ellis, D. P. W., and Reilly, R. B. (2007). “Using broad phonetic
group experts for improved speech recognition,” IEEE Trans. Audio
Speech Lang. Process. 15(3), 803–812.
Sch€onle, P. W., Gr€abe, K., Wenig, P., H€ohne, J., Schrader, J., and Conrad,
B. (1987). “Electromagnetic articulography: Use of alternating magnetic
fields for tracking movements of multiple points inside and outside the
vocal tract,” Brain Lang. 31(1), 26–35.
Schwartz, M. F. (1972). “Bilabial closure durations for /p/, /b/, and /m/ in
voiced and whispered vowel environments,” J. Acoust. Soc. Am. 51,
2025–2029.
Sharifzadeh, H. R., McLoughlin, I. V., and Ahmadi, F. (2010).
“Reconstruction of normal sounding speech for laryngectomy patients
through a modified CELP codec,” IEEE Trans. Biomed. Eng. 57(10),
2448–2458.
Tartter, V. C. (1989). “What’s in a whisper?,” J. Acoust. Soc. Am. 86,
1678–1683.
Team, T. T. D., Al-Rfou, R., Alain, G., Almahairi, A., Angermueller, C.,
Bahdanau, D., Ballas, N., Bastien, F., Bayer, J., Belikov, A., et al. (2016).
J. Acoust. Soc. Am. 143 (6), June 2018 Nisha Meenakshi G. and Prasanta Kumar Ghosh 3363
“Theano: A python framework for fast computation of mathematical
expressions,” arXiv:1605.02688.
Toda, T., and Shikano, K. (2005). “NAM-to-speech conversion with
Gaussian mixture models,” in INTERSPEECH, pp. 1957–1960.
Toutios, A., and Maeda, S. (2012). “Articulatory VCV synthesis from EMA
data,” in INTERSPEECH, pp. 2566–2569.
Toutios, A., and Narayanan, S. (2013). “Articulatory synthesis of French
connected speech from EMA data,” in INTERSPEECH, pp. 2738–2742.
Wang, J., Hahm, S., and Mau, T. (2015). “Determining an optimal set of
flesh points on tongue, lips, and jaw for continuous silent speech recog-
nition,” in Proceedings of SLPAT 2015: 6th Workshop on Speech andLanguage Processing for Assistive Technologies, Association for
Computational Linguistics, Dresden, Germany, pp. 79–85.
Wang, J., Samal, A., and Green, J. R. (2014). “Preliminary test of a real-
time, interactive silent speech interface based on electromagnetic
articulograph,” in SLPAT@ACL, Association for Computational
Linguistics, pp. 38–45.
Wang, J., Samal, A., Green, J. R., and Rudzicz, F. (2012a). “Sentence recog-
nition from articulatory movements for silent speech interfaces,” in
ICASSP, IEEE, pp. 4985–4988.
Wang, J., Samal, A., Green, J. R., and Rudzicz, F. (2012b). “Whole-word
recognition from articulatory movements for silent speech interfaces,” in
INTERSPEECH, ISCA, pp. 1327–1330.
Wrench, A. (1999). “MOCHA-TIMIT,” speech database.
Wszołek, W., Modrzejewski, M., and Przysiezny, M. (2014). “Acoustic
analysis of esophageal speech in patients after total laryngectomy,” Arch.
Acoust. 32(4), 151–158.
Yoshioka, H. (2008). “The role of tongue articulation for /s/ and /z/
production in whispered speech,” in Proceedings of Acoustics, pp.
2335–2338.
3364 J. Acoust. Soc. Am. 143 (6), June 2018 Nisha Meenakshi G. and Prasanta Kumar Ghosh