Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Ultrasound tongue contour extraction using BowNet network: A deep learningapproachM. Hamed Mozaffari, David Sankoff, and Won-Sook Lee
Citation: Proc. Mtgs. Acoust. 39, 020001 (2019); doi: 10.1121/2.0001117View online: https://doi.org/10.1121/2.0001117View Table of Contents: https://asa.scitation.org/toc/pma/39/1Published by the Acoustical Society of America
ARTICLES YOU MAY BE INTERESTED IN
Study on impact of noise annoyance from highway traffic in Singapore CityProceedings of Meetings on Acoustics 39, 015001 (2019); https://doi.org/10.1121/2.0001116
Filter optimization and artifacts estimation for orthogonal frequency division multiplexing with multi linetransmission in ultrasound imagingProceedings of Meetings on Acoustics 39, 020002 (2019); https://doi.org/10.1121/2.0001135
Mathematical music theory and the representation of Igbo musicProceedings of Meetings on Acoustics 39, 035002 (2019); https://doi.org/10.1121/2.0001136
Decolonizing African music with visualizations and sonifications using beat-class theoryProceedings of Meetings on Acoustics 39, 035001 (2019); https://doi.org/10.1121/2.0001134
Nonlinear acoustic theory on pressure wave propagation in water flows containing bubbles acting a drag forceProceedings of Meetings on Acoustics 39, 045001 (2019); https://doi.org/10.1121/2.0001162
Hibla: Acoustic fiberProceedings of Meetings on Acoustics 39, 015002 (2019); https://doi.org/10.1121/2.0001264
Volume 39 http://acousticalsociety.org/
178th Meeting of the Acoustical Society of America San Diego, California
2-6 December 2019
Biomedical Acoustics: Paper 3pBA3
Ultrasound tongue contour extraction using BowNetnetwork: A deep learning approachM. Hamed MozaffariDepartment of Engineering, University of Ottawa School of Electrical Engineering and Computer Science, Ottawa, Ontario, K1N 6N5, CANADA; [email protected]
David SankoffDepartment of Mathematics and Statistics, University of Ottawa, 150 Louis Pasteur Pvt.,Rooms 635-636, Ottawa, Ontario K1N 6N5, CANADA; [email protected]
Won-Sook LeeDepartment of Engineering, Professor, School of Electrical Engineering and Computer Science (EECS) University of Ottawa, Ottawa, Ontario, K1N 6N5, CANADA; [email protected]
Ultrasound imaging is safe, relatively affordable, and capable of real-time performance. This technology has been used for real-time visualization and analyzing the functionality of human organs such as human tongue shape and motion during a real-time speech to study healthy or impaired speech production. Due to the noisy nature of ultrasound images with low-contrast characteristic, it might require expertise for non-expert users to recognize tongue shapes. To alleviate this difficulty and for quantitative analysis of tongue shape and motion, tongue surface can be extracted, tracked, and visualized instead of the whole tongue region. The progress of deep convolutional neural networks has been successfully exploited in various computer vision tasks. This paper presents two novel deep neural network models named BowNet and wBowNet benefits from the ability of global prediction of decoding-encoding models from integrated multi -scale-contextual information and the capability of full-resolution (local) extraction of dilated convolutions. Experimental results using ultrasound tongue image datasets from different ultrasound devices revealed that the combination of both localization and globalization searching could improve prediction result significantly. Assessment of BowNet models in real-time using both qualitatively and quantitatively studies showed their outstanding achievements in terms of accuracy and robustness in comparison with similar techniques.
Published by the Acoustical Society of America
© 2019 Acoustical Society of America. https://doi.org/10.1121/2.0001117 Proceedings of Meetings on Acoustics, Vol. 39, 020001 (2019) Page 1
1. INTRODUCTION
Ultrasound technology has been a well-known method in speech research for studying tongue motion and
speech articulation [1]. The popularity of ultrasound imaging for tongue visualization is because of its attractive
characteristics such as imaging at a reasonably rapid frame rate, which allows researchers to visualize subtle and
swift gestures of the tongue during speech in real-time. Moreover, ultrasound technology is portable, relatively
affordable, and clinically safe with a non-invasive nature [2]. Studying and exploiting the dynamic nature of
speech data from ultrasound tongue image sequences might provide valuable information for linguistics
researchers, and it is of great interest in many recent studies [3]. Ultrasound imaging has been utilized for tongue
motion analysis in treatment of speech sound disorders [4], comparing healthy and impaired speech production
[3], second language training and rehabilitation [5], Salient Speech Interfaces [6], and swallowing research [7],
3D tongue modelling [8], to name a few.
During speech data acquisition, the ultrasound probe under the user’s jaw pictures the tongue surface in
midsagittal or coronal view in real-time. The Mid-sagittal view of the tongue in ultrasound data is regularly
adapted, as it illustrates relative backness, height, and the slope of various areas of the tongue. The tongue dorsum
can be seen in this view as a thick, long, bright, and continues region due to the tissue-air reflection of ultrasound
signal by the air around the tongue. Due to the noise characteristic of ultrasound images with low-contrast, it is
not an easy task for non-expert users to localize the tongue surface. To address this difficulty, researchers utilize
extracted tongue surface data in the form of a curve instead of the whole tongue region. Manual segmentation
of the tongue contour and analysis of each frame suffer from several drawbacks such as bias depending on the
skill of the user or quality of the data, fatigue due to a large number of image frames to be analyzed, and lack of
achieving reproducible results [9]. Classical semi-automatic methods of tongue contour extraction, such as image
processing techniques [10], work well only on specific datasets, designed for each research. Furthermore,
initialization step is vital for the accuracy of each method. Therefore, having a fully automatic system for the
tongue contour extraction, capable of working on every dataset with acceptable results, is still an open topic in
the literature. This problem is even harder for the case of tongue contour tracking in real-time applications [11].
Figure 1 illustrates the process of ultrasound tongue imaging, where the tongue surface was manually extracted
for the sake of illustration.
Figure 1. A sample frame of ultrasound tongue imaging data acquired from a mid-sagittal view when the user
keeps the ultrasound probe beneath the chin area. Reflection from the tongue dorsum area can be seen as a
gradient bright to the black region specified by the red curve. The yellow dotted curve is the tongue contour. On
the lower left side of the image, the shadow from the jawbone is clearly illustrated.
The remainder of this paper is structured as follows. A literature review of the field is covered in section 2.
Section 3 describes in detail our methodology and architecture of BowNet models. The Experimental results and discussion around our proposed segmentation techniques are presented in section 4. Section 5 concludes and
outlines future work directions.
M. H. Mozaffari et al. Dilated convolution for ultrasound tongue extraction
Proceedings of Meetings on Acoustics, Vol. 39, 020001 (2019) Page 2
2. LITERATURE REVIEW AND BACKGROUNDS
Automatic tracking and extracting of the tongue contour has been investigated in several studies in the last
recent years such as active contour models or snakes [12]–[16], a graph-based technique [17], machine learning-
based methods [1], [18]–[21], and many more. A complete list of new tongue contour extraction techniques can
be found in a study by Laporte et al. [3]. Although the methods above have been applied successfully on the
ultrasound tongue contour extraction, still manual labelling and initialization are frequently needed. Users should
manually label at least one frame with a restriction of drawing near to the tongue region [3]. For instance, to use
Autotrace, EdgeTrak, or TongueTrack software, users should annotate several points on at least one frame.
Deep convolutional neural networks (DCNNs) have been the method of choice for many computer vision
applications in recent years. They have shown outstanding performance in many image classification tasks as
well as to object detection, recognition, and tracking [22]. With inspiration, modification, and adaptation of
several well-known deep classification networks [23]–[25], fully convolutional neural (FCN) networks were
successfully exploited for image segmentation tasks [26]. The performance of the FCN model has been improved
by employing several different operators such as deconvolution [27], skip-connection and concatenation from
previous layers [28], [29], indexed un-pooling [30], and post-processing stages such as CRFs [31].
Many of these innovations significantly improved the accuracy of segmentation results, usually at the
expense of more computational costs due to the significant number of network parameters. Furthermore,
consecutive pooling layers that are used to improve receptive field and localization invariance [32] cause a
considerable reduction of feature resolution in decoding section [33]. To alleviate this issue, atrous or dilated
convolutions have been proposed and employed recently [31]–[35]. Dilated convolutions help a network model
to predict instances without losing the receptive field, with fewer learnable parameters. DCNNs for the problem
of ultrasound tongue extraction has been used in a few studies. In [36], Restricted Boltzmann Machine (RBM)
was trained first on samples of ultrasound tongue images and ground truth labels in the form of encoder-decoder
networks. Then, the trained decoder part of the RBM was tuned and utilized for prediction of new instances in a
translational fashion from the trained network to the test network. To automatically extract tongue contours
without any manipulation on a large number of image frames, modified versions of U-net [28] have been used
recently for ultrasound tongue extraction [11], [37], [38].
In this study, we introduced a new method for automatic tongue contour extraction in real-time to handle
the difficulties of previous techniques we experienced in our research. We proposed two robust, accurate, fully
automatic deep learning architectures named BowNet and wBowNet, specifically for tongue contour extraction.
The proposed models have the capability of real-time performance using GPU power. An alternative solution
for CPU only purposes has been reported in previous research [11]. For the first time, dilated convolution is
combined with the standard deep, dense classification method to extract both local and global context from each
frame at the same time in an end-to-end fashion. Unlike previous studies which evaluate their models on a
personal dataset, BowNet models were assessed on two comprehensive datasets of tongue images and ground
truths, each of which contained two types of data from two different ultrasound machines to test the overfitting
problem and robustness of each technique in terms of generalization capability.
3. METHODOLOGY
In the first versions of the DeepLab (V1) model, in order to increase the output feature map resolution of
the network, as a post-processing stage, CRF was used for acquiring better local delineation [31]. In the next
generations of DeepLab models (V2 and V3) [34], [39], more accurate segmentation results could be achieved
using information transfer from different layers of the model modified and concatenated with separated layers.
Similarly, we tested multiscale contextual reasoning from successive pooling, and subsampling layers (global
exploration) combined with the full-resolution output from dilated convolutions [35].
We develop two new DCNNs (named BowNet and wBowNet) for the problem of ultrasound tongue
extraction with few learnable parameters for the sake of real-time performance. The BowNet is a parallel
combination of a dense classification architecture inspired by VGG16 [23] and U-net [28] with a segmentation
model inspired by the DeepLabV3 network where dilated convolution layers are used without pooling layers
[33]. The wBowNet is the counterpart of the BowNet, where a combination of classification and segmentation
networks were designed interconnected to support an even higher resolution in prediction outcomes. As part of
this work, we also re-examine the performance of repurposed versions of U-net [28] and DeepLabV3 [33]. In
M. H. Mozaffari et al. Dilated convolution for ultrasound tongue extraction
Proceedings of Meetings on Acoustics, Vol. 39, 020001 (2019) Page 3
our designs, we considered the effect of the checkerboard [40] and gripping artifacts [41] of deconvolutional and
dilated layers. Our experimental results show that using both ideas of omitting pooling layers in one path and
variable dilation factors in another path of BowNet models alleviate the effect of those artifacts.
A. DILATED CONVOLUTION
The consecutive combination of convolutional and pooling layers of a DCNN model results in a significantly
lower spatial resolution of output feature maps, typically by a factor of 32 for modern architectures that are not
designed from scratch for the segmentation task [33]. Several ideas have been proposed to reconstruct an input-
size segmented output from the course feature map of the encoding part. Interpolation (up-sampling) could be
the first solution, whereas each pixel of the feature map is repeated to increase the image size [26]. This method
of down-sampling and up-sampling with inevitable losing information is not beneficial for the task of
segmentation when boundary delineation with a high resolution is required. Transposed convolution (sometimes
called deconvolution) has been introduced to solve this issue and to recover the low-resolution prediction maps
[42]. Transpose convolutional layer operates opposite of a convolution layer where each pixel of the feature map
is expanded to the kernel size and superimposed with its neighbour. Although deconvolution improved the
segmentation result [27], it still suffers from the checkerboard problem [40] as well as increasing the number of
trainable parameters. Other techniques, such as indexed un-pooling [30], [43] require memory for saving
positions while max-pooling is applied on a previous feature map.
Dilated convolution (sometimes called atrous convolution) has been utilized in previous studies [31], [34],
[39] with successful results. In this method, instead of downsampling a feature map for the next layer, the kernel
is sub-sampled with a sampling rate (name dilation rate or factor). The receptive field could be effectively
expanded with keeping a spatial resolution while dilation factor was increased monotonously. Therefore, dilated
layers can be used instead of pooling layers, while special resolution of feature maps will be almost constant in
the forward path of a DCNN. However, the sparsity of dilated kernels does not always cause performance
improvement especially for small objects and details [34]. To solve this problem, one solution is to decrease the
dilation factor throughout the decoding path of the network model, similar to the number of kernels in the U-net
network [34]. Figure 2 illustrates samples of decoding strategies using interpolation, deconvolution, and dilated
convolution.
B. NETWORK ARCHITECTURES
Figure 3 illustrates the network architecture of the BowNet model. BowNet has two separate sub-network
forward paths where the results of the two paths are concatenated in the last layer, followed by a fully
convolutional layer. On the contrary, the wBowNet has two interconnected (weaved) sub-networks. The
connection of layers in wBowNet was shown in Figure 4. Two popular DCNN models, U-net [28] and
DeepLabV3 [32], [34] are modified and simplified for a fair comparison, and we named them as sU-net and
sDeepLabV3. To have a better criterion for the accuracy of the proposed models, the original U-net and
DeepLabV3 were implemented and tested as well. It is vital to pay attention to both context and resolution to
detect an object such as a tongue contour in relatively noisy ultrasound images.
In this application, context can be achieved using dense classification methods [11]. Nevertheless, up-
sampling methods such as deconvolution are not able to recover the low-level visual features which are lost in
the down-sampling stage [22], resulting in a low-resolution prediction map as the model output. To address this
issue, we proposed two deep network models considering both localization and globalization detection. In each
network architecture, convolutional layers use ReLU activation as a non-linearity function. Drop-out layers and
batch normalization layers are followed by convolutional layers to improve the regularization and accuracy of
network models. For the dilation factor, we increased the value from one to eight and then decreased again to
one in a forward path to solve the problem of spatial inconsistency (gripping) [34], [41].
Multiclass cross-entropy is employed as a loss function for the training of networks where the SoftMax
function was used for the last layer. It is noteworthy to mention that the receptive field will be decreased after
applying each convolutional layer due to the overlapping of the convolution kernels within the border areas. For
this reason, to concatenate feature maps from previous layers, it is necessary to crop feature maps to have similar
size images (see faded blue boxes in and Figure 4).
M. H. Mozaffari et al. Dilated convolution for ultrasound tongue extraction
Proceedings of Meetings on Acoustics, Vol. 39, 020001 (2019) Page 4
Figure 2. Several ideas for improving the low-resolution feature map to acquire an output image with the input-
size image. (a) also, (b) Up-sampling using linear interpolation and transpose convolution, respectively. (c)
Dilated convolution with different dilation factor of 1, 2, and 3.
4. EXPERIMENTAL RESULTS
A. SYSTEM SETUP
In order to train our proposed network models, first we conducted an extensive random search
hyperparameter tuning [44] for finding the optimum value of parameters in each network such as filter size
(double for each consecutive layer starts from 16, 32, or 64), kernel size (3 × 3 and 5 × 5), dilation factor (double
for each consecutive layer starts from 1 or 2), the number of global iterations (iteration and epoch size), batch
size (10, 20, and 50 depend on GPU memory), augmentation parameters (online and offline), drop-out factor
(0.5, 0.7, and variable), padding type (zero or valid), normalization layer (with and without), optimization
methods (SGD and Adam [45]), and the type of the activation layers (Tanh and ReLU). We also tested several
network configurations for the BowNet, whereas the number of layers was different in encoding-decoding
forward paths (3, 4, and 5), and for the wBowNet, where the number of dilated convolution layers was different.
In this experiment, we set real-time performance as one restriction for all the architectures in this study.
Figure 3. Overview of the proposed BowNet network architecture. In each layer, filter kernels are depicted using
boxes. The green and orange boxes are results of standard and dilated convolution layers, respectively.
M. H. Mozaffari et al. Dilated convolution for ultrasound tongue extraction
Proceedings of Meetings on Acoustics, Vol. 39, 020001 (2019) Page 5
Figure 4. Overview of the proposed wBowNet network architecture. In each layer, filter kernels are depicted using
boxes. The green and orange boxes are results of regular and dilated convolution layers, respectively.
Our results from hyperparameter tuning revealed that, besides network architecture size, the learning rate
has the most significant effect on the performance of each architecture in terms of accuracy. Testing fixed and
scheduled decaying learning rates showed that the variable learning rate might provide better results, but it
requires different initialization of decay factor and decay steps. Therefore, for the sake of a fair comparison, we
only reported results using fixed learning rates. Table 1. presents selected learning rates for each proposed
network model.
Table 1. Fixed Learning-rate (LR) tuning using Best training and validation loss (BTL, BVL).
LR sU-net sDeepLabV3 BowNet wBowNet
BTL BVL BTL BVL BTL BVL BTL BVL
0.005 N/A N/A N/A N/A N/A N/A N/A N/A
0.0001 0.023 0.024 0.02 0.023 0.023 0.023 0.022 0.021
0.0003 0.021 0.023 0.018 0.021 0.021 0.022 0.021 0.023
0.0005 0.020 0.020 0.014 0.020 0.020 0.021 0.021 0.021
0.0007 0.019 0.022 0.017 0.020 0.020 0.021 0.020 0.022
0.0009 0.020 0.023 0.017 0.020 0.019 0.021 0.022 0.023
0.00001 0.026 0.028 0.026 0.029 0.027 0.027 0.026 0.031
Architectures were deployed using the publicly available TensorFlow framework on Keras API as the
backend library [46], [47]. For the initialization of network parameters, randomly distributed values have been
selected similarly using the same seed value. For all the four deep convolutional networks, size of the input
images, the number of iterations, mini-batch sizes, number of epochs were selected as 128 × 128, 3000, 10, 60,
respectively. Adam optimization was selected with a fixed momentum value of 0.9 for finding the optimum
solution on a binary cross-entropy loss function. Each network model was trained for ten times, then averaged,
and the standard deviation was reported. It takes less than one hour approximately for each time training of a
model depending on the network size, using one NVIDIA 1080 GPU unit, which was installed on a Windows
PC with Core i7, 4.2 GHz speed, and 32 GB of RAM. We also used the Google cloud virtual machine with a
Tesla P100 GPU and 16GB of memory for acquiring training results faster.
B. DATASETS AND DATA AUGMENTATION
We evaluated our proposed deep network models using different scenarios (train and test on a dataset and
train on one and test on another dataset and vice versa). The two datasets comprise of ultrasound tongue images
acquired with different ultrasound machines: one from the University of Ottawa (Dataset I) and another from
the publicly available SeeingSpeech project (Dataset II) [48]. True labels corresponding to each image were
created by two experts manually using our customized annotation software. Users should only annotate several
markers on the lower edge of the tongue contour region, and then the contour curve will be created automatically
using the B-spline method. As Table 2 shows, we separated datasets into training, validation, and test sets. Each
model was trained, validated, and tested via different scenarios utilizing online augmentation. Because we used
M. H. Mozaffari et al. Dilated convolution for ultrasound tongue extraction
Proceedings of Meetings on Acoustics, Vol. 39, 020001 (2019) Page 6
the same batch size, the number of iterations, the number of epochs, and initial randomization, each model will
be examined by the same system setup. Figure 7 shows the ranking values of each image in our datasets in terms
of the distance from the average image.
Table 2. Specification of the dataset I and II for training/validation/testing
Total number of images Train/Validation (%80/10) Test (%10)
Dataset I (UOttawa) 2058 1646/205 205
Dataset II (SeeingSpeech)
[48] 4016 3212/401 401
Dataset I (UOttawa) Dataset II (SeeingSpeech) [48]
Figure 5. Two sample images from Dataset I and II accompany with their corresponding truth labels.
Online augmentation on ultrasound data was deployed using the Keras augmentation library with the same
random numbers for the training of each deep convolutional network. In total, 60 iterations × 10 mini-batches ×
50 epochs = 30000 augmented images were used for training and validation of each network model. We used
image flipping (half of the dataset after augmentation), rotation (50-degree rotation in each side) and zooming
(ratio of 0.5x to 1.5x) to mimic all the possible transformations which can happen in ultrasound tongue data.
Dataset II has similar images (homogenous) in comparison to the dataset I (heterogeneous). Figure 5. shows
two samples of Database I and II with their corresponding ground truth labels. Due to the heterogeneity and
homogeneity of the dataset I and II, we applied the informed undersampling method for imbalanced datasets
[49] to create two similar datasets with the highest variations for better regularization and better comparison. For
this reason, we first find the average image of each dataset, and then the Euclidian distance of images was
calculated with respect to the average image. Figure 6. shows the average image of each dataset as well as a
sample image with its difference from the average one. After sorting each dataset in terms of distance score, we
selected 2000 images with the highest rank and 50 images from the lowest position.
Figure 6. Informed under-sampling procedure to enhance dataset in terms of variation: Left column: the average
image from Dataset I and II. Middle column: one sample image from each dataset. The right column: the difference
between the sample image and the average one.
M. H. Mozaffari et al. Dilated convolution for ultrasound tongue extraction
Proceedings of Meetings on Acoustics, Vol. 39, 020001 (2019) Page 7
Figure 7. Ranking of the dataset I and II in terms of similarity distance score calculated as the distance from each
image to the average image.
C. QUALITATIVE AND QUANTITATIVE ANALYSIS
To illustrate the efficacy of our proposed segmentation methods qualitatively, some randomly selected
segmentation results from testing our proposed models on datasets I and II are shown in this section. Figure 8
shows the results from applying each proposed model on both datasets. As can be seen clearly, wBowNet could
achieve better prediction results in terms of noise. Training and testing on the same dataset, all the models in this
study generate instances with acceptable noise. For the case of training on one dataset and testing on another,
results show a worse prediction for both sDeepLabv3 and sU-net. It is noteworthy to mention that the correct
position of the tongue contour curve acquired from skeletonizing should be shifted toward the edge of the
intensity gradient between black and white regions [2].
In these experiments, we ignored that transfer for the sake of a fair comparison study for all the models. The
last row of figures shows the difference between the prediction curve and the truth label curve. More faded
regions mean a better correlation between the contour from the ground truth label and predicted map. As can be
seen clearly in Figure 8, both BowNet models could obtain outstanding prediction maps and extracted contours
in comparison with other architectures.
In order to quantitatively validate our proposed models, we followed the standard evaluation criteria for the
field of ultrasound tongue segmentation [1], [3], [9], [10], [29], [50]. Contours extracted from prediction maps
and ground truth labels are compared in terms of the mean sum of distance (MSD) as defined in equation (1)
[36]. It provides an evaluation metric defined as the mean distance between pixels of a contour U and a contour
V, even if these curves do not share the same coordinates on one axis or do not have the same number of points.
The latter one consisted of a contour set U of 2D points (𝑢1, ⋯ , 𝑢𝑛) and the former one consisted of a contour
set V of 2D points (𝑣1, ⋯ , 𝑣𝑚). We carried out contour extraction using the skeleton method on binarized ground
truth labels and predicted images employing morphological operators from the Scikit-learn library [51]. This
criterion is sensitive to translation, as we saw in our experiments. For this reason, we did not shift the extracted
contours to their correct positions for the sake of a fair comparison.
𝑀𝑆𝐷(𝑈, 𝑉) =1
𝑚 + 𝑛(∑ |𝑣𝑖 − 𝑢𝑗|
𝑗
𝑚𝑖𝑛𝑚
𝑖=1
+ ∑ |𝑢𝑖 − 𝑣𝑗|𝑗
𝑚𝑖𝑛𝑛
𝑖=1
) (1)
To evaluate the performance of the proposed methods, we also calculated and reported the value of the dice
coefficient and binary cross-entropy loss between non-enhanced prediction maps and labelled data. Binary cross-
entropy loss, which is a particular case of multiclass cross-entropy calculates the difference between the
predicted and the ground truth labelled data, as defined in equation (2). For assessment of the proposed method
performance, we also calculated the dice coefficient as described in equation (3) where 𝑝𝑖 ∈ [0, 1] is the ith
predicted pixel in the flattened image, whereas the original pixel value from [0, 255] in a grayscale image is
rescaled to [0, 1] for fitting into the ReLU/sigmoid activation functions. And 𝑦𝑖 ∈ [0, 1] is the corresponding
class label for that ith pixel of the sample data.
𝐿𝐵𝐶𝐸 = ∑ 𝑦𝑖
𝑖
𝑙𝑜𝑔 𝑝𝑖 + (1 − 𝑦𝑖)𝑙𝑜𝑔 (1 − 𝑝𝑖) (2)
𝐿𝐷𝑖𝑐𝑒 = −2 ∑ 𝑝𝑖𝑦𝑖𝑖
∑ 𝑝𝑖𝑖 + ∑ 𝑦𝑖𝑖 (3)
M. H. Mozaffari et al. Dilated convolution for ultrasound tongue extraction
Proceedings of Meetings on Acoustics, Vol. 39, 020001 (2019) Page 8
Figure 8. Sample test results of trained and validated models on datasets using online augmentation. The first rows
are prediction maps where yellow regions are a higher probability of true prediction than other areas. Second rows
are contour extracted from predicted feature maps after the same binarization and skeletonization. The last rows
are the difference between the contour extracted from the true label and predicted the label. Note that the sample
image, ground truth, and contour are illustrated on the top of the first row.
Quantitative results are reported from the same experiments in the qualitative study with the same data,
setups, and procedures. Table 3. shows the results of training and validating of each deep network model on the
dataset I and II using online augmentation. Each architecture was tested on both datasets separately. For each
experiment, we run each system ten times, and then average and standard deviation over the cross-entropy loss
and dice coefficients are tabulated. The last results are only presented to assess the robustness of each model in
terms of overfitting and convergence. Therefore, analogy between the last training and validation losses means
better convergence. We also presented results of the original U-net and DeepLabV3. Although both original
methods could find better results in many aspects, they are significantly more prominent in size and slower in
performance. The best results in tables ascertain the accuracy of each model in terms of loss function and dice
coefficient. The results of Table 3 reveals that the BowNet and wBowNet could reach optimum solutions better
than other models. Table 3. also shows the number of parameters and the memory, which is needed to save those
trainable parameters, calculated by TensorFlow library. As can be seen from the table, BowNet has the lowest number of parameters. wBowNet and sDeepLab have almost double the number of parameters of BowNet. We
calculated the number of parameters of the original UNet and DeepLabV3, where the former is 71 times bigger
than BowNet and 39 times bigger than wBowNet.
M. H. Mozaffari et al. Dilated convolution for ultrasound tongue extraction
Proceedings of Meetings on Acoustics, Vol. 39, 020001 (2019) Page 9
Table 3. Results of the best and the last binary cross-entropy loss function and dice coefficient.
Trained/Validated → Dataset I
For quantitative evaluation For comparison with original models
sU-net sDeepLab BowNet wBowNet U-net DeepLabV3
Parameters Memory 948,833 3.7GB 785,889 3.1GB 434,785 1.7GB 786,657 3.1GB 31,042,369 8.3GB 26,043,235 6.9GB
Mean STD Mean STD Mean STD Mean STD Mean STD Mean STD
Train loss Last 0.0242 0.2329 0.0336 0.0393 0.0265 0.0427 0.0242 0.037 0.0197 0.0113 0.0190 0.0066
Validation loss Last 0.0321 0.0075 0.0367 0.0131 0.0254 0.0046 0.0306 0.0103 0.0188 0.0669 0.0198 0.0046
Train Dice Last 0.7367 0.2329 0.8064 0.0393 0.7871 0.0427 0.817 0.037 0.8301 0.0239 0.8380 0.0321
Validation Dice Last 0.8061 0.0277 0.796 0.0222 0.7886 0.0389 0.8068 0.016 0.8230 0.0312 0.8251 0.0584
Train loss Best 0.0109 0.0003 0.0111 0.0003 0.0107 0.0005 0.0104 0.0005 0.0091 0.0004 0.0089 0.0021
Validation loss Best 0.0123 0.0012 0.0115 0.001 0.0109 0.0007 0.0122 0.0019 0.0089 0.0316 0.0090 0.0026
Train Dice Best 0.8751 0.0055 0.8775 0.002 0.8748 0.0053 0.8846 0.0142 0.8900 0.0008 0.8938 0.0547
Validation Dice Best 0.8582 0.0036 0.8594 0.006 0.859 0.0058 0.8652 0.0051 0.8507 0. 8608 0.8865 0.0293
Trained/Validated → Dataset II
sUNet sDeepLab BowNet wBowNet U-net DeepLabV3
Mean STD Mean STD Mean STD Mean STD Mean STD Mean STD
Train loss Last 0.0512 0.053 0.0458 0.0669 0.0592 0.0269 0.045 0.0454 0.0165 0.0065 0.0198 0.0053
Validation loss Last 0.037 0.007 0.0432 0.0113 0.0464 0.0118 0.0448 0.0162 0.0154 0.0035 0.0183 0.0057
Train Dice Last 0.7688 0.053 0.7774 0.0669 0.7746 0.0269 0.8024 0.0454 0.8512 0.0334 0.9008 0.0332
Validation Dice Last 0.7932 0.0316 0.7918 0.0239 0.7864 0.0316 0.7912 0.0312 0.8350 0.0251 0.8758 0.0247
Train loss Best 0.0242 0.0004 0.0224 0.0009 0.0242 0.0008 0.0198 0.0004 0.0093 0.0008 0.0088 0.0023
Validation loss Best 0.0258 0.0008 0.0234 0.0019 0.0258 0.0008 0.0228 0.0013 0.0078 0.0015 0.0085 0.0021
Train Dice Best 0.863 0.0053 0.8608 0.0004 0.8658 0.0026 0.8698 0.0008 0.8810 0.0025 0.8402 0.0132
Validation Dice Best 0.852 0.0041 0.86 0.0043 0.85 0.0099 0.865 0.0058 0.8797 0.0026 0.8460 0.0124
The results of testing each model using MSD criteria applying on the contours extracted from enhanced
prediction maps and the ground truth labels are illustrated in Table 4. As we discussed before, due to removing
noise from the model’s instances during enhancement procedures, wBowNet is not the best model in terms of
accuracy, although it outperformed other network models regarding MSD criteria and qualitative results in many
cases. To convert MSD from pixel to millimeter, we calculated the approximate conversion of 1 px = 0.638 mm
for both datasets I and II. wBowNet and BowNet could achieve MSDs around 0.04mm on average while the
Deep Belief Network (DBN) model in a study by [1] achieved 1.0 mm (1 px = 0.295 mm). For the active contour
models mentioned in [13], the average MSD was 1.05 mm. It is essential to mention that the two human experts
participating in active contour models experiment produced two different annotation results having an average
MSD of 0.73 mm [13], which might thus be reasonably considered the ultimate approximate minimum MSD
value-based human capability.
Our experimental results revealed that although original DCNN models are sturdy enough for the problem
of tongue contour segmentation and tracking, they are huge in the number of trainable parameters and memory
intake. On the other hand, BowNet models could perform well in comparison with smaller and bigger networks.
In other words, BowNet models are relatively small, but their performance is near to big sized networks. A
consequence of this performance drop can be seen in the real-time performance of those models. One goal of
this study is to propose a robust and fully automatic DCNN, at the same time with the capability of real-time
performance.
Real-time performance of a deep model depends on the network size, hyperparameter tuning, computational
facilities, parameter initialization, per- or post data processing, data augmentation method, convergence rate due
to the performance of the optimization method, and activation function. The first two items are more effective
than others on the speed of the instantiation of each deep network model, as we saw in our experiments.
Furthermore, the accuracy of deep learning methods is also highly related to the size of the training dataset and
the complexity of the deep network model. Hence, there is always a trade-off between the number of training
samples, which is a big issue in many applications such as in medicine [28], [52], and the number of parameters
in the network, which requires more computing and memory units [30].
Increasing the number of datasets through data acquisition is not always the best and cheapest alternative.
Data augmentation can help to alleviate this difficulty, but a bigger dataset needs a better network model in terms
of generalization. In general, large state of the art deep network models [27], [32] can be generalized on relatively
small datasets such as the ultrasound tongue better with the expense of higher computational cost due to the
number of parameters. Moreover, training and testing time will rise consequently using big networks. From
Table 5, sU-net is the fastest method with double size of the BowNet. It can be seen in the table that the
performance speed of each proposed method is in the real-time range.
M. H. Mozaffari et al. Dilated convolution for ultrasound tongue extraction
Proceedings of Meetings on Acoustics, Vol. 39, 020001 (2019) Page 10
Table 4. Results of testing each trained model on different datasets created. MSD is calculated and converted into
millimetres with the approximation dependent on the ultrasound device resolution.
For quantitative evaluation For comparison with original models
Trained/Validated → Dataset I
Test → Dataset I sUNet sDeepLab BowNet wBowNet U-net DeepLabV3
Mean STD Mean STD Mean STD Mean STD Mean STD Mean STD
Test loss 0.023 0.063 0.027 0.026 0.028 0.078 0.018 0.064 0.010 0.089 0.012 0.026
Test Dice 0.759 0.084 0.748 0.011 0.744 0.018 0.826 0.011 0.850 0.018 0.840 0.017
MSD (pixels) 0.249 0.182 0.266 0.029 0.281 0.090 0.213 0.080 0.196 0.069 0.154 0.069
MSD (mm) 0.037 0.025 0.039 0.016 0.042 0.036 0.032 0.090 0.030 0.036 0.028 0.031
Test → Dataset II
Test loss 0.082 0.022 0.064 0.004 0.069 0.069 0.065 0.003 0.056 0.023 0.062 0.007
Test Dice 0.498 0.012 0.526 0.032 0.547 0.011 0.520 0.002 0.611 0.020 0.538 0.032
MSD (pixels) 0.424 0.063 0.240 0.058 0.374 0.063 0.358 0.005 0.210 0.064 0.371 0.017
MSD (mm) 0.063 0.078 0.036 0.002 0.056 0.023 0.053 0.006 0.361 0.019 0.054 0.003
Trained/Validated → Dataset II (SeeingSpeech)
Test → Dataset II sUNet sDeepLab BowNet wBowNet U-net DeepLabV3
Test loss 0.048 0.086 0.045 0.007 0.045 0.027 0.043 0.005 0.034 0.073 0.033 0.007
Test Dice 0.717 0.084 0.730 0.053 0.714 0.080 0.736 0.005 0.810 0.019 0.831 0.032
MSD (pixels) 0.382 0.068 0.391 0.031 0.401 0.031 0.269 0.034 0.201 0.011 0.210 0.026
MSD (mm) 0.057 0.003 0.058 0.051 0.060 0.038 0.040 0.021 0.030 0.092 0.035 0.002
Test → Dataset I
Test loss 0.044 0.082 0.042 0.008 0.039 0.030 0.064 0.068 0.034 0.041 0.041 0.005
Test Dice 0.579 0.043 0.578 0.043 0.620 0.002 0.397 0.014 0.707 0.025 0.550 0.002
MSD (pixels) 0.263 0.090 0.375 0.048 0.287 0.080 0.180 0.064 0.016 0.005 0.023 0.001
MSD (mm) 0.039 0.072 0.056 0.037 0.043 0.003 0.027 0.011 0.022 0.006 0.031 0.006
Table 5. A comparison study over each network model in terms of the number of parameters, the memory which is
required for keeping the parameters of each network model, and frame rate while testing each model on one batch
of data. Results are the average of 10 times run on each test model with different test datasets and models (using
online or offline augmentation).
sUNet sDeepLab BowNet wBowNet
Number of parameters 948833 785889 434785 786657
Memory required for learnable parameters 3795332 3143556 1739140 3146628
Framerate using GPU 72 32 42 30
5. CONCLUSION
In this paper, we have proposed and presented two new deep convolutional neural networks called BowNet
and wBowNet for tongue contour extraction benefiting from dilated convolution for globally and locally accurate
segmentation results. Extensive experimental studies demonstrated the improved performance of the proposed
deep learning techniques. In general, wBowNet has all benefits in one place with accurate, robust, with the
capability of real-time performance amongst the comparison models such as sUNet, sDeepLab, or even BowNet.
Materials of this study might help other researchers in different fields such as linguistics to study tongue gestures
in real-time easier, accessible, and with higher accuracy than previous methods. In future work, BowNet models
can be applied to other famous databases, even non-medical large-scale benchmarks, to test the ability of this
robust algorithm. Our qualitative analysis revealed the power of wBowNet architecture in terms of generalizing
datasets and handling overfitting issues. Our quantitative study showed that the BowNet model provides higher
accuracy instances in comparison with similar techniques. The quantitative study can be the support for our
observations, and similarly revealed acceptable results for BowNet models in different aspects. The current
infant BowNet models need to be extended for other applications after development with new assessment
criteria.
REFERENCES
[1] I. Fasel and J. Berry, “Deep belief networks for real-time extraction of tongue contours from ultrasound during
speech,” in Pattern Recognition (ICPR), 2010 20th International Conference on, 2010, pp. 1493–1496.
M. H. Mozaffari et al. Dilated convolution for ultrasound tongue extraction
Proceedings of Meetings on Acoustics, Vol. 39, 020001 (2019) Page 11
[2] M. Stone, “A guide to analysing tongue motion from ultrasound images,” Clin. Linguist. Phonetics, vol. 19,
no. 6–7, pp. 455–501, Jan. 2005.
[3] C. Laporte and L. Ménard, “Multi-hypothesis tracking of the tongue surface in ultrasound video recordings of
normal and impaired speech,” Med. Image Anal., vol. 44, pp. 98–114, Feb. 2018.
[4] A. Eshky et al., “UltraSuite: A Repository of Ultrasound and Acoustic Data from Child Speech Therapy
Sessions,” Interspeech, no. September, pp. 1888–1892, 2018.
[5] B. Gick, B. M. Bernhardt, P. Bacsfalvi, and I. Wilson, “Ultrasound imaging applications in second language
acquisition,” Phonol. Second Lang. Acquis., vol. 36, no. June, pp. 315–328, 2008.
[6] B. Denby, T. Schultz, K. Honda, T. Hueber, J. M. Gilbert, and J. S. Brumberg, “Silent speech interfaces,”
Speech Commun., vol. 52, no. 4, pp. 270–287, Apr. 2010.
[7] M. Ohkubo and J. M. Scobbie, “Tongue Shape Dynamics in Swallowing Using Sagittal Ultrasound,”
Dysphagia, pp. 1–7, Jun. 2018.
[8] S. Chen, Y. Zheng, C. Wu, G. Sheng, P. Roussel, and B. Denby, “Direct, Near Real Time Animation of a 3D
Tongue Model Using Non-Invasive Ultrasound Images,” in 2018 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4994–4998.
[9] Y. S. Akgul, C. Kambhamettu, and M. Stone, “Automatic extraction and tracking of the tongue contours,”
IEEE Trans. Med. Imaging, vol. 18, no. 10, pp. 1035–1045, 1999.
[10] K. Xu, T. Gábor Csapó, P. Roussel, and B. Denby, “A comparative study on the contour tracking algorithms
in ultrasound tongue images with automatic re-initialization,” J. Acoust. Soc. Am., vol. 139, no. 5, pp. EL154-
-EL160, 2016.
[11] M. H. Mozaffari, S. Guan, S. Wen, N. Wang, and W.-S. Lee, “Guided Learning of Pronunciation by
Visualizing Tongue Articulation in Ultrasound Image Sequences,” in 2018 IEEE International Conference on
Computational Intelligence and Virtual Environments for Measurement Systems and Applications
(CIVEMSA), 2018, pp. 1–5.
[12] K. Xu et al., “Development of a 3D tongue motion visualization platform based on ultrasound image
sequences,” arXiv Prepr. arXiv1605.06106, 2016.
[13] M. Li, C. Kambhamettu, and M. Stone, “Automatic contour tracking in ultrasound images,” Clin. Linguist.
Phonetics, vol. 19, no. 6–7, pp. 545–554, 2005.
[14] S. Ghrenassia, L. Ménard, and C. Laporte, “Interactive segmentation of tongue contours in ultrasound video
sequences using quality maps,” in Medical Imaging 2014: Image Processing, 2014, vol. 9034, p. 903440.
[15] C. Laporte and L. Ménard, “Robust tongue tracking in ultrasound images: a multi-hypothesis approach,” in
Sixteenth Annual Conference of the International Speech Communication Association, 2015.
[16] K. Xu et al., “Robust contour tracking in ultrasound tongue image sequences,” Clin. Linguist. Phon., vol. 30,
no. 3–5, pp. 313–327, 2016.
[17] L. Tang and G. Hamarneh, “Graph-based tracking of the tongue contour in ultrasound sequences with adaptive
temporal regularization,” 2010 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. - Work. CVPRW
2010, pp. 154–161, 2010.
[18] J. Berry and I. Fasel, “Dynamics of tongue gestures extracted automatically from ultrasound,” in Acoustics,
Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, 2011, pp. 557–560.
[19] A. Jaumard-Hakoun, K. Xu, P. Roussel-ragot, and M. L. Stone, “Tongue Contour Extraction From Ultrasound
Images,” Proc. 18th Int. Congr. Phonetic Sci. (ICPhS 2015), 2015.
[20] T. L., B. T., H. G., L. Tang, T. Bressmann, and G. Hamarneh, “Tongue contour tracking in dynamic ultrasound
via higher-order MRFs and efficient fusion moves,” Med. Image Anal., vol. 16, no. 8, pp. 1503–1520, 2012.
[21] D. Fabre, T. Hueber, F. Bocquelet, and P. Badin, “Tongue tracking in ultrasound images using eigentongue
decomposition and artificial neural networks,” Proc. Annu. Conf. Int. Speech Commun. Assoc.
INTERSPEECH, vol. 2015-Janua, no. 2, pp. 2410–2414, 2015.
[22] G. Lin, A. Milan, C. Shen, and I. Reid, “RefineNet: Multi-path refinement networks for high-resolution
semantic segmentation,” 2017.
[23] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,”
CoRR, vol. abs/1409.1, Sep. 2014.
[24] C. Szegedy et al., “Going Deeper with Convolutions,” Popul. Health Manag., vol. 18, no. 3, pp. 186–191,
Sep. 2014.
[25] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural
Networks,” Commun. ACM, vol. 60, no. 6, 2017.
[26] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
[27] H. Noh, S. Hong, and B. Han, “Learning Deconvolution Network for Semantic Segmentation,” Proc. IEEE
Int. Conf. Comput. Vis., vol. 2015 Inter, pp. 1520–1528, May 2015.
M. H. Mozaffari et al. Dilated convolution for ultrasound tongue extraction
Proceedings of Meetings on Acoustics, Vol. 39, 020001 (2019) Page 12
[28] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image
segmentation,” in International Conference on Medical image computing and computer-assisted intervention,
2015, vol. 9351, pp. 234–241.
[29] F. Milletari, N. Navab, and S. A. Ahmadi, “V-Net: Fully convolutional neural networks for volumetric medical
image segmentation,” in Proceedings - 2016 4th International Conference on 3D Vision, 3DV 2016, 2016, pp.
565–571.
[30] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A Deep Convolutional Encoder-Decoder
Architecture for Image Segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 1–14,
2015.
[31] L.-C. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic Image
Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 1–14, Apr. 2014.
[32] L.-C. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic Image
Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2018.
[33] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking Atrous Convolution for Semantic Image
Segmentation,” Jun. 2017.
[34] R. Hamaguchi, A. Fujita, K. Nemoto, T. Imaizumi, and S. Hikosaka, “Effective Use of Dilated Convolutions
for Segmenting Small Object Instances in Remote Sensing Imagery,” Proc. - 2018 IEEE Winter Conf. Appl.
Comput. Vision, WACV 2018, vol. 2018-Janua, pp. 1442–1450, 2018.
[35] F. Yu and V. Koltun, “Multi-Scale Context Aggregation by Dilated Convolutions,” Proc. ICLR, Nov. 2015.
[36] A. Jaumard-Hakoun, K. Xu, P. Roussel-Ragot, G. Dreyfus, and B. Denby, “Tongue contour extraction from
ultrasound images based on deep neural network,” Proc. 18th Int. Congr. Phonetic Sci. (ICPhS 2015), May
2016.
[37] J. Zhu, W. Styler, and I. C. Calloway, “Automatic tongue contour extraction in ultrasound images with
convolutional neural networks,” J. Acoust. Soc. Am., vol. 143, no. 3, pp. 1966–1966, Mar. 2018.
[38] J. Zhu, W. Styler, and I. Calloway, “A CNN-based tool for automatic tongue contour tracking in ultrasound
images,” pp. 1–6, Jul. 2019.
[39] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-Decoder with Atrous Separable
Convolution for Semantic Image Segmentation.” pp. 801–818, 2018.
[40] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and Checkerboard Artifacts,” Distill, vol. 1, no. 10, p.
e3, Oct. 2016.
[41] and T. F. Yu, Fisher, Vladlen Koltun, “Segmentation Dilated Residual Networks,” Proc. IEEE Conf. Comput.
Vis. pattern Recognit., pp. 472–480, 2017.
[42] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Deconvolutional networks,” in 2010 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 2528–2535.
[43] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Lecture Notes in
Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), 2014, vol. 8689 LNCS, no. PART 1, pp. 818–833.
[44] Bergstra,James and Y. Bengio, “Random Search for Hyper-Parameter Optimization,” 2012.
[45] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv Prepr. arXiv1412.6980, Dec.
2014.
[46] M. Abadi et al., “TensorFlow: A system for large-scale machine learning,” Methods Enzymol., vol. 101, no.
C, pp. 582–598, May 2016.
[47] F. Chollet and others, “Keras: Deep learning library for theano and tensorflow,” URL https//keras. io/k, vol.
7, p. 8, 2015.
[48] E. Lawson et al., “Seeing Speech: an articulatory web resource for the study of phonetics [website],” Apr.
2015.
[49] X. Y. Liu, J. Wu, and Z. H. Zhou, “Exploratory under-sampling for class-imbalance learning,” in Proceedings
- IEEE International Conference on Data Mining, ICDM, 2006, vol. 39, no. 2, pp. 965–969.
[50] G. V. Hahn-Powell, D. Archangeli, J. Berry, and I. Fasel, “AutoTrace: An automatic system for tracing tongue
contours,” J. Acoust. Soc. Am., vol. 136, no. 4, pp. 2104–2104, Oct. 2014.
[51] N. Martínková, P. Nová, O. V. Sablina, A. S. Graphodatsky, and J. Zima, “Karyotypic relationships of the
Tatra vole (Microtus tatricus),” Folia Zool., vol. 53, no. 3, pp. 279–284, 2004.
[52] G. Litjens et al., “A survey on deep learning in medical image analysis,” Med. Image Anal., vol. 42, no. 1995,
pp. 60–88, 2017.
M. H. Mozaffari et al. Dilated convolution for ultrasound tongue extraction
Proceedings of Meetings on Acoustics, Vol. 39, 020001 (2019) Page 13