Ultrasound tongue contour extraction using BowNet network

Ultrasound tongue contour extraction using BowNet network: A deep learningapproachM. Hamed Mozaffari, David Sankoff, and Won-Sook Lee

Citation: Proc. Mtgs. Acoust. 39, 020001 (2019); doi: 10.1121/2.0001117View online: https://doi.org/10.1121/2.0001117View Table of Contents: https://asa.scitation.org/toc/pma/39/1Published by the Acoustical Society of America

ARTICLES YOU MAY BE INTERESTED IN

Study on impact of noise annoyance from highway traffic in Singapore CityProceedings of Meetings on Acoustics 39, 015001 (2019); https://doi.org/10.1121/2.0001116

Filter optimization and artifacts estimation for orthogonal frequency division multiplexing with multi linetransmission in ultrasound imagingProceedings of Meetings on Acoustics 39, 020002 (2019); https://doi.org/10.1121/2.0001135

Mathematical music theory and the representation of Igbo musicProceedings of Meetings on Acoustics 39, 035002 (2019); https://doi.org/10.1121/2.0001136

Decolonizing African music with visualizations and sonifications using beat-class theoryProceedings of Meetings on Acoustics 39, 035001 (2019); https://doi.org/10.1121/2.0001134

Nonlinear acoustic theory on pressure wave propagation in water flows containing bubbles acting a drag forceProceedings of Meetings on Acoustics 39, 045001 (2019); https://doi.org/10.1121/2.0001162

Hibla: Acoustic fiberProceedings of Meetings on Acoustics 39, 015002 (2019); https://doi.org/10.1121/2.0001264

https://images.scitation.org/redirect.spark?MID=176720&plid=1085526&setID=418004&channelID=0&CID=358532&banID=519827671&PID=0&textadID=0&tc=1&type=tclick&mt=1&hc=5c58d8e19b7ba12ab2ec6787bee0c2ad0f1b444c&location=

https://asa.scitation.org/author/Mozaffari%2C+M+Hamed

https://asa.scitation.org/author/Sankoff%2C+David

https://asa.scitation.org/author/Lee%2C+Won-Sook

/loi/pma

https://doi.org/10.1121/2.0001117

https://asa.scitation.org/toc/pma/39/1

https://asa.scitation.org/publisher/

https://asa.scitation.org/doi/10.1121/2.0001116

https://doi.org/10.1121/2.0001116



https://doi.org/10.1121/2.0001135


https://doi.org/10.1121/2.0001136


https://doi.org/10.1121/2.0001134


https://doi.org/10.1121/2.0001162


https://doi.org/10.1121/2.0001264

Volume 39 http://acousticalsociety.org/

178th Meeting of the Acoustical Society of America San Diego, California

2-6 December 2019

Biomedical Acoustics: Paper 3pBA3

Ultrasound tongue contour extraction using BowNetnetwork: A deep learning approachM. Hamed MozaffariDepartment of Engineering, University of Ottawa School of Electrical Engineering and Computer Science, Ottawa, Ontario, K1N 6N5, CANADA; [email protected]

David SankoffDepartment of Mathematics and Statistics, University of Ottawa, 150 Louis Pasteur Pvt.,Rooms 635-636, Ottawa, Ontario K1N 6N5, CANADA; [email protected]

Won-Sook LeeDepartment of Engineering, Professor, School of Electrical Engineering and Computer Science (EECS) University of Ottawa, Ottawa, Ontario, K1N 6N5, CANADA; [email protected]

Ultrasound imaging is safe, relatively affordable, and capable of real-time performance. This technology has been used for real-time visualization and analyzing the functionality of human organs such as human tongue shape and motion during a real-time speech to study healthy or impaired speech production. Due to the noisy nature of ultrasound images with low-contrast characteristic, it might require expertise for non-expert users to recognize tongue shapes. To alleviate this difficulty and for quantitative analysis of tongue shape and motion, tongue surface can be extracted, tracked, and visualized instead of the whole tongue region. The progress of deep convolutional neural networks has been successfully exploited in various computer vision tasks. This paper presents two novel deep neural network models named BowNet and wBowNet benefits from the ability of global prediction of decoding-encoding models from integrated multi -scale-contextual information and the capability of full-resolution (local) extraction of dilated convolutions. Experimental results using ultrasound tongue image datasets from different ultrasound devices revealed that the combination of both localization and globalization searching could improve prediction result significantly. Assessment of BowNet models in real-time using both qualitatively and quantitatively studies showed their outstanding achievements in terms of accuracy and robustness in comparison with similar techniques.

Published by the Acoustical Society of America

© 2019 Acoustical Society of America. https://doi.org/10.1121/2.0001117 Proceedings of Meetings on Acoustics, Vol. 39, 020001 (2019) Page 1

1. INTRODUCTION

Ultrasound technology has been a well-known method in speech research for studying tongue motion and

speech articulation [1]. The popularity of ultrasound imaging for tongue visualization is because of its attractive

characteristics such as imaging at a reasonably rapid frame rate, which allows researchers to visualize subtle and

swift gestures of the tongue during speech in real-time. Moreover, ultrasound technology is portable, relatively

affordable, and clinically safe with a non-invasive nature [2]. Studying and exploiting the dynamic nature of

speech data from ultrasound tongue image sequences might provide valuable information for linguistics

researchers, and it is of great interest in many recent studies [3]. Ultrasound imaging has been utilized for tongue

motion analysis in treatment of speech sound disorders [4], comparing healthy and impaired speech production

[3], second language training and rehabilitation [5], Salient Speech Interfaces [6], and swallowing research [7],

3D tongue modelling [8], to name a few.

During speech data acquisition, the ultrasound probe under the user’s jaw pictures the tongue surface in

midsagittal or coronal view in real-time. The Mid-sagittal view of the tongue in ultrasound data is regularly

adapted, as it illustrates relative backness, height, and the slope of various areas of the tongue. The tongue dorsum

can be seen in this view as a thick, long, bright, and continues region due to the tissue-air reflection of ultrasound

signal by the air around the tongue. Due to the noise characteristic of ultrasound images with low-contrast, it is

not an easy task for non-expert users to localize the tongue surface. To address this difficulty, researchers utilize

extracted tongue surface data in the form of a curve instead of the whole tongue region. Manual segmentation

of the tongue contour and analysis of each frame suffer from several drawbacks such as bias depending on the

skill of the user or quality of the data, fatigue due to a large number of image frames to be analyzed, and lack of

achieving reproducible results [9]. Classical semi-automatic methods of tongue contour extraction, such as image

processing techniques [10], work well only on specific datasets, designed for each research. Furthermore,

initialization step is vital for the accuracy of each method. Therefore, having a fully automatic system for the

tongue contour extraction, capable of working on every dataset with acceptable results, is still an open topic in

the literature. This problem is even harder for the case of tongue contour tracking in real-time applications [11].

Figure 1 illustrates the process of ultrasound tongue imaging, where the tongue surface was manually extracted

for the sake of illustration.

Figure 1. A sample frame of ultrasound tongue imaging data acquired from a mid-sagittal view when the user

keeps the ultrasound probe beneath the chin area. Reflection from the tongue dorsum area can be seen as a

gradient bright to the black region specified by the red curve. The yellow dotted curve is the tongue contour. On

the lower left side of the image, the shadow from the jawbone is clearly illustrated.

The remainder of this paper is structured as follows. A literature review of the field is covered in section 2.

Section 3 describes in detail our methodology and architecture of BowNet models. The Experimental results and discussion around our proposed segmentation techniques are presented in section 4. Section 5 concludes and

outlines future work directions.

M. H. Mozaffari et al. Dilated convolution for ultrasound tongue extraction

Proceedings of Meetings on Acoustics, Vol. 39, 020001 (2019) Page 2

2. LITERATURE REVIEW AND BACKGROUNDS

Automatic tracking and extracting of the tongue contour has been investigated in several studies in the last

recent years such as active contour models or snakes [12]–[16], a graph-based technique [17], machine learning-

based methods [1], [18]–[21], and many more. A complete list of new tongue contour extraction techniques can

be found in a study by Laporte et al. [3]. Although the methods above have been applied successfully on the

ultrasound tongue contour extraction, still manual labelling and initialization are frequently needed. Users should

manually label at least one frame with a restriction of drawing near to the tongue region [3]. For instance, to use

Autotrace, EdgeTrak, or TongueTrack software, users should annotate several points on at least one frame.

Deep convolutional neural networks (DCNNs) have been the method of choice for many computer vision

applications in recent years. They have shown outstanding performance in many image classification tasks as

well as to object detection, recognition, and tracking [22]. With inspiration, modification, and adaptation of

several well-known deep classification networks [23]–[25], fully convolutional neural (FCN) networks were

successfully exploited for image segmentation tasks [26]. The performance of the FCN model has been improved

by employing several different operators such as deconvolution [27], skip-connection and concatenation from

previous layers [28], [29], indexed un-pooling [30], and post-processing stages such as CRFs [31].

Many of these innovations significantly improved the accuracy of segmentation results, usually at the

expense of more computational costs due to the significant number of network parameters. Furthermore,

consecutive pooling layers that are used to improve receptive field and localization invariance [32] cause a

considerable reduction of feature resolution in decoding section [33]. To alleviate this issue, atrous or dilated

convolutions have been proposed and employed recently [31]–[35]. Dilated convolutions help a network model

to predict instances without losing the receptive field, with fewer learnable parameters. DCNNs for the problem

of ultrasound tongue extraction has been used in a few studies. In [36], Restricted Boltzmann Machine (RBM)

was trained first on samples of ultrasound tongue images and ground truth labels in the form of encoder-decoder

networks. Then, the trained decoder part of the RBM was tuned and utilized for prediction of new instances in a

translational fashion from the trained network to the test network. To automatically extract tongue contours

without any manipulation on a large number of image frames, modified versions of U-net [28] have been used

recently for ultrasound tongue extraction [11], [37], [38].

In this study, we introduced a new method for automatic tongue contour extraction in real-time to handle

the difficulties of previous techniques we experienced in our research. We proposed two robust, accurate, fully

automatic deep learning architectures named BowNet and wBowNet, specifically for tongue contour extraction.

The proposed models have the capability of real-time performance using GPU power. An alternative solution

for CPU only purposes has been reported in previous research [11]. For the first time, dilated convolution is

combined with the standard deep, dense classification method to extract both local and global context from each

frame at the same time in an end-to-end fashion. Unlike previous studies which evaluate their models on a

personal dataset, BowNet models were assessed on two comprehensive datasets of tongue images and ground

truths, each of which contained two types of data from two different ultrasound machines to test the overfitting

problem and robustness of each technique in terms of generalization capability.

3. METHODOLOGY

In the first versions of the DeepLab (V1) model, in order to increase the output feature map resolution of

the network, as a post-processing stage, CRF was used for acquiring better local delineation [31]. In the next

generations of DeepLab models (V2 and V3) [34], [39], more accurate segmentation results could be achieved

using information transfer from different layers of the model modified and concatenated with separated layers.

Similarly, we tested multiscale contextual reasoning from successive pooling, and subsampling layers (global

exploration) combined with the full-resolution output from dilated convolutions [35].

We develop two new DCNNs (named BowNet and wBowNet) for the problem of ultrasound tongue

extraction with few learnable parameters for the sake of real-time performance. The BowNet is a parallel

combination of a dense classification architecture inspired by VGG16 [23] and U-net [28] with a segmentation

model inspired by the DeepLabV3 network where dilated convolution layers are used without pooling layers

[33]. The wBowNet is the counterpart of the BowNet, where a combination of classification and segmentation

networks were designed interconnected to support an even higher resolution in prediction outcomes. As part of

this work, we also re-examine the performance of repurposed versions of U-net [28] and DeepLabV3 [33]. In



our designs, we considered the effect of the checkerboard [40] and gripping artifacts [41] of deconvolutional and

dilated layers. Our experimental results show that using both ideas of omitting pooling layers in one path and

variable dilation factors in another path of BowNet models alleviate the effect of those artifacts.

A. DILATED CONVOLUTION

The consecutive combination of convolutional and pooling layers of a DCNN model results in a significantly

lower spatial resolution of output feature maps, typically by a factor of 32 for modern architectures that are not

designed from scratch for the segmentation task [33]. Several ideas have been proposed to reconstruct an input-

size segmented output from the course feature map of the encoding part. Interpolation (up-sampling) could be

the first solution, whereas each pixel of the feature map is repeated to increase the image size [26]. This method

of down-sampling and up-sampling with inevitable losing information is not beneficial for the task of

segmentation when boundary delineation with a high resolution is required. Transposed convolution (sometimes

called deconvolution) has been introduced to solve this issue and to recover the low-resolution prediction maps

[42]. Transpose convolutional layer operates opposite of a convolution layer where each pixel of the feature map

is expanded to the kernel size and superimposed with its neighbour. Although deconvolution improved the

segmentation result [27], it still suffers from the checkerboard problem [40] as well as increasing the number of

trainable parameters. Other techniques, such as indexed un-pooling [30], [43] require memory for saving

positions while max-pooling is applied on a previous feature map.

Dilated convolution (sometimes called atrous convolution) has been utilized in previous studies [31], [34],

[39] with successful results. In this method, instead of downsampling a feature map for the next layer, the kernel

is sub-sampled with a sampling rate (name dilation rate or factor). The receptive field could be effectively

expanded with keeping a spatial resolution while dilation factor was increased monotonously. Therefore, dilated

layers can be used instead of pooling layers, while special resolution of feature maps will be almost constant in

the forward path of a DCNN. However, the sparsity of dilated kernels does not always cause performance

improvement especially for small objects and details [34]. To solve this problem, one solution is to decrease the

dilation factor throughout the decoding path of the network model, similar to the number of kernels in the U-net

network [34]. Figure 2 illustrates samples of decoding strategies using interpolation, deconvolution, and dilated

convolution.

B. NETWORK ARCHITECTURES

Figure 3 illustrates the network architecture of the BowNet model. BowNet has two separate sub-network

forward paths where the results of the two paths are concatenated in the last layer, followed by a fully

convolutional layer. On the contrary, the wBowNet has two interconnected (weaved) sub-networks. The

connection of layers in wBowNet was shown in Figure 4. Two popular DCNN models, U-net [28] and

DeepLabV3 [32], [34] are modified and simplified for a fair comparison, and we named them as sU-net and

sDeepLabV3. To have a better criterion for the accuracy of the proposed models, the original U-net and

DeepLabV3 were implemented and tested as well. It is vital to pay attention to both context and resolution to

detect an object such as a tongue contour in relatively noisy ultrasound images.

In this application, context can be achieved using dense classification methods [11]. Nevertheless, up-

sampling methods such as deconvolution are not able to recover the low-level visual features which are lost in

the down-sampling stage [22], resulting in a low-resolution prediction map as the model output. To address this

issue, we proposed two deep network models considering both localization and globalization detection. In each

network architecture, convolutional layers use ReLU activation as a non-linearity function. Drop-out layers and

batch normalization layers are followed by convolutional layers to improve the regularization and accuracy of

network models. For the dilation factor, we increased the value from one to eight and then decreased again to

one in a forward path to solve the problem of spatial inconsistency (gripping) [34], [41].

Multiclass cross-entropy is employed as a loss function for the training of networks where the SoftMax

function was used for the last layer. It is noteworthy to mention that the receptive field will be decreased after

applying each convolutional layer due to the overlapping of the convolution kernels within the border areas. For

this reason, to concatenate feature maps from previous layers, it is necessary to crop feature maps to have similar

size images (see faded blue boxes in and Figure 4).



Figure 2. Several ideas for improving the low-resolution feature map to acquire an output image with the input-

size image. (a) also, (b) Up-sampling using linear interpolation and transpose convolution, respectively. (c)

Dilated convolution with different dilation factor of 1, 2, and 3.

4. EXPERIMENTAL RESULTS

A. SYSTEM SETUP

In order to train our proposed network models, first we conducted an extensive random search

hyperparameter tuning [44] for finding the optimum value of parameters in each network such as filter size

(double for each consecutive layer starts from 16, 32, or 64), kernel size (3 × 3 and 5 × 5), dilation factor (double

for each consecutive layer starts from 1 or 2), the number of global iterations (iteration and epoch size), batch

size (10, 20, and 50 depend on GPU memory), augmentation parameters (online and offline), drop-out factor

(0.5, 0.7, and variable), padding type (zero or valid), normalization layer (with and without), optimization

methods (SGD and Adam [45]), and the type of the activation layers (Tanh and ReLU). We also tested several

network configurations for the BowNet, whereas the number of layers was different in encoding-decoding

forward paths (3, 4, and 5), and for the wBowNet, where the number of dilated convolution layers was different.

In this experiment, we set real-time performance as one restriction for all the architectures in this study.

Figure 3. Overview of the proposed BowNet network architecture. In each layer, filter kernels are depicted using

boxes. The green and orange boxes are results of standard and dilated convolution layers, respectively.



Figure 4. Overview of the proposed wBowNet network architecture. In each layer, filter kernels are depicted using

boxes. The green and orange boxes are results of regular and dilated convolution layers, respectively.

Our results from hyperparameter tuning revealed that, besides network architecture size, the learning rate

has the most significant effect on the performance of each architecture in terms of accuracy. Testing fixed and

scheduled decaying learning rates showed that the variable learning rate might provide better results, but it

requires different initialization of decay factor and decay steps. Therefore, for the sake of a fair comparison, we

only reported results using fixed learning rates. Table 1. presents selected learning rates for each proposed

network model.

Table 1. Fixed Learning-rate (LR) tuning using Best training and validation loss (BTL, BVL).

LR sU-net sDeepLabV3 BowNet wBowNet

BTL BVL BTL BVL BTL BVL BTL BVL

0.005 N/A N/A N/A N/A N/A N/A N/A N/A

0.0001 0.023 0.024 0.02 0.023 0.023 0.023 0.022 0.021

0.0003 0.021 0.023 0.018 0.021 0.021 0.022 0.021 0.023

0.0005 0.020 0.020 0.014 0.020 0.020 0.021 0.021 0.021

0.0007 0.019 0.022 0.017 0.020 0.020 0.021 0.020 0.022

0.0009 0.020 0.023 0.017 0.020 0.019 0.021 0.022 0.023

0.00001 0.026 0.028 0.026 0.029 0.027 0.027 0.026 0.031

Architectures were deployed using the publicly available TensorFlow framework on Keras API as the

backend library [46], [47]. For the initialization of network parameters, randomly distributed values have been

selected similarly using the same seed value. For all the four deep convolutional networks, size of the input

images, the number of iterations, mini-batch sizes, number of epochs were selected as 128 × 128, 3000, 10, 60,

respectively. Adam optimization was selected with a fixed momentum value of 0.9 for finding the optimum

solution on a binary cross-entropy loss function. Each network model was trained for ten times, then averaged,

and the standard deviation was reported. It takes less than one hour approximately for each time training of a

model depending on the network size, using one NVIDIA 1080 GPU unit, which was installed on a Windows

PC with Core i7, 4.2 GHz speed, and 32 GB of RAM. We also used the Google cloud virtual machine with a

Tesla P100 GPU and 16GB of memory for acquiring training results faster.

B. DATASETS AND DATA AUGMENTATION

We evaluated our proposed deep network models using different scenarios (train and test on a dataset and

train on one and test on another dataset and vice versa). The two datasets comprise of ultrasound tongue images

acquired with different ultrasound machines: one from the University of Ottawa (Dataset I) and another from

the publicly available SeeingSpeech project (Dataset II) [48]. True labels corresponding to each image were

created by two experts manually using our customized annotation software. Users should only annotate several

markers on the lower edge of the tongue contour region, and then the contour curve will be created automatically

using the B-spline method. As Table 2 shows, we separated datasets into training, validation, and test sets. Each

model was trained, validated, and tested via different scenarios utilizing online augmentation. Because we used



the same batch size, the number of iterations, the number of epochs, and initial randomization, each model will

be examined by the same system setup. Figure 7 shows the ranking values of each image in our datasets in terms

of the distance from the average image.

Table 2. Specification of the dataset I and II for training/validation/testing

Total number of images Train/Validation (%80/10) Test (%10)

Dataset I (UOttawa) 2058 1646/205 205

Dataset II (SeeingSpeech)

[48] 4016 3212/401 401

Dataset I (UOttawa) Dataset II (SeeingSpeech) [48]

Figure 5. Two sample images from Dataset I and II accompany with their corresponding truth labels.

Online augmentation on ultrasound data was deployed using the Keras augmentation library with the same

random numbers for the training of each deep convolutional network. In total, 60 iterations × 10 mini-batches ×

50 epochs = 30000 augmented images were used for training and validation of each network model. We used

image flipping (half of the dataset after augmentation), rotation (50-degree rotation in each side) and zooming

(ratio of 0.5x to 1.5x) to mimic all the possible transformations which can happen in ultrasound tongue data.

Dataset II has similar images (homogenous) in comparison to the dataset I (heterogeneous). Figure 5. shows

two samples of Database I and II with their corresponding ground truth labels. Due to the heterogeneity and

homogeneity of the dataset I and II, we applied the informed undersampling method for imbalanced datasets

[49] to create two similar datasets with the highest variations for better regularization and better comparison. For

this reason, we first find the average image of each dataset, and then the Euclidian distance of images was

calculated with respect to the average image. Figure 6. shows the average image of each dataset as well as a

sample image with its difference from the average one. After sorting each dataset in terms of distance score, we

selected 2000 images with the highest rank and 50 images from the lowest position.

Figure 6. Informed under-sampling procedure to enhance dataset in terms of variation: Left column: the average

image from Dataset I and II. Middle column: one sample image from each dataset. The right column: the difference

between the sample image and the average one.



Figure 7. Ranking of the dataset I and II in terms of similarity distance score calculated as the distance from each

image to the average image.

C. QUALITATIVE AND QUANTITATIVE ANALYSIS

To illustrate the efficacy of our proposed segmentation methods qualitatively, some randomly selected

segmentation results from testing our proposed models on datasets I and II are shown in this section. Figure 8

shows the results from applying each proposed model on both datasets. As can be seen clearly, wBowNet could

achieve better prediction results in terms of noise. Training and testing on the same dataset, all the models in this

study generate instances with acceptable noise. For the case of training on one dataset and testing on another,

results show a worse prediction for both sDeepLabv3 and sU-net. It is noteworthy to mention that the correct

position of the tongue contour curve acquired from skeletonizing should be shifted toward the edge of the

intensity gradient between black and white regions [2].

In these experiments, we ignored that transfer for the sake of a fair comparison study for all the models. The

last row of figures shows the difference between the prediction curve and the truth label curve. More faded

regions mean a better correlation between the contour from the ground truth label and predicted map. As can be

seen clearly in Figure 8, both BowNet models could obtain outstanding prediction maps and extracted contours

in comparison with other architectures.

In order to quantitatively validate our proposed models, we followed the standard evaluation criteria for the

field of ultrasound tongue segmentation [1], [3], [9], [10], [29], [50]. Contours extracted from prediction maps

and ground truth labels are compared in terms of the mean sum of distance (MSD) as defined in equation (1)

[36]. It provides an evaluation metric defined as the mean distance between pixels of a contour U and a contour

V, even if these curves do not share the same coordinates on one axis or do not have the same number of points.

The latter one consisted of a contour set U of 2D points (𝑢1, ⋯ , 𝑢𝑛) and the former one consisted of a contour

set V of 2D points (𝑣1, ⋯ , 𝑣𝑚). We carried out contour extraction using the skeleton method on binarized ground

truth labels and predicted images employing morphological operators from the Scikit-learn library [51]. This

criterion is sensitive to translation, as we saw in our experiments. For this reason, we did not shift the extracted

contours to their correct positions for the sake of a fair comparison.

𝑀𝑆𝐷(𝑈, 𝑉) =1

𝑚 + 𝑛(∑ |𝑣𝑖 − 𝑢𝑗|

𝑗

𝑚𝑖𝑛𝑚

𝑖=1

+ ∑ |𝑢𝑖 − 𝑣𝑗|𝑗

𝑚𝑖𝑛𝑛

𝑖=1

) (1)

To evaluate the performance of the proposed methods, we also calculated and reported the value of the dice

coefficient and binary cross-entropy loss between non-enhanced prediction maps and labelled data. Binary cross-

entropy loss, which is a particular case of multiclass cross-entropy calculates the difference between the

predicted and the ground truth labelled data, as defined in equation (2). For assessment of the proposed method

performance, we also calculated the dice coefficient as described in equation (3) where 𝑝𝑖 ∈ [0, 1] is the ith

predicted pixel in the flattened image, whereas the original pixel value from [0, 255] in a grayscale image is

rescaled to [0, 1] for fitting into the ReLU/sigmoid activation functions. And 𝑦𝑖 ∈ [0, 1] is the corresponding

class label for that ith pixel of the sample data.

𝐿𝐵𝐶𝐸 = ∑ 𝑦𝑖

𝑖

 𝑙𝑜𝑔 𝑝𝑖  + (1 − 𝑦𝑖)𝑙𝑜𝑔 (1 − 𝑝𝑖) (2)

𝐿𝐷𝑖𝑐𝑒 = −2 ∑ 𝑝𝑖𝑦𝑖𝑖

∑ 𝑝𝑖𝑖 + ∑ 𝑦𝑖𝑖 (3)



Figure 8. Sample test results of trained and validated models on datasets using online augmentation. The first rows

are prediction maps where yellow regions are a higher probability of true prediction than other areas. Second rows

are contour extracted from predicted feature maps after the same binarization and skeletonization. The last rows

are the difference between the contour extracted from the true label and predicted the label. Note that the sample

image, ground truth, and contour are illustrated on the top of the first row.

Quantitative results are reported from the same experiments in the qualitative study with the same data,

setups, and procedures. Table 3. shows the results of training and validating of each deep network model on the

dataset I and II using online augmentation. Each architecture was tested on both datasets separately. For each

experiment, we run each system ten times, and then average and standard deviation over the cross-entropy loss

and dice coefficients are tabulated. The last results are only presented to assess the robustness of each model in

terms of overfitting and convergence. Therefore, analogy between the last training and validation losses means

better convergence. We also presented results of the original U-net and DeepLabV3. Although both original

methods could find better results in many aspects, they are significantly more prominent in size and slower in

performance. The best results in tables ascertain the accuracy of each model in terms of loss function and dice

coefficient. The results of Table 3 reveals that the BowNet and wBowNet could reach optimum solutions better

than other models. Table 3. also shows the number of parameters and the memory, which is needed to save those

trainable parameters, calculated by TensorFlow library. As can be seen from the table, BowNet has the lowest number of parameters. wBowNet and sDeepLab have almost double the number of parameters of BowNet. We

calculated the number of parameters of the original UNet and DeepLabV3, where the former is 71 times bigger

than BowNet and 39 times bigger than wBowNet.



Table 3. Results of the best and the last binary cross-entropy loss function and dice coefficient.

Trained/Validated → Dataset I

For quantitative evaluation For comparison with original models

sU-net sDeepLab BowNet wBowNet U-net DeepLabV3

Parameters Memory 948,833 3.7GB 785,889 3.1GB 434,785 1.7GB 786,657 3.1GB 31,042,369 8.3GB 26,043,235 6.9GB

Mean STD Mean STD Mean STD Mean STD Mean STD Mean STD

Train loss Last 0.0242 0.2329 0.0336 0.0393 0.0265 0.0427 0.0242 0.037 0.0197 0.0113 0.0190 0.0066

Validation loss Last 0.0321 0.0075 0.0367 0.0131 0.0254 0.0046 0.0306 0.0103 0.0188 0.0669 0.0198 0.0046

Train Dice Last 0.7367 0.2329 0.8064 0.0393 0.7871 0.0427 0.817 0.037 0.8301 0.0239 0.8380 0.0321

Validation Dice Last 0.8061 0.0277 0.796 0.0222 0.7886 0.0389 0.8068 0.016 0.8230 0.0312 0.8251 0.0584

Train loss Best 0.0109 0.0003 0.0111 0.0003 0.0107 0.0005 0.0104 0.0005 0.0091 0.0004 0.0089 0.0021

Validation loss Best 0.0123 0.0012 0.0115 0.001 0.0109 0.0007 0.0122 0.0019 0.0089 0.0316 0.0090 0.0026

Train Dice Best 0.8751 0.0055 0.8775 0.002 0.8748 0.0053 0.8846 0.0142 0.8900 0.0008 0.8938 0.0547

Validation Dice Best 0.8582 0.0036 0.8594 0.006 0.859 0.0058 0.8652 0.0051 0.8507 0. 8608 0.8865 0.0293

Trained/Validated → Dataset II

sUNet sDeepLab BowNet wBowNet U-net DeepLabV3


Train loss Last 0.0512 0.053 0.0458 0.0669 0.0592 0.0269 0.045 0.0454 0.0165 0.0065 0.0198 0.0053

Validation loss Last 0.037 0.007 0.0432 0.0113 0.0464 0.0118 0.0448 0.0162 0.0154 0.0035 0.0183 0.0057

Train Dice Last 0.7688 0.053 0.7774 0.0669 0.7746 0.0269 0.8024 0.0454 0.8512 0.0334 0.9008 0.0332

Validation Dice Last 0.7932 0.0316 0.7918 0.0239 0.7864 0.0316 0.7912 0.0312 0.8350 0.0251 0.8758 0.0247

Train loss Best 0.0242 0.0004 0.0224 0.0009 0.0242 0.0008 0.0198 0.0004 0.0093 0.0008 0.0088 0.0023

Validation loss Best 0.0258 0.0008 0.0234 0.0019 0.0258 0.0008 0.0228 0.0013 0.0078 0.0015 0.0085 0.0021

Train Dice Best 0.863 0.0053 0.8608 0.0004 0.8658 0.0026 0.8698 0.0008 0.8810 0.0025 0.8402 0.0132

Validation Dice Best 0.852 0.0041 0.86 0.0043 0.85 0.0099 0.865 0.0058 0.8797 0.0026 0.8460 0.0124

The results of testing each model using MSD criteria applying on the contours extracted from enhanced

prediction maps and the ground truth labels are illustrated in Table 4. As we discussed before, due to removing

noise from the model’s instances during enhancement procedures, wBowNet is not the best model in terms of

accuracy, although it outperformed other network models regarding MSD criteria and qualitative results in many

cases. To convert MSD from pixel to millimeter, we calculated the approximate conversion of 1 px = 0.638 mm

for both datasets I and II. wBowNet and BowNet could achieve MSDs around 0.04mm on average while the

Deep Belief Network (DBN) model in a study by [1] achieved 1.0 mm (1 px = 0.295 mm). For the active contour

models mentioned in [13], the average MSD was 1.05 mm. It is essential to mention that the two human experts

participating in active contour models experiment produced two different annotation results having an average

MSD of 0.73 mm [13], which might thus be reasonably considered the ultimate approximate minimum MSD

value-based human capability.

Our experimental results revealed that although original DCNN models are sturdy enough for the problem

of tongue contour segmentation and tracking, they are huge in the number of trainable parameters and memory

intake. On the other hand, BowNet models could perform well in comparison with smaller and bigger networks.

In other words, BowNet models are relatively small, but their performance is near to big sized networks. A

consequence of this performance drop can be seen in the real-time performance of those models. One goal of

this study is to propose a robust and fully automatic DCNN, at the same time with the capability of real-time

performance.

Real-time performance of a deep model depends on the network size, hyperparameter tuning, computational

facilities, parameter initialization, per- or post data processing, data augmentation method, convergence rate due

to the performance of the optimization method, and activation function. The first two items are more effective

than others on the speed of the instantiation of each deep network model, as we saw in our experiments.

Furthermore, the accuracy of deep learning methods is also highly related to the size of the training dataset and

the complexity of the deep network model. Hence, there is always a trade-off between the number of training

samples, which is a big issue in many applications such as in medicine [28], [52], and the number of parameters

in the network, which requires more computing and memory units [30].

Increasing the number of datasets through data acquisition is not always the best and cheapest alternative.

Data augmentation can help to alleviate this difficulty, but a bigger dataset needs a better network model in terms

of generalization. In general, large state of the art deep network models [27], [32] can be generalized on relatively

small datasets such as the ultrasound tongue better with the expense of higher computational cost due to the

number of parameters. Moreover, training and testing time will rise consequently using big networks. From

Table 5, sU-net is the fastest method with double size of the BowNet. It can be seen in the table that the

performance speed of each proposed method is in the real-time range.



Table 4. Results of testing each trained model on different datasets created. MSD is calculated and converted into

millimetres with the approximation dependent on the ultrasound device resolution.

For quantitative evaluation For comparison with original models

Trained/Validated → Dataset I

Test → Dataset I sUNet sDeepLab BowNet wBowNet U-net DeepLabV3


Test loss 0.023 0.063 0.027 0.026 0.028 0.078 0.018 0.064 0.010 0.089 0.012 0.026

Test Dice 0.759 0.084 0.748 0.011 0.744 0.018 0.826 0.011 0.850 0.018 0.840 0.017

MSD (pixels) 0.249 0.182 0.266 0.029 0.281 0.090 0.213 0.080 0.196 0.069 0.154 0.069

MSD (mm) 0.037 0.025 0.039 0.016 0.042 0.036 0.032 0.090 0.030 0.036 0.028 0.031

Test → Dataset II

Test loss 0.082 0.022 0.064 0.004 0.069 0.069 0.065 0.003 0.056 0.023 0.062 0.007

Test Dice 0.498 0.012 0.526 0.032 0.547 0.011 0.520 0.002 0.611 0.020 0.538 0.032

MSD (pixels) 0.424 0.063 0.240 0.058 0.374 0.063 0.358 0.005 0.210 0.064 0.371 0.017

MSD (mm) 0.063 0.078 0.036 0.002 0.056 0.023 0.053 0.006 0.361 0.019 0.054 0.003

Trained/Validated → Dataset II (SeeingSpeech)

Test → Dataset II sUNet sDeepLab BowNet wBowNet U-net DeepLabV3

Test loss 0.048 0.086 0.045 0.007 0.045 0.027 0.043 0.005 0.034 0.073 0.033 0.007

Test Dice 0.717 0.084 0.730 0.053 0.714 0.080 0.736 0.005 0.810 0.019 0.831 0.032

MSD (pixels) 0.382 0.068 0.391 0.031 0.401 0.031 0.269 0.034 0.201 0.011 0.210 0.026

MSD (mm) 0.057 0.003 0.058 0.051 0.060 0.038 0.040 0.021 0.030 0.092 0.035 0.002

Test → Dataset I

Test loss 0.044 0.082 0.042 0.008 0.039 0.030 0.064 0.068 0.034 0.041 0.041 0.005

Test Dice 0.579 0.043 0.578 0.043 0.620 0.002 0.397 0.014 0.707 0.025 0.550 0.002

MSD (pixels) 0.263 0.090 0.375 0.048 0.287 0.080 0.180 0.064 0.016 0.005 0.023 0.001

MSD (mm) 0.039 0.072 0.056 0.037 0.043 0.003 0.027 0.011 0.022 0.006 0.031 0.006

Table 5. A comparison study over each network model in terms of the number of parameters, the memory which is

required for keeping the parameters of each network model, and frame rate while testing each model on one batch

of data. Results are the average of 10 times run on each test model with different test datasets and models (using

online or offline augmentation).

sUNet sDeepLab BowNet wBowNet

Number of parameters 948833 785889 434785 786657

Memory required for learnable parameters 3795332 3143556 1739140 3146628

Framerate using GPU 72 32 42 30

5. CONCLUSION

In this paper, we have proposed and presented two new deep convolutional neural networks called BowNet

and wBowNet for tongue contour extraction benefiting from dilated convolution for globally and locally accurate

segmentation results. Extensive experimental studies demonstrated the improved performance of the proposed

deep learning techniques. In general, wBowNet has all benefits in one place with accurate, robust, with the

capability of real-time performance amongst the comparison models such as sUNet, sDeepLab, or even BowNet.

Materials of this study might help other researchers in different fields such as linguistics to study tongue gestures

in real-time easier, accessible, and with higher accuracy than previous methods. In future work, BowNet models

can be applied to other famous databases, even non-medical large-scale benchmarks, to test the ability of this

robust algorithm. Our qualitative analysis revealed the power of wBowNet architecture in terms of generalizing

datasets and handling overfitting issues. Our quantitative study showed that the BowNet model provides higher

accuracy instances in comparison with similar techniques. The quantitative study can be the support for our

observations, and similarly revealed acceptable results for BowNet models in different aspects. The current

infant BowNet models need to be extended for other applications after development with new assessment

criteria.

REFERENCES

[1] I. Fasel and J. Berry, “Deep belief networks for real-time extraction of tongue contours from ultrasound during

speech,” in Pattern Recognition (ICPR), 2010 20th International Conference on, 2010, pp. 1493–1496.



[2] M. Stone, “A guide to analysing tongue motion from ultrasound images,” Clin. Linguist. Phonetics, vol. 19,

no. 6–7, pp. 455–501, Jan. 2005.

[3] C. Laporte and L. Ménard, “Multi-hypothesis tracking of the tongue surface in ultrasound video recordings of

normal and impaired speech,” Med. Image Anal., vol. 44, pp. 98–114, Feb. 2018.

[4] A. Eshky et al., “UltraSuite: A Repository of Ultrasound and Acoustic Data from Child Speech Therapy

Sessions,” Interspeech, no. September, pp. 1888–1892, 2018.

[5] B. Gick, B. M. Bernhardt, P. Bacsfalvi, and I. Wilson, “Ultrasound imaging applications in second language

acquisition,” Phonol. Second Lang. Acquis., vol. 36, no. June, pp. 315–328, 2008.

[6] B. Denby, T. Schultz, K. Honda, T. Hueber, J. M. Gilbert, and J. S. Brumberg, “Silent speech interfaces,”

Speech Commun., vol. 52, no. 4, pp. 270–287, Apr. 2010.

[7] M. Ohkubo and J. M. Scobbie, “Tongue Shape Dynamics in Swallowing Using Sagittal Ultrasound,”

Dysphagia, pp. 1–7, Jun. 2018.

[8] S. Chen, Y. Zheng, C. Wu, G. Sheng, P. Roussel, and B. Denby, “Direct, Near Real Time Animation of a 3D

Tongue Model Using Non-Invasive Ultrasound Images,” in 2018 IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4994–4998.

[9] Y. S. Akgul, C. Kambhamettu, and M. Stone, “Automatic extraction and tracking of the tongue contours,”

IEEE Trans. Med. Imaging, vol. 18, no. 10, pp. 1035–1045, 1999.

[10] K. Xu, T. Gábor Csapó, P. Roussel, and B. Denby, “A comparative study on the contour tracking algorithms

in ultrasound tongue images with automatic re-initialization,” J. Acoust. Soc. Am., vol. 139, no. 5, pp. EL154-

-EL160, 2016.

[11] M. H. Mozaffari, S. Guan, S. Wen, N. Wang, and W.-S. Lee, “Guided Learning of Pronunciation by

Visualizing Tongue Articulation in Ultrasound Image Sequences,” in 2018 IEEE International Conference on

Computational Intelligence and Virtual Environments for Measurement Systems and Applications

(CIVEMSA), 2018, pp. 1–5.

[12] K. Xu et al., “Development of a 3D tongue motion visualization platform based on ultrasound image

sequences,” arXiv Prepr. arXiv1605.06106, 2016.

[13] M. Li, C. Kambhamettu, and M. Stone, “Automatic contour tracking in ultrasound images,” Clin. Linguist.

Phonetics, vol. 19, no. 6–7, pp. 545–554, 2005.

[14] S. Ghrenassia, L. Ménard, and C. Laporte, “Interactive segmentation of tongue contours in ultrasound video

sequences using quality maps,” in Medical Imaging 2014: Image Processing, 2014, vol. 9034, p. 903440.

[15] C. Laporte and L. Ménard, “Robust tongue tracking in ultrasound images: a multi-hypothesis approach,” in

Sixteenth Annual Conference of the International Speech Communication Association, 2015.

[16] K. Xu et al., “Robust contour tracking in ultrasound tongue image sequences,” Clin. Linguist. Phon., vol. 30,

no. 3–5, pp. 313–327, 2016.

[17] L. Tang and G. Hamarneh, “Graph-based tracking of the tongue contour in ultrasound sequences with adaptive

temporal regularization,” 2010 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. - Work. CVPRW

2010, pp. 154–161, 2010.

[18] J. Berry and I. Fasel, “Dynamics of tongue gestures extracted automatically from ultrasound,” in Acoustics,

Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, 2011, pp. 557–560.

[19] A. Jaumard-Hakoun, K. Xu, P. Roussel-ragot, and M. L. Stone, “Tongue Contour Extraction From Ultrasound

Images,” Proc. 18th Int. Congr. Phonetic Sci. (ICPhS 2015), 2015.

[20] T. L., B. T., H. G., L. Tang, T. Bressmann, and G. Hamarneh, “Tongue contour tracking in dynamic ultrasound

via higher-order MRFs and efficient fusion moves,” Med. Image Anal., vol. 16, no. 8, pp. 1503–1520, 2012.

[21] D. Fabre, T. Hueber, F. Bocquelet, and P. Badin, “Tongue tracking in ultrasound images using eigentongue

decomposition and artificial neural networks,” Proc. Annu. Conf. Int. Speech Commun. Assoc.

INTERSPEECH, vol. 2015-Janua, no. 2, pp. 2410–2414, 2015.

[22] G. Lin, A. Milan, C. Shen, and I. Reid, “RefineNet: Multi-path refinement networks for high-resolution

semantic segmentation,” 2017.

[23] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,”

CoRR, vol. abs/1409.1, Sep. 2014.

[24] C. Szegedy et al., “Going Deeper with Convolutions,” Popul. Health Manag., vol. 18, no. 3, pp. 186–191,

Sep. 2014.

[25] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural

Networks,” Commun. ACM, vol. 60, no. 6, 2017.

[26] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in

Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.

[27] H. Noh, S. Hong, and B. Han, “Learning Deconvolution Network for Semantic Segmentation,” Proc. IEEE

Int. Conf. Comput. Vis., vol. 2015 Inter, pp. 1520–1528, May 2015.



[28] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image

segmentation,” in International Conference on Medical image computing and computer-assisted intervention,

2015, vol. 9351, pp. 234–241.

[29] F. Milletari, N. Navab, and S. A. Ahmadi, “V-Net: Fully convolutional neural networks for volumetric medical

image segmentation,” in Proceedings - 2016 4th International Conference on 3D Vision, 3DV 2016, 2016, pp.

565–571.

[30] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A Deep Convolutional Encoder-Decoder

Architecture for Image Segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 1–14,

2015.

[31] L.-C. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic Image

Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,” IEEE Trans.

Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 1–14, Apr. 2014.

[32] L.-C. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic Image

Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,” IEEE Trans.

Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2018.

[33] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking Atrous Convolution for Semantic Image

Segmentation,” Jun. 2017.

[34] R. Hamaguchi, A. Fujita, K. Nemoto, T. Imaizumi, and S. Hikosaka, “Effective Use of Dilated Convolutions

for Segmenting Small Object Instances in Remote Sensing Imagery,” Proc. - 2018 IEEE Winter Conf. Appl.

Comput. Vision, WACV 2018, vol. 2018-Janua, pp. 1442–1450, 2018.

[35] F. Yu and V. Koltun, “Multi-Scale Context Aggregation by Dilated Convolutions,” Proc. ICLR, Nov. 2015.

[36] A. Jaumard-Hakoun, K. Xu, P. Roussel-Ragot, G. Dreyfus, and B. Denby, “Tongue contour extraction from

ultrasound images based on deep neural network,” Proc. 18th Int. Congr. Phonetic Sci. (ICPhS 2015), May

2016.

[37] J. Zhu, W. Styler, and I. C. Calloway, “Automatic tongue contour extraction in ultrasound images with

convolutional neural networks,” J. Acoust. Soc. Am., vol. 143, no. 3, pp. 1966–1966, Mar. 2018.

[38] J. Zhu, W. Styler, and I. Calloway, “A CNN-based tool for automatic tongue contour tracking in ultrasound

images,” pp. 1–6, Jul. 2019.

[39] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-Decoder with Atrous Separable

Convolution for Semantic Image Segmentation.” pp. 801–818, 2018.

[40] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and Checkerboard Artifacts,” Distill, vol. 1, no. 10, p.

e3, Oct. 2016.

[41] and T. F. Yu, Fisher, Vladlen Koltun, “Segmentation Dilated Residual Networks,” Proc. IEEE Conf. Comput.

Vis. pattern Recognit., pp. 472–480, 2017.

[42] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Deconvolutional networks,” in 2010 IEEE Computer

Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 2528–2535.

[43] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Lecture Notes in

Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in

Bioinformatics), 2014, vol. 8689 LNCS, no. PART 1, pp. 818–833.

[44] Bergstra，James and Y. Bengio, “Random Search for Hyper-Parameter Optimization,” 2012.

[45] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv Prepr. arXiv1412.6980, Dec.

2014.

[46] M. Abadi et al., “TensorFlow: A system for large-scale machine learning,” Methods Enzymol., vol. 101, no.

C, pp. 582–598, May 2016.

[47] F. Chollet and others, “Keras: Deep learning library for theano and tensorflow,” URL https//keras. io/k, vol.

7, p. 8, 2015.

[48] E. Lawson et al., “Seeing Speech: an articulatory web resource for the study of phonetics [website],” Apr.

2015.

[49] X. Y. Liu, J. Wu, and Z. H. Zhou, “Exploratory under-sampling for class-imbalance learning,” in Proceedings

- IEEE International Conference on Data Mining, ICDM, 2006, vol. 39, no. 2, pp. 965–969.

[50] G. V. Hahn-Powell, D. Archangeli, J. Berry, and I. Fasel, “AutoTrace: An automatic system for tracing tongue

contours,” J. Acoust. Soc. Am., vol. 136, no. 4, pp. 2104–2104, Oct. 2014.

[51] N. Martínková, P. Nová, O. V. Sablina, A. S. Graphodatsky, and J. Zima, “Karyotypic relationships of the

Tatra vole (Microtus tatricus),” Folia Zool., vol. 53, no. 3, pp. 279–284, 2004.

[52] G. Litjens et al., “A survey on deep learning in medical image analysis,” Med. Image Anal., vol. 42, no. 1995,

pp. 60–88, 2017.



Documents

Ultrasound tongue contour extraction using BowNet network