10
1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2764270, IEEE Journal of Selected Topics in Signal Processing MANUSCRIPT J-STSP-EESLP-00027–2017 1 Multi-Task Feature Learning for Low-Resource Query-by-Example Spoken Term Detection Hongjie Chen, Student Member, IEEE, Cheung-Chi Leung, Member, IEEE, Lei Xie, Senior Member, IEEE, Bin Ma, Senior Member, IEEE, and Haizhou Li, Fellow, IEEE Abstract—We propose a novel technique that learns a low- dimensional feature representation from unlabeled data of a target language, and labeled data from a non-target language. The technique is studied as a solution to query-by-example spoken term detection for a low-resource language. We extract low-dimensional features from a bottle-neck layer of a multi-task deep neural network, which is jointly trained with speech data from the low-resource target language and resource-rich non- target language. The proposed feature learning technique aims to extract acoustic features that offer phonetic discriminability. It explores a new way of leveraging cross-lingual speech data to overcome the resource limitation in the target language. We conduct query-by-example spoken term detection (QbE-STD) experiments using the dynamic time warping distance of the multi-task bottle-neck features between the query and the search database. The QbE-STD process does not rely on an automatic speech recognition pipeline of the target language. We validate the effectiveness of multi-task feature learning through a series of comparative experiments. Index Terms—Query-by-example, spoken term detection, multi-task learning, bottle-neck feature I. I NTRODUCTION Query-by-example spoken term detection (QbE-STD) is to retrieve spoken content with spoken queries [1]. The research of QbE-STD has attracted much interest, especially in low- resource situations. For example, low-resource QbE-STD was investigated in a series of benchmark evaluations called Spo- ken Web Search (SWS) [2], [3], [4] or Query by Example Search on Speech Task (QUESST) [5], [6] in the MediaEval benchmark evaluation campaign [7]. In such low-resource situations, the transcription or linguistic knowledge of the target speech data is not always available. To eliminate the dependence on the transcription or any other linguistic meta- data, one can perform direct acoustic pattern matching on the speech features in QbE-STD, which facilitates many applica- tions in low-resource situations, such as topic classification [8], topic segmentation on spoken documents [9], [10], etc. Hongjie Chen and Lei Xie are with the School of Computer Science, Northwestern Polytechnical University, Xi’an 710129, China (email: [email protected]; [email protected]; [email protected]; [email protected]). Lei Xie is the corresponding author. Cheung-Chi Leung, Bin Ma, and Haizhou Li are with the Insti- tute for Infocomm Research, Agency for Science, Technology and Re- search (A STAR), Singapore (email: [email protected]; [email protected] star.edu.sg). Haizhou Li is also with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore (email: [email protected]). This work is supported by the Chinese Scholarship Council (No. 201606291069) and the National Natural Science Foundation of China (No. 61571363). The choice of speech features plays a crucial role in an acoustic pattern matching approach to QbE-STD. In this paper, we assume that no transcription is available in the target data. In such situation, studies have shown that we can perform unsupervised feature learning of the unlabeled target speech data to characterize acoustic content of speech, e.g. com- ponent posteriorgrams from unsupervised Gaussian mixture models (GMMs) [11], [12] or sub-clustered GMMs [13], state posteriorgrams from unsupervised hidden Markov models (HMMs) [14], [15], [16], unsupervised deep Boltzmann ma- chine posteriorgrams [17] and internal representations from deep neural networks (DNNs). The DNNs can act as auto- encoders that reconstruct the input in the continuous feature space [18], [19], [20], [21], siamese neural networks that compare paired instances in latent continuous spaces [22], [23], or classifiers that predict unsupervised labels in latent discrete spaces [24], [25]. Such features generally outperform conventional spectral features, e.g. mel-frequency cepstral coefficients (MFCC), in low-resource QbE-STD because they express highly dynamic acoustic content with phoneme-like representations that facilitate the comparison of acoustic con- tent. Studies have also shown that we may borrow representa- tions learned from resource-rich languages, e.g. cross-lingual phonetic posteriorgrams [26], [27], internal representations in deep neural networks (DNNs) such as bottle-neck features (BNFs) [27], [28], where the models are trained in a supervised manner on labeled non-target speech data. Such cross-lingual features outperform basic spectral features since they benefit from the discriminative phonetic representations that are well learned from a large amount of cross-lingual speech data. In this study, we propose a novel technique to incorporate the target low-resource speech with a large amount of cross- lingual speech for QbE-STD. We hope to learn a unified low- dimensional feature representation which benefits from the best of both worlds aforementioned for low-resource QbE- STD, i.e. expressiveness of phoneme-like units of the target speech and the phonetic discriminability from the rich cross- lingual resource. This study is also inspired by our previous work in which we extract features through a DNN trained on unsupervised clustering labels [25]. Such labels represent phoneme-like units. In [25], the training procedure of the unsupervised feature extractor follows that of the standard supervised train- ing of DNN acoustic models. Especially, such unsupervised speech features are low-dimensional and as competitive as the features learned in a supervised manner from the cross-lingual resource. We believe that the combination of the phoneme-like

MANUSCRIPT J-STSP-EESLP-00027–2017 1 Multi-Task Feature ...lxie.npu-aslp.org/papers/2017JSTSP-ChenHJ.pdf · Query-by-example spoken term detection (QbE-STD) is to retrieve spoken

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MANUSCRIPT J-STSP-EESLP-00027–2017 1 Multi-Task Feature ...lxie.npu-aslp.org/papers/2017JSTSP-ChenHJ.pdf · Query-by-example spoken term detection (QbE-STD) is to retrieve spoken

1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2764270, IEEE Journalof Selected Topics in Signal Processing

MANUSCRIPT J-STSP-EESLP-00027–2017 1

Multi-Task Feature Learning for Low-ResourceQuery-by-Example Spoken Term Detection

Hongjie Chen, Student Member, IEEE, Cheung-Chi Leung, Member, IEEE, Lei Xie, Senior Member, IEEE,Bin Ma, Senior Member, IEEE, and Haizhou Li, Fellow, IEEE

Abstract—We propose a novel technique that learns a low-dimensional feature representation from unlabeled data of atarget language, and labeled data from a non-target language.The technique is studied as a solution to query-by-examplespoken term detection for a low-resource language. We extractlow-dimensional features from a bottle-neck layer of a multi-taskdeep neural network, which is jointly trained with speech datafrom the low-resource target language and resource-rich non-target language. The proposed feature learning technique aimsto extract acoustic features that offer phonetic discriminability.It explores a new way of leveraging cross-lingual speech datato overcome the resource limitation in the target language. Weconduct query-by-example spoken term detection (QbE-STD)experiments using the dynamic time warping distance of themulti-task bottle-neck features between the query and the searchdatabase. The QbE-STD process does not rely on an automaticspeech recognition pipeline of the target language. We validatethe effectiveness of multi-task feature learning through a seriesof comparative experiments.

Index Terms—Query-by-example, spoken term detection,multi-task learning, bottle-neck feature

I. INTRODUCTION

Query-by-example spoken term detection (QbE-STD) is toretrieve spoken content with spoken queries [1]. The researchof QbE-STD has attracted much interest, especially in low-resource situations. For example, low-resource QbE-STD wasinvestigated in a series of benchmark evaluations called Spo-ken Web Search (SWS) [2], [3], [4] or Query by ExampleSearch on Speech Task (QUESST) [5], [6] in the MediaEvalbenchmark evaluation campaign [7]. In such low-resourcesituations, the transcription or linguistic knowledge of thetarget speech data is not always available. To eliminate thedependence on the transcription or any other linguistic meta-data, one can perform direct acoustic pattern matching on thespeech features in QbE-STD, which facilitates many applica-tions in low-resource situations, such as topic classification [8],topic segmentation on spoken documents [9], [10], etc.

Hongjie Chen and Lei Xie are with the School of Computer Science,Northwestern Polytechnical University, Xi’an 710129, China (email:[email protected]; [email protected]; [email protected];[email protected]). Lei Xie is the corresponding author.

Cheung-Chi Leung, Bin Ma, and Haizhou Li are with the Insti-tute for Infocomm Research, Agency for Science, Technology and Re-search (A⋆STAR), Singapore (email: [email protected]; [email protected]). Haizhou Li is also with the Department of Electrical andComputer Engineering, National University of Singapore, Singapore (email:[email protected]).

This work is supported by the Chinese Scholarship Council (No.201606291069) and the National Natural Science Foundation of China (No.61571363).

The choice of speech features plays a crucial role in anacoustic pattern matching approach to QbE-STD. In this paper,we assume that no transcription is available in the target data.In such situation, studies have shown that we can performunsupervised feature learning of the unlabeled target speechdata to characterize acoustic content of speech, e.g. com-ponent posteriorgrams from unsupervised Gaussian mixturemodels (GMMs) [11], [12] or sub-clustered GMMs [13],state posteriorgrams from unsupervised hidden Markov models(HMMs) [14], [15], [16], unsupervised deep Boltzmann ma-chine posteriorgrams [17] and internal representations fromdeep neural networks (DNNs). The DNNs can act as auto-encoders that reconstruct the input in the continuous featurespace [18], [19], [20], [21], siamese neural networks thatcompare paired instances in latent continuous spaces [22],[23], or classifiers that predict unsupervised labels in latentdiscrete spaces [24], [25]. Such features generally outperformconventional spectral features, e.g. mel-frequency cepstralcoefficients (MFCC), in low-resource QbE-STD because theyexpress highly dynamic acoustic content with phoneme-likerepresentations that facilitate the comparison of acoustic con-tent. Studies have also shown that we may borrow representa-tions learned from resource-rich languages, e.g. cross-lingualphonetic posteriorgrams [26], [27], internal representations indeep neural networks (DNNs) such as bottle-neck features(BNFs) [27], [28], where the models are trained in a supervisedmanner on labeled non-target speech data. Such cross-lingualfeatures outperform basic spectral features since they benefitfrom the discriminative phonetic representations that are welllearned from a large amount of cross-lingual speech data.

In this study, we propose a novel technique to incorporatethe target low-resource speech with a large amount of cross-lingual speech for QbE-STD. We hope to learn a unified low-dimensional feature representation which benefits from thebest of both worlds aforementioned for low-resource QbE-STD, i.e. expressiveness of phoneme-like units of the targetspeech and the phonetic discriminability from the rich cross-lingual resource.

This study is also inspired by our previous work in whichwe extract features through a DNN trained on unsupervisedclustering labels [25]. Such labels represent phoneme-likeunits. In [25], the training procedure of the unsupervisedfeature extractor follows that of the standard supervised train-ing of DNN acoustic models. Especially, such unsupervisedspeech features are low-dimensional and as competitive as thefeatures learned in a supervised manner from the cross-lingualresource. We believe that the combination of the phoneme-like

Page 2: MANUSCRIPT J-STSP-EESLP-00027–2017 1 Multi-Task Feature ...lxie.npu-aslp.org/papers/2017JSTSP-ChenHJ.pdf · Query-by-example spoken term detection (QbE-STD) is to retrieve spoken

1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2764270, IEEE Journalof Selected Topics in Signal Processing

MANUSCRIPT J-STSP-EESLP-00027–2017 2

unit information expressed using such speech features with thecross-lingual phonetic discriminability could be beneficial tothe QbE-STD task.

Moreover, we notice that multi-task learning (MTL) [29]is capable of learning shared and transferable features formultiple related tasks, and it effectively increases the trainingsample size over that of a single learning task. We proposea multi-task feature learning technique to leverage the richcross-lingual resource. In the proposed technique, we employ aDirichlet process Gaussian mixture model (DPGMM) in unsu-pervised phoneme-like unit modeling of the target speech anduse the component identity of the DPGMM, i.e. phoneme-likelabel, to tokenize the target speech. Then we borrow resource-rich cross-lingual data together with their readily available tiedtriphone state labels, i.e. senone labels, to train an MTL deepneural network (MTL-DNN). The low-dimensional featuresare extracted from an internal bottle-neck layer in the MTL-DNN. The main contributions of this paper are summarizedas follows:

• We propose an MTL technique to extract efficient low-dimensional features which leverage the expressiveness ofthe phoneme-like units of the target speech and phonemediscriminability learned from rich cross-lingual resource;

• We perform visualization and analysis to demonstrate theproperties of the proposed features and their phonemediscrimination ability;

• We show empirically that the QbE-STD performance ofdifferent features correlates with their phoneme discrim-ination ability.

Note that MTL-DNN is usually used to improve a single ormultiple primary learning tasks. However, in this paper, wetreat the tasks equally and the final features are derived froman internal shared hidden layer of the MTL-DNN for QbE-STD.

To evaluate the QbE-STD performance of the proposed fea-tures, a set of experiments were conducted on the TIMIT [30]speech corpus, where the transcription was assumed unavail-able to simulate a low-resource scenario. We regard Man-darin Chinese as a resource-rich language in our experi-ments. To demonstrate the ability of our proposed featuresfor phoneme discrimination, we performed feature visualiza-tion using multi-dimensional scaling (MDS) and the ABXphoneme discriminability test [31], [32] defined in track 1of the zero resource speech challenge 2015 (ZeroSpeech2015) [33]. Moreover, we studied the effect of the configu-ration of the MTL-DNN, including the position of the bottle-neck layer and the temporal context length of input features,on the QbE-STD performance.

The remainder of the paper is organized as follows. Sec-tion II reviews the prior work on MTL and QbE-STD. Sec-tion III details our proposed approach to QbE-STD. Section IVgives the experimental details. Section V reports the QbE-STDresults, investigates the empirical configuration and explainsempirically why the proposed MTL worked for QbE-STD.Section VI concludes this paper and suggests some futurework.

II. PRIOR WORK

A. Multi-Task Learning

Multi-task learning (MTL) [29] or bias learning [34] is awell-known machine learning technique that jointly learns aset of related tasks to improve the generalization performanceon a single or multiple tasks. It has been proven successfulin practice that MTL can learn the inductively transferableknowledge which is beneficial for each single task becausethe training sample size for each single task will be effectivelyincreased. In an early study [34], a generalization bound onthe average error of MTL was derived according to a statisticallearning theory. In [35], a tighter generalization bound for eachsingle learning task was derived by defining the notion of relat-edness of multiple tasks. It is postulated [29] that related tasksmust share input features so as to make joint MTL effectiveand provide an MTL-oriented error back-propagation methodin which the inductively transferable knowledge is capturedby the shared hidden units in artificial neural networks formultiple tasks.

MTL can be implemented with deep neural networks (DNN)that have been proven effective in a wide spectrum of speechand language processing tasks. For example, in speech recog-nition, the MTL-DNN is used to learn inductive knowledgefrom extra tasks such as speaker recognition [36], speechenhancement [37], [38], prediction of phoneme/articulatorycontext [39], [40], classification of different acoustic units (e.g.monophones, senone-clusters or tri-graphemes) [41], [42],[43], cross-lingual speech recognition [44]. In speech synthe-sis, the MTL-DNN is employed to improve the naturalness ofthe synthetic speech in [45], where the input parameters for avocoder is more perceptually salient by learning the perceptualrepresentation as an auxiliary task. In speaker verification,the previous study [46] proposes to use text (i.e. phrase) asauxiliary labels in the MTL-DNN to improve the performance.

These prior studies usually improve the performance ofa single or multiple primary learning tasks involved in theMTL. Instead, we treat the tasks equally in this study. Weuse the features derived from the internal shared hidden layerof the MTL-DNN for acoustic pattern matching. One task inour MTL-DNN is to predict the phoneme-like labels from aDirichlet process Gaussian mixture model (DPGMM), whilethe other is to predict the cross-lingual tied triphone states, i.e.senone labels. We would like to see how the shared internalfeatures, which can be regarded as a fusion between thephoneme-like unit information and the borrowed cross-lingualphonetic discriminability, facilitate the low-resource QbE-STDtask.

B. Query-by-Example Spoken Term Detection via Fusion

In low-resource QbE-STD, we note that it is helpful to fusemultiple sources of knowledge to improve the expressivenessof acoustic pattern of the target data. This can be imple-mented in different stages of a typical DTW-based system.For example, it has been demonstrated [26], [47], [48] thatfeature concatenation, which concatenates multiple sources ofcross-lingual features at the very early stage, outperforms thefeatures learned from the target data alone. In [49], distance

Page 3: MANUSCRIPT J-STSP-EESLP-00027–2017 1 Multi-Task Feature ...lxie.npu-aslp.org/papers/2017JSTSP-ChenHJ.pdf · Query-by-example spoken term detection (QbE-STD) is to retrieve spoken

1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2764270, IEEE Journalof Selected Topics in Signal Processing

MANUSCRIPT J-STSP-EESLP-00027–2017 3

matrix combination has been shown effective to improve theQbE-STD performance as well, where the distance matrixderived from unsupervised features is combined with severaldistance matrices from different cross-lingual speech featuresduring the DTW-based search. In [26], [47], [27], it also hasbeen shown effective to boost the QbE-STD performance byscore fusion at the ranking stage.

While fusion can take place at the different stages of theclassification pipeline, we prefer fusion at an early stagebecause late fusion involves multiple parallel computationalpaths. For instance, distance matrix combination needs tocompute several different distance matrices for each type offeatures. This is time-consuming especially when complicateddistance metrics, e.g. Kullback–Leibler (KL) divergence, areinvolved. Similarly, score fusion suffers from multiple runsof the DTW-based search which dramatically increases thecomputation loads in the whole detection system. Thus featureconcatenation is a preferable way to utilize handy cross-lingualknowledge. We note that by simply concatenating the features,we may introduce redundant information. We consider thatthe proposed multi-task feature learning via the MTL-DNNoffers a solution to the above issues. That is, right at the veryearly stage in the QbE-STD system, the proposed techniqueincorporates the phoneme-like unit information and the cross-lingual phonetic discriminability through a low-dimensionalspeech feature derived from an MTL-DNN.

III. THE PROPOSED TECHNIQUE FORQUERY-BY-EXAMPLE SPOKEN TERM DETECTION

The proposed multi-task feature learning for query-by-example spoken term detection (QbE-STD) is depicted inFig. 1. The multi-task learning deep neural network (MTL-DNN) takes low-level acoustic features, i.e. filterbank andfundamental frequency (F0), as the input and outputs thephoneme-like labels and the cross-lingual senone labels. Thephoneme-like labels are generated from an unsupervisedmodel learned on the target speech data where no transcriptionor any other linguistic metadata is available. The cross-lingualsenone labels are derived from a supervised model learned onthe transcribed cross-lingual speech. When training the MTL-DNN, we associate each input sample with an unsupervisedphoneme-like label or a cross-lingual senone label. Beingoptimized for both the phoneme-like labels and the cross-lingual senone labels simultaneously, the multi-task learningdeep neural network (MTL-DNN) is designated for capturingthe knowledge, e.g. the phoneme-like unit information and thecross-lingual phonetic information, carried by these two sets oflabels. Instead of taking the output from one (or each) singlelearning task, we take the output from the shared bottle-necklayer, illustrated in the dashed box in the left panel of Fig. 1.,as the final speech features. The resultant features are then fedto a dynamic time warping (DTW) based QbE-STD system.

A. Unsupervised Phoneme-like Labels

Previous studies [50], [12], [25], [51], [52] have shownthat a Dirichlet process Gaussian mixture model (DPGMM) issuitable for unsupervised acoustic modeling in a low-resource

speech scenario. Firstly [12] has illustrated the effectivenessof parallel inference of the DPGMM for frame-wise speechclustering and shown the best-performing speech features,posteriorgrams (PGs) of the DPGMM, for phoneme discrim-inability test in ZeroSpeech 2015. It was also shown [25]that DNNs trained on unsupervised labels from the DPGMMprovides bottle-neck features (BNFs) as competitive as cross-lingual BNFs in QbE-STD. Other studies have shown thatDPGMM labels improve the posteriorgrams for phonemediscrimination[51], [52], and segment-wise acoustic model-ing [50]. Actually, the DPGMM learned from speech framescan be regarded as a phoneme-like unit model where eachGaussian component models a phoneme-like unit. Meanwhile,the DPGMM is practically suitable for low-resource scenarios,where no knowledge about the model complexity is available.Thus we use the DPGMM to tokenize the target low-resourcespeech into sequences of phoneme-like labels.

Basically, the DPGMM is a Gaussian mixture model(GMM) extended in a non-parametric Bayesian way in whicha Dirichlet process prior is placed over the vanilla GMMwith a set of hyperparameters. We would like to refer to [53]for a more detailed model explanation. We adopt Metropolis-Hastings based split/merge sampling method [54] to infer themodel parameters.

In our scenario, given feature vectors of speech framesX = {xi}Ni=1, after the parameter inference of DPGMM,we obtain K Gaussian components together with their mixingweights, π = (π1, . . . , πK), mean vectors, µ = {µk}Kk=1, andcovariance matrices, Σ = {Σk}Kk=1. The label of xi, denotedas li, is computed as follows:

li(xi) = arg max1≤k≤K

pi,k, (1)

where

pi,k = p(ck|xi) =πkN (xi|µk,Σk)∑Kj=1 πjN (xi|µj ,Σj)

(2)

is the k-th element in the i-th posteriorgram (PG) representedby pi = (pi,1, . . . , pi,K). The resultant labels of speech framesare used as the phoneme-like labels for the unsupervised taskin the MTL-DNN.

Note that we are not necessarily limited to using theDPGMM as our phoneme-like acoustic model. We believethat other temporal or hierarchical unsupervised modeling(e.g. [14], [15], [55], [16]) on the target data can also provideuseful phoneme-like labels. Actually, we conducted a prelim-inary test on the labels derived using state-level PGs fromthe acoustic segment model (ASM) [56], [15] which bettermodels long-term temporal information than a single GMM.The PGs from the ASM surpassed the PGs of the DPGMM andprovided BNFs of similar performance in QbE-STD. However,we would like to leave the detailed investigation of temporalmodels to the future work.

B. Cross-lingual Senone Labels

The frame-based cross-lingual senone labels are obtainedfor cross-lingual training data by performing standard forcedalignment using a GMM-HMM system trained with readily

Page 4: MANUSCRIPT J-STSP-EESLP-00027–2017 1 Multi-Task Feature ...lxie.npu-aslp.org/papers/2017JSTSP-ChenHJ.pdf · Query-by-example spoken term detection (QbE-STD) is to retrieve spoken

MANUSCRIPT J-STSP-EESLP-00027–2017 4

Spoken query

Query result

Speech database

Feature extraction

Acoustic pattern matching

organizations

organizationsplannedparenthoodpromotebirthcontrol

Online Retrieval

Cross-lingual speech corpus

Supervised tokenization

Cross-lingual senone labelsPhoneme-like labels

Low level acoustic features

Target speech corpus

Unsupervised tokenization

Offline Training

Fig. 1. Illustration of the multi-task feature learning for QbE-STD. When training the MTL-DNN, we associate each input sample with an unsupervisedphoneme-like label or a cross-lingual senone label. We trim the last two layers off the MTL-DNN and the resultant DNN (in dashed box) is used to extractthe shared bottle-neck features. Posteriorgrams extracted from the last layer of the MTL-DNN are also evaluated in our comparative experiments.

available transcription. The senone (i.e. tied triphone state)labels are used in a cross-lingual task of the MTL-DNN.The number of senones is set to roughly 500 as in our earlystudies [25], [28].

C. Shared Bottle-Neck Feature Extraction from Multi-TaskLearning DNN

As shown in the left panel of Fig. 1, we train the multi-task learning DNN (MTL-DNN) by taking the speech fromboth target language and non-target languages as input, andoptimize the network to minimize the errors against theunsupervised phoneme-like labels and cross-lingual senonelabels respectively. For ease of reading, we would like to firstlylook at a single task learning (STL) based DNN which is aspecial case of the MTL-DNN where the input (i.e. speechdata) and the output (i.e. labels) are from one language alone.

In an STL-based DNN (STL-DNN), each non-bottle-necklayer usually consists of a linear transformation layer and asigmoid layer. Therefore, the output of each unit in the j-thlayer of a J-layer STL-DNN can be expressed as follows:

zj = Wjyj−1 + bj , 0 ≤ j ≤ J, (3)

yj =

xi, j = 0

sigmoid(zj), 0 < j < J

softmax(zj), j = J

, (4)

where Wj is a weight matrix, bj is a bias vector, xi is thei-th speech feature vector, and both functions, sigmoid andsoftmax, conduct element-wise operations on the input vector.Given N training speech vectors, X = {xi}Ni=1, and the labels,T = {ti}Ni=1, the loss function is usually defined as the sumof cross-entropy between the predictions and the true labels:

Lce = −N∑i

K∑k

ti,klogsi,k, (5)

where ti,k is the k-th element of ti denoting whether xi

corresponds to class label ck, K is the number of outputdimensions, and

si,k = p(ck|xi) =ez

kJ∑K

k′=1 ezk′J

(6)

where zkJ is the k-th element of zJ .By replacing the softmax output layer of the STL-DNN with

M separate softmax layers, one for each learning task, we canobtain an MTL-DNN with its loss function being expressedas:

Lce =

M∑m=1

wmL(m)ce , (7)

where wm is the weight for the m-th task, L(m)ce is the loss

function for the m-th learning task as defined in Eq.(5). Asin [44], L(m)

ce is defined on the input data and the output labelbelonging to the m-th task. In the training process, the sharedparameters are updated using the gradient obtained with theweighted sum of the back-propagated error in each task, whilethe m-th softmax layer is updated with the input data and theexpected labels belonging to the m-th task.

In this study, we place a bottle-neck layer of linear trans-formation unit between the last two hidden layers of theMTL-DNN, through which we hope to provide a shared low-dimensional representation across multiple learning tasks. Allof the softmax layers in the output layer share equal weights.Mini-batch stochastic gradient descent is employed to obtainall the neural network parameters.

Once the MTL-DNN is trained, the last two layers of theDNN can be discarded. By forward-passing input spectralfeatures through the truncated DNN, one can get BNFs fromthe shared bottle-neck layer.

D. Query-by-Example Spoken Term DetectionThe query-by-example spoken term detection (QbE-STD)

consists of two main processes, namely feature extraction andacoustic pattern matching. The online retrieval process of theQbE-STD system is shown in the right panel of Fig. 1.

In QbE-STD, various speech features, such as mel-frequency cepstral coefficients (MFCC), posteriorgrams andbottle-neck features (BNFs), have been studied. In this paper,we extract the BNFs proposed in Section III-C for the acousticpattern matching that follows.

In acoustic pattern matching, we employ subsequence dy-namic time warping (DTW) [57] to find the best match

Page 5: MANUSCRIPT J-STSP-EESLP-00027–2017 1 Multi-Task Feature ...lxie.npu-aslp.org/papers/2017JSTSP-ChenHJ.pdf · Query-by-example spoken term detection (QbE-STD) is to retrieve spoken

1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2764270, IEEE Journalof Selected Topics in Signal Processing

MANUSCRIPT J-STSP-EESLP-00027–2017 5

between a query and a test utterance in the speech database.Given two sequences of speech feature vectors from a spo-ken query and a test utterance, Q = (q1,q2, . . . ,qM ) andU = (u1,u2, . . . ,uN ), where M and N are the lengths ofthe query and the test utterance, the distance di,j betweenarbitrary two vectors, qi and uj , is commonly computed as

di,j =

{1− qT

i uj

|qi||uj | for MFCCs and BNFs,

− log(qTi uj) for posteriorgrams.

(8)

With the subsequence DTW search, one can find an optimalpath with the minimum distance cost, also called dissimilarity,between the query, Q, and the test utterance, U. For the spokenquery Q, all the test utterances U = {Uk}KU

k=1 are ranked inan ascending order according to the dissimilarity.

IV. EXPERIMENTAL SETUP

A. Query-by-Example Spoken Term Detection

In the QbE-STD experiment, we treated English as the low-resource language. We used the TIMIT English read speechcorpus (LDC93S1, ∼5 hours) as a low-resource corpus withoutusing the phonetic transcription, and the HKUST MandarinChinese telephone speech corpus [58] (LDC2005S15, ∼150hours) as a cross-lingual corpus. We used the Metropolis-Hastings based split/merge sampler [54] to train the DPGMMand Kaldi [59] for the feature extraction and the implementa-tion of DNNs.

Unsupervised Phoneme-like Labels: We obtained the un-supervised phoneme-like labels from the DPGMM trained onthe training set of the TIMIT corpus. The DPGMM took13-dimensional mel-frequency cepstral coefficients (MFCC)+∆ + ∆∆ followed by cepstral mean and variance normal-ization as the input features. The hyperparameters for trainingthe DPGMM was set as {α = 1, κ0 = 1, ν0 = 41, µ0 =mean(X ), S0 = cov(X )} [12] where µ0 and S0 are the meanand covariance of speech data X .

Cross-lingual Senone Labels: We obtained the cross-lingual senone labels by performing standard forced align-ment using a GMM-HMM system on the HKUST Man-darin Chinese telephone speech corpus. This system consistsof a triphone acoustic model trained using 39-dimensionalMFCC+∆+∆∆ features transformed by linear discrimina-tive analysis, maximum likelihood linear transformation andspeaker adaptive training.

Deep Neural Networks: We trained three DNNs with dif-ferent labels as configured in TABLE I for feature extraction.The number of layers and the number of hidden units followthose in [25] since empirically these parameter sizes are appro-priate to model both the TIMIT and HKUST speech corpus.In the STL-DNN, we used either unsupervised phoneme-likelabels or cross-lingual senone labels while the MTL-DNNused both sets of labels. We used filterbank and fundamen-tal frequency (F0) features (totally 39-dimensional) as inputfeatures of the DNNs since they output better-performancebottle-neck features (BNFs) in our previous study [25]. Pre-training was performed only when training the STL-DNN on

TABLE ICONFIGURATION OF DNNS WHERE NU DENOTES THE NUMBER OF

DISTINCT LABELS. DPGMM-BASED PHONEME-LIKE LABELS OR/ANDMANDARIN CHINESE (CHN) SENONE LABELS WERE USED IN TRAININGOF THE DNNS. HERE, {1024×4, 40×1, 1024×1, NU} DENOTES FOUR

1024-UNIT HIDDEN LAYERS, ONE 40-UNIT LINEAR BOTTLE-NECK LAYER,ONE 1024-UNIT HIDDEN LAYER AND FINALLY ONE NU-UNIT

CLASSIFICATION LAYER.

Type Tokenizer (NU) TopologyMTL–DNN DPGMM+CHN (718) {1024×4,40×1,1024×1,NU}STL–DNN DPGMM (306) {1024×4,40×1,1024×1,NU}STL–DNN CHN (412) {1024×4,40×1,1024×1,NU}

the low-resource corpus using phoneme-like labels. No pre-training was carried out for the MTL-DNN and the STL-DNN which use rich cross-lingual data because we found thatwhen we had a large amount of training data, pre-training didnot give obvious gains. Similar to the standard DNN trainingrecipe ‘run multilingual.sh’ in Kaldi, 90% of the training setwas used for training and the rest 10% was used for cross-validation. The DNNs were trained using shuffled sampleswith 20 maximum epochs and an initial learning rate of 0.008.The learning rate was halved when the relative reduction ofthe validation loss was less than 0.01. The training procedurewas stopped when the relative reduction of the validation lossbecame lower than 0.001.

Evaluation Metrics: We built a speech database of 944utterances from the test set of TIMIT and chose 346 spokenexamples for 69 queries as the query set, where each exampleconsists of at least 6 characters and has longer time span than0.35 second. The QbE-STD was evaluated in terms of threemetrics: 1) MAP: the mean average precision for correct hitsin the retrieval; 2) P@N: the average precision of the top Nutterances where N is the number of the correct hit utterancesin the speech database; 3) P@10: the average precision overthe first 10 ranked utterances.

Comparative Configuration: Firstly, we performed a set ofcomparative QbE-STD experiments between the bottle-neckfeatures learned by the proposed technique and that derivedfrom STL-DNNs. Secondly, we compared the proposed BNFswith different methods of fusion of BNFs derived from the twoSTL-DNNs including feature concatenation, matrix combina-tion, and score fusion. These fusion methods made use of bothtarget and non-target speech during offline training. Thirdly,we compared the posteriorgrams (PGs) learned by the MTL-DNN with those from the STL-DNNs and the fusion methodsabove. Here the posteriorgrams are the output of the softmaxlayers.

We noted that PGs are widely used in QbE-STD [60], [61],[17], [14], [62] for their superior performance to conventionalspectral features, while BNFs become prevailing recently inQbE-STD [63], [64], [27] for being low-dimensional. Weinvestigated whether the BNFs derived from the MTL-DNNare as efficient as their corresponding PGs.

Additionally, we employed an acoustic segment model(ASM) to perform fusion. The frame-level representation usedin the ASM was the concatenation of DPGMM posteriorgrams(PGs) and Mandarin Chinese (CHN) senone PGs. The numberof segment units was selected to be 384 among {128, 256,

Page 6: MANUSCRIPT J-STSP-EESLP-00027–2017 1 Multi-Task Feature ...lxie.npu-aslp.org/papers/2017JSTSP-ChenHJ.pdf · Query-by-example spoken term detection (QbE-STD) is to retrieve spoken

1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2764270, IEEE Journalof Selected Topics in Signal Processing

MANUSCRIPT J-STSP-EESLP-00027–2017 6

TABLE IIDATA CONFIGURATION OF SEVEN METHODS IN THE COMPARATIVE

STUDY.

Methods Target Speech Non-target SpeechSTL-DNN (DPGMM)

STL-DNN (CHN)√

Feature Concatenation√ √

Matrix Combination√ √

Score Fusion√ √

ASM-based Fusion√ √

MTL-DNN√ √

384, 512, 640} in a preliminary test. Each segment unit wasmodeled by a 3-state HMM where each state was modeled by a5-component GMM. The state PGs of the ASM were tested onthe QbE-STD. Meanwhile, we tested the BNFs from an STL-DNN trained using labels of the ASM on the QbE-STD. Themethods mentioned above are summarized in TABLE II whereSTL-DNN (DPGMM) and STL-DNN (CHN) denote the STL-DNNs trained using phoneme-like labels from the DPGMMand Mandarin Chinese (CHN) senone labels respectively. TheQbE-STD results are reported in Section V-A.

B. Feature Analysis

To appreciate the strength of multi-task features, we con-ducted multi-dimensional scaling (MDS) and ABX phonemediscriminability test [31], [32] to study their phoneme repre-sentation ability.

In MDS, we used the phoneme transcription provided inthe TIMIT corpus to compute the centroids of each Englishphoneme using different BNFs. Here for each phoneme, wecomputed 50 centroids by randomly sampling 50% of theobserved frames for each phoneme. Then we obtained a 2-dimensional representation which can retain the similaritybetween data points in a multi-dimensional representation.

We carried out the ABX phoneme discriminability test ex-periment on the English dataset from the Buckeye corpus [65]used in the track 1 of ZeroSpeech 2015. We evaluated differentBNFs from DNNs trained in the QbE-STD experiment usingthe CHN senone labels or/and the DPGMM-based labels.Given two arbitrary phonemes, a and b, together with twosets of their corresponding acoustic examples, S(a) and S(b),represented by a certain type of speech features, the correctrate c(a, b) in the ABX phoneme discriminability test iscalculated as follows:

c(a, b) =1

m(m− 1)n

∑ea∈S(a)

∑eb∈S(b)

∑ex∈S(a)\{ea}

(δd(ea,ex)<d(eb,ex) +1

2δd(ea,ex)=d(eb,ex)),

where m and n are the number of examples in S(a) and S(b),d(ex, ey) denotes the DTW distance between examples, ex andey , δ is an indicator function. In our implementation, whensearching the best alignment path, the accumulated DTW costwas normalized by the length of the currently aligned path,Lpath, as:

d(ex, ey) =cost(ex, ey)

Lpath(9)

TABLE IIIQBE-STD PERFORMANCE COMPARISON AMONG BOTTLE-NECK FEATURES

LEARNED BY THE PROPOSED MTL TECHNIQUE, FEATURES LEARNED BYSINGLE TASK LEARNING, AND DIFFERENT FUSION METHODS.

Methods Feat. NDim MAP P@N P@10– MFCC 39 0.285 0.289 0.247STL-DNN (DPGMM) BNF 40 0.494 0.470 0.412STL-DNN (CHN) BNF 40 0.482 0.461 0.405Feature Concatenation BNF 80 0.545 0.511 0.451Matrix Combination BNF – 0.541 0.507 0.450Score Fusion BNF – 0.539 0.505 0.448ASM-based Fusion BNF 40 0.473 0.441 0.394MTL-DNN BNF 40 0.548 0.513 0.456STL-DNN (DPGMM) PG 306 0.405 0.390 0.340STL-DNN (CHN) PG 412 0.334 0.312 0.290Feature Concatenation PG 718 0.471 0.440 0.390Matrix Combination PG – 0.460 0.437 0.385Score Fusion PG – 0.461 0.436 0.380ASM-based Fusion PG 1152 0.451 0.428 0.374MTL-DNN PG 718 0.469 0.440 0.389

0.3

0.35

0.4

0.45

0.5

0.55

0.6

STL-DNN (DPGMM)

STL-DNN (CHN)

Score Fusion Feature Concatenation

Matrix Combination

ASM-based Fusion

MTL-DNN

MA

P

BNF PG

Fig. 2. MAP of QbE-STD using BNFs and PGs from different methods.

We used the cosine distance for the BNFs when computingDTW distances. The correct rates were averaged over all foundcontexts for a given pair of central phonemes and then over allpairs of central phonemes [33]. The resultant correct rates aredenoted as cr and the error rates (i.e. 1− cr) of the phonemediscriminability test are presented in Section V-B.

V. RESULTS AND ANALYSIS

A. Multi-Task Feature Learning vs. Different Fusion Methods

TABLE III details the QbE-STD performance of differentmethods. For ease of reading, we also plot the MAP perfor-mance of using BNFs and PGs in Fig. 2. The observationsfrom the overall experimental results are summarized as fol-lows:

• The MTL-based features, including BNFs and PGs, con-sistently outperform the STL-based counterparts. Thisis consistent with the finding in many earlier studiesthat show the fusion of the knowledge learned in anunsupervised manner and the supervised cross-lingualknowledge is beneficial for QbE-STD, and suggests thatit is effective to use multi-task feature learning to fusesuch two sources of knowledge.

• The MTL-based BNFs are comparable to other fusionmethods, such as feature concatenation, distance matrixcombination and score-level fusion of the STL-basedBNFs. Moreover, it is worth noting that the multi-taskfeature learning provides a compact feature representation

Page 7: MANUSCRIPT J-STSP-EESLP-00027–2017 1 Multi-Task Feature ...lxie.npu-aslp.org/papers/2017JSTSP-ChenHJ.pdf · Query-by-example spoken term detection (QbE-STD) is to retrieve spoken

1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2764270, IEEE Journalof Selected Topics in Signal Processing

MANUSCRIPT J-STSP-EESLP-00027–2017 7

BNF

org

an

ization

sorg

an

izati

on

s

PG

organizationsplanned parenthood promote controlbirth

Fig. 3. Distance matrices between a query and a test utterance in DTW usingBNFs and PGs. y- and x-axes show the words uttered in the query and thetest utterance respectively.

which is advantageous. In contrast, feature concatenationprobably results in redundant dimensions, and distancematrix combination and score fusion lead to multipleruns of distance matrix computation or/and DTW search.To compare the computing time, we run the QbE-STDexperiment on a single CPU machine (Intel® Xeon®

CPU E5-2680 @2.7GHz). The acoustic pattern matchingbetween all the queries and test utterances using theMTL-based BNFs, feature concatenation, distance matrixcombination, and score fusion, took about 305 seconds,353 seconds, 480 seconds, and 612 seconds, respectively.

• In ASM-based fusion, while the PGs outperform those ofthe DPGMM and the CHN tokenizer, the BNFs show noimprovement over the STL-based counterparts.

• The MTL-based BNFs outperform the MTL-based PGs.Similar to the observations in [66], [25], the results sug-gest that the BNFs derived from the MTL-DNN are moreeffective feature representations than the PGs. Actually asshown in Fig. 2, all types of BNFs outperform their PGcounterparts even with a lower dimension. To visualize,we plot the distance matrices between the queries andtest utterances using the MTL-based BNFs and PGs inFig. 3. Colours in the distance matrix depict the distancebetween speech frames, with red (blue) indicating large(small) distances. As illustrated in Fig. 3, the MTL-basedBNFs give rise to a more salient blue diagonal band inthe hit region, and larger areas of red and yellow in otherregions.

The experiments suggest that it is beneficial to leverage cross-lingual speech data when the speech samples from the targetlanguage are insufficient, and multi-task learning offers asolution.

B. Feature Analysis

We now take a close look at the multi-task features throughthe experiment described in Section IV-B.

(1) Multi-dimensional Scaling VisualizationThe transformed centroids of BNFs of plosive and fricative

consonants are shown in Fig. 4. Several observations are madeas follows:

• The clustering of the BNFs derived from the cross-lingual senone labels is consistent with that of phonetic

TABLE IVERROR RATES (%) OF THE ABX PHONEME DISCRIMINABILITY TEST ON

ENGLISH DATASET IN TRACK 1 OF ZEROSPEECH 2015.

Methods Feat. Intra-speaker Inter-speaker– MFCC 12.6 24.4STL-DNN (DPGMM) BNF 10.3 16.3STL-DNN (CHN) BNF 11.2 15.9MTL-DNN BNF 10.0 14.6

knowledge. For example, we can classify the phonemesinto different groups according to the place of articulation(e.g. {/d/, /t/, /z/, /s/} in the alveolar group, {/b/, /p/}in the bilabial group, {/g/, /k/} in the velar group, {/v/,/f/} in the labiodental group). As shown in Fig. 4(a)and Fig. 4(d), the distances between the phonemes inthe same group are relatively smaller than those fromdifferent groups.

• The BNFs derived from the unsupervised phoneme-likelabels (Fig. 4(b) and Fig. 4(e)) and the cross-lingual BNFs(Fig. 4(a) and Fig. 4(d)) show similar relative positionbetween phonemes.

• The BNFs derived from the MTL-DNN (Fig. 4(c) andFig. 4(f)) maintain the similar relative position betweenthe plosive and fricative consonants from different groupswhen compared to the BNFs derived from the STL-DNNs(Fig. 4(a)-(b) and Fig. 4(d)-(e)).

We also examine other phoneme categories, e.g. nasal con-sonants and diphthongs, that give similar observations. Theseresults suggest that similar speech features can be obtainedfrom the phoneme-like labels and the cross-lingual senonelabels. And the MTL-learned features are capable of retainingthe information captured by each set of STL-learned features.These observations empirically explain why the BNFs fromthe MTL-DNN works well for QbE-STD.

(2) Phoneme Discriminability TestWe also report the intra-speaker and inter-speaker ABX

phoneme discriminability test in TABLE IV. The resultssuggest that the BNFs derived from the MTL-DNN outper-form those from the STL-DNNs, while all the BNFs outper-form MFCC features. This experiment confirms the fact thatunsupervised phoneme-like labels and cross-lingual senonelabels provide complementary information characterizing thephonemes, which supports the results in Table III. It is logicalto believe that better phoneme discrimination leads to betterQbE-STD performance.

C. Bottle-Neck Layer Position and Temporal Context in Multi-Task Learning

In this section, we study the effect of the configurations ofthe MTL-DNN, including the position of the bottle-neck layerand the length of input temporal context, on the QbE-STDperformance.

We are interested in where to place the bottle-neck layerinside a 5-layer MTL-based DNN. As illustrated in Fig. 5(a),the performance of BNFs varies according to the position ofthe bottle-neck layer. Generally speaking, when the bottle-neck layer is placed closer to the output layer, a better QbE-STD performance is observed. This reconfirms the finding that

Page 8: MANUSCRIPT J-STSP-EESLP-00027–2017 1 Multi-Task Feature ...lxie.npu-aslp.org/papers/2017JSTSP-ChenHJ.pdf · Query-by-example spoken term detection (QbE-STD) is to retrieve spoken

MANUSCRIPT J-STSP-EESLP-00027–2017 8

b

d

g

p

t

k

jh

chb

dg

p

t

k

jh

chb

d

g

p

t

k

jh

ch

(a) Plosive: STL-DNN (CHN) (b) Plosive: STL-DNN (DPGMM) (c) Plosive: MTL-DNN

s

sh

z

f

th

v

dh

hh

s

sh

z

f

th

v

dh

hh

s

sh

z

fth

v

dh

hh

(d) Fricative: STL-DNN (CHN) (e) Fricative: STL-DNN (DPGMM) (f) Fricative: MTL-DNN

Fig. 4. Multi-dimensional scaling visualization of the centroids of plosive consonants and fricative consonants in English from TIMIT training set. The dotsin the same color represent the centroids of the same phoneme. Each link between the centroids of two dot clusters indicates the relatively closer distancebetween the two phonemes classified into the same group (including /d/ and /t/ inthe alveolar group, /b/ and /p/ in the bilabial group, /g/ and /k/ in the velargroup, /v/ and /f/ in the labiodental group, /z/ and /s/ in the alveolar group).

0.35

0.40

0.45

0.50

0.55

0.60

1 2 3 4 5

MA

P

Layer ID

0.35

0.40

0.45

0.50

0.55

0.60

3 7 11 15 19

MA

P

Context Length

BNF PG

(a) (b)Fig. 5. MAP of BNFs from the MTL-DNNs with the different configurations.(a) Bottle-neck layer at the different positions, (b) Different context length.

the speech representation from a deeper layer offers a betterphoneme discrimination [67].

Fig. 5(b) shows the effect of temporal context size ofinput spectral features in the MTL-DNN on the QbE-STDperformance. In this study, splicing 5 frames from the left andright context, we obtained the best result. This is consistentwith the setting in a fully unsupervised training scenario [12].We emphasize that the choice of the contextual length islanguage-dependent. For a very low-resource language, oneneeds to make an effort to find the best context size for inputspectral features.

VI. CONCLUSION AND FUTURE WORK

We have proposed a multi-task feature learning techniquethat incorporates both the phoneme-like unit informationlearned in an unsupervised way from a target language and the

cross-lingual phonetic information learned in a supervised wayfrom a non-target language. The dimension of our proposedfeatures is the same as that of the corresponding STL-basedfeatures. The experiments have shown that the proposed fea-tures consistently outperform the STL-based features in QbE-STD. The proposed technique does not bring multiple runs ofearly-stage processes, one for each individual feature source,during online retrieval.

In the future, we plan to investigate several aspects of theproposed technique. Firstly, we only studied the case of onetarget language in this paper. We would like to extend thestudy to the multilingual cases, e.g. SWS [2], [3], [4] orQUESST [5], [6] in MediaEval [7]. Secondly, we would liketo investigate whether more flexible graphical models, suchas structural variational auto-encoders [68], are capable ofbetter characterizing the target speech, thus providing betterunsupervised knowledge. Thirdly, we would further study theuse of unsupervised graphical models with temporal consid-eration [14], [69], [55], [70]. Moreover, we consider imple-menting other network architectures such as convolutional andrecurrent neural networks during multi-task learning. Finally,we would consider our proposed multi-task framework forlearning acoustic word embeddings [71], [72], [21] so as toget rid of the time-consuming DTW pattern matching betweentwo arbitrary-length feature sequences.

REFERENCES

[1] L. Lee, J. R. Glass, H. Lee, and C. Chan, “Spoken content retrieval– beyond cascading speech recognition with text retrieval,” IEEE/ACMTrans. Audio, Speech, Language Process., vol. 23, no. 9, pp. 1389–1420,2015.

[2] N. Rajput and F. Metze, “Spoken Web Search,” in Work. Notes Proc.MediaEval Workshop, 2011.

Page 9: MANUSCRIPT J-STSP-EESLP-00027–2017 1 Multi-Task Feature ...lxie.npu-aslp.org/papers/2017JSTSP-ChenHJ.pdf · Query-by-example spoken term detection (QbE-STD) is to retrieve spoken

1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2764270, IEEE Journalof Selected Topics in Signal Processing

MANUSCRIPT J-STSP-EESLP-00027–2017 9

[3] F. Metze, E. Barnard, M. H. Davel, C. J. van Heerden, X. Anguera,G. Gravier, and N. Rajput, “The Spoken Web Search task,” in Work.Notes Proc. MediaEval Workshop, 2012.

[4] X. Anguera, F. Metze, A. Buzo, I. Szoke, and L. J. Rodrıguez-Fuentes,“The Spoken Web Search task,” in Work. Notes Proc. MediaEvalWorkshop, 2013.

[5] X. Anguera, L. J. Rodrıguez-Fuentes, I. Szoke, A. Buzo, and F. Metze,“Query by example search on speech at MediaEval 2014,” in Work.Notes Proc. MediaEval Workshop, 2014.

[6] I. Szoke, L. J. Rodrıguez-Fuentes, A. Buzo, X. Anguera, F. Metze,J. Proenca, M. Lojka, and X. Xiong, “Query by example search onspeech at MediaEval 2015,” in Work. Notes Proc. MediaEval Workshop,2015.

[7] http://www.multimediaeval.org.[8] M. Dredze, A. Jansen, G. Coppersmith, and K. W. Church, “NLP on

spoken documents without ASR,” in Proc. EMNLP. ACL, 2010, pp.460–470.

[9] I. Malioutov, A. Park, R. Barzilay, and J. Glass, “Making sense of sound:Unsupervised topic segmentation over acoustic input,” in Proc. ACL.ACL, 2007, pp. 504–511.

[10] L. Zheng, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Acoustic texttilingfor story segmentation of spoken documents,” in Proc. ICASSP. IEEE,2012, pp. 5121–5124.

[11] A. S. Park and J. R. Glass, “Unsupervised pattern discovery in speech,”IEEE Trans. Audio, Speech, Language Process., vol. 16, no. 1, pp. 186–197, 2008.

[12] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Parallel inferenceof Dirichlet process Gaussian mixture models for unsupervised acousticmodeling: A feasibility study,” in Proc. Interspeech. ISCA, 2015, pp.3189–3193.

[13] A. Jansen, S. Thomas, and H. Hermansky, “Weak top-down constraintsfor unsupervised acoustic model training,” in Proc. ICASSP. IEEE,2013, pp. 8091–8095.

[14] C.-y. Lee and J. Glass, “A nonparametric Bayesian approach to acousticmodel discovery,” in Proc. ACL. ACL, 2012, pp. 40–49.

[15] H. Wang, C.-C. Leung, T. Lee, B. Ma, and H. Li, “An acoustic segmentmodeling approach to query-by-example spoken term detection,” inProc. ICASSP. IEEE, 2012, pp. 5157–5160.

[16] A. H. H. N. Torbati and J. Picone, “A nonparametric Bayesian approachfor spoken term detection by example query,” in Proc. Interspeech.ISCA, 2016, pp. 928–932.

[17] Y. Zhang, R. Salakhutdinov, H.-A. Chang, and J. Glass, “Resourceconfigurable spoken query detection using deep Boltzmann machines,”in Proc. ICASSP. IEEE, 2012, pp. 5161–5164.

[18] L. Badino, C. Canevari, L. Fadiga, and G. Metta, “An auto-encoder basedapproach to unsupervised learning of subword units,” in Proc. ICASSP.IEEE, 2014, pp. 7634–7638.

[19] H. Kamper, M. Elsner, A. Jansen, and S. Goldwater, “Unsupervised neu-ral network based feature extraction using weak top-down constraints,”in Proc. ICASSP. IEEE, 2015, pp. 5818–5822.

[20] D. Renshaw, H. Kamper, A. Jansen, and S. Goldwater, “A comparisonof neural network methods for unsupervised representation learning onthe zero resource speech challenge,” in Proc. Interspeech. ISCA, 2015,pp. 3199–3203.

[21] Y.-A. Chung, C.-C. Wu, C.-H. Shen, H.-Y. Lee, and L.-S. Lee, “AudioWord2Vec: Unsupervised learning of audio segment representationsusing sequence-to-sequence autoencoder,” in Proc. Interspeech. ISCA,2016, pp. 765–769.

[22] R. Thiolliere, E. Dunbar, G. Synnaeve, M. Versteegh, and E. Dupoux,“A hybrid dynamic time warping-deep neural network architecture forunsupervised acoustic modeling,” in Proc. Interspeech. ISCA, 2015,pp. 3179–3183.

[23] Y. Yuan, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Learning neuralnetwork representations using cross-lingual bottleneck features withword-pair information,” in Proc. Interspeech. ISCA, 2016, pp. 788–792.

[24] C. Chung, C. Tsai, H. Lu, C. Liu, H. Lee, and L. Lee, “An iterative deeplearning framework for unsupervised discovery of speech features andlinguistic units with applications on spoken term detection,” in Proc.ASRU. IEEE, 2015, pp. 245–251.

[25] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Unsupervisedbottleneck features for low-resource query-by-example spoken termdetection,” in Proc. Interspeech. ISCA, 2016, pp. 923–927.

[26] L. J. Rodrıguez-Fuentes, A. Varona, M. Penagarikano, G. Bordel, andM. Dıez, “High-performance query-by-example spoken term detectionon the SWS 2013 evaluation,” in Proc. ICASSP. IEEE, 2014, pp.7819–7823.

[27] H. Xu, P. Yang, X. Xiao, L. Xie, C.-C. Leung, H. Chen, J. Yu, H. Lv,L. Wang, S. J. Leow, B. Ma, E. Chng, and H. Li, “Language independentquery-by-example spoken term detection using n-best phone sequencesand partial matching,” in Proc. ICASSP. IEEE, 2015, pp. 5191–5195.

[28] C.-C. Leung, L. Wang, H. Xu, J. Hou, V. T. Pham, H. Lv, L. Xie,X. Xiao, C. Ni, B. Ma, E. S. Chng, and H. Li, “Toward high-performancelanguage-independent query-by-example spoken term detection for Me-diaEval 2015: Post-evaluation analysis,” in Proc. Interspeech, 2016, pp.3703–3707.

[29] R. Caruana, “Multitask learning,” Machine Learning, vol. 28, no. 1, pp.41–75, 1997.

[30] Lingusitc Data Consortium, “TIMIT Acoustic-Phonetic ContinuousSpeech Corpus,” https://catalog.ldc.upenn.edu/ldc93s1, 1993.

[31] T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, andE. Dupoux, “Evaluating speech features with the Minimal-Pair ABXtask: Analysis of the classical MFC/PLP pipeline,” in Proc. Interspeech.ISCA, 2013, pp. 1–5.

[32] T. Schatz, V. Peddinti, X.-N. Cao, F. Bach, H. Hermansky, andE. Dupoux, “Evaluating speech features with the Minimal-Pair ABXtask (ii): Resistance to noise,” in Proc. Interspeech. ISCA, 2014, pp.915–919.

[33] http://www.lscp.net/persons/dupoux/bootphon/zerospeech2014/website/.[34] J. Baxter, “A model of inductive bias learning,” J. Artif. Intell. Res.,

vol. 12, pp. 149–198, 2000.[35] S. Ben-David and R. Schuller, “Exploiting task relatedness for mulitple

task learning,” in Proc. COLT, 2003, pp. 567–580.[36] Z. Tang, L. Li, D. Wang, and R. Vipperla, “Collaborative joint training

with multitask recurrent model for speech and speaker recognition,”IEEE/ACM Trans. Audio, Speech, Language Process., vol. 25, no. 3,pp. 493–504, 2017.

[37] S. Parveen and P. D. Green, “Multitask learning in connectionist robustASR using recurrent neural networks,” in Proc. Interspeech. ISCA,2003.

[38] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, “Speech en-hancement and recognition using multi-task learning of long short-termmemory recurrent neural networks,” in Proc. Interspeech. ISCA, 2015,pp. 3274–3278.

[39] M. L. Seltzer and J. Droppo, “Multi-task learning in deep neuralnetworks for improved phoneme recognition,” in Proc. ICASSP. IEEE,2013, pp. 6965–6969.

[40] P. Bell and S. Renals, “Complementary tasks for context-dependent deepneural network acoustic models,” in Proc. Interspeech, 2015, pp. 3610–3614.

[41] Z. Huang, J. Li, S. M. Siniscalchi, I. Chen, J. Wu, and C. Lee, “Rapidadaptation for deep neural networks through multi-task learning,” inProc. Interspeech. ISCA, 2015, pp. 3625–3629.

[42] D. Chen, B. Mak, C.-C. Leung, and S. Sivadas, “Joint acoustic modelingof triphones and trigraphemes by multi-task learning deep neural net-works for low-resource speech recognition,” in Proc. ICASSP. IEEE,2014, pp. 5592–5596.

[43] P. Bell and S. Renals, “Regularization of context-dependent deep neu-ral networks with context-independent multi-task training,” in Proc.ICASSP. IEEE, 2015, pp. 4290–4294.

[44] J. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, “Cross-languageknowledge transfer using multilingual deep neural network with sharedhidden layers,” in Proc. ICASSP. IEEE, 2013, pp. 7304–7308.

[45] Z. Wu, C. Valentini-Botinhao, O. Watts, and S. King, “Deep neuralnetworks employing multi-task learning and stacked bottleneck featuresfor speech synthesis,” in Proc. ICASSP. IEEE, 2015, pp. 4460–4464.

[46] N. Chen, Y. Qian, and K. Yu, “Multi-task learning for text-dependentspeaker verification,” in Proc. Interspeech. ISCA, 2015, pp. 185–189.

[47] I. Szoke, M. Skacel, and L. Burget, “BUT QUESST 2014 systemdescription,” in Work. Notes Proc. MediaEval Workshop, 2014.

[48] M. Skacel and I. Szoke, “BUT QUESST 2015 system description,” inWork. Notes Proc. MediaEval Workshop, 2015.

[49] H. Wang, T. Lee, C.-C. Leung, B. Ma, and H. Li, “Using paralleltokenizers with DTW matrix combination for low-resource spoken termdetection,” in Proc. ICASSP. IEEE, 2013, pp. 8545–8549.

[50] H. Kamper, A. Jansen, S. King, and S. Goldwater, “Unsupervisedlexical clustering of speech segments using fixed-dimensional acousticembeddings,” in Proc. SLT. IEEE, 2014, pp. 100–105.

[51] M. Heck, S. Sakti, and S. Nakamura, “Supervised learning of acousticmodels in a zero resource setting to improve DPGMM clustering,” inProc. Interspeech. ISCA, 2016, pp. 1310–1314.

[52] ——, “Iterative training of a DPGMM-HMM acoustic unit recognizerin a zero resource scenario,” in Proc. SLT. IEEE, 2016, pp. 57–63.

Page 10: MANUSCRIPT J-STSP-EESLP-00027–2017 1 Multi-Task Feature ...lxie.npu-aslp.org/papers/2017JSTSP-ChenHJ.pdf · Query-by-example spoken term detection (QbE-STD) is to retrieve spoken

1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2764270, IEEE Journalof Selected Topics in Signal Processing

MANUSCRIPT J-STSP-EESLP-00027–2017 10

[53] C. E. Rasmussen, “The infinite Gaussian mixture model,” in Proc. NIPS,1999, pp. 554–560.

[54] J. Chang and J. W. Fisher III, “Parallel sampling of DP mixture modelsusing sub-clusters splits,” in Proc. NIPS, 2013, pp. 620–628.

[55] L. Ondel, L. Burget, and J. Cernocky, “Variational inference for acousticunit discovery,” Procedia Comput. Sci., vol. 81, pp. 80–86, 2016.

[56] H. Li, B. Ma, and C. Lee, “A vector space modeling approach to spokenlanguage identification,” IEEE Trans. Audio, Speech, Language Process.,vol. 15, no. 1, pp. 271–284, 2007.

[57] A. Muscariello, G. Gravier, and F. Bimbot, “Audio keyword extractionby unsupervised word discovery,” in Proc. Interspeech. ISCA, 2009,pp. 2843–2846.

[58] Linguistic Data Consortium, “HKUST Mandarin Telephone Speech, Part1,” https://catalog.ldc.upenn.edu/LDC2005S15, 2005.

[59] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel,M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The Kaldispeech recognition toolkit,” in Proc. ASRU, no. EPFL-CONF-192584.IEEE, 2011.

[60] T. J. Hazen, W. Shen, and C. White, “Query-by-example spoken termdetection using phonetic posteriorgram templates,” in Proc. ASRU.IEEE, 2009, pp. 421–426.

[61] Y. Zhang and J. R. Glass, “Unsupervised spoken keyword spotting viasegmental DTW on Gaussian posteriorgrams,” in Proc. ASRU. IEEE,2009, pp. 398–403.

[62] A. H. H. N. Torbati and J. Picone, “A nonparametric Bayesian approachfor spoken term detection by example query,” in Proc. Interspeech.ISCA, 2016, pp. 928–932.

[63] I. Szoke, M. Fapso, and K. Vesely, “BUT2012 approaches for spokenweb search - MediaEval 2012,” in Work. Notes Proc. MediaEvalWorkshop, 2012.

[64] J. Hou, V. T. Pham, C.-C. Leung, L. Wang, H. Xu, H. Lv, L. Xie,Z. Fu, C. Ni, X. Xiao, H. Chen, S. Zhang, S. Sun, Y. Yuan, P. Li, T. L.Nwe, S. Sivadas, B. Ma, E. S. Chng, and H. Li, “The NNI Query-by-Example System for MediaEval 2015,” in Work. Notes Proc. MediaEvalWorkshop, 2015.

[65] M. A. Pitt, L. Dilley, K. Johnson, S. Kiesling, W. D. Raymond,E. Hume, and F.-L. Eric, “Buckeye corpus of conversational speech(2nd release).” Department of Psychology, Ohio State University,2007. [Online]. Available: www.buckeyecorpus.osu.edu

[66] I. Szoke, M. Skacel, L. Burget, and J. Cernocky, “Coping with channelmismatch in query-by-example - BUT QUESST 2014,” in Proc. ICASSP.IEEE, 2015, pp. 5838–5842.

[67] T. Nagamine, M. L. Seltzer, and N. Mesgarani, “On the role of nonlineartransformations in deep neural network acoustic models,” in Proc.Interspeech. ISCA, 2016, pp. 803–807.

[68] M. J. Johnson, D. Duvenaud, A. B. Wiltschko, S. R. Datta, andR. P. Adams, “Composing graphical models with neural networks forstructured representations and fast inference,” in Proc. NIPS, 2016, pp.2946–2954.

[69] A. H. H. N. Torbati and J. Picone, “A doubly hierarchical Dirichletprocess hidden Markov model with a non-ergodic structure,” IEEE/ACMTrans. Audio, Speech, Language Process., vol. 24, no. 1, pp. 174–184,2016.

[70] C. Liu, J. Yang, M. Sun, S. Kesiraju, A. Rott, L. Ondel, P. Ghahremani,N. Dehak, L. Burget, and S. Khudanpur, “An empirical evaluation ofzero resource acoustic unit discovery,” in Proc. ICASSP. IEEE, 2017,pp. 5305–5309.

[71] K. Levin, A. Jansen, and B. V. Durme, “Segmental acoustic indexingfor zero resource keyword search,” in Proc. ICASSP. IEEE, 2015, pp.5828–5832.

[72] S. Settle, K. Levin, H. Kamper, and K. Livescu, “Query-by-example search with discriminative neural acoustic word embeddings,”arXiv:1706.03818, http://arxiv.org/abs/1706.03818, 2017.

Hongjie Chen (S’16) received the B.Eng. degree fromNorthwestern Polytechnical University (NPU), China in2013. He is currently a Ph.D. Candidate in ComputerScience and Technology in the Audio, Speech & LanguageProcessing Group (ASLP), Shaanxi Provincial Key Labora-tory of Speech & Image Information Processing, School ofComputer Science, NPU. In 2014, he visited the Institute forInfocomm Research in Singapore as an intern. In 2017, heserved as a research student at the Department of Electricaland Computer Engineering, National University of Singa-pore. His current research interests include automatic speech

recognition, spoken document retrieval, and especially speech representation learning forunder-resourced languages.

Cheung-Chi Leung (M’10) received the B.Eng. degreefrom the Hong Kong University of Science and Technologyin 1999, and the M.Phil. and Ph.D. degrees from the ChineseUniversity of Hong Kong in 2001 and 2004, respectively.From 2004 to 2008, he was with the Spoken LanguageProcessing Group, CNRS-LIMSI in France as a PostdoctoralResearcher. In 2008, he joined the Institute for InfocommResearch (I2R) in Singapore, where he is currently a Sci-entist at the Human Language Technology Department. Dr.Leung is a key member for the development of I2R’s state-of-the-art automatic speech recognition systems. His current

research interests include automatic speech recognition, spoken document retrieval,spoken language recognition and speaker recognition.

Lei Xie (M’07-SM’15) received the Ph.D. degree in Com-puter Science from Northwestern Polytechnical University(NPU), Xian, China, in 2004. He is currently a Professorwith School of Computer Science, NPU. From 2001 to2002, he was with the Department of Electronics andInformation Processing, Vrije Universiteit Brussel (VUB),Brussels, Belgium, as a Visiting Scientist. From 2004 to2006, he was a Senior Research Associate in the Centerfor Media Technology (RCMT), School of Creative Media,City University of Hong Kong, Hong Kong. From 2006to 2007, he was a Postdoctoral Fellow in the Human-

Computer Communications Laboratory (HCCL), Department of Systems Engineering andEngineering Management, The Chinese University of Hong Kong. He has published morethan 160 papers in major journals and proceedings, such as the IEEE Transactions onAudio, Speech, and Language Processing, IEEE Transactions on Multimedia, InformationSciences, Pattern Recognition, ACM Multimedia, ACL, INTERSPEECH, and ICASSP.His current research interests include speech and language processing, multimedia andhuman-computer interaction.

Bin Ma (M’00-SM’06) received the B.Sc. degree in com-puter science from Shandong University, China, in 1990, theM.Sc. degree in pattern recognition & artificial intelligencefrom the Institute of Automation, Chinese Academy ofSciences (IACAS), China, in 1993, and the Ph.D. degree incomputer engineering from The University of Hong Kong,in 2000. He was a Research Assistant from 1993 to 1996 atthe National Laboratory of Pattern Recognition in IACAS.In 2000, he joined Lernout & Hauspie Asia Pacific asa Researcher working on speech recognition. From 2001to 2004, he worked for InfoTalk Corp., Ltd, as a Senior

Researcher and a Senior Technical Manager for speech recognition. He joined theInstitute for Infocomm Research, Singapore in 2004 and is now working as a SeniorScientist and the Lab Head of Automatic Speech Recognition. He has served as aSubject Editor for Speech Communication in 2009-2012, the Technical Program Co-Chairfor INTERSPEECH 2014, and is now serving as an Associate Editor for IEEE/ACMTransactions on Audio, Speech, and Language Processing. His current research interestsinclude robust speech recognition, speaker & language recognition, spoken documentretrieval, natural language processing and machine learning.

Haizhou Li (M’91–SM’01-F’14) received the B.Sc., M.Sc., and Ph.D degree in electrical and electronic engineering from South China University of Technology, Guangzhou, China in 1984, 1987, and 1990 respectively. Dr Li is currently a Professor at the Department of Electrical and Computer Engineering, National University of Singapore (NUS). He is also a Conjoint Professor at the University of New South Wales, Australia. His research interests include automatic speech recognition, speaker and language recognition, and natural language processing. Prior to joining NUS, he taught in the University of Hong Kong (1988–1990) and South China University of Technology (1990–1994). He was a Visiting Professor at CRIN in France (1994–1995), Research Manager at the Apple-ISS Research Centre (1996–1998), Research Director in Lernout & Hauspie Asia Pacific (1999–2001), Vice President in InfoTalk Corp. Ltd. (2001–2003), and the Principal Scientist and Department Head of Human Language Technology in the Institute for Infocomm Research, Singapore (2003-2016). Dr Li is currently the Editor-in-Chief of IEEE/ACM Transactions on Audio, Speech and Language Processing (2015-2018), a Member of the Editorial Board of Computer Speech and Language (2012-2018). He was an elected Member of IEEE Speech and Language Processing Technical Committee (2013-2015), the President of the International Speech Communication Association (2015-2017), the President of Asia Pacific Signal and Information Processing Association (2015-2016), and the President of Asian Federation of Natural Language Processing (2017-2018). He was the General Chair of ACL 2012 and INTERSPEECH 2014. Dr Li is a Fellow of the IEEE. He was a recipient of the National Infocomm Award 2002 and the President’s Technology Award 2013 in Singapore. He was named one of the two Nokia Visiting Professors in 2009 by the Nokia Foundation.

Haizhou Li (M’91-SM’01-F’14) received the B.Sc., M.Sc.,and Ph.D. degree in electrical and electronic engineeringfrom South China University of Technology, Guangzhou,China in 1984, 1987, and 1990 respectively. Dr. Li iscurrently a Professor at the Department of Electrical andComputer Engineering, National University of Singapore(NUS). He is also a Conjoint Professor at the Universityof New South Wales, Australia. His research interests in-clude automatic speech recognition, speaker and languagerecognition, and natural language processing.

Prior to joining NUS, he taught in the University ofHong Kong (1988-1990) and South China University of Technology (1990-1994). Hewas a Visiting Professor at CRIN in France (1994-1995), Research Manager at theApple-ISS Research Centre (1996-1998), Research Director in Lernout & Hauspie AsiaPacific (1999-2001), Vice President in InfoTalk Corp. Ltd. (2001-2003), and the PrincipalScientist and Department Head of Human Language Technology in the Institute forInfocomm Research, Singapore (2003-2016).

Dr. Li is currently the Editor-in-Chief of IEEE/ACM Transactions on Audio, Speechand Language Processing (2015-2018), a Member of the Editorial Board of ComputerSpeech and Language (2012-2018). He was an elected Member of IEEE Speech andLanguage Processing Technical Committee (2013-2015), the President of the Interna-tional Speech Communication Association (2015-2017), the President of Asia PacificSignal and Information Processing Association (2015-2016), and the President of AsianFederation of Natural Language Processing (2017-2018). He was the General Chair ofACL 2012 and INTERSPEECH 2014.

Dr. Li is a Fellow of the IEEE. He was a recipient of the National Infocomm Award2002 and the President’s Technology Award 2013 in Singapore. He was named one ofthe two Nokia Visiting Professors in 2009 by the Nokia Foundation.