8
INCREASE APPARENT PUBLIC SPEAKING FLUENCY BY SPEECH AUGMENTATION Sagnik Das Nisha Gandhi Tejas Naik Roy Shilkrot Human Interaction Lab, Stony Brook University Department of Computer Science Stony Brook, NY, USA ABSTRACT Fluent and confident speech is desirable to every speaker. But professional speech delivering requires a great deal of experi- ence and practice. In this paper, we propose a speech stream manipulation system which can help non-professional speak- ers to produce fluent, professional-like speech content, in turn contributing towards better listener engagement and compre- hension. We propose to achieve this task by manipulating the disfluencies in human speech, like the sounds uh and um, the filler words and awkward long silences. Given any unre- hearsed speech we segment and silence the filled pauses and doctor the duration of imposed silence as well as other long pauses (disfluent) by a predictive model learned using pro- fessional speech dataset. Finally, we output a audio stream in which speaker sounds more fluent, confident and practiced compared to the original speech he/she recorded. According to our quantitative evaluation, we significantly increase the fluency of speech by reducing rate of pauses and fillers. Index TermsSpeech disfluency detection, Speech dis- fluency repair, Speech Processing, Assistive technologies in speech 1. INTRODUCTION Professional speakers, who make their living from their speech, speak clearly and fluently with very few repetitions and revisions. This kind of error-free utterances is the result of many hours of practice and experience. On the other hand, a regular speaker generally speaks with no real practice of articulation and delivery. Naturally, words of an unrehearsed speech contain unintentional disfluencies interrupting the flow of the speech. Speech disfluency generally contains long pauses, discourse markers, repeated words, phrases or sen- tences and fillers or filled pauses like uh and um. According to the research by [1] approximately 6% of the speech appears to be non-pause disfluency. Filled pauses or filler words are the most common disfluency in any unrehearsed, impromptu speech [2]. Numerous linguistics research has been conducted to find out the effect of speech disfluencies on the listener’s compre- Examples: https://sagniklp.github.io/pub-speaker-aug/ Filler-word Segmentation Disfluent Silence Detection Filler Word Removal Synthesize Silences Fig. 1. The proposed speaker augmentation pipeline hension and speakers cognitive state such as uncertainty, con- fidence, thoughtfulness and cognitive load ([3, 2]). Different studies claim that disfluencies often refer to uncertainties in speakers’ mind about the future statements. Consequently, less confident speakers tend to be more disfluent.[2]. More- over, it has also been observed that filled pauses specifically indicate the level of cognitive difficulty of the speaker [4]. Generally, disfluencies occur before a longer utterance [5] or when the topic is unfamiliar to the speaker [6]. Fluency re- flects the speaker’s ability to focus the listener’s attention on his/her message rather than inviting the listener to focus on the idea and try to self-interpret it [7]. Considering the di- verse factors affecting speaker fluency, our idea is to doctor arXiv:1812.03415v2 [cs.SD] 3 Aug 2019

INCREASE APPARENT PUBLIC SPEAKING FLUENCY BY SPEECH

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: INCREASE APPARENT PUBLIC SPEAKING FLUENCY BY SPEECH

INCREASE APPARENT PUBLIC SPEAKING FLUENCY BY SPEECH AUGMENTATION

Sagnik Das Nisha Gandhi Tejas Naik Roy Shilkrot

Human Interaction Lab, Stony Brook UniversityDepartment of Computer Science

Stony Brook, NY, USA

ABSTRACTFluent and confident speech is desirable to every speaker. Butprofessional speech delivering requires a great deal of experi-ence and practice. In this paper, we propose a speech streammanipulation system which can help non-professional speak-ers to produce fluent, professional-like speech content, in turncontributing towards better listener engagement and compre-hension. We propose to achieve this task by manipulatingthe disfluencies in human speech, like the sounds uh and um,the filler words and awkward long silences. Given any unre-hearsed speech we segment and silence the filled pauses anddoctor the duration of imposed silence as well as other longpauses (disfluent) by a predictive model learned using pro-fessional speech dataset. Finally, we output a audio streamin which speaker sounds more fluent, confident and practicedcompared to the original speech he/she recorded. Accordingto our quantitative evaluation, we significantly increase thefluency of speech by reducing rate of pauses and fillers.

Index Terms— Speech disfluency detection, Speech dis-fluency repair, Speech Processing, Assistive technologies inspeech

1. INTRODUCTION

Professional speakers, who make their living from theirspeech, speak clearly and fluently with very few repetitionsand revisions. This kind of error-free utterances is the resultof many hours of practice and experience. On the other hand,a regular speaker generally speaks with no real practice ofarticulation and delivery. Naturally, words of an unrehearsedspeech contain unintentional disfluencies interrupting theflow of the speech. Speech disfluency generally contains longpauses, discourse markers, repeated words, phrases or sen-tences and fillers or filled pauses like uh and um. Accordingto the research by [1] approximately 6% of the speech appearsto be non-pause disfluency. Filled pauses or filler words arethe most common disfluency in any unrehearsed, impromptuspeech [2].

Numerous linguistics research has been conducted to findout the effect of speech disfluencies on the listener’s compre-

Examples: https://sagniklp.github.io/pub-speaker-aug/

Filler-word

Segmentation

Disfluent

Silence

Detection

Filler Word Removal

Synthesize Silences

Fig. 1. The proposed speaker augmentation pipeline

hension and speakers cognitive state such as uncertainty, con-fidence, thoughtfulness and cognitive load ([3, 2]). Differentstudies claim that disfluencies often refer to uncertainties inspeakers’ mind about the future statements. Consequently,less confident speakers tend to be more disfluent.[2]. More-over, it has also been observed that filled pauses specificallyindicate the level of cognitive difficulty of the speaker [4].Generally, disfluencies occur before a longer utterance [5] orwhen the topic is unfamiliar to the speaker [6]. Fluency re-flects the speaker’s ability to focus the listener’s attention onhis/her message rather than inviting the listener to focus onthe idea and try to self-interpret it [7]. Considering the di-verse factors affecting speaker fluency, our idea is to doctor

arX

iv:1

812.

0341

5v2

[cs

.SD

] 3

Aug

201

9

Page 2: INCREASE APPARENT PUBLIC SPEAKING FLUENCY BY SPEECH

a speech to make it fluent by taking care of temporal factorscontributing to it.

In this work, we propose a system to detect, segment,and remove the most common disfluencies, namely the fillerwords and long, unnatural pauses from a speech to aid speak-ers’ fluency. Our system takes a raw speech track as inputand outputs a modified fluent version of it by intelligently re-moving the filled pauses and adjusting the “long” silences.We evaluate the performance of our system quantitatively onspeeches of non-native speakers of English. We also proposean assistive user interface which can be used to help users’ tovisualize and comparatively analyze their speech.

We interpret the occurrence of disfluencies in a speechas an acoustic event, and a segmentation approach is takenfor the detection. A CNN and RNN combined architecture,a convolution recurrent neural network (CRNN) is used toachieve the task, inspired from [8]. Further, a binary classifi-cation approach is taken to detect long pauses between words.After deleting the filler-words and adjusting the silences thefluent version of the speech is obtained. The performance ofour system is evaluated on speeches of non-native speakersof English using fluency metrics proposed by [9]. We alsopropose an assistive user interface which can be used to helpusers’ to visualize and comparatively analyze their speech.The essential contributions of this paper are -

1. A disfluency detection mechanism that works directlyon acoustic features without using any language fea-tures.

2. A silence modeling scheme directly conditioned on theprevious speech.

3. A disfluency repair technique to help users improve apre-delivered speech.

2. RELATED WORKS

In recent years, there have been many works related to speechdisfluencies, spanned across the domains of psychology, lin-guistics and natural language processing (NLP). Where, thepsychology and linguistic researchers focused on definingdisfluencies, the reasons and effects of it from a languageand cognitive aspect; the NLP researchers focused moreon detecting these from speech transcripts to help languageunderstanding and recognition systems.

The prime motivation for disfluency detection in NLP is tobetter interpret the speech-to-text transcripts for natural lan-guage understanding systems.

One of the first work [10], focuses on classifying the editwords (restarts and repairs) from the text using a boostedclassifier. Another contemporary method was to apply anoisy channel model to detect and correct speech disfluencies[11, 12, 13]. Later, Hidden Markov Model (HMM), Con-ditional Random Field (CRF), Integer Linear Programming

(ILP) based [14, 15] methods are introduced. A classifica-tion approach using lexical features is taken by [16], theyspecifically focus on schizophrenic patient dialogs. Someincremental [17, 18, 19, 20], multi-step [21] and joint task(parsing and disfluency detection) [22, 17] methods wereintroduced recently. Though all these provide some con-vincing results, all of them are limited to pre-defined featuretemplates (lexical, acoustic, and prosodic).

With advances in deep learning, most recent methods relyon recurrent neural networks (RNN) [23, 24, 25, 26, 27].These methods use word embeddings and acoustic featuresinstead of pre-defined feature templates.

All the techniques above, make one fundamental assump-tion, i.e., any disfluency detection must have an automaticspeech recognizer (ASR) in the pipeline. Consequently, tothe best of our knowledge till now all the presented disflu-ency detection schemes work in the transcript level. Also,these systems have never been paired with an acoustic levelrepair scheme with a goal of exploring the use-cases from theperspective of the listener and the speaker.

In our work, we address these motivations by devisinga disfluency detection and repair method relying solely onacoustic features to synthesize temporally fluent speech seg-ments from the perspective of human-interaction.

3. PROPOSED METHOD

3.1. Disfluency Detection

Our work focuses on building a system that can be used notonly as a disfluency detection system but also provide a wayto understand users’ disfluency better. The primary motiva-tions of this work are the following-

• Work with disfluencies on the acoustic level without us-ing any transcript.

• Significant portion of a disfluent speech contains longpauses. In transcript level, it’s not an issue, but in theacoustic level, it matters a lot in determining speakers’fluency.

• Repairing disfluent segments to help users understandthe possible improvements to their speech, as well asallow to create fluent speech content without much has-sle.

The types of disfluencies we considered in this work, are theuse of filler words, and intermittent long pauses.

3.1.1. Dataset

The dataset used for filler word segmentation is obtained fromSwitchboard transcription1. We also used the Automanner2

1https://www.isip.piconepress.com/projects/switchboard/2https://www.cs.rochester.edu/hci/currentprojects.php?proj=automanner

Page 3: INCREASE APPARENT PUBLIC SPEAKING FLUENCY BY SPEECH

GRU1

GRUL…

GRUL

GRU1

FC1 (ReLU)

FC2

Softmax

𝑆 𝑀

𝑃𝑐

𝑃𝑠

𝑝𝑡

𝑃𝑠

𝐺 𝐺

MFCC Conv MaxPool stack

Fig. 2. Block diagram of the filler-word segmentation

[28] transcription for additional data. This gives more gen-eralization to our training samples since contains recordingfrom standard interfaces.

To label disfluent silences we use combination of a si-lence probability model [29] and a disfluency detection model[25]. First, we locate the silences and segment each word pairfrom the dataset then according to the probability model it’sdecided if silence is disfluent. For each word pair utterancethe silence probability model gives a probability of a silence(Psil) occurring between them. A word pair with low Psil

but a significant amount of silence is labeled as disfluent. If aword pair doesn’t exist the model vocabulary, we resort to thefollowing approach. Since, general disfluencies accompanylonger silences, any silence within a disfluent segment is la-beled as an unnatural pause. Additionally, the word pairs sur-rounded with silences more than 0.7 seconds are also labeledsimilarly. This choice is experimental and can be consid-ered safe because it’s considerably higher than the suggestedquantitative measure of micro-pauses (fluent), 0.2 secs. [30].On the other hand, additional fluent pairs are collected fromTIMIT [31].

3.1.2. Features

In this step, frame level acoustic features (log mel band en-ergy or mel frequency cepstral coefficients (MFCCs)) are ob-tained at each timestep t resulting a feature vectormt ∈ RC .Here, C is the number of features (in frequency dimension)at frame t. The task of segmenting the filler words is formu-lated as binary classification of each frame to its correct classk (Eq. 1).

argmaxk

P (y(k)t |mt,θ) (1)

Where, k = {1, 2} and θ are the parameters of the classifier.In the training data, the target class y

(k)t = 1 if frame t be-

longs to class k (determined using the onset/offset timeline ofk associated with a sound segment) , otherwise zero.

Each soundtrack S, is divided into multiple fixed length

sequences of frames Mt:t+T−1. Where, T is the length ofthe frame sequence. The corresponding class label matrix,Yt:t+T−1 contains all the yt.

3.1.3. CRNN for filler word segmentation

Here we propose a Convolutional Recurrent Neural Network(CRNN) for filler word segmentation. Similar, architectureis previously used for sound event detection (SED) [8] andspeech-recognition [32] task. The architecture is a combina-tion of convolutional and recurrent layers, followed by feed-forward layers.

The sequence of extracted features M ∈ RC×T is fed tothe CNN layers with rectified linear unit (ReLU) activations.Filters used in the CNN layers are spanned across the featureand time dimension. Max-pooling is only applied over thefrequency dimension. Output of max-pooling is a tensorPc ∈RF×M ′×T . Where, F is the number of filters of the finalconvolution layer, M ′ is the truncated frequency dimensionafter the max-pooling operation.

To learn the features over time axis, F feature maps arethen stacked along the frequency axis which outputs a tensorPs ∈ R(F×M ′)×T . This is fed to the RNN as a sequence offrames pt which outputs a hidden vector pt. The ith recurrentlayer output is given as in Eq. 2. Where, F is a functionlearned by the each RNN unit. In this work, we use GRUs aspresented in [33].

pit = F(pi−1t , pit−1) (2)

RNN final layer outputs pft are then fed to a fully-connected (FC) layer with ReLU activation and G ∈ RFC1×T

is obtained where, FC1 is the number of neurons of the layer.Finally, another layer with softmax activation is applied to getthe class probabilities. Let, G ∈ RK×T is the output tensorof the final FC layer, then probabilities are given by-

P (yt |m0:t,θ) = Softmax(gt) (3)

Page 4: INCREASE APPARENT PUBLIC SPEAKING FLUENCY BY SPEECH

Fig. 3. The visualization interface; top: Speech trackwith colored segmentation outputs (brown: disfluent silences,blue: fillers, green: fluent silences); bottom: Modified speech.

The CRNN training objective is to minimize the cross-entropyloss with l2 regularization (Eq. 4)

L(θ) = −∑0:t

∑k

logP (y(k)t ) + λ||θ|| (4)

3.1.4. Disfluent silence Classification

The problem is formulated as a binary classification task,given a silent segment Z, the task is to decide whether it’s adisfluent or a non-disfluent silence. Classifying a silence onlymakes sense when it’s combined with adjacent utterances.Because an occurrence of silence is solely driven by the ut-terance and also heavily influenced by disfluencies. Thus, it’snot always evident that all pauses higher than a significantthreshold is disfluent, illustration in Fig.3 gives an idea of thefact.

We train a support vector machine (SVM) to achieve thistask. Given a silent segment Z, it’s first padded with the one-word utterances on the left and right (Z). Then, the MFCCfeatures are extracted and we take the mean over the fre-quency axis to create the feature vector zi ∈ RT . T is thenumber of frames in zi. Segments are of variable length thuszi padded with trailing zeros prior the classification. During,testing we don’t use the previous and next word boundariesbut a fixed length time window is used. In our experiments,we found that 0.8− 1.0 secs. give pretty good results.

3.2. Disfluency Repair

First, the fillers are replaced with silences. We found that it’soften helpful (such as when ambient noise is present) to use adecomposition mechanism [34] on the speech to separate thebackground noise and vocals. Then, the fillers are replacedwith its corresponding background segment. The modifiedtrack is then used to segment (Z) the silences and finally, theclassification is done.

All the silence segment lengths are then modified to makethe speech fluent (Fig. 4). The goal is to reduce the amount

0

5

10

15

20

1 2 3 4 5 6 7 8 9 10

count

bins

HISTOGRAM OF

FLUENT SILENCE TIMES

0.0905

0.1578

0.2252

0.2925

0.3599

0.4272

0.4945

0.5619

0.6292

0.6965

Silence classifier

Fluent

Silence

Intervals

Disfluent

Silence

Intervals Modified Silences

Silence

Segmentation

Filler

Replacement

Vocals

Background

fillers

Fig. 4. Silence modification pipeline: The dashed line on thehistogram shows the median time of the fluent silences.

of long, unnatural pauses that hurt the fluency of the speech.It is also required to keep the pace of the speech intact. Toomuch reduction of silences makes the speech unnatural andbroken. We take the fluent silence times (i.e., as suggested byour silence classifier) and obtain a histogram and found thattaking the median of the histogram bins as the optimal amountof silence works quite well. In this way, the distribution of thesilence along the speech progression confines to a constantdistribution and speaker sounds more consistent and fluent inthe modified speech.

4. RESULTS & ANALYSIS

4.1. Experimental Settings

4.1.1. Datasets

The experiments are performed on Switchboard [35], Au-tomanner [28] and our dataset of public speaker recording. Totrain the CRNN we use the segments from the Switchboard.The CRNN test results are reported on held-out data fromSwitchboard-I. Silence classification results are reported onTIMIT [31], Switchboard, and Automanner held-out dataset.All the fluency metrics are evaluated on our dataset, contain-ing recordings of 20 non-native speakers of English. Thespeakers were asked to talk on a specific topic for 50-60seconds.

4.1.2. Parameter settings

We experimented with different configurations of the CNNand RNN parameters and different features.

Types of features: Initial experiments were performedon mel frequency cepstrum coefficients (mfcc), mel spectro-grams, log mel spectrograms (log mel), spectral contrast, zerocrossing rate and tonnetz. According to the experimentalresults, the mfcc (40 × t) and log mel (128 × t) features areused for filler segmentation. For the silence classificationmfcc features are used, after taking mean over the frequency

Page 5: INCREASE APPARENT PUBLIC SPEAKING FLUENCY BY SPEECH

axis. The used feature dimensions are shown in table. Allfeatures are extracted in 30ms frames with 15ms overlap.

RNN & CRNN parameters: In experiments with the CNNand CRNN, we explore {1, 2, 3} convolutional (conv) lay-ers with combination of max pooling and average pool-ing. At each layer, ReLU activation is used. Followingsettings are used for conv filters- {16, 32, 64} and kernelsizes- {2, 3, 4, 5, 8}. All the conv layers use same padding.Pooling size was varied within {2, 3, 4, 5, 8}. The poolingis performed only on the frequency dimension. We trieddifferent dropout ratios of {0.3, 0.5, 0.75}.

The RNN we use is Gated Recurrent Units (GRU). Exper-iments are performed with {2, 3} layers (l) and {64, 128, 256}hidden units (d). No intermediate dropout (dr) is applied.

Final fully connected layer (FC1 in Fig.3) is experimentedwith hidden units (d) of {100, 200} with dropout ratios of{0.3, 0.5, 0.75}.

Features CNN RNN FC

mfccconv1 [32,(8,8)], conv2 [64,(4,4)]maxpool1 [5,5], maxpool2[4,4]

dr=0.25l=3

d=128d=100dr=0.5

log melconv1 [32,(8,8)], conv2 [64,(4,4)]maxpool1 [8,4], maxpool2[4,2]

dr=0.25

Table 1. Final parameters for CNN, RNN and fully connectedlayers

The networks are trained in an end-to-end fashion usingAdaGrad algorithm for 200 epochs. The learning rate was setto 0.01. The regularization constant (λ) was set to 0.01. Thefinal parameters are given in Table 1.

Silence classification parameters: Max length of the se-quences were set to 128. Final parameters are given in table3.

SVM LogReg XGBoost

itr=1500kernel=rbf

C=10

itr=100C=10

depth=3lr=0.1

estimators=100

Table 3. Final parameters used in silence classification

4.1.3. Evaluation Metrics

To evaluate the filler word segmentation we use the followingframe level statistics:

• F1 Score (F1): The F1 score is calculated on framelevel (30ms) using the TP, the frames where fillers arecorrectly detected; TN, the frames where non-fillersare correctly detected; FP, the frames where fillers arewrongly detected; and FN, the frames where non-fillersare wrongly detected.

The silence classification is evaluated using the F1 score w.r.t.the disfluent silence class.

To evaluate the quality of the augmented speech from oursystem, we use the following metrics defined in [9]:

• Speech rate: Is obtained as-

SR =# of syllables

total time− ufp[< 3]× 60 (5)

Where, ufp[< 3] = total time of unfilled pauses lesserthan 3 seconds. Since, pauses > 3 secs. are consideredas articulation pauses [30].

• Articulation rate

AR =# of syllables

total time× 60 (6)

• Phonation-time ratio

PTR =speaking time

total time(7)

• Mean length of runs

MLR =# of syllables

# utterances between p[> 0.25](8)

Where, p[> 0.25] = pauses greater than 0.25 seconds.

• Mean length of pauses

MLP =total of p[> 0.2]

# of p[> 0.2](9)

• Filled pauses per min.

FPM =# of filled pauses

total time(10)

4.2. Filler Word Segmentation

The filler word segmentation performance is evaluation re-sults are given in Table 4 and 5. In Table 4 we report the com-parative performance of the CRNN using different features.To understand more about the credibility of the CRNN, inTable 5 we show the results compared to an automatic speechrecognizer available with Kaldi (ASpIRE Chain Model3).Considering the simplicity of our network, it performs prettyclose to the ASR in terms of F1 score. All results are evalu-ated on a subset of Switchboard-I dataset.

Features Precision Recall F1

mfcc 0.9482 0.9610 0.9534log mel 0.9495 0.9629 0.9550

Table 4. Performance of the CRNN with different features3https://github.com/kaldi-asr/kaldi/tree/master/egs/aspire

Page 6: INCREASE APPARENT PUBLIC SPEAKING FLUENCY BY SPEECH

Metrics→ SR ↑ AR ↑ PTR ↑ MLR ↑ MLP ↓ FPM ↓Original 165.3571 171.0986 58.865 0.400 0.654 3.659

Processed 186.241 186.241 65.570 0.495 0.365 1.762

Table 2. The fluency metrics, before and after processing the speeches. ↑ means higher is better and ↓ denotes lower is better

Method Precision Recall F1

ASR 0.9774 0.9792 0.9775CRNN 0.9495 0.9629 0.9550

Table 5. Performance of filler word segmentation comparedto an automatic speech recognizer.

The only drawback that we have observed while compar-ing our method and ASR is that, sometimes our classifier de-tects segments that sounds similar with ’uh’ or ’um’.

4.3. Disfluent Silence Classification

For this task we experimented with SVM, Logistic Regression(LogReg) and XGBoost. The results are summarized in table6. We used 10-fold cross validation to report our results.

Method→ SVM LogReg XGBoost

F1 0.9055 0.9200 0.9207

Table 6. Silence classification performance on TIMIT,SwitchBoard and Automanner

4.4. Disfluency Repair

After processing the speeches by removing the fillers and longsilences, the fluent speech is obtained. To compare the flu-ency of the synthesized and the original speech, discussedmetrics (Section 4.1.3) are used. The results are reported intable 2. Mean of each metric across all the samples are re-ported. From the numbers, it’s pretty clear that we improvethe fluency. It’s notable that in the processed speech the ar-ticulation and speech rate increases to same quantity sincewe take care of all the unfilled pauses in the speech and in-troduce a more uniform silence production. Apart from thenumbers, for qualitative understanding, some processed sam-ples are available here.

5. FUTURE WORK

This work is motivated by the fact that, disfluency detectionis not only useful for the intelligent agents but also a practicalproblem definition to help users to produce a better, confi-dent and fluent talk. To the extent of the types of disfluenciesproduced in a speech, this work is a small step towards a big-ger goal, repairing disfluencies in a speech from a speakers’

perspective. Along with the pitfalls of our method followingcould be the future directions of this work-

• Improving the filler word segmentation performanceas well as devising techniques to segment other kindsof common disfluencies (repetition, discourse markers,corrections) and speech impairments (stuttering).

• Devising a dynamic and online repair scheme, by gen-erating necessary (disfluent) portions of speech, insteadof replacing.

6. CONCLUSION

Disfluency detection is a well-explored problem in the speechprocessing community and performed on speech transcriptsto mostly aid the intelligent conversational agents. In thiswork, we interpret disfluency detection from speakers per-spective and introduce an additional component of repairingthe disfluencies. Consequently, we tried to work solely on theacoustic domain, diminishing a need for a complex systemlike an ASR, before disfluency detection. With the results ofour detection and repair scheme, we show improved fluencyin speakers’ dialogues, given a less-fluent speech. To the bestof our knowledge, this is the first work related to disfluencyrepair for the sake of users’ and can be further extended toassist users with speech impairments and other general dis-fluencies.

7. ACKNOWLEDGEMENTS

We are thankful to Faizaan Charania and Mahima Parashar forcurating the dataset and working on some essential observa-tions. We would also like to thank the participating speakersfor the speeches they provided. We gratefully acknowledgethe support of NVIDIA Corporation with the donation of theTitan Xp and P6000 GPU used for this research.

8. REFERENCES

[1] Jean E Fox Tree, “The effects of false starts and repeti-tions on the processing of subsequent words in sponta-neous speech,” Journal of memory and language, vol.34, no. 6, pp. 709–738, 1995.

[2] Kathryn Womack, Wilson McCoy, Cecilia OvesdotterAlm, Cara Calvelli, Jeff B Pelz, Pengcheng Shi, and

Page 7: INCREASE APPARENT PUBLIC SPEAKING FLUENCY BY SPEECH

Anne Haake, “Disfluencies as extra-propositional in-dicators of cognitive processing,” in Proceedings ofthe workshop on extra-propositional aspects of meaningin computational linguistics. Association for Computa-tional Linguistics, 2012, pp. 1–9.

[3] Martin Corley and Oliver W Stewart, “Hesitation dis-fluencies in spontaneous speech: The meaning of um,”Language and Linguistics Compass, vol. 2, no. 4, pp.589–602, 2008.

[4] Dale J Barr and Mandana Seyfeddinipur, “The roleof fillers in listener attributions for speaker disfluency,”Language and Cognitive Processes, vol. 25, no. 4, pp.441–455, 2010.

[5] Elizabeth Shriberg, “Disfluencies in switchboard,” inProceedings of International Conference on SpokenLanguage Processing, 1996, vol. 96, pp. 11–14.

[6] Sandra Merlo and Letıcia Lessa Mansur, “Descriptivediscourse: topic familiarity and disfluencies,” Journalof Communication Disorders, vol. 37, no. 6, pp. 489–503, 2004.

[7] Paul Lennon, “Investigating fluency in efl: A quantita-tive approach,” Language learning, vol. 40, no. 3, pp.387–417, 1990.

[8] Emre Cakır, Giambattista Parascandolo, Toni Heittola,Heikki Huttunen, and Tuomas Virtanen, “Convolutionalrecurrent neural networks for polyphonic sound eventdetection,” arXiv preprint arXiv:1702.06286, 2017.

[9] Judit Kormos and Mariann Denes, “Exploring measuresand perceptions of fluency in the speech of second lan-guage learners,” System, vol. 32, no. 2, pp. 145–164,2004.

[10] Eugene Charniak and Mark Johnson, “Edit detectionand parsing for transcribed speech,” in Proceedings ofthe second meeting of the North American Chapter ofthe Association for Computational Linguistics on Lan-guage technologies. Association for Computational Lin-guistics, 2001, pp. 1–9.

[11] Matthias Honal and Tanja Schultz, “Correction of dis-fluencies in spontaneous speech using a noisy-channelapproach,” in Eighth European Conference on SpeechCommunication and Technology, 2003.

[12] Mark Johnson and Eugene Charniak, “A tag-basednoisy-channel model of speech repairs,” in Proceedingsof the 42nd Annual Meeting of the Association for Com-putational Linguistics (ACL-04), 2004.

[13] Simon Zwarts, Mark Johnson, and Robert Dale, “De-tecting speech repairs incrementally using a noisy chan-nel approach,” in Proceedings of the 23rd International

Conference on Computational Linguistics. Associationfor Computational Linguistics, 2010, pp. 1371–1378.

[14] Yang Liu, Elizabeth Shriberg, Andreas Stolcke, DustinHillard, Mari Ostendorf, and Mary Harper, “Enrichingspeech recognition with automatic detection of sentenceboundaries and disfluencies,” IEEE Transactions on au-dio, speech, and language processing, vol. 14, no. 5, pp.1526–1540, 2006.

[15] Kallirroi Georgila, “Using integer linear programmingfor detecting speech disfluencies,” in Proceedings ofHuman Language Technologies: The 2009 Annual Con-ference of the North American Chapter of the Associa-tion for Computational Linguistics, Companion Volume:Short Papers. Association for Computational Linguis-tics, 2009, pp. 109–112.

[16] Christine Howes, Matt Purver, Rose McCabe,PG Healey, and Mary Lavelle, “Helping the medicinego down: Repair and adherence in patient-cliniciandialogues,” in Proceedings of the 16th SemDial Work-shop on the Semantics and Pragmatics of Dialogue(SeineDial), 2012, pp. 19–21.

[17] Matthew Honnibal and Mark Johnson, “Joint incre-mental disfluency detection and dependency parsing,”Transactions of the Association of Computational Lin-guistics, vol. 2, no. 1, pp. 131–142, 2014.

[18] Julian Hough and Matthew Purver, “Strongly incremen-tal repair detection,” arXiv preprint arXiv:1408.6788,2014.

[19] Christine Howes, Julian Hough, Matthew Purver, andRose McCabe, “Helping, i mean assessing psychiatriccommunication: An application of incremental self-repair detection,” 2014.

[20] James Ferguson, Greg Durrett, and Dan Klein, “Disflu-ency detection with a semi-markov model and prosodicfeatures,” in Proceedings of the 2015 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technologies,2015, pp. 257–262.

[21] Xian Qian and Yang Liu, “Disfluency detection usingmulti-step stacked learning,” in Proceedings of the 2013Conference of the North American Chapter of the As-sociation for Computational Linguistics: Human Lan-guage Technologies, 2013, pp. 820–825.

[22] Mohammad Sadegh Rasooli and Joel Tetreault, “Jointparsing and disfluency detection in linear time,” in Pro-ceedings of the 2013 Conference on Empirical Methodsin Natural Language Processing, 2013, pp. 124–129.

Page 8: INCREASE APPARENT PUBLIC SPEAKING FLUENCY BY SPEECH

[23] Julian Hough and David Schlangen, “Recurrent neuralnetworks for incremental disfluency detection,” Inter-speech 2015, 2015.

[24] Shaolei Wang, Wanxiang Che, and Ting Liu, “A neuralattention model for disfluency detection,” in Proceed-ings of COLING 2016, the 26th International Confer-ence on Computational Linguistics: Technical Papers,2016, pp. 278–287.

[25] Vicky Zayats, Mari Ostendorf, and Hannaneh Ha-jishirzi, “Disfluency detection using a bidirectionallstm,” arXiv preprint arXiv:1604.03209, 2016.

[26] Julian Hough and David Schlangen, “Joint, incremen-tal disfluency detection and utterance segmentation fromspeech,” in Proceedings of the Annual Meeting of theEuropean Chapter of the Association for ComputationalLinguistics (EACL), 2017.

[27] Shaolei Wang, Wanxiang Che, Yue Zhang, MeishanZhang, and Ting Liu, “Transition-based disfluency de-tection using lstms,” in Proceedings of the 2017 Confer-ence on Empirical Methods in Natural Language Pro-cessing, 2017, pp. 2785–2794.

[28] M Iftekhar Tanveer, Ru Zhao, Kezhen Chen, Zoe Tiet,and Mohammed Ehsan Hoque, “Automanner: An au-tomated interface for making public speakers aware oftheir mannerisms,” in Proceedings of the 21st Interna-tional Conference on Intelligent User Interfaces. ACM,2016, pp. 385–396.

[29] Guoguo Chen, Hainan Xu, Minhua Wu, Daniel Povey,and Sanjeev Khudanpur, “Pronunciation and silenceprobability modeling for asr,” in Sixteenth Annual Con-ference of the International Speech Communication As-sociation, 2015.

[30] Heidi Riggenbach, “Toward an understanding of flu-ency: A microanalysis of nonnative speaker conversa-tions,” Discourse processes, vol. 14, no. 4, pp. 423–441,1991.

[31] John S Garofolo, Lori F Lamel, William M Fisher,Jonathan G Fiscus, and David S Pallett, “Darpa timitacoustic-phonetic continous speech corpus cd-rom. nistspeech disc 1-1.1,” NASA STI/Recon technical report n,vol. 93, 1993.

[32] Tara N Sainath, Oriol Vinyals, Andrew Senior, andHasim Sak, “Convolutional, long short-term memory,fully connected deep neural networks,” in Acoustics,Speech and Signal Processing (ICASSP), 2015 IEEE In-ternational Conference on. IEEE, 2015, pp. 4580–4584.

[33] Kyunghyun Cho, Bart Van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio, “Learning phrase rep-resentations using rnn encoder-decoder for statisticalmachine translation,” arXiv preprint arXiv:1406.1078,2014.

[34] Zafar Rafii and Bryan Pardo, “Music/voice separationusing the similarity matrix.,” in ISMIR, 2012, pp. 583–588.

[35] John J Godfrey, Edward C Holliman, and Jane Mc-Daniel, “Switchboard: Telephone speech corpus forresearch and development,” in Acoustics, Speech, andSignal Processing, 1992. ICASSP-92., 1992 IEEE Inter-national Conference on. IEEE, 1992, vol. 1, pp. 517–520.