020513 McLaughlin

7/31/2019 020513 McLaughlin

1/4

ABSTRACT

As ever greater numbers of telephone transactions arebeing conducted solely between a caller and an automated

answering system, the need increases for software which canautomatically identify and authenticate these callers withoutthe need for an onerous speaker enrollment process. In thispaper we introduce and investigate a novel speaker detectionand tracking (SDT) technique, which dynamically mergesthe traditional enrollment and recognition phases of thestatic speaker recognition task. In this speaker recognitionapplication, no prior speaker models exist and the goal is todetect and model new speakers as they call into the systemwhile also recognizing utterances from previously modeledcallers. New speakers are added to the enrolled set ofspeakers and speech from speakers in the currently enrolledset is used to update models. We describe a system based ona GMM speaker identification (SID) system and develop a

new measure to evaluate the performance of the system onthe SDT task. Results for both static, open-set detection andthe SDT task are presented using a portion of theSwitchboard corpus of telephone speech communications.Static open-set detection produces an equal error rate ofabout 5%.As expected,performance forSDT is quite varied,depending greatly on the speaker set and ordering of the testsequence. These initial results, however, are quite promisingand point to potential areas in which to improve the systemperformance.

1. INTRODUCTION

The general task of speaker recognition traditionallyconsists of two phases: enrollment and recognition. Duringenrollment, training speech collected from a speaker is usedto train his/her model. The static collection of speakermodels are then used during the recognition phase to eitheridentify (closed-set) or verify (open-set) the speaker in aninput speech utterance. In this paper we introduce andinvestigate a detection and tracking task in which utterancespresented to our system must be either enrolled orrecognized in a new and dynamic fashion.

In the this task, no prior speaker models exist and thegoal is to detect and model new speakers from a stream ofsingle-speaker input utterances while also recognizingutterances from previously modeled speakers. The system

must operate in an open-set identification mode where it isfirst determined whether a new utterance matches apreviously modeled speaker or is from a new speaker. If theutterance matches a speaker in the current enrolled group,then the identified speakers model is updated using theutterance. If the utterance is from a new speaker, then a newmodel is createdand theenrolledgroupis updated. This taskis more challenging than the static speaker recognition taskin several respects. First, decisions at any point in time willhave a direct effect on future decisions and performancesince speaker models are evolving and the enrolled group ofalternative models is changing. Second, performance isdifficult to characterize since we now have several differenttypes of errors to catalog and performance can be highly

dependent on the ordering of test sequences.Detection and tracking has been used heavily in the

transcription of broadcast news corpora where the goal is toautomatically produce a clean text transcript of the audioportion of a broadcast[1]. Theaccuracy of word recognizerscan be increased substantially if the acoustic characteristicsof the incoming audio are detected and tracked so as to steerlike audio segments to recognizers specially tuned for thoseacoustic conditions. Similar segments may have a commonspeaker as well as background noise [2][3], or may be alikein their speaker attribute alone [4] or possibly speaker andlanguage attributes [5].

Another application of this detection and tracking

technology is automatic message routing based on calleridentification. For message routing, an incoming call couldbe identified as from a previous or new customer. Previouscustomers would be handled in a personalized manner,whereas new customers information would be entered intothe systemand newspeaker models automatically generated.The importance of updating customer models over time hasbeen illustrated [6], and thus adapting such models as newdata becomes available is an important part of any SDTsystem.

As a development corpus for this task, we have chosento use a portion of the NIST 1999 Speaker IdentificationEvaluation corpus. This corpus is derived from the larger

Switchboard-II phase 3 conversational telephone corpuscollected by the Linguistic Data Consortium (LDC)[7].

*This work was sponsored by the Department of Defense under Air Force con-tract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommenda-

tions are those of the authors and are not necessarily endorsed by the United States

GovernmentJ. McLaughlin was a member of MIT LL staff when this work was performed

SPEAKER DETECTION AND TRACKING FOR TELEPHONE

TRANSACTIONS*

Jack McLaughlin Douglas A. Reynolds

Applied Physics Laboratory, Univ. of Washington MIT Lincoln Laboratory

Seattle, WA 98105 Lexington, MA 02420-9185

[email protected] [email protected]

I - 1290-7803-7402-9/02/$17.00 2002 IEEE

7/31/2019 020513 McLaughlin

2/4

The remainder of this paper is organized as follows. InSection 2 we describe in more detail the operation of ourdetection and tracking system. The system is built upon aGaussian mixture model (GMM) based speaker verificationsystem that is also described. The following sectiondiscusses the evaluation of our system, first describing the

characteristics of the evaluation corpus and then detailing anovel evaluation metric for measuring the performance ofthe SDT system. Section 4 provides performance results fora static detection task, and shows the result of our SDTsystem using the new metric. Section 5 wraps up with someconclusions and discussion of future directions.

2. DETECTION AND TRACKING SYSTEM

In designing our SDT system, our goal is to process aseries of single-speaker telephone transactions such that, atany point in time, we can associate each past message with alabel indicating the speaker that originated it. As a newmessage enters the system, it is treated very much as a testmessage in open-set SID would be. An initial decision is

made concerning whether the message came from a speakerwithin the enrolled set. If so, then we must decide whichspeaker the message came from as in open-set SID. In SDT,we take the additional step of updating the model of theidentified speaker by recalculating that model using all themessages previously associated with that speaker and thenew message. In the event that the new message isdetermined to have come from an unenrolled speaker, wecan not merely reject the message as we would in the open-set task. Because we need to be able to identify that speakerin the future should he call again, it is necessary to create amodel for that speaker using the new message. This speakerthen becomes a member of the enrolled set.

To determine if a new message comes from a speaker inthe enrolled set, the message is scored against each existingmodel. These scores are then normalized using the score forthat message against a universal background model [8]before comparison with a preset threshold :

Here, Mi is the score for model i for the set of feature

vectors extracted from the message, andB is the score for

the universal background model. Note that if any one of the

normalized scores exceeds the threshold, then this issufficient to accept the message as coming from a speakerwho is already enrolled. In this regard, the existing modelscan be viewed as a set of independently operating detectors,any one of which can alarm to indicate that the currentspeech file belongs to an enrollee. If an enrollee is detected,he is identified by choosing the model with the highestscore.

At the heart of our detection and tracking system is anopen-set SIDsystem. Ourspeechdata,originally sampled at48kHz is downsampled to 8kHz sampling rate. The sampledspeech is broken up into frames using a 20 msec windowwhich slides by 10 msec, and from each frame a 20-element

mel frequency cepstral coefficient vector is extracted after

discarding cepstra outside the telephone band. Vectorscomposed of these cepstra and delta cepstra are used to builda Gaussian Mixture Model (GMM) for each enrolledspeaker. To speedprocessingwhen running the SDT system,feature vectors were decimated by a factor of 10. Thisprocedure is documented in detail in [9], and has

surprisingly little effect on accuracy.The 2048-mixture background model is trained using

Switchboard data from the 1997 NIST evaluation, and thespeaker models, also having 2048 mixtures, are adaptedfrom this universal model as discussed in [8]. Severaldifferent values of threshold were experimented with, allspaced around the point that yielded equal false alarm andprobability of miss rates in our open-set tests with theevaluation corpus (see below).

3. SYSTEM EVALUATION

1999 Evaluation Corpus

System development was done using the NIST 1999Speaker Identification Evaluation corpus. Culled from theSwitchboard-II phase 3 telephone speech collection, thiscorpus contains both single-speaker and multi-speakerutterances; we used onlythe single-speaker, male utterances.In addition,though these utterances were recorded from bothelectret and carbon button handsets, we chose to use only theelectret utterances in order to eliminate the known problemof mismatch in training and test data. From what remained,we worked with the 403 utterances from the 60 speakerswith the greatest number of utterances. These ranged induration from nearly zero length to a minute with an averageduration of 30 seconds.

Evaluation Metric

In order to assess performance as changes are made tothe system, it is essential to have some metric whichconsiders the gravity of each type of error. We propose ascheme which assesses penalty points for each of threedifferent errors:

A) Missed model This occurs when a new speakerarrives, and we fail to detect him. Thus, we miss creating amodel for this new speaker.

B) Duplicate model Occurs when we create a model fora speaker who is already enrolled and has a primarymodel.

C) Misidentification Occurs when message is correctlyassessed as belonging to an enrolled speaker, but the systemfails to assign it to the correct speakers primary orduplicate model(s).

Another possible result of classification is that anutterance is associated with a duplicate model. Since thesystem, in this case, is associating speech from a particularspeaker with a model for that speaker, we count this as beingcorrect and assess no penalty points. For each messageentering our system then, we must either make one of theabove three errors or be correct. Points are assigned on amessage by message basis as shown in Table 1. The missedmodel and duplicate model errors receive the largest number

of penalty points because both of these errors result in the

Mi x( ) B x( ) enrolled speaker

unenrolled speaker

7/31/2019 020513 McLaughlin

3/4

system having an incorrect perception of the number ofspeakers present in the run, and in our application we wouldlike, at the very least, to maintain an accurate count of thenumber of callers the system has seen. Of the two, themissed model error seems more serious because it is moredifficult to recover from. The duplicate model error could be

largely recovered from by recognition on the part of ahuman operator or the software that a particular speakerhas more than one model representing him..:

4. RESULTS

As a means of assessing the performance of our SDTsystems underlying open-set SID system on the 60-speakersubset of our corpus, models were trained up for each of thespeakers using approximately one minute of training dataper model taken from the training data portion of the Eval99 dataset. Testing was performed using close to 1500utterances of varying durations but averaging 30 secondslong.

The accuracy of an open-set SID system is dependentupon the number of speakers enrolled, but we would like tohave some measure of the goodness of our system which isindependent of this factor. A detection curve is one suchmeasure. Figure 1 shows the detection curve resulting from

the training and testing described above. The false alarmprobability describes the percentage of test messages whosenormalized score exceeds the threshold when scored againsta model for a speaker who did not actually produce themessage This is plotted against the miss probability whichis the percentage of messages that fell below the threshold

when they were scored against the model for the speaker thatactually didproduce the message. Sweeping over a range ofthresholds produces the detection curve. In essence, this isa plot of the aggregate open-set result for an enrolled groupof size one with each speaker in the corpus in turn serving asthe enrolled speaker.

Also given in Figure 1 are the detection curves for allthe single-speaker, male utterances regardless of handsettype and all males using electret handsets. We see that, asexpected, results are considerably worse when two handsettypes areinvolved andno compensationis performed. It wasto avoid the complications of such compensation that weexcluded the carbon button handset utterances from ourdevelopment data. Note also that the 60-speaker subset wehave chosen performs slightly better than the full 231 maletalkers, but is more or less reflective of the entire set.

To evaluate our SDT system using our telephone speechcorpus, the 403 utterances of our 60-speaker subset wereused to create 52 sets of messages (or 52 scenarios in our

terminology) by making 52 random draws of 300 messages.Because the ordering of messages has a profound effect onperformance with an SDT system, within each scenarioutterances were randomly sorted to assure a variety ofevolutions. Each 300-message scenario contained anaverage of 50 speakers with a minimum of 41 and amaximum of 60. Each set was then processed, in therandom order, through our system, and penalty points weretotaled with the introduction of each new utterance. Thispenalty point score was then normalized by dividing bythe total number of utterances seen by the system up to thatpoint. Figure 2 shows the score for the best, worst andaverage of these scenarios for each of the preset thresholdsused after processing of the first 100 utterances of each

scenario.Analysis of the result points out two important aspects

of the problem of detection and tracking on a sequential set

Table 1: Penalty Point Assignments

Missed

model

Duplicate

modelMis-ID Correct

3 2 1 0

0.1 0.2 0.5 1 2 5 10 20 400.1

0.2

0.5

1

2

5

10

20

40

False Alarm probability (in %)

Missprobability

(in%)

AllElectret60 speaker subset

Figure 1: Detection curves for all males (using bothcarbon and electret handsets), all males using electret

0.25 0.2 0.15 0.1 0.05 0 0.05 0.1 0.15 0.20.6

0.8

1

1.2

1.4

1.6

1.8

Threshold

S

core

AverageBestWorst

Figure 2: Normalized score evaluated after 100utterances for several different thresholds, for thebest of the 52 scenarios, the worst and the average. A

threshold setting of 0 yields the minimum score, onaverage.

I - 131

7/31/2019 020513 McLaughlin

4/4

of utterances. First is the wide variability in performancebetween the best and worst scenarios. This may be due notonly to the differences in the number of speakers in thescenarios, but also to the length and sound quality of theutterances. Short, noisy utterances will tend to hurt systemperformance particularly if such messages occur early in

the scenario as this will lead to poor speaker models. On theother hand, long, clean messages, if correctly classified, willresult in ever improving speaker models as time goes on forour dynamic system.

The second thing we observe is that a threshold seemsto exist which yields the lowest score. This threshold(somewhere around 0) strongly discourages false alarms.This is important because, as noted earlier, each model isfunctioning independently as a detector, and if any one ofthem false alarms, then we will commit a missed modelerror. By biasing against this most expensive of errors, wereduce our score at the expense of committing less costlyduplicate model errors. At threshold values much beyond 0however, the large number of duplicate model errors being

committed more than offsets the decrease in missed modelerrors, and the score rises.

5. CONCLUSIONS, FUTURE DIRECTIONS

Using our GMM SID system, we have demonstratedreasonable performance on an open-set speaker recognitiontask using a portion of the Switchboard telephone speechcorpus. Building upon this system, we have been able toconstruct a speaker detection and tracking system which iscapable of creating models on the fly as data comes in. Toevaluate performance of this new, dynamic system, we haveproposed a metric which takes into account considerationsnot relevant in static, open-set evaluations.

Our results indicate a number of directions for futurework. A critical problem with a sequential classifier such asours is the tendency for errors to compound as time goes on.Figure 3 illustrates this effect. The top half of the figureshows score over time while the bottom half shows error (allmisclassifications being counted equally). Both of thesemeasures increase dramatically as additional utterances areprocessed.

This highlights the importance of high accuracy whenadapting models with new data. We have observed that asmodels are adapted, the optimal detection threshold changes.Thus, a dynamic threshold and/or a model scorenormalization to stabilize the threshold should offer

improvements over our current, fixed threshold.At present, each new message is immediately

incorporated into a model upon identification of the speaker(or used to create a new model in the event a new speaker isdetected). Such a decision need not be made immediately,but can be deferred if a message is deemed to be unreliable.Alternatively, a message can be made to contribute to anexisting model only in proportion to its goodness for SID.Methods for such utterance assessment have been proposed[10][11] and could be incorporated into our system.

REFERENCES

[1] C.L. Wayne, Topic Detection & Tracking (TDT),

Overview & Perspective, Proc. of the DARPA BroadcastNews Transcription and Understanding Workshop, 1998.

[2] S. S. Chen and P.S. Gopalakrishnan, Speaker,Environment and Channel Change Detection andClusteringvia the Bayesian InformationCriterion,Proc.of the DARPA Broadcast News Transcription andUnderstanding Workshop, 1998.

[3] T. Hain, S.E. Johnson, A. Tuerk, P.C. Woodland and S.J.

Young, Segment Generation and Clustering in the HTKBroadcast News Transcription System, Proc. of theDARPA Broadcast News Transcription andUnderstanding Workshop, 1998.

[4] M. Nishida and Y. Ariki, Real Time Speaker IndexingBased on Subspace Method - Application to TV NewsArticles and Debate, Int. Conf. on Spoken LanguageProcessing, vol. 4, pp. 1347, 1998.

[5] D. A. Reynolds, et. al., "Blind Clustering of Speechutterances Based on Speaker and LanguageCharacteristics," Int. Conf. on Spoken LanguageProcessing, 1998.

[6] W. Mistretta and K.R. Farrell, Model AdaptationMethods for Speaker Verification, Int. Conf. on

Acoustics, Speech and Signal Processing,vol. 1, pp. 113-116, 1998.

[7] Linguistic Data Consortium, http://www.ldc.upenn.edu.

[8] Reynolds, D. Comparison of BackgroundNormalization Methods for Text-Independent SpeakerVerification, Eurospeech 97, pp. 963-967, 1997.

[9] J. McLaughlin, D.A. Reynolds and T. Gleason, A Studyof Computation Speed-Ups of the GMM-UBM SpeakerRecognition System, Eurospeech 99, pp. 1215-1218,1999.

[10] J. Thompson and J.S. Mason, The Pre-detection ofError-prone Class Members at the Enrollment Stage ofSpeaker Recognition Systems, Proc. ESCA-94

Workshop on Automatic Speaker Recognition,Identification and Verification, pp 127-130, 1994.

[11] S. Ong and M.P. Moody, Confidence Analysis for Text-Independent Speaker Identification Using StatisticalFeature Averaging,Applied Signal Processing, vol. 1,pp. 166-175, 1994.

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1

1.2

1.4

Score

0 50 100 150 200 250 3000

20

40

60

80

Error(%)

Number of messages

Figure 3: Normalized score (top) and percent error

averaged over 52 scenarios and evaluated every 20utterances.

I - 132

Documents

020513 McLaughlin