Integration of a Large Text and Audio Corpus using Speaker ...transcript and then locate the audio which corresponds to the text search results. Correspondences between the text and

Integration of a Large Text and Audio Corpus Using SpeakerIdentification

Deb Roy Carl Malamud

MIT Media Laboratory20 Ames Street, E15-388Cambridge, MA 02139

[email protected] carl @media.mit.edu

AbstractWe report on an audio retrieval system which letsInternet users efficiently access a large text and audiocorpus containing the transcripts and recordings of theproceedings of the United States House ofRepresentatives. The audio has been temporallyaligned to corresponding text transcripts (which aremanually generated by the U.S. Government) using anautomatic method based on speaker identification.This system is an example of using digital storage andstructured media to make a large multimedia archiveeasily accessible.

IntroductionIn the United States, the text of proceedings of the twohouses of the Congress has long been published in theCongressional Record. No systematic effort has beenmade, however, to record audio from the floor of the Houseand Senate. In 1995, the non-profit Internet MulticastingService (IMS) began sending out live streaming audio the Internet and making complete digital audio recordingsof the proceedings on computer disks. The challenge wasto make this massive amount of recorded audio informationavailable to Internet users in a meaningful way.

After investigating a variety of options, we decided tocouple the Congressional Record (the text database) to theaudio database. The resulting system allows users toefficiently search, browse, and retrieve audio over theInternet. The basic idea is to allow searches on a texttranscript and then locate the audio which corresponds tothe text search results. Correspondences between the textand audio are generated by automatically identifying theidentity of speakers in a recording and aligning speakertransitions in the audio with corresponding speakertransitions in the associated text transcript.

Recently there have been several efforts to build audioretrieval and indexing systems. The most popular approachhas been to index audio based on content words usingeither large vocabulary speech recognition or keywordspotting (Wilpon at al. 1990, Rose, Chang & Lippman1991, Wilcox & Bush 1991, Glavitsch & Schauble 1992,Jones et al. 1996, James 1996). Other cues including pitch

contour, pause locations and speaker changes have alsobeen used (Chen& Withgott 1992, Arons 1996, Roy 1995).In one system the closed caption text of television newsbroadcasts was aligned to the audio track based on pauselocations enabling users to perform searches on text andthen access corresponding audio (Horner 1991).

Media retrieval systems will continue to grow inimportance as digital archives such as the followingbecome common:

¯ Radio and television broadcast archives¯ Internet sites with speech and music (already quite

common)¯ Recordings from various local sources such as lecture

halls, and courts¯ Data collected on wearable computers which record f

irst-person media (video, audio and other information)

Since the majority of such archives will include speech,this is a natural application domain for speech processing.By extracting some structure from audio, an archive whichis inaccessible due to it’s size and the difficulties ofsearching unstructured audio can become searchable bycontent. Applications include multimedia content re-use,audio note taking, and content-searchable multimediaarchives.

In this paper we describe a novel method based onspeaker identification which was used to align the text andaudio recordings of the proceedings of the Congress. Wealso describe the WWW interface which readers are invitedto try at http://town.hall.org/Congress/. The resultingsystem enables Internet users to quickly locate originalcongressional proceedings which were previouslyunavailable in audio form.

The Congressional Databases

The Congressional Record includes edited transcripts of theproceedings, manually generated time stamps, results ofany votes, and scheduling information about upcomingsessions. The transcripts are originally created live duringthe proceedings by a human transcriber. Among otherthings, two types of information recorded by the transcriber

From: AAAI Technical Report SS-97-03. Compilation copyright © 1997, AAAI (www.aaai.org). All rights reserved.

are of particular interest for the automatic text to audioalignment task: each speaker transition is recorded, andtime stamps spaced every 10 to 45 minutes are enteredduring long pauses in the proceedings. One of thesignificant challenges is that the Congressional Record isnot a verbatim record of the proceedings. Members havethe opportunity to add new material, abridge their remarks,and otherwise edit the transcripts.

The audio database used for the experiments described inthis paper contains 132 hours of proceedings of the Houseof Representatives recorded from January 20 throughFebruary 22, 1995. We also collected the correspondingtext transcripts in electronic form.

The Congressional Database Retrieval SystemFigure 1 shows the main components of the audio retrievalsystem. The text and audio databases described in Section2 are shown at the top of the figure. The World Wide Web(WWW) interface enables users to constrain searches usinga variety of parameters (see Section 5 for more details).The search parameters are used to locate selections of textfrom the text database. The text search engine includes aparser which extracts information about the date, time, andspeaker identity from the text databases and uses thisinformation to enforce some of the user specified searchconstraints. The search engine returns pointers to speakertransition points within the text which indicate searchmatches. The text to audio alignment system then providespointers into the audio database which correspond to theselected text. The WWW interface also provides bothaudio playback and a text display so users can interactivelyskim both the text and the audio in real time over theInternet (real time audio play back is supported over theInternet multicasting backbone using several popular audiotransfer protocols including VAT, Real Audio and Xing).

Text To Audio AlignmentA key component of the audio retrieval system is the text toaudio alignment system which performs an automatic timealignment of the audio and text databases. One method ofperforming the alignment might be to run a largevocabulary speech recognizer on the audio and align thetext output of the recognizer to the text transcript. Thisapproach is difficult because the transcriptions often straysignificantly from the verbatim words of the audio.Additionally, the original audio recordings have variablesignal to noise ratios which makes speech recognitiondifficult (speakers talk into an open microphone mountedon a floor stand; the microphone occasionally picks upconsiderable background noise from other people presentin the chamber.).

ICongressional Record I

I

Congressional Audio IText Database Database

I Text Search Engine Text-to-Audio

~,~ ~ment System I

[ WWW Interface I

Figure 1: The Congressional Audio Retrieval System

Rather than attempt to align text and audio using speechrecognition, our approach is to use speaker identification.We extract the sequence of speakers from the texttranscript. We then use acoustic models of the speakers tolocate points in audio where speaker transitions occur. Wecan then find correspondences between the text and audioat these points of speaker change. In addition to thespeaker sequence, we also use the time stamps to furtherconstrain the speaker identification process.

We have implemented the alignment system shown inFigure 2. The text parser extracts the sequence of speakersand time stamps. Although the Congressional Record wasnot designed to be machine readable, its structured formatallowed us to find the names of the speakers with fairlyhigh accuracy. The time stamps are also well marked andcan be extracted from the text easily but were found to beaccurate only within a range of about two minutes.

We have built Gaussian models for the voice of each of435 members of the House of Representatives based oncepstral features. The models are used in conjunction withthe speaker sequence and time stamps (extracted from thetext transcript) to constrain a Viterbi alignment of thespeaker transition points in the text and the audio. Fortechnical details of the alignment process see (Roy Malamud 1997). The Viterbi alignment results in coupling of the audio and text at each speaker changeboundary.

I CongressionalText[ I Congressional ] [ SpeakerDatabase Audio Database Models

/[ Text Parser [ I Cepstral Analysis I/speaker~"~n~mtem s .1, ./

sequencL~,~’,~ P ~ ,##

[ Viterbi Alignment [

Text to audio Alignment

Figure 2: Estimating temporal alignment of text andaudio

The Search And Browse Interface

We have built two WWW interfaces for accessing theaudio database over the Internet. The primary interface is asearch form with which the user can search for audiosegments constrained by several criteria includingkeywords, name of speaker, political party of speaker,speaker’s home state, date range, time range (specify rangeof times within a day). Figure 3 shows a search page inwhich the user has requested a search for all speeches byNew York Democrats who spoke within the date range95/01/20 to 95/02/15 and whose speech contained thekeyword "budget".

U.S. Con~re~onal ArcMv~

~ Itm~lu

m~ M.mmm ~am.m ~tt.

ZlFigure 3: WWW Search interfaceCongressional proceedings corpus.

to the

The search engine in the web server finds all speeches inthe text corpus which meet the search criteria and presentsthem as shown in Figure 4. This page first lists alltranscript documents which contain hits, and then list eachspeaker (note that all speakers are New York Democrats asspecified in the search). The user may then follow any linkfrom this page to read the full text and listen to thecorresponding audio of each speech. Figure 5 shows atypical web page when a link from one of the speakers isfollowed. The scrollable window at the top contains thetranscript of the speech, and the control buttons at thebottom enable interactive playback and navigation of theaudio. Assuming the Viterbi alignment was successful forthis speaker, simply hitting the play audio will play theaudio from the beginning of the speech.

[] ~tsc~e: Cor~’e~sio~el ~ Search Remits []

I B ~ m m,,wiiii

Congressional Memory Search Results

¯ New ~e~¯ US, Congress ~cbives Home Pace

¯ Int ernst Multicasl~n¢ Serviee H orae p~gt

Documents:

pI~O]~q3~INB At ]~kLANCED BUDGET AMENDMENT TO THE CONSTITUTION ($S/81/25)PROPOSING A BALANCED BUDGET AMENDMENT TO THE CONSTITUTION (9~1/Z5)PROPOSING A BALANCED BUDGET AMENDMENT TO THE CONSTITUTION !$S/O 1/2,5)NATIONAL SECURITy REVITALIZATION &CT (~/IJ2/IS)mow. ICAI (~Sm2ms)LINE-ITEM VETO ACT (~/IJ2/05)UNFUNDED MANDATE REFORM ACT OF I’~ (~./17#!11)VIOLENT CRIMINAL INCARCERATION ACT OF |99[ ($5/(l~J~)UNFUNDED MANDATE REFORM ACT OF 1~ (~./IIIM)

PROPOSING A BALANCED BUDGET AMENDMENT TO THE CONSTITUTION ( ~A~ I/’~)

Jose E S~ (D-NY) (121 seconds)Major R. Owens (D-NY) (1545 second.)l~vdia M. Vtle=~ue= (D-I’IY) (133 seconds)M~iorR Ow~ (D-IXlY) (4 seconds)Mamice D. Henley (D-NY) (I seconds)Meier R. Owens (D- IdY) (2 seconds)l~old Hadl~ (D- NY) (21 seconds)Ch~lu E Seh~u (D-blY) (112 seconds)Nits M Lowev (D-b/Y) (75 seconds)Louise Mdntosh Elaud~r (D-b/Y) ($7 seconds)Eliot L. Enetl (D- I~ (67 seconds)

PROPOSING A BALANCED BUDGET AMENDMENT TO THE CONSTITUTION (~/01/~

Jerrold Nsdlez (D-HY) (126 seconds)Hliot L. En~tl (D- I@/) (116 see¯ntis)

PROPOSING A BALANCED BUDGET AMENDMENT TO THE,.CONSTITUTION (’f~|l~

Figure 4: Example search result lists all speecheswhich meet the specified search criteria.

m ~t~e~: ¢~renl~l geccr~ ~ Remit ...... []

mmmmmmm .e,nttp ://~-n. h~tL orglog~ -b~n/9~ate_te~t, pL?PS0126.043,68O4,O07,22 ,VAn

Congresdonal Record from 95101/26

Nr. SERRANO. Hr. chairman, this ~ole issue of the balanced budgetamendment and the three-fifths super ma~orlty is one that if you reallyanalyze i~ can confuse you a lot.

First of all, we all o~me here vl~h an equal vote and now ve arebeing ~old in order to accomplish something le~Islatlveiy ~e have toget ~ spe¢lal super majority.

HO~ is It golng to and? AnF time ve find an Issue we do not have thecourage to deal with ourselves ve are goln9 put forth a super ma~orltyso that everybody can deal wlth I~ that ray and then throv It off to

els09~he other Issue that seams to create a problem here Is that we cannot

still get the truth from the other elde, from the proponents of thisbill. ~ha~ it is Lhey intend to do once they balance the budget Lhe waythey want to balance the budget.

This whole issue of ~oelal Sec~ri~F ~hat some people think we are~rying to scare some folks here, ~hia is a hones~ issue. This is atruthfuz issue.

Why v111 people not ~ell us what ls going ~o happen to socla$8ecurlty and Medlcare once this constltutlonal amendment takes effect?

When ! was much younger ~h¯ alrllne lndustry went out to try to get

¯Cenms~oed ~e.~o~ H~ Pa~

¯Int~t~ Multi~ Stlv~ Hom~/~

Figure 5: Display page for a speech; top windowcontains transcript, buttons at bottom enableinteractive playback of corresponding audio.

10

All audio alignment is precomputed by the text to audioalignment system off-line so the system response time isquick.

In cases there the audio alignment has failed, asecondary browse interface can be brought up by the userwhich displays a list of temporally close segments beforeand after the initially selected audio segment. Since errorsin the alignment are typically within a few minutes of theactual location (the worst case error is limited to the timestamp interval or that section of the Congressional Record),the user can use the browser to quickly locate theinformation of interest.

Conclusions And Future WorkWe have presented a system for aligning the audio and textof the proceedings of the U.S. House of Representatives.The alignment effectively couples the text and audiodatabases in a manner which enables efficient search andbrowsing of hundreds of hours of audio. The resultingaudio retrieval system has been deployed experimentallyon the Internet since October, 1995.

We plan to add other methods of audio analysisincluding keyword spotting and prosody analysis to extractfurther structure from audio recordings. This additionalstructure can be used to improve navigation withinsegments of audio located by the speaker identificationsystem. We also plan to use related methods for extractinglow level structure from video. For example jump cuts areeasily found by automatic systems, and the presence ofpeople (by looking for flesh tones) is also relatively simpleand robust.

The first author is currently working on two newapplications using the structured media retrieval methodsdescribed in this paper and in (Roy 1995). The first is make the proceedings of the European Union Summitavailable on the Internet, and the second application is inwearable computing; We are building a wearable computerwhich continuously records the audio environment of theuser. The resulting audio archive will be structured usingvarious methods and made accessible to the user throughthe wearable, providing a form of augmented memory.

Wilcox, L. and Bush, M. 1991. HMM-based Wordspottingfor Voice Editing and Indexing. In Eurospeech ’91, pp.25-28.

Glavitsch, U. and Schauble, P. 1992. A System forRetrieving Speech Documents. In 15th AnnualInternational SIGIR ’92, ACM, New York, pp. 168-176.

Jones, G. J. F., Foote, J.T., Sparck Jones, K., and Young,S.J.. 1996. Robust Talker-Independent Audio DocumentRetrieval. In Proc. ICASSP, pp. 311-314, IEEE, Atlanta.

James, D.A.. 1996. A System for Unrestricted TopicRetrieval from Radio News Broadcasts. In Proc. ICASSP,IEEE, Atlanta.

Chen, F. and Withgott, W. 1992. The Use of Emphasis toAutomatically Summarize a Spoken Discourse. In Proc.ICASSP, pp. 229-232, IEEE, San Francisco.

Arons, B. 1994. Speech Skimmer: Interactively SkimmingRecorded Speech. Ph.D. thesis, MIT Media Laboratory,.

Roy, D. 1995. NewsComm: A Hand-Held Device forInteractive Access to Structured Audio. Masters thesis,MIT Media Laboratory.

Horner, C. 1991. NewsTime: A Graphical User Interfaceto Audio News. Masters thesis, MIT Media Laboratory.

Roy, D. and Malamud, C. 1997. Speaker Identificationbased Text to Audio Alignment for an Audio RetrievalSystem. To appear in Proc. ICASSP, IEEE, Munich,.

ReferencesWilpon, J.G., Rabiner, L.R, Lee, C. and Goldman, E.R..1990. Automatic Recognition of Keywords inUnconstrained Speech Using Hidden Markov Models.IEEE Trans. on Acoustics, Speech, and Signal Processing38, 11, pp. 1870-1878.

Rose, R.C., Chang, E.I., and Lippman, R.P. 1991.Techniques for Information Retrieval from VoiceMessages. In Proc. ICASSP, pp. 317-320, IEEE, Toronto.

11