MAESTRO: CONDUCTOR OF MULTIMEDIA ANALYSIS …broadcast news, video teleconferences, reconnais-sance data, and audio-visual recordings of corporate meetings and classroom lectures

MAESTRO: CONDUCTOR OF MULTIMEDIA ANALYSISTECHNOLOGIES

Ze’ev Rivlin, Douglas Appelt, Robert Bolles, Adam Cheyer, Dilek Hakkani-Tür, David Israel,Luc Julia, David Martin, Greg Myers, Ken Nitz, Bikash Sabata, Ananth Sankar,

Elizabeth Shriberg, Kemal Sonmez, Andreas Stolcke, Gökhan Tür

SRI InternationalMenlo Park, California 94025

http://www.chic.sri.com/projects/Maestro.htmlContact: [email protected]

Although keyword-based queries are now a fa-miliar part of any user’s experience with the WorldWide Web, they are of limited direct applicability tothe vast and growing quantity of multimedia infor-mation becoming available in materials such asbroadcast news, video teleconferences, reconnais-sance data, and audio-visual recordings of corporatemeetings and classroom lectures. Content-based in-dexing, archiving and retrieval would facilitate accessto large databases of such materials.

For example, in the broadcast news domain,content-based archiving is particularly useful. Ar-chiving can be done by exploiting the speechcontained in the audio track, the images contained inthe video track, and the text in video overlays. Oneapplication is to filter down huge volumes of rawnews footage to create the nicely packaged newsbroadcasts that we watch on television. Another useis to create a news-on-demand system for viewingnews more efficiently. We can create a database ofnews broadcasts annotated for later retrieval of newsclips of interest. The query “Tell me about the recentelections in Bosnia” would bring up news clips re-lated to the elections.

MAESTRO (Multimedia Annotation and En-hancement via a Synergy of Technologies andReviewing Operators) is a research and demonstra-tion system developed at SRI International forexploring the contribution of a variety of analysistechnologies — for example, speech recognition, im-age understanding, and optical character recognition— to the indexing and retrieval of multimedia. In-formedia [1] and Broadcast News Navigator [2] aresimilar projects that use these technologies for ar-chiving and retrieval. The main goal of theMAESTRO project is to discover, implement, andevaluate various combinations of these technologiesto achieve analysis performance that surpasses the

sum of the parts. For example, British Prime Minis-ter Tony Blair can be identified in the news by hisvoice, his appearance, captions, and other cues. Acombination of these cues should provide more reli-able identification of the Prime Minister than usingany of the cues on their own.

MAESTRO is a highly multidisciplinary effort,involving contributions from three laboratories acrosstwo divisions at SRI. Each of these SRI technologiesis described in more detail below. The integratingarchitecture makes it easy to combine these in differ-ent ways, and to incorporate new analysistechnologies developed by our team or by others.

MULTIMEDIA ANALYSIS TECHNOLOGIESON THE MAESTRO SCORE

On the MAESTRO Score (see Figure 1), oneach line, similar to musical notation for a particularinstrument, as seen on a musical conductor’s score,we see the output from a particular multimedia analy-sis technology. All of these outputs share a commontimeline. For example, the speaker tracking line(second from top) shows the audio segmented intospeakers, numbered 1, 2, 3, and so on. MAESTROaccepts an input format that consists of pairs of in-formation, each describing an event and the timeinterval in which it occurred. For example, forspeaker tracking, an information pair could bespeaker 4 speaking from time 1:12 to time 1:35 in thegiven news story. This appears on the MAESTROScore as the marked segment on the speaker trackingline with the label “4” below it. MAESTRO providesthe visual interface (the MAESTRO Score) for dis-covering synergistic strategies of combining analysistechnologies for the best possible archiving and re-trieving given all available data and relevanttechnologies.

Up to now, we have incorporated intoMAESTRO the SRI multimedia analysis technologiesdenoted to the left of the MAESTRO Score in Figure1. In the MAESTRO video display (the inset framein Figure 1), we see British Prime Minister TonyBlair speaking with his name in a video overlay cap-tion. We will refer to the example of identifying thePrime Minister in the brief descriptions of each of theSRI analysis technologies below. For more detaileddescriptions, we include references to related techni-cal publications and relevant Web pages.

Speech Recognition

Speech recognition technology (for more infor-mation, see http://www.speech.sri.com) is used toautomatically recognize the words that people speakin the news, including speech from news anchors,field reporters, world leaders, and passersby inter-viewed on the street. SRI's large-vocabulary speech

recognition system, known as DECIPHER™, auto-matically transcribes the speech, creatingcorresponding text. DECIPHER™ can recognizenatural, continuous speech without requiring the userto train the system in advance (a speaker-independentsystem). DECIPHER™ is distinguished by its effi-cient modeling approach (which results in very smallmodels), its robustness to noise and channel distor-tion, and its multilingual capabilities. These featuresmake the system fast and accurate in recognizingspontaneous speech of many styles, dialects, lan-guages, and noise conditions. The DECIPHER™recognizer was trained in the broadcast news domain[3], and its ability to deal with a wide variety of noiseconditions (e.g., clean broadcast studio speech, noisycrowd background) and spontaneous speech (e.g.,passersby interviewed on the street) was essential inthis domain.

Speech recognition contributes to our exampletask of identifying Tony Blair by recognizing the

SRI's MultimediaAnalysis Technologies

The Maestro Score

Sorted list of relevant clips

DECIPHER™ speech recognition

Speaker tracking

Scene segmentation/classification

Camera flash detection

OCR for overlay captions

OCR for text in video images

OCR for identifying persons

Named entity recognition usingFASTUS©/TextPro

Sentence boundary detection

Disfluency detection

Topic tracking

Voice query using the NuanceSpeech Recognition System™

Video overlay caption

Figure 1: The MAESTRO Score

news anchor announcing, “British Prime MinisterBlair said...”.

Named Entity Detection

The named-entity detection system identifiesnames of people, companies, organizations, and loca-tions in text. In broadcast news, this text can comefrom the output of the speech recognizer, or fromvideo overlay captions. SRI’s FASTUS© system [4]uses an English lexicon, extensive lists of persons,companies and locations, as well as contextual infor-mation to identify and classify proper names.MAESTRO uses TextPro, (a version of which ispublicly available over the Web at the URLhttp://www.ai.sri.com/~appelt/TextPro) which is de-rived from FASTUS©, but implemented in C++, andcompliant with the architecture developed under thegovernment sponsored TIPSTER program. The per-formance of TextPro has been evaluated on severalhours of broadcast news transcripts, and it identifiesapproximately 91% of names in an uppercase-onlyinput stream with no word recognition errors [5].

TextPro identifies “British Prime MinisterBlair” (see Figure 1, center of the fourth line frombottom on the MAESTRO score) as the name of aperson in the text output from the speech recognizer.

Disfluency Detection

Spontaneous speech frequently contains disflu-encies (hesitations, self-correction, restarts) that wewant to detect and eliminate before further process-ing. This is accomplished by SRI's hidden eventmodeling techniques that rely on a combination oflexical and prosodic cues to find the events of interest(for more information on this topic, seehttp://www.speech.sri.com/projects/hidden-events.html). These events are hidden in the sensethat the speech recognizer typically just outputs texthypothesizing spoken words, but does not markevents such as the disfluencies mentioned above.

In certain cases, named entities have embeddeddisfluencies. For example, if the news anchor an-nounced, “British Prime Minister uh Blair said...”then the interruption by a disfluency makes it difficultto recognize the named entity. The disfluency de-tector cleans up the text, resulting in “British PrimeMinister Blair said...”.

Optical Character Recognition (OCR)

Text appearing in video imagery includes com-puter-generated captions overlaid on the imagery, andtext that is part of the scene itself, such as signs and

placards. Location and recognition of text in videoimagery is more difficult than in many other OCRapplications (e.g., reading printed matter) because ofsmall character sizes and nonuniform backgrounds.SRI has developed an approach involving substantialpreprocessing (binarizing individual color videoframes, locating lines of text in each binarized frame,and binarizing the detected text regions again fromthe original color image data) before applying theOCR engine. The accuracy of the recognition resultcan be improved substantially by additional postproc-essing using a lexicon of named entities automaticallyfound by the named entity detector in text outputfrom the speech recognizer. For example, Figure 2shows part of the OCR recognition results of a videooverlay caption. Without the help of the named en-tity lexicon, due to the background of the PrimeMinister's hand behind the overlay caption, OCR rec-ognizes ‘BLAKJB'. However, with the help of thelexicon from the named entity detector, OCR is ableto correctly recognize ‘BLAIR'. The detection ofinstances of named entities in the video imagery,which often correspond to a person, place, or organi-zation depicted in the video scene, is noted as asemantic event on the MAESTRO Score.

Speaker Identification and Tracking

Speaker identification and tracking technologywas developed at SRI for various domains includingbroadcast news. Identifying and tracking speakers is

Without Lexicon:With Lexicon:

TONY BLAKJBTONY BLAIR

Figure 2: Name Recognition Helps OCR

of fundamental importance in archiving broadcastnews since it allows the user to search the data streamby speaker identity. In SRI's approach to speakertracking [6], portions of the audio stream marked asspeech are processed to fit a speaker change modelgenerated iteratively by grouping speech into uniquespeaker bins and training models for each hypothe-sized speaker. Fitting of the resulting speaker changemodel to the data determines the speaker turns as wellas the acoustic models of speakers. Specifically, anexisting set of models for speakers appearing fre-quently, such as news anchors and world leaders, canbe trained with available data and subsequently usedto verify these speakers' presence in the audio streamand assign labels accordingly.

For our identification example, we can build aspeech model for Tony Blair using various news clipsin which he speaks, and subsequently use this modelto find new clips of the Prime Minister speaking.

Scene Segmentation and Classification

The first step in understanding the imagerycontent in broadcast news video is the segmentationof the video into contiguous camera shots. This isachieved by detecting scene changes in the video.We use the basic properties of the color distributionsof individual video frames to detect the changes inthe scene. Instead of past approaches where two adja-cent frames' color distributions are compared, we usea novel robust multiscale approach to compute colordiscontinuity across scene change boundaries [7].The procedure detects scene changes that occur be-cause of cuts, fades or dissolves. Our method is veryrobust to the different noise phenomena in video andalso not sensitive to the thresholds that are requiredby other methods.

The classification of the scenes can potentiallyprovide important inputs to higher-level interpreta-tions, such as distinguishing the anchor personcamera shots from the other news item shots, or shotswith text and graphics from other general scenes.Another application is that similar shots are used insimilar or related stories over different news broad-casts. For example, we encountered cases in whichTony Blair appears in a few places in a particularnews story, such as when the news anchor refers sev-eral times to comments made by the Prime Minister.Each time the Prime Minister is shown, the scene issimilar — he is speaking from a white podium wear-ing a black suit, the background is gray. The sceneclassifier applies a hierarchical agglomerative clus-tering technique to the features of the colordistribution in each frame to classify these segments

of Tony Blair as being similar and assigns them to thesame class. Once we have located Tony Blair in onesegment of the news story, the scene classifier pro-vides a cue that will be useful for identifying thePrime Minister in other video segments that were as-signed to the same class.

Camera Flash Detection

Camera flashes are an important cue for under-standing the content of scenes in broadcast newsvideo, as they typically indicate the presence of im-portant persons in news video as the press attempts totake good pictures. Our algorithm to detect cameraflashes is a direct by-product of the above procedureto detect scene changes. Camera flashes are a cuethat a person worth identifying, such as the PrimeMinister, may be in the scene.

Topic Segmentation and Tracking

SRI has developed technology that enablesMAESTRO to automatically divide speech into topicunits and display automatically determined keywordscharacterizing each segment. To find the topicboundaries, the system combines prosodic informa-tion extracted from the speech waveforms (such aspause and pitch patterns) with word usage statistics.Both sources of information are integrated in a hid-den Markov model to find the best segmentation [8].

For example, if we wanted to find the PrimeMinister specifically in a story related to British eco-nomics, we could search for occurrences of TonyBlair in stories tagged with keywords British andeconomics.

SYNERGISTIC COMBINATIONS FORIMPROVED ANALYSIS

The main objective of MAESTRO is to com-bine analysis technologies for the best possibleperformance in analysis, archiving, and retrieving.MAESTRO's multimedia interface allows the user tojump from one analysis technology line to another(Figure 1), to play video sequences and display out-put corresponding to the events on that line, and toevaluate the performance of each of these technolo-gies. The display of analysis lines on the MAESTROScore guides the viewer in the mental combinationprocess between the lines --- that is, between thetechnologies. It even inspires some synergistic com-bination strategies and rules that would not have beenobvious without this representation. This particularinterface is effectively a “look under the hood” andhas research and pedagogical use. A commercial

news-on-demand system may only present to the userthe list of videos that match a query, without showingthe underlying process that led to the retrieval result.For more information about intelligent human-computer interfaces developed at SRI, seehttp://www.chic.sri.com.

Through MAESTRO, we have learned howanalysis technologies can be combined for extractinginformation of interest. For example, we can identifyBritish Prime Minister Tony Blair by using the fol-lowing cues:• DECIPHER™ speech recognition recognizes the

news anchor saying, “British Prime MinisterBlair said...”.

• Named entity detection using TextPro identifies“British Prime Minister Blair'' as the name of aperson.

• Disfluency detection can clean up a disfluentnamed entity “British Prime Minister uh Blairsaid...” yielding the recognizable name “BritishPrime Minister Blair.”

• Optical character recognition, driven by a lexi-con of names (including “BLAIR”) from namedentity detection, reads the video overlay captionrecognizing “TONY BLAIR.”

• Speaker identification and tracking identifies thePrime Minister's speech as ‘speaker 4’ (seeSpeaker Tracking line in Figure 1).

Conclusion: The person in the video andspeaker 4 are identified as “TONY BLAIR.”

• Scene segmentation and classification classifiesthe video segment of the Prime Minister speakingwith a later segment of the same news storybased on scene similarity. We may hypothesizethat this is again Tony Blair.

• Camera flash detection identifies camera flasheslater in the news story, which is a cue of the pres-ence of a famous person, possibly the PrimeMinister again.

• Topic segmentation/tracking can be used to findthe Prime Minister in news stories about a par-ticular subject — for example, stories related toBritish economics.

We are currently exploring various ways ofautomatically combining the above cues to locate thePrime Minister in the news.

• Heuristic (rule-based): If certain cues are evidentat certain times, we simply conclude presence ofthe Prime Minister.

• Statistical combinations: Weighting the cuesbased on confidence measures — e.g., we may be

70% sure it is Tony Blair's voice and 45% surethis scene matches another scene where he ap-peared. We can develop an optimal weightingscheme to make the final decision.

• Combinations of heuristic and statistical ap-proaches.

ARCHIVING AND RETRIEVING INMAESTRO

Each news story is archived via a statistical lan-guage model applied to text corresponding to thespeech recognized from its audio track. Archivingand retrieval is based on content words, or words thathave high salience for the task. The content wordsare automatically determined by computing a sali-ence measure for each word using an algorithm thatimplements word occurrence statistics [9]. Salientkeywords are those that are useful for discriminatingbetween news stories. For example, the word ‘the’may appear in all stories and is not useful for re-trieval. However, the word ‘Blair' is useful fordiscriminating between stories about the Prime Min-ister, and those not about him. Inspired by magneticpoetry on a refrigerator door, MAESTRO's “Fridge”(Figure 3) shows the automatically determined key-words for use in archive voice querying, with largerwords being more salient.

Stories can be retrieved by voice queries usingthe Nuance Speech Recognition System™, developedby Nuance Communications based on SRI'sDECIPHER™ technology. The query, “Tell meabout Blair and economics,” brings up a sorted list ofnews clips ranked by numerical score in order ofrelevance to the query. The highest scoring newsclips will be news stories most relevant to economicsand the British Prime Minister.

MAESTRO is implemented using the OpenAgent Architecture™ (OAA), a general-purposeframework for constructing systems composed ofmultiple software components written in differentprogramming languages and distributed across multi-ple platforms. Similar in spirit to distributed objectinfrastructures such as OMG's CORBA or Microsoft'sDCOM, OAA's delegation-based approach providessupport for describing more flexible and adaptableinteractions than the tightly bound method calls usedby these architectures [10].

In the current MAESTRO implementation,OAA's role is to integrate all on-line components,which include the MAESTRO User Interface (Java),speech recognition and telephony agents (C), and re-trieval scoring (Awk). For example, the user can

query MAESTRO from Washington D.C. over thetelephone via the speech recognition agent that mayreside remotely on a computer at SRI in Menlo Park,California. At the same time, the MAESTRO UserInterface (Figure 1) can reside in Washington D.C. ona laptop comjuter for local display. In the future, it isexpected that OAA's capabilities will help provideautomated synergy of multimedia analysis technolo-gies, inferencing across multiple on-line processingstreams.

CONCLUSIONS AND FUTURE DIRECTIONS

MAESTRO is a research and demonstrationtestbed providing an integrating architecture and avisual interface (the MAESTRO Score). ThroughMAESTRO, the researcher or technical user can dis-cover, develop, and test synergistic strategies ofcombining multimedia analysis technologies toachieve archiving and retrieving capabilities far ex-ceeding that possible with any of the technologiesindividually. We have implemented a number ofthese strategies --- for example, using a lexicon ofnames from the named entity detector to enable cor-rect recognition of video overlay captions by OCR.We have learned a great deal about information cuesfrom studying the MAESTRO Score --- for example,cues for identifying newsmakers. And, most impor-tant, MAESTRO has motivated new ideas forcombination strategies.

We presently are focusing our attention on im-plementing more synergistic multimedia-analysis-technology combination strategies. We are exploringvarious methods of combining cues that help answerquestions like “Is Tony Blair in the news?” We arealso interested in porting to new domains, such as

meeting archival. Finally, we continue to further im-prove the individual core multimedia analysistechnologies.

ACKNOWLEDGEMENTS

We gratefully acknowledge the efforts of thefollowing people in developing MAESTRO: JohnBear, Marty Fischler, Marsha Jo Hannah, JerryHobbs, Ray Perrault, Patti Price, Eric Rickard, andDavid Scott.

Support for this work from DARPA contractnumber N66001-97-C-8544, from DARPA throughthe Naval Command and Control Ocean SurveillanceCenter under contract number N66001-94-C-6048,through NSF grant IRI-9619921, and other Govern-ment agencies is gratefully acknowledged. TheGovernment has certain rights in this material. Anyopinions, findings, and conclusions or recommenda-tions expressed in this material are those of theauthors and do not necessarily reflect the views of theGovernment funding agencies.

REFERENCES

[1] Hauptmann, A., and M. Whitbrock, “Informeida:News-on-demand multimedia information ac-quisition and retrieval,” Intelligent MultimediaInformation Retrieval, MIT Press, 1997, pp.215–240.

[2] Merlino, A., D. Morey, M. Maybury, “Broad-case News Navigation using StorySegmentation,” in Fifth ACM InternationalMultimedia Conference, 1997, pp. 381–392.

[3] Sankar, A., R.R. Gadde, and F. Weng, “SRI'sBroadcast News System — Toward Faster,Smaller, and Better Speech Recognition,” inProceedings of the DARPA Broadcast NewsW o r k s h o p , 1 9 9 9 , p p . 2 8 1 – 2 8 6 .http://www.speech.sri.com/papers/darpa99-fbs.ps.gz

[4] Hobbs, J., D. Appelt, J. Bear, D. Israel, M.Kameyama, M. Stickel, and M. Tyson,“FASTUS: A Cascaded Finite-State Transducerfor Extracting Information from Natural-Language Text,” in Roche and Schabes (eds.)Finite State Devices for Natural Language Proc-essing, MIT Press, 1996, 383–406.

[5] Appelt, D., D. Martin, “Named Entity Extractionfrom Speech: Approach and Results using theTextPro System,” in Proceedings of the 1999DARPA Broadcast NEWS Workshop, 1999, pp.51–54 . A l so ava i l ab le a t URLhttp://www.ai.sri.com/~appelt/SRIIEN

Figure 3: The MAESTRO Fridge

E.HTM.[6] Sonmez, K., L. Heck, M. Weintraub, “Speaker

Tracking and Detection with Multiple Speak-ers,” in Proceedings of EUROSPEECH 99,1999, pp. 2219–2222. Also available at URLhttp://www.speech.sri.com/papers/eurospeech99-tracking.ps.gz

[7] Sabata, B. and M. Goldszmidt, “Fusion of Mul-tiple Cues for Video Segmentation,” inProceedings of the Second International Confer-ence on Information Fusion, 1999.

[8] Stolcke A., E. Shriberg, D. Hakkani-Tür, G. Tür,Z. Rivlin, and K. Sonmez, “Combining Wordsand Speech Prosody for Automatic Topic Seg-mentation,” in Proceedings of the 1999 DARPABroadcast NEWS Workshop, 1999, pp. 61–64.http://www.speech.sri.com/papers/darpa99-topicseg.ps.gz

[9] Gorin, A. S. Levinson, and A. Sankar, “An Ex-periment in Spoken Language Acquisition, IEEETransactions on Speech and Audio Processing,pp. 224-240, Volume 2, Number 1, 1994.

[10] Martin, D., A. Cheyer, and D. Moran, “TheOpen Agent Architecture: A Framework forBuilding Distributed Software Systems,” Ap-plied Artificial Intelligence: An InternationalJournal, Volume 13, Number 1-2. January-March 1999. Also see the World Wide Web sitehttp://www.ai.sri.com/~oaa/

For more information, see the MAESTRO webpage athttp://www.chic.sri.com/projects/Maestro.html

Documents

MAESTRO: CONDUCTOR OF MULTIMEDIA ANALYSIS …broadcast news, video teleconferences, reconnais-sance data, and audio-visual recordings of corporate meetings and classroom lectures