Multimodal Processing and Interaction: Audio, Video, Text · tile; (iii) image/video (or speech/audio) and text; (iv) multiple-cue versions of vision and/or speech; (v) other semantic

Multimodal Processing and Interaction:

Audio, Video, Text

Edited by

Petros Maragos, Alexandros Potamianos and Patrick Gros

Your dedication goes here

Preface

In multimedia analysis, most of the tools are usually devoted to a sin-gle modality, the other ones being treated as illustrations or complementarycomponents. For example, web search engines and image retrieval systemsbarely mix textual and visual descriptions, video processing is usually doneseparately on sound and images. One main reason for this is that the differentmedia concern different and sometimes very separate scientific fields. However,even without learning, performance of multimedia analysis and understandingsystems (especially in terms of robustness) can be greatly enhanced by com-bining different modalities through interaction or integration. Thus one of thegoals of this book is to present algorithms and systems processing several dif-ferent media present in the same multimedia and exploiting their interaction.This requires a strong synergy between various scientific fields and many re-search methodologies. Examples of modalities to integrate include all possiblecombinations of: (i) vision and speech/audio; (ii) vision (or speech) and tac-tile; (iii) image/video (or speech/audio) and text; (iv) multiple-cue versionsof vision and/or speech; (v) other semantic information or meta-data.

Multimedia information retrieval via an interactive human-computer in-terface is a complex task that requires feedback from the user and a complexnegotiation between the user and the machine. The grand challenges are toresearch, design and build natural and efficient human-computer interfacesfor performing multimedia information retrieval tasks that allow for negotia-tion (dialogue) between the user and the system. Three broad thematic areasin research on human-computer interfaces are multimodality, adaptivity andmobility.

This book grew out from a four-year collaboration among research groupsparticipating in the European Network of Excellence on Multimedia Un-derstanding through Semantics, Computation and Learning, abbreviated as”MUSCLE”. It focuses on thematic areas that are scientifically and techno-logically important for multimodal processing and interaction. Specifically,it addresses state-of-the-art research and new directions on the theory andapplications of multimedia analysis approaches that improve robustness andperformance through cross-modal and multimodal integration. It also focuseson interaction with multimedia content, with special emphasis on multimodalinterfaces for accessing multimedia information.

VIII Edited by Petros Maragos, Alexandros Potamianos and Patrick Gros

This book contains contributions by leading experts in the ubiquitous sci-entific and technological field of multimedia. Its purpose is to present high-quality, state-of-the-art research ideas and results from theoretic, algorithmicand application viewpoints. A broad spectrum of novel perspectives, analytictools, algorithms, design practices and applications is presented. Emphasis isgiven on multimodal information processing aspects of multimedia and cross-integration of multiple modalities.

Petros MaragosAlexandros PotamianosPatrick Gros

Month XX 2007.

Contents

Multimodal Processing and Interaction: Audio, Video, Text

Edited by Petros Maragos, Alexandros Potamianos and Patrick Gros . . . V

Part I Review of the State-of-the-Art

Part II New Research Directions: INTEGRATED MULTIMEDIA

ANALYSIS AND RECOGNITION

Part III New Research Directions: SEARCHING MULTIMEDIA

CONTENT

1 Toward the integration of NLP and ASR techniques: POS

tagging, topic adaptation and transcription.

Stephane Huet, Gwenole Lecorve, Guillaume Gravier, Pascale Sebillot . . 7

Part IV New Research Directions: INTERFACES TO MULTIME-

DIA CONTENT

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

List of Contributors

Nozha Bouzemaa

INRIA RocquencourtBP 105 Rocquencourt78153 Le Chesnay Cedex, [email protected]

A. Enis Cetin

Bilkent UniversityDepartment of Electrical andElectronics Engineering06800, Ankara, [email protected]

Michel Crucianu

BP 432 Rocquencourt292 rue St Martin75141 Paris Cedex, [email protected]

Rozenn Dahyot

Trinity College [email protected]

Yigithan Dedeoglu

Bilkent UniversityDepartment of Computer Engineer-ing06800, Ankara, [email protected]

Manolis Delakis

IRISACampus Universitaire de Beaulieu35042 Rennes Cedex, [email protected]

Giorgos Evangelopoulos

National Technical University ofAthensSchool of Electrical & ComputerEngineeringAthens 15773, [email protected]

Marin Ferecatu

INRIA RocquencourtBP 105 Rocquencourt78153 Le Chesnay Cedex, [email protected]

Guillaume Gravier


Patrick Gros


XII List of Contributors

Ugur Gudukbay

Bilkent UniversityDepartment of Computer Engineer-ing06800, Ankara, [email protected]

Naomi Harte

Trinity College Dublin

Petri Honkamaa

VTT Technical Research Center ofFinland

Stephane Huet


Athanassios Katsamanis


Anil Kokaram


Constantine Kotropoulos

Aristotle Univ. of ThessalonikiDepartment of InformaticsThessaloniki, Box 451, Thessaloniki541 24, [email protected]

Gwenole Lecorve

IRISACampus Universitaire de Beaulieu35042 Rennes Cedex, France.

Daire Lennon


Petros Maragos


Robert Neumayer

Institute of Software Technology andInteractive [email protected]

Nikos Nikolaidis


O.K. Oyekoya

University College LondonAdastral Park CampusMartlesham Heath, Ipswich, IP53RE, [email protected]

George Papandreou


Manolis Perakakis

Technical University of CreteDept. of ECE,Chania, [email protected]

Vassilis Pitsikalis


List of Contributors XIII

Ioannis Pittas


Alexandros Potamianos

Technical University of CreteDept. of ECE,Chania, [email protected]

Kostas Rapantzikos


Andreas Rauber

Institute of Software Technology andInteractive [email protected]

Pascale Sebillot


Spyridon Siatras


Sanni Siltanen

VTT Technical Research Center [email protected]

F.W.M. Stentiford

University College LondonAdastral Park CampusMartlesham Heath, Ipswich, IP53RE, [email protected]

B. Ugur Toreyin

Bilkent UniversityDepartment of Electrical andElectronics Engineering06800, Ankara, [email protected]

Seppo Valli

VTT Technical Research Center ofFinland

Part I

Review of the State-of-the-Art

Part II

New Research Directions: INTEGRATED

MULTIMEDIA ANALYSIS AND

RECOGNITION

Part III

New Research Directions: SEARCHING

MULTIMEDIA CONTENT

1

Toward the integration of NLP and ASR

techniques: POS tagging, topic adaptation and

transcription.

Stephane Huet1, Gwenole Lecorve3, Guillaume Gravier2, and PascaleSebillot3

1 Irisa, Universite de Rennes 1, Campus de Beaulieu, Rennes, [email protected]

2 Irisa, CNRS, Campus de Beaulieu, Rennes, [email protected]

3 Irisa, INSA, Campus de Beaulieu, Rennes, France [email protected],

[email protected]

In the framework of multimedia analysis and interaction, speech and languageprocessing plays a major role as they carry a lot of semantic information. Thischapter focuses on recent research work toward a better integration betweenautomatic speech recognition (ASR) and natural language processing (NLP)for the analysis of spoken document.

1.1 Introduction

The ASR and NLP communities have had a long history of total misunder-standing, mostly due to two different approaches of natural language: a purestatistical one vs. a more symbolic, rule-based one. But the last 15 years havebegun to re-appropriate the joint use of ASR and NLP for different and ob-vious reasons. First, the introduction of new linguistic knowledge sources inASR (e.g. morphological, syntactic, etc.) seems a good idea to produce bettertranscriptions [12]. Moreover several typical tasks do require this collaborationof NLP techniques applied to transcriptions [2]: spoken document indexing,topic tracking, named-entity detection, summarization, etc..

If using ASR and NLP jointly is now a clear will, the cooperation is not thatsimple. Oral output has characteristics, such as repetitions, revisions, or fillersthat make it not straightforward. Additional difficulties come from the factthat automatic transcriptions are segmented into breath groups rather thaninto sentences —the usual object of study for NLP; they also lack punctuationand, in the case of some ASR systems as ours, capitalization. Moreover ASRproduces transcriptions that contain errors whereas NLP is usually applied tocorrect texts.

8 Stephane Huet, Gwenole Lecorve, Guillaume Gravier, and Pascale Sebillot

Different classical approaches have been proposed to combine ASR andNLP. Some works post-process the transcriptions to make them look likewritten text [14] – using repunctuation techniques, or detecting disfluencies –before using NLP methods; others adapt NLP techniques to take into accountadditional ASR output such as confidence measures for each word and wordgraphs [2].

However a deeper and better integration of NLP and ASR rather than con-secutive applications of techniques from these two domains seems necessaryto make the most of what both can offer. The idea is thus to take into accountat each step a fusion of clues provided by NLP and ASR in order for thesedomains to better inform each other. Following this approach, we have chosento mainly focus here on two problems that may benefit from the collaboration:improving the quality of automatic transcriptions using morpho-syntactic lin-guistic knowledge, and adapting the ASR system to the various topics of thediscourse in an unsupervised way.

The paper is organized as follows: Section 1.2 explains the key points ofASR. Section 1.3 describes both our ASR system, and the data used in theexperiments of the two following sections. Section 1.4 presents our deep inte-gration of standard information from our ASR system and morpho-syntacticknowledge to improve the quality of transcriptions. Finally, section 1.5 is ded-icated to the unsupervised topic-based adaptation of our ASR system.

1.2 The basic principles of automatic speech recognition

Most automatic speech recognition systems rely on statistical models of speechand language to find out the best transcription, i.e. word sequence, given a(representation of the) signal y, according to

w = arg maxw

p(y|w) P [w] . (1.1)

Language models (LM), briefly described below, are used to get the priorprobability of a word sequence. Acoustic models, typically continuous densityhidden Markov models (HMM) representing phones, are used to compute theprobability of the acoustic material for a given word sequence, p(y|w). Therelation between words and acoustic models of phone-like units is provided bya pronunciation dictionary which lists the words known to the ASR systemalong with the corresponding pronunciations. Hence, ASR systems operates ona closed vocabulary whose typical size is between 64,000 and 100,000 words ortokens. Because of the limited size of the vocabulary, word normalization, suchas ignoring the case or breaking compound words, is often used to limit thenumber of out of vocabulary words. The consequence is that the vocabularyof an ASR system is not necessarily suited for natural language processing.

As mentioned previously, the role of the language model is to define aprobability distribution over the set of possible sentences according to the

1 Integration of NLP and ASR techniques. 9

vocabulary of the system. As such, the language model is a key componentfor a better integration between ASR and NLP. ASR systems typically rely onN-gram based language models because of their simplicity which makes themaximization in (1.1) tractable. The N-gram model defines the probability ofa sentence wn

1 as

P [wn1 ] =

n∏

i=1

P [wi|wi−1i−N+1] , (1.2)

where the probabilities of the sequences of N words P [wi|wi−1i−N+1] are esti-

mated from large text corpora. Because of the large size of the vocabulary,observing all the possible sequences of N words is impossible. A first ap-proach to circumvent the problem is based on smoothing techniques, such asdiscounting and back-off, to avoid null probabilities for events unobserved inthe training corpus. Another approach is based on N-gram models based onclass of words [4] where an N-gram model operates on a limited set of classesand words belong to one or several classes. The probability of a word sequenceis then given by

P (wn1 ) =

∑

t1∈C(w1)...tn∈C(wn)

n∏

i=1

P (wi|ti)P (ti|ti−1i−N+1) , (1.3)

where C(w) denotes the set of possible classes for a word w.In practice, (1.1) is evaluated in the log-domain and the LM probabilities

are scaled in order to be comparable to acoustic likelihoods, thus resulting inthe following maximization problem

w = arg maxw

ln p(y|w) + β ln P [w] + γ|w| , (1.4)

where the LM scale factor beta and the word insertion penalty γ are empiri-cally set.

The ultimate output of an ASR system is obviously the transcription.However, additional information, such as confidence measures or transcriptionalternatives, can be obtained from the ASR system. These information mightprove useful for NLP as they avoid error-prone hard decisions from the ASRsystem. Rather than finding out the best word sequence maximizing (1.4),one can output a list of the N -best word sequences thus keeping track of thealternative transcriptions that were discarded by the system. For a very largenumber of transcription hypotheses, these N -best lists can be convenientlyorganized as word graphs where arcs correspond to words. From the set ofalternative hypotheses, confidence measures can be computed for each words,where the measures reflects how confident is the system in its decision.

The quality of ASR transcriptions are usually measured in terms of worderror rates (WER), obtained by counting the number of erroneous words afteran optimal alignment between the automatic transcription and the referenceone. The error rate can also be computed at the sentence level (SER) by


counting the number of sentences for which all the words are correctly tran-scribed. The quality of language models can be measured apart from that ofthe entire ASR system by means of the perplexity. Perplexity measures howwell a LM can predict the next word given the word history, where the lowerthe perplexity, the better. However, due to the complex interactions betweenall the components of an ASR system, increasing the perplexity of a LM doesnot necessarily result in decreasing the word and sentence error rates.

1.3 The Irene broadcast news transcription system

The Irene broadcast news transcription system, jointly developed by Irisaand ENST for the ESTER broadcast news transcription evaluation cam-paign [8], implements the basic principles described in the previous sectionafter a segmentation step which aims at segmenting the input stream intopseudo-sentences. The system has a vocabulary of 64,000 words.

The input stream is first segmented automatically in order to detect re-gions containing speech. A further partitioning into speaker turns is then per-formed. Since (1.4) can only be solved for short utterances, the speech streamis finally segmented into breath-groups based on the energy profile in orderto detect breath intakes. To avoid problems due to segmentation errors, thesegmentation steps were done manually in this study. However, let us stressthe fact that sentence segmentation is based on breath groups rather than onsyntactic and grammatical considerations, even though the two are related.

Transcription itself is carried out in three passes. A first pass with fairlysimple context-independent acoustic models and a 3-gram word based LMaims at generating large word graphs. These word graphs are then rescoredwith more complex context-dependent acoustic models and a 4-gram LM.Rescoring word graphs is based on (1.4) where the maximization is limitedto the set of word sequences encoded in the word graph, thus making the useof more complex models tractable. Finally, based on the transcription fromthe second pass and the speaker partition obtained in the segmentation step,the acoustic models are adapted for each speaker and final word graphs areobtained by rescoring the initial word graphs with speaker-adapted acousticmodels.

Experiments reported in this chapter were obtained on the ESTER Frenchbroadcast news transcription task. A corpus of about one hundred hours ofmanually transcribed data was used and divided into three parts: a large partwas reserved for the purpose of acoustic and language model training whiletwo sets of four hours each, from four different broadcasters, were used asdevelopment and test sets respectively. The development set was used to tunethe many parameters of an ASR system such as the language model scalefactor. The language model was obtained by interpolating a LM estimatedon 1M words from the manual transcriptions of the training set with a LMestimated from 350M words from the French newspaper “Le Monde”.


1.4 Morpho-syntactic knowledge integration

Morpho-syntactic information is brought by parts of speech (POS), which aregrammatical classes (e.g. verb, noun, preposition, etc.), along with morpholog-ical knowledge about gender, number, tense, mode, or case. Such informationis obtained by assigning each word or locution of a sentence its POS tagaccording to the context, a process that is called tagging. POS tags can beconsidered to improve transcription. Indeed, by tagging the output producedby an ASR system for each breath group, the hypotheses with correct POSsequences, like a singular noun following a singular adjective, can be selected.In particular, this method can correct agreement errors by taking into ac-count information about gender and number. We show in this section thatthis information is in practice valuable to reduce WER.

In what follows, we present our method to integrate POS information in theASR process. We first describe previous works on the use of morpho-syntacticinformation to improve WER. We then show experimentally how reliable POStagging of transcriptions can be obtained by automatic methods and be usedin a post-processing stage of an ASR system. We finally investigate the use ofmorpho-syntactic knowledge for confidence measures.

1.4.1 Related works

The first ASR systems which used POS knowledge resorted to class-basedLM where words are associated to morpho-syntactic classes as explained insection 1.2. However, even when the class-based LM are interpolated withword-based LM, class-based LM exhibit a limited effectiveness [24]. In [10], a3-gram LM is built over word/tag pairs rather than words and the recognitionproblem is redefined as finding the best joint word and POS tag sequences.This approach results in a significant WER reduction. Nevertheless, due tothe drastic increase of entries in the LM, the approach requires a very largeamount of training data and heavily relies on smoothing techniques to makeup for the lack of data.

We use a different approach where POS information is combined with theLM score in a post-processing stage rather than integrated in the LM as inprevious approaches. An interesting point is that our method does not requirea large amount of annotated training data as opposed to [10]. The idea is toprocess the lists of N best hypotheses found for each breath group. We firsttag each hypothesis before computing a score function combining the POS,LM, and acoustic scores to rerank N -best lists.

1.4.2 Morpho-syntactic tagging of automatic transcriptions

Morpho-syntactic tagging is a widely used technique in NLP and taggers arenow considered as reliable enough to automatically tag a text according toPOS information. However, most experiments were carried out on written


text, and spoken corpora in comparison have been seldom studied. As oraloutput has specificities that are likely to disturb taggers, we first demonstratethat such noisy texts can be reliably tagged.

We built a morpho-syntactic tagger based on the popular technique ofHMM [17], where tagging is expressed as finding out, for each sentence, themost probable POS tag sequence, among all the possible sequences accordingto a lexicon. In order to adapt the model to the characteristics of oral, we useda 200,000-word training set from the manual transcriptions of the trainingcorpus. Moreover, we removed all capital letters and punctuation marks toobtain a format similar to a transcription and segment it into breath groups.We also restrained the vocabulary of the tagger to the one of our ASR system.We chose our POS tags in order to distinguish the gender and the number ofadjectives and nouns, and the tense and the mood of verbs, which led to a setof 95 tags.

To quantitatively evaluate morpho-syntactic tagging, we manually taggeda 1-hour broadcast. We first investigated the behavior of the tagger on manu-ally transcribed text by comparing the tag found for each word with the oneof the reference. For automatic transcriptions, evaluating the tagger is moreproblematic than for manual transcriptions since ASR output contains mis-recognized words; for the ungrammatical hypotheses, it becomes impossibleto know which sequence of POS would be right. We therefore compute thetag rate only for the words that are correctly recognized.

Results obtained over the 1-hour test corpus, transcribed with a WERof 22.0 % are given in the first line of Table 1.1. They show a tag accuracyover 95 % which is comparable to the results usually given on written cor-pora. Furthermore, we get the same results on both manually and automatictranscriptions, which establishes therefore that morpho-syntactic tagging isreliable, even for text produced by an ASR system whose recognition errorsare likely to jeopardize the tagging of correctly recognized words. The robust-ness of tagging is explained by the fact that tags are locally assigned. Wecompared the performances of our tagger with those of Cordial4, probablythe best tagger available for written French and which has already producedgood results on a spoken corpus [22]. The last line of Table 1.1 shows resultscomparable with our HMM-based tagger when we ignore confusion betweenproper names and common names. Indeed, the lack of capital letters is par-ticularly problematic for Cordial, which relies on this information do detectproper names.

1.4.3 Reranking of N -best lists

Morpho-syntactic information is here used in a post-processing stage to treatN -best sentence hypothesis lists. Although N -best lists are not as informative

4 Distributed by the Synapse Dveloppement corporation.


transcription manual automatic

HMM tagger 95.7 (95.9) 95.7 (95.9)Cordial 90.7 (95.0) 90.6 (95.2)

Table 1.1. Tag accuracy (in %), where results between parentheses are computedwhen confusion between common names and proper names are ignored.

as word graphs, each entry can be seen as a standard text, permitting thusPOS tagging.

To combine morpho-syntactic information with the LM and acousticscores, we first determine the most likely POS tag sequence tm

1 correspond-ing to a sentence hypothesis wn

1 . Based on this information, we compute themorpho-syntactic probability of the sentence hypothesis

P (tm1 ) =

m∏

i=1

P (ti|ti−1i−N+1) . (1.5)

Note that the number m of tags may differ from the number n of words as weassociate a unique POS with locutions, consecutive proper names, or cardinals.To take into account longer dependencies than the 4-gram word-based LM,we chose a 7-gram POS-based LM.

We propose a new global score of a sentence [11] by adding the morpho-syntactic score to the score given in (1.4) with an appropriate weight. Thecombined score for a sentence wn

1 , corresponding to the acoustic input yt1, is

therefore given by

s(wn1 ) = log P (yt

1|wn1 ) + α log P [wn

1 ] + β log P [tm1 ] + γn . (1.6)

Introducing POS information at the sentence level allows us to differentlytokenize sequences of words and tags and to more explicitly penalize unlikelysequences of tags like a plural noun following a singular adjective.

Based on the score function defined in (1.6), which includes all the availablesources of knowledge, we can reorder N -best lists using various criteria. Weconsidered three criteria, namely maximum a posteriori (MAP), minimumexpected word error rate [20], and consensus decoding on N -best lists [15].The two last criteria, often used in current systems, aim at reducing the worderror rate to the expense of an increased sentence error rate.

MAP criterion

The MAP criterion selects among the N -best list generated for each breathgroup the best hypothesis w(i) which maximizes s(w(i)) as given by (1.6).Results on the test corpus show that our approach achieves an absolute de-crease of 0.8 % of the WER as reported in Tab. 1.2, columns 1 and 2. Bytaking into account lexical probabilities P (wi|ti), which are usually included


baseline contextual lexical and contextual class-based LMASR system probabilities probabilities

19.9 19.1 19.0 19.5

Table 1.2. WER(%) on test data.

in class-based LM, we observed a minor additional decrease (Tab. 1.2, column3) of the WER. The score in this last case is given by

s′(wn1 ) = log P (yt

1|wn1 ) + α log P (wn

1 )

+β[

n∑

i=1

log P (wi|ti) + log P (tm1 )] + γn (1.7)

and leads to penalize words that are rarely associated with the proposed tag.We compared our approach with class-based LM incorporated in the tran-

scription process by linear interpolation with a word-based LM according to

P (wn1 ) =

n∏

i=1

[λ Pword(wi|w

i−11 ) + (1 − λ) PPOS(wi|w

i−11 )

]

withPPOS(wi|w

i−11 ) =

∑

ti−N+1...ti

P (wi|ti) P (ti|ti−1i−N+1) . (1.8)

We reevaluated the N -best lists by interpolating the N -class based POS taggerand the word level language model, the interpolation factor λ being optimizedon the development data. We noticed an absolute decrease of 0.4 % w.r.t.the baseline system, i.e., half of the decrease previously observed (Tab. 1.2,last column). These results clearly establish that linear interpolation of logprobabilities is more effective than that of probabilities.

Word error minimization criteria

Combined scores incorporating morpho-syntactic information can be used toreorder N -best lists using decoding criteria that aim at minimizing the worderror rate, rather than the sentence error rate as the MAP criterion does.Two popular criteria can be used to explicitly minimize the WER: the firstone consists in approximating the posterior expectation of the word errorrate by comparing each pair of hypotheses in the N -best list [20]; the secondone, consensus decoding, is based on the multiple alignment of the N -besthypotheses into a confusion network [15].

Both criteria rely on the computation of the posterior probability for eachsentence hypothesis w(i)

P (w(i)|yt1) =

es(w(i))/z

∑

j

es(w(j))/z(1.9)


WER SERMAP dec. min. WE cons. dec. MAP dec. min. WE cons. dec.

without POS 19.9 19.8 19.8 61.8 62.2 62.4with POS 19.0 18.9 18.9 59.4 59.6 59.7

Table 1.3. WER(%) and sentence error rate (%) on test data for various decodingtechniques.

where the posterior probability is obtained from a score including morpho-syntactic knowledge. The combined score is scaled by a factor z in order toavoid overpeaked posterior probabilities. In the experiments described below,the combined score including lexical probabilities (1.7) was used.

Results are reported in Tab. 1.3 for the three decoding criteria, namelyMAP, WER minimization, and consensus, with and without POS knowledge.In both cases, we observe a slight decrease of the WER when using word errorminimization criteria, along with an increased sentence error rate. However,the gain observed is not meaningful because of the limited size of the N -bestlists (N=100); a larger gain was observed with a 1, 000-best list. Interestingly,results show that including morpho-syntactic information also benefits to worderror minimization criteria, thus indicating that the WER improvement dueto POS tags is consistent whatever the decoding criterion used.

Discussion on the results

Statistical tests were carried out to measure the significance of the WERimprovement observed, assuming independence of the errors across breathgroups. For all the decoding criteria, both the paired t-test and the pairedWilcoxon test resulted in a confidence over 99.9 % that the difference of WERby using or not POS knowledge is not observed by chance. Besides, for theMAP criterion, the same tests indicate that global scores computed as (1.6)or (1.7) led to a significant improvement w.r.t. interpolated class-based LMwith a confidence over 99 %.

Our method shows its robustness for spontaneous speech. Indeed, we mea-sured performance on a short extract of 3,650 words containing interviewswith numerous disfluencies and observed that the baseline WER of 46.3 % isreduced to 44.5 % with (1.6) and to 44.3 % with (1.7) using the MAP crite-rion. This 4% relative improvement is consistent with the relative improve-ment obtained on the entire test set. Additional experiments with automaticsegmentation also demonstrated the validity of our approach in that case.

Experiments reported here were carried out on the French language, whosenouns, adjectives, and verbs are very often inflected for number, gender, ortense into various homophone forms. However, an experiment we led to im-prove an hand-writing recognition system in the English language, showsthat morpho-syntactic knowledge still brings an improvement of WER, eventhough English is not a so highly inflected language.


WER NCE without POS NCE with POS

decoding without POS 19.7 0.307 0.326decoding with POS 18.7 0.265 0.288

Table 1.4. WER (in %) and normalized cross entropy for MAP decoding with andwithout POS score.

To conclude this section, we observed that changes induced by POS knowl-edge generally produce more grammatical utterance transcriptions as indi-cated by the sentence error rates reported in Tab. 1.3. Indeed, an analysis ofthe sentence error rate shows a significant reduction when taking into accountmorpho-syntactic information. In particular, we noticed several corrections ofagreement or tense errors such as “une date qui A DONNER le vertige a une

partie de la France” (“a date which TO GIVE a part of France fever”).

1.4.4 Confidence measures

We have shown how POS knowledge can reduce transcript errors. Anotherinterest of NLP for ASR is that it can bring new information to computeconfidence measures.

Plots of the conditional probabilities P (ti|ti−1i−N+1) for POS sequences and

P (wi|wi−1i−M+1) for word sequences show that P (ti|t

i−1i−N+1) exhibits a signifi-

cant decrease on erroneous words where language model may show the samebehavior on correct words due to smoothing or back-off; this property is inter-esting to compute confidence measures. As sentence posterior probabilities arecommonly used to derive confidence measures from N -best lists or lattices,we compute them as in [18].

Confidence measures are obtained from 1, 000-best lists5 using the com-bined score (1.7). We limit the study to the lists produced by MAP decoding,which gives the lowest SER. The scaling factors and insertion penalty usedfor the computation of the sentence posteriors are different from those usedfor reordering the N -best lists and were optimized on the development set tomaximize the normalized cross entropy (NCE). NCE, which has been intro-duced by NIST, is commonly used to evaluate confidence measure about thecorrectness of a word.

Table 1.4 summarizes the results, where the higher the NCE, the betterthe confidence measure. WER with and without POS information are givenin the first column. The next two columns report NCE obtained when com-puting confidence measures respectively without and with morpho-syntacticinformation. Results show that POS improves confidence measure in bothcases. Thus, our experiments combining POS and ASR show that morpho-

5 And not from 100-best lists as in the previous experiments in order to get sufficientinformation to compute sentence posterior probabilities.


Fig. 1.1. Main scheme of a topic based adaptation.

syntactic information is valuable to improve WER and confidence measuresof ASR systems.

1.5 Topic-based adaptation

Because of the training procedure based on a large and topically generic cor-pus, current ASR systems treat in the same way each discourse without takinginto account the several topics which are encountered. However, the languagediffers according to the theme and for instance, documents dealing with tennisor broadcast news will have specific word successions.

Methods coming from NLP are precisely able to locate and characterizetopics and can be applied to update the vocabulary of an ASR system orspecialize its baseline LM [1, 3]. Since changing the lexicon would require tomodify the whole ASR system, we focus on the unsupervised adaptation ofan initial generic LM to the various topics found in a recording. The updatedLM built for each section can be used to generate a new, hopefully better,transcription.

To implement a topic-based adaptation, we have chosen to follow a schemeproposed by [6] and presented in Fig. 1.1. This approach gathers several sub-tasks which are usually separately studied. Before presenting our work, weoutline the methods previously used for each subtask.

1.5.1 Related works

Text segmentation

The first step consists in segmenting a long automatic transcription into the-matically coherent sections. To do this, different clues can be considered.


The most popular indicator focuses on the vocabulary used in a text blockand studies the numbers of word occurrences. The frequent use of the samewords in a given text section tends to demonstrate a thematic coherence ofthe text. At the opposite, a change in the use of the vocabulary is a clue toguess that the topic has changed. This principle, called lexical cohesion [7],can be enriched by the knowledge of more complex relations between words,such as synonymy, in order to gather the words which refer to close senses.Other indications can be given by the presence of discourse markers [16], like“however” and “furthermore”, or by acoustical cues which also provide someinformation for a segmentation [21]. For example, silences or jingles are oftenused to emphasize some topic change.

Besides, all these clues can be used in supervised or unsupervised methods.In the first case, coherent text sections can be found by measuring how welltext zones are represented by some pre-calculated LM, each of them beingspecialized a priori for one precise topic of a theme collection. This can bedone if we consider that only a static number of topics will be encounteredin any speech. This characteristic is less suitable if themes cannot be fixedbeforehand, which leads to favor unsupervised methods as in the experimentsdescribed below.

Adaptation data collection

A second step aims at obtaining a dedicated update LM for each topically-coherent text segment. Two main different methods seek to achieve this. Onthe one hand, some works adopt a supervised approach. An update LM is cho-sen in a set of pre-calculated LM according to their thematic proximity withthe topic of the considered text segment [19]. On the other hand, an updateLM can be trained on a dynamically built thematic corpus. It is composedof texts corresponding to the topic encountered in the text segment and iseither selected from a static domain-independent set of articles [13], or builtfrom a collection of texts retrieved from the Internet [3]. This last source isparticularly interesting because the style of the texts is closer to speech thanthe one of classical written documents [23]. In the end, a selection amongthose texts can be done by considering several criteria like word distributions,lexical cohesion, or some rules coming from the IR domain [6].

LM adaptation

Finally, the mixture of the baseline LM and the update LM is computed(step 3). It can be made by interpolating linearly or log-linearly [9] theseLM; in this case, the N -gram probabilities of each LM are directly mixed.Other more complex techniques are based on the search for a final N -gramdistribution which minimizes an information quantity, like entropy or mutualinformation. Some works show that these methods outperform the previousones [5].


In the following sections, we present our approaches for each of thesessubtasks. Our main idea is to combine knowledge from NLP and ASR. Weshow that the interaction between these two domains leads to perplexity andWER reductions. Furthermore, we want to avoid any user intervention sothat an ASR system can constantly and automatically adapt itself to anytopic it encounters. To do this, we explain how we segment a long transcriptinto thematically homogeneous sections and then we present our way to trainspecialized LM from automatically built update corpora.

1.5.2 Transcript segmentation

To segment a word stream into thematic blocks, we use different kinds ofclues. First, repetitions of lexical words enable us to detect boundaries betweenlexically coherent sections. However, broadcast news tend to avoid repetitions,which leads to a poor segmentation, even considering lemmas6 instead ofwords. Therefore, we extend semantic links between words by studying co-occurrences of lexical words in the French corpus Le Monde. We also considerpause duration which has already proved to be a valuable cue for a similartask [21].

Our segmentation task still needs to be finalized and has not been used inthe following sections. To evaluate the interest of topic adaptation for WERimprovement, we separately study 22 extracts of transcriptions, that we man-ually selected and that are therefore thematically coherent.

1.5.3 Language model adaptation

To train an update LM for a thematically coherent section, we select keywordsfrom each thematic segment and submit them to a Web search engine; theretrieved pages are used to form a corpus from which the update LM prob-abilities are estimated. Several constraints have to be kept in mind in orderto implement this approach successfully. First, keywords must be significantenough to characterize well the content of their segment. However, they mustnot be too precise if we want to form queries able to retrieve a large enoughnumber of pages for our update corpus. This means that we have to find away of combining our keywords to constitute one or more efficient queries. Wealso have to determine how many pages and which ones we should keep to geta large enough and thematically good corpus. These questions are discussedin the following sections.

Keyword spotting

We use an IR based technique to extract keywords so that these words dis-criminate their corresponding segment relatively to the texts of our generic

6 Canonical forms of words that depend on their POS. For example, the lemma fora verb can be its infinitive, and for a noun its singular masculine form.


training corpus. To measure the importance of each term w in the content of asegment, we use two criteria: the term frequency tf and the inverse documentfrequency idf, i.e., a value linked to the inverse number of documents whichcontain w. The product of these two parameters with a normalization factorprovides us a tf*idf score which is high for characteristic terms, i.e. keywords.However, we modify the tf*idf scores to take into account specificities of thetranscriptions.

Firstly, we try to limit the number of proper names to avoid a too smallnumber of returned pages. Penalties are thus applied to each proper name bymultiplying the tf*idf score by a coefficient, empirically set to 0.75. Becauseof the lack of cases in transcribed texts, proper names are detected based onthe the morpho-syntactic tagging and a dictionary, words with no definitionin the dictionary being likely proper names.

Secondly, badly transcribed words must also be removed from our key-words list. Indeed, their inclusion in queries would result in irrelevant docu-ments in the adaptation corpus. To solve this problem, we take into accountthe confidence measure c in the tf*idf score

score(w) = tf ∗ idf × λ + tf ∗ idf × (1 − λ) × c (1.10)

where λ limits the influence of c.Finally, the few terms with the highest scores are kept. We show how to

combine them in order to start the building of an update corpus.

Update corpus building

To get Web pages, we seek to put our keywords together in order to formqueries. Building a single query combining all the keywords turned out im-possible and we rather rely on a fixed number of queries combining subsetsof the whole keyword set. For example, a query is composed of the two bestkeywords while another one is composed of the first and third keywords. Thisstrategy maximizes the chance to have at least one relevant query, even whentranscription errors are present. The question is then to select a subset of thedocuments returned by the search engine for our update corpus.

As some queries can return many thousands of documents, a constantmaximum limit of pages returned is determined. In our study, experimentsshow that at least fifty pages are needed to get a good update LM. However,by increasing the number of pages used, we notice that the runtime linearlyincreases while the gain produced by the topic-based adaptation tends tobecome stable. In our system, we keep only two hundred Web pages.

Among the returned pages, the quality is variable —mainly because theycome from different queries. That is why these pages are filtered. To do this,we compute a thematic similarity measure between the transcript segmentand each retrieved Web page. Pages whose similarity is considered too smallare discarded from the update corpus. Thematic similarity is measured based


Fig. 1.2. Details of measured perplexity before and after our topic-based adaptation.

on the tf*idf scores along with the cosine distance, a well-known and testedIR method, to measure the thematic similarity.

After having built a large enough and qualitatively good update corpus, thespecialized LM is trained before being linearly interpolated with our genericbaseline LM in order to obtain an updated LM. Even if linear interpolationis not optimal, it is powerful enough to test the impact of our topic basedadaptation approach.

Results

Figure 1.2 compares the perplexity of our updated LM with the perplexity ofthe baseline model for each of our 22 reference thematic segments. First, wecan see that adaptation always reduces perplexity, even for texts that alreadyhad a low perplexity (texts 13, 19, and 20). This means that the language ofthe sections is better represented by the updated LM than by the general one.

In two out of three segments, the perplexity falls by over 10 %; this istranslated into an absolute decrease of 0.55% of the WER when rescoringthe word graphs generated by our ASR system. However, for a third of thesections, our topic-based adaptation leads to a WER increase. Moreover, theseresults were computed from manually segmented transcriptions. However, westill expect a WER improvement with an automatic segmentation by usingmore cleverly the update LM to adapt the ASR system. For instance, we fixedbeforehand for all sections the update LM weight to 0.25 and the general LMweight to 0.75 but these coefficients could be separately optimized for eachsection.


1.6 Discussion

In this chapter, we presented experiments toward a better integration betweenautomatic speech recognition and natural language processing techniques inorder to improve spoken document processing techniques, in particular formultimedia document analysis. We have first shown that NLP techniques,in this case morpho-syntactic analysis, can be used to improve ASR resultsand confidence measures. We have then demonstrated that some informationretrieval techniques can be applied to transcribed texts provided confidencemeasures are taken into account. In our case, information retrieval techniquesare used for unsupervised LM adaptation using the Internet. Preliminary re-sults on topic adaptation are encouraging though the entire system describedin this chapter has not been implemented.

In the case of morpho-syntactic rescoring, an N -best list of sentence hy-potheses is used as interface between NLP and ASR rather than a singletranscription hypothesis. In the case of topic adaptation, the single transcrip-tion hypothesis is enhanced with confidence measures which are exploitedfor efficient keyword selection. However, we believe that N -best lists or wordgraphs are a better, richer interface between ASR and NLP. Indeed, with asingle transcription hypothesis, transcription errors made by the ASR systemare detrimental to NLP algorithms. Confidence measures can help to avoidrelying on erroneous words but they do not provide information on alternativehypotheses discarded by the ASR system during the search for the best wordsequence. On the contrary, word graphs or N -best lists can be used by NLPalgorithms to recover from errors made by the transcription system.

Finally, let us conclude by a word on repunctuation of automatic tran-scripts, a popular topic lately. Repunctuation, which aims at formatting anautomatically transcribed text as a regular written text with punctuations – sothat NLP techniques developed for written texts can be applied to transcrip-tions –, is often proposed as the solution to a better interface between ASRand NLP. However, we believe that repunctuation cannot replace a better in-tegration between ASR and NLP as, for example, repunctuation cannot helpNLP recover from transcription errors. Moreover, the repunctuation processitself rely on the erroneous transcription. On the contrary, the experimentsdescribed in this chapter show that a real feedback loop between ASR andNLP can prove beneficial for both domains.

Part IV

New Research Directions: INTERFACES TO

MULTIMEDIA CONTENT

References

1. A. Allauzen and J.-L. Gauvain. Open vocabulary ASR for audiovisual documentindexation. In Proc. IEEE Int. Conf. on Acoustic, Speech and Signal Processing,2005.

2. F. Bechet, A. L. Gorin, J. H. Wright, and D. Hakkani-Tur. Detecting andextracting named entities from spontaneous speech in a mixed initiative spokendialogue context: How may I help you? Speech Communication, 42(2):207–225,2004.

3. B. Bigi, Y. Huang, and R. De Mori. Vocabulary and language model adaptationusing information retrieval. In Proc. Intl. Conf. on Spoken Langauge Processing

– Interspeech, 2004.4. Peter F. Brown, Vincent J. Della Pietra, Peter V. de Souza, Jennifer C. Lai,

and Robert L. Mercer. Class-based n-gram models of natural language. Com-

putational Linguistics, 18(4):467–479, 1992.5. L. Chen, J.-L. Gauvain, L. Lamel, and G. Adda. Unsupervised language model

adaptation for broadcast news. In Proc. IEEE Int. Conf. on Acoustic, Speech

and Signal Processing, 2003.6. L. Chen, J.-L. Gauvain, L. Lamel, G. Adda, and M. Adda-Decker. Using infor-

mation retrieval methods for language model adaptation. In Proc. Europ. Conf.

on Speech Communications and Technology, 2001.7. O. Ferret. How to thematically segment texts by using lexical cohesion? In

Proc. COLING-ACL, 1998.8. S. Galliano, E. Geoffrois, D. Mostefa, K. Choukri, J.-F. Bonastre, and

G. Gravier. The ESTER phase II evaluation campaign for the rich transcriptionof French broadcast news. In Proc. Europ. Conf. on Speech Communications

and Technology – Interspeech, 2005.9. A. Gutkin. Log-linear interpolation of language models. PhD thesis, University

of Cambridge, UK, 2000.10. P. A. Heeman. POS tags and decision trees for language modeling. In Proc.

the Joint SIGDAT Conf. on Empirical Methods in Natural Language Processing

and Very Large Corpora, 1999.11. S. Huet, G. Gravier, and P. Sebillot. Morphosyntactic processing of N-best lists

for improved recognition and confidence measure computation. In Proc. Europ.

Conf. on Speech Communications and Technology – Interspeech, 2007.

26 References

12. S. Huet, P. Sebillot, and G. Gravier. Utilisation de la linguistique en reconnais-sance de la parole : un etat de l’art. Research Report 1804, Irisa, 2006.

13. D. Klakow. Selecting articles from the language model training corpus. In Proc.

IEEE Int. Conf. on Acoustic, Speech and Signal Processing, 2000.14. Y. Liu, E. Shriberg, A. Stolcke, and M. P. Harper. Using machine learning to

cope with imbalanced classes in natural speech: Evidence from sentence bound-ary and disfluency detection. In Proc. ICSLP, 2004.

15. L. Mangu, E. Brill, and A. Stolcke. Finding consensus in speech recognition:Word error minimization and other applications of confusion networks. Com-

puter Speech and Language, 14(4):373–400, 2000.16. D. Marcu. The rhetorical parsing of unrestricted texts: A surface-based ap-

proach. Computational Linguistics, 26(3):395–448, 2000.17. B. Merialdo. Tagging English text with a probabilistic model. Computational

Linguistics, 20(2):155–171, 1994.18. B. Rueber. Obtaining confidence measures from sentence probabilities. In Proc.

Europ. Conf. on Speech Communication and Technology, 1997.19. K. Seymore and R. Rosenfeld. Using story topics for language model adaptation.

In Proc. Europ. Conf. on Speech Communication and Technology, 1997.20. A. Stolcke, Y. Konig, and M. Weintraub. Explicit word error minimization

in N-best list rescoring. In Proc. Europ. Conf. on Speech Communication and

Technology, 1997.21. G. Tur, D. Hakkani-Tur, A. Stolcke, and E. Shriberg. Integrating prosodic

and lexical cues for automatic topic segmentation. Computational Linguistics,21(1):31–57, 2001.

22. A. Valli and J. Veronis. Etiquetage grammatical de corpus oraux : problemeset perspectives. Revue francaise de linguistique appliquee, 4(2):113–133, 1999.

23. D. Vaufreydaz, M. Akbar, and J. Rouillard. Internet documents: A rich sourcefor spoken language modeling. In Proc. the IEEE Workshop Automatic Speech

Recognition and Understanding, 1999.24. M. Weintraub, Y. Aksu, S. Dharanipragada, S. Khudanpur, H. Ney, J. Prange,

A. Stolcke, F. Jelinek, and E. Shriberg. LM95 project report: Fast trainingand portability. Technical report, Center for Language and Speech Processing,Johns Hopkins University, 1996.

Index

automatic speech recognition 7

confidence measures 8, 16consensus decoding 14cosine distance 20

discource markers 17

ESTER evaluation campaign 10

information retrieval 17Irene, Irisa/ENST ASR system 10

language model 8language model adaptation 17–19language model interpolation 18lexical cohesion 17

maximum a postiorior 13morpho-syntactic knowledge 11morpho-syntactic rescoring 12

N-best list 8n-class model 8n-gram model 8natural language processing 7

tagging, ASR transcripts 11tagging, part-of-speech 11text segmentation 17, 19topic adaptation 17

word error minimization 14word graph 8

Documents

Multimodal Processing and Interaction: Audio, Video, Text · tile; (iii) image/video (or speech/audio) and text; (iv) multiple-cue versions of vision and/or speech; (v) other semantic