Recovering Capitalization and Punctuation Marks on Speech

UNIVERSIDADE TÉCNICA DE LISBOA

INSTITUTO SUPERIOR TÉCNICO

<TranscriptSegment>

<TranscriptGUID>2</TranscriptGUID>

<AudioType start="970" end="1472">Clean</AudioType>

<Time start="970" end="1472" reasons=""/>

<Speaker id="1000" name="Homem" gender="M" known="F"/>

<SpeakerLanguage native="T">PT</SpeakerLanguage>

<TranscriptWList>

<W start="970" end="981" conf="0.765016" focus="F0" pos="S.">em</W>

<W start="982" end="997" conf="0.525857" focus="F0" pos="Nc">boa</W>

<W start="998" end="1049" conf="0.98280" focus="F0" punct=".” pos="Nc">noite</W>

<W start="1050" end="1064" conf="0.904695" focus="F0" pos="Td">os</W>

<W start="1065" end="1113" conf="0.974994" focus="F0" pos="Nc">centros</W>

<W start="1114" end="1121" conf="0.938673" focus="F0" pos="S.">de</W

<W start="1122" end="1173" conf="0.993847" focus="F0" pos="Nc">emprego</W>

<W start="1174" end="1182" conf="0.951339" focus="F0" pos="S.">em</W>

<W start="1183" end="1229" conf="0.999291" focus="F0" pos="Np">portugal</W>

<W start="1230" end="1283" conf="0.979457" focus="F0" pos="V.">continuou</W>

<W start="1284" end="1285" conf="0.967095" focus="F0" pos="Td">a</W>

<W start="1286" end="1345" conf="0.996321" focus="F0" pos="V.">registar</W>

<W start="1346" end="1399" conf="0.946317" focus="F0" pos="R.">menos</W>

<W start="1400" end="1503" conf=... focus="F0" punct=".” pos="V.">inscritos</W>

</TranscriptWList>

</TranscriptSegment>

Recovering Capitalization and Punctuation Markson Speech Transcriptions

Fernando Manuel Marques Batista(Mestre)

Dissertação para obtenção do Grau de Doutor emEngenharia Informática e de Computadores

Orientador: Doutor Nuno João Neves Mamede

JúriPresidente: Presidente do Conselho Científico do ISTVogais: Doutor Mário Jorge Costa Gaspar da Silva

Doutora Isabel Maria Martins TrancosoDoutor Nuno João Neves MamedeDoutora Dilek Hakkani-TürDoutora Helena Sofia Andrade Nunes Pereira Pinto

Maio de 2011

Resumo

Esta tese aborda duas tarefas de anotação de meta-informação, que fazem parte do en-riquecimento de transcrições de fala: maiusculização e recuperação de marcas de pontu-ação. Este estudo centra-se no processamento de notícias televisivas, envolvendo transcriçõesmanuais e automáticas. São comparados e analisados vários modelos de maiusculização,concluindo-se que os modelos generativos capturam melhor a estrutura da língua escrita, en-quanto que os modelos discriminativos são melhores para transcrições de fala e mais robustosaos erros de reconhecimento. O impacto da dinâmica da língua é analisado, concluindo-se queo desempenho da maiusculização é afectado pela distância temporal entre o material de treinoe teste. Em termos de pontuação, são analisadas as três marcas mais frequentes: ponto, vírgula,e interrogação. As experiências iniciais usam informação local, que combina informação lexicale acústica, para dar conta do ponto e da vírgula. As experiências mais recentes utilizam tambéminformação prosódica e estendem este estudo às interrogativas.

Grande parte do estudo é independente da língua, mas à língua Portuguesa foi dado umdestaque especial. A investigação realizada permitiu obter os primeiros resultados de avali-ação, relativos às duas tarefas, para notícias televisivas em Português Europeu. Algumas ex-periências foram também replicadas para Inglês e Espanhol.

Abstract

This thesis addresses two important metadata annotation tasks, involved in the productionof rich transcripts: capitalization and recovery of punctuation marks. The main focus of thisstudy concerns broadcast news, using both manual and automatic speech transcripts. Differ-ent capitalization models were analysed and compared, indicating that generative approachescapture the structure of written corpora better, while the discriminative approaches are suitablefor dealing with speech transcripts, and are also more robust to ASR errors. The so-called lan-guage dynamics have been addressed, and results indicate that the capitalization performanceis affected by the temporal distance between the training and testing data. In what concernsthe punctuation task, this study covers the three most frequent marks: full stop, comma, andquestion mark. Early experiments addressed full-stop and comma recovery, using local features,and combining lexical and acoustic information. Recent experiments also combine prosodicinformation and extend this study to question marks.

Much of the research conducted here is language independent, but a special focus is givento the Portuguese language. This thesis provides the first evaluation results of these two tasksover European Portuguese broadcast news data. Most experiments were also conducted overEnglish and Spanish.

Palavras Chave

Keywords

Palavras chave

Enriquecimento de transcrições de fala

Maiusculização automática

Pontuação automática

Segmentação automática de frases

Métodos generativos e discriminativos

Dinâmica da linguagem

Keywords

Rich transcription

Automatic capitalization

Automatic punctuation

Sentence boundary detection

Generative and discriminative methods

Language dynamics

Agradecimentos

Acknowledgements

Esta tese não teria sido possível sem o apoio e ajuda que recebi ao longo destes quatro anos.Agradeço a todos os que me apoiaram e ajudaram.

Em primeiro lugar quero agradecer ao Professor Nuno Mamede pela sua orientação eapoio. Muito agradeço a amizade e confiança prestadas desde os tempos do meu mestrado.O meu obrigado por ter alocado recursos que permitiram revêr e corrigir dados essenciais aomeu trabalho.

Queria também agradecer ao Diamantino Caseiro que, agora nos Estados Unidos, muitome ajudou nos primeiros tempos deste trabalho, com as suas oportunas sugestões. Não tenhopalavras para agradecer à Professora Isabel Trancoso a sua amizade, disponibilidade e ajudasempre pronta. A sua dedicação pessoal e as suas sábias e valiosas contribuições foram deter-minantes no desenvolvimento deste trabalho.

Agradeço a todos os meus colegas do laboratório de sistemas de língua falada (L2F) doINESC-ID por toda a colaboração, apoio, camaradagem e excelente ambiente de trabalho queme têm proporcionado. À Joana Paulo, Ricardo Ribeiro, David Matos, Luisa Coheur, HugoMeinedo e António Serralheiro pela sua longa amizade e apoio. Aos meus colegas HelenaMoniz, Hugo Meinedo, Thomas Pellegrini e Alberto Abad pela ajuda, estreita colaboração eimportantes contribuições para o meu trabalho. Um especial agradecimento ao Jorge Baptistapela sua amizade e tempo dedicado à revisão deste documento. Agradeço também à VeraCabarrão o profissionalismo e tempo dedicado à revisão dos dados de fala com que trabalhei.

Agradeço aos meus colegas do DCTI do ISCTE-IUL pela sua camaradagem, apoio e exce-lente ambiente de trabalho que me têm proporcionado. Um agradecimento especial aos meuscolegas Ricardo Ribeiro, Luís Nunes, Tomás Brandão, João Baptista, Paulo Trezentos, Luís Can-cela, Abílio Oliveira, Joaquim Esmerado, Alexandre Almeida, José Farinha, Marco Ribeiro, LuísBotelho, Manuel Sequeira, Maria Albuquerque, José André, mas também a todos os outros dosquais tenho sempre recebido o maior apoio.

Uma palavra de agradecimento para todos os restantes amigos que também, pela suaamizade, foram catalizadores deste trabalho.

Queria também agradecer aos meus pais e aos meus irmãos, com os quais sempre pudecontar. Aos meus avós já falecidos, que recordo com muito carinho. Aos meus restantes famil-iares, tios, sogros, cunhados e primos.

Finalmente, um agradecimento muito especial à Susana, minha mulher, que com o seuamor, abnegação e sacrifício, tornou possível a realização desta tese.

Muito obrigado a todos.

Lisboa, Março de 2011Fernando Manuel Marques Batista

Contents

1 Introduction 1

1.1 Emerging Interest in Rich Transcription . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Rich Transcription Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Motivation and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Proposed Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 State of the Art 11

2.1 Related Work on Capitalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Related Work on Punctuation and Sentence Boundary Detection . . . . . . . . . . 16

2.3 The Maximum Entropy Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.1 Application to Rich Transcription . . . . . . . . . . . . . . . . . . . . . . . 22

2.3.2 Large Corpora Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Corpora 27

3.1 Broadcast News Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1.1 Portuguese Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1.2 Spanish Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1.3 English Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Written Newspaper Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.1 Portuguese Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2.2 Spanish Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2.3 English Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

i

3.3 Speech Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3.1 Capitalization Alignment Issues . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.2 Punctuation Alignment Issues . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 Additional Prosodic Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4.1 Extracting the Pitch and Energy . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4.2 Adding Phone Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.4.3 Marking the Syllable Boundaries and Stress . . . . . . . . . . . . . . . . . . 46

3.4.4 Producing the Final XML File . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.5 Speech Data Word Boundaries Refinement . . . . . . . . . . . . . . . . . . . . . . 47

3.5.1 Post-processing rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.5.3 Impact on acoustic models training . . . . . . . . . . . . . . . . . . . . . . 52

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4 Capitalization Recovery 55

4.1 Capitalization Analysis Based in Written Corpora . . . . . . . . . . . . . . . . . . 55

4.1.1 Capitalization Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2 Early Work Comparing Different Methods . . . . . . . . . . . . . . . . . . . . . . . 58

4.2.1 Description of the Generative Methods . . . . . . . . . . . . . . . . . . . . 59

4.2.2 Comparative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2.3 Results using Unlimited Vocabulary . . . . . . . . . . . . . . . . . . . . . . 65

4.3 Impact of Language Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3.1 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3.2 Capitalization of Written Corpora . . . . . . . . . . . . . . . . . . . . . . . 68

4.3.3 Capitalization of Speech Transcripts . . . . . . . . . . . . . . . . . . . . . . 70

4.3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.4 Capitalization Model Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.4.1 Baseline results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.4.2 Adaptation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

ii

4.4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.5 Recent Work on Capitalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.5.1 Capitalization Results using a ME-based Approach . . . . . . . . . . . . . 76

4.5.2 Capitalization Results using an HMM-based Approach . . . . . . . . . . . 78

4.5.3 Capitalization Results using Conditional Random Fields . . . . . . . . . . 78

4.5.4 Analysis of Feature Contribution . . . . . . . . . . . . . . . . . . . . . . . . 81

4.5.5 Error Analysis and General Problems . . . . . . . . . . . . . . . . . . . . . 81

4.6 Extension to other Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.6.1 Analysis of the language variations over time . . . . . . . . . . . . . . . . 83

4.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5 Punctuation Recovery 89

5.1 Punctuation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.2 Early Work using Lexical and Acoustic Features . . . . . . . . . . . . . . . . . . . 92

5.2.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.2.2 Sentence Boundary Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.2.3 Segmentation into Chunk Units, Delimited by Punctuation Marks . . . . 98

5.2.4 Recovering full-stop and comma Simultaneously . . . . . . . . . . . . . . . 102

5.3 Extended Punctuation Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.3.1 Improving full stop and comma Detection . . . . . . . . . . . . . . . . . . . 107

5.3.2 Extension to Question Marks . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.4 Extension to other Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.4.1 Recovering Full-stop and Comma . . . . . . . . . . . . . . . . . . . . . . . . 115

5.4.2 Detection of Question Marks . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

iii

6 Conclusions and Future Directions 123

6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.2 Main Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

Bibliography 131

Nomenclature 143

A Portuguese Text Normalization 145

A.1 Date and time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

A.2 Ordinals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

A.3 Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

A.4 Optional Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

A.5 Money . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

A.6 Abreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

A.7 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

iv

List of Figures

1.1 Integration of the RT modules in the recognition system. . . . . . . . . . . . . . . 4

1.2 Overall architecture of the L2F speech recognition system. . . . . . . . . . . . . . 5

1.3 Excerpt of a transcribed text, with different markup conditions. . . . . . . . . . . 7

2.1 Block diagram of the capitalization and punctuation tasks. . . . . . . . . . . . . . 22

2.2 Conversion of trigram counts into features. . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Example of correct and incorrect slots. . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Focus distribution in terms of speech duration for Portuguese BN. . . . . . . . . . 29

3.2 Focus distribution in terms of speech duration for Spanish BN. . . . . . . . . . . . 31

3.3 Excerpt of the LDC1998T28 manual transcripts. . . . . . . . . . . . . . . . . . . . . 33

3.4 Except of the LDC2000S86 corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5 Excerpt of the LDC2005T24 corpus (XML format). . . . . . . . . . . . . . . . . . . 35

3.6 Excerpt of the LDC2005T24 corpus (RTTM format). . . . . . . . . . . . . . . . . . 36

3.7 Excerpt of the LDC2007S10 corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.8 Creating an XML containing all required information for further experiments. . . 40

3.9 Example of a transcript segment extracted from the AUT data set. . . . . . . . . . 41

3.10 Capitalization alignment examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.11 Pitch adjustment for unvoiced regions. . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.12 Integrating prosody information in the corpora . . . . . . . . . . . . . . . . . . . . 47

3.13 PCTM file containing the phones/diphones produced by the ASR system. . . . . 48

3.14 PCTM file with monophones and marked with syllable boundary and stress. . . 48

3.15 Excerpt of one of the final XML files, containing prosodic information. . . . . . . 49

3.16 Improvement in terms of correct word boundaries, after post-processing. . . . . . 51

v

3.17 Phone segmentation before and after post-processing. . . . . . . . . . . . . . . . . 52

3.18 Improvement in terms of correct word boundaries, after retraining. . . . . . . . . 53

4.1 The different capitalization classes and their distribution in the PUBnews corpus. 56

4.2 Number of words by frequency interval in the PUBnews corpus. . . . . . . . . . 57

4.3 Distribution of the ambiguous words by word frequency interval. . . . . . . . . . 58

4.4 Proportion of ambiguous words by word frequency interval. . . . . . . . . . . . . 59

4.5 Using the HMM-based tagger. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.6 Using a WFST to perform capitalization. . . . . . . . . . . . . . . . . . . . . . . . . 61

4.7 Number of OOVs considering written corpora. . . . . . . . . . . . . . . . . . . . . 67

4.8 Proportion of OOVs considering speech transcripts. . . . . . . . . . . . . . . . . . 67

4.9 Performance for different training periods. . . . . . . . . . . . . . . . . . . . . . . 68

4.10 Capitalization of written corpora, using forward and backwards training. . . . . 69

4.11 Automatic capitalization of speech transcripts, using forward retraining. . . . . . 70

4.12 Comparing the capitalization results of manual and automatic transcripts. . . . . 72

4.13 Manual transcription results, using all approaches. . . . . . . . . . . . . . . . . . . 75

4.14 Analysis of each capitalization feature usefulness. . . . . . . . . . . . . . . . . . . 80

4.15 Vocabulary coverage on written newspaper corpora. . . . . . . . . . . . . . . . . . 83

4.16 Vocabulary coverage for Broadcast News speech transcripts. . . . . . . . . . . . . 84

4.17 Forward and Backwards training results over written corpora. . . . . . . . . . . . 85

4.18 Forward training results over spoken transcripts. . . . . . . . . . . . . . . . . . . . 86

5.1 Punctuation marks frequency in Europarl. . . . . . . . . . . . . . . . . . . . . . . 91

5.2 Punctuation marks frequency in the ALERT-SR corpus (old version). . . . . . . . 92

5.3 Punctuation marks frequency in the ALERT-SR corpus (revised version). . . . . 93

5.4 Converting time gap values into binary features using intervals. . . . . . . . . . . 94

5.5 Impact of each feature type in the SU detection performance. . . . . . . . . . . . 97

5.6 Impact of each individual feature in the SU detection performance. . . . . . . . . 98

5.7 Impact of each feature type in the chunk detection performance. . . . . . . . . . . 101

5.8 Impact of each individual feature in the chunk detection performance. . . . . . . . 101

vi

5.9 Impact of lexical and acoustic features in the punctuation detection. . . . . . . . 105

5.10 Impact of each individual feature in the punctuation detection performance. . . 105

5.11 Relation between the acoustic features and each punctuation mark. . . . . . . . . 106

vii

viii

List of Tables

3.1 Different parts of the Portuguese BN corpus. . . . . . . . . . . . . . . . . . . . . . 28

3.2 Confusion matrix between the old to the new manual transcripts. . . . . . . . . . 30

3.3 User annotation agreement for punctuation marks in the Portuguese BN corpus. 30

3.4 Spanish BN corpora properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5 English BN corpora properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.6 Portuguese Newspaper corpora properties. . . . . . . . . . . . . . . . . . . . . . . 37

3.7 European Spanish written corpora properties. . . . . . . . . . . . . . . . . . . . . 38

3.8 English written corpora properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.9 Capitalization alignment report. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.10 Punctuation alignment examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.11 Punctuation alignment report. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1 Different LM sizes when dealing with PUBnews corpus. . . . . . . . . . . . . . . 63

4.2 Capitalization results of the generative methods over the PUBnews corpus. . . . 63

4.3 Capitalization results of the generative methods over the ALERT-SR corpus. . . . 64

4.4 ME-based capitalization results using limited vocabulary. . . . . . . . . . . . . . . 65

4.5 ME-based capitalization results using unlimited vocabulary. . . . . . . . . . . . . 65

4.6 Using the first 8 subsets of each year for training. . . . . . . . . . . . . . . . . . . . 68

4.7 Retraining from Jan. 1999 to Sep. 2004. . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.8 Evaluating with manual transcripts. . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.9 Retraining with manual and evaluating with automatic transcripts. . . . . . . . . 72

4.10 Baseline capitalization results produced using BaseCM. . . . . . . . . . . . . . . . 74

4.11 Capitalization SER achieved for all different approaches. . . . . . . . . . . . . . . 74

4.12 Recent ME-based capitalization results for Portuguese. . . . . . . . . . . . . . . . 77

ix

4.13 Recent HMM-based capitalization results for Portuguese. . . . . . . . . . . . . . . 78

4.14 ME and CRF capitalization results for the PUBnews test set. . . . . . . . . . . . . 79

4.15 ME and CRF capitalization results for the force aligned transcripts test set. . . . . 79

4.16 ME and CRF capitalization results for the automatic speech transcripts test set. . 79

4.17 Comparing two approaches for capitalizing the English Language. . . . . . . . . 86

5.1 Frequency of each punctuation mark in written newspaper corpora. . . . . . . . 90

5.2 Frequency of each punctuation mark in broadcast news speech transcriptions. . . 91

5.3 Recovering sentence boundaries over the ASR output, using the APP segmentation. 95

5.4 Recovering sentence boundaries in the force aligned data. . . . . . . . . . . . . . . . 96

5.5 Recovering sentence boundaries directly in the ASR output. . . . . . . . . . . . . . . 96

5.6 Recovering chunks over the ASR output, using only the APP segmentation. . . . 99

5.7 Recovering chunk units in the force aligned data. . . . . . . . . . . . . . . . . . . . 99

5.8 Recovering chunk units directly in the ASR output. . . . . . . . . . . . . . . . . . 100

5.9 Punctuation mark replacements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.10 Recovering full-stop and comma in force aligned transcripts. . . . . . . . . . . . . . 103

5.11 Recovering full-stop and comma in automatic transcripts. . . . . . . . . . . . . . . . 103

5.12 Punctuation results over manual transcripts, combining prosodic features. . . . . 108

5.13 Punctuation performance over automatic transcripts, combining prosodic features.109

5.14 Results for manual transcripts, bootstrapping from a written corpora model. . . . 110

5.15 Results for automatic transcripts, bootstrapping from a written corpora model. . 110

5.16 Recovering question marks using a written corpora model. . . . . . . . . . . . . . 112

5.17 Performance results recovering the question mark in different corpora. . . . . . . 113

5.18 Punctuation results for English BN transcripts. . . . . . . . . . . . . . . . . . . . . 115

5.19 Punctuation results for English BN manual transcripts, adding prosody. . . . . . 117

5.20 Punctuation results for English BN automatic transcripts, adding prosody. . . . . 117

5.21 Punctuation for force aligned transcripts, bootstrapping from a written corporamodel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.22 Punctuation for automatic transcripts, bootstrapping from a written corporamodel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

x

5.23 Recovering question marks using a written corpora based model. . . . . . . . . . 119

5.24 Recovering the question mark, adding acoustic and prosodic features. . . . . . . 120

xi

xii

1Introduction

TV stations, radio broadcasters, and other media organizations are now producing largequantities of digital audio and video data on a daily basis. Automatic Speech Recognition(ASR) systems can now be applied to such sources of information in order to enrich them withadditional information for applications, such as indexing, cataloging, subtitling, translationand multimedia content production. The ASR output consists of raw text, often in lowercaseformat, without any punctuation information, numbers are represented with words instead ofsymbols, and possibly containing different types of disfluencies. Even if useful for many appli-cations, like indexing and cataloging, for other tasks, such as subtitling and multimedia contentproduction, the ASR output would benefit from other information. In general, enriching thespeech output aims at enhancing information for a better human and machine processing.

Speech units do not always correspond to sentences, as established in the written sense.They may, in fact, be quite flexible, elliptic, restructured, and even incomplete. Taking intoaccount this idiosyncratic behavior, the notion of utterance in Jurafsky and Martin (2009) orsentence-like unit (SU) in (Strassel, 2004; Liu et al., 2006) is often used instead of sentence. De-tecting positions where a punctuation mark is missing, roughly1 corresponds to the task ofdetecting a SU, or finding the SU boundaries. SU boundaries provide a basis for further nat-ural language processing, and their impact on subsequent tasks has been analyzed in manyspeech processing studies (Harper et al., 2005; Mrozinsk et al., 2006; Ostendorf et al., 2008).

The capitalization task, also known as truecasing (Lita et al., 2003), consists of assigningto each word of an input text its corresponding case information, which sometimes dependson its context. Proper capitalization can be found in many information sources, such as news-paper articles, books, and most of the web pages. Many computer applications, such as wordprocessing and e-mail clients, perform automatic capitalization along with spell correction andgrammar check. One important aspect related with capitalization concerns neologisms that arefrequently introduced, and also archaisms. This so-called language dynamics are relevant andmust be taken into consideration in what concerns capitalization.

This thesis addresses two important metadata extraction tasks (MDE) that take part in theproduction of rich transcripts: recovering punctuation marks and capitalization. Both tasks are

1Roughly because, for instance, units delimited by commas often do not correspond to sentences.

2 CHAPTER 1. INTRODUCTION

critical for the legibility of speech transcripts, and are now gaining increasing attention fromthe scientific community. Consider for example the following extract “the crisis was expected lastmay a man died...”. The extract does not provide any clue concerning the speaker pauses andintonation, making it very difficult to decide what has been said: “The crisis was expected. LastMay a man died” or “The crisis was expected last May. A man died”. These last two versions ofthe extract are easier to read and to process. Besides improving human readability, punctua-tion marks and capitalization provide important information for Parsing, Machine Translation(MT), Information Extraction, Summarization, Named Entity Recognition (NER), and otherdownstream tasks that are usually also applied to written corpora. Apart from the insertionof punctuation marks and capitalization, enriching speech recognition covers other importantactivities, such as Speaker Identification, and the Detection and Filtering of Disfluencies, whichare not covered by the scope of this thesis.

1.1 Emerging Interest in Rich Transcription

The production of rich transcripts involves both the speech-to-text (STT) technologies andmetadata extraction technologies. The rich transcription (RT) process usually involves the pro-cess of recovering structural information and the creation of metadata from that information.The final enriched transcript contains all the metadata, which can then be used to enhance thefinal recognition output. The following are the most relevant MDE tasks used in the productionof rich transcripts:

Speaker diarization: Covering sub-tasks such as “Who Speaks When” and "Who Said What", con-sists of assigning the different parts of the speech to the corresponding speakers. This taskis important for many speech sources, such as telephone conversations, and meetings.

Sentence segmentation: Often also referred as Sentence Boundary Detection, this task consistsof identifying the SU (Sentence-like Unit) boundaries. When dealing with speech, thenotion of SU (Strassel, 2004) or utterance (Jurafsky and Martin, 2009) is often used insteadof “sentence”. More details concerning this task can be found in Section 2.2.

Punctuation recovery or detection: This task consists of identifying and inserting punctuationmarks, which can be full-stops, commas, question marks, exclamation marks, and other lesscommon punctuation marks. This task shares most of the strategies also used in thesentence segmentation task.

Capitalization: Also known as truecasing, consists of assigning the correct case for each word,when such information is unavailable.

Disfluency detection and filtering: Disfluencies are well-known phenomena, occurring spe-cially in spontaneous speech. They are inevitable because humans make utterances while

1.1. EMERGING INTEREST IN RICH TRANSCRIPTION 3

thinking about what to say. Disfluencies are classified into two broad categories (Furuiand Kawahara, 2008): (1) fillers, with which speakers try to fill pauses while thinking orto attract the attention of listeners (e.g., “um”, “uh”, “well”, “you know”); and (2) repairs,including repetitions and restarts, with which speakers try to correct, modify, or abortearlier statements.

While the speech-to-text core technologies have been developed for more than 30 years (Furui,2005), metadata extraction/annotation technologies only have gained significant importanceduring the latest years. For example, (Jurafsky and Martin, 2009), published in 2009, containsan entire section dedicated to this subject (Chapter 10 - Speech Recognition: Advanced Top-ics), while this topic was only briefly mentioned in the first version of that book, published in2000 (Jurafsky and Martin, 2000). On top of that, the recent advances in Rich Transcription werethe focus of the recent (September 2010) special issue on “New Frontiers in Rich Transcription”,part of the IEEE Transactions on Audio, Speech, and Language Processing publication.

The Rich Transcription project, from the DARPA-sponsored EARS (Effective, AffordableReusable Speech-to-text) program, was a five-year project, with the aim of advancing the stateof the art in automatic rich transcription of speech. The Metadata Extraction and Modelingtask, described in the project, aims at introducing structural information into the ASR output,as a good human transcriber would do, and includes the following topics: “Punctuation andtopic segmentation”, “Disfluency detection and clean-up”, “Semantic annotation”, “Dialogueact modeling”, “Speaker recognition, segmentation, and tracking”, and “Annotation of speakerattributes”.

The NIST RT evaluation series2 is also another important initiative that supports someof the goals of the DARPA EARS program, providing means to evaluate STT and MDE tech-nologies. The main propose of this initiative was to involve the community, by studying theSTT/MDE integration and determining the MDE goals. Two other important goals consist ofproviding a state-of-the-art baseline and begining the creation of flexible/extensible evaluationparadigms (new formats and new software). These evaluation series cover some of the meta-data extraction tasks, such as: Speaker Diarization; Edit Word Detection; Filler Word Detection;and Sentence Boundary detection. The RT 2002 evaluation was the first of this RT evaluation se-ries, covering STT tasks for three different speech sources (broadcast news, conversational tele-phone speech, and meetings) and speaker diarization for broadcast news and conversationaltelephone speech. It is important to notice that by this time the “rich transcription” conceptwas not yet completely established. In 2003, two evaluations were conducted: RT-03S – focus-ing on the STT tasks, but also covering the speaker diarization task “Who Spoke When”; RT03F– focusing on the MDE tasks, which included "Who Said What" speaker diarization, sentenceboundary detection, and disfluencies detection for broadcast news speech and conversationaltelephone speech. Other languages besides English were covered in the 2004 evaluation series,

2http://www.nist.gov/speech/tests/rt/


Speech Transcription........................................

Speech Recognition

Rich Transcription

Punctuation

Capitalization

DisfluencyDetection and filtering

...

Close-captioning

Multimedia content

Further automatic processing

...

Indexing, Retrieval, ...

AudioSegmentation

Diarization

Speech signalPitch, Energy

...Enriched

Transcription............................................................

Machine translationSummarizationNamed Entity Recognition....

Figure 1.1: Integration of the RT modules in the recognition system.

but after this period these evaluations focused again on the English language. After 2004, allevaluation focused the English Meeting domain, the STT task and the MDE speaker diariza-tion task. In 2005 two specific speaker diarization sub-tasks, under the scope of the meetingsdomain, were introduced: Speech Activity Detection – consists of detecting when someone istalking; and Source Localization – determining the 3D position of the person who is talking.Nonetheless, it is important to notice that despite the emerging efforts in the RT scope, manymetadata extraction tasks are still not covered by these evaluation plans.

Most of the current research focus on the English language. However, some initiatives havealso been reported for other languages in the last few years. For example, the ESTER evaluationcampaign is an important initiative for evaluating automatic broadcast news transcription sys-tems for the French language (Gravier et al., 2004). ESTER is part of the EVALDA project andfocuses on the evaluation of rich transcription and indexing of broadcast news. The campaignstarted in 2003 with a pilot evaluation, using a subset of the final corpus, and implementedthree categories of tasks: transcription, segmentation and information extraction.

1.2 Rich Transcription Integration

Most of the information required by the rich transcription modules is usually provided bythe speech transcript itself, produced by the speech recognition modules. However, additionalinformation may be extracted directly from the speech signal, or provided by other sourcesof information, such as linguistic knowledge. Figure 1.1 summarizes the integration of richtranscription modules in a generic recognition system. The raw recognition output is still use-ful for some applications (e.g., indexing and information retrieval), but the RT tasks provideadditional information for improved results in a broader set of applications.

Depending on the application, the RT modules may be required to work online. For ex-ample, on-the-fly subtitling for oral presentations or TV shows demands a very small delaybetween the speech production and the corresponding transcript. In these systems, both the

1.2. RICH TRANSCRIPTION INTEGRATION 5

Offline System

Online SystemTV broadcasted signal

Control System & GUI

Audio

Subtitling Generation

Teletex Server

Web

Topic Segmentationand Indexing

Summarization

PunctuationCapitalization

JDJingle detection

ASRSpeech Recognition

APPAudio segmentation

XML file

Figure 1.2: Overall architecture of the L2F speech recognition system.

computational delay and the number of words to the right of the current word that are requiredto make a decision, are important aspects to be taken into consideration. On the other hand,an offline system can perform multiple passes on the same data, thus being able to produceimproved results. One of the goals behind this work consists of building a module for integra-tion both on the on-the-fly subtitling system and on the offline multimedia content productionsystem, which are currently being in use.

The blocks that constitute the current architecture of the current Spoken Language Labo-ratory (L2F) recognition system (Amaral et al., 2007) are illustrated in Figure 1.2. The systemfollows the connectionist paradigm. The JD (Jingle Detection) module detects the beginningand the end of the show, as well as the possible breaks. The selected blocks of speech are thenprocessed by the APP (Audio Pre-Processing or Audio Segmentation) module (Meinedo andNeto, 2003), which splits the input into homogeneous speech segments based on the acousticproperties of the speech signal, performs speaker clustering, identifies the speaker gender, andclassifies the speech according to the different focus conditions (noisy, clean, etc.). Audimus,a large vocabulary continuous speech recognition module (Meinedo et al., 2008), is the ASRmodule that processes each speech segment, previously identified by the APP module, andproduces the initial speech transcript. The punctuation and capitalization modules, developedin the scope of this thesis, update the speech transcript data, by adding the punctuation andcapitalization information. The whole system can be used either for TV Close-captioning (on-line) of for producing multimedia content (offline). The online version of the system uses theSubtitling Generator to send the final output to the Teletex Server. The speech recognition pro-cessing chain used for producing multimedia content corresponds to the offline version of the


system. An XML3 file is iteratively updated with information coming from each processingmodule, and its final content contains all the information required for web publishing. TheTopic Segmentation and Indexing (Amaral and Trancoso, 2008), as well as the Summarization(Ribeiro and Matos, 2007, 2008) modules, take part in the processing chain. Because this is anoffline chain, the BN show can processed as a whole by each processing module in multiplepasses, thus producing improved results.

The first modules of this system, including punctuation and capitalization, are optimizedfor on-line performance, given their deployment in the fully automatic subtitling system thatis running on the main news shows of the public TV channel in Portugal, since 2008 (Netoet al., 2008). All the information is being used by the Selective Dissemination of MultimediaInformation system – SSNT (Amaral et al., 2007), which has been deployed since 2003, and isnow processing Portuguese broadcast news on a daily basis.

1.3 Motivation and Goals

The recent advances in the ASR systems, together with the increase of computational re-sources, make it now possible to process broadcast speech signals, continuously being pro-duced by a number of broadcasters. The Portuguese recognition system, developed at the Spo-ken Language Laboratory (L2F), and described previously in Section 1.2, is a state-of-the-artASR system, now being applied to different domains of the Portuguese language. It has beenapplied since the beginning of 2003 to the main TV news show, produced by the National Por-tuguese Broadcaster RTP. Nowadays, it is being used for processing the most important newsshows, produced by all Portuguese TV broadcasters: RTP, SIC and TVI. The system performstwo different tasks: live close-captioning, and multimedia content production for offline usage.The original content produced by this system was still difficult to read and to process, mainlybecause of the incorrect segmentation, and also because a number of basic information wasmissing, such as capital letters. Enriching the speech recognition output is an important assetfor a speech recognition system that performs tasks like online captioning or produces multi-media content. Hence, the main motivation behind this thesis consists of producing enhancedtranscripts, that can be applied to real life speech recognition systems. More specifically, themain target corresponds to correctly address the tasks of recovering punctuation marks andcapitalization information, when dealing with speech transcripts, produced by an automaticspeech recognition system. Accordingly, one of the expected outcomes is a prototype module,incorporating Rich Transcription tasks, for integration in the L2F recognition system.

Figure 1.3 shows an excerpt of a transcribed text, where the upper rectangle shows the orig-inal text, corresponding to the output of a recognition system, with a paragraph segmentationbased purely on the acoustic elements of the signal. The second rectangle introduces the cap-

3http://www.w3.org/XML/

1.3. MOTIVATION AND GOALS 7

boa tarde a Ministra da Educação pronunciou sobre a polémica do professor suspenso Maria de Lurdes Rodrigues disse que vai aguardar pelos resultados do processo que está a decorrer e garantiu que não tem motivos para duvidar do funcionamento da Direcção Regional de Educação do Norte que suspendeu passou por ter feito um comentário pose do primeiro-ministro a ministra disse lamentar que este tipo de pesados marquem à agenda mediática até este momento do muito que o li e ouvi não tenho nenhum sinal não tenho nenhum motivo para duvidar do funcionamento das instituições ou para de a considerar que pode estar em causa o funcionamento da Direcção Regional de Educação dos seus de serviços ...

boa tarde a ministra da educação pronunciou sobre a polémica do professor suspenso maria de lurdes rodrigues disse que vai aguardar pelos resultados do processo que está a decorrer e garantiu que não tem motivos para duvidar do funcionamento da direcção regional de educação do norte que suspendeu passou por ter feito um comentário pose do primeiro-ministro a ministra disse lamentar que este tipo de pesados marquem à agenda mediática até este momento do muito que o li e ouvi não tenho nenhum sinal não tenho nenhum motivo para duvidar do funcionamento das instituições ou para de a considerar que pode estar em causa o funcionamento da direcção regional de educação dos seus de serviços aquilo que é a minha preocupação é que no âmbito deste processo estejam garantidos a existência dos mecanismos de defesa

Boa tarde.A Ministra da Educação pronunciou sobre a polémica do professor suspenso. Maria de Lurdes Rodrigues disse que vai aguardar pelos resultados do processo que está a decorrer e garantiu que não tem motivos para duvidar do funcionamento da Direcção Regional de Educação do Norte, que suspendeu passou por ter feito um comentário pose do primeiro-ministro. A ministra disse lamentar que este tipo de pesados marquem à agenda mediática. Até este momento do muito que o li e ouvi não tenho nenhum sinal. Não tenho nenhum motivo para duvidar do funcionamento das instituições ou para de a considerar que pode estar em causa o funcionamento da Direcção Regional de Educação, dos seus de serviços. Aquilo que é a minha preocupação é que no âmbito deste processo estejam garantidos a existência dos mecanismos de defesa.

Boa tarde.A Ministra da Educação pronunciou sobre a polémica do professor suspenso. Maria de Lurdes Rodrigues disse que vai aguardar pelos resultados do processo que está a decorrer e garantiu que não tem motivos para duvidar do funcionamento da Direcção Regional de Educação do Norte, que suspendeu passou por ter feito um comentário pose do primeiro-ministro. A ministra disse lamentar que este tipo de pesados marquem à agenda mediática. Até este momento do muito que o li e ouvi não tenho nenhum sinal. Não tenho nenhum motivo para duvidar do funcionamento das instituições ou para de a considerar que pode estar em causa o funcionamento da Direcção Regional de Educação, dos seus de serviços. Aquilo que é a minha preocupação é que no âmbito deste processo estejam garantidos a existência dos mecanismos de defesa.

Boa tarde.A Ministra da Educação pronunciou-se sobre a polémica do professor suspenso.Maria de Lurdes Rodrigues disse que vai aguardar pelos resultados do processo que está a decorrer e garantiu que não tem motivos para duvidar do funcionamento da Direcção Regional de Educação do Norte, que suspendeu o professor por ter feito um comentário a propósito do primeiro-ministro.A ministra disse lamentar que este tipo de episódios marquem a agenda mediática.Até este momento do muito que li e ouvi não tenho nenhum sinal. Não tenho nenhum motivo para duvidar do funcionamento das instituições ou para considerar que pode estar em causa o funcionamento da Direcção Regional de Educação, ou dos seus serviços.Aquilo que é a minha preocupação é que no âmbito deste processo estejam garantidos a existência dos mecanismos de de defesa.

2

1

3

4 5

Figure 1.3: Excerpt of a transcribed text, with different markup conditions.


italization information. The third rectangle shows the same text, enriched with capitalization,punctuation, and the corresponding segmentation. The last two rectangles also show that thisexcerpt cannot be seen exactly as written text, due to the presence recognition errors and otherphenomena that occurs in speech, such as disfluencies. The last rectangle shows the perfectresult, without recognition errors, and where the punctuation and capitalization was correctlyassigned. The third rectangle shows the desired output for the modules developed in the scopeof this thesis. Whereas the output still contains a number of recognition errors, the punctuationand punctuation information was correctly assigned.

The output of a speech recognition system includes a broad set of lexical, acoustic andprosodic information, such as: time gaps between words, speaker clusters and speaker gender,which can be combined to produce the best results. On the other hand, the speech signal isalso an important source of information when certain features, such as the pitch and energy,are not available on the recognition output. An important initial goal addressed by this thesisconsisted of investigating and evaluating different punctuation and capitalization methods.An important requirement for a given method is its ability in combining all the available andrelevant information. An additional outcome expected from this thesis is to better understandthe individual contribution of each feature to the final performance, in both tasks.

Current computational resources allow the manipulation of large-sized data, and the appli-cation of complex learning methods on such data. On the other hand, we are now witnessingthe mass production of online written content in the web. Different Portuguese newspapercompanies are now publishing their news and last minute news content freely available on theweb. This written corpus constitutes an important resource for processing the Portuguese lan-guage, and it also provides important basic information for speech processing. Given that onlya limited set of manually labelled speech training data is now available for Portuguese, oneof the main goals of this thesis consisted of using additional sources of information wheneverpossible, including large written corpora.

Finally, the Portuguese language is spoken by a large community in many countries aroundthe world, such as Brazil and many African countries. For that reason it would be importantthat this work could be easily extended to such language varieties and even to other languages.The research conducted in the scope of this thesis is, as much as possible, language indepen-dent, but a special focus is given to the specific problems of Portuguese. The extension toother languages is restricted to a number of experiments concerning the English and Spanishlanguages.

1.4 Proposed Strategy

This study considers both punctuation and capitalization tasks as two classification tasks,thus sharing the same classification approach. Both tasks will be performed using the same

1.5. DOCUMENT STRUCTURE 9

maximum entropy (ME) modeling approach, a discriminative approach, suitable for dealingwith speech transcripts, which includes both read and spontaneous speech, the latter beingcharacterized by more flexible linguistic structures and by adjustments to the communicativesituation (Blaauw, 1995). The use of a discriminative approach facilitates the combination ofdifferent data sources and different features for modeling the data. It also provides a frame-work for learning with new data, while slowly discarding unused data, making it interestingfor problems that comprise language variations in time, such as capitalization. With this ap-proach, the classification of an event is straightforward, making it interesting for on-the-flyintegration, with strict latency requirements.

The capitalization of a word depends mostly on the context where that word appears, andcan be regarded as a sequence labeling or a lexical ambiguity resolution problem. The HiddenMarkov Model (HMM) framework is a typical approach, used since the early studies, that canbe easily applied to such problems. That is because computational models for sequence label-ing or lexical ambiguity resolution usually involve language models (LM) built from n-grams,which can also be regarded as Markov models. For that reason, capitalization experimentsreported will include comparative results achieved using an HMM-based approach. Ratherthan comparing with other approaches, punctuation experiments will focus on the usage ofadditional information sources, and the wide range of features provided by the speech data.

1.5 Document Structure

This document presents the research developed under the scope of this thesis and pointspossible future directions for the ongoing research. The document is structured as follows:Chapter 2 makes an overview of the Rich Transcription, describing the current state of the arton specific tasks of this domain, the metrics currently in use, some of them adopted for this doc-ument, and the approach adopted here for automatic punctuation and capitalization. Chapter3 describes the different corpora used for training and testing, focusing on issues related withits treatment and on the feature extraction process. Chapter 4 presents the capitalization task,and reports different experiments comparing different methods and analysing the impact oflanguage variation over time. Chapter 5 deals with the punctuation task and the correspond-ing experiments. Finally, Chapter 6 presents the conclusions and proposes a number of futuretasks to further extend the work here described.


2State of the Art

Spoken language is similar to written text in many aspects, but is different in many oth-ers, mostly due to the way these communication methods are produced. Current ASR systemsfocus on minimizing the WER (Word Error Rate), making few attempts to detect structural in-formation which is available in written texts. Spoken language is also typically less organizedthan textual material, making it a challenge to bridge the gap between spoken and written ma-terial. The text produced by a standard speech recognition system consists of raw single-casewords, without punctuation marks, with numbers written as text, and with many differenttypes of disfluencies. The representation format of this text, equivalent to SNOR (StandardNormalized Orthographical Representation), is difficult to read and sometimes even hard tounderstand because of the missing information (Jones et al., 2005b). Moreover, the missinginformation, specifically punctuation, sentence boundary, and capitalization, is also importantfor many types of automatic downstream processing, such as parsing, information extraction,dialog act modeling, NER, summarization, and translation (Shriberg et al., 2000; Zechner, 2002;Huang and Zweig, 2002; Kim and Woodland, 2003; Kahn et al., 2004; Niu et al., 2004; Ostendorfet al., 2005; Jones et al., 2005a; Makhoul et al., 2005; Shriberg, 2005; Khare, 2006; Matusov et al.,2006; Cattoni et al., 2007; Ostendorf et al., 2008). For example, Kahn et al. (2004) and Harperet al. (2005) reveal that the parsing accuracy is strongly affected by sentence boundary detec-tion errors. Makhoul et al. (2005) show that punctuation marks can improve the accuracy ofinformation extraction algorithms over speech transcripts. Other studies have shown that thepunctuation marks, or at least sentence boundaries, are important for machine translation (Ma-tusov et al., 2006; Cattoni et al., 2007) and summarization (Zechner, 2002). The capitalizationinformation is also important for human readability and other tasks, such as parsing, machinetranslation, and NER (Lita et al., 2003; Niu et al., 2004; Khare, 2006; Wang et al., 2006).

The rich transcription process usually involves the process of recovering structural infor-mation, which has been the focus of many studies (Heeman and Allen, 1999; Kim et al., 2004),and the creation of metadata from that information. Liu et al. (2005) presents an overviewof the research on metadata extraction, in the scope of the Rich Transcription project, fromthe DARPA EARS program. The paper focuses on the detection of structural information inthe word stream, covering four main tasks: Sentence Unit detection, Edit word detection, Fillerword detection, and Interruption point detection. Speaker diarization, excluded from the scope

12 CHAPTER 2. STATE OF THE ART

of this thesis, is overviewed by Reynolds and Torres-Carrasquillo (2005). Chen et al. (2006);Soltau et al. (2005) describe the advances on the IBM speech recognition technology during theEARS program. Liu et al. (2004, 2006) describe the ICSI-SRI-UW system for extraction of meta-data, also previously developed as part of the EARS Rich Transcription program. The systemincludes sentence boundary detection, filler word detection, and detection/correction of disflu-encies. The paper reports results on the NIST Rich Transcription metadata tasks. Strassel et al.(2003) describes the efforts at LDC (Linguistic Data Consortium) to create shared resources forimproved speech-to-text technology, motivated by the EARS program demands. The DARPAGALE program1 is an on-going project, also involving LDC, whose goal is to develop and ap-ply computer software technologies to absorb, analyze and interpret huge volumes of speechand text in multiple languages. The LDC now provides quick transcription specifications fora number of languages, including: Arabic, Chinese, and English, and will collect transcriptsfor large volumes of speech, conform such specifications. The specification elements includeaccurate transcription of sentences (segmentation), sentence type identification, standardizedpunctuation, and orthography.

A fair doubt for punctuation and capitalization is whether the ASR system can be adaptedfor dealing with both tasks, instead of creating additional modules. The work reported by Kimand Woodland (2004) addresses this question by proposing and evaluating two methods: i)adapting the ASR system for dealing with both punctuation and capitalization, by duplicatingeach vocabulary entry with the possible capitalized forms, modeling the full stop with silence,and training with capitalized and punctuated text, and ii) using a rule-based named entitytagger and punctuation generation. The paper shows that the first method produces worseresults, due to the distorted and sparser language model, thus suggesting the separation of thepunctuation and capitalization tasks from the speech recognition system.

A number of studies consider punctuation and capitalization recovery as two connectedtasks and perform them concomitantly (Mikheev, 1999, 2002; Baldwin and Joseph, 2009; Gra-vano et al., 2009; Lu and Ng, 2010). For example, Stevenson and Gaizauskas (2000) perform anumber of experiments using human annotators, and conclude that case information is an im-portant feature for sentence boundary detection. That is also according to Baldwin and Joseph(2009), who conclude that these two tasks are highly dependent, and that if we can get one ofthe two tasks correct, the other becomes considerably easier. Nevertheless, each one of thesetasks presents its own challenges, and therefore they are treated individually by most of the re-ported work. The remainder of this section overviews the related work on each one of the tworich transcription tasks, describes the approach adopted for this study, considering a numberof requirements, and presents the performance evaluation metrics adopted here.

1http://projects.ldc.upenn.edu/gale/

2.1. RELATED WORK ON CAPITALIZATION 13

2.1 Related Work on Capitalization

The capitalization task, also known as truecasing (Lita et al., 2003; Jurafsky and Martin,2009), consists of rewriting each word of an input text with its proper case information givenits context. Many languages distinguish between uppercase and lowercase letters, even so,capitalization does not apply to a number of languages that do not use Latin, Greek or Cyrillicscripts, such as Chinese, Thai, Japanese, Arabic, Hindi, Hebrew, etc. Proper capitalization canbe found in many information sources, such as newspaper articles, books, and most of the webpages. Besides improving the readability of texts, capitalization provides important semanticclues for further text processing tasks. Different practical applications benefit from automaticcapitalization as a preprocessing step: many computer applications, such as word processingand e-mail clients, perform automatic capitalization along with spell corrections and grammarcheck; and while dealing with speech recognition output, automatic capitalization providesrelevant information for automatic content extraction, Named Entity Recognition, and MachineTranslation. For example, Kubala et al. (1998) perform NER over speech data and conclude thatthe performance is affected by the lack of punctuation and capitalization information, speciallywhen dealing with proper names, given that most entities involve capitalized words.

In applications where capitalization is expected, a typical approach consists of modifyingthe process that usually relies on case information in order to suppress the need of that infor-mation (Brown and Coden, 2002; Manning et al., 2008). The NE extraction on ASR output, acore task in the DARPA’s sponsored Broadcast News workshops, is a good example of suchapproach. Bikel et al. (1997) describes a HMM-based (Baum and Petrie, 1966) NE extractionsystem that performed well in these circumstances. When trained with data previously con-verted to lowercase, the system performance was almost as good as when tested with mixedcase data. An alternate approach to modify the applications for lowercase input is to previouslyrecover the capitalization information, which can also benefit a number of other applicationsthat use case information. Concerning this subject, Niu et al. (2004) propose an alternative two-step approach to Named Entity tagging. Instead of excluding the case-related features, whichis the traditional approach, the authors introduce a preprocessing module designed to restorecase information. Based on the observation that the size of a training corpus is often a moreimportant factor than the complexity of the model, the authors use a bigram HMM, trainedusing a large corpus (7M words) of case sensitive documents, concluding that this approach(i) outperforms the feature exclusion approach for Named Entity tagging, (ii) leads to limiteddegradation for semantic parsing and relationship extraction, (iii) reduces system complexity,and (iv) has wide applicability: the restored text can feed both statistical model and rule-basedsystems. Also concerning this subject, Kim and Woodland (2004) describe a CapitalizationGeneration system used in their NER system. The authors show that using a ruled-based NEtagger and punctuation generation is better than duplicating each vocabulary entry with thepossible capitalized forms, model the full-stop with silence, and to train the ASR with capital-


ized and punctuated text. The performance measures include Precision, Recall and SER (SlotError Rate), being that only three different capitalization classes are considered, and that thefirst word of each sentence is used in the evaluation.

All the capitalization results reported in this thesis do not consider the first word of asentence for evaluation, because its capitalized form depends mostly on its position in the sen-tence. However, the connection between case information and punctuation has been reportedby a number of studies. Mikheev (1999, 2002) presents an approach to the disambiguation ofcapitalized words, consisting of a cascade of different simple positional heuristics, but onlywhere capitalization is expected, such as the first word of the sentence or after a period. Thisstudy was performed over written corpora where the capitalization is provided. Another studyrecovering capitalization for punctuated texts involving heuristics is reported by Brown andCoden (2002), where a series of techniques and heuristics are evaluated. The co-dependence ofcase information and punctuation was also recently investigated by Baldwin and Joseph (2009)and Gravano et al. (2009). In both studies punctuation and case information is restored simul-taneously in English texts. Baldwin and Joseph (2009) explores multi-class SVMs, and Gravanoet al. (2009) uses purely text-based n-gram language models.

Capitalization can be viewed as a lexical ambiguity resolution problem, where each wordhas different graphical forms (Yarowsky, 1994; Gravano et al., 2009). A pilot example of suchapproach is reported by Yarowsky (1994), which presents a statistical procedure for lexical am-biguity resolution, based on decision lists, that achieved good results when applied to accentrestoration in Spanish and French. The capitalization and accent restoration problems can betreated using the same methods, given that a different accentuation can be regarded as a dif-ferent word form. Capitalization can also be viewed as a specialized spelling correction byconsidering different capitalization forms as spelling variations (Lita et al., 2003). Finally, thecapitalization problem may also be seen as a sequence tagging problem, where each lower-case word is associated to a tag that describes its capitalization form (Lita et al., 2003; Kim andWoodland, 2004; Chelba and Acero, 2004; Khare, 2006). This corresponds to a classificationproblem that can be dealt with a vast number of approaches, like POS taggers, HMM-based,ME-based, SVM-based, MEMM-based (McCallum et al., 2000), and CRF-based (Lafferty et al.,2001) classifiers.

Lita et al. (2003) build a trigram language model (LM) with pairs (word, tag), estimatedfrom a corpus with case information, and then use dynamic programming to disambiguateover all possible tag assignments on a sentence. This paper reports experiments that reveal apositive impact of the capitalization in named entity recognition, automatic content extraction,and machine translation.

Chelba and Acero (2004) study the impact of using increasing amounts of training data aswell as a small amount of adaptation. This work uses an approach based on Maximum EntropyMarkov Models (MEMM). A large written newspaper corpus (WSJ) is used for training and the

2.1. RELATED WORK ON CAPITALIZATION 15

test data consists of Broadcast News (BN) data (CNN and ABC prime time).

Khare (2006) finds the usefulness of Joint Learning to the tasks of NER and CapitalizationGeneration. The study goes further to look for feature sets that help or do not help the Jointtask. This is achieved by using Dynamic Conditional Random Fields (DCRFs) as models forexperiments with the two tasks. The Joint model is compared with both simple systems foreach task that do not use the other task, and with traditional pipeline systems that perform thetwo tasks sequentially. The paper concludes that either joint learning or NER tagging do nothelp Capitalization. Despite that, errors made in the Capitalization are crucial for NER, andthat improving Capitalization improves NER significantly.

Other related work includes a bilingual capitalization model for capitalizing MachineTranslation (MT) outputs using Conditional Random Fields (CRFs) and is reported by Wanget al. (2006). This work exploits case information both from source and target sentences of theMT system, producing better performance than a baseline capitalizer using a trigram languagemodel. Another truecasing module that works inside a machine translation system is presentedby Agbago et al. (2005). The module is used in a Portage system and combines a n-gram lan-guage model, a case mapping model, and a specialized language model for unknown words.The module presents 80% relative error rate reduction over the baseline using unigrams only.Also concerning machine translation, Stüker et al. (2006) describe a set of evaluation systemsthat participated in the TC-STAR 2006 (Technology and Corpora for Speech to Speech Transla-tion) evaluation. The capitalization of the recognition output is performed in a post-processingstage, after the actual decoding procedure, and before the punctuation. The process relies on a4-gram language model, built both from the transcriptions and the final text editions.

A recent work performing experiments on large corpora using different n-gram orders isreported by Gravano et al. (2009). This paper is of particular interest not only because of thehigh complexity of the applied models, where the n-gram order varies from n = 3 to n = 6, butalso because of the large amount of training data, which varies from 58 million to 55 billion to-kens. The paper concludes that using larger training data sets leads to increasing improvementsin performance, but increasing the n-gram order does not significantly improve capitalizationresults. However, it seems that capitalization results consider the first word of each sentence,whose capitalization is highly dependent from the assigned punctuation, implying that thesetwo tasks cannot be measured separately.

Most of the words and structures of a language is not subject to diachronic substantialchanges. However, the usage of new words (neologisms), frequently introduced, and the usageof others that may decay with time (archaisms) introduces dynamics in the lexicon. Being partof the Language Dynamics emerging field (Wichmann, 2008), this problem has been addressedfor Portuguese BN in the work of Martins et al. (2007b), which proposes a daily adaptation ofthe vocabulary and language model to the topic of current news, based on texts daily availableon the Web. Also concerning this subject, Mota and Grishman (2008, 2009) analyze the relation


between corpora variation over time and the NER performance, proving that, as the time gapbetween training and test data increases, the performance of a named entity tagger based onco-training Blum and Mitchell (1998); Collins and Singer (1999) also decreases. These studieshave shown that, as the time gap between corpora increases, the similarity between the cor-pora and the names shared between those corpora also decreases. The language adaptationproblem concerning capitalization has been addressed by Batista et al. (2008d,c), concludingthat the capitalization performance is influenced by the training data period. All these studiesemphasize the relation between named entities and capitalized words, showing that they areinfluenced by time variation effects.

2.2 Related Work on Punctuation and Sentence Boundary Detection

When dealing with conversational speech, the notion of utterance or sentence-like unit isoften used instead of “sentence” (Strassel, 2004; Liu et al., 2006; Jurafsky and Martin, 2009). ASU may correspond to a grammatical sentence, or can be semantically complete but smallerthan a sentence. Detecting a SU consists of finding its limits, and roughly corresponds tothe task of detecting positions where a punctuation mark is missing. The problem of sen-tence boundary detection is connected to the punctuation recovery problem, especially withrespect to predicting sentence boundary punctuation like full-stops, question marks, and excla-mation marks (Shieber and Tao, 2003). Nevertheless, this problem is distinct from the sentenceboundary disambiguation problem, where the punctuation is provided and the task consistsof deciding if whether or not it marks a sentence boundary (Palmer and Hearst, 1994, 1997;Reynar and Ratnaparkhi, 1997).

Despite being originally used mostly for marking breaths, punctuation is nowadays usedfor marking structural units, thereby used to disambiguate meaning and to provide cues tocoherence of the written text (Kowal and O’Connell, 2008). Inserting punctuation marks intospoken texts is a way of approximating such texts to written texts, keeping in mind that speechdata is linguistically structured. Despite that, a punctuation mark may assume different behav-ior in speech, for example, a sentence in spontaneous speech does not always correspond to asentence in written text. A large number of punctuation marks can be considered for spokentexts, including: comma; period or full stop; exclamation mark; question mark; colon; semicolon; andquotation marks. However, most of these marks rarely occur and are quite difficult to insert orevaluate. Hence, most of the available studies focus on full stop and comma, which have highercorpus frequencies. A number of studies also consider the question mark, but most of them havenot yet shown promising results (Christensen et al., 2001). Previous work on other punctuationmarks, such as exclamation marks, are rarely found on the literature.

Sentence boundary detection has gained increasing attention during recent years, and ithas been part of the NIST rich transcription evaluations. It provides a basis for further natural

2.2. RELATED WORK ON PUNCTUATION AND SENTENCE BOUNDARY DETECTION 17

language processing, and its impact on subsequent tasks has been recently analyzed in manyspeech processing studies (Harper et al., 2005; Mrozinsk et al., 2006; Ostendorf et al., 2008). Thedetection of sentence boundaries is one of the main structural events annotated in the DARPAEARS rich transcription program. This topic is addressed in Liu et al. (2005), where prosody isshown to be more helpful for broadcast news than for conversational speech, and recognitionerrors are shown to affect the performance significantly.

Recovering hidden punctuation or sentence boundaries is considered by Shriberg (2005) asthe first of the four main spoken language properties of spontaneous speech that impose chal-lenges for spoken language applications, on the basis that these properties violate assumptionsoften applied in automatic processing technology. Different approaches have been reported toaddress the punctuation recovery problem. Computational models for detecting punctuationmarks and sentence boundaries in speech typically involve a combination of N-gram languagemodels and prosodic classifiers. The HMM framework is a common approach, used since theearly studies for similar problems, that allows combining different knowledge sources. Morerecently, other model types have been used successfully, such as Maximum Entropy (ME) mod-els and Conditional Random Fields (CRFs).

The general HMM framework for detecting sentence boundaries, combining lexical andprosodic cues, has been reported by a number of studies (Gotoh and Renals, 2000; Shriberget al., 2000; Liu et al., 2006; Stolcke and Shriberg, 1996; Stolcke et al., 1998). A similar approachwas also used for punctuation recovery by Kim and Woodland (2001) and Christensen et al.(2001). One of the first studies on sentence segmentation of speech is reported by Stolcke andShriberg (1996). It uses an n-gram model, based on linguistic features and turn markers. Later,Stolcke et al. (1998) study a combined approach for detecting sentence boundaries and fourclasses of disfluencies on spontaneous, automatically transcribed telephone speech. The systemcombines prosody, modeled by decision trees, and n-gram language models. The study demon-strated that model combination yields significantly better results than just using individualmodels. Gotoh and Renals (2000) present an approach for identifying sentence boundaries inbroadcast speech transcripts, based on FSM (Finite State Machines). This work concludes thata model estimated from pause duration information outperforms an n-gram language modelbased on textual information, but the combination of the two models achieves even better re-sults. Shriberg et al. (2000) combine prosodic cues with word-based approaches, showing thatthe prosodic model alone performs on par with, or better than, word-based statistical languagemodels. The paper concludes that prosodic models capture language-independent boundaryindicators.

A multi-pass linear fold algorithm for sentence boundary detection in spontaneous speechthat uses prosodic features has been proposed by Wang and Narayanan (2004). This studyfocus on the relation between sentence boundaries and their correlates, pitch breaks and pitchdurations, covering their local and global structural properties. Detecting sentence boundarieswas also addressed by Liu et al. (2006), who report state-of-the-art results according to the NIST


RT-04F evaluation. Besides the common HMM approach, the usage of maximum entropy andconditional random fields is also investigated, conducting experiments in both broadcast newsdata and conversational telephone speech. The system described combines information fromdifferent types of textual knowledge sources, with information from a prosodic classifier. Thepaper reports that discriminative models usually outperform the generative models.

The ICSI+ sentence segmentation system (Zimmermann et al., 2006) works both on Englishand Mandarin, and is a result of a joint effort involving ICSI, SRI and UT Dallas. The systemuses an HMM approach for exploiting lexical information, and maximum entropy and boostingclassifiers to exploit lexical and prosodic features, speaker changes and syntactic information.The methodology uses prosodic features, including pitch-based and energy-based features, andis significantly better than a baseline system based on words and pause features. The paperconcludes that the pause duration in between words is a very important feature for this task,as well as features derived from speaker turns coming from the speaker diarization system.The work reported by Favre et al. (2008) moves towards the use of long-distance dependenciesin combination with local features. The authors construct an initial hypothesis lattice usinglocal features, and then assign syntactic language model scores to the candidate sentences.The resulting system, that combines global syntactic scores with local scores, outperforms thepopular HMM model for sentence segmentation.

In terms of punctuation marks, comma is the most frequent, but it is also the most problem-atic because of it serves many different purposes, and it is used in different syntactic contexts.It can be used, for example, to separate elements in a series (e.g., They can read, write, and ex-ecute); to separate two independent clauses joined by a coordinating conjunction (e.g., Theyhave completed the tasks, but some will have to be repeated); to separate long independent construc-tions; to set off or enclose certain adverbs (e.g., therefore, nevertheless); to enclose parentheticalwords and phrases within a sentence; to separate each group of three digits in representinglarge numbers (in English texts), or as the decimal separator (Portuguese texts); to separate ele-ments in dates; in geographical names; and also prevent misreading, by separating words thatmight otherwise be misread as closely related (e.g., After the bear had eaten the zookeeper cleanedits cage vs. After the bear had eaten, the zookeeper cleaned its cage). Punctuation marks are closelyrelated with syntactic, and semantic properties. Thus, the presence/absence of a comma in spe-cific locations may influence the grammatical judgments of the SUs. As synthesized by Duarte(2000), commas should not be placed between: i) the subject and the predicate; ii) the verb andthe arguments; iii) the antecedent and the restrictive relative clause; iv) before the copulativeconjunction e/and. Then again, commas should separate: i) adverbial subordinate clauses, suchas participial or gerundive ones; ii) appositive modifiers; iii) parenthetical constituents; iv) an-teposed constituents; v) asyndetically coordinated constituents; and vi) vocatives.

Concerning the question mark, European Portuguese (EP), as other languages, has differentinterrogative types (Mateus et al., 2003): yes/no questions (total/global interrogatives), alterna-

2.2. RELATED WORK ON PUNCTUATION AND SENTENCE BOUNDARY DETECTION 19

tive2 questions, wh- (partial questions) and tag questions. A yes/no question requests a yes orno answer (Estão a ver a diferença?/Can you see the difference?). In EP they generally present thesame syntactic order as a statement, contrarily to English that may encode the yes/no interroga-tive with an auxiliary verb and subject inversion. An alternative question presents two or morehypotheses (Acha que vai facilitar ou vai ainda tornar mais difícil?/Do you think that it will makeit easier or will it make it even harder?) expressed by the disjunctive conjunction ou/or. A wh-question has a wh interrogative pronoun or adverb, such as o que/what, quem/who, quando/when,onde/where, etc., corresponding to what is being asked about (Qual é a ideia?/What is the idea?). Ina tag question, an interrogative clause is added to the end of a statement (Isto é fácil, não é?/Thisis easy, isn’t it?).

The published literature on intra-sentence punctuation recovery is quite limited. Beefer-man et al. (1998) describe a lightweight method for automatically inserting intra-sentence punc-tuation marks into text. This method relies on an HMM with trigram probabilities, built usingsolely lexical information, and it uses the Viterbi algorithm for classification. The paper focuson the comma restoration problem, and presents a qualitative evaluation based on user satisfac-tion. It concludes that the system’s performance is qualitatively higher than sentence accuracyrate would otherwise indicate. The use of syntactic information for tackling the comma restora-tion problem is reported by (Shieber and Tao, 2003; Favre et al., 2009). Shieber and Tao (2003)show improved results over the use of lexical information alone, after replicating the Beefer-man et al. (1998) trigram-based model. Favre et al. (2009) analyse the impact of the syntacticfeatures on other subsets of features, and conclude that syntactic cues can help characteriz-ing large syntactic patterns such as appositions and lists, which are not necessarily marked byprosody. All these papers assume sentence boundaries as given, which is not the case in theexperiments reported in this thesis, where no punctuation whatsoever is assumed, includingsentence boundaries. For that reason, a direct comparison of these studies with the work herereported is quite limited.

Kim and Woodland (2001) generate the punctuation simultaneously with the speech recog-nition output, and the multiple hypothesis are re-scored using prosodic information. Theyconclude that prosodic information alone outperforms the use of lexical information, but thebest results are achieved by combining all information, since that information is complemen-tary. Their experiments report a small WER reduction (0.2%). Besides the HMM framework,Christensen et al. (2001) present an alternate approach to automatic punctuation of speech tran-scripts, based on MLP (multi-layered perceptions), that also models prosodic features. Despitethe different test sets, they consider their results similar to results reported by Kim and Wood-land (2001). Both studies consider full-stops, commas and question marks, but the later does notdiscriminate between the three punctuation marks.

2In the literature, alternative questions may not be considered as a type of interrogative, rather as a subtype. Forthe sake of distinguishing alternative questions from disjunctive declarative clauses, we included the alternativesas well.


Huang and Zweig (2002) describe an ME-based method for inserting punctuation marksinto spontaneous conversational speech. This work views the punctuation task as a taggingtask, where words are tagged with the appropriate punctuation. The ME tagger uses bothlexical and prosodic features, it uses the Switchboard corpus released by LDC for training andHub5-2000 evaluation data for testing (about 20% WER). This work covers three punctuationmarks: comma, period and question mark. The best results on the ASR output are achieved usingbigram-based features and by combining lexical and prosodic features. (full-stop: 73% Precisionand 65% Recall, comma: 77% Precision and 74% Recall, and question mark: 64% Precison and 14%Recall).

Stüker et al. (2006) describes the ISL machine translation system used in the TC-STAR 2006evaluation. In this system, the output is enriched with punctuation marks, or as the authorscall them, boundary marks, by means of a case-sensitive, 4-gram language model and hard-coded rules based on pause duration. In this system, the punctuation is performed after thecapitalization. Also concerning Machine Translation, Cattoni et al. (2007) report a system thatrecovers punctuation directly over confusion networks. This paper compares three differentways of inserting punctuation and concludes, by evaluation over manual transcriptions, thatthe best results are achieved when the training corpus include punctuation marks in both lan-guages, which means that the translation is performed from punctuated input to punctuatedoutput.

Lu and Ng (2010) propose an approach based on CRFs, which jointly perform both sentenceboundary and sentence type prediction, as well as punctuation prediction on speech utterances.Evaluations have been performed on English and Chinese transcribed conversational speechtexts. The authors use an empirical evaluation and conclude that their method outperforms anapproach based on linear-chain conditional random fields and other previous approaches.

Other recent studies have shown that the best performance for the punctuation task isachieved when prosodic, morphologic and syntactic information are combined (Liu et al., 2006;Ostendorf et al., 2008; Favre et al., 2009).

2.3 The Maximum Entropy Approach

Most of the experiments described in this document treat the punctuation and capitaliza-tion tasks as two classification tasks, thus being able to share the same approach. The approachis based on logistic regression classification models, which corresponds to the maximum en-tropy classification for independent events, firstly applied to natural language problems in(Berger et al., 1996). A maximum entropy model estimates the conditional probability of theevents given the corresponding features. Let us consider the random variable y ∈ C that cantake k different values, corresponding to the classes c1, c2, ... ,ck. The maximum entropy modelis given by the following equation:

2.3. THE MAXIMUM ENTROPY APPROACH 21

P(c|d) = 1Zλ(F) × exp

(∑

iλci fi(c, d)

)

determined by the requirement that

∑c∈C P(c|d)=1.

Zλ(F) is a normalizing term, used to make the exponential a true probability, and is given by:

Zλ(F) = ∑c′∈C

exp

(∑

iλc′i fi(c′, d)

)

fi are feature functions corresponding to features defined over events, and fi(c, d) is the fea-ture defined for a class c and a given observation d. The index i indicates different features,each of which has associated weights λci, one for each class. The ME model is estimated byfinding the parameters λci with the constraint that the expected values of the various featurefunctions match the averages in the training data. These parameters ensure the maximumentropy of the distribution and also maximize the conditional likelihood ∏i P(y(i)|d(i)) of thetraining samples. Decoding is conducted for each sample individually and the classification isstraightforward, making it interesting for on-the-fly usage.

Maximum entropy is a probabilistic classifier, a generalization of Boolean classification,that provides probability distributions over the classes. The single-best class corresponds tothe class with the highest probability, and is given by:

c = argmaxc∈C

P(c|d) = argmaxc∈C

exp

(∑

i

λci fi(c,d)

)

∑c′∈C

exp

(∑

i

λc′ i fi(c′,d)

)

This approach provides a clean way of expressing and combining different aspects of theinformation. This is especially useful for the punctuation task, given the broad set of lexical,acoustic and prosodic features that can be used.

Another interesting property of this method concerns feature selection, a central problem inMachine Leaning, that consists of finding the best subset of features for a given problem. Mostmethods for feature selection are based on information theory measures, like mutual informationand entropy. Maximum entropy solves this problem naturally by finding the entropy of eachfeature, which means that “more features never hurt!”.


Capitalization and Punctuation tasksTraining process

Punct & CapTranscript

Corpus

ME train

PunctuationFeatures

CapitalizationFeatures

Punctuation model

Capitalization model

ASR Transcript

PunctuationFeatures

CapitalizationFeatures

CapLexicon

ME classifier ME classifier

CapitalizedTranscript

PoS tagging

Figure 2.1: Block diagram of the capitalization and punctuation tasks.

2.3.1 Application to Rich Transcription

Figure 2.1 illustrates the classification approach for both tasks, where the left side of thepicture represents the training process using a set of predefined features, and the right sidecorresponds to classification using previously trained models. An updated lexicon containingthe capitalization of new and mixed-case words (e.g., “McGyver” is an example of a mixed-case word) can be used as a complement for producing the final capitalization form, sinceit contains the corresponding written form. Notice, however, that our evaluation results in-volve the classification only. As shown in the figure, capitalization comes first in the clas-sification pipeline, thus producing suitable information for feeding a part-of-speech tagger.Subsequently, part-of-speech information is used to aid detecting the punctuation marks, cor-responding to SU boundaries. The capitalization of the first word of each sentence is assignedin a post-processing step, based on the previously detected SU boundaries.

The maximum entropy models described in these experiments are trained using the MegaM

tool (Daumé III, 2004), which uses an efficient implementation of conjugate gradient (for binaryproblems) and limited memory BFGS (for multiclass problems) for training the ME models.The MegaM tool includes an option for predicting results from previously trained models. Nev-ertheless, by the time these experiments started, it was not prepared to deal with a stream ofdata, producing results only after completely reading the input. An on-the-fly predicting toolwas created, that uses the models in their original format and overcomes this problem.

2.3. THE MAXIMUM ENTROPY APPROACH 23

Trigram counts Class Weight Features

w1w2w3 ⇒ count1 class(w2) WEIGHT=count1 W:w2 PB:w1w2 NB:w2w3 T:w1w2w3w2w3w4 ⇒ count2 → class(w3) WEIGHT=count2 W:w3 PB:w2w3 NB:w3w4 T:w2w3w4w3w4w5 ⇒ count3 class(w4) WEIGHT=count3 W:w4 PB:w3w4 NB:w4w5 T:w3w4w5

... ...

Figure 2.2: Conversion of trigram counts into features.

2.3.2 Large Corpora Issues

This approach requires all information to be expressed in terms of features, causing theresultant data file to become several times larger than the original one. Capitalization mod-els, for example, are usually trained using large written corpora, which contain the requiredcapitalization information. On the other hand, the memory required for training with this ap-proach increases with the size of the corpus (number of observations). The MegaM tool, usedin our experiments, requires the training to be performed in a single machine and using allthe training data in a single step. This constitutes a training problem, making it difficult touse large corpora for training. Two different training strategies are proposed here to deal withthese memory limitations and minimize this problem:

N-gram counts based strategy is based on the fact that, in MegaM, scaling the event by the num-ber of occurrences is equivalent to use multiple occurrences of that event. Accordinglyto this, our strategy to use large training corpora consists of counting all n-gram occur-rences in the training data and then use such counts to produce the corresponding inputfeatures. Figure 2.2 illustrates this process considering trigram counts. The class of eachword corresponds to the type of capitalization observed for that word. Each trigram pro-vides a feature vector concerning the middle word, namely: W (current word), PB (previ-ous bigram), NB (next bigram), and T (trigram containing the three words). By pruningthe less frequent n-grams if necessary, this strategy allows the usage of large corpora sets.Higher order n-grams can be used, however, it is not possible to produce all the desirablerepresentations from n-gram counts, for example, sentences containing less than n wordsare discarded in n-gram counts, which may conduct to defective results.

Retraining strategy The memory problem can also be solved by splitting the corpus into sev-eral subsets, and then iteratively retraining with each one separately. The first subset isused for training the first ME model, which is then used to provide initial values for theweights of the next iteration over the next subset. This process goes on, comprising sev-eral epochs, until all subsets are used. Although the final ME model contains informationfrom all corpora subsets, events occurring in the latest training sets gain more impor-tance in the final model. As the training is performed with the new data, the old modelsare iteratively adjusted to the new data. This approach provides a clean framework for


Ref: w1 w2 w3 w4 w5 . w6 w7 , w8 w9 w10 .Hyp: w1 w2 . w3 w4 w5 w6 w7 . w8 w9 w10 .

ins del sub cor

Ref: here is an Example of a Big capitalization SERHyp: here Is an example of a BIG capitalization SER

ins del sub cor

Figure 2.3: Example of correct and incorrect slots.

language dynamics adaptation: i) new events are automatically considered in the newmodels; ii) the final discriminative model collects information from all corpora subsets;iii) with time, unused events slowly decrease in weight (Batista et al., 2008d,c). Sortingthe trained model by the relevance of each feature and limiting the number of featureskept in each model, it is possible to reduce computational resources for the next trainingstage, without much impact on the results.

2.4 Evaluation Metrics

Throughout this chapter, several evaluation metrics have been quoted, which will now bedescribed in detail. Precision, Recall, F-measure, and Slot Error Rate (SER) (Makhoul et al.,1999) are defined in equations (2.1) to (2.4). All these metrics are based on slots, which, for thepunctuation task, corresponds to the occurrence of a punctuation mark in the corpus, and forthe capitalization task, corresponds to all words not written as a lowercase form.

Precision =C

Hyp=

CC + S + I

(2.1)

Recall =C

Re f=

CC + S + D

(2.2)

F−measure =2 ∗ Precision ∗ Recall

Precision + Recall(2.3)

SER =total slot errors

Re f=

I + D + SC + D + S

(2.4)

In the equations, C is the number of correct slots; I is the number of insertions (spuriousslots / false acceptances); D is the number of deletions (missing slots / false rejections); S isthe number of substitutions (incorrect slots); Re f is the number of slots in reference; and Hypis the number of slots in hypothesis. The first three performance metrics are guaranteed toassume values between 0 and 1, but that is not the case of the SER, which can assume values

2.4. EVALUATION METRICS 25

greater than 100%. Both examples presented in Figure 2.3 achieve 33% Precision, 33% Recall,and 33% F-measure. However, the SER is 100%, which may be a more meaningful measure,given that the number of slot errors in the example is greater than the number of correct ones.F-measure is a way of having a single value for measuring all types of errors simultaneouslybut, as reported by Makhoul et al. (1999), “this measure implicitly discounts the overall errorrate, making the systems look like they are much better than they really are”.

These performance metrics are widely used by the scientific community, but a number ofvariations in their usage can be found. For example, Kim and Woodland (2001) assume thatcorrectly placed but wrongly identified punctuation marks count as half an error, thereforeimproving the performance of all the metrics here presented.

The challenge of obtaining more meaningful performance measures for MDE scoring hasbeen accepted by Ostendorf and Hillard (2004). The authors analyse the performance mea-sures for MDE, specifically for SU detection, and propose event-based statistical significancemeasures. A study conducted by Liu and Shriberg (2007) shows the advantages of curves overa single metric for sentence boundary detection. Other metrics, based on performance curves,have also been proposed, and can also be used for more adequate analysis. The ROC (ReceiverOperating Characteristic) curve has been used for this propose, and it basically consists of plot-ting the false alarm rate on the horizontal axis, while the correct detection rate is plotted onvertical. Martin et al. (1997) proposes the DET (Decision Error Tradeoff) curve, a variant of theROC curve, where error rates are plotted in both axes.

Most of the results presented in the scope of this thesis include the standard metrics: Pre-cision, Recall, F-measure, and Slot Error Rate. However, the SER is the preferred performancemetric for the performance evaluation. The SER for punctuation corresponds to the NIST errorrate for sentence boundary detection, which is defined as the sum of the insertion and deletionerrors per number of reference sentence boundaries (Liu and Shriberg, 2007).


3Corpora

This chapter describes the most relevant data involved in the work performed in the scopeof this thesis, most of them reported in this document. Preliminary experiments used onlythe data from a Portuguese broadcast news corpus, but the small size of the corpus soon de-manded new data sources. This applies essentially to the capitalization task, where the lexicalfeatures are of extreme importance. Despite speech transcripts and written corpora being muchdifferent in many aspects, both types of corpora share important information concerning punc-tuation marks and capitalization information. For that reason, even if the main goal is to dealwith BN speech transcripts, large written corpora containing punctuation and capitalization in-formation for each word and the corresponding context were also used as a way of improvingthe punctuation and capitalization models.

The work described in this document has been performed using tools and corpora underdevelopment. For example, the recognition system was upgraded several times during thiswork, either for correcting existing bugs or for providing improved results. This imposes se-rious problems, since experiments with different tools or data versions may not be directlycompared. Consequently, a great number of previous experiments were repeated several timesalong with the ongoing experiments, some of them taking several weeks to be completed. Un-less otherwise stated, the data properties described in this chapter were recently calculated inorder to reflect the most recent data.

Recently, some of the work performed for Portuguese has been also ported to other lan-guages. One important task concerning this goal was to setup the different corpora sets foreach one of the languages involved. That is a difficult task, mostly because of the differentformats and annotation criteria. So far, our experiments cover three different languages: Por-tuguese, Spanish and English. The remainder of this chapter describes the BN and the WrittenNewspaper data, for each one of the languages were considered, as well as, the pre-processingsteps applied to each one of the corpora sets.

Most of the experiments performed in the scope of this thesis use either written newspapercorpora or broadcast news corpora, described in this chapter. Nevertheless, other corpora wasalso used for specific experiments and, in such cases, these resources are described only in thecontext of such experiments.

28 CHAPTER 3. CORPORA

Usage Name Recording period Duration Words WERTrain Oct., Nov. 2000 46.5h 480k 14.1Development Dec. 2000 6.5h 67k 19.6

Eval Jan. 2001 4.5h 48k 19.4Test JEval Oct. 2001 13.5h 136k 17.9

RTP07 May, June, Sep., Oct. 2007 4.8h 49k 19.1RTP08 June, July 2008 3.7h 39k 25.4

Table 3.1: Different parts of the Portuguese BN corpus.

3.1 Broadcast News Data

Besides words, speech data transcripts typically contain additional information coming fromthe speech signal, including information concerning the background noise, speaker gender,and other metadata. Manual transcripts, which provide the reference data, may include infor-mation concerning sections to be excluded, punctuation, capitalization and other phenomena,such as indication of foreign languages and disfluencies. The reference is usually divided intosegments, with information about the start and end locations in the signal file, speaker id,speaker gender, and focus conditions. Automatic transcripts, produced by speech recognitionsystems, usually include the time period corresponding to each word and other information,such as confidence measures. All our automatic transcripts were produced by the APP (Au-dio Pre-Processing) and speech recognition modules, thus including the following information:speaker cluster, speaker gender, background speech conditions (clean/noise/music), and theconfidence score for each word. Due to a recent upgrade, the recognition system now alsoprovides confidence scores for other information, such as speaker cluster, and speaker gender.

3.1.1 Portuguese Corpus

The Portuguese speech corpus (henceforth ALERT-SR corpus) is an European PortugueseBroadcast News corpus, originally collected for training and testing the speech recognition andtopic detection systems, in the scope of the ALERT European project (Neto et al., 2003; Meinedoet al., 2003). The original corpus includes two different evaluation sets: Eval and JEval, thelatter having been collected with the purpose of a “joint evaluation” among all project partners.This corpus was recently complemented with two collections of 11 BN shows from the samepublic TV channel (RTP). Table 3.1 presents details for each part of the corpus, where durationvalues represent the duration of all speech sequences (silences not included). The reportedWER (Word Error Rate) values were calculated for the recognition system as it was on May2010, but notice that these values change from time to time. The RTP08 test set was collectedwith an 8 day time span between each BN show. The corpus includes two other subsets (Pilotand 11March), which were not used for the experiments here described.

The manual orthographic transcription process follows the LDC Hub4 (Broadcast Speech)

3.1. BROADCAST NEWS DATA 29

Figure 3.1: Focus distribution in terms of speech duration for Portuguese BN.

transcription conventions1, and includes information such as punctuation marks, capital lettersand special marks for proper nouns, and acronyms. Each segment in the corpus is marked as:planned speech with or without noise (F40/F0); spontaneous speech with or without noise(F41/F1); telephone speech (F2); speech mixed with music (F3); non-native speaker (F5); anyother speech (FX). Figure 3.1 shows the corpus distribution by focus condition, revealing thatmost of the corpus consists of planned speech (F0+F40), but it also contains a large percentage(35%) of spontaneous speech (F1+F41).

3.1.1.1 Manual revision of the corpus

The manual orthographic transcripts of this corpus were recently revised by an expert lin-guist, thereby removing many inconsistencies in terms of punctuation marks that affected ourprevious results. The previous version of this corpus was manually transcribed by differentannotators, who did not follow consistent criteria in terms of punctuation marks. The revi-sion process focused mostly on correcting punctuation marks and on adding disfluency an-notation (Moniz, 2006), which were not previously annotated. Table 3.2 shows the numberof differences between the old reference data and the new, in terms of the punctuation marksand disfluency annotation. For example, the table shows that, whereas about 40k full-stopswere kept from the older version to the newer, about 3.9k were replaced by commas, 233 werereplaced by question marks, and another 1574, were simply removed by the expert. Most ofthe differences concern comma, and they are often due to different criteria when marking dis-fluencies. Since our previous data had no disfluency identification and no objective criteria

1http://www.ldc.upenn.edu/Projects/Corpus_Cookbook/transcription/broadcast_speech/english/conven-tions.html


none . , ? ! ... : ; " <>

none

.

,

?

!

...

:

;

“

<>

1574 31429 52 67 249 11 6 202 0

1239 40064 1856 33 172 63 14 5 0 0

18828 3892 41435 83 74 36 62 27 16 0

118 233 66 2017 3 11 2 1 0 0

14 13 18 0 64 0 0 0 0 0

139 30 43 1 1 172 0 0 0 0

110 83 140 0 4 0 104 8 1 0

63 52 54 0 0 0 0 11 0 0

0 0 0 0 0 0 0 0 0 0

14576 6 36 0 0 7 0 0 3 0

>20k

10k-20k

5k-10k

1k-5k

100-1k

20-100

5-20

Punctuation before revisionP

un

ctu

atio

n a

fter

revi

sio

n

Table 3.2: Confusion matrix between the old to the new manual transcripts.

. , ? ! ... : ; All punctuationCohen’s kappa 0.890 0.557 0.870 0.259 0.372 0.323 0.092 0.705

Table 3.3: User annotation agreement for the punctuation marks in the Portuguese BN corpus,in terms of Cohen’s kappa values.

were applied to deal with this, the annotators often delimited the disfluency sequences withcommas, or other punctuation marks. Moreover, they also applied a naive criterion of corre-sponding a comma to a silent pause, even if that was an ungrammatical convention, meaningnot respecting the syntactic structure. For example, about 41k commas were kept from the oldthe the new version, but about 31k were removed, and another 19k were simply added to thenew corpora version. The question mark is mostly consistent, and results concerning otherpunctuation marks are less significant given the lower frequency in the corpus. The bottomline of the table reports statistics for the newly added disfluency boundaries, revealing thatabout 15k disfluencies are now marked in the corpus.

Using the previous differences, Cohen’s kappa values (Carletta, 1996) have been calculatedfor each punctuation mark, thus allowing to assess the user agreement between the original andthe revised version. Table 3.3 shows the corresponding results for all corpora, revealing thatthe most consistent punctuation marks are the full-stop and the question-mark, and confirmingthe strong disagreement concerning comma. Most of the differences concern comma, and theyare often due to different criteria when marking disfluencies. Since our previous data had nodisfluency identification, and no objective criteria were applied to deal with this, the annotatorsoften delimited the disfluency sequences with commas, or other punctuation marks. Moreover,they also applied a naive criterion of corresponding a comma to a silent pause, even if that didnot respect the syntactic structure (e.g., often introducing a comma between the subject and thepredicate). Results concerning other punctuation marks are less significant, given the lower


#shows duration #wordsTrain 19 13.6h 152k

Development 3 2.2h 25kTest 3 1.9h 22k

Table 3.4: Spanish BN corpora properties.

Figure 3.2: Focus distribution in terms of speech duration for Spanish BN.

frequency in the corpus.

3.1.2 Spanish Corpus

This Spanish corpus was recently created at L2F/VoiceInteraction2 (Martinez et al., 2008;Meinedo et al., 2010). Table 3.4 shows some properties of this corpus, which contains about 25Broadcast News from the national Spanish TV station (TVE). The training data contains about14h of usable speech, and the evaluation data contains about 2h. The focus distribution isillustrated in Figure 3.2, revealing a very large percentage of planned speech (73%) when com-paring with Portuguese broadcast news. Only about 11% of the corpus consists of spontaneousspeech.

The manual transcripts of this corpus are available in the TRS format. The conversion toother formats follows the same strategy already adopted for the Portuguese corpus.


Subset LDC corpora set Total1998T28 2000S86 2000S88 2005T24 2007S10 duration #words

Train 94% 80% 79.1h 829kDevelopment 6% 10% 6.3h 66kTest 100% 100% 10% 100% 9.3h 98kWER 15.1% 23.7% 25.5% 16.1% 20.9%

Table 3.5: English BN corpora properties.

3.1.3 English Corpora

The English BN corpus used in our experiments combine five different English BN corporasubsets, available from the Linguistic Data Consortium (LDC). Table 3.5 shows details of thiscorpus. From the corpus LDC1998T28 (HUB4 1997 BN training data), about 94% of was usedfor training and the rest for development. The first 80% of the LDC2005T24 corpus (RT-04MDE Training Data Text/Annotations) was used for training, 10% for development and thelast 10% for evaluation. The evaluation data also includes the LDC corpus LDC2000S86 (HUB41998 BN evaluation) , LDC2000S88 (HUB4 1999 BN evaluation), LDC2005T24 (MDE RT04, onlythe last 10% were used), and LDC2007S10 (NIST RT03 evaluation data). The training datacontains about 81h (transcribed speech only), which is almost twice the size of the PortugueseBN training data.

Dealing with such corpora demanded normalization strategies, specifically adapted foreach corpus. They have been produced in different time periods, encoded with different an-notation criteria, and are available in different formats as well. Besides, they were built fordifferent purposes, which makes them even more heterogeneous. One important step for deal-ing with these corpus consisted on converting the original format of each one into a commonformat. The chosen common format was the STM (Segment Time Mark) format, which is easyto process and understand, and can easily map all the information required for our experi-ments. This corpora contains portions of overlapped speech but, in order to correctly use ourrecognition system, only one speaker was kept for such segments.

3.1.3.1 1997 Hub-4 Broadcast News Speech Corpus

This corpus contains a total of 97 hours of recordings from radio and television news broad-casts, gathered between June 1997 and February 1998. It has been prepared to serve as a sup-plement to the 1996 Broadcast News Speech collection (consisting of over 100 hours of similarrecordings). However, the 1996 BN speech collection does not include neither punctuation norcapitalization information and was therefore excluded from our experiments.

The manual transcripts are available under two different distributions: LDC1998T28 and

2http://www.voiceinteraction.pt/


<time sec=50.017>{breath} We don’t know yet what it is the judge thinks the penalty should be, but itwill not be death. Only the jury could have made that call. {breath} So first, we go to^Denver, and here is _A_B_C’s ^Bob ^Jamieson. ^Bob?...<time sec=836.976>{breath} In an interview with _C_N_N, he praises Âmericans and he calls for a thoughtfuldialog between the people of the two countries. {breath} For the nearly twenty yearssince Âmericans were held hostage in Îran, going soft on Îran {breath} has been tabooin Âmerican politics. And so, it is a tough question for President ^Clinton.<time sec=853.232>How to respond {breath} to a friendly voice {breath} from ^Teheran? {breath} We check infirst at the White House. Here’s _A_B_C’s ^John ^Donvan.</turn>...<turn speaker=Joseph_Lieberman spkrtype=male startTime=974.151 endTime=986.845><time sec=974.151>Îran has a serious ballistic missile development program that probably within less thana year %uh will %uh threaten our troops in the ^Middle Êast and our allies there.</turn>

Figure 3.3: Excerpt of the LDC1998T28 manual transcripts.

LDC1998E11, and the speech data is available under the LDC1998S71 distribution. The primarymotivation for the LDC1998T28 collection is to provide additional training data for the DARPA"HUB-4" Project on continuous speech recognition in the broadcast domain. Transcripts havebeen made of all recordings in this publication, manually time aligned to the phrasal level, andannotated to identify boundaries between news stories, speaker turn boundaries, and genderinformation about the speakers. The released version of the transcripts is in SGML format,comparable to the format that was used in the 1996 Broadcast News Speech transcripts, there isaccompanying documentation, and an SGML DTD file, included with the transcription release.The LDC1998E11 also contains transcripts from the this corpus, annotated with Named Entitiesaccording to the new 1998 Hub-4 guidelines, and available in UTF format. Nonetheless, onlythe LDC1998T28 collection was used, which serves well our goals. The corresponding evalua-tion corpora is available under the LDC catalog LDC2002S11. However, the manual transcriptsuse no punctuation marks and no capitalization information, and therefore could not be usedin our experiments.

Figure 3.3 shows an excerpt of the original content of this corpus, which is available inthe SGML format. This content was converted into the standard STM format by means of thebn_filt_sgml97 tool3.

3This tool was adapted from the tool bn_filter, which was supposed to be included in the distribution, but itwas instead provided by Thomas Pellegrini, a colleague from L2F, that found it elsewhere.


<utf dtd_version="utf-1.0" audio_filename="h4e_98_1.sph" language="english" version="4"version_date="981118" scribe="Reconciled"><bn_episode_trans program="unk" air_date="unk"><background type="other" startTime="0.0" level="low"><Section startTime="0.004438" endTime="85.505438" type="report"><Turn startTime="0.004438" endTime="13.638313" spkrtype="male" dialect="native"speaker="David_Brancaccio" mode="planned" fidelity="high">{breath The guardians of the electronic stock market <b_enamexTYPE="ORGANIZATION">@NASDAQ<e_enamex> <contraction e_form="[who=>who][’ve=>have]">who’vebeen burned by past ethics questions, are moving tohead off<time sec="6.839750">market fraud by toughening the rules for companies that want to be listed on theexchange. {breath<time sec="11.182000">{breath Marketplace’s <b_enamex TYPE="PERSON">^Philip ^Boroff<e_enamex> reports.</Turn><Turn startTime="13.638313" endTime="49.432875" spkrtype="male" dialect="native"speaker="Philip_Boroff" mode="planned" fidelity="high">As part of the proposals, penny stocks will be eliminated from <b_enamexTYPE="ORGANIZATION">@NASDAQ<e_enamex>.<time sec="17.968750">{breath These trade for literally <b_numex TYPE="MONEY">pennies<e_numex>....<time sec="314.280751">about a seventh of the <b_enamex TYPE="LOCATION">_U_S<e_enamex> market.

Figure 3.4: Except of the LDC2000S86 corpus.

3.1.3.2 1998 and 1999 Hub-4 Evaluation Corpora

The LDC2000S86 distribution contains the English evaluation test material used in the1998 DARPA/NIST Continuous Speech Recognition Broadcast News HUB4 English Bench-mark Test, administered by the NIST (National Institute of Standards and Technology) Spo-ken Natural Language Processing Group. The LDC2000S88 publication contains the Englishevaluation test material used in the 1999 NIST Broadcast News Transcription Evaluation, alsoadministered by the NIST Spoken Natural Language Processing Group.

Each distribution contains about 1.5 hours of broadcast news speech. The transcriptiondata from these corpora is available in the UTF format, which is the same format as the Hub-4English Compendium. Figure 3.4 contains an excerpt of the LDC2000S86 transcripts. It wasconverted into the STM format by means of a adapted version of the utf_filt tool.

3.1.3.3 MDE RT-04 Training Data

This corpus was created by LDC to provide training data for the RT-04 Fall Metadata Ex-traction (MDE) Evaluation, part of the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program. The manual transcripts of this corpus are available under the LDC distribu-tion LDC2005T24. The corresponding speech data is available under the LDC2005S16 (MDE


...<Annotation id="ea980129-split001:1:E1W" type="token" start="ea980129-split001:1:2z"end="ea980129-split001:1:30"><Feature name="_SU*">ea980129-split001:1:EJH</Feature><Feature name="_next">ea980129-split001:1:E1X</Feature><Feature name="_segment*">ea980129-split001:1:E19</Feature><Feature name="_sn">86</Feature><Feature name="_speaker*">ea980129-split001:1:E1</Feature><Feature name="language"></Feature><Feature name="punctuation">period</Feature><Feature name="text">misleading</Feature></Annotation><Annotation id="ea980129-split001:1:EJH" type="SU" start="ea980129-split001:1:2z"end="ea980129-split001:1:30"><Feature name="_et">ea980129-split001:1:E1W</Feature><Feature name="_sn">86</Feature><Feature name="_speaker*">ea980129-split001:1:E1</Feature><Feature name="_st">ea980129-split001:1:E1W</Feature><Feature name="difficultToAnnotate">false</Feature><Feature name="type">statement</Feature></Annotation><Annotation id="ea980129-split001:1:E1X" type="token" start="ea980129-split001:1:31"end="ea980129-split001:1:32"><Feature name="_next">ea980129-split001:1:E1Y</Feature><Feature name="_segment*">ea980129-split001:1:E19</Feature><Feature name="_sn">87</Feature><Feature name="_speaker*">ea980129-split001:1:E1</Feature><Feature name="language"></Feature><Feature name="text">There</Feature>...

Figure 3.5: Excerpt of the LDC2005T24 corpus (XML format).

RT04 Training Data Speech). This data set has been previously released to the EARS MDEcommunity as LDC2004E31. The LDC2005S16 remaps some original annotations to new MDEelements, in order to support better annotation consistency.

The corpus includes 20h of CTS (Conversational Telephone Speech) from the Switchboardcorpus, and 20h of Broadcast News shows (23 shows) from the Hub-4 Broadcast News Corpus.Only the BN portion of the corpus was used for experiments described in this document. Theoriginal data is available in XML format, and can be further converted into the RTTM format, bymeans of the ag-to-rttm script, included in the distribution. Figures 3.5 and 3.6 show excerptsof this corpus in XML and RTTM formats, respectively. Each RTTM file was converted into theSTM format using a tool specially created for this purpose and developed in the scope of thisthesis.

3.1.3.4 2003 NIST Rich Transcription Evaluation Data

This corpus, distributed under the LDC2007S11 reference, contains the test material usedin the 2003 Rich Transcription Spring and Fall evaluations, administered by the NIST SpeechGroup. The Spring evaluation (RT-03S), implemented in March-April 2003, focused on Speech-


SPEAKER ea980129 1 87.987 3.054 <NA> <NA> Peter_Jennings <NA>SU ea980129 1 87.987 3.054 <NA> statement Peter_Jennings <NA>LEXEME ea980129 1 87.987 0.611 We’ll lex Peter_Jennings <NA>LEXEME ea980129 1 88.598 0.611 take lex Peter_Jennings <NA>LEXEME ea980129 1 89.209 0.610 A lex Peter_Jennings <NA>LEXEME ea980129 1 89.819 0.611 Closer lex Peter_Jennings <NA>LEXEME ea980129 1 90.430 0.611 Look lex Peter_Jennings <NA>SEGMENT ea980129 1 91.041 9.111 <NA> <NA> spkr_1 <NA>SPEAKER ea980129 1 91.041 10.427 <NA> <NA> spkr_1 <NA>SU ea980129 1 91.041 4.795 <NA> statement spkr_1 <NA>LEXEME ea980129 1 91.041 0.480 From lex spkr_1 <NA>LEXEME ea980129 1 91.521 0.479 A. alpha spkr_1 <NA>LEXEME ea980129 1 92.000 0.480 B. alpha spkr_1 <NA>LEXEME ea980129 1 92.480 0.479 C. alpha spkr_1 <NA>LEXEME ea980129 1 92.959 0.480 News lex spkr_1 <NA>...

Figure 3.6: Excerpt of the LDC2005T24 corpus (RTTM format).

To-Text (STT) tasks for broadcast news speech and conversational telephone speech in threelanguages: English, Mandarin Chinese and Arabic. That evaluation also included one Meta-data Extraction (MDE) task, speaker diarization for broadcast news speech and conversationaltelephone speech in English. The Fall evaluation (RT-03F), implemented in October 2003, fo-cused on MDE tasks including speaker diarization, speaker attributed STT, SU (sentence/se-mantic unit) detection and disfluency detection for broadcast news speech and conversationaltelephone speech in English. Surprisingly enough, the LDC2007S10 distribution does not con-tain the reference data for the MDE tasks, as it should. To complement this corpus, it would alsobe interesting to use the LDC2004T12 distribution, corresponding to the RT-03 MDE TrainingData Text and Annotations, but unfortunately this corpus was not available.

The original information is provided in the TYP format, which can be read by the Tran-scriber speech tool. This tool is used for segmenting, labeling and transcribing speech (Barraset al., 2001). Transcriber was also used to produce a TRS file, which was then used for producingthe adopted STM format. Figure 3.7 presents an excerpt of the data in its original format.

3.2 Written Newspaper Data

Written corpora contains important information concerning the two rich transcription tasksfocused by this thesis. In terms of capitalization, they provide information concerning thecontext where the capitalized words appear. In terms of punctuation, the importance is not soobvious, but they can also be used to improve speech data punctuation models.

In order to bring the written corpora closer to the speech transcripts, we have performedautomatic normalizations. This process transforms much of the original text into word se-quences that can also be found in a speech transcripts, but without recognition errors. Eachcorpus requires a specific normalization tool, which depends not only on the language, but

3.2. WRITTEN NEWSPAPER DATA 37

...<t 83.813> <‌<female, Gillian_Findlay>‌>It was an historic day today, ^Peter, {breath}and already Mr. ^Sharon says he’s taken a phone call of congratulations from President^Bush, {breath}who, he says, reminds him of a tour that Mr. ^Sharon gave him years ago in Îsrael.{breath}At the time Mr. ^Bush said to him, you know one day I will be prime minister and youwill be p- -- %uh, I will be president and you will be prime minister. {breath}...<t 271.373> <‌<female, Betsy_Stark>‌>But today the owner of that toy store told ~A ~B ~C Newsshe sees no signs of a slowdown in her business.<t 277.633> <‌<female, spkr_4>‌>January was a very good month. We still are growing. %um{breath} {lipsmack}Seems to be, so far, so good.

Figure 3.7: Excerpt of the LDC2007S10 corpus.

PUBnews LMnewsdata period #words data period #words

Train Jan 1999 - Sep 2004 151 M Jan 2005 - Jun 2009 41 MDevelopment Oct 2004 - Nov 2004 2.2 M Jul 2009 - Jul 2009 1.8 M

Test Nov 2004 - Dec 2004 2.2 M Aug 2009 - Nov 2009 1.7 M

Table 3.6: Portuguese Newspaper corpora properties.

also on the corpus itself. This is the most difficult part in terms of corpora setup. The follow-ing subsections provide more details about each one of the written corpus and describes thenormalization steps performed.

3.2.1 Portuguese Corpora

Two written corpora subsets were used in the context of this thesis, both collected from theWeb. Table 3.6 summarizes the properties of each one of these corpora.

The first newspaper corpus is a subset of a larger newspaper corpus, consisting of collectededitions of the Portuguese “Público” newspaper, that have been collected at INESC-ID Lisboabetween 1995 and 2004. During the year 1998, the scripts used for this task became obsolete,and only the first three months were collected in that year. For that reason, the subset used inour experiments – PUBnews – covers only the period from 1999 to 2004, containing about 150Million words. It was split into subsets of about 2 Million words each, resulting in 72 subsets(between 10 to 14 per year). The last two subsets are used for development and evaluation,


subsets data period #wordsTrain 37 Jan 2003 - Nov 2008 75M

Development 1 Jan 2009 - Feb 2009 2MTest 1 Feb 2009 - April 2009 2M

Table 3.7: European Spanish written corpora properties.

respectively. Each subset is referred by the day corresponding to the latest data in that subset.

The LMnews corpus consists of online text, daily collected from the web, corresponding tolast minute news published by the “Público” newspaper. It began being collected in October2008, during the execution of this thesis, but the content of this corpus covers from 2005 to2009, since older data was also available. In October 2009, the newspaper company changedthe publication format of the data, making it very difficult to continue collecting it. Togetherwith this corpus, we have also collected last minute news from the TSF agency. It is currentlystill being collected for language model adaptation of the speech recognition system, but it wasnot used in the scope of this study.

Early experiments used an old Portuguese normalization tool (NUVEM) created at INESCseveral years ago. Since then, unsuccessful efforts have been made to create an alternate nor-malization tool, to be used by other people within the group, and by the speech recognitionsystem itself. Recently, also in the scope of this thesis, the initial normalization tool was deeplyrevised, and has been used during the latest experiments. The normalization now includesseveral modules: units, numbers, money, dates, time, known expressions, and deals with othertype of expressions found in real corpora. Some modules in the tool can be optionally ac-tivated, such as expansion of abbreviations and separation of punctuation. Annex A showssome examples of the output of this normalization tool.

3.2.2 Spanish Corpora

The Spanish written corpus consists of online editions of the Spanish newspaper “El País”,collected since 2003. Table 3.7 summarizes the properties of this corpus. It was normalizedusing the normalization tool also in use in the speech recognition system. The normalizationrules have been adapted from the Portuguese normalization rules (Martinez et al., 2008).

3.2.3 English Corpora

The English written corpora correspond to the LDC1998T30 (North American News TextSupplement). The LDC1998T30 contains data from three different sources: APWS (AssociatedPress World Stream), NYT (New York Times), and LATWP (Los Angeles Times & WashingtonPost). Table 3.8 summarizes the size of the corpus, after cleaning the corpus and removingproblematic text (unknown chars, etc).

3.3. SPEECH DATA PREPARATION 39

Corpus Subset Data Period WordsTrain Development Test

APWS Nov-94 to Apr-98 228M 454K 863KNYT Jan-97 to Apr-98 211M 574K 1.2M

LATWP Sep-97 to Apr-98 16.4M 711K 769K

Table 3.8: English written corpora properties.

Concerning the normalization of the English corpora, the existing tools have been used, asmuch as possible. Most of the normalization task for English had already been performed byThomas Pellegrini. For this work, several existing normalization modules have been adaptedand combined, including those which have been previously worked on by Thomas Pellegrini.

3.3 Speech Data Preparation

Besides the manual transcripts, for each corpus we also have the automatic transcriptsproduced by the recognition system (Neto et al., 2008). More recently, the recognition sys-tem has also been used to produce automatic force alignments for all the corpora, except forthe Spanish. All corpora were automatically annotated with part-of-speech information. Themorphological information was added to the Portuguese data using the morphological ana-lyzer Palavroso (Medeiros, 1995), followed by the ambiguity resolver MARv (Ribeiro et al.,2003, 2004), while the Spanish and the English data were annotated using TreeTagger (Schmid,1994), a language independent part-of-speech tagger, using the dictionaries there included.

The manual orthographic transcripts include punctuation marks and capitalization infor-mation, providing our reference data. Whereas the manual transcripts already contain refer-ence punctuation marks and capitalization, this is not the case of the automatic transcripts. Inthe context of this thesis, the required reference was produced by means of word alignmentsbetween the manual and automatic transcripts, which is a non-trivial task mainly because ofrecognition errors. The alignment was performed using the NIST SCLite tool4, followed byan automatic post-processing stage, for correcting possible SCLite errors and aligning specialwords which can be written/recognized differently. The automatic post-processing stage al-lows to overcome problems such as words A.B.C. or C.N.N. appearing as single words in thereference data, but recognized as isolated letters.

We have adopted the standard XML representation format for keeping all the informationrequired for further experiments. During an early stage of this work, we have created twoXML data sets: MAN – built entirely from manual transcripts, where part-of-speech data wereadded to each word; AUT – built from both automatic transcripts, where part-of-speech datawas added to each word and reference information coming from the manual transcripts was

4available from http://www.nist.gov/speech.


ASR Output(XML)

Alignment

update punctuation

capitalizationmark excl.sections

Manually annotated speech transcriptions

(TRS)

POS tagger updatemorphologytext

CTM

STM

XML formatASR outputexcluded sectionsfocus conditionspunctuation & capitalizationmorphology

Figure 3.8: Creating an XML (Extensible Markup Language) file with all the required informa-tion for further experiments. The following file formats are used: CTM (time marked conver-sation scoring), STM (segment time mark), and TRS (XML-based standard Transcriber).

also added. The two data sets were then used for training and testing our punctuation andcapitalization models. Concerning the capitalization, both data sets provided all the requiredinformation. Nonetheless, the punctuation task makes use of important information, such aspause durations, which sometimes is not available in the manual transcripts. For that reason,the MAN data set was recently redefined to contain force aligned transcripts, treated in thesame way as the automatic transcripts, but without recognition errors. These two data setscontain exactly the same type of information, allowing to apply the same procedures and toolsto each one of them.

Figure 3.8 illustrates the process of creating a XML file with the information required for allfurther experiments. The resulting file includes APP/ASR output information: time intervalsto be ignored in scoring, focus conditions, speaker information, punctuation marks, part-of-speech of each word and word confidence scores. The input for the POS tagger correspondsto the text extracted from the original transcript, and segmented according to the acoustic seg-ments previously identified by the APP module. Hence, a generic part-of-speech tagger thatprocesses written texts, can be used to perform this task, taking into account the surroundingwords. Figure 3.9 shows a transcript segment, extracted from an AUT file, where the punc-tuation, part-of-speech, focus conditions and information concerning excluded sections wereupdated with information coming from the manual transcripts. Whereas, in the reference, ex-clusion and focus information are properties of a segment, in the speech recognition output,such information must be assigned to each word individually. That is because reference seg-ments are different from segments given by the APP/ASR.


<TranscriptSegment><TranscriptGUID>13</TranscriptGUID><AudioType start="5022" end="5783" conf="0.686300">Clean</AudioType><Time start="5022" end="5783" reasons="" sns_conf="0.964000"/><Speaker id="52" id_conf="0.94" gender="F" gender_conf="0.92" known="T"/><SpeakerLanguage native="T">PT</SpeakerLanguage><TranscriptWordList><W start="5042" end="5053" conf="0.76" focus="F3" cap="Boa" pos="A.">boa</W><W start="5054" end="5095" conf="0.98" focus="F3" punct="." pos="Nc">noite</W><W start="5106" end="5162" conf="0.99" ... cap="Benfica" pos="Np">benfica</W><W start="5163" end="5169" conf="0.94" focus="F3" pos="Cc">e</W><W start="5170" end="5219" conf="0.96" ... cap="Sporting" pos="Np">sporting</W><W start="5220" end="5253" conf="0.99" focus="F3" pos="V.">estão</W><W start="5254" end="5280" conf="0.96" focus="F3" pos="S.">sem</W><W start="5281" end="5336" conf="0.99" ... punct="." pos="Nc">treinador</W><W start="5344" end="5370" conf="0.99" focus="F0" cap="José" pos="Np">josé</W><W start="5371" end="5399" conf="0.99" focus="F0" cap="Mourinho" pos="Np">mourinho</W><W start="5400" end="5441" conf="0.91" focus="F0" pos="V.+Pf">demitiu-se</W><W start="5442" end="5443" conf="0.86" focus="F0" pos="Pd">o</W><W start="5444" end="5498" ... punct="." cap="Benfica" pos="Np">benfica</W><W start="5522" end="5568" conf="0.99" focus="F0" cap="Augusto" pos="Np">augusto</W><W start="5569" end="5604" conf="0.99" focus="F0" cap="Inácio" pos="Np">inácio</W><W start="5605" end="5631" conf="0.99" focus="F0" pos="V.">foi</W><W start="5632" end="5698" conf="0.99" focus="F0" pos="V.">demitido</W><W start="5699" end="5709" conf="0.98" focus="F0" pos="S.">do</W><W start="5710" end="5766" ... punct="." cap="Sporting" pos="Np">sporting</W></TranscriptWList>

</TranscriptSegment>

Figure 3.9: Example of a transcript segment extracted from the AUT data set.

3.3.1 Capitalization Alignment Issues

The reference capitalization in automatic recognition transcripts is automatically producedby aligning the manual and the automatic transcripts. The problem about creating a referencecapitalization is that, for each and every word in the recognition output, a capitalization formmust be assigned. That means that if a mistake is made, the evaluation will reflect it. When inthe presence of a correct word, the capitalization can be assigned directly, but problems arisefrom the recognition errors. Figure 3.10 shows examples of word aligmements, extracted fromthe SCLite output, where the misalignments were marked by colors. Some alignment problemshere presented are solved by the automatic post-processing stage, by looking at the words in theneighborhood. For example, the capitalized word “Portugal” from the first example becomescorrectly capitalized. In fact, all the underlined words become capitalized after applying thepost-processing step.

If no information exists concerning the capitalization of a word, it is considered lowercaseby default. Therefore, any word inserted by the recognition system that does not exist in thereference (insertion) will be kept lowercase. On the other hand, if a reference word was skippedby the recognition system (deletion), nothing can be done about it. Anyway, most of the inser-tions and deletions consist of short functional words which usually appear in lowercase and


1)REF: noutro processo também em Portugal que está junto que é um apenso dos autosHYP: noutro processo também ** ******** portugal está junto que é um apenso nos alpes

2)REF: O pavilhão desportivo do Colégio Dom Nuno Álvares PereiraHYP: o pavilhão desportivo do ******* colégio dono novas pereira

3)REF: A SAD a administração da SAD Luís Duque e Augusto InácioHYP: * lhe assada administração da *** sad luís duque augusto inácio

4)REF: Esta noite em Gondomar o líder dos Social DemocratasHYP: esta noite em gondomar o líder dos ****** social-democratas

Figure 3.10: Capitalization alignment examples.

Corrected alignements Unsolved alignmentsSclite probs

Compound words

subs First cap

All upper

Other

Train 87% 1.9% 4.5% 5.5% 0.5% 0.1% 0.3% 0.6% 0.0% 0.0% 13.6%

Development 81% 2.5% 5.5% 8.4% 0.7% 0.1% 0.5% 0.9% 0.1% 0.0% 19.0%

Eval 81% 2.6% 5.5% 8.5% 0.5% 0.0% 0.4% 1.1% 0.1% 0.0% 19.4%

Jeval 82% 3.4% 4.4% 7.8% 0.6% 0.1% 0.5% 1.1% 0.1% 0.0% 18.1%

Rtp07 81% 2.2% 6.3% 8.4% 0.6% 0.1% 0.4% 1.0% 0.1% 0.0% 19.9%

Rtp08 76% 2.3% 10.3% 9.7% 0.5% 0.0% 0.5% 1.1% 0.1% 0.0% 26.8%

Cor Del Ins WERlowercase subs

Table 3.9: Capitalization alignment report.

do not pose significant problems to the reference capitalization. Finally, if the words mismatchbut the reference word is lowercase, the word in the automatic transcript is kept in lowercaseand will not pose problems to the reference capitalization either. Most of the alignment prob-lems arise from word substitution errors, where the reference word appears capitalized (notlowercase). In this case, three different situations may occur: i) the two words have alternativegraphical forms, a not infrequent phenomena in proper nouns, for example: ”Menezes” and“Meneses” (proper nouns); ii) the two words are different but share the same capitalization, forexample: “Andreia” and “André”; and iii) the two words have different capitalization forms,for example “Silva” (proper noun) and “de” (of, from). The Levenshtein distance (Levenshtein,1966) has been used to measure the difference between the two words. As the process is fullyautomatic, we have decided to assign the same capitalization information whenever it was lessthan 2. By doing this, capitalization assignments like the following were performed: Trabalhos7→ Trabalho; Espanyol 7→ Espanhol; Millennium 7→ Millenium; Claudia 7→ Cláudia; Exigimos7→ Exigidos; Andámos 7→ Andamos; Carvalhas 7→ Carvalho; PSV 7→ PSD; Tina 7→ Athina. No-tice that if the capitalization assignments of the above words were not performed, those wordswould appear lowercase in the reference capitalization, which would not be correct.

Table 3.9 presents statistics concerning the capitalization assignment after the word align-ment. The proportion of correct alignments is show in column Cor; Del and Ins correspondto the number of deletions and insertions in the word alignment; lowercase subs correspond to


1)REF: ESTAMOS SEMPRE A DIZER À senhoria .HYP: ******* ****** CALMO SEM PESARÁ senhoria *

2)REF: no centro , O rio ceira ENCHEU de forma que A aldeia de CABOUCO ficou INUNDADA .HYP: no centro * * rio ceira INÍCIO de forma que * aldeia de TEMPO ficou ******** *

3)REF: é a primeira vez que isto LHE acontece ?HYP: é a primeira vez que isto *** acontece *

4)REF: sem PERCEBEREM , SEM LHES DIZEREM quais são as consequências desta políticaHYP: sem ********** * RECEBEREM SELHO DIZER quais são as consequências desta política

5)REF: ALIÁS , alguém DISSE , E EU ESTOU de acordo , que hoje não temos UM governo ,HYP: HÁLIA ÀS alguém ***** * * ** INDÍCIOS de acordo * que hoje não temos O governo *

6)REF: no segundo * **** , COLIN MONTGOMERY , JARMO SANDELIN , michael e lauraHYP: no segundo O QUAL NÃO COBRE E CRIAR UMA CÉLULA E michael e laura

Table 3.10: Punctuation alignment examples.

Corpus

subset Good Ok Bad Good Ok Bad Good Ok Bad Good Ok Bad

Train 71.4% 24.7% 3.9% 76.1% 18.4% 5.5% 41.1% 44.3% 14.7% 19.4% 36.1% 44.4%

Development 66.3% 28.8% 4.9% 66.2% 27.1% 6.7% 33.7% 40.6% 25.7%

Eval 63.9% 30.1% 6.0% 64.6% 27.6% 7.9% 27.5% 51.0% 21.6% 25.0% 75.0% 0.0%

Jeval 65.0% 30.7% 4.3% 65.8% 27.4% 6.8% 44.7% 37.5% 17.8% 33.3% 66.7% 0.0%

Rtp07 63.6% 29.2% 7.2% 65.0% 25.6% 9.5% 27.5% 44.1% 28.4%

Rtp08 56.9% 30.9% 12.3% 59.8% 28.1% 12.1% 20.4% 49.0% 30.6% 33.3% 0.0% 66.7%

full-stop (.) comma (,) question mark (?) exclamation mark (!)

Table 3.11: Punctuation alignment report.

substitution of words involving lowercase words, which do not pose problems to the referencecapitalization. Corrected alignments show the percentage of corrections performed during thepost-processing stage. The unsolved alignments correspond unsuccessful alignments, involvingfirst capitalized words (e.g., proper nouns), all uppercase letters (e.g., acronyms), and othertype of capitalization (e.g., McDonald’s). The recognition WER (Word Error Rate) is shown inthe last column, revealing the proportion of recognition errors in the corpus when the align-ment was performed.

3.3.2 Punctuation Alignment Issues

Like the reference capitalization, inserting the correct reference punctuation to the auto-matic transcripts, according with the manual transcripts, is not an easy task, but it also facesdifferent challenges. The effect of the speech recognition errors is only relevant when theyoccur in the neighbourhood of a punctuation mark. Table 3.10 shows punctuation alignmentexamples extracted from the SCLite output. The recognition errors in the first three examplesdo not pose problems to the reference punctuation, which means that they will provide a good


reference. However, the last three examples present more difficult challenges. The fourth andthe fifth examples can still be solved in an acceptable manner, and provide acceptable referencedata. The last example is very difficult to solve, even by a human annotator, and will providea bad reference data. The REF data suggests the use of three commas, but considering only thespeech recognition output (HYP data) and the speech signal, would one use none, some, orall three commas? And, in the case they are used, where to put them? Table 3.11 presents thealignment summary for each punctuation mark, where the alignments are classified as good,ok (acceptable), or bad. The worst alignments concerning the question mark are related with thefact that these sentences consist mostly of spontaneous speech. The sum of exclamation marks inall the corpora is only about forty, therefore, results concerning this punctuation mark are lesssignificant. The final alignment would benefit from manual correction, an issue to be addressedin the future. Nevertheless, even an expert human annotator would find difficulties doing thistask and sometimes would not perform it coherently.

3.4 Additional Prosodic Information

The reference XML, described in section 3.3, contains lexical and acoustic information,which served as data source for the initial experiments performed in the scope of this thesis.Nevertheless, linguistic evidences state that nuclear contour, boundary tones, energy slopes,and pauses are crucial to delimit sentence-like units across languages (Vassière, 1983). Thissection describes the prosodic feature extraction process and the creation of a new data source,containing additional prosodic information.

3.4.1 Extracting the Pitch and Energy

Pitch ( f0) and energy (E) are two important sources of prosodic information that can beextracted directly from the speech signal. By the time our experiments were conducted, thatinformation was not available in the speech recognition system output. For that reason, wehave extracted that information directly from the speech signal, using the Snack (Sjölanderet al., 1998) toolkit. Nevertheless, because this is an important starting point for the use ofprosody, in the future it will be directly available in the speech recognition output.

Both pitch and energy were extracted using the standard parameters, taken from thewavesurfer tool configuration (Sjölander and Beskow, 2000). Energy was extracted using apre-emphasis factor of 0.97 and a 200ms hamming window, while pitch was extracted using theESPS method (auto-correlation).

We have removed all the pitch values calculated for unvoiced regions in order to eliminateirregular values. This is performed in a phone-based analysis by detecting all the unvoiced

3.4. ADDITIONAL PROSODIC INFORMATION 45

Figure 3.11: Pitch adjustment for unvoiced regions.

phones. Figure 3.11 illustrates this process, where the original pitch values are represented bydots and the resultant pitch is represented by the gray line.

3.4.2 Adding Phone Information

Audimus (Meinedo et al., 2008) is a hybrid automatic speech recognizer that combinesthe temporal modeling capabilities of Hidden Markov Models with the pattern discriminativeclassification capabilities of Multi-layer Perceptrons (MLP). Modeling context dependency is aparticularly hard problem in hybrid systems. For that reason, this speech recognition systemuses, in addition to monophone units modeled by a single state, multiple-state monophoneunits, and a fixed set of phone transition units aimed at specifically modeling the most fre-quent intra-word phone transitions (Abad and Neto, 2008). That information was then con-verted into monophones by another tool, specially designed for that purpose. Still, the existinginformation is insufficient for correctly assigning phone boundaries. For example, the phonesequence “k k=u u”, presented in figure 3.11, must be converted into the monophone sequence“k u”, but the exact boundary between the first and the second phone can only be guessed. Wehave used the mid point of the phone transition. The speech recognition system could alterna-tively perform the speech recognition based purely on monophones, but the cost would be anincreased WER.


3.4.3 Marking the Syllable Boundaries and Stress

Another important step consisted of marking the syllable boundaries as well as the syllablestress. This was achieved by means of a lexicon containing all the pronunciations of each word,marked with syllable boundaries and stress. For the Portuguese BN, a set of syllabificationrules was designed and applied to the lexicon5. The rules account fairly well for the canonicalpronunciation of native words, but they still need improvement for words of foreign origin.

Regarding the English language, most of the lexicon content was created from the CMUdictionary (version 0.7a). The phone sequence for the unknown words was provided by thetext-to-phone CMU/NIST tool addttp46, and the stressed syllables were marked using the tooltsylb2 (Fisher, 1996), which uses an automatic phonological-based syllabication algorithm.

3.4.4 Producing the Final XML File

After extracting and calculating the above information, all data was merged into a sin-gle data source, which provides all the required information for later use. The existing DataType Definition (DTD) (Section 3.3) has been upgraded in order to accommodate the additionalprosodic information.

Figure 3.12 illustrates the involved processing steps. The pitch and energy values areextracted from the speech signal. A Gaussian mixture model (GMM) classifier is then useto automatically detect speech/non-speech regions, based on the Energy7. Both pitch andspeech/non-speech values are used to adjust the boundaries of the acoustic phone transitions,generally known as diphones (Section 3.5 contains further details). An excerpt of the PCTMinput file, produced by the speech recognition system and containing all the phones/diphonessequence, is shown in Figure 3.13. The referred PCTM file is modified with the new unit bound-aries, and then used to produce another file (in the same format) containing only monophones.The monophone units are used both for removing the pitch values from unvoiced regions andto produce a new PCTM file containing the syllable/phone information. Figure 3.14 presentsan except of the resultant information, where the syllable boundaries and stress are marked.

A final XML file combines all the previous information together with pitch and energystatistics for each unit. Figure 3.15 shows an excerpt from one of these files, containing onetranscript segment. Information concerning words, syllables and phones can be found in thefile, together with pitch, energy and duration information. For each unit of analysis we havecalculated the minimum, maximum, average, median and slope, both for pitch and energy.Pitch slopes were calculated after converting the pitch differences into semitone values.

5Work performed by Isabel Trancoso and Helena Moniz, in cooperation with Hugo Meinedo.6Work of Thomas Pellegrini7In cooperation with Alberto Abad.

3.5. SPEECH DATA WORD BOUNDARIES REFINEMENT 47

Input data

PCTM

XMLexcluded regionsfocus conditionspunctuationcapitalizationmorphology

Extract Pitch

Extract Energy

GMM classifier

Adjust diphone boundaries

SNSPitch

Energy

produce monophones

Pitch adjustment

Pitch

Lexicon Mark syllables

PCTM

Add syllables & phone Add statistics Final XML

Energy

Pitch

speech signal

Figure 3.12: Integrating prosody information in the corpora.

Recent efforts have been made to also include additional valuable information, availablein the manual transcripts into the XML files. The most recent XML files include informationconcerning filled pauses and other marks (inspiration, disfluency boundaries) available in themanual annotations, now aligned with the corresponding speech signal.

3.5 Speech Data Word Boundaries Refinement

Duration of silent pauses is one of the most important features for detecting punctuationmarks, or at least sentence boundaries (Gotoh and Renals, 2000). Even though they may notbe directly converted into punctuation, silent pauses are in fact a basic cue for punctuationand speaker diarization (Chen, 1999; Kim and Woodland, 2001). The durations of phones andsilent pauses are automatically provided by a large vocabulary continuous speech recognition


2000_12_05-17_00_00-Noticias-7.spkr000 1 14.00 0.27 interword-pause2000_12_05-17_00_00-Noticias-7.spkr000 1 14.27 0.01 L-m2000_12_05-17_00_00-Noticias-7.spkr000 1 14.28 0.01 m2000_12_05-17_00_00-Noticias-7.spkr000 1 14.29 0.04 m=u~2000_12_05-17_00_00-Noticias-7.spkr000 1 14.33 0.01 u~2000_12_05-17_00_00-Noticias-7.spkr000 1 14.34 0.02 u~=j~2000_12_05-17_00_00-Noticias-7.spkr000 1 14.36 0.01 j~2000_12_05-17_00_00-Noticias-7.spkr000 1 14.37 0.03 j~=t2000_12_05-17_00_00-Noticias-7.spkr000 1 14.40 0.01 t2000_12_05-17_00_00-Noticias-7.spkr000 1 14.41 0.02 t=u2000_12_05-17_00_00-Noticias-7.spkr000 1 14.43 0.01 u2000_12_05-17_00_00-Noticias-7.spkr000 1 14.44 0.01 u+R+2000_12_05-17_00_00-Noticias-7.spkr000 1 14.45 0.01 L-b2000_12_05-17_00_00-Noticias-7.spkr000 1 14.46 0.02 b2000_12_05-17_00_00-Noticias-7.spkr000 1 14.48 0.01 b+R2000_12_05-17_00_00-Noticias-7.spkr000 1 14.49 0.02 L-o~2000_12_05-17_00_00-Noticias-7.spkr000 1 14.51 0.05 o~2000_12_05-17_00_00-Noticias-7.spkr000 1 14.56 0.05 o~+R+2000_12_05-17_00_00-Noticias-7.spkr000 1 14.61 0.02 L-d2000_12_05-17_00_00-Noticias-7.spkr000 1 14.63 0.02 d2000_12_05-17_00_00-Noticias-7.spkr000 1 14.65 0.06 d=i2000_12_05-17_00_00-Noticias-7.spkr000 1 14.71 0.04 i2000_12_05-17_00_00-Noticias-7.spkr000 1 14.75 0.01 i=A2000_12_05-17_00_00-Noticias-7.spkr000 1 14.76 0.01 A2000_12_05-17_00_00-Noticias-7.spkr000 1 14.77 0.01 A+R+2000_12_05-17_00_00-Noticias-7.spkr000 1 14.78 0.06 interword-pause

Figure 3.13: PCTM file containing the phones/diphones produced by the ASR system.

2000_12_05-17_00_00-Noticias-7.spkr000 1 14.000 0.270 interword-pause2000_12_05-17_00_00-Noticias-7.spkr000 1 14.270 0.040 "m2000_12_05-17_00_00-Noticias-7.spkr000 1 14.310 0.040 u~2000_12_05-17_00_00-Noticias-7.spkr000 1 14.350 0.035 j~2000_12_05-17_00_00-Noticias-7.spkr000 1 14.385 0.035 #t2000_12_05-17_00_00-Noticias-7.spkr000 1 14.420 0.030 u+2000_12_05-17_00_00-Noticias-7.spkr000 1 14.450 0.040 "b2000_12_05-17_00_00-Noticias-7.spkr000 1 14.490 0.120 o~+2000_12_05-17_00_00-Noticias-7.spkr000 1 14.610 0.070 "d2000_12_05-17_00_00-Noticias-7.spkr000 1 14.680 0.075 i2000_12_05-17_00_00-Noticias-7.spkr000 1 14.755 0.025 #A+2000_12_05-17_00_00-Noticias-7.spkr000 1 14.780 0.060 interword-pause

Figure 3.14: PCTM file with monophones and marked with syllable boundary and stress.


<TranscriptSegment><TranscriptGUID>2</TranscriptGUID><AudioType start="1400" end="1484" conf="1.000000">Music</AudioType><Time start="1400" end="1484" reasons="" sns_conf="1.000000"/><Speaker id="2000" id_conf="1.000000" name="Mulher" gender="F" gender_conf="1.000000"known="F"/><SpeakerLanguage native="T">PT</SpeakerLanguage><TranscriptWordList phones="10" ph_duration="51" ph_avg="5.1"><Word start="1427" end="1444" conf="0.999766" focus="F3" cap="Muito" pos="R." name="muito"phseq="_mu~j~#tu+" pmax="271.9" pmin="220.8" pavg="258.3" pmed="263.7" emax="59.7"emin="35.0" eavg="47.8" emed="46.5" pslope="-1.8" eslope="0.4"><syl stress="y" start="1427" dur="11.5" pmax="271.9" pmin="251.2" pavg="263.7" pmed="266.0"emax="59.7" emin="35.0" eavg="49.0" emed="50.2" pslope="2.0" eslope="2.2"><ph name="m" start="1427" dur="4" pmax="0.0" pmin="0.0" pavg="0.0" pmed="0" emax="39.2"emin="35.0" eavg="37.1" emed="37.0" pslope="0.0" eslope="1.6"/><ph name="u~" start="1431" dur="4" pmax="266.0" pmin="251.2" pavg="258.9" pmed="259.1"emax="58.8" emin="45.1" eavg="53.8" emed="55.5" pslope="2.5" eslope="4.7"/><ph name="j~" start="1435" dur="3.5" pmax="271.9" pmin="268.0" pavg="270.1" pmed="270.3"emax="59.7" emin="47.7" eavg="56.2" emed="58.7" pslope="-0.8" eslope="-3.6"/></syl><syl start="1438.5" dur="6.5" pmax="220.8" pmin="220.8" pavg="220.8" pmed="220.8" emax="47.9"emin="41.6" eavg="45.7" emed="45.8" pslope="0.0" eslope="-0.2"><ph name="t" start="1438.5" dur="3.5" pmax="0.0" pmin="0.0" pavg="0.0" pmed="0" emax="47.7"emin="41.6" eavg="45.5" emed="46.5" pslope="0.0" eslope="-1.7"/><ph name="u" start="1442" dur="3" pmax="220.8" pmin="220.8" pavg="220.8" pmed="220.8"emax="47.9" emin="44.9" eavg="46.0" emed="45.2" pslope="0.0" eslope="0.2"/></syl></Word><Word start="1445" end="1460" conf="0.994520" focus="F3" pos="A." name="bom" phseq="_bo~+"pmax="253.3" pmin="217.4" pavg="234.8" pmed="232.1" emax="60.8" emin="41.2" eavg="51.8"emed="52.8" pslope="-1.9" eslope="0.2"><syl stress="y" start="1445" dur="16" pmax="253.3" pmin="217.4" pavg="234.8" pmed="232.1"emax="60.8" emin="41.2" eavg="51.8" emed="52.8" pslope="-1.9" eslope="0.2"><ph name="b" start="1445" dur="4" pmax="0.0" pmin="0.0" pavg="0.0" pmed="0" emax="49.2"emin="41.2" eavg="44.0" emed="42.8" pslope="0.0" eslope="2.7"/><ph name="o~" start="1449" dur="12" pmax="253.3" pmin="217.4" pavg="234.8" pmed="232.1"emax="60.8" emin="46.1" eavg="54.4" emed="56.2" pslope="-1.9" eslope="-1.3"/></syl></Word><Word start="1461" end="1477" conf="0.996707" focus="F3" punct="." pos="Nc" name="dia"phseq="_di#6+" pmax="231.6" pmin="222.4" pavg="227.2" pmed="226.4" emax="61.6" emin="37.9"eavg="50.7" emed="52.9" pslope="0.8" eslope="1.3"><syl stress="y" start="1461" dur="14.5" pmax="229.3" pmin="222.4" pavg="225.5" pmed="225.0"emax="57.3" emin="37.9" eavg="48.8" emed="48.6" pslope="0.7" eslope="1.3"><ph name="d" start="1461" dur="7" pmax="0.0" pmin="0.0" pavg="0.0" pmed="0" emax="46.4"emin="37.9" eavg="43.2" emed="44.0" pslope="0.0" eslope="-0.5"/><ph name="i" start="1468" dur="7.5" pmax="229.3" pmin="222.4" pavg="225.5" pmed="225.0"emax="57.3" emin="50.8" eavg="54.4" emed="54.1" pslope="0.7" eslope="1.0"/></syl><syl start="1475.5" dur="2.5" pmax="231.6" pmin="231.1" pavg="231.3" pmed="231.3" emax="61.6"emin="59.3" eavg="60.5" emed="60.5" pslope="0.3" eslope="2.3"><ph name="6" start="1475.5" dur="2.5" pmax="231.6" pmin="231.1" pavg="231.3" pmed="231.3"emax="61.6" emin="59.3" eavg="60.5" emed="60.5" pslope="0.3" eslope="2.3"/></syl></Word></TranscriptWordList></TranscriptSegment>

Figure 3.15: Excerpt of one of the final XML files, containing prosodic information.


module, Audimus (Meinedo et al., 2008). By the time prosodic cues began being used in thisstudy, it was found that the automatic phone segmentation was not being completely wellperformed, so prosodic/acoustic cues were explored to improve a baseline phone segmentationmodule (Moniz et al., 2010). An analysis of the baseline results revealed problems in wordboundary detection. A solution to these was put in place using post-processing rules based onprosodic features (pitch, energy and duration).

A limited subset of the ALERT-SR corpus, containing about 1h of speech, was transcribedat the word boundary level8, in order to allow for the evaluation of the efficacy of the post-processing rules. With this sample one could evaluate the speech segmentation robustnesswith several speakers in prepared non-scripted and spontaneous speech settings, and withdifferent strategies regarding speech segmentation and speech rate.

The recognizer was used in a forced alignment mode on this reduced test set of 1h du-ration that was manually transcribed at the word boundary level. As explained above, thisrevealed several problems, namely in the boundaries of silent pauses, and in their frequentmiss-detection.

The post-processing rules achieved better results in terms of inter-word pause detection,durations of previously detected silent pauses, and also durations of phones at initial and finalsentence-like unit level. Moreover, these experiments showed that these improvements hadimpact both in terms of acoustic models and punctuation (Moniz et al., 2010). This work wasthe first step towards more challenging problems, namely to combine prosodic and lexical fea-tures for the identification of sentence-like units. It was also a determinant step towards thegoal of adding the identification of interrogatives to the punctuation module.

3.5.1 Post-processing rules

The post-processing rules were applied off-line, using both pitch and energy information.In terms of pitch values, the only information used was the presence or absence of pitch. Theenergy information was also extracted off-line for each audio file. Speech and non-speech por-tions of the audio data were automatically segmented at the frame-level with a bi-Gaussianmodel of the log energy distribution. That is, for each audio sample, a 1-dimensional energybased Gaussian model of two mixtures is trained. In this case, the Gaussian mixture with the“lowest” mean is expected to correspond to the silence or background noise, and the one withthe “highest” mean corresponds to speech. Then, frames of the audio file having a higher like-lihood with the speech mixture are labeled as speech and those that are more likely generatedby the non-speech mixture are labeled as silence.

The integration of extra information was implemented as a post-processing stage with four

8Work performed by Helena Moniz, an expert linguist.


!"#

$"#

%"#

&"#

'"#

("#

)"#

*"#

(# $!# $(# %!# %(# &!# &(# '!# '(# (!# ((# )!# )(# *!# *(# +!# +(# ,!# ,(# $!!#

!"#$%&'(

)*%+

,(-$.'/)

*%+,(-$0)12$'/2%3()4"/5)

-./01234/2#5/5167#89./4# -./01234/2#:/67#89./4###

Figure 3.16: Improvement in terms of correct word boundaries, after post-processing.

rules9:

1. if the word starts by a plosive sound, the duration of the preceding pause is unchanged(typically around 50 to 60 ms for European Portuguese);

2. if the word starts or ends by a fricative, the energy-based segmentation is used;

3. if the word starts with a liquid sound, energy and pitch are used;

4. otherwise, they are delimited by pitch.

With these rules, more adequate word boundaries than those with previous segmentationmethods were expected, without imposing thresholds for silent pause durations, which wererecognized by (Campione and Véronis, 2002) as misleading cues that do not account for differ-ences between speakers, speech rate or speech genres.

3.5.2 Results

By comparing the results in terms of word boundaries before and after the post-processingstage in the limited test set of 1h duration, it was found that 9.3% of the constituent initialphones and 10.1% of the constituent final phones were modified, in terms of boundaries. Re-garding the inter-word pauses, 62.5% of them were modified and 10.9% more were added.

9Proposed by Helena Moniz


Figure 3.17: Phone segmentation before (top) and after (bottom) post-processing.

Figure 3.16 illustrates the improvement in terms of correct boundaries, when different bound-ary thresholds are used. The graph shows that most of the improvements are concentrated inan interval corresponding to 5-60 ms. The manual reference has 443.82 seconds of inter-wordpauses, the modified version correctly identified more 67.71 seconds of silence than in the orig-inal one, but there are still 14.81 seconds of silence that were not detected.

Figure 3.17 shows an example of a silent pause detection corresponding to a comma. Theoriginal sentence is:

o Infarmed analisa cerca de quinhentos [medicamentos], os que levantam mais dúvidasquanto à sua eficácia/ Infarmed analyses about five hundred [drugs], those that raise moredoubts about their effectiveness.

Initial and final word phones are marked with “L-”, and “+R”, respectively, whereas frequentphone transition units are marked with “=”. The two automatic transcripts correspond to theresults obtained before (miss-detection) and after post-processing.

3.5.3 Impact on acoustic models training

A new acoustic model has been retrained using the modified phone boundaries after ap-plying the above mentioned rules. Using this second model, the WER decreased from 22.0% to21.5%.

The number of correct phone boundaries has also been compared for a given threshold inthe results produced by these two acoustic models, and Figure 3.18 shows the correspondingresults. The graph shows that the phone boundaries produced by the second acoustic modelare closer to the manual reference.

3.6. SUMMARY 53

!"#$

%#$

"#$

&#$

'#$

(#$

)#$

)$ "%$ ")$ &%$ &)$ '%$ ')$ (%$ ()$ )%$ ))$ *%$ *)$ +%$ +)$ ,%$ ,)$ -%$ -)$ "%%$

!"#$%&'(

)*%+

,(-$.'/)

*%+,(-$0)12$'/2%3()4"/5)

./01234503$606278$9:/05$ ./01234503$;078$9:/05$$$

Figure 3.18: Improvement in terms of correct word boundaries, after retraining.

3.6 Summary

This chapter described the most relevant data that has been used for the experiments per-formed in the scope of this thesis. Most of the experiments were conducted over broadcastnews data, but the relatively small size of the speech corpora soon demanded complementarydata, specially required for the capitalization task.

The current version of the Portuguese BN corpus has recently been completely revised byan expert linguist, thereby removing many inconsistencies, specially in terms of punctuationmarks, but also in terms of capitalization. That was particularly important given than theprevious version of this corpus was manually transcribed by different annotators, who did notfollow consistent criteria in terms of punctuation marks. The user annotation agreement, interms of punctuation marks, was calculated, using the two versions of the corpus.

Besides the Portuguese speech corpus, other languages were processed, namely Span-ish and English. The English BN corpus combine five different English BN corpora subsets.Each corpora subset has been produced in different time periods, built for different purposes,encoded with different annotation specification criteria, and is available in a different for-mat. Combining these heterogeneous corpora demanded a normalization strategy, specificallyadapted for each corpus.

Written corpora contains information that is specially important for capitalization, giventhat they provide information concerning the context where the capitalized words appear.


All written corpora were normalized in order to be closer to a speech transcript, this taskdemanded improving the existing normalization tools. Each corpus required a specially de-signed (or at least adapted) tool for dealing with specific phenomena. The Portuguese corporawas collected from the web, and the English corpus is available from the LDC.

The automatic transcripts for all speech corpora were produced by the L2F recognition sys-tem. The reference punctuation and capitalization for the automatic transcripts were providedby means of an alignment between the manual and the automatic transcripts. That is not atrivial task mainly because of the recognition errors.

The speech reference data has been upgraded recently in order to accommodate additionalprosodic information. This chapter described the prosodic feature extraction process and thecreation of a new data source, which accommodates additional prosodic information. The finalcontent is available as an XML file, containing not only pitch and energy, extracted directlyfrom the speech signal, but also phone information, syllable boundaries and syllable stress.The prosodic features (pitch, energy and duration) were then used to adjust word boundaries,automatically identified by the speech recognition system, using post-processing rules.

4Capitalization Recovery

Many information sources, like newspaper articles, books, and most web pages, containproper capitalization. The capitalization task consists of rewriting each word of an input textwith its proper case information given its context. Besides improving the readability of texts,capitalization provides important semantic clues for further text processing tasks. Differentpractical applications benefit from automatic capitalization as a preprocessing step: many com-puter applications, such as word processing and e-mail clients, perform automatic capitaliza-tion along with spell corrections and grammar check; and while dealing with speech recog-nition output, automatic capitalization provides relevant information for automatic contentextraction, NER, and MT.

This chapter focus on the automatic recovery of capitalization information both in writtennewspaper corpora and in spoken transcripts. This study considers that the capitalization ofthe first word of each sentence is performed in a separated processing stage (after punctuationfor instance), since its correct orthographical form depends on its position in the sentence.Results described here do not consider the first word of the sentence for evaluation. However,results may be influenced when taking such words into account (Kim and Woodland, 2004).

This chapter is structured as follows: Section 4.1 analyses the subject of the capitalizationtask in a written corpora basis. Section 4.2 reports on early work comparing different capital-ization approaches, and choosing the best approach for processing speech transcripts. Section4.3 goes deeper, studying the effects of the language dynamics on the capitalization task. Sec-tion 4.4 describes the efforts to understand the best way how capitalization models should beupdated. Section 4.5 presents the most recent work on capitalization. Section 4.6 reports on thework porting the capitalization task to other languages. Finally, Section 4.7 summarizes all thecontent of this chapter.

4.1 Capitalization Analysis Based in Written Corpora

The orthographical form of a given word can be classified as: lowercase (e.g., verbs, func-tional words, common nouns), first-letter-capitalized (e.g., proper nouns), uppercase (e.g.,acronyms), and mixed-case (special words, such as McGyver or LaTeX). Many words assume a

56 CHAPTER 4. CAPITALIZATION RECOVERY

!"#$%&

'$#'%&

'#(%&'#$%&

)#"%&

*+,-./01-&

2.134566-.&

0**4566-.&

789-:4/01-&

07;8<5+51&

Figure 4.1: The different capitalization classes and their distribution in the PUBnews corpus.

fixed capitalization form, unless they appear in the beginning of a sentence or in a title writtenin uppercase. Many other words may assume different capitalization forms, depending on thecontext they are used (e.g., the English words bank/Bank or miles/Miles).

This section describes the frequency of each capitalization class in a corpus, and calcu-lates the portion of words with ambiguous/unambiguous capitalization. The results reportedwere achieved using the PUBnews newspaper corpus, described in Section 3.2.1. The train-ing portion of the corpus contains about 700K different word forms, but most of them rarelyoccur, as predicted by Zipf’s Law. In fact, about 46% of these word forms occur only once(hapax legomena), and many of them consist of spelling errors (e.g., abandomos instead of aban-donos); verbs with enclitic pronouns (e.g., abater-se-á); unusual strings of words that were connectedby hyphens for some discursive reason (e.g., a-comida-que-veio-do-mar/food-that-came-from-the-sea); words used in uppercase for emphasis purposes, which would otherwise written in lowercase(e.g., ABANDONADO); foreign words (e.g., Abkykalykov); and other less common words (e.g.,abismaticamente, abarcavam). Some of these words are easy to capitalize, for example theverbs with enclitic pronouns are always written in lowercase. There are 27 different enclitic pro-nouns that can be used in the same verbal form, which originates a very large number of word-forms, but their identification is straightforward, at least in written corpora.

4.1.1 Capitalization Ambiguity

The capitalization ambiguity was analyzed by considering unique case-folded words oc-curring in the training portion of the corpus, and then finding all their possible graphical forms.For this study, a word is considered unambiguous in respect to its capitalization, if it occursin the corpus with the same orthographical form at least 99% of the times. This assumptionworks well for high frequency words, but gets less accurate for lower frequency words. Forthat reason, only case-folded words occuring at least 5 times were considered, also reducingthe influence of spelling errors and unusual words connected by hyphens. Figure 4.1 illus-

4.1. CAPITALIZATION ANALYSIS BASED IN WRITTEN CORPORA 57

!"

#!!!"

$!!!!"

$#!!!"

%!!!!"

%#!!!"

&!!!!"

&#!!!"

'!!!!"

'#!!!"

$" %" &" '" #" (" )" *" +" $!" $$" $%" $&" $'" $#" $(" $)" $*"

!"#

$%&'(

)'*(&+,'

-(&+')&%."%/01'2/3%&456'

,-./01234/4"5264." 7264."589:"-;<8=>2>."?-@89-38A-B2C"

Figure 4.2: Number of words by frequency interval in the PUBnews corpus.

trates the proportion of each capitalization class in the corpus. The graph shows that mostwords in this corpus are written in lowercase, but sill a large percentage is capitalized, andthe first-letter-capitalized (first-upper) is the most common capitalization form. The number ofmixed-case words is quite similar to the number of uppercase words, and together they consti-tute the smaller subsets. The capitalization problem concerns the portion of ambiguous wordsin respect to their capitalization. The figure shows that only a small portion of the words (about9.6%) fits into this category, still they are representative.

The relation between the word frequency and the capitalization ambiguity is an importantissue, specially for the less frequent words. Such relation provides important knowledge con-cerning the ability of predicting the capitalization ambiguity of rare and unseen words. Figure4.2 shows the number of different case-folded words and ambiguous words in terms of capi-talization, by frequency interval. In order to achieve more meaningful results, the analysis isperformed using frequency intervals of the form Fi =

[2i, 2i+1[, where the size of each inter-

val Fi grows exponentially, corresponding to intervals of size 2i. For example, interval [16, 31]contains about 30k different case-folded words, 2140 of them are ambiguous in terms of capi-talization. As expected, the figure shows (upper curve) that the number of case-folded wordswith a given frequency decreases exponentially when moving from lower to higher word fre-quency intervals, thereby confirming the well known Zipf’s law. The figure also shows (lowercurve) that the number of words with capitalization ambiguity does not follow the same ten-dency. On the contrary, their number grows from the intervals F3 to F7 (frequency between 8and 255), despite the first intervals containing far more different case-folded words.


!"

#"

$!"

$#"

%!"

%#"

$ % & ' # ( ) * + $! $$ $% $& $' $# $( $) $*

!"#$"%

&'(")*+),

*#-.

/*#-)+#"01"%$2)3%&"#4'5

,-./01234/456274. 8274.569:;5-<=9>?2?.5@-A9:-39B-C2D

Figure 4.3: Distribution of the ambiguous words by word frequency interval.

This corpus contains about 189K different words after case-folding, and considering onlyoccurrencies above 5. The total number of ambiguous words in terms of capitalization is about9.6%, which corresponds to about 18K. Figure 4.2 has shown that lower frequency intervalscontain much more words than higher frequency intervals, while the capitalization ambiguitydistribution does not have the same configuration. That is illustrated better in Figure 4.3, whichshows the word distribution per interval, considering the whole corpus. The figure reveals thatmost of the ambiguous words are neither uncommon nor highly frequent, while it confirmsthat most of the words have lower frequency. Finally, is interesting to calculate the proportionbetween ambiguous and unambiguous words in terms of capitalization, for a given frequencyinterval. Figure 4.4 illustrates such relation, revealing that words having frequency around 214

(16K) are likely to be more ambiguous in terms of capitalization.

One important conclusion coming from these results is that a suitable capitalization modelmust consider the context where each word appears, and a simple capitalization lexicon is notsufficient to deal with the capitalization ambiguity of many frequent words occurring in writtencorpora. Nevertheless, a capitalization lexicon containing the most frequent capitalization formof each word may be of use for dealing with out-of-vocabulary (OOV) words when retrainingis not possible.

4.2 Early Work Comparing Different Methods

An important preparatory step performed in the scope of this thesis consisted of assert-ing which methods would apply better to the capitalization problem, considering the different

4.2. EARLY WORK COMPARING DIFFERENT METHODS 59

0%

10%

20%

30%

40%

50%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Cap

italiz

atio

n am

bigu

ity

Word frequency interval

Figure 4.4: Proportion of ambiguous words by word frequency interval.

types of corpora. In addition to the maximum entropy modeling approach, described in Sec-tion 2.3, we have explored two additional approaches, namely: (1) an HMM-based tagger, asimplemented by the disambig tool from the SRILM toolkit (Stolcke, 2002); (2) a transducer,built from a previously created language model (LM). The two additional approaches are gen-erative (joint), while the maximum entropy approach is discriminative (conditional).

Discriminative approaches model the posterior probabilities directly, and the parametervalues are inferred from the set of labelled training data. On the other hand, generative ap-proaches model the joint distribution p(k, X) of events and labels, which can be done, for in-stance, by learning the class prior probabilities p(k) and the class-conditional densities p(X|k)separately, and, finally, by using the Bayes’ theorem to calculate the posterior probabilitiesp(k|X). Generative models are very good at modeling context, and they can handle missingdata or partially labelled data. Nevertheless, all other things being equal, it would be expectedthat discriminative methods would have better predictive performance, since they are trainedto predict the class label rather than the joint distribution of input vectors and targets (Ulusoyand Bishop, 2005). The following subsections provide details on the generative methods andpresent results comparing the three methods.

4.2.1 Description of the Generative Methods

The two generative approaches start with an n-gram language model (LM), created fromthe training corpus. The HMM-based approach uses the language model directly, while theWFST-based approach requires creating a transducer from it. Our experiments use unigram,bigram, and trigram language models, created using backoff estimates, as implemented by thengram-count tool of the SRILM toolkit, without n-gram discounts.


Training process Capitalization process

Corpus

Count ngrams

LanguageModel

Lower-casesentence

HMM tagger

Capitalizedsentence

ana ana Ana ANAcanto canto Canto CANTOluís luís Luís LUÍSfaria faria Faria FARIA...

Map

Figure 4.5: Using the HMM-based tagger.

Despite the fact that the capitalization experiments here described rely solely on word in-formation, a strong disadvantage of using generative methods is the difficulty of benefitingfrom a variety of features, such as part-of-speech tags or acoustic/prosodic features.

4.2.1.1 HMM-based Tagger

The HMM-based tagger, implemented by the disambig tool, uses a hidden-event n-gramLM (Stolcke and Shriberg, 1996), and can be used to perform capitalization directly from theLM. Figure 4.5 illustrates the process, where each cloud represents a process and ellipses repre-sents data. Map represents a file that contains all possible orthographical forms of words in thevocabulary. The idea consists of translating a stream of tokens from a vocabulary L (lowercasewords) to a corresponding stream of tokens from a vocabulary C (capitalized words), accordingto a 1-to-many mapping. Ambiguities in the mapping are resolved by finding the C sequencewith the highest probability given the L sequence. This probability is computed from the LM1.

This implementation of the HMM-based tagger can use different algorithms for decoding.However, results presented here use the Viterbi decoding algorithm, where the output is thesequence with the highest joint posterior probability. This is a straightforward method, pro-ducing fast results, and often used by the scientific community for this task. For example, ithas been the baseline suggested in the IWSLT workshop competitions2.

4.2.1.2 Finite state transducers

The capitalization based on Weighted Finite State Transducers (WFST) is illustrated in fig-ure 4.6. This approach makes use of the LM previously built for the HMM-based tagger, which

1See disambig manual for more information.2http://www.slt.atr.jp/ IWSLT2006/downloads/case+punc_tool_using_SRILM.instructions.txt


Capitalization process

Lower-casesentence

Capitalizedsentence

FSA (S)

Text to FSA

bestpath (SoT)

Wfst to text

Training process

Capitalizationwfst (T)

FSA

Change input to lowercase

LM to FSAConvertion

Language Model

Figure 4.6: Using a WFST to perform capitalization.

is converted into an automaton (FSA), corresponding to a WFST having the input equal to theoutput. The capitalization transducer T is created from this last WFST by converting everyword in the input to its lowercase representation. Notice that the input of the transducer Tuses a lowercase vocabulary while the output includes all orthgraphical forms. In order to cap-italize a given input sentence, it must be firstly converted into an FSA (S) and then composedwith the transducer T. The resultant transducer contains all possible sequences of capitalizedwords, given the input lowercase sequence. The bestpath() operation over this compositionreturns the most probable sequence of capitalized words.

From a more theoretical point of view, the capitalization process consists of calculating thebest sequence of capitalized tokens c ∈ C∗ for the lowercase sequence l ∈ L∗, as expressed inequation 4.1.

c = argmaxc∈C∗

P (c|l) (4.1)

using Bayes’ rule:

P(c|l) =P(l|c) P(c)

P(l)=

P(l, c)P(l)

(4.2)

Assuming that P(l) is a constant, the capitalization process consists of maximizing the result ofP(l|c) ∗ P(c) or P(l, c) as expressed by equation 4.3.


c = argmaxc∈C∗

P (l, c) (4.3)

In terms of transducers, the prior P(c) can be computed from the FSA built from the LM,and P(l|c) is computed from the FSA built from the sentence. The composition SoT containsall possible capitalization sequences c for the input sequence l, and the P(l, c) can be computedfrom all paths associated with sequence c. The Viterbi approximation is used, therefore thebestpath() operation over the composition returns the c sequence that maximizes P(l, c).

4.2.2 Comparative Results

The three methods provide different ways of performing automatic capitalization. How-ever, while the generative methods require a predefined vocabulary, this is not the case of thediscriminative method. In order to compare all three methods, the training data was restrictedto a vocabulary. The vocabulary contains almost all the words that can be found in the speechtranscripts. Notice that the latest versions of the ASR use a dynamic vocabulary. As a closedvocabulary is used, all words outside the vocabulary were marked “unknown”. The punctua-tion marks were also removed from the corpus, bringing the written newspaper corpus closerto speech transcripts. The out-of-vocabulary (OOV) words include proper nouns, domain-specific words, and unusual verbal forms, but their capitalized form is usually fixed. Hence,apart from verbs, most of them can be handled with domain-specific and periodically updatedlexica. These experiments used only the word identification information, sometimes combinedas bigrams and trigrams.

The capitalization of mixed-case words is simple to accomplish using the generative meth-ods, but increases considerably the discriminative model complexity. For that reason, theseexperiments considered only three ways of writing a word: lowercase, uppercase, and first-capitalized, not covering the mixed-case words. The experiments were conducted both on writ-ten newspaper corpora and on spoken transcripts, making it possible to analyze the impact ofthe different methodologies over these two different data sets. The first set of experiments wasperformed on written newspaper corpora, using the PUBnews corpus both for training andevaluation, allowing to establish an upper-bound for capitalization. Written newspaper corpusand spoken transcripts were combined in order to provide richer training sets and to reducethe problem of having small quantities of spoken data for training. The spoken data evalua-tion subset corresponds to merging Eval, JEval and RTP07, but only manual transcripts wereused because, by the time these experiments were performed, the automatic speech transcriptdid not include reference case information. Results achieved using only the most common or-thographical form were included in the experiments, which is a popular baseline for similarwork (Lita et al., 2003; Chelba and Acero, 2004; Kim and Woodland, 2004; Agbago et al., 2005).


LM options unigrams bigrams trigramsLM size (MB) 3.2 198 504

Table 4.1: Different LM sizes when dealing with PUBnews corpus.

LM options HMM-based tagger WFSTPrecision Recall F SER Precision Recall F SER

unigrams 89.6% 61.5% 72.9% 45.2% 85.0% 64.4% 73.3% 46.3%bigrams 92.0% 72.0% 80.8% 33.8% 89.5% 73.1% 80.5% 34.8%trigrams 93.3% 74.5% 82.8% 30.4% 91.7% 75.3% 82.7% 30.9%

Table 4.2: Capitalization results of the generative methods over the PUBnews corpus.

4.2.2.1 The generative approaches

An LM created from a large written newspaper corpus may include spelling errors and rarewords, which combined as bigrams and trigrams increases the size of the LM without muchgain. Thus, all bigrams and trigrams occurring less than 4 times were removed from LMs builtfrom the PUBnews training data. Doing so, a significant reduction in the LM size is achievedwithout much impact on the results. Table 4.1 shows the size of each LM, after this restriction,depending on the building options.

Table 4.2 shows results achieved by training and testing on written newspaper corpus,where F corresponds to F-measure. The left side of the table shows results produced by theHMM-based tagger, while the right side shows equivalent results produced using the WFSTapproach (transducers), for the same training and testing data. Similar results were expectedfrom both methods, since the transducers were built from exactly the same LM, neverthelessthe HMM-based tagger achieves a slightly better performance. As expected, results improveas the LM order increases: the best results were achieved using trigram models, however thelargest difference occurs when moving from unigrams to bigrams. We have performed otherexperiments using 4-grams, but results do not improve significantly. That is consistent with thework of Gravano et al. (2009), that concludes that increasing the n-gram order does not improvecapitalization results. Despite no spelling errors exist in the ASR output, recognition errorsand disfluencies are quite frequent, specially in spontaneous speech. For this reason, resultson a written newspaper corpus should be taken as an upper-bound for the capitalization overautomatic speech transcripts.

The spoken data is insufficient for training, so both PUBnews and the ALERT-SR BN train-ing data were combined in order to provide a richer LM. The final LM is a linear interpolationbetween: LM1 - built from PUBnews training data; and LM2 - built from the ALERT-SR train-ing data, where the interpolation parameter lambda was 0.737121 when considering trigrams(perplexity = 174.5) and 0.690977 when considering bigrams (perplexity = 267.7). Previouslambda values, calculated using the compute-best-mix tool (included in the SRILM toolkit),


LM options HMM-based tagger WFSTPrecision Recall F SER Precision Recall F SER

unigrams 85.9% 72.5% 78.6% 39.3% 85.9% 72.7% 78.8% 39.2%bigrams 82.5% 81.9% 82.2% 35.1% 82.6% 82.0% 82.3% 34.8%trigrams 81.8% 83.4% 82.6% 34.8% 81.9% 83.3% 82.6% 34.7%

Table 4.3: Capitalization results of the generative methods over the ALERT-SR corpus.

minimize the perplexity of the interpolated model, considering the BN corpus developmentsubset, not previously used for training. Perplexity is a common way of evaluating languagemodels, which are probability distributions over texts. Given a probability model q, one mayevaluate q by asking how well it predicts a separate test sample x1, x2, ..., xN . The perplexity of

the model q is defined as 2−∑Ni=1

1N log2 q(xi). Better models will tend to have a lower perplexity,

which means they are less surprised by the test sample (Jelinek et al., 1977; Brown et al., 1992).

Table 4.3 shows results for capitalization of speech transcripts, where the left side of thetable shows results achieved by the HMM-based tagger and the right side shows equivalent re-sults achieved using a transducer. Results reveal a decrease of performance when moving fromwritten newspaper corpora to speech transcripts. The precision values decrease significantly,while the recall increases because of the vocabulary restrictions of the spoken transcriptions.The best results are produced with trigrams, but they are very similar to the results obtainedfrom bigrams, given the flexible linguistic structure of spoken texts, in opposition to writtentexts. Since the written newspaper corpora have properties different from speech transcripts,the availability of more spoken training data would certainly improve these results.

In short, the two generative methods applied produce similar results. Nevertheless, thecurrent implementation of the WFST method implies loading, composing and searching a largenon-deterministic transducer for each sentence, thus being the method computationally mostexpensive among those here proposed. This process was accelerated in the latter experimentsby considering blocks of n sentences (e.g., 1K sentences) and applying the capitalization trans-ducer to the whole block. The automaton created from the block contains a single path thatgoes from the initial to the final state, covering the whole text sequence. The time required forthe composition operation is only a few seconds, and it is performed once for each group ofn sentences. Using this strategy, the time required for this method became similar to the timetaken by the HMM tagger.

4.2.2.2 The discriminative approach

Experiments concerning the discriminative approach, presented in this subsection, use thefollowing unigram and bigram features for a given word: wi, 〈wi−1, wi〉, and 〈wi, wi+1〉. Ad-ditional experiments using trigrams were also initially performed, but no improvements wereachieved. The memory limitations mentioned in Section 2.3 make it difficult to use all writ-


Training approach n-gram counts successive retrainingPrec. Recall F SER Prec. Recall F SER

PUBnews corpus 92.0% 73.8% 81.9% 32.4% 93.2% 72.5% 81.6% 32.6%ALERT-SR corpus 87.4% 80.5% 83.8% 31.0% 87.2% 80.5% 83.7% 31.2%

Table 4.4: ME-based capitalization results using limited vocabulary.

Training approach n-gram counts successive retrainingPrec. Recall F SER Prec. Recall F SER

PUBnews corpus 93.3% 82.2% 87.4% 23.3% 93.2% 83.2% 87.9% 22.5%ALERT-SR corpus 87.3% 85.4% 86.3% 26.8% 86.5% 85.9% 86.2% 27.2%

Table 4.5: ME-based capitalization results using unlimited vocabulary.

ten newspaper corpus for training. Therefore, the following experiments use the two differentstrategies, described in the Section 2.3.2: i) use all training data, by extracting n-gram countsand then producing features for each corresponding n-gram; ii) Perform successive retrainingover all training data, using blocks of fixed size (2.5 million words). All the words occurringat least four times in the PUBnews corpus were used for training. Table 4.4 shows the corre-sponding results, revealing similar performances in terms of SER for both strategies. In whatconcerns the BN speech transcripts, the ALERT-SR training data was used together with thePUBnews training data in order to create the ME models. For written corpora, the first strategyachieves a better recall, while the second one achieves a better precision, but results are quitesimilar. Results concerning the speech transcripts reveal a lower precision but a better recallwhen comparing to written corpora. The features used in these experiments are more expres-sive than simple bigrams and less expressive than trigrams. This statement is supported bythe achieved results, given that the performance is better than using bigrams with the gener-ative approaches, and worse than using trigrams with the generative approaches. While thegenerative approaches are more adequate for capitalizing written newspaper corpora, the dis-criminative approach produces better results for the BN transcripts, corresponding to the bestresults seen so far. The second strategy learns the most common capitalization combinationsappearing in the corpus, being suitable for the flexible linguistic structure found in the speechtranscripts, specially in spontaneous speech.

4.2.3 Results using Unlimited Vocabulary

Data from previous subsection, was restricted to a limited vocabulary, in order to allowcomparing all the three methods. However, such restriction is only imposed by the genera-tive approaches, not being required by the discriminative approach. Therefore, another set ofexperiments has been performed without any vocabulary restrictions, using uniquely the dis-criminative approach. All the words occurring less than four times in the PUBnews trainingcorpus have been removed, minimizing the effects of misspelled words and reducing memory


limitations. Table 4.5 shows the achieved performance, where the capitalization model appliedto the BN transcripts was also retrained with BN speech transcripts (ALERT-SR corpus) beforebeing applied to the BN data. Results reveal an expected increase of performance, specially interms of recall, when comparing with results from Table 4.4. Differences are significant, spe-cially concerning the written corpus (about 10% absolute). In fact, this corpus contains a verylarge number of different words, previously considered as OOV words. The speech data con-tains much less OOV words, but the differences (about 4% absolute) are still quite significant.

4.3 Impact of Language Dynamics

In order to better understand the way we should train our capitalization models, we havestarted by analyzing the newspaper corpus for establishing a relation between the vocabularyusage and the time-line. The major goal with these experiments is to assess the importance oftraining data collected in nearby testing periods for capitalizing written corpora, manual andautomatic transcripts. The first set of experiments concern the capitalization of written corpora.The capitalization over Broadcast News (BN) orthographic transcripts is addressed afterwards.The major source of capitalization information is always provided by written newspaper cor-pora, and the evaluation was conducted on three different subsets of speech transcripts, col-lected from different time periods.

Experiments described in this section were performed using the discriminative modelingapproach, described in Section 2.3, following the retraining strategy. Like the experiments fromthe previous section, experiments in this section used only features comprising word identifica-tion, sometimes combined as bigrams: wi (current word); 〈wi−1, wi〉, 〈wi, wi+1〉 (bigrams). Onlywords occurring more than once were included for training, and only three ways of writing aword were explored: lowercase, uppercase, and first-capitalized.

4.3.1 Data Analysis

In order to verify the relation between the vocabulary usage and the time line, we havestarted by analysing the PUBnews newspaper corpus. Each subset of about 2.5 million wordsfrom the newspaper corpus contains about 86K unique words, where only about 50K occurmore than once. In order to assess the relation between the word usage and the languageperiod, a set of vocabularies were created, with the 30K more frequent words appearing ineach training set (roughly corresponds to a frequency above 3). Then, the first and last corporasubsets were checked against each one of the vocabularies. The name of each vocabulary is thesame as the corresponding corpora subset, and corresponds to the month of the latest data inthat subset. Figure 4.7 shows the results, revealing that the number of OOV (Out of VocabularyWords) decreases as the time gap between the train and test periods gets smaller.

4.3. IMPACT OF LANGUAGE DYNAMICS 67

80k

90k

100k

110k

120k

130k

140k

1999

-01

1999

-04

1999

-07

1999

-10

1999

-12

2000

-03

2000

-05

2000

-07

2000

-10

2000

-12

2001

-02

2001

-05

2001

-07

2001

-09

2001

-11

2002

-01

2002

-03

2002

-05

2002

-08

2002

-10

2003

-01

2003

-04

2003

-07

2003

-09

2003

-12

2004

-02

2004

-04

2004

-06

2004

-08

2004

-11

OO

V

Vocabulary period

2004-12 1999-01

Figure 4.7: Number of OOVs considering written corpora.

2.5%

3.0%

3.5%

4.0%

4.5%

1999

-01

1999

-04

1999

-07

1999

-10

1999

-12

2000

-03

2000

-05

2000

-07

2000

-10

2000

-12

2001

-02

2001

-05

2001

-07

2001

-09

2001

-11

2002

-01

2002

-03

2002

-05

2002

-08

2002

-10

2003

-01

2003

-04

2003

-07

2003

-09

2003

-12

2004

-02

2004

-04

2004

-06

2004

-08

2004

-12

OO

V

Vocabulary Period

Eval JEval RTP07

Figure 4.8: Proportion of OOVs considering speech transcripts.

The same conclusion also applies to speech transcripts, but the visualization is not so ob-vious. Using the same 57 previously calculated vocabularies, we have counted the number ofwords appearing in each testing set of the speech corpus that were not covered by each one ofthe vocabularies. Figure 4.8 shows the corresponding results, where for each one of the testingperiods the closest training periods are marked with a circle. The graph shows that, for eachevaluation set, vocabularies built with nearby data have a better coverage of the data.

These results suggest that lexical information changes whenever the language periodchanges. But, how does this affect the capitalization task? The rest of this section addressesthis question, by presenting capitalization experiments with different time conditions.


Train 2000-12 test set 2004-12 test setPrecision Recall F SER Precision Recall F SER

1999 93.3% 79.0% 85.6% 26.1% 92.4% 76.3% 83.6% 29.6%2000 93.5% 80.1% 86.3% 25.0% 92.4% 76.8% 83.9% 29.1%2001 93.7% 80.0% 86.3% 25.0% 92.9% 76.4% 83.8% 29.1%2002 93.2% 79.2% 85.6% 26.1% 92.5% 78.2% 84.8% 27.7%2003 93.1% 77.7% 84.7% 27.5% 93.0% 78.2% 85.0% 27.3%2004 92.5% 78.0% 84.6% 27.8% 92.5% 79.7% 85.6% 26.4%

Table 4.6: Using the first 8 subsets of each year for training.

!"#$

!%#$

!&#$

'(#$

)***$ !((($ !(()$ !((!$ !(('$ !(("$

!"#$

%&'()()*$+,&(-.$

!(((+)!$,-./$.-/$012$ !(("+)!$,-./$.-/$012$

Figure 4.9: Performance for different training periods.

4.3.2 Capitalization of Written Corpora

The following experiments use the PUBnews newspaper corpus, where the first 57 subsetswere used for training and last subset is used for evaluation. The punctuation marks wereremoved from the original text, and only events occurring more than once were included fortraining.

4.3.2.1 Separated training using one year of training data

In order to assess how time affects the capitalization performance, the first experimentsconsist of producing six isolated capitalization models, one for each year of training data. Foreach year, the first 8 subsets were used for training and the last one was used for evaluation.Table 4.6 shows the corresponding capitalization results for the first and last testing subsets, re-vealing that performance is affected by the time lapse between the training and testing periods.The best results were always produced with nearby the testing data, even if it is not so obviousfor the first test set. The performance results for each training period are also illustrated in Fig-ure 4.9. A similar behavior was observed on the other four testing subsets, corresponding to thelast subset of each year, but results are not presented here for simplicity reasons. Results reveala degradation of performance when the training data is from a time period after the evaluation


Checkpoint LM #lines Precision Recall F SER1999-12 1.27 Million 92.4% 77.0% 84.0% 29.0%2000-12 1.86 Million 92.5% 79.3% 85.4% 26.6%2001-12 2.36 Million 93.0% 79.9% 86.0% 25.7%2002-12 2.78 Million 93.2% 80.8% 86.6% 24.7%2003-12 3.10 Million 92.9% 82.2% 87.2% 23.6%2004-08 3.36 Million 93.2% 83.2% 87.9% 22.5%

Table 4.7: Retraining from Jan. 1999 to Sep. 2004.

20%

25%

30%

35%

40%

45%

1999-01 2000-07 2001-11 2003-04 2004-08

SE

R

Checkpoint data

Forward Training Backward Training

Figure 4.10: Capitalization of written corpora, using forward and backwards training.

data. Results presented in the last row concerning the 2004-12 evaluation set are about 3.9%worse (in terms of SER) than those presented in Section 4.2.3 for the same test set and using thesame training conditions. This is mostly due to a low coverage of the training data, revealingthat 20 million words training sets do not provide sufficient data for the capitalization task.

4.3.2.2 Forward and Backward Retraining

Previous results used one year of training data, by iteratively retraining the previously cal-culated capitalization models with the new data. The following results use the same retrainingstrategy, but considering all the training corpus. Table 4.7 shows the results achieved with thisapproach for the PUBnews test set (Oct. to Dec. 2004), revealing higher performance as moretraining data is available.

The following experiment shows that, besides the amount of data, the training order is im-portant. In fact, previous results could suggest that the increase of performance comes uniquelyfrom the increasing number of training events. For that reason, another experiment has beenperformed, using the same training and testing data, but retraining backwards, similarly to


20%

25%

30%

35%

40%

45%

1999-01 2000-07 2001-11 2003-04 2004-08

SE

R

Training data period

PUBnews test Eval JEval RTP07

Figure 4.11: Automatic capitalization of speech transcripts, using forward retraining.

what has been performed by Mota and Grishman (2009) for NER. Corresponding results areillustrated in Figure 4.10, revealing that the backwards training results are worse than forwardtraining results, and that backward training results do not always increase, rather they stabilizeafter a certain amount of data. Despite the fact that both training experiments use all the train-ing data, in the case of forward training the time gap between the training and testing data getssmaller for each iteration, while in the backwards training it grows. Similar conclusions wereachieved by Mota and Grishman (2009) for their NER tagger based on co-training, where thegain was higher when older seeds and contemporary unlabeled data was used, instead of usingcontemporary seeds and older unlabeled data. Figure 4.10 also attest that a retraining-basedstrategy is suitable for using large amounts of data and dealing with language adaptation.

4.3.3 Capitalization of Speech Transcripts

The following experiments apply the previous capitalization models, learned from writtencorpora, directly to the ALERT-SR evaluation data. The impact of the speech recognition errorsin the capitalization of speech data is obtained by performing the same experiments both inmanual and automatic transcripts.

4.3.3.1 Manual transcripts

In order to assess the relation between the data period and the capitalization performanceof the speech data, the forward trained capitalization models, used previously for written cor-pora (see Figure 4.10), were also applied to each one of the SR corpus evaluation subsets. Figure4.11 shows the corresponding performance variations for each speech data evaluation set, in


Training Eval JEval RTP07Data Prec. Recall SER Prec. Recall SER Prec. Recall SER1999 83.2% 80.0% 35.8% 86.3% 83.8% 29.4% 92.5% 80.3% 26.1%2000 83.4% 79.6% 35.9% 86.3% 84.4% 28.9% 92.4% 80.9% 25.7%2001 84.3% 80.2% 34.5% 86.7% 86.7% 26.4% 93.1% 79.6% 26.2%2002 84.1% 79.5% 35.5% 86.2% 86.1% 27.5% 93.2% 80.6% 25.1%2003 82.9% 78.9% 37.3% 85.9% 85.4% 28.4% 91.5% 81.9% 25.6%2004 83.9% 78.9% 36.1% 86.5% 84.7% 28.3% 92.4% 82.2% 24.4%All 83.3% 80.7% 35.2% 85.7% 88.4% 26.2% 92.2% 84.5% 22.4%

Table 4.8: Evaluating with manual transcripts.

terms of SER. The figure suggests that, for each test set, the performance grows until the cor-respondent time period is reached, but it does not significantly improve after that period. Thecontinuously growing performance for the PUBnews and RTP07 test sets is related with theirtime period being later to the training data. The graph also shows that the performance is sim-ilar both to written corpora and to speech transcripts. However, the performance evolutionconcerning the written corpora is smoother and steeper.

Another relevant question concerns the amount of training data required for achieving thebest results. Is it necessary to use all the training data? Besides using all corpora for training,some experiments were also conducted using only the first 8 corpora subsets of every yearfor training (about 20 million words). Another set of experiments, either applying the writtencorpora LMs directly to the transcripts data or retraining it with transcripts before applying it,revealed a better performance for the later, due to closer properties between the training andthe testing sets. Hence, we have retrained the ME models calculated from written corpora withmanual transcripts training data, thus achieving 2% to 5% improved performance. Table 4.8shows the corresponding results, in terms of SER, Precision and Recall, where bold marks thebest results for each corpus subset. The first 6 rows corresponds to the initial training with thefirst 8 corpora subsets of each year, while the last row corresponds to using all training data.The table shows that when all data is used, a better recall is achieved whereas the precisionslightly decreases. In general, we may conclude that, for capitalizing manual transcripts, largeamounts of training data are not necessary, if recent data is available. Results also show thatthe RTP07 test subset consistently presents best performances in opposition to the Eval subset.Nonetheless, the worse performance for the Eval and JEval sets is also due to the unusual topicscovered in the news by that time (US presidentials and War on Terrorism). Notice that a numberof unusual foreign names were brought to focus by the media organizations by that time.

4.3.3.2 Automatic transcripts

Table 4.9 shows the results of capitalizing automatic transcripts with the LMs also used forthe results of Table 4.8. As shown in the table, overall results are about 20% worse in terms of


Training Eval JEval RTP07Data Prec. Recall SER Prec. Recall SER Prec. Recall SER1999 71.9% 72.9% 55.2% 74.6% 76.4% 49.4% 79.4% 72.7% 46.0%2000 72.1% 72.7% 55.1% 74.8% 77.2% 48.5% 79.5% 73.0% 45.5%2001 72.8% 73.4% 53.7% 74.9% 78.0% 47.8% 79.7% 72.0% 46.1%2002 72.9% 73.0% 54.0% 74.7% 77.9% 48.1% 80.4% 72.9% 44.7%2003 71.8% 72.4% 55.8% 74.2% 77.1% 49.3% 79.2% 73.7% 45.5%2004 72.5% 72.8% 54.6% 74.8% 76.5% 49.0% 80.3% 73.7% 44.3%All 71.6% 73.6% 55.2% 73.7% 79.5% 48.4% 79.2% 75.2% 44.3%

Table 4.9: Retraining with manual and evaluating with automatic transcripts.

23% 28% 33% 38% 43% 48% 53% 58%

1999

2000

2001

2002

2003

2004

SER

Eval

Manual Transcriptions ASR output 19

99

2000

2001

2002

2003

2004

JEval 19

99

2000

2001

2002

2003

2004

RTP07

Figure 4.12: Comparing the capitalization results of manual and automatic transcripts.

SER, which is mostly due to the recognition WER (Word Error Rate). Besides, some errors mayalso be due to unsolved alignment problems. In fact, more accurate results would be achievedif a manually corrected capitalization alignment could be used. These results also suggest astrong relation between the performance and the training period. The distribution of values issimilar to the previous results concerning manual transcripts. Figure 4.12 illustrates the relationbetween training and testing periods, for both manual and automatic transcripts. A number ofadditional experiments were also conducted, for example, by retraining each written corporacapitalization model with automatic transcripts before applying to automatic transcripts, butonly small differences were achieved, sharing the same tendency.

4.3.4 Conclusions

We have studied the impact of language variations over time and the way it affects the cap-italization task. The results reveal a strong relation between the capitalization performance andthe elapsed time between the training and testing data periods. Reported experiments suggestthat the capitalization performance decays over time, as an effect of language variation, sup-porting the idea that different capitalization models should be used for different time periods.

4.4. CAPITALIZATION MODEL ADAPTATION 73

That has also been previously addressed by related work on NER, with similar conclusions(Collins and Singer, 1999; Mota and Grishman, 2008, 2009). The adopted approach, based onmaximum entropy models, takes the language changes into consideration, providing a cleanframework for learning with new data, while slowly discarding unused data.

4.4 Capitalization Model Adaptation

The application of automatic speech recognition (ASR) to close-captioning of BroadcastNews (BN) has motivated the adaptation of the lexical and language models of the recognizeron a daily basis with text material retrieved from online newspapers (Martins et al., 2007b,a).The vocabulary and language model adaptation approaches use 3 corpora as training data: themanual transcriptions of the BN speech training data, a large newspaper text database with741M words and a relative small adaptation set consisting of the 7 last days of online text. Theselection of the 100k dynamic vocabulary is POS-based. Relative to the first ASR version, whichused a fixed vocabulary of 57k words, the dynamic version achieves a relative reduction of 65%in OOV (out-of-vocabulary) word rate and of 5.7% in WER (word error rate). Roughly half ofthis improvement is due to the increased size of the vocabulary, as shown by the WER resultsobtained with a baseline version using a static vocabulary of 100k words.

These improvements have an obvious impact on the quality of the automatically producedsubtitles, which include online punctuation and capitalization. An offline topic segmentationand indexation module (Amaral and Trancoso, 2008) splits the BN show into stories and as-signs one or more topics to each story from a closed set of topics. An extractive summarizationmodule (Ribeiro and Matos, 2007) also assigns a summary to each story. These post-ASR mod-ules were originally trained with the material available until a certain date, in no way takingadvantage of the online newspapers which are daily being collected. One of the goals of thisstudy is to try to use this data to train better models for the capitalization.

This section analyses the capitalization performance either using a static capitalizationmodel (CM) or using dynamic capitalization models retrained over time with training datafrom nearby time periods. The capitalization procedure uses the adopted ME-based approach,and the features are based on word identification: wi (current word); and 〈wi−1, wi〉, 〈wi, wi+1〉(bigrams including the previous and next words). Like before, only three ways of writing aword are considered: lowercase, first-capitalized, and uppercase.

4.4.1 Baseline results

The capitalization model currently in use for BN close-captioning, denoted as BaseCM,provides the baseline and it was trained with the content of the PUBnews corpus. The capital-ization model adaptation uses the LMnews corpus, which consists of online text daily collected


Evaluation set % Precision % Recall %F % SERManual transcript 86.0 87.6 86.8 26.6

S_ASR 70.5 78.5 74.3 54.0D_ASR 72.2 80.8 76.3 50.1

Table 4.10: Baseline capitalization results produced using BaseCM.

Approach Model period Man S_ASR D_ASRbaseline 2008-05-20 26.6 % 54.0 % 50.1 %LMnews only 2008-05-20 26.5 % 53.3 % 49.5 %adapt-base 2008-05-20 26.0 % 54.4 % 50.2 %adapt-iter 2008-05-20 25.0 % 53.8 % 49.6 %adapt-iter daily model 25.0 % 53.6 % 49.8 %

Table 4.11: Capitalization SER achieved for all different approaches.

from the web, and described in Section 3.2.1. By the time these experiments were conducted,the corpus contained about 30M words. The evaluation was performed over the RTP08 testset, which consists of 5 BN shows. The corpus contains about 40k words and was collectedduring June and July 2008, with an 8 day time span between each BN show. Besides the man-ual orthographic transcript, two automatic transcripts were also available, sharing the samepreprocessing segmentation: S_ASR – produced using a static LM and a static 100k word vo-cabulary; and D_ASR – produced using a dynamic LM and vocabulary, built specifically forthe corresponding day, and the existing recognition system by that time.

The baseline results, achieved using the capitalization model currently in use for dailysubtitling (BaseCM), are shown in Table 4.10. The capitalization performance decreases whendealing with automatic transcripts. Even so, a better performance is achieved for the D_ASRtranscript, where both the LM and the vocabulary are daily computed, and a lower WER isachieved.

4.4.2 Adaptation results

The adaptation and retraining experiments performed in the scope of this work use LM-news corpora subsets of 2M words each, and the previously described retraining method. Eachsubset is referred by the day corresponding to the latest data in that subset. Accordingly, thecapitalization model that results from retraining with a given corpora subset is also referred bythe day corresponding to the latest data in that subset.

Three adaptation approaches were tested: i) using only the LMnews corpus for training;ii) adapting the BaseCM to a target period, by retraining with the latest data from that period;and iii) iteratively retraining BaseCM with all the available corpora subsets. While the firstapproach assumes that using only the most recent data (LMnews) is sufficient for training, theother two approaches use this data to retrain the baseline CM, assuming that former data also

4.4. CAPITALIZATION MODEL ADAPTATION 75

24% 26% 28% 30% 32% 34% 36% 38% 40%

2005

-03-

16

2005

-06-

22

2005

-09-

30

2005

-11-

29

2006

-02-

10

2006

-04-

11

2006

-06-

18

2006

-08-

27

2006

-10-

26

2007

-01-

06

2007

-03-

08

2007

-06-

02

2007

-08-

29

2007

-11-

08

2008

-01-

11

2008

-03-

13

2008

-05-

20

daily

mod

els

SE

R

Capitalization model

baseline LMnews only adapt-base adapt-iter

Figure 4.13: Manual transcription results, using all approaches.

provides important capitalization information. The second approach assumes that BaseCM al-ready contains most of the capitalization information and a simple retrain with data from atarget period is sufficient. The last approach assumes that all corpora periods provide impor-tant capitalization information and contribute for a better final model. Table 4.11 shows thefinal capitalization results for each approach. Concerning the manual transcript, all the pro-posed approaches yield better results than the baseline, and the best result is produced usingthe third approach (lines 3 and 4), which combines the BaseCM with the LMnews information.Concerning the automatic transcripts, despite achieving only small improvements, the thirdapproach also prove to be the best, specially for the D_ASR transcripts, currently in use. Re-sults also show that the LMnews information alone is sufficient to beat the baseline, revealingthe importance of training data from periods closer to the testing data. The table shows thatresults are not further improved by using daily CMs, which corresponds to retraining the 2008-05-20 capitalization model with the latest 2M words former to the testing data (5 daily modelswere used), suggesting that a periodic retraining is suitable for this task.

Figure 4.13 illustrates the results achieved for the manual transcription, using differentcapitalization models and all different approaches. All the approaches depict clear trend lines.However, the capitalization models produced by the third approach are more stable, achievingthe best results after a certain period of time.

4.4.3 Conclusions

From the three different approaches for capitalization here proposed and evaluated, themost promising one consists of iteratively retraining a baseline model with new available data,


using small corpora subsets. When dealing with the manual transcription, the performanceimproves about 1.6%. Results reveal that producing capitalization models on a daily basisdoes not lead to a significant improvement. Therefore, the adaptation of capitalization modelson a periodic basis is the best choice. The small improvements gained in terms of capitalizationsuggest that dynamically updated models may play a small role, but the updating does notneed to be done daily, a fact that is also according to our intuition. One possible solution forthe updating interval could be assessed by comparing the frequency of emerging words.

4.5 Recent Work on Capitalization

The experiments in this thesis have been conducted during several years with data andsoftware under development. For that reason, comparing results obtained in different timeperiods is a difficult task. Nowadays, the speech recognition system performance has beenimproved, a number of third party tools have been corrected or improved, more data becameavailable, and some of the old data has been revised. For example, we now use a revised ver-sion of the ALERT-SR speech corpus, where a large number of inconsistencies were corrected,as described in Section 3.1.1.1. Finally, the significant increase of computational memory re-sources during these years, makes it possible to perform more complex experiments, usingmore data and larger number of retraining iterations.

These facts led to a new set of experiments, which are supposed to reflect more accu-rate conclusions, given the improved conditions described above. The following subsectionspresent the most recent results achieved using a ME-based approach, and compare this ap-proach with the most recent results achieved using HMMs and Conditional Random Fields(CRFs). Besides the automatic transcripts we now perform experiments with force alignedtranscriptions, also produced by our speech recognition system. The most recent experimentson capitalization consider all the four types of capitalization classes for a given word: lowercase,first-capitalized, uppercase and mixed-case.

4.5.1 Capitalization Results using a ME-based Approach

These experiments use the most recent version of the MegaM tool for training the models.A number of additional options were introduced and a number of bugs have been correctedsince the first version of the tool that had been used. The PUBnews corpus has again been usedfor training the capitalization models, where the original texts were re-normalized and all thepunctuation marks removed. The normalization tool has been recently revised and improved,which means that the content is not exactly the same as in the early experiments. The evaluationdata includes only the Eval and JEval portions of the ALERT-SR corpus, since the other twoevaluation subsets were still not completely revised by the time these experiments started.

4.5. RECENT WORK ON CAPITALIZATION 77

Written corpora model only After retraining with transcriptsEvaluation data Prec. Recall F SER Prec. Recall F SERWritten corpora 95.1% 85.3% 89.9% 18.8%Manual transcripts 94.8% 88.0% 91.3% 16.5% 95.4% 88.6% 91.9% 15.4%ASR transcripts 82.7% 81.7% 82.2% 34.9% 83.3% 82.2% 82.7% 33.9%

Table 4.12: Recent ME-based capitalization results for Portuguese.

The retraining approach described in Section 2.3 was followed, with subsets of two millionwords each. Sometimes the algorithm implemented in the MegaM tool assumes the optimiza-tion converges before it actually does, so now the optimization was forced to repeat severaltimes. Each epoch was retrained three times, using 200 iterations. For performance reasons,each capitalization model was limited to 5 million weights. The following features have beenused for a given word w in the position i of the corpus: wi, 2wi−1, 2wi, 3wi−2, 3wi−1, 3wi, wherewi is the current word, wi+1 is the word that follows and nwi±x is the n-gram of words that startsx positions after or before the position i. Table 4.12 shows the corresponding results, where theleft of the table refers to same capitalization model, built from the written corpora, while theright side refers to the model obtained after retraining with the ALERT-SR training data. Theseresults correspond to substantial improvements when considering results from Section 4.2.3.The different reasons for this advancement include software bug corrections and increasingthe number of training iterations. Furthermore, results concerning written corpora reflect theimproved normalization, and results concerning transcripts reflect the revised version of thereference data. The difference between the manual and automatic transcripts expresses thespeech recognition WER, which is about 18.5 for the evaluation speech data in use. Resultsachieved after retraining the written corpora model with the speech training data are about 1%better, and for that reason one can conclude that it is always important to include data similarin style to the testing data in the training, even if it is only a small portion.

One final experiment concerning the ME results consisted on combining the predition ofeach one of the 70 intermediate models, trained from each one of the PUBnews training subsets.Results were 19.6% SER for written corpora, 16.2% SER for manual transcripts, and 34.8 SERfor automatic transcripts. Written corpora results were no better than results achieved usingonly the last trained capitalization model, which suggests that the model includes most of theinformation from the previous capitalization models. Furthermore, as it was trained with datacloser to the written corpora test set, it is more suitable for capitalizing that data. The gainsover speech transcripts are only residual and can be explained by the fact that the transcriptstesting data period being in the middle of the training data period.


Evaluation data Precision Recall F-measure SERWritten corpora 94.4 90.6 92.5 14.4Manual transcripts 84.8 91.4 87.9 24.7ASR transcripts 69.3 85.9 76.7 51.5

Table 4.13: Recent HMM-based capitalization results for Portuguese.

4.5.2 Capitalization Results using an HMM-based Approach

An HMM-based approach requires a limited vocabulary, but the disambig tool from theSRILM toolkit (Stolcke, 2002) facilitates the use of unrestricted vocabulary, by automaticallyconsidering the new words in its internal vocabulary. The most recent experiments using theHMM-based tagger implemented by the disambig tool use the same data sets used in previ-ous subsection. Trigram language models have been used, which were created using backoffestimates, as implemented by the ngram-count tool from the same toolkit, without n-gram dis-counts.

Table 4.13 shows results achieved using the same training and evaluation data previouslyused in Table 4.12. As a first result, the ME approach achieves a better precision, while theHMM-based approach achieves a better recall. Results indicate that the HMM-based approachis better for written corpora, while the ME approach is significantly better for speech tran-scripts. Several reasons may explain this fact: i) the information expressivity is not the same inboth methods: while the HMM-based approach uses all the context of a word, the features usedin the ME-based approach may not express that complete context, e. g., ME experiments heredescribed do not use the information concerning the previous word (wi−1) as an isolated fea-ture, while that information is available in the 3-gram LM used by the HMM-based approach;ii) the ME-based approach is not much influenced by the context as the HMM-based approach,which is quite important when dealing with speech units that may be, as stated in Section 1,flexible, elliptic, and even incomplete; iii) the restricted training conditions used for limitingcomputational resources. Finally, the WER impact is bigger for the HMM-based approach, be-cause different words may cause completely different search paths. This same conclusion wasalso reached in Section 4.2.

4.5.3 Capitalization Results using Conditional Random Fields

An ME model classifies a single observation into one of a set of discrete classes. A maxi-mum entropy markov model (MEMM) (McCallum et al., 2000) is an augmentation of the basicME classifier so that it can be applied to an entire sequence, assigning a class to each elementin the sequence, just like it is done with HMM. However, while an HMM is a generative modelthat optimizes the likelihood P(W|T) and combines it with the prior P(T) to calculate the pos-terior probability according to Bayes’ rule, an MEMM computes the posterior P(T|W) directly


Training without the output bigram with the output bigramdata Precision Recall F SER Precision Recall F SER

Year 1999 86.9% 86.5% 86.7% 24.8% 92.3% 82.1% 86.9% 24.1%Year 2000 88.1% 87.1% 87.6% 23.1% 92.7% 82.8% 87.5% 23.1%Year 2001 88.7% 87.0% 87.9% 22.5% 93.1% 83.5% 88.0% 22.1%Year 2002 88.6% 87.6% 88.1% 22.3% 93.1% 83.8% 88.2% 21.9%Year 2003 88.6% 87.9% 88.2% 22.1% 93.4% 84.2% 88.5% 21.2%Year 2004 88.5% 88.7% 88.6% 21.6% 93.5% 85.0% 89.0% 20.4%

ALL 95.2% 86.1% 90.4% 17.8% 94.1% 83.7% 88.6% 21.0%

Table 4.14: ME and CRF capitalization results for the PUBnews test set.

Training without the output Bigram with the output bigramdata Precision Recall F SER Precision Recall F SER


ALL 94.6% 89.3% 91.9% 15.5% 94.5% 87.7% 91.0% 17.1%

Table 4.15: ME and CRF capitalization results for the force aligned transcripts test set.

(Jurafsky and Martin, 2009). The conditional random field (CRF) (Lafferty et al., 2001), beingalso a discriminative sequence model, augments the MEMM. One advantage of CRF models isthe fact that it supports rich features, like ME, while it also accounts for label dependency, likeHMMs. The other advantage is that it performs global normalization, eliminating the labelingbias problem.

A number of experiments have been performed in order to assess the interest of sequencemodeling for capitalization. The CRF++ tool3, an open-source implementation that performs

3http://crfpp.sourceforge.net/

Training without the output Bigram with the output bigramdata Precision Recall F SER Precision Recall F SER


ALL 78.3% 85.1% 81.6% 38.1% 78.3% 85.1% 81.6% 38.1%

Table 4.16: ME and CRF capitalization results for the automatic speech transcripts test set.


!"#

$!"#

%!"#

&!"#

'!"#

(!!"#

()# (!)# *!)# (!!)# *!!)# (+# $+# *+# ,+#!"#$%&'()'*+,%-'+,'./%'012+.1*+314(,'#(5%*''

-!#

$-.(#

$-#

*-.$#

*-.(#

*-!#

-!#

$-.(#

$-!#

*-.$#

*-.(#

*-!#

Figure 4.14: Analysis of each capitalization feature usefulness.

the training based on L-BFGS (Limited memory Broyden-Fletcher-Goldfarb-Shanno), has beenused for training the capitalization models. Considering the memory resources currently avail-able4, it was not possible to use the entire training data at the same time. Therefore, six differentexperiments have been conducted, each one using one year of training data. The data and theset of features are the same previously used in this section. All experiments were performedwith and without the output bigram (results without the output bigram correspond to usingME only). This way, the importance of using CRFs can be measured. Tables 4.14, 4.15 and 4.16show results for written corpora, force aligned transcripts, and automatic transcripts, respec-tively. The last row (ALL) shows the results of combining all the previous models’ prediction,similarly to what has been reported in section 4.5.1. The best results are consistently achievedwhen the output bigram is enabled, due to the significant increase in the precision. The bestrecall values are still achieved without the output bigram. The conclusion is that given the richfeature set, the label dependency helps in all scenarios, supporting the idea that the capitaliza-tion of a word tends to be connected with the capitalization of words around.

Once again, it is interesting to notice a tendency for achieving better results when the train-ing data period is closer to the testing data period, which supports the same conclusion alsoachieved in Section 4.3. This is particularly clear when dealing with written corpora and forcealigned transcripts.


4.5.4 Analysis of Feature Contribution

Consider that the number of feature weights in a capitalization model must be limited,for example, for performance reasons or limitations on computational resources; which, then,are the features that should be pruned? This issue may be of importance for a module thatperforms capitalization on-the-fly. One possible answer consists of sorting the capitalizationmodel by the most discriminant features and then selecting the first k feature weights. Wehave tested this process on the ME capitalization model built from the PUBnews training dataand used in Section 4.5.1, which contains 5M feature weights and has a size of 592 Mbytes.The capitalization model was sorted by the standard deviation of the weights of each feature,putting the most discriminant features at the top. The capitalization model was then prunedwith different sizes, and the proportion of each feature in each resultant capitalization modelwas calculated. Figure 4.14 shows the obtained results, where nwi, corresponds to the n-gramstarting at the position i from the current word, for example 2wi−1 corresponds to the bigram< wi−1, wi >. As expected, the figure shows that the most discriminant capitalization featurefor a given word is the word itself, followed by the bigram covering the previous word (2wi−1).The graph depicts that features involving previous words are more discriminant and thereforemore important that the features involving the following words.

4.5.5 Error Analysis and General Problems

The capitalization approaches described so far are language independent, but the capital-ization performance can be further improved by considering language specific problems andby performing choices based on natural language knowledge. This subsection analyses anddiscusses a number of problems posed by our pure machine learning-based strategy for thePortuguese language.

Machine learning methods depend on the training data, which is sometimes from a spe-cific domain, and may not contain specific phenomena found in the evaluation data. For exam-ple, large written newspaper corpora provide useful capitalization information for building ageneric capitalization model, even so, frequent words found in speech transcripts rarely appearin the newspaper data. For example, the verbal form “(tu) queres/ (you) want” is rarely foundin written newspaper data, while it is frequent in dialogs or in spontaneous speech. In thisspecific example, as verbs are always written lowercase they do not pose significant problems,since it was considered that when no information concerning a word exists, such word is keptlowercase. Verbs with enclitic pronouns are easy to detect and are always written lowercase,for that reason our future experiments will consider an additional feature for identifying suchwords.

Our evaluation considers an absolute true form for each word in the original reference data.

4By the time these experiments were conducted, 3 machines with 24GB of memory were available.


However, differences between the original reference and the automatic capitalization are notalways capitalization errors. For example, whereas the original reference contains “rádio Re-nascença” it could contain “Rádio Renascença” instead, which is most of the times prefered bythe capitalizer. A number of errors are still produced by the capitalizer. For example, movie ti-tles like “A Esperança Está Onde Menos Se Espera” are frequently badly capitalized, given thatthey rarely appear in the training corpus and most of the words are usually written lowercase.Unusual organization names are also frequently badly capitalized as well. These conclusionsare similar to the qualitative analysis reported by Lita et al. (2003).

Media organizations sometimes bring to focus names of rarely mentioned people andplaces, which are then frequently used for a given time period. Because such proper nounsmust be capitalized, this may constitute a problem if the capitalization model is not updatedon a periodic basis. Concerning this subject, we have applied a number of regular expressionsfor detecting letter sequences that cannot occur in Portuguese (Meinedo et al., 2010). Thenwe have conducted experiments where all the detected words, supposedly “foreign” words,were capitalized. The capitalization performance has increased with this process. Moreover,this post-processing strategy permits to capitalize words which never occurred before in thetraining corpus.

4.6 Extension to other Languages

This section describes a number of capitalization experiments that have also been per-formed for other languages. Many experiments were performed both with Spanish and En-glish data, but this section will focus on the English language, thus avoid reporting severaltimes similar conclusions. Nevertheless, whenever important, specific results on the Spanishdata will be also mentioned.

The English BN corpus combines different English BN corpora subsets, as described inSection 3.1.3. The written corpus corresponds to the LDC corpus LDC1998T30, described inSection 3.2.3. For these experiments, however, only the NYT (New York Times) portion of thecorpus was used. The data has been collected from January 1997 to April 1998 and containsabout 213 Million words, after cleaning the corpus and removing problematic text (unknowncharacters, etc.). About 211 million words were used for training, 574 thousand for develop-ment, and 1.2 million for evaluation. The original texts were normalized and all punctuationmarks were removed, making them close to speech transcriptions. For the experiments heredescribed, only data previous to the evaluation data period was used for training.

4.6. EXTENSION TO OTHER LANGUAGES 83

0.5%

1.0%

1.5%

2.0%

2.5%

1997

-01

1997

-01

1997

-02

1997

-03

1997

-03

1997

-04

1997

-04

1997

-05

1997

-06

1997

-06

1997

-07

1997

-08

1997

-08

1997

-09

1997

-09

1997

-10

1997

-11

1997

-11

1997

-12

1998

-01

1998

-01

1998

-02

1998

-03

OO

V

Vocabulary period

05-22 Aug1997.voc

Figure 4.15: Vocabulary coverage on written newspaper corpora.

4.6.1 Analysis of the language variations over time

We have started by analyzing the newspaper corpus for establishing a relation between thevocabulary usage and the time-line, like we previously did for Portuguese. The English corpuswas split into several subsets, each containing about 8 million words. Each subset, containingabout 88K unique words, was named with the month corresponding to the first data in thatsubset. In order to assess the relation between the word usage and the language period, againseveral vocabularies have been created with the 50K more frequent words appearing in eachtraining set (roughly corresponds to frequency greater than two). Then, the coverage of eachone of these vocabularies was checked against one of the subsets. The chosen subset containsdata from August 1997, and is located in the middle of the corpus time span. Figure 4.15 showsthe results, where each vocabulary was named with its corresponding corpora subset. The bestcoverage is, as expected, achieved with the vocabulary built from the testing subset. The moreimportant result, however, is that the number of OOVs (Out of Vocabulary Words) decreasesas the time gap between the vocabulary and the testing period gets smaller.

The previous experiment was also performed on manual and automatic speech transcripts,by selecting a piece of speech data from the BN corpus. Most of the English BN corpora fromTable 3.5 does not have a reference to its corresponding collect time period, especially for theevaluation subsets. Therefore, the coverage of each one of the previous 23 vocabularies wastested against a subset from the LDC1998T28 corpus, corresponding to January 1998. Again,the number of words appearing in the test set of the speech corpus that were not covered byeach one of the vocabularies were counted. Figure 4.16 shows the correspondent results. Thegraph shows that the vocabulary coverage is better for vocabularies built from data collectednearby the testing data period. This same relation between the vocabulary usage and the time-line has been previously established in Section 4.3. These results confirm that different lexicalinformation is used in different language periods. This subject is further addressed in the


1.0%

1.2%

1.4%

1.6%

1.8%

2.0%

1997

-01

1997

-01

1997

-02

1997

-03

1997

-03

1997

-04

1997

-04

1997

-05

1997

-06

1997

-06

1997

-07

1997

-08

1997

-08

1997

-09

1997

-09

1997

-10

1997

-11

1997

-11

1997

-12

1998

-01

1998

-01

1998

-02

1998

-03

OO

V

Vocabulary period

Jan98 (Manual transcripts) Jan98 (ASR)

Figure 4.16: Vocabulary coverage for Broadcast News speech transcripts.

next section, where several capitalization experiments, with data collected from different timeperiods, are presented to show how this affects the capitalization task.

It would be interesting to complete this work, comparing the same period of time in bothPortuguese and English to measure the impact of new named entities, e.g., in the beginningof the Iraq war, or during U.S.A. presidential elections. That would depict the timeline effectsof the same event on both languages. Unfortunately, we do not have data suitable to performsuch experiments.

4.6.2 Results

Like in the previous section, capitalization experiments here described assume all the fourways of writing a word: lowercase, first-capitalized, uppercase, and mixed-case. The capital-ization models were trained with the newspaper corpora. The original texts were normalizedand all the punctuation marks removed, making them closer to speech transcriptions. The re-training approach described in Section 2.3 was followed, with subsets of two million wordseach. Each epoch was retrained three times, using 200 iterations. For performance reasons,each capitalization model was limited to 5 million weights. The following features were usedfor a given word w in the position i of the corpus: wi, 2wi−1, 2wi, 3wi−2, 3wi−1, where wi isthe current word, wi+1 is the word that follows and nwi±x is the n-gram of words that starts xpositions after or before position i.

In order to assess the impact on the capitalization task of the language variations in time,two different strategies were used for training, based on the data period. The first capitalizationmodels were trained by starting with the oldest data available and by retraining each epochwith more recent data. The second capitalization models were trained backwards, using thenewest data first and retraining each epoch with data older than the one used in the previous


!"#$

!%#$

&"#$

&%#$

'"#$

'%#$

())*+"($ ())*+"%$ ())*+("$ ()),+"&$

-./$ !"#$%#&'(#%)*)*+'

,%(%'-.#)"&'

!"#$

!%#$

&"#$

&%#$

'"#$

'%#$

())*+"($())*+"%$())*+("$()),+"&$

-./$/%01$%#&2'(#%)*)*+'

Figure 4.17: Forward and Backwards training results over written corpora.

epoch. Each capitalization model was applied to the newspaper corpora evaluation subset,and results are shown in Figure 4.17. While the models trained with the forward strategyconsistently increase the performance on the evaluation set, the performance of the modelsproduced with the backwards strategy does not increase after a certain number of epochs andit even decreases. Although both experiments use the same training data, the best result isachieved with the last model created using the forward strategy, because the latest trainingdata time period was closest to the evaluation time period. The small performance differencebetween the forward and backwards strategy is related with the relatively small period of time,less than one and a half year of data, covered by the English written corpus. During such asmall period of time, the vocabulary changes are quite limited. Notice, however, that bothresults were achieved using exactly the same data, justifying the preference for the forwardstrategy.

Each one of the previous capitalization models, previously created using the forward strat-egy, have also been used for restoring the capitalization of BN speech transcripts. The evalua-tion was conducted over data collected during January 1998, extracted from the LDC1998T28corpus, and corresponding to about 100k words. Figure 4.18 shows the results for manual andautomatic transcripts, again revealing that the best models are precisely the ones closer to theevaluation data period. In fact, the best model is the one built from data of the same period,despite the training and evaluation data being from different sources. As expected, manualtranscripts achieve the best performance. The performance differences between manual andautomatic transcripts reflect the WER impact. Another important conclusion arising from thetwo previous charts is that the amount of data is an important performance factor. In fact, ourresults show that the performance increases consistently as more data is provided.

In order to compare this method with other methods, we also have performed the capi-


!"#$

!%#$

&"#$

&%#$

'"#$

'%#$

%"#$

%%#$

())*+"($ ())*+"&$ ())*+"%$ ())*+",$ ())*+("$ ())*+(!$ ()),+"&$

!"#$

%&'&$()*+,-$

.%/0112342$5$6&712$

-./0.1$23./4536724$ 89:$;3./4536724$

Figure 4.18: Forward training results over spoken transcripts.

Evaluation data Precision Recall F-measure SERWritten ME 96.2 81.6 88.3 20.8corpora HMM 94.9 88.5 91.6 15.3Manual ME 94.3 82.4 88.0 22.2

transcripts HMM 91.9 84.9 88.2 22.2ASR ME 83.9 73.1 78.1 40.4

transcripts HMM 77.8 75.3 76.5 45.5

Table 4.17: Comparing two approaches for capitalizing the English Language.

talization using an HMM-based tagger, as implemented by the disambig tool from the SRILMtoolkit (Stolcke, 2002). This generative approach makes use of trigram language models, cre-ated using backoff estimates, as implemented by the ngram-count tool from the same toolkit,without n-gram discounts.

Table 4.17 shows results achieved with both methods, using the same training and evalu-ation sets for English. The table shows that the HMM-based approach produces better resultsfor written corpora, while the ME approach works better with the speech transcripts, confirm-ing the observations also made, in Section 4.5, for the Portuguese language. Two main reasonsexplain the better values of the HMM-based approach for the written corpora: i) The restrictedtraining conditions used for limiting computational resources; ii) the information expressiv-ity not being the same in both methods: while the HMM-based approach uses all the contextof a word, the features used in the ME approach may not express that complete context. Forexample, ME experiments here described do not use the information concerning the previousword (wi−1) as an isolated feature, while that information is available in the 3-gram LM usedby the HMM-based approach. The impact of the recognition errors is bigger when the HMM-

4.7. SUMMARY 87

based approach is used, because different words may cause completely different search paths.Finally, is interesting to notice that the ME approach achieves best precision values, while theHMM-based approach achieves a better recall.

Comparing these results with those presented in Tables 4.12 and 4.13, one can verify aswell that the capitalization task performs better for the Portuguese language, given the corporasets in use. A possible explanation for the bigger difference between the methods in the Por-tuguese speech data may be related with the proportion of spontaneous/prepared speech inboth corpora. We know that Portuguese transcripts contain a high percentage of spontaneousspeech (35%), much higher than our data for the Spanish BN (11%), but, unfortunately, thisinformation is not available for the English data. Nevertheless, at this point we do not have aconclusive answer for this difference.

These results are difficult to compare to other related work, mainly because of the differentevaluation sets, but also because of the different evaluation metrics and applied criteria. For ex-ample, sometimes it is not clear whether the evaluation takes into consideration the first wordof each sentence. However, these results are consistent with the work reported by Gravanoet al. (2009), which achieves 88.5% F-measure (89% prec., 88% recall) on written corpora (WallStreet Journal), and 83% F-measure (83% prec., 83% recall) on manual transcripts.

4.7 Summary

This chapter described the set of experiments performed on the scope of automatic capi-talization task for both written corpora and speech transcripts. Section 4.2 compared the ME-based approach with two generative approaches, concluding that the generative methods pro-duce better results for written corpora, while the ME approach works better with the speechtranscripts. The impact of the recognition errors is stronger when generative approaches areused, because different words may cause completely different search paths.

Section 4.3 analysed the impact of the language variations in the capitalization task. Max-imum entropy models proved to be suitable to perform the capitalization task, specially whendealing with language dynamics. This approach provides a clean framework for learning withnew data, while slowly discarding unused data. It also enables the combination of differentdata sources and exploration of different features. The analysis has been performed with BNdata, automatically produced by a speech recognition system. In fact, subtitling of BN hasled into using a baseline vocabulary of 100K words combined with a daily modification of thevocabulary (Martins et al., 2007b) and a re-estimation of the language model. This dynamicvocabulary proved to be an interesting scenario for these experiments. In terms of languagevariation, results suggest that different capitalization models should be used for different timeperiods. Capitalization results for broadcast news transcriptions have been presented. Theperformance evolution was analyzed for three test subsets taken from different time periods.


Capitalization results of manual and automatic transcriptions have been presented, revealingthe impact of the recognition errors on this task. For both types of transcription, the capitaliza-tion results show evidence that the performance is affected by the temporal distance betweentraining and testing sets.

Section 4.4 described the work on updating the capitalization module. Three differentapproaches were proposed and evaluated, and the most promising approach consists of iter-atively retraining a baseline model with the new available data, using small corpora subsets.When dealing with manual transcripts the performance grows about 1.6%. Results reveal thatproducing capitalization models on a daily basis does not lead to a significant improvement.Therefore, the adaptation of capitalization models on a periodic basis is the best choice. Thesmall improvements gained in terms of capitalization lead us to believe that dynamically up-dated models may play a small role, but the updating does not need to be done daily, a factthat is also according to intuition. Moreover, the update interval may be chosen dynamically,according to the frequency of new words appearing in the current data.

Section 4.5 presented the most recent experiments on automatic capitalization, with moreaccurate results. Results achieved using the ME-based approach are compared with the mostrecent results achieved using HMMs and CRFs. The HMM-based approach turned out to bebetter for written corpora, while the ME approach was significantly better for speech tran-scripts. The WER impact was also bigger for the HMM-based approach, supporting the conclu-sions previously achieved in Section 4.2. Besides the automatic transcripts, experiments withforce aligned transcriptions were also included. Experiments have shown that CRF (using theoutput bigram) outperforms ME (without the output bigram), due to a significant increase inthe precision. The best results, achieved when the output bigram is enabled, support the ideathat capitalized words tend to appear connected with the words around.

Section 4.6 reported experiments on extending this work to other languages. The effect oflanguage variation over time is again studied for the English and Spanish data, confirming thatthe interval between the training and testing periods is relevant for the automatic capitalizationperformance. The capitalization task performs better for the Portuguese language, given thecorpora sets in use. The bigger difference between the methods in the Portuguese speech datamay be related with the proportion of spontaneous/prepared speech in both corpora.

5Punctuation Recovery

This chapter addresses the punctuation task, covering the three most frequent punctuationmarks: full stop, comma, and question mark. Detecting full-stops and commas depends mostlyon a local context, usually two or three words, and corresponds to detecting sentence bound-aries. The idea consists of jointly detecting the sentence boundaries and predicting the type ofpunctuation mark for each sentence boundary. On the other hand, in Portuguese as in otherlanguages, most interrogative sentences, specially the wh-questions, depend on words that areused in the beginning or/and at the end of the sentence (e.g., quem disse o quê?/who said what?),which means that the sentence boundaries must be previously known. Notice, however, thatthis is not true for all the languages: for example, Chinese interrogatives are marked with a spe-cial character at the end of the sentence, which means that, in this case, question marks can betreated just like full-stops and commas. Anyway, two separated sub-tasks are distinguished here:the first, using local contexts, for detecting full-stops and commas; and the second for detectingquestion marks, using the whole sentence properties as features.

As previously stated in Section 4.5, in the context of the capitalization task, the work heredescribed has been conducted during several years, depending from data and software un-der development. Therefore, comparing results obtained in different periods is most of thetimes a difficult task. That compelled a set of new experiments, most of them repeating oldexperiments, that should reflect more accurate conclusions, given the improved conditions interms of data, software and computational resources. As a result, all the important experi-ments concerning punctuation were again performed using the most recent conditions, beingnow possible to compare between all the results. For that reason, all results presented in thischapter reflect the most recent conditions, which consider not only the automatic transcripts,but also force aligned transcripts, produced by the speech recognition system.

Most of the experiments are performed using the maximum entropy-based approach, de-scribed in Section 2.3, and also used to perform the capitalization task. This approach is a goodframework for combining the large number of features that can be computed from the speechtranscript and/or from the speech signal itself.

Although the relationship of time effects and punctuation conventions may be consideredinteresting, the time effect analysis has been conducted exclusively for the capitalization task,since named entities are more prone to be influenced by short-time effects than punctuation

90 CHAPTER 5. PUNCTUATION RECOVERY

Language Corpus Tokens Full-stop Comma Q-mark Excl-markPortuguese PUBnews 150M 3.2% 6.3% 0.11% 0.03%

Europarl 30M 3.3% 6.8% 0.13% 0.04%English WSJ 42M 4.2% 4.7% 0.04% 0.01%

Europarl 29M 3.7% 4.7% 0.12% 0.03%

Table 5.1: Frequency of each punctuation mark in written newspaper corpora. Wall StreetJournal (WSJ) results extracted from Beeferman et al. (1998).

conventions. This has to do with several reasons. Firstly, time effects in punctuation usuallytake into account texts from several decades (or even centuries), instead of short periods oftime, like the ones reported in our data. For instance, in 1838, Alexandre Herculano, a famousPortuguese writer1, described punctuation conventions used in his time that are consideredungrammatical in contemporary Portuguese (e.g., a long subject is separated from the pred-icate by a comma; a restrictive relative clause is separated from the antecedent by a comma)Duarte (2000). Secondly, changes in the conventional usages of punctuation marks, reported inrecent years, are mainly associated with semicolon usage – a punctuation mark with residualfrequencies across corpora (BN 0.2%; newspapers 0.7%; and university lectures 0.1%). Thirdly,punctuation is diverse across corpora from the same period of time. However, that was not amajor issue here, since only BN are being analyzed.

This chapter is structured as follows: Section 5.1 starts by analysing the occurrence of thedifferent punctuation marks, considering written corpora and speech transcripts, and differentlanguages. Following an historical perspective, Section 5.2 describes initial sentence segmen-tation experiments, which accounted only for the two most frequent punctuation marks: fullstop and comma. Section 5.3 reports recent experiments on extending the initial punctuationmodel to accommodate prosodic features and to also detect question marks. Section 5.4 reportsthe most recent bilingual experiments performed with Portuguese and English, and comparesthe punctuation performance in these two languages. Section 5.5 presents some conclusionsconcerning the adopted approach and the obtained results.

5.1 Punctuation Analysis

In order to better understand the usage of each punctuation mark, their occurrence wascounted in written newspaper corpora, using PUBnews, the Europarl (Koehn, 2005) corpus –a multilingual parallel corpus covering 11 languages and extracted from the proceedings ofthe European Parliament, and published statistics from WSJ (Wall Street Journal). Results areshown on table 5.1, revealing that comma is the most common punctuation mark in writtencorpora of both languages, and is even more frequent for Portuguese. The full-stop frequency

1Alexandre Herculano, Opúsculo V, edição crítica de [critical edition by] J. Custódio and J. M. Garcia. Lisboa,Presença. 1986.

5.1. PUNCTUATION ANALYSIS 91

0%

1%

2%

3%

4%

5%

6%

7%

8%

nl sv en fr it es el de pt da fi

Freq

uenc

y

Language

Full-stop Comma Question mark Exclamation mark Comma avg

Figure 5.1: Punctuation marks frequency in Europarl.

Broadcast News Transcript Tokens Full-stop Comma Q-mark Excl-markLDC98T28 (Hub4 English) 854k 5.1% 3.5% 0.29% 0.00%LDC98T29 (Hub4 Spanish) 350k 4.0% 5.1% 0.14% 0.00%TVE (Spanish) 221K 4.0% 5.8% 0.15% 0.00%ALERT-SR (Portuguese) 920k 4.6% 6.8% 0.24% 0.01%

Table 5.2: Frequency of each punctuation mark in broadcast news speech transcriptions.

is lower for Portuguese, suggesting that the Portuguese written language contains longer sen-tences when compared to English. The question mark turned out to be the third most frequentpunctuation mark, but its frequency is highly dependent on the corpus domain.

This study has been extended to all the 11 languages covered by the Europarl corpus. Fig-ure 5.1 presents the corresponding results, revealing that comma is the most frequent punctua-tion mark for most languages, and achieves one of the highest frequency scores for Portuguese(6.75%). It also confirms that, from all languages, the Portuguese language contains the lowestpercentage of full stops (3.30% vs. 3.56% for English). All other punctuation marks have shownlower and similar frequencies for all languages.

The previous study has been extended also to BN transcriptions. Table 5.2 shows the cor-responding results, where the revised version of the ALERT-SR corpus, described in Section3.1.1.1, was used. The most frequent punctuation mark for Portuguese and Spanish is alsocomma, however, this is not the case for English where the full-stop punctuation mark is nowthe most frequent. The Portuguese BN transcripts present the highest frequency of comma, inconcordance with the written corpora. The full-stop frequency is approximately the same forEnglish and Portuguese BN transcriptions, and about 1% lower for the Spanish language. It isinteresting to observe that the comma is the most frequent punctuation mark in the Portuguese


!"#

$"#

%"#

&"#

'"#

(!"#

($"#

)*+,-# ./0/1# 20+1# 320+1# 4)5!6# 4)5!'#

!"#$

%#&'()

*+",+"-).%/.#0)

78119:;<=# ><??+# @8/:A<-#?+*B# 2CD1+?+A<-#?+*B#

Figure 5.2: Punctuation marks frequency in the ALERT-SR corpus (old version).

corpora, while the full-stop is the most frequent punctuation mark in English. This is consistentwith the widespread notion that sentences are longer in written Portuguese. The frequency ofother punctuation marks on BN corpora is very low.

Previous analyses confirm that spoken text sentences, corresponding to utterances or SUs,are much smaller than written text sentences, specially for the Portuguese language. Intra-sentence punctuation marks also occur more often in spoken texts, specially in Portuguese.

Despite the clear difference in the usage of punctuation marks in the different languages, itis also important to stress the importance of the annotation criteria. Concerning the PortugueseSR corpus, the first version of this corpus was annotated in different time periods by differentpeople, with possibly different criteria. It has recently been revised, as explained in Section3.1.1.1, and differences in terms of punctuation are significant. Figure 5.2 shows the punctua-tion statistics for the first version of the corpus. While the full-stop has a similar frequency ineach subset, the comma usage differs from the first to the last subsets. This observation sug-gests that a manual verification of this data would possibly provide more accurate evaluationresults. Figure 5.3 presents the punctuation statistics for the revised version of the same corpus.

5.2 Early Work using Lexical and Acoustic Features

The punctuation task benefits from lexical and acoustic information found in speech tran-scripts but unavailable in written corpora. Features, such as pause duration and pitch contour,may be used together with linguistic information in order to provide clues for punctuationinsertion. Experiments described in this section correspond to the initial experiments on thissubject, and use only spoken data for training.

5.2. EARLY WORK USING LEXICAL AND ACOUSTIC FEATURES 93

!"#

$"#

%"#

&"#

'"#

(!"#

($"#

)*+,-# ./0/1# 20+1# 320+1# 4)5!6# 4)5!'#

!"#$

%#&'()

*+",+"-).%/.#0)

78119:;<=# ><??+# @8/:A<-#?+*B# 2C>1+?+A<-#?+*B#

Figure 5.3: Punctuation marks frequency in the ALERT-SR corpus (revised version).

5.2.1 Features

These experiments use real valued features for expressing information, such as word iden-tification, morphological class, pauses, speaker gender and speaker clusters, sometimes com-bined as bigrams or trigrams. The following features are used for a given word w in position iof the corpus:

Word: Captures word identification.Used features: wi, wi+1, 2wi−2, 2wi−1, 2wi, 2wi+1, 3wi−2, 3wi−1, where wi is the currentword (the word preceding the event), wi+1 is the word that follows, and nwi±x is then-gram of words that starts x positions after or before the position i.

POS tag: Captures part-of-speech information.Used features: pi, pi+1, 2pi−2, 2pi−1, 2pi, 2pi+1, 3pi−2, 3pi−1, where pi is the part-of-speechof the word at position i, and npi±x is the n-gram of part-of-speech of words that starts xpositions after or before the position i.

Speaker change: Captures whenever a new speaker cluster begins.Used feature: SpeakerChgi+1, true if wi+1 starts a different speaker cluster.

Gender change: Captures speaker gender changes.Used feature: GenderChgi+1, true if speaker gender changes before wi+1.

Time: Captures time difference between words.Used feature: TimeGapi+1, the time lapse between the word wi and the word wi+1.


!" #!!" $!!" %!!" &!!" '!!" (!!" )!!" *!!" +!!" #!!!" ##!!" #$!!"

#),$!"$#,$&"$',$+"%!,%("%),&&"&','&"'',(("(),*#"*$,++"

#!!,#$#"#$$,#&*"#&+,#*#"#*$,$$#"$$$,$)!"$)#,%%!"%%#,&!%"&!&,&+$"&+%,(!#"(!$,)%'")%(,*+)"*+*,#!+("

-./"

!"#$%&'(%)$*+$$,%+-./0%1#02%

3,*$.4'5%

Figure 5.4: Converting time gap values into binary features using intervals.

The first two features, involving word information and part-of-speech information, are lexicalfeatures. The word information features were selected according to the performance achievedby a number of initial experiments for sentence boundary detection. These features are similarto features reported by Cuendet et al. (2007) and Liu et al. (2006), with the following differ-ences: Cuendet et al. (2007) also use wi−1, but do not use 2wi−2, 2wi+1, 3wi−2; Liu et al. (2006)use 3wi, but do not report the usage of 2wi−2, 2wi+1, 3wi−1. According to Stolcke and Shriberg(1996), part-of-speech information helps improving the sentence boundary detection perfor-mance. The remaining features are not exactly acoustic, but lacking a better description for thisheterogeneous set, these features will be henceforth designated as acoustic. All but TimeGap arebinary features by nature. The TimeGap is a continuous value that measures the amount of timebetween the end of a word and the start of the following word and must be binarized. To makeit binary, TimeGap values have been associated to logarithm intervals, according to the formula:

I(v) = int

eint(ln(v) ∗ s))

s + 1

(5.1)

where v is the time gap value (in milliseconds), int is a function that returns the integer part of anumber, and s is a smooth value. Experiments reported here use s = 5, which turned out to bethe best, given the performance achieved with experiments performed with different smoothvalues. Smaller time gaps, with less than 20ms, were not used, and time gaps with durationsgreater than one second were represented by the feature TimeGap:big. Figure 5.4 illustrates


Background focus Cor Ins Del Precision Recall F SERplanned, clean F0 2411 2463 447 49.5% 84.4% 62.4% 101.8%planned, noise F40 4441 5073 916 46.7% 82.9% 59.7% 111.8%all planned F0, F40 6852 7536 1363 47.6% 83.4% 60.6% 108.3%spontaneous, clean F1 855 2435 416 26.0% 67.3% 37.5% 224.3%spontaneous, noise F41 2053 5134 832 28.6% 71.2% 40.8% 206.8%all spontaneous F1, F41 2908 7569 1248 27.8% 70.0% 39.7% 212.2%All 10794 16386 2920 39.7% 78.7% 52.8% 140.8%

Table 5.3: Recovering sentence boundaries over the ASR output, using the APP segmentation.

defined intervals using 5 as the smoothing value.

The confidence score of each word, given by the ASR module, is used for both Word and POSunigram-based features. The confidence score for speaker and gender changes is provided bythe APP module and is also being used. For all other features a score of 1.0 is being used.

5.2.2 Sentence Boundary Detection

The output of a speech recognition system consists of a stream of text, sometimes groupedinto segments purely based on the acoustic properties of the signal. Detecting sentence bound-aries over such data is a way of enriching such transcripts with metadata, which serves asa starting point for creating data more adequate for human and machine further processing.The L2F broadcast news processing system, described in Section 1.2, benefits from the correctsentence segmentation for correctly performing some of its tasks, such as: subtitling, topic in-dexation and summarization. The initial experiments carried out in the scope of this thesisaimed at providing the correct sentence boundaries to the original transcript.

The initial system used the APP (Audio Pre-Processing) segmentation as the only clue formarking the sentence boundaries. Table 5.3 shows the system performance, considering thateach sentence boundary corresponds to one of the following reference punctuation marks: “.”,“:”, “;”, “!”, “?” and “...” (all punctuation marks except comma are being used). The mostrecent revision of the speech Portuguese corpus (Section 3.1.1.1) has been used. The numberof Correct (Cor), Inserted (Ins), and Deleted (Del) slots, are shown together with the standardevaluation measures, to better clarify the magnitude of the errors. The planned speech resultsare much better than spontaneous speech, but no significant difference occurs from clean tonoisy speech. Those results succeed in terms of recall, but the low precision causes an overallSER above 100%. Those results provide the baseline for the following experiments that aim atautomatically detecting the sentence boundaries.

The ME modeling approach, described in Section 2.3, and also used for the capitalizationtask, has been adopted for this task as well. However, this is a binary problem, much easierto perform. The optimization for the following experiments was performed using 10,000 iter-


Training data All data Planned speech onlyBackground focus Prec Rec F SER Prec Rec F SERF0 89.2% 73.4% 80.6% 35.4% 86.2% 77.4% 81.6% 35.0%F40 86.3% 71.2% 78.0% 40.1% 83.3% 77.0% 80.1% 38.4%F0, F40 87.3% 72.0% 78.9% 38.5% 84.3% 77.2% 80.6% 37.2%F1 74.0% 69.4% 71.6% 55.0% 67.6% 70.5% 69.0% 63.4%F41 79.6% 69.7% 74.3% 48.1% 72.6% 70.4% 71.5% 56.1%F1, F41 77.8% 69.6% 73.5% 50.2% 71.0% 70.4% 70.7% 58.3%All 84.2% 70.8% 76.9% 42.5% 79.8% 74.4% 77.0% 44.4%

Table 5.4: Recovering sentence boundaries in the force aligned data.


Table 5.5: Recovering sentence boundaries directly in the ASR output.

ations. Table 5.4 presents the results achieved for the force aligned transcripts, combining allfeatures previously described in this section. Results from the left side of the table were ob-tained from models built from all data, while results from the right side were produced modelstrained only with planned speech. Results confirm the expectation that sentence boundariesare easier to detect in planned speech. While the difference between planned speech and spon-taneous speech is significant, results for speech with and without noise are quite similar. Betterresults could be expected by reducing the frequency of some phenomena, such as disfluen-cies, from the training data, but results from the right side of the table do not support thisassumption, since the overall performance decreased. The decreased performance is due to thereduced training material, corresponding to about 56% of all available training material, andalso because removing the spontaneous part of the training corpus, caused some spontaneousspeech phenomena not to be captured.

Force aligned transcripts do not contain recognition errors, for that reason Table 5.4 pro-vides the upper-bound estimate for a real system. The second set of experiments is performeddirectly on the automatic speech recognition (ASR) output. Table 5.5 shows the correspondingresults, where the training data consists of automatic speech transcriptions. A number of ad-ditional experiments, not presented here, revealed that models trained with automatic speechtranscripts are better suitable for the ASR output than the corresponding models trained withforce aligned transcripts, despite containing recognition errors. That is due to the fact that train-


!"#

$!"#

%!"#

&!"#

'!"#

(!!"#

)**#+),)# -*)../+# 0-1.,)./120#

!"#$%&''#'%()$*%

+),-)"%$'),./'01$.%

)**#3/),24/0# */567)*#1.*8# )712097#1.*8#

)**#+),)# -*)../+# 0-1.,)./120#

2-$#3)4/%$'),./'01$.%

Figure 5.5: Impact of each feature type in the SU detection performance.

ing and testing data share the same conditions. The worst results produced by models trainedonly with planned speech confirms that is better to use all training data, even if contains anincreased number of recognition errors.

The impact of the recognition errors can be calculated by comparing Tables 5.4 and 5.5.When all the data is used for training the overall SER impact is 19.8% absolute (62.3%-42.5%),reflecting the WER (Word Error Rate) of about 19.5% in the evaluation data, for the currentspeech recognition system version (Sep. 2010). Results also show a bigger impact for sponta-neous speech, where the number of recognition errors is much higher.

5.2.2.1 Feature Contribution Analysis

Previous results were produced by combining all features described previously in this Sec-tion. The following experiments try to assess the contribution of each individual feature andalso of groups of features, allowing to distinguish the most interesting features and, possibly,to allow discarding features that may not contribute for better results. Figure 5.5 reveals thefirst results, concerning the influence of the lexical and acoustic features, by background focuscondition. The best results were consistently produced by the combination of lexical and acous-tic features. Nevertheless, results show that lexical features have lesser impact than acousticfeatures on the final performance. These conclusions are similar to those reported by Batistaet al. (2008b) concerning our previous work, despite a different corpora revision and differentversions of most of the tools, including the speech recognition, being used by that time.

The contribution of each one of the five features, presented in Section 5.2.1, was also stud-ied, and is illustrated in figure 5.6. The figure shows results when using all but a given feature,


!"#

$!"#

%!"#

&!"#

'!"#

(!!"#

)**#+),)# -*)../+# 0-1.,)./120#

!"#$%&''#'%()$*%

+),-)"%$'),./'01$.%

.1#314+# .1#567/8)-# .1#9:;# .1#;-/)</4=>?0# .1#8/.+/4=>?0# )**#@/),24/0#

)**#+),)# -*)../+# 0-1.,)./120#

2-$#3)4/%$'),./'01$.%

Figure 5.6: Impact of each individual feature in the SU detection performance.

where: no word means results achieved without word information related features; no TimeGapmeans results excluding time intervals between words; no POS means results achieved withoutpart-of-speech related features; no SpeakerChg means results excluding speaker change infor-mation; and no GenderChg means results excluding speaker gender change information. Again,results reveal that the combination of all features consistently produces the best results both formanual and automatic transcripts. Overall, the biggest contribution comes from the TimeGapinformation, except for the spontaneous speech, where the contribution of each feature is notso clear. TimeGap information becomes more important when dealing with planned speech,suggesting that pauses between words are relevant for sentence boundary detection of plannedspeech, specially when in the presence of recognition errors where the acoustic information isthe most important information. The most important conclusion arising from this study is thatall the proposed features contribute to a better sentence boundary detection performance.

5.2.3 Segmentation into Chunk Units, Delimited by Punctuation Marks

Previous subsection considers that a sentence boundary corresponds to a punctuationmark, where commas were excluded. Most of the literature considers commas as intra-sentencepunctuation marks, and do not use them for sentence boundary detection. However, after ob-serving the speech recognition output it was verified that most APP segments correspond tocommas. Moreover, when addressing the sentence boundary detection problem, sometimes itis not always clear whether a comma is or not used to delimit a SU boundary. For example, Liuet al. (2006) is not clear about this subject, however, the proportion of reported SUs (8%) resem-bles the proportion of full-stops and commas calculated for English (8.6%) in Table 5.2, whichsuggests the usage of the comma for delimiting the SUs.


Background focus Cor Ins Del Precision Recall F SERplanned, clean F0 3364 1507 2068 69.1% 61.9% 65.3% 65.8%planned, noise F40 6760 2735 4710 71.2% 58.9% 64.5% 64.9%all planned F0, F40 10124 4242 6778 70.5% 59.9% 64.8% 65.2%spontaneous, clean F1 1770 1505 2861 54.0% 38.2% 44.8% 94.3%spontaneous, noise F41 3848 3317 5576 53.7% 40.8% 46.4% 94.4%all spontaneous F1, F41 5618 4822 8437 53.8% 40.0% 45.9% 94.3%All 17276 9831 16847 63.7% 50.6% 56.4% 78.2%

Table 5.6: Recovering chunks over the ASR output, using only the APP segmentation.


Table 5.7: Recovering chunk units in the force aligned data.

This section aims at identifying units, also referred as chunks, that are delimited by anypunctuation mark, including commas. Those units may also be useful for tasks like summa-rization, machine translation, NER, etc.. Table 5.6 shows the system performance, consideringall the punctuation marks in the reference, namely: “.”, “:”, “;”, “!”, “?”, “...”, “,”, and “-”.Comparing these results with results from Table 5.3, one can see that the APP segmentationpinpoints the position of a punctuation mark, but most of the times does not correspond to asentence boundary. Precision has increased considerably while recall decreased. That meansthat most of the commas are still not identified, but most APP segments correspond to commas.

The challenge proposed in this subsection is very similar to what has been performed inthe previous subsection. The same ME modeling approach is used, as well as the same numberof optimization iterations (10K). The upper-bound estimate is again provided using the forcealigned transcripts. Table 5.7 presents the results combining all features. Similarly to previoussubsection, results from the left side of the table were obtained from models built from all data,while results from the right side were produced models trained only with planned speech.Again, planned speech achieves the best performance, the noise impact is insignificant, andusing more data achieves the best results. The corresponding results for the ASR output areshown in Table 5.8, where the training data consists also of automatic speech transcriptions.The impact of the recognition errors is about 11.7% SER (absolute) and can be calculated bycomparing the two tables. Results again show a bigger difference for spontaneous speech,where the number of recognition errors is higher.



Table 5.8: Recovering chunk units directly in the ASR output.

Despite the corpora being different, and results not being directly comparable, the achievedresults are similar to the state-of-the-art results reported by Liu et al. (2006), concerning sen-tence boundary detection for English broadcast news. The authors evaluate different modelingapproaches, alone and in combination, reporting about 47% to 52% WER (equivalent to NISTSU Error Rate) for manual transcriptions, and 57% to 61% for automatic transcriptions. Theexperiments here reported achieved about 44.3% WER (2.7% difference) for the manual tran-scripts and 56% (1% difference) for the automatic transcripts. The authors report about 9%difference between manual and automatic transcriptions. These experiments reveal a differ-ence of 11.7%, which can be explained because of the large percentage of spontaneous speechin the corpus (34%), and because of the WER differences. Whereas the WER of speech recog-nition system used by Liu et al. (2006) is 11.7%, the WER of our recognition system is above15.1%. The authors also study the effect of the recognition errors in the SU performance andthey conclude that, for example, a WER increase of 3.3% leads to a 3.2% worse performance forSU detection.

5.2.3.1 Feature Contribution Analysis

Figure 5.7 shows the results concerning the influence of the lexical and acoustic features,by background focus condition. Again, the best results were consistently produced by thecombination of lexical and acoustic features. But, in contrast with results achieved in the previ-ous subsection, lexical features have now more impact than acoustic features on the final per-formance, specially when considering spontaneous speech. The only exception concerns theplanned speech portion of the automatic transcripts, where the recognition errors may havelowered the performance of the lexical features. Notice that the previous results presented forsentence boundary detection have shown an opposite result, where the acoustic features turnedout to be the most important.

The contribution of each one of the five features was also studied, and is illustrated in fig-ure 5.8. The figure shows results when using all but a given feature. Again, results reveal thatthe combination of all features consistently produces the best results both for manual and auto-


0%

20%

40%

60%

80%

100%

all data planned spontaneous

Slo

t Err

or R

ate

Manual transcripts

all features lexical only acoustic only

all data planned spontaneous

Automatic transcripts

Figure 5.7: Impact of each feature type in the chunk detection performance.

!"#

$!"#

%!"#

&!"#

'!"#

(!!"#

)**#+),)# -*)../+# 0-1.,)./120#

!"#$%&''#'%()$*%

+),-)"%$'),./'01$.%

.1#314+# .1#567/8)-# .1#9:;# .1#;-/)</4=>?0# .1#8/.+/4=>?0# )**#@/),24/0#

)**#+),)# -*)../+# 0-1.,)./120#

2-$#3)4/%$'),./'01$.%

Figure 5.8: Impact of each individual feature in the chunk detection performance.


Symbol Replacement. : ; ! ? ... full-stop

, - comma

Table 5.9: Punctuation mark replacements.

matic transcripts. Overall, the biggest contribution comes from the word information, followedby the TimeGap information. Again, TimeGap information, corresponding to pauses betweenwords, is more important for punctuation detection of planned speech. In spontaneous speech,pauses are likely to to be associated with disfluency phenomena, because people tend to con-struct the sentences while thinking (Clark and Fox Tree, 2002). The importance of pauses is notsurprising, taking into account that punctuation was originally used for marking breaths, andthat such function is expected to remain part of its basic usage Kowal and O’Connell (2008).The POS information is the third most important feature, despite the part-of-speech taggernot having been specially trained for dealing with speech transcripts. Information concerningspeaker change and gender change have shown little impact on the results. One possible expla-nation is that they have a lower occurrence in the corpus, and therefore a lower impact on theresults. The second possible explanation is that they provide some redundant information, forexample, whenever the gender changes the speaker also changes. A third explanation is that,most often, speaker changes are accompanied by time gaps, because otherwise the current APPmodule does not detect them. Again, all the proposed features have shown to contribute to abetter punctuation performance.

The next section uses the same feature set to achieve results for punctuation mark recovery,distinguishing between two different sentence boundary markers: full-stop and comma.

5.2.4 Recovering full-stop and comma Simultaneously

The following experiments distinguish between two more frequent punctuation marks:full-stops and commas, which depend on local features. As only full-stops and commas are beingconsidered, all the other punctuation marks have been converted into one of these, in accor-dance with the replacements described in Table 5.9. This task uses the same approach previ-ously used for sentence segmentation, as well as the same feature set. The most significantdifference arises from the fact that instead of facing a binary classification problem, one nowfaces a multiclass problem. As mentioned in Section 2.2, comma is one of the most frequent andunpredictable punctuation marks. Its use is highly dependent of the corpus and there is weakhuman agreement on a given annotation. Therefore, a lower performance is expected for thispunctuation mark.

The following experiments use the ME modeling approach, described in Section 2.3. Formulticlass problems, however, sometimes the implemented algorithm considers the optimiza-


full-stop comma all punctuationFocus Prec Rec F SER Prec Rec F SER Prec Rec F SERplanned 82.5 77.6 80.0 38.9 64.2 39.0 48.5 82.7 75.1 57.9 65.4 51.6spontan. 69.9 71.8 70.8 59.2 70.9 39.9 51.1 76.5 70.4 49.5 58.1 62.5all 78.4 75.3 76.8 45.4 67.5 39.4 49.7 79.6 73.3 54.0 62.2 56.5

Table 5.10: Recovering full-stop and comma in force aligned transcripts.

full-stop comma all punctuationFocus Prec Rec F SER Prec Rec F SER Prec Rec F SERplanned 73.8 70.7 72.2 54.4 61.1 30.0 40.3 89.1 69.3 49.8 58.0 60.7spontan. 53.4 53.8 53.6 93.1 64.5 27.9 39.0 87.4 59.0 35.6 44.4 78.7all 67.2 65.3 66.2 66.7 62.8 28.9 39.6 88.2 65.4 43.5 52.2 68.8

Table 5.11: Recovering full-stop and comma in automatic transcripts.

tion converges before it actually does, so the optimization has been forced to be repeated sev-eral times. Each epoch was retrained 25 times, using 100 iterations.

The first set of experiments consisted of recovering the punctuation over force aligned tran-scripts, which provide the upper-bound performance for speech transcripts. Table 5.10 presentsthe corresponding results, where each value is a percentage. Results concerning each punctu-ation mark are presented individually, together with the overall results, which also considerthe number of substitutions between the two punctuation marks. From the results, it is clearthat the full-stop detection is easier than comma detection. However, it is surprising that, whilerecovering the full-stop is easier to perform in planned speech, comma is easier to recover inspontaneous speech. The performance of detecting full-stops in planned speech is about 20%better than the performance in spontaneous speech, which is significant. The same tendencyis not achieved for the comma detection, where differences are relatively much smaller and thebest results were in fact achieved for spontaneous speech.

The overall results presented in the last four columns of Table 5.10, as well as the valuesfrom the first four columns of Table 5.7 concern the performance of detecting all punctuationmarks. However, Table 5.10 also considers the number of substitutions, i.e., mistakes confus-ing full-stops and commas. In terms of SER performance, results from Table 5.10 are about 9.4%worse, which reflects not only the Substitutions impact, but also passing from a binary to amulticlass problem. By considering the number of Substitutions (which corresponds to about9% of the Correct slots) as correct slots, the performance increases about 9%, but the final SERperformance is still about 0.4% worse that performance from Table 5.7, suggesting that the op-timization method converges better when dealing with binary problems. This is also accordingto the work reported by Matusov et al. (2006), which states that it is much easier to predictsegment boundaries than to predict whether a specific punctuation mark has to be inserted ornot at a given word position in the transcript.


Table 5.11 shows the performance on recovering punctuation marks over automatic tran-scripts, which is the ultimate challenge of this study. These results concerning automatic tran-scripts support all the conclusions that have already been presented for the manual transcripts.The performance of detecting full-stops in planned speech is about 39% better than the per-formance in spontaneous speech, which is even more significant than the results achieved formanual transcripts (Table 5.10). The comma detection performance follows the same tendencyobserved for manual transcripts, and it does not vary significantly from planned to sponta-neous speech. The overall impact of the recognition errors, achieved by comparing manualto automatic transcripts, is about 12.3% absolute, but it is much greater for full-stop detection,where such a difference increases to about 21.3%. The difference between planned and sponta-neous speech increases from 20% in manual transcripts to 39% in automatic transcripts, whichcorroborates that recognition errors have a strong impact on full-stop detection. The commadetection performance is only about 8.6% worse for automatic transcripts, which is relativelymarginal.

A number of additional experiments have shown that training with all data is better thantraining with planned speech only, and that no significant performance differences exist be-tween noisy and clean speech. The same conclusions were achieved for sentence boundarydetection, in Section 5.2.3, and for that reason results are not presented here.

Our results are consistent with other related work. For example, (Christensen et al., 2001)use statistical models of prosodic features for recovering punctuation marks over the Hub-4broadcast news corpora. Results vary from 41% to 79% SER for full-stop detection, and above81% for comma detection.

The evaluation presented here, however, may not reflect the real achievements of this work,and would benefit from a human evaluation (Beeferman et al., 1998). This is also supported bythe analysis on the user annotation agreement, reported in Section 3.1.1.1.

5.2.4.1 Feature Contribution for Punctuation Recovery

Similarly to the analysis on the feature impact, previously performed for SU detection,the following experiments assess the impact of each feature on the performance of recover-ing punctuation marks. The first results, concerning the influence of the lexical and acousticfeatures are presented in Figure 5.9. Once again, results indicate that the combination of allfeatures lead to significant better results. Considering the overall results on both punctuationmarks, lexical features have more impact than acoustic features, both for manual and automatictranscripts. Nonetheless, separate results for full-stop detection reveal that acoustic featureshave a stronger impact on recovering this punctuation mark, which is specially notorious onautomatic transcripts.

The impact of each individual feature was also analysed and the corresponding results are


0%

20%

40%

60%

80%

100%

all full-stop comma

Slo

t Err

or R

ate

Manual transcripts

all features lexical only acoustic only

all full-stop comma


Figure 5.9: Impact of lexical and acoustic features in the punctuation detection.

0%

20%

40%

60%

80%

100%

all full-stop comma

Slo

t Err

or R

ate

Manual transcripts

no word no TimeGap no POS no SpeakerChgs no GenderChgs all features

all full-stop comma


Figure 5.10: Impact of each individual feature in the punctuation detection performance.


!"#

$!"#

%!"#

&!"#

'!"#

(!!"#()*$!#

$(*$%#

$+*$,#

-!*-&#

-)*%%#

%+*+%#

++*&&#

&)*'(#

'$*,,#

(!!*($(#

($$*(%'#

(%,*('(#

('$*$$(#

$$$*$)!#

$)(*--!#

--(*%!-#

%!%*%,$#

%,-*&!(#

&!$*)-+#

)-&*',)#

','*(!,&#

./0#

12345367809(#

:3;<

3678

09(#

!"#$%&'(&)*(+,$(-+,$.(&/-012/(3$&+0.$1(

=>??*9@A2#

BACC4#

;A2>;B@#

Figure 5.11: Relation between the acoustic features and each punctuation mark.

presented in Figure 5.10. Identical to the approach used in Figure 5.8 for SU detection, the per-formance was evaluated when using all but a given feature. Unlike the conclusions previouslyachieved for SU detection, the best results are not achieved by combining all features. In fact,only marginal improvements can be achieved by removing the GenderChg feature.

The classification probabilities, given by the punctuation model created from the automatictranscripts, were analysed for each acoustic feature. The corresponding results are illustratedin Figure 5.11, where the first columns correspond to the binarized TimeGap intervals and thelast two columns correspond to the features SpeakerChg and GenderChg. This figure revealsthe probability of choosing each punctuation mark, given the feature weights provided by thepunctuation model. Results show that each one of these features is likely to be associated withfull-stops. However, for small time gaps, there are increased chances of having a comma or nothaving a punctuation mark at all. This also happens for time gaps with about 1 second andwhenever the APP module indicates a speaker or gender change. That helps explaining thereason for the better performance without the GenderChg feature, even if it is only marginal. Animportant overall conclusion arising from this study is that the largest contribution to detect-ing full-stops and commas comes from the word information, followed by the TimeGap infor-mation. The latter information becomes more important when dealing with full-stop detection,suggesting that pauses between words are indeed relevant as clues for punctuation recovery.Information concerning speaker change and gender change have shown little impact on theresults, perhaps because some of that information is also encoded by pause durations.

To conclude this study we have analysed the proportion of pauses that correspond to apunctuation mark, and the proportion of each punctuation mark that corresponds to pauses.

5.3. EXTENDED PUNCTUATION MODULE 107

Considering pauses of at least 20ms and the training and development sets from the speechdata, we have verified that about 40% of all pauses correspond to full-stops, and another 24%correspond to commas. About 88% of the full-stops correspond to a pause, but only about 38%of the commas correspond to pauses. From all punctuated locations where a pause did notoccur, about 87% correspond to commas. Kowal and O’Connell (2008) report similar results forGerman (95%) and for American English (82%), and use them to support the idea that pausesare not “the oral equivalent of commas” and that commas do not “signal” pauses. Our resultsfor Portuguese also substantiate that statement.

5.3 Extended Punctuation Module

The performance results presented and analysed in previous section, concerning full-stopand comma detection, correspond to the first implemented version of the European Portuguesepunctuation module, which explored a limited set of features, mostly lexical and acoustic, si-multaneously targeting at low latency, and language independence. The aim of the followingexperiments is to improve the performance of the punctuation module, first by exploring addi-tional features, namely prosodic; then, by exploring the use of punctuation information that canbe found in large newspaper corpora; finally, by weighting the impact of lexical and prosodicfeatures on the baseline system when encompassing interrogatives. This study constituted thefirst step towards the goal of adding the identification of interrogatives to the punctuation mod-ule. It was a joint effort, only possible because it involved a number of researchers from the L2Flaboratory with different background.

5.3.1 Improving full stop and comma Detection

This section addresses two ways of improving the initial results for full-stop and comma: i)adding prosodic features, besides the existing lexical, time-based and speaker-based features;ii) making use of punctuation information that can be found in large written corpora.

5.3.1.1 Introducing Prosodic Information

The first strategy for improving the initial results consisted of adding prosodic features,besides the existing lexical, time-based and speaker-based features. We do know that thereis no one-to-one mapping between prosody and punctuation (Viana et al., 2003; Kowal andO’Connell, 2008). Silent pauses, for instance, can not be directly transformed into punctuationmarks for different reasons, e.g., prosodic constraints regarding the weight of a constituent,speech rate, style, different pragmatic functions, such as emphasis, emotion, on-line planning.However, prosodic information can be used to improve the punctuation detection. For exam-ple, Kim and Woodland (2001) concludes that F-measure can be improved by 19% relative.


Type of added full-stop comma AllInfo features Prec Rec F SER Prec Rec F SER Prec Rec F SER

Baseline 78.4 75.3 76.8 45.4 67.5 39.4 49.7 79.6 73.3 54.0 62.2 56.5Word f0 81.5 77.2 79.3 40.3 68.1 42.0 51.9 77.7 75.0 56.3 64.3 54.2Based E 78.4 77.5 77.9 43.9 67.9 40.1 50.4 78.8 73.5 55.3 63.1 55.6

f0 + E 81.2 78.5 79.8 39.7 68.0 44.1 53.5 76.7 74.7 58.0 65.3 53.3Syllables f0 79.5 77.2 78.3 42.7 68.5 41.7 51.9 77.4 74.2 56.1 63.9 54.4

& E 78.3 76.8 77.5 44.5 68.3 40.3 50.7 78.4 73.6 55.1 63.0 55.5Phones D 78.2 76.8 77.5 44.6 68.4 41.7 51.9 77.5 73.6 56.0 63.6 54.9

f0+E+D 78.4 78.2 78.3 43.3 69.1 40.8 51.3 77.4 74.1 56.0 63.8 54.5All Combined 79.8 79.9 79.8 40.4 69.6 43.6 53.6 75.5 74.9 58.4 65.6 52.7

Table 5.12: Punctuation results over manual transcripts, combining prosodic features.

The feature extraction stage involved several steps, as described in Section 3.4. The firststep consisted of extracting the pitch and the energy from the speech signal. Durations ofphones, words, and interword-pauses have been extracted from the recognizer output. Bycombining the pitch values with the phone boundaries, micro-intonation and octave jump ef-fects have been removed from the pitch track. Another important step consisted of marking thesyllable boundaries as well as the syllable stress, as described in Section 3.4.3. Finally, the max-imum, minimum, median and slope values for pitch and energy have been calculated in eachword, syllable, and phone. Duration was also calculated for each one of the previous units.

As previously mentioned, these experiments aim at analyzing the weight and contributionof each prosodic feature per se and the impact of the combination of prosodic features. Un-derlying the feature extraction process are linguistic evidences that pitch contour, boundarytones, energy slopes, and pauses are crucial to delimit sentence-like units across languages.The first experiment aimed at testing whether the features would perform better on differentunits of analysis: phones, syllables and/or words. The linguistic findings for EP (Viana, 1987;Mata, 1999; Frota, 2000), prospect that the stressed and post-stressed syllables would be rel-evant units of analysis to automatically identify punctuation marks. When considering theword as a window of analysis, we are also accounting for the information in the pre-stressedsyllables as well.

Features are calculated for each word transition, with or without a pause, using the sameanalysis window as Shriberg et al. (2009). The following features have been used: f0 and energyslopes in the previous and following words, with or without a silent pause; f0 and energydifferences between these units; and also duration of the last syllable and the last phone. Withthis set of features, the aim is to capture nuclear and boundary tones, amplitude, pitch reset,and final lengthening.

Tables 5.12 and 5.13 show the performance results for full stop and comma recovery, whereeach prosodic parameter was analyzed separately. The baseline corresponds to a punctuationmodel created using only lexical and acoustic information, and is represented in the first rowof each table. These results represent significant gains relatively to the previous results for


type of added full-stop comma AllInfo features Prec Rec F SER Prec Rec F SER Prec Rec F SER

Baseline 67.2 65.3 66.2 66.7 62.8 28.9 39.6 88.2 65.4 43.5 52.2 68.8Word f0 71.3 67.3 69.2 59.8 63.8 31.6 42.2 86.4 68.0 45.9 54.8 66.0Based E 66.8 65.7 66.2 67.0 62.6 27.8 38.5 88.8 65.1 43.1 51.8 69.2

f0 + E 71.2 67.5 69.3 59.8 62.0 34.8 44.6 86.5 66.9 47.9 55.8 65.8Syllables f0 67.6 67.3 67.5 64.9 64.1 27.5 38.5 87.9 66.2 43.5 52.5 68.3

& E 67.2 65.5 66.4 66.5 62.1 30.0 40.5 88.3 65.0 44.3 52.7 68.6Phones D 68.2 64.3 66.2 65.7 61.3 31.0 41.2 88.5 65.1 44.4 52.8 68.5

f0+E+D 67.0 68.4 67.7 65.3 63.6 29.3 40.1 87.5 65.6 45.0 53.4 67.8All Combined 70.9 68.3 69.6 59.8 62.6 33.0 43.2 86.7 67.2 47.2 55.4 65.8

Table 5.13: Punctuation performance over automatic transcripts, combining prosodic features.

both types of transcripts, and both punctuation marks, ranging from 3% to 7% SER (absolute).The best results are mainly related with the full stop, where the pitch value of words turnedout to be the most relevant prosodic feature. This model was further improved by addingthe energy value of words. The syllables and phone-based features did not constitute a sub-stantial improvement. Moreover, combining words and syllables achieved results similar tousing only word-based features. The duration parameter is of interest in EP, since three par-ticular strategies are used at the end of an intonational phrase: epenthetic vowel, elongatedsegmental material or elision of post-stressed segmental material. To the best of our knowl-edge, no quantifiable measures were reported for the Portuguese language and little has beensaid about these strategies so far. Then, not surprisingly, the durational parameter did not adda substantial improvement to our model, although it did contribute to a slightly better result inthe spontaneous speech data. In this specific set of data, there is a tendency to elongate the lastphone or the last syllable of the word in a potential location for a punctuation mark, makingduration an informative cue for this specific context.

Results in this study partially agree with the ones reported in Shriberg et al. (2009), re-garding the contribution of each prosodic parameter and also the set of discriminative featuresused, where the most important feature turned out to be f0 slope in the words and betweenword transitions. These features are language independent; still, language specific propertiesin the data are related with different durational patterns at the end of an intonational unit andalso with different pitch slopes that may be associated with discourse functionalities beyondsentence-form types.

5.3.1.2 Retraining from a Written Corpora Model

Another idea for improving the baseline punctuation results consisted of making use ofthe punctuation information that can be found in large written corpora. For that purpose,first a punctuation model has been trained with the written corpus, and then a new model wastrained with the training transcriptions, bootstrapping from the initial training with newspaper



Lexical + Acoustic 78.3 77.4 77.9 44.0 68.1 51.8 58.8 72.4 72.9 62.2 67.1 50.6Word f0 79.6 80.6 80.1 40.0 68.4 53.9 60.3 71.0 73.6 64.7 68.9 49.0Based E 76.0 81.3 78.5 44.4 70.3 49.4 58.0 71.5 73.2 62.3 67.3 50.3

f0 + E 81.6 79.4 80.5 38.5 68.2 54.6 60.6 70.9 74.3 64.7 69.1 48.7Syllables f0 80.1 78.6 79.3 40.9 69.2 52.0 59.4 71.1 74.4 62.8 68.1 49.4

& E 78.9 78.9 78.9 42.3 69.5 51.0 58.8 71.3 74.0 62.3 67.7 49.7Phones D 77.2 79.1 78.1 44.3 67.8 53.0 59.5 72.2 72.2 63.6 67.6 50.5

f0+E+D 78.8 79.8 79.3 41.6 69.4 52.1 59.5 70.9 73.9 63.4 68.2 49.3All Combined 78.6 81.6 80.0 40.7 68.2 54.6 60.6 70.8 73.1 65.5 69.1 48.9

Table 5.14: Results for manual transcripts, bootstrapping from a written corpora model.



f0 + E 72.1 68.3 70.1 58.2 63.5 39.7 48.8 83.2 67.8 51.2 58.3 62.8Syllables f0 69.1 68.3 68.7 62.3 62.2 39.1 48.0 84.6 65.7 50.9 57.4 64.3

& E 68.4 66.4 67.4 64.3 61.1 39.4 47.9 85.7 64.8 50.3 56.6 65.4Phones D 68.5 66.2 67.3 64.3 62.2 38.9 47.8 84.8 65.4 49.8 56.6 64.9

f0+E+D 68.4 69.3 68.8 62.8 63.5 37.6 47.3 84.0 66.1 50.4 57.2 64.1All Combined 71.9 68.6 70.2 58.2 61.3 41.9 49.8 84.5 66.4 52.7 58.7 63.2

Table 5.15: Results for automatic transcripts, bootstrapping from a written corpora model.

text. The first results were encouraging, which motivated testing this strategy to the combina-tion of each prosodic feature. The results obtained in this way are presented in Tables 5.14 and5.15. They correspond to the best results achieved so far. Combining all features still achievesbetter results than the baseline, but the best results are obtained by combining lexical, acousticand word-based prosodic features, putting aside the syllable-based and phone-based features,whose impact is only marginal.

Additionally to the above bootstrapping method for improving the transcripts model, analternative was also tested. The idea consisted of using the prediction of the written corporamodel as a complement to the transcripts data. Three different features (COMMA, FULLSTOP,SPACE) were appended to the feature vector of each event in the transcripts data, with thecorresponding probabilities, provided by the written corpora model. Models trained with theimproved data achieve better performances than using solely information coming from thetranscripts. Nevertheless, in general, this method is still worse than the first method tested,based on bootstrapping.


5.3.2 Extension to Question Marks

This section concerns the automatic detection of question marks, which corresponds to de-tecting which sentences are interrogatives. This is an extension to the punctuation module,which was initially designed to deal only with full stop and comma. Detecting full-stops andcommas depends mostly on a local context, usually two or three words, and corresponds todetecting sentence boundaries. However, most interrogatives, specially the wh-questions, hingeon words that are used in the beginning and/or at the end of the sentence, implying that sen-tence boundaries must be previously known. Experiments here reported are based on a manualsentence segmentation and identify which sentences are interrogative. The same acoustic andprosodic features used for full stop and comma were also applied to question mark. However,lexical features were extracted from the whole sentence, and each event corresponds to a sen-tence instead of a word. The same ME-based approach will be followed, but now reducing theclassification to a binary problem.

Results concerning the ALERT-SR corpus consider only the Eval and JEval evaluation sub-sets, because the remaining evaluation sets were not completely revised by the time these ex-periments started. In order to evaluate if the features would be dependent on the nature of thecorpus, besides the ALERT-SR corpus, a corpus collected within the Lectra project (Trancosoet al., 2006, 2008) has also been analysed. The corpus, henceforth denoted of LECTRA corpus, isaimed at transcribing university lectures for e-learning/e-inclusion applications, namely, mak-ing the transcribed lectures available for hearing impaired students, and it offers a differentperspective on the question marks recovery. The corpus has a total of 75h, corresponding to 7different courses, of which only 27h were orthographically transcribed (totaling 155k words)so far.

In a previous study (Moniz et al., 2011) different corpora have been analysed in order tosee if the weight of the features was dependent on the nature of the corpus and on the mostcharacteristic types of interrogatives in each. The study concluded that the percentage of inter-rogatives was, in fact, highly dependent on the nature of the corpus. For the university lecturescorpus, interrogatives represent 20.4% of all the punctuation marks, and similar values (22.0%)have been found also in a map-task corpus (Trancoso et al., 1998) - in both corpora, the pro-portion is ten times more than in broadcast news (2.1%). This difference is related not onlywith the percentage of interrogatives across different corpora, but also with their subtypes. Inbroadcast news, interrogatives are almost exclusively found in interviews, and in transitionsfrom anchormen to reporters. In broadcast news yes/no questions account for 47.0% of all inter-rogatives, wh-questions for 40.4%, while tags and alternative questions account only 10.0% and2.6%, respectively. These percentages compare well with the ones for newspapers, but not withthe ones of the other corpora analysed. The highest percentage of tag questions is found in theuniversity lecture corpus (40.4%), interpretable by the teacher’s need to confirm if the studentsare understanding what is being said; and the highest percentage of yes/no questions occur in


Evaluation corpus correct wrong missed Prec Rec F SERPUBnews 1100 236 1740 82.3 38.7 52.7 69.6ALERT-SR - Manual transcripts 128 25 287 83.7 30.8 45.1 75.2ALERT-SR - Automatic transcripts 84 27 305 75.7 21.6 33.6 85.3LECTRA - Manual transcripts 157 31 221 83.5 41.5 55.5 66.7

Table 5.16: Recovering question marks using a written corpora model.

the map-task corpus (73.6%), related mostly with the description of a map made by a giver andthe need to ask if the follower is understanding the instructions.

The baseline experiments were performed using only lexical information. The followingfeatures were used for a given sentence: wi, wi+1, 2wi−2, 2wi−1, 2wi, 2wi+1, 3wi−2, 3wi−1, start_x,x_end, len, where: wi is a word in the sentence, wi+1 is the word that follows and nwi±x is then-gram of words that starts x positions after or before the position i. start_x and x_end featureswere used for identifying word n-grams occurring either at the beginning or at the end of thesentence. len corresponds to the number of words in the sentence. A discriminative modelhas been created using the PUBnews newspaper corpora, described in section 3.2.1. Table 5.16shows the results of applying the model directly to different evaluation sets, where correct isthe number of correctly identified interrogatives, wrong corresponds to false acceptances orinsertions, and missed corresponds to the missing slots or deletions. The table values reveal aprecision around 80%, but a small recall. The main conclusion is that the recall values using thislimited set of features are correlated with the identification of a specific type of interrogative,wh- questions. Recall values are comparable to the ones of the wh- question distribution acrosscorpora. As for yes/no and tag questions, they are only residually identified.

The subsequent experiments aimed at analyzing the weight and contribution of differentfeature classes and the impact of their combination. Features were calculated for each sentencetransition, with or without a pause, using the same analysis scope as Shriberg et al. (2009)(last word, last stressed syllable and last voiced phone from the current boundary, and the firstword, and first voiced phone from the following boundary). The following set of features hasbeen used: f0 and energy slopes in the words before and after a silent pause, f0 and energydifferences between these units and also duration of the last syllable and the last phone. Withthis set of features, we aimed at capturing nuclear and boundary tones, amplitude, pitch reset,and final lengthening. This set of prosodic features already proved useful for the detection ofthe full stop and comma, in the ALERT-SR corpus, outperforming results achieved using lexicaland acoustic features only.

Combining the previously created model from written corpora, with transcripts data wasalso a major issue. Different approaches have been tested, and the best approach consisted ofusing the large written corpora model to perform automatic classification on the transcriptstraining data and then use the assigned probabilities as features for training a new model fromthe transcripts training data. Only two features were added, with the corresponding probabil-


Evaluation Transcripts info correct wrong missed Prec Rec F SERlexical 143 24 272 85.6 34.5 49.1 71.3

ALERT-SR lexical + acoustic 144 27 271 84.2 34.7 49.1 71.8manual lexical + acoustic + WB 147 28 268 84.0 35.4 49.8 71.3

transcripts lexical + acoustic + SYB 148 29 267 83.6 35.7 50.0 71.3all features 146 31 269 82.5 35.2 49.3 72.3lexical 74 27 315 73.3 19.0 30.2 87.9

ALERT-SR lexical + acoustic 75 25 314 75.0 19.3 30.7 87.1automatic lexical + acoustic + WB 76 22 313 77.6 19.5 31.2 86.1transcripts lexical + acoustic + SYB 71 26 318 73.2 18.3 29.2 88.4

all features 75 23 314 76.5 19.3 30.8 86.6lexical 267 51 111 84.0 70.6 76.7 42.9

LECTRA lexical + acoustic 268 54 110 83.2 70.9 76.6 43.4manual lexical + acoustic + WB 276 52 102 84.1 73.0 78.2 40.7

transcripts lexical + acoustic + SYB 266 52 112 83.6 70.4 76.4 43.4all features 274 53 104 83.8 72.5 77.7 41.5

Table 5.17: Performance results recovering the question mark in different corpora.

ities, provided by the initial model. The performance of the resultant models is better than: i)using only the information coming from the transcripts; ii) using the bootstrapping method,applied in previous sections, because it is an easier problem (binary), and the reduced numberof question marks found in the BN corpora cause the method to converge too fast, loosing mostof the information given by the initial model.

The results of recovering question marks over the LECTRA and the ALERT-SR corpus arepresented in Table 5.17, where different combination of features were added to a standardmodel, which uses lexical features only. For each corpus, the first row was achieved using onlylexical features, the second also uses acoustic features and the last three lines combine lexical,acoustic an prosodic information, either using word-based (WB) prosodic features, syllable andphone-based (SYB) prosodic features, or all the prosodic features combined. Combining thewritten corpora model with lexical information coming from the speech transcripts seems tobe significantly important for manual transcripts, where the performance increases about 3.9%for ALERT-SR and 21.2% for the LECTRA corpus, when comparing with results from Table5.16. However, the impact is negative for automatic transcripts, where recognition errors causechanges on the lexical features. Besides, acoustic information seems to be useful for automatictranscripts, but its impact is negative for the manual transcripts. The combination of word-based (WB) prosodic features seems to lead to the best results, but syllable and phone-based(SLB) prosodic features have not shown a positive contribution.

Based on language dependency effects (fewer lexical cues in EP than in other languages,such as English) and also on the statistics presented, one can say that, ideally, around 40.0% ofall interrogatives in broadcast news would be mainly identified by lexical cues – correspond-ing to wh-questions – while the remaining ones would imply the use of prosodic features to be


correctly identified. Results pointed out in this direction. A recent study focusing on the de-tection of question marks in meeting transcriptions (Boakye et al., 2009) analysed the relevanceof various features in this task, showing that the lexico-syntactic features are the most useful.Like stated by Moniz et al. (2011), when training only with lexical features, wh- questions aresignificantly identified, whereas tag questions and yes/no questions are quite residual, excep-tion made in the latter case for the bigram acha que/do you think. There are still wh- questionsnot accounted for, mainly due to very complex structures hard to disambiguate automatically.When training with all the features, yes/no and tag questions are better identified. It was alsoverified that prosodic features increase the identification of interrogatives in BN spontaneousspeech.

These results are encouraging, but still far from the ones obtained for full stop and comma.Nevertheless, other related papers also show a lower performance in the detection of questionmarks. For example, Gravano et al. (2009) report about 47% precision and 24% recall for EnglishBN, using lexical features only, but training with a very large written corpus.

The results in this study partially agree with the ones reported in Shriberg et al. (2009), re-garding the contribution of each prosodic parameter, and also the set of discriminative featuresused, where the most important feature turned out to be the f0 slope in the last word of thecurrent boundary and between word transitions (last word of the current boundary and thestarting word of the following boundary).

It was also verified that prosodic features increase the identification of interrogatives inPortuguese BN spontaneous speech, e.g., yes/no questions with a request to complete a sentence(e.g., recta das?/lines of?), tag questions (such as não é?/isn’t it?), and alternative questions aswell (contava com esta decisão ou não?/Did you expect this decision or not?). Even when all the in-formation is combined, we still have questions that are not well identified, due to the followingaspects:

i) a considerable amount of questions is made in the transition between newsreader andreporter with noisy background (such as war scenarios);

ii) frequent elliptic questions with reduced contexts, e.g., eu?/me? or José?;

iii) sequences with disfluencies, e.g., <é é é> como é que se consegue?, contrasted with a similarquestion without disfluencies that was identified: Como é que conseguem isso?/how do youmanage that?;

iv) sequences starting with the copulative conjunction e/and or the adversative conjunctionmas/but, which usually do not occur at the start of sentences;

v) false insertions of question marks in sequences with subordinated questions, which arenot marked with a question mark;


Type of full-stop comma allTranscripts Prec Rec F SER Prec Rec F SER Prec Rec F SERManual 79.2 70.8 74.7 47.8 66.2 16.1 25.9 92.1 76.7 45.1 56.8 60.5Automatic 71.1 64.6 67.7 61.7 65.1 16.1 25.8 92.6 69.9 41.7 52.3 68.1

Table 5.18: Punctuation results for English BN transcripts.

vi) sequences with more than one consecutive question, randomly chosen, e.g., ... nascemduas perguntas: quem? e porquê?/ ...two questions arise: who? and why?;

vii) sequences integrating parenthetical comments or vocatives, e.g., Foi acidente mesmo ouatentado, Noé?/Was it an accident or an attack, Noé?.

5.4 Extension to other Languages

This section extends some of the previously described experiments to other languages,particularly to English. The three most frequent and important punctuation marks are con-sidered: full-stop, comma, and question marks. However, similarly to what has been done forPortuguese, the detection of question marks is separated from the other two, since detecting full-stops and commas depend mostly on a local context, but most interrogative sentences, speciallywh-questions, depend on the information used in the beginning and at the end of the sentence(global context). Detecting full-stops and commas corresponds to detecting sentence boundaries,which in our case corresponds to distinguish between two types of sentence boundaries. On theother hand, detecting interrogative sentences uses properties of the whole sentence as features,given the sentence boundaries.

An analysis of Tables 3.1 and 3.5 reveal that the English training data has almost twice thesize of the Portuguese data. Nonetheless, English corpora are heterogeneous, comprising fivedifferent corpora. Therefore, better performances may not necessarily be achieved.

5.4.1 Recovering Full-stop and Comma

The baseline performance for the English data was achieved using the feature set describedin Section 5.2.1: wi, wi+1, 2wi−2, 2wi−1, 2wi, 2wi+1, 3wi−2, 3wi−1, pi, pi+1, 2pi−2, 2pi−1, 2pi,2pi+1, 3pi−2, 3pi−1, GenderChg1, SpeakerChg1, and TimeGap1. Table 5.18 shows the punctuationresults achieved for the English data, using these baseline features. The manual transcriptresults were achieved using force aligned data, produced using the L2F speech recognitionsystem. These results are quite similar to the results achieved for Portuguese using the sameset of features, and presented in Tables 5.10 and 5.11. Like it was concluded for the Portuguesedata, the overall performance is also affected by the comma detection performance, which issignificantly lower in terms of SER. Precision is consistently better than recall, confirming that


the system usually prefers to avoid mistakes than to add incorrect slots. The WER impact, interms of SER, is about 12% for Portuguese and about 8% for English (absolute values).

The recent study reported by Gravano et al. (2009) considers different text-based n-gramlanguage models, ranging from 58 million to 55 billion of training words, trained from Internetnews articles. For the task of recovering punctuation over broadcast news data, the smaller LMachieved an F-score of 37% for comma and 46% for full stop, while the bigger LM achieved 52%(14% absolute increase) for the comma and 63% for the full stop (17% absolute increase). Thesignificant performance increase suggests that our results, that use less than one million wordsof speech transcripts, can be much improved by using larger training sets. Their best F-scoreconcerning the full stop (62%) is lower than results presented here for English, but authors donot make use of any acoustic information, as it as been used here.

The following subsections analyse two ways for improving the baseline punctuation re-sults: the first making use of the punctuation information that can be found in large writtencorpora, and the second adding prosodic features, besides the existing lexical and acoustic fea-tures.

5.4.1.1 Introducing Prosodic Information

The first strategy for improving the baseline results, consisted of adding prosodic features,besides the existing lexical and speaker dependent features. An important issue for providingprosodic information consisted of marking the syllable boundaries as well as the syllable stress.The tsylb2 (Fisher, 1996), an automatic phonological-based syllabication algorithm, has beenused for this purpose. Similarly to what has been done for Portuguese, the maximum, minimum,median, and slope values for pitch and energy were calculated in each word, syllable, and phone.Duration was also calculated for each one of the previous units. Features were calculated foreach word transition, with or without a pause, using: the last word, last stressed syllable andlast voiced phone from the current word, and the first word, and first voiced phone from thefollowing word. The following set of features has been used: f0 and energy slopes in the wordsbefore and after a silent pause, f0 and energy differences between these units and also durationof the last syllable and the last phone.

Tables 5.19 and 5.20 show the achieved results for manual and automatic transcripts, re-spectively, outlining the contribution of each prosodic feature per se. Despite being less signifi-cant than the corresponding results for Portuguese (Section 5.3.1.1), these results exhibit gainsrelatively to the previous results, for both types of transcripts, and for both punctuation marks,and they are again mainly related with the full stop. The word-based features turned out to bethe most reliable ones, whereas syllable-based features achieved only small gains relatively toprevious results. The best results were always achieved either combining all the features orusing the word-based features alone. These results, together with results from Section 5.3.1.1,partially agree with the ones reported in Shriberg et al. (2009), regarding the contribution of



Word f0 78.1 75.2 76.6 45.9 66.6 14.9 24.4 92.6 76.1 46.9 58.1 59.1Based E 78.3 73.5 75.8 46.8 69.4 16.6 26.7 90.7 76.7 46.8 58.1 59.0

f0 + E 79.4 74.4 76.9 44.8 67.5 15.3 25.0 92.0 77.3 46.7 58.3 58.9Syllables f0 78.1 74.0 76.0 46.8 69.2 13.3 22.3 92.6 76.7 45.5 57.1 60.1

& E 78.3 72.6 75.3 47.5 68.4 15.6 25.4 91.6 76.6 45.9 57.4 60.0Phones D 78.1 72.4 75.1 47.9 69.0 14.6 24.2 91.9 76.6 45.3 56.9 60.5

f0+E+D 78.3 74.0 76.1 46.5 68.8 13.7 22.9 92.5 76.8 45.8 57.3 59.9All Combined 78.4 74.0 76.1 46.4 67.2 13.7 22.7 93.0 76.6 45.7 57.3 60.1

Table 5.19: Punctuation results for English BN manual transcripts, adding prosody.


Word f0 72.9 66.8 69.7 58.0 66.0 17.6 27.7 91.5 71.5 43.6 54.2 65.7Based E 70.9 60.6 65.3 64.3 65.1 7.8 14.0 96.4 70.3 35.7 47.4 71.6

f0 + E 72.7 65.8 69.1 58.9 62.6 15.3 24.6 93.9 70.8 42.0 52.7 67.1Syllables f0 72.5 61.0 66.3 62.1 64.4 9.7 16.9 95.7 71.4 36.9 48.6 70.3

& E 71.0 55.7 62.4 67.1 63.7 4.5 8.4 98.1 70.5 31.6 43.6 74.9Phones D 70.9 58.5 64.1 65.5 59.5 7.8 13.8 97.5 69.5 34.6 46.2 73.0

f0+E+D 71.2 62.0 66.3 63.1 63.4 7.5 13.5 96.8 70.4 36.3 47.9 71.1All Combined 74.5 66.1 70.0 56.6 65.0 17.9 28.1 91.7 72.4 43.4 54.2 65.3

Table 5.20: Punctuation results for English BN automatic transcripts, adding prosody.

each prosodic parameter, and also the set of used features, where the most important featureturned out to be the f0 slope in the words and between word transitions.

Individual results concerning the 1998 Hub-4 evaluation data (LDC2006S86) corpus werecalculated, in order to establish a parallel with other related work. Christensen et al. (2001) andKim and Woodland (2001) report results on the LDC2006S86 corpus, making use of lexical andprosodic features for recovering full stop, comma and question mark. The first paper describesa set of experiments using statistical finite-state models and reports a SER of 89%, when allpunctuation marks are combined, the best performance reported is 89% SER. The paper alsoreports results on individual punctuation marks, achieving from 41% to 79% SER for the fullstop, and from 81% to 110% SER for the comma, but these results are insufficient for drawingfurther conclusions. The results for automatic transcripts presented in this study are similar tothe ones reported by Kim and Woodland (2001). However, results cannot be directly compared,because i) the paper uses only a portion of the LDC2006S86 data for evaluation; ii) the paperresults take into account the question mark detection; and iii) the WER is different.

5.4.1.2 Retrain from a Written Corpora Model

Similarly to experiments performed for Portuguese, another way of improving the currentpunctuation results consisted of using punctuation information extracted from written cor-




f0 + E 78.8 76.5 77.6 44.1 64.2 24.2 35.1 89.3 75.1 52.0 61.4 55.8Syllables f0 77.2 75.3 76.2 46.9 63.3 20.8 31.3 91.2 74.0 49.7 59.5 57.8

& E 79.2 71.4 75.1 47.4 61.3 26.9 37.4 90.1 73.8 50.5 60.0 57.6Phones D 78.3 73.1 75.6 47.2 64.9 21.8 32.7 90.0 75.1 49.1 59.3 58.1

f0+E+D 77.7 75.4 76.5 46.2 64.5 23.4 34.3 89.5 74.4 51.0 60.5 56.7All Combined 79.3 75.7 77.5 44.0 63.0 24.8 35.6 89.8 74.9 51.9 61.3 55.8

Table 5.21: Punctuation for force aligned transcripts, bootstrapping from a written corporamodel.



f0 + E 71.3 63.5 67.1 62.1 59.2 15.8 25.0 95.1 68.7 41.0 51.4 68.4Syllables f0 71.4 62.0 66.4 62.8 58.7 15.5 24.5 95.4 68.7 40.1 50.6 69.3

& E 71.0 61.1 65.7 63.8 59.1 15.4 24.5 95.2 68.5 39.6 50.2 69.7Phones D 69.6 60.1 64.5 66.1 57.4 16.0 25.0 95.9 66.9 39.4 49.6 70.8

f0+E+D 71.9 62.5 66.9 62.0 57.7 15.7 24.6 95.8 68.8 40.5 50.9 69.0All Combined 71.3 64.0 67.4 61.8 56.4 16.6 25.7 96.2 67.9 41.7 51.6 68.6

Table 5.22: Punctuation for automatic transcripts, bootstrapping from a written corpora model.

pora. For that purpose, we have firstly trained a punctuation model using written corpora,and then trained a new punctuation model with transcripts, bootstrapping from the writtencorpora model. These experiments use the NYT (New York Times) portion of the LDC corpusLDC1998T30 (North American News Text Supplement), described in Section 3.2.3. The originaltexts were normalized and all the punctuation marks removed, making them close to speechtranscripts. Models suitable for speech transcripts were then trained using transcripts trainingdata, bootstrapping from the initial written corpora model.

Tables 5.21 and 5.22 present the obtained results that can be directly compared with re-sults from Table 5.18. From the comparison, regular trends are found: i) bootstrapping doesnot promote better results for automatic transcripts; ii) the performance for the force aligneddata is improved; iii) comma detection improves in both conditions. These findings, togetherwith the conclusions derived in Section 5.3.1.2, support two basic ideas: results are better forPortuguese, probably because the English data is quite heterogeneous and has a higher WER;the most significant gains concerning comma derive from the fact that this specific punctuationmark depends more on lexical features (e.g., ..., por exemplo/for instance, ...), similar to observa-tions from Favre et al. (2009).


Evaluation corpus correct wrong missed Prec Rec F SERNYT evaluation 993 81 668 92.5 59.8 72.6 45.1Manual transcripts 145 30 119 82.9 54.9 66.1 56.4Automatic transcripts 101 51 151 66.4 40.1 50.0 80.2

Table 5.23: Recovering question marks using a written corpora based model.

5.4.2 Detection of Question Marks

The following experiments concern the automatic detection of question marks. This corre-sponds to an extension of the punctuation module which was initially designed to deal onlywith full stops and commas. The following experiments follow the same ME-based approach,but this time each event corresponds to an entire sentence, previously marked using the ref-erence data. In other words, this task tries to assess which sentences are interrogative. Theremaining of this section will firstly assess the performance of the module, where only lexicalinformation was used, learned from a large corpus of written data; and then it will study theimpact of introducing prosodic features, analysing the individual contribution of each prosodicfeature.

Initial experiments used only lexical information. A discriminative model has been cre-ated for English written corpora, using the NYT (New York Times) portion of the LDC corpusLDC1998T30, described in Section 3.2.3. The features previously used in Section 5.3.2 were alsoapplied to each sentence: wi, wi+1, 2wi−2, 2wi−1, 2wi, 2wi+1, 3wi−2, 3wi−1, start_x, x_end, len.Table 5.23 shows the results of applying the written corpora model to different evaluation setsdirectly, where correct, wrong and missed correspond to the number of correct sentences, inser-tions and deletions, respectively. Results concerning the written corpora evaluation set (NYT)are about 24% better than the corresponding results achieved for Portuguese newspaper data.As expected, question marks are easier to detect for written English, since this language hasmore lexical cues, mainly quite frequent n-grams related with “auxiliary verb plus subject in-version” (e. g. do you?, can you?, have you?). The difference of about 24% SER is mostly relatedwith the high number of deletions (non-identified sentences) for Portuguese. This is due tothe fact that yes/no questions, corresponding to almost 50% of all the questions in the corpus,are mainly disambiguated from a declarative sentence using prosodic information. Concern-ing the force aligned transcripts, results are again better for English. The difference betweenforce aligned and automatic transcripts is bigger in English (21.4%) than in Portuguese (16.6%),reflecting the impact of the recognition errors in this task. Although n-grams related with“auxiliary verb plus subject inversion” are relevant features for correctly identifying questionmarks in English, the auxiliary verbs (e.g., do, can, have) are often misrecognized, particularlyin spontaneous speech, causing that bigger impact.

When using only this limited set of features, the recall percentages are correlated with spe-cific types of questions, namely, wh-questions for both languages; and yes/no questions almost


Evaluation Transcripts info correct wrong missed Prec Rec F SERlexical 155 22 109 87.6 58.7 70.3 49.6

manual lexical + acoustic 151 21 113 87.8 57.2 69.3 50.8transcripts lexical + acoustic + WB 152 21 112 87.9 57.6 69.6 50.4

lexical + acoustic + SYB 151 19 113 88.8 57.2 69.6 50.0all features 149 19 115 88.7 56.4 69.0 50.8lexical 100 27 152 78.7 39.7 52.8 71.0

automatic lexical + acoustic 103 26 149 79.8 40.9 54.1 69.4transcripts lexical + acoustic + WB 100 31 152 76.3 39.7 52.2 72.6

lexical + acoustic + SYB 100 27 152 78.7 39.7 52.8 71.0all features 102 33 150 75.6 40.5 52.7 72.6

Table 5.24: Recovering the question mark, adding acoustic and prosodic features.

exclusively for English. Due to language-specific properties, namely, “auxiliary verb plus sub-ject inversion”, the recall percentages for English are always higher than for Portuguese. Notsurprisingly then, the bigram “do you”, for instance, is fairly well associated with a yes/noquestion. For Portuguese, the recall percentage of the aligned data is comparable to the onesof the wh-questions for BN and newspapers, although there is still a small percentage of thistype of interrogative not accounted for, mainly due to very complex structures hard to disam-biguate automatically. As for tag and alternative questions in both languages they are not easilyidentified with lexical features only.2

The subsequent experiments aimed at analyzing the weight and contribution of differentfeature classes and the impact of their combination. The model previously created from writtencorpora, was combined with transcripts data, using the approach also applied for Portuguese.This consisted in using the written corpora model to perform automatic classification on thetraining data and then use the assigned class as a feature for training a new model from thetranscripts training data. Results recovering question marks over the English data are presentedin Table 5.24, where different combination of features were added to a standard model that useslexical features only. For each corpus, the first row was achieved using only lexical features,the second uses also acoustic features and the last three lines combine lexical, acoustic andprosodic information, either using word-based (WB) prosodic features, syllable and phone-based(SYB) prosodic features, or all the prosodic features combined. There is an effective gain for thealigned English data, but results are not very significant, due to the relatively small numberof question marks found in the corpora. Combining the written corpora model with lexicalinformation coming from the speech transcripts seems to be significantly important both for themanual and automatic transcripts, contrasting with the Portuguese results where the impactwas negative for the automatic transcripts. On the other hand, acoustic information seemsto be useful for the automatic transcripts, but its impact is negative for the manual transcripts,similarly to conclusions achieved for Portuguese. The combination of prosodic features has not

2Exception made for tag questions in the university lectures corpus used in the previous section, which has a highpercentage of this type of interrogatives in both train and test sets.

5.5. SUMMARY 121

shown a positive contribution for the English language, contrasting with the results achievedfor Portuguese. The impact of the recognition errors is about 16% (absolute) for PortugueseBroadcast News and 21% for English, where the overall WER is higher.

As stated by Vassière (1983), these prosodic features are language-independent. Languagespecific properties in the data are related with the fact that word-based features are more usefulfor the Portuguese corpus, while syllable-based ones give the best results for the English data.This result may be interpretable by language-specific syllabic properties, i.e., English allowsfor more segmental material in all the syllabic skeleton. Thus, for Portuguese, the word-basedfeatures provide more context. Moreover, different durational patterns were found at the endof an intonational unit (e.g., in European Portuguese post-tonic syllables are quite often trun-cated); and also with different pitch slopes that may be associated with discourse functionsbeyond sentence-form types.

5.5 Summary

This chapter reported experiments concerning the utomatic recovery of punctuation marks.Section 5.1 reported an exploratory work analysing the occurrence of the different punctuationmarks for different languages. Such analysis, considering both written corpora and speechtranscripts, contributes to better understand the usage of each punctuation mark across lan-guages.

Section 5.2 described initial experiments, using lexical and acoustic features, for basic sen-tence boundary detection, and for discriminating the two most frequent punctuation marks:full stop and comma. Independent results were achieved for manual and automatic transcripts,which allowed to assess the impact of the speech recognition errors on this task. Independentresults were also achieved for spontaneous and planned speech. The contribution of each oneof the features was analysed separately, making it possible to measure the influence of eachfeature in the automatic punctuation. Results achieved provided the baseline for further punc-tuation experiments.

Section 5.3 described the efforts to improve the punctuation module to a better detection ofthe basic punctuation marks, full stop and comma, and to deal with the question mark. Two waysof improving the initial results were addressed: i) adding prosodic features, besides the existinglexical, time-based and speaker-based features; ii) making use of punctuation information thatcan be found in large written corpora. Reported experiments were performed both on manualtranscripts and directly over the automatic speech recognition output, using lexical, acousticand prosodic features. Results pointed out that combining all the features lead to the bestperformance. Results also made possible to discriminate the most relevant prosodic featuresfor this task, those related to pitch being the most significant per se; however, the best resultswere obtained when combining pitch and energy. The full stop detection consistently achieved


the best performance, followed by the comma, and finally by the question mark. The study ofthe latter, however, is still in an early stage and results can be further improved either by usinglarger training data or by extending the analysis of pitch slopes with discourse functionalitiesbeyond sentence-form types.

Section 5.4 reported experiments using English data, and compared the performance ofthe punctuation recovery module when dealing with Portuguese and English. Results sug-gested that question marks are easier to detect for the English language. When using only lexicaland acoustic features, recall values are correlated with specific types of questions, namely, wh-questions for both languages, and yes/no questions almost exclusively for English. Due tolanguage-specific properties, namely, auxiliary verb plus subject inversion, the recall for En-glish is always higher than for Portuguese broadcast news. Not surprisingly then, the bigram“do you”, for instance, is fairly well associated with yes/no questions. For Portuguese, the re-call of the aligned data is comparable to the one of the wh-questions for broadcast news andnewspapers, although there is still a small percentage of this type of interrogatives that is notaccounted for, mainly due to very complex structures hard to disambiguate automatically. Asfor tag and alternative questions in both languages, they are not easily identified with lexicalfeatures only.

6Conclusions and Future Directions

This chapter overviews the work reported in this thesis, presents the main conclusions,enumerates the main contributions, and describes a number of possible directions for furtherextending this research.

6.1 Overview

The quality of a speech recognition system should not be measured using only a singleperformance metric, like the WER, as other important factors that can improve the human leg-ibility and contribute for further automatic processing may be considered as well. For thatreason, rich transcription-related topics have been gaining increasing attention from the sci-entific community in the recent years. This study addressed two metadata annotation tasksthat take part in the production of rich transcripts: recovering punctuation marks and capi-talization information on speech transcripts. Information concerning punctuation marks andcapitalization is critical for the legibility of speech transcripts and it is also important to otherdownstream tasks that are usually also applied to written corpora, like named entity recogni-tion, information extraction, extractive summarization, parsing, and machine translation.

The most relevant data used in the scope of this work was described in Chapter 3. Most ofthe data has changed during this thesis time span, sometimes due to data revisions, but also dueto corrections in the corpora related tools. For example, the current version of the PortugueseBN corpus has been completely revised recently by an expert linguist. This was particularlyimportant given that the previous version of this corpus was manually transcribed by differentannotators, who did not follow consistent criteria, specially in terms of punctuation marks. Thenormalization-related tools have also been subject to several different improvements along thisthesis. For that reason, in order to compare different experimental configurations, a number ofexperiments were repeated several times in different time periods.

Besides the Portuguese speech corpus, the study has been ported to other languages,namely Spanish and English. The English BN corpus combines five different corpora subsets.Each corpora subset has been produced in a different time period, built for a different purpose,encoded with different annotation criteria, and was available in a different format. Combin-

124 CHAPTER 6. CONCLUSIONS AND FUTURE DIRECTIONS

ing these heterogeneous corpora demanded a normalization strategy specifically adapted foreach corpus, which combined a number of different existing tools with new tools developedspecifically to deal with this problem. The automatic transcripts for all the speech corporawere produced by the L2F recognition system. The reference punctuation and capitalizationfor the automatic transcripts were provided by means of alignments between the manual andthe automatic transcripts. This is is not a trivial task because of the recognition errors. All theinformation is encoded in XML format. It contains information coming from the speech recog-nition system, as well as other reference information coming from the manual transcripts. Theword boundaries that have been previously identified automatically by the speech recognitionsystem were adjusted by means of post-processing rules and prosodic features (pitch, energyand duration). The final content is also available as an XML file and contains not only pitchand energy, extracted directly from the speech signal, but also information concerning phones,syllable boundaries and syllable stress.

Most of the experiments described in this study aim at processing broadcast news data,but other information sources, like written newspaper corpora, have been used to complementthe relatively small size of the speech corpora. Written corpora contains information that isspecially important for capitalization. In fact, these corpora provide information concerningthe context where the capitalized words appear. The Portuguese corpora used in these exper-iments consist of online editions of Portuguese newspapers, collected from the web (at L2F).Some of the data has been collected during this thesis time spam, allowing to perform exper-iments with the most recent data. The English written corpus is the North American NewsText Supplement, available from the LDC. All the written corpora was normalized in order tobe closer to speech transcripts, and therefore to be used for training speech-like models. Thatdemanded substantial efforts in creating new tools, or adjusting and improving the existingnormalization tools, given that each corpus requires a specially designed (or at least adapted)tool for dealing with specific phenomena.

The work on capitalization recovery, on both written corpora and speech transcripts, was pre-sented in Chapter 4. As part of the early work, two generative methods have been comparedwith the ME approach. Results suggest that generative methods produce better results for writ-ten corpora, while the maximum entropy approach works better with speech transcripts, alsosuggesting that the impact of the recognition errors is stronger for these generative approaches.The following step in this study consisted of analysing the impact of language variation in thecapitalization task. This was partly motivated by the daily BN Subtitling, which led the L2Fspeech group to use a baseline vocabulary combined with a daily modification of the vocabu-lary (Martins et al., 2007b) and a re-estimation of the language model. This dynamic languagemodeling provided an interesting scenario for our capitalization experiments. Maximum en-tropy models proved to be suitable to perform the capitalization task, specially when dealingwith language dynamics. This approach provided a clean framework for learning with newdata, while slowly discarding unused data. It also enabled the combination of different data

6.1. OVERVIEW 125

sources and exploration of different features. In terms of language variation, results suggestedthat different capitalization models should be used for different time periods.

Most of the experiments compared the capitalization performance when performed both inwritten corpora and in speech transcripts. Individual results concerning manual and automatictranscriptions were also presented, revealing the impact of the recognition errors on this task.For both types of transcription, results show evidence that the performance is affected by thetemporal distance between training and testing sets. Such conclusions led to the proposal andevaluation, in this study, of three different approaches for updating the capitalization module.The most promising approach consisted of iteratively retraining a baseline model with the newavailable data, using small corpora subsets, causing the performance to increase about 1.6%when dealing with manual transcripts. Results reveal that producing capitalization models ona daily basis did not lead to a significant improvement. Therefore, the adaptation of capitaliza-tion models on a periodic basis was the best choice. The small improvements gained in termsof capitalization suggested that dynamically updated models may play a small role, but theupdating does not need to be done daily, a fact that is also according to our intuition.

A number of recent experiments on automatic capitalization, reflecting the most recenttraining and testing conditions, with more accurate results, were also presented in Chapter 4.The ME-based approach was again compared with HMMs and also with CRFs, using the mostrecent data and increasing the number of training iterations. Besides the automatic transcripts,experiments with force aligned transcriptions were also included. Later experiments confirmedthat the HMM-based method was suitable for dealing with written corpora, but ME and CRFsachieved a better performance when applied to speech data. The most recent experimentsextending this work to other languages were also reported. The effect of the language variationover time was again studied for the English and Spanish data, confirming that the intervalbetween the training and testing periods is relevant for the capitalization performance.

Chapter 5 reported experiments concerning the automatic recovery of punctuation marks.As part of the early work, an exploratory work analysing the occurrence of the different punc-tuation marks for different languages has been performed. Such an analysis, considering bothwritten corpora and speech transcripts, contributed to better understand the usage of eachpunctuation mark across languages. Results show that Portuguese broadcast news transcriptshave a higher number of commas when compared with English and Spanish. The BN data con-tains a greater number of sentences and more intra-sentence punctuation marks when compar-ing to newspaper written corpora, confirming that speech sentences are shorter.

Initial experiments concerning punctuation recovery were performed using lexical andacoustic features, firstly for basic sentence boundary detection, and then for discriminatingthe two most frequent punctuation marks: full stop and comma. The initial results were im-proved by adding prosodic features, besides the existing lexical, time-based and speaker-basedfeatures; and by making use of punctuation information that can be found in large written cor-


pora. Independent results were achieved for manual and automatic transcripts, allowing toassess the impact of the speech recognition errors on this task. Independent results were alsoachieved for spontaneous and planned speech. The contribution of each feature was analysedseparately, making it possible to measure its influence on the automatic punctuation recovery.The punctuation module was then extended to obtain a better detection of the basic punctua-tion marks, full stop and comma, and also to deal with the question mark. Reported experimentswere performed both on manual transcripts and directly over the automatic speech recognitionoutput, using lexical, acoustic and prosodic features. Results pointed out that combining all thefeatures usually conducts to the best performance.

6.2 Main Conclusions

This study addresses the tasks of recovering capitalization and punctuation marks fromspoken transcripts, produced by ASR systems. These two practical RT tasks were performedusing the same discriminative approach, based on maximum entropy, adequate for combiningdifferent data sources and features for characterizing the data and for on-the-fly integration,which is of great importance for tasks such as online subtitling, characterized by strict latencyrequirements. Reported experiments were conducted both over Portuguese and English BNdata, allowing to compare the performance on the two languages. Both force aligned and au-tomatic transcripts experiments were used, allowing to measure the impact of the recognitionerrors.

Capitalized words and named entities are intrinsically related, and are influenced by timevariation effects. For that reason, the so-called language dynamics have been analyzed for thecapitalization task. The ME modeling approach provides a clean framework for learning withnew data, while slowly discarding unused data, making it interesting for addressing prob-lems that comprise language variations in time. Language adaptation results clearly indicate,for both languages, that the capitalization performance is affected by the temporal distancebetween the training and testing data. Hence, our proposal states that different capitaliza-tion models should be used for different time periods. Capitalization experiments were alsoperformed with an HMM-based tagger, a common approach in this type of problem. Whilethe HMM-based approach captured the structure of written corpora better, the ME-based ap-proach proved to be better suited for speech transcripts, which includes portions of sponta-neous speech, characterized by a more flexible linguistic structure when compared to writtencorpora, and also more robust to ASR errors.

In what regards the punctuation task, this paper covers the three most frequent punctua-tion marks: full stop, comma, and question mark. Detecting full stops and commas is performedfirstly, and corresponds to segmenting the speech recognizer output stream. Question marks aredetected afterwards, making use of the previously identified segmentation boundaries. Rather

6.3. CONTRIBUTIONS 127

than comparing with other approaches, reported punctuation experiments focused on the us-age of additional information sources and diverse linguistic structures that can be found on thespeech data. Two different scenarios were explored for improving the baseline results for fullstop and comma. The first made use of the punctuation information that can be found in largewritten corpora. The second consisted of introducing prosodic features, besides the initial lex-ical, time-based and speaker-based features. The first scenario yielded improved results forall force aligned data, and for all the Portuguese data. The comma detection improved signif-icantly, specially for Portuguese aligned data (7.8%). These findings support two basic ideas:results are better for Portuguese, because our English data is quite heterogeneous and has ahigher WER; the most significant gains concerning comma derive from the fact that it dependsmore on lexical features. The second scenario provided even better results, for both languagesand both punctuation marks, with improvements ranging from 3% to 8% (absolute). The bestresults were again achieved for Portuguese, but this time they are mainly related with full stop.We have concluded that, in both languages, the linguistic structure related with punctuationmarks is being captured in different ways regarding the distinct punctuation marks: commas arebeing identified mostly by lexical features, while full stops depend more on prosodic ones. Themost significant gains come from combining all the available features. As for question marks,there is a gain for the recognized Portuguese and for the aligned English data, but differencesare not significant, due to the relatively small number of question marks in the corpora.

6.3 Contributions

This study proposes a methodology for enriching speech transcriptions that has been suc-cessfully applied for recovering punctuation marks and capitalization on broadcast news cor-pora. A prototype module, based on the proposed methodology, and incorporating the tworich transcription tasks, has been integrated in the L2F broadcast news processing chain. Thebroad set of lexical, acoustic, and prosodic information has been successfully combined to en-hance the results of this punctuation module. Additional sources of information, in particularwritten corpora, were used to provide additional information, specially for capitalization, thusminimizing the effects of having small sized speech corpora. Finally, the most relevant experi-ments have been ported for other languages. Hence, the goals initially proposed for this studywere met. The following items outline the most important contributions achieved in the scopeof this work:

• A shared approach for punctuation and capitalization recovery: the proposed approachproved to be suitable for dealing with speech data, allows to combine different levels ofinformation, and can be used for on-the-fly processing.

• To the best of our knowledge, the described experiments provide the first punctuationand capitalization results for Portuguese. Most results concern broadcast news corpora,


which was this study application major domain, but a number of experiments were alsoconducted on other corpora, including written newspaper corpora.

• Concerning the capitalization task, different methods have been compared, and reportedexperiments suggest that generative methods seem more appropriate for written corporawhile discriminative methods are more suitable for speech data. To the best of our knowl-edge, this is the first study achieving such conclusion.

• The impact of the language variation in time has been analysed for the capitalization task,in line with other related work. Reported experiments suggest that the capitalization per-formance decays over time, as an effect of language variation. That is also in agreementwith other reported work on NER. A number of different adaptation strategies have beenanalysed, suggesting that the capitalization models must be updated on a periodic basis.

• Concerning the punctuation task, independent results were achieved for manual and au-tomatic transcripts, allowing to assess the impact of the speech recognition errors. Inde-pendent results were also achieved for spontaneous and planned speech. Results pointedout that combining all the features lead to the best performance, but the contribution ofeach one of the features was analysed separately, making it possible to measure its influ-ence on the automatic punctuation task. The linguistic structure related with punctuationmarks is being captured in different ways: commas are being identified mostly by lexicalfeatures, while full stops depend more on prosodic features. The relatively small numberof question marks in the corpora prevented from drawing significant conclusions for thispunctuation mark. The full stop detection achieved the best performance, followed by thecomma, and finally by the question mark.

• Most of these experiments were ported to other languages, in particular to English,thereby allowing to compare properties of the different languages, and also to confirm anumber of conclusions firstly achieved for Portuguese. Despite not being able to directlycompare the achieved results with the related work, the performance here achieved con-cerning both punctuation and capitalization may be considered similar to state-of-the-artresults reported in the literature.

• An on-the-fly module for punctuation and capitalization recovery has been developed.This module is an important asset in the in-house automatic subtitling system, and it hasbeen included in the fully automatic subtitling module for broadcast news, deployed atthe national television broadcaster since March 2008, in the scope of the national TEC-NOVOZ1 project. The two modules also provide important information for an automaticmultimedia content dissimination system. Results of the offline processing of each BNshow are also shown daily on the web2.

1http://www.tecnovoz.com.pt/2https://tecnovoz.l2f.inesc-id.pt/demos/asr/legendagem/

6.4. FUTURE WORK 129

• Improved version of the existing normalization tool for Portuguese. The existing normal-ization tool was deeply revised to correctly address phenomena, such as: date and timeexpressions, ordinals, numbers, abreviations, money quantities, and a number of otherexpressions that were found in real text.

• Creation and integration of different tools for English corpora normalization. A num-ber of tools have been created and integrated into pipeline processing chains, speciallyadapted to deal with each one of the five broadcast news corpora subsets, as well as withthe English newspaper data.

The work performed in the scope of this thesis has been disseminated by means of a journalpublication, a book chapter3, and by a number of other publications on international confer-ences and workshops. The following publications focus the capitalization task as an isolatedtask: Batista et al. (2007b) compares different capitalization approaches for the capitalizationof Portuguese broadcast news; Batista et al. (2008e) performs a study of the impact of lan-guage dynamics on written corpora; such study was then extended to the broadcast news cor-pora and described in Batista et al. (2008d) and Batista et al. (2008c); the impact of dynamicmodel adaptation beyond speech recognition is reported in Batista et al. (2008a). The followingpublications focus on the punctuation recovery task as an isolated task: Batista et al. (2007a)describes the initial approach to a punctuation module together with the first performance re-sults achieved for sentence boundary detection; recent experiments, porting this work to otherlanguages have been reported together with capitalization results for the same corpora (Batistaet al., 2009b,a); the most recent papers Batista et al. (2010); Moniz et al. (2010, 2011) focus on theprosody usage. Papers involving both punctuation and capitalization include the following:(Batista et al., 2008b) presents the experiments concerning both punctuation and capitalizationof broadcast news data; Batista et al. (2009b,a) report the most recent experiments comparingthe performance in different languages. (Batista et al., 2011) describes the most recent work,reporting bilingual experiments comparing the Portuguese and English languages.

6.4 Future Work

The contributions of this thesis correspond to the first steps in enriching the speech recog-nition output, and much work remains to be done. The following items pinpoint a number ofpossible directions for the future:

• Adapt the current part-of-speech tagger to deal with speech transcripts. Despite the part-of-speech tagger not having been specially trained for dealing with speech transcripts,POS information is still an important feature according to results presented in Section5.2.3.1, where it was shown to be the third most important feature;

3Springer book, containing extended versions of the best selected papers from a workshop.


• Perform these tasks over lattices and/or confusion networks, thus enriching informationsuitable for tasks such as speech translation;

• Analyse the impact of the punctuation and capitalization tasks in machine translation,machine summarization, and question answering; and contribute to a better quality ofeach one of these tasks, supported by a number of studies found in the literature (Ma-tusov et al., 2006; Ostendorf et al., 2008).

• Use unlabeled data for improving the existing models, by means of a semi-supervisedtraining method, such as co-training (Blum and Mitchell, 1998);

• In terms of capitalization, an interesting future direction would be the fusion of the gen-erative and the discriminative approaches, since they perform better for written corporaand speech transcripts, respectively;

• Additional features, widely used in named entity extraction, can be also used for capital-ization restoration. Features such as word prefix and suffix can be easily extracted andmay contribute for the capitalization performance.

• In terms of punctuation, there are many interesting research directions, particularly inwhat concerns prosodic features (for instance, by using pseudo-syllable information di-rectly derived from the audio data). Extending this study on interrogatives to other do-mains, besides BN, will allow to better model different types of interrogatives, not wellrepresented in this corpus;

• Further experiments must be performed in order to assess to what extend our prosodicfeatures are language based or language independent features. Extending this study toother languages, besides Portuguese and English, will certainly provide challenging sce-narios in the future;

• Interrogatives are still in an early stage and can be further improved either by using largertraining data and by extending the analysis of pitch slopes with discourse functionalitiesbeyond sentence-form types;

• Consider alternate evaluation strategies for both tasks, keeping in mind the human per-formance. According to Kowal and O’Connell (2008), each transcriber can, deliberatelyor involuntary, delete, add, substitute, and/or relocate utterances or parts of utterancesin a transcript. However, these decisions are not always a matter of error, leading Kowaland O’Connell (2008) to consider them as changes rather than errors. Similarly, one canthink that the classification proposed by an automatic system may sometimes be an alter-nate way of producing the transcript. Therefore, a binary decision comprising only thereference and the automatic classification, may not reflect the real achievement;

• Port these modules to other varieties of Portuguese (spoken in South America and Africa)

Bibliography

Abad, A. and Neto, J. (2008). Incorporating acoustical modelling of phone transitions in anhybrid ANN/HMM speech recognizer. In Proc. of the 9th Annual Conference of the InternationalSpeech Communication Association (Interspeech 2008), Brisbane, Australia.

Agbago, A., Kuhn, R., and Foster, G. (2005). Truecasing for the portage system. In Proc. of the In-ternational Conference on Recent Advances in Natural Language Processing (RANLP’05), Borovets,Bulgaria.

Amaral, R., Meinedo, H., Caseiro, D., Trancoso, I., and Neto, J. P. (2007). A prototype systemfor selective dissemination of broadcast news in European Portuguese. EURASIP Journal onAdvances in Signal Processing, 2007(37507).

Amaral, R. and Trancoso, I. (2008). Topic segmentation and indexation in a media watch sys-tem. In Proc. of the 9th Annual Conference of the International Speech Communication Association(Interspeech 2008), Brisbane, Australia. ISCA.

Baldwin, T. and Joseph, M. P. (2009). Restoring punctuation and casing in english text. InProceedings of the 22nd Australasian Joint Conference on Advances in Artificial Intelligence, AI ’09,pages 547–556, Berlin, Heidelberg. Springer-Verlag.

Barras, C., Geoffrois, E., Wu, Z., and Liberman, M. (2001). Transcriber: Development and use ofa tool for assisting speech corpora production. Speech Communication, 33(1-2):5 – 22. SpeechAnnotation and Corpus Tools.

Batista, F., Amaral, R., Trancoso, I., and Mamede, N. (2008a). Impact of dynamic model adapta-tion beyond speech recognition. In Proc. of the IEEE Workshop on Spoken Language Technology(SLT 2008), Goa, India.

Batista, F., Caseiro, D., Mamede, N., and Trancoso, I. (2008b). Recovering capitalization andpunctuation marks for automatic speech recognition: Case study for Portuguese broadcastnews. Speech Communication, 50(10):847–862.

Batista, F., Caseiro, D., Mamede, N. J., and Trancoso, I. (2007a). Recovering punctuation marksfor automatic speech recognition. In Proc. of the 8th Annual Conference of the International SpeechCommunication Association (Interspeech 2007), pages 2153 – 2156, Antwerp, Belgium.

132 BIBLIOGRAPHY

Batista, F., Mamede, N., Caseiro, D., and Trancoso, I. (2007b). A lightweight on-the-fly capi-talization system for automatic speech recognition. In Proc. of the International Conference onRecent Advances in Natural Language Processing (RANLP’07).

Batista, F., Mamede, N., and Trancoso, I. (2008c). The impact of language dynamics on thecapitalization of broadcast news. In Proc. of the 9th Annual Conference of the International SpeechCommunication Association (Interspeech 2008).

Batista, F., Mamede, N., and Trancoso, I. (2008d). Language dynamics and capitalization usingmaximum entropy. In Proc. of 46th Annual Meeting of the Association of Computational Linguis-tics: Human Language Technologies (ACL-08): HLT, Short Papers, pages 1–4. ACL.

Batista, F., Mamede, N. J., and Trancoso, I. (2008e). Temporal issues and recognition errors onthe capitalization of speech transcriptions. Lecture Notes in Artificial Intelligence, 5246:45–52.

Batista, F., Moniz, H., Trancoso, I., and Mamede, N. J. (2011). Bilingual experiments on au-tomatic recovery of capitalization and punctuation of automatic speech transcripts. IEEETransaction on Audio, Speech and Language Processing, Special Issue on New Frontiers in RichTranscription. (to be accepted).

Batista, F., Moniz, H., Trancoso, I., Meinedo, H., Mata, A. I., and Mamede, N. J. (2010). Extend-ing the punctuation module for European Portuguese. In Proc. of the 11th Annual Conferenceof the International Speech Communication Association (Interspeech 2010), Mukari, Japan.

Batista, F., Trancoso, I., and Mamede, N. J. (2009a). Automatic recovery of punctuation marksand capitalization information for iberian languages. In I Joint SIG-IL/Microsoft Workshop onSpeech An Language Technologies for Iberian Languages, pages 99 –102, Porto Salvo, Portugal.

Batista, F., Trancoso, I., and Mamede, N. J. (2009b). Comparing automatic rich transcription forPortuguese, Spanish and English broadcast news. In Proc. of the Automatic Speech Recognitionand Understanding Workshop (ASRU 2009), Merano, Italy.

Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of finite statemarkov chains. The Annals of Mathematical Statistics, 37(6):1554–1563.

Beeferman, D., Berger, A., and Lafferty, J. (1998). Cyberpunc: a lightweight punctuation anno-tation system for speech. In Proc. of the IEEE International Conference on Acoustics, Speech, andSignal Processing (ICASSP ’98), pages 689–692.

Berger, A. L., Pietra, S. A. D., and Pietra, V. J. D. (1996). A maximum entropy approach tonatural language processing. Computational Linguistics, 22(1):39–71.

Bikel, D. M., Miller, S., Schwartz, R., and Weischedel, R. (1997). Nymble: a high-performancelearning name-finder. In Proceedings of the fifth conference on Applied natural language processing,pages 194–201, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.

BIBLIOGRAPHY 133

Blaauw, E. (1995). On the Perceptual Classification of Spontaneous and Read Speech. ResearchInstitute for Language and Speech.

Blum, A. and Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. InProceedings of the eleventh annual conference on Computational learning theory, COLT’ 98, pages92–100, New York, NY, USA. ACM.

Boakye, K., Favre, B., and Hakkani-Tür, D. (2009). Any Questions? Automatic Question De-tection in Meetings. In Proc. of the Automatic Speech Recognition and Understanding Workshop(ASRU 2009), Merano, Italy.

Brown, E. and Coden, A. (2002). Capitalization recovery for text. Information Retrieval Techniquesfor Speech Applications, pages 11–22.

Brown, P. F., Pietra, V. J. D., Mercer, R. L., Pietra, S. A. D., and Lai, J. C. (1992). An estimate ofan upper bound for the entropy of english. Computational Linguistics, 18(1):31–40.

Campione, E. and Véronis, J. (2002). A large-scale multilingual study of silent pause duration.In Speech prosody.

Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistic. Computa-tional Linguistics, 22:249–254.

Cattoni, R., Bertoldi, N., and Federico, M. (2007). Punctuating confusion networks for speechtranslation. In Proc. of the 8th Annual Conference of the International Speech Communication As-sociation (Interspeech 2007), pages 2453–2456.

Chelba, C. and Acero, A. (2004). Adaptation of maximum entropy capitalizer: Little datacan help a lot. In Proc. of the Conference on Empirical Methods in Natural Language Processing(EMNLP ’04).

Chen, C. J. (1999). Speech recognition with automatic punctuation. In Proc. of EU-ROSPEECH’99, pages 447–450.

Chen, S. F., Kingsbury, B., Mangu, L., Povey, D., Saon, G., Soltau, H., and Zweig, G. (2006).Advances in speech transcription at IBM under the DARPA EARS program. IEEE Transactionson Audio, Speech & Language Processing, 14(5):1596–1608.

Christensen, H., Gotoh, Y., and Renals, S. (2001). Punctuation annotation using statisticalprosody models. In Proc. of the ISCA Workshop on Prosody in Speech Recognition and Under-standing, pages 35–40.

Clark, H. and Fox Tree, J. (2002). Using uh and um in spontaneous speaking. Cognition, (84).

Collins, M. and Singer, Y. (1999). Unsupervised models for named entity classification. In Proc.of the Joint SIGDAT Conference on EMNLP.

134 BIBLIOGRAPHY

Cuendet, S., Hakkani-Tur, D., Shriberg, E., Fung, J., and Favre, B. (2007). Cross-genre featurecomparisons for spoken sentence segmentation. International Journal of Semantic Computing,1(3):335–346.

Daumé III, H. (2004). Notes on CG and LM-BFGS optimization of logistic regression.http://hal3.name/megam/.

Duarte, I. (2000). Língua Portuguesa, Instrumentos de Análise. Universidade Aberta, Lisboa.

Favre, B., Hakkani-Tür, D., Petrov, S., and Klein, D. (2008). Efficient Sentence SegmentationUsing Syntactic Features. In Spoken Languge Technologies (SLT), Goa, India.

Favre, B., Hakkani-Tur, D., and Shriberg, E. (2009). Syntactically-informed Models for CommaPrediction. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Process-ing (ICASSP ’09), Taipei, Taiwan.

Fisher, B. (1996). The tsylb2 program. National Institute of Standards and Technology Speech.

Frota, S. (2000). Prosody and Focus in European Portuguese. Phonological Phrasing and Intonation.Garland Publishing, New York.

Furui, S. (2005). 50 years of progress in speech and speaker recognition. In Proc. SPECOM 2005,pages 1 – 9, Patras, Greece.

Furui, S. and Kawahara, T. (2008). Springer Handbook of Speech Processing, chapter 32 - Transcrip-tion and Distillation of Spontaneous Speech. Springer Berlin Heidelberg.

Gotoh, Y. and Renals, S. (2000). Sentence boundary detection in broadcast speech transcripts.In Proc. of the ISCA Workshop: Automatic Speech Recognition: Challenges for the new MillenniumASR-2000, pages 228–235.

Gravano, A., Jansche, M., and Bacchiani, M. (2009). Restoring punctuation and capitalizationin transcribed speech. In Proc. of the IEEE International Conference on Acoustics, Speech, andSignal Processing (ICASSP ’09), Taipei, Taiwan.

Gravier, G., Bonastre, J.-F., Geoffrois, E., Galliano, S., Tait, K. M., and Choukri, K. (2004). TheESTER evaluation campaign for the rich transcription of french broadcast news. In Proc.LREC 2004.

Harper, M., Dorr, B., Hale, J., Roark, B., Shafran, I., Lease, M., Liu, Y., Snover, M., Yung, L.,Krasnyanskaya, A., and Stewart, R. (2005). Parsing and spoken structural event detection. In2005 Johns Hopkins Summer Workshop Final Report.

Heeman, P. and Allen, J. (1999). Speech repairs, intonational phrases and discourse markers:Modeling speakers’ utterances in spoken dialogue. Computational Linguistics, 25:527–571.

BIBLIOGRAPHY 135

Huang, J. and Zweig, G. (2002). Maximum entropy model for punctuation annotation fromspeech. In Proc. of the 7th International Conference on Spoken Language Processing (INTER-SPEECH 2002), pages 917 – 920.

Jelinek, F., Mercer, R. L., Bahl, L. R., and Baker, J. K. (1977). Perplexity – a measure of thedifficulty of speech recognition tasks. Journal of the Acoustical Society of America, 62:S63. Sup-plement 1.

Jones, D., Gibson, E., Shen, W., Granoien, N., Herzog, M., Reynolds, D., and Weinstein, C.(2005a). Measuring human readability of machine generated text: three case studies inspeech recognition and machine translation. In Proc. of the IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP ’05), volume 5, pages v/1009–v/1012.

Jones, D., Shen, W., Shriberg, E., Stolcke, A., Kamm, T., and Reynolds, D. (2005b). Two exper-iments comparing reading with listening for human processing of conversational telephonespeech. In Proc. of Eurospeech - 9th European Conference on Speech Communication and Technology(Interspeech 2005), Lisbon, Portugal.

Jurafsky, D. and Martin, J. H. (2000). Speech and Language Processing: An Introduction to NaturalLanguage Processing, Computational Linguistics, and Speech Recognition. Prentice Hall PTR.

Jurafsky, D. and Martin, J. H. (2009). Speech and Language Processing: An Introduction to Natu-ral Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall PTR,second edition.

Kahn, J. G., Ostendorf, M., and Chelba, C. (2004). Parsing conversational speech using en-hanced segmentation. In Proc. of HLT/NAACL.

Khare, A. (2006). Joint learning for named entity recognition and capitalization generation.Master’s thesis, University of Edinburgh.

Kim, J., Schwarm, S. E., and Ostendorf, M. (2004). Detecting structural metadata with decisiontrees and transformation-based learning. In Proc. HLT-NAACL, pages 137–144.

Kim, J. and Woodland, P. C. (2001). The use of prosody in a combined system for punctuationgeneration and speech recognition. In Proc. of Eurospeech, pages 2757–2760.

Kim, J.-H. and Woodland, P. C. (2003). A combined punctuation generation and speechrecognition system and its performance enhancement using prosody. Speech Communication,41(4):563 – 577.

Kim, J.-H. and Woodland, P. C. (2004). Automatic capitalisation generation for speech input.Computer Speech & Language, 18(1):67–90.

Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. MT Summit2005.

136 BIBLIOGRAPHY

Kowal, S. and O’Connell, D. C. (2008). Communicating with One Another: Toward a Psychologyof Spontaneous Spoken Discourse. Cognition and Language: A Series in Psycholinguistics.Springer New York.

Kubala, F., Schwartz, R., Stone, R., and Weischedel, R. (1998). Named entity extraction fromspeech. In in Proceedings of DARPA Broadcast News Transcription and Understanding Workshop,pages 287–292.

Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random fields: Probabilistic mod-els for segmenting and labeling sequence data. In Proceedings of ICML-01, pages 282–289.

Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions and reversals.Soviet Physics Doklady, 6:707–710. (English translation).

Lita, L. V., Ittycheriah, A., Roukos, S., and Kambhatla, N. (2003). tRuEcasIng. In Proc. of the 41st

annual meeting on ACL, pages 152–159, USA. ACL.

Liu, Y. and Shriberg, E. (2007). Comparing evaluation metrics for sentence boundary detection.In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’07), Honolulu, Hawaii.

Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M., and Harper, M. (2006). Enrichingspeech recognition with automatic detection of sentence boundaries and disfluencies. IEEETransaction on Audio, Speech and Language Processing, 14(5):1526–1540.

Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M., Peskin, B., and Harper, M. (2004).The ICSI-SRI-UW metadata extraction system. In Proc. of INTERSPEECH 2004 - ICSLP - 8thInternational Conference on Spoken Language Processing, pages 577–580, Jeju, Korea.

Liu, Y., Shriberg, E., Stolcke, A., Peskin, B., Ang, J., Hillard, D., Ostendorf, M., Tomalin, M.,Woodland, P., and Harper, M. (2005). Sructural metadata research in the EARS program. InProc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’05), Philadelphia, USA.

Lu, W. and Ng, H. T. (2010). Better punctuation prediction with dynamic conditional randomfields. In Proceedings of EMNLP10 (The 2010 Conference on Empirical Methods in Natural Lan-guage Processing), MIT, Massachusetts.

Makhoul, J., Baron, A., Bulyko, I., Nguyen, L., Ramshaw, L., Stallard, D., Schwartz, R., andXiang, B. (2005). The effects of speech recognition and punctuation on information extrac-tion. In Proc. of Eurospeech - 9th European Conference on Speech Communication and Technology(Interspeech 2005), pages 57–60, Lisbon, Portugal.

Makhoul, J., Kubala, F., Schwartz, R., and Weischedel, R. (1999). Performance measures forinformation extraction. In Proc. of the DARPA Broadcast News Workshop, Herndon, VA.

BIBLIOGRAPHY 137

Manning, C., Prabhakar, R., and Hinrich, S. (2008). Introduction to Information Retrieval. Cam-bridge University Press, 1 edition.

Martin, A., Doddington, G., Kamm, T., Ordowski, M., and Przybocki, M. (1997). The DETCurve in Assessment of Detection Task Performance. In Proc. Eurospeech ’97, pages 1895–1898, Rhodes, Greece.

Martinez, R., da Silva Neto, J. P., and Caseiro, D. A. (2008). Statistical machine translation ofbroadcast news from Spanish to Portuguese. In PROPOR 2008 - 8th International Conferenceon Computational Processing of the Portuguese Language. Springer.

Martins, C., Teixeira, A., and Neto, J. (2007a). Vocabulary selection for a broadcast news tran-scription system using a morpho-syntatic approach. In Proc. of the 8th Annual Conference of theInternational Speech Communication Association (Interspeech 2007).

Martins, C., Teixeira, A., and Neto, J. P. (2007b). Dynamic language modeling for a daily broad-cast news transcription system. In Proc. of the Automatic Speech Recognition and UnderstandingWorkshop (ASRU 2007).

Mata, A. I. (1999). Para o Estudo da Entoação em Fala Espontânea e Preparada no Português Europeu.PhD thesis, University of Lisbon.

Mateus, M. H., Brito, A., Duarte, I., Faria, I. H., Frota, S., Matos, G., Oliveira, F., Vigário, M.,and Villalva, A. (2003). Gramática da Língua Portuguesa. Caminho, Lisbon, Portugal.

Matusov, E., Mauser, A., and Ney, H. (2006). Automatic sentence segmentation and punctua-tion prediction for spoken language translation. In International Workshop on Spoken LanguageTranslation, pages 158–165, Kyoto, Japan.

McCallum, A., Freitag, D., and Pereira, F. C. N. (2000). Maximum entropy markov modelsfor information extraction and segmentation. In ICML ’00: Proceedings of the Seventeenth In-ternational Conference on Machine Learning, pages 591–598, San Francisco, CA, USA. MorganKaufmann Publishers Inc.

Medeiros, J. C. (1995). Processamento morfológico e correcção ortográfica do português. Mas-ter’s thesis, IST/ UTL, Portugal.

Meinedo, H., Abad, A., Pellegrini, T., Neto, J., and Trancoso, I. (2010). The L2F broadcast newsspeech recognition system. In Proc. of the VI Jornadas en Tecnología del Habla and II IberianSLTech Workshop (FALA 2010).

Meinedo, H., Caseiro, D., Neto, J. P., and Trancoso, I. (2003). AUDIMUS.media: A broadcastnews speech recognition system for the European Portuguese language. In PROPOR’2003,volume 2721 of LNCS, pages 9–17. Springer.

138 BIBLIOGRAPHY

Meinedo, H. and Neto, J. P. (2003). Audio segmentation, classification and clustering in a broad-cast news task. In Proc. of the IEEE International Conference on Acoustics, Speech, and SignalProcessing (ICASSP ’03), Hong Kong, China.

Meinedo, H., Viveiros, M., and Neto, J. (2008). Evaluation of a live broadcast news subtitlingsystem for Portuguese. In Proc. of the 9th Annual Conference of the International Speech Commu-nication Association (Interspeech 2008), Brisbane, Australia.

Mikheev, A. (1999). A knowledge-free method for capitalized word disambiguation. In Proc. ofthe 37th annual meeting of the ACL, pages 159–166, Morristown, NJ, USA. ACL.

Mikheev, A. (2002). Periods, capitalized words, etc. Computational Linguistics, 28(3):289–318.

Moniz, H. (2006). Contributo para a caracterização dos mecanismos de (dis)fluência no Por-tuguês Europeu. Master’s thesis, University of Lisbon.

Moniz, H., Batista, F., Meinedo, H., Abad, A., Trancoso, I., Mata, A. I., and Mamede, N. (2010).Prosodically-based automatic segmentation and punctuation. In Proc. of 5th International Con-ference on Speech Prosody, Chicago, Illinois.

Moniz, H., Batista, F., Trancoso, I., and Mata, A. I. (2011). Toward Autonomous, Adaptive, andContext-Aware Multimodal Interfaces: Theoretical and Practical Issues, volume 6456 of LectureNotes in Computer Science, chapter Analysis of interrogatives in different domains, pages 136–148. Springer Berlin / Heidelberg, Caserta, Italy, 1st edition edition.

Mota, C. and Grishman, R. (2008). Is this NE tagger getting old? In ELRA, editor, Proc. of theLREC’08.

Mota, C. and Grishman, R. (2009). Updating a name tagger using contemporary unlabeleddata. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 353–356, Suntec,Singapore. Association for Computational Linguistics.

Mrozinsk, J., Whittaker, E. W., Chatain, P., and Furui, S. (2006). Automatic sentence segmen-tation of speech for automatic summarization. In Proc. of the IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP ’06).

Neto, J., Meinedo, H., Viveiros, M., Cassaca, R., Martins, C., and Caseiro, D. (2008). Broadcastnews subtitling system in Portuguese. In Proc. of the IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP ’08), pages 1561–1564.

Neto, J. P., Meinedo, H., Amaral, R., and Trancoso, I. (2003). A system for selective dissemina-tion of multimedia information. In Proc. of the ISCA MSDR 2003.

Niu, C., Li, W., Ding, J., and Srihari, R. K. (2004). Orthographic case restoration using super-vised learning without manual annotation. INTERNATIONAL JOURNAL ON ARTIFICIALINTELLIGENCE TOOLS, 13, part 1:141–156.

BIBLIOGRAPHY 139

Ostendorf, M., Favre, B., Grishman, R., Hakkani-Tür, D., Harper, M., Hillard, D., Hirschberg,J., Ji, H., Kahn, J. G., Liu, Y., Maskey, S., Matusov, E., Ney, H., Rosenberg, A., Shriberg, E.,Wang, W., and Wooters, C. (2008). Speech segmentation and spoken document processing.IEEE Signal Processing Magazine, 25(3):59–69.

Ostendorf, M. and Hillard, D. (2004). Scoring structural MDE: Towards more meaningful errorrates. In Proc. of EARS RT-04 Workshop.

Ostendorf, M., Shriberg, E., and Stolcke, A. (2005). Human language technology: Opportunitiesand challenges. In Proc. of the IEEE International Conference on Acoustics, Speech, and SignalProcessing (ICASSP ’05), Philadelphia.

Palmer, D. D. and Hearst, M. A. (1994). Adaptive sentence boundary disambiguation. In Proc.of the fourth conference on Applied natural language processing, pages 78–83, San Francisco, CA,USA. Morgan Kaufmann Publishers Inc.

Palmer, D. D. and Hearst, M. A. (1997). Adaptive multilingual sentence boundary disambigua-tion. Computational Linguistics, 23(2):241–267.

Reynar, J. C. and Ratnaparkhi, A. (1997). A maximum entropy approach to identifying sentenceboundaries. In Proc. of the fifth conference on Applied natural language processing, pages 16–19,San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.

Reynolds, D. and Torres-Carrasquillo, P. (2005). Approaches and applications of audio diariza-tion. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP ’05), volume 5, pages 953–956.

Ribeiro, R., Mamede, N. J., and Trancoso, I. (2004). Language Technology for Portuguese: shallowprocessing tools and resources, chapter Morpho-syntactic Tagging: a Case Study of LinguisticResources Reuse, pages 31–32. Edições Colibri, Lisbon.

Ribeiro, R. and Matos, D. (2007). Extractive summarization of broadcast news: Comparingstrategies for European Portuguese. In Text, Speech and Dialogue, 10th International Conference,TSD 2007, volume 4629 of Lecture Notes in Computer Science, ISBN 978-3-540-74627-0, pages115–122. Springer.

Ribeiro, R. and Matos, D. (2008). Mixed-source multi-document speech-to-text summarization.In MMIES-2: Multi-source, Multilingual Information Extraction and Summarization (COLING2008), The 22nd International Conference on Computational Linguisti, pages 33–40. Coling2008 Organizing Committee.

Ribeiro, R., Oliveira, L., and Trancoso, I. (2003). Using Morphossyntactic Information in TTSSystems: comparing strategies for European Portuguese. In In Computational Processing of thePortuguese Language: 6th International Workshop, PROPOR 2003, pages 26–27. Springer.

140 BIBLIOGRAPHY

Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of theInternational Conference on New Methods in Language Processing, Manchester, United Kingdom.

Shieber, S. M. and Tao, X. (2003). Comma restoration using constituency information. InNAACL ’03: Proc. of the 2003 Conference of the North American Chapter of the Association for Com-putational Linguistics on Human Language Technology, pages 142–148, Morristown, NJ, USA.Association for Computational Linguistics.

Shriberg, E. (2005). Spontaneous speech: How people really talk, and why engineers shouldcare. In Proc. of Eurospeech - 9th European Conference on Speech Communication and Technology(Interspeech 2005), pages 1781 – 1784, Lisbon, Portugal.

Shriberg, E., Favre, B., Fung, J., Hakkani-Tur, D., and Cuendet, S. (2009). Prosodic similaritiesof dialog act boundaries across speaking styles. Linguistic Patterns in Spontaneous Speech -Language and Linguistics Monograph Series, 25:213–239.

Shriberg, E., Stolcke, A., Hakkani-Tür, D., and Tür, G. (2000). Prosody-based automatic seg-mentation of speech into sentences and topics. Speech Communications, 32(1-2):127–154.

Sjölander, K. and Beskow, J. (2000). Wavesurfer-an open source speech tool. In Sixth Interna-tional Conference on Spoken Language Processing, pages 464–467.

Sjölander, K., Beskow, J., Gustafson, J., Lewin, E., Carlson, R., and Granström, B. (1998). Web-based educational tools for speech technology. In Proc. of ICSLP98, 5th Intl Conference onSpoken Language Processing, pages 3217–3220, Sydney, Australia.

Soltau, H., Kingsbury, B., Mangu, L., Povey, D., Saon, G., and Zweig, G. (2005). The ibm2004 conversational telephony system for rich transcription. In Proc. of the IEEE InternationalConference on Acoustics, Speech, and Signal Processing (ICASSP ’05), pages 205 – 208.

Stevenson, M. and Gaizauskas, R. (2000). Experiments on sentence boundary detection. InProc. of the sixth conference on Applied natural language processing, pages 84–89, San Francisco,CA, USA. Morgan Kaufmann Publishers Inc.

Stolcke, A. (2002). SRILM - An extensible language modeling toolkit. In Proc. of the fourthInternational Conference on Spoken Language Processing (ICSLP ’02), volume 2, pages 901–904,Denver, CO.

Stolcke, A. and Shriberg, E. (1996). Automatic linguistic segmentation of conversational speech.In Proc. of the fourth International Conference on Spoken Language Processing (ICSLP ’96), vol-ume 2, pages 1005–1008, Philadelphia, PA.

Stolcke, A., Shriberg, E., Bates, R., Ostendorf, M., Hakkani, D., Plauche, M., Tur, G., and Lu,Y. (1998). Automatic detection of sentence boundaries and disfluencies based on recognizedwords. In Proc. of the fourth International Conference on Spoken Language Processing (ICSLP ’98),volume 5, pages 2247–2250, Sydney.

BIBLIOGRAPHY 141

Strassel, S. (2004). Simple Metadata Annotation Specification V6.2. Linguistic Data Consortium.

Strassel, S., Miller, D., Walker, K., and Cieri, C. (2003). Shared resources for robust speech-to-text technology. In Eurospeech 2003.

Stüker, S., Fügen, C., Hsiao, R., Ikbal, S., Kraft, F., Paulik, M., Raab, M., Tam, Y.-C., and Wölfel,M. (2006). The ISL TC-STAR spring 2006 ASR evaluation systems. In Proc. of the TC-STARWorkshop on Speech-to-Speech Translation, Barcelona, Spain.

Trancoso, I., do Céu Viana, M., Duarte, I., and Matos, G. (1998). Corpus de diálogo CORAL. InPROPOR’98, Porto Alegre, Brasil.

Trancoso, I., Martins, R., Moniz, H., Mata, A. I., and Viana, M. C. (2008). The Lectra corpus -classroom lecture transcriptions in European Portuguese. In LREC 2008 - Language Resourcesand Evaluation Conference, Marrakesh, Morocco.

Trancoso, I., Nunes, R., Neves, L., do Céu Viana Ribeiro, M., Moniz, H., Caseiro, D., andda Silva, A. I. M. (2006). Recognition of classroom lectures in European Portuguese. In Proc.of the 9th International Conference on Spoken Language Processing (Interspeech 2006 – ICSLP).

Ulusoy, I. and Bishop, C. M. (2005). Generative versus discriminative methods for object recog-nition. In CVPR ’05: Proceedings of the 2005 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR’05) - Volume 2, pages 258–265, Washington, DC, USA.IEEE Computer Society.

Vassière, J. (1983). Language-independent prosodic features. In Cutler, A. and Ladd, R., editors,Prosody: models and measurements, pages 55–66. Berlin: Springer.

Viana, M. C. (1987). Para a Síntese da Entoação do Português. PhD thesis, University of Lisbon.

Viana, M. C., Oliveira, L. C., and Mata, A. I. (2003). Prosodic phrasing: Machine and humanevaluation. International Journal of Speech Technology, 6(1):83–94.

Wang, D. and Narayanan, S. S. (2004). A multi-pass linear fold algorithm for sentence boundarydetection using prosodic cues. In Proc. of the IEEE International Conference on Acoustics, Speech,and Signal Processing (ICASSP ’04), volume 1, pages 525–528.

Wang, W., Knight, K., and Marcu, D. (2006). Capitalizing machine translation. In HLT-NAACL,pages 1–8. ACL.

Wichmann, S. (2008). The emerging field of language dynamics. Language and Linguistics Com-pass, 2(3):442–455.

Yarowsky, D. (1994). Decision lists for lexical ambiguity resolution: Application to accentrestoration in Spanish and French. In Proc. of the 2nd Annual Meeting of the Association forComputational Linguistics (ACL ’94), pages 88–95.

142 BIBLIOGRAPHY

Zechner, K. (2002). Automatic summarization of open-domain multiparty dialogues in diversegenres. Computational Linguistics, 28(4):447–485.

Zimmermann, M., Hakkani-Tur, D., Fung, J., Mirghafori, N., Gottlieb, L., Shriberg, E., andLiu, Y. (2006). The ICSI+ multi-lingual sentence segmentation system. In Proc. of the 9th

International Conference on Spoken Language Processing (Interspeech 2006 – ICSLP), pages 117–120, Pittsburgh.

Nomenclature

This chapter presents some of the terminology used in this dissertation. This includesthe reference to the information sources and other parts of the document where the subject istreated in more detail.

APP Audio Pre-Processing or Audio Segmen-tation.

ASR Automatic Speech Recognition

BFGS Broyden-Fletcher-Goldfarb-Shanno isa quasi-Newton method for solving non-linear optimization problems.

BN Broadcast News

capitalization consists of rewriting each wordof an input text with its proper case infor-mation given its context

CART Classification and Regression Tree

CRF Conditional Random Field

CTS Conversational Telephone Speech

DTD Data Type Definition

EARS Effective, Affordable, ReusableSpeech-to-Text

Edit distance see Levenshtein distance

EP European Portuguese

GMM Gaussian mixture model

HMM Hidden Markov ModelA hidden Markov model is a statisticalMarkov model in which the system be-ing modeled is assumed to be a Markovprocess with unobserved (hidden) states.

An HMM can be considered as the sim-plest dynamic Bayesian network.

L-BFGS (Limited memory BFGS). The L-BFGS algorithm is a member of thebroad family of quasi-Newton optimiza-tion methods. L-BFGS uses a lim-ited memory variation of the Broyden-Fletcher-Goldfarb-Shanno (BFGS).

Language Dynamics Everyday, new wordsare introduced and the usage of othersdecays with time, despite the fact thatmost of the words and constructions of ahuman language are kept in use for manyyears or never change. Language dynam-ics correspond to these language varia-tions in time

LDC Linguistic Data Consortium

Levenshtein distance The Levenshtein dis-tance between two strings is defined asthe minimum number of edits needed totransform one string into the other, withthe allowable edit operations being inser-tion, deletion, or substitution of a singlecharacter. It is named after Vladimir Lev-enshtein, who considered this distance in1965. The term edit distance is often used

144 NOMENCLATURE

to refer specifically to Levenshtein dis-tance

LM Language Model

MDE Metadata Extraction

MEMM Maximum Entropy Markov Model

NER Named Entity Recognition

NIST National Institute of Standards andTechnology

NLP Natural Language Processing

RT Rich Transcription

SER Slot Error Rate

SNOR Speech Normalized OrthographicalRepresentation, is a standard speechrecognition system output format. Doesnot contain any punctuation marks, allnumbers are spelled as words, and all in-formation is represented in single case,which means that no capitalization infor-mation is provided.In other words, a SNOR-normalized tran-scription consists of text strings made upof ASCII characters and has the follow-ing contraints: (1) Whitespace separateswords for languages that use words; (2)The text is case-insensitive (usually in allupper case); (3) No punctuation is in-cluded except apostrophes for contrac-tions; (4) Previously hyphenated wordsare divided into their constituent partsseparated by whitespace.

STT Speech To Text

SU Sentence-like Unit

SVM Support Vector Machines

truecasing (see capitalization) consists ofrewriting each word of an input textwith its proper case information given itscontext

TTS Text To Speech

WER Word Error Rate

WSJ Wall Street Journal

XML Extensible Markup Language - is ageneral-purpose specification for creat-ing custom markup languages, derivedfrom SGML. XML is playing an impor-tant role in the exchange of a wide varietyof data on the Web and elsewhere.

Zipf’s law is an empirical law that states thatgiven some corpus of natural languageutterances, the frequency of any word isinversely proportional to its rank in thefrequency table. Thus the most frequentword will occur approximately twice asoften as the second most frequent word,three times as often as the third most fre-quent word, etc.A simple corollary is that while only afew words are used very often, many ormost are used rarely.

APortuguese Text Normalization

Normalization is an important stage in the process of using written corpora for buildingspeech models. The normalization tool used in this process was deeply revised and has beenapplied to all Portuguese written corpora. This anex shows a number of expressions, most ofthem extracted from real text, that have been considered when revising the original normaliza-tion tool. Most of them did not produce the desired output in the original version.

A.1 Date and time

Text Normalization Result

1h23m13s uma hora vinte e três minutos e treze segundos2h23m23,45s duas horas vinte e três minutos vinte e três vírgula quarenta e

cinco segundos23:59 | 23h59 vinte e três horas e cinquenta e nove minutos

23:59:00 vinte e três horas cinquenta e nove minutos e zero segundos10/07/201010.07.201010-07-2010

dez de Julho de dois mil e dez

23:45:23,2342 vinte e três horas quarenta e cinco minutos vinte e três vírguladois três quatro dois segundos

10s | 10seg dez segundos10h20m1010h20m10s

dez horas vinte minutos e dez segundos

10h20m10,12s10h20m10,12

dez horas vinte minutos dez vírgula doze segundos

10.21:31,32s dez horas vinte e um minutos trinta e um vírgula trinta e doissegundos

10m5,1s dez minutos cinco vírgula um segundos10m1s | 10m1seg dez minutos e um segundo

10h | 10H dez horas10h-12h dez horas - doze horas

146 APPENDIX A. PORTUGUESE TEXT NORMALIZATION

A.2 Ordinals


a 3ª vez a terceira vez23ª operação vigésima terceira operaçãoo 22º peso o vigésimo segundo peso1º lugar primeiro lugarfoi a 1ª foi a primeira14551ª quarta milésima quingentésima quinquagésima primeira123º centésimo vigésimo terceiro

A.3 Numbers


11900 onze mil e novecentos11,12 onze vírgula doze0,234 zero vírgula duzentos e trinta e quatro965042221 nove seis cinco zero quatro dois dois dois um213100203 dois um três um zero zero dois zero três17km/h dezassete quilómetros por hora17 km/h dezassete quilómetros por hora17km/h. dezassete quilómetros por hora .123.342.122 cento e vinte e três milhões trezentos e quarenta e dois mil cento e

vinte e doissão 100$00 são cem escudossão 1000$00 são mil escudossão 100,00$ ou200$53

são cem vírgula zero escudos ou duzentos vírgula cinquenta etrês escudos

100,0€ cem euros1000.000€ um milhão e zero eurosnos 123Hz depotência

nos cento e vinte e três hertz

(+351) 213133030 ( mais trezentos e cinquenta e um ) dois um três um três três zerotrês zero

123,3433 cento e vinte e três vírgula três mil quatrocentos e trinta e três

A.4. OPTIONAL EXPRESSIONS 147

A.4 Optional Expressions


angl.-sax. anglo-saxónicomal.-jav. malaio-javanêsdir. can. direito canónicodir. civ. direito civildir. com. direito comercialdir. ecles. direito eclesiásticodir. rom. direito romanohist. nat. história naturalhist. rel. história religiosav. s. f. f. volte se faz favorderiv. regr. derivação regressivam. q. mesmo queartº artigoart. artigonrº númeron. númerodrª doutoradra doutoradrº doutorsrª senhorasrº senhorarqt. arquitectoarqtª arquitectaengº engenheirov. ex.ª v. excelênciaexmsª excelentíssimasexª excelênciaexº exemplodtº direitoesqº esquerdoantº antóniostº santopp. páginaprof. professorop. cit. trabalho citado

148 APPENDIX A. PORTUGUESE TEXT NORMALIZATION

A.5 Money


1.100.000$00 um milhão e cem mil escudos123.123.123$00 cento e vinte e três milhões cento e vinte e três mil cento e vinte e

três escudos100$00 cem escudos100,00 cem100,23€ cem vírgula vinte e três euros£100,23 cem vírgula vinte e três libras100,23£ cem vírgula vinte e três libras1002,22£ mil e dois vírgula vinte e dois libras£1002,22 mil e duas vírgula vinte e duas libras

A.6 Abreviations


em Jan. fiz isto em Janeiro fiz istoem jan. fiz isto em Janeiro fiz istoo log. de um nº é o logaritmo de um número éo prof. o professoro dr. o doutoro Dr. Pedro o Doutor Pedroa Dra. Mª da silva a Doutora Maria da silvaem km/h foramfeitos 100

em quilómetros por hora foram feitos cem

o sp. club o Sporting cluba 100 km/h. nempensar

a cem quilómetros por hora . nem pensar

foram 10 l/h foram dez litros por horaa profª disse aoprof.

a professora disse ao professor

V.Ex.ª disse quesim?

Vossa Excelência disse que sim ?

A.7. OTHER 149

A.7 Other


13 kms / 13kms treze quilómetros1km um quilómetros1234.212.232km/h um dois três quatro dois um dois dois três dois quilómetros por

horaangl.-sax. anglo-saxónicoo Mañolo o Mañoloterceiro lugar(29.02 minutos),por uma

terceiro lugar ( vinte e nove ponto zero dois minutos ) , por uma

entram em vigorem 01 de janeirode 2010

entram em vigor em um de janeiro de dois mil e dez

passada, para os432.000,assinalando

passada , para os quatrocentos e trinta e dois mil , assinalando

caíram 22.000 nasemana passadafixando-se nos432.000.

caíram vinte e dois mil na semana passada fixando-se nosquatrocentos e trinta e dois mil .

10 a 15 Km/h. dez a quinze quilómetros por hora .pág.20 / pág. 20 página vinte121.123.123$00km cento e vinte e um milhões cento e vinte e três mil cento e vinte e

três escudos quilómetros1231222$00/km um milhão duzentos e trinta e um mil duzentos e vinte e dois

escudos por quilómetropág.150 / pag.150 página cento e cinquentaEngºs Engenheiroso V.de e a Cª L.da o Visconde e a Companhia Limitada12234212-2111212121

doze milhões duzentos e trinta e quatro mil duzentos e doze doisum um um dois um dois um dois um

12-16 doze dezasseis5-7 5.7 5/7 cinco a sete cinco ponto sete cinco barra sete

Documents

Recovering Capitalization and Punctuation Marks on Speech