Progress in Arabic Broadcast News Transcription at BBN

1

Progress in Arabic Broadcast News Transcription at BBN

Mohamed Afify, Long Nguyen, John Makhoul

STT Workshop Philadelphia, PA, March 24, 2005

2

Overview

Problems in Arabic speech recognition Arabic Treebank and Buckwalter morphological

analyzer Building phonetic systems in Arabic Comparison of phonetic and grapheme models Experimental results Summary and future work

3

Problems in Arabic Speech Recognition

Lack of short vowels from existing corpora– Creates ambiguity for acoustic and language models– Most systems rely on grapheme-based acoustic

models• No explicit models for short vowels• Therefore, no detailed phonetic acoustic models

– Language models also ignore short vowels Affixes create a large number of “words”

– e.g., “and he will write it” is one word in Arabic– OOV rate is around 5% for 64K lexicon compared to

around 0.5% for English– Morphological richness also adds to the large

number of words

4

Possible Solutions Short Vowels

– Obtain vowelization of words in dictionary from Arabic Treebank and morphological analysis

– Bootstrap acoustic-phonetic models for all phonemes, including short vowels

– Expand vowelization process to language model

Affixes and morphological richness– Reduce OOV rate by increasing lexicon size– Use morphological analysis to decompose words

into components

Current focus– Bootstrap acoustic models for short vowels– Build phonetic system– Available resources

• No vowelized speech corpus• Arabic Treebank• Buckwalter morphological analyzer

5

LDC Arabic Treebank

Text only; no speech

Consists of three parts– The words in the articles in Parts 1 and 2 are vowelized in

context– The unique words in Part 3 have multiple pronunciations

based on the Buckwalter morphological analyzer

Corpus Unique Words Unique

PronunciationsATB1+ATB2 25,862 39,570

ATB1+ATB2+ATB3 62,279 378,883

6

Buckwalter Morphological Analyzer

Available from LDC

Uses a lexicon and a set of rules for affixes to– Assign parts of speech to a word– Produce different vowelizations for each word

Version 2.0 was recently released– Several additional new features– Produces all possible ending vowelizations for input word

Can only analyze words whose stems are in its lexicon– Lexicon has about 40K stems– Does not include many foreign words– Does not deal with mis-spelled words

7

Building an Arabic Phonetic System

Use Arabic Treebank and Buckwalter morphological analyzer to bootstrap short vowels for acoustic training data and recognition lexicon

Method 1– Search word in Treebank dictionary– If not found, pass to morphological analyzer– If both fail, discard word or manually vowelize

Method 2– Pass word to morphological analyzer– If failed, lookup in Treebank dictionary– If both fail, discard word or manually vowelize

As a result, some acoustic training data and words in recognition lexicon were discarded

We found Method 2 to give more consistent vowelizations than Method 1

8

Arabic Phonetic System (cont’d)

Starting with 100 hrs of possible acoustic training data and a 64K recognition lexicon, we were able to keep: – 80 hrs (63K utterances) of data with short vowels– 62K recognition lexicon with short vowels

A 35-phoneme set (28 consonants + 6 vowels + “taa marbuuTa”)

Phonetic transcription rules are relatively straightforward starting from vowelized transcriptions

Built a conventional phonetic system and compared to grapheme system– No vowelization for language model

9

Initial Results

Scoring Grapheme System Phonetic System

(vowelized)Baseline 18.7 22.3

Normalization I 18.2 18.8Normalization II 17.9 17.0

Dev 03, unadapted results, Method 1 vowelization Normalization I : Normalize “hamza” at beginning of

the word Normalization II : Normalize “hamza” at beginning of

the word, after popular prefixes, and also frequent “Y” and “y” confusions at end of word

Text normalization is much more important for phonetic system

10

Updated Development Results

Use Normalization II on acoustic and language training data, and for scoring

Use Method 2 to bootstrap short vowels Expanded phonetic transcription rules to include

assimilation of word-initial hamza and definite article Dev03 test set, unadapted decoding About 13% improvement for phonetic system

SystemMorphological

AnalyzerWER

Grapheme NA 18.1Phonetic Version 1.0 16.9Phonetic Version 2.0 15.8

11

Experimental Results

About 80 hrs of net acoustic training data– ML models for un-adapted decoding– ML SAT models for adapted decoding

About 300M words of language training data– 3-gram language models– 60K recognition lexicon

Adapted decoding on different test sets

Test SetGrapheme

SystemPhonetic System

Dev03 15.4 14.2Dev04 19.0 16.9Eval03 19.2 16.6Eval04 26.2 24.3

12

Next Immediate Steps

Use all 100 hrs for acoustic training– Phonetic models can automatically vowelize discarded sentences– Possibly manually vowelize missing words

Use 64K recognition lexicon– Manually vowelize missing words

Gain is about 1% absolute on Dev03 for grapheme system Switch to MMI models for un-adapted and adapted decoding

13

Summary and Future Work

Quickly bootstrap phonetic system for Arabic– Text normalization and Buckwalter morphological analyzer

version II are key to success– From 8%-13.5% improvement over grapheme system for

different test sets– Further improvement can be obtained by straightforward

upgrades

Future work– Using vowelization in language model– Increase lexicon size to reduce OOV rate– Statistical vowelization for missed words, mainly foreign names

Documents

Progress in Arabic Broadcast News Transcription at BBN