Upload
ardara
View
39
Download
0
Embed Size (px)
DESCRIPTION
Progress in Arabic Broadcast News Transcription at BBN. Mohamed Afify, Long Nguyen, John Makhoul STT Workshop Philadelphia, PA, March 24, 2005. Overview. Problems in Arabic speech recognition Arabic Treebank and Buckwalter morphological analyzer Building phonetic systems in Arabic - PowerPoint PPT Presentation
Citation preview
1
Progress in Arabic Broadcast News Transcription at BBN
Mohamed Afify, Long Nguyen, John Makhoul
STT Workshop Philadelphia, PA, March 24, 2005
2
Overview
Problems in Arabic speech recognition Arabic Treebank and Buckwalter morphological
analyzer Building phonetic systems in Arabic Comparison of phonetic and grapheme models Experimental results Summary and future work
3
Problems in Arabic Speech Recognition
Lack of short vowels from existing corpora– Creates ambiguity for acoustic and language models– Most systems rely on grapheme-based acoustic
models• No explicit models for short vowels• Therefore, no detailed phonetic acoustic models
– Language models also ignore short vowels Affixes create a large number of “words”
– e.g., “and he will write it” is one word in Arabic– OOV rate is around 5% for 64K lexicon compared to
around 0.5% for English– Morphological richness also adds to the large
number of words
4
Possible Solutions Short Vowels
– Obtain vowelization of words in dictionary from Arabic Treebank and morphological analysis
– Bootstrap acoustic-phonetic models for all phonemes, including short vowels
– Expand vowelization process to language model
Affixes and morphological richness– Reduce OOV rate by increasing lexicon size– Use morphological analysis to decompose words
into components
Current focus– Bootstrap acoustic models for short vowels– Build phonetic system– Available resources
• No vowelized speech corpus• Arabic Treebank• Buckwalter morphological analyzer
5
LDC Arabic Treebank
Text only; no speech
Consists of three parts– The words in the articles in Parts 1 and 2 are vowelized in
context– The unique words in Part 3 have multiple pronunciations
based on the Buckwalter morphological analyzer
Corpus Unique Words Unique
PronunciationsATB1+ATB2 25,862 39,570
ATB1+ATB2+ATB3 62,279 378,883
6
Buckwalter Morphological Analyzer
Available from LDC
Uses a lexicon and a set of rules for affixes to– Assign parts of speech to a word– Produce different vowelizations for each word
Version 2.0 was recently released– Several additional new features– Produces all possible ending vowelizations for input word
Can only analyze words whose stems are in its lexicon– Lexicon has about 40K stems– Does not include many foreign words– Does not deal with mis-spelled words
7
Building an Arabic Phonetic System
Use Arabic Treebank and Buckwalter morphological analyzer to bootstrap short vowels for acoustic training data and recognition lexicon
Method 1– Search word in Treebank dictionary– If not found, pass to morphological analyzer– If both fail, discard word or manually vowelize
Method 2– Pass word to morphological analyzer– If failed, lookup in Treebank dictionary– If both fail, discard word or manually vowelize
As a result, some acoustic training data and words in recognition lexicon were discarded
We found Method 2 to give more consistent vowelizations than Method 1
8
Arabic Phonetic System (cont’d)
Starting with 100 hrs of possible acoustic training data and a 64K recognition lexicon, we were able to keep: – 80 hrs (63K utterances) of data with short vowels– 62K recognition lexicon with short vowels
A 35-phoneme set (28 consonants + 6 vowels + “taa marbuuTa”)
Phonetic transcription rules are relatively straightforward starting from vowelized transcriptions
Built a conventional phonetic system and compared to grapheme system– No vowelization for language model
9
Initial Results
Scoring Grapheme System Phonetic System
(vowelized)Baseline 18.7 22.3
Normalization I 18.2 18.8Normalization II 17.9 17.0
Dev 03, unadapted results, Method 1 vowelization Normalization I : Normalize “hamza” at beginning of
the word Normalization II : Normalize “hamza” at beginning of
the word, after popular prefixes, and also frequent “Y” and “y” confusions at end of word
Text normalization is much more important for phonetic system
10
Updated Development Results
Use Normalization II on acoustic and language training data, and for scoring
Use Method 2 to bootstrap short vowels Expanded phonetic transcription rules to include
assimilation of word-initial hamza and definite article Dev03 test set, unadapted decoding About 13% improvement for phonetic system
SystemMorphological
AnalyzerWER
Grapheme NA 18.1Phonetic Version 1.0 16.9Phonetic Version 2.0 15.8
11
Experimental Results
About 80 hrs of net acoustic training data– ML models for un-adapted decoding– ML SAT models for adapted decoding
About 300M words of language training data– 3-gram language models– 60K recognition lexicon
Adapted decoding on different test sets
Test SetGrapheme
SystemPhonetic System
Dev03 15.4 14.2Dev04 19.0 16.9Eval03 19.2 16.6Eval04 26.2 24.3
12
Next Immediate Steps
Use all 100 hrs for acoustic training– Phonetic models can automatically vowelize discarded sentences– Possibly manually vowelize missing words
Use 64K recognition lexicon– Manually vowelize missing words
Gain is about 1% absolute on Dev03 for grapheme system Switch to MMI models for un-adapted and adapted decoding
13
Summary and Future Work
Quickly bootstrap phonetic system for Arabic– Text normalization and Buckwalter morphological analyzer
version II are key to success– From 8%-13.5% improvement over grapheme system for
different test sets– Further improvement can be obtained by straightforward
upgrades
Future work– Using vowelization in language model– Increase lexicon size to reduce OOV rate– Statistical vowelization for missed words, mainly foreign names