Upload
tanja
View
214
Download
2
Embed Size (px)
Citation preview
Initial Experiments with Tamil LVCSR
Melvin Jose J
Department of Computer TechnologyMIT Campus, Anna University
Chennai, Indiamelvin.jose [email protected]
Ngoc Thang Vu, Tanja Schultz
Cognitive Systems Lab, Institute for AnthropomaticsKarlsruhe Institute of Technology (KIT)
Karlsruhe, Germany{thang.vu,tanja.schultz}@kit.edu
Abstract—In this paper we present our recent efforts towardsbuilding a large vocabulary continuous speech recognizer forTamil. We describe the text and speech corpus collected torealize this task. The data was complemented by a largeamount of text data crawled from various Tamil news websites.The Tamil speech recognition system was bootstrapped usingthe Rapid Language Adaptation scheme which employs amultilingual phone inventory. After initialization, we built aword-based and syllable-based system with a Syllable ErrorRate (SyllER) of 29.30% and 34.16%, respectively. We proposea data-driven approach to obtain better dictionary units toovercome the challenge of the agglutinative nature of Tamil.The approach produced a significant improvement of 27.20%and 15.12% relative SyllER on the test set over the syllable-and word-based systems, respectively. Our current best systemhas a SyllER of 17.44% on read newspaper speech.
Keywords-Agglutinative language; multilingual bootstrap;morphological complexity; LVCSR System; dictionary units
I. INTRODUCTION
With the globalization of the world, one of the most
important needs in speech technology is the support of
multiple input and output languages, since applications are
required to address the needs of linguistically diverse users
and that of international markets. Hence, new algorithms
and techniques which support a rapid adaptation of speech
processing systems to new languages are of most concern.
Our Rapid Language Adaptation Toolkit (RLAT) [1] aims to
considerably reduce both the time and the effort consumed
in building speech technologies for new languages. This is
achieved by providing innovative methods and tools that
enable users to develop speech processing models, collect
appropriate speech and text data to build these models and
finally evaluate the results leading to iterative improvements.
In this paper we describe the application of these tools to the
Tamil language, which has seen growing interest in terms of
speech and language technology recently.
Early work started with [2], in which large vocabulary
speech recognition experiments for three different languages
namely Marathi, Telugu and Tamil were built. In [3] and [4],
a Tamil spoken dialog system for farmers in Tamil Nadu and
a syllable-based Tamil recognizer were built. [5] investigated
language modeling for Tamil using various units and [6]
presented word and triphone-based Tamil recognizers with
a vocabulary of 1,700 and 341 words, respectively. It can be
seen that although there has been significant effort to address
specific parts of an LVCSR system for Tamil, a compre-
hensive LVCSR for Tamil has still not been developed. In
this paper, we present a complete LVCSR system for Tamil
built from scratch, using our new speech database and text
gathered with RLAT, which addresses the peculiarities of the
Tamil Language.
The paper is organized as follows: In Section II we present
the characteristics of the Tamil language and explain its
morphological richness. Section III describes our speech
and text corpora. In Section IV, the grapheme-to-phoneme
(G2P) mapping and the syllable level segmentation algo-
rithm which were used to build the syllable- and word-
based recognizers are explained. In Section V, our merging
algorithm that mitigates the challenge of the agglutinating
nature of Tamil is presented. In Section VI we conclude with
our results.
II. CHARACTERISTICS OF TAMIL LANGUAGE
Tamil is one of the four major languages of the Dravidian
Language family. It is predominantly spoken in the state of
Tamil Nadu of the Indian subcontinent and also in other parts
of the world such as Singapore, Sri Lanka and Malaysia by
close to 70 million people. Dravidian languages are among
the most complex languages in the world at the level of
morphology. This can be attributed to agglutination and
complex sandhi rules. Additionally, a significant part of
grammar that is handled by syntax in some languages e.g.
English is handled within morphology in Tamil. Externalsandhi (that is, conflation between two or more full word
forms) and compounding further increase the complexity.
External sandhi is not the result of simple concatenation
or juxtaposition of complete words written without inter-
vening spaces as is the convention in some languages but
results from words that are made up of several morphemes
conjoined through complex morphophonological processes.
Below are two examples of English sentences that reduce
to a single word in Tamil after applying a series of sandhirules:
2012 International Conference on Asian Language Processing
978-0-7695-4886-9/12 $26.00 © 2012 IEEE
DOI 10.1109/IALP.2012.46
81
2012 International Conference on Asian Language Processing
978-0-7695-4886-9/12 $26.00 © 2012 IEEE
DOI 10.1109/IALP.2012.46
81
We collected a 38 million word Tamil corpus with our
RLAT crawler [1]. We obtained an Out Of Vocabulary
(OOV) rate of 16% for the most frequent 100K types in
Tamil. Turkish and Korean have OOV rates of 13.5% and
34%, respectively, on a vocabulary of 64K [7]. Thus Tamil
can be classified as being morphologically rich comparable
to languages like Korean and Turkish.
The Tamil alphabet consists of 12 graphemes which repre-
sent vowels and 18 graphemes which represent consonants
and a special character, the aytam (which is classified as
neither consonant nor vowel) but can be used as a diacritic
to represent foreign sounds. Each of the consonants can be
combined with any of the twelve vowels to produce 216
combinations (combinants). Together with the 31 graphemes
in their independent form, this lead to a total of 247
graphemes. The combinant graphemes are formed by adding
a vowel marker to the consonant or altering the basic shape
of the consonant specific to the vowel.
Consequently the G2P mapping is a non-trivial task
for Tamil when compared to other Dravidian languages.
Also unlike many other Indian languages, Tamil has very
restricted consonant clusters and has neither aspirated nor
voiced stops. The stops are present in the spoken language
as allophones. In addition, the voicing of stop consonants
is governed by strict rules. They are voiceless if they occur
word-initially or in gemination.
III. SPEECH AND TEXT CORPORA
A. Text Corpus
The text corpus of Tamil words was built using RLAT [1]
with which text from popularly known Tamil news websites
was collected. The websites were crawled with a link depth
of 10. i.e the crawler went recursively 10 levels deep from
each level after the level was completed crawled. The list of
websites that were crawled are given in Table I below.
The collected text was cleaned and normalized using the
Table ILIST OF CRAWLED TAMIL WEBSITES
Website URL Link depth1 www.dinamalar.com 102 www.dinakaran.com 103 www.dinamani.com 10
following four steps (1) Remove all HTML-Tags and codes,
(2) Remove special characters, non-Tamil characters and
empty lines, (3) Convert numbers, dates, time and common
abbreviations to their equivalent text form, and (4) Remove
leading and trailing white spaces and write each sentence
on a separate line.
B. Speech Corpus
The speech data for our Tamil recognizer was collected
in Tamil Nadu, India in two stages. In the first stage, 68
speakers were each asked to read several lines (ranging
between 30 and 300) from Thirukkural, which is a Tamil
literary classic which contains 1,330 verses and is one of
the most important works in Tamil. This accrued to almost
17 hours of speech data. From the above data, speech from
5 speakers which added up to 1 hour was separated and
designated as the development set. In the second stage,
additional speakers were asked to read newspaper prompts
collected from the newspapers mentioned in Table I. A new
set of 29 speakers participated in this exercise. The data from
this stage accrues to 1 hour and constitutes the test set. All
speech data was recorded with a close-speaking microphone
and in quiet environmental conditions. A sampling rate of
16 kHz with a resolution of 16 bits was used for the data
which was stored in PCM encoding.
Table IISPEECH CORPUS DESCRIPTION
Set#Speakers
#Utterances DurationMale Female
Training 30 33 1012 15h 50minDevelopment 2 3 51 1hr 4minTest 14 15 370 1hr 0minTotal 46 51 1433 17hr 54min
IV. BASELINE TAMIL RECOGNITION SYSTEMS
A. Grapheme-to-Phoneme Model
Unfortunately, the G2P conversion task is not very
straightforward in the case of Tamil, for the following two
reasons: (1) Confusion between allophones p (b), t (d), th
(dh), k (g) and c (j) (s) which are very difficult to solve
with linguistic rules and (2) the transcription of borrowed
words which do not have a standard pronunciation. While
most Indian languages are phonetic in nature i.e they pos-
sess a one-to-one correspondence between orthography and
pronunciation, Tamil script, although phonetic in nature has
a lot of exceptions. For building the G2P model for Tamil,
the Sequitur G2P Toolkit [8] was used to iteratively train
a manually phoneticized lexicon with varying vocabulary
sizes of 10k, 18k and 35k words. The N-gram size of all
the models is N=6 and the graphone size is L=1. For testing
the three G2P models, we used a test lexicon with 10k
words which was handcrafted by native speakers. Our best
G2P model achieves a letter accuracy of 99.56% on the test
lexicon. This model is used to generate the pronunciation
lexicons for our experiments.
B. Text Segmentation into Syllables
A syllable is usually defined as a vowel nucleus supported
by consonants on either side. It is a consonant-vowel cluster
that can be generalized as C*VC* where C stands for
consonant and V for vowel. We used the rules formulated
in [4] for Romanized or transliterated Tamil and enhanced
them for direct application to Tamil script. This was done in
three steps: (1) Prepocessing – every combinant (consonant
+ vowel) in the word is replaced by an equivalent form that
places both the root consonant and the vowel in adjacent
8282
positions. This way we have the word written in a form
where each character is either a vowel or a consonant and
not a combinant. (2) Syllabification – we apply the linguistic
rules on the preprocessed word recursively i.e. after each
syllable is identified from a word, the remainder of the word
is again provided as input to the algorithm to be syllabified.
(3) Postprocessing – the combinant for every syllable is
reconstructed thereby resulting in correct Tamil script. An
example showing the results of each stage of the above
algorithm is shown in Figure 1. The word used below means
”from the dictionaries”.
Figure 1. Illustration of the syllabification algorithm
C. Language Modeling
We selected 3,832,821 lines out of the total text crawled
for our experiments. First, we built a word-level Language
Model (LM) using the SRI Language Modeling Toolkit [9].
Afterwards, we use the segmentation algorithm described in
the previous section to syllabify the same text. A syllable-
level LM was then built. Table III gives the characteristics
of both the word- and the syllable-based language models
when evaluated on the test set.
Table IIICHARACTERISTICS OF LM-Word and LM-Syll
Criteria LM-Word LM-Syll#Tokens 903M 963M#Types 602K 10KOOV-rate% 4.99 0.00Perplexity 5867 21
D. Word-based vs Syllable-based Recognizers
For building our baseline speech recognizer for Tamil
we used RLAT for bootstrapping the Tamil system with
the help of a multilingual phone inventory (MM7). MM7
was trained using seven arbitrarily selected GlobalPhone
languages (Chinese, Croatian, English, Japanese, German,
Spanish, and Turkish) with about 20 hours of audio data
each [1]. To bootstrap the system, Tamil phonemes from
the closest matches of the MM7 inventory were derived
by an IPA-based phone mapping. An initial state alignment
of the Tamil training data was produced by the selected
MM7 models as seed models. We utilized a GlobalPhone
style preprocessing which consisted of extracting features
by applying a Hamming window of 16ms length with a
window overlap of 10ms. Each feature vector has 143
dimensions resulting from stacking of 11 adjacent frames
of 13 Melscale Frequency Cepstral Coefficients (MFCC)
each. A Linear Discriminant Analysis (LDA) transformation
reduces the feature vector size to 42 dimensions. The model
uses a 3-state left-to-right HMM. The emission probabilities
are modeled by Gaussian Mixtures with diagonal covari-
ances. For our context-dependent AMs with different context
sizes, we terminated the decision tree splitting process at
2000 quinphones. After context clustering, a merge-and-
split training was applied, which selects the number of
Gaussians according to the amount of data (23 Gaussians on
an average and atleast 50 frames to train each Gaussian). For
all models, we use one global semi-tied covariance (STC)
matrix after LDA. First, we built a word-based system. A
word-level pronunciation lexicon was generated for training.
The SyllER of the resulting system is 29.30% on the test
set with LM-Word.
Due to the high perplexity and OOV rate of the word
based system on the test set, we decided to build a syllable-
based recognizer. We used our syllable segmentation al-
gorithm from Section IV B to syllabify the training and
the test transcripts. A syllable-level pronunciation lexicon
was generated. A similar training framework was used for
training the syllable system. The SyllER of our syllable-
based system is 34.16% which is worse than the word-based
system. This may be due to the loss of context information
and decrease in the range of the language model due to the
short syllables. Additionally, higher n-grams for LM-Syll
and the co-articulation events between syllables may have
contributed to the decrease in performance.
V. DICTIONARY UNIT MERGING ALGORITHM
A. Motivation
From the above section, it can be seen that the syllable
system performs worse than the word-based system. We
need a trade-off between the syllable and the word units
in order to retrieve some lost context and to tackle the
high OOV rate of the word-based system. This is the main
motivation behind applying a unit merging algorithm. It (1)
reduces the acoustic confusability due to short syllables (2)
increases the range of the language model and (3) increases
the context of the acoustic model. On the other hand these
units are shorter than words, thereby keeping the OOV rate
at a manageable level.
B. Proposed Algorithm
In [10], the authors proposed a data-driven algorithm
to determine appropriate dictionary units for Korean to
overcome the high OOV rate that results from the rich
morphology of Korean. We extended the above algorithm
and adapt it for application to Tamil. Our extended algorithm
is a data-driven, statistical approach that requires no a-priori
linguistic knowledge.
The input to the algorithm are a pronunciation dictionary,
a large text file and a vowel list. We propose a slight variation
to the algorithm used in [10] to keep the OOV rate always
at zero. Initially, we segment the entire text into syllables
using the syllabification algorithm. We also include word
boundary information in the syllabified text i.e we prepend
8383
a ’-’ to every syllable that does not occur at the start of
a word. Then we obtain all possible syllable pairs from
the syllabified text. Each possible pair is then looked up
in the dictionary and the pronunciation of the vowel-vowel
transition is retrieved.
The merging algorithm is governed by the following
iterative steps:
1. First, we compute a hash table that contains the vowel-
vowel transition, the syllable pair that produces it as
keys and the frequency of the pair as the value.
2. For each vowel-vowel transition in the hash table, we
place the most frequent syllable pair into a merge-list.
3. We merge all the pairs in merge-list in the segmented
corpus.
We only merge pairs that occur within a word, and chose
not to merge pairs across word boundaries since Tamil has
a fixed word boundary. We use the merge-list obtained after
step 2 of the unit merging algorithm to merge both the
training and test transcripts, which keeps the OOV rate at
zero. Figure 2 shows the various stages of the unit merging
algorithm.
Figure 2. Various stages of the unit merging algorithm
C. Experimental results
We applied our merging algorithm on the same 3,832,821
lines of text as above and built an LM (LM-Merge) with the
new units which reports a perplexity of 82.33 on the test
set. By using LM-Merge, we obtained a SyllER of 24.87%.
Additionally, we added another 4,932,821 lines of text which
was crawled using RLAT to improve the language model.
Since the vocabulary of LM-Merge is more than 200k and
the OOV-rate is zero, we did not apply the merging algorithm
on the additional text, but chose to use the merge-list of
the first set to merge the pairs of the additional text. The
final LM has a perplexity of 37.49 on the test set and the
resulting system gives a performance of 17.44% SyllER.
Figure 3 gives the performance of all the systems in terms
of SyllER.
VI. CONCLUSION
In this paper, we built a state-of-the-art LVCSR system
for Tamil, a morphologically rich Indian language using
RLAT. Initially, we built a word- and syllable-based baseline
systems with a SyllER of 29.30% and 34.16% on the test
Figure 3. Evolution of Tamil LVCSR System
set, respectively. Next, we addressed the agglutinative nature
of Tamil by applying a data-driven approach to merge the
syllables into new dictionary units and obtained a rela-
tive improvement of 27.20% and 15.12% SyllER over the
syllable- and word-based systems, respectively. Afterwards,
we crawled additional text data using RLAT and improved
the system from 24.87% to 17.44% SyllER on the test set.
Our best system has a SyllER of 17.44% on the test set.
ACKNOWLEDGMENT
The authors would like to thank all friends and family
in India, for contributing their time, effort, knowledge and
speech to the Tamil corpus. We would also like to thank
Dr. Bharadwaja Kumar, under whose guidance the G2P
experiments were done.
REFERENCES
[1] T. Schultz, A.W. Black, S. Badaskar, M. Hornyak, and J.Kominek. SPICE: Web-based tools for rapid language adapta-tion in speech processing systems. Proceedings of Interspeech,August 2007.
[2] R. Kumar, S. Kishore, A. Gopalakrishna, R. Chitturi, S. Joshi,S. Singh, R. Sitaram. Development of Indian language speechdatabases for large vocabulary speech recognition systems. InSPECOM Proceedings, 2005.
[3] M. Plauche, N. Udhyakummar, C. Wooters, J. Pal, and D.Ramachadran, Speech Recognition for illiterate access to in-formation and technology, Proceedings of First InternationalConference on ICT and Development, 2006.
[4] A. Lakshmi and H.A. Murthy. A syllable based continuousspeech recognizer for Tamil. In Interspeech 2006.
[5] S. Saraswathi, T.V. Geetha. Design of language models at var-ious phases of Tamil speech recognition system. InternationalJournal of Engineering, Science and Technology, 2(5):244-257,2010.
[6] R. Thangarajan, A.M. Natarajan, M. Selvam. Word and tri-phone based approaches in continuous speech recognition forTamil language. WSEAS Trans. Sig. Proc, 4(3):76-85, 2008.
[7] T. Schultz, K. Kirchhoff. Multilingual Speech Processing.Elsevier, Academic Press, 88, 2006.
[8] M. Bisani, H. Ney. Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(5):434-451,2008.
[9] A. Stolcke. SRILM - An extensible language modeling toolkit.In Intl. Conf. Spoken Language Processing. 901-904, Septem-ber 2002.
[10] D. Kiecza, T. Schultz, and A. Waibel. Data-Driven determi-nation of appropriate dictionary units for Korean LVCSR. InProceedings of ICASSP 1999, 323-327.
8484