[IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - Initial

Initial Experiments with Tamil LVCSR

Melvin Jose J

Department of Computer TechnologyMIT Campus, Anna University

Chennai, Indiamelvin.jose [email protected]

Ngoc Thang Vu, Tanja Schultz

Cognitive Systems Lab, Institute for AnthropomaticsKarlsruhe Institute of Technology (KIT)

Karlsruhe, Germany{thang.vu,tanja.schultz}@kit.edu

Abstract—In this paper we present our recent efforts towardsbuilding a large vocabulary continuous speech recognizer forTamil. We describe the text and speech corpus collected torealize this task. The data was complemented by a largeamount of text data crawled from various Tamil news websites.The Tamil speech recognition system was bootstrapped usingthe Rapid Language Adaptation scheme which employs amultilingual phone inventory. After initialization, we built aword-based and syllable-based system with a Syllable ErrorRate (SyllER) of 29.30% and 34.16%, respectively. We proposea data-driven approach to obtain better dictionary units toovercome the challenge of the agglutinative nature of Tamil.The approach produced a significant improvement of 27.20%and 15.12% relative SyllER on the test set over the syllable-and word-based systems, respectively. Our current best systemhas a SyllER of 17.44% on read newspaper speech.

Keywords-Agglutinative language; multilingual bootstrap;morphological complexity; LVCSR System; dictionary units

I. INTRODUCTION

With the globalization of the world, one of the most

important needs in speech technology is the support of

multiple input and output languages, since applications are

required to address the needs of linguistically diverse users

and that of international markets. Hence, new algorithms

and techniques which support a rapid adaptation of speech

processing systems to new languages are of most concern.

Our Rapid Language Adaptation Toolkit (RLAT) [1] aims to

considerably reduce both the time and the effort consumed

in building speech technologies for new languages. This is

achieved by providing innovative methods and tools that

enable users to develop speech processing models, collect

appropriate speech and text data to build these models and

finally evaluate the results leading to iterative improvements.

In this paper we describe the application of these tools to the

Tamil language, which has seen growing interest in terms of

speech and language technology recently.

Early work started with [2], in which large vocabulary

speech recognition experiments for three different languages

namely Marathi, Telugu and Tamil were built. In [3] and [4],

a Tamil spoken dialog system for farmers in Tamil Nadu and

a syllable-based Tamil recognizer were built. [5] investigated

language modeling for Tamil using various units and [6]

presented word and triphone-based Tamil recognizers with

a vocabulary of 1,700 and 341 words, respectively. It can be

seen that although there has been significant effort to address

specific parts of an LVCSR system for Tamil, a compre-

hensive LVCSR for Tamil has still not been developed. In

this paper, we present a complete LVCSR system for Tamil

built from scratch, using our new speech database and text

gathered with RLAT, which addresses the peculiarities of the

Tamil Language.

The paper is organized as follows: In Section II we present

the characteristics of the Tamil language and explain its

morphological richness. Section III describes our speech

and text corpora. In Section IV, the grapheme-to-phoneme

(G2P) mapping and the syllable level segmentation algo-

rithm which were used to build the syllable- and word-

based recognizers are explained. In Section V, our merging

algorithm that mitigates the challenge of the agglutinating

nature of Tamil is presented. In Section VI we conclude with

our results.

II. CHARACTERISTICS OF TAMIL LANGUAGE

Tamil is one of the four major languages of the Dravidian

Language family. It is predominantly spoken in the state of

Tamil Nadu of the Indian subcontinent and also in other parts

of the world such as Singapore, Sri Lanka and Malaysia by

close to 70 million people. Dravidian languages are among

the most complex languages in the world at the level of

morphology. This can be attributed to agglutination and

complex sandhi rules. Additionally, a significant part of

grammar that is handled by syntax in some languages e.g.

English is handled within morphology in Tamil. Externalsandhi (that is, conflation between two or more full word

forms) and compounding further increase the complexity.

External sandhi is not the result of simple concatenation

or juxtaposition of complete words written without inter-

vening spaces as is the convention in some languages but

results from words that are made up of several morphemes

conjoined through complex morphophonological processes.

Below are two examples of English sentences that reduce

to a single word in Tamil after applying a series of sandhirules:

2012 International Conference on Asian Language Processing

978-0-7695-4886-9/12 $26.00 © 2012 IEEE

DOI 10.1109/IALP.2012.46

81

2012 International Conference on Asian Language Processing

978-0-7695-4886-9/12 $26.00 © 2012 IEEE

DOI 10.1109/IALP.2012.46

81

We collected a 38 million word Tamil corpus with our

RLAT crawler [1]. We obtained an Out Of Vocabulary

(OOV) rate of 16% for the most frequent 100K types in

Tamil. Turkish and Korean have OOV rates of 13.5% and

34%, respectively, on a vocabulary of 64K [7]. Thus Tamil

can be classified as being morphologically rich comparable

to languages like Korean and Turkish.

The Tamil alphabet consists of 12 graphemes which repre-

sent vowels and 18 graphemes which represent consonants

and a special character, the aytam (which is classified as

neither consonant nor vowel) but can be used as a diacritic

to represent foreign sounds. Each of the consonants can be

combined with any of the twelve vowels to produce 216

combinations (combinants). Together with the 31 graphemes

in their independent form, this lead to a total of 247

graphemes. The combinant graphemes are formed by adding

a vowel marker to the consonant or altering the basic shape

of the consonant specific to the vowel.

Consequently the G2P mapping is a non-trivial task

for Tamil when compared to other Dravidian languages.

Also unlike many other Indian languages, Tamil has very

restricted consonant clusters and has neither aspirated nor

voiced stops. The stops are present in the spoken language

as allophones. In addition, the voicing of stop consonants

is governed by strict rules. They are voiceless if they occur

word-initially or in gemination.

III. SPEECH AND TEXT CORPORA

A. Text Corpus

The text corpus of Tamil words was built using RLAT [1]

with which text from popularly known Tamil news websites

was collected. The websites were crawled with a link depth

of 10. i.e the crawler went recursively 10 levels deep from

each level after the level was completed crawled. The list of

websites that were crawled are given in Table I below.

The collected text was cleaned and normalized using the

Table ILIST OF CRAWLED TAMIL WEBSITES

Website URL Link depth1 www.dinamalar.com 102 www.dinakaran.com 103 www.dinamani.com 10

following four steps (1) Remove all HTML-Tags and codes,

(2) Remove special characters, non-Tamil characters and

empty lines, (3) Convert numbers, dates, time and common

abbreviations to their equivalent text form, and (4) Remove

leading and trailing white spaces and write each sentence

on a separate line.

B. Speech Corpus

The speech data for our Tamil recognizer was collected

in Tamil Nadu, India in two stages. In the first stage, 68

speakers were each asked to read several lines (ranging

between 30 and 300) from Thirukkural, which is a Tamil

literary classic which contains 1,330 verses and is one of

the most important works in Tamil. This accrued to almost

17 hours of speech data. From the above data, speech from

5 speakers which added up to 1 hour was separated and

designated as the development set. In the second stage,

additional speakers were asked to read newspaper prompts

collected from the newspapers mentioned in Table I. A new

set of 29 speakers participated in this exercise. The data from

this stage accrues to 1 hour and constitutes the test set. All

speech data was recorded with a close-speaking microphone

and in quiet environmental conditions. A sampling rate of

16 kHz with a resolution of 16 bits was used for the data

which was stored in PCM encoding.

Table IISPEECH CORPUS DESCRIPTION

Set#Speakers

#Utterances DurationMale Female

Training 30 33 1012 15h 50minDevelopment 2 3 51 1hr 4minTest 14 15 370 1hr 0minTotal 46 51 1433 17hr 54min

IV. BASELINE TAMIL RECOGNITION SYSTEMS

A. Grapheme-to-Phoneme Model

Unfortunately, the G2P conversion task is not very

straightforward in the case of Tamil, for the following two

reasons: (1) Confusion between allophones p (b), t (d), th

(dh), k (g) and c (j) (s) which are very difficult to solve

with linguistic rules and (2) the transcription of borrowed

words which do not have a standard pronunciation. While

most Indian languages are phonetic in nature i.e they pos-

sess a one-to-one correspondence between orthography and

pronunciation, Tamil script, although phonetic in nature has

a lot of exceptions. For building the G2P model for Tamil,

the Sequitur G2P Toolkit [8] was used to iteratively train

a manually phoneticized lexicon with varying vocabulary

sizes of 10k, 18k and 35k words. The N-gram size of all

the models is N=6 and the graphone size is L=1. For testing

the three G2P models, we used a test lexicon with 10k

words which was handcrafted by native speakers. Our best

G2P model achieves a letter accuracy of 99.56% on the test

lexicon. This model is used to generate the pronunciation

lexicons for our experiments.

B. Text Segmentation into Syllables

A syllable is usually defined as a vowel nucleus supported

by consonants on either side. It is a consonant-vowel cluster

that can be generalized as C*VC* where C stands for

consonant and V for vowel. We used the rules formulated

in [4] for Romanized or transliterated Tamil and enhanced

them for direct application to Tamil script. This was done in

three steps: (1) Prepocessing – every combinant (consonant

+ vowel) in the word is replaced by an equivalent form that

places both the root consonant and the vowel in adjacent

8282

positions. This way we have the word written in a form

where each character is either a vowel or a consonant and

not a combinant. (2) Syllabification – we apply the linguistic

rules on the preprocessed word recursively i.e. after each

syllable is identified from a word, the remainder of the word

is again provided as input to the algorithm to be syllabified.

(3) Postprocessing – the combinant for every syllable is

reconstructed thereby resulting in correct Tamil script. An

example showing the results of each stage of the above

algorithm is shown in Figure 1. The word used below means

”from the dictionaries”.

Figure 1. Illustration of the syllabification algorithm

C. Language Modeling

We selected 3,832,821 lines out of the total text crawled

for our experiments. First, we built a word-level Language

Model (LM) using the SRI Language Modeling Toolkit [9].

Afterwards, we use the segmentation algorithm described in

the previous section to syllabify the same text. A syllable-

level LM was then built. Table III gives the characteristics

of both the word- and the syllable-based language models

when evaluated on the test set.

Table IIICHARACTERISTICS OF LM-Word and LM-Syll

Criteria LM-Word LM-Syll#Tokens 903M 963M#Types 602K 10KOOV-rate% 4.99 0.00Perplexity 5867 21

D. Word-based vs Syllable-based Recognizers

For building our baseline speech recognizer for Tamil

we used RLAT for bootstrapping the Tamil system with

the help of a multilingual phone inventory (MM7). MM7

was trained using seven arbitrarily selected GlobalPhone

languages (Chinese, Croatian, English, Japanese, German,

Spanish, and Turkish) with about 20 hours of audio data

each [1]. To bootstrap the system, Tamil phonemes from

the closest matches of the MM7 inventory were derived

by an IPA-based phone mapping. An initial state alignment

of the Tamil training data was produced by the selected

MM7 models as seed models. We utilized a GlobalPhone

style preprocessing which consisted of extracting features

by applying a Hamming window of 16ms length with a

window overlap of 10ms. Each feature vector has 143

dimensions resulting from stacking of 11 adjacent frames

of 13 Melscale Frequency Cepstral Coefficients (MFCC)

each. A Linear Discriminant Analysis (LDA) transformation

reduces the feature vector size to 42 dimensions. The model

uses a 3-state left-to-right HMM. The emission probabilities

are modeled by Gaussian Mixtures with diagonal covari-

ances. For our context-dependent AMs with different context

sizes, we terminated the decision tree splitting process at

2000 quinphones. After context clustering, a merge-and-

split training was applied, which selects the number of

Gaussians according to the amount of data (23 Gaussians on

an average and atleast 50 frames to train each Gaussian). For

all models, we use one global semi-tied covariance (STC)

matrix after LDA. First, we built a word-based system. A

word-level pronunciation lexicon was generated for training.

The SyllER of the resulting system is 29.30% on the test

set with LM-Word.

Due to the high perplexity and OOV rate of the word

based system on the test set, we decided to build a syllable-

based recognizer. We used our syllable segmentation al-

gorithm from Section IV B to syllabify the training and

the test transcripts. A syllable-level pronunciation lexicon

was generated. A similar training framework was used for

training the syllable system. The SyllER of our syllable-

based system is 34.16% which is worse than the word-based

system. This may be due to the loss of context information

and decrease in the range of the language model due to the

short syllables. Additionally, higher n-grams for LM-Syll

and the co-articulation events between syllables may have

contributed to the decrease in performance.

V. DICTIONARY UNIT MERGING ALGORITHM

A. Motivation

From the above section, it can be seen that the syllable

system performs worse than the word-based system. We

need a trade-off between the syllable and the word units

in order to retrieve some lost context and to tackle the

high OOV rate of the word-based system. This is the main

motivation behind applying a unit merging algorithm. It (1)

reduces the acoustic confusability due to short syllables (2)

increases the range of the language model and (3) increases

the context of the acoustic model. On the other hand these

units are shorter than words, thereby keeping the OOV rate

at a manageable level.

B. Proposed Algorithm

In [10], the authors proposed a data-driven algorithm

to determine appropriate dictionary units for Korean to

overcome the high OOV rate that results from the rich

morphology of Korean. We extended the above algorithm

and adapt it for application to Tamil. Our extended algorithm

is a data-driven, statistical approach that requires no a-priori

linguistic knowledge.

The input to the algorithm are a pronunciation dictionary,

a large text file and a vowel list. We propose a slight variation

to the algorithm used in [10] to keep the OOV rate always

at zero. Initially, we segment the entire text into syllables

using the syllabification algorithm. We also include word

boundary information in the syllabified text i.e we prepend

8383

a ’-’ to every syllable that does not occur at the start of

a word. Then we obtain all possible syllable pairs from

the syllabified text. Each possible pair is then looked up

in the dictionary and the pronunciation of the vowel-vowel

transition is retrieved.

The merging algorithm is governed by the following

iterative steps:

1. First, we compute a hash table that contains the vowel-

vowel transition, the syllable pair that produces it as

keys and the frequency of the pair as the value.

2. For each vowel-vowel transition in the hash table, we

place the most frequent syllable pair into a merge-list.

3. We merge all the pairs in merge-list in the segmented

corpus.

We only merge pairs that occur within a word, and chose

not to merge pairs across word boundaries since Tamil has

a fixed word boundary. We use the merge-list obtained after

step 2 of the unit merging algorithm to merge both the

training and test transcripts, which keeps the OOV rate at

zero. Figure 2 shows the various stages of the unit merging

algorithm.

Figure 2. Various stages of the unit merging algorithm

C. Experimental results

We applied our merging algorithm on the same 3,832,821

lines of text as above and built an LM (LM-Merge) with the

new units which reports a perplexity of 82.33 on the test

set. By using LM-Merge, we obtained a SyllER of 24.87%.

Additionally, we added another 4,932,821 lines of text which

was crawled using RLAT to improve the language model.

Since the vocabulary of LM-Merge is more than 200k and

the OOV-rate is zero, we did not apply the merging algorithm

on the additional text, but chose to use the merge-list of

the first set to merge the pairs of the additional text. The

final LM has a perplexity of 37.49 on the test set and the

resulting system gives a performance of 17.44% SyllER.

Figure 3 gives the performance of all the systems in terms

of SyllER.

VI. CONCLUSION

In this paper, we built a state-of-the-art LVCSR system

for Tamil, a morphologically rich Indian language using

RLAT. Initially, we built a word- and syllable-based baseline

systems with a SyllER of 29.30% and 34.16% on the test

Figure 3. Evolution of Tamil LVCSR System

set, respectively. Next, we addressed the agglutinative nature

of Tamil by applying a data-driven approach to merge the

syllables into new dictionary units and obtained a rela-

tive improvement of 27.20% and 15.12% SyllER over the

syllable- and word-based systems, respectively. Afterwards,

we crawled additional text data using RLAT and improved

the system from 24.87% to 17.44% SyllER on the test set.

Our best system has a SyllER of 17.44% on the test set.

ACKNOWLEDGMENT

The authors would like to thank all friends and family

in India, for contributing their time, effort, knowledge and

speech to the Tamil corpus. We would also like to thank

Dr. Bharadwaja Kumar, under whose guidance the G2P

experiments were done.

REFERENCES

[1] T. Schultz, A.W. Black, S. Badaskar, M. Hornyak, and J.Kominek. SPICE: Web-based tools for rapid language adapta-tion in speech processing systems. Proceedings of Interspeech,August 2007.

[2] R. Kumar, S. Kishore, A. Gopalakrishna, R. Chitturi, S. Joshi,S. Singh, R. Sitaram. Development of Indian language speechdatabases for large vocabulary speech recognition systems. InSPECOM Proceedings, 2005.

[3] M. Plauche, N. Udhyakummar, C. Wooters, J. Pal, and D.Ramachadran, Speech Recognition for illiterate access to in-formation and technology, Proceedings of First InternationalConference on ICT and Development, 2006.

[4] A. Lakshmi and H.A. Murthy. A syllable based continuousspeech recognizer for Tamil. In Interspeech 2006.

[5] S. Saraswathi, T.V. Geetha. Design of language models at var-ious phases of Tamil speech recognition system. InternationalJournal of Engineering, Science and Technology, 2(5):244-257,2010.

[6] R. Thangarajan, A.M. Natarajan, M. Selvam. Word and tri-phone based approaches in continuous speech recognition forTamil language. WSEAS Trans. Sig. Proc, 4(3):76-85, 2008.

[7] T. Schultz, K. Kirchhoff. Multilingual Speech Processing.Elsevier, Academic Press, 88, 2006.

[8] M. Bisani, H. Ney. Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(5):434-451,2008.

[9] A. Stolcke. SRILM - An extensible language modeling toolkit.In Intl. Conf. Spoken Language Processing. 901-904, Septem-ber 2002.

[10] D. Kiecza, T. Schultz, and A. Waibel. Data-Driven determi-nation of appropriate dictionary units for Korean LVCSR. InProceedings of ICASSP 1999, 323-327.

8484

Documents

[IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - Initial