5
Digit Recognition in the Náhuatl Language: an Evaluation using Various Recognition Models Suárez-Guerra Sergio, Oropeza-Rodríguez José Luis, Flores-Paulin Juan Carlos, Sánchez-Fernández Luis Pastor Center for Computing Research, National Polytechnic Institute, Av. Juan de Dios Batiz, esq. Miguel Othón de Mendizabal, s/n, 07038, Mexico City, Mexico [email protected] , [email protected] , [email protected] , [email protected] Abstract The aim of Automatic Speech Recognition (ASR) is to develop techniques and systems that enable a computer to accept speech input. The digit recognition task has been often employed contributing to the ASR. In this work, we used parameters of Lineal Prediction Codes (LPC) and Mel Frequency Cepstrum Coefficients (MFCCs). For selection of the best analysis interval we used a Vector Quantization Model. For recognition, we applied the Continuous Density Hidden Markov Model (CDHHM), which employed a dictionary conformed of eighteen command words that are specific digits from the Náhuatl language. The obtained results were compared using Discrete Hidden Markov Models and Vector Quantization Models. In this experiment, we obtained a performance of 99% accuracy for digit recognition. In our experiments we used three native speakers. Key Words : Automatic Speech Recognition, Discrete Hidden Markov Models, Viterbi Training, Náhuatl language. 1. Introduction Hidden Markov Models (HMMs) have dominated automatic speech recognition for at least a decade. The success of these models is related to their mathematical simplicity: simple, efficient and r obust algorithms have been developed to facilitate their practical implementation. In Mexico, many universities had invested a lot of time creating speech recognition systems (UNAM, IPN, ITESM, UAM, UDLA; and others), primarily for the Spanish language. There are very few works for Náhuatl that do not employ native speakers [1]. Náhuatl grammatical characteristics are described as follows [2, 3, 8]: It is an agglutinative language. Words are constructed by joining prefixes, root words and suffixes. It is an incorporative language. Nouns can be incorporated into a verbal phrase. It uses often reverential forms of speech that gives a respectful frame to human relations. Náhuatl is a Uto-Aztecan language spoken by about 1.5 million people in Mexico. The majority of speakers live in central Mexico, particularly in Puebla, Veracruz, Hildago, San Luis Potosi, Guerrero, Mexico DF, Tlaxcala, Morelos and Oaxaca, and also in El Salvador. There are numerous dialects of Náhuatl; some of them are mutually unintelligible. Most Náhuatl speakers speak Spanish too, with the exception of some of the more elder persons. Classical Náhuatl was the language at the time of the Aztec empire, and it was used as a lingua franca in most of Mesoamerica from the 7th century AD until the Spanish conquest in the 16th century. Modern dialects of Náhuatl that are close to Classical Náhuatl are spoken in the Valley of Mexico. Náhuatl originally used a pictographic script which was not a full writing system, but instead used as a mnemonic painting to help readers to remember the text that they had learned orally. The script appeared in inscriptions carved in stone and in book pictures. Many of them were destroyed by the Spanish conquerers [8]. Ninth Mexican International Conference on Artificial Intelligence 978-0-7695-4284-3/10 $26.00 © 2010 IEEE DOI 10.1109/MICAI.2010.27 143

[IEEE 2010 Ninth Mexican International Conference on Artificial Intelligence (MICAI) - Pachuca, Mexico (2010.11.8-2010.11.13)] 2010 Ninth Mexican International Conference on Artificial

Embed Size (px)

Citation preview

Digit Recognition in the Náhuatl Language:an Evaluation using Various Recognition Models

Suárez-Guerra Sergio, Oropeza-Rodríguez José Luis, Flores-Paulin Juan Carlos, Sánchez-Fernández Luis Pastor

Center for Computing Research, National Polytechnic Institute,Av. Juan de Dios Batiz, esq. Miguel Othón de Mendizabal, s/n, 07038,

Mexico City, [email protected], [email protected],

[email protected], [email protected]

AbstractThe aim of Automatic Speech Recognition (ASR) is to develop techniques and systems that enable a computer to accept speech input. The digit recognition task has been often employed contributing to the ASR. In this work, we used parameters of Lineal Prediction Codes(LPC) and Mel Frequency Cepstrum Coefficients(MFCCs). For selection of the best analysis interval we used a Vector Quantization Model. For recognition, we applied the Continuous Density Hidden Markov Model (CDHHM), which employed a dictionary conformed of eighteen command words that are specific digits from the Náhuatl language. The obtained results were compared using Discrete Hidden Markov Models and VectorQuantization Models. In this experiment, we obtained a performance of 99% accuracy for digit recognition. In our experiments we used three native speakers.

Key Words: Automatic Speech Recognition, DiscreteHidden Markov Models, Viterbi Training, Náhuatllanguage.

1. Introduction

Hidden Markov Models (HMMs) have dominated automatic speech recognition for at least a decade. The success of these models is related to their mathematical simplicity: simple, efficient and robust algorithms havebeen developed to facilitate their practical implementation.

In Mexico, many universities had invested a lot of time creating speech recognition systems (UNAM, IPN,ITESM, UAM, UDLA; and others), primarily for theSpanish language. There are very few works for Náhuatl that do not employ native speakers [1].

Náhuatl grammatical characteristics are described as follows [2, 3, 8]:• It is an agglutinative language. Words are

constructed by joining prefixes, root words and suffixes.

• It is an incorporative language. Nouns can be incorporated into a verbal phrase.

• It uses often reverential forms of speech that gives a respectful frame to human relations.

Náhuatl is a Uto-Aztecan language spoken by about 1.5 million people in Mexico. The majority of speakers livein central Mexico, particularly in Puebla, Veracruz, Hildago, San Luis Potosi, Guerrero, Mexico DF, Tlaxcala, Morelos and Oaxaca, and also in El Salvador.

There are numerous dialects of Náhuatl; some of them are mutually unintelligible. Most Náhuatl speakers speak Spanish too, with the exception of some of the more elder persons.

Classical Náhuatl was the language at the time of the Aztec empire, and it was used as a lingua franca in most of Mesoamerica from the 7th century AD until the Spanish conquest in the 16th century. Modern dialects of Náhuatl that are close to Classical Náhuatl are spoken in the Valley of Mexico. Náhuatl originally used a pictographic script which was not a full writing system, but instead used as a mnemonic painting to help readers to remember the text that they had learned orally. The script appeared in inscriptions carved in stone and in book pictures. Many of them were destroyed by the Spanish conquerers [8].

Ninth Mexican International Conference on Artificial Intelligence

978-0-7695-4284-3/10 $26.00 © 2010 IEEE

DOI 10.1109/MICAI.2010.27

143

Table 1. Phonetics of the Náhuatl language used in this work

Group Phoneme

Vowels a, e, i, o

Semivowels or

Semiconsonants

hu, uh, y

Consonants c, h, l, m, n, p, q, t, x, z, ch, ll, tl, tz

2. Numbers in the Náhuatl Language

The numerical system in Náhuatl is based on 20. We count 20 and afterwards 20 x 20 which is 400, after 20 x 400, 20 x 8000, etc. Always the groups are for 20 numbers [1, 2,3].

1 ce2 ome3 yei (o ei)4 nahui5 macuhilli6 chicuha-ce (chic + uha + ce)7 chic-ome (chic + ome)8 chic-uheyi9 chic-nahui10 matlahtli (o mahtlahtli)11 matlah-ce (matlah + ce)12 matlah-ome13 matlah-yei14 matlah-nahui15 caxtolli16 caxtol-ce17 caxtol-ome18 caxtol-yei19 caxtol-nahui20 cempohualliThe count from 20 to 40 is 20 + huan(“plus, and”) +

(1, 2, 3, … 19)21 cempohualli (20) + huan (plus) + ce (1) =>

cempohualli-huan-ce...29 cempohualli-huan-chicnahui30 cempohualli-huan-matlahtli...39 cempohualli-huan-caxtol-nahui40 om-cempohualli (literal: two time twenty) ome +

cempohualli (2 x 20)...60 yei-cempohualli (3 x 20)...

80 nahui-cempohualli (4 x 20)...100 macuhil-cempohualli (macuhilli + cempohualli)

(5x20)…200 (10x20) matlah-cempohualli =>

matlahtli+cempohualli…400 (20 x 20) centzontli =>

cempohualli+cempohualli... another account: a hair or bird of 400 voice.

…8000 xiquhipilli (a bag) = 20 x 400

3. Considerations for Recognition

The speech recognition systems nowadays usedifferent techniques to achieve better results according to the needs of the user. Whether he wants to recognize a small number of words or recognize a large vocabulary, whether he wants to recognize a speaker or he wants morerobust system and does not need to determine who is speaking. In this sense, it was decided to develop this work as a short analysis using some specific commonly used techniques for the Náhuatl language, limiting the investigation to a set of isolated words (digits) in order to find a methodology for speech recognition in thislanguage.

It was necessary to consider three important aspects: • Speech corpus for the Náhuatl language. • Characteristic parameters used for speech recognition

[4, 6]. • Speech recognition models to be tested [5, 6, 7].

3.1. Speech Corpus

The speech corpus used for this system was generated as audio in WAV format. Each speaker performs twenty repetitions of each word to be recognized. There were two or more extra recordings to ensure that the corpus has no errors. We made recording in an isolated place without noise and we used a standard microphone and a cassette recorder.

It should be mentioned that in order to obtain the best conditions for proper recording it is necessary to consider appropriate conditions for recording, i.e., a minimum of environmental noise (with or without external filter),speech interface capture quality (mono-aural microphone or stereo) and an appropriate sampling frequency.

For this work, we used a corpus of recordings obtained from three adult speakers using, WAV format at 22050 KHz, 16 bit, Mono. We used a digital recorder OlympusVN-3100 PC, 20 records per digit for a total of 400 speech

144

files for each speaker. A total of 1,200 recordings were used as follows: for each digit 10 recordings were used for training of the recognition and 10 for testing of therecognition.

3.2 Characteristic Parameters used for SpeechRecognition

The parameters used in this study were the Linear Prediction Coefficients (LPC) and Frequency CepstralCoefficients Mels scale (MFCC). The first was used due to its easy calculation in the time domain and its nature related to the modular human voice production bysimulating the vocal tract. The latter is an approximation

model of the natural human auditory perception and its best spectral representation.

3.3 Speech Recognition Models to be Tested

There are simple models used to classify or recognize parameters characteristics that have acceptable results. More sophisticated models can achieve better results but have higher comp utational costs.

The models tested are presented in Figure 1: • Vector Quantization.• Discrete Hidden Markov Models .• Dynamic Discrete Hidden Markov Models .

Figure 1. Schematic Diagram for Vector Quantization and Discrete Hidden Markov Models.

4. Results

The first experiment was designed for determiningthe length of segment and parameter coefficients that allow better recognition. We selected length segments equal to 11 and 40 mseg, p=12 and 26 parameters respectively ; we also used 32 codebook regions. We applied a Vector Quantization Model for the recognition.The recognition results:• p=12 and 11 mseg: 88%• p=12 and 40 mseg: 91%• p=26 and 11 mseg: 90%• p=26 and 40 mseg: 94%

We concluded that p=26 and 40 mseg is the best option for Vector Quantization Model.

The second experiment was designed for determiningwhich type of parameters and which model presentsbetter results for digit speech recognition in Náhuatl.

We decided to use a k-Fold Cross ValidationTechnique as a technique of some experiment (for example, LPC and VQ). In other experiments we use HTK toolkit restricted to the length of segment. We compare at the end of the paper all best results obtained using each technique.

Note that we present the values for training that are not always a 100%, but primarily we are interested in validation.

Speech signal

DSP

Codebookgeneration Codebooks

Near neighbor Observationsequence

WordRecognizer

HMMsModels

GenerationRecognition

145

4.1. LPC and VQ

The results are shows in Tables 2 and 3.First test: conditions: p =26, 40 mseg, 32 codebook

regions and k=10. Validation of results: 95.5% .Second test: conditions: p=26, 40 mseg, 32 codebook

regions and k=50. 10 tests. Validation of results: 92.6%.Table 2 Results with LPC parameters and selected conditions (test 1)

Corpus K test Result

1 2 3 4 5 6 7 8 9 10

Validation 100 80 95 100 100 95 90 100 100 95 95.5

Training 100 100 100 100 100 100 100 100 100 100 100

Full result 100 98 99.5 100 100 99.5 99 100 100 99.5 99.55

Table 3 Results with LPC parameters and selected conditions (test 2)Corpus Tests Result

1 2 3 4 5 6 7 8 9 10

Validation 94 92 91 87 89 92 94 96 96 95 92.6

Training 100 100 100 100 100 100 100 100 100 100 100

Full result 97 96 95.5 93.5 94.5 96 97 98 98 97 96.25

4.2 LPC and DHMM

The results are shows in Tables 4 and 5.First test: conditions: p=26, 40 mseg, 128 codebook

regions, 6 states, 3 Viterbi re-estimations and k=10.Validation of results : 83% .

Second test: conditions: p=26, 40 mseg, 128codebook regions, 6 states, 3 Viterbi re-estimations and k=50. 10 tests. Validation of results: 77.2% .

Table 4 Results with LPC-DHMM and selected conditions (test 1)Corpus Tests Result

1 2 3 4 5 6 7 8 9 10

Validation 85 60 85 95 90 80 85 100 80 70 83

Training 100 100 99.44 100 99.44 100 100 99.44 100 100 99.83

Full result 98.5 96 98 99.5 98.5 98 98.5 99.5 98 97 98.15

Table 5 Results with de LPC-DHMM and selected conditions (test 2)Corpus Tests Result

1 2 3 4 5 6 7 8 9 10

Validation 83 79 71 67 74 78 80 86 78 76 77.2

Training 100 100 100 100 100 100 100 100 100 100 100

Full result 91.5 89.5 85.5 83.5 87 89 90 93 89 88 88.6

4.3 LPC-CDHMM

These tests were realized using the HTK system with LPC parameters, without application of the K-Fold Cross Validation method.

First test: conditions: p=12, 6 states and 3 Viterbi re-estimations. Validation of results: 85% .

Second test: conditions: p=12, D=12, A=12 and T=12, 6 states and 3 Viterbi re-estimations. D, A and T are

146

delta, acceleration and third change relation parameterscorrespondingly. Validation of results: 91% .

4.4 MFCC-CDHMM

These tests were realized using the HTK system, MFCC parameters, without using K-Fold CrossValidation method.

First test: conditions: L=26 (numbers filters), p=12, 6 states and 3 Viterbi re-estimations. Validation of results:96% .

Second test: conditions: Included null cepstralcoefficient (E), L=26 (numbers filters), p=13, D=13, A=13, T=13, 6 states and 3 Viterbi re-estimations. Validation ofresult s: 99% .

Table 6. Summary of the results(for 40 msec)

Parameters and Method Performance

LPC & QV (1), 32 codebooks 95.5%

LPC & QV (2), 32 codebooks 92.6%

LPC & DHMM (1), 128 codebooks 83%

LPC & DHMM (2), 128 codebooks 77.2%

LPC & CDHMM (1), 12 parameters 85%

LPC & CDHMM (2), 48 parameters 91%

MFCC & CDHMM (1), 12 parameters 96%

MFCC & CDHMM (2), 52 parameters 99%

5 Conclusions

The MFCC parameters and Continuous DensityHidden Markus Models is the best set “parameter-model” for the English and Spanish languages . We conclude that it is also the best set for the Náhuatllanguage. We think that though the digit corpus is restricted as far as its’ phonetic and tonality areconcerned, these results are applicable to the otherspeech data of the Náhuatl language.

The results of this work shows high recognition ratefor the presented quantities of the MFCC parameters andfor the length of segment equal to 40 msec.

On the basis of the obtained results, we expect that further studies for restricted sets of phrases for Náhuatlhopefully will also produce good results that can be applied for services in these communities (e.g.,reservation systems, tourist systems, medical systems, etc.).

The authors would like to thank the followinginstitutions for their financial support while developing this work: CONACyT, SNI, and National Polytechnic Institute, COFAA (Grant 20090807, “Tools for speechrecognition, translation and emotion classification”).

References[1] Nolazco-Flores Juan A., Salgado-Garza Luis R., and Peña-Díaz Marco, “Speaker Dependent ASRs forHuastec and Western -Huastec Náhuatl Languages”,Departamento de Ciencias Computacionales, ITESM,Campus Monterrey, 2007

[2] Andrews Richard, “Introduction to classicalNáhuatl”, 1975; Joe Campbell y Frances Karttunen,“Foundation course in Náhuatl grammar”, 1989; JohnBierhorst “Diccionario y concordancia del manuscrito Cantares mexicanos”, 1985.

[3] INALI, SEP México, “Lectura del Náhuatl”.Fundamentos para la traducción de los textos en Náhuatl del periodo Novohispano Temprano 2007.

[4] Suárez-Guerra Sergio, “Una metodología para realizar trabajos de reconocimiento de voz”. Centro deinvestigación en Computación (CIC)–IPN, México, D.F. Informe técnico 196 Azul 2004.

[5] Real-Time Laboratory CIC–IPN. D.F. México“CVRecVoz; Sistema para el Reconocimiento de PalabrasAisladas utilizando Cuantización Vectorial y Técnica Multisección.” Manual de Usuario. Junio 1999.

[6] Rabiner L. & Juang B. H., “Fundamentals of Speech Recognition”, Prentice-Hall, New Jersey, 1993.

[7] Oropeza -Rodríguez José Luis, Suárez-Guerra Sergio,Barrón-Fernández Ricardo. “Reconocimiento de vozusando segmentación de energía usando modelosocultos de Markov de densidad continua.” CIC-IPN 2004

[8]http://wikipedia.org/wiki/Classical_Nahuatl_Language.

147