6
+LHUDUFKLFDO 6SHHFK 5HFRJQLWLRQ 6\VWHP 8VLQJ 0)&& )HDWXUH ([WUDFWLRQ DQG '\QDPLF 6SLNLQJ 5620 Behi Tarek Laboratory of Signal, Image and Information Technologies National Engineering school of tunis, Enit Université Tunis El Manar, Tunisia [email protected] Arous Najet Laboratory of Signal, Image and Information Technologies National Engineering school of tunis, Enit Université Tunis El Manar, Tunisia [email protected] Ellouze Noureddine Laboratory of Signal, Image and Information Technologies National Engineering school of tunis, Enit Université Tunis El Manar, Tunisia [email protected] AbstractIn this paper, we propose new variants of unsupervised and competitive learning algorithms designed to deal with temporal sequences. These algorithms combine features from Spiking Neural Networks (SNNs) and the advantages of the hierarchical self organizing map (HSOM). The first variant named Hierarchical Dynamic recurrent spiking self-organizing map (HD-RSSOM) is characterized by the integration of a temporal controller component to regulate the firing activity of the spiking neurons. The second variant is a hierarchical model which represents a multi-layer extension of HD-RSSOM model. The case study of the proposed HSOM variants is phonemes and words recognition in continuous speech. The applied HSOM variants serve as tools for developing intelligent systems and pursuing artificial intelligence applications. KeywordsKohonen map; Temporal self organizing map; hierarchical self-organizing model; Spiking neural network; speech recognition. I. INTRODUCTION The self-organizing map (SOM) is one of the most important unsupervised artificial neural network models; it has shown to be successful in cluster analysis of high- dimensional input data onto a usually two-dimensional output space while preserving the topological relationships between the input data items as faithfully as possible within the representation space of the SOM. Despite the large number of research reports on successful applications of the SOM [1], this model present some weaknesses, in patterns where involve the temporal dimension. For instance, speech recognition includes the time parameter in an intrinsic manner. Speech discrimination is more accurate with contextual information which successively occurs in time. Thus it is very important to introduce contextual or temporal concepts into neural networks. On the other hand, an essential parameter for the resolution of the SOM is the size of map. With a linearly increasing map area, the numbers of units in a SOM increase quadratically. Therefore, the training of large maps can be computationally quite expensive, especially due to the cost of the best-matching unit (BMU) search. In order to overcome these limitations of self-organizing maps we propose in this paper novel temporal neural network models with hierarchical architecture. In our models the temporal information is taken into account by using spiking neurons [2]. Spiking neural networks (SNN) are models for the computational units in biological neural systems where information is considered to be encoded mainly in the temporal patterns of their activity. SNN are able to encode temporal information into both spike timing and spiking rates. The model which realizes the spiking neurons as coincidence detectors encodes the training input information in the connection delays [3]. Spiking neural network is a promising structure in temporal sequence processing area, in order to improve his performances of recognition and to enrich his value like promising model; we propose to design a hierarchical model are able to learn and recognize dynamic patterns represented by sequences and to extract the hierarchical structure of the data and to reduce the cost of training process. Therefore, the purpose of this work is to provide a new approach to hierarchical clustering and structuring of data with temporal self-organizing maps. For that, we have implemented two models, the Hierarchical Dynamic recurrent spiking self-organizing map (HD-RSSOM) and the multi-layered HD-RSSOM. In this paper, we are interested in phoneme and word recognition in continuous speech and speaker independent with Mel Frequency Cepstral Coefficients (MFCC) by means of the hierarchical self-organizing map variants. We have used the DARPA TIMIT speech corpus for the evaluation of the HSOM variants in domain application. The proposed HSOM variants provide more accurate phoneme and word recognition rates in comparison with the basic SOM, HSOM and HRSOM models. 978-1-4799-5604-3/14/$31.00 copyright 2014 IEEE SNPD 2014, June 30-July 2, 2014, Las Vegas, USA

[IEEE 2014 15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD) - Las Vegas, NV, USA (2014.6.30-2014.7.2)]

  • Upload
    ellouze

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Behi Tarek

Laboratory of Signal, Image and Information Technologies

National Engineering school of tunis, Enit

Université Tunis El Manar, Tunisia [email protected]

Arous Najet Laboratory of Signal, Image and

Information Technologies National Engineering school of

tunis, Enit Université Tunis El Manar, Tunisia

[email protected]

Ellouze Noureddine Laboratory of Signal, Image and

Information Technologies National Engineering school of

tunis, Enit Université Tunis El Manar, Tunisia

[email protected]

Abstract—In this paper, we propose new variants of unsupervised and competitive learning algorithms designed to deal with temporal sequences. These algorithms combine features from Spiking Neural Networks (SNNs) and the advantages of the hierarchical self organizing map (HSOM). The first variant named Hierarchical Dynamic recurrent spiking self-organizing map (HD-RSSOM) is characterized by the integration of a temporal controller component to regulate the firing activity of the spiking neurons. The second variant is a hierarchical model which represents a multi-layer extension of HD-RSSOM model. The case study of the proposed HSOM variants is phonemes and words recognition in continuous speech. The applied HSOM variants serve as tools for developing intelligent systems and pursuing artificial intelligence applications. Keywords— Kohonen map; Temporal self organizing map; hierarchical self-organizing model; Spiking neural network; speech recognition.

I. INTRODUCTION The self-organizing map (SOM) is one of the most

important unsupervised artificial neural network models; it has shown to be successful in cluster analysis of high-dimensional input data onto a usually two-dimensional output space while preserving the topological relationships between the input data items as faithfully as possible within the representation space of the SOM.

Despite the large number of research reports on successful applications of the SOM [1], this model present some weaknesses, in patterns where involve the temporal dimension. For instance, speech recognition includes the time parameter in an intrinsic manner. Speech discrimination is more accurate with contextual information which successively occurs in time. Thus it is very important to introduce contextual or temporal concepts into neural networks.

On the other hand, an essential parameter for the resolution of the SOM is the size of map. With a linearly increasing map area, the numbers of units in a SOM

increase quadratically. Therefore, the training of large maps can be computationally quite expensive, especially due to the cost of the best-matching unit (BMU) search.

In order to overcome these limitations of self-organizing maps we propose in this paper novel temporal neural network models with hierarchical architecture.

In our models the temporal information is taken into account by using spiking neurons [2]. Spiking neural networks (SNN) are models for the computational units in biological neural systems where information is considered to be encoded mainly in the temporal patterns of their activity. SNN are able to encode temporal information into both spike timing and spiking rates. The model which realizes the spiking neurons as coincidence detectors encodes the training input information in the connection delays [3].

Spiking neural network is a promising structure in temporal sequence processing area, in order to improve his performances of recognition and to enrich his value like promising model; we propose to design a hierarchical model are able to learn and recognize dynamic patterns represented by sequences and to extract the hierarchical structure of the data and to reduce the cost of training process.

Therefore, the purpose of this work is to provide a new approach to hierarchical clustering and structuring of data with temporal self-organizing maps. For that, we have implemented two models, the Hierarchical Dynamic recurrent spiking self-organizing map (HD-RSSOM) and the multi-layered HD-RSSOM.

In this paper, we are interested in phoneme and word recognition in continuous speech and speaker independent with Mel Frequency Cepstral Coefficients (MFCC) by means of the hierarchical self-organizing map variants. We have used the DARPA TIMIT speech corpus for the evaluation of the HSOM variants in domain application. The proposed HSOM variants provide more accurate phoneme and word recognition rates in comparison with the basic SOM, HSOM and HRSOM models.

978-1-4799-5604-3/14/$31.00 copyright 2014 IEEE SNPD 2014, June 30-July 2, 2014, Las Vegas, USA

The remainder of the paper is organized as follows. In Section 2 we present the basic models of the self-organizing map and Hierarchical SOM. In Section 3, we present the proposed new variants. Section 4 illustrates experimental results of the application of GHSOM variants on phoneme and word recognition on TIMIT speech corpus. The paper is concluded in Section 5.

II. SELF ORGANIZING MAP AND HIERARCHICAL SOM Among various neural network architectures and

learning algorithms, Kohonen’s self organizing map (SOM) [4] is one of the most popular neural network models. The SOM is an unsupervised neural network based on competitive learning. It is characterized by the representation of high-dimensional input data onto a low dimensional. The output space is usually supposed to be a 2-dimensional array of output units. The SOM is a topology-preserving technique.

The basic SOM learning algorithm may be described as follows [5].

For each input pattern:

1) Calculate the distance between the pattern and all units of the SOM:

dij = || xk - wij || (1)

Where wij is the weight vector associated with unit positioned at column i row j and xk is the vector associated with input k. 2) Select the nearest unit as winner:

wwinner = min(dij) (2)

3) Update each unit of the SOM according to the update function:

wij = wij + h(wwinner,wij) || xk – wij || (3)

Where is the learning rate, and h is a neighborhood function. 4) Repeat the steps 1) to 3), and update the learning parameters, until a model converged to their stationary states.

The hierarchical self-organizing map (HSOM) is an extension to the Kohonen’s SOM that refers to a tree of maps. Hierarchical self-organizing networks were first proposed by Luttrell [6].A multilayer hierarchical SOM (HSOM) for clustering was introduced by Lampinen and Oja [7]

A further advantage is that different kinds of representations would be available from different levels of the hierarchy. The key idea is to use a hierarchical setup of multiple layers where each layer consists of a number of independent SOMs. One SOM is used at the first layer of the hierarchy. For every unit in this map a SOM is created in the next layer of the hierarchy.

The training process of hierarchical feature maps starts with the root SOM on the first layer. This map undergoes standard training. When this first SOM becomes stable, training proceeds with the maps in the second layer. Here, each map is trained with only that portion of the input data that is mapped on the respective unit in the higher layer map. In the HSOM, the BMU of an input vector x is sought from the first-layer map and its index is given as input to the second-layer map [8].

During the training process, the input vectors that are passed down in the hierarchy are compressed: if certain vector entries of all input signals that are mapped onto the same node show no or little variance, they are deemed not to contain any additional information for the subordinate map and thus are not required for training the corresponding sub-tree of the hierarchy. This leads to the definition of different weight vectors for each map, created dynamically as the training proceeds [8].

III. THE PROPOSED HIERARCHICAL SOM VARIANTS Self organizing map model are often static systems.

They are powerful for patterns with no evolution in time, like in character recognition, but present some weaknesses if patterns involve a temporal component like in speech recognition.

During the last years, several temporal SOM models have been suggested. Two main classes of systems can be mentioned: system with an external representation and system with internal representation of time.

For the first class the time is considered as an extra dimension in the input layer, the spatio-temporal kohonen maps (ST-Kohonen) is an example [9]. The ST-Kohonen was proposed by Mozayyani, his method is to encode the time dependent data explicitly by extending the field of the SOM inputs from the real domain R into the complex plane C. The ST-Kohonen map algorithm works in the same manner as classical kohonen one however, the winner is chosen according to the Hermetien distance.

For the second class the time is encoded in the activation of the network. This class is divided into two sub-classes: one is devoted to the implicit representation, the other one is devoted to the explicit representation of time. In the implicit sub-classe the time is given by the succession of steady state of the network like the recurrent SOM that allow storing information from the past input vectors [10]. Explicit representation refers to the existence of sequence and by propagation activity [11], here, time is clearly implemented. Sequences of events can be directly detected in the neural system.

Our work takes place in an internal and explicit representation of time. This representation seems to be more neuro-biologically plausible.

A. Hierarchical Dynamic Recurrent spiking self-organizing map: HD-RSSOM Recurrent neural networks (RNNs) were originally

developed as a way of extending neural networks to sequential data; the representation of time with this model is

carried out by reintroducing in entry of the network the preceding state of the network. This implicit representation of time takes into account that the aspect of order. The use of spiking neurons makes it possible to improve the taking into account of other temporal parameters like the duration and continuity. One major advantage of this model is an ability to find temporal structure in a signal using synchronization.

In this context, we proposed a hierarchical spiking self organizing neural network. In this model the current activation level is considered to be the neuron's state. The size and shape of a spike is independent of the input, but the time when a neuron fires depends on input of that neuron. Thereby, information in the proposed model is propagated by the timing of individual spikes.

The dynamic neurons model used is the Leaky Integrators Neurons (LIN) [12], whose threshold changes depending on its firing activity. Firing activity of neurons in combination with interactivity between them creates a highly dynamical self-organized process. The spiking activity, the neuron’s threshold that changes depending on the firing activity and the time of stabilization of the proposed network can be used to represent its behavior.

In this approach, the state of each neuron is represented by a membrane potential, which, is a function of the input which measures the degree of matching between the neuron’s weight vector and the current input vector. The last activation of each neuron is stored by using a variable called leaky Integrator Neurons Potential.

The best matching unit (BMU) is selected according maximal value of the neuron’s activity. After choosing a BMU, learning is applied as follows. The afferent weights of a competitive neuron are adapted in such a way as to maximize their similarity with the current input pattern.

A measure of the similarity is the difference between the postsynaptic potential that encodes the input stimulus and the connection weight [13].

B. MULTI-LAYERED HD-RSSOM The purpose of the proposed system is to create

autonomous systems that can learn independently and cooperate to provide a better decision in classifying the input samples. The multi-layered HD-RSSOM reduces the complexity of the classification task and each layer provides its specific corresponding information for an input sample.

At the first level of the hierarchy, we retrain the specific information of the macro-class label of the input sample. At the second level of the hierarchy, we retrain the specific information of the phoneme label in their macro-classes [14].

The first layer of the hierarchy is trained by all phonemes labeled by identifiers of macro-classes. The database is divided into seven macro-classes. Samples of these macro-classes are labeled by their phoneme identifier.

The number of elements of the second layer of the hierarchy is the same as the number of macro-classes. Each element of this layer can be regarded as an isolated subsystem. Thus, each element is associated a training data base of a macro-class.

The multi-layered HD-RSSOM model proposed is composed of two layers. The first layer is composed of a single HD-RSSOM variant ensuring the classification of the seven macro-classes. The second layer is composed of seven models HD-RSSOM ensuring the classification of phonemes of a given macro-class (Fig 1). The same competitive learning system is applied to the sentences and words.

Decision system of phoneme classification

Phoneme

HD-RSSOM for ‘macro-classes’

classification

HD-RSSOM for

‘Affricates’ classification

HD-RSSOM for

‘fricatives’ classification

HD-RSSOM for

‘vowles’ classification

Decision system of macro-classesclassification

Fig. 1. Multi-layered HD-RSSOM model

IV. EXPERIMENTAL RESULTS In this section, we investigate the evaluation of the

proposed HSOM variants for continuous speech recognition on the TIMIT database.

A. SPEECH FEATURE EXTRACTION TIMIT Acoustic-Phonetic Continuous Speech Corpus

contains a total of 6300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States. The prompts for the 6300 utterances consist of 2 dialect sentences (SA), 450 phonetically compact sentences (SX) and 1890 phonetically-diverse sentences (SI). The data were recorded at a sample rate of 16 KHz and a resolution of 16 bits. Total training data to be used to conduct the experiment are 3696 sentences, while total testing data are 1344 sentences. TIMIT transcriptions are included together with the speech data, and consist of 61 phonemes [15]. Table I shows the list of phonemes of each macro-class of TIMIT database.

TABLE I. LIST OF PHONEMES OF EACH MACRO-CLASS

Due to TIMIT speech corpus in waveform, it is necessary to convert the speech corpus into digital form. For that, we choose to use a Mel-Frequency Cepstral Coefficients (MFCC) which is perhaps the best known and most popular preprocessing technique in the field of speech recognition. MFCC gives good discrimination between speech variations [16] and takes human perception sensitivity with respect to frequencies into consideration [17], making MFCC less sensitive to pitch. MFCC also offers better suppression of insignificant spectral variation in higher frequency band and able to preserve sufficient information for speech and phoneme recognition with a small number of required coefficients.

To calculate the Mel frequency cepstral coefficients (MFCC), the idea, is first differentiated (preemphasis) the speech waveform, sampled 16 kHz and cut into a number of overlapping segments (windowing), each 25 ms long and shifted by 10 ms. A Hamming window is multiplied and the Fourier transform (FFT) is computed for each frame.

The power spectrum is warped according to the Mel-scale in order to adapt the frequency resolution to the properties of the human ear. Then the spectrum is segmented into a number of critical bands by means of a filter bank. The filter bank typically consists of overlapping triangular filters.

A discrete cosine transformation (DCT) applied to the logarithm of the filterbank outputs results in the raw MFCC vector.

B. RECOGNITION EXPERIMENTS In our experiments, we have used the New England

dialect region (DR1) composed of 31 males and 18 females. The corpus contains 25382 phonetic units and 12894 word units.

We have implemented the basic Kohonen model based on sequential learning, the Hierarchical SOM (HSOM), the Hierarchical recurrent SOM (HRSOM) and the proposed HSOM variants. We built several phonemes (for the seven macro classes) and words bases each extracted from the Darpa-TIMIT.

We have compared each variant with: a static model which is the basic SOM, a hierarchical static model which is the HSOM and the temporal hierarchical model which is HRSOM to prove the performance of our models.

The proposed system for continuous speech recognition based on three main components: a pre-processor for sounds and producing mel cepstrum vectors. The sound input space is composed by 12 mel cepstrum coefficients each 16 ms frame. 11 frames are selected at the middle of each phoneme and all frames are selected of each word to generate data vectors. The second component is a competitive learning module. The third component is a phonemes and word recognition module.

For all maps, the learning rate decrease linearly from 0.9 to 0.05. The radius width decrease also linearly from half the diameter of the lattice to one.

C. RESULTS AND DISCUSSIONS In this section we present a comparison of different

recognition rates of the 7 macro-classes and words of TIMIT data base by using respectively, SOM based on a sequential learning, HSOM, basic recurrent HSOM, HSOM with dynamic recurrent spiking neurons (HD-RSSOM) and multi-layered HD-RSSOM.

We should note that recognition rates depend on many parameters (map size, learning rate, neighborhood function, neighborhood radius and iterations number) and the frame number.

From Table II, the proposed HSOM variants provide best recognition accuracy in comparison with basic SOM and HSOM.

With multi-layered HD-RSSOM we obtained an improvement of the recognition rate in the range of 20 and 30% in training and in test set in comparison with SOM and HSOM.

According to Table II, the proposed recurrent HSOM variant, HD-RSSOM and multi-layered HD-RSSOM obtained an improvement of the recognition rate in the range of 10 and 20% in comparison with basic recurrent HSOM. These results prove the stability and the performance of our recurrent HSOM variants.

Macro-class Phonemes Affricates /jh/, /ch/

Stops

/b/, /d/, /g/, /p/, /t/, /k/, /dx/, /q/, /bcl/, /dcl/, /gcl/, /pcl/, /tcl/, /kcl/

Others /pau/, /epi/, /h#/ Nasals /m/, /n/, /ng/, /em/, /en/, /eng/, /nx/ Semi-vowels /l/, /r/, /w/, /y/, /hh/, /hv/, /el/ Fricatives s/, /sh/, /z/, /zh/, /f/, /th/, /v/, /dh/ Vowels /iy/, /ih/, /eh/, /ey/, /ae/, /aa/, /aw/, /ay/,

/ah/, /ao/, /oy/, /ow/, /uh/, /uw/, /ux/, /er/, /ax/, /ix/, /axr/, /axh/

Multi-layered HD-RSSOM reach good recognition rates in the range of 80 and 95% in training and test set for /Affricates/, /Others/ and /Semivowels/.

We should note also that with SOM and HSOM we obtained the weak recognition rates in the range of 10 and 20% in test set for / Stops/ and / Affricates/.

TABLE II. GENERAL RECOGNITION RATES OF THE 7 MACRO-CLASSES OF TIMIT DATA BASE (TRAINING AND TEST SET)

According to Table III, the proposed variants provide

best recognition accuracy in comparison with other models.

It is also noticed that multi-layered HD-RSSOM provides the best recognition rate in order to 87.32.86% in training set and 83% in test set.

With multi-layered HD-RSSOM we obtained an improvement of the recognition rate in the range of 30 and 40% in training set and in the range of 30 and 60% in test set in comparison with SOM and HSOM.

We should note also that with HSOM and SOM models we can’t recognize some words like /your/, /to/ and /like/, recognition rates in the range of 0 and 30%.

On the other hand HD-RSSOM and multi-layered HD-RSSOM provide the best recognition rate for the words /dark/, /all/ and /that/ in the range of 80% and 95% in training and test set.

A higher value of recognition rate for these words proved the better performance of the proposed HSOM variants.

TABLE III. GENERAL RECOGNITION RATES OF THE WORDS (TRAINING AND TEST SET)

V. CONCLUSION In this paper, a novel hierarchical methods based on

hierarchical self organizing network was proposed for the continuous speech recognition. The proposed hierarchical neural network models combine: on the one hand dynamical characteristics of SNNs and RNN advantage, on the other

hand, the performance of the hierarchical self organizing model to reduce the cost of training process and extract the hierarchical structure of the data.

The first proposed variant is the Hierarchical Dynamic recurrent spiking self-organizing map (HD-RSSOM), in this model the firing activity of all neurons can be considered as

SOM HSOM HRSOM HD-RSSOM Multi-layered HD-RSSOM

Macro-classes

training test training test training test training test training test

Vowels 56.80 38.59 60.08 48.37 65.73 59.18 82.96 74.82 82.86 75.98 Fricatives 66.21 58.01 68.36 61.33 70.89 67.26 73.53 69.33 80.00 72.97 Nasals 57.18 49.80 62.81 60.80 76.87 71.66 83.75 77.36 88.75 77.94 Affricates 23.37 10.31 23.37 12.69 11.86 22.22 36.36 26.19 94.80 82.86 Semivowels 59.11 37.04 63.86 44.09 70.57 60.46 82.23 70.97 92.40 80.35 Stops 37.46 13.69 37.84 23.46 41.53 29.03 49.28 36.61 57.50 47.29 Others 65.35 69.16 69.28 70.44 72.78 71.42 75.62 73.16 89.06 84.27 Average 55.21 40.71 58.07 48.31 63.21 57.07 74.10 66.52 83.62 75.38

Words SOM HSOM HRSOM HD-RSSOM Hierarchical HD-RSSOM

training test training test training test training test training test she 55.41 52.45 62.22 69.94 70.58 51.91 85.13 81.96 84.21 74.86 had 50.90 20.96 60.90 45.16 72.27 54.83 79.31 68.27 83.86 86.55 your 29.06 0.00 35.96 37.50 57.63 44.79 72.90 55.20 74.87 82.29 dark 53.73 31.69 67.82 58.45 70.26 55.98 84.00 69.71 87.47 80.98 suit 64.70 32.19 70.58 64.04 73.68 68.49 81.42 74.31 85.29 77.73 in 57.30 0.00 66.08 67.85 73.68 71.42 79.53 88.39 87.71 87.50 greasy 37.12 30.31 52.54 39.37 55.38 54.10 77.39 61.75 81.43 78.18 wash 53.34 36.87 62.77 56.63 70.49 59.29 76.15 73.74 83.87 80.82 water 46.63 26.20 57.04 37.90 57.04 60.88 74.40 65.32 84.16 82.25 all 51.96 51.66 64.37 62.22 77.12 64.44 91.50 85.55 92.15 93.33 year 66.73 74.18 70.24 67.27 87.60 84.00 91.52 85.81 93.59 85.09 ask 42.75 2.79 49.14 39.53 62.40 56.74 79.11 75.81 93.12 87.90 me 31.42 1.33 50.71 29.33 67.85 28.00 74.28 56.00 89.28 82.66 to 20.83 21.42 25.00 21.42 39.16 25.71 54.16 40.00 90.00 81.42 carry 45.16 24.10 61.33 53.57 75.65 68.45 80.66 77.38 88.47 86.60 an 24.59 11.39 35.24 37.15 53.27 37.97 64.75 62.02 82.78 58.22 oily 49.06 31.43 56.31 47.15 78.05 55.85 86.54 79.26 91.71 88.96 rag 43.61 29.87 55.59 53.77 68.36 66.66 79.96 76.10 90.37 82.70 like 28.66 2.15 38.31 24.19 53.58 43.01 67.91 63.44 84.11 73.65 that 69.87 73.87 77.66 79.57 85.65 80.78 94.67 92.19 95.08 96.39 Average 48.54 31.64 58.76 52.25 69.14 60.77 80.12 73.53 87.32 83.00

a temporal pattern with an explicit temporal structure. In this approach, the state of each neuron is performed by a membrane potential which is function of the input, this potential measure the adaptation degree between the neuron weight vector and the current input vector.

The second proposed variant is the multi-layered HD-RSSOM model which designed a multi-layer extension of HD-RSSOM model. The multi-layered HD-RSSOM is composed of two layers. The first layer is composed of a single HD-RSSOM ensuring the classification of the seven macro-classes and sentences. The second layer is composed of seven models HD-RSSOM ensuring the classification of phonemes and words of a given macro-class and sentences.

The advantage of such a hierarchical model is the construction of several elementary HD-RSSOM as isolated modules, which can learn independently and cooperate. Moreover, rapid search of the winner (BMU), the speed of convergence, better generalization ability, a greater capacity for abstraction and reduced computational cost.

We also compare the proposed hierarchical SOM models with other recognition methods including: basic SOM and hierarchical SOM, and hierarchical RSOM. The proposed HSOM variant provides best recognition rates in comparison with these models.

For the phoneme recognition the multi-layered HD-RSSOM provides the best general recognition rates in order to 83.62% in training set and 75.38% in test set. Also the multi-layered HD-RSSOM provides the best general recognition rates of words in order to 87.32% in training set and 83% in test set. With our proposed HSOM variant we obtained an improvement of the recognition rate in the range of 20% and 50% in training and test set in comparison with SOM, HSOM and HRSOM.

As a future work, we will continue study hierarchical recognition strategy and look for some new supervised and unsupervised learning methods and to use other features extraction that are able to improve recognition performance.

Also, we suggest to hybridize HSOM and genetic algorithm on one hand to fine tune HSOM parameters and on the other hand for training data set input in the objective to ameliorate recognition rates.

REFERENCES [1] S. Kaski, J. Kangas, and T. Kohonen, “Bibliography of self-organizing

map,” Neural Computing Surveys, vol. 1, no. 3-4, pp.1–176, 1998. [2] W. Gerstner, “What´s different with spiking neurons,” in Henk

Mastebroek and Hans Vos, editors, “Plausible Neural Networks for Biological Modelling” Kluwer Academic Publishers, pp. 23–48, 2001.

[3] W. Maass, and C.M. Bishop, “Pulsed Neural Networks”, MIT Press, 1999.

[4] T. Kohonen, “Self-organizing maps”. Springer, Berlin, 3rd ed, 2001. [5] N. Arous, N. Ellouze, “Phoneme classification accuracy improvements

by means of new variants of unsupervised learning neural networks”, Proceedings of 6th World Multiconference on Systematics, Cybernetics and Informatics, Floride, USA, pp.14 – 18, 2002.

[6] S.P. Luttrell, “Hierarchical self-organizing networks”, Proceedings of the International Conference on Neural Networks, pp. 2-6, London. 1989.

[7] J. Lampinen, and E. Oja, “Clustering properties of hierarchical self-organizing maps”, Journal of Mathematical Imaging and Vision, vol.2, pp. 261-272, 1992.

[8] L. Vicente, and A. Vellido, “Review of Hierarchical Models for Data Clustering and Visualization,” ,In R.Giraldez et al. (Eds.) Tendencias de la Minera de Datos en Espa. Espa ola de Minera de Datos - Vicente, Vellido, 2004.

[9] N. Zouhour, B. Laurent, and A. Frédéric, “Spatio-temporal biologically inspired models for clean and noisy speech recognition,” Neurocomputing, vol. 71, pp. 131-136, 2007.

[10] M. Varsta, J. Heikkonen, and R. Milan, “A recurrent self-organizing map for temporal sequence processing,” Proceedings of the International Conference on Artificial Neural Networks, pp. 421-426, 1997.

[11] T. Behi, N. Arous, N. Ellouze, “Comparative study of SOM variants in Recurrent Pulsed Neural Networks Case Study: Phoneme and Word Recognition on the TIMIT speech corpus,” International Review on Computers and Software, vol.7, no.6, pp. 3184-3194, 2012.

[12] J. G. Taylor, “Temporal patterns and leaky integrator neurons,” in Proceding of Int. Conf. Neural Networks, Paris, pp. 952-955, 1990.

[13] T. Behi, N. Arous, N. Ellouze, “Spike Timing Dependent Competitive Learning in Recurrent Self Organizing Pulsed Neural Networks Case Study: Phoneme and Word Recognition,” International Journal of Computer Science Issues, vol. 9, issue 4, no 2, pp. 328-337, 2012.

[14] N. Arous, N. Ellouze, “Cooperative supervised and unsupervised learning algorithm for phoneme recognition in continuous speech and speaker-independent context”, Elsevier Science, Neurocomputing, Special Issue on Neural Pattern Recognition, vol. 51, pp. 225 – 235, 2003.

[15] J. Garofalo, L. Lamel, W. Fisher, J. Fiscus , D, Pallett, N. Dahlgren, and V. Zue, “TIMIT acoustic-phonetic continuous speech corpus”, Linguistic Data Consort, 2005.

[16] S. Young,, “The HTK Book”, Cambridge University Engineering Department, 8th ed, 2006.

[17] J. Wong Jing Lung, M.S. H. Salam, M. S. M. Rahim, and A. M. Ahmad, “Implementation of Vocal Tract Length Normalization for Phoneme Recognition on TIMIT Speech Corpus ”, International Conference on Information Communication and Management, IPCSIT vol.16, 2011.