3
ISSN 10546618, Pattern Recognition and Image Analysis, 2011, Vol. 21, No. 3, pp. 505–507. © Pleiades Publishing, Ltd., 2011. 1. INTRODUCTION Nowadays different speech information control systems (SICS) are widely distributed. They are espe cially easytouse when the operator may communi cate with them on his native language with the help of speech commands (SC). Such SICS are especially useful on transport and at work when it is necessary to operate actuating mechanisms, to get information about their condition and the state of environment and to reduce the operator’s working load. Such SICS must operate in conditions of intensive noise (noise of the wind, engine, etc.). Nowadays there exist systems of SC recognition which function in condition of low noise. But they are not valid in noisy environment. In conditions of intense noise SC recognition which depends on the announcer is usually used. Besides the model library of speech commands is applied. At the same time the quality of recognition considerably depends on the fact how well this library has been formed. Let us consider SC recognition with the help of their cross correlation portraits as an exam ple. There exists 20 pronunciations in noisy room for each of 18 SC. From this number at random a library of model commands (MC) is formed. It consists of three pronunciations of each SC. For this library per centage P of correctly recognized SC is determined. Their cross correlation portraits are used for this pur pose. There has been done 5000 of such random library samplings. The table represents statistical series for the results of this experiment. The table shows that the recognition quality strongly depends on the composition of SC library. The range of values P here is from 56 to 65 percent (it is higher, as not all of the existing variants of the library were taken into account during the experiment). Under intensive noise the range of values P was 20% and more. Similar range of values P is possible in real ity, if MC are chosen by chance or if the necessary number of pronunciations is simply taken. Let us underline that small values of P are less prob able if the library is formed randomly (average values are more probable). That is why recognition usually has acceptable quality level even without special model sampling. However, there exist variant of the library which guarantee values P, which are consider ably higher than the average ones. Such variants are also unlikely if the library is formed randomly. That is why they may not appear even in a very big random search. So the development of methods of directional library optimization is actual. 2. THE PROBLEM SETTING AND THE QUALITY CRITERIA OF ITS SOLUTION Let us formulate the task under consideration. The dictionary consists of m SC {C 1 , C 2 , …, C m }. For each SC C i there exists a set of its pronunciations P i = {p i1 , p i2 , …, }. This set can consist of real pronunciations or can be formed artificially. Besides it can include pronunciations on the background of different noises. In general, this set must describe the possible variants of a given SC, which may occur during the process of its recognition as completely as possible. For any elements p and q from P = P 1 P m a function (quasimetric) d(p, q) is defined. Among the p in i Optimization of Dictionary and Model Library for Recognition of Speech Commands 1 V. R. Krasheninnikov a , N. A. Krasheninnikova b , V. V. Kuznetsov a , and E. U. Lebedeva a a Ulyanovsk State Technical University, Severny Venets St. 32, Ulyanovsk, 432027 Russia b Ulyanovsk State University, Lev Tolstoy St. 42, Ulyanovsk, 432970 Russia email: [email protected], [email protected], [email protected], [email protected] Abstract—The article considers the problem of optimization of dictionary and model library for recognition of speech commands on the background of noise. The algorithms which help to obtain good salvations of the problem during acceptable time are suggested. As an example the optimization of command recognition according to their crosscorrelation portraits is considered. DOI: 10.1134/S1054661811020611 Received January 23, 2011 1 The article was translated by the authors. Statistical series for the results of the experiment P, % 56 57 58 59 60 Frequency 146 376 589 775 876 P, % 61 62 63 64 65 Frequency 798 580 390 280 190 APPLIED PROBLEMS

Optimization of dictionary and model library for recognition of speech commands

Embed Size (px)

Citation preview

Page 1: Optimization of dictionary and model library for recognition of speech commands

ISSN 1054�6618, Pattern Recognition and Image Analysis, 2011, Vol. 21, No. 3, pp. 505–507. © Pleiades Publishing, Ltd., 2011.

1. INTRODUCTION

Nowadays different speech information controlsystems (SICS) are widely distributed. They are espe�cially easy�to�use when the operator may communi�cate with them on his native language with the help ofspeech commands (SC). Such SICS are especiallyuseful on transport and at work when it is necessary tooperate actuating mechanisms, to get informationabout their condition and the state of environmentand to reduce the operator’s working load. Such SICSmust operate in conditions of intensive noise (noise ofthe wind, engine, etc.). Nowadays there exist systemsof SC recognition which function in condition of lownoise. But they are not valid in noisy environment.

In conditions of intense noise SC recognitionwhich depends on the announcer is usually used.Besides the model library of speech commands isapplied. At the same time the quality of recognitionconsiderably depends on the fact how well this libraryhas been formed. Let us consider SC recognition withthe help of their cross correlation portraits as an exam�ple. There exists 20 pronunciations in noisy room foreach of 18 SC. From this number at random a libraryof model commands (MC) is formed. It consists ofthree pronunciations of each SC. For this library per�centage P of correctly recognized SC is determined.Their cross correlation portraits are used for this pur�pose. There has been done 5000 of such randomlibrary samplings. The table represents statistical seriesfor the results of this experiment.

The table shows that the recognition qualitystrongly depends on the composition of SC library.The range of values P here is from 56 to 65 percent (itis higher, as not all of the existing variants of the librarywere taken into account during the experiment).

Under intensive noise the range of values P was 20%and more. Similar range of values P is possible in real�ity, if MC are chosen by chance or if the necessarynumber of pronunciations is simply taken.

Let us underline that small values of P are less prob�able if the library is formed randomly (average valuesare more probable). That is why recognition usuallyhas acceptable quality level even without specialmodel sampling. However, there exist variant of thelibrary which guarantee values P, which are consider�ably higher than the average ones. Such variants arealso unlikely if the library is formed randomly. That iswhy they may not appear even in a very big randomsearch. So the development of methods of directionallibrary optimization is actual.

2. THE PROBLEM SETTING AND THE QUALITY CRITERIA

OF ITS SOLUTION

Let us formulate the task under consideration. Thedictionary consists of m SC {C1, C2, …, Cm}. For eachSC Ci there exists a set of its pronunciations Pi = {pi1,pi2, …, }. This set can consist of real pronunciations

or can be formed artificially. Besides it can includepronunciations on the background of different noises.In general, this set must describe the possible variantsof a given SC, which may occur during the process ofits recognition as completely as possible.

For any elements p and q from P = P1 ∪ … ∪ Pm afunction (quasimetric) d(p, q) is defined. Among the

pini

Optimization of Dictionary and Model Library for Recognition of Speech Commands1

V. R. Krasheninnikova, N. A. Krasheninnikovab, V. V. Kuznetsova, and E. U. Lebedevaa

a Ulyanovsk State Technical University, Severny Venets St. 32, Ulyanovsk, 432027 Russiab Ulyanovsk State University, Lev Tolstoy St. 42, Ulyanovsk, 432970 Russia

e�mail: [email protected], [email protected], [email protected], [email protected]

Abstract—The article considers the problem of optimization of dictionary and model library for recognitionof speech commands on the background of noise. The algorithms which help to obtain good salvations of theproblem during acceptable time are suggested. As an example the optimization of command recognitionaccording to their cross�correlation portraits is considered.

DOI: 10.1134/S1054661811020611

Received January 23, 2011

1The article was translated by the authors.

Statistical series for the results of the experiment

P, % 56 57 58 59 60

Frequency 146 376 589 775 876

P, % 61 62 63 64 65

Frequency 798 580 390 280 190

APPLIEDPROBLEMS

Page 2: Optimization of dictionary and model library for recognition of speech commands

506

PATTERN RECOGNITION AND IMAGE ANALYSIS Vol. 21 No. 3 2011

KRASHENINNIKOV et al.

metric axioms it may possibly not satisfy the triangleaxiom. The value d(p, q) is the extent of some differ�ence between elements p and q. It may be differencebetween spectrums of sound signals, wavelet transfor�mations, etc.

Let us note that the mere possibility of to determin�ing these distances is relevant to us. They are deter�mined using a special recognition algorithm. In thegiven article the distance (difference) between SCcross correlation portraits p and q was accepted asd(p, q).

From every set Pi it is necessary to choose a subsetEi = {ei1, ei2, …, eik} ⊂ Pi of k elements which we willcall MC. The totality of all models will compose MClibrary. This library must be optimal in terms of somequality criterion U.

Probability is the only and the primary criterion(relative frequency) of correct recognition. Thus, thelibrary maximizing the criterion is optimal

U1 = K/N, (1)where K is the number of correctly recognized com�mands and N is their overall number.

Other criteria which characterize geometric corre�lation of distances between SC re also possible

Average distance

(2)

from SC to its nearest model characterizes the com�pleteness of description of this SC, thus it should be assmall as possible here. Here M is the number of non�model commands. The average distance betweenmodels of different commands is

(3)

where mk—the number of models—must be con�versely as large as possible. The same is true about theaverage distance from SC to the nearest model of othercommands

(4)

Hence two more quality criteria are obtained:

(5)

They should be minimized.Maximum of function (1) or minimum of func�

tions (5) correspond to the optimal library. It is impos�sible to solve this problem using pure search even if thenumber of pronunciations is small.

3. THE ALGORITHMS OF MODEL LIBRARY OPTIMIZATION

In [1, 2] the following ways of MC library optimi�zation are suggested.

d 1M���� min d p e,( ) e, Ei∈{ }

p Pi∈

∑i 1=

m

∑=

D11

mk������ min d e f,( ) f, Ei∉{ },

e Ei∈

∑i

∑=

D21N��� min d pij e,( ) f, Ei∉{ }.

j

∑i

∑=

U2 d/D1, U3 d/D2,= =

The algorithm of improvement of available solution.At first the initial set of MC E1 is chosen at random, forwhich corresponding value U(E1) is estimated. Thensorting out of all variants of replacement of the first

MC e11 in the first SC on the element from P1\ iscarried out. The best of E1 and these variants (in thesense of U minimum) is remembered and considered

to be = { , e12, …, e1k} where is an optimalreplacement of model e1. Then tests to replace the sec�ond MC e12 by the first SC in E2 on the elements from

the set P1\ are carried out. And so on until the set ofmodels is obtained. Then in the same way the tests to

replace the models of the other SC are carriedout. The described procedure of MC set improvementis carried out twice.

The experiments with the given algorithm haveshown that the obtained set of MC usually reaches adeadlock (it can not be further improved by the sug�gested procedure) and even if it is not optimal it is nextto it. Of course, the final result sufficiently depends onthe initial choice of MC. That is why it is better to useseveral initial random variants.

The use of this algorithm to the example consid�ered in the introduction gave us the variant of thelibrary in which 81 percent of SC were correctly recog�nized.

Gravitation algorithm. Let pronunciations be pre�sented as dots of m�dimensional Euclidean space witha usual metric (it is the space of attributes according towhich SC recognition is conducted). Let them bematerial dots with a unitary mass in viscous medium.Then these dots will experience mutual gravity withmedium resistance. The dots which are situated nearerto each other are attracted stronger; they approacheach other more quickly and join in clusters. In each ofthese clusters the nearest to the center dot is consid�ered to be MC. For (3) and (4) to grow repulsionbetween dots which belong to different SC is intro�duced.

Algorithms of fuzzy clustering. The problem con�sidered in this article is close to the problem of fuzzyclustering. The latter also demands dividing some ele�ments into classes (clusters), but the element’s belong�ing to each of the clusters is fuzzy, i.e. an elementsomehow belongs to all clusters, and the total measureof element’s belonging to all clusters is equal to one.Besides, representatives (middles) of the clusters arenot obligatory the elements of the initial set, they aretypical representatives of their clusters in the sense ofthis fuzzy belonging. There exist a number of iterativealgorithms of quasi�optimal solution of the problem offuzzy clustering.

Similar algorithms have been used for the problemunder consideration, i.e. the problem of model choice.The extent of belonging has been chosen as a fixeddecreasing function of distance.

E11

E21 e11' e1'

E21

Ek 1+1

Page 3: Optimization of dictionary and model library for recognition of speech commands

PATTERN RECOGNITION AND IMAGE ANALYSIS Vol. 21 No. 3 2011

OPTIMIZATION OF DICTIONARY AND MODEL LIBRARY 507

4. LIBRARIES WITH DIFFERENT NUMBERS OF MODEL COMMANDS

In the formulated problem of formation of MClibrary it has been suggested that the numbers of MCfor all SC are the same. This requirement is not neces�sary: different SC may have different variability, so it isdesirable to increase the number of MC for SC withlarger variability and to decrease to less probable SC.Then if the overall number of models is fixed it is pos�sible to get libraries with a high possibility of correctrecognition. Software implementation of recognitionwith such libraries is only a bit complicated.

The described above algorithms are easily modern�ized for compilation of such libraries. In the process ofoptimization in spite of condition about equality ofMC number a new condition is introduced, that over�all number of models is fixed and that each commandhas at least one model.

5. FORMATION OF DICTIONARY OF SPEECH COMMANDS FROM A NUMBER

OF SYNONYMS

In the opening stage of SICS development it issometimes possible to change the very set of SC.Namely if there already exists some set of SC it is pos�sible to substitute some SC for their possible synonymsin order to improve recognition. Hence, there appearsa task of dictionary formation, i.e. a set of SC from thenumber of their synonyms. At the same time the dis�tinction between them should be high as possible. Inthat case it is natural to look forward to a better qualityof SC recognition.

Let’s formulate the problem under consideration.The dictionary consists of m SC: {C1, C2, …, Cm}. Forevery SC Ci there exist a set of its pronunciations Pi ={pi1, pi2, …, }. If some command Ci has no other

synonyms, then the corresponding militate Pi containsonly one element—the very command Ci. As beforefor any elements p and q from P = P1 ∪ … ∪ Pm therehas been determined a distance d(p, q), which is thedegree of distinction between p and q. From every setPi it is necessary to choose one element Ei ∈ Pi. Theirset E = E1 ∪ E2 ∪ … ∪ Em will be a formed set of SC,i.e. a dictionary. The dictionary must be optimal interms of criterion which shows the distinction betweencommands.

Thus the task of SC dictionary formation differsfrom the main task of MC library formation only in thefact that classes Pi consist of synonyms and it is neces�sary to chose only one element from each class.

Here the distances between elements within classesdon’t play any role—only the distances between SC of

a dictionary E = {E1 ∪ E2 ∪ … ∪ Em are important. Soit is possible to take an average distance from the dic�tionary commands to the nearest to the them com�mands as a quality criterion:

(6)

which must be maximum.In spite of this criterion it is possible to demand that

minimum of the distances between SC be maximum.In that case the value

(7)

will be a guaranteed distinction of any command pairfrom a formed dictionary (maxmin criterion).

The conducted experiments showed that if dictio�nary is formed randomly the formulated criteria havedistributions with a wide range of value of these criteriawhich are similar to distributions for MC dictionary.That is why the optimization of SC dictionary is advis�able.

The problem under consideration can be solvedusing all considered in the article methods of MClibrary formation by known distances as these twotasks are fundamentally similar. At the same time weneed only distances between synonyms of differentSC. These distances may be found in a usual way, i.e.using a recognition algorithm for synonyms, pro�nounced by an announcer.

However the dictionary is formed not for a specificannouncer, but for a specific SICS or even for a wideruse. So it is advisable to use average data about the dis�tances (distinctions) between synonyms pronouncedby different announcers.

6. CONCLUSIONS

The suggested algorithms help to obtain good tasksolutions for optimization of SC dictionary and MClibrary during an acceptable period of time. Theexperiment have shown that if these algorithms wereused the quality of SC recognition was higher in com�parison with random choice.

This study was supported by RFBR, grant no. 09�08�00725�a.

REFERENCES

1. V. R. Krasheninnikov, N. A. Krasheninnikov, andV. V. Kuznetsov, “The Algorithms for Choosing thePatterns for Voice Commands under Speech Recogni�tion,” in Proc. 62th Sci. Session Devoted to Radio Day(NTO RES im. A. S. Popova, Moscow, 2007), pp. 158–159.

pini

D 1m��� min d Ei Ej,( ) Ej, Ei≠{ },

i 1=

m

∑=

h maxmin d Ei Ej,( ) Ej, Ei≠{ }=