24
Yousef Rabah Page 1 6/29/2022 VoiceXML – Speech Recognition

Yousef Rabah - Earlham Collegelegacy.earlham.edu/~rabahyo/survey.doc  · Web viewYousef Rabah. February 25, 2004. CS480: Survey Paper. Speech is the means of primary communication

  • Upload
    others

  • View
    8

  • Download
    1

Embed Size (px)

Citation preview

Yousef Rabah

PAGE

Yousef RabahPage 12/25/2004

VoiceXML – Speech Recognition

Yousef Rabah

February 25, 2004

CS480: Survey Paper

Speech is the means of primary communication between humans. The process includes the selection of suitable words or sets of words to express the meaning in the speaker’s thoughts, the appropriate grammar form and the production of correct gestures to give acoustic vibrations in the physical word. Speech Recognition is one of the fastest growing technology because of the advancement of technology. During the early stages of building speech recognition compactness, noise, cost per channel and naturalness of human machine were some factors that were overlooked.

Speech Recognition is seen as one of the promising market technology of the future. For example, in 1997 the speech technology industrial product was $500 million and it expanded to $38 billion by 2003. Voice command applications are expected to cover many of the aspects of our future daily life. As an example, telephones, car computers, Internet surfing, and other general appliances are the candidates for this. On the technical level, speech recognition algorithms benefited from the fact that hardware provided more processors. Standards such as VoiceXML, which is a high-level programming interface to speech standard influenced by the W3C (World Wide Web Consortium’s XML. VoiceXML will be an access point to help connect voice with the Internet.

VoiceXML is a programming language for scripting voice interactions between a person and a computer. VoiceXML uses speaker-independent dialogs where the computer is always aware that the user is using a finite vocabulary words. VoiceXML unlike HTML comes from a programming language background. It has variables, control constructs, event handles, nested scooping and so on. A spoken dialog is the basic element of interaction in which the computer would produce spoken prompts to bring forth spoken responses from the user. At this point the prompts would be recorded but could also use Text-to-Speech synthesis. The user-spoken responses to the computer prompts are processed using speech recognition and grammars defined in the VoiceXML program. The grammar specifies a set of responses that ca be recognized by the computer. For example, a simple grammar may specify some fixed phrases to be used as commands. A more complex grammar may specify a set of basic word (the vocabulary) in which the words would form an expression or sentence. The grammar is the primary input to the voice recognition technology that underlies the VoiceXML interpreter. A grammar is specified within the VoiceXML program, inline or in an external file. VoiceXML uses an Automatic Speech Recognition system.

A great way to start looking at speech recognition systems is to look at the similarities between the automatic and human speech recognition. Human beings rely on a parametric model where knowledge can be gained by parameter tuning. The parameters represent a modeled phenomena to be recognized in the most accurate way. Humans process information with a very large biological neural networks while an Automatic Speech Recognition (ASR) implements a much simpler model, for example using the Hidden Markov Models (HMM) and/or other models. The first step is Initialization, where a Human baby is born with almost no knowledge (at this point the Neural Network is at an almost clear state), and the HMM models for ASR is initialized to zero knowledge. At startup both human baby and ASR cannot recognize words but with training there is a potential of learning any language. The next step is training, where training is performed by listening to speakers and associating the acoustic speech to its meaning. This applies for both ASR and Human beings. The effects of training is great; for Humans training sets and enforces connections of Neural Network where recognition consists of associating speech to the related concepts; while for ASR training would improve the statistical estimate of HMM parameters where strings of words are associated to speech. In both cases you can conclude that the more training the better accuracy in recognition.

As you look in ASR, there are different systems developed for different speaker(s). For example, you would have an ASR system that is speaker dependent, adaptive or independent. Speaker dependent is a system of which is designed only for a single speaker. This system is easier to develop while it is not flexible to use. A speaker independent system is designed for any speaker of a particular place (i.e US English). These systems are the hardest to develop, expensive and accuracy is less than the speaker dependent systems although they are more flexible than speaker dependent systems. The adaptive speaker system adapts its operation to characteristics of new speakers. The difficulty lies in between speaker dependent and speaker independent systems.

The vocabulary size of a speech recognition system affects the complexity, processing requirements and the accuracy of the system. It depends on the system where some systems use large dictionaries and others use only few words depending on what you are trying to make. The range goes from a small vocabulary that would consist of tens of words to a very-large vocabulary that consists of tens of thousands of words.

There are two speech system types. The first is an isolated-word system that uses single words at a time. This is the simplest type of recognition system because the end points are easier to find and the pronunciation of a word tens not affect others. The reason for that is because it requires a pause between saying each word. The other is a continuous speech system in which words are connected together, meaning that it is not separated by pauses. This system is more difficult to handle. It is difficult to find the start and end points of words. The production of each phoneme is affected by the production of surrounding phonemes, and the beginning and ending of words are affected by the pre and post words. The rate of speech also affects the system, where faster speech is harder to follow.

There are multiple ways of performing speech recognition. The first stage starts with digital sampling of speech. The next step is acoustic signal processing. The next stage is recognition of phonemes (groups of words depending on what system you are going with). There are many ways to achieve this, such as DTW (Dynamic Time Warping). With DTW, it finds lowest distance path through matrix while minimizing amount of computation. It operates in a time-time matrix considered in succession. It is equivalent in a sense to processing input frame by frame. For path N there is a maximum number of path considered at anytime and it is also N. Another system to use for recognition of phonemes is HMM (Hidden Markov Modeling), NN (Neural Networks) and other combination of techniques. Currently, HMM-based systems are the most widely used and therefore I will focus more on HMM.

Any application that needs voice processing would need specific representation of speech information. For example, for speech recognition the extraction of voice features is a requirement, which might distinguish different phonemes in a language. This procedure is similar to finding a sufficient statistic to estimate phonemes. The speaker’s current mood, age, sex, dialect, inflexions, background noise..Etc would in fact affect the phonatory apparatus dimensions. The first stage of speech analysis is speech filtering. Speech is filtered before it arrives at the automatic recognizer in order to decrease vocal message ambiguity. The first procedure consists of an analog to digital conversion. Voice is a pressure wave, which is converted to numerical values in order to be digitally processed. Changes into continuous varying voltage at regular intervals, through time.

(1)

Hardware devices needed here are: a microphone that allows the pressure sound wave p(t) to be converted into an electrical signal Xc(t). Then a time interval sampler at Tc and a sampling frequency fc= 1/Tc which yields voltage values

Xc(nTc)=X(n)

and finally an analog to digital (A/D) converter quantizes each X(n), n = 0,1,…, N-1 into a specific number, usually 16 bits. A speech waveform needs at most 100,000 bits/sec to retain all conveyed information that is much higher than the average phoneme information. A speaker is able to produce, at most, 50 different phonemes, where each phoneme is represented in binary form by 6 bits, since 50<26 and therefore the phoneme information is about 60 bits/sec which is in fact much lower than 100,000 bits/sec. This all means that speech signal representations, which do not contain information that is redundant for recognition, are in fact required. Speech reaches about 16 KHz of maximum frequency.

The analog amplitude values Xc(nTc) are not suitable to be handled by software algorithms. All values can range incompatible with digital numerical system that can manage quantized amplitudes represented by a small number of bits.

If value X represented by 16 bits

Then X can assume only 216 = 65,536 values on amplitude axis.

All values that do not correspond to one of 65,536 possible values will be associated to nearest one. A new representation error is introduced this way, called quantization noise (q and therefore,

Xc(nTc) = Xc * (nTc) + (q(nTc)

Once a speech signal is digitized both in time and in amplitude, it can be stored and processed by a computer. Here N samples coordinates of a unique point in an N-dimensional vector space. By tracing points generated by repeated replications of same phoneme, these points have a tendency to randomly fill space in. sufficient

The characteristics of the vocal tract define the current uttered phoneme, found in the frequency domain by location of the formats (peaks given by resonance’s of vocal tract). A high frequency is required to have similar amplitude for all formats, and is obtained by filtering the speech signal with a first order FIR filter.

The feature extraction is a procedure, which includes transforming vectors of voice samples into suitable vectors whose components are called the “feature” of a signal.

Here is a diagram what shows the feature extraction module of ASR:

(1)

The aim of the procedure is aimed at mapping data on a more appropriate space where useless information is discarded. In theory, feature vectors for a given word should be the same regardless of the way in which the word has been uttered. The procedure may also reduce the amount of data managed by the ASR, which in turn means that the size of the feature vector could be smaller than of the sample vector. The easiest way to understand feature extraction is when you think of it as a sequence of operations that map input vectors into output vectors (vectors of voice samples) that are input of a cascade discrete time systems, as “blocks”, to return the output vectors according to a specific operation. Each of these blocks processes the output of the previous one and feeds the following block. The last block (output block) is the feature vector.

An example of feature extraction is MFCC (Mel Frequency Coefficient). It is one of many examples out there. It’s a process chain that returns the signal features. Some of the processes include pre-emphasis of signal processing achieved by filtering speech signal, windowing, which is a short time interval analysis for estimation, filter bank processing, where its spaced along the frequency axis, usually 1 kHz before and after with an algorithm in between; resulting in detection of harmonics. Different algorithms are used like speech enhancement and so forth in order to help clear-cut what the speaker is trying to say.

The acoustic speech signal is easily observed and measured stage in speech communication. A representation of the speech signal is very valuable in giving significant information about resonances and how they vary over time; frequency-time spectrogram. Divided into formats, each format could be characterized by a resonant frequency, amplitude, and bandwidth. In general there are three to five formats to model a speech. The lower three characteristics represent the phonetic quality of sound itself, while the extra resonances add to the accuracy of the description and the naturalness of speech.

Speech sentences are represented by sequences of words referenced by W = (w1,w2,w3 …, wt). Here wt is last word spoken at discrete time t. The communication of word sequences through voice that is a sequence of acoustic sounds X. What we need is to find the sequence of words W that is associated to a given acoustic sequence X. Looking at this in a mathematical perspective, f maps a give X that belongs to set of all acoustic sequence ( to a W, which is actually included in the set (:

F: X ( W, X ( ( and W ( (

Recognition is complex. Not all X correspond to words, but the number of possible X still remains high. The mapping function maps and associates many different X to the same W as different acoustic sequences correspond to the same sentence. Speakers could generate different X by simply changing location.

With ASR, the good strategy of going by building a model is by using ( that would returns all the possible “emissions” X for all the admissible events W. For speech, h(W, () would return all possible acoustic sequences X that can be associated to a given W. Recognition is performed by finding the sequence of words that according to ( would return an acoustic sequence X that best matches the one given. This all means that the recognition is performed by relying on prior knowledge of the mapping of acoustic and word sequences. The knowledge is embedded in (.

There are two basic steps for the procedure. Training and Recognition. With training, the model is built form a large number of different correspondences, couples (X’.W’). The key idea is that the greater the number of couples (X’, W’), the greater the accuracy. Here is a scheme of an ASR system functions that would help demonstrate what I am trying to convey:

(1)

The Hidden Markov Model is a Markov Chain where the output symbols or probabilistic functions that describe output symbols are connected either to the states or to the transitions between states. The HMM is a model of a stochastic process that is used to reduce a non-stationary process to a piece-wise process. That is signals that we meet can be considered instances of random processes that are non-stationary. The systems that produce these random processes have a continuous variation of their state. In the algorithm consists of a set of nodes that are chosen whom are appropriate for a particular vocabulary. The nodes are ordered and connected from left to right, and recursive loops are allowed. Recognition is based on a transition matrix of changing from one node to another and another matrix that represents the probability that a particular set of coeds will be noted in each node Mostly, HMM is well suited for speaker-independent recognition because the speech used during training can be from multiple speakers.

In order to have a feasible model, the assumption is that there is a finite state model that allows the HMM to be computationally feasible. What the HM process does is that it acts as a switch that associates the observations to one of the operative condition of the process. Sequences are divided into subsequences that are realizations of different piecewise stationary processes, where for speech, observations associated to a single phoneme belonging to three different stationary processes. As mentioned before, in order to perform speech recognition, an ASR system processes the sequence of speech samples that are extracted from speech recording.

Say that S is a sequence of a spoken language ( as sequence of units ), and Y is a set of given sets of units. The ASR problem would then be finding the correct sequences S from a given observation Y.

(1)

The HMM ( is referred to often as a parametric model, because the state of the system at each time t is completely described by a finite set of parameters. The algorithm that estimates the HMM parameters (training algorithm) takes a first good guess. The initialization model performs the first guess. What the initialization does is that it computes the HMM parameter first guess using the preprocessed speech data (features) and their associated phoneme labels. The HMM parameters are kept or stored as files and then retrieved by the training procedure. Here is a figure that shows the initialization in ASR:

(1)

Before estimating the HMM parameters, the basic structure of HMM must be defined. To be specific, the graph structure, which is the number of states and their connections, and the number of mixtures per state M must be specific.

A good way to understand HMM is by giving an example. If we build a model ( that recognizes only the word “yes”; then word “yes” is composed of two phonemes ‘\ye’ and ‘\s’ and this corresponds to six states of the two phoneme models. To be more accurate “yes” is composed of ‘\y’ ‘\eh’ ‘\s’. The ASR would not know the acoustic state in mind of the speaker; therefore, the ASR system would try to find W by reconstructing the more likely sequences of states and words W that have generated X.

Model training is performed by estimating HMM parameters. Since estimation accuracy is roughly proportional to the number of training data, the large speech databases are needed in order to have acceptable performances.

One could say that HMM could be included in different class of models, which includes Bayesian networks and the types of neural networks that have evolved as probabilistic models. All these networks have in common the use of prior information in the learning process; usually the prior information is obtained from data.

HMM is dynamic process model, while the Bayesian network is the static representation of the joint or conditional probabilities between variables, where the result is a network that is a set of frequent units that are extended in time. HMMs have a set of inference, learning and recognition algorithms that are well optimized to handle large sets of data. Neural networks are systems that are made of a large number of simple interconnected units that simulate in fact, brain activity. Each unit representing a neuron in a biological simulation, it is part of a layered structure and it produces an output that is a non-linear function of the inputs. It is difficult to design and train Neural Network with dynamic growth. From current and past experiments, speech recognition is in favor of HMMs.

An important aspect of a speech recognition is its vocabulary and what system it is using. Man applications can use a fixed vocabulary recognizer. Advantages of such systems are the high accuracy. These kinds of systems include those that recognize only ten digits, specialized vocabularies, as well as some with large vocabularies for dictation. The choice of what vocabulary you choose will determine the accuracy of the word recognition system. Systems that have multiple vocabularies benefit because it would be easy to switch between vocabularies. A better performance is achieved via acoustically distinct words. A classic example of limitation of speech recognition is the failure to identify the letter of the alphabet. The reason for this is that many of the alphabets are acoustically very similar. For example, the “e-set” consists of {b, c, d, e, g, p, t, v, and z} and when your actually say these alphabets you hear {bee, cee, dee, gee … etc}. Hence, the e-set is appropriate for testing your system with.

As a final note, I think that Speech recognition is becoming a very important aspect in our everyday life. The integration of VoiceXML with an Automatic Speech Recognition system would provide great means of quick and effective mobile communication.

Bibliography

Articles:

· White, George M. "Natural Language understanding and Speech Recognition." Communications of the ACM 33 (1990): 74 - 82.

· Osada, Hiroyasu. "Evaluation Method for a Voice Recognition System Modeled with Discrete Markov Chain." IEEE 1997: 1 - 3.

· Bradford, James H. "The Human Factors of Speech-Based Interfaces: A Research Agenda." SIGCHI Bulletin 27 (1995): 61 - 67.

· Shneiderman, Ben. "The Limits of Speech Recognition." Communication of the ACM 43 (2000): 63 - 65.

· Danis, Catalina, and John Karat. "Technology-Driven Design of Speech Recognition Systems" ACM 1995: 17 - 24.

· Suhm, Bernhard, et al. "Multimodal Error Correction for Speech User Interfaces" ACM Transactions on Computer-Human Interaction 8 (2001) 60 - 98.

· Brown, M.G., et al. "Open-Vocabulary Speech Indexing for Voice and Video Mail Retrieval" ACM Multimedia 96 1996: 307 - 316.

· Christian, Kevin., et al. "A Comparison of Voice Controlled and Mouse Controlled Web Browsing" ACM 2000: 72 - 79.

· Falavigna, D., et al. "Analysis of Different Acoustic front-ends for Automatic voice over IP Recognition" Italy 2001.

· Simons, Sheryl P. "Voice Recognition Market Trends" Faulkner Information Services 2002.

Books:

· (1) Becchetti, Claudio, and Lucio Prina Ricotti. Speech Recognition: Theory and C++ Implementation. New York : 1999

· Abbott, Kenneth R. Voice Enabling Web Applications: VoiceXML and Beyond. New York: 2002

· Miller, Mark. VOiceXML: 10 Projects to Voice Enable Your Web Site. New York: 2002

· Syrdal, A., et. al. Applied Speech Technology Ann Arbor: CRC 1995

· Larson, James A. VoiceXML:Introduction to Developing Speech Applications New Jersey : 2003