14
Comput. & Indus. Engng Vnl. 7, No. 2, pp. 101-114, 1983 0360-8352/83/020101-14503.00/0 Printed in Great Britain. Pergamon Press Ltd. VOICE COMMUNICATION WITH COMPUTERS: A PRIMER M. G. JOOST Department of Industrial Engineering, North Carolina State University, Raleigh, NC 27650, U.S.A. Y. A. HOSNI Department of Industrial Engineering and Management Systems, University of Central Florida, Orlando, FL 32816, U.S.A. and F. E. PETRY Department of Computer Science, Tulane University, New Orleans, LA 70118, U.S.A. (Received in revisedform July 1982) Abstract--Due to the rapid development of speech communication interfaces, speech will be a major mode of man-machine communication. Essential to thg development of speech input/output applications is the understanding of the techniques and jargon of speech by the application designer who, in most cases, has no formal speech training and, perhaps, little understanding of computers. This paper includes introductory tutorials on speech synthesis and recognition in an attempt to overcome the jargon barrier. A limited number of application examples are presented to stimulate generation of new applications. INTRODUCTION Speech is the most common mode of human communication. In most instances, other modes are used only if speech is not possible or if the information transmission must be delayed in time or a permanent record must be generated. The process of effective speech communication requires at least two components; a speaker whose efforts are directed at presenting audible information in a manner optimizing its reception, and a listener who similarly attempts to interpret the message being presentedby the speaker. With practice, both the speaker and the listener adapt to the other's style enhancing the flow of information while accommodating transmission aberrations caused by illness, environmental changes, ambient noise, etc. Carried to the extreme, this adaptation accounts for regional dialects and language divisions. WHY USE SPEECH? Given this apparent diversity and dynamism in speech communication, the first question of importance is why use speech for computer input/output (I/O) at all--it appears to foster imprecision, the antithesis of the common perception of computers. The response to this question requires the consideration of several points. First, and probably of prime importance, is the ease with which skills needed for speech communication with computers can be acquired. For example, contrast speech with the use of an ordinary terminal for I/O purposes. Most individuals will compose a command, segment it into its components (words), spell these components and translate the character string into the motor activity necessary to enter the correct characters codes at the terminal. The response then generated is visually scanned and the information is assembled into meaningful concepts. As the individual gains experience, the translation process becomes more automatic but most intermittent users never reach this level. Thus, entry is slow and subject to error. Using speech, on the other hand, the most highly developed translation skills are implemented which should reduce the additional workload imposed by the computer communication on the user. This parallel with keyboard data entry is quite strong (Fig. 1) with the major differences due to coding and muscle control. Speech transactions are dependent on merely another computer interface. While this natural argument has much appeal, there are also several negative aspects to 101

Voice communication with computers: A primer

Embed Size (px)

Citation preview

Page 1: Voice communication with computers: A primer

Comput. & Indus. Engng Vnl. 7, No. 2, pp. 101-114, 1983 0360-8352/83/020101-14503.00/0 Printed in Great Britain. Pergamon Press Ltd.

V O I C E C O M M U N I C A T I O N W I T H C O M P U T E R S : A P R I M E R

M. G. JOOST Department of Industrial Engineering, North Carolina State University, Raleigh, NC 27650, U.S.A.

Y. A. HOSNI Department of Industrial Engineering and Management Systems, University of Central Florida, Orlando,

FL 32816, U.S.A.

and

F. E. PETRY Department of Computer Science, Tulane University, New Orleans, LA 70118, U.S.A.

(Received in revised form July 1982)

Abstract--Due to the rapid development of speech communication interfaces, speech will be a major mode of man-machine communication. Essential to thg development of speech input/output applications is the understanding of the techniques and jargon of speech by the application designer who, in most cases, has no formal speech training and, perhaps, little understanding of computers. This paper includes introductory tutorials on speech synthesis and recognition in an attempt to overcome the jargon barrier. A limited number of application examples are presented to stimulate generation of new applications.

INTRODUCTION

Speech is the most common mode of human communication. In most instances, other modes are used only if speech is not possible or if the information transmission must be delayed in time or a permanent record must be generated. The process of effective speech communication requires at least two components; a speaker whose efforts are directed at presenting audible information in a manner optimizing its reception, and a listener who similarly attempts to interpret the message being presentedby the speaker.

With practice, both the speaker and the listener adapt to the other's style enhancing the flow of information while accommodating transmission aberrations caused by illness, environmental changes, ambient noise, etc. Carried to the extreme, this adaptation accounts for regional dialects and language divisions.

WHY USE SPEECH?

Given this apparent diversity and dynamism in speech communication, the first question of importance is why use speech for computer input/output (I/O) at all--it appears to foster imprecision, the antithesis of the common perception of computers. The response to this question requires the consideration of several points.

First, and probably of prime importance, is the ease with which skills needed for speech communication with computers can be acquired. For example, contrast speech with the use of an ordinary terminal for I/O purposes. Most individuals will compose a command, segment it into its components (words), spell these components and translate the character string into the motor activity necessary to enter the correct characters codes at the terminal. The response then generated is visually scanned and the information is assembled into meaningful concepts. As the individual gains experience, the translation process becomes more automatic but most intermittent users never reach this level. Thus, entry is slow and subject to error. Using speech, on the other hand, the most highly developed translation skills are implemented which should reduce the additional workload imposed by the computer communication on the user. This parallel with keyboard data entry is quite strong (Fig. 1) with the major differences due to coding and muscle control. Speech transactions are dependent on merely another computer interface.

While this natural argument has much appeal, there are also several negative aspects to

101

Page 2: Voice communication with computers: A primer

102 M. G. JOOST et aL

CONDITIONING ..~-~

SIGNAL HESS* CONDITIONING

*Automatic Speech Recognition (ASR) *Electronic Speech Synthesis (ESS)

PROGRAM

SOFTWARE NON ITOR

APPLICATION

PROCESSING

_ I ( SPEE!H /OPE~TOR

~X[ ATTERNS ~.

Fig. 1. Functional components--distributed workstation application.

speech I/O. The computer cannot adapt as do humans, thus, in most cases, there is a period of learning/training involved which, if handled improperly, can negatively influence the user's perception of speech I/O. Additionally, communication may be slower than the user would normally prefer which may limit its effectiveness for some tasks.

In spite of any disadvantages, speech is likely to become a major communication path over the next decade. This is due to several reasons: it's easy to learn, it's easy to use and is becoming more inexpensive to implement.

THEORY OF SPEECH IN COMPUTER COMMUNICATION

Speech, as it has been used here, refers to two-way communication. In terms of computer interfaces, there are two separate functions. For output from the computer, electronic speech synthesis is used, while input to the computer requires automatic speech recognition. These two elements may be used separately as in warning or monitoring applications or they may be used in conjunction to develop a speech transaction system.

SPEECH SYNTHESIS

Mechanical analogs of the human vocal system were constructed and produced speech sounds 200 years ago; however, it was the recent invention of the digital computer which made practical speech synthesis feasible. Speech synthesizers are devices which artificially produce speech sounds. The associated design theory includes the process of analyzing and storing human speech by extracting information from an input speech wave form and storing it in a computer memory. Speech reproduction is accomplished by reconstructing speech from the information stored. Many different approaches to speech synthesis exist. The techniques are distinguished mainly by memory storage requirements for the vocabulary and by the com- plexity of the control rules for generating the speech. Proper hardware/software interaction makes potential applications of speech synthesis in many widely different areas apparent, especially in the field of industrial engineering and re-engineering jobs for the blind. This section will present some of the theory of speech reconstruction, recent developments in currently available electronic speech synthesizers, representative engineering applications and some potential applications in industrial engineering.

Page 3: Voice communication with computers: A primer

Voice communication with computers: a primer 103

HUMAN SPEECH PRODUCTION

Human speech is a physiological process. Figure 2 shows the human vocal tracts and the components which produces speech. The vocal tract is a non-uniform acoustic tube, 16--18 cm in length, which extends from the glottis to the lips, and varies in shape as a function of time. The major anatomical components causing this time-varying change are the lips, tongue, jaw and velum. The velum is a flap which couples the nasal tract to lhe vocal tract through a trap-door type of action. The nasal cavity is about 12 cm long and has an approximate volume of 60 cm 3.

The vocal system can produce three basic types of sound: voiced, fricative and plosive. Voiced sounds, such as the vowels, are produced by the vibration of the vocal cords due to air released from the lungs. These vibrations interrupt the air flow and generate a series of sharp pulses that excite the vocal tract. Fricatives (s, sh, f, th) are generated by forcing air through a constricted vocal tract at a high velocity which causes turbulence. Plosives (p, t, k) are produced by closing the vocal tract completely with the lips or tongue, allowing a pressure buildup and then abruptly opening the closure. These sounds have a fairly broad spectrum of frequencies ranging from I00 Hz to more than 3000 Hz. The vocal system acts as a time-varying filter to produce resonance characteristics. The vocal tract acts as a filter with poles cor- responding to the acoustic resonances known as formants. The formants are located ap- proximately at 1000 Hz intervals. When the nasal tract is made part of the transmission system, another resonance pole and an antiresonance zero are introduced at around 1400Hz. The various voiced sounds of speech are produced by changing the shape of the vocal tract and hence its resonances, while unvoiced sounds are excited by a noise source that has a fairly broad, uniform spectrum by effectively altering the length of the vocal tract. This can be accomplished by, for example, allowing the tongue to constrict the flow in the back of the mouth, causing the resonances and antiresonances to fall at different frequencies [2].

VELUM

VOCAL ~=

Fig.

SOFT PALATE RD PALATE

'~NOSTRILS .k ~LIP,S ""'-TEETH

GLOTTII ONGUE

GLOTTIS

2. The vocal system.

HISTORY OF SPEECH SYNTHESIS

The first known practical realization of machine generated speech was accomplished in 1791 by Wolfgang Von Kempelen, of the Hungarian government. Von Kempelen's machine was based on a surprisingly detailed understanding of the mechanisms of human speech production, but he was not taken seriously by his peers due to a previous well publicized deception in which he built a nearly unbeatable chess playing automaton. The "automaton" was, unfortunately, later discovered to actually conceal a legless Polish army ex-commander who was a master chess player. In the voice machine a bellows was used to supply air to a reed. This in turn excited a hand controlled resonator that produced voiced sounds. Consonants and nasais were simulated by four constricted passages [12].

In 1820 a machine was constructed which could carry on a normal conversation when

Page 4: Voice communication with computers: A primer

104 M.G. JOOST et aL

manipulated by a skilled operator. The machine, built by Joseph Faber, a Viennese Professor, was demonstrated in London where it sang "God Save the Queen". Like the Von Kempelen synthesizer, Faber's device also incorporated bellows, reeds and resonant cavities to simulate the human vocal tract[21].

The first electrical synthesizer was demonstrated at the 1939 World's Fair. Built by Bell Telephone Laboratories, the VODER (Voice Operated Demonstrator) consisted of a periodic buzz oscillator to simulate the vocal chords, and wideband noise source to simulate constricted air flow for fricative production. These sounds were modified by passing them through a resonance control box containing ten continuous bandpass filters that spanned the frequency range of normal speech. Ten finger keys activated gain control potentiometers which modulated the outputs of the filters. Three additional keys provided a transient excitation of selected filters to simulate three types of plosive sound (t-d, p-d, k-g). A wrist bar selected noise or buzz source and a foot pedal controlled the pitch of the buzz oscillator (Fig. 3).

The development of the digital computer provided a boost to the production of practical speech synthesizers. The computer took over the control functions providing greater freedom in modeling the vocal system. All the information necessary to repeatedly and reliably generate any one speech sound or "phoneme" could not be programmed into the machine. Through the proper connection of phonemes, the computer could produce words and sentences.

P H O N E T I C S

In order to design or program an electronic speech synthesizer, it is essential to have a knowledge of phonetics. Phonetics is the scientific study of speech sounds from the standpoint of their production, reception, and symbolization. A word is phonetically described by breaking it into distinctive sound units called phonemes. There are 38 phonemes in the English language[21] as well as allophones produced by subtle differences in phoneme production.

RANDOM NOISE SOURCE ANCE CONTROL

RELAXATION OSCILLATOR

WRIST BAR I

/ /

VODER CONSOLE KEYBOARD

LOUDSPEAKER

SOURCE SWITCHAMPLI

QUIET

STOPS

PITCH CONTROL~~ PEDAL

Fig. 3. The VODER block diagram[12].

Page 5: Voice communication with computers: A primer

Voice communication with computers: a primer 105

These are the slightly varying sounds of a single phoneme usually determined by the position in a word and phonemes proceeding and following it. Table 1 gives a list of International Phonetic Alphabet (IPA) symbols and key words demonstrating their use.

In a sense, phonetics is a language in itself. For example, the following are some common

Table 1. IPA phonetic symbols

Phonetic Symbol ~ Words

[p] part [b] be [ t ] to

[d] do [k] keep

[g] ~ive [ f ] fane [v] vest [0] khin [6] the [s] see

[Z] ZOO

[~] khip

[~] lesion [h] he [m] milk [n] no

[ I ] lake [w] zig [ r ] red [J] ze s

[ t f ] chew [d~] just [n] s i ~ [ i ] see

[ I ] s i t

[el ache [el end

[m] can't [~] earn

[ ~] I adder [A] ~p

[~] sofa [u] food

[U] book [0] rope

[9] aught

[a] farm [a l l sk~

[~U] out [o l ] box -

Additional Spelling~ of Sound rope, sto~9_ed, hiccough rubbed, cu~oard butt, debt, raked, indict , yacht, Thomas, receip_t_, might add, rubbed, could cue, sick, account, lake, ache, walk, khaki - - e~, 9host, ~uest off, phrase, lau~ of, have, Stephen

miss,__ __scent, schism, cinder, ~alm, sword, waltz fuzz, raise, scissors, xylophone iss__ue, sugar, pension, gracious, ration, champagne, anxicus, schottische, conscious leis_'ure, az_ure, negliqee whole summer, comb inn, pneumon_ia, Wednesday, mnemonic, knife, ~ash tel 1 Ianguage m e ~ , rhetoric, wrist onion c_ello, witch, feature ra~e, ~_em, d o ~ , soldier anchor, handkerchief, tongue eat, people, chief, perceive, be, key_, phoenix, ravine, Caesar here, hear, sieve, hymn, business, women, guild aim, beige, great, play_, grey_, gau~e said, pear, s~s, heir, leopard, friend, any laugh, half, as worst, f i r , fur, purr, germ, my_[tle, journey ,~oI o~I surprise, sai lor , l i a r so_n, tough, gul l , does, blood succeed, famous, bargain, speci_me_n rude, whose, through, threw, shoes, group, bTue coul d, furl ly, wo_l f oak, though, sown, sew, goes, yeo_men, shoulder, beau raw, cough, abroad, gone, thought, a l l hot, ho_nest write, height, aisle, buy, lye, e_,y__e, a_z ~, pie, sigh bough, crowd, hou__r_r, sauerkraut

broil

Page 6: Voice communication with computers: A primer

106 M.G. Joosx et al.

English words with their phonetic transcriptions:

the ~i shall f ael and aend should J'Ud but bAt was woz to tu can kaen by baI as aez

In general phonemes could be classified as: Pure vowels. Produced by a constant excitation of the larynx and the mouth held in a steady

position, e.g.: "~". Diphthongs. A transition from one pure vowel to another, thus are not always considered as

separate phonemes; "~", "0". Fricatives. Consonants produced by a rush of aspirated air through the vocal passages "f",

Plosives. Explosive bursts of air: "p", "k", "t". Semi-vowels. "w", "y". Laterals. "1", "r". Nasals. "n", "m". A speech synthesizer must be able to produce these sounds and connect them in such a way

that the transition from one phoneme to another is accomplished as smoothly and naturally as possible.

ELECTRONIC SPEECH SYNTHESIS

The main components of a speech synthesis system are shown in Fig. 4. The machine is required to synthesize a message, typically English text. A synthesis program may obtain from a predefined vocabularly, or a set of vocabulary rules, a description of the required sequence of sounds. This sequence is then transferred to a synthesis device which transforms the information into speech. Four different techniques illustrate the range of complexity. They are: adaptive differential pulse-code modulation (ADPCM), linear predictive coding (LPC), formant synthesis and text-to speech synthesis.

VOCABULARY STORE

OR RULES

I DIGITAL ENGLISH SYNTHESIS SPEECH

TEXT PROGRAM SYNTHESIZER

SPEECH SPEECH OUTPUT

FORMATION RULES

Fig. 4. Block diagram of computer voice response system.

ADAPTIVE DIFFERENTIAL PULSE CODE MODULATION (ADPCM)

The simplest technique, ADPCM, utilizes a vocabularly composed of human-spoken words whose waveforms are digitally coded using a high sample rate (20 K-bits/sec). Figure 5 shows a block diagram of an ADPCM system. For message assembly, the synthesis program merely retrieves the digitally coded words from storage, and supplies them to an ADPCM decoder to produce an analog output. The disadvantage of this system is the large amount of data storage required; due to the high bit rate, 106 bits of storage are required for one minute of speech.

Page 7: Voice communication with computers: A primer

Voice communication with computers: a primer

SAMPLED INPUT

DIGITAL CHANNEL

YY---.---m-

107

Q - QUANTIZER C - CODER D - DECODER P - PREDICTOR L - LOGIC LPF - LOW PASS FILTER

Fig. 5. Block diagram of ADPCM speech synthesis (10).

LINEAR PREDICTIVE CODING (LPC)

In LPC, the acoustic wave pattern associated with the sound of speech is represented by a set of parameters obtained from the analysis of an original speech signal. In LPC synthesis, the digital filter requires two sound sources. A white noise generator is used for unvoiced sound production and periodic source is used for voiced sounds. The output of the digital filter is converted from a digital signal to speech by a digital-to-analog converter. The advantage of such a technique is that a rather modest amount of memory can store an impressive amount of speech--7 minutes in 106 bits of storage. Figure 6 shows a diagram of how LPC is used in the Texas Instruments TMCO 280 synthesizer chip.

WHITE LATTICE I F ,TER

] PERIOOIC I I 7

,, ] AMPLITUDE

FILTER PITCH VOICED/ REFLECTION

UNVOICED COEFFICIENTS

DIA CONVERTER

SPEECH OUTPUT

Fig. 6. Linear predictive coding used in the TME0280 single chip synthesizer[18].

FORMANT SYNTHESIS

Unlike the ADPCM and LPC types, the formant synthesizer does not require a voice input to be digitized and stored. Instead, a circuit is designed to reproduce sounds (usually phonemes and allophones) by manipulating filters and applying rules to produce the proper sound when used in the context of a word or sentence. Figure 7 is a block diagram of serial and parallel formant synthesizers. Both have their relative advantages and disadvantages; thus many

Page 8: Voice communication with computers: A primer

M. G. JOOST et a/.

PITCH VARIABLE

FILTER SYSTEM

SPECTRAL COMPENSATION

SPEECH OUTPUT

Nm -ii_ ;;~~;~~L~~I,,

NOISE AN iP FL FILTER GENERATOR VARIABLE

POLE/ZERO NET'dORK

SUtlFl I NG ANPLIFIEP

(a) PROGRAM:?- '4ABI~E FILTER

Fig. 7. Format synthesizer block diagram (a) Series (b) Parallei[21].

different designs have been produced. To model the vocal tract, a formant synthesizer typically has two excitation sources: (1) an impulse generator to simulate a voicing source and (2) a white noise generator as a fricative source. For example, for vowel production the output of the impulse generator is fed into formant filters which can produce most of the vowels of The American English language by steady state formant frequencies.

TEXT TO SPEECH SYNTHESIS

The final class of synthesizer to be surveyed is a text synthesizer, shown in block diagram in Fig. 8. In its simplest form this type of synthesizer consists of a vocabulary stored in a

pronunciation dictionary, each entry of which has the English word and a phonetic transcription of the word. When a word is to be output or “spoken”, the controlling program outputs the sequence of codes to a synthesizer (such as formant synthesizer) to be converted to an analog signal[S]. This type of text synthesizer has a limited (i.e. predefined) vocabulary; however, it has the desired feature of converting English text directly into synthesized speech. In spite of the limited vocabulary, over 700 words can be stored in 6.6 kilobytes of memory. In more sophisticated forms, a test synthesizer can have an unlimited vocabulary. It is able to take any English text, and through appropriate set of rules, convert the text to speech. The approach usually taken is to isolate a word from the input text and attempt to find it in an exception dictionary. If the word is not found, the word is converted to a simpler version, and the dictionary is searched for the converted word. If not found, letter to sound rules are applied, producing sounds as would a child learning to read using phonetic rules.

Page 9: Voice communication with computers: A primer

Voice communication with computers: a primer

ENGLISH r TEXT | DICTIONARY

AND - ~ PHONETIC TYPEWRITER { PARSER I SYMBOLS

ENTRY

109

MODEL OF I ARTICULATION

SYNTHETIC I DIGITAL L PITCH INTENSITY, SPEECH VOICED/

I I-- SPEECH SYNTHESIZER UNVOICED, FORMANT

CALCULATION

Fig. 8. Block diagram of a text synthesizer.

IMPACT OF SYNTHETIC SPEECH EQUIPMENT IN INDUSTRIAL ENGINEERING PRACTICE

There are numerous applications of synthetic speech systems in industrial engineering practice, particularly with regard to man-machine interfacing and information transfer. Perhaps, the most appropriate use of these systems is in those applications where conventional terminal input/output equipment is not feasible or too expensive. The speech synthesizer has an impact on work design and simplificationma major area of industrial engineering. A spoken message, if feasible, is less likely to be misunderstood than other forms of communication (e.g. written) when the message is short. Messages could be received while the person is moving or performing another job, rather than being constrained to the front of visually-oriented equip- ment. Most jobs where a message is not required to be retained, could be redesigned to make use of speech synthesizers.

Consider the following examples. In a large computerized inventory system, where sales persons and customers are requesting information, the computer can be interfaced with a "talking" terminal equipped with a touch-tone decoder. The inquirer (customer or sales person) can call in from any touch-tone telephone and query the inventory system. The same idea can be applied to credit checking or in any application where a quick check is needed by telephone and simple transactions will suffice.

In the area of industrial plant supervision and maintenance, where computers monitor industrial equipment (e.g. furnaces, automatic machines, etc.), a speech synthesizer might page the appropriate service engineer over the plant public address system when abnormal con- ditions are detected. Meanwhile the engineer is free to go anywhere in the factory performing other tasks. Paging could be accompanied by a spoken message designed with the appropriate information of instructions. Warnings, instructions, etc. may also be stored as spoken messages in the computer, to be accessed and activated whenever needed in an industrial safety situation. Messages can be designed to inform workers of how to cope with a dangerous situation.

In the area of work design and simplification, jobs which require computer access via standard telephone equipment could be simplified. Examples of such jobs are service on information desks, and credit card vertification. An operator receives the inquiry, checks its legitimacy, enters the information into the computer and then lets a synthesizer handle the response, while the operator is processing another inquiry.

In the area of job training, job descriptions and detailed work instructions can be stored as messages in a computer. The training program can be designed such that the computer monitors the execution of the operators, for example, an assembly operation requiring a certain sequence and a particular pace. As the computer monitors the situation, it would instruct the operator and

CAIE Vol. 7, No 2 ~

Page 10: Voice communication with computers: A primer

110 M. G. JOOST et a/.

inform him about any improper operation. Another routine could be programmed to record and rate the efficiency of the training program.

Work design for the handicapped in general and for the visually handicapped in particular represents a challenge for the industrial engineer. Jobs must be re-engineered in a way to suit the blind. Speech synthesizers open the door to a wide class of jobs which were previously unavailable for the visually handicapped. Tasks such as airline reservations (ticket reser- vations), telephone dispatching for mobile units, police cars, taxis and trucks, word processing and locating library materials are all done by accessing information from a central computer. Speech synthesizers will open such jobs for visually handicapped if properly trained for the re-engineered jobs. Speech synthesizers are already employed in word-processing, and form- filling, where a computer program “reads” what has been typed, and accepts modifications. Advances in the area of character recognition in conjunction with speech synthesizers, makes feasible a reading machine for the blind. The Kurzweil reading machine[22] converts print to speech, and is designed to recognize English characters in printed material (books, letters, reports). A program composes the words which are pronounced by a speech synthesizer.

The development of the digital computer and technical advances in integrated circuits has provided a boost to the production of practical speech synthesizers. Proper hardware/software interaction makes potential applications of speech synthesis in different areas apparent, especially in the area of industrial engineering, information systems, work design and re-engineering jobs for the handicapped.

AUTOMATICSPEECHRECOGNITION

The general problem of speech recognition is very difficult. At its worst it involves understanding natural language, which in a typical conversation is often idiomatic and in- complete. In addition, of course, the speech signal is noisy and ambiguous. Indeed, the area of speech recognition has seen some very sophisticated systems and advances in artificial intelligence applied to it. However, for specific subcases of general speech recognition, there exists a large variety of available equipment and systems that can perform quite well. In particular, in a cooperative, speaker-dependent, isolated word or phrase context, excellent solutions are available [23-251. Some typical applications of such systems are inventory control, data input, process control, material handling and aids to the handicapped[26-301.

Although we shall not discuss it here, another topic closely related to speech recognition is the topic of speaker recognition or speaker verification. This can be of considerable use in speaker identification and entry security problems [31,32].

In this section, we will discuss in some detail the operation of an isolated word recognizer and some of its capabilities and limitations. Then, an overview of the more general efforts and future directions in speech recognition will be provided.

There are currently a large number of isolated word recognition units available. Their operation involves, for each person using the system, an initial training or entry of the individual’s vocabulary. Words from this vocabulary can then be spoken and recognized at later points in time. Additional words can be added to the vocabulary set at any time, or old words “retrained” if recognition performance is unsatisfactory. Figure 9 shows this process.

Once again the characteristics of the typical system are: (1) Each speaker must use their own vocabulary set due to individual vocal characteristics. (2) Each word must be “isolated,” spoken with a distinct pause or break between each

word. Some systems monitor for words and others provide a prompt to indicate when a new word can be entered for training or recognition.

(3) It is assumed the user is cooperative, trying to speak clearly and consistently in training and use. However, variations in speech patterns in time, due to colds, psychological state, etc., can cause deterioration of the recognition rate. This is why it may be necessary to retrain the system on some already existing vocabulary items.

A common system of this sort forms templates characterizing each training word and then attempts to match a word spoken during usage to one of the existing templates. Let US now discuss the formation of these templates and their matching.

The formation of templates involves basically three steps: (1) Sampling of certain selected features of the speech signal (pre-processing).

Page 11: Voice communication with computers: A primer

Voice communication with computers: a primer I I I

L INITIAL WORD ENTRY

-L SPEAK WORDS FOR COMMAND, DATA ENTRY, ETC.

RETRAIN "OLD" WORDS OR ENTER NEW WORDS

Fig. 9. Automatic speech recognition process.

(2) Determining the actual word boundaries in the sample. (3) Formation of template of processed features that characterize the word. The feature extraction will vary depending on the specific system. One common approach is

a bank of bandpass filters (often 16) providing a measurement of spectrum amplitudes[34]. Another smaller system[35] extracts 4 features of the speech input. It uses 3 filters tuned to midpoints for the average male speaker of the first three speech formants [36, 37] and also the zero-crossing rate which provides an overall frequency measurement. When all of these features are then sampled at some rate and converted to digital form, the pre-processing phase is completed.

The actual starting and ending points of the word must next be determined. A threshold value, based on a combination of features, can be used to detect the beginning and ending of words. Thus, in a four feature system (three amplitude values and a zero crossing value) the sum of the 3 amplitude values should be greater than a threshold chosen to screen out noise. However, for words beginning or ending in a low energy fricative such as [s], actual word length is considerably longer than that found based on the above threshold. This problem can be resolved by selecting endpoints where the amplitude sum exceeds a threshold-and then extending these endpoints until the zero crossing drops below its threshold[38]. This is illustrated in Fig. 10.

Another difficulty with word boundaries is the appearance of stop constants such as the unvoiced [t] in eight. For some individuals this is a break of considerable duration and the above algorithm must be modified to find an endpoint only when the signal drops below the threshold for a certain duration, typically 100 msec.

Once word boundaries have been determined, the template is formed by selecting a fixed number of time slices between the boundaries with each feature represented in a slice. For example, if 32 slices are chosen and 4 features are used, the template for a given word would appear to be an array of the form

l

2

3

0

; o 0

31

32

F1 F2 F3

i 0 J 0 0 0 ,4 0 0 , 0 * 0

0 0 o 0

ZCR

Page 12: Voice communication with computers: A primer

112 M. G. JOOST et al.

SUM OF

THRESHOLD

ZERO CROSSINGS

FINAL START AND END

Fig. 10. Beginning and ending points of words

Each word in the vocabulary is represented by one (or more for accuracy) templates. When a word is now spoken to be recognized, its template is formed as described above and it is compared to each of the templates in the vocabulary to find the best match. If no match is sufficiently close, a possible action is to prompt the speaker to repeat the word. If yet no match is successful, but the word is indeed in the vocabulary, it may be necessary for the user to retrain the system for that word.

The matching is usually based on some simple distance measure. For example the Che- vyshev distance, which is simply the summing of the differences between the two templates, has often been advocated because of its computational simplicity[39]. The Euclidian distance is also often used, but in a small real time response system, the multiplications required can be costly over a large vocabulary.

Another approach to matching using discriminant analysis has been adopted in a computer- aided speech training system. In this system, a patient is prompted to say a word and rather than having to recognize the spoken word, it simply must decide if the utterance is a satisfactory match to a previously determined model of a word. Instead of using templates, discriminant functions are formulated. This technique has proven to be more efficient in both memory and speed than storing and matching patterns[40].

A typical problem with any speech recognition system is a limited vocabulary size. Even if storage is sufficient for a larger vocabulary, the time required to compute the matching between an input word template and the vocabulary templates can be prohibitive. One possible solution is partitioning the vocabulary set by hierarchial structuring if the particular application permits, Another approach that has been taken is the use of context, specifically grammatical syntax, for the spoken input of BASIC programs[41].

An additional difficulty in isolated word systems is that due to slight variations in timing, both in overall length of words and length of phonemes within the word, a straightforward distance measure can be very poor in producing correct matches. A solution to this problem that is commonly employed is the use of dynamic programming[42]. In the matching, a dynamic programming algorithm is used to find the best possible time alignment between the vocabulary templates and the unknown word template [43,44]. This, of course, does incur a considerable penalty in computation speed.

Some researchers believe a large percentage of the practical speech input problems can be dealt with by using isolated word recognizers as discussed above. However, it is clear that under many circumstances the use of connected speech, relatively speaker independent is desirable. There have been a number of efforts that have produced state of the art systems, but not generally commercially available ones for this domain. Two particular interesting ap- proaches are seen in the systems Harpy and Hearsay-II.

Harpy [45,46] permitted continuous speech of about 1000 words in a specific task domain, document retrieval. A network structure was used involving all words in the vocabulary by outlining all possible combinations. Because of the number of combinations and variations in sounds, additional knowledge of speech characteristics and syntax was used in a unified framework to delimit these alternatives.

The Hearsay-11[47,48] system uses an idea known as a “blackboard” to organize its multiple sources of knowledge about speech and language. This concept has become very useful in the development of what are called “expert systems ” in artificial intelligence [49].

Page 13: Voice communication with computers: A primer

Voice communication with computers: a primer 113

The future of speech recognition is that there are likely to be a spectrum of vocal input capabilities categorized by cost, computation capability and response time. However, it seems unlikely that an open-ended conversational capability with arbitrary individuals will be achieved without major break-throughs in artificial intelligence and computation.

WHEN TO CONSIDER SPEECH

As with any emerging technology, speech I/O can be beneficially applied to many situations, but applications to inappropriate situations can result in lowered performance. The guidelines presented in Table 2 have been proposed by Fink[50]:

In most instances, applications are limited only by imagination, creativity and under- standing.

Table 2. Speech use guidelines

YOU CAN USE SPEECH IF:

HUMAN INTERACTION WITH A COMPUTER BASED SYSTEM OR PROCESS IS

REQUIRED TO ACCOMPLISH TASK - CAD SYSTEM CONTROL

OPERATOR USAGE IS REPETITIVE AND FREQUENT - RECEIVING RECORD

CREATION

A CONTROLLED SET OF OPERATORS WILL USE THE SYSTEM - QUALITY

ASSURANCE INSPECTION RECORDING

GENERIC TERMINOLOGY OF LESS THAN 200 UNIQUE TERMS IS REQUIRED -

WORD PROCESSING FUNCTION CONTROL

AN OPERATOR BENEFIT CAN BE IDENTIFIED - A REASON TO CHANGE

YOU SHOULD NOT USE SPEECH IF:

- OPERATOR USAGE IS INFREQUENT OR CONSISTS OF A LARGE UNCONTROLLED

POPULATION - AIRLINE RESERVATION

- ONE UTTERANCE AND ONE KEYSTROKE ARE EQUIVALENT - TYPEWRITER

- NO OPERATOR PERCEPTIBLE BENEFIT CAN BE IDENTIFIED - CHANGE

FOR CHANGE'S SAKE

C O N C L U S I O N S

The field of speech communication with computers is growing at an ever-increasing rate. The technology is rapidly moving out of the laboratory and into the field. Applications, well into the hundreds, if not thousands, span inspection, sorting, data entry, data or speaker verification and aids to the handicapped to name but a few. If the appropriate design/selection/integration guidelines are developed and consistently used, speech communications may well catalyze major advances in industrial data systems, their accessability, usefulness and acceptance.

R E F E R E N C E S 1. A. House, On vocal duration in English. J. Acoustical Society of America 33, 1174-1178 (Sept. 1961). 2. C. M. Wise, Introduction to Phonetics. Prentice-Hall, Englewood Cliffs, New Jersey (1958). 3. D. Klatt, Acoustic theory of terminal analog speech synthesis. Record of IEEE 1972 Conference on Speech

Communication and Processing 131-135 (Apr. 1972). 4. D. Klatt, Structure of a phonological rule component for a synthesis-by-rule program. IEEE Trans. on Acoustics,

Speech, and Signal Processing 24, 391-398 (Oct. 1976). 5. D. R. Morgan & S. C. Craig, Real-Time Linear Prediction Using the IMS Gradient Algorithm. General Electric,

Syracuse, New York (1976). 6. D. O'Shaughnessy Consonant durations in clusters. IEEE Trans on Acoustics, Speech, and Signal Processing 22,

282-295 (Aug. 1974). 7. H. S. Elowitz, Letter-to-sound rules for automatic translation of English text to phonetics. IEEE Trans. on Acoustics,

Speech, and Signal Processing 24, 446--459 (Dec. 1976). 8. J. Allen, Synthesis of speech from unrestricted text. Proc. IEEE 64, 433-442 (1976). 9. J. D. Markel & A. H. Gray, Jr., Linear Prediction of Speech. Springer-Verlag, New York (1976).

10. J. L. Flanagan, Computers and talk and listen: Man-machine communication by voice. Proc. IEEE 64, 405-415 (Apr. 1976).

Page 14: Voice communication with computers: A primer

II4 M. G. JOOST et ~1.

1 I. J. L. Flanagan, The synthesis of speech. Scientific American 226,48-57 (Feb. 1972). 12. J. L. Flanagan, Voice of men and machines. J. Acoustical Society of America 51, 1375-1387 (May 1972). 13. J. Friedman, Computer exploration of fast-speed rules. IEEE Trans. on Acoustics, Speech, and Signal Processing 23.

100-103 (Feb. 1975). 14. J. N. Holmes, The influence of Glottal waveform on the naturalness of speech from a parallel-formant synthesizer.

Record of IEEE 1972 Conference on Speech Communication and Processing 148-151 (Apr. 1972). 15. L. D. Rice, Friends, humans, and countryrobots: lend me your ears. BYTE, 16-24 (Aug. 1976). 16. L. R. Rabiner & R. W. Schafer, Digital techniques for computer voice response implementations and applications.

Proc. IEEE 64,416-433 (Apr. 1976). 17. R. R. Leutenegger, The Sounds of American English. Scoft, Foresman, Chicago (1963). 18. R. Wiggins L L. Brantigham, Three-chip system synthesizes human speech. Electronics 109-116 (Aug. 1978). 19. S. V. Rao & R. B. Thosar, A programming system for studies in speech synthesis. IEEE Trans. on Acoustics, Speech.

and Signal Processing 22, 217-225 (June 1977). 20. The Digital Group/ Votrax Voice Synthesizer. The Digital Group, Denver, Colorado (1978). 21. W. Amtar, The time has come to talk. BYTE 26-33 (Aug. 1976). 22. Y. Hosni, Reengineering for the blind. Technical Report, Division of Blind Services/University of Central Florida,

Orlando, Florida, 25-28 (Oct. 1979). 23. W. Lea, Speech Recognition: Past, Present and Future. Trends in Speech Recognition (Edited by W. Lea), pp. 39-98.

Prentice-Hall, Englewood Cliffs, New Jersey (1980). 24. D. R. Reddy, Speech recognition by machine: a review. Proc. IEEE 64, Sol-31 (1976). 25. G. Kaplan, Words into action: I. IEEE Spectrum Vol. I7 22-26 (June 1980). 26. T. B. Martin, Practical applications of voice input to machine. Proc. IEEE 64.487-501 (1976). 27. B. Beek, E. Neuberg & P. Hodge, An assessment of the technology of automatic speech recognition for military

applications. IEEE Trans ASSP, 25, 301-327 (1977). 28. M. Joost & F. Petry, A vocal input system for use in speech therapy. Proc. IEEE Int. Conf. on Cybernetics und

Society, Tokyo, Japan, 961-964 (Oct. 1978). 29. J. R. Welch, Automatic speech recognition, putting it to work in industry. IEEE Computer 65-73 (May 1980) Vol. 13. 30. G. Doddington & T. Schalk, Speech recognition: turning theory into practice. IEEE Spectrum 26-32 (Sept. 1981) Vol 18. 31. A. B. Rosenberg, Automatic Speaker verification: a review. Proc. IEEE, 64,475-487 (1976). 32. B. Atal, Automatic recognition of speakers from their voices. Proc. IEEE, 64,466-475 (1976). 33. G. R. Doddington, Voice identification for entry control. Symp. Voice Interactive Systems: Applications and Payoffs,

Dallas, Texas, pp. 73-84 (May 1980). 34. S. Viglione, Voice recognition module: VRM, Symp. Voice Interactive Systems: Applications ond Puyofls, Dallas.

Texas, pp. 451-460 (May 1980). 35. HEURISTICS, INC., SPEECHLAB Laboratory Manual and Hardware Manual. HEURISTICS, INC., Los Altos,

California (1977). 36. G. Fant, The acoustics of speech, Proc. of 3rd Int. Conf. on Acoustics, 1969, in Speech Analysis (Edited by R. Schafer

and J. Markel), pp. 188-201. IEEE Press (1978). 37. P. Denes & B. Pinsin. The Soeech Chain. Bell Teleohone Laboratories (1963) 38. L. Rabiner & M. Sambar, An’ algorithm for determining the endpoints of isolated utterance. Bell. System Tech. .L 54(?)

(Feb. 1975). 39. G. M. White & R. B. Neely, Speech recognition experiments with linear prediction, bandpass filtering and dynamic

programming. IEEE Truns ASSP 24 (April 1976). 40. M. Joost, F. Petry & M. Olroyd, Computer-aided speech training in articulation therapy. J. Acoustical Society of

America, Supplement I, 67 (Spring 1980). 41. F. Petry & B. Pierce, Proc. IEEE Southeastcon ‘81, Huntsville, Alabama (March 1981). 42. R. Bellman, Dynamic Programming. Princeton University Press, Princeton, New Jersey (1957). 43. R. A. Smith & M. Sambur, Hypothesizing and verifying words for speech recognition, in Trends in Speech Recognition

(Edited by W. Lea), pp. 139-165. Prentice-Hall, Englewood Cliffs, New Jersey (1980). 44. Y. Kato, Words into action III: a commercial system. IEEE Spectrum 29 (June 1980). 45. B. Lowene & R. Reddy, The HARPY speech understanding system, in Trends in Speech Recognition (Edited by W.

Lea), pp. 340-460. Prentice-Hall, Englewood Cliffs, New Jersey (1980). 46. R. Reddy, Words into action II: a task-oriented system. IEEE Spectrum 2&28 (June 1980). 47. L. D. Erman, F. Hayes-Roth, V. R. Lesser & D. R. Reddy, The HEARSAY-II speech understanding system:

integrating knowledge to resolve uncertainty. Computing Surveys 12,2,213-X3 (1980). 48. L. D. Erman & V. R. Lesser, The HEARSAY-II speech understanding system: a tutorial, in Trends in Speech

Recognition (Edited by W. Lea), pp. 361-381. Prentice-Hall, Englewood Cliffs, New Jersey (1980). 49. A. Barr & E. Feigenbaum, Eds., The Handbook of Artificial Intelligence, vol. I. W. Kaufman, Los Altos. California

(1981). 50. D. F. Fink, Corporate Strategic Staff, Intel Corporation [Personal communications]. (1982)