10
U.D.C. 612.78: 681.84 Speech and Vocoders By L. C. KELLY, C.Eng., M.I.E.R.E.f Presented at a meeting of the Institution's Communications Group in London on 8th January 1969 and at a meeting of the East Anglian Section in Hornchurch on 29th October 1969. Speech signals are produced by relatively slow articulatory movements. This suggests that the information rate of the speech signal is much less than would be expected by considering the bandwidth of the acoustic signal. Vocoders attempt to exploit the redundancy in the speech wave- form by extracting and transmitting the information bearing parameters of the speech signal. At the receiver, these parameters are used to control a speech synthesizer that reproduces the original signal without any serious loss of intelligibility but with some degradation of quality. The paper describes speech production principles, and their application to speech synthesis; the operation of various types of vocoder and the problems of pitch extraction. 1. Introduction Speech can be considered from many viewpoints: phoneticians, physiologists, and linguists would all consider the speech process in different ways. From the point of view of the communications engineer speech is an analogue signal whose bandwidth extends from the very low frequencies in the audio range (about 50 Hz) up to frequencies of the order of 8-10 kHz. Direct transmission of high quality speech therefore requires a bandwidth of this order. Fairly good quality speech can still however be transmitted with a bandwidth of approximately half that quoted because there are sufficient clues to perception in the lower half of the frequency band. If the analogue speech signal is converted into digital form, for example by pulse code modulation, then an information rate of between 20,000 and 100,000 bits/ second is needed for transmission depending on the subjective quality required for the received signal. A speech signal is produced by relatively slow articulatory movements, and these movements can be described with sufficient accuracy by signals of much lower information rate than might be expected by considering the bandwidth of the acoustic signal. This implies that much of the waveform detail of speech is in fact redundant, and that a speech transmission system using some sort of model of the human speech production process at the receiver might require less transmission channel capacity than is needed for the speech waveform. Such systems are known as analysis- synthesis telephony systems or vocoders. Vocoders (the word comes from voice and coders) exploit the redundancy in the speech waveform. This is done by analysing the speech waveform and attempting to extract information-bearing parameters. These para- meters can then be transmitted with a much lower bandwidth, or can be digitally encoded at a much lower bit rate, than the original signal. At the receiver, t Joint Speech Research Unit, Post Office, Ruislip, Middlesex. speech is then synthesized from the transmitted parameters, usually with some degradation of quality, but without any serious loss of intelligibility. No practical vocoder has yet been developed which is sufficiently tolerant of a wide-range of speech material, and which gives a sufficiently high quality speech output signal, to be acceptable for commercial telephone use. The vocoder can still however provide a useful service in certain situations, for example on h.f. radio to provide narrow-band digital transmission with its immunity to noise, for space communications where the signalling rate must be kept low, and for military communications to provide speech privacy. 1 Another recent use for a suitably modified vocoder is to correct the change of quality of divers' speech caused by breathing a helium-oxygen mixture (the 'Donald Duck' effect). The purpose of this paper is to explain the principles of speech production and vocoder operation, to discuss various types of vocoder and to illustrate the types of speech quality that can be obtained from vocoders. 2. Speech Production Figure 1 shows in diagrammatic form the essential parts of the human vocal system. The vocal organs are the lungs, the trachea or windpipe, the larynx, the pharynx or throat, the nose, and the mouth. The part of this 'tube' extending from the larynx to the lips is known as the vocal tract, and in an adult male is about 17 cm long. The shape of this tract is varied extensively during speech production, by moving the lips, the tongue, and the jaw, i.e. the articulatory organs. A typical speech waveform is shown in Fig. 2. This was produced from a recording made by a phonetician speaking in an anechoic chamber. The waveform was produced when the vowel' A' in the word comfort was being said. Generally, we can say three things The Radio and Electronic Engineer, Vol. 40. No. 2, August 1970 73

05267587

Embed Size (px)

Citation preview

Page 1: 05267587

U.D.C. 612.78: 681.84

Speech and VocodersBy

L. C. KELLY,C.Eng., M.I.E.R.E.f

Presented at a meeting of the Institution's Communications Group inLondon on 8th January 1969 and at a meeting of the East AnglianSection in Hornchurch on 29th October 1969.Speech signals are produced by relatively slow articulatory movements.This suggests that the information rate of the speech signal is much lessthan would be expected by considering the bandwidth of the acousticsignal. Vocoders attempt to exploit the redundancy in the speech wave-form by extracting and transmitting the information bearing parametersof the speech signal. At the receiver, these parameters are used to controla speech synthesizer that reproduces the original signal without anyserious loss of intelligibility but with some degradation of quality.

The paper describes speech production principles, and their applicationto speech synthesis; the operation of various types of vocoder and theproblems of pitch extraction.

1. IntroductionSpeech can be considered from many viewpoints:

phoneticians, physiologists, and linguists would allconsider the speech process in different ways. Fromthe point of view of the communications engineerspeech is an analogue signal whose bandwidth extendsfrom the very low frequencies in the audio range(about 50 Hz) up to frequencies of the order of8-10 kHz. Direct transmission of high quality speechtherefore requires a bandwidth of this order. Fairlygood quality speech can still however be transmittedwith a bandwidth of approximately half that quotedbecause there are sufficient clues to perception in thelower half of the frequency band.

If the analogue speech signal is converted into digitalform, for example by pulse code modulation, then aninformation rate of between 20,000 and 100,000 bits/second is needed for transmission depending on thesubjective quality required for the received signal.

A speech signal is produced by relatively slowarticulatory movements, and these movements can bedescribed with sufficient accuracy by signals of muchlower information rate than might be expected byconsidering the bandwidth of the acoustic signal. Thisimplies that much of the waveform detail of speech isin fact redundant, and that a speech transmissionsystem using some sort of model of the human speechproduction process at the receiver might require lesstransmission channel capacity than is needed for thespeech waveform. Such systems are known as analysis-synthesis telephony systems or vocoders. Vocoders(the word comes from voice and coders) exploit theredundancy in the speech waveform. This is done byanalysing the speech waveform and attempting toextract information-bearing parameters. These para-meters can then be transmitted with a much lowerbandwidth, or can be digitally encoded at a muchlower bit rate, than the original signal. At the receiver,

t Joint Speech Research Unit, Post Office, Ruislip, Middlesex.

speech is then synthesized from the transmittedparameters, usually with some degradation of quality,but without any serious loss of intelligibility.

No practical vocoder has yet been developed whichis sufficiently tolerant of a wide-range of speechmaterial, and which gives a sufficiently high qualityspeech output signal, to be acceptable for commercialtelephone use. The vocoder can still however providea useful service in certain situations, for example onh.f. radio to provide narrow-band digital transmissionwith its immunity to noise, for space communicationswhere the signalling rate must be kept low, and formilitary communications to provide speech privacy.1

Another recent use for a suitably modified vocoderis to correct the change of quality of divers' speechcaused by breathing a helium-oxygen mixture (the'Donald Duck' effect).

The purpose of this paper is to explain the principlesof speech production and vocoder operation, todiscuss various types of vocoder and to illustrate thetypes of speech quality that can be obtained fromvocoders.

2. Speech ProductionFigure 1 shows in diagrammatic form the essential

parts of the human vocal system. The vocal organsare the lungs, the trachea or windpipe, the larynx,the pharynx or throat, the nose, and the mouth. Thepart of this 'tube' extending from the larynx to thelips is known as the vocal tract, and in an adult maleis about 17 cm long. The shape of this tract is variedextensively during speech production, by moving thelips, the tongue, and the jaw, i.e. the articulatoryorgans.

A typical speech waveform is shown in Fig. 2. Thiswas produced from a recording made by a phoneticianspeaking in an anechoic chamber. The waveformwas produced when the vowel' A' in the word comfortwas being said. Generally, we can say three things

The Radio and Electronic Engineer, Vol. 40. No. 2, August 1970 73

Page 2: 05267587

L C. KELLY

NASAL CAVITY

HARD PALATE

CONSTRICTION

TONGUEEPIGLOTTIS

LARYNX.VOCAL CORDS

TRACHEA

about the waveform that are fairly obvious:(a) It looks a fairly complicated function of time.(b) Its form is consistent with its having been

produced by some sort of resonant system.(c) It exhibits a periodicity.

Having examined a typical vowel waveform it is ofinterest to consider how some of the sounds of speechare produced.

Speech is produced by two basic types of soundsource; the class of sounds known as voiced soundsare produced by puffs of air released during vocal cordvibration. Voiced sounds include all the vowels andsuch consonants as m, n, 1 and r. The other class ofsounds, unvoiced sounds, are produced when turbu-lence is caused by air being forced through a narrowconstriction somewhere in the vocal tract; examplesof unvoiced sounds are f, s, p and t.

Some sounds, such as z and v, require the use ofboth types of sound source simultaneously.

2.1. Voiced SoundsDuring ordinary breathing the vocal cords are in a

relaxed condition and are held fairly wide apart, but

Fig. 1. Schematic diagram of the humanvocal system.

10TIME IN MILLISECONDS

Fig. 2. The speech waveform corresponding to an occurrenceof the vowel ' A ' in the word

during voiced sounds they are drawn together. Thesecords are in fact folds of ligament at the top of thetrachea, and the slit-like orifice between them is calledthe glottis. When we produce a voiced sound airtravels from the lungs up to the trachea, and builds up

SOFT PALATENASAL PART

PHARYNX

UVULA

ORAL PARTOF PHARYNX

VOCAL RESONANCECHAMBERSOESOPHAGUS

a pressure behind the vocal cords; these are pushedapart and air rushes through the narrow glottalopening, slowing down again when it reaches thewider pharynx above. By a combination of musculartension in the cords and the lowering of pressure inthe glottis due to the Bernoulli effect, the vocal cordsare drawn back to their starting position and the airflow ceases. The sub-glottal pressure then forces thecords apart again and the whole cycle is repeated. Itcan be seen therefore that the vocal cords act as anintermittent barrier to the flow of air from the lungs,and in fact chop the air stream so that a discrete setof puffs is produced.

The vocal cord vibration period is a function ofvocal cord mass, tension, and sub-glottal pressure.For normal male talkers these puffs of air are producedwith a frequency typically in the range 50-250 Hz andextends to 500 Hz and higher for women and children.Although the frequency of vocal cord vibration isfairly high, its rate can be changed only slowly, byvarying the sub-glottal pressure and tension of thevocal cords, both of which are under muscular control.These puffs of air constitute the basic generator for thevoiced sounds of speech; typically the shape of aglottal puff (i.e. the volume velocity of air plottedagainst time) is approximately triangular, and sincethese puffs are quasi-periodic can be considered tohave an approximation to a line spectrum.

Figure 3 shows the shape of some typical glottalpuffs, and Fig. 4 the spectral envelope of a single puff.The repetition rate of the larynx puffs is closelyrelated to the perceived pitch of the voice, and forthis reason is known colloquially to vocoder engineersas the pitch frequency, or simply the pitch. In mostelectronic speech synthesizers electrical pulses areused to approximate the larynx puffs.

If we were able to listen directly to the larynxoutput it would sound rather buzzy as might beexpected from the waveform shape. Normally how-

74 The Radio and Electronic Engineer, Vol. 40, No. 2

Page 3: 05267587

SPEECH AND VOCODERS

5trS3LUO>

0 1 2 3 4 5 6 7 8 9 10 11 12TIME IN MILLISECONDS

Fig. 3. The larynx waveform of two vocal periods correspondingto the speech waveform in Fig. 2.

ever the larynx output signal reaches our ears via thevocal tract, and it is the shape of the vocal tract thatdetermines the speech sound quality we hear. Thevocal tract can be shown to be quite closely analogous(up to fairly high audio frequencies) to a mismatchednon-uniform transmission line, and for radio engineersis probably most easily considered from this view-point; it is therefore a resonant system that intensifiesthe energy of certain bands of frequencies. Theseresonances, whose frequencies can be changed bymovement of the articulators, are given the name'formants'. The formants superimpose their responseon the vocal cord signal to produce the voiced soundsof speech. Voiced sounds are usually characterizedby three or four formants in the frequency range upto 4000 Hz. For convenience the formants arenumbered, the lowest one (Fx) corresponding to thelowest frequency, etc.

Figure 5 shows the spectrum of an occurrence ofthe vowel ' A '. This clearly shows the line structureand the formants.

2.2. Unvoiced SoundsMany of the sounds of speech come into the class

of sounds known as unvoiced. During unvoicedsounds the vocal cords are held wide apart and theair stream from the lungs is forced through a con-striction between the tongue and the teeth as in 's' orbetween the teeth and the lips as in T causing

12dB/OCTAVE HIGH-FREQUENCY EMPHASIS

1-5 3-0FREQUENCY IN kHz

Fig. 4. Calculated amplitude spectrum for one of the vocalperiods in the previous figure.

turbulence and producing the characteristic 'fricative'sounds of speech. The basic generator for this typeof sound is the air stream whose source can beconsidered to be at the point of constriction. Theunvoiced sounds do not exhibit the harmonic structureof the voiced sounds and the sound generator isprobably best thought of in electrical terms as arandom noise source. The energy of the fricativesounds is generally much lower than that of voicedsounds and the resonances in the system have greaterbandwidth than for the voiced sounds.

70-

60-

50-

§ 0B40-

i

30-

20-

10

0

1

/

s

\»t

\

\ /

/- , - •

\

\\

s

1 2 3 4FREQUENCY IN kHz

Fig. 5. Spectrum of an occurrence of the vowel 'A ' in the wordkAmfst.

2.3. Plosive ConsonantsAnother class of sounds that are produced fairly

often in speech are the plosives or stop consonants;these sounds are produced by stopping the flow ofair from the lungs by blocking the vocal tract at somepoint and then very quickly releasing the air pressure.The plosives therefore are always characterized by asilence preceding the burst of energy; they may bevoiced, for example as in b and d, or unvoiced suchas p and t.

The preceding description of speech production isnot intended in any way to be exhaustive but it ishoped that a reasonable idea of the mechanism hasbeen conveyed.

3. Speech SynthesisHaving considered how the various speech sounds

are produced it is now possible to see how thisknowledge can be applied to the synthesis of speech.

Most of the earliest speech synthesizers weremechanical, and in fact constructed as long ago asthe 18th century. Alexander Bell in the late 1800smade a speech synthesizer by making a cast of ahuman skull and moulding the vocal tract and cordsfrom rubber and similar materials. Mechanicalmodels however are very difficult to control and

August 1970 75

Page 4: 05267587

L. C. KELLY

progress has only been made with the advent ofelectronic speech synthesizers.

Speech production can be thought of as a combina-tion of two functions; firstly the generation of asound source and secondly the modification of thesound from this source by the vocal tract. These twofunctions are given names taken from general circuittheory, and are known as the excitation and systemfunctions respectively. Most speech synthesizersdepend on this idea of the separation of the speechsignal into an excitation function and a systemfunction. The excitation function can be represented in

electrical terms as a generator producing a periodicwaveform rich in harmonics during voiced sounds,and a random noise source during unvoiced sounds.Many synthesizers do not provide the mixed excitationsignal required for the voiced fricative sounds suchas z and v. In practice it is found that there aresufficient other perceptual clues for these sounds to becorrectly identified in spite of this restriction. Thesystem function (corresponding to the vocal tract) canbe represented by a linear, time-varying four-terminalnetwork terminated in a resistance representing theradiation resistance of the mouth. Control parameters

AMPLITUDE

CONTROL CIRCUITS

PHOTO-

ELECTRICTAPE

READERAND

DECODER

PITCHFREQUENCY _

V/UV SWITCH

PULSE

GENERATOR

V

NOISE

GENERATOR

k

4th FORMANT AMPLITUDE A 4

i

- ' y

1

t

FREQUENCY F3

3 N I FORMANT

AMPLITUDE A5

FREQUENCY F2

2nd FORMANT

AMPLITUDE A 2

FREQUENCY F,

1st FORMANT

AMPLITUDE A\

f

//

* /

BAND-PASS FILTER

4 0 0 0 Hz

3 4 0 0 Hz

FIXED

RESONANT

CIRCUIT

3500 Hz

UNVOICEDENERGY

4 th FORMANT

FIXED FREQUENCY

VARIABLE

RESONANT

CIRCUITS

>

/

f

/

RANGE1500 Hz

TO3300 Hz

3 rd FORMANT

\

/

/

' / •

\

RANGE700 Hz

TO2600 Hz

2 nd FORMANT

\

/

f

/

f

RANGElOOHz

TO1000 Hz

1 st FORMANT

OUT

Fig. 6. Block diagram of a formant synthesizer. This synthesizer is normally controlled from apunched paper tape generated by a digital computer.

76 The Radio and Electronic Engineer. Vol. 40, No. 2

Page 5: 05267587

SPEECH AND VOCODERS

are used to vary the frequency response or spectralenvelope of the four-terminal network to producesynthetic speech. Speech synthesizers of this type havebeen known for some time and in some instances thenetwork corresponding to the vocal tract has been alumped approximation to an electrical transmission linewith parameters corresponding'to the dimensions of thevocal tract that can be varied. Other types of syn-thesizer have been simpler in that they approximatethe response of the vocal tract with simple resonantcircuits. These are usually given the name formantsynthesizers. A block diagram of a typical formantsynthesizer is shown in Fig. 6.

4. Formant SynthesizerFormant synthesizers are of two basic types. Those

in which the resonant circuits are cascaded are calledserial synthesizers, and those whose outputs arecombined in parallel are known as parallel syn-thesizers. The particular synthesizer shown in Fig. 6was designed to operate from a punched paper tapegenerated by a digital computer. It is a parallel-typeformant synthesizer and single tuned circuits areused to generate the formant peaks. The centrefrequencies of three of the tuned circuits correspondingto Fu F2, and F3 are dynamically controlled byvoltage analogues generated by the computer. Thebandwidths of the tuned circuits are fixed and are setto a bandwidth of between 60 and 120 Hz. Analoguesignals representing the amplitude of the formantsare fed to the amplitude control circuits and at thesame time a common excitation signal (pulses orrandom noise) is fed to all controllers. The outputsignal from the amplitude control circuits is thereforea wide-band (4000 Hz) excitation signal whoseamplitude has been determined by the appropriateanalogue voltage Au A2, or A3. These signals arefed to the required voltage controlled tuned circuit.These variable frequency resonators select the bandof the excitation signal in the region of their centrefrequency and their outputs are simply added toprovide the synthetic speech. Two other circuits areadded as a refinement to this basic synthesizer; one isa fixed resonant circuit corresponding to a fourthformant, and the second a further wide-band filter thatis switched on only during unvoiced sounds toprovide a further improvement in quality. Theexcitation signal is provided either from a 'voltage-controlled' pulse generator during voiced sounds orfrom a random noise source during unvoiced sounds.

The speech quality obtainable from a synthesizerof this type can be very good indeed if great care istaken to provide the correct control parameters.

5. VocodersA vocoder is a device that performs measurements

on speech that are in some way related to the short-

term spectral envelope of the speech signal. Para-meters are extracted from these measurements, andtransmitted with considerably less bandwidth thanrequired by the speech signal. At the receiver speechis synthesized from the transmitted parameters. It isthe choice of these parameters that distinguishes thedifferent types of vocoders.

5.1. Formant VocodersThe formant vocoder analyser attempts to measure

in real time the frequency and amplitude of spectralpeaks of the speech signal (the formants) and transmitsthese measurements as parameters to control asynthesizer similar to the one shown in Fig. 6. Atthe same time the larynx vibration rate and thedecision as to whether the speech is voiced or un-voiced must also be measured and transmitted asadditional control parameters. It is generally recog-nized that good speech quality can be obtained froma formant synthesizer, but when connected to a realtime analyser the results obtained to date have alwaysbeen inferior. This is mainly because of the difficultyin measuring formant frequencies accurately. Figure 6shows that the formant frequency ranges overlap andthis tends even more to aggravate the problem. Theadvantage of the formant vocoder over most othertypes of vocoder is the bandwidth compression thatcan be obtained. Typically a formant vocoder needseight parameters and each of these can be band-limited to about 25 Hz so that a bandwidth reductionof the order of 20: 1 can be obtained relative to theoriginal speech band. It should however be empha-sized that all the difficulties of real-time formantanalysis are not yet solved and that until they are it isunlikely that formant vocoders will prove to be veryuseful.

5.2. Spectrum Channel VocoderThis is the original channel vocoder invented in the

1930s by Homer Dudley of Bell Telephone Labora-tories. In the channel vocoder the short-term spectralenvelope of the speech signal is represented by samplesspread across the frequency axis. These samples areusually obtained from a bank of band-pass filterswhose centre frequencies are spaced across the speechband. Normally between 10 and 20 filters are usedto give a corresponding number of samples of thespectral envelope; the outputs of the filters are thenrectified and smoothed by low-pass filters to give aset of time Varying- average signals representing theshort term spectral envelope. As in the formantvocoder, pitch analysis has also to be performed.The resulting signals are then transmitted to thevocoder synthesizer and are used to control the fre-quency response of what is in effect a time-varyingband-pass filter that is fed with a spectrally flatexcitation signal. The block diagram of a typical

August 1970 11

Page 6: 05267587

L C. KELLY

SENO TRANSMISSION

LOG AMP P A T H

RECTIFIER aANALYSIS LOW-PASSFILTERS FILTER

RECEIVE

ANTI LOGAMP & SYNTHESIS

MODULATOR FILTERS

N CHANNELS COVERINGRANGE 200-4000 Hz

t

EXCITATIONGENERATOR

The bandwidth reduction or compression obtainedfrom a channel vocoder is not as large as that achievedby a formant vocoder. Typically it needs twice thenumber of control parameters required by a formantvocoder at about the same bandwidth per parameterand gives therefore a bandwidth reduction of approxi-mately 10 to 1. The channel vocoder has the advantagehowever over the formant vocoder that at the presenttime it will tolerate a larger proportion of speakers'

channel vocoder is shown in Fig. 7. The analyserincludes the filter bank, rectifiers and low-pass filtersmentioned previously. The purpose of the equalizershown at the analyser input is to ensure that the band-pass filters in the analyser have comparable signallevels over the whole frequency range. Anotherfeature shown here is the presence of the logarithmicamplifiers (one per channel) connected to the outputsof the band-pass filters and whose purpose is toincrease the dynamic range of the vocoder.

The low-pass filtered signals obtained at the outputof the analyser are connected to the transmissionpath, and at the receiver are used to dynamicallycontrol the gain of each synthesis filter. The synthesisfilters are supplied with a wide-band (pulse or randomnoise) excitation signal controlled from the analyserand each synthesis filter selects a band of excitationabout its centre frequency. Summation of the outputsignals from the synthesis filters and further equaliza-tion yields the synthetic speech.

The vocoder, in common with other speech pro-cessing systems, must have provision for handlingthe wide dynamic range of the input speech signal. Anexperimental vocoder at the Joint Speech ResearchUnit has been designed to operate over a range of50 dB; this is achieved by employing logarithmicamplifiers immediately following each analysis filterand a corresponding anti-logarithmic device precedingthe synthesis filters. It is thought that for mostpurposes this dynamic range is sufficient, but in caseswhere a larger spread of input signal level is expected aVoice Operated Gain Adjusting Device (VOGAD) can beused to provide an extra dynamic range of about 20dB.

our Fig. 7. Block diagram of a typical channel vocoder.

voices and gives a more satisfactory synthetic speechoutput signal than the formant vocoder.

5.3. Vocoders for Digitized Speech TransmissionMost modern vocoders have been designed to use

a digital transmission path. One reason for this isit is much easier to transmit a serial digit stream thana large number of low-frequency analogue signals,and another is that it is much easier to introduceprivacy into a 'digital' speech link.1 To digitize achannel vocoder both spectrum and pitch channelshave first to be multiplexed and then coded by somedigital coding scheme such as delta-modulation orpulse code modulation. At the receiver a digital-to-analogue convertor reconstitutes the signals into ananalogue form; these analogue signals are thenfiltered by low-pass filters with cut-off frequencies ofthe order of 25 Hz and are then used as controlsignals to the synthesizer. Digit rates used are in theorder of 2000-3000 bits/s, which can be transmittedby modern data-modems over normal 3 kHz lines.

5.4. Pitch ExtractionIt has been tacitly assumed so far that measurement

of larynx vibration rate, or 'pitch extraction' is a fairlyinsignificant part of the vocoding process. In factaccurate measurement of pitch is probably one of themost difficult vocoder operations. In principle thismeasurement of the vocal cord vibration rate issimple, since by passing the speech waveform througha low-pass filter the fundamental frequency componentcan be extracted. In practice things are not so easy;the range that the pitch frequency can occupy islarge (3 or 4 octaves), pitch inflexions can be rapid,and on some circuits where vocoders might be usefulthe fundamental component in the available inputsignal is weak or missing altogether. Errors that occurduring pitch extraction can cause very objectionableeffects in the resultant synthetic speech; for example,if the pitch extractor occasionally measures thesecond harmonic instead of the fundamental frequency(a common fault), the sudden 'squeak' that occurssounds extremely unnatural.

78 The Radio and Electronic Engineer, Vol. 40, No. 2

Page 7: 05267587

SPEECH AND VOCODERS

Many schemes have been devised for the measure-ment of fundamental frequency. Early vocodersused a simple low-pass filter to extract the fundamentalcomponent from the speech waveform; a frequencymeter was then used to derive an analogue of the pitchfrequency. With a high-quality input signal andcarefully spoken speech fairly good results could beobtained. However more elaborate schemes arenecessary for greater tolerance to various inputsignals. One such scheme that gives better resultsuses a 'tracking' low-pass, or band-pass filter tofollow changes in the fundamental frequency of thespeech waveform. Pitch extractors of this type canwork well over a restricted frequency range if theyhave as their input a high quality speech signal of goodsignal/ noise ratio. In the last decade autocorrelationtechniques have been used to measure voice pitch. Simpleautocorrelation of the speech waveform, however,does not give particularly good results during fastpitch inflexions, or during rapid formant transitions.Some pre-processing of the signal before autocorrela-tion may be expected to improve measurements, andGill2 has reported a successful pitch extractor thatautocorrelates 'envelopes' derived from the speechwaveform. More recently Sondhi3 has suggested theautocorrelation of centre clipped speech. Centreclipping the speech removes most of the zero crossingsfrom the waveform and this has the effect of com-pressing or flattening the spectral envelope. Thiscentre-clipped or spectrally-flattened speech is thenfed to an autocorrelator whose task is made mucheasier than a correlator that has to operate directlyon the speech waveform.

A different approach to fundamental frequencyextraction has been described by Noll4 and is knownas 'cepstrum'. The cepstrum is the Fourier transformof the logarithm of the power spectrum of a signal.Because a speech waveform is nearly periodic duringa voiced sound it has an approximation to a line-spectrum; this spectrum has periodic ripples in it atthe 'line' spacing, which corresponds to the funda-mental frequency. Taking the logarithm of thisspectrum compresses the peaks due to the formantsand gives an equal weight to the ripples in the high-energy and low-energy regions. The Fourier transformof this log-spectrum exhibits a large peak correspond-ing to the pitch frequency.

In a practical cepstrum pitch extractor, log-spectraare continuously generated and the Fourier transformscalculated; the positions of the resultant cepstralpeaks are then used to provide an estimate of thepitch period. Since the logarithmic spectrum willhave a strong periodic component even when thefundamental is missing, this method will give goodresults for telephone quality speech. Cepstrum pitchextractors, although apparently successful, are ex-

tremely complicated to implement, and somethingsimpler is desirable.

Yet another approach to pitch extraction uses thephilosophy that one simple measurement of pitch isunlikely to be satisfactory, but by combining theanswers of several fairly simple measurements andtaking a majority vote on the results a useful pitchextractor can be obtained. Systems using this principlehave been described by Gold5 of the MIT LincolnLaboratory and also by Gill6 of J.S.R.U.

In Gill's system, measurements are made of theinterval between major peaks in the speech waveformin four frequency bands in the range 0-600 Hz. Theseperiod measurements are converted into voltageanalogues, each analogue representing the logarithmof the larynx vibration frequency over a range from37-5-600 Hz. The analogue voltages of the fourchannels are compared with each other and also withthe last transmitted measurement three times every20 ms. If during this time a good measure of agree-ment is obtained the majority measurement is usedfor the next transmitted value of pitch. If goodagreement is not obtained but the sound is judged tobe voiced then a 'no-confidence' signal is transmittedto the receiver causing it to continue using the previousmeasurement.

The voiced-unvoiced decision is made by comparingthe energy in the band 200-600 Hz with that in therange 5000-7000 Hz. A high ratio of low frequencyto high frequency energy indicates a voiced soundand a high ratio of high frequency to low frequencyenergy an unvoiced sound. This decision is influencedby the pitch measurement; when a confident pitchmeasurement is obtained the low frequency/highfrequency ratio can be much lower and still interpretedas a voiced sound.

At the receiver the pitch analogue signal is passedthrough a 10 Hz low-pass filter, the cut-off frequencyof this filter is switched to a higher value at thebeginning of each voiced sound in order to get theanalogue signal to its desired new value quickly.

This pitch extractor will operate with very few grosspitch errors or errors of voicing detection, on rapidconversational speech.

The formant vocoder and the channel vocoder arethe most well known of the analysis/synthesis systemsbut several other types of vocoder do exist and a fewof these will now be described.

5.5. Voice Excited VocodersMany workers have attempted to improve the

speech quality of the channel .vpcoder. One usefulsolution which considerably eases the problem ofpitch extraction and at the same time improves thereceived speech quality is the voice excited vocoder.

August 1970 79

Page 8: 05267587

L C. KELLY

The voice excited vocoder differs from the channelvocoder in two main ways. Firstly a baseband ofnatural speech is transmitted (usually in the range200-800 Hz), and is used at the receiver instead ofsynthetic speech for the lower part of the spectrum,and secondly at the receiver the baseband is used aftersome processing to provide the vocoder excitationsignal for the frequency region above the baseband.Channel signals for the band above the baseband aretransmitted as in a normal channel vocoder.

Generation of a suitable flat spectrum excitationsignal from the baseband requires some form of non-linear processing, often called spectral flattening, andthere are several methods for doing this. A simpleexample of spectral flattening is half-wave rectificationfollowed by a suitable weighting network to producea spectrum which is approximately flat in the requiredband.

Generally speaking voice-excited vocoders providemuch better quality speech than simple channelvocoders. Some of this improvement is obviouslydue to the fact that part of the natural speech is beingdirectly transmitted, but the rest of the improvementobtained is probably due to the inherently moreaccurate excitation signal obtained from the spectralflattening process. The voice excited vocoder willgive a bandwidth compression of about 3: 1 com-pared with the 10: 1 obtained from a typical channelvocoder.

5.6. Autocorrelation VocoderThe autocorrelation function of a signal is the

Fourier transform of its power spectrum. It thereforefollows that the short term spectrum of a speechsignal can be represented by a set of short termautocorrelation functions. This is the basis of theautocorrelation vocoder, originated by Schroeder.7

In this type of vocoder, analysis is performed bymultiplying the signal by itself at various delays (delayincrements must be less than half the reciprocal of therequired speech bandwidth). The multiplier outputsignals are then low-pass filtered to about 25 Hz toproduce the short term autocorrelation channelsignals. These signals after transmission are used tocontrol a speech synthesizer. The synthesizer is atime varying filter, whose impulse response is areplica of the short term autocorrelation function forthe range of delays used in the analyser. Synthesis isachieved by 'sampling' the channel signals with theexcitation signal in a set of multipliers, and thenassembling the samples in the correct time relation-ship by use of a delay line. Since the correlationfunction is even it,wjll be symmetrical about thecentre sample. This'means in practice that a delayline that gives a complete reflexion at one end can beused, with a consequent halving of the delay-line length.

Synthetic speech produced by the autocorrelationvocoder will have a spectral envelope correspondingto the square of the spectrum of the original speechsignal, since in the analyser the speech signal wasmultiplied by itself. This results in making thespectral peaks more pronounced than they normallywould be, and gives the synthetic speech a ratherunnatural 'bouncy' quality. This defect may be over-come by using a complex equalizer which effectivelytakes the square root of the input signal spectrum.Figure 8 shows a block diagram of an autocorrelationvocoder.

5.7. Harmonic VocoderThe harmonic vocoder or harmoniphone first report-

ed by Pirogov8 is yet another variation of the vocoderprinciple. In this vocoder the short term spectralenvelope of the speech signal is expanded into aFourier series; there are several methods for achievingthis, but a fairly straightforward way is by means of aresistor matrix connected to the low-pass filteredchannel signals of a normal channel-vocoder analyser.The time-varying Fourier coefficients obtained fromthis matrix are then used after transmission to controla speech synthesizer.

The synthesizer takes the form of a set of wide-bandnon-dispersive, interconnected delay lines. Thisnetwork has frequency responses of sin nDco, cos nDco,available at a series of tapping points; n takes thevalue 0, 1, 2, etc., and D is the delay of one delaysection. The excitation signal is connected to thisnetwork, and the signals at the tapping points aremultiplied by the appropriate Fourier coefficient andare then summed in order to produce the syntheticspeech.

An interesting feature of this type of vocoder isthat each single coefficient affects the whole spectrumof the synthetic speech, and not just a portion of itas in the channel vocoder. It is found in practicethat the higher coefficients have a progressivelysmaller effect on the spectrum than the lower ones.This would generally mean that errors in transmissionof coefficients near the spectrum fundamental wouldhave a more serious effect on the speech quality,and therefore one would expect that these coefficientswould need to be transmitted more accurately thanthose corresponding to the higher harmonics of thespectrum shape.

5.8. Chebyshev VocoderA further refinement of the harmonic vocoder

principle is the Chebyshev vocoder first reported byKulya.9 This vocoder expands the short term spectralenvelope of the speech into an orthogonal series inthe same way as the harmoniphone; in this casehowever the functions used in the expansion are not

80 The Radio and Electronic Engineer, Vol. 40, No. 2

Page 9: 05267587

SPEECH AND VOCODERS

ANALYSERMULTIPLEXING

ANDTRANSMISSION

SYNTHESIZER

EXCITATIONSOURCE

Fig. 8. Block diagram of an autocorrelation vocoder(after Schroeder).

sine and cosine waves but transformed Chebyshevfunctions. This may at first sight appear to be anunnecessary complication but it does in fact have twomain advantages. Firstly, the Chebyshev functionsscan the short-term spectral envelope in such a waythat more accuracy is obtained at the low-frequencyend of the spectrum. Since the resolution of the ear isbelieved to be distributed in a similar manner, thisgreater low-frequency accuracy would seem a desirablefeature. Secondly, we have seen that synthesis interms of Fourier series requires the use of wide-bandnon-dispersive delay lines which are bulky anddifficult to design; it can be shown, however, that theuse of Chebyshev functions requires a set of dispersive

SPEECH INPUT

DELAY LINESYNTHETICSPEECHOUT

COMPLETEREFLEXION

1 ,

2 ,

18

" T O

: ! i

1

2

3

RESISTORMATRIX

delay networks that can be simply constructed fromRC networks known as Laguerre networks. Speechquality from a Chebyshev vocoder can be at least asgood as from a channel vocoder with the same numberof control parameters, and there is some evidence toshow that if we are restricted to a small number ofcoefficients (e.g. five) that the speech quality can bebetter than for the corresponding channel vocoder.

Figure 9 shows a conventional 18-channel vocoderanalyser, modified by the use of a matrix of 180resistors, that converts it into a Chebyshev vocoderanalyser. Figure 10 shows the block diagram of thecorresponding 10-coefficient synthesizer.

-^T1 \

•T10

PITCHEXTRACTOR

p(t)

TRANSMISSIONPATH

Fig. 9. Block diagram of a10-coefficient Chebyshev vo-coder. This was obtained bymodifying an 18 channelvocoder analyser with the

resistor matrix shown.

August 1970 81

Page 10: 05267587

L C. KELLY

/ pit)

TRANSMISSIONPATH

19 IDENTICAL LAGUERRE NETWORKS

Fig. 10. Block diagram of a 10-coefficient Chebyshev vocoder synthesizer.

6. ConclusionsSince the channel vocoder was invented in the 1930s

progress in vocoder research has been relatively slow.This is mainly due to the complicated nature ofspeech, the complexity of electronic circuitry requiredfor a vocoder terminal, and the large amount ofprototype construction and testing needed to optimizethe design of the analysis and synthesis filters. With

'the improvements available in circuit techniques it is,however, now quite practicable to construct a vocoderterminal in a volume of about a cubic foot, and furtherreductions of size are likely.

Present and future research on vocoders is essentiallya task for digital computers. It is now possible tosimulate complete vocoders on a digital computer.Recordings of natural speech may be processed by asimulated vocoder, whose design parameters may bequickly optimized; for example the parameterscorresponding to complete analysis or synthesis filterbanks can be changed simply by altering a few numberson a data tape. This simulation process obviouslyoutstrips anything that can be done in the laboratorywith hardware, but has one main disadvantage; atypical vocoder simulation program takes somethinglike 300 times real time on a moderately fast computer.This obviously limits the amount of speech that canbe processed so that one still needs hardware availablefor final checking of an optimized design on a largevolume of speech material. Vocoders are now reachingthe stage where they are becoming practical com-munication devices.

The next generation of vocoders may well be of theformant type, when the existing problems of formanttracking have been solved. These vocoders shouldprovide improved speech quality at a lower trans-mission rate than present day channel vocoders.

7. References1. Gold, B. and Rader, C. M., 'The channel vocoder', I.E.E.E.

Trans, on Audio and Electroacoustics, AU-15, No. 4,December 1967.

2. Gill, J. S., 'Automatic extraction of the excitation functionof speech with particular reference to the use of correlationmethods', Proc. of 3rd International Congress on Acoustics,Stuttgart 1959, pp. 217-220. (Elsevier, 1961).

3. Sondhi, M. M., 'New methods of pitch extraction', I.E.E.E.Trans, on Audio and Electroacoustics, AU-16, No. 2, June1968.

4. Noll, A. M., 'Short term spectrum and "cepstrum" tech-niques for vocal-pitch detection',/. Acoust. Soc. Amer.,36,No. 2, February 1964; 'Cepstrum pitch determination',/ . Acoust. Soc. Amer., 41, No. 2, February 1967.

5. Gold, B., 'Computer program for pitch extraction', / . Acoust.Soc. Amer., 34, pp. 916-921, 1962.

6. Gill, J. S., Improvements in or Relating to Larynx ExcitationPeriod Detectors, U.K. Patent Application No. 10525/65May, 1965.

7. Schroeder, M. R., 'Recent progress in speech coding at BellTelephone Laboratories,' Proc. of 3rd International Congresson Acoustics, Stuttgart 1959, pp. 201-210. (Elsevier, 1961).

8. Pirogov, A. A., 'A harmonic system for compressing speech-spectra', Electrosviaz, No. 3, pp. 8-17, 1959.

9. Kulya, V. I., 'Analysis of a Chebyshev type vocoder',Telecommunications and Radio Engineering, Pt. 1, No. 3,pp. 23-32, March 1963.

Manuscript first received by the Institution on 3rd January 1970and in final form on 29th April 1970. {Paper No.l335/Com. 31).

© The Institution of Electronic and Radio Engineers, 1970

82 The Radio and Electronic Engineer, Vol. 40. No. 2