ref25

Embed Size (px)

Citation preview

  • 7/29/2019 ref25

    1/4

    A1.2A SELF-STEERING DIGITAL MICROPHONE ARRAY

    Walter Kellermann *Ac o u s t i c s R e se a rc h De p a r t me n t , AT&T Bel l Labora tories , Murray Hi l l , NJ , US A

    ABSTRACTA self-steering microphone array for teleconfer-encing is presented in which the digitally implementedsteering algorithm consists of two parts. Th e firstpart , the beamforming, is based on already knownconcepts. The second part, a novel voting algorithm,integrates elements of pattern classification and ex-ploits temporal characteristics of speech signals. Italso accounts for perceptual criteria and the acousticenvironment. A real-time implementation is outlinedand results are discussed.

    1 INTRODUCTIONSteerable microphone arrays for audio applications aim at pick-ing up sound emitted by a distant source, while suppressing sig-nals arriving from other directions. This is achieved by directinga beam of increased sensitivity towards the source so that , ide-ally, the output signal of the steered array should sound similarto a microphone placed next to the source ('presence effect').Applications for such devices include hands-free telephonysystems (e.g., mobile telephony, teleconferencing) and the moregeneral situation where speech signals must be picked up ou t of anoisy environment and where the talker should not be required tocarry a personal microphone (as, .g., in many speech recognitionapplications). Here, we focus on th e application t o teleconferenc-ing where, at th e same time, th e 'presence effect', noise reduction,and suppression of interfering sources are desirable. Moreover,for conferences of several teleconferencing rooms, steerable ar-rays can relieve the echo cancellation problem by attenuatingthe local feedback path.The main difficulty for the bea.mform ing s given by the ba nd-width of the audio signal, which even for telephony extends overmore than three octaves. Furthermore, for the intended applica-tion, the microphone array must accommodate several simulta-neously active sources while maintaining good sp atial selectivity,and it m ust be able to track a moving source, e.g., a talker walk-ing around in the room.We approached this problem by using a two-stage strategy:First, fixed beams are formed, whose superposition covers theentire space of interest, and second, a voting algorithm selectsthe beam(s) that should contribute to the outpu t signal. Thisidea was implemented earlier using mostly analog hardware [l].The motivation for the work presented here was to explore thepossibilities offered by digital signal processing algorithms andhardware and, thereby, to improve functionality, reduce hard-ware cost, and increase flexibility.In the following sections we briefly outline the system archi-tecture, then discuss the beamforming a nd th e voting algorithm.

    'Now with Philips Kommunikations Industrie, Nirnberg, Germany.

    Thereby, the emphasis is bn the novel voting algorithm. More-over, an implementation and some results are briefly reviewed.

    2 GENERAL STRUCTURE AND DESIGNTh e basic structure of the system is shown in Fig.1. The ou tputs

    CONSIDERATIONS

    -+Output

    -Microphone Signals ~eamignaisFigure 1. Structure of t he digitally steered microphone array.

    of a inear microphone array are conveyed to th e digital signalprocessing hardware after preamplification and A/D conversion.Here, beamforming and voting are performed t o produce an out-put signal which may be used for transmission, speech recogni-tion, or other purposes.The key parameters of such a system - determining both per-formance and processing load - are associated with the numberof sensors and the numbe r of beams. Considering a prototyp e forteleconferencing rooms or office-like environments, th e numb er ofmicrophones and their spacings is a compromise between a largeaperture for good spatial resolution and t he demand for a smallaperture to ensure the validity of the far-field assumption onwhich th e beamforming is based. Moreover, the spacings must besmaller tha n a half wavelength to preclude spatial aliasing. Aim-ing at telephone bandwidth (sampling rate 8 kHz) which coversapproximately three octaves, arrays for a low-frequency(LF), amid-frequency(MF), and a high-frequency(HF) section were re-alized, each consisting of 11(first-order differential) microphoneswith spacings of 16 cm, Scm, and 4cm, respectively. (A s some ofthe microphones can be used for several frequency sections, theentire array consists of 23 microphones.) Assuming that the ar-ray will be mounted to a wall and, therefore, should cover anangular range of somewhat less than 180, he number of beamswas chosen to be 7 with 'look directions' OD, 20, 4 0 " , f60' offbroadside '. Unlike the earlier analog implementation [l], herea 'track while scan' method used only two beams simultaneously,the digital implementation allows us to form all seven beams ateach sampling instant.

    'The broadside axis is defined as being perpendicular to the array axisoriginating at the center of the array (cf. [2]).

    - 3581 - CH2977-719110000-3581 $1.00 @ 1991 IEEE

  • 7/29/2019 ref25

    2/4

    3 BEAMFORMINGThe beamforming comprises a sequence of linear signal process-ing operations as shown in Fig.2. First, the microphone signals

    Figure 2. Signal processing for beamforming.are assigned to t he frequency sections they contribute to (group-ing). The subsequent steps are then performed independentlyfor each section with t he processing being essentially the same.The aperture shading stage multiplies each of the sensor sig-nals by a weighting factor and thereby determines the balancebetween beamwidth and sidelobe attenuation for each beam.The weights are chosen according to the Dolph-Chebychev de-sign method [3], which, for the broadside beam , yields t he min-imum beamwidth for a prescribed sidelobe attenuation and agiven number of sensors. With regard to the sensitivity of highersidelobe attenu ation t o calibration errors of the microphone ar-ray, the weights were chosen to yield a sidelobe attenuation of25dB.

    In the wavefront reconstruction stag e, for each beam a wave-front is reconstructed as it would be received if the array wererota ted by the respective 'steering angle'. This can be achievedby delaying the sensor signals approp riately. Here, we assumefar-field conditions and, therefore, reconstruct planar wavefrontsfor each of the 7 beam directions within each frequency section.Delays th at are non-integer multiples of the sampling interval arerealized by interpolation. Using 8 neighboring samples fo r th einterpolation of a delayed sample and designing the interpola-tion filter according to [4], the m aximum interpolation error, inour case, is less than 34dB for all beam directions. An efficientimplementation circumvents upsampling before the interpolationfiltering by using only subsets of the filter coefficients [ 5 ] .In the final stage of the beam forming, the three sections aresummed after frequency-selective filtering. For LF, MF, and HFsections, a lowpass, a bandpass, and a highpass are em ployed, re-spectively. Each filter is designed as a n elliptic II R filter of order6, 12, and 6, respectively. T he sum of these filters approximatesunity over the entire frequency range. The crossover points werechosen to be at 760Hz and 1680Hz. In combination with thegiven shading coefficients and steering angles, this choice wasfound to be a good compromise as it provides sufficient spatialcoverage at the high frequencies of each frequency section whileat the low end the beams do not become unnecessarily wide (seealso [2]).4 VOTING

    The purpose of the voting stage is to select those beam signalswhich provide the best coverage of the currently active signalsources (talkers) and to form the corresponding output signal.The strat egy developed here is outlined in Fig.3 and operateson a frame-by-frame basis (no overlap, frame length 16ms). Th etask of the voting algorithm is to derive the suitable weights forthe incoming beam signals which are weighted and summed togive the output signal. This is achieved by a procedure consistingof four stages (see Fig.3). First, th e analysis extract s a featu revector from each beam signal frame. In a second step, the fea-ture vector is used to decide whether the current frame is part ofa noise signal or part of a speech signal. This decision requires

    Speech / Noise WeightAssignm.nalysis Discrlmlnation

    4 7

    Estimation1 IFigure 3. Structure of the voting algorithm.

    estim ation of the noise characteristics. Analysis, speech/noisediscrimination, and background noise estimation are performedindependently and identically for each beam signal. The assign-ment of the beam weights, however, must consider all beams asan ensemb le. In the following sections we describe the varioussteps in more detail. The values given for various parameters arederived from experiments, so that no optimality can be claimed.

    4.1 AnalysisTh e choice of the features tha t are extracted from a signal frameis based on th e results docum ented in [6], where speech detectionschemes for satellite communication systems were investigated.With regard to compu tationa l complexity we limit ourselves to3 features. The most imp ortant feature is the logarithm of th esignal energy, while th e first two PA RCO R coefficients [7] werefound to be a reasonable choice for the two other features [6].4.2 Speech/Noise DiscriminationThe decision whether a given frame at time k should be con-sidered speech or noise is made on the basis of the followingdiscriminant function:

    D ( k ) = log, (det Cvv(k))t(v(k)- mv(k))*.cVv(k)-'( ~ ( k ) mV(k)),

    where C,,(k) an d mv(k) represent the covariance matrix andthe mean vector of the estimated feature vectors v for back-ground noise, respectively. Th e second term of th e sum corre-sponds to the Mahalanobis distance [SI between the current fea-ture vector v(k) and the estimated background noise features.The f i rs t term is constant as long as the background noise fea-tures do not vary. (T his term makes D ( k ) for our discriminationtask equivalent to an e stimat e of the conditional probability den-sity function for the current feature vector, assuming it is back-ground noise with normally distributed, wide-sense stationaryfeatures 191.)Th e discriminant funct ion is used for two decisions: Firs t,it is decided whether the current signal frame is to be regardedas speech or noise. Using two thresholds D1 , D Z ( D l > D z ) ,a hysteresis over time is realized: To enter the 'speech' sta teD ( k ) must exceed D1 and only if D ( k )

  • 7/29/2019 ref25

    3/4

    4.3 Background Noise EstimationAs a basis for the speech/noise discrimination, the mean vectormV(k) nd th e covariance matrix Cv v (k ) of the fea ture vectorsfor background noise must be ascertained. This requires an es-timation procedure which produces useful initial noise estimatesas soon as possible after start ing th e system, so that discrimina-tion between speech and noise can be performed. This estimatemust then be consolidated before entering a steady state phasewhere the estimation procedure remains invariant. At all times,the estimation should be able to follow both abru pt and smo othchanges of the background noise. In order to meet these require-ments, a scheme was developed consisting of two adaptation pro-cedures that ru n in parallel: One adaptation procedure estimatesm v( k) and C,,(k) on the basis of the discriminant function val-ues and a given threshold D s , while the othe r procedure adaptsthe threshold D3 .E s t i m a t i o n of M e a n V e c t o r a n d C o v a r i a n c e M a tr i x-For the adapta tion of mV( k) and C vv(k ) a three-stage strategyis adopted modeled after [ 6 ] . During the first stage, the start-up phase, (which extends over typically 50 frames) no previousknowledge about the background noise is available. Thus, eachframe is considered to be background noise and is therefore usedfor the estimates th at are formed at the end of this phase.For the second and the third phase, the discriminant func-tion can now be computed and for each frame the speech/noisediscrimination is performed. For updating the noise estimates,only those frames are considered which not only are candidatesaccording to the discriminant function value D ( k ) but also areembedded in a contiguous block of candidates. When choosingthe number of candidates that must precede the first one which isaccepted for the noise update, we take into account that reverber-ation in the acoustic environment might cause noise-like frameswhich actually are reverberated spee ch. At the end of such acontiguous block of noise cand idates, one may find frames, whichare detected as noise although they contain unvoiced sounds ofemerging speech. Consequently, we decided t o discard 50 frames(corresponding to 800ms) at the beginning of the block and 12frames (192ms) at t he end of the block. During the second phase,th e consolidation phase, the averaging is extended to typically500 frames (8s), and during the third phase, the steady statephase, the estimates are recursively updated using a fixed timeconstant of 8s. For computa tional efficiency, during the consoli-dation and the steady state phase the estimates are not updatedfor each newly accepted noise candidate, but only if a block of -typically 10 - such frames is complete.T h r e s h o l d A d a p t a t i o n for t h e D i s c r i m i n a n t F u n c t i on -The motivation for the adap tation of the thresholds for the dis-criminant function arises from the time-variance of the back-ground noise: After a change of the noise environment it mayoccur that feature vectors which describe the new backgroundnoise are not accepted as noise, because D ( k ) exceeds the cur-rent threshold D3 . Thus, the background noise estimation maynever adapt to the new noise situation. Obviously, in this case,the threshold D3 must be raised in order to incorporate the fea-ture vectors of the new noise background into the estimation. Onthe other hand, D3 should be kept as low as possible to avoidacceptance of speech seg ments asbackground noise, which wouldresult in a degraded speech/noise discrimination. As a first step,the concept for the a daptation of the threshold requires a noisedetection which is independent of the discriminant function andthe associated thresholds. We use here as a criterion the dy-namic range of the energy of the signal frames within a timeinterval of typically 100 frames. If the dynamic range of theframe energies is below a given threshold this signal segment isconsidered to be background noise. Th e underlying assumptionis that background noise exhibits a distinctly smaller dynamicrange of frame energy tha n speech. Once a signal segment is

    decided to be background noise, a series of tests is performedwhich use quantit ies derived from frame energy and from thediscrimination function. These tests determine whether or no tthe threshold D3 should be changed, and if so, they also deter-mine the amount of change. Accordingly, D3 is increased if thenoise features changed abruptly. A corresponding mechanismdecreases D3 again if the averaged discriminant function value ismuch smaller tha n the threshold D3 over a certain time interval.As for the thresholds D1 an d D z , it was found that changingthese along with D3 by the same amount yields good results.4. 4 Beam Weight AssignmentThe assignment of the weights t o the individual beams m ust sat-isfy two requirements. Firs t, it must select those beams whichprovide best coverage of the curr ently active sources, and sec-ond, it must also account for perceptual criteria, such as theunpleasantne ss of switching effects. Th e algorithm consists oftwo stages which will be explained below. At th e first stage,we assign a poten tial to each beam the reby deter mining whichbeams ar e activated. Here, activation means th at this beam isconsidered as pointing to an active talker and should obtain thecorresponding weight. Based on the potentials th e second stepassigns the actual weight values.Potential Ass ignm ent- The potential assignment method wasmotivated m ainly by the inability of the speech/noise discrim-ination to always find the correct beam for an active talker:Speech will usually be detected for several beams as long asthe environment is reverberant. Choosing the beam of whichthe discriminant function D ( k ) is maximum is prone to errors,too, since D ( k ) s measured relative to the noise background es-timates which are not equal for all beams. Therefore, a strategywas developed that essentially exploits the energy bursts thatare characteristic for voiced sounds in speech signals. The ideais that a n energy burst in the beam signal generates a positivepotential which corresponds to the number of future frames thatthis beam should be activated following the current frame. Thispotential should only be assigned to bursts arriving via the short-est pa th, not to reflections, and should not be eroded completelyby time before the next burst in continuous speech can be ex-pected.For each time fra me we select those beams as candidates fornew potential that are maximum with respect to the instanta-neous (i.e., the current frame) energy or a lowpass-filtered energy.The potential for the maximum of the instantaneous energy lastsfor typically 5 frames, while for the maximum of the averagedenergy 20 frames proved to be reasonable. Using two differentcriteria allows fast de tection of emerging speech as well as bridg-ing unvoiced sounds between bursts, and discards rapidly pulsivenoises.Provided that the candidates were recognized as speech bythe speech/noise discrimination, some more tests are performedbefore they obtain new nonzero potential. First , an estimatefor the signal-to-noise ratio(SNR) - formed by the instantaneousor the lowpass-filtered energy and the mean background noiseenergy - must exceed a given threshold (e.g., 3dB). This pre-vents beams from being activated due to background noise tha tis not well represented by the estimates. Two more criteria areintroduced to account for the interactions between already ac-tive beams (having nonzero potential from previous assignment)and potentially newly activated beams. Both are applied onlyto those candidates which had zero potential in the previoustime frame . Th e first criterion realizes a burst echo suppressionand should prevent a candidate from being activated if the cor-responding beam signal is an echoed version of a burst whichalready caused another bea m t o be activated. T hus, if there arealready activated beams, the candidates energy must exceed theattenuated maximum of al beams over a certain number of pre-ceding frames. The atte nuation factor and th e number of frames

    - 3583 -

  • 7/29/2019 ref25

    4/4

    correspond to the reflectiveness and th e reverberation tim e ofthe acoustic environment, respectively. Th e second criterion isa neighbor inhibition mechanism and prevents two neighboringbeams from being activated at the same time. This avoids can-cellation effects in the directivity pattern which are caused bythe interference of two neighboring beams. As the directivitypatterns of neighboring beams overlap for most frequencies, thisprocedure does not impair the coverage of possibly ac tive sourcessignificantly, b ut it re tains th e am ount of spati al selectivity whichis achieved by a single beam. Th e algorithm for the neighbor in-hibition proceeds as follows: If a beam is a candidate for newpotential and its neighboring beam is already active, the candi-date is discarded if it is not the maximum of both instantaneousand lowpass-filtered energy. To obtain the new potential, thecandidate must also exceed its neighbors' lowpass-filtered energyby a prescribed amount (typically 3dB). If the candidate meetsthese conditions, the potential of the previously activated a d-jacent beam is set to zero, so that only one of the two beamsremains active. Th e required excess energy causes a hysteresisand is very useful as i t prevents the algorithm from alternatelyassigning the potential to two beams while the source actually islocated in between their 'look directions'. T hus, the undesirableswitching of the background noise with the same talker beingactive does not occur. On the other hand, the ability of the al-gorithm to track a moving source is not affected as long as th eamount of required excess energy is not too large.Finally, each beam gets assigned a potential th at is th e newlyaquired potential or the decremented previously assigned poten-tial, whichever is larger.Th e potential assignment m ethod proved to be very efficientin keeping the number of activated beams minimum while stillcovering all active sources. Although at most two beams couldhave new potentials assigned within a time frame (l Sm s), ex-periments showed that three simultaneously active talkers arecovered witho ut a beam being 'lost'.C o m p u t a t i o n ofweights-T he weights for the individual beamsignals range between 0 an d 1 and are mainly determined by thecorresponding potential and th e previous weight. For a newlyactivated beam, i t was found to be perceptually very importantthat the transition of the weight from 0 t o 1 has sigmoid char-acter, while for the transition from 1 t o 0 - initiated when apreviously activated beam runs out of potential - a simply expo-nentially decaying weight yields satisfactory behavior. However,the case that no beam has nonzero potentid has to be treatedseparately: To avoid the 'dead channel' phenomenon, at leastone weight should not decay exponentially. Thus, fo r the beamwhich most recently had nonzero potential, the weight is keptconstant until a beam is reactivated.

    5 IMPLEMENTATION AND RESULTSThe proposed system was implemented in real-time hardwareand tested in different environments. For the sensor array first-order differential microphones were employed. Th e A/D inter-face provides a linear 16 bit converter for each sensor signal andforms a serial bitstr eam . Th e digital signal processing is per-formed by a cascade of 4 AT&T DSP32C processors, with 3 ofthem being used for the beamforming and the fourth realizingthe voting algorithm. A total of 27.2 Mips ('million instruc-tions per second') are executed with a required storage of 57.4kByte for both program and data. The DSPs are monitored andcontrolled via a personal comp uter. Menu-driven user interfacesoftware allows control of all pa rameters of the voting algorithmusing a 'mouse'.Measurements in an anechoic chamber confirmed that thebeamforming performance of the implemented system is in goodagreement with theory. Th e functionality of the voting algo-rithm was examined by careful listening tests in a teleconferenc-ing room and in an office environmen t. It was found that the

    reaction time of the array to emerging speech is short enough toavoid noticeable chopping of speech and th at no switching noise isheard when beams are activated or deactivated. As intended, thenumber of activated beam s is always kept minimal. Interestingly,violation of the far-field condition by th e source does not d egradethe system performance as long as the background noise sources(including reverberation) meet this condition. Moreover, the im-plemented system showed good 'self-healing' capability when re-covering from the most problematic s ituati on, i.e., when the noiseestimation could not follow a changing background noise becauseof an intensive conversation. In both environments it could beverified that th e good functionality is quite robust to parame tervariations.6 CONCLUSION

    The results obtained by real-time experiments confirm that theproposed concept deals successfully with teleconferencing envi-ronments and that i t yields substantially better performancethan earlier concepts based on analog hardware. Future workcould aim at larger bandwidth and larger rooms, e.g., auditoria.Conceptually, these extensions are straightforward, as the pro-posed voting algorithm can be applied without major alterations,and for the beamforming the processing remains essentially thesame, although the numbers of sensors and beams will increase.

    A C K N O W L E D G E M E N TThe author wishes to thank Gary Elko, Jim Snyder, and BobKubli for their guidance and support as well as many other indi-viduals of the Information Principles Research Lab for providingan inspiring and creative environment fo r this work.

    RE F E RE NCE SJ.L. Flanagan, J.D. Johnston, R. Zahn, and G.W. Elko.Computer-steered microphone arrays fo r sound transduc-tion in large rooms. J . Acoust. Soc. A m . , 78(5):1508-1518,November 1985.J.L. Flanagan. Beamwidth and useable bandwidth ofdelay-steered microphone arrays. A T B T Technical Joumal,64(4):983-995, April 1985.C.L. Dolph. A current distribution for broadside arrayswhich optimizes the relationship between beamwidth andsidelobe level. Proceedings of the IRE, 34:335-348, June 1946.G. Oetken, T.W. Parks, and H.W. Schiifl ler. A computerprogram for digital interpolator design. In Digital Signal Pro-cessing Committee of the IEEE ASSP Society, editor, Pro-g m m s for Digital Signal Processing, chapter 8.1. IEEE Press,1979.R.G. Pridham and R.A. Mucci. Digital interpolation beam-forming for low-pass and bandpass signals. Proceedings of heI E E E , 67(6):904-919, June 1979.H. Schramm. Untersuchungen an Sprachdetektoren fur d i g -itale Spmchinlerpolationsverfahren. PhD thesis, UniversitatErlangen-Niirnberg, Erlangen, FR G, 1987. (in German).L. R. Rabiner and R. W . Schafer. Digital Processing of SpeechSignals. Prentice Hall, Englewood Cliffs, NJ, 1978.R. 0. Duda and P. E. Hart . Pattern Classification and SceneAnalysis. John Wiley & Sons, New York, NY, 1973.A. Papoulis. Probability, Random Variables, and StochasticProcesses. McGraw-Hill, New York, NY, 2nd edition, 1984.

    - 3584 -