27
Master’s Thesis MEE04:20 Acoustic speech localization with microphone array in real time Mikael Swartling Examensarbete Teknologie Magisterexamen i Elektroteknik Blekinge Tekniska H¨ ogskola Januari 2005 Blekinge Tekniska H¨ ogskola Sektionen f¨ or Teknik Avdelningen f¨ or Signalbehandling Examinator: Nedelko Grbi´ c Handledare: Nedelko Grbi´ c

Sound Source Localization

Embed Size (px)

DESCRIPTION

Sound Source Localization project report

Citation preview

  • Masters Thesis MEE04:20

    Acoustic speech localization withmicrophone array in real time

    Mikael Swartling

    ExamensarbeteTeknologie Magisterexamen i Elektroteknik

    Blekinge Tekniska HogskolaJanuari 2005

    Blekinge Tekniska HogskolaSektionen for TeknikAvdelningen for SignalbehandlingExaminator: Nedelko GrbicHandledare: Nedelko Grbic

  • Acoustic speech localization with microphonearray in real time

    Mikael SwartlingBlekinge Institute of Technology

    Abstract

    The purpose of this thesis is to evaluate and implement algorithmsfor robust localization and tracking of moving acoustic sources in realtime using a microphone array. To identify inter-sensor delays, thegeneralized cross correlation is used together with a filter bank. Fromthe inter-sensor delays, position is estimated using a linear intersectionalgorithm. Position estimates are associated with tracks, which are fil-tered by a Kalman filter. Results from two real-room experiments arepresented to demonstrate the localization and tracking performance,along with a discussion on real time implementation issues.

  • Contents

    1 Introduction 4

    2 Delay estimation 4

    2.1 Signal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.2 The generalized cross correlation method . . . . . . . . . . . . 5

    2.3 Angle of arrival . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.4 Multiple sensors . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.5 Optimizing the cross correlation function . . . . . . . . . . . . 8

    3 Filter banks 11

    4 Position estimation 11

    4.1 Source localization problem . . . . . . . . . . . . . . . . . . . 11

    4.2 Linear intersection . . . . . . . . . . . . . . . . . . . . . . . . 12

    5 Track association and filtering 13

    5.1 Track association . . . . . . . . . . . . . . . . . . . . . . . . . 14

    5.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    6 Experiments 16

    6.1 Testing the angle of arrival . . . . . . . . . . . . . . . . . . . . 16

    6.1.1 Bias and variance . . . . . . . . . . . . . . . . . . . . . 17

    6.2 Testing the localization and tracking . . . . . . . . . . . . . . 17

    6.2.1 Two fixed talkers . . . . . . . . . . . . . . . . . . . . . 18

    6.2.2 Single moving talker . . . . . . . . . . . . . . . . . . . 18

    7 Real time implementation 18

    8 Conclusion and further development 19

    2

  • List of Figures

    1 Delay due to extra progagation distance. . . . . . . . . . . . . 5

    2 Path of possible source locations. . . . . . . . . . . . . . . . . 7

    3 Sensor arrangement and delays when using multiple sensors. . 8

    4 Uniform DFT analysis filter bank. . . . . . . . . . . . . . . . . 11

    5 Linear intersection. . . . . . . . . . . . . . . . . . . . . . . . . 14

    6 Room with moderate echo. . . . . . . . . . . . . . . . . . . . . 22

    7 Room with low echo. . . . . . . . . . . . . . . . . . . . . . . . 22

    8 Bias of estimated angles. . . . . . . . . . . . . . . . . . . . . . 23

    9 Standard deviation of estimated angles. . . . . . . . . . . . . . 24

    10 Two speakers having a conversation. . . . . . . . . . . . . . . 25

    11 Two speakers having a conversation. . . . . . . . . . . . . . . 25

    12 Single speaker moving in a circle. . . . . . . . . . . . . . . . . 26

    13 Single talker moving in a circle. . . . . . . . . . . . . . . . . . 26

    3

  • 1 Introduction

    An array of microphones has the ability to be steered electronically to changeits directivity pattern to only receive sounds from certain directions. Thisability can be used to replace directed microphones, as it has the advantageof rapidly changing its directivity pattern, allowing it to pick up new sourcesand follow source movements. Instead of steering the arrays directivity pat-tern to a specific location, it can also be used to search for acoustic sourcesby dynamically forming the directivity pattern to sweep over the surroundingenvironment.

    The problem of locating a source is often split into three parts; inter-sensor delay estimation, position estimation and tracking association andfiltering. The most important of these parts is a precise and robust algorithmfor inter-sensor delay estimation, since the delay estimates forms the base forfurther calculations and location estimates. To work in real time, it mustalso be computationally inexpensive to be able to process the signals as theyare sampled and to provide a continuous flow of inter-sensor delay estimatesto the location estimator.

    All three parts will be discussed in this report. Experiments are alsoperformed to demonstrate the performance, along with a discussion on realtime implementation issues and finally, conclusions and possible further de-velopments are given.

    2 Delay estimation

    2.1 Signal model

    Given two spatially separated sensors (in this thesis, the sensors are micro-phones), the signal received from an acoustic source at one sensor will beshifted in time relative the other sensor due to an extra propagation distancefrom source to sensor. Figure 1 illustrates this delay where the source islocated in the near and far field, respectively. In the near field case, thedirection of arrival is different for the two sensors. In the far field case, thedirection of arrival can be considered parallel and will therefore be the samefor both sensors.

    Assuming the relative attenuation between the two sensors is negligible,

    4

  • (a) Near field source.

    (b) Far field source.

    Figure 1: Delay due to extra progagation distance.

    the received signals x0(t) and x1(t) can be modelled as

    x0 (t) = s (t 0) + n0 (t)x1 (t) = s (t 1) + n1 (t)

    (1)

    where s(t) is the acoustic source signal, 0 and 1 are the propagation delaysfrom the source to the sensors and n0(t) and n1(t) are noise signals. Thenoise received at the sensors are considered mutually uncorrelated and alsouncorrelated with the source signal. The relative delay between the sensors, = 1 0, is the delay caused by the extra propagation distance.

    The task is to estimate the delay from finite size blocks of data fromx0(t) and x1(t). To track a talker, to locate new sources and alternate betweenseveral sources quickly, a method to quickly estimate the delay is required.

    2.2 The generalized cross correlation method

    The method used to estimate inter-sensor delays in this thesis is based onthe generalized cross correlation method, described in [KC76]. The delay isestimated by maximizing the cross correlation between the two signals x0(t)and x1(t), and can be expressed as

    = argmaxRx0x1 () (2)

    5

  • The cross correlationRx0x1 () is related to the cross power spectrumGx0x1()by the Fourier transform as

    Rx0x1 () =

    Gx0x1 () ejd (3)

    The cross power spectrum of x0 and x1, Gx0x1(), is calculated as

    Gx0x1 () = X0 ()X1 () (4)

    whereX0 () andX1 () are the Fourier transforms of x0 and x1, respectively,and denotes complex conjugate.

    The generalized cross correlation is defined in [KC76] as

    Rx0x1 () =

    ()Gx0x1 () ejd (5)

    where () is a general weighting function. The generalized correlationmethod known as phase transform, or PHAT, is obtained by setting theweighting function to

    PHAT () =1

    |Gx0x1 ()|(6)

    This weighting function normalizes the absolute value of all coefficients inthe cross spectrum to unity, and uses only the phase information to calculatethe cross correlation.

    2.3 Angle of arrival

    When the time delay of arrival is estimated and the array geometry is known,a direction of arrival can also be estimated. From a given delay, a path canbe calculated along which the source is located. It is not possible, using onlytwo sensors, to determine where along the path the source is located. Thepath is a parabolic curve in two dimensions as illustrated by the dashed linein figure 2(a). The curve is actually mirrored along the line connecting thetwo sensors. However, only one half-space is concidered here; the source isassumed to be located in front of the sensor array.

    In the far field, the parabolic curve approaches a straight line. Assumingthe source is always located in the far field, it is possible to approximate the

    6

  • (a) Near field source.

    (b) Far field source.

    Figure 2: Path of possible source locations. A source located in the near fieldresults in a parabolic curve of possible source locations (a), andin the far field the parabolic path ca be approximated by a straightline (b).

    parabolic curve with a straight line, as shown in figure 2(b). The angle isthe angle or arrival for a distant source.

    The angle of arrival can be calculated as

    = sin1(c d fs

    )(7)

    where c is the speed of sound, d is the distance between the two sensors,fs is the sample rate and is the estimated delay between the two sensorsmeasured in samples. An estimate of the variance of the estimated angle is[ADBS95]

    V [] V [ ]cos2

    (8)

    2.4 Multiple sensors

    To increase the accuracy of the delay estimate, multiple sensors can be used.Here, the sensors are placed on a line, evenly spaced. Assuming a far-fieldsource, the sensor arrangement and their delays relative other sensors are asin figure 3.

    The SRP based algorithms, steered response power, are algorithms basedon steering a beamformer, searching for maximum power output. The type

    7

  • N

    m1

    m0

    m2

    mN-1

    Figure 3: Sensor arrangement and delays when using multiple sensors.

    of beamformer used is a delay-and-sum beamformer, which delays the outputsignals from the individual sensors and them sums them together to form theoutput of the beamformer.

    A generalization of the GCC-PHAT is the SRP-PHAT algorithm, definedas

    = argmax

    N2n=0

    N1m=n

    PHAT ()Gx0x1 () ej(mn)d (9)

    where PHAT () is the weighting function defined in (6).

    The SRP-PHAT algorithm maximizes the cross correlation between allcombinations of sensor pairs in the array. As the number of sensor increases,the variance of the estimate decreases. For N sensors, there is a total of

    (N2

    )pairs of sensors for which the sum of the cross correlation is being maximized.

    2.5 Optimizing the cross correlation function

    The optimization problem presented in (9) generally lacks a closed formsolution, so a numerical search method is used. The method used is theGolden section search, described in [LRV01]. The Golden section search is aone dimensional search method that searches for a maxima (or minima whenminimizing a function) between two end-points.

    The first thing to do before optimizing is to determine the interval overwhich to optimize. The relative delay between two sensors in the arraycan never be larger than the delay caused by the distance between the twosensors. The largest relative delay occurs when the source is located on the

    8

  • line connecting the two sensors. Therefore, in (9), it is known that

    [d fs

    c,d fsc

    ](10)

    This is also the interval of for which (7) is defined, since the domain ofsin1 is [1, 1].

    Assume the search interval for iteration i is [i, i], where i < i. Twonew points, li and ri, are choosen such that i < li < ri < i. The searchinterval is then updated depending on the function values at the points li andri. If f (li) > f (ri), the new search interval [i+1, i+1] = [i, ri], otherwise[i+1, i+1] = [li, i].

    By keeping the ratio bewteen all points constant for each iteration, theinner points li and ri can be reused in the next iteration, not only as anendpoint for the new search interval, but also as one of the new inner points.Therefore, only a single new point and corresponding function value must becalculated for each iteration. The ratio between the points can be expressedas

    li iri i =

    ri ii i (11)

    The ratio is the Golden ratio, hence the name of the algorithm. The Goldenratio is calculated as

    =35

    2 0,3820 (12)

    The algorithm for the Golden section search is shown in algorithm 1.Eligible parameters in the algorithm are the search interval [, ] and thetolerance . The algorithm returns the value that maximizes the functionf () over the search interval, with a tolerance of units.

    For the Golden section search to work, the function being optimized mustbe unimodal; it must have one, and only one, maxima in the interval beingoptimized. In general, the cross correlation is not unimodal. However, inves-tigating the cross correlation for real recordings have shown that the crosscorrelation can, in practice, under the circumstances given in this thesis, beconsidered unimodal often enough for the Golden section search to be anoption. Sometimes the optimization returns a local maxima instead of theglobal (in the range specified) maxima, but not often enough to notably affectthe general performance.

    9

  • Algorithm 1 The Golden section search algorithm.

    Require: < and > 0Ensure: = argmax

    f ()

    1: l = + ( )2: r = l + ( l)3: fl = f (l)4: fr = f (r)

    5: while < do6: if fl < fu then7: = l8: l = r9: r = + ( )10: fl = fr11: fr = f (r)12: else13: r14: r l15: l + (r )16: fr fl17: fl f (l)18: end if19: end while

    20: if fl > fr then21: l22: else23: r24: end if

    10

  • H z0( )

    z-1

    z-1

    z-1

    K

    H z1( )

    H zN-1( )

    K

    K

    IFFT

    X z0( )X z( )

    X z1( )

    X zN-1( )

    Figure 4: Uniform DFT analysis filter bank.

    3 Filter banks

    The generalized cross correlation, described in section 2.2, estimates the inter-sensor delays using the cross power spectrum. The cross power spectrumis calculated as shown in (4). Instead of calculating the discrete Fouriertransform of the signals x0 and x1 directly, a uniform DFT analysis filterbank is used.

    The signal x (n) is decomposed into a set ofN subbands by the filter bank.The filter bank consists of a set of bandpass filters derived from a prototypefilter. The prototype filter is a lowpass filter whose frequency response isshifted in frequency domain, making it a bandpass filter. The prototype filteris used to create one bandpass filter for each of the N subbands, with centerfrequency at 2pin

    N, n = 0 . . . N 1, for the n:th subband. After filtering, the

    subband signals are decimated. If the sample rate of the subband signals aredecimated by a multiple of the number of subbands, N , an efficient polyphaseimplementation is possible, as shown in figure 4

    4 Position estimation

    4.1 Source localization problem

    From a set of N pairs of sensors {mi0, mi1}, i = 0 . . . N 1, the time delaybetween the two sensors in the pair, given the knowledge about the position

    11

  • for the two sensors, mi0 and mi1, and the position of the source, s, is

    T ({mi0, mi1} , s) = |smi0| |smi1|c

    (13)

    where c is the speed of sound. For each pair, there is an estimated timedelay i between the two sensors, and an estimated variance i. If the delayestimates i are corrupted by uncorrelated, zero-mean gaussian noise, themaximum likelihood estimate of the source location sML is found by mini-mizing a least-square error function JML(s) [BAS97].

    sML = argminsJML (s) (14)

    where

    JML (s) =N1i=0

    1

    2i[i T ({mi0, mi1} , s)]2 (15)

    4.2 Linear intersection

    Minimizing the error function in (14) involves searching for a position s fromwhich the theoretical delays, as closely as possible, matches the measureddelays. Instead of using a numerical search method to find the location ofthe source, a numerically less expensive closed-form solution is used instead.The algorithm used is based on the Linear intersection algorithm describedin [BAS97], modified from three- to two-dimensional intersections.

    Once the direction of arrival is calculated for each sensor pair, the inter-section of all estimated directions of arrival, together with the sensor position,can be calculated. Given the position of sensor pair i, mi, and its directionof arrival, vi, any point pi on the line originating from the array location inthe direction vi can be described as

    pi =mi + ti vi (16)where ti > 0, as shown in figure 5. pi also describes all possible locations ofthe source as seen from the sensor pair. By using two pairs, {mi0, mi1} and{mj0, mj1}, the source location can be found by calculating the intersectionof the lines pi and pj.

    pi = pj mi + ti vi =mj + tj vj ti vi tj vj =mj mi

    (17)

    On matrix form, the equation becomes

    V t =m (18)

    12

  • where

    V =[vi vj

    ], t =

    titj

    (19)and

    m =mj mi (20)Seeking t, the solution is

    t = V1 m (21)and the intersection point can then be calculated as

    sij,LI =mi + ti vi =mj + tj vj (22)

    When using N > 2 sensor pairs, or more generally, sensor subarrays whenmultiple sensors are used per pair for increased accuracy,

    (N2

    )possible

    intersections can be calculated; one for each combination of 2 subarrays.Assuming there are at least 2 subarrays, the final position can be estimatedas

    sLI =

    N2i=0

    N1j=i

    sij,LI(N2

    ) (23)Since no information regarding propagation delay from the source to a

    sensor subarray, or between subarrays, is available, problem arises when thesource is located near the line connecting the two subarrays or far away fromthe subarray compared to the distance between them. In those cases thedirection of arrival vectors are almost parallel, and the matrix V in (21) isbadly conditioned, or even non-invertable.

    5 Track association and filtering

    This section describes the algorithm used for tracking sources from individualpositional estimates. Section 4 describes an algorithm to estimate a positionfor the source given the time delay between sensors in a sensor array, andusing several sensor subarrays to estimate a position. The algorithm gives aset of points sampled at a certain time interval. The positional estimates aredistorted by noise and needs to be filtered spatially.

    13

  • v0

    v1

    m0

    m1

    p0

    p1

    Figure 5: Linear intersection.

    5.1 Track association

    When there are multiple sources being located (for example, two or more talk-ers having a conversation), simply filtering the samples as they are calculatedis not an option. An algorithm to determine which source a sample belongsto must be implemented, and only then can samples be filtered properly. Thetrack association algorithm is based on a method described in [SBS97].

    A track is a state vector following a source. When a new sample iscalculated, one of the currently stored tracks is first associated with it. Thetrack associated with the sample is the nearest track, but the track must alsobe within a certain distance from the sample.

    If no track is good enough to be associated, a new track is created. Anassociation can fail because of two main reasons; the sample belongs to acompletely new source, or the sample was distorted by so much noise it felloutside the acceptance region for the correct source. When a new track iscreated, it is not yet known whether the sample is a new source being active,or just a noise-corrupted sample from a current track. Therefore, all newtracks are marked as potential tracks, so if no new samples falls within theacceptance regions within a certain time, it can be assumed it was createdfrom a noise-corrupted sample and it will be dropped. However, if moresamples starts to fall within the acceptance region, it is assumed that thetrack is indeed tracking an active source, and the track is promoted to anactive track.

    A track associated with a sample is updated. The sample is added tothe list of samples for that track, and eventually filtered to smooth the path

    14

  • formed by the samples.

    When a track is not updated with new samples within a certain time,the track is considered abandoned, and the track is dropped from the listof potential or active tracks. A completed track is an active track that wasdropped. Potential track not yet promoted to active tracks are not consideredcompleted tracks when they are dropped. That is because a potential trackis a track that is not yet classified as being a real source.

    5.2 Filtering

    Filtering is performed using a Kalman filter. The source being tracked isassumed to be humans talking, and since the source can move around, asimple Newtonian motion model is used to model the motions of the talker.Therefore, the state vector for the Kalman filter is

    xn =[xn yn xn yn

    ]T(24)

    where xn and yn represents the two-dimensional position of the source, andxn and yn the velocity, at iteration n.

    The filter used is a one-step predictor as described in [Hay02]. The tran-sition matrix F is

    F =

    [I2 T I202 I2

    ](25)

    and the measurement matrix C is

    C =[I2 02

    ](26)

    where In is an nn identity matrix, 0n is an nn zero-matrix and T is thetime since last update of the state vector. The filter is updated at constanttime intervals T , so the transition matrix F is also constant, and the inverseof the transition matrix is

    F1 =

    [I2 T I202 I2

    ](27)

    The correlation matrices for the process and measurement noise, Q1 and Q2respectively, is

    Q1 = q1I4, Q2 = q2I2 (28)

    where q1 and q2 are the variances of the process and measurement noise.

    15

  • The algorithm for estimating the sources state vector at iteration n, xn,given the estimated position samples, yn, is show in algorithm 2. The initialstate vector x0 is the estimated position and velocity of the source at thetime the Kalman filter starts tracking the source. The position is estimatedfrom the samples collected before the track was promoted to an active track(see section 5.1) and the velocity is assumed to be zero. The initial predictedstate-error correlation matrix K0 = 04.

    Algorithm 2 Kalman filter based on one-step prediction.

    1: for n = 1, 2, 3 . . . do

    2: Gn = F Kn CH [C Kn CH +Q2

    ]13: an = yn Cxn4: xn+1 = Fxn +Gnan5: Kn+1 = F [Kn F1 Gn Kn] FH +Q16: end for

    Instead of iterating through all the samples at once with the for-loop inalgorithm 2, each new sample calculated will trigger a single pass in the loop.This is necessary for real time filtering where the filtered result is needed asnew samples are calculated.

    6 Experiments

    6.1 Testing the angle of arrival

    The algorithm to estimate the angle of arrival is evaluated using measure-ments with different types of sound and room environments and from differentangles relative the sensor array. The three scenarios are:

    Speech in a room with low echo. Speech in a room with moderate echo. White gaussian noise in a room with low echo.

    The speech used is pre-recorded speech of random phrases. The room isof size 45 m. One wall have an acousting damper covering it, and theother walls are unblocked walls, giving a moderate echo. Along the wallsare some tables with computer equipment and home entertainment systems,

    16

  • speakers and some chairs. Figure 6 shows a general overview of the room, theplacement of the sensor and placement of the source in the different angles.The source is placed in four angles; 0, 22,5, 45and 67,5. Figure 7 showsthe same room, but with acoustic dampers placed along the walls around thesensor array to reduce the echo.

    The sound is played using a speaker placed at the angles shown in figure 6and 7, at a distance of 2 m away from the array. The sound is playedat normal speech level. Noise is present in the form of computer fans andventilation, and the signal to noise ratio at the sensors are about 15 dB.The sample rate is 8 kHz. The array consists of 6 microphones with aninter-sensor distance of 4 cm.

    6.1.1 Bias and variance

    Bias is the introduction of an offset in the estimated parameter comparedto the real parameter. Figure 8 shows the estimated angles for the differentscenarios. The performance is evaluated as a function of the number ofsubbands in the DFT filter bank.

    White noise is fairly accurate to locate. As the angle of arrival approachesthe edges and as the reverberation level increases, the bias also increases. Byusing a high number of subbands and with a source not located at the edgeof a sensor array, the bias can be kept below 5. That is roughly equivalentto an offset of about 2,5 dm, 3 m away from the array.

    The variance, or the standard deviation, of the estimate is a measurementof how much a specific sample generally deviates from the average value.Figure 9 shows the deviation measured at different angles for the differentscenarios.

    As with bias, the variance of white noise is very low. For speech, thevariance is about the same for low and moderate echo as long as the sourceis not located near the edge of the sensor array.

    6.2 Testing the localization and tracking

    The localization and tracking algorithms are tested in the same room asbefore. Two scenarios are tested:

    Two fixed talkers having a conversation. Single talker moving in a circle.

    17

  • In both scenarios, the sample rate is 8 kHz and 512 subband filter bankis used.

    6.2.1 Two fixed talkers

    The scenario setup is given in figure 10. The distance between the twosubsensor arrays is 1,5 m, and the two talkers are located 1,7 m out from thearrays.

    The scenario simulates two talkers having a conversation. The test con-sists of three phases. They begin by speaking one at a time for about 20 seach. Then they start talking for 5 s each to simulate more rapid changes inthe location estimates, and in the last phase they talk at the same time tosee how the algorithms handle two simultaneous sources.

    Figure 11 shows the result from the evaluation after track associationand filtering. Figure 11(a) shows the x and y position components over time.The first two phases pass without problems, the sources are clearly separatedand located. In the third phase, the algorithm can find two separate sourcesand can track them independently, although tracks are sometimes lost andrecreated. Figure 11(b) shows the positions of the sources as a view fromabove.

    6.2.2 Single moving talker

    The setup in this scenario is shown in figure 12. The distance between thesensor subarrays is, as in the previous scenario, 1,5 m. The talker is nowmoving in a circle, about 1,8 m out from the arrays. The result from thisevaluation is shown in figure 13, where figure 13(a) shows the x and y positioncomponents over time and figure 13(b) the position from above.

    7 Real time implementation

    The algorithms were first implemented and evaluated in Matlab. Whenthe algorithms was working properly, the Matlab M-code was translated,by hand, to C++. Around the translated code, an interface was implementedfor interaction with the user. The program is written for the the Windowsplatform, using the ASIO standard for communication with sound record-ing equipment. Because everything were thoroughly tested in Matlab, the

    18

  • translation went smooth. The general structure of the code in bothMatlaband C++ are similar, so the translation was basically a line-by-line translation.

    The main concern in the beginning was the available CPU time. It waslater found that it wasnt really the biggest problem in implementing thealgorithms in real time. A standard-equiped Pentium 4 at 1,5 GHz couldeasily handle 2-3 arrays with 4-6 sensors per array, at sample rates up to 16kHz, enough to sample speech at good quality, and filter banks with 1024subbands. As new computers have significantly more computing power, theCPU time is not a problem unless the arrays becomes too large and too many.

    8 Conclusion and further development

    Different algorithms was first evaluated to estimate the angle of arrival.Other than the Steered response power algorithm described in this thesis,the algorithms tried initially was the following.

    Using the cross correlation calculated in time domain and search for apeak in the cross correlation.

    Using an LMS-filter where the adaptive filter is used to estimate thedelay between a signal from a reference sensor and the other sensors.The slope of the phase response of the filter determines the delay. Ide-ally, the impulse response of the filter is a delayed -impulse, and thephase response is a straight line.

    Estimating the slope of the phase of the cross power spectrum, as de-scribed in [ADBS95]. Ideally, only a delay is present, and the crosspower spectrum is on the form ej .

    Except for the first, using the cross correlation calculated in time domain,they all work well on synthetic data. The cross correlation calculated in timedomain did not have enough resolution as the delay could only be estimatedas multiples of the sampling period. When real recorded data was used, theLMS-filter and the cross power spectrum method was too inaccurate whenestimating the slope of the phase.

    For speech in reverberant rooms, only the SRP algorithm used in thisthesis worked well enough to be used in practice. Together with the PHAT-weighting function in the general cross correlation, the SRP-PHAT algorithmforms a robust method of estimating the angle of arrival for a sensor array.

    19

  • It it also a good choise for real time applications, as its doesnt requiremuch computing power compared to whats available in a standard desktopcomputer.

    The filter bank was also a huge improvement compared to only using theDFT. The filter bank forms a time-averaged spectrum, making the impor-tant phase information less variant for the inter-sensor delay estimator. Thecomputational complexity of the filter bank is higher, but well within thelimits for real time applications and the improved precision was well worthit.

    The linear intersection, a closed-form algorithm, is computationally veryefficient. By associating samples with tracks, and spatially filtering thetracks, the location algorithms is able to quickly locate and track multiplesources; not just alternating sources, but also, to some extent, simultaneoussources.

    Further, the algorithms can be improved with smart acoustic detectorsand classificators to classify sounds and locate only certain types of events(or ignore them), such as tracking speech only or locating noise sources. Themethod for detecting multiple sources can also be improved. The currentimplementation relies on the two sources being at about the same signalpower level at the subarrays.

    20

  • References

    [ADBS95] John E. Adcock, Joseph H. DiBiase, Michael S. Brandstein, andHarvey F. Silverman. Practical issues in the use of a frequency-domain delay estimator for microphone-array applications, Janu-ary 1995.

    [BAS97] Michael S. Brandstein, John E. Adcock, and Harvey F. Silver-man. A closed form location estimator for use with room environ-ment microphone arrays. IEEE Transaction on Speech and Audioprocessing, 5(1):4550, January 1997.

    [Hay02] Simon Haykin. Adaptive filter theory. Prentice Hall, fourth edi-tion, 2002.

    [KC76] Charles H. Knapp and G. Clifford Carter. The generalized corre-lation method for estimation of time delay. IEEE Transaction onAcoustics, Speech and Signal Processing, 24(4):320327, August1976.

    [LRV01] Jan Lundgren, Mikael Ronnqvist, and Peter Varnblad. Linjar ochicke-linjar optimering. Studentlitteratur, 2001.

    [SBS97] Douglas E. Sturim, Michael S. Brandstein, and Harvey F. Sil-verman. Tracking multiple talkers using microphone-array mea-surements. IEEE Transaction on Acoustics, Speech and SignalProcessing, 1:371374, 1997.

    21

  • 0=0

    =22,51

    =452

    =67,53

    200 c

    m

    Figure 6: Room with moderate echo.

    0=0

    =22,51

    =452

    =67,53

    200 c

    m

    Figure 7: Room with low echo.

    22

  • 64 128 256 512 1024 2048

    20

    15

    10

    5

    0

    5

    10

    15

    20

    Subbands

    Angl

    e of

    arri

    val [d

    egree

    s]

    Speech, moderate echoSpeech, low echoNoise, low echo

    (a) Real angle is 0.

    64 128 256 512 1024 20480

    5

    10

    15

    20

    25

    30

    35

    40

    45

    Subbands

    Angl

    e of

    arri

    val [d

    egree

    s]

    Speech, moderate echoSpeech, low echoNoise, low echo

    (b) Real angle is 22,5.

    64 128 256 512 1024 2048

    25

    30

    35

    40

    45

    50

    55

    60

    65

    Subbands

    Angl

    e of

    arri

    val [d

    egree

    s]

    Speech, moderate echoSpeech, low echoNoise, low echo

    (c) Real angle is 45.

    64 128 256 512 1024 204845

    50

    55

    60

    65

    70

    75

    80

    85

    90

    Subbands

    Angl

    e of

    arri

    val [d

    egree

    s]

    Speech, moderate echoSpeech, low echoNoise, low echo

    (d) Real angle is 67,5.

    Figure 8: Bias of estimated angles.

    23

  • 64 128 256 512 1024 2048101

    100

    101

    102

    Subbands

    Stan

    dard

    dev

    iatio

    n [de

    grees

    ]

    Speech, moderate echoSpeech, low echoNoise, low echo

    (a) Standard deviation at 0.

    64 128 256 512 1024 2048101

    100

    101

    102

    Subbands

    Stan

    dard

    dev

    iatio

    n [de

    grees

    ]

    Speech, moderate echoSpeech, low echoNoise, low echo

    (b) Standard deviation at 22,5.

    64 128 256 512 1024 2048101

    100

    101

    102

    Subbands

    Stan

    dard

    dev

    iatio

    n [de

    grees

    ]

    Speech, moderate echoSpeech, low echoNoise, low echo

    (c) Standard deviation at 45.

    64 128 256 512 1024 2048101

    100

    101

    102

    Subbands

    Stan

    dard

    dev

    iatio

    n [de

    grees

    ]

    Speech, moderate echoSpeech, low echoNoise, low echo

    (d) Standard deviation at 67,5.

    Figure 9: Standard deviation of estimated angles.

    24

  • Speaker A

    Speaker B

    x-axis

    y-axis

    75 c

    m75 c

    m

    150 cm

    Figure 10: Two speakers having a conversation.

    0 20 40 60 80 100 120

    1

    0

    1

    Time [s]

    Posi

    tion,

    x [m

    ]

    0 20 40 60 80 100 1200

    1

    2

    3

    Time [s]

    Posi

    tion,

    y [m

    ]

    (a) x and y values as a function of time.

    1 0 10

    1

    2

    3

    Position, x [m]

    Posi

    tion,

    y [m

    ]

    (b) x and y values against eachother.

    Figure 11: Two speakers having a conversation.

    25

  • x-axis

    y-axis

    75 c

    m75 c

    m

    150 cm

    Figure 12: Single speaker moving in a circle.

    0 10 20 30

    1

    0

    1

    Time [s]

    Posi

    tion,

    x [m

    ]

    0 10 20 300

    1

    2

    3

    Time [s]

    Posi

    tion,

    y [m

    ]

    (a) x and y values as a function of time.

    1 0 10

    1

    2

    3

    Position, x [m]

    Posi

    tion,

    y [m

    ]

    (b) x and y values against eachother.

    Figure 13: Single talker moving in a circle.

    26

    IntroductionDelay estimationSignal modelThe generalized cross correlation methodAngle of arrivalMultiple sensorsOptimizing the cross correlation function

    Filter banksPosition estimationSource localization problemLinear intersection

    Track association and filteringTrack associationFiltering

    ExperimentsTesting the angle of arrivalBias and variance

    Testing the localization and trackingTwo fixed talkersSingle moving talker

    Real time implementationConclusion and further development