A Non-Parametric Bayesian Approach to Spike Sortinghomepages.inf.ed.ac.uk/sgwater/papers/embs06-spikes.pdf · A Non-Parametric Bayesian Approach to Spike Sorting Frank Wood †Sharon

A Non-Parametric Bayesian Approach to Spike Sorting

Frank Wood† Sharon Goldwater‡ Michael J. Black†

†Department of Computer Science, ‡Department of Cognitive and Linguistic SciencesBrown University, Providence, RI, USA

Abstract— In this work we present and apply infinite Gaus-sian mixture modeling, a non-parametric Bayesian method, tothe problem of spike sorting. As this approach is Bayesian,it allows us to integrate prior knowledge about the problemin a principled way. Because it is non-parametric we areable to avoid model selection, a difficult problem that mostcurrent spike sorting methods do not address. We comparethis approach to using penalized log likelihood to select thebest from multiple finite mixture models trained by expectationmaximization. We show favorable offline sorting results onreal data and discuss ways to extend our model to onlineapplications.

Index Terms— Spike sorting, mixture modeling, infinite mix-ture model, non-parametric Bayesian modeling, Chinese restau-rant process, Bayesian inference, Markov chain Monte Carlo,expectation maximization, Gibbs sampling.

I. INTRODUCTION

Spike sorting is a collection of processes whose purpose isto both distinguish between multiple cells recorded togetherand to assign detected spiking activity to the neuron re-sponsible for generating it [1]–[6]. This definition presumesthat spikes have been detected a priori (disambiguated frombackground noise) by some other process. This is a narrowdefinition of spike sorting, and one that admittedly maynot accurately describe more sophisticated approaches thatcombine the two steps intelligently [2], [7]. We furtherrestrict our focus to a single channel of data, disregardingthe valuable correlation and spatial information providedby closely spaced electrodes. We adopt this definition andviewpoint to highlight the model identification problem.

A central problem in spike sorting is determining howmany neurons are present in a recording. This problem canbe viewed more generally as a problem of statistical modelidentification if we adopt a mixture modeling approach, asin [3], [5], [6]. In mixture modeling the distribution of apopulation (all the spikes from a channel) is formulated as theadditive combination of a number of hidden subpopulations,or classes (the spikes from each individual cell). Learninga parametric mixture model consists of determining theparameters of the classes as well as the proportion of eachclass in the full population. Traditional approaches suchas expectation maximization (EM) [8] provide no explicitsolution to the problem of how many classes there are(how many cells). The typical approach for determining thatnumber is to learn many different models, one or more eachfor each choice of dimensionality (number of classes). Then

This work was supported by NIH-NINDS R01 NS 50967-01 as part of theNSF/NIH Collaborative Research in Computational Neuroscience Program.

Correspondence F. Wood ([email protected])

a penalty such as the Bayesian information criteria (BIC)or the Aikaike information criteria is used to select the bestmodel corresponding to the best penalized log-likelihood ofheld-out data.

Many solutions have been proposed to the model selectionproblem [9], [10]. Here we apply the infinite mixture model-ing (IMM) technique of [11] to the spike sorting problem anddemonstrate that it may in some ways outperform standardmaximum likelihood techniques on real data while avoidingthe model identification problem. We present an abbreviatedintroduction to IMM; for more details to reader is referredto [11]–[16] among others.

II. METHODS

Suppose we have a single channel neurophysiologicalrecording R = [~t1, . . . ,~tN ] consisting of N spike waveforms,where each waveform is a vector of n voltage samples~ti = [t1i , . . . , t

ni ]T ∈ Rn (n = 40 throughout). Our goal is to

‘sort’ this channel by figuring out how many units (neurons)are present and which waveforms came from which units.

To both improve the numerical robustness of our approachand to aid in human verification of our results, we useda reduced dimensionality representation of the waveforms,where each ~ti is represented by bases obtained via principalcomponent analysis (PCA) such that ~ti ≈

∑Dd=1 yd

i ~ud.Here ~ud is the dth PCA basis vector, and the yd

i are thelinear coefficients. Our spike sorting algorithm clusters thelow dimensionality representation of the waveforms Y =[~yi, . . . , ~yN ] rather than the full waveforms, so, for theremainder of this paper, when we write ‘spike’ or ‘waveform’it should be read as shorthand for ‘low dimensionalitywaveform representation’. For learning and sampling weused D = 3 while in Figures 1 and 2 we show D = 2,accounting for approximately 80% and 73% of the waveformvariance respectively.

Like others we assume that the distribution of PCA coeffi-cients of spike waveforms is well modeled by a multivariateGaussian [5]. Under this assumption it makes sense tomodel a channel as a Gaussian mixture model, with oneGaussian density accounting for each hidden neuron. Thecorresponding generative model is

ci|~π ∼ Multinomial(·|~π) (1)~yi|ci = k, Θ ∼ N (·|θk)

where C = {ci}Ni=1 indicate which class each spike belongs

to, Θ = {θk}Kk=1, θk = {~µk,Σk} are the class distribution

parameters, and ~π = {πk}Kk=1, πk = P (ci = k) are the

mixture weights. To generate a spike in this model, firstchoose a neuron according to the multinomial ~π, then samplea spike from the normal distribution for that neuron.

If we know the number of neurons, K, and the neuronresponsible for each spike, we can compute the completedata likelihood

P (Y, C|~π,Θ) =N∏

i=1

K∑k=1

πkP (~yi|ci = k, Θ). (2)

Unfortunately we know neither. To address this problem wemay treat the class indicators as hidden random variablesand use Expectation Maximization (EM) to find a maximumlikelihood (ML) estimate of the model parameters

~π, Θ = arg max~π,Θ

log(P (Y, C|~π,Θ)). (3)

EM is a powerful algorithm and guaranteed to convergeto a local maximum of the likelihood. However, it hasseveral problems. If the likelihood has multiple maxima (asis usually the case), the global maximum will not necessarilybe found, and the choice of parameters for initializationcan have a marked effect on the results. Also, since thelikelihood of the data can be increased by adding moremixture components, a method for choosing the correctK must be adopted. One way is to penalize the likeli-hood by an information theoretic measure of model com-plexity such as the Bayesian information criterion BIC =−2log(P (Y, C|~π,Θ))+νK log(N), where νK is the numberof free parameters in a model with K hidden densities.Another way is to assume, as in [9] and [10], that K isfinite but unknown.

As we will show, adopting a fully Bayesian approachallows us to avoid these problems. Bayes’ rule tells us thatthe posterior probability of a model h given the observeddata Y is proportional to the prior probability of h times thelikelihood:

P (h|Y) ∝ P (h)P (Y|h) (4)

EM assumes that the prior probability of all hypotheses isequal, and seeks the single model with highest posteriorprobability (the maximum a posteriori (MAP) solution).Here, we use sensible priors to encode our knowledge aboutthe problem, and seek to learn the full posterior distribution.This ultimately will allow us to avoid the model selectionproblem by marginalizing out the unknown model parameters(including the number of classes) when making inferencesabout the data.

Assume for the moment that we know the value of K. Wechoose conjugate priors for the model parameters:1 Dirichletfor ~π and Normal times Inverse Wishart for θ [17], [18].

~π|α ∼ Dirichlet(·| αK

, . . . ,α

K) (5)

Θ ∼ G0

where Θ ∼ G0 is shorthand for

Σk ∼ Inverse-Wishartυ0(Λ−10 )

~µk ∼ N (~µ0,Σk/κ0).

1A prior is conjugate if it yields a posterior that is in the same family asthe prior (a mathematical convenience).

The Dirichlet hyperparameters, here symmetric αK , can en-

code our beliefs about how uniform or skewed the classmixture weights will be. The parameters to the Normal timesInverse-Wishart prior, Λ−1

0 , υ0, ~µ0, κ0, can encode our priorexperience regarding the shape and position of the mixturedensities. For instance ~µ0 specifies where we believe themean of the mixture densities should be, where κ0 is thenumber of pseudo-observations we are willing to ascribeto our belief [17]. The hyper-parameters Λ−1

0 , υ0 behavesimilarly for the mixture density covariance.

Under this model, the posterior distribution is

P (C,Θ, ~π, α|Y) (6)

∝ P (Y|C,Θ)P (Θ|G0)N∏

i=1

P (ci|~π)P (~π |α)P (α).

This distribution cannot be computed analytically, but onecan obtain samples from it using Markov chain MonteCarlo (MCMC) methods [12]. These methods simulate aMarkov chain whose equilibrium distribution is the posteriordistribution in Eqn. 6. We describe our particular samplerbelow.

Sampling from this posterior distribution circumvents theproblems with initialization and local optima that exist withML estimation. We can avoid the problem of model selectionby assuming that there are an infinite number of causalclasses, but that only a finite number are ever observed in afinite amount of data. This is possible because the posterior(Eqn. 6) is well behaved as K → ∞. This approach, firstdeveloped by [11], is an application of Dirichlet processmixture modeling [13], [14], [19].

The only terms in Eqn. 6 that involve K are P (ci|~π) andP (~π|α). Further, because we chose conjugate priors, we canmarginalize out ~π [15].

P (C|α) =∫ N∏

i=1

P (ci|~π)P (~π |α)d~π

=∏K

k=1 Γ(mk + αK )

Γ( αK )K

Γ(α)Γ(N + α)

. (7)

Taking the limit as K →∞ (for details see [15]), we have

P (C|α) = αK+

K+∏k=1

(mk − 1)!

Γ(α)Γ(N + α)

(8)

where K+ is the number of classes containing at least oneitem, mk =

∑Ni=1 I(ci = k) is the number of items in class

k, and Γ(α) = (α− 1)Γ(α− 1) is the Gamma function.Ultimately, we wish to draw samples from the posterior

distribution over models in Eqn. 6 when K → ∞. Wedo so using an MCMC method known as Gibbs sampling[13], in which new values for each model parameter arerepeatedly sampled conditioned on the current values ofall other variables. Upon convergence, these samples willapproximate the posterior distribution.

The state of our sampler consists of {C,Θ}. We samplenew values for Θ according to

P (θk|C,Y,Θ−k, ~π, α) ∝∏

i s.t. ci=k

P (~yi|ci, θk)PG0(θk). (9)

where Θ−k = {θk, . . . , θk−1, θk+1, . . . , θN} and PG0(θk) isthe probability of θk under G0. C is sampled according to

P (ci = k|C−i,Y,Θ, ~π, α) ∝ P (~yi|ci,Θ)P (ci|C−i) (10)

where C−i = {c1, . . . , ci−1, ci+1, . . . , cN} can be derivedfrom Eqn. 10 and is given by

P (ci = k|C−i) ={ mk

i−1+α k ≤ K+α

i−1+α k > K+(11)

The generative process in Eqn. 11 is known as the Chineserestaurant process [15].

We sampled the remaining free parameter, α, using aMetropolis update, although it too can be updated using aGibbs step [16].

III. EXPERIMENTS

Four multi-unit hand sorted channels were selected froma multi-channel extra-cellular recording of an awake andbehaving monkey performing a task. For a detailed descrip-tion of the recording technique and data format see [6].A regular subsample of the waveforms recorded on eachchannel was taken, corresponding to retaining every sixteenthspike (leaving 814, 299, 1360, and 838 on channels 1-4 asviewed left to right in Fig. 1). By starting at a different offseta separate held-out dataset was constructed similarly.

Fig. 1 shows modeling results for each channel using MLand infinite mixture modeling (IMM) techniques. For ML weused EM to train a model for each value of K between 1 and15. Each model was trained 10 times per channel startingfrom different random seeds, and the model with overallmaximum BIC score was selected. For IMM we used thesampler described above, again run 10 times per channel withdifferent random seeds. Each run generated 5000 samplesand the first 500 were discarded as burn-in. The sample withthe highest posterior probability (an estimate of the MAPsolution) is shown in Fig. 1. To generate these results, weused a Gamma(1, 1) prior for α to encode our belief thateach channel has relatively few neurons. We specified Λ0

to be isotropic with variance equal to 0.1, encoding ourbelief that the clusters should be roughly spherical and tightlyclustered. We set µ0 = 0. The other hyperparameters wereassigned to give low weight to our prior. We believe that theclusters found by IMM more closely resemble the human-sorted ‘ground truth’ than do the EM clusters. In Fig. 1 oneoutlier was removed from channels 2 and 4. The channel 4outlier is retained in Fig. 2.

A quantitative evaluation is shown in Table I, where wecompare the log likelihood of held out data under the MAPestimates obtained from IMM and EM2. Note that computingthe true log likelihood of held out data under the IMMrequires integrating over all possible assignments of held-out spikes to neurons. Instead, we computed a lower boundon this quantity by constructing a (finite) Gaussian mixturemodel approximation to the infinite model, estimating ~π fromthe classification induced by the sample with K = K+.

2Using only the MAP estimate from IMM is somewhat unfair to theBayesian approach, since it discards the rest of the information containedin the posterior distribution. However, it makes evaluation simpler.

0.230

0.6170.150

0.002

0.001

.0.2400.1450.145

0.143

0.147

0.136

0.189

Fig. 2. Weights and 95% confidence intervals for the IMM MAP estimate(left) and the ML estimate with best BIC score (right) for channel 4.

1 2 3 4IMM -5.91 ± -0.03 -2.55 ± 0.05 -9.37 ± 0.04 -5.83 ± 0.02 ×103

EM -5.81 ± 0.02 -2.15 ± 0.17 -9.39 ± 0.07 -5.84 ± 0.59 ×103

TABLE IPER-CHANNEL AVERAGE LOG LIKEHOOD OF HELD-OUT DATA

Given this approximation, we expected EM to outperformIMM, but their performance is virtually indistinguishable.We believe this is because the finite mixture model, lackingthe influence of priors, is overfitting and doesn’t generalizewell. We observed that EM achieved higher log likelihoodon the training data than IMM, which supports this assertion.

A benefit of using IMM is its implicit prior on classweights. This prior prefers clusterings with few prominentclusters. ML, on the other hand, implicitly assumes a uniformprior over class weights. This prior is uninformative withrespect to cluster weights. Figure 2 illustrates the effects ofthese assumptions on clustering.

IV. DISCUSSION AND FUTURE WORK

In this work we have applied non-parametric Bayesianmixture modeling techniques to the problem of spike sortingand compared this method to maximum likelihood finiteGaussian mixture modeling. It should be noted that non-parametric Bayesian techniques in the form we have dis-cussed suffer from a problem similar to the EM haltingproblem; i.e. it is difficult to determine when the Markovchain simulated by the Gibbs sampler has converged to itsequilibrium distribution. In practice, convergence may not benecessary to find a good solution for a particular data set.Additionally, more sophisticated methods for initializing EMhave been tried [6]. These are uncommon in general practiceand, in limited experimentation, did not affect our reportedresults.

The primary contribution of this work is in presenting away to avoid the problem of model selection in spike sorting.By changing the modeling assumption from ‘There are Kneurons on this channel.’ to ‘There are an infinite numberof neurons that could be on this channel, how many areactually recorded?’ it is possible to select MAP IMM modelparameterizations that are competitive with traditional MLdensity estimators such as EM. This approach also allows usto address a wider range of questions than those allowed bytraditional ML models. Due to space limitations, we provideonly a single example: if we are interested in whether or nottwo spikes originate from the same neuron, the IMM yields

0 100 200 300 400 500 600 700 800 900 1000

µ sec.

0 100 200 300 400 500 600 700 800 900 1000

µ sec.

0 100 200 300 400 500 600 700 800 900 1000

µ sec.

0 100 200 300 400 500 600 700 800 900 1000

µ sec.

Fig. 1. Results from sorting four channels of real neural data using both maximum likelihood (ML) finite Gaussian mixture modeling (GMM) and infiniteGaussian mixture modeling (IMM). Channels 1-4 are shown left to right in columns. The first three rows show the sorted waveforms for human, IMM,and ML sorting. The last three rows show the sorted PCA waveform coefficients in the same order. The ML results are the model with highest Bayesianinformation criterion score. The IMM results are a single sample estimate of the MAP model drawn from the sampled posterior.

an answer that implicitly averages over an infinite number ofmodels, avoiding the model selection problem. We believethat this is an important feature given the large amount ofnoise and ambiguity currently intrinsic to neural signals.

As developed in this work, the non-parametric Bayesianspike sorting approach is practical for offline analysis, yet itis also possible to extend it to online spike sorting. By esti-mating the posterior sequentially, an online spike sorter couldbe built that dynamically accomodates non-stationarity inthe signal, adding neurons when sufficient evidence for theirexistence is recorded. This not only will solve the problem ofdetermining when the Markov chain reaches equilibrium, butit also could be valuable for online analyses, long term neuraldecoding, and future clinical neuroprosthetic applications.

V. ACKNOWLEDGEMENT

We wish to thank M. Fellows for data, S. Roth for carefulediting, and T. L. Griffiths for introducing us to food-typesas statistical metaphors.

REFERENCES

[1] K. H. Kim and S. J. Kim, “Neural spike sorting under nearly 0-db signal-to-noise ratio using nonlinear energy operator and artificialneural-network classifier,” IEEE Trans. on Biomedical Engineering,vol. 47, pp. 1406–1411, 2000.

[2] S. Takahashi, Y. Anzai, and Y. Sakurai, “Automatic sorting for multi-neuronal activity recorded with tetrodes in the presence of overlappingspikes,” Journal of Neurophysiology, vol. 89, pp. 2245–2258, 2003.

[3] S. Shoham, M. R. Fellows, and R. A. Normann, “Robust, automaticspike sorting using mixtures of multivariate t-distributions,” Journalof Neuroscience Methods, vol. 127(2), pp. 111–122, 2003.

[4] E. Hulata, R. Segev, and E. Ben-Jacob, “A method for spike sortingand detection based on wavelet packets and Shannon’s mutual infor-mation,” Journal of Neuroscience Methods, vol. 117, pp. 1–12, 2002.

[5] M. S. Lewicki, “A review of methods for spike sorting: the detectionand classification of neural action potentials,” Network, vol. 9, no. 4,pp. R53–78, 1998.

[6] F. Wood, M. Fellows, J. P. Donoghue, and M. J. Black, “Automaticspike sorting for neural decoding,” in IEEE Engineering MedicineBiological Systems, 2004, pp. 4126–4129.

[7] M. Sahani, J. S. Pezaris, and R. A. Andersen, “On the separation ofsignals from neighboring cells in tetrode recordings,” in Advances inNeural Information Processing Systems 10. MIT Press, 1998.

[8] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihoodfrom incomplete data via the EM algorithm,” Journal of the RoyalStatistical Society, B, vol. 39, 1977.

[9] N. Friedman, “The Bayesian structural EM algorithm,” in Proceedingsof the 14th Annual conference on uncertainty in AI, 1998, pp. 129–138.

[10] P. Green, “Reversible jump Markov chain Monte Carlo computationand Bayesian model determination,” Biometrika, pp. 711–732, 1995.

[11] C. Rasmussen, “The infinite Gaussian mixture model,” in Advances inNeural Information Processing Systems 12. Cambridge, MA: MITPress, 2000.

[12] R. M. Neal, “Probabilistic inference using Markov chain Monte Carlomethods,” University of Toronto, Tech. Rep. CRG-TR-93-1, 1993.

[13] ——, “Markov chain sampling methods for Dirichlet process mixturemodels,” Department of Statistics, University of Toronto, Tech. Rep.9815, 1998.

[14] S. MacEachern and P. Muller, “Estimating mixture of Dirichlet processmodels,” Journal of Computational and Graphical Statistics, vol. 7,pp. 223–238, 1998.

[15] T. L. Griffiths and Z. Ghahramani, “Infinite latent feature models andthe Indian buffet process,” Gatsby Computational Neuroscience Unit,Tech. Rep. 2005-001, 2005.

[16] M. D. Escobar and M. West, “Bayesian density estimation and infer-ence using mixtures,” Journal of the American Statistical Association,vol. 90, pp. 577–588, 1995.

[17] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian dataanalysis. New York: Chapman & Hall, 1995.

[18] C. Fraley and A. E. Raftery, “Bayesian regularization for normal mix-ture estimation and model-based clustering,” Department of Statistics,Washington University, Seattle, Washington, Tech. Rep. 05/486, 2005.

[19] R. M. Neal, “Markov chain sampling methods for Dirichlet processmixture models,” Journal of Computational and Graphical Statistics,vol. 9, pp. 249–265, 2000.

Documents

A Non-Parametric Bayesian Approach to Spike Sortinghomepages.inf.ed.ac.uk/sgwater/papers/embs06-spikes.pdf · A Non-Parametric Bayesian Approach to Spike Sorting Frank Wood †Sharon