14
2048 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2017 Online Learning of Hierarchical Pitman–Yor Process Mixture of Generalized Dirichlet Distributions With Feature Selection Wentao Fan, Hassen Sallay, and Nizar Bouguila, Senior Member, IEEE Abstract—In this paper, a novel statistical generative model based on hierarchical Pitman–Yor process and generalized Dirichlet distributions (GDs) is presented. The proposed model allows us to perform joint clustering and feature selection thanks to the interesting properties of the GD distribution. We develop an online variational inference algorithm, formulated in terms of the minimization of a Kullback–Leibler divergence, of our resulting model that tackles the problem of learning from high-dimensional examples. This variational Bayes formulation allows simultaneously estimating the parameters, determining the model’s complexity, and selecting the appropriate relevant features for the clustering structure. Moreover, the proposed online learning algorithm allows data instances to be processed in a sequential manner, which is critical for large-scale and real-time applications. Experiments conducted using challenging applications, namely, scene recognition and video segmentation, where our approach is viewed as an unsupervised technique for visual learning in high-dimensional spaces, showed that the proposed approach is suitable and promising. Index Terms— Generalized Dirichlet (GD), hierarchical Pitman–Yor process, mixture models, nonparametric Bayesian, scene recognition, variational inference, video segmentation. I. I NTRODUCTION C ONSIDERABLE advances in broadband networks and information technology have made it possible to process, share, and store huge amounts of images and videos. As a matter of fact, image and video databases are now widely used and analyzed in many applications from different areas using mainly data mining and machine learning techniques [1]. In particular, generative models have been widely adopted [2]–[6]. They are very appealing for many reasons in the context of real-life scenarios. For instance, these models can be learned incrementally and can efficiently deal with missing data in a principled way. The Gaussian mixture has Manuscript received July 11, 2015; revised February 2, 2016; accepted May 20, 2016. Date of publication June 9, 2016; date of current version August 15, 2017. The work of W. Fan and N. Bouguila was supported by the Natural Sciences and Engineering Research Council of Canada. The work of H. Sallay was supported by the King Abdulaziz City for Science and Technology, Kingdom of Saudi Arabia, under Grant 11-INF1787-08. (Corresponding author: Nizar Bouguila.) W. Fan is with the Department of Computer Science and Technology, Huaqiao University, Xiamen 362021, China (e-mail: [email protected]). H. Sallay is with the College of Computer and Information Systems, Umm Al-Qura University, Mecca 24382, Saudi Arabia (e-mail: [email protected]). N. Bouguila is with the Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC H3G 1T7, Canada (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2016.2574500 been perhaps the most widely adopted generative model in the past [7]–[11]. Recently, Andrews and McNicholas [12], Zio et al. [13], and Bdiri et al. [14] have proposed the consideration of other mixture models or even mixtures of mixtures to overcome the limitations related to the Gaussian. Indeed, real data are generally very complex, typically involving elements of non-Gaussianity and high dimensionality [15]–[17]. These are problems of fundamental importance in several areas and domains as discussed, for instance, Bouguila and Ziou [18] proposed the consideration of the generalized Dirichlet (GD) mixture as an efficient alternative to the Gaussian. The GD mixture has been shown to offer flexibility and ease of use in several challenging real-life applications, especially those involving proportional data [18]–[20]. Examples include text analysis and visual object categorization using the bag-of-words and the bag-of- visual-words representations, respectively. Model assessment and comparison is a crucial problem when considering missing data models in general and mixtures of distributions in particular [21]. This problem has been tackled in the case of the GD mixture via the development of a minimum message length criterion in [19]. In order to exploit further the GD mixture, Boutemedjet et al. [22] have proposed a feature selection method that has been improved further in [23] via a variational inference technique. This has been motivated by the fact that dealing with high-dimensional data is of utmost importance, as deeply discussed in [24]–[26]. In this case, generally not all the features are relevant: some are not salient, do not reveal clustering structure, and may compromise the modeling and generalization capability. Thus, adopting a feature selection approach becomes crucial [27], [28] in order to improve the model’s interpretability, while decreasing its complexity [29]. The work in this paper can be viewed as an extension to these previous efforts and focuses on the consideration of a hierarchical Pitman–Yor process [30] of GD distributions that allows simultaneous data fitting (i.e., estimation of the model’s parameters), model selection (i.e., determination of the model’s complexity), and feature selection. In particular, we describe in detail the variational implementation, which we have developed, of the resulting statistical model. This implementation takes into account the dynamic nature of real data [31], [32] and then is based on online learning. The consideration of variational learning is motivated by the excellent results it has provided in the past in the case of generative models [33], [34] in general and mixture 2162-237X © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2048 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28 …amansystem.com/apps/publications/papers/Online Learning of Hierar… · 2048 IEEE TRANSACTIONS ON NEURAL

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 2048 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28 …amansystem.com/apps/publications/papers/Online Learning of Hierar… · 2048 IEEE TRANSACTIONS ON NEURAL

2048 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2017

Online Learning of Hierarchical Pitman–YorProcess Mixture of Generalized Dirichlet

Distributions With Feature SelectionWentao Fan, Hassen Sallay, and Nizar Bouguila, Senior Member, IEEE

Abstract— In this paper, a novel statistical generative modelbased on hierarchical Pitman–Yor process and generalizedDirichlet distributions (GDs) is presented. The proposed modelallows us to perform joint clustering and feature selection thanksto the interesting properties of the GD distribution. We developan online variational inference algorithm, formulated in termsof the minimization of a Kullback–Leibler divergence, of ourresulting model that tackles the problem of learning fromhigh-dimensional examples. This variational Bayes formulationallows simultaneously estimating the parameters, determiningthe model’s complexity, and selecting the appropriate relevantfeatures for the clustering structure. Moreover, the proposedonline learning algorithm allows data instances to be processedin a sequential manner, which is critical for large-scale andreal-time applications. Experiments conducted using challengingapplications, namely, scene recognition and video segmentation,where our approach is viewed as an unsupervised techniquefor visual learning in high-dimensional spaces, showed that theproposed approach is suitable and promising.

Index Terms— Generalized Dirichlet (GD), hierarchicalPitman–Yor process, mixture models, nonparametric Bayesian,scene recognition, variational inference, video segmentation.

I. INTRODUCTION

CONSIDERABLE advances in broadband networks andinformation technology have made it possible to process,

share, and store huge amounts of images and videos. As amatter of fact, image and video databases are now widely usedand analyzed in many applications from different areas usingmainly data mining and machine learning techniques [1].In particular, generative models have been widelyadopted [2]–[6]. They are very appealing for many reasons inthe context of real-life scenarios. For instance, these modelscan be learned incrementally and can efficiently deal withmissing data in a principled way. The Gaussian mixture has

Manuscript received July 11, 2015; revised February 2, 2016; acceptedMay 20, 2016. Date of publication June 9, 2016; date of current versionAugust 15, 2017. The work of W. Fan and N. Bouguila was supportedby the Natural Sciences and Engineering Research Council of Canada. Thework of H. Sallay was supported by the King Abdulaziz City for Scienceand Technology, Kingdom of Saudi Arabia, under Grant 11-INF1787-08.(Corresponding author: Nizar Bouguila.)

W. Fan is with the Department of Computer Science and Technology,Huaqiao University, Xiamen 362021, China (e-mail: [email protected]).

H. Sallay is with the College of Computer and Information Systems,Umm Al-Qura University, Mecca 24382, Saudi Arabia (e-mail:[email protected]).

N. Bouguila is with the Concordia Institute for Information SystemsEngineering, Concordia University, Montreal, QC H3G 1T7, Canada (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TNNLS.2016.2574500

been perhaps the most widely adopted generative model inthe past [7]–[11]. Recently, Andrews and McNicholas [12],Zio et al. [13], and Bdiri et al. [14] have proposed theconsideration of other mixture models or even mixturesof mixtures to overcome the limitations related to theGaussian. Indeed, real data are generally very complex,typically involving elements of non-Gaussianity and highdimensionality [15]–[17]. These are problems of fundamentalimportance in several areas and domains as discussed, forinstance, Bouguila and Ziou [18] proposed the considerationof the generalized Dirichlet (GD) mixture as an efficientalternative to the Gaussian. The GD mixture has been shownto offer flexibility and ease of use in several challengingreal-life applications, especially those involving proportionaldata [18]–[20]. Examples include text analysis and visualobject categorization using the bag-of-words and the bag-of-visual-words representations, respectively. Model assessmentand comparison is a crucial problem when consideringmissing data models in general and mixtures of distributionsin particular [21]. This problem has been tackled in thecase of the GD mixture via the development of a minimummessage length criterion in [19].

In order to exploit further the GD mixture,Boutemedjet et al. [22] have proposed a feature selectionmethod that has been improved further in [23] via a variationalinference technique. This has been motivated by the fact thatdealing with high-dimensional data is of utmost importance,as deeply discussed in [24]–[26]. In this case, generally not allthe features are relevant: some are not salient, do not revealclustering structure, and may compromise the modeling andgeneralization capability. Thus, adopting a feature selectionapproach becomes crucial [27], [28] in order to improve themodel’s interpretability, while decreasing its complexity [29].The work in this paper can be viewed as an extension tothese previous efforts and focuses on the consideration ofa hierarchical Pitman–Yor process [30] of GD distributionsthat allows simultaneous data fitting (i.e., estimation of themodel’s parameters), model selection (i.e., determination ofthe model’s complexity), and feature selection. In particular,we describe in detail the variational implementation, whichwe have developed, of the resulting statistical model. Thisimplementation takes into account the dynamic nature ofreal data [31], [32] and then is based on online learning.The consideration of variational learning is motivated bythe excellent results it has provided in the past in the caseof generative models [33], [34] in general and mixture

2162-237X © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 2048 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28 …amansystem.com/apps/publications/papers/Online Learning of Hierar… · 2048 IEEE TRANSACTIONS ON NEURAL

FAN et al.: ONLINE LEARNING OF HIERARCHICAL PITMAN–YOR PROCESS MIXTURE OF GD DISTRIBUTIONS WITH FEATURE SELECTION 2049

models [35] in particular. We investigate the performance ofthe proposed framework using two challenging application.The first one involves scene recognition, which is verychallenging due to the difficulty of capturing the variabilityof appearance of diverse scenes belonging to the samecategory, while avoiding confusing scenes from differentcategories. Our second application is video segmentation,which has been the topic of extensive research in thepast [36]–[41]. In fact, it is a very important step in severalapplications such as human–computer interaction, objecttracking, and content-based video browsing, annotation, andretrieval [9], [42].

The rest of this paper is described as follows. Related worksare introduced in Section II. Section III presents a detailedoverview of the proposed hierarchical Pitman–Yor processmixture of GD distributions with feature selection model.In Section IV, we develop our online variational learningapproach. Section V shows the results obtained when applyingour statistical framework to two challenging applications,namely, video segmentation and scene recognition. This paperends with conclusions and outlines possible future worksin Section VI.

II. RELATED WORKS

Recently, hierarchical Bayesian nonparametric modelshave drawn significant attention and have been successfullyapplied in various fields [43]. One of the most well-knownhierarchical nonparametric Bayesian models is the hierarchicalDirichlet process model, which has shown promising resultsto the problem of model-based clustering of grouped datawith sharing clusters [44]. In our work, we focus on thehierarchical Pitman–Yor process, which is based onthe Pitman–Yor process,1 a two-parameter generalization ofthe Dirichlet process [30], [45]. It is noteworthy that thePitman–Yor process possesses a power-law behavior [30],[45], [46] when its discount parameter is defined in (0, 1),which makes it more suitable for applications involvingnatural phenomena than the Dirichlet process.

Another extension to the hierarchical Dirichlet processmodel, namely, the Indian buffet process (IBP) compoundDirichlet process (ICD) model, is proposed in [47]. TheICD is a Bayesian nonparametric prior that decouples across-data prevalence and within-data proportion in a mixed mem-bership model. It combines properties from the hierarchicalDirichlet process model and the IBP, a Bayesian nonpara-metric prior on binary matrices. It also considerers featureselection by setting Bernoulli priors on the selection variables.In [48], a class of dependent normalized random measure,namely, the thinned normalized random measure (TNRM),for the nonparametric modeling of dependent probabilitymeasures is proposed. It uses a similar feature selectionset up as in [47] and shows superior performance com-pared with related dependent nonparametric models such ashierarchical Dirichlet process model on complex data.Nevertheless, both [47] and [48] adopted Markov chainMonte Carlo (MCMC) techniques [49] for model learning.

1Pitman–Yor process is also sometimes referred to as the two-parameterPoisson-Dirichlet process.

However, in practice, the use of MCMC method is oftenlimited to small-scale problems due to the high computationaldemanding and the difficulty of diagnosing the convergence.Another method for group feature selection in linear regressionproblems has been proposed in [50]. This method is based on ageneralized version of the standard spike-and-slab prior distri-bution, which is often used for individual feature selection, anduses an expectation propagation (EP) learning approach [51]to learn the corresponding model.

An alternative approach to both MCMC and EP learningtechniques is known as variational inference [52], [53]. Thevariational approach, based on analytical approximations tothe posterior distribution, remains an active research field dueto its good generalization performance as well as computa-tional tractability in various applications. Several extensionsto the conventional variational inference framework havebeen proposed lately. In [54], instead of adopting themean field assumption as in the conventional variationalinference [52], [53] for learning models with conjugatepriors, two approaches, namely, Laplace variational inferenceand delta method variational inference, were introduced fornonconjugate models. Hoffman et al. [55] have introduceda stochastic variational inference framework that is basedon stochastic optimization, a technique that follows noisyestimates of the gradient of the objective. The stochasticvariational inference has demonstrated appealing results onmodeling large-scale data sets.

In this paper, we develop an online variational approachfor learning hierarchical Pitman–Yor process mixture modelwith feature selection. Our work is closely related to [56]in which an online variational inference method for learninghierarchical Dirichlet process model has been introduced. Themajor differences of our approach from the one proposedin [56] are listed as follows.

1) Rather than using hierarchical Dirichlet process, theframework of hierarchical Pitman–Yor process isadopted in our case, which has shown a power-lawbehavior that makes it more suitable for applicationsinvolving natural phenomena than the Dirichlet process.

2) Motivated by the promising results provided by theGD mixture model recently, we have tailored a hierarchi-cal Pitman–Yor process mixture model, which containsGD as the basic distribution.

3) Different from [56] in which all features are used inmodel learning, we incorporate an unsupervised featureselection scheme into our hierarchical framework toeffectively handle high-dimensional data and improveclustering performance by detecting and eliminatingirrelevant features.

4) We apply the resulting model on two challengingapplications, namely, scene recognition and videosegmentation.

III. HIERARCHICAL PITMAN–YOR PROCESS MIXTURE OF

GD DISTRIBUTIONS WITH FEATURE SELECTION

A. Hierarchical Pitman–Yor Process Mixture Model

Here, we introduce the hierarchical Pitman–Yor process thatcontains a recursive construction in which the base measure

Page 3: 2048 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28 …amansystem.com/apps/publications/papers/Online Learning of Hierar… · 2048 IEEE TRANSACTIONS ON NEURAL

2050 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2017

for a Pitman–Yor process is itself a draw from a Pitman–Yorprocess. Specifically, assume that we have obtained a groupeddata with M groups, where each observation within a group jis distributed according to a Pitman–Yor process G j . Withinthe framework of hierarchical Pitman–Yor process, the indexedset of {G j } of all groups shares a common base distribu-tion G0, which itself follows a Pitman–Yor process:

G0 ∼ PY(η, γ, H )

G j ∼ PY(ω, λ,G0) for each j , j ∈ {1, . . . ,M} (1)

where j is an index for each group of the data. In general,the hierarchical Pitman–Yor process can readily be extendedto include more than two levels since the involved hierarchycan be extended to as many levels as required.

We may also represent the hierarchical Pitman–Yor processusing the stick-breaking construction [57]. In order to facilitatethe online learning process and have closed-form solutions,we construct the hierarchical Pitman–Yor process throughtwo stick-breaking constructions for G0 and G j . Since theglobal measure G0 is distributed according to a Pitman–Yorprocess PY(η, γ, H ), we have the following stick-breakingrepresentation:

G0 =∞∑

k=1

ψkδ�k , �k ∼ H

ψk = ψ ′k

k−1∏

s=1

(1 − ψ ′s), ψ ′

k ∼ Beta(1 − η, γ + kη) (2)

where∑∞

k=1 ψk = 1. {�k} is a set of independent randomvariables distributed according to H and δ�k is an atom at �k .The random variables ψk are known as the stick-breakingweights and are obtained by recursively breaking a unit lengthstick into an infinite number of pieces such that the size of eachsuccessive piece is proportional to the rest of the stick. SinceG0 is the base distribution of the Pitman–Yor process G j , theatoms �k are shared among all G j and differ only in weights.

The stick-breaking representation for the group-levelPitman–Yor process G j can be written in the following form:

G j =∞∑

t=1

π j tδ j t , j t ∼ G0

π j t = π ′j t

t−1∏

s=1

(1 − π ′j s), π ′

j t ∼ Beta(1 − ω, λ + tω) (3)

where {π j t} is a set of stick-breaking weights, which satisfies∑∞t=1 π j t = 1. Since j t is distributed according to the base

distribution G0, it takes on the value �k with probability ψk .Then, we can introduce a binary latent variable W jtk ∈ {0, 1}for each j t as an indicator variable, such that W jtk = 1 if j t maps onto the base-level atom �k , which is indexed by k;

otherwise, W jtk = 0. Thus, j t maps onto �W jtkk .

The indicator variable �W = (W jt1,W jt2, . . .) is condition-ally distributed for the given �ψ as

p( �W | �ψ) =M∏

j=1

∞∏

t=1

∞∏

k=1

ψW jtkk . (4)

Since �ψ is a function of �ψ ′ according to the stick-breakingconstruction as shown in (2), the probability of �W can also bewritten as

p( �W | �ψ ′) =M∏

j=1

∞∏

t=1

∞∏

k=1

[ψ ′

k

k−1∏

s=1

(1 − ψ ′s)

]W jtk

. (5)

The prior distribution of �ψ ′ is a specific Beta distributionaccording to (2)

p( �ψ ′) =∞∏

k=1

�(1 − ηk + γk + kηk)

�(1 − ηk)�(γk + kηk)

(1 − ψ ′

k

)γk+kηk−1ψ

′−ηkk .

(6)

The setting of the hierarchical Pitman–Yor process mixturemodel can be described as follows: let i indexes the observa-tions within each group j of a grouped data set, each variableθ j i is a factor corresponding to an observation X ji and thefactors �θ j = (θ j1, θ j2, . . .) are distributed according to thePitman–Yor process G j , one for each j . Then, the likelihoodfunction can be defined as

θ j i |G j ∼ G j

X j i |θ j i ∼ F(θ j i) (7)

where F(θ j i) represents the distribution of the observationX ji given θ j i . It is worth mentioning that since Pitman–Yorprocess is a generalization of the Dirichlet process, similar tothe Dirichlet process mixture model, the Pitman–Yor processmixture model may also be considered as an infinite mixturemodel. The prior distribution for the factors θ j i is given bythe base distribution H of G0. Within this framework, eachgroup j is associated with an infinite mixture model, and themixture components are shared among these mixture modelsdue to the sharing of atoms �k among all G j . Since eachfactor θ j i is distributed according to G j , it takes the value j t with probability π j t . Then, for each θ j i , we introduce abinary latent variable Z j it ∈ {0, 1} as the indicator variable.Specifically, Z j it = 1 if θ j i is associated with component tand maps onto the group-level atom j t ; otherwise, Z j it = 0.

Therefore, we have θ j i = Z jitj t . Since j t also maps onto

the global-level atom �k , θ j i therefore maps onto �W jtk Z jitk .

The latent variable �Z = (Z j i1, Z j i2, . . .) is conditionallydistributed given �π as

p( �Z | �π) =M∏

j=1

N∏

i=1

∞∏

t=1

πZ jitj t . (8)

According to the stick-breaking construction in (3), �π is afunction of �π ′. We then have

p( �Z | �π ′) =M∏

j=1

N∏

i=1

∞∏

t=1

[π ′

j t

t−1∏

s=1

(1 − π ′

j s

)]Z jit

. (9)

The prior distribution of �π ′ is a specific Beta distribution asshown in (3)

p(�π ′) =M∏

j=1

∞∏

t=1

�(1 − ω j t + λ j t + tω j t )

�(1 − ω j t )�(λ j t + tω j t )

× (1 − π ′

j t

)λ j t+tω j t−1π

′−ω j tj t . (10)

Page 4: 2048 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28 …amansystem.com/apps/publications/papers/Online Learning of Hierar… · 2048 IEEE TRANSACTIONS ON NEURAL

FAN et al.: ONLINE LEARNING OF HIERARCHICAL PITMAN–YOR PROCESS MIXTURE OF GD DISTRIBUTIONS WITH FEATURE SELECTION 2051

B. Hierarchical Pitman–Yor Process Mixture Modelof GD Distributions With Feature Selection

In this section, we propose a specific form of hierarchicalPitman–Yor process mixture model where each observationwithin a group is drawn from a mixture of GD distributions.The GD distribution is a generalization of the Dirichlet dis-tribution with a more general covariance structure (can bepositive or negative). The adoption of the GD mixture modelis mainly motivated by its promising results in modeling high-dimensional proportional data (i.e., normalized histograms)that are naturally generated by many applications from dif-ferent domains [18], [19], [23], [58]. Compared with the mostwidely used Gaussian distribution, the GD distribution has thefollowing merits.

1) GD distribution may have multiple symmetric and asym-metric modes, whereas Gaussian distribution containsonly symmetric modes.

2) GD distribution is defined in the compact support [0, 1]and can be easily generalized to be defined in a compactsupport of the form [A, B], where (A, B) ∈ R

2. Thus, itis a better choice than Gaussian for modeling compactlysupported data, such as images, text, and videos.

3) GD distribution has a smaller number of parametersthan does Gaussian, which makes the estimation andthe selection more accurate.

The GD distribution is defined as follows: given a D-Drandom vector �Y = (Y1, . . . ,YD), it is drawn from a GDdistribution with positive parameters �α = (α1, . . . , αD) and�β = (β1, . . . , βD) if:

GD( �Y |�α, �β) =D∏

l=1

�(αl + βl)

�(αl )�(βl)Y αl−1

l

⎝1 −l∑

f =1

Y f

⎠γl

(11)

where∑D

l=1 Yl < 1 and 0 < Yl < 1 for l = 1, . . . , D, αl > 0,βl > 0, γl = βl − αl+1 − βl+1 for l = 1, . . . , D − 1, andγD = βD − 1. �(·) is the gamma function.

It is possible to transform the original vector �Y into anotherD-D vector �X with independent features according to amathematical property of the GD distribution as shown in [22]

GD( �X |�α, �β) =D∏

l=1

Beta(Xl |αl , βl) (12)

where �X = (X1, . . . , X D), X1 = Y1 and Xl = Yl/(1 − ∑l−1

f =1 Y f ) for l > 1, and Beta(Xl |αl , βl) is a Betadistribution defined with parameters {αl , βl}. As a result, theindependence between the features now becomes a fact ratherthan an assumption as considered in previous unsupervisedfeature selection Gaussian-mixture-based approaches [59].

Now assume that we have a grouped data set X that containsN random vectors and is separated into M groups, whereeach D-D vector �X ji = (X ji1, . . . , X ji D) is drawn froma hierarchical Pitman–Yor process mixture model with GDdistributions. Then, the likelihood function of the proposed

model given latent and unknown variables V = {�α, �β, �Z , �W }

can be written as

p(X |V) =M∏

j=1

N∏

i=1

∞∏

t=1

∞∏

k=1

[D∏

l=1

Beta(X jil |αkl , βkl)

]Z jit W jtk

(13)

where all features are involved. However, in practice, some ofthe features may be irrelevant and therefore do not contributeto or even degrade the learning performance. Thus, featureselection technique is adopted in our work as a tool to choosethe best feature subset. We adopt one of the most commonunsupervised feature selection techniques by defining an irrel-evant feature as the one having a distribution independent ofclass labels [59] with the assumption that feature relevanceis common through all clusters. This is motivated by the factthat a feature independent of class labels does not provide anyknowledge that can be used to classify the object into one ofthe existing clusters. Thus, each feature X jil can be modeledas follows:

p(X jil) = Beta(X jil |αkl , βkl )φ j il Beta

(X jil |α′

l , β′l

)1−φ j il

(14)

where φ j il is a binary latent variable representing the featurerelevance indicator, such that φ j il = 0 denotes that thefeature l of group j is irrelevant and follows a Beta distribu-tion: Beta(X jil |α′

l , β′l ); otherwise, the feature X jil is relevant

and follows Beta(X jil |αkl , βkl). The prior distribution of �φ isdefined as

p( �φ|�ε) =M∏

j=1

N∏

i=1

D∏

l=1

εφ j ill1ε

1−φ j ill2

(15)

where each φ j il is a Bernoulli variable such that p(φ j il =1) = εl1 and p(φ j il = 0) = εl2 . The vector �ε = (�ε1, . . . , �εD)represents the feature saliencies (i.e., the probabilities that thefeatures are relevant) where �εl = (εl1 , εl2 ) and εl1 + εl2 = 1.Furthermore, a Beta distribution is chosen as the prior distri-bution for �ε

p(�ε) =D∏

l=1

Beta(�εl |�ξ) =D∏

l=1

�(ξ1 + ξ2)

�(ξ1)�(ξ2)εξ1−1l1

εξ2−1l2

. (16)

By incorporating the unsupervised feature selection schemeinto our model, we then have the following likelihood function:

p(X | �Z, �W , �θ, �φ) =M∏

j=1

N∏

i=1

∞∏

t=1

∞∏

k=1

×[

D∏

l=1

Beta(X jil |αkl , βkl )φ j il

× Beta(X jil

∣∣α′l

∣∣β ′l

)(1−φ j il )

]Z jit W jtk

(17)

where �θ = {�α, �β, �α′, �β ′}.We also need to place prior distributions over the parameters

�α, �β, �α′, and �β ′ of Beta distributions. Although Beta distri-bution belongs to the exponential family and has a formal

Page 5: 2048 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28 …amansystem.com/apps/publications/papers/Online Learning of Hierar… · 2048 IEEE TRANSACTIONS ON NEURAL

2052 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2017

TABLE I

SYMBOLS INVOLVED IN THE PROPOSED MODEL

conjugate prior, it is analytically intractable. Thus, Gammadistribution G(·) is adopted here to approximate the conjugateprior with the assumption that these Beta parameters arestatistically independent

p(�α) = G(�α|�u, �v), p( �β) = G( �β|�g, �h) (18)

p(�α′) = G(�α′|�u′, �v ′), p( �β ′) = G( �β ′|�g′, �h′). (19)

The symbols involved in the proposed model are summa-rized in Table I.

IV. ONLINE VARIATIONAL MODEL LEARNING

In this section, an online variational inference approachis developed for learning the proposed hierarchicalPitman–Yor process mixture model with feature selectionusing the online learning framework proposed by [60].In contrast with batch learning algorithms, online algorithmsare more efficient when dealing with large-scale or streamingdata. In order to simplify notations, we define � = (�,�)as the set of latent and unknown random variables, where� = { �Z , �φ} and � = { �W , �ε, �ψ ′, �π ′, �α, �β, �α′, �β ′}.

A. Online Variational Learning Framework

In variational inference, the goal is to acquire an approx-imation q(�) to the true posterior distribution p(�|X ) bymaximizing the lower bound of ln p(X ), which is

L(q) =∫

q(�) ln[p(X ,�)/q(�)]d�. (20)

We can extend the conventional variational inference algo-rithm to an online version by successively maximizing thecurrent variational lower bound for the observed data, which isgiven by

L(r)(q) = N

r

r∑

i=1

∫q(�)d�

�i

q(�i ) ln

[p( �Xi ,�i |�)

q(�i )

]

+∫

q(�) ln

[p(�)

q(�)

]d� (21)

where r denotes the amount of observed data that we currentlyhave. Moreover, we apply the truncation technique as in [61]to truncate the variational approximations of G0 and G j atK and T, such that

ψ ′K = 1,

K∑

k=1

ψk = 1, ψk = 0 when k > K (22)

π ′j T = 1,

T∑

t=1

π j t = 1, π j t = 0 when t > T (23)

where the truncation levels K and T will be inferred auto-matically during the learning process. We also use a factorialapproximation [53] to factorize the approximated posteriordistribution q(�) into disjoint factors as

q(�) = q( �Z)q( �W )q( �φ)q(�π ′)q( �ψ ′)q(�α)q( �β)q(�α′)q( �β ′)q(�ε).(24)

Assume that we have already observed a data set{X1, . . . X(r−1)}. After obtaining a new observation Xr , wecan update the hyperparameters of each variational factor bysuccessfully maximize the current lower bound L(r)(q) as

q( �φr ) =M∏

j=1

D∏

l=1

ϕφ jrlj rl (1 − ϕ j rl)

1−φ jrl (25)

q( �Zr ) =M∏

j=1

T∏

t=1

ρZ jrtj rt (26)

q(r)( �W ) =M∏

j=1

T∏

t=1

K∏

k=1

(ϑ(r)j tk

)W (r)j tk (27)

q(r)(�ε) =D∏

l=1

Beta(�ε(r)l

∣∣�ξ∗(r)) (28)

q(r)(�π ′) =M∏

j=1

T∏

t=1

Beta(π

′(r)j t

∣∣a(r)j t , b(r)j t

)(29)

q(r)( �ψ ′) =K∏

k=1

Beta(ψ

′(r)k

∣∣c(r)k , d(r)k

)(30)

q(r)(�α) =K∏

k=1

D∏

l=1

G(α(r)kl

∣∣u(r)kl , v(r)kl

)(31)

q(r)( �β) =K∏

k=1

D∏

l=1

G(β(r)kl

∣∣g(r)kl , h(r)kl

)(32)

q(r)(�α′) =D∏

l=1

G(α

′(r)l

∣∣u′(r)l , v

′(r)l

)(33)

q(r)( �β ′) =D∏

l=1

G(β

′(r)l

∣∣g′(r)l , h′(r)

l

). (34)

The detailed learning process is presented in the Appendix.The online variational inference procedure is repeated untilall the variational factors are updated with respect tothe current arrived observation. The complete algorithm issummarized in Algorithm 1.

Page 6: 2048 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28 …amansystem.com/apps/publications/papers/Online Learning of Hierar… · 2048 IEEE TRANSACTIONS ON NEURAL

FAN et al.: ONLINE LEARNING OF HIERARCHICAL PITMAN–YOR PROCESS MIXTURE OF GD DISTRIBUTIONS WITH FEATURE SELECTION 2053

Algorithm 1 Learning Algorithm1: Choose the initial truncation levels K and T .2: Initialize the values for hyperparameters ω j t , λ j t , ηk , γk ,

ukl , vkl , gkl , hkl , u′l , v

′l , g′

l , h′l , ξ1 and ξ2.

3: for r = 1 → N do4: The variational E-step:5: Sequentially update the variational solutions to q( �φr ) and

q( �Zr)6: The variational M-step:7: Update variational factors q(r)( �W ) and q(r)(�ε).8: Update variational factors q(r)(�π ′), q(r)( �ψ ′), q(r)(�α),

q(r)( �β), q(r)(�α′) and q(r)( �β ′).9: Repeat the variational E-step and M-step until new data

is observed.10: end for11: Compute the expected values of ψ ′

k and π ′j t where 〈ψ ′

k〉 =ck

ck+dkand 〈π ′

j t 〉 = a jta j t+b jt

, substitute the results into Eq.(2)

and Eq.(3) respectively to obtain the estimated values ofthe mixing coefficients ψk and π j t .

12: Detect the optimal values of K and T by eliminating thecomponents with small mixing coefficients ψk and π j t thatare close to 0.

TABLE II

FOUR REAL-WORLD DATA SETS. N , D AND M DENOTE THE NUMBERS

OF INSTANCES, FEATURES, AND CLASSES, RESPECTIVELY

V. EXPERIMENTAL RESULTS

We evaluate the effectiveness of the proposed online hier-archical Pitman–Yor process mixture of GD distributionswith feature selection (referred to as OnHGD-Fs) usingreal-world data sets and two challenging applications con-cerning scene recognition and video segmentation. In ourexperiments, the global truncation level K and the grouptruncation level T are initialized to 100 and 50, respec-tively. The hyperparameters ξ1 and ξ2 of the feature saliencypart are both initialized to 0.5. The parameters ς and τ0of the learning rate are set to 0.80 and 64, respec-tively. The hyperparameters of the stick-breaking weightsare initialized as (ω j t , λ j t , ηk , γk) = (0.05, 0.25, 0.05, 0.25).The initial values of other involved hyperparameters areinitialized as follows: (ukl , vkl , gkl , hkl , u′

l , v′l , g′

l, h′l) =

(0.5, 0.01, 0.5, 0.01, 0.5, 0.01, 0.5, 0.01).

A. Real-World Data Sets

In this section, we validate the proposed OnHGD-Fs usingfour real-world data sets with different properties, as shownin Table II. The Spambase (SP) data set contains a collectionof spam and nonspam e-mails. The aim is to determine if ane-mail is spam or legitimate. It contains 4601 57-D vectorsdivided into two classes. The Statlog (ST) data set consists ofthe multispectral values of pixels in 3 × 3 neighborhoods in

TABLE III

AVERAGE ERROR RATES (%) USING DIFFERENT ALGORITHMSOVER 30 RANDOM RUNS

a satellite image, and the classification associated with thecentral pixel in each neighborhood. It contains 6435 36-Dvectors from six classes: read soil, cotton crop, gray soil,damp gray soil, soil with vegetation stubble, and very dampgray soil. The Handwritten Digits (HD) data set contains5620 vectors with 64 features from ten classes: zero to nine.The Letter Recognition (LR) data set has 20 000 instanceswith 16 primitive numerical features (e.g., statistical momentsand edge counts). The objective is to identify each characterimage data as one of the 26 capital letters in the Englishalphabet. The aforementioned real-world data sets are takenfrom the UCI machine learning repository.2 It is noteworthythat these data sets were originally collected for supervisedclassification; however, the class labels are not involved inour experiment, except for evaluation of the clustering results.Since the features of these four data sets are within somespecific range, we performed normalization as a preprocessingstep to transform all features into the range of [0,1]. Each dataset was then randomly divided into two halves: one for trainingour model and another for testing. In order to test our onlinelearning algorithm, the testing data are observed sequentiallyin an online manner. We evaluated the performance of theproposed algorithm by running it 30 times.

For comparison, we also applied the hierarchicalPitman–Yor process mixture of GD distributions withoutfeature selection (i.e., using all features) as well as theonline hierarchical Dirichlet process mixture model asintroduced in [56] with Gaussian distributions. The results aresummarized in Table III. The numbers in parenthesis are thestandard deviation of the corresponding quantities. Accordingto the results shown in Table III, we can observe that theimprovement is obvious with the proposed algorithm since itdecreases the error rate compared with the algorithm in [56]and using all the features for all the data sets. These resultsdemonstrate the merits of using feature selection technique aswell as the advantages of using the hierarchical Pitman–Yorprocess mixture model with GD distributions.

B. Scene Recognition

Scene recognition is a crucial problem in computer visionand has been the topic of extensive research in the past(see [3], [62], [63]). Based on [64], scene represents a placewhere a human can act within or navigate. The goal ofscene recognition is to classify a set of natural images into anumber of semantic categories. According to [46] and [65],the distribution over the size of natural segments and the

2http://www.ics.uci.edu/∼mlearn/MLRepository.html

Page 7: 2048 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28 …amansystem.com/apps/publications/papers/Online Learning of Hierar… · 2048 IEEE TRANSACTIONS ON NEURAL

2054 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2017

frequencies at which objects appear in a scene image followa power-law distribution. Thus, the hierarchical Pitman–Yorprocess is an appropriate choice in this case. The majority ofrecent scene recognition approaches have been based on thedescription of scene categories using local features via thebag-of-visual-words approach. Normally, this approach isbased on three steps. The first one is devoted to the extractionof local descriptors. The second one applied a vector quantiza-tion technique to obtain a histogram of local features. Finally,a model is used as a classifier for recognition purposes.

1) Methodology: Our approach for recognizing scene cat-egories is based on the proposed OnHGD-Fs model and thebag-of-visual-words representation. Our methodology can besummarized as follows. First, 128-D scale-invariant featuretransform (SIFT) descriptors [66] are extracted from eachscene image using the difference-of-Gaussians interest pointsdetector and then normalized. Then, the resulted vectors, afterapplying the geometric transformation as presented in (12)in Section III-B, are modeled using the proposed OnHGD-Fs.In our case, each image I j is treated as a group and is there-fore associated with a Pitman–Yor process mixture (infinitemixture) model G j . Thus, each extracted SIFT feature vectorX ji of the image I j is supposed to be drawn from G j ,where the mixture components of G j can be considered asvisual words. Next, a global vocabulary is constructed and isshared among all groups (images) through the common globalPitman–Yor process mixture G0 of our hierarchical model. It isworth mentioning that most of the previously proposed bag-of-visual-words approaches have to apply a separate vectorquantization technique (such as K -means) to build the visualvocabulary, where the size of vocabulary is normally manuallyselected. By contrast, the construction of the visual vocabularyin our approach is part of the framework of our hierarchicalPitman–Yor process mixture model, and therefore, the size ofthe vocabulary (i.e., the number of mixture components inthe global-level mixture model) can be automatically inferredfrom the data thanks to the property of nonparametric Bayesianmodels. Then, we employ the paradigm of bag-of-visual-wordsand compute a histogram of visual words from each image.Because our target is to determine which scene category thata testing image I j belongs to, an indicator variable B jm isthen associated with each image (or group) in our hierarchicalPitman–Yor process mixture framework. B jm represents thatimage I j is from scene category m and is drawn fromanother Pitman–Yor process mixture model, which is truncatedat level J . As a result, it is required to add a new levelof hierarchy to our hierarchical Pitman–Yor process mixturemodel with a sharing vocabulary among all scene classes.In this experiment, we truncate J to 50 and initialize thehyperparameter of the mixing probability of B jm to 0.05.Finally, we assign a testing image into the scene class thathas the highest posterior probability.

2) Data Sets and Results: The effectiveness of the pro-posed approach for recognizing scene categories was testedon a challenging large-scale scene categorization bench-mark, namely, the SUN (Scene UNderstanding) database.3

3The database is available at http://vision.princeton.edu/projects/2010/SUN/.

Fig. 1. Sample scene images from the SUN database.

Fig. 2. Confusion matrix obtained by OnHGD-Fs for the SUN-16 database.

This database contains 899 scenes with 130,519 imagesin total. Images within each category were obtained usingWordNet terminology from various search engines on theInternet [64]. Sample images from the SUN database areshown in Fig. 1.

In order to evaluate the performance of our approach, wefirst conduct our experiments on a subset of the SUN database(referred to as SUN-16 database), which contains 16 categories(e.g., Abbey, Aqueduct, Lighthouse, and Beach) with8000 images in total. We randomly divided this database intotwo partitions: one for training (to learn the model and buildthe visual vocabulary) and the other one for testing. The entiremethodology was repeated 30 times to evaluate the perfor-mance of our approach. The performance of our approach forscene recognition was measured in terms of both confusionmatrix and rate of overall recognition accuracy. Fig. 2 presentsthe confusion matrix calculated by the proposed OnHGD-Fsusing the SUN-16 database. Each entry (i , j ) of the confusionmatrix denotes the ratio of images in category i that areassigned to category j . According to Fig. 2, we observed thatthe overall average recognition accuracy was 73.13% (errorrate of 26.87%) for the tested database.

The corresponding feature saliencies of the 128 featuresobtained by the proposed OnHGD-Fs are shown in Fig. 3.Based on these results, it is obvious that different featureswere assigned different relevance degrees. Specifically, therewere 23 features with high relevance degrees where the featuresaliencies were greater than 0.9. By contrast, there were16 features that have saliencies lower than 0.5, and thereforeprovided less contribution to the recognition process.

We compared our approach (OnHGD-Fs) with several otherrelated mixture modeling approaches for recognizing scenecategories. The goals of the comparison are to compare theonline learning approach with the batch one, to comparethe approach that includes feature selection scheme with the

Page 8: 2048 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28 …amansystem.com/apps/publications/papers/Online Learning of Hierar… · 2048 IEEE TRANSACTIONS ON NEURAL

FAN et al.: ONLINE LEARNING OF HIERARCHICAL PITMAN–YOR PROCESS MIXTURE OF GD DISTRIBUTIONS WITH FEATURE SELECTION 2055

Fig. 3. Feature saliencies of different features calculated by OnHGD-Fs forthe SUN-16 database.

TABLE IV

AVERAGE SCENE RECOGNITION RATE (%) USING DIFFERENTAPPROACHES FOR THE SUN-16 AND SUN-397 DATABASES

one without using it, and to compare GD mixture mod-els with Gaussian mixtures on proportional data. Thus, theproposed OnHGD-Fs model was compared with the batchhierarchical Pitman–Yor process mixture of GD distribu-tions with feature selection (HGD-Fs), the online hierarchicalPitman–Yor process mixture of GD distributions withoutfeature selection (OnHGD), and the online hierarchicalPitman–Yor process mixture of Gaussian distributions withfeature selection (OnHGau-Fs). All these models were learnedusing variational inference. The testing data in our experimentsare supposed to arrive sequentially in an online manner for thetested online learning approaches. Furthermore, we also com-pared our approach with two state-of-the-art approaches: theICD model proposed in [47] and the TNRM presented in [48].Note that for ICD and TNRM, MCMC techniques wereperformed for model learning. We adopted the same hyper-parameter settings as in [47] and [48] for ICD and TNRM,respectively. Table IV illustrates the average recognition resultsof our OnHGD-Fs model and the other tested models for scenerecognition using the SUN-16 database. Based on Table IV, ourapproach OnHGD-Fs outperformed the two state-of-the-artapproaches, which demonstrated the merits of using variationalinference as well as the hierarchical Pitman–Yor processframework together with the GD distribution and feature selec-tion. Moreover, the proposed online approach (OnHGD-Fs)and its batch counterpart (HGD-Fs) can obtain the highestrecognition rates among all tested approaches. It is noteworthythat a Student’s t-test with 95% confidence shows that thedifference in performance between OnHGD-Fs and HGD-Fsis not statistically significant (i.e., for different runs, wehave obtained p-values between 0.163 and 0.285). Thus,OnHGD-Fs is preferred in this case since it is significantlyfaster than HGD-Fs thanks to its online learning property.

This is due to the fact that the batch algorithm updates thevariational factors using the whole data set in each iteration,and therefore, the quality of its estimation is improved moreslowly than in the case of the online one. We may alsoobserve from Table IV that the OnHGD-Fs outperformed bothOnHGD and OnHGau-Fs. This phenomenon demonstrates theadvantage of incorporating a feature selection scheme into ourframework and also verifies that the GD mixture model hasa better modeling capability for proportional data than theGaussian mixture.

Next, we test our scene recognition approach using397 well-sampled categories from SUN database (referredto as SUN-397 database) with 108,754 images in total.We compare its performance with the ones obtained usingcomparable approaches and report the corresponding resultsin Table IV. It is clear that the highest two recognition rateswere obtained by the proposed OnHGD-Fs and its batchversion, which once again demonstrated the merits of usingthe framework of hierarchical Pitman–Yor process mixturemodel with feature selection. Furthermore, as we increasethe size of our data set, the computational gain achievedusing online learning approaches (e.g., OnHGD-Fs, OnHGD,and OnHGau-Fs) becomes more and more significant. Thisphenomenon confirms the advantage of using online learningmethods to deal with streaming data sets.

Indeed, the recognition accuracy obtained by our approachcan be further improved if different state-of-the-art imagefeature representations (rather than SIFT, which is used in thispaper) were adopted, such as OverFeat [67], which is a state-of-the-art convolutional neural network (CNN) for featurelearning. However, the evaluation of different features is notour major concern and beyond the scope of this paper. It isalso noteworthy that more recent state-of-the-art approachesbased on deep learning techniques may achieve better recog-nition accuracy than the one obtained using our approach.For instance, [68] has reported a recognition performance of40.94 % on the SUN-397 database with a deep convolutionalnetwork algorithm, DeCAF, which is trained on the ImageNetdatabase. In addition, [69] obtained a recognition accuracyof 42.61 % on the SUN-397 database based on a deeplearning approach that exploits the CNN and that is trained onImageNet using the Caffe package.4 In our case, however, theproposed approach is based on the bag-of-visual-words modeland is quite different from deep features (such as CNNs) andis thus not comparable with deep learning techniques. In addi-tion, deep learning approaches often require a huge amount ofcomputational resources in order to train multiple layers ofnetworks, and are therefore normally implemented on GPUs.By contrast, our proposed framework is an online learningapproach, which has the advantage of handling sequentiallyarriving data, and normally performed on CPUs.

C. Video Segmentation1) Problem Statement: Videos form a major portion of

information disseminated today [70]. Scene analysis on thesevideos has been the topic of extensive research in the

4Caffe: an open source convolutional architecture for fast featureembedding. (http://caffe.berkeleyvision.org/.)

Page 9: 2048 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28 …amansystem.com/apps/publications/papers/Online Learning of Hierar… · 2048 IEEE TRANSACTIONS ON NEURAL

2056 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2017

past [71]–[74]. Video segmentation, the process of partitioningvideo sequences into spatiotemporal segments, is an importanttopic in video analysis [75]. In our work, we adopt theidea of frame saliency [76] in which only a small set offrames with high relevancy are used to train our model.In contrast to the conventional video segmentation approachwhere all video frames are used for model training, using onlymost relevant frames may significantly reduce redundancyand improve the quality of video modeling. In [76], videosegmentation is performed based on simultaneous Gaussianmixture model training and frame saliency estimation. Thisapproach has demonstrated its efficiency for spatiotemporalvideo modeling and segmentation; however, it contains severaldrawbacks. First, the Gaussian assumption (i.e., assuming thateach per-class density is Gaussian) may not be the idealchoice since most of the data in real applications may possessa nonGaussian structure. Second, the number of mixturecomponents for the Gaussian mixture model in [76] is esti-mated using the minimum description length (MDL) criterion.Unfortunately, this estimation requires high computational costsince it has to evaluate the MDL criterion for several numbersof mixture components. Third, in [76], although each frame isassociated with a finite Gaussian mixture model, these mixturemodels are not statistically linked. In practice, it is desiredto share common mixture components among frames sincemost of the time the contents of continuous frames in a videoshot may not be dramatically different. The aforementionedshortcomings can be addressed elegantly using the proposedOnHGD-Fs. First, GD mixture models may provide bettermodeling capabilities than Gaussian mixtures for normalizedcount data as we mentioned earlier. Furthermore, the use ofhierarchical Bayesian nonparametric model in our approachcan avoid the problem of model selection by assuming thatthere exist an infinite number of mixture components in avideo frame, and then allow the mixture models that areassociated with video frames to remain linked by sharingmixture components among all frames.

2) Video Segmentation Methodology: The first step in ourapproach for video segmentation is to extract feature vectorsfor each pixel. In our case, we represented each pixel witha 3-D color descriptor in the L∗a∗b∗ color space, withdimension L for lightness and a and b for the color-opponentdimensions. We also included spatial information [i.e., the(x, y) position of the pixel] and the time feature (r ) to thefeature vector. The time descriptor is taken as an incrementalcounter: 1, …, indicates the number of frames in a video shot.We may also include other features (such as motion) with anincreased computational cost. Each of the features is normal-ized to have a value between 0 and 1. Next, these features weremodeled using the proposed OnHGD-Fs. Specifically, eachframe F j is considered as a group and is therefore associatedwith a Pitman–Yor process mixture model G j . Then, for thei th pixel of j th frame �X ji , we have the density function as

p( �X ji) =∞∑

k=1

ψk [ε j Beta( �X ji |�αk, �βk)

+ (1 − ε j )Beta( �X ji |�α′, �β ′)] (35)

Fig. 4. Sample frames from each video sequence. First row: video 1.Second row: video 2. Third row: video 3.

TABLE V

AVERAGE NUMERICAL EVALUATION OF SEGMENTATION

PERFORMANCE FOR VIDEO 1

where the frame saliency ε j = p(φ j = 1) of frame jrepresents the probability that this frame is highly relevant.By including frame saliency into the proposed hierarchicalBayesian nonparametric model, we obtain the video segmen-tation approach where each mixture component in a framerepresents a segment and the mixture components are sharedamong all frames. The estimation of the parameters of themodel is performed by the online variational learning frame-work, as proposed in Section III.

3) Results: We evaluated the effectiveness of the proposedvideo segmentation approach using three video sequences pro-vided in [77]. In our case, video 1 contains 75 frames, video 2has 528 frames, and video 3 includes 123 frames in total. Eachvideo has a resolution of 450 × 350. Sample frames from eachvideo can be viewed in Fig. 4. In order to demonstrate themerits of the proposed OnHGD-Fs approach, we compared itsperformance with two other video segmentation approaches:the one based on the OnHGau-Fs and the one proposed in [76],which is based on the finite Gaussian mixture model and MDLcriterion.

We adopted the same objective criterion that is used in [76]to evaluate the segmentation performance. This criterion isassessed according to the following three aspects.

1) Spatial Uniformity: It is used to measure the colorhomogeneity of video segments. This includes the tex-ture (color) variance of segments (text_var ) [78] andthe spatial color contrast along segment boundaries(color_con) [79].

Page 10: 2048 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28 …amansystem.com/apps/publications/papers/Online Learning of Hierar… · 2048 IEEE TRANSACTIONS ON NEURAL

FAN et al.: ONLINE LEARNING OF HIERARCHICAL PITMAN–YOR PROCESS MIXTURE OF GD DISTRIBUTIONS WITH FEATURE SELECTION 2057

TABLE VI

AVERAGE NUMERICAL EVALUATION OF SEGMENTATIONPERFORMANCE FOR VIDEO 2

TABLE VII

AVERAGE NUMERICAL EVALUATION OF SEGMENTATION

PERFORMANCE FOR VIDEO 3

Fig. 5. Examples of video segmentation results obtained by differentalgorithms.

2) Temporal Stability: This metric measures the colorand spatial homogeneity of the segments for consec-utive time instants. In this experiment, it was mea-sured by the interframe difference of segment size andelongation (si ze_di f f and elong_di f f ) [78], and theX 2 metric [79].

3) Motion Uniformity: This metric evaluates the seg-ments’ motion smoothness. This includes the summa-tion of motion vector variance in x and y directions(motion_var ) [78].

The average segmentation results are shown in Tables V–VIIfor the tested video sequences. According to Tables V–VII,it is clear that the proposed OnHGD-Fs outperformed theother two approaches in terms of smaller text_var , X 2,motion_var , and larger color_con for all tested videos. Thisdemonstrated the advantages of using the GD mixture model

over the Gaussian mixture model as well as the hierarchicalPitman–Yor process mixture model over the conventional mix-ture modeling. Fig. 5 shows examples of video segmentationresults for each video sequence. From Fig. 5, we can observethat the proposed approach obtained better quality of objectsegmentation than the other two approaches.

VI. CONCLUSION

We developed a new statistical framework that can performsimultaneous clustering and feature selection based on hier-archical Pitman–Yor process and GD distribution. Instead offixing the number of clusters, we use a hierarchical Pitman–Yor process to infer the number of groups. As a nonparametricBayesian approach, our model adequately describes complexrealistic data sets by avoiding assuming restricted functionalforms and allowing the complexity, and then the accuracyof the learned model to increase as new data is observed.Despite its appealing simplicity, learning this model becomescomputationally expensive in high-dimensional input spaces.In order to overcome this problem, we used an importantmathematical property of the GD distribution that allowsaccurate learning. Parameters of the framework are estimatedusing an efficient online variational method, which does notneed to explicitly store training data, which necessitates theconsideration of appropriate prior distributions to efficientlycontrol the model’s complexity. To illustrate the wider applica-bility of the approach, we validated it using two differentchallenging tasks, namely, video segmentation where the maingoal is to statistically capture the spatiotemporal semanticcontents of a given scene and scene recognition. Accordingto the results, we can say that our model offers the advantageof having strong statistical foundations that are well suited todifferent image processing and computer vision applications.

APPENDIX

ONLINE VARIATIONAL MODEL LEARNING

A. Variational Inference of q( �φr )

Assume that we have already observed a data set{X1, . . . X(r−1)}. After obtaining a new observation Xr , thecurrent lower bound L(r)(q) can be maximized with respectto q( �φr ), while other variational factors remain fixed to their(r − 1)th values. Therefore, the variational solution to q( �φr )can be calculated as

q( �φr ) =M∏

j=1

D∏

l=1

ϕφ jrlj rl (1 − ϕ j rl)

1−φ jrl (36)

where we have

ϕ j rl = exp(ϕ j rl)

exp(ϕ j rl)+ exp(ϕ j rl)(37)

ϕ j rl =T∑

t=1

K∑

k=1

〈Z j (r−1)t〉⟨W (r−1)

j tk

× [R(r−1)kl + (

α(r−1)kl − 1

)ln X jrl

+ (β(r−1)kl − 1

)ln(1 − X jrl)

] + ⟨ln ε(r−1)

l1

⟩(38)

ϕ j rl = R′(r−1)l + ⟨

ln ε(r−1)l2

⟩ + (β

′(r−1)l − 1

)ln(1 − X jrl)

+ (α

′(r−1)l − 1

)ln X jrl (39)

Page 11: 2048 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28 …amansystem.com/apps/publications/papers/Online Learning of Hierar… · 2048 IEEE TRANSACTIONS ON NEURAL

2058 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2017

where R and R′ are the lower bounds of〈ln(�(α + β)/�(α)�(β))〉 and 〈ln(�(α′ + β ′)/�(α′)�(β ′))〉,respectively. Since these expectations are intractable, we usethe second-order Taylor series expansion to calculate theirlower bounds. The expected values in (38) and (39) aredefined as

αkl = ukl

vkl, βkl = gkl

hkl, α′

l = u′l

v ′l, β ′

l = g′l

h′l

〈Z j it 〉 = ρ j it , 〈W jtk〉 = ϑ j tk

〈ln εl1〉 = �(ξ∗

1

) −�(ξ∗

1 + ξ∗2

)

〈ln εl2〉 = �(ξ∗

2

) −�(ξ∗

1 + ξ∗2

)

where �(·) is the digamma function.

B. Variational Inference of q( �Zr )

Next, we update the variational solution to q( �Zr) bymaximizing the current lower bound L(r)(q) with respectto q( �Zr), while other variational factors remain in their currentvalues

q( �Zr) =M∏

j=1

T∏

t=1

ρZ jrtj rt (40)

where

ρ j rt = exp(ρ j rt )∑Tf =1 exp(ρ j r f )

(41)

ρ j rt =K∑

k=1

⟨W (r−1)

j tk

⟩ D∑

l=1

ϕ j rl[R(r−1)

kl + (α(r−1)kl − 1

)ln X jrl

+ (β(r−1)kl − 1

)ln(1 − X jrl)

]

+ ⟨ln π ′(r−1)

j t

⟩ +t−1∑

s=1

⟨ln

(1 − π

′(r−1)j s

)⟩. (42)

C. Variational Inference of q(r)( �W ) and q(r)(�ε)In the following step, we maximize the current lower

bound L(r)(q) with respect to q(r)( �W ) and q(r)(�ε), as shownin (27) and (28), where the corresponding hyperparameters aredefined by:ϑ(r)j tk = ϑ

(r−1)j tk + ζr�ϑ

(r)j tk (43)

ξ∗(r)1 = ξ

∗(r−1)1 + ζr�ξ

∗(r)1 , ξ

∗(r)2 = ξ

∗(r−1)2 + ζr�ξ

∗(r)2

(44)

where ζr is the learning rate that is described in [56]and is defined by ζr = (τ0 + r)−ς , subject to the con-

straints ς ∈ (0.5, 1] and τ0 ≥ 0. �ϑ(r)j tk, �ξ∗(r)1 , and

�ξ∗(r)2 are the natural gradients of the hyperparameters ϑ(r)j tk,

ξ∗(r)1 , and ξ

∗(r)2 , correspondingly. These natural gradients

are obtained by multiplying the gradients by the inverse of

Riemannian metrics

�ξ∗(r)1 = ξ

∗(r)1 − ξ

∗(r−1)1 = ξ1 +

M∑

j=1

Nϕ j rl − ξ∗(r−1)1 (45)

�ξ∗(r)2 = ξ

∗(r)2 − ξ

∗(r−1)2 = ξ2 +

M∑

j=1

N(1 − ϕ j rl)− ξ∗(r−1)2

(46)

�ϑ(r)j tk = ϑ

(r)j tk − ϑ

(r−1)j tk = exp

(ϑ(r)j tk

)

∑Kf =1 exp

(ϑ(r)j t f

) − ϑ(r−1)j tk (47)

where

ϑ(r)j tk = Nρ j rt

D∑

l=1

ϕ j rl[R(r−1)

kl + (α(r−1)kl − 1

)ln X jrl

+ (β(r−1)kl − 1

)ln(1 − X jrl)

]

+ ⟨lnψ ′(r−1)

k

⟩ +k−1∑

s=1

⟨ln

(1 − ψ ′(r−1)

s

)⟩. (48)

D. Variational Inference of Remaining Factors

Finally, the current lower bound L(r)(q) is maximized withrespect to q(r)(�π ′), q(r)( �ψ ′), q(r)(�α) q(r)( �β), q(r)(�α′), andq(r)( �β ′) as in (29)–(34). We then obtain the update equationswith the following hyperparameters:a(r)j t = a(r−1)

j t + ζr�a(r)j t , b(r)j t = b(r−1)j t + ζr�b(r)j t (49)

c(r)k = c(r−1)k + ζr�c(r)k , d(r)k = d(r−1)

k + ζr�d(r)k (50)

u(r)kl = u(r−1)kl + ζr�u(r)kl , v

(r)kl = v

(r−1)kl + ζr�v

(r)kl (51)

g(r)kl = g(r−1)kl + ζr�g(r)kl , h(r)kl = h(r−1)

kl + ζr�h(r)kl (52)

u′(r)l = u′(r−1)

l + ζr�u′(r)l , v

′(r)l = v

′(r−1)l + ζr�v

′(r)l (53)

g′(r)l = g′(r−1)

l + ζr�g′(r)l , h′(r)

l = h′(r−1)l + ζr�h′(r)

l . (54)

The corresponding natural gradients can be calculated as

�a(r)j t = 1 + Nρ j rt − ω j t − a(r−1)j t (55)

�b(r)j t = λ j t + tω j t + NT∑

s=t+1

ρ j rs − b(r−1)j t (56)

�c(r)k = 1 +M∑

j=1

T∑

t=1

ϑ(r)j tk − ηk − c(r−1)

k (57)

�d(r)k = γk + kηk +M∑

j=1

T∑

t=1

K∑

s=k+1

ϑ(r)j t s − d(r−1)

k (58)

�u(r)kl = ukl + NM∑

j=1

T∑

t=1

ϑ(r)j tkρ j rtϕ j rl α

(r−1)kl

× [�

(α(r−1)kl + β

(r−1)kl

) −�(α(r−1)kl

)

+ β(r−1)kl � ′(α(r−1)

kl + β(r−1)kl

)(⟨ln β(r−1)

kl

− ln β(r−1)kl

)] − u(r−1)kl (59)

�v(r)kl = vkl − N

M∑

j=1

T∑

t=1

ϑ(r)j tkρ j rtϕ j rl ln X jrl − v

(r−1)kl (60)

Page 12: 2048 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28 …amansystem.com/apps/publications/papers/Online Learning of Hierar… · 2048 IEEE TRANSACTIONS ON NEURAL

FAN et al.: ONLINE LEARNING OF HIERARCHICAL PITMAN–YOR PROCESS MIXTURE OF GD DISTRIBUTIONS WITH FEATURE SELECTION 2059

�g(r)kl = gkl + NM∑

j=1

T∑

t=1

ϑ(r)j tkρ j rtϕ j rl β

(r−1)kl

× [�

(α(r−1)kl + β

(r−1)kl

) − �(β(r−1)kl

)

+ α(r−1)kl � ′(α(r−1)

kl + β(r−1)kl

)(⟨ln α(r−1)

kl

− ln α(r−1)kl

)] − g(r−1)kl (61)

�h(r)kl = hkl − NM∑

j=1

T∑

t=1

ϑ(r)j tkρ j rtϕ j rl ln(1 − X jrl)− h(r−1)

kl

(62)

�u′(r)l = u′

l + NM∑

j=1

(1 − ϕ j rl)α′(r−1)l

× [�

′(r−1)l + β

′(r−1)l

) − �(α

′(r−1)l

)

+ β′(r−1)l � ′(α′(r−1)

l + β′(r−1)l

)(⟨ln β ′(r−1)

l 〉− ln β ′(r−1)

l

)] − u′(r−1)l (63)

�v′(r)l = v ′

l − NM∑

j=1

(1 − ϕ j rl) ln X jrl − v′(r−1)l (64)

�g′(r)l = g′

l + NM∑

j=1

(1 − ϕ j rl)β′(r−1)l

× [�

′(r−1)l + β

′(r−1)l

) − �(β

′(r−1)l

)

+ α′(r−1)l � ′(α′(r−1)

l + β′(r−1)l

)(⟨ln α′(r−1)

l

− ln α′(r−1)l

)] − g′(r−1)l (65)

�h′(r)l = h′

l − NM∑

j=1

(1 − ϕ j rl) ln(1 − X jrl)− h′(r−1)l . (66)

According to [60], the proposed online variational algorithmcan be considered as a stochastic approximation method [80]for estimating the expected lower bound and the conver-gence is guaranteed if the learning rate satisfies the followingconditions:

∞∑

r=1

ζr = ∞,

∞∑

r=1

ζ 2r < ∞. (67)

ACKNOWLEDGMENT

The first and third authors would like to thank theNatural Sciences and Engineering Research Council ofCanada (NSERC). The authors would like to thank the anony-mous referees and the associate editor for their comments.

REFERENCES

[1] B. J. Frey and N. Jojic, “A comparison of algorithms for inference andlearning in probabilistic graphical models,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 27, no. 9, pp. 1392–1416, Sep. 2005.

[2] M. C. Burl, M. Weber, and P. Perona, “A probabilistic approach to objectrecognition using local photometry and global geometry,” in ComputerVision (Lecture Notes in Computer Science), vol. 1407, H. Burkhardtand B. Neumann, Eds. Heidelberg, Germany: Springer, 1998,pp. 628–641.

[3] W. T. Freeman and E. C. Pasztor, “Learning to estimate scenesfrom images,” in Advances in Neural Information Processing Systems.Cambridge, MA, USA: MIT Press, 1998, pp. 775–781.

[4] M. Schuster, “Better generative models for sequential data problems:Bidirectional recurrent mixture density networks,” in Advances in NeuralInformation Processing Systems. Cambridge, MA, USA: MIT Press,1999, pp. 589–595.

[5] A. Srivastava, X. Liu, and U. Grenander, “Universal analytical forms formodeling image probabilities,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 24, no. 9, pp. 1200–1214, Sep. 2002.

[6] M. Haft, R. Hofman, and V. Tresp, “Generative binary codes,” FormalPattern Anal. Appl., vol. 6, no. 4, pp. 269–284, 2004.

[7] C. K. I. Williams and M. K. Titsias, “Learning about multiple objectsin images: Factorial learning without factorial search,” in Advancesin Neural Information Processing Systems. Cambridge, MA, USA:MIT Press, 2002, pp. 1391–1398.

[8] H. Greenspan, J. Goldberger, and A. Mayer, “A probabilistic frameworkfor spatio-temporal video representation & indexing,” in ComputerVision (Lecture Notes in Computer Science), vol. 2353, A. Heyden,G. Sparr, M. Nielsen, and P. Johansen, Eds. Berlin, Germany: Springer,2002, pp. 461–475.

[9] J. Fan, H. Luo, and A. K. Elmagarmid, “Concept-oriented indexingof video databases: Toward semantic sensitive retrieval and browsing,”IEEE Trans. Image Process., vol. 13, no. 7, pp. 974–992, Jul. 2004.

[10] D. Gu, “Distributed EM algorithm for Gaussian mixtures in sensornetworks,” IEEE Trans. Neural Netw., vol. 19, no. 7, pp. 1154–1166,Jul. 2008.

[11] A. D. E. D. Isaia, “A quick procedure for model selection in the caseof mixture of normal densities,” Comput. Statist. Data Anal., vol. 51,no. 12, pp. 5635–5643, 2007.

[12] J. L. Andrews and P. D. McNicholas, “Extending mixtures of multivari-ate t-factor analyzers,” Statist. Comput., vol. 21, no. 3, pp. 361–373,2011.

[13] M. Di Zio, U. Guarnera, and R. Rocci, “A mixture of mixture modelsfor a classification problem: The unity measure error,” Comput. Statist.Data Anal., vol. 51, no. 5, pp. 2573–2585, 2007.

[14] T. Bdiri, N. Bouguila, and D. Ziou, “Object clustering and recognitionusing multi-finite mixtures for semantic classes and hierarchy modeling,”Expert Syst. Appl., vol. 41, no. 4, pp. 1218–1235, 2014.

[15] M. Bressan, D. Guillamet, and J. Vitria, “Using an ICA representationof high dimensional data for object recognition and classification,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2001,pp. I-1004–I-1009.

[16] T. Bozkaya and M. Ozsoyoglu, “Distance-based indexing for high-dimensional metric spaces,” in Proc. ACM SIGMOD Int. Conf. Manage.Data, 1997, pp. 357–368.

[17] J. F. Murray and K. Kreutz-Delgado, “Visual recognition and inferenceusing dynamic overcomplete sparse learning,” Neural Comput., vol. 19,no. 9, pp. 2301–2352, Sep. 2007.

[18] N. Bouguila and D. Ziou, “A hybrid SEM algorithm for high-dimensional unsupervised learning using a finite generalized Dirichletmixture,” IEEE Trans. Image Process., vol. 15, no. 9, pp. 2657–2668,Sep. 2006.

[19] N. Bouguila and D. Ziou, “High-dimensional unsupervised selectionand estimation of a finite generalized Dirichlet mixture model basedon minimum message length,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 29, no. 10, pp. 1716–1731, Oct. 2007.

[20] M. M. B. Ismail and H. Frigui, “Unsupervised clustering and featureweighting based on generalized Dirichlet mixture modeling,” Inf. Sci.,vol. 274, pp. 35–54, Aug. 2014.

[21] G. Celeux, F. Forbes, C. P. Robert, and D. M. Titterington, “Devianceinformation criteria for missing data models,” Bayesian Anal., vol. 1,no. 4, pp. 651–674, 2006.

[22] S. Boutemedjet, N. Bouguila, and D. Ziou, “A hybrid feature extractionselection approach for high-dimensional non-Gaussian data clustering,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 8, pp. 1429–1443,Aug. 2009.

[23] W. Fan, N. Bouguila, and D. Ziou, “Unsupervised hybrid featureextraction selection for high-dimensional non-Gaussian data clusteringwith variational inference,” IEEE Trans. Knowl. Data Eng., vol. 25,no. 7, pp. 1670–1685, Jul. 2013.

[24] J. H. Friedman and N. I. Fisher, “Bump hunting in high-dimensionaldata,” Statist. Comput., vol. 9, no. 2, pp. 123–143, 1999.

[25] C. B. Hurley, “Clustering visualizations of multidimensional data,”J. Comput. Graph. Statist., vol. 13, no. 4, pp. 788–806, 2004.

[26] M. F. Tappen, B. C. Russell, and W. T. Freeman, “Efficient graphicalmodels for processing images,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit. (CVPR), Jun./Jul. 2004, pp. II-673–II-680.

Page 13: 2048 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28 …amansystem.com/apps/publications/papers/Online Learning of Hierar… · 2048 IEEE TRANSACTIONS ON NEURAL

2060 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2017

[27] J. Martí, J. Freixenet, J. Batlle, and A. Casals, “A new approach to out-door scene description based on learning and top-down segmentation,”Image Vis. Comput., vol. 19, no. 14, pp. 1041–1055, 2001.

[28] R. T. Collins, Y. Liu, and M. Leordeanu, “Online selection of discrimina-tive tracking features,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27,no. 10, pp. 1631–1643, Oct. 2005.

[29] Y. Guan, J. G. Dy, and M. I. Jordan, “A unified probabilistic model forglobal and local unsupervised feature selection,” in Proc. 28th Int. Conf.Mach. Learn. (ICML), 2011, pp. 1073–1080.

[30] J. Pitman and M. Yor, “The two-parameter Poisson–Dirichlet distributionderived from a stable subordinator,” Ann. Probab., vol. 25, pp. 855–900,Apr. 1997.

[31] S. K. Ng and G. J. McLachlan, “On the choice of the number of blockswith the incremental EM algorithm for the fitting of normal mixtures,”Statist. Comput., vol. 13, no. 1, pp. 45–55, 2003.

[32] S. Vijayakumar, A. D’Souza, and S. Schaal, “Incremental onlinelearning in high dimensions,” Neural Comput., vol. 17, no. 12,pp. 2602–2634, 2005.

[33] Z. Ghahramani and G. Hinton, “Variational learning for switching state-space models,” Neural Comput., vol. 12, no. 4, pp. 831–864, Apr. 2000.

[34] M. I. Jordan, “Graphical models,” Statist. Sci., vol. 19, no. 1,pp. 140–155, 2004.

[35] W. Fan, N. Bouguila, and D. Ziou, “Variational learning for finiteDirichlet mixture models and applications,” IEEE Trans. Neural Netw.Learn. Syst., vol. 23, no. 5, pp. 762–774, May 2012.

[36] J. Kato, T. Watanabe, S. Joga, J. Rittscher, and A. Blake, “An HMM-based segmentation method for traffic monitoring movies,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 24, no. 9, pp. 1291–1296, Sep. 2002.

[37] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, “Wallflower: Prin-ciples and practice of background maintenance,” in Proc. 7th IEEE Int.Conf. Comput. Vis. (ICCV), vol. 1. Sep. 1999, pp. 255–261.

[38] O. Sanchez and F. Dibos, “Displacement following of hidden objectsin a video sequence,” Int. J. Comput. Vis., vol. 57, no. 2, pp. 91–105,2004.

[39] M. Allan, M. K. Titsias, and C. K. I. Williams, “Fast learning of spritesusing invariant features,” in Proc. Brit. Mach. Vis. Conf. (BMVC), 2005,pp. 40–49.

[40] P. Yin, A. Criminisi, J. Winn, and I. Essa, “Tree-based classifiers forbilayer video segmentation,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit. (CVPR), Jun. 2007, pp. 1–8.

[41] A. Hampapur, R. Jain, and T. E. Weymouth, “Production model baseddigital video segmentation,” Multimedia Tools Appl., vol. 1, no. 1,pp. 9–46, 1995.

[42] Y. Wang, K.-F. Loe, T. Tan, and J.-K. Wu, “Spatiotemporal videosegmentation based on graphical models,” IEEE Trans. Image Process.,vol. 14, no. 7, pp. 937–947, Jul. 2005.

[43] Y. W. Teh and M. I. Jordan, “Hierarchical Bayesian nonparametricmodels with applications,” in Bayesian Nonparametrics: Principles andPractice, N. Hjort, C. Holmes, P. Müller, and S. Walker, Eds. Cambridge,U.K.: Cambridge Univ. Press, 2010.

[44] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchi-cal Dirichlet processes,” J. Amer. Statist. Assoc., vol. 101, no. 476,pp. 1566–1581, Dec. 2006.

[45] Y. W. Teh, “A hierarchical Bayesian language model based onPitman–Yor processes,” in Proc. 21st Int. Conf. Comput. Linguistics,44th Annu. Meeting Assoc. Comput. Linguistics (ACL-44), 2006,pp. 985–992.

[46] E. B. Sudderth and M. I. Jordan, “Shared segmentation of naturalscenes using dependent Pitman–Yor processes,” in Proc. Adv. NeuralInf. Process. Syst. (NIPS), 2008, pp. 1585–1592.

[47] S. Williamson, C. Wang, K. A. Heller, and D. M. Blei, “The IBP com-pound Dirichlet process and its application to focused topic modeling,”in Proc. Int. Conf. Mach. Learn.(ICML, 2010, pp. 1–8.

[48] C. Chen, V. Rao, W. Buntine, and Y. W. Teh, “Dependent normalizedrandom measures,” in Proc. Int. Conf. Int. Conf. Mach. Learn. (ICML),2013, pp. 969–977.

[49] C. Robert and G. Casella, Monte Carlo Statistical Methods.New York, NY, USA: Springer-Verlag, 1999.

[50] D. Hernández-Lobato, J. M. Hernández-Lobato, and P. Dupont, “Gener-alized spike-and-slab priors for Bayesian group feature selection usingexpectation propagation,” J. Mach. Learn. Res., vol. 14, pp. 1891–1945,Jul. 2013.

[51] T. P. Minka, “Expectation propagation for approximate Bayesian infer-ence,” in Proc. Conf. Uncertainty Artif. Intell. (UAI), 2001, pp. 362–369.

[52] H. Attias, “A variational Bayesian framework for graphical models,” inProc. Adv. Neural Inf. Process. Syst. (NIPS), 1999, pp. 209–215.

[53] C. M. Bishop, Pattern Recognition and Machine Learning. New York,NY, USA: Springer, 2006.

[54] C. Wang and D. M. Blei, “Variational inference in nonconjugate mod-els,” J. Mach. Learn. Res., vol. 14, pp. 1005–1031, Jan. 2013.

[55] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochasticvariational inference,” J. Mach. Learn. Res., vol. 14, pp. 1303–1347,2013.

[56] C. Wang, J. W. Paisley, and D. M. Blei, “Online variational inferencefor the hierarchical Dirichlet process,” J. Mach. Learn. Res., vol. 15,pp. 752–760, 2011.

[57] J. Sethuraman, “A constructive definition of Dirichlet priors,” Statist.Sinica, vol. 4, pp. 639–650, Mar. 1994.

[58] W. Fan and N. Bouguila, “Variational learning of a Dirichlet processof generalized Dirichlet distributions for simultaneous clustering andfeature selection,” Pattern Recognit., vol. 46, no. 10, pp. 2754–2769,2013.

[59] M. H. C. Law, M. A. T. Figueiredo, and A. K. Jain, “Simultaneousfeature selection and clustering using mixture models,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 26, no. 9, pp. 1154–1166, Sep. 2004.

[60] M. Sato, “Online model selection based on the variational Bayes,”Neural Comput., vol. 13, no. 7, pp. 1649–1681, Jul. 2001.

[61] D. M. Blei and M. I. Jordan, “Variational inference for Dirichlet processmixtures,” Bayesian Anal., vol. 1, no. 1, pp. 121–144, 2005.

[62] X. Zhou, N. Cui, Z. Li, F. Liang, and T. S. Huang, “HierarchicalGaussianization for image classification,” in Proc. IEEE 12th Int. Conf.Comput. Vis. (ICCV), Sep./Oct. 2009, pp. 1971–1977.

[63] D. Song and D. Tao, “Biologically inspired feature manifold for sceneclassification,” IEEE Trans. Image Process., vol. 19, no. 1, pp. 174–184,Jan. 2010.

[64] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN data-base: Large-scale scene recognition from abbey to zoo,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2010, pp. 3485–3492.

[65] A. Shyr, T. Darrell, M. Jordan, and R. Urtasun, “Supervised hierar-chical Pitman–Yor process for natural scene segmentation,” in Proc.IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2011,pp. 2281–2288.

[66] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.

[67] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun,“OverFeat: Integrated recognition, localization and detection using con-volutional networks,” in Proc. Int. Conf. Learn. Represent. (ICLR),Apr. 2014, pp. 1–4.

[68] J. Donahue et al., “DeCAF: A deep convolutional activation feature forgeneric visual recognition,” in Proc. JMLR Workshop Conf., 31st Int.Conf. Mach. Learn. (ICML), 2014, pp. 647–655.

[69] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learningdeep features for scene recognition using places database,” in Proc. Adv.Neural Inf. Process. Syst. (NIPS), 2014, pp. 487–495.

[70] X. Wu, C.-W. Ngo, and Q. Li, “Threading and autodocumenting newsvideos: A promising solution to rapidly browse news topics,” IEEESignal Process. Mag., vol. 23, no. 2, pp. 59–68, Mar. 2006.

[71] H. S. Sawhney and S. Ayer, “Compact representations of videos throughdominant and multiple motion estimation,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 18, no. 8, pp. 814–830, Aug. 1996.

[72] B.-L. Yeo and B. Liu, “Rapid scene analysis on compressed video,”IEEE Trans. Circuits Syst. Video Technol., vol. 5, no. 6, pp. 533–544,Dec. 1995.

[73] A. Ekin, A. M. Tekalp, and R. Mehrotra, “Automatic soccer videoanalysis and summarization,” IEEE Trans. Image Process., vol. 12, no. 7,pp. 796–807, Jul. 2003.

[74] L.-Y. Duan, M. Xu, Q. Tian, C.-S. Xu, and J. S. Jin, “A unifiedframework for semantic shot classification in sports video,” IEEE Trans.Multimedia, vol. 7, no. 6, pp. 1066–1083, Dec. 2005.

[75] D.-Y. Chen, K. Cannons, H.-R. Tyan, S.-W. Shih, and H.-Y. M. Liao,“Spatiotemporal motion analysis for the detection and classification ofmoving targets,” IEEE Trans. Multimedia, vol. 10, no. 8, pp. 1578–1591,Dec. 2008.

[76] X. Song and G. Fan, “Selecting salient frames for spatiotemporal videomodeling and segmentation,” IEEE Trans. Image Process., vol. 16,no. 12, pp. 3035–3046, Dec. 2007.

[77] T. Brox and J. Malik, “Object segmentation by long term analysis ofpoint trajectories,” in Computer Vision (Lecture Notes in ComputerScience), vol. 6315. K. Daniilidis, P. Maragos, and N. Paragios, Eds.Heidelberg, Germany: Springer, 2010, pp. 282–295.

[78] P. L. Correia and F. Pereira, “Objective evaluation of video segmentationquality,” IEEE Trans. Image Process., vol. 12, no. 2, pp. 186–200,Feb. 2003.

Page 14: 2048 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 28 …amansystem.com/apps/publications/papers/Online Learning of Hierar… · 2048 IEEE TRANSACTIONS ON NEURAL

FAN et al.: ONLINE LEARNING OF HIERARCHICAL PITMAN–YOR PROCESS MIXTURE OF GD DISTRIBUTIONS WITH FEATURE SELECTION 2061

[79] C. E. Erdem, B. Sankur, and A. M. Tekalp, “Performance measures forvideo object segmentation and tracking,” IEEE Trans. Image Process.,vol. 13, no. 7, pp. 937–951, Jul. 2004.

[80] H. J. Kushner and G. G. Yin, Stochastic Approximation Algorithmsand Applications (Applications of mathematics). New York, NY, USA:Springer, 1997.

Wentao Fan received the M.Sc. degree ininformation systems security and the Ph.D. degree inelectrical and computer engineering from ConcordiaUniversity, Montreal, QC, Canada, in 2009and 2014, respectively.

He is currently an Associate Professor with theDepartment of Computer Science and Technology,Huaqiao University, Xiamen, China. His currentresearch interests include data clustering, mixturemodels, machine learning, computer vision, andpattern recognition.

Hassen Sallay receive the Ph.D. degree from l’Université Henri PoincaréNancy 1, Nancy, France, in 2004.

He is currently an Assistant Professor with Umm al-Qura University,Mecca, Saudi Arabia. He has expertise in computer security and reliability,computer communications (networks), and data mining.

Nizar Bouguila (S’05–M’06–SM’11) received theEngineering degree from the University of Tunis,Tunis, Tunisia, in 2000, and the M.Sc. andPh.D. degrees in computer science from SherbrookeUniversity, Sherbrooke, QC, Canada, in 2002 and2006, respectively.

He is currently a Professor with the Concor-dia Institute for Information Systems Engineering,Concordia University, Montreal, QC, Canada. Hiscurrent research interests include image processing,machine learning, data mining, 3-D graphics, com-

puter vision, and pattern recognition.Prof. Bouguila is the holder of a Concordia Research Chair.