Upload
truongnhu
View
217
Download
1
Embed Size (px)
Citation preview
1
A Tutorial on the Dirichlet Process for Engineers
Technical Report
John Paisley
Department of Electrical & Computer Engineering
Duke University, Durham, NC
Abstract
This document provides a review of the Dirichlet process originally given in the author’s preliminary
exam paper and is presented here as a tutorial. No motivation is given (in what I’ve excerpted here), and
this document is intended to be a mathematical tutorial that is still accessible to the engineer.
I. THE DIRICHLET DISTRIBUTION
Consider the finite, D-dimensional vector, π, having the properties 0 ≤ πi ≤ 1, i = 1, . . . , D and∑Di=1 πi = 1 (i.e., residing on the (D − 1)-dimensional simplex in RD, or π ∈ ∆D). We view this
vector as the parameter for the multinomial distribution, where samples, X ∼ Mult(π), take values
X ∈ {1, . . . , D} with probability P (X = i|π) = πi. When the vector π is unknown, it can be inferred
in the Bayesian setting using its conjugate prior, the Dirichlet distribution.
The Dirichlet distribution of dimension D is a continuous probability measure on ∆D having the
density function
p(π|β1, . . . , βD) =Γ\(∑
i βi)∏i Γ\(βi)
D∏i=1
πβi−1i (1)
where the parameters βi ≥ 0, ∀i. It will be useful to reparameterize this distribution as follows:
p(π|αg01, . . . , αg0D) =Γ\(α)∏
i Γ\(αg0i)
D∏i=1
παg0i−1i (2)
where we have defined α ≡∑
i βi and g0i ≡ βi/∑
i βi. We refer to this distribution as Dir(αg0). With
this parameterization, the mean and variance of an element in π ∼ Dir(αg0) is
E[πi] = g0i, V[πi] =g0i(1− g0i)
(α + 1)(3)
DRAFT
2
(a) (b) (c)
Fig. 1. 10,000 samples from a 3-dimensional Dirichlet distribution with g0 uniform and (a) α = 1 (b) α = 3 (c) α = 10 As
can be seen, when (a) α < 3, the samples concentrate near vertices and edges of ∆3. When (b) α = 3, the density is uniform
and when (c) α > 3, the density begins to center on g0.
so g0 functions as a prior guess at π and α as a strength parameter, controlling how tight the distribution
is around g0. In the case where g0i = 0, it follows that P (πi = 0) = 1 and when g0j = 1 and g0i = 0,
∀i 6= j, it follows that P (πj = 1) = 1.
Figure 1 contains plots of 10,000 samples each from a 3-dimensional Dirichlet distribution with g0
uniform and α = 1, 3, 10, respectively. When α = D, or the dimensionality of the Dirichlet distribution,
we see that the density is uniform on the simplex; when α > D, the density begins to cluster around g0.
Perhaps more interesting, and more relevant to the Dirichlet process, is when α < D. We see that as α
becomes less than the dimensionality of the distribution, most of the density lies on the corners and faces
of the simplex. In general, as the ratio of α to D shrinks, draws of π will be sparse, where most of the
probability mass will be contained in a subset of the elements of π. This phenomenon will be discussed
in greater detail later and is a crucial element of the Dirichlet process. Values of π can be drawn from
Dir(αg0) in a finite number of steps using the following two methods (two infinite-step methods will
be discussed shortly).
1) A Function of Gamma Random Variables [31]: Gamma random variables can be used to sample
values from Dir(αg0) as follows: Let Zi ∼ Ga(αg0i, λ), i = 1, . . . , D, where αg0i is the shape parameter
and λ the scale parameter of the gamma distribution. Then the vector π =(
Z1∑i Zi
, . . . , ZD∑i Zi
)has a
Dir(αg0) distribution. The parameter λ can be set to any positive, real value, but must remain constant.
That is to say, the vector π is free of λ.
DRAFT
3
2) A Function of Beta Random Variables [9]: A stick-breaking approach can also be employed to
draw from Dir(αg0). For i = 1, . . . , D − 1 and V0 ≡ 0, let
Vi ∼ Beta
αg0i, α
D∑j=i+1
g0j
πi = Vi
∏j<i
(1− Vj)
πD = 1−D−1∑j=1
πj (4)
then the resulting vector, π, has a Dir(αg0) distribution. This approach derives its name from an analogy
to breaking a proportion, Vi, from the remainder of a unit-length stick,∏
j<i(1− Vj). Though called a
“stick-breaking” approach, this method is distinct from that of Sethuraman [27], which will be discussed
later and referred to as the “Sethuraman construction” to avoid confusion.
A. Calculating the Posterior of π
As indicated above, the Dirichlet distribution is conjugate to the multinomial, meaning that the posterior
can be solved analytically and is also a Dirichlet distribution. Using Bayes theorem,
p(π|X = i) =p(X = i|π)p(π)
p(X = i)=
p(X = i|π)p(π)∫π∈∆D
p(X = i|π)p(π)dπ∝ p(X = i|π)p(π)
we can calculate the posterior distribution of π given data, X . For a single datum, the likelihood p(X =
i|π) = πi, so the posterior can be seen to be Dirichlet as well
p(π|X = i) ∝ π(αg0i+1)−1i
∏j 6=i
παg0j−1j (5)
which, when normalized, equals Dir(αg0 + ei), where we define ei to be a D-dimensional vector of
zeros with a one in the ith position. We see that the ith parameter of the Dirichlet distribution has simply
been incremented by one. For N observations, this extends to
p(π|X1 = x1, . . . , XN = xN ) ∝D∏
i=1
παg0i+ni−1i (6)
where ni =∑N
j=1 eXj(i), or the number of observations taking value i, and N =
∑Di=1 ni. When
normalized, the posterior equals Dir(αg01 + n1, . . . , αg0D + nD). Therefore, when used as a conjugate
prior to the multinomial, the posterior distribution of π is Dirichlet where the parameters have been
updated with the “counts” from the observed data. The interplay between the prior and the data can be
seen in the posterior expectation of an element, πi,
E[πi|X1 = x1, . . . , XN = xN ] =ni + αg0i
α + N=
ni
α + N+
αg0i
α + N(7)
DRAFT
4
where the second expression clearly shows the tradeoff between the prior and the data. We see that,
as N → ∞, the posterior expectation converges to a point mass located at(
n1N , . . . , nD
N
), the empirical
distribution of the observations. We can also see more clearly the effect of α, which acts as a “prior
count,” dictating the transition from prior to data. As a Bayes estimator, this parameter can be viewed
as controlling how quickly we are willing to trust the empirical distribution of the data over our prior
guess, while making this transition smoothly. Moreover, as N → ∞, V[πi] → 0, ∀i, meaning samples
of π ∼ Dir(αg01 + n1, . . . , αg0D + nD) will equal this expectation with probability 1, or the Dirichlet
distribution converges to a delta function at the expectation.
B. Two Infinite Sampling Methods for the Dirichlet Distribution
Two other methods requiring an infinite number of random variables, the Polya urn scheme and
Sethuraman’s constructive definition, also exist for sampling π ∼ Dir(αg0). These methods, though
of little practical value for the finite Dirichlet distribution, become very useful for sampling from the
infinite Dirichlet process. Because they are arguably easier to understand in the finite setting, they are
given here, which will allow for a more intuitive extension to the infinite-dimensional case.
1) The Polya Urn Scheme [17]: The Polya urn scheme is a sequential sampling process accompanied
by the following story: Consider an urn initially containing α balls that can take one of D colors, with
αg01 balls of the first color, αg02 of the second, etc. A person randomly selects a ball, X , from this urn,
replaces the ball, and then adds a ball of the same color. After the first draw, the second ball is therefore
drawn from the distribution
p(X2|X1 = x1) =1
α + 1δx1 +
α
α + 1g0
and, inductively,
p(XN+1|X1 = x1, . . . , XN = xN ) =D∑
i=1
ni
α + Nδi +
α
α + Ng0 (8)
where δi is a delta function at the color having index i. If this seems reminiscent of the posterior
expectation of π given in (7), that’s because it is. Using the fact that∫π∈∆D
p(X|π)p(π)dπ = p(X),
we can write that
p(XN+1|X1 = x1, . . . , XN = xN ) =∫
π∈∆D
p(XN+1|π)p(π|X1 = x1, . . . , XN = xN )dπ (9)
where, conditioned on π, XN+1 is independent of X1, . . . , XN . We observe that p(XN+1 = i|π) = πi,
and so
p(XN+1|X1 = x1, . . . , XN = xN ) = E[π|X1 = x1, . . . , XN = xN ] (10)
DRAFT
5
which is given for one element in (7). In this process, we are integrating out, or marginalizing, the random
pmf, π, and are therefore said to be drawing from a marginalized Dirichlet distribution. By the law of
large numbers [31], as the number of samples N → ∞, the empirical distribution of the urn converges
to a discrete distribution, π. That the distribution of π is Dir(αg0) can be seen using the theory of
exchangeability [1].
A sequence of random variables is said to be exchangeable if, for any permutation, ρ, of the integers
{1, . . . , N}, p(X1, . . . , XN ) = p(Xρ1 , . . . , XρN). By selecting the appropriate values from (8) and using
the chain rule, p(X1, . . . , XN ) =∏N
i=1 p(Xi|Xj<i), we see that
p(X1 = x1, . . . , XN = xN ) =
∏Di=1
∏ni
j=1(αg0i + j − 1)∏Nj=1(α + j − 1)
(11)
Because this likelihood is unchanged for all permutations, ρ, the sequence (X1, . . . , XN ) is exchangeable.
As a result of this exchangeability, de Finetti’s theorem [12] states that there exists a discrete pmf, π,
having the yet-to-be-determined distribution, p(π), conditioned on which the observations, (X1, . . . , XN ),
are independent as N →∞. The sequence of equalities follows:
p(X1 = x1, . . . , XN = xN ) =∫
π∈∆D
p(X1 = x1, . . . , XN = xN |π)p(π)dπ
=∫
π∈∆D
N∏j=1
p(Xj = xj |π)p(π)dπ
=∫
π∈∆D
D∏i=1
πni
i p(π)dπ
= Ep
[D∏
i=1
πni
i
](12)
Because a distribution is uniquely defined by its moments, the only distribution, p(π), having moments
Ep
[D∏
i=1
πni
i
]=
∏Di=1
∏ni
j=1(αg0i + j − 1)∏Nj=1(α + j − 1)
(13)
is the Dir(αg0) distribution. Therefore, as N → ∞, the sequence (X1, X2, . . . ) can be viewed as iid
samples from π, where π ∼ Dir(αg0) and is equal to(
n1N , . . . , nD
N
).
2) The Sethuraman Construction of a Dirichlet Prior [27]: A second method for drawing the vector
π ∼ Dir(αg0) using an infinite sequence of random variables was proven to exist by Sethuraman:
π =∞∑
j=1
Vj
∏k<j
(1− Vk)
eYj
Vj ∼ Beta(1, α)
Yj ∼ Mult(g0) (14)
DRAFT
6
where Yj ∈ {1, . . . , D} and V0 ≡ 0. Defining pj ≡ Vj∏
k<j(1 − Vk), we see that πi =∑∞
j=1 pjeYj(i).
The weight vector, p, is often said to be constructed via a stick-breaking process due to the analogy
previously discussed. A visualization for two breaks of this process is shown in Figure 2. Note that this
analogy and accompanying visualization does not take into consideration the location vector, Y . For this
draw to be meaningful, the values (Vi, Yi) must always be considered as a pair, with neither being of
any value without the other.
Fig. 2. A visualization of an abstract stick-breaking process after two breaks have been made. With respect to the Sethuraman
construction, the random variable, Vi ∼ Beta(1, α), must always be considered with its corresponding Yi as a pair.
This distribution is said to be “constructed” due to the fact that the sequence πi,n, being the value of
πi after break n, is stochastically increasing in value by stochastically decreasing values of pn for which
Yn = i, converging to πi as n →∞. That the resulting vector, π, is a proper probability mass function
can be seen from the fact that pj ∈ [0, 1], ∀j because Vj ∈ [0, 1], ∀j, as well as the fact that∑∞
j=1 pj = 1
because 1−∑∞
j=1 pj =∏∞
j=1(1− Vj) → 0 as j →∞. The convergence of each πi can be argued using
similar reasoning.
The proof of Sethuraman’s constructive definition relies on two Lemmas regarding the Dirichlet
distribution. These Lemmas are given below, followed by their use in proving (14).
Lemma 1: Consider a random vector, π ∈ ∆D, drawn according to the following process
π ∼ Dir(αg0 + eY )
Y ∼ Mult(g0) (15)
where Y ∈ {1, . . . , D}. Then π has a Dir(αg0) distribution. Another way to express this is that
Dir(αg0) =D∑
i=1
g0iDir(αg0 + ei) (16)
DRAFT
7
So a Dirichlet distribution is also a specially parameterized mixture of Dirichlet distributions with the
component Dir(αg0 + ei) having mixing weight g0i.
Proof: Basic probability theory allows us to write that
p(π) =D∑
i=1
p(X = i)p(π|X = i) (17)
p(X = i) =∫
π∈∆D
p(X = i|π)p(π)dπ
We observe that p(X = i) = E[πi] = g0i and that p(π|X = i) = Dir(αg0 + ei) from the posterior
calculation of a Dirichlet distribution. Replacing these two values in (17) yields the desired result.
Lemma 2: Let π be constructed according to the following linear combination,
π = V W1 + (1− V )W2
W1 ∼ Dir(ω1, . . . , ωD)
W2 ∼ Dir(υ1, . . . , υD)
V ∼ Beta
(D∑
i=1
ωi,
D∑i=1
υi
)(18)
Then the vector π has the Dir(ω1 + υ1, . . . , ωD + υD) distribution.
Proof: The proof of this Lemma relies on the representation of π as a function of gamma random
variables. First, let W1 =(
Z1∑i Zi
, . . . , ZD∑i Zi
), W2 =
(Z′
1∑i Z′
i, . . . ,
Z′D∑
i Z′i
)and V =
∑i Zi∑
i Zi+∑
i Z′i, where
Zi ∼ Ga(ωi, λ) and Z ′i ∼ Ga(υi, λ). Then W1 ∼ Dir(ω1, . . . , ωD), W2 ∼ Dir(υ1, . . . , υD) and
V ∼ Beta(∑
i ωi,∑
i υi). However, it remains to show that these are drawn independently, or that
p(W1,W2, V ) = p(W1)p(W2)p(V ). For this, we appeal to Basu’s theorem [3], which states that since
W1 and W2 are free of λ, W1 and W2 are also independent of any complete sufficient statistic of λ. In this
case,∑
i Zi and∑
i Z′i are complete sufficient statistics for this parameter. Therefore, W1 is independent
of∑
i Zi and W2 is independent of∑
i Z′i and, by extension, W1 and W2 are both independent of V .
Given that W1, W2 and V are constructed as above, we can see in their linear combination that π =
V W1+(1−V )W2 =(
Z1+Z′1∑
i Zi+∑
i Z′i, . . . ,
ZD+Z′D∑
i Zi+∑
i Z′i
). The distribution of the sum of two gamma random
variables allows us to say that Zi +Z ′i ∼ Ga(ωi +υi, λ) and therefore π ∼ Dir(ω1 + υ1, . . . , ωD + υD).
DRAFT
8
An immediate result of this Lemma is that the vector, π, constructed as,
π = V W1 + (1− V )W2
W1 ∼ Dir(ei)
W2 ∼ Dir(αg0)
V ∼ Beta(1, α) (19)
has the distribution Dir(αg0 + ei). Furthermore, we observe that, using a property of the Dirichlet
distribution mentioned above, p(W1 = ei) = 1. We now use these two Lemmas in considering the
distribution of π in the following,
π = V eY + (1− V )π′
Y ∼ Mult(g0)
V ∼ Beta(1, α)
π′ ∼ Dir(αg0) (20)
First, we add a layer to (20) that seemingly contains no new information: eY ∼ Dir(eY ), (a probability
1 event, which is why we use the same variable notation). Using Lemma 2, however, this allows us to
collapse (20) and write
π ∼ Dir(αg0 + eY )
Y ∼ Mult(g0) (21)
which, using Lemma 1, tells us that π ∼ Dir(αg0). This fact allows us to decompose π into a linear
combination of an infinite set of pairs of random variables. For example, since π′ ∼ Dir(αg0), this
random variable can be expanded in the same manner as (20), producing
π = V1eY1 + V2(1− V1)eY2 + (1− V2)(1− V1)π′′
Yiiid∼ Mult(g0)
Viiid∼ Beta(1, α)
π′′ ∼ Dir(αg0) (22)
for i = 1, 2. Letting i →∞ produces (14). Since this process cannot be realized in practice, it must be
truncated, where (Vi, Yi) are drawn for i = 1, . . . ,K with the remainder, ε =∏K
i=1(1 − Vi), arbitrarily
assigned across π. This truncation will be discussed in greater detail later.
DRAFT
9
From this definition of the Dirichlet distribution, the effect of α is perhaps more clear vis-a-vis Figure
1. The distribution on the values of p = φ(V ), where we will use φ(·) to represent the stick-breaking
function, is only dependent upon α and indicates how many breaks of the stick are necessary (probabilis-
tically speaking) to utilize a given fraction of the probability weight. This number is independent of D.
However, the sparsity of the final vector, π, is dependent upon D (via g0), as it controls the placement
of elements from p into π (via Y ).
The Dirichlet distribution therefore has several means by which one can draw π ∼ Dir(αg0). The two
finite methods allow one to obtain the vector π exactly. The two infinite approaches, due to a necessary
truncation (dictated by the data in the Polya urn case and by choice in the Sethuraman construction), are
only approximate. For this reason, in the finite setting (D < ∞), finite sampling methods are perhaps
more reasonable and the two infinite methods discussed above merely interesting facts. However, in the
infinite, nonparametric setting, the two infinite approach become the only viable means for working with
the Dirichlet process.
II. THE DIRICHLET PROCESS
To motivate the following discussion of the Dirichlet process, consider a Dirichlet distribution with g0
uniform, π ∈ ∆k and
π ∼ Dir(α/k, . . . , α/k) (23)
What happens when k → ∞? Looking at the mean and variance of πi in (3), we see that both
E[πi], V[πi] → 0, leading to questions as to whether this distribution remains well-defined and can be
sampled from. Additionally, the definition of the Dirichlet distribution in the previous section was a general
analysis of the functional form. Each probability, πi, was never defined to correspond to anything, and
the prior vector, g0, was arbitrarily defined. However, what if we define π over a continuous, measurable
space, (Θ,B) equipped with a measure, αG0, with which we replace αg0?
In the following, we will attempt to give a sense of how these problems are approached and can be
solved via the use of measure theory. However, our focus is mainly on developing the intuition; the reader
is referred to the original papers for rigorous proofs [11][5][2][27]. We begin by giving the definition of
the Dirichlet process.
Definition of the Dirichlet Process: Let Θ be a measurable space and B the σ-field of subsets of Θ. Let
α be a positive scalar and G0 a probability measure on (Θ,B). Then for all pairwise disjoint partitions
{B1, . . . , Bk} of Θ, where⋃k
i=1 Bi = Θ, the random probability measure, G, on (Θ,B) is said to be a
DRAFT
10
Dirichlet process if (G(B1), . . . , G(Bk)) ∼ Dir(αG0(B1), . . . , αG0(Bk)).
This process is denoted G ∼ DP (αG0). Because the nonparametric Dirichlet process is an infinite-
dimensional prior on a continuous space, we focus on the infinite-dimensional analogues to the two
infinite sampling methods detailed in Section I. Below, we review these two representations as they
appear in the literature, noting that the infinite-dimensional analogue to the Polya urn scheme is called
the “Chinese restaurant process.” Also in keeping with the literature, we replace the symbol X , as used
above, with θ in what follows.
The Chinese Restaurant Process [1]: Given the observations, {θ1, . . . , θN}, sampled from a Chinese
restaurant process (CRP), a new observation, θN+1, is sampled according to the distribution
θ∗N+1 ∼K∑
i=1
ni
α + Nδθi
+α
α + NG0 (24)
where K is the number of unique values among {θ∗1, . . . , θ∗N}, θi those specific values and ni the
number of observations taking the value θi. As N → ∞, the samples {θ∗i }∞i=1 are iid from G, where
G ∼ DP (αG0) and is equal to the expression in (24). As is evident, θ∗N+1 takes either a previously
observed value with probability Nα+N or a new value drawn from G0 with probability α
α+N . This value
is new almost surely because we assume G0 to be a continuous probability measure.
The Stick-Breaking (Sethuraman) Construction [27]: Let θ ∼ G and G be constructed as follows:
G =∞∑i=1
Vi
∏j<i
(1− Vj)
δθi
Vi ∼ Beta(1, α)
θi ∼ G0 (25)
Then G ∼ DP (αG0). Notice that G is constructed explicitly in (25), while only implicitly in (24). We
now provide a general overview of how measure theory [4][16] relates to the Dirichlet process.
A. Measure Theory and the Dirichlet Process
In the following discussion, we will focus on the real line, θ ∈ R, with B the Borel σ-field generated
by sets A ∈ A of the form A = {θ : θ ∈ (−∞, a]}, where a ∈ Q, the set of rational numbers. This
σ-field is generated by A, or B = σ(A), by including in B all complements, unions and intersections of
DRAFT
11
the sets in A. For our purposes, this means that all disjoint sets of the form Bi = {θ : θ ∈ (ai−1, ai]},
are included in B since Bi = Ai ∩Aci−1, where Ac = {θ : θ 6∈ A}, the complement of A and letting Ai
be the set for which a = ai, with −∞ < ai−1 ≤ ai < ∞.
This is necessary because the Dirichlet process is parameterized by a measure, G0, on the Borel sets
of Θ. To accomplish this, we let the probability density function p(θ) be associated with the probability
measure, G0, such that
G0(Bi) =∫ ai
ai−1
p(θ) dθ = P (ai)− P (ai−1) (26)
For example, we could define P (θ) ≡ N (θ;µ, σ2). This is illustrated in Figure 3 for three partitions. We
observe that, in this illustration, the measure is simply the area under the curve of the region (ai−1, ai].
In essence, measure theory is introduced because it allows us to partition the real line in any way we
choose, provided a is rational, and lets us refine these partitions to infinitesimally small regions, all in a
rigorous, well-defined manner.
(a) (b)
Fig. 3. (a) A partition of R into three sections, ((−∞, a1], (a1, a2], (a2,∞)). (b) The corresponding probabilities of the three
sections, (P (a1), P (a2)− P (a1), 1− P (a2)). In measure theoretic notation, this corresponds to the disjoint sets (B1, B2, B3)
and their respective measures (G0(B1), G0(B2), G0(B3)).
In Ferguson’s definition of the Dirichlet process [11], a prior measure, G0, on the Borel sets in B is
used along with a strength parameter, α, to parameterize the Dirichlet distribution. This applies to any
set of and any number of disjoint partitions, which motivates the generality of the definition. In machine
learning applications, we are concerned with the infinite limit, where Θ is partitioned in a specific way
that we will discuss later. First, we observe that the function of α in the Dirichlet process is identical to
that in Section I and, because G0(Bi) is simply a number (an area under a curve), G0 performs the same
DRAFT
12
function as g0. The crucial difference is that we have definied G over a measurable space and defined
the prior for G using a probability measure, G0, on that same space. Functionally speaking, G(B1) and
π1 are equivalent in that they are both probabilities corresponding to the first dimension of the Dirichlet
distribution, but G(B1) contains the additional information that the measure is associated with the set
B1. Indeed, to emphasize this similarity, this could be written in the following way:
G(B) =k∑
i=1
πiδBi(B)
π ∼ Dir(αG0(B1), . . . , αG0(Bk)) (27)
where G(B) is a real-valued set function that takes as its argument B ∈ {B1, . . . , Bk} and outputs the
πi for which δBi(B) = 1, which occurs when B = Bi and is 0 otherwise. The summation, though
uninteresting in this case, is to be taken literally. For example, G(B1) = π1 · 1 +∑k
i=2 πi · 0 = π1.
It will be noticed that discussion of the Dirichlet process in the machine learning and signal processing
literature never mentions sets of the form of B, but rather specific values of θ. This is due to taking the
limit as k →∞ and motivates the following discussion of the inverse-cdf method for obtaining random
variables having a density p(θ). Figure 4 contains a plot of ten partitions of the space Θ according to a
function, p(θ), such that the measure of any partition is G0(Bi) = 1/10. This is clearly visible in Figure
4(b). According to this picture, one can see that a partition, Bi, could be selected according to this
measure by first drawing a uniformly distributed random variable, η ∼ U(0, 1), finding the appropriate
cell on the y-axis and inverting to obtain the randomly selected partition along the θ-axis. The partition,
Bi, will have been selected uniformly, but the values, θ ∈ Bi, will span various widths of Θ according to
P (θ). If we define new sets, {C1, . . . , Ck}, of the form, Ci = {η : η ∈ ( i−1k , i
k ]}, then Bi is the inverse
image of Ci, or Bi = {θ : θ = P−1(η),∀η ∈ Ci}. Allowing k →∞, uniformly sampling among sets, Ci,
and finding the inverse image converges to sampling θ ∼ G0. In other words, if we sample η ∼ U(0, 1)
and calculate θ = P−1(η), then θ ∼ P (θ).
We mention this because this is implicitly taking place in the Dirichlet process. When we sample
G ∼ DP (αG0), we are assigning to each dimension of Dir(αG0(B1), . . . , αG0(Bk)) a partition of
measure G0(Bi) = 1/k derived according to the distribution function P (θ). In a manner identical to the
finite dimensional case, (8) and (14), where we sample from g0 (i.e. Mult(g0)), we sample uniformly
among partitions Bi according to the measure G0. Though each measure, G0(Bi), is equal, their locations
along Θ are distributed according to the distribution P (θ). Therefore, by letting G0(Bi) = 1/k for
i = 1, . . . , k and allowing k → ∞, we can use the mathematical derivations of Section I explicitly, as
we will discuss in the next section.
DRAFT
13
(a) (b)
Fig. 4. A partition of R into k = 10 disjoint sets, (B1, . . . , B10) = ({(−∞, a1]}, {(a1, a2]}, . . . , {(a8, a9]}, {(a9,∞)}),
of equal measure, G0(Bi) = 110
, ∀i (or P (ai) − P (ai−1) = 110
). The uniform partition of P (θ) in (b) can be seen to lead
to partitions of θ approximating the density p(θ) in (a). As the partitions are refined, or k → ∞, uniformly sampling among
partitions converges to drawing θ ∼ G0.
Though significantly more discussion is needed to make this rigorous (e.g., that it satisfies Kolmogorov’s
consistency condition [11]), the above discussion is essentially what is required for an intuitive under-
standing of the differences between the Dirichlet distribution and the Dirichlet process. Furthermore, this
intuition can be extended to spaces of higher dimension, allowing θ ∼ G0 to indicate the drawing of
multiple variables in a higher dimensional space.
B. Drawing from the Infinite Dirichlet Process
We examine this uniform sampling of Bi more closely and how it directly relates to the Chinese
Restaurant Process (CRP) and Sethuraman construction. First, consider a finite number of partitions, in
which case the CRP can be written as a finite-dimensional Polya urn model (8):
B∗N+1 ∼
k∑i=1
ni
α + NδBi
+α
α + NG0(B) =
k∑i=1
ni + αG0(Bi)α + N
δBi(28)
Letting G0(Bi) = 1/k and k →∞ as discussed, we see that αG0(Bi) → 0, thus disappearing from all
components for which ni > 0. However, there are an infinite number of remaining, unused components
with locations distributed according to G0. Notice that we do not need to change G0 to reflect the removal
of θ values, as they are all of measure zero. This means that the weight on G0 is also unchanged and
remains α. As a result, though we cannot explicitly write the remaining, unused components, we can
“lump” them together under G0 with a probability of selecting from G0 proportional to α. Whereas for
the finite-dimensional case there was a nonzero probability that a value selected from G0 would equal
DRAFT
14
one previously seen, for the continuous case this is a probability zero event. Therefore, by allowing the
number of components to tend to infinity, we obtain a continuum of colors as proven in [5], also known
as the Chinese restaurant process [1].
Identical reasoning is used for the Sethuraman construction. For k →∞, consider the step
Yi ∼ Mult(G0(B1), . . . , G0(Bk)) (29)
in light of Figure 4(b), in which case Y ∈ {B1, . . . , Bk}. Sampling partitions from this infinite-dimensional
uniform multinomial distribution is precisely equivalent to drawing θ ∼ G0 for reasons previously
discussed regarding the inverse-cdf method. In this case and that of the CRP, a change of notation
has concealed the fact that the same fundamental underlying process is taking place as that of the finite-
dimensional case – a notable difference being that repeated values are here probability zero events. Using
the notation of Section I-B2, this difference implies for the Sethuraman construction that a one-to-one
correspondence exists between values in p and the desired vector, π. For clarity, we will continue to use
p when discussing the Dirichlet process, leaving π to represent values drawn from the finite-dimensional
Dirichlet distribution. Also, we emphasize that G is a symbol (a set function) that incorporates both p and
θ as a complete measure. Just as Vi was meaningless without the Yi in the finite case, pi is meaningless
without its corresponding location, θi.
Finally, it is hopefully clear why it is sometimes rhetorically asked in the literature how one can go
about drawing from the Dirichlet process. We are attempting to draw an infinite-dimensional pmf that
covers every point of a continuous space. Using the two “finite” methods (Section I), so labeled because
they produced samples π ∼ Dir(αg0) in a finite number of steps, we need an infinite length of time
to obtain π. The two infinite methods are now the only feasible means and each solves this problem
elegantly in its own way. With the Chinese restaurant process, we never need to draw G. More remarkably,
the Sethuraman construction allows one to explicitly draw G by drawing the important components first,
stopping when a desired portion of the unit mass is assigned and implicitly setting the probability of
the infinitely remaining components to zero. We will see later that this prior can be applied to drawing
directly from the Beta process as well, a result we believe to be new.
C. Dirichlet Process Mixture Models
Having established the means to draw G ∼ DP (αG0), we next discuss the use of the values θ ∼ G.
The main reason we replaced X with θ in our discussion of the Dirichlet process is that the symbol
X is generally reserved for the observed data. In Section I, this was appropriate because we were only
DRAFT
15
interested in observing samples from the discrete π. However, the primary use of the Dirichlet process
in the machine learning community is in nonparametric mixture modeling [2][10], where θ ∼ G is not
the observed data, but rather a parameter (or set of parameters) for some distribution, F (θ), from which
X is drawn. This can be written as follows:
Xj ∼ F (θ∗j )
θ∗jiid∼ G
G ∼ DP (αG0) (30)
where θ∗j is a specific value selected from G and associated with observation Xj . An example of this
process is the Gaussian mixture model, where θ = {µ,Λ}, the mean and precision matrix for a multivariate
normal distribution, and G0 the normal-Wishart conjugate prior. Whereas samples from the discrete G
will repeat with high probability, samples from F (θ) are again from a continuous distribution. Therefore,
values that share parameters are clustered together in that, though not exactly the same, they exhibit
similar statistical characteristics according to some distribution function, F (θ).
III. INFERENCE FOR THE DIRICHLET PROCESS MIXTURE MODEL
To show the Dirichlet process in action, we outline a general method for performing Markov chain
Monte Carlo (MCMC) [13] inference for Dirichlet process mixture models. We let f(x|θ) be the likelihood
function of an observation, x, given parameters, θ, and p(θ) the prior density of θ, corresponding to the
probability measure G0. We also use an auxiliary variable, c ∈ {1, . . . ,K +1}, which acts as an indicator
of the parameter value, θci, associated with observation xi. We will discuss the value of K and why we
write K + 1 shortly. To clarify further the function of c, we rewrite (30) utilizing this auxiliary variable:
Xj ∼ F (θcj)
cjiid∼ Mult(φK(V ))
Vkiid∼ Beta(1, α)
θkiid∼ G0 (31)
where p = φK(V ) represents the stick-breaking portion of the Sethuraman construction that has been
truncated to length K + 1, with pK+1 =∏K
i=1(1 − Vi). We use subscript K because the (K + 1)-
dimensional vector, p, is a function of K parameters. Therefore, only the first K values of Vk and the
first K + 1 values of θk are used. For clarity, following each iteration only utilized components are
DRAFT
16
retained (as indicated by {cj}Nj=1), and a new, proposal components is drawn from the base distribution.
See [23] for further discussion. The number of utilized components for any given iteration is represented
by the variable K, and we here only propose a K + 1st component. Below is this sampling process.
Initialization: Select a truncation level, K + 1, and initialize the model, G, by sampling θk ∼ G0 for
k = 1, . . . ,K + 1 and Vk ∼ Beta(1, α) for k = 1, . . . ,K and construct p = φK(V ). For now, α is set
a priori, but we will discuss inference for this parameter as well.
Step 1: Sample the indicators, c1, . . . , cN , independently from their respective conditional posteriors,
p(cj |xj) ∝ f(xj |θcj)p(θcj
|G),
cj ∼K+1∑k=1
pkf(xj |θk)∑l plf(xj |θl)
δk (32)
Set K to be the number of unique values among c1, . . . , cN and relabel from 1 to K.
Step 2: Sample θ1, . . . , θK from their respective posteriors conditioned on c1, . . . , cN and x1, . . . , xN ,
θk ∼ p(θk|{cj}N
j=1, {xj}Nj=1
)p(θk|{cj}N
j=1, {xj}Nj=1
)∝
N∏j=1
(f(xj |θ)δcj
(k))
p(θ) (33)
where δcj(k) is a delta function equal to one if cj = k and zero otherwise, simply picking out which xj
belong to component k. Sample θK+1 ∼ G0.
Step 3: For the Sethuraman construction, construct the (K + 1)-dimensional weight vector, p = φK(V ),
using V1, . . . , VK sampled from their Beta-distributed posteriors conditioned on c1, . . . , cN ,
Vk ∼ Beta
1 +N∑
j=1
δcj(k), α +
K∑l=k+1
N∑j=1
δcj(l)
(34)
Set pK+1 =∏K
k=1(1 − Vk). For the Chinese restaurant process, this step instead consists of setting
pk = nk
α+N and pK+1 = αα+N , as discussed in Section II.
Repeat Steps 1 – 3 for a desired number of iterations. The convergence of this Markov chain can be
assessed [13], after which point uncorrelated samples (properly spaced out in the chain) of the values in
Steps 1 – 3 are considered iid samples from the posterior, p({θk}K
k=1, {Vk}Kk=1, {cj}N
j=1,K|{xj}Nj=1
).
Additional inference for α can be done for the CRP using a method detailed in [10] and for the Sethuraman
construction using a conjugate, gamma prior.
DRAFT
17
Step 4: Sample α from its posterior gamma distribution conditioned on V1, . . . , VK
α ∼ Ga
(a0 + K, b0 −
K∑k=1
ln(1− Vk)
)(35)
where a0, b0 are the prior hyperparameters for the gamma distribution.
As can be seen, inference for the Dirichlet process is fairly straightforward and, when p(θ) is conjugate
to f(x|θ), fully analytical. Other MCMC methods exist for performing this inference [20] as does a fast,
variational inference method [7][6] that, following initialization, deterministically converges to a local
optimal approximation to the full posterior distribution.
IV. EXTENSIONS OF THE DIRICHLET PROCESS
With the advances of computational resources and MCMC (and later, VB) inference methods, the
Dirichlet process has had a significant role in Bayesian hierarchical modeling. This can be seen in the
many developments of the original idea. To illustrate, we briefly review two of these developments
common to the machine learning literature: the hierarchical Dirichlet process (HDP) [28] and the nested
Dirichlet process (nDP) [26]. We then present our own extension that, to our knowledge, is new to the
machine learning and signal processing communities. This extension, called the Dirichlet process with
product base measure (DP-PBM), extends the Dirichlet process to the modeling of multiple modalities,
m, that each call for unique distribution functions, Fm(θm).
A. The Hierarchical Dirichlet Process [28]
In our discussion of the Dirichlet distribution in Section I, we defined α to be a positive scalar and g0
to be a D-dimensional pmf, with the new pmf π ∼ Dir(αg0). Though our discussion assumed g0 was
uniform, in general there are no restrictions on g0, so long as it is in ∆D. This leads to the following
possibility
π′ ∼ Dir(βπ), π ∼ Dir(αg0) (36)
where β is a value having the same restrictions as α. In essence, this is the hierarchical Dirichlet process
(HDP). A measure, G′i, is drawn from an HDP, denoted G′
i ∼ HDP (α, β, G0), as follows:
G′i ∼ DP (βG), G ∼ DP (αG0) (37)
Using the measure theory outlined in Section II-A, this process can be understood as the infinite-
dimensional analogue to (36) extended to a measurable space (Θ,B). We note that, in the case where G
DRAFT
18
is constructed according to a stick-breaking process and truncated, G′i is a finite-dimensional Dirichlet
process for which the prior expectation, p, and locations, θ, are determined for G′i by G. A key benefit
of the HDP is that draws of G′1, . . . , G
′n ∼ DP (βG) will utilize the same subset of atoms, since G is
a discrete measure with probability one. Therefore, this method is useful for multitask learning [8][21],
where atoms might be expected to share across tasks, but with a different probability of usage for each
task.
B. The Nested Dirichlet Process [26]
In our discussion of the Dirichlet process, we assumed G0 to be a continuous distribution, though
this isn’t necessary, as we’ve seen with the HDP. Consider the case where G0 is a Dirichlet distribution.
Extending this base Dirichlet to its own measurable space, the base distribution becomes a Dirichlet
process as well. This is called the nested Dirichlet process (nDP). A way to think of a draw G ∼
nDP (α, β, G0) is as a mixture model of mixture models. This can be seen in the Sethuraman construction,
G =∞∑
j=1
Vj
∏k<j
(1− Vk)
δGj
Vi ∼ Beta(1, α)
Gj ∼ DP (βG0) (38)
Now, rather than drawing θ∗i ∼ G, an entire mixture, G∗i ∼ G, is drawn, from which is then drawn an
atom, θ∗i ∼ G∗i . As with the HDP, this prior can be useful in multitask learning, where a task, t, selects
an indicator from the “top-level” DP, G, and each observation within that task utilizes atoms sampled
from a “second-level” DP, G∗t . In the case of multitask learning with the HDP, all tasks share the same
atoms, but with different mixing weights. The nDP for multitask learning is “all-or-nothing;” two tasks
either share a mixture model exactly, or nothing at all.
C. The Dirichlet Process with a Product Base Measure
We here propose another extension of the Dirichlet process that incorporates a product base measure, de-
noted DP-PBM, where rather than drawing parameters for one parametric distribution, θ ∼ G0, parameters
are drawn for multiple distributions, θm ∼ G0,m for m = 1, . . . ,M . In other words, rather than having a
connection between data, {Xi}Ni=1, and their respective parameters, {θ∗i }N
i=1, through a parametric distribu-
tion, {F (θ∗i )}Ni=1, sets of data, {X1,i, . . . , XM,i}N
i=1 have respective sets of parameters, {θ∗1,i, . . . , θ∗M,i}N
i=1,
DRAFT
19
used in inherently different and incompatible distribution functions, {F1(θ∗1,i), . . . , FM (θ∗M,i)}. The DP-
PBM is so called because it utilizes a product base measure to achieve this end, G0 = G0,1×G0,2×· · ·×
G0,M , where, in this case, M modalities are considered. The space over which this process is defined is
now(∏M
m=1 Θm,⊗M
m=1 Bm,∏M
m=1 G0,m
). Though this construction implicitly takes place in all mixture
models that attempt to estimate multiple parameters, for example the multivariate Gaussian mixture model,
we believe our use of these parameters in multiple, incompatible distributions (or modalities) is novel.
The fully generative process can be written as follows:
Xm,i ∼ Fm(θ∗m,i) m = 1, . . . ,M
{θ∗m,i}Mm=1 ∼ G
G =∞∑
j=1
Vj
∏k<j
(1− Vk)
M∏m=1
δθm,j
Vjiid∼ Beta(1, α)
θm,j ∼ G0,m m = 1, . . . ,M (39)
where θm,j are drawn iid from G0,m for a fixed m and independently under varying m. To make a
clarifying observation, we note that if each G0,m is a univariate normal-gamma prior, this model reduces
to a multivariate GMM with a forced diagonal covariance matrix. As previously stated, we are more
interested in cases where each Xm is inherently incompatible, but is still linked by the structure of the
data set.
For example, consider a set of observations, {Oi}Ni=1, where each Oi = {X1,i, X2,i} with X1 ∈ Rd
and X2 a sequence of time-series data. In this case, a single density function, f(X1, X2|θ1, θ2) cannot
analytically accommodate Oi, making inference difficult. However, if these densities can be considered
as independent, that is f(X1, X2|θ1, θ2) = f(X1|θ1) · f(X2|θ2), then this problem becomes analytically
tractable and, furthermore, no more difficult to solve than for the standard Dirichlet process. One might
choose to model X1 with a Gaussian distribution, with G0,1 the appropriate prior and X2 by an HMM
[25], with G0,2 its respective prior [22] . In this case, this model becomes a hybrid Gaussian-HMM
mixture, where each component is both a Gaussian and a hidden Markov model. We develop a general
MCMC inference method below to allow us to look more deeply at the structure of our framework. We
also observe that this idea can be easily extended to the nDP and HDP frameworks.
DRAFT
20
D. Inference for the DP-PBM Mixture Model
MCMC inference for the DP-PBM mixture model entails a few simple alterations to the sampling
scheme outlined in Section III. These changes alone are given below:
Initialization: For each j = 1, . . . ,K + 1, sample θm,j ∼ G0,m for m = 1, . . . ,M , where M is the
number of “modalities,” or distinct subsets of data for each observation.
Step 1: Sample the indicators, c1, . . . , cN , independently from their conditional posteriors.
cj ∼K+1∑k=1
pk∏M
m=1 fm(xm,j |θm,k)∑l pl∏M
m=1 fm(xm,j |θm,l)δk (40)
We note that the likelihood term has been factorized into a product of the likelihood for each modality.
Step 2: Sample {θm,1}Mm=1, . . . , {θm,K}M
m=1 from their conditional posteriors given c1, . . . , cN and
{xm,1}Mm=1, . . . , {xm,N}M
m=1.
θm,k ∼ p(θm,k|{cj}N
j=1, xm,1, . . . , xm,N
)p(θm,k|{cj}N
j=1, xm,1, . . . , xm,N
)∝
N∏j=1
(f(xm,j |θm)δcj
(k))
p(θm) (41)
Sample θm,K+1 ∼ G0,m for m = 1, . . . M . The essential difference here is that each component,
k = 1, . . . ,K now has M posteriors to calculate. These M posteriors are calculated independently
of one another given the relevant data for that modality extracted from the observations assigned to that
component. We stress that when an “observation” is assigned to a component (via the indicator, c) it is
actually all of the data that comprise that observation that is being assigned to the component.
E. Experimental Result with Synthesized Data
To further illustrate our model, we implement it on a toy problem of two modalities. For this problem,
each observation is of the form, O = {X1, X2}, where X1 ∼ N (µ,Σ) and X2 ∼ HMM(A,B, π). We
define three, two-dimensional Gaussian distributions with respective means µ1 = (−3, 0), µ2 = (3, 0)
and µ3 = (0, 5) and each having the identity as the covariance matrix. Two hidden Markov models are
defined as below,
A1 =
0.05 0.9 0.05
0.05 0.05 0.9
0.9 0.05 0.05
A2 =
0.9 0.05 0.05
0.05 0.9 0.05
0.05 0.05 0.9
B1,B2 =
0.9 0.05 0.05
0.05 0.9 0.05
0.05 0.05 0.9
DRAFT
21
with the initial state vector π1,π2 = [1/3, 1/3, 1/3]. Data was generated as follows: We sampled 300
observations, 100 from each Gaussian, constituting X1,i for i = 1, . . . , 300. For each sample, if the
observation was on the right half of its respective Gaussian, a sequence of length 50 was drawn from
HMM 1, if on the left, from HMM 2. We performed fast variational inference [6], truncating to 50
components, to obtain results for illustration. This precisely defined data set allows the model to clearly
display the purpose of its design. If one were to build a Gaussian mixture model on the X1 data alone,
three components would be uncovered, as the posterior expectation shows in Figure 5(a). If an HMM
mixture were built alone on the X2 data, only two components would be uncovered. Using all of the
data, that is, mixing on {Oi}300i=1 rather than just {X1,i}300
i=1 or {X2,i}300i=1 alone, the correct number of six
components was uncovered, as shown in Figure 5(b).
(a) (b)
Fig. 5. An example of a mixed Gaussian-HMM data set where observations on the right side of each Gaussian has an associated
sequence from HMM 1 and the left side from HMM 2. (a) Gaussian mixture model results without using sequential information.
(b) Gaussian-HMM mixture results using both spatial and sequential information. Each ellipse corresponds to a cluster. Because
there are six Gaussian-HMM combinations, there are six clusters.
The results show that, as was required by the data, the Dirichlet process prior uncovered six distinct
groups of data. In other words, though no two observations were exactly the same, it was inferred that
six distinct sets of parameters were required to properly characterize the statistical properties of the data.
We recall that we initialized to 50 components. Upon convergence, 44 of them contained no data. The
DP-PBM framework therefore allows for the incorporation of all information of a data set to be included,
thus providing more precise clustering results.
DRAFT
22
REFERENCES
[1] D. Aldous (1985). Exchangeability and related topics. Ecole d’ete de probabilites de Saint-Flour XIII-1983 1-198 Springer,
Berlin.
[2] C.E. Antoniak (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of
Statistics, 2:1152-1174.
[3] D. Basu (1955). On statistics independent of a complete sufficient statistic. Sankhya: The Indian Journal of Statistics,
15:377-380.
[4] P. Billingsley (1995). Probability and Measure, 3rd edition. Wiley Press, New York.
[5] D. Blackwell and J.B. MacQueen (1973). Ferguson distributions via Polya urn schemes. The Annals of Statistics, 1:353-355.
[6] M.J. Beal (2003). Variational Algorithms for Approximate Bayesian Inference PhD thesis, Gatsby Computational
Neuroscience Unit, University College London.
[7] D. Blei and M.I. Jordan (2006). Variational inference for Dirichlet process mixtures. Journal of Bayesian Analysis, 1(1):121-
144.
[8] R. Caruana (1997). Multitask learning. Machine Learning, 28:41-75.
[9] R.J. Connor and J.E. Mosimann (1969). Concepts of independence for proportions with a generalization of the Dirichlet
distribution. Journal of the American Statistical Association, 64:194-206.
[10] M.D. Escobar and M. West (1995). Bayesian density estimation and inference using mixtures. Journal of the American
Statistical Association. 90(430):577-588.
[11] T. Ferguson (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1:209-230.
[12] B. de Finetti (1937). La prevision: ses lois logiques, ses sources subjectives. Ann. Inst. H. Poincare, 7:1-68.
[13] D. Gamerman and H.F. Lopes (2006). Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, Second
Edition, Chapman & Hall.
[14] P. Halmos (1944). Random Alms. The Annals of Mathematical Statistics, 15:182-189.
[15] E. Hewitt and L.J. Savage (1955). Symmetric measures on Cartesian products. Trans. American Math Association, 80:470-
501.
[16] J. Jacod and P. Protter (2004). Probability Essentials, 2nd ed. Springer.
[17] N. Johnson and S. Kotz (1977). Urn Models and Their Applications. Wiley Series in Probability and Mathematical Statistics.
[18] J.F.C. Kingman (1978). Uses of exchangeability. Annals of Probability, 6(2):183-197.
[19] E. Kreyszig (1999). Advanced Engineering Mathematics, 8th edition. John Wiley & Sons, Inc., New York.
[20] R.M. Neal (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and
Graphical Statistics, 9(2):249-265.
[21] K. Ni, J. Paisley, L. Carin and D. Dunson (2008). Multi-task learning for analyzing and sorting large databases of sequential
data. IEEE Trans. on Signal Processing to appear.
[22] J. Paisley and L. Carin (2008). Hidden Markov Models with Stick Breaking Priors. submitted to IEEE Trans. on Signal
Processing.
[23] O. Papaspiliopoulos and G.O. Roberts (2008). Retrospective Markov chain Monte Carlo methods for Dirichlet process
hierarchical models. Biometrika, 95(1):169-186.
[24] J. Pitman (1996). Some developments of the Blackwell-MacQueen urn scheme. Statistics, Probability and Game Theory.
Papers in honor of David Blackwell, T. S. Ferguson, L. S. Shapley and J. B. MacQueen, eds. 245-267.
DRAFT
23
[25] L.R. Rabiner (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of
the IEEE, Vol. 77, No 2 pp. 257-286.
[26] A. Rodriquez, D.B. Dunson, and A.E. Gelfand (2008). The nested Dirichlet process. Journal of the American Statistical
Association, to appear.
[27] J. Sethuraman (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4:639-650.
[28] Y.W. Teh, M.I. Jordan, M.J. Beal and D.M. Blei (2006). Hierarchical Dirichlet processes. Journal of the American Statistical
Association, 101(476):1566-1581.
[29] Y. Xue, X. Liao, L. Carin and B. Krishnapuram (2007). Multi-task learning for classification with Dirichlet process priors.
Journal of Machine Learning Research, 8:35-63.
[30] M. West, P. Muller and M.D. Escobar (1994). Hierarchical priors and mixture models, with applications in regression and
density estimation. Aspects of Uncertainty: A Tribute to D.V. Lindley, pp 363-386.
[31] S.S. Wilks (1962). Mathematical Statistics. John Wiley, New York.
DRAFT