23
1 A Tutorial on the Dirichlet Process for Engineers Technical Report John Paisley Department of Electrical & Computer Engineering Duke University, Durham, NC [email protected] Abstract This document provides a review of the Dirichlet process originally given in the author’s preliminary exam paper and is presented here as a tutorial. No motivation is given (in what I’ve excerpted here), and this document is intended to be a mathematical tutorial that is still accessible to the engineer. I. THE DIRICHLET DISTRIBUTION Consider the finite, D-dimensional vector, π, having the properties 0 π i 1, i =1,...,D and D i=1 π i =1 (i.e., residing on the (D - 1)-dimensional simplex in R D , or π Δ D ). We view this vector as the parameter for the multinomial distribution, where samples, X Mult(π), take values X ∈{1,...,D} with probability P (X = i|π)= π i . When the vector π is unknown, it can be inferred in the Bayesian setting using its conjugate prior, the Dirichlet distribution. The Dirichlet distribution of dimension D is a continuous probability measure on Δ D having the density function p(π|β 1 ,...,β D )= Γ( i β i ) i Γ(β i ) D i=1 π βi-1 i (1) where the parameters β i 0, i. It will be useful to reparameterize this distribution as follows: p(π|αg 01 ,...,αg 0D )= Γ(α) i Γ(αg 0i ) D i=1 π αg0i-1 i (2) where we have defined α i β i and g 0i β i / i β i . We refer to this distribution as Dir(αg 0 ). With this parameterization, the mean and variance of an element in π Dir(αg 0 ) is E[π i ]= g 0i , V[π i ]= g 0i (1 - g 0i ) (α + 1) (3) DRAFT

1 A Tutorial on the Dirichlet Process for Engineers ...jwp2128/Teaching/E6892/papers/DP_tutorial.pdf · A Tutorial on the Dirichlet Process for Engineers Technical Report ... -dimensional

Embed Size (px)

Citation preview

1

A Tutorial on the Dirichlet Process for Engineers

Technical Report

John Paisley

Department of Electrical & Computer Engineering

Duke University, Durham, NC

[email protected]

Abstract

This document provides a review of the Dirichlet process originally given in the author’s preliminary

exam paper and is presented here as a tutorial. No motivation is given (in what I’ve excerpted here), and

this document is intended to be a mathematical tutorial that is still accessible to the engineer.

I. THE DIRICHLET DISTRIBUTION

Consider the finite, D-dimensional vector, π, having the properties 0 ≤ πi ≤ 1, i = 1, . . . , D and∑Di=1 πi = 1 (i.e., residing on the (D − 1)-dimensional simplex in RD, or π ∈ ∆D). We view this

vector as the parameter for the multinomial distribution, where samples, X ∼ Mult(π), take values

X ∈ {1, . . . , D} with probability P (X = i|π) = πi. When the vector π is unknown, it can be inferred

in the Bayesian setting using its conjugate prior, the Dirichlet distribution.

The Dirichlet distribution of dimension D is a continuous probability measure on ∆D having the

density function

p(π|β1, . . . , βD) =Γ\(∑

i βi)∏i Γ\(βi)

D∏i=1

πβi−1i (1)

where the parameters βi ≥ 0, ∀i. It will be useful to reparameterize this distribution as follows:

p(π|αg01, . . . , αg0D) =Γ\(α)∏

i Γ\(αg0i)

D∏i=1

παg0i−1i (2)

where we have defined α ≡∑

i βi and g0i ≡ βi/∑

i βi. We refer to this distribution as Dir(αg0). With

this parameterization, the mean and variance of an element in π ∼ Dir(αg0) is

E[πi] = g0i, V[πi] =g0i(1− g0i)

(α + 1)(3)

DRAFT

2

(a) (b) (c)

Fig. 1. 10,000 samples from a 3-dimensional Dirichlet distribution with g0 uniform and (a) α = 1 (b) α = 3 (c) α = 10 As

can be seen, when (a) α < 3, the samples concentrate near vertices and edges of ∆3. When (b) α = 3, the density is uniform

and when (c) α > 3, the density begins to center on g0.

so g0 functions as a prior guess at π and α as a strength parameter, controlling how tight the distribution

is around g0. In the case where g0i = 0, it follows that P (πi = 0) = 1 and when g0j = 1 and g0i = 0,

∀i 6= j, it follows that P (πj = 1) = 1.

Figure 1 contains plots of 10,000 samples each from a 3-dimensional Dirichlet distribution with g0

uniform and α = 1, 3, 10, respectively. When α = D, or the dimensionality of the Dirichlet distribution,

we see that the density is uniform on the simplex; when α > D, the density begins to cluster around g0.

Perhaps more interesting, and more relevant to the Dirichlet process, is when α < D. We see that as α

becomes less than the dimensionality of the distribution, most of the density lies on the corners and faces

of the simplex. In general, as the ratio of α to D shrinks, draws of π will be sparse, where most of the

probability mass will be contained in a subset of the elements of π. This phenomenon will be discussed

in greater detail later and is a crucial element of the Dirichlet process. Values of π can be drawn from

Dir(αg0) in a finite number of steps using the following two methods (two infinite-step methods will

be discussed shortly).

1) A Function of Gamma Random Variables [31]: Gamma random variables can be used to sample

values from Dir(αg0) as follows: Let Zi ∼ Ga(αg0i, λ), i = 1, . . . , D, where αg0i is the shape parameter

and λ the scale parameter of the gamma distribution. Then the vector π =(

Z1∑i Zi

, . . . , ZD∑i Zi

)has a

Dir(αg0) distribution. The parameter λ can be set to any positive, real value, but must remain constant.

That is to say, the vector π is free of λ.

DRAFT

3

2) A Function of Beta Random Variables [9]: A stick-breaking approach can also be employed to

draw from Dir(αg0). For i = 1, . . . , D − 1 and V0 ≡ 0, let

Vi ∼ Beta

αg0i, α

D∑j=i+1

g0j

πi = Vi

∏j<i

(1− Vj)

πD = 1−D−1∑j=1

πj (4)

then the resulting vector, π, has a Dir(αg0) distribution. This approach derives its name from an analogy

to breaking a proportion, Vi, from the remainder of a unit-length stick,∏

j<i(1− Vj). Though called a

“stick-breaking” approach, this method is distinct from that of Sethuraman [27], which will be discussed

later and referred to as the “Sethuraman construction” to avoid confusion.

A. Calculating the Posterior of π

As indicated above, the Dirichlet distribution is conjugate to the multinomial, meaning that the posterior

can be solved analytically and is also a Dirichlet distribution. Using Bayes theorem,

p(π|X = i) =p(X = i|π)p(π)

p(X = i)=

p(X = i|π)p(π)∫π∈∆D

p(X = i|π)p(π)dπ∝ p(X = i|π)p(π)

we can calculate the posterior distribution of π given data, X . For a single datum, the likelihood p(X =

i|π) = πi, so the posterior can be seen to be Dirichlet as well

p(π|X = i) ∝ π(αg0i+1)−1i

∏j 6=i

παg0j−1j (5)

which, when normalized, equals Dir(αg0 + ei), where we define ei to be a D-dimensional vector of

zeros with a one in the ith position. We see that the ith parameter of the Dirichlet distribution has simply

been incremented by one. For N observations, this extends to

p(π|X1 = x1, . . . , XN = xN ) ∝D∏

i=1

παg0i+ni−1i (6)

where ni =∑N

j=1 eXj(i), or the number of observations taking value i, and N =

∑Di=1 ni. When

normalized, the posterior equals Dir(αg01 + n1, . . . , αg0D + nD). Therefore, when used as a conjugate

prior to the multinomial, the posterior distribution of π is Dirichlet where the parameters have been

updated with the “counts” from the observed data. The interplay between the prior and the data can be

seen in the posterior expectation of an element, πi,

E[πi|X1 = x1, . . . , XN = xN ] =ni + αg0i

α + N=

ni

α + N+

αg0i

α + N(7)

DRAFT

4

where the second expression clearly shows the tradeoff between the prior and the data. We see that,

as N → ∞, the posterior expectation converges to a point mass located at(

n1N , . . . , nD

N

), the empirical

distribution of the observations. We can also see more clearly the effect of α, which acts as a “prior

count,” dictating the transition from prior to data. As a Bayes estimator, this parameter can be viewed

as controlling how quickly we are willing to trust the empirical distribution of the data over our prior

guess, while making this transition smoothly. Moreover, as N → ∞, V[πi] → 0, ∀i, meaning samples

of π ∼ Dir(αg01 + n1, . . . , αg0D + nD) will equal this expectation with probability 1, or the Dirichlet

distribution converges to a delta function at the expectation.

B. Two Infinite Sampling Methods for the Dirichlet Distribution

Two other methods requiring an infinite number of random variables, the Polya urn scheme and

Sethuraman’s constructive definition, also exist for sampling π ∼ Dir(αg0). These methods, though

of little practical value for the finite Dirichlet distribution, become very useful for sampling from the

infinite Dirichlet process. Because they are arguably easier to understand in the finite setting, they are

given here, which will allow for a more intuitive extension to the infinite-dimensional case.

1) The Polya Urn Scheme [17]: The Polya urn scheme is a sequential sampling process accompanied

by the following story: Consider an urn initially containing α balls that can take one of D colors, with

αg01 balls of the first color, αg02 of the second, etc. A person randomly selects a ball, X , from this urn,

replaces the ball, and then adds a ball of the same color. After the first draw, the second ball is therefore

drawn from the distribution

p(X2|X1 = x1) =1

α + 1δx1 +

α

α + 1g0

and, inductively,

p(XN+1|X1 = x1, . . . , XN = xN ) =D∑

i=1

ni

α + Nδi +

α

α + Ng0 (8)

where δi is a delta function at the color having index i. If this seems reminiscent of the posterior

expectation of π given in (7), that’s because it is. Using the fact that∫π∈∆D

p(X|π)p(π)dπ = p(X),

we can write that

p(XN+1|X1 = x1, . . . , XN = xN ) =∫

π∈∆D

p(XN+1|π)p(π|X1 = x1, . . . , XN = xN )dπ (9)

where, conditioned on π, XN+1 is independent of X1, . . . , XN . We observe that p(XN+1 = i|π) = πi,

and so

p(XN+1|X1 = x1, . . . , XN = xN ) = E[π|X1 = x1, . . . , XN = xN ] (10)

DRAFT

5

which is given for one element in (7). In this process, we are integrating out, or marginalizing, the random

pmf, π, and are therefore said to be drawing from a marginalized Dirichlet distribution. By the law of

large numbers [31], as the number of samples N → ∞, the empirical distribution of the urn converges

to a discrete distribution, π. That the distribution of π is Dir(αg0) can be seen using the theory of

exchangeability [1].

A sequence of random variables is said to be exchangeable if, for any permutation, ρ, of the integers

{1, . . . , N}, p(X1, . . . , XN ) = p(Xρ1 , . . . , XρN). By selecting the appropriate values from (8) and using

the chain rule, p(X1, . . . , XN ) =∏N

i=1 p(Xi|Xj<i), we see that

p(X1 = x1, . . . , XN = xN ) =

∏Di=1

∏ni

j=1(αg0i + j − 1)∏Nj=1(α + j − 1)

(11)

Because this likelihood is unchanged for all permutations, ρ, the sequence (X1, . . . , XN ) is exchangeable.

As a result of this exchangeability, de Finetti’s theorem [12] states that there exists a discrete pmf, π,

having the yet-to-be-determined distribution, p(π), conditioned on which the observations, (X1, . . . , XN ),

are independent as N →∞. The sequence of equalities follows:

p(X1 = x1, . . . , XN = xN ) =∫

π∈∆D

p(X1 = x1, . . . , XN = xN |π)p(π)dπ

=∫

π∈∆D

N∏j=1

p(Xj = xj |π)p(π)dπ

=∫

π∈∆D

D∏i=1

πni

i p(π)dπ

= Ep

[D∏

i=1

πni

i

](12)

Because a distribution is uniquely defined by its moments, the only distribution, p(π), having moments

Ep

[D∏

i=1

πni

i

]=

∏Di=1

∏ni

j=1(αg0i + j − 1)∏Nj=1(α + j − 1)

(13)

is the Dir(αg0) distribution. Therefore, as N → ∞, the sequence (X1, X2, . . . ) can be viewed as iid

samples from π, where π ∼ Dir(αg0) and is equal to(

n1N , . . . , nD

N

).

2) The Sethuraman Construction of a Dirichlet Prior [27]: A second method for drawing the vector

π ∼ Dir(αg0) using an infinite sequence of random variables was proven to exist by Sethuraman:

π =∞∑

j=1

Vj

∏k<j

(1− Vk)

eYj

Vj ∼ Beta(1, α)

Yj ∼ Mult(g0) (14)

DRAFT

6

where Yj ∈ {1, . . . , D} and V0 ≡ 0. Defining pj ≡ Vj∏

k<j(1 − Vk), we see that πi =∑∞

j=1 pjeYj(i).

The weight vector, p, is often said to be constructed via a stick-breaking process due to the analogy

previously discussed. A visualization for two breaks of this process is shown in Figure 2. Note that this

analogy and accompanying visualization does not take into consideration the location vector, Y . For this

draw to be meaningful, the values (Vi, Yi) must always be considered as a pair, with neither being of

any value without the other.

Fig. 2. A visualization of an abstract stick-breaking process after two breaks have been made. With respect to the Sethuraman

construction, the random variable, Vi ∼ Beta(1, α), must always be considered with its corresponding Yi as a pair.

This distribution is said to be “constructed” due to the fact that the sequence πi,n, being the value of

πi after break n, is stochastically increasing in value by stochastically decreasing values of pn for which

Yn = i, converging to πi as n →∞. That the resulting vector, π, is a proper probability mass function

can be seen from the fact that pj ∈ [0, 1], ∀j because Vj ∈ [0, 1], ∀j, as well as the fact that∑∞

j=1 pj = 1

because 1−∑∞

j=1 pj =∏∞

j=1(1− Vj) → 0 as j →∞. The convergence of each πi can be argued using

similar reasoning.

The proof of Sethuraman’s constructive definition relies on two Lemmas regarding the Dirichlet

distribution. These Lemmas are given below, followed by their use in proving (14).

Lemma 1: Consider a random vector, π ∈ ∆D, drawn according to the following process

π ∼ Dir(αg0 + eY )

Y ∼ Mult(g0) (15)

where Y ∈ {1, . . . , D}. Then π has a Dir(αg0) distribution. Another way to express this is that

Dir(αg0) =D∑

i=1

g0iDir(αg0 + ei) (16)

DRAFT

7

So a Dirichlet distribution is also a specially parameterized mixture of Dirichlet distributions with the

component Dir(αg0 + ei) having mixing weight g0i.

Proof: Basic probability theory allows us to write that

p(π) =D∑

i=1

p(X = i)p(π|X = i) (17)

p(X = i) =∫

π∈∆D

p(X = i|π)p(π)dπ

We observe that p(X = i) = E[πi] = g0i and that p(π|X = i) = Dir(αg0 + ei) from the posterior

calculation of a Dirichlet distribution. Replacing these two values in (17) yields the desired result.

Lemma 2: Let π be constructed according to the following linear combination,

π = V W1 + (1− V )W2

W1 ∼ Dir(ω1, . . . , ωD)

W2 ∼ Dir(υ1, . . . , υD)

V ∼ Beta

(D∑

i=1

ωi,

D∑i=1

υi

)(18)

Then the vector π has the Dir(ω1 + υ1, . . . , ωD + υD) distribution.

Proof: The proof of this Lemma relies on the representation of π as a function of gamma random

variables. First, let W1 =(

Z1∑i Zi

, . . . , ZD∑i Zi

), W2 =

(Z′

1∑i Z′

i, . . . ,

Z′D∑

i Z′i

)and V =

∑i Zi∑

i Zi+∑

i Z′i, where

Zi ∼ Ga(ωi, λ) and Z ′i ∼ Ga(υi, λ). Then W1 ∼ Dir(ω1, . . . , ωD), W2 ∼ Dir(υ1, . . . , υD) and

V ∼ Beta(∑

i ωi,∑

i υi). However, it remains to show that these are drawn independently, or that

p(W1,W2, V ) = p(W1)p(W2)p(V ). For this, we appeal to Basu’s theorem [3], which states that since

W1 and W2 are free of λ, W1 and W2 are also independent of any complete sufficient statistic of λ. In this

case,∑

i Zi and∑

i Z′i are complete sufficient statistics for this parameter. Therefore, W1 is independent

of∑

i Zi and W2 is independent of∑

i Z′i and, by extension, W1 and W2 are both independent of V .

Given that W1, W2 and V are constructed as above, we can see in their linear combination that π =

V W1+(1−V )W2 =(

Z1+Z′1∑

i Zi+∑

i Z′i, . . . ,

ZD+Z′D∑

i Zi+∑

i Z′i

). The distribution of the sum of two gamma random

variables allows us to say that Zi +Z ′i ∼ Ga(ωi +υi, λ) and therefore π ∼ Dir(ω1 + υ1, . . . , ωD + υD).

DRAFT

8

An immediate result of this Lemma is that the vector, π, constructed as,

π = V W1 + (1− V )W2

W1 ∼ Dir(ei)

W2 ∼ Dir(αg0)

V ∼ Beta(1, α) (19)

has the distribution Dir(αg0 + ei). Furthermore, we observe that, using a property of the Dirichlet

distribution mentioned above, p(W1 = ei) = 1. We now use these two Lemmas in considering the

distribution of π in the following,

π = V eY + (1− V )π′

Y ∼ Mult(g0)

V ∼ Beta(1, α)

π′ ∼ Dir(αg0) (20)

First, we add a layer to (20) that seemingly contains no new information: eY ∼ Dir(eY ), (a probability

1 event, which is why we use the same variable notation). Using Lemma 2, however, this allows us to

collapse (20) and write

π ∼ Dir(αg0 + eY )

Y ∼ Mult(g0) (21)

which, using Lemma 1, tells us that π ∼ Dir(αg0). This fact allows us to decompose π into a linear

combination of an infinite set of pairs of random variables. For example, since π′ ∼ Dir(αg0), this

random variable can be expanded in the same manner as (20), producing

π = V1eY1 + V2(1− V1)eY2 + (1− V2)(1− V1)π′′

Yiiid∼ Mult(g0)

Viiid∼ Beta(1, α)

π′′ ∼ Dir(αg0) (22)

for i = 1, 2. Letting i →∞ produces (14). Since this process cannot be realized in practice, it must be

truncated, where (Vi, Yi) are drawn for i = 1, . . . ,K with the remainder, ε =∏K

i=1(1 − Vi), arbitrarily

assigned across π. This truncation will be discussed in greater detail later.

DRAFT

9

From this definition of the Dirichlet distribution, the effect of α is perhaps more clear vis-a-vis Figure

1. The distribution on the values of p = φ(V ), where we will use φ(·) to represent the stick-breaking

function, is only dependent upon α and indicates how many breaks of the stick are necessary (probabilis-

tically speaking) to utilize a given fraction of the probability weight. This number is independent of D.

However, the sparsity of the final vector, π, is dependent upon D (via g0), as it controls the placement

of elements from p into π (via Y ).

The Dirichlet distribution therefore has several means by which one can draw π ∼ Dir(αg0). The two

finite methods allow one to obtain the vector π exactly. The two infinite approaches, due to a necessary

truncation (dictated by the data in the Polya urn case and by choice in the Sethuraman construction), are

only approximate. For this reason, in the finite setting (D < ∞), finite sampling methods are perhaps

more reasonable and the two infinite methods discussed above merely interesting facts. However, in the

infinite, nonparametric setting, the two infinite approach become the only viable means for working with

the Dirichlet process.

II. THE DIRICHLET PROCESS

To motivate the following discussion of the Dirichlet process, consider a Dirichlet distribution with g0

uniform, π ∈ ∆k and

π ∼ Dir(α/k, . . . , α/k) (23)

What happens when k → ∞? Looking at the mean and variance of πi in (3), we see that both

E[πi], V[πi] → 0, leading to questions as to whether this distribution remains well-defined and can be

sampled from. Additionally, the definition of the Dirichlet distribution in the previous section was a general

analysis of the functional form. Each probability, πi, was never defined to correspond to anything, and

the prior vector, g0, was arbitrarily defined. However, what if we define π over a continuous, measurable

space, (Θ,B) equipped with a measure, αG0, with which we replace αg0?

In the following, we will attempt to give a sense of how these problems are approached and can be

solved via the use of measure theory. However, our focus is mainly on developing the intuition; the reader

is referred to the original papers for rigorous proofs [11][5][2][27]. We begin by giving the definition of

the Dirichlet process.

Definition of the Dirichlet Process: Let Θ be a measurable space and B the σ-field of subsets of Θ. Let

α be a positive scalar and G0 a probability measure on (Θ,B). Then for all pairwise disjoint partitions

{B1, . . . , Bk} of Θ, where⋃k

i=1 Bi = Θ, the random probability measure, G, on (Θ,B) is said to be a

DRAFT

10

Dirichlet process if (G(B1), . . . , G(Bk)) ∼ Dir(αG0(B1), . . . , αG0(Bk)).

This process is denoted G ∼ DP (αG0). Because the nonparametric Dirichlet process is an infinite-

dimensional prior on a continuous space, we focus on the infinite-dimensional analogues to the two

infinite sampling methods detailed in Section I. Below, we review these two representations as they

appear in the literature, noting that the infinite-dimensional analogue to the Polya urn scheme is called

the “Chinese restaurant process.” Also in keeping with the literature, we replace the symbol X , as used

above, with θ in what follows.

The Chinese Restaurant Process [1]: Given the observations, {θ1, . . . , θN}, sampled from a Chinese

restaurant process (CRP), a new observation, θN+1, is sampled according to the distribution

θ∗N+1 ∼K∑

i=1

ni

α + Nδθi

α + NG0 (24)

where K is the number of unique values among {θ∗1, . . . , θ∗N}, θi those specific values and ni the

number of observations taking the value θi. As N → ∞, the samples {θ∗i }∞i=1 are iid from G, where

G ∼ DP (αG0) and is equal to the expression in (24). As is evident, θ∗N+1 takes either a previously

observed value with probability Nα+N or a new value drawn from G0 with probability α

α+N . This value

is new almost surely because we assume G0 to be a continuous probability measure.

The Stick-Breaking (Sethuraman) Construction [27]: Let θ ∼ G and G be constructed as follows:

G =∞∑i=1

Vi

∏j<i

(1− Vj)

δθi

Vi ∼ Beta(1, α)

θi ∼ G0 (25)

Then G ∼ DP (αG0). Notice that G is constructed explicitly in (25), while only implicitly in (24). We

now provide a general overview of how measure theory [4][16] relates to the Dirichlet process.

A. Measure Theory and the Dirichlet Process

In the following discussion, we will focus on the real line, θ ∈ R, with B the Borel σ-field generated

by sets A ∈ A of the form A = {θ : θ ∈ (−∞, a]}, where a ∈ Q, the set of rational numbers. This

σ-field is generated by A, or B = σ(A), by including in B all complements, unions and intersections of

DRAFT

11

the sets in A. For our purposes, this means that all disjoint sets of the form Bi = {θ : θ ∈ (ai−1, ai]},

are included in B since Bi = Ai ∩Aci−1, where Ac = {θ : θ 6∈ A}, the complement of A and letting Ai

be the set for which a = ai, with −∞ < ai−1 ≤ ai < ∞.

This is necessary because the Dirichlet process is parameterized by a measure, G0, on the Borel sets

of Θ. To accomplish this, we let the probability density function p(θ) be associated with the probability

measure, G0, such that

G0(Bi) =∫ ai

ai−1

p(θ) dθ = P (ai)− P (ai−1) (26)

For example, we could define P (θ) ≡ N (θ;µ, σ2). This is illustrated in Figure 3 for three partitions. We

observe that, in this illustration, the measure is simply the area under the curve of the region (ai−1, ai].

In essence, measure theory is introduced because it allows us to partition the real line in any way we

choose, provided a is rational, and lets us refine these partitions to infinitesimally small regions, all in a

rigorous, well-defined manner.

(a) (b)

Fig. 3. (a) A partition of R into three sections, ((−∞, a1], (a1, a2], (a2,∞)). (b) The corresponding probabilities of the three

sections, (P (a1), P (a2)− P (a1), 1− P (a2)). In measure theoretic notation, this corresponds to the disjoint sets (B1, B2, B3)

and their respective measures (G0(B1), G0(B2), G0(B3)).

In Ferguson’s definition of the Dirichlet process [11], a prior measure, G0, on the Borel sets in B is

used along with a strength parameter, α, to parameterize the Dirichlet distribution. This applies to any

set of and any number of disjoint partitions, which motivates the generality of the definition. In machine

learning applications, we are concerned with the infinite limit, where Θ is partitioned in a specific way

that we will discuss later. First, we observe that the function of α in the Dirichlet process is identical to

that in Section I and, because G0(Bi) is simply a number (an area under a curve), G0 performs the same

DRAFT

12

function as g0. The crucial difference is that we have definied G over a measurable space and defined

the prior for G using a probability measure, G0, on that same space. Functionally speaking, G(B1) and

π1 are equivalent in that they are both probabilities corresponding to the first dimension of the Dirichlet

distribution, but G(B1) contains the additional information that the measure is associated with the set

B1. Indeed, to emphasize this similarity, this could be written in the following way:

G(B) =k∑

i=1

πiδBi(B)

π ∼ Dir(αG0(B1), . . . , αG0(Bk)) (27)

where G(B) is a real-valued set function that takes as its argument B ∈ {B1, . . . , Bk} and outputs the

πi for which δBi(B) = 1, which occurs when B = Bi and is 0 otherwise. The summation, though

uninteresting in this case, is to be taken literally. For example, G(B1) = π1 · 1 +∑k

i=2 πi · 0 = π1.

It will be noticed that discussion of the Dirichlet process in the machine learning and signal processing

literature never mentions sets of the form of B, but rather specific values of θ. This is due to taking the

limit as k →∞ and motivates the following discussion of the inverse-cdf method for obtaining random

variables having a density p(θ). Figure 4 contains a plot of ten partitions of the space Θ according to a

function, p(θ), such that the measure of any partition is G0(Bi) = 1/10. This is clearly visible in Figure

4(b). According to this picture, one can see that a partition, Bi, could be selected according to this

measure by first drawing a uniformly distributed random variable, η ∼ U(0, 1), finding the appropriate

cell on the y-axis and inverting to obtain the randomly selected partition along the θ-axis. The partition,

Bi, will have been selected uniformly, but the values, θ ∈ Bi, will span various widths of Θ according to

P (θ). If we define new sets, {C1, . . . , Ck}, of the form, Ci = {η : η ∈ ( i−1k , i

k ]}, then Bi is the inverse

image of Ci, or Bi = {θ : θ = P−1(η),∀η ∈ Ci}. Allowing k →∞, uniformly sampling among sets, Ci,

and finding the inverse image converges to sampling θ ∼ G0. In other words, if we sample η ∼ U(0, 1)

and calculate θ = P−1(η), then θ ∼ P (θ).

We mention this because this is implicitly taking place in the Dirichlet process. When we sample

G ∼ DP (αG0), we are assigning to each dimension of Dir(αG0(B1), . . . , αG0(Bk)) a partition of

measure G0(Bi) = 1/k derived according to the distribution function P (θ). In a manner identical to the

finite dimensional case, (8) and (14), where we sample from g0 (i.e. Mult(g0)), we sample uniformly

among partitions Bi according to the measure G0. Though each measure, G0(Bi), is equal, their locations

along Θ are distributed according to the distribution P (θ). Therefore, by letting G0(Bi) = 1/k for

i = 1, . . . , k and allowing k → ∞, we can use the mathematical derivations of Section I explicitly, as

we will discuss in the next section.

DRAFT

13

(a) (b)

Fig. 4. A partition of R into k = 10 disjoint sets, (B1, . . . , B10) = ({(−∞, a1]}, {(a1, a2]}, . . . , {(a8, a9]}, {(a9,∞)}),

of equal measure, G0(Bi) = 110

, ∀i (or P (ai) − P (ai−1) = 110

). The uniform partition of P (θ) in (b) can be seen to lead

to partitions of θ approximating the density p(θ) in (a). As the partitions are refined, or k → ∞, uniformly sampling among

partitions converges to drawing θ ∼ G0.

Though significantly more discussion is needed to make this rigorous (e.g., that it satisfies Kolmogorov’s

consistency condition [11]), the above discussion is essentially what is required for an intuitive under-

standing of the differences between the Dirichlet distribution and the Dirichlet process. Furthermore, this

intuition can be extended to spaces of higher dimension, allowing θ ∼ G0 to indicate the drawing of

multiple variables in a higher dimensional space.

B. Drawing from the Infinite Dirichlet Process

We examine this uniform sampling of Bi more closely and how it directly relates to the Chinese

Restaurant Process (CRP) and Sethuraman construction. First, consider a finite number of partitions, in

which case the CRP can be written as a finite-dimensional Polya urn model (8):

B∗N+1 ∼

k∑i=1

ni

α + NδBi

α + NG0(B) =

k∑i=1

ni + αG0(Bi)α + N

δBi(28)

Letting G0(Bi) = 1/k and k →∞ as discussed, we see that αG0(Bi) → 0, thus disappearing from all

components for which ni > 0. However, there are an infinite number of remaining, unused components

with locations distributed according to G0. Notice that we do not need to change G0 to reflect the removal

of θ values, as they are all of measure zero. This means that the weight on G0 is also unchanged and

remains α. As a result, though we cannot explicitly write the remaining, unused components, we can

“lump” them together under G0 with a probability of selecting from G0 proportional to α. Whereas for

the finite-dimensional case there was a nonzero probability that a value selected from G0 would equal

DRAFT

14

one previously seen, for the continuous case this is a probability zero event. Therefore, by allowing the

number of components to tend to infinity, we obtain a continuum of colors as proven in [5], also known

as the Chinese restaurant process [1].

Identical reasoning is used for the Sethuraman construction. For k →∞, consider the step

Yi ∼ Mult(G0(B1), . . . , G0(Bk)) (29)

in light of Figure 4(b), in which case Y ∈ {B1, . . . , Bk}. Sampling partitions from this infinite-dimensional

uniform multinomial distribution is precisely equivalent to drawing θ ∼ G0 for reasons previously

discussed regarding the inverse-cdf method. In this case and that of the CRP, a change of notation

has concealed the fact that the same fundamental underlying process is taking place as that of the finite-

dimensional case – a notable difference being that repeated values are here probability zero events. Using

the notation of Section I-B2, this difference implies for the Sethuraman construction that a one-to-one

correspondence exists between values in p and the desired vector, π. For clarity, we will continue to use

p when discussing the Dirichlet process, leaving π to represent values drawn from the finite-dimensional

Dirichlet distribution. Also, we emphasize that G is a symbol (a set function) that incorporates both p and

θ as a complete measure. Just as Vi was meaningless without the Yi in the finite case, pi is meaningless

without its corresponding location, θi.

Finally, it is hopefully clear why it is sometimes rhetorically asked in the literature how one can go

about drawing from the Dirichlet process. We are attempting to draw an infinite-dimensional pmf that

covers every point of a continuous space. Using the two “finite” methods (Section I), so labeled because

they produced samples π ∼ Dir(αg0) in a finite number of steps, we need an infinite length of time

to obtain π. The two infinite methods are now the only feasible means and each solves this problem

elegantly in its own way. With the Chinese restaurant process, we never need to draw G. More remarkably,

the Sethuraman construction allows one to explicitly draw G by drawing the important components first,

stopping when a desired portion of the unit mass is assigned and implicitly setting the probability of

the infinitely remaining components to zero. We will see later that this prior can be applied to drawing

directly from the Beta process as well, a result we believe to be new.

C. Dirichlet Process Mixture Models

Having established the means to draw G ∼ DP (αG0), we next discuss the use of the values θ ∼ G.

The main reason we replaced X with θ in our discussion of the Dirichlet process is that the symbol

X is generally reserved for the observed data. In Section I, this was appropriate because we were only

DRAFT

15

interested in observing samples from the discrete π. However, the primary use of the Dirichlet process

in the machine learning community is in nonparametric mixture modeling [2][10], where θ ∼ G is not

the observed data, but rather a parameter (or set of parameters) for some distribution, F (θ), from which

X is drawn. This can be written as follows:

Xj ∼ F (θ∗j )

θ∗jiid∼ G

G ∼ DP (αG0) (30)

where θ∗j is a specific value selected from G and associated with observation Xj . An example of this

process is the Gaussian mixture model, where θ = {µ,Λ}, the mean and precision matrix for a multivariate

normal distribution, and G0 the normal-Wishart conjugate prior. Whereas samples from the discrete G

will repeat with high probability, samples from F (θ) are again from a continuous distribution. Therefore,

values that share parameters are clustered together in that, though not exactly the same, they exhibit

similar statistical characteristics according to some distribution function, F (θ).

III. INFERENCE FOR THE DIRICHLET PROCESS MIXTURE MODEL

To show the Dirichlet process in action, we outline a general method for performing Markov chain

Monte Carlo (MCMC) [13] inference for Dirichlet process mixture models. We let f(x|θ) be the likelihood

function of an observation, x, given parameters, θ, and p(θ) the prior density of θ, corresponding to the

probability measure G0. We also use an auxiliary variable, c ∈ {1, . . . ,K +1}, which acts as an indicator

of the parameter value, θci, associated with observation xi. We will discuss the value of K and why we

write K + 1 shortly. To clarify further the function of c, we rewrite (30) utilizing this auxiliary variable:

Xj ∼ F (θcj)

cjiid∼ Mult(φK(V ))

Vkiid∼ Beta(1, α)

θkiid∼ G0 (31)

where p = φK(V ) represents the stick-breaking portion of the Sethuraman construction that has been

truncated to length K + 1, with pK+1 =∏K

i=1(1 − Vi). We use subscript K because the (K + 1)-

dimensional vector, p, is a function of K parameters. Therefore, only the first K values of Vk and the

first K + 1 values of θk are used. For clarity, following each iteration only utilized components are

DRAFT

16

retained (as indicated by {cj}Nj=1), and a new, proposal components is drawn from the base distribution.

See [23] for further discussion. The number of utilized components for any given iteration is represented

by the variable K, and we here only propose a K + 1st component. Below is this sampling process.

Initialization: Select a truncation level, K + 1, and initialize the model, G, by sampling θk ∼ G0 for

k = 1, . . . ,K + 1 and Vk ∼ Beta(1, α) for k = 1, . . . ,K and construct p = φK(V ). For now, α is set

a priori, but we will discuss inference for this parameter as well.

Step 1: Sample the indicators, c1, . . . , cN , independently from their respective conditional posteriors,

p(cj |xj) ∝ f(xj |θcj)p(θcj

|G),

cj ∼K+1∑k=1

pkf(xj |θk)∑l plf(xj |θl)

δk (32)

Set K to be the number of unique values among c1, . . . , cN and relabel from 1 to K.

Step 2: Sample θ1, . . . , θK from their respective posteriors conditioned on c1, . . . , cN and x1, . . . , xN ,

θk ∼ p(θk|{cj}N

j=1, {xj}Nj=1

)p(θk|{cj}N

j=1, {xj}Nj=1

)∝

N∏j=1

(f(xj |θ)δcj

(k))

p(θ) (33)

where δcj(k) is a delta function equal to one if cj = k and zero otherwise, simply picking out which xj

belong to component k. Sample θK+1 ∼ G0.

Step 3: For the Sethuraman construction, construct the (K + 1)-dimensional weight vector, p = φK(V ),

using V1, . . . , VK sampled from their Beta-distributed posteriors conditioned on c1, . . . , cN ,

Vk ∼ Beta

1 +N∑

j=1

δcj(k), α +

K∑l=k+1

N∑j=1

δcj(l)

(34)

Set pK+1 =∏K

k=1(1 − Vk). For the Chinese restaurant process, this step instead consists of setting

pk = nk

α+N and pK+1 = αα+N , as discussed in Section II.

Repeat Steps 1 – 3 for a desired number of iterations. The convergence of this Markov chain can be

assessed [13], after which point uncorrelated samples (properly spaced out in the chain) of the values in

Steps 1 – 3 are considered iid samples from the posterior, p({θk}K

k=1, {Vk}Kk=1, {cj}N

j=1,K|{xj}Nj=1

).

Additional inference for α can be done for the CRP using a method detailed in [10] and for the Sethuraman

construction using a conjugate, gamma prior.

DRAFT

17

Step 4: Sample α from its posterior gamma distribution conditioned on V1, . . . , VK

α ∼ Ga

(a0 + K, b0 −

K∑k=1

ln(1− Vk)

)(35)

where a0, b0 are the prior hyperparameters for the gamma distribution.

As can be seen, inference for the Dirichlet process is fairly straightforward and, when p(θ) is conjugate

to f(x|θ), fully analytical. Other MCMC methods exist for performing this inference [20] as does a fast,

variational inference method [7][6] that, following initialization, deterministically converges to a local

optimal approximation to the full posterior distribution.

IV. EXTENSIONS OF THE DIRICHLET PROCESS

With the advances of computational resources and MCMC (and later, VB) inference methods, the

Dirichlet process has had a significant role in Bayesian hierarchical modeling. This can be seen in the

many developments of the original idea. To illustrate, we briefly review two of these developments

common to the machine learning literature: the hierarchical Dirichlet process (HDP) [28] and the nested

Dirichlet process (nDP) [26]. We then present our own extension that, to our knowledge, is new to the

machine learning and signal processing communities. This extension, called the Dirichlet process with

product base measure (DP-PBM), extends the Dirichlet process to the modeling of multiple modalities,

m, that each call for unique distribution functions, Fm(θm).

A. The Hierarchical Dirichlet Process [28]

In our discussion of the Dirichlet distribution in Section I, we defined α to be a positive scalar and g0

to be a D-dimensional pmf, with the new pmf π ∼ Dir(αg0). Though our discussion assumed g0 was

uniform, in general there are no restrictions on g0, so long as it is in ∆D. This leads to the following

possibility

π′ ∼ Dir(βπ), π ∼ Dir(αg0) (36)

where β is a value having the same restrictions as α. In essence, this is the hierarchical Dirichlet process

(HDP). A measure, G′i, is drawn from an HDP, denoted G′

i ∼ HDP (α, β, G0), as follows:

G′i ∼ DP (βG), G ∼ DP (αG0) (37)

Using the measure theory outlined in Section II-A, this process can be understood as the infinite-

dimensional analogue to (36) extended to a measurable space (Θ,B). We note that, in the case where G

DRAFT

18

is constructed according to a stick-breaking process and truncated, G′i is a finite-dimensional Dirichlet

process for which the prior expectation, p, and locations, θ, are determined for G′i by G. A key benefit

of the HDP is that draws of G′1, . . . , G

′n ∼ DP (βG) will utilize the same subset of atoms, since G is

a discrete measure with probability one. Therefore, this method is useful for multitask learning [8][21],

where atoms might be expected to share across tasks, but with a different probability of usage for each

task.

B. The Nested Dirichlet Process [26]

In our discussion of the Dirichlet process, we assumed G0 to be a continuous distribution, though

this isn’t necessary, as we’ve seen with the HDP. Consider the case where G0 is a Dirichlet distribution.

Extending this base Dirichlet to its own measurable space, the base distribution becomes a Dirichlet

process as well. This is called the nested Dirichlet process (nDP). A way to think of a draw G ∼

nDP (α, β, G0) is as a mixture model of mixture models. This can be seen in the Sethuraman construction,

G =∞∑

j=1

Vj

∏k<j

(1− Vk)

δGj

Vi ∼ Beta(1, α)

Gj ∼ DP (βG0) (38)

Now, rather than drawing θ∗i ∼ G, an entire mixture, G∗i ∼ G, is drawn, from which is then drawn an

atom, θ∗i ∼ G∗i . As with the HDP, this prior can be useful in multitask learning, where a task, t, selects

an indicator from the “top-level” DP, G, and each observation within that task utilizes atoms sampled

from a “second-level” DP, G∗t . In the case of multitask learning with the HDP, all tasks share the same

atoms, but with different mixing weights. The nDP for multitask learning is “all-or-nothing;” two tasks

either share a mixture model exactly, or nothing at all.

C. The Dirichlet Process with a Product Base Measure

We here propose another extension of the Dirichlet process that incorporates a product base measure, de-

noted DP-PBM, where rather than drawing parameters for one parametric distribution, θ ∼ G0, parameters

are drawn for multiple distributions, θm ∼ G0,m for m = 1, . . . ,M . In other words, rather than having a

connection between data, {Xi}Ni=1, and their respective parameters, {θ∗i }N

i=1, through a parametric distribu-

tion, {F (θ∗i )}Ni=1, sets of data, {X1,i, . . . , XM,i}N

i=1 have respective sets of parameters, {θ∗1,i, . . . , θ∗M,i}N

i=1,

DRAFT

19

used in inherently different and incompatible distribution functions, {F1(θ∗1,i), . . . , FM (θ∗M,i)}. The DP-

PBM is so called because it utilizes a product base measure to achieve this end, G0 = G0,1×G0,2×· · ·×

G0,M , where, in this case, M modalities are considered. The space over which this process is defined is

now(∏M

m=1 Θm,⊗M

m=1 Bm,∏M

m=1 G0,m

). Though this construction implicitly takes place in all mixture

models that attempt to estimate multiple parameters, for example the multivariate Gaussian mixture model,

we believe our use of these parameters in multiple, incompatible distributions (or modalities) is novel.

The fully generative process can be written as follows:

Xm,i ∼ Fm(θ∗m,i) m = 1, . . . ,M

{θ∗m,i}Mm=1 ∼ G

G =∞∑

j=1

Vj

∏k<j

(1− Vk)

M∏m=1

δθm,j

Vjiid∼ Beta(1, α)

θm,j ∼ G0,m m = 1, . . . ,M (39)

where θm,j are drawn iid from G0,m for a fixed m and independently under varying m. To make a

clarifying observation, we note that if each G0,m is a univariate normal-gamma prior, this model reduces

to a multivariate GMM with a forced diagonal covariance matrix. As previously stated, we are more

interested in cases where each Xm is inherently incompatible, but is still linked by the structure of the

data set.

For example, consider a set of observations, {Oi}Ni=1, where each Oi = {X1,i, X2,i} with X1 ∈ Rd

and X2 a sequence of time-series data. In this case, a single density function, f(X1, X2|θ1, θ2) cannot

analytically accommodate Oi, making inference difficult. However, if these densities can be considered

as independent, that is f(X1, X2|θ1, θ2) = f(X1|θ1) · f(X2|θ2), then this problem becomes analytically

tractable and, furthermore, no more difficult to solve than for the standard Dirichlet process. One might

choose to model X1 with a Gaussian distribution, with G0,1 the appropriate prior and X2 by an HMM

[25], with G0,2 its respective prior [22] . In this case, this model becomes a hybrid Gaussian-HMM

mixture, where each component is both a Gaussian and a hidden Markov model. We develop a general

MCMC inference method below to allow us to look more deeply at the structure of our framework. We

also observe that this idea can be easily extended to the nDP and HDP frameworks.

DRAFT

20

D. Inference for the DP-PBM Mixture Model

MCMC inference for the DP-PBM mixture model entails a few simple alterations to the sampling

scheme outlined in Section III. These changes alone are given below:

Initialization: For each j = 1, . . . ,K + 1, sample θm,j ∼ G0,m for m = 1, . . . ,M , where M is the

number of “modalities,” or distinct subsets of data for each observation.

Step 1: Sample the indicators, c1, . . . , cN , independently from their conditional posteriors.

cj ∼K+1∑k=1

pk∏M

m=1 fm(xm,j |θm,k)∑l pl∏M

m=1 fm(xm,j |θm,l)δk (40)

We note that the likelihood term has been factorized into a product of the likelihood for each modality.

Step 2: Sample {θm,1}Mm=1, . . . , {θm,K}M

m=1 from their conditional posteriors given c1, . . . , cN and

{xm,1}Mm=1, . . . , {xm,N}M

m=1.

θm,k ∼ p(θm,k|{cj}N

j=1, xm,1, . . . , xm,N

)p(θm,k|{cj}N

j=1, xm,1, . . . , xm,N

)∝

N∏j=1

(f(xm,j |θm)δcj

(k))

p(θm) (41)

Sample θm,K+1 ∼ G0,m for m = 1, . . . M . The essential difference here is that each component,

k = 1, . . . ,K now has M posteriors to calculate. These M posteriors are calculated independently

of one another given the relevant data for that modality extracted from the observations assigned to that

component. We stress that when an “observation” is assigned to a component (via the indicator, c) it is

actually all of the data that comprise that observation that is being assigned to the component.

E. Experimental Result with Synthesized Data

To further illustrate our model, we implement it on a toy problem of two modalities. For this problem,

each observation is of the form, O = {X1, X2}, where X1 ∼ N (µ,Σ) and X2 ∼ HMM(A,B, π). We

define three, two-dimensional Gaussian distributions with respective means µ1 = (−3, 0), µ2 = (3, 0)

and µ3 = (0, 5) and each having the identity as the covariance matrix. Two hidden Markov models are

defined as below,

A1 =

0.05 0.9 0.05

0.05 0.05 0.9

0.9 0.05 0.05

A2 =

0.9 0.05 0.05

0.05 0.9 0.05

0.05 0.05 0.9

B1,B2 =

0.9 0.05 0.05

0.05 0.9 0.05

0.05 0.05 0.9

DRAFT

21

with the initial state vector π1,π2 = [1/3, 1/3, 1/3]. Data was generated as follows: We sampled 300

observations, 100 from each Gaussian, constituting X1,i for i = 1, . . . , 300. For each sample, if the

observation was on the right half of its respective Gaussian, a sequence of length 50 was drawn from

HMM 1, if on the left, from HMM 2. We performed fast variational inference [6], truncating to 50

components, to obtain results for illustration. This precisely defined data set allows the model to clearly

display the purpose of its design. If one were to build a Gaussian mixture model on the X1 data alone,

three components would be uncovered, as the posterior expectation shows in Figure 5(a). If an HMM

mixture were built alone on the X2 data, only two components would be uncovered. Using all of the

data, that is, mixing on {Oi}300i=1 rather than just {X1,i}300

i=1 or {X2,i}300i=1 alone, the correct number of six

components was uncovered, as shown in Figure 5(b).

(a) (b)

Fig. 5. An example of a mixed Gaussian-HMM data set where observations on the right side of each Gaussian has an associated

sequence from HMM 1 and the left side from HMM 2. (a) Gaussian mixture model results without using sequential information.

(b) Gaussian-HMM mixture results using both spatial and sequential information. Each ellipse corresponds to a cluster. Because

there are six Gaussian-HMM combinations, there are six clusters.

The results show that, as was required by the data, the Dirichlet process prior uncovered six distinct

groups of data. In other words, though no two observations were exactly the same, it was inferred that

six distinct sets of parameters were required to properly characterize the statistical properties of the data.

We recall that we initialized to 50 components. Upon convergence, 44 of them contained no data. The

DP-PBM framework therefore allows for the incorporation of all information of a data set to be included,

thus providing more precise clustering results.

DRAFT

22

REFERENCES

[1] D. Aldous (1985). Exchangeability and related topics. Ecole d’ete de probabilites de Saint-Flour XIII-1983 1-198 Springer,

Berlin.

[2] C.E. Antoniak (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of

Statistics, 2:1152-1174.

[3] D. Basu (1955). On statistics independent of a complete sufficient statistic. Sankhya: The Indian Journal of Statistics,

15:377-380.

[4] P. Billingsley (1995). Probability and Measure, 3rd edition. Wiley Press, New York.

[5] D. Blackwell and J.B. MacQueen (1973). Ferguson distributions via Polya urn schemes. The Annals of Statistics, 1:353-355.

[6] M.J. Beal (2003). Variational Algorithms for Approximate Bayesian Inference PhD thesis, Gatsby Computational

Neuroscience Unit, University College London.

[7] D. Blei and M.I. Jordan (2006). Variational inference for Dirichlet process mixtures. Journal of Bayesian Analysis, 1(1):121-

144.

[8] R. Caruana (1997). Multitask learning. Machine Learning, 28:41-75.

[9] R.J. Connor and J.E. Mosimann (1969). Concepts of independence for proportions with a generalization of the Dirichlet

distribution. Journal of the American Statistical Association, 64:194-206.

[10] M.D. Escobar and M. West (1995). Bayesian density estimation and inference using mixtures. Journal of the American

Statistical Association. 90(430):577-588.

[11] T. Ferguson (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1:209-230.

[12] B. de Finetti (1937). La prevision: ses lois logiques, ses sources subjectives. Ann. Inst. H. Poincare, 7:1-68.

[13] D. Gamerman and H.F. Lopes (2006). Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, Second

Edition, Chapman & Hall.

[14] P. Halmos (1944). Random Alms. The Annals of Mathematical Statistics, 15:182-189.

[15] E. Hewitt and L.J. Savage (1955). Symmetric measures on Cartesian products. Trans. American Math Association, 80:470-

501.

[16] J. Jacod and P. Protter (2004). Probability Essentials, 2nd ed. Springer.

[17] N. Johnson and S. Kotz (1977). Urn Models and Their Applications. Wiley Series in Probability and Mathematical Statistics.

[18] J.F.C. Kingman (1978). Uses of exchangeability. Annals of Probability, 6(2):183-197.

[19] E. Kreyszig (1999). Advanced Engineering Mathematics, 8th edition. John Wiley & Sons, Inc., New York.

[20] R.M. Neal (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and

Graphical Statistics, 9(2):249-265.

[21] K. Ni, J. Paisley, L. Carin and D. Dunson (2008). Multi-task learning for analyzing and sorting large databases of sequential

data. IEEE Trans. on Signal Processing to appear.

[22] J. Paisley and L. Carin (2008). Hidden Markov Models with Stick Breaking Priors. submitted to IEEE Trans. on Signal

Processing.

[23] O. Papaspiliopoulos and G.O. Roberts (2008). Retrospective Markov chain Monte Carlo methods for Dirichlet process

hierarchical models. Biometrika, 95(1):169-186.

[24] J. Pitman (1996). Some developments of the Blackwell-MacQueen urn scheme. Statistics, Probability and Game Theory.

Papers in honor of David Blackwell, T. S. Ferguson, L. S. Shapley and J. B. MacQueen, eds. 245-267.

DRAFT

23

[25] L.R. Rabiner (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of

the IEEE, Vol. 77, No 2 pp. 257-286.

[26] A. Rodriquez, D.B. Dunson, and A.E. Gelfand (2008). The nested Dirichlet process. Journal of the American Statistical

Association, to appear.

[27] J. Sethuraman (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4:639-650.

[28] Y.W. Teh, M.I. Jordan, M.J. Beal and D.M. Blei (2006). Hierarchical Dirichlet processes. Journal of the American Statistical

Association, 101(476):1566-1581.

[29] Y. Xue, X. Liao, L. Carin and B. Krishnapuram (2007). Multi-task learning for classification with Dirichlet process priors.

Journal of Machine Learning Research, 8:35-63.

[30] M. West, P. Muller and M.D. Escobar (1994). Hierarchical priors and mixture models, with applications in regression and

density estimation. Aspects of Uncertainty: A Tribute to D.V. Lindley, pp 363-386.

[31] S.S. Wilks (1962). Mathematical Statistics. John Wiley, New York.

DRAFT