A Modest Thesis Draft

Uncovering, Understanding, and

Predicting Links

Jonathan Chang

A Dissertation

Presented to the Faculty

of Princeton University

in Candidacy for the Degree

of Doctor of Philosophy

Recommended for Acceptance

by the Department of

Electrical Engineering

Adviser: David M. Blei

November 2011

c© Copyright by Jonathan Chang, 2011.

All Rights Reserved

Abstract

Network data, such as citation networks of documents, hyperlinked networks of web

pages, and social networks of friends, are pervasive in applied statistics and machine

learning. The statistical analysis of network data can provide both useful predictive

models and descriptive statistics. Predictive models can point social network mem-

bers towards new friends, scientific papers towards relevant citations, and web pages

towards other related pages. Descriptive statistics can uncover the hidden community

structure underlying a network data set.

In this work we develop new models of network data that account for both links

and attributes. We also develop the inferential and predictive tools around these

models to make them widely applicable to large, real-world data sets. One such model,

the Relational Topic Model can predict links using only a new node’s attributes. Thus,

we can suggest citations of newly written papers, predict the likely hyperlinks of a

web page in development, or suggest friendships in a social network based only on a

new user’s profile of interests. Moreover, given a new node and its links, the model

provides a predictive distribution of node attributes. This mechanism can be used to

predict keywords from citations or a user’s interests from his or her social connections.

While explicit network data — network data in which the connections between

people, places, genes, corporations, etc. are explicitly encoded — are already ubiq-

uitous, most of these can only annotate connections in a limited fashion. Although

relationships between entities are rich, it is impractical to manually devise complete

characterizations of these relationships for every pair of entities on large, real-world

corpora. To resolve this we present a probabilistic topic model to analyze text cor-

pora and infer descriptions of its entities and of relationships between those entities.

We show qualitatively and quantitatively that our model can construct and annotate

graphs of relationships and make useful predictions.

iii

Acknowledgements

A graduate career is an endeavor which requries support from all those around you.

Friends, family, you know who you are and what I owe you (at least $2016). To all

the people in the EE and CS departments, especially the Liberty and SL@P labs, it’s

been a ball.

I want to call some special attention (in temporal order) to the faculty who have

helped me on my peripatetic journey through grad school. First off, thanks to David

August who took a chance on a clown with green hair. I made some good friends and

research contributions during my sojourn at the liberty lab. Next I’d like to thank

Moses Charikar, Christiane D. Fellbaum, and Dan Osherson for giving me my second

chance by including me on the WordNet project when I had all but given up on

graduate school. Special thanks also go out to the members of my FPO committee:

Rob Schapire, Paul Cuff, Sanjeev Kulkarni, and Matt Salganik. Thanks for helping

me make sure my thesis is well-written and relevant.

Finally, the bulk of my thanks must be given to David Blei — a consummate advi-

sor, teacher, and all-around stand-up guy. Thanks for teaching me about variational

inference, schooling me on strange and wonderful music, and never giving up on me

and making sure I finished.

iv

To Rory Gilmore, for being a hell of a lot smarter than me.

v

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

1 Introduction 1

2 Modeling, Inference and Prediction 9

2.1 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Exponential family distributions . . . . . . . . . . . . . . . . . 15

2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Exponential Family Models of Links 25

3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Pairwise Ising model . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 Approximate inference of marginals . . . . . . . . . . . . . . . 29

3.2.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.1 Estimating marginal probabilities . . . . . . . . . . . . . . . . 35

3.3.2 Making predictions . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

vi

4 Relational Topic Models 41

4.1 Relational Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1.1 Modeling assumptions . . . . . . . . . . . . . . . . . . . . . . 44

4.1.2 Latent Dirichlet allocation . . . . . . . . . . . . . . . . . . . . 45

4.1.3 Relational topic model . . . . . . . . . . . . . . . . . . . . . . 48

4.1.4 Link probability function . . . . . . . . . . . . . . . . . . . . . 49

4.2 Inference, Estimation and Prediction . . . . . . . . . . . . . . . . . . 53

4.2.1 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53


4.2.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.1 Evaluating the predictive distribution . . . . . . . . . . . . . . 61

4.3.2 Automatic link suggestion . . . . . . . . . . . . . . . . . . . . 65

4.3.3 Modeling spatial data . . . . . . . . . . . . . . . . . . . . . . 67

4.3.4 Modeling social networks . . . . . . . . . . . . . . . . . . . . . 71

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5 Discovering Link Information 78

5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.3 Computation with NUBBI . . . . . . . . . . . . . . . . . . . . . . . 87

5.3.1 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87


5.3.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.4.1 Learning networks . . . . . . . . . . . . . . . . . . . . . . . . 93

5.4.2 Evaluating the predictive distribution . . . . . . . . . . . . . . 96

5.4.3 Application to New York Times . . . . . . . . . . . . . . . . . 99

vii

5.5 Discussion and related work . . . . . . . . . . . . . . . . . . . . . . . 106

6 Conclusion 108

A Derivation of RTM Coordinate Ascent Updates 110

B Derivation of RTM Parameter Estimates 113

C Derivation of NUBBI coordinate-ascent updates 117

D Derivation of Gibbs sampling equations 121

D.1 Latent Dirichlet allocation (LDA) . . . . . . . . . . . . . . . . . . . . 122

D.2 Mixed-membership stochastic blockmodel (MMSB) . . . . . . . . . . 125

D.3 Relational topic model (RTM) . . . . . . . . . . . . . . . . . . . . . . 127

D.4 Supervised latent Dirichlet allocation (sLDA) . . . . . . . . . . . . . 128

D.5 Networks uncovered by Bayesian inference (NUBBI) model . . . . . . 130

viii

Chapter 1

Introduction

In this work our aim is to apply the tools of probabilistic modeling to network data,

that is, a collection of nodes with identifiable properties, each one (possibly) connected

to other nodes. In the parlance of graph theory, we are concerned with graphs,

collections of vertices and edges, whose vertices may contain additional information.

In modeling these networks our aim is to gain insight into the structure underpinning

these networks and be able to make predictions about them.

Much of the pioneering work on the study of networks was done under the auspices

of sociological studies, i.e., the networks under consideration were social networks.

Zachary’s data on the members of a university karate club (Zachary 1977) and

Sampson’s study of social interactions among monks at a monastery (Sampson 1969)

are some early iconic works. The number and variety of data sets have grown

considerably since, from networks of dolphins (Lusseau et al. 2003) to co-authorship

networks (Newman 2006a); however the underlying structure of the data remains the

same — a collection of nodes (people / animals / organisms / etc.) connected to one

another through some relationship (friendship / hatred / co-authorship / etc.).

In recent years, with the increasing digital representation of entities and the

relationships between them, the amount of data available to researchers has increased

1

Figure 1.1: A depiction of a subset of an online social network. Nodes representindividuals and edges represent friendships between them.

and the impact of network understanding and prediction has magnified enormously.

Online social networks such as Facebook1, Linked-in2, and Twitter3 have made creating

and leveraging these networks their primary product. Consequently these online social

networks operate on a scale unimaginable to the early researchers of social networks;

the aforementioned early works have social networks on the order of tens of nodes

whereas Facebook alone has over 500 million4.

1http://www.facebook.com2http://www.linkedin.com3http://www.twitter.com4https://www.facebook.com/press/info.php?statistics, retrieved June 2011

2

Figure 1.1 shows what a subset of an online social network might look like. The

nodes in the graph represent people and the edges represent self-reported friendship

between members. Even in this simple example, a rich structure emerges with some

individuals belonging to tightly connected clusters while others exist on the periphery.

Characterizing this structure has been one major thrust of network research (Newman

et al. 2006b).

Figure 1.2 shows a screen capture from the online social network Facebook. In

this view, the screenshot shows some of the other nodes connected to the node that

the profile represents. The screenshot shows the variety of nodes and large number

of edges associated with a single user. For example, in this small portion of the

profile alone there are connections to nodes representing friends and family, nodes

representing workplaces, nodes representing schools, and nodes representing interests.

Again, there is a rich structure to be explored.

In the previous discussion we have discussed social networks as simple graphs;

however, these graphs are often richer than those expressed in traditional graphs. In

particular, both the nodes and edges may have some content associated with them.

Even in Figure 1.2 it is clear that a single node / edge type cannot capture the

structure associated with friends vs. family or musical interest vs. sport interest.

The nodes may also have other attributes such as age or gender that can make for a

more expressive probabilistic model. Additionally, users on online social networks may

produce textual data associated with status updates or biographical prose. Figure 1.3

shows an example of a status update. The user generates some snippet of text which

is then posted online; other users may respond with comments. A collection of status

updates (and comments, etc.) comprise a corpus. Outside of social networks, one may

also consider citation networks (Figure 4.1), gene regulatory networks and many other

instances as networks whose nodes and their attendant attributes comprise a corpus.

Thus instead of referring to nodes / vertices we may refer to documents, and instead of

3

Friends

InterestsFamily

Work

Education

Figure 1.2: A screenshot from a typical Facebook profile with sections annotated. Inthis subset of the profile alone there are network connections to workplaces, schools,interests, family and friends. Understanding the nature of these connections is ofimmense practical and theoretical interest.

4

Figure 1.3: A screenshot of a typical Facebook status update, a small user-generatedsnippet of text. Other users can react to status updates by posting comments inresponse.

referring to node attributes we refer to words. Throughout this work we will be using

the language of textual analysis and the language of graph theory interchangeably.

The study of natural language has a long and rich history (see Jurafsky and Martin

(2008) for a description of many modern techniques for analyzing language). One

modern technique to analyze language that we shall leverage throughout this work is

topic modeling (Blei et al. 2003b). Topic modeling, to be described in more detail

in Section 4.1, is a latent mixed-membership model. It presupposes the existence

of latent themes or topics which characterize how words tend to occur with one

another. Documents are then merely specific realizations of ensembles of these themes.

Figure 1.4 depicts how the approach assumes topics on the left (denoted by β) and

an ensemble of these themes for each document (right). It is this ensemble that

determines the words in the document that we observe.

We have thus far described two incomplete perspectives of data; one which is

centered around documents and another centered around graphs. What we propose in

this thesis is a set of techniques for modeling these data with a complete perspective

that takes both of these aspects of the data into account. We also develop methods

for determining the unknowns in these models. We show that once so determined,

these models provide useful insights into the structure underpinning the data and are

5

The congressman threw the opening pitch at the Yankees game yesterday

evening, despite being under investigation by a house committee. Both Democrats and Republicans on

the committee condemned...

lawyerjusticejudge

investigateprosecutor

gamecoachplayerplay

match

republicandemocrat

senatecampaign

mayor

β2

β3

β1wd,1:N

θd

zd,1 zd,2

Figure 1.4: A depiction of the assumptions underlying topic models. Topic modelspresuppose latent themes (left) and documents (right). Documents are a compositionof latent themes; this composition determines the words in the document that weobserve.

6

able to make predictions about unseen nodes, edges, and attributes.

In Chapter 2 we lay the ground work for our technique by first describing a general

framework in which to define and speak about probabilistic models. We follow up

by describing a set of tools for using data to uncover the unknown aspects of these

models. Then we describe how this can be used to analyze and make predictions about

data. In Chapter 3 we dive into a specific model which has wide applicability not

only to networks but also to a variety of other data. The challenge with this model

has always been the computational complexity of uncovering likely values for the

latent parameters. We introduce a technique which is able to vastly improve on the

state-of-the-art in terms of computation speed, while sacrificing very little accuracy,

thus making these models much more applicable to the large networks in which we

are interested.

In Chapter 4, we introduce the Relational Topic Model, a model specifically

designed to analyze collections of documents with connections between them (or

alternatively graphs with both edges and attributes). It leverages the aforementioned

topic modeling infrastructure but extends it so that the model can offer a unified view

of both links and content. We show that the model can make statements about new

nodes, for example predicting the content of a document based solely on its citations

or predicting additional citations based on its content. Further, it can be used to find

hidden community structure, and we analyze these features of the model on several

data sets.

The work in Chapter 4 presupposes a network in which most links have already

been observed. However, it is often the case that we have only textual content and we

would like to build out this network. Chapter 5 explores the construction of networks

based purely on text. By looking at the content associated with each node, as well as

content appearing around pairs of nodes we are able to infer descriptions of individual

entities and of the relationship between those entities. With the inference machinery

7

we develop we can apply the model to large corpora such as Wikipedia and show that

the model can construct and annotate graphs and make useful predictions.

8

Chapter 2

Modeling, Inference and Prediction

Throughout this work our approach will be to

1. define a probabilistic model with certain unknown parameters for data of a

particular character;

2. perform inference, that is, find values of the unknown parameters of the model

that best explain observations;

3. make predictions using a model whose parameters have been determined.

In this chapter I describe the framework in which we execute these steps. A more

detailed treatment can be found in Wainwright and Jordan (2008).

2.1 Probabilistic Models

Our approach uses the language of directed graphical models to describe probabilistic

models. Directed graphical models have been described as a synthesis of graph theory

and probability. In this framework, distributions are represented as directed, acyclic

graphs. Nodes in this graph represent variables and arrows indicate, informally, a

9

Z

(a) Un-observedvariablenamed Z.

X

(b) Observed(indicated byshading) vari-able namedX.

U V

(c) Variable V possibly dependenton variable U (indicated by arrow).

YN

(d) Variable Y repli-cated N times (indi-cated by box).

Figure 2.1: The language of graphical models.

possible dependence between variables.1

The constituents of directed graphical models are

1. unshaded nodes indicating unobserved variables whose names are enclosed in

the circle;

2. shaded nodes indicate observed variables;

3. arrows between nodes indicating a possible dependence between variables;

4. boxes which depict replication.

These are shown in Figure 2.1.

Associated with each node is a conditional probability distribution over the variable

represented by that node. That probability distribution is conditioned on the variable

represented by that node’s parents. That is, letting xi represent the variable associated

with the ith node,

pi(xi|xj∈parents(i)) (2.1)

describes the distribution of xi. The full joint distribution of the entire graphical

model can thus be written as1The dependence between variables can be formally described by D-separation which is outside

the scope of this text.

10

p(x) =∏i

pi(xi|xj∈parents(i)). (2.2)

Note that it is straightforward to evaluate the probability of a state in this formalism;

one need only take the product of the evaluation of each pi. This formalism also makes

it convenient to simulate draws from this distribution by drawing each constituent

variable in topological order. Because each of the variables xi is conditioned on is a

parent, and all parent variables are guaranteed to have fixed values by dint of the

topological sort, xi can be simulated by doing a single draw from pi.

This also means it is straightforward to describe each probability distribution as a

generative process, that is, a sequence of probabilistic steps by which the data were

hypothetically generated. The intermediate steps of the generative process create

unobserved variables while the final step generates the observed data, i.e., the leaves

of the graph. This construction will be of particular interest in the sequel.

2.2 Inference

With a probability distribution thus defined our goal is to find values of unobserved

variables which explain observed variables. More formally, we are interested in finding

the posterior distribution of hidden variables (z) conditioned on observed variables

(x).

p(z|x) (2.3)

For all but a few special cases, it is computationally prohibitive to compute this

exactly. To see why, let us recall the definition of marginalization,

11

p(z|x) =p(x, z)

p(x)

=p(x, z)∑z′ p(x, z

′).

As mentioned in the previous section, evaluating the joint distribution p(x, z) is

straightforward. However, to compute the posterior probability we must evaluate the

joint probability across all possible values of z′. Since the number of possible values

of z′ increases exponentially with the number of variables comprising z′, this quickly

becomes prohibitive.

Thus we turn to approximate methods. There are many approaches to approx-

imating the posterior such as Markov Chain Monte Carlo (MCMC) (Neal 1993).

However, we will use variational approximations in this work because they do not

rely on stochasticity, they are amenable to various optimization approaches, and have

been empirically shown to achieve good approximations.

Variational methods approximate the true posterior, p(z|x) with an approximate

posterior, q(z). The approximation chosen is that distribution which is in some sense

“closest” to the true distribution. The definition of closeness used is Kullback-Leibler

(KL) divergence,

12

KL(q(z)||p(z|x)) =∑z

q(z) logq(z)

p(z|x)(2.4)

= −∑z

q(z) logp(z|x)

q(z)

≥ − log∑z

q(z)p(z|x)

q(z)

= − log∑z

p(z|x)

= − log 1

= 0, (2.5)

where the inequality follows from Jensen’s inequality. This choice of distance can be

intuitively justified several ways. One is to rewrite the KL-divergence as

KL(q(z)||p(z|x)) = −Eq [log p(z|x)]− H (q) , (2.6)

where H () denotes entropy. Thus KL-divergence promotes distributions q which

look “similar” to p while adding an entropy regularization. Another justification of

KL-divergence arises from its relationship to the likelihood of observed data,

KL(q(z)||p(z|x)) = −Eq [log p(z|x)]− H (q)

= −Eq[log

p(z,x)

p(x)

]− H (q)

= −Eq [log p(z,x)] + Eq [log p(x)]− H (q)

= log p(x)− Eq [log p(z,x)]− H (q) . (2.7)

This representation first implies that the problem can be expressed as finding the

distance between the variational distribution and the joint distribution rather than

13

the posterior distribution. The second is that this distance can be used to form an

evidence lower bound (ELBO); as the distance decreases the likelihood of our data

increases.

Our objective function now is to find q∗ such that

q∗(z) = argminq∈Q

KL(q(z)||p(z|x)). (2.8)

Note that this is trivially minimized when q∗(z) = p(z|x), the true posterior.

Therefore, the optimization problem as formulated is equivalent to posterior inference.

But since this is intractable, a tractable approximation is made by restricting the

search space Q. A common choice is the family of factorized distributions,

q(z) =∏i

qi(zi). (2.9)

This choice of Q is often termed a naıve variational approximation. This expression

is convenient since

H (q) = −Eq

[log∏i

qi(zi)

]

= −Eq

[∑i

log qi(zi)

]

= −∑i

Eq [log qi(zi)]

=∑i

H (qi) . (2.10)

Further, recall from the discussion above that in a generative process all of the

observations (x) appear as leaves of the graph. Therefore the expected log joint

probability can be expressed as

14

Eq [log p(z,x)] = Eq[log∏

pi(zi|zj∈parents(i))∏

pi′(xi|zj∈parents(i′))]

=∑

Eq[log pi(zi|zj∈parents(i))

]+∑

Eq[log pi′(xi|zj∈parents(i′))

].

Note that because of marginalization the expectation of term pi depends only on

{qj(zj)|j ∈ parents(i)} if i is a leaf node, and {qj(zj)|j ∈ parents(i) ∪ {i}} otherwise.

Optimizing this with respect to a common choice for pi warrants further elucidation

below.

2.2.1 Exponential family distributions

Exponential family distributions are a class of distributions which take a particular

form. This form encompasses many common distributions and is convenient to optimize

with respect to the objective described in the previous section. Exponential family

distributions take the following form:

p(x|η) =exp(ηTφ(x))

Z(η). (2.11)

The normalization constant Z(η) is chosen so that the distribution sums to one.

The vector η is termed the natural parameters while φ(x) are the sufficient statistics.

Figure 2.2 helps illustrate how common distributions such as the Gaussian and

the Beta can be expressed in this representation.

The structure of the exponential family representation allows for these distributions

to be easily manipulated in the variational optimization above. In particular,

15

−6 −4 −2 0 2 4 6x

0

0.1

0.2

0.3

0.4

0.5

0.6

p(x)

< −0.5, 0 >

−6 −4 −2 0 2 4 6x

< −0.5, 1 >

−6 −4 −2 0 2 4 6x

< −1, 0 >

(a) The Gaussian distribution has sufficient statistics φ(x) = 〈x2, x〉. The natural parameters arerelated to the common parameterization by η = 〈− 1

2σ2 ,µσ2 〉. The normalization constant Z is given

by√− πη1

exp(η224η1

).

0 0.2 0.4 0.6 0.8 1x

0

0.5

1

1.5

2

2.5

3

3.5

p(x)

< 0, 0 >

0 0.2 0.4 0.6 0.8 1x

< −0.5, −0.5 >

0 0.2 0.4 0.6 0.8 1x

< 2, 1 >

(b) The Beta distribution has sufficient statistics φ(x) = 〈log(x), log(1−x)〉. The natural parametersare related to the common parameterization by η = 〈α− 1, β − 1〉. The normalization constant Z is

given by Γ(η1+1)Γ(η2+1)Γ(η1+η2+2) .

Figure 2.2: Two exponential family functions. The title of each panel shows the valueof the natural parameters for the depicted distribution.

16

2

z

x μN

Figure 2.3: A directed graphical model representation of a Gaussian mixture model.

Eq [log p(x, z)] = Eq[log

exp(ηTφ(x, z))

Z(η)

]= Eq

[ηTφ(x, z)

]− Eq [logZ(η)]

= Eq[ηT]Eq [φ(x, z)]− Eq [logZ(η)] , (2.12)

where the last line follows by independence under a fully-factorized variational distri-

bution. (Note that q is a distribution over both sets of latent variables in the model,

z and η.)

2.3 Example

To illustrate the procedure described in the previous sections, we perform it on a

simple Gaussian mixture model. Figure 2.3 shows a directed graphical model for this

example. We describe the generative process as

1. For i ∈ {0, 1},

(a) Draw µi ∼ Uniform(−∞,∞).2

2. For n ∈ [N ],

2We set aside here the issue of drawing from an improper probability distribution.

17

(a) Draw mixture indicator zn ∼ Bernoulli(0.5);

(b) Draw observation xn ∼ N(µzn , 1).

Our goal now is to approximate the posterior distribution of the hidden variables,

p(z,µ|x), conditioned on observations x. To do so we use the factorized distribution,

q(µ, z) = r(µ0|m0)r(µ1|m1)∏n

qn(zn|πn), (2.13)

where qn(zn|πn) is a binomial distribution with parameter πn, and r(µi|mi) is a

Gaussian distribution with mean mi and unit variance. With the variational family

thus parameterized, the optimization problem becomes

argminπ,m

−Eq [log p(µ, z)]−H(q) (2.14)

To do so we first appeal to Equation 2.12 for the expected log probability of an

exponential family with our choice of parameter,

Eq [log p(xn|µi)] = −1

2x2n + Eq [µi]xn −

1

2Eq[µ2i

]− 1

2log 2π

= −1

2x2n +mixn −

1

2(1 +m2

i )−1

2log 2π.

Since we have chosen uniform distributions for z and µ, we can express the

expected log probability of the joint as

18

Eq [log p(µ, z)] = Eq

[log∏n

p(xn|µ0)1−znp(xn|µ1)zn

]

=∑i

Eq [(1− zn) log p(xn|µ0)] + Eq [zn log p(xn|µ1)]

=∑n

(1− πn)Eq [log p(xn|µ0)] + πnEq [log p(xn|µ1)]

=∑n

(1− πn)(m0xn −1

2m2

0) + πn(m1xn −1

2m2

1) + C,

where C contains terms which do not depend on either πn or mi. We also compute

the entropy terms,

H (qn(zn|πn)) = (1− πn) log(1− πn) + πn log πn

H (ri(µi|mi)) =1

2log(2πe).

To optimize these expressions we take the derivative with respect to each variable,

∂L∂πn

=1

2(m1 −m0)(2−m1 −m0)xn + log

πn1− πn

∂L∂m0

=∑n

(1− πn)(xn −m0)

∂L∂m1

=∑n

πn(xn −m1).

19

● ●● ●● ●●● ●●● ●● ●●● ●●● ●●●●●● ● ●●●● ●● ●●●● ●●● ●●●● ●●●

●● ●●● ● ●● ●● ● ●●●● ●●● ●●●● ● ●● ●●●●●● ●●● ● ●●● ●●●●● ●●●● ●● ● ● ●● ●

−6 −4 −2 0 2 4 6x

0

0.2

0.4

0.6

0.8

1z

Figure 2.4: 100 points drawn from the mixture model depicted in Figure 2.3 withµ0 = −3 and µ1 = 3. The x axis denotes observed values while the horizontal axisand coloring denote the latent mixture indicator values.

Setting these equal to zero yields the following optimality conditions,

πn = σ(1

2(m1 −m0)(2−m1 −m0)xn)

m0 =

∑n(1− πn)xn∑n(1− πn)

m1 =

∑n πnxn∑n πn

where σ(x) denotes the sigmoid function 11+exp(−x)

. This is a system of transcenden-

tal equations which cannot be solved analytically. However, we may apply coordinate

ascent ; we initialize each variable to some guess and repeatedly cycle through variables

optimizing them one at a time while holding the others fixed.

20

1 2 3 4 5iteration

−4

−3

−2

−1

0

1

2

3

mi

Figure 2.5: Estimated values of m0 and m1 as a function of iteration using coordinateascent. The variational method is able to quickly recover the true values of theseparameters (shown as dashed lines).

21

N 2

z

x μ

zN+1

xN+1

Figure 2.6: The mixture model of Figure 2.3 augmented with an additional unobserveddatum to be predicted.

Figure 2.4 shows the result of simulating 100 draws from the distribution to be

estimated. The distribution has µ0 = −3 and µ1 = 3. The x axis denotes observed

values while the horizontal axis and coloring denote the latent mixture indicator values.

Figure 2.5 shows the result of applying the variational method with coordinte ascent

estimation. The series show the estimated values of mi as a function of iteration. The

approach is able to quickly find the parameters of the true generating distributions

(dashed lines).

2.4 Prediction

With an approximate posterior in hand, our goal is often to make predictions about

data we have not yet seen. That is, given some observed data x1:N we wish to evaluate

the probability of an additional datum xN+1,

p(xN+1|x1:N). (2.15)

This desideratum is illustrated in Figure 2.6 for the case of the Gaussian mixture

of the previous section. On the right hand side another unobserved instance of a

draw from the mixture model has been added as the datum to be predicted. One way

of approaching the problem is by noting that the marginalization of the predictive

22

distribution,

p(xN+1|x1:N) =∑zN+1

∑z1:N

p(xN+1, zN+1|z1:N)p(z1:N |x1:N)

=∑zN+1

Ep [p(xN+1, zN+1|z1:N)]

≈∑zN+1

Eq [p(xN+1, zN+1|z1:N)] , (2.16)

where the expectation on the second line is taken with respect to the true posterior

of the observed data, p(z1:N |x1:N ) and the expectation on the third line is taken with

respect to the variational approximation to the posterior, q(z1:N).

In the case of the Gaussian mixture, this expression is

p(xN+1|x1:N) ≈ 1

2Eq [p(xN+1|µ1)] +

1

2Eq [p(xN+1|µ0)]

=1

2p(xN+1|m1) +

1

2p(xN+1|m0). (2.17)

The efficacy of this approach is demonstrated in Figure 2.7 wherein we empirically

estimate the expected value of p(xN+1|x1:N) by drawing an additional M values and

taking their average. The dashed line shows the expectation estimated using the

variational approximation.

We have now described a framework for defining probabilistic models, inferring

the values of their unknowns using data, and taking the model and inferred values

to provide predictions about unseen data. In the following chapters we leverage this

framework to model, understand, and make predictions about networked data.

23

0 20 40 60 80 100M

−4

−3

−2

−1

0

1

x N+1^

Figure 2.7: Estimated expected value of p(xN+1|x1:N) taken by averaging M randomdraws from this function. The dashed line shows the value of this expectation estimatedby the variational approximation.

24

Chapter 3

Exponential Family Models of

Links

The first model of networks we explore are Binary Markov random fields. These

models are widely used to model correlations between binary random variables. While

generally useful for a wide variety of applications, in this chapter we focus on applying

these models to collections of documents which contain words and/or links. In a Binary

Markov random field, each document is treated as a collection of binary variables;

these binary variables may correspond to the presence of words in a document or the

presence of a citation to another document. Modeling the correlations between these

variables allows us to predict new words or new connections for documents.

However, their application to large-scale data sets has been hindered by their

intractability; both parameter estimation and inference is prohibitively expensive

on many large real-world data sets. In this chapter we present a new method to

perform both of these tasks. Leveraging a novel variational approximation to compute

approximate gradients, our technique is accurate yet computationally simple. We

evaluate our technique on both synthetic and real-world data and demonstrate that

we are able to learn models comparable to the state-of-the-art in a fraction of the

25

time.

3.1 Background

Large-scale models of co-occurrence are increasingly in demand. They can be used

to model the words in documents, connections between members of social networks,

or the structure of the human brain; these models can then lead to new insights into

brain function, suggest new friendships, or discover latent patterns of language usage.

The Ising model (Ising 1925) is a model of co-occurrence for binary vectors which

has been successfully applied to a variety of domains such as signal processing (Besag

1986), natural language processing (Takamura et al. 2005), genetics (Majewski et al.

2001), biological sensing (Shi and Duke 1998; Besag 1975), and computer vision (Blake

et al. 2004). Practitioners of the Ising model are limited, however, in the size of the

data sets to which the model can be applied. Many modern data sets and applications

require models with millions of parameters. Unfortunately, estimating the model’s

probabilities and optimizing its parameters are both #P-complete problems (Welsh

1990).

In response to its intractability, there has been a rich body of work on approximate

inference and optimization for the Ising model. The most common approaches have

been sampling-based (Geman and Geman 1984) of which contrastive divergence is

the most recent incarnation (Carreira-Perpinan and Hinton 2005; Welling and Hinton

2002). Other approaches include max-margin (Taskar et al. 2004a) and exponentiated

gradient (Globerson et al. 2007), expectation propagation (Minka and Qi 2003),

various relaxations (Fisher 1966; Globerson and Jaakkola 2007; Wainwright and

Jordan 2006; Kolar and Xing 2008; Sontag and Jaakkola 2007), as well as loopy belief

propagation (Pearl 1988; Murphy et al. 1999; Yedidia et al. 2003; Szeliski et al. 2008)

and its extensions (Wainwright et al. 2003; Welling and Teh 2001; Kolmogorov 2006).

26

In this chapter we present a new approach which is substantially faster and has

accuracy comparable to state-of-the-art methods. Our approach employs iterative

scaling (Dudık et al. 2007) and a new technique for approximating the gradients of the

log partition function of the Ising model. This approximation technique is inspired

by variational mean field methods (Jordan et al. 1999; Wainwright and Jordan 2003).

While these methods have been applied to a variety of models (Jaakkola and Jordan

1999; Saul and Jordan 1999; Bishop et al. 2002) including the Ising model, we will

show that our technique produces more accurate estimates of marginals and that this

in turn produces models with higher predictive accuracy. Further, our approximation

has a simple mathematical form which can be computed much more quickly. This

allows us to apply the Ising model to large models with millions of parameters.

Because of the large parameter space, our model also employs `1 + `22 feature

selection penalties to achieve sparse parameter estimates. This penalty is used in

linear models under the name elastic nets (Zou and Hastie 2005). Feature selection

penalties have an extensive history (Lafferty and Wasserman 2008; Malouf 2002). The

`1 penalty, in particular, has been a popular approach to obtaining sparse parameter

vectors (Friedman et al. 2007; Meinshausen and Buhlmann 2006; Wainwright et al.

2006). However, theory of regularized maximum likelihood estimation also indicates

that it is often beneficial to use `22 regularization (Dudık et al. 2007). Regularizations

of this form have been extensively applied (Chen and Rosenfeld 2000; Goodman 2004;

Riezler and Vasserman 2004; Haffner et al. 2006; Andrew and Gao 2007; Kazama and

Tsujii 2003; Gao et al. 2006).

This chapter is organized as follows. In Section 3.2, we describe the Ising model

and our procedure for approximating the marginals of the model and fitting its

parameters by approximate maximum a posteriori point estimation. In Section 3.3,

we compare the accuracy/speed trade-off of our model with several others on synthetic

and large real-world corpora. We show that our method provides parameter estimates

27

comparable with those of state-of-the-art techniques, but in much less time. This

enables the application of the Ising model to new data sets and application areas

which were previously out of reach. We summarize these findings in Section 3.4.

3.2 Pairwise Ising model

We study the exponential family known as the pairwise Ising model or binary Markov

random field which has long been used in physics to model ensembles of particles with

pairwise interactions. Our motivation is to characterize the co-occurrence of items

within “unordered bags” such as the co-occurrence of citations or keywords in research

papers. Such bags are represented by a binary vector x ∈ {0, 1}n with components

xi indicating presence of each item. The pairwise Ising model is parameterized by

κ ∈ Rn and λ ∈ Rn(n−1) controlling frequencies of individual items and frequencies of

their co-occurrence as

pκ,λ(x) =1

Zκ,λexp

[n∑i=1

κixi +1

2

n∑i=1

∑j 6=i

λijxixj

].

We assume throughout that λij = λji. Here, Zκ,λ denotes the normalization constant

ensuring that probabilities sum to one. For general settings of κ and λ, the exact

calculation of the normalization constant Zκ,λ requires summation over 2n possible

values of x, which becomes intractable for even moderate sizes of n. Since the normal-

ization constant Zκ,λ is required to calculate expectations and evaluate likelihoods,

basic tasks such as inference of marginals and parameter estimation cannot be carried

out exactly and require approximation. We propose a novel technique to approximate

marginals of the Ising model and a new procedure to learn its parameters. Since

learning of parameters relies on inference of marginals as a subroutine, we first present

the marginal approximation.

28

3.2.1 Approximate inference of marginals

Our approach begins with the naıve mean field approximation (Wainwright and Jordan

2005b; Jordan et al. 1999). While naıve mean field approximations may provide good

estimates of singleton marginals pκ,λ(xi), they often provide poor estimates of pairwise

marginals pκ,λ(xi, xj). Our technique corrects these estimates using an augmented

variational family. By combining the richness of the augmented variational family with

the computational simplicity of the naıve mean field, our technique yields accurate

estimates that can be computed efficiently.

In the sequel we first present the naıve mean field and then our improved approxi-

mation.

Naıve mean field

Naıve mean field approximates the Ising model pκ,λ by a distribution qMF with a

factored representation across components xi

qMF(x) =∏i

qMFi (xi) .

Among all distributions of the form above, naıve mean field algorithms seek the

distribution qMF which minimizes the KL divergence from the true distribution pκ,λ,

qMF = argminqMF

D(qMF‖pκ,λ) . (3.1)

Here D(q‖p) = Eq[ln(q/p)] denotes the KL divergence, which measures information-

theoretic discrepancy between densities q and p. Since Equation (3.1) is not convex,

it is usually solved by alternating minimization in each coordinate—a procedure

which only yields a local minimum. In each individual coordinate, the objective of

Equation (3.1) can be minimized exactly by setting the derivatives to zero, yielding

29

the update

qMFi (xi) ∝ exp

(κixi +

∑j 6=i

λijxiqMFj (xj = 1)

). (3.2)

For the derivation see for example Wainwright and Jordan (2005b).

A chief advantage of naıve mean field is its simplicity and the speed of conver-

gence. However, compared with other approximation techniques such as loopy belief

propagation, the naıve mean field solution qMF may yield poor approximations to the

pairwise marginals pκ,λ(xi, xj) (in Section 3.3 we demonstrate this empirically). Since

pairwise marginals are needed for parameter estimation, this is a major drawback.

Our approach

Our approach, Fast Learning of Ising Models (FLIM), takes advantage of the rapid

convergence of naıve mean field while correcting its estimates of pairwise marginals.

When estimating the marginal pκ,λ(xi, xj) for a fixed pair i, j, we propose replacing

the product density qMF in Equation (3.1) by a richer family

q(ij)(x) = q(ij)ij (xi, xj)

∏k 6=i,j

q(ij)k (xk) .

This is similar to the approach known as structured mean field (Saul and Jordan

1996). However, we take advantage of the approximate singleton marginals qMFk (xk)

provided by naıve mean field which, unlike pairwise marginals, provide sufficiently

good approximations of the true singleton marginals pκ,λ(xk). We minimize the KL

divergence from pκ,λ under the constraint that q(ij)k (xk) equal qMF

k (xk):

q(ij) = argminq(ij)

D(q(ij)‖pκ,λ)

s.t. q(ij)k (xk) = qMF

k (xk) for all k 6= i, j . (3.3)

30

Note that the only undetermined portion of q(ij) is q(ij)ij . This can be solved explicitly

by setting derivatives equal to zero, yielding

q(ij)ij (xi, xj) ∝ exp

(κixi + κjxj + λijxixj

+∑k 6=i,j

(λikxi + λjkxj)qMFk (xk = 1)

). (3.4)

Given the naıve mean field solution qMF, it is possible to calculate all corrected pairwise

marginals q(ij)ij in time O(n2) by using auxiliary values

rowsumi =∑k 6=i

λikqMFk (xk = 1) .

Thus, each q(ij)ij is calculated in constant amortized time.

Note that if we have access to estimates of marginals pκ,λ(xk) other than those

given by naıve mean field, we can use them instead of qMFk in Equations (3.3) and

(3.4).

3.2.2 Parameter estimation

The main task we study is the problem of estimating parameters κ and λ from data.

As we will see, this necessitates calculation of pairwise marginals which we derived in

the previous section.

The data consists of a set of observations x1, x2, . . . , xD generated by an Ising

model p(x |κ,λ) = pκ,λ(x). We posit a prior p(κ,λ), and estimate κ and λ as

maximizers of the posterior

p(λ,κ | {xd}) ∝ p(κ,λ)D∏d=1

p(xd |κ,λ) . (3.5)

31

We consider the factored prior

p(κ,λ) =(∏

i

p(κi))(∏

i,j

p(λij)),

with

p(κi) ∝ exp(κi)

p(λij) ∝ exp(−β1 |λij| − β2λ2ij)

(3.6)

where β1 and β2 are hyperparameters. The prior over κi corresponds to Laplace

smoothing of empirical counts (however, note that it is improper). The prior over λij

corresponds to regularization with an `1-norm term and an `22-norm, used in linear

models under the name elastic nets (Zou and Hastie 2005). This prior encourages

parameter vectors which exhibit both sparsity and grouping.

Combining Equation (3.5) and Equation (3.6), we obtain the following expression

for the log posterior:

ln p(κ,λ | {xd})

=(∑

i

κi

)− β1 ‖λ‖1 − β2 ‖λ‖2

2

+D∑d=1

[( n∑i=1

κixdi

)+

1

2

( n∑i=1

∑j 6=i

λijxdix

dj

)− lnZκ,λ

]+ const. (3.7)

We optimize Equation (3.7) by a version of the algorithm PLUMMET (Dudık et al.

2007). This algorithm in each iteration updates κ and λ to new values κ′ and λ′ that

optimize a lower bound on Equation (3.7). More precisely, λ′ij = λij + δij where

δij = argmaxδ

[−µeδ + δµ

− β1(|λij + δ|)− β2(λij + δ)2], (3.8)

32

where µ denotes the empirical co-occurrence count

µ =∑d

xdixdj

while µ is the estimate of this count, µ = DEκ,λ[xixj ]. We approximate the expectation

Eκ,λ[xixj] using the technique of the previous section.

The objective of Equation (3.8) is concave in δ and therefore we can find its

maximizer by setting its derivative to zero

−µeδ + µ− β1sign(λij + δ)− 2β2(λij + δ) = 0 . (3.9)

This can be solved explicitly using Lambert W function, denoted W (z), which for

a given z ≥ −e−1 represents the unique value W (z) ≥ −1 such that W (z)eW (z) = z.

Using this definition it is straightforward to prove the following lemma which can then

be used to solve Equation (3.9).

Lemma 3.2.1. For b > 0, the identity x = a−bex holds if and only if x = a−W (bea).

Rearranging Equation (3.9) to match the lemma, we now just need to carry out

the case analysis according to the sign of λij + δ and consider possibilities

δ+ =µ− β1

2β2

−W(µe−λij

2β2

exp

(µ− β1

2β2

))− λij

δ− =µ+ β1

2β2

−W(µe−λij

2β2

exp

(µ+ β1

2β2

))− λij

δ0 = −λij .

We choose δ+ if λij + δ+ > 0, δ− if λij + δ− < 0 and δ0 otherwise.

33

3.3 Evaluation

In this section we first apply our technique for performing marginal inference to a

synthetic test case. We compare our technique with several competing techniques

on both accuracy and speed. We then evaluate our entire parameter estimation

procedure on two large-scale, real-world data sets and show that models trained using

our procedure perform comparably with the state-of-the-art at making predictions

about unseen data.

Throughout this section we will compare the following five approaches:

Baseline No training is done for the parameters which govern pairwise correlations

λ, i.e., λ is set to 0.

NMF This method uses a naıve mean field to approximate pairwise expectations. As

described in Section 3.2, this method approximates the true model with one in

which all variables are decoupled. Because the implied Markov random field has

no edges, it cannot capture pairwise behavior.

BP Loopy belief propagation (Yedidia et al. 2003) is a message passing algorithm

that optimizes an approximation to the log partition function based on Bethe

energies. Because it must compute O(n2) messages each iteration, it can be

comparatively slow.

FLIM-NMF FLIM-NMF (Fast Learning of Ising Models) is our proposal for esti-

mating pairwise and singleton marginals described in Section 3.2. The estimates

are the solutions to a variational approximation where singleton marginals are

constrained to be equal to the marginals adduced by naıve mean field.

FLIM-Z FLIM-Z is similar to FLIM-NMF except that the singleton marginals are

constrained to be equal to the marginals when the pairwise correlations λ = 0,

i.e., σ(κ). This is an effective approximation to FLIM-NMF when λ is close to

zero. FLIM-Z is faster than FLIM-NMF since it does not require first solving

34

the naıve mean field variational problem.

3.3.1 Estimating marginal probabilities

To evaluate how well each of the approaches approximates the singleton marginals

p(xi) and pairwise marginals p(xi, xj) we generated a model with 24 nodes. Because

the number of nodes in this model is small, it is possible to compute the singleton

marginals and the pairwise marginals exactly through enumeration. By comparing

these true marginals with those estimated by each of the approximation techniques,

we can evaluate their accuracy/speed trade-off.

The following procedure was used to generate the parameters of the model. The

parameters which control the frequency of components, κ, is a vector of length 24

generated from a Beta distribution σ(κi) ∼ Beta(1, 100). The parameters which

control correlations of components, λ, is a vector of length 276. 10% of the elements of

λ are randomly chosen to be non-zero; those elements are generated from a zero-mean

Gaussian λij ∼ N (0, 1). The parameters generated by this process resemble those

found in the real-world corpora described in the next section.

The metric we use to compare the estimated marginals to the true marginals is

the mean relative error,

εsingleton =1

n

∑i

|q(xi = 1)− p(xi = 1)|p(xi = 1)

εpairwise =1

n2 − n∑i

∑j 6=i

|q(xixj = 1)− p(xixj = 1)|p(xixj = 1)

,

where q describes the approximate marginals computed by the approach under test.

To measure the approximation error as a function of computation time, we compute

the mean relative error after every full round of message passing for BP, every full

iteration of coordinate ascent updates in Equation (3.2) for FLIM-NMF and NMF,

and once at the end for FLIM-Z, since FLIM-Z is not iterative. We also compute

35

the time elapsed since the start of the program every time the mean relative error is

computed.

The approximation error versus time for BP, FLIM-Z, FLIM-NMF, and NMF

is shown in Figure 3.1. Loopy belief propagation is the most accurate of all the

techniques at estimating both the singleton marginals and the pairwise marginals.

Further, it converges to its final estimate after very few iterations. Unfortunately, it is

also the slowest. In contrast, naıve mean field and our proposals, FLIM-NMF and

FLIM-Z are much faster. They too converge in very few iterations. However, their

errors are higher than those of BP.

On singleton marginals, all of the approximations are quite accurate — mean

relative errors are always less than 1% on singleton marginals. NMF and FLIM-

NMF have the same relative errors since the singleton marginals for FLIM-NMF are

constrained to be equal to the solutions of NMF. FLIM-Z has a larger error than

either of these since its marginals assume that there are no pairwise correlations, an

assumption that is violated.

On pairwise marginals, BP once again achieves the lowest error, with FLIM-NMF

and FLIM-Z following closely behind. However, here NMF deviates from the other

three, having a much larger error (note that the y-axis is logarithmic). Because the

naıve mean field removes all dependencies between variables, it poorly characterizes

the rich correlation structure implied by λ. As the next section shows, this large error

leads to poorer MAP estimates of λ. FLIM-NMF, FLIM-Z, and BP however have

errors circa 1%; consequently they all have better MAP estimates of λ than NMF. But

our proposals, FLIM-NMF and FLIM-Z are able to run in a fraction of the execution

time of BP.

36

3.3.2 Making predictions

With the parameters of the model optimized using the procedure described in Sec-

tion 3.2, the model can then be used to make predictions on unseen data. The

predictive problem we evaluate here is that of predicting one of the binary random

variables xi given all other variables x−i. This question can be answered by computing

the conditional log likelihood

log p(xi |x−i,λ,κ) ∝ exp(κixi +∑j 6=i

λijxixj).

We apply this predictive procedure to two data sets:

Cora Cora (McCallum et al. 2000) is a set of 2708 abstracts from the Cora research

paper search engine, with links between documents that cite each other. For

the evaluation in this section, we ignore the textual content of the corpus and

concern ourselves with the links alone. The set of observed tokens associated

with each document is the set of cited and citing documents, yielding 2708

unique tokens. The model has a total of 3,667,986 parameters.

Metafilter Metafilter1 is an internet community weblog where users share links.

Users can then annotate links with tags which describe them. We consider each

link to be a document and each link’s attendant tags to be its observed token

set. We culled a subset of these links to create a corpus of 18609 documents

with 3096 unique tokens. The model has a total of 4794156 parameters.

For Cora, this predictive problem amounts to estimating the probability of a document

in Cora citing a particular paper given our knowledge of the document’s other citations.

For Metafilter, we are estimating the probability that a link has a certain tag given

its other tags.

1http://www.metafilter.com

37

We used five-fold cross-validation to compute the predictive perplexity of unseen

data. All experiments were run with Dirichlet prior parameter α = 2 (equivalent

to Laplace smoothing); Gaussian and Laplacian priors were set to β1 = β2 = De−8,

where D is the size of the corpus (cross-validation can be used to find good values of

β1 and β2). The results of these experiments are shown in Figure 3.2.

On both data sets, learning the covariance structure improves the predictive

perplexity over the baseline. Thus the correlation structure captured by the Ising

model provides increased predictive power when applied to these data sets.

The predictive perplexity of the model when trained using our proposals, FLIM-Z

and FLIM-NMF, is nearly identical to that of loopy belief propagation (BP) on both

data sets. Naıve mean field (NMF), on the other hand, does substantially worse, but

still better than Baseline. While FLIM-Z and FLIM-NMF are close to BP with respect

to predictive power, the previous section showed that their speed was closer to that

of NMF. Thus, our procedure provides a way to train models as accurately as loopy

belief propagation, but in a fraction of the time.

3.4 Discussion

We introduced a procedure to estimate the parameters of large-scale Ising models. This

procedure makes use of a novel constrained variational approximation for estimating

the pairwise marginals of the Ising model. This approximation has a simple mathemat-

ical form and can be computed more efficiently than other techniques. We also showed

empirically that this approximation is accurate for real-world data sets. Our approxi-

mation yields a procedure which can tractably be applied to models with millions of

parameters that can make predictions comparable with the state-of-the-art.

38

● ● ● ● ● ●●●●●

0.02 0.05 0.10 0.20 0.50 1.00 2.00

0.00

700.

0075

0.00

800.

0090

Execution time (ms)

Mea

n re

lativ

e er

ror

in s

ingl

eton

mar

gina

ls

● BPFLIM−ZFLIM−NMFNMF

(a) Relative error of singleton marginals

● ● ● ● ● ●●●●●

0.02 0.05 0.10 0.20 0.50 1.00 2.00

0.02

0.05

0.10

Execution time (ms)

Mea

n re

lativ

e er

ror

in p

airw

ise

mar

gina

ls ● BPFLIM−ZFLIM−NMFNMF

(b) Relative error of pairwise marginals

Figure 3.1: Mean relative error of singleton marginals(left) and pairwisemarginals(right) on a synthetic model. Execution times are on a logarithmic scale.The errors in (b) are also on a logarithmic scale. Loopy belief propagation (BP)is accurate but slow. Naıve mean field (NMF) is grossly inaccurate at estimatingpairwise marginals. FLIM-NMF offers a compromise: accuracy not much worse thanBP at speed not much worse than NMF.

39

BP FLIM−Z FLIM−NMF NMF Baseline

Pre

dict

ive

perp

lexi

ty

0e+

002e

+12

4e+

12

(a) Cora

BP FLIM−Z FLIM−NMF NMF Baseline

Pre

dict

ive

perp

lexi

ty

0e+

002e

+09

4e+

09

(b) Metafilter

Figure 3.2: A comparison of the predictive perplexity of the Ising model usingprocedures for parameter optimization. Lower is better. All approaches perform betterthan the baseline. Our proposals (FLIM-Z and FLIM-NMF) achieves better predictiveperplexity than naıve mean field (NMF), as does loopy belief propagation (BP). Butour proposals are able to run in a fraction of the time of BP (Figure 3.1).

40

Chapter 4

Relational Topic Models

In the previous chapter, we described a model of documents and links and inferential

tools for the model. While these models are able to successfully make predictions

about documents, they often miss salient patterns of the corpus better captured by

latent variable models of link structure.

Recent research in this field has focused on latent variable models of link structure

because of their ability to decompose a network according to hidden patterns of

connections between its nodes (Kemp et al. 2004; Hofman and Wiggins 2007; Airoldi

et al. 2008). These models represent a significant departure from statistical models of

networks, which explain network data in terms of observed sufficient statistics (Wasser-

man and Pattison 1996; Newman 2002; Fienberg et al. 1985; Getoor et al. 2001; Taskar

et al. 2004b).

While powerful, current latent variable models account only for the structure

of the network, ignoring additional attributes of the nodes that might be available.

For example, a citation network of articles also contains text and abstracts of the

documents, a linked set of web-pages also contains the text for those pages, and an

on-line social network also contains profile descriptions and other information about

its members. This type of information about the nodes, along with the links between

Portions of this chapter appear in Chang and Blei (2010, 2009).

41

them, should be used for uncovering, understanding and exploiting the latent structure

in the data.

To this end, we develop a new model of network data that accounts for both links

and attributes. While a traditional network model requires some observed links to

provide a predictive distribution of links for a node, our model can predict links using

only a new node’s attributes. Thus, we can suggest citations of newly written papers,

predict the likely hyperlinks of a web page in development, or suggest friendships in a

social network based only on a new user’s profile of interests. Moreover, given a new

node and its links, our model provides a predictive distribution of node attributes.

This mechanism can be used to predict keywords from citations or a user’s interests

from his or her social connections. Such prediction problems are out of reach for

traditional network models.

Here we focus on document networks. The attributes of each document are its

text, i.e., discrete observations taken from a fixed vocabulary, and the links between

documents are connections such as friendships, hyperlinks, citations, or adjacency.

To model the text, we build on previous research in mixed-membership document

models, where each document exhibits a latent mixture of multinomial distributions

or “topics” (Blei et al. 2003b; Erosheva et al. 2004; Steyvers and Griffiths 2007). The

links are then modeled dependent on this latent representation. We call our model,

which explicitly ties the content of the documents with the connections between them,

the relational topic model (RTM).

The RTM affords a significant improvement over previously developed models

of document networks. Because the RTM jointly models node attributes and link

structure, it can be used to make predictions about one given the other. Previous work

tends to explore one or the other of these two prediction problems. Some previous work

uses link structure to make attribute predictions (Chakrabarti et al. 1998; Kleinberg

1999), including several topic models (Dietz et al. 2007; McCallum et al. 2005; Wang

42

et al. 2005). However, none of these methods can make predictions about links given

words.

Other models use node attributes to predict links (Hoff et al. 2002). However,

these models condition on the attributes but do not model them. While this may be

effective for small numbers of attributes of low dimension, these models cannot make

meaningful predictions about or using high-dimensional attributes such as text data.

As our empirical study in Section 4.3 illustates, the mixed-membership component

provides dimensionality reduction that is essential for effective prediction.

In addition to being able to make predictions about links given words and words

given links, the RTM is able to do so for new documents—documents outside of

training data. Approaches which generate document links through topic models treat

links as discrete “terms” from a separate vocabulary that essentially indexes the

observed documents (Nallapati and Cohen 2008; Cohn and Hofmann 2001; Sinkkonen

et al. 2008; Gruber et al. 2008; Erosheva et al. 2004; Xu et al. 2006, 2008). Through

this index, such approaches encode the observed training data into the model and

thus cannot generalize to observations outside of them. Link and word predictions for

new documents, of the kind we evaluate in Section 4.3.1, are ill-defined.

Recent work from Nallapati et al. (2008) has jointly modeled links and document

content so as to avoid these problems. We elucidate the subtle but important differ-

ences between their model and the RTM in Section 4.1.4. We then demonstrate in

Section 4.3.1 that the RTM makes modeling assumptions that lead to significantly

better predictive performance.

The remainder of this chapter is organized as follows. First, we describe the

statistical assumptions behind the relational topic model. Then, we derive efficient

algorithms based on variational methods for approximate posterior inference, parameter

estimation, and prediction. Finally, we study the performance of the RTM on scientific

citation networks, hyperlinked web pages, geographically tagged news articles, and

43

52

478

430

2487

75

288

1123

2122

2299

1354

1854

1855

89

635

92

2438

136

479

109

640

119686

120

1959

1539

147

172

177

965

911

2192

1489

885

178378

286

208

1569

2343

1270

218

1290

223

227

236

1617

254

1176

256

634

264

1963

2195

1377

303

426

2091

313

1642

534

801

335

344

585

1244

2291

2617

1627

2290

1275

375

1027

396

1678

2447

2583

1061 692

1207

960

1238

20121644

2042

381

418

1792

1284

651

524

1165

2197

1568

2593

1698

547 683

2137 1637

2557

2033632

1020

436442

449

474

649

2636

2300

539

541

603

1047

722

660

806

1121

1138

831837

1335

902

964

966

981

16731140

14811432

1253

1590

1060

992

994

1001

1010

1651

1578

1039

1040

1344

1345

1348

1355

14201089

1483

1188

1674

1680

2272

1285

1592

1234

1304

1317

1426

1695

1465

1743

1944

2259

2213

We address the problem of finding a subset of features that allows a supervised induction algorithm to...

Irrelevant features and the subset selection

problem

In many domains, an appropriate inductive bias is the MIN-FEATURES bias, which prefers ...

Learning with many irrelevant features

In this introduction, we define the term bias as it is used in machine learning systems. We motivate ...

Evaluation and selection of biases in machine

learning

The inductive learning problem consists of learning a concept given examples ...

Utilizing prior concepts for learning

The problem of learning decision rules for sequential tasks is addressed, focusing on ...

Improving tactical plans with genetic algorithms

Evolutionary learning methods have been found to be useful in several areas in ...

An evolutionary approach to learning in

robots

...

...

...

...

...

...

...

...

...

...

Figure 4.1: Example data appropriate for the relational topic model. Each documentis represented as a bag of words and linked to other documents via citation. The RTMdefines a joint distribution over the words in each document and the citation linksbetween them.

social networks. The RTM provides better word prediction and link prediction than

natural alternatives and the current state of the art.

4.1 Relational Topic Models

The relational topic model (RTM) is a hierarchical probabilistic model of networks,

where each node is endowed with attribute information. We will focus on text data,

where the attributes are the words of the documents (see Figure 4.1). The RTM

embeds this data in a latent space that explains both the words of the documents and

how they are connected.

4.1.1 Modeling assumptions

The RTM builds on previous work in mixed-membership document models. Mixed-

membership models are latent variable models of heterogeneous data, where each data

point can exhibit multiple latent components. Mixed-membership models have been

44

successfully applied in many domains, including survey data (Erosheva et al. 2007),

image data (Fei-Fei and Perona 2005; Barnard et al. 2003), network data (Airoldi

et al. 2008), and document modeling (Steyvers and Griffiths 2007; Blei et al. 2003b).

Mixed-membership models were independently developed in the field of population

genetics (Pritchard et al. 2000).

To model node attributes, the RTM reuses the statistical assumptions behind

latent Dirichlet allocation (LDA) (Blei et al. 2003b), a mixed-membership model of

documents.1 Specifically, LDA is a hierarchical probabilistic model that uses a set

of “topics,” distributions over a fixed vocabulary, to describe a corpus of documents.

In its generative process, each document is endowed with a Dirichlet-distributed

vector of topic proportions, and each word of the document is assumed drawn by first

drawing a topic assignment from those proportions and then drawing the word from

the corresponding topic distribution. While a traditional mixture model of documents

assumes that every word of a document arises from a single mixture component, LDA

allows each document to exhibit multiple components via the latent topic proportions

vector. Below we describe this model in more detail before introducing our contribution,

the RTM.

4.1.2 Latent Dirichlet allocation

Latent Dirichlet allocation takes as input a collection of documents which are rep-

resented as bags-of-words, that is, an unordered collections of terms from a fixed

vocabulary. A collection of documents is imbued with a fixed number of topics, multi-

nomial distributions over those terms. Intuitively, a topic captures themes by putting

high weights on words which are connected to that theme, and small weights otherwise.

This representation is captured in Figure 1.4 (reproduced here for convenience). On

the left are three topics, β1,β2,β3; we have depicted each by selecting words with

1A general mixed-membership model can accommodate any kind of grouped data paired with anappropriate observation model (Erosheva et al. 2004).

45

high probability mass in that topic. For example, the blue topic, β2 puts high mass

on terms related to jurisprudence while the red topic, β3 puts high mass on terms

related to sports.

The congressman threw the opening pitch at the Yankees game yesterday

evening, despite being under investigation by a house committee. Both Democrats and Republicans on

the committee condemned...

lawyerjusticejudge

investigateprosecutor

gamecoachplayerplay

match

republicandemocrat

senatecampaign

mayor

β2

β3

β1wd,1:N

θd

zd,1 zd,2

Figure 4.2: A depiction of the assumptions underlying topic models. Topic modelspresuppose latent themes (left) and documents (right). Documents are a compositionof latent themes; this composition determines the words in the document that weobserve.

Additionally, LDA associates with each document a multinomial distribution over

topics. Intuitively, this captures what the document “is about” in broad thematic

terms. This is captured by θd in Figure 1.4 also depicted graphically as a bar graph

over topics (colors). In the example text, the document is mostly about “politics” with

a smattering of “sports” and “law”. Finally, LDA associates a single topic assignment

with each word in the document. The topic proportions θd govern the frequency with

which each topic appears in an assignment; the topic vectors βk govern which words

are likely to appear for a given assignment. This is graphically depicte in Figure 1.4

46

DN K

z

w β

θ α

Figure 4.3: A graphical model representation of latent Dirichlet allocation. The wordsare observed (shaded) while the the topic assignments (z), topic proportions (θ), andtopics (β) are latent. Plates indicate replication.

by coloring words according to their topic assignment.

This intuitive description of LDA can be formalized by the following generative

process:

1. For each document d:

(a) Draw topic proportions θd|α ∼ Dir(α).

(b) For each word wd,n:

i. Draw assignment zd,n|θd ∼ Mult(θd).

ii. Draw word wd,n|zd,n,β1:K ∼ Mult(βzd,n).

The notation x|z ∼ F (z) means that x is drawn conditional on z from the

distribution F (z). We use Dir and Mult as shorthand for the Dirichlet and Multinomial

distributions.

This generative process is depicted in Figure 4.3. The words w are the only

observed variables. The parameters for the model are K, the number of topics in the

model, α, a K-dimensional Dirichlet parameter controlling the topic proportions θ,

and β1:K K multinomial parameters representing the topic distributions over terms.

It is worth emphasizing that the words are the only observed data in this model.

47

The topics, the rate at which topics appear in each document, and the topic associated

with each word are all inferred solely based on the way words co-occur in the data.

4.1.3 Relational topic model

In the RTM, each document is first generated from topics as in LDA. The links between

documents are then modeled as binary variables, one for each pair of documents.

These binary variables are distributed according to a distribution that depends on the

topics used to generate each of the constituent documents. Because of this dependence,

the content of the documents are statistically connected to the link structure between

them. Thus each document’s mixed-membership depends both on the content of the

document as well as the pattern of its links. In turn, documents whose memberships

are similar will be more likely to be connected under the model.

The parameters of the RTM are β1:K , K topic distributions over terms, a K-

dimensional Dirichlet parameter α, and a function ψ that provides binary probabilities.

(This function is explained in detail below.) We denote a set of observed documents

by w1:D,1:N , where wi,1:N are the words of the ith document. (Words are assumed to

be discrete observations from a fixed vocabulary.) We denote the links between the

documents as binary variables y1:D,1:D, where yi,j is 1 if there is a link between the ith

and jth document. The RTM assumes that a set of observed documents w1:D,1:N and

binary links between them y1:D,1:D are generated by the following process.

1. For each document d:

(a) Draw topic proportions θd|α ∼ Dir(α).

(b) For each word wd,n:

i. Draw assignment zd,n|θd ∼ Mult(θd).

ii. Draw word wd,n|zd,n,β1:K ∼ Mult(βzd,n).

2. For each pair of documents d, d′:

48

α

Nd

θd

wd,n

zd,n

Kβk

yd,d'

η

Nd'

θd'

wd',n

zd',n

Figure 4.4: A two-document segment of the RTM. The variable yd,d′ indicates whetherthe two documents are linked. The complete model contains this variable for each pairof documents. This binary variable is generated contingent on the topic assignmentsfor the participating documents, zd and zd′ , and global regression parameters η. Theplates indicate replication. This model captures both the words and the link structureof the data shown in Figure 4.1.

(a) Draw binary link indicator

yd,d′ |zd, zd′ ∼ ψ(·|zd, zd′ ,η),

where zd = 〈zd,1, zd,2, . . . , zd,n〉.

Figure 4.4 illustrates the graphical model for this process for a single pair of documents.

The full model, which is difficult to illustrate in a small graphical model, contains

the observed words from all D documents, and D2 link variables for each possible

connection between them.

4.1.4 Link probability function

The function ψ is the link probability function that defines a distribution over the

link between two documents. This function is dependent on the two vectors of topic

assignments that generated their words, zd and zd′ .

This modeling decision is important. A natural alternative is to model links as a

49

function of the topic proportions vectors θd and θd′ . One such model is that of Nallapati

et al. (2008), which extends the mixed-membership stochastic blockmodel (Airoldi

et al. 2008) to generate node attributes. Similar in spirit is the non-generative model

of Mei et al. (2008) which “regularizes” topic models with graph information. The

issue with these formulations is that the links and words of a single document are

possibly explained by disparate sets of topics, thereby hindering their ability to make

predictions about words from links and vice versa.

For example, such a model with ten topics may use the first five topics to describe

the language of the corpus and the latter five to describe its connectivity. Each

document would participate in topics from the first set which account for its language

and the second set which account for its links. However, given a new document without

link information it is impossible in such a model to make predictions about links since

the document does not participate in the latter five topics. Similarly, a new document

without word information does not participate in the first five topics and hence no

predictions can be made.

In enforcing that the link probability function depends on the latent topic as-

signments zd and zd′ , we enforce that the specific topics used to generate the links

are those used to generate the words. A similar mechanism is employed in Blei and

McAuliffe (2007) for non pair-wise response variables. In estimating parameters, this

means that the same topic indices describe both patterns of recurring words and

patterns in the links. The results in Section 4.3.1 show that this provides a superior

prediction mechanism.

We explore four specific possibilities for the link probability function. First, we

consider

ψσ(y = 1) = σ(ηT(zd ◦ zd′) + ν), (4.1)

where zd = 1Nd

∑n zd,n, the ◦ notation denotes the Hadamard (element-wise) product,

and the function σ is the sigmoid. This link function models each per-pair binary

50

variable as a logistic regression with hidden covariates. It is parameterized by coeffi-

cients η and intercept ν. The covariates are constructed by the Hadamard product of

zd and zd′ , which captures similarity between the hidden topic representations of the

two documents.

Second, we consider

ψe(y = 1) = exp(ηT(zd ◦ zd′) + ν). (4.2)

Here, ψe uses the same covariates as ψσ, but has an exponential mean function instead.

Rather than tapering off when zd ◦ zd′ are close, the probabilities returned by this

function continue to increases exponentially. With some algebraic manipulation, the

function ψe can be viewed as an approximate variant of the modeling methodology

presented in Blei and Jordan (2003).

Third, we consider

ψΦ(y = 1) = Φ(ηT(zd ◦ zd′) + ν), (4.3)

where Φ represents the cumulative distribution function of the Normal distribution.

Like ψσ, this link function models the link response as a regression parameterized by

coefficients η and intercept ν. The covariates are also constructed by the Hadamard

product of zd and zd′ , but instead of the logit model hypothesized by ψσ, ψΦ models

the link probability with a probit model.

Finally, we consider

ψN(y = 1) = exp(−ηT(zd − zd′) ◦ (zd − zd′)− ν

). (4.4)

Note that ψN is the only one of the link probability functions which is not a function

of zd ◦ zd′ . Instead, it depends on a weighted squared Euclidean difference between the

51

0.0 0.2 0.4 0.6 0.8 1.0

0.1

0.3

0.5

0.7

zd ⋅⋅ zd′′

Link

pro

babi

lity

ψψσσψψe

ψψΦΦψψN

Figure 4.5: A comparison of different link probability functions. The plot showsthe probability of two documents being linked as a function of their similarity (asmeasured by the inner product of the two documents’ latent topic assignments). Alllink probability functions were parameterized so as to have the same endpoints.

two latent topic assignment distributions. Specifically, it is the multivariate Gaussian

density function, with mean 0 and diagonal covariance characterized by η, applied to

zd − zd′ . Because the range of zd − zd′ is finite, the probability of a link, ψN(y = 1),

is also finite. We constrain the parameters η and ν to ensure that it is between zero

and one.

All four of the ψ functions we consider are plotted in Figure 4.5. The link likelihoods

suggested by the link probability functions are plotted against the inner product of zd

and zd′ . The parameters of the link probability functions were chosen to ensure that

all curves have the same endpoints. Both ψσ and ψΦ have similar sigmoidal shapes.

In contrast, the ψe is exponential in shape and its slope remains large at the right

limit. The one-sided Gaussian form of ψN is also apparent.

52

4.2 Inference, Estimation and Prediction

With the model defined, we turn to approximate posterior inference, parameter estima-

tion, and prediction. We develop a variational inference procedure for approximating

the posterior. We use this procedure in a variational expectation-maximization (EM)

algorithm for parameter estimation. Finally, we show how a model whose parameters

have been estimated can be used as a predictive model of words and links.

4.2.1 Inference

The goal of posterior inference is to compute the posterior distribution of the latent

variables conditioned on the observations. As with many hierarchical Bayesian models

of interest, exact posterior inference is intractable and we appeal to approximate

inference methods. Most previous work on latent variable network modeling has

employed Markov Chain Monte Carlo (MCMC) sampling methods to approximate the

posterior of interest (Hoff et al. 2002; Kemp et al. 2004). Here, we employ variational

inference (Jordan et al. 1999; Wainwright and Jordan 2005a) a deterministic alternative

to MCMC sampling that has been shown to give comparative accuracy to MCMC with

improved computational efficiency (Braun and McAuliffe 2007; Blei and Jordan 2006).

Wainwright and Jordan (2008) investigate the properties of variational approximations

in detail. Recently, variational methods have been employed in other latent variable

network models (Airoldi et al. 2008; Hofman and Wiggins 2007).

In variational methods, we posit a family of distributions over the latent variables,

indexed by free variational parameters. Those parameters are then fit to be close to

the true posterior, where closeness is measured by relative entropy. For the RTM, we

use the fully-factorized family, where the topic proportions and all topic assignments

53

are considered independent,

q(Θ,Z|γ,Φ) =∏d

[qθ(θd|γd)

∏n

qz(zd,n|φd,n)

]. (4.5)

The parameters γ are variational Dirichlet parameters, one for each document, and Φ

are variational multinomial parameters, one for each word in each document. Note

that Eq [zd,n] = φd,n.

Minimizing the relative entropy is equivalent to maximizing the Jensen’s lower

bound on the marginal probability of the observations, i.e., the evidence lower bound

(ELBO),

L =∑

(d1,d2)

Eq [log p(yd1,d2|zd1 , zd2 ,η, ν)] +∑d

∑n

Eq [log p(zd,n|θd)] +

∑d

∑n

Eq [log p(wd,n|β1:K , zd,n)] +∑d

Eq [log p(θd|α)] + H (q) , (4.6)

where (d1, d2) denotes all document pairs and H (q) denotes the entropy of the dis-

tribution q. The first term of the ELBO differentiates the RTM from LDA (Blei

et al. 2003b). The connections between documents affect the objective in approximate

posterior inference (and, below, in parameter estimation).

We develop the inference procedure below under the assumption that only observed

links will be modeled (i.e., yd1,d2 is either 1 or unobserved).2 We do this for both

methodological and computational reasons.

First, while one can fix yd1,d2 = 1 whenever a link is observed between d1 and

d2 and set yd1,d2 = 0 otherwise, this approach is inappropriate in corpora where the

absence of a link cannot be construed as evidence for yd1,d2 = 0. In these cases, treating

these links as unobserved variables is more faithful to the underlying semantics of the

data. For example, in large social networks such as Facebook the absence of a link

2Sums over document pairs (d1, d2) are understood to range over pairs for which a link has beenobserved.

54

between two people does not necessarily mean that they are not friends; they may

be real friends who are unaware of each other’s existence in the network. Treating

this link as unobserved better respects our lack of knowledge about the status of their

relationship.

Second, treating non-links as hidden decreases the computational cost of inference;

since the link variables are leaves in the graphical model they can be removed whenever

they are unobserved. Thus the complexity of computation scales with the number

of observed links rather than the number of document pairs. When the number of

true observations is sparse relative to the number of document pairs, as is typical,

this provides a significant computational advantage. For example, on the Cora data

set described in Section 4.3, there are 3,665,278 unique document pairs but only

5,278 observed links. Treating non-links as hidden in this case leads to an inference

procedure which is nearly 700 times faster.

Our aim now is to compute each term of the objective function given in Equation 4.6.

The first term,

∑(d1,d2)

Ld1,d2 ≡∑

(d1,d2)

Eq [log p(yd1,d2|zd1 , zd2 ,η, ν)] , (4.7)

depends on our choice of link probability function. For many link probability func-

tions, this term cannot be expanded analytically. However, if the link probability

function depends only on zd1 ◦ zd2 we can expand the expectation using the following

approximation arising from a first-order Taylor expansion of the term (Braun and

McAuliffe 2007)3,

L(d1,d2) = Eq [logψ(zd1 ◦ zd2)] ≈ logψ(Eq [zd1 ◦ zd2 ]) = logψ(πd1,d2),

3While we do not give a detailed proof here, the error of a first-order approximation is closelyrelated to the probability mass in the tails of the distribution on zd1 and zd2 . Because the numberwords in a document is typically large, the variance of zd1 and zd2 tends to be small, making thefirst-order approximation a good one.

55

where πd1,d2 = φd1 ◦ φd2 and φd = Eq [zd] = 1Nd

∑n φd,n. In this work, we explore

three functions which can be written in this form,

Eq [logψσ(zd1 ◦ zd2)] ≈ log σ(ηTπd1,d2 + ν)

Eq [logψΦ(zd1 ◦ zd2)] ≈ log Φ(ηTπd1,d2 + ν)

Eq [logψe(zd1 ◦ zd2)] = ηTπd1,d2 + ν. (4.8)

Note that for ψe the expression is exact. The likelihood when ψN is chosen as the link

probability function can also be computed exactly,

Eq [logψN(zd1 , zd2)] = −ν −∑i

ηi(φd1,i − φd2,i)2 + Var(zd1,i) + Var(zd2,i)),

where Var(zd,i) = 1N2d

∑n φd,n,i(1− φd,n,i). (See Appendix A.)

Leveraging these expanded expectations, we then use coordinate ascent to op-

timize the ELBO with respect to the variational parameters γ,Φ. This yields an

approximation to the true posterior. The update for the variational multinomial φd,j

is

φd,j ∝ exp

{∑d′ 6=d

∇φd,nLd,d′ + Eq [log θd|γd] + logβ·,wd,j

}. (4.9)

The contribution to the update from link information, ∇φd,nLd,d′ , depends on the

choice of link probability function. For the link probability functions expanded in

Equation 4.8, this term can be written as

∇φd,nLd,d′ = (∇πd1,d2Ld,d′) ◦φd′

Nd

. (4.10)

Intuitively, Equation 4.10 will cause a document’s latent topic assignments to be

nudged in the direction of neighboring documents’ latent topic assignments. The

56

magnitude of this pull depends only on πd,d′ , i.e., some measure of how close they are

already. The corresponding gradients for the functions in Equation 4.8 are

∇πd,d′Lσd,d′ ≈ (1− σ(ηTπd,d′ + ν))η

∇πd,d′LΦd,d′ ≈

Φ′(ηTπd,d′ + ν)

Φ(ηTπd,d′ + ν)η

∇πd,d′Led,d′ = η.

The gradient when ψN is the link probability function is

∇φd,nLNd,d′ =

2

Nd

η ◦ (φd′ − φd,−n −1

Nd

), (4.11)

where φd,−n = φd− 1Ndφd,n. Similar in spirit to Equation 4.10, Equation 4.11 will cause

a document’s latent topic assignments to be drawn towards those of its neighbors.

This draw is tempered by φd,−n, a measure of how similar the current document is to

its neighbors.

The contribution to the update in Equation 4.9 from the word evidence logβ·,wd,j

can be computed by taking the element-wise logarithm of the wd,jth column of the

topic matrix β. The contribution to the update from the document’s latent topic

proportions is given by

Eq [log θd|γd] = Ψ(γd)−Ψ(∑

γd,i),

where Ψ is the digamma function4. (A digamma of a vector is the vector of digammas.)

The update for γ is identical to that in variational inference for LDA (Blei et al.

4The digamma function is defined as the logarithmic derivative of the gamma function.

57

2003b),

γd ← α +∑n

φd,n.

These updates are fully derived in Appendix A.


We fit the model by finding maximum likelihood estimates for each of the parameters:

multinomial topic vectors β1:K and link function parameters η, ν. Once again, this

is intractable so we turn to an approximation. We employ variational expectation-

maximization, where we iterate between optimizing the ELBO of Equation 4.6 with

respect to the variational distribution and with respect to the model parameters. This

is equivalent to the usual expectation-maximization algorithm (Dempster et al. 1977),

except that the computation of the posterior is replaced by variational inference.

Optimizing with respect to the variational distribution is described in Section 4.2.1.

Optimizing with respect to the model parameters is equivalent to maximum likelihood

estimation with expected sufficient statistics, where the expectation is taken with

respect to the variational distribution.

The update for the topics matrix β is

βk,w ∝∑d

∑n

1(wd,n = w)φd,n,k. (4.12)

This is the same as the variational EM update for LDA (Blei et al. 2003b). In practice,

we smooth our estimates of βk,w using pseudocount smoothing (Jurafsky and Martin

2008) which helps to prevent overfitting by positing a Dirichlet prior on βk.

In order to fit the parameters η, ν of the logistic function of Equation 4.1, we employ

gradient-based optimization. Using the approximation described in Equation 4.8, we

compute the gradient of the objective given in Equation 4.6 with respect to these

58

parameters,

∇ηL ≈∑

(d1,d2)

[yd1,d2 − σ

(ηTπd1,d2 + ν

)]πd1,d2 ,

∂

∂νL ≈

∑(d1,d2)

[yd1,d2 − σ

(ηTπd1,d2 + ν

)].

Note that these gradients cannot be used to directly optimize the parameters

of the link probability function without negative observations (i.e., yd1,d2 = 0). We

address this by applying a regularization penalty. This regularization penalty along

with parameter update procedures for the other link probability functions are given in

Appendix B.

4.2.3 Prediction

With a fitted model, our ultimate goal is to make predictions about new data. We

describe two kinds of prediction: link prediction from words and word prediction from

links.

In link prediction, we are given a new document (i.e. a document which is not

in the training set) and its words. We are asked to predict its links to the other

documents. This requires computing

p(yd,d′ |wd,wd′) =∑zd,zd′

p(yd,d′|zd, zd′)p(zd, zd′|wd,wd′),

an expectation with respect to a posterior that we cannot compute. Using the inference

algorithm from Section 4.2.1, we find variational parameters which optimize the ELBO

for the given evidence, i.e., the words and links for the training documents and the

words in the test document. Replacing the posterior with this approximation q(Θ,Z),

59

the predictive probability is approximated with

p(yd,d′|wd,wd′) ≈ Eq [p(yd,d′|zd, zd′)] . (4.13)

In a variant of link prediction, we are given a new set of documents (documents not

in the training set) along with their words and asked to select the links most likely to

exist. The predictive probability for this task is proportional to Equation 4.13.

The second predictive task is word prediction, where we predict the words of a

new document based only on its links. As with link prediction, p(wd,i|yd) cannot be

computed. Using the same technique, a variational distribution can approximate this

posterior. This yields the predictive probability

p(wd,i|yd) ≈ Eq [p(wd,i|zd,i)] .

Note that models which treat the endpoints of links as discrete observations of

data indices cannot participate in the two tasks presented here. They cannot make

meaningful predictions for documents that do not appear in the training set (Nallapati

and Cohen 2008; Cohn and Hofmann 2001; Sinkkonen et al. 2008; Erosheva et al.

2004). By modeling both documents and links generatively, our model is able to

give predictive distributions for words given links, links given words, or any mixture

thereof.

4.3 Empirical Results

We examined the RTM on four data sets5. Words were stemmed; stop words, i.e.,

words like “and” “of” or “but”, and infrequently occurring words were removed.

5An R package implementing these models and more are available online at http://cran.r-project.org/web/packages/lda/. Detailed derivations for some of the models included in the packageare given in Appendix D.

60

Table 4.1: Summary statistics for the four data sets after processing.Data Set # of Documents # of Words Number of Links Lexicon size

Cora 2708 49216 5278 1433WebKB 877 79365 1388 1703PNAS 2218 119162 1577 2239

LocalNews 51 93765 107 1242

Directed links were converted to undirected links6 and documents with no links were

removed. The Cora data (McCallum et al. 2000) contains abstracts from the Cora

computer science research paper search engine, with links between documents that

cite each other. The WebKB data (Craven et al. 1998) contains web pages from the

computer science departments of different universities, with links determined from

the hyperlinks on each page. The PNAS data contains recent abstracts from the

Proceedings of the National Academy of Sciences. The links between documents are

intra-PNAS citations. The LocalNews data set is a corpus of local news culled from

various media markets throughout the United States. We create one bag-of-words

document associated with each state (including the District of Columbia); each state’s

“document” consists of headlines and summaries from local news in that state’s media

markets. Links between states were determined by geographical adjacency. Summary

statistics for these data sets are given in Table 4.1.

4.3.1 Evaluating the predictive distribution

As with any probabilistic model, the RTM defines a probability distribution over unseen

data. After inferring the latent variables from data (as described in Section 4.2.1), we

ask how well the model predicts the links and words of unseen nodes. Models that

give higher probability to the unseen documents better capture the joint structure of

words and links.

We study the RTM with three link probability functions discussed above: the

6The RTM can be extended to accommodate directed connections. Here we modeled undirectedlinks.

61

●

● ●●

●

5 10 15 20 25

600

700

800

900

Predictive Link Rank

Cor

a

●

●

●

●

●

●

●●

●

●

● ●

●●

●

●

●

●

●●

●

●●

● ●

5 10 15 20 25

275

285

295

Predictive Word Rank

●

●

●

●

●

●

●

●

●●

●● ● ●

●

● ●● ●

●

●

●

●

●●

5 10 15 20 25

180

220

260

Web

KB

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

5 10 15 20 2530

030

531

031

5

●●

●

●

●

●

●

●

●

●

● ● ●●

●● ●

●● ●

●

●

●

●

●

5 10 15 20 25

440

480

520

Number of topics

PN

AS

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●RTM,, ψψσσRTM,, ψψe

RTM,, ψψΦΦLDA + Regression

Pairwise Link−LDA

● ●

●

●

●

5 10 15 20 25

430

440

450

460

Number of topics

●

●

●

●

●

●

●

●

●

●

● ● ● ● ●● ● ● ● ●

Figure 4.6: Average held-out predictive link rank (left) and word rank (right) asa function of the number of topics. Lower is better. For all three corpora, RTMsoutperform baseline unigram, LDA, and “Pairwise Link-LDA” Nallapati et al. (2008).

logistic link probability function, ψσ, of Equation 4.1; the exponential link proba-

bility function, ψe of Equation 4.2; and the probit link probability function, ψΦ of

Equation 4.3. We compare these models against two alternative approaches.

62

The first (“Pairwise Link-LDA”) is the model proposed by Nallapati et al. (2008),

which is an extension of the mixed membership stochastic block model (Airoldi et al.

2008) to model network structure and node attributes. This model posits that each link

is generated as a function of two individual topics, drawn from the topic proportions

vectors associated with the endpoints of the link. Because latent topics for words and

links are drawn independently in this model, it cannot ensure that the discovered topics

are representative of both words and links simultaneously. Additionally, this model

introduces additional variational parameters for every link which adds computational

complexity.

The second (“LDA + Regression”) first fits an LDA model to the documents and

then fits a logistic regression model to the observed links, with input given by the

Hadamard product of the latent class distributions of each pair of documents. Rather

than performing dimensionality reduction and regression simultaneously, this method

performs unsupervised dimensionality reduction first, and then regresses to understand

the relationship between the latent space and underlying link structure. All models

were fit such that the total mass of the Dirichlet hyperparameter α was 1.0. (While

we omit a full sensitivity study here, we observed that the performance of the models

was similar for α within a factor of 2 above and below the value we chose.)

We measured the performance of these models on link prediction and word pre-

diction (see Section 4.2.3). We divided the Cora, WebKB and PNAS data sets each

into five folds. For each fold and for each model, we ask two predictive queries: given

the words of a new document, how probable are its links; and given the links of a

new document, how probable are its words? Again, the predictive queries are for

completely new test documents that are not observed in training. During training the

test documents are removed along with their attendant links. We show the results

for both tasks in terms of predictive rank as a function of the number of topics in

Figure 4.6. (See Section 4.4 for a discussion on potential approaches for selecting the

63

number of topics and the Dirichlet hyperparameter α.) Here we follow the convention

that lower predictive rank is better.

In predicting links, the three variants of the RTM perform better than all of the

alternative models for all of the data sets (see Figure 4.6, left column). Cora is

paradigmatic, showing a nearly 40% improvement in predictive rank over baseline and

25% improvement over LDA + Regression. The performance for the RTM on this

task is similar for all three link probability functions. We emphasize that the links are

predicted to documents seen in the training set from documents which were held out.

By incorporating link and node information in a joint fashion, the model is able to

generalize to new documents for which no link information was previously known.

Note that the performance of the RTM on link prediction generally increases as the

number of topics is increased (there is a slight decrease on WebKB). In contrast, the

performance of the Pairwise Link-LDA worsens as the number of topics is increased.

This is most evident on Cora, where Pairwise Link-LDA is competitive with RTM

at five topics, but the predictive link rank monotonically increases after that despite

its increased dimensionality (and commensurate increase in computational difficulty).

We hypothesize that Pairwise Link-LDA exhibits this behavior because it uses some

topics to explain the words observed in the training set, and other topics to explain

the links observed in the training set. This problem is exacerbated as the number of

topics is increased, making it less effective at predicting links from word observations.

In predicting words, the three variants of the RTM again outperform all of the

alternative models (see Figure 4.6, right column). This is because the RTM uses

link information to influence the predictive distribution of words. In contrast, the

predictions of LDA + Regression and Pairwise Link-LDA barely use link information;

thus they give predictions independent of the number of topics similar to those made

by a simple unigram model.

64

4.3.2 Automatic link suggestion

Table 4.2: Top eight link predictions made by RTM (ψe) and LDA + Regression fortwo documents (italicized) from Cora. The models were fit with 10 topics. Boldfacedtitles indicate actual documents cited by or citing each document. Over the wholecorpus, RTM improves precision over LDA + Regression by 80% when evaluated onthe first 20 documents retrieved.

Markov chain Monte Carlo convergence diagnostics: A comparative reviewMinorization conditions and convergence rates for Markov chain Monte Carlo

RTM

(ψe)

Rates of convergence of the Hastings and Metropolis algorithmsPossible biases induced by MCMC convergence diagnostics

Bounding convergence time of the Gibbs sampler in Bayesian image restorationSelf regenerative Markov chain Monte Carlo

Auxiliary variable methods for Markov chain Monte Carlo with applicationsRate of Convergence of the Gibbs Sampler by Gaussian Approximation

Diagnosing convergence of Markov chain Monte Carlo algorithms

Exact Bound for the Convergence of Metropolis Chains LDA

+Regressio

n

Self regenerative Markov chain Monte CarloMinorization conditions and convergence rates for Markov chain Monte Carlo

Gibbs-markov modelsAuxiliary variable methods for Markov chain Monte Carlo with applications

Markov Chain Monte Carlo Model Determination for Hierarchical and Graphical ModelsMediating instrumental variables

A qualitative framework for probabilistic inferenceAdaptation for Self Regenerative MCMC

Competitive environments evolve better solutions for complex tasksCoevolving High Level Representations

RTM

(ψe)

A Survey of Evolutionary StrategiesGenetic Algorithms in Search, Optimization and Machine Learning

Strongly typed genetic programming in evolving cooperation strategiesSolving combinatorial problems using evolutionary algorithms

A promising genetic algorithm approach to job-shop scheduling. . .Evolutionary Module Acquisition

An Empirical Investigation of Multi-Parent Recombination Operators. . .

A New Algorithm for DNA Sequence Assembly

LDA

+Regressio

n

Identification of protein coding regions in genomic DNASolving combinatorial problems using evolutionary algorithms

A promising genetic algorithm approach to job-shop scheduling. . .A genetic algorithm for passive management

The Performance of a Genetic Algorithm on a Chaotic Objective FunctionAdaptive global optimization with local search

Mutation rates as adaptations

A natural real-world application of link prediction is to suggest links to a user

based on the text of a document. One might suggest citations for an abstract or

friends for a user in a social network.

65

As a complement to the quantitative evaluation of link prediction given in the

previous section, Table 4.2 illustrates suggested citations using RTM (ψe) and LDA +

Regression as predictive models. These suggestions were computed from a model fit on

one of the folds of the Cora data using 10 topics. (Results are qualitatively similar for

models fit using different numbers of topics; see Section 4.4 for strategies for choosing

the number of topics.) The top results illustrate suggested links for “Markov chain

Monte Carlo convergence diagnostics: A comparative review,” which occurs in this

fold’s training set. The bottom results illustrate suggested links for “Competitive

environments evolve better solutions for complex tasks,” which is in the test set.

RTM outperforms LDA + Regression in being able to identify more true connections.

For the first document, RTM finds 3 of the connected documents versus 1 for LDA +

Regression. For the second document, RTM finds 3 while LDA + Regression does not

find any. This qualitative behavior is borne out quantitatively over the entire corpus.

Considering the precision of the first 20 documents retrieved by the models, RTM

improves precision over LDA + Regression by 80%. (Twenty is a reasonable number

of documents for a user to examine.)

While both models found several connections which were not observed in the data,

those found by the RTM are qualitatively different. In the first document, both sets

of suggested links are about Markov chain Monte Carlo. However, the RTM finds

more documents relating specifically to convergence and stationary behavior of Monte

Carlo methods. LDA + Regression finds connections to documents in the milieu

of MCMC, but many are only indirectly related to the input document. The RTM

is able to capture that the notion of “convergence” is an important predictor for

citations, and has adjusted the topic distribution and predictors correspondingly. For

the second document, the documents found by the RTM are also of a different nature

than those found by LDA + Regression. All of the documents suggested by RTM

relate to genetic algorithms. LDA + Regression, however, suggests some documents

66

which are about genomics. By relying only on words, LDA + Regression conflates

two “genetic” topics which are similar in vocabulary but different in citation structure.

In contrast, the RTM partitions the latent space differently, recognizing that papers

about DNA sequencing are unlikely to cite papers about genetic algorithms, and vice

versa. Better modeling the properties of the network jointly with the content of the

documents, the model is able to better tease apart the community structure.

4.3.3 Modeling spatial data

While explicitly linked structures like citation networks offer one sort of connectivity,

data with spatial or temporal information offer another sort of connectivity. In this

section, we show how RTMs can be used to model spatially connected data by applying

it to the LocalNews data set, a corpus of news headlines and summaries from each

state, with document linkage determined by spatial adjacency.

Figure 4.7 shows the per state topic distributions inferred by RTM (left) and LDA

(right). Both models were fit with five topics using the same initialization. (We restrict

the discussion here to five topics for expositional convenience. See Section 4.4 for a

discussion on potential approaches for selecting the number of topics.) While topics

are strictly speaking exchangeable and therefore not comparable between models,

using the same initialization typically yields topics which are amenable to comparison.

Each row of Figure 4.7 shows a single component of each state’s topic proportion for

RTM and LDA. That is, if θs is the latent topic proportions vector for state s, then θs1

governs the intensity of that state’s color in the first row, θs2 the second, and so on.

While both RTM and LDA model the words in each state’s local news corpus,

LDA ignores geographical information. Hence, it finds topics which are distributed

over a wide swath of states which are often not contiguous. For example, LDA’s topic

1 is strongly expressed by Maine and Illinois, along with Texas and other states in

the South and West. In contrast, RTM only assigns non-trivial mass to topic 1 in a

67

Topic 1

Topic 2

Topic 3

Topic 4

Topic 5

Topic 1

Topic 2

Topic 3

Topic 4

Topic 5

Figure 4.7: A comparison between RTM (left) and LDA (right) of topic distributionson local news data. Each color/row depicts a single topic. Each state’s color intensityindicates the magnitude of that topic’s component. The corresponding words associatedwith each topic are given in Table 4.3. Whereas LDA finds geographically diffusetopics, RTM, by modeling spatial connectivity, finds coherent regions.

68

Southern states. Similarly, LDA finds that topic 5 is expressed by several states in

the Northeast and the West. The RTM, however, concentrates topic 4’s mass on the

Northeastern states.

Table 4.3: The top eight words in each RTM (left) and LDA (right) topic shown inFigure 4.7 ranked by score (defined below). RTM finds words which are predictive ofboth a state’s geography and its local news.

comments dead

scores landfill

plane metro

courthouse evidence

Topic 1

crash yesterday

registration county

police children

quarter campaign

Topic 2

measure marriage

suspect officer

guards protesters

appeals finger

Topic 3

bridge area

veterans winter

city snow

deer concert

Topic 4

manslaughter route

girls state

knife grounds

committee developer

Topic 5

election plane

landfill dead

police union

interests veterans

Topic 1

crash police

yesterday judge

fire leave

charges investors

Topic 2

comments marriage

register scores

schools comment

registration rights

Topic 3

snow city

veterans votes

winter bridge

recount lion

Topic 4

garage girls

video dealers

underage housing

mall union

Topic 5

The RTM does so by finding different topic assignments for each state, and

69

commensurately, different distributions over words for each topic. Table 4.3 shows the

top words in each RTM topic and each LDA topic. Words are ranked by the following

score,

scorek,w ≡ βk,w(log βk,w −1

K

∑k′

log βk′,w).

The score finds words which are likely to appear in a topic, but also corrects for

frequent words. The score therefore puts greater weight on words which more easily

characterize a topic. Table 4.3 shows that RTM finds words more geographically

indicative. While LDA provides one way of analyzing this collection of documents,

the RTM enables a different approach which is geographically cognizant. For example,

LDA’s topic 3 is an assortment of themes associated with California (e.g., ‘marriage’)

as well as others (‘scores’, ‘registration’, ‘schools’). The RTM on the other hand,

discovers words thematically related to a single news item (‘measure’, ‘protesters’,

‘appeals’) local to California. The RTM typically finds groups of words associated

with specific news stories, since they are easily localized, while LDA finds words which

cut broadly across news stories in many states. Thus on topic 5, the RTM discovers

key words associated with news stories local to the Northeast such as ‘manslaughter’

and ‘developer.’ On topic 5, the RTM also discovers a peculiarity of the Northeastern

dialect: that roads are given the appellation ‘route’ more frequently than elsewhere in

the country.

By combining textual information along with geographical information, the RTM

provides a novel exploratory tool for identifying clusters of words that are driven by

both word co-occurrence and geographic proximity. Note that the RTM finds regions

in the United States which correspond to typical clusterings of states: the South, the

Northeast, the Midwest, etc. Further, the soft clusterings found by RTM confirm

many of our cultural intuitions—while New York is definitively a Northeastern state,

Virginia occupies a liminal space between the Midatlantic and the South.

70

4.3.4 Modeling social networks

We now show how the RTM can be used to qualitatively understand the structure of

social networks. In this section we apply the RTM to four data sets — people from

the Bible, people from New York Times articles, and two data sets crawled from the

online social networking site Twitter 7.

Bible The data set contains 523 entities which appear in the Bible. For each entity

we extract all of the verses in which those entities appear; we take this collection

of verses to be the “document” for that entity. For links we take all entities which

co-occur in the same verse yielding 475 links. Figure 4.8 shows a visualization of

the results. Each node represents an individual; nodes are colored according to the

topic most associated with that individual. The node near the center (which although

colored brown is on the border of several other clusters) is associated with Jesus.

Another notable figure in that cluster is David who is connected to many others of

that line. A node with high connectivity in a different cluster is Israel. Because Israel

may refer to both the place and as an alternate name for Jacob, it is possible that

some of these edges are spurious and the result of improper disambiguation. However

the results are suggestive, with the RTM clustering Israel along with figures such as

Joseph and Benjamin. As an avenue of future work, the RTM might be used to help

disambiguate these entities.

New York Times The data set contains 944 entities tagged in New York Times

articles. We use the collection of articles (out of a set of approximately one million

articles) in which those entities appear as that entity’s “document”. We consider two

entities connected if they are co-tagged in an article. Figure 4.9 shows the result of

fitting the RTM to these data. The RTM finds distinct clusters corresponding to

distinct areas in which people are notable; these clusters also often have strong internal

7http://www.twitter.com

71

tobiah

jahzeelachish

hodiah

jacobjonathan

jonathan

israel

israel

michal

nethaniah

omar

epher gershon

mahlah

geber

unni

on

ibhar

sihon

ish

damascus

milcah anah

anah

boaz

manasseh

elihu

joses

sodomnebuchadnezzar

zibeon

ehud

adoni

carmi

eliphaz

abigail

abigail

timothy reuben

hanoch

hanoch

ezra james james

gilead

gilead

gilead

debir

nehemiah

solomon

gemariah

makkedah

caleb

edom

dorcasevil

maaseiah

tereshmaaseiah

gideon

adam

adam jehoiada

mishael

benaiah

almodad

amraphelnineveh

priscilla

caiaphassaul

jehoram

shishak

meshach

gad

gad

gad

seraiah

seraiah seraiah

beeliada

uzzah

amasa

pekah

debir

gezer

er eli

legion

haggith

elnathan

beriah

hushai

joram

joram kedar

canaan

canaan

shalmaneser

hadad

zibajoiada

manoah

jeremiah

megiddo

joseph

joseph

joseph

sapphira

ezekiel

ahithophel

jerusalem

daviddavid

eldad

sarah

herod

dedanjezebel

hannah

shechem

enoshophni

benben

adonijah

josiah

trophimus

zimri

bishlam

havilahnahor nahor

joash

mamreazariahazariah

azariah

azariah

korah libnah

bigthan

rehoboam

reuel

reuelpallu

tiras annas

jehozadak

serug ithamar

levi shemaiah

malchus

zebulun

hazor hanan

magog

tubal

hezekiah

rehum

isaac

joel

ephraim

sheshai

meshech

lachish

uzziah

guni

zelophehad

kohath

rachel

huldah

kenaz

ethanpeleg

bela

simeon

simeon

jamin

barzillai

shuah

abraham

enoch

gehazi

jehoahaz

hormah

samuel

shaul

sheleph

nahshon

shobal

hilkiah

hilkiah balaam

joseph

asher

asher

jonadab

shammah

shammah

noah

noah rapha

jairus

mark

andrew

sennacherib

zebuljavan

hamram

eglon

manasseh

jeroboam

jeroboamterahsceva

amaziah

arphaxad

sered

ohad

ahab

heman

elkanah

achbor

rechab

rechab

ebed

ashbel

zohar

og

athaliah

eliab

eliab eliab

eliab

goliath

shillem

ahaz

amminadab

jahleel

lot

simon

simon simon

simon

simon salome

haran

haran

elisha

vashtiartaxerxes

abimelechamram

joab

abijah

miriam

aphek

jeiel

immanuel

ahaziah

aaron

tiglathmilcah

jaazaniah

abiathar

naaman

mephibosheth

shimei

taanach

bethueltirzahtirzah

naboth

nathanael

tarshish

gaal

pelatiah

jotham

eliasaph

abner

zadok

ephraim

ephraim

shaphan

shaphanadriel

hazarmaveth

chenaanah

chedorlaomer

elon

stephen

birsha

jehoash

kish

danshemaiah

eve

baanah

shechemsalmabartholomew

shemaiah

zur

dan

dan

sidon

toi

zipporah

jabin

zimran

onan

dathan

baasha

japhethcush

amariah

demetrius

luke

nathanmalchiel

abishag

abishai

piram

mary

mary

mary

johanan

nicodemus

phinehas

barabbas

jehoiachin

lamech

lamech

ananias

ananias shobach laban

ahimaaz

tamar tamar

elishama

libni arioch

asahel

hiram

jether

gallio

ahimelech

abiram

geshem

shealtiel abel benjamin benjamin

benjamin

gomer

jonathan

aram

tidal

japhia

shimshai

gedaliah

sherebiah

adoni

eliada

judah

joktan

shelah

joel

jason

thomas

shemiramoth

isaiah

hamorornan

seir

seir

israel

lazarus

zeruiah

sheba sheba

cain

nebaiothishmael

ishmael

elijah

gera zalmunna

tabitha

moses

zerah

asenath

uriah

jemuel

nepheg

elah

jericho

john

john

zephaniah

ahijah

ahijah

joah

jeshua

ephah

naphtali

naphtali jarmuthmattithiahmelchizedek

jephthah

jethro

jezer

jachin

reu

hebron

remaliah

matthias

jesusissachar

shaphat

hadadezer

absalom

eliakimebermarthahoglahmedad

nadab

nadab

hazael rezin

thaddaeus

zobah

zedekiah

mizzah

zedekiah heth silas

araunah

asaph titus

elishua

hepher

madai

asa

jehu

jehu

adah

seba

mephibosheth

aharah

matthewsamson

jehoiakim

seth

jehoshaphat

paul

hoshea

methuselah shem judah

eldaah

peter

leah

mordecai

eliphelet

daniel merab

delaiah

esau

pharaoh

pharaoh

pharaoh

pharaoh

raamah

philip

philip

philip

midian

eliashib

zebah

jeshua

jesse

bani

balak

malachi

joshua

joshua

hezron

hezron

Figure 4.8: The result of fitting the RTM to a collection of entities from the Bible.Nodes represent people and edges indicate that the people co-occur in the same verse.Nodes are colored according to the topic most assoicated with that individual. Edgesbetween nodes with the same primary topic are colored black while edges betweennodes with different primary topics are colored grey.

72

ties. For example, in the top center the green cluster contains sports personalities.

Michael Jordan and Derek Jeter are a few prominent, highly-connected figures in

this cluster. The yellow cluster which also has strong internal connections represents

international leaders such as George W. Bush (lower left), Ronald Reagan (lower right),

and George H. W. Bush (upper right). Note that many of these are conservatives.

Beside this cluster is another orange cluster of politicians. This cluster leans more

liberal with figures such as Bill Clinton and Michael Dukakis. Notably, several

republicans are also in this cluster such as Michael Bloomberg. The remaining

clusters found by RTM capture other groups of related individuals, such as artists

and businesspeople.

Twitter Twitter is an online social network where users can regularly post statements

(known as “tweets”). Users can also choose to “follow” other users, that is, receive

their tweets. We take each user’s documents to be the accumulation of their tweets

and we use follower connections as edges between users. Here we present two data

sets.

The first is a series of tweets collected over the period of approximately one week.

The users included in this data set were found by starting a breadth-first crawl from a

distinguished node, leading to 180 users being included. Figure 4.10 shows a force-

directed layout of this data set after RTM has been fit to it. The nodes represent

users; the colors of the nodes indicate the topic most associated with that user. Some

regions of the graph with similar topics have also been highlighted and annotated

with the most frequently occurring words in that topic. For example, one sector of the

graph has people talking about music topics. However these reside on the periphery.

Another sector uses words associated with blogs and social media; this area has a

hub-spoke structure. Finally, another region of the graph is distinguished by frequent

occurences of the phrase “happy easter” (the crawl period included Easter). This

73

lazio, rick a

richter, mike

wilder, l douglas

foreman, george

mehta, zubin

bhutto, benazir

wright, jim olmert, ehud

pinochet, augusto

hatch, orrin g

berlin, irving

souter, david h

tower, john g

husband, rick d

risen, james

charles, eleanor

klebold, dylan

rather, dan

mcmellon, edward

saint laurent, yves

helms, jesse

spitzer, eliot l

panetta, leon e

columbus, christopher

jesus christ

mcveigh, timothy james

o'neill, paul

martinez, pedro

barry, marion s jr

els, ernie

chernomyrdin, viktor s

hynes, charles j

ewing, patrick

miers, harriet e

stein, andrew j

aung san suu kyi, daw

silver, sheldon

specter, arlen

coughlin, tom

kennedy, john fitzgerald

rodman, dennis

chaney, don

james, caryn

lane, nathan

hashimoto, ryutaro

lautenberg, frank r

giambi, jason

de klerk, f w

mourning, alonzo

mapplethorpe, robert

glavine, tom

winfrey, oprah

van horn, keithjeter, derek

reagan, ronald

limbaugh, rush

netanyahu, benjamin

law, bernard f

sessions, william s

fujimori, alberto

ward, charlie

barron, james

van gundy, jeff

musharraf, pervez

blair, jayson

valentine, bobby

rowland, john g

chass, murray

jacobs, marc

puccini, giacomo

gates, henry louis jr

staples, brent

baker, howard h jrclark, wesley k

koch, edward i

canseco, jose

major, john

plame, valerie

o'neill, paul h

holyfield, evander

mills, richard p

abdullah

bush, laura

odeh, mohammed saddiq

kozlowski, l dennis

karadzic, radovan

salinas de gortari, carlos

federer, roger

jospin, lionel

bennett, william j

nader, ralph

jackson, mark steinbrenner, george m 3d

steinberg, lisa

velella, guy j

moynihan, daniel patrick

kahane, meir

khamenei, ali

darman, richard g

barber, tiki

friedman, thomas l

bradley, bill

minaya, omar

goldin, harrison j

hernandez, orlandobrown, dave

reich, robert b

smith, william kscoppetta, nicholas

james, sharpe

gore, al

moi, daniel arap

forbes, steve

chertoff, michael

wolfowitz, paul d

clark, laurel salton

ashe, arthur vecsey, george

pirro, jeanine f

sullivan, andrew

mantle, mickey

hanks, tom

robinson, jackie

bird, larry

reno, janet

edberg, stefan

belkin, lisa

nussbaum, hedda

blix, hans

houston, allan

carter, bill

scowcroft, brent

golden, howard

quayle, dan

morgenthau, robert m

taylor, lawrence

ovitz, michaelmurdoch, rupert

delay, tom

wagner, richard

wellstone, paul

brody, jane e

mcenroe, john

kim jong il

rivera, mariano

welch, john f jr

schundler, bret d green, mark

jiang zemin

kim dae jung

karzai, hamid

rohatyn, felix g

carlucci, frank c

kaczynski, theodore j

biaggi, mario

bruni, frank

baker, james a 3d

chavez, julio cesar

kantor, mickey

clemens, roger

steinberg, joel b

buffett, warren e

paulson, henry m jr

greenhouse, steven

klein, calvin

conner, dennis

letterman, david

lamont, ned rice, condoleezza

sununu, john h

combs, sean miller, arthur

bruder, thomas

mussina, mike

freeh, louis j

difrancesco, donald t

johnson, lyndon baines

pelosi, nancy

moxley, martha

weicker, lowell p jr

scott, byron

johnson, magic

geffen, david

mandela, winnie

falwell, jerryfarrakhan, louis

chretien, jean

eastwood, clint

dean, howard

florio, james j

miller, judith

ward, benjamin

gretzky, wayne

zedillo ponce de leon, ernesto

arafat, yasir

joyce, james

woodward, bob

green, richard r

summers, lawrence h

muschamp, herbert

izetbegovic, alija

roth, philip

martin, steve

abrams, robert

chirac, jacques

sullivan, louis w

shevardnadze, eduard a

starr, kenneth w

rostenkowski, dan

thompson, william c jr

wasserstein, wendy

levin, carl

kristof, nicholas d

rangel, charles b

noriega, manuel antonio

ali, muhammad

mobutu sese seko

gates, bill

james, lebron pitino, rick

fehr, donald

warhol, andy

robbins, jerome

dalai lama

feinstein, dianne

holtzman, elizabeth kennedy, edward m

pirro, jeanine

vance, cyrus r

becker, boris

moore, michael

gonzalez, elian

barenboim, daniel

henderson, rickey

john, elton

edwards, john

winerip, michael

padilla, jose

lewis, anthony

abramoff, jack

eisner, michael d

beckett, samuel

franks, bob

bonilla, bobby martinez, tino

sistani, ali al

manning, eli

shalala, donna e

foley, thomas s

biden, joseph r jr

einstein, albert

hart, gary

williams, serena

hariri, rafik

feld, eliot

dole, elizabeth h

cosby, bill

frist, bill

alito, samuel a jr

dingell, john d

klein, joel i

purdum, todd s

anderson, dave

maddox, alton h jr

king, wayne

mulroney, brian

mbeki, thabo

thurmond, strom

moses, robert

stern, henry j

sharon, ariel

mcgreevey, james e

robb, charles s

malvo, john lee

norodom sihanouk

taubman, a alfredredstone, sumner m

bernstein, leonard

fields, c virginia

botstein, leon

rove, karl

perry, william j

marcos, imelda

sheffield, gary

hussein i

ruth, george herman

cuomo, mario m

schmitt, eric

morris, mark

miller, melvin

thomas, isiah

keating, charles h jr

chalabi, ahmad

ceausescu, nicolae

brokaw, tom

suozzi, thomas r

roh tae woo

o'neill, eugene

pettitte, andy

pollan, michael

rabin, yitzhak

leno, jay

tagliabue, paul

rosenthal, a m

ortega saavedra, daniel

north, oliver l

turner, ted

blumenthal, ralph

walters, barbara

harkin, tom

hussein, saddam

madden, john

glenn, john

golisano, b thomas

bryant, kobe

bremer, l paul iii

marbury, stephon

kelly, raymond w

pickens, t boone jr

qaddafi, muammar el

dole, bob

hingis, martina

king, rodney glen

wilson, august

bonds, barry

mubarak, hosni

bradsher, keith

kean, thomas h

coleman, derrick

brodsky, richard l

sondheim, stephen

tommasini, anthony

johnson, earvin

gates, robert m

vincent, fayphillips, steve

brown, jerry

kabila, laurent

sprewell, latrell

washington, desiree

lewis, lennox

kushner, tonyma, yo

whitman, christie

wiese, thomas

leiter, al

kasparov, garry capriati, jennifer

lee, spike

molinari, guy v

primakov, yevgeny m

shakespeare, william

dukakis, michael s

verdi, giuseppe

piazza, mike

yeltsin, boris n khatami, mohammad

bush, barbara

mohamed, khalfan khamis

dowd, maureen

lloyd webber, andrew

norman, greg

meese, edwin 3d

pataki, george e

gershwin, george

hitler, adolf

johnson, larry

volpe, justin a

baker, russell

domingo, placido

dinkins, david n

baryshnikov, mikhail

giamatti, a bartlett

leetch, brian

dole, robert j

weinberger, caspar w

kevorkian, jack

paterno, joe

simon, neil

simpson, o j

gerstner, louis v jr

masur, kurt

ashcroft, john

soros, george

dole, elizabeth

jones, marion

schlesinger, arthur m jr

diana, princess of wales

schroder, gerhard

walsh, lawrence e

lennon, johngibson, mel

baker, al

gore, albert jr

brown, david m

spitzer, elliot l

miyazawa, kiichi

collins, glenn

mccartney, paul

torre, joe

ridge, tom

malone, john c

scalia, antonin

ackerman, felicia

lindros, eric

leonard, sugar ray

mutombo, dikembe

torricelli, robert g

aquino, corazon c kim young sam

murphy, richard

fujimori, alberto k

martin, kenyon

hemingway, ernest

gotti, john

sciolino, elaine

belichick, bill

reagan, ronald wilson

weingarten, randi

milosevic, slobodan

hill, anita f

kerry, john

thomas, clarence

bettman, gary stevens, scott

peres, shimon

picasso, pablo

bumiller, elisabeth

keller, bill

spielberg, steven

packwood, robert w

foster, vincent w jr

o'neal, shaquille

lipsyte, robert

weinstein, harvey

prodi, romano

papp, joseph

badillo, herman

canby, vincent

collins, kerry

pierce, samuel r jr

maliki, nuri kamal al

graham, martha van gogh, vincent

charles, prince of wales

li pengda silva, luiz inacio lula

bloomberg, michael r

savimbi, jonas

mickelson, phil

webber, chris

messinger, ruth w

dimaggio, joe

krugman, paul

riley, patmartin, billy

mcfarlane, robert c

ozawa, seiji

camby, marcus

nunn, sam

lewinsky, monica s

powell, colin l

strahan, michael

waldheim, kurt

levy, steve

oates, joyce carol

kemp, jack f

snow, john w

hun sen

bruno, joseph l

courier, jim

ellington, duke

poindexter, john m

pope

whitman, christine todd

wilson, pete

levitt, arthur jr

stern, david

maslin, janet

volcker, paul a

martin, curtis

brown, edmund g jr

pareles, jon

graham, bob

holland, bernard

lopez, jennifer

daly, john

johnson, randy

albee, edward

brawley, tawana

tsongas, paul e

richardson, bill

blair, tony

waxman, henry a

jones, paula corbin

testaverde, vinny

reagan, nancy

levine, jamesfabricant, florence

williams, ted

hyde, henry j

nichols, terry lynn

goss, porter j

brooks, david

sadr, moktada al

schilling, curt

carter, jimmy

koizumi, junichiro

perot, ross

parcells, bill

montana, joe

gorbachev, mikhail skostunica, vojislavdeng xiaoping

rowling, j k

sabatini, gabriela

nicklaus, jack

nixon, richard milhous

armey, dickmccain, john s

roosevelt, theodore

ramon, ilanshamir, yitzhak

bush, george w

helmsley, harry b

al

cortines, ramon c

cheney, dick

holik, bobbymessier, mark

whitehead, mary beth

crew, rudy

hosokawa, morihiro

altman, robert

mamet, david

casey, william j

christo

pear, robert

hastert, j dennis

shultz, george p

woods, tiger

thompson, tommy g bentsen, lloyd

cashman, brian

mccain, john

harris, katherine

safire, william

jeffords, james m

jobs, steven p

oakley, charles

ferraro, geraldine a

robertson, pat

johns, jasper

schiavo, terri thornburgh, richard l

wilson, michael

singh, vijay

starks, john

ravitch, richard

rose, pete

wilson, joseph c iv

havel, vaclav

o'connor, john

carey, mariah

diallo, amadou

fox, vicente

glass, philip

malcolm x

marsalis, wynton

sand, leonard b

thatcher, margaret h

martins, peter

o'neill, william a

kennedy, john f jr

anderson, kenny

king, don

gandhi, rajiv

stalin, joseph

armani, giorgio

buckley, william f jr

spano, andrew j

williams, bernie

clinton, bill

simon, paul

karpov, anatoly

kidd, jason

bowman, patricia

stoppard, tom

rodriguez, alex

hartocollis, anemona

mcconnell, mitch

armstrong, lance

rosenbaum, yankel

ahern, bertie

jennings, peter

pollock, jackson

regan, edward v

versace, gianni

lauder, ronald s

karan, donna

warner, john w

le pen, jean tudjman, franjo

fernandez, joseph a

schwarz, charles

shostakovich, dmitri

louima, abner

kalikow, peter s

chavez, hugo

cuomo, andrew m

ebbers, bernard j

rocker, john

norton, gale

herbert, bob

robertson, marion g

safir, howard

lieberman, joseph i

mailer, norman

ferguson, colin

kwan, michelle

pavarotti, luciano

gates, william h

mccall, h carl

lott, trent forrester, douglas r

zemin, jiang

franco, john

mladic, ratko

gooden, dwight

wiesel, elie

steinbrenner, george

scorsese, martin

castro, fidel

bork, robert h

lauren, ralph williams, tennessee

gordon, jeff

eisenhower, dwight david

collor de mello, fernando

abdel rahman, omar

navratilova, martina

helmsley, leona

bush, jeb

brady, nicholas f

ford, gerald rudolph jr

shevardnadze, eduard

kerry, john f

strawberry, darryl

allen, woody

jagr, jaromir

rafsanjani, hashemi

hu jintao

salameh, mohammed a

adams, john

madonna farrow, mia

bolton, john r

wilson, valerie plame

griffith, michael

menendez, robert

mugabe, robert

wilpon, fred

weiner, tim

boxer, barbara

reid, harry

lagerfeld, karl

greenspan, alan

khodorkovsky, mikhail bperelman, ronald o

fassel, jim

de niro, robert

putin, vladimir

harding, tonya

bush, george w.

sather, glen

bowe, riddick

mcnally, terrence

deaver, michael krehnquist, william h

quindlen, anna

jackson, thomas penfield

botha, p w

perez de cuellar, javier

adams, gerry

handel, george frederick

pennington, chad

barak, ehud

zedillo, ernesto

tierney, john

kerrigan, nancy

brown, larry matsui, hideki

jefferson, thomas

chun doo hwan

zuckerman, mortimer b

bakker, jim

balanchine, george

alexander, lamar

freud, sigmund

chang, michael

hevesi, alan g

johnson, philip

crew, rudolph f

agassi, andre

presley, elvis

icahn, carl c

brown, ronald h

wilson, robert

childs, chris

lewis, neil a

tharp, twyla

roosevelt, franklin delano

weill, sanford i

kerkorian, kirk

muhammad, john allen

najibullah

williams, jayson

dodd, christopher j

taylor, charles

cone, david

pol pot

o'connor, sandra day

goldman, ronald lyle

erlanger, steven

king, stephenonassis, jacqueline kennedy

goodnough, abby

harris, eric

solomon, deborah

sandomir, richard

kaye, judith s

lay, kenneth l

randolph, willie

carter, vince

strauss, richard

chamorro, violeta barrios de

corzine, jon s

holbrooke, richard c

sinatra, frank

buchanan, patrick j

grasso, richard a

ahmadinejad, mahmoud

washington, george

bush, georgefox quesada, vicente

gramm, phil

spitzer, eliot

fitzwater, marlin

sorenstam, annika

edwards, herman

jackson, michael

van natta, don jr

truman, harry sabbas, mahmoud

cooper, michael

diller, barry

boss, kenneth

seles, monica

mandela, nelson r

rumsfeld, donald h

weinstein, jack b

gingrich, newt

aspin, les

menem, carlos saul

kissinger, henry a

cage, john

babbitt, bruceschumer, charles e

assad, bashar al boutros

simms, phil

lipton, eric

newman, paul

mcgwire, mark

lee, wen ho

leahy, patrick jgrassley, charles e

libeskind, daniel

gephardt, richard a

brown, lee p

vacco, dennis c

rell, m jodi

tyson, mike

davis, gray

o'donnell, rosie

roberts, john g jr

clinton, hillary rodham

goetz, bernhard hugo

stephanopoulos, george

aziz, tariq

lewinsky, monica

iacocca, lee a

feiner, paul

smith, william kennedy

cunningham, merce

khomeini, ruhollah

skilling, jeffrey k

showalter, buck

samaranch, juan antonio abraham, spencer

springsteen, bruce

davenport, lindsay

monroe, marilyn

rohde, david

gonzales, alberto r

romney, mitt

lemieux, mario

trump, donald j

pearl, daniel

jackson, bo

kerik, bernard b

mason, c vernon

lewis, carl

gulotta, thomas s

maazel, lorin

lugar, richard g

lincoln, abraham

de la hoya, oscar

lewis, michael

jackson, jesse l

berlusconi, silvio suharto

wachtler, sol

aoun, michel

lendl, ivan

rushdie, salman

mandela, nelson

holden, stephen

chen shui albright, madeleine k

gotbaum, betsy

hanover, donna

tenet, george j

cardoso, fernando henrique

piniella, lou

sampras, pete

kimmelman, michael

walesa, lech

mozart, wolfgang amadeus

barkley, charles

levy, harold o

byrd, robert c

jones, roy jr

shays, christopher

cohen, william s

mueller, robert s iii

mitterrand, francois

ginsburg, ruth bader

elizabeth ii, queen of great britain

williams, venus

brodeur, martin

el

aristide, jeanfeingold, russell d

hawkins, yusuf k

beethoven, ludwig van

carroll, sean

kerrey, bob

giuliani, rudolph w

cruise, tom

stewart, martha

john paul ii

berke, richard l

graf, steffi

mitchell, george j

chawla, kalpana

maxwell, robert

winfield, dave

o'rourke, andrew p

nelson, lemrick jr

rodgers, richard

kohl, helmut

koppel, ted

rubin, robert estern, howard

clinton, chelsea

wright, frank lloyd

d'amato, alfonse m

rich, frank

annan, kofi

jackson, phil

putin, vladimir v

irabu, hideki

weiner, anthony d

miller, gifford

sharpton, al

johnson, keyshawn

hubbell, webster l

egan, edward m

libby, i lewis jr

obama, barack

johnson, ben

roddick, andy

bin laden, osama

brozan, nadine

boesky, ivan f

connors, jimmy

schwarzenegger, arnold

white, mary jo

ryan, george

mattingly, don

weld, william f

sosa, sammy

silverstein, larry a

wells, david

mccool, william ctraub, james

bratton, william j

sweeney, john j

daschle, tom

milken, michael r

ryan, nolan

blumenthal, richard

moussaoui, zacarias

jordan, michael

vallone, peter f

dylan, bob

christopher, warren m

codey, richard j

pagones, steven a

dershowitz, alan m

hewitt, lleyton

selig, bud

assad, hafez al

bach, johann sebastian

altman, lawrence k

gehry, frank

zimmer, richard a

daley, richard m

broad, william j

brady, lois smith

simpson, nicole brown

zarqawi, abu musab al

reeves, dan

trump, donald

iverson, allen

knoblauch, chuck

king, martin luther jr

kennedy, anthony m

kasparov, gary

koresh, david

benedict xvi

marcos, ferdinand e

ferrer, fernando

Figure 4.9: The result of fitting the RTM to a collection of entities from the NewYork Times. Nodes represent people and edges indicate that the people co-occur inthe same article. Nodes are colored according to the topic most assoicated with thatindividual. Edges between nodes with the same primary topic are colored black whileedges between nodes with different primary topics are colored grey.

74

happyeaster

overband

blogsocial

Figure 4.10: The result of fitting the RTM to a small collection of Twitter users. Nodesrepresent users and edges indicate follower/followee relationships. Nodes are coloredaccording to the topic most assoicated with each user. Some regions dominated by asingle topic have been highlighted and annotated with frequently appearing words forthat topic.

region is more of a clique, with many users sending individual greetings to one another.

The second Twitter data set we analyze comes from a larger-scale crawl over a

longer period of time. There were 1425 users in this data set. Figure 4.11 shows

a visualization of the RTM applied to this data set. Once again, nodes have been

colored according to primary topic and several of the topical areas have been labeled

with frequently occurring words. This subset of the graph is dominated by a large

connected component in the center focused on online affairs (“blog”, “post”, “online”).

At the periphery are several smaller communities. For example, there is a food-centric

75

blog post money

online businessbutter recipe

research food

sausage

news iphone game

2009 video

obama tcot swine

michigan

sotomayor

night show tonight

chicago game

Figure 4.11: The result of fitting the RTM to a larger collection of Twitter users.Nodes represent users and edges indicate follower/followee relationships. Nodes arecolored according to the topic most assoicated with each user. Some regions areannotated with frequently appearing words for that topic.

community in the lower left, and a politics community just above it8. Because this

is a larger data set, the RTM is able to discover broader, more thematically related

communities than with the smaller data set.

4.4 Discussion

There are many avenues for future work on relational topic models. Applying the

RTM to diverse types of “documents” such as protein-interaction networks, whose

node attributes are governed by rich internal structure, is one direction. Even the

8The frequently appearing term “tcot” is an acronym for Top Conservatives On Twitter.

76

text documents which we have focused on in this chapter have internal structure such

as syntax (Boyd-Graber and Blei 2008) which we are discarding in the bag-of-words

model. Augmenting and specializing the RTM to these cases may yield better models

for many application domains.

As with any parametric mixed-membership model, the number of latent components

in the RTM must be chosen using either prior knowledge or model-selection techniques

such as cross-validation. Incorporating non-parametric Bayesian priors such as the

Dirichlet process into the model would allow it to flexibly adapt the number of topics

to the data (Ferguson 1973; Antoniak 1974; Kemp et al. 2004; Teh et al. 2007). This,

in turn, may give researchers new insights into the latent membership structure of

networks.

In sum, the RTM is a hierarchical model of networks and per-node attribute data.

The RTM is used to analyze linked corpora such as citation networks, linked web

pages, social networks with user profiles, and geographically tagged news. We have

demonstrated qualitatively and quantitatively that the RTM provides an effective

and useful mechanism for analyzing and using such data. It significantly improves on

previous models, integrating both node-specific information and link structure to give

better predictions.

77

Chapter 5

Discovering Link Information

In the previous chapters we have focused on modeling existing network data,

encoding collections of relationships between entities such as people, places, genes, or

corporations. However, the network data thus far have been unannotated, that is, edges

express connectivity but not the nature of the connection. And while many resources

for networks of interesting entities are emerging, most of these can only annotate

connections in a limited fashion. Although relationships between entities are rich, it is

impractical to manually devise complete characterizations of these relationships for

every pair of entities on large, real-world corpora.

Below we present a novel probabilistic topic model to analyze text corpora and

infer descriptions of its entities and of relationships between those entities. We

develop variational methods for performing approximate inference on our model and

demonstrate that our model can be practically deployed on large corpora such as

Wikipedia. We show qualitatively and quantitatively that our model can construct

and annotate graphs of relationships and make useful predictions.

Portions of this chapter appear in Chang et al. (2009).

78

5.1 Background

Network data—data which express relationships between ensembles of entities—are

becoming increasingly pervasive. People are connected to each other through a variety

of kinship, social, and professional relationships; proteins bind to and interact with

other proteins; corporations conduct business with other corporations. Understanding

the nature of these relationships can provide useful mechanisms for suggesting new

relationships between entities, characterizing new relationships, and quantifying global

properties of naturally occurring network structures (Anagnostopoulos et al. 2008;

Cai et al. 2005; Taskar et al. 2003; Wasserman and Pattison 1996; Zhou et al. 2008).

Many corpora of network data have emerged in recent years. Examples of such

data include social networks, such as LinkedIn or Facebook, and citation networks,

such as CiteSeer, Rexa, or JSTOR. Other networks can be constructed manually or

automatically using texts with people such as the Bible, scientific abstracts with genes,

or decisions in legal journals. Characterizing the networks of connections between

these entities is of historical, scientific, and practical interest. However, describing

every relationship for large, real-world corpora is infeasible. Thus most data sets label

edges as merely on or off, or with a small set of fixed, predefined connection types.

These labellings cannot capture the complexities underlying the relationships and

limit the applicability of these data sets.

An example of this is shown in Figure 5.1. The figure depicts a social network

where nodes represent entities and edges represent some relationship between the

entities. Some social networks such as Facebook1 have self-reported information about

each edge; for example, two users may be connected by the fact that they attended

the same school (top panel). However, this self-reported information is limited and

sparsely populated. By analyzing unstructured resources, we hope to increase the

number of annotated edges, the number of nodes covered, and the kinds of annotations

1http://www.facebook.com

79

(bottom panel).

In this chapter we develop a method for augmenting such data sets by analyzing

document collections to uncover the relationships encoded in their texts. Text corpora

are replete with information about relationships, but this information is out of reach

for traditional network analysis techniques. We develop Networks Uncovered By

Bayesian Inference (Nubbi), a probabilistic topic model of text (Blei et al. 2003a;

Hofmann 1999; Steyvers and Griffiths 2007) with hidden variables that represent the

patterns of word use which describes the relationships in the text. Given a collection

of documents, Nubbi reveals the hidden network of relationships that is encoded in

the texts by associating rich descriptions with each entity and its connections. For

example, Figure 5.2 illustrates a subset of the network uncovered from the texts

of Wikipedia. Connections between people are depicted by edges, each of which is

associated with words that describe the relationship.

First, we describe the intuitions and statistical assumptions behind Nubbi. Second,

we derive efficient algorithms for using Nubbi to analyze large document collections.

Finally, we apply Nubbi to the Bible, Wikipedia, and scientific abstracts. We demon-

strate that Nubbi can discover sensible descriptions of the network and can make

predictions competitive with those made by state of the art models.

5.2 Model

The goal of Nubbi is to analyze a corpus to describe the relationships between pairs of

entities. Nubbi takes as input very lightly annotated data, requiring only that entities

within the input text be identified. Nubbi also takes as input the network of entities

to be annotated. For some corpora this network is already explicitly encoded as a

graph. For other text corpora this graph must be constructed. One simple way of

constructing this graph is to use a fully-connected network of entities and then prune

80

Jonathan Chang

Jordan Boyd-GraberYou and Jordan both went to Princeton.

(a) A social network with some extant data abouthow two entities are related.

Ronald Reagan

Jane WymanYou and Jane used to be married.

(b) The desiderata, a social network where relation-ships have been automatically by analyzing freetext.

Figure 5.1: An example motivating this work. The figures depict a social network;nodes represent individuals and edges represent relationships between the individuals.Many social networks have some detailed information about the relationships. It isthis data we seek to automatically build and augment.

81

Joseph Stalin

Winston Churchill

Lyndon B. Johnson

Mao Zedong

Jimmy Carter

Margaret Thatcher

Ronald Reagan

Richard Nixon

Nikita KhrushchevJohn F. KennedyHubert

Humphrey

George H. W. Bush

Ross Perot

Leon Trotsky

Lev Kamenev

Zhou Enlai

Mikhail Gorbachev

labourgovernleaderbritishworld

sovietcommunist

centralunion

full

sovietrussiangovernunion

nuclear

republicanstate

federalistvotevice

Figure 5.2: A small subgraph of the social network Nubbi learned taking only the rawtext of Wikipedia with tagged entities as input. The full model uses 25 relationship andentity topics. An edge exists between two entities if their co-occurrence count is high.For some of the edges, we show the top words from the most probable relationshiptopic associated with that pair of entities. These are the words that best explain thecontexts where these two entities appear together. A complete browser for this datais available at http://topics.cs.princeton.edu/nubbi.

the edges in this graph using statistics such as entity co-occurrence counts.

From the entities in this network, the text is divided into two different classes of

bags of words. First, each entity is associated with an entity context, a bag of words

co-located2 with the entity. Second, each pair of entities is associated with a pair

context, a bag of words co-located with the pair. Figure 5.3 shows an example of the

input to the algorithm turned into entity contexts and pair contexts.

Nubbi learns two descriptions of how entities appear in the corpus: entity topics

and relationship topics. Following Blei et al. (2003a), a topic is defined to be a

distribution over words. To aid intuitions, we will for the moment assume that these

topics are given and have descriptive names. We will describe how the topics and

contexts interplay to reveal the network of relationships hidden in the texts. We

emphasize, however, that the goal of Nubbi is to analyze the texts to learn both the

topics and relationships between entities.

An entity topic is a distribution over words, and each entity is associated with a

distribution over entity topics. For example, suppose there are three entity topics:

politics, movies, and sports. Ronald Reagan would have a distribution that favors

2We use the term “co-located” to refer to words and entities which appear near one-another in atext. The definition of near depends on the corpus; some practical choices are given in Section 5.4.

82

1 When Jesus had spoken these words, he went forth with his disciples over the brook Cedron, where was a garden, into the which he entered, and his disciples.

2 And Judas also, which betrayed him, knew the place: for Jesus ofttimes resorted thither with his disciples.

3 Judas then, having received a band of men and officers from the chief priests and Pharisees, cometh thither with lanterns and torches and weapons.

4 Jesus therefore, knowing all things that should come upon him, went forth, and said unto them, Whom seek ye?

5 They answered him, Jesus of Nazareth. Jesus saith unto them, I am he. And Judas also, which betrayed him, stood with them.

6 As soon then as he had said unto them, I am he, they went backward, and fell to the ground.

7 Then asked he them again, Whom seek ye? And they said, Jesus of Nazareth.

received band officers chief priests Pharisees lanterns

torches weapons

spoken words disciples brook Cedron garden enter disciples knowing things

seek asked seek NazarethJesus

Judas

1 When Jesus had spoken these words, he went forth with his disciples over the brook Cedron, where was a garden, into the which he entered, and his disciples.

2 And Judas also, which betrayed him, knew the place: for Jesus ofttimes resorted thither with his disciples.

3 Judas then, having received a band of men and officers from the chief priests and Pharisees, cometh thither with lanterns and torches and weapons.

4 Jesus therefore, knowing all things that should come upon him, went forth, and said unto them, Whom seek ye?

5 They answered him, Jesus of Nazareth. Jesus saith unto them, I am he. And Judas also, which betrayed him, stood with them.

6 As soon then as he had said unto them, I am he, they went backward, and fell to the ground.

7 Then asked he them again, Whom seek ye? And they said, Jesus of Nazareth.

betrayed knew place disciples answered Nazareth saith betrayed

Jesusand

Judas

Figure 5.3: A high-level overview of Nubbi’s view of text data. A corpus with identifiedentities is turned into a collection of bags-of-words (in rectangles), each associatedwith individual entities (left) or pairs of entities (right). The procedure in the leftpanel is repeated for every entity in the text while the procedure in the right panel isrepeated for every pair of entities.

politics and movies, athlete actors like Johnny Weissmuller and Geena Davis would

have distributions that favor movies and sports, and specialized athletes, like Pele,

would have distributions that favor sports more than other entity topics. Nubbi uses

entity topics to model entity contexts. Because the sports entity topic would contain

words like “cup,” “win,” and “goal,” associating Pele exclusively with the sports

entity topic would be consistent with the words observed in his context.

Relationship topics are distributions over words associated with pairs of entities,

rather than individual entities, and each pair of entities is associated with a distribution

over relationship topics. Just as the entity topics cluster similar people together (e.g.,

Ronald Reagan, George Bush, and Bill Clinton all express the politics topic), the

relationship topics can cluster similar pairs of people. Thus, Romeo and Juliet,

Abelard and Heloise, Ruslan and Ludmilla, and Izanami and Izanagi might all share a

lovers relationship topic.

Relationship topics are used to explain pair contexts. Each word in a pair context

is assumed to express something about either one of the participating entities or

something particular to their relationship. For example, consider Jane Wyman and

83

Ronald Reagan. (Jane Wyman, an actress, was actor/president Ronald Reagan’s first

wife.) Individually, Wyman is associated with the movies entity topic and Reagan

is associated with the movies and politics entity topics. In addition, this pair of

entities is associated with relationship topics for divorce and costars.

Nubbi hypothesizes that each word describes either one of the entities or their

relationship. Consider the pair context for Reagan and Wyman:

In 1938, Wyman co-starred with Ronald Reagan. Reagan and actress Jane

Wyman were engaged at the Chicago Theater and married in Glendale, Califor-

nia. Following arguments about Reagan’s political ambitions, Wyman filed for

divorce in 1948. Since Reagan is the only U.S. president to have been divorced,

Wyman is the only ex-wife of an American President.

We have marked the words that are not associated with the relationship topic. Func-

tional words are gray; words that come from a politics topic (associated with Ronald

Reagan) are underlined; and words that come from a movies topic (associated with

Jane Wyman) are italicized.

The remaining words, “1938,” “co-starred,” “engaged,” “Glendale,” “filed,” “di-

vorce,” “1948,” “divorced,” and “ex-wife,” describe the relationship between Reagan

and Wyman. Indeed, it is by deducing which case each word falls into that Nubbi is

able to capture the relationships between entities. Examining the relationship topics

associated with each pair of entities provides a description of that relationship.

The above discussion gives an intuitive picture of how Nubbi explains the observed

entity and pair contexts using entity and relationship topics. In data analysis, however,

we do not observe the entity topics, pair topics, or the assignments of words to topics.

Our goal is to discover them.

To do this, we formalize these notions in a generative probabilistic model of the

texts that uses hidden random variables to encode the hidden structure described

above. In posterior inference, we “reverse” the process to discover the latent structure

84

that best explains the documents. (Posterior inference is described in the next section.)

More formally, Nubbi assumes the following statistical model.

1. For each entity topic j and relationship topic k,

(a) Draw topic multinomials βθj ∼ Dir(ηθ + 1), βψk ∼ Dir(ηψ + 1)

2. For each entity e,

(a) Draw entity topic proportions θe ∼ Dir(αθ);

(b) For each word associated with this entity’s context,

i. Draw topic assignment ze,n ∼ Mult(θe);

ii. Draw word we,n ∼ Mult(βθze,n).

3. For each pair of entities e, e′,

(a) Draw relationship topic proportions ψe,e′ ∼ Dir(αψ);

(b) Draw selector proportions πe,e′ ∼ Dir(απ);

(c) For each word associated with this entity pair’s context,

i. Draw selector ce,e′,n ∼ Mult(πe,e′);

ii. If ce,e′,n = 1,

A. Draw topic assignment ze,e′,n ∼ Mult(θe);

B. Draw word we,e′,n ∼ Mult(βθze,e′,n).

iii. If ce,e′,n = 2,

A. Draw topic assignment ze,e′,n ∼ Mult(θe′);

B. Draw word we,e′,n ∼ Mult(βθze,e′,n).

iv. If ce,e′,n = 3,

A. Draw topic assignment ze,e′,n ∼ Mult(ψe,e′);

B. Draw word we,e′,n ∼ Mult(βψze,e′,n).

This is depicted in a graphical model in Figure 5.4.

85

MNe Ne,e'Kθ

βθ

ψe,e'

αθ

w

z

θe

αψ

c

πe,e'

ηθz

w

απ

Kψ

βψ

ηψ

Ne'

z

w

θe'

...Ne''

z

w

θe''

N entity contexts M pair contexts Kψ relationship topicsKθ entity topics

Figure 5.4: A depiction of the Nubbi model using the graphical model formalism.Nodes are random variables; edges denote dependence; plates (i.e., rectangles) denotereplication; shaded nodes are observed and unshaded nodes are hidden. The left halfof the figure are entity contexts, while the right half of the figure are pair contexts. Inits entirety, the model generates both the entity contexts and the pair contexts shownin Figure 5.3.

The hyperparameters of the Nubbi model are Dirichlet parameters αθ, αψ, and

απ, which govern the entity topic distributions, the relationship distributions, and

the entity/pair mixing proportions. The Dirichlet parameters ηθ and ηψ are priors

for each topic’s multinomial distribution over terms. There are Kθ per-topic term

distributions for entity topics, βθ1:Kθ, and Kψ per-topic term distributions βψ1:Kψ

for

relationship topics.

The words of each entity context are essentially drawn from an LDA model using

the entity topics. The words of each pair context are drawn in a more sophisticated

way. The topic assignments for the words in the pair context for entity e and entity e′

are hypothesized to come from the entity topic proportions θe, entity topic proportions

θe′ , or relationship topic proportions ψe,e′ . The switching variable ce,e′,n selects which

of these three assignments is used for each word. This selector ce,e′,n is drawn from

πe,e′ , which describes the tendency of words associated with this pair of entities to be

ascribed to either of the entities or the pair.

It is ψe,e′ that describes what the relationship between entities e and e′ is. By

allowing some of each pair’s context words to come from a relationship topic distribu-

tion, the model is able to characterize each pair’s interaction in terms of the latent

86

relationship topics.

5.3 Computation with NUBBI

With the model formally defined in terms of hidden and observed random variables,

we now turn to deriving the algorithms needed to analyze data. Data analysis involves

inferring the hidden structure from observed data and making predictions on future

data. In this section, we develop a variational inference procedure for approximating

the posterior. We then use this procedure to develop a variational expectation-

maximization (EM) algorithm for parameter estimation and for approximating the

various predictive distributions of interest.

5.3.1 Inference

In posterior inference, we approximate the posterior distribution of the latent variables

conditioned on the observations. As for LDA, exact posterior inference for Nubbi is

intractable (Blei et al. 2003a). We appeal to variational methods.

Variational methods posit a family of distributions over the latent variables indexed

by free variational parameters. Those parameters are then fit to be close to the true

posterior, where closeness is measured by relative entropy. See Jordan et al. (1999)

for a review. We use the factorized family

q(Θ,Z,C,Π,Ψ|γθ,γψ,Φθ,Φψ,γπ,Ξ) =∏e

[q(θe|γθe )

∏n q(ze,n|φθe,n)

]·∏

e,e′ q(ψe,e′ |γψe,e′)q(πe,e′ |γπe,e′)·∏

e,e′

[∏n q(ze,e′,n, ce,e′,n|φ

ψe,e′,n, ξe,e′,n)

],

where γθ is a set of Dirichlet parameters, one for each entity; γπ and γψ are sets

87

of Dirichlet parameters, one for each pair of entities; Φθ is a set of multinomial

parameters, one for each word in each entity; Ξ is a set of multinomial parameters, one

for each pair of entities; and Φψ is a set of matrices, one for each word in each entity

pair. Each φψe,e′,n contains three rows — one which defines a multinomial over topics

given that the word comes from θe, one which defines a multinomial given that the

word comes from θe′ , and one which defines a multinomial given that the word comes

from ψe,e′ . Note that the variational family we use is not the fully-factorized family;

this family fully captures the joint distribution of ze,e′,n and ce,e′,n. We parameterize

this pair by φψe,e′,n and ξe,e′,n which define a multinomial distribution over all 3K

possible values of this pair of variables.

Minimizing the relative entropy is equivalent to maximizing the Jensen’s lower

bound on the marginal probability of the observations, i.e., the evidence lower bound

(ELBO),

L =∑e,e′

Le,e′ +∑e

Le + H (q) , (5.1)

where sums over e, e′ iterate over all pairs of entities and

Le,e′ =∑n

Eq[log p(we,e′,n|βψ1:K , β

θ1:K , ze,e′,n, ce,e′,n)

]+

∑n

Eq [log p(ze,e′,n|ce,e′,n, θe, θe′ , ψe,e′)] +

∑n

Eq [log p(ce,e′,n|πe,e′)] +

Eq [log p(ψe,e′ |αψ)] + Eq [log p(πe,e′|απ)]

88

and

Le =∑n

Eq[log p(we,n|βθ1:K , ze,n)

]+

Eq [log p(θe|αθ)] +∑n

Eq [log p(ze,n|θe)] .

The Le,e′ term of the ELBO differentiates this model from previous models (Blei et al.

2003a). The connections between entities affect the objective in posterior inference

(and, below, in parameter estimation).

Our aim now is to compute each term of the objective function given in Equation 5.1.

After expanding this expression in terms of the variational parameters, we can derive a

set of coordinate ascent updates to optimize the ELBO with respect to the variational

parameters, γθ,γψ,Φθ,Φψ,γπ,Ξ. Refer to Appendix C for a full derivation of the

following updates.

The updates for φθe,n assign topic proportions to each word associated with an

individual entity,

φθe,n ∝ exp(log βθwn + Ψ

(γθe)), (5.2)

where log βθwn represents the logarithm of column wn of βθ and Ψ (·) is the digamma

function. (A digamma of a vector is the vector of digammas.) The topic assignments

for each word associated with a pair of entities are similar,

φψe,e′,n,1= exp(log βθwn + Ψ

(γθe)−Ψ

(1Tγθe

)− λe,e′,n,1

)(5.3)

φψe,e′,n,2= exp(log βθwn + Ψ

(γθe′)−Ψ

(1Tγθe′

)− λe,e′,n,2

)(5.4)

φψe,e′,n,3= exp(

log βψwn + Ψ(γψe,e′

)−Ψ

(1Tγψe,e′

)− λe,e′,n,3

), (5.5)

where λe,e′,n is a vector of normalizing constants. These normalizing constants are

89

then used to estimate the probability that each word associated with a pair of entities

is assigned to either an individual or relationship,

ξe,e′,n ∝ exp(λe,e′,n + Ψ

(γπe,e′

)). (5.6)

The topic and entity assignments are then used to estimate the variational Dirichlet

parameters which parameterize the latent topic and entity proportions,

γπe,e′ = απ +∑n

ξe,e′,n (5.7)

γψe,e′ = αψ +∑n

ξe,e′,n,3φe,e′,n,3. (5.8)

Finally, the topic and entity assignments for each pair of entities along with the topic

assignments for each individual entity are used to update the variational Dirichlet

parameters which govern the latent topic assignments for each individual entity. These

updates allow us to combine evidence associated with individual entities and evidence

associated with entity pairs.

γθe =∑e′

∑n

(ξe,e′,n,1φ

ψe,e′,n,1 + ξe′,e,2φ

ψe′,e,n,2

)+ (5.9)

αθ +∑n

φθe,n. (5.10)


We fit the model by finding maximum likelihood estimates for each of the parameters:

πe,e′ , βθ1:K and βψ1:K . Once again, this is intractable so we turn to an approximation.

We employ variational expectation-maximization, where we iterate between optimizing

the ELBO of Equation 5.1 with respect to the variational distribution and with respect

to the model parameters.

Optimizing with respect to the variational distribution is described in Section 5.3.1.

90

Optimizing with respect to the model parameters is equivalent to maximum likelihood

estimation with expected sufficient statistics, where the expectation is taken with

respect to the variational distribution. The sufficient statistics for the topic vectors

βθ and βψ consist of all topic-word pairs in the corpus, along with their entity or

relationship assignments. Collecting these statistics leads to the following updates,

βθw ∝ ηθ +∑e

∑n

1(we,n = w)φθe,n + (5.11)∑e,e′

∑n

1(we,e′,n = w)ξe,e′,n,1φψe,e′,n,1 + (5.12)∑

e,e′

∑n

1(we′,e,n = w)ξe′,e,n,2φψe′,e,n,2 (5.13)

βψw ∝ ηψ +∑e,e′

∑n

1(we,e′,n = w)ξe,e′,n,3φψe,e′,n,3. (5.14)

The sufficient statistics for πe,e′ are the number of words ascribed to the first entity,

the second entity, and the relationship topic. This results in the update

πe,e′ ∝ exp (Ψ (απ +∑

n ξe,e′,n)) .

5.3.3 Prediction

With a fitted model, we can make judgments about how well the model describes the

joint distribution of words associated with previously unseen data. In this section we

describe two prediction tasks that we use to compare Nubbi to other models: word

prediction and entity prediction.

In word prediction, the model predicts an unseen word associated with an entity

pair given the other words associated with that pair, p(we,e′,i|we,e′,−i). This quantity

cannot be computed tractably. We instead turn to a variational approximation of this

91

posterior,

p(we,e′,i|we,e′,−i) ≈ Eq [p(we,e′,i|ze,e′,i)] .

Here we have replaced the expectation over the true posterior probability p(ze,e′,i|we,e′,−i)

with the variational distribution q(ze,e′,i) whose parameters are trained by maximizing

the evidence bound given we,e′,−i.

In entity prediction, the model must predict which entity pair a set of words is

most likely to appear in. By Bayes’ rule, the posterior probability of an entity pair

given a set of words is proportional to the probability of the set of words belonging to

that entity pair,

p((e, e′)|w) ∝ p(w|we,e′),

where the proportionality constant is chosen such that the sum of this probability

over all entity pairs is equal to one.

After a qualitative examination of the topics learned from corpora, we use these

two prediction methods to compare Nubbi against other models that offer probabilistic

frameworks for associating entities with text in Section 5.4.2.

5.4 Experiments

In this section, we describe a qualitative and quantitative study of Nubbi on three

data sets: the bible (characters in the bible), biological (genes, diseases, and proteins

in scientific abstracts), and wikipedia. For these three corpora, the entities of interest

are already annotated. Experts have marked all mentions of people in the Bible (Nave

2003) and biological entities in corpora of scientific abstracts (Ohta et al. 2002; Tanabe

et al. 2005), and Wikipedia’s link structure offers disambiguated mentions. Note that

92

Topic 1 Topic 2Entities Jesus, Mary Abraham, Chedorlaomer

Terah, Abraham Ahaz, Rezinfather kingbegat city

Top Terms james smotedaughter lordmother thousand

Table 5.1: Examples of relationship topics learned by a five topic Nubbi model trainedon the Bible. The upper part of the table shows some of the entity pairs highlyassociated with that topic. The lower part of the table shows the top terms in thattopic’s multinomial.

it is also possible to use named entity recognizers to preprocess data for which entities

are not previously identified.

The first step in our analysis is to determine the entity and pair contexts. For

bible, verses offer an atomic context; any term in a verse with an entity (pair) is

associated with that entity (pair). For biological, we use tokens within a fixed distance

from mentions of an entity (pair) to build the data used by our model. For wikipedia,

we used the same approach as biological for associating words with entity pairs. We

associated with individual entities, however, all the terms in his/her Wikipedia entry.

For all corpora we removed tokens based on a stop list and stemmed all tokens using

the Porter stemmer. Infrequent tokens, entities, and pairs were pruned from the

corpora.3

5.4.1 Learning networks

We first demonstrate that the Nubbi model produces interpretable entity topics that

describe entity contexts and relationship topics that describe pair contexts. We also

show that by combining Nubbi’s model of language with a network automatically

estimated through co-occurrence counts, we can construct rich social networks with

labeled relationships.

3After preprocessing, the bible dataset contains a lexicon of size 2411, 523 entities, and 475 entitypairs. The biological dataset contains a lexicon of size 2425, 1566 entities, and 577 entity pairs. Thewikipedia dataset contains a lexicon of size 9144, 1918 entities, and 429 entity pairs.

93

Table 5.1 shows some of the relationship topics learned from the Bible data. (This

model has five entity topics and five relationship topics; see the following section for

more details on how the choice of number of topics affects performance.) Each column

shows the words with the highest weight in that topic’s multinomial parameter vector,

and above each column are examples of entity pairs associated with that topic. In this

example, relationship Topic 1 corresponds to blood relations, and relationship Topic 2

refers to antagonists. We emphasize that this structure is uncovered by analyzing the

original texts. No prior knowledge of the relationships between characters is used in

the analysis.

In a more diverse corpus, Nubbi learns broader topics. In a twenty-five topic model

trained on the Wikipedia data, the entity topics broadly apply to entities across many

time periods and cultures. Artists, monarchs, world politicians, people from American

history, and scientists each have a representative topic (see Table 5.2).

The relationship topics further restrict entities that are specific to an individual

country or period (Table 5.3). In some cases, relationship topics narrow the focus of

broader entity topics. For instance, relationship Topics 1, 5, 6, 9, and 10 in Table 5.3

help explain the specific historical context of pairs better than the very broad world

leader entity Topic 7.

In some cases, these distinctions are very specific. For example, relationship Topic 6

contains pairs of post-Hanoverian monarchs of Great Britain and Northern Ireland,

while relationship Topic 5 contains relationships with pre-Hanoverian monarchs of

England even though both share words like “queen” and “throne.” Note also that

these topics favor words like “father” and “daughter,” which describe the relationships

present in these pairs.

The model sometimes groups together pairs of people from radically different

contexts. For example, relationship Topic 8 groups composers with religious scholars

(both share terms like “mass” and “patron”), revealing a drawback of using a unigram-

94

Topic 1 Topic 2 Topic 3 Topic 4

Entities

George Westinghouse Charles Peirce Lindsay Davenport Lee Harvey OswaldGeorge Stephenson Francis Crick Martina Hingis Timothy McVeighGuglielmo Marconi Edmund Husserl Michael Schumacher Yuri Gagarin

James Watt Ibn al-Haytham Andre Agassi Bobby SealeRobert Fulton Linus Pauling Alain Prost Patty Hearse

Top Terms

electricity work align stateengine universe bgcolor americanpatent theory race year

company science win timeinvent time grand president


Entities

Pierre-Joseph Proudhon Betty Davis Franklin D. Roosevelt Jack KirbyBenjamin Tucker Humphrey Bogart Jimmy Carter Terry PratchettMurray Rothbard Kate Winslet Brian Mulroney Carl Barks

Karl Marx Martin Scorsese Neville Chamberlain Gregory BenfordAmartya Sen Audrey Hepburn Margaret Thatcher Steve Ditko

Top Terms

social film state storywork award party book

politics star election worksociety role president fiction

economics play government publishTopic 9 Topic 10

Entities

Babe Ruth XenophonBarry Bonds CaligulaSatchel Page Horus

Pedro Martinez Nebuchadrezzar IIRoger Clemens Nero

Top Terms

game greekbaseball romeseason historyleague senate

run death

Table 5.2: Ten topics from a model trained on Wikipedia carve out fairly broadcategories like monarchs, athletes, entertainers, and figures from myth and religion.An exception is the more focused Topic 9, which is mostly about baseball. Note thatnot all of the information is linguistic; Topic 3 shows we were unsuccessful in filteringout all Wikipedia’s markup, and the algorithm learned to associate score tables witha sports category.

95


Pairs

Reagan-Gorbachev Muhammad-Moses Grant-Lee Paul VI-John Paul IIKennedy-Khrushchev Rabin-Arafat Muhammad-Abu Bakr Pius XII-Paul II

Alexandra-Alexander III E. Bronte-C. Bronte Sherman-Grant John XXIII-John Paul IINajibullah-Kamal Solomon-Moses Jackson-Lee Pius IX-John Paul II

Nicholas I-Alexander III Arafat-Sharon Sherman-Lee Leo XIII-John Paul II

Terms

soviet israel union vaticanrussian god corp cathol

government palestinian gen papalunion chile campaign council

nuclear book richmond timeTopic 5 Topic 6 Topic 7 Topic 8

Pairs

Philip V-Louis XIV Henry VIII-C. of Aragon Jefferson-Burr Mozart-SalieriLouis XVI-Francis I Mary I (Eng)-Elizabeth I Jefferson-Madison Malory-Arthur

Maria Theresa-Charlemagne Henry VIII-Anne Boleyn Perot-Bush Mozart-BeethovenPhilip V-Louis XVI Mary I (Scot)-Elizabeth I Jefferson-Jay Bede-Augustine

Philip V-Maria Theresa Henry VIII-Elizabeth I J.Q. Adams-Clay Leo X-Julius II

Terms

french queen republican musicdauphin english state playspanish daughter federalist filmdeath death vote pianothrone throne vice workTopic 9 Topic 10

Pairs

George VI-Edward VII Trotsky-StalinGeorge VI-Edward VIII Kamenev-Stalin

Victoria-Edward VII Khrushchev-StalinGeorge V-Edward VII Kamenev-Trotsky

Victoria-George VI Zhou Enlai-Mao Zedong

Terms

royal sovietqueen communistbritish centralthrone unionfather full

Table 5.3: In contrast to Table 5.2, the relationship topics shown here are more specificto time and place. For example, English monarch pairs (Topic 6) are distinct fromBritish monarch pairs (Topic 9). While there is some noise (the Bronte sisters beinglumped in with mideast leaders or Abu Bakr and Muhammad with civil war generals),these relationship topics group similar pairs of entities well. A social network labeledwith these relationships is shown in Figure 5.2.

based method. As another example, relationship Topic 3 links civil war generals and

early Muslim leaders.

5.4.2 Evaluating the predictive distribution

The qualitative results of the previous section illustrate that Nubbi is an effective

model for exploring and understanding latent structure in data. In this section, we

provide a quantitative evaluation of the predictive mechanisms that Nubbi provides.

As with any probabilistic model, Nubbi defines a probability distribution over

unseen data. After fitting the latent variables of our model to data (as described

96

●

●

●

●

10 15 20

−5.

9−

5.8

−5.

7−

5.6

−5.

5−

5.4

biological

Wor

d P

redi

ctio

n Lo

g Li

kelih

ood

●

●

●

●

10 15 20

−6.

5−

6.3

−6.

1−

5.9

bible

●

●

●

●

10 15 20

−7.

5−

7.0

−6.

5

wikipedia

●

●

●

●

10 15 20

−6.

0−

5.5

−5.

0−

4.5

−4.

0

Number of topics

Ent

ity P

redi

ctio

n Lo

g Li

kelih

ood ●

● ●

●

10 15 20

−6.

5−

6.0

−5.

5

Number of topics

●

●

● ●

10 15 20

−7.

0−

6.0

−5.

0−

4.0

Number of topics

● Nubbi Author−Topic LDA Unigram Mutual

Figure 5.5: Predictive log likelihood as a function of the number of Nubbi topics on twotasks: entity prediction (given the context, predict what entities are being discussed)and relation prediction (given the entities, predict what words occur). Higher is better.

in Section 5.3.1), we take unseen pair contexts and ask how well the model predicts

those held-out words. Models that give higher probability to the held-out words better

capture how the two entities participating in that context interact. In a complimentary

problem, we can ask the fitted model to predict entities given the words in the pair

context. (The details of these metrics are defined more precisely in Section 5.3.3.)

We compare Nubbi to three alternative approaches: a unigram model, LDA (Blei

et al. 2003a), and the Author-Topic model (Rosen-Zvi et al. 2004). All of these

approaches are models of language which treat individual entities and pairs of entities

alike as bags of words. In the Author-Topic model (Rosen-Zvi et al. 2004), entities are

associated with individual contexts and pair contexts, but there are no distinguished

pair topics; all words are explained by the topics associated with individuals. In

addition, we also compare the model against two baselines: a unigram model (equivalent

to using no relationship topics and one entity topic) and a mutual information model

97

(equivalent to using one relationship topic and one entity topic).

We use the bootstrap method to create held-out data sets and compute predictive

probability (Efron 1983). Figure 5.5 shows the average predictive log likelihood for the

three approaches. The results for Nubbi are plotted as a function of the total number

of topics K = Kθ +Kψ. The results for LDA and author-topic were also computed

with K topics. All models were trained with the same hyperparameters.

Nubbi outperforms both LDA and unigram on all corpora for all numbers of topics

K. For word prediction Nubbi performs comparably to Author-Topic on bible, worse

on biological, and better on wikipedia. We posit that because the wikipedia corpus

contains more tokens per entity and pair of entities, the Nubbi model is able to leverage

more data to make better word predictions. Conversely, for biological, individual

entities explain pair contexts better than relationship topics, giving the advantage

to Author-Topic. For wikipedia, this yields a 19% improvement in average word log

likelihood over the unigram model at K = 24.

In contrast, the LDA model is unable to make improved predictions over the

unigram model. There are two reasons for this. First, LDA cannot use information

about the participating entities to make predictions about the pair, because it treats

entity contexts and pair contexts as independent bags of words. Second, LDA does not

allocate topics to describe relationships alone, whereas Nubbi does learn topics which

express relationships. This allows Nubbi to make more accurate predictions about the

words used to describe relationships. When relationship words do find their way into

LDA topics, LDA’s performance improves, such as on the bible dataset. Here, LDA

obtains a 6% improvement over unigram while Nubbi achieves a 10% improvement.

With the exception of Author-Topic on biological, Nubbi outperforms the other

approaches on the entity prediction task. For example, on wikipedia, the Nubbi model

shows a 32% improvement over the unigram baseline, LDA shows a 7% improvement,

and Author-Topic actually performs worse than the unigram baseline. While LDA,

98

Author-Topic, and Nubbi improve monotonically with the number of topics on the

word task, they can peak and decrease for the entity prediction task. Recall that an

improved word likelihood need not imply an improved entity likelihood; if a model

assigns a higher word likelihood to other entity pairs in addition to the correct entity

pair, the predictive entity likelihood may still decrease. Thus, while each held-out

context is associated with a particular pair of entities, it does not follow that that

same context could not also be aptly associated with some other entity pair.

5.4.3 Application to New York Times

We can gain qualitative insight into the preformance of the Nubbi model (and demon-

strate its scalability) by investigating its performance on a larger data set from the

New York Times. We treat each of the approximately 1 million articles in this corpus

as a document. We filter the corpus down to 2500 vocabulary terms and 944 entities.

We fit a Nubbi model using five entity topics and five relationship topics.4

Figure 5.6 shows a visualization of the results as a radial plot. Each entity appears

at an angle along the edge of the circle and lines are drawn between related entities.

The thickness of the lines represent the strength of the relationship inferred by the

model while the color of the lines represent the relationship topic which appears most

frequently in the description of the relationship between the two entities.

Because the data set is large, a high-level overview such as Figure 5.6 is difficult to

fully take in. Consequently we zoom in to a small portion of the graph in Figure 5.7

which also annotates some of the relationship topics with the word with highest

probability mass in that topic. This view reveals some of the structure of relationships

that Nubbi is able to uncover on this data set. One topic we have labeled “trial”

appears infrequently in this sector of the graph; the only entity connected by this

relationship is Nicole Brown-Simpson. Although not depicted in this zoomed-in graph,

4For the qualitative evaluation here we fix the number of topics. Refer to the previous section formore details on how performance varies with the number of topics.

99

lazio, rick arichter, mikewilder, l douglasforeman, georgemehta, zubinbhutto, benazirwright, jimolmert, ehudpinochet, augustohatch, orrin gberlin, irvingsouter, david htower, john ghusband, rick drisen, jamescharles, eleanorklebold, dylanrather, danmcmellon, edwardsaint laurent, yveshelms, jessespitzer, eliot lpanetta, leon ecolumbus, christopher

jesus christmcveigh, timothy jameso'neill, paulmartinez, pedrobarry, marion s jrels, erniechernomyrdin, viktor s

hynes, charles jewing, patrickmiers, harriet estein, andrew jaung san suu kyi, dawsilver, sheldonspecter, arlencoughlin, tomkennedy, john fitzgerald

rodman, dennischaney, donjames, caryn

lane, nathanhashimoto, ryutaro

lautenberg, frank r

giambi, jasonde klerk, f wmourning, alonzo


glavine, tomwinfrey, oprahvan horn, keithjeter, derek

reagan, ronaldlimbaugh, rush

netanyahu, benjamin

law, bernard fsessions, william s

fujimori, alberto

ward, charliebarron, james

van gundy, jeff

musharraf, pervez

blair, jayson

valentine, bobby

rowland, john g

chass, murray

jacobs, marc

puccini, giacomo

gates, henry louis jr

staples, brent

baker, howard h jr

clark, wesley k

koch, edward i

canseco, jose

major, john

plame, valerie

o'neill, paul h

holyfield, evander

mills, ric

hard p

abdullahbush, la

ura


kozlowski, l dennis

karadzic, radovan

salinas de gortari,

carlos

federer, roger

jospin, lio

nel

bennett, willia

m j

nader, ralph

jackson, m

ark

steinbrenner, g

eorge m 3d

steinberg, li

sa

velella

, guy j

moynihan, d

aniel patric

k

kahane, m

eir

khamenei, a

li

darman, ri

chard g

barber, tiki

friedman, th

omas l

bradley, bill

minaya, o

mar

goldin, harris

on j

hernandez,

orlando

brown, d

ave

reich, ro

bert b

smith

, willi

am k

scoppetta

, nich

olas

james

, sha

rpe

gore

, al

moi, da

niel a

rap

forbe

s, ste

ve

cher

toff, m

ichae

l

wolfow

itz, p

aul d

clark

, laur

el sa

lton

ashe

, arth

ur

vecs

ey, g

eorg

e

pirro

, jean

ine f

sulliv

an, a

ndre

w

man

tle, m

ickey

hank

s, to

m

robin

son,

jack

ie

bird,

larry

reno

, jane

t

edbe

rg, s

tefa

n

belki

n, lis

a

nuss

baum

, hed

da

blix,

han

s

hous

ton,

alla

n

carte

r, bi

ll

scow

crof

t, br

ent

gold

en, h

owar

d

quay

le, d

an

mor

gent

hau,

robe

rt m

taylo

r, la

wrenc

e

ovitz

, mich

ael

mur

doch

, rup

ert

dela

y, to

m

wagne

r, ric

hard

wells

tone

, pau

l

brod

y, ja

ne e

mce

nroe

, joh

n

kim

jong

il

river

a, m

aria

no

wel

ch, j

ohn

f jr

schu

ndle

r, br

et d

gree

n, m

ark

jiang

zem

in

kim

dae

jung

karz

ai, h

amid

roha

tyn,

felix

g

carlu

cci,

frank

c

kacz

ynsk

i, th

eodo

re j

biag

gi, m

ario

brun

i, fra

nk

bake

r, ja

mes

a 3

d

chav

ez, j

ulio

ces

ar

kant

or, m

icke

y

clem

ens,

roge

r

stei

nber

g, jo

el b

buffe

tt, w

arre

n e

paul

son,

hen

ry m

jr

gree

nhou

se, s

teve

n

klei

n, c

alvi

n

conn

er, d

enni

s

lette

rman

, dav

id

lam

ont,

ned

rice,

con

dole

ezza

sunu

nu, j

ohn

h

com

bs, s

ean

mill

er, a

rthu

r

brud

er, t

hom

as

mus

sina

, mik

e

freeh

, lou

is j

difra

nces

co, d

onal

d t

john

son,

lynd

on b

aine

s

pelo

si, n

ancy

mox

ley,

mar

tha

wei

cker

, low

ell p

jr

scot

t, by

ron

john

son,

mag

ic

geffe

n, d

avid

man

dela

, win

nie

falw

ell,

jerr

y

farr

akha

n, lo

uis

chre

tien,

jean

east

woo

d, c

lint

dean

, how

ard

florio

, jam

es j

mill

er, j

udith

war

d, b

enja

min

gret

zky,

way

ne

zedi

llo p

once

de

leon

, ern

esto

araf

at, y

asir

joyc

e, ja

mes

woo

dwar

d, b

ob

gree

n, r

icha

rd r

sum

mer

s, la

wre

nce

h

mus

cham

p, h

erbe

rt

izet

bego

vic,

alij

a

roth

, phi

lip

mar

tin, s

teve

abra

ms,

rob

ert

chira

c, ja

cque

s

sulli

van,

loui

s w

shev

ardn

adze

, edu

ard

a

star

r, ke

nnet

h w

rost

enko

wsk

i, da

n

thom

pson

, will

iam

c jr

was

sers

tein

, wen

dy

levi

n, c

arl

kris

tof,

nich

olas

d

rang

el, c

harle

s b

norie

ga, m

anue

l ant

onio

ali,

muh

amm

ad

mob

utu

sese

sek

o

gate

s, b

illja

mes

, leb

ron

pitin

o, r

ick

fehr

, don

ald

war

hol,

andy

robb

ins,

jero

me

dala

i lam

afe

inst

ein,

dia

nne

holtz

man

, eliz

abet

hke

nned

y, e

dwar

d m

pirr

o, je

anin

eva

nce,

cyr

us r

beck

er, b

oris

moo

re, m

icha

elgo

nzal

ez, e

lian

bare

nboi

m, d

anie

lhe

nder

son,

ric

key

john

, elto

ned

war

ds, j

ohn

win

erip

, mic

hael

padi

lla, j

ose

lew

is, a

ntho

nyab

ram

off,

jack

eisn

er, m

icha

el d

beck

ett,

sam

uel

fran

ks, b

obbo

nilla

, bob

bym

artin

ez, t

ino

sist

ani,

ali a

l−m

anni

ng, e

lish

alal

a, d

onna

efo

ley,

thom

as s

bide

n, jo

seph

r jr

eins

tein

, alb

ert

hart

, gar

yw

illia

ms,

ser

ena

harir

i, ra

fikfe

ld, e

liot

dole

, eliz

abet

h h

cosb

y, b

illfr

ist,

bill

alito

, sam

uel a

jrdi

ngel

l, jo

hn d

klei

n, jo

el i

purd

um, t

odd

san

ders

on, d

ave

mad

dox,

alto

n h

jrki

ng, w

ayne

mul

rone

y, b

rian

mbe

ki, t

habo

thur

mon

d, s

trom

mos

es, r

ober

tst

ern,

hen

ry j

shar

on, a

riel

mcg

reev

ey, j

ames

ero

bb, c

harle

s s

mal

vo, j

ohn

lee

noro

dom

sih

anou

kta

ubm

an, a

alfr

edre

dsto

ne, s

umne

r m

bern

stei

n, le

onar

dfie

lds,

c v

irgin

iabo

tste

in, l

eon

rove

, kar

lpe

rry,

will

iam

jm

arco

s, im

elda

shef

field

, gar

yhu

ssei

n i

ruth

, geo

rge

herm

an

cuom

o, m

ario

msc

hmitt

, eric

mor

ris, m

ark

mill

er, m

elvi

nth

omas

, isi

ahke

atin

g, c

harle

s h

jr

chal

abi,

ahm

ad

ceau

sesc

u, n

icol

ae

brok

aw, t

omsu

ozzi

, tho

mas

r

roh

tae

woo

o'ne

ill, e

ugen

e

petti

tte, a

ndy

polla

n, m

icha

el

rabi

n, y

itzha

k

leno

, jay

tagl

iabu

e, p

aul

rose

ntha

l, a

m

orte

ga s

aave

dra,

dan

iel

nort

h, o

liver

l

turn

er, t

edbl

umen

thal

, ral

ph

wal

ters

, bar

bara

hark

in, t

omhu

ssei

n, s

adda

m

mad

den,

john

glen

n, jo

hngo

lisan

o, b

thom

as

brya

nt, k

obe

brem

er, l

pau

l iii

mar

bury

, ste

phon

kelly

, ray

mon

d w

pick

ens,

t bo

one

jr

qadd

afi,

mua

mm

ar e

l−

dole

, bob

hing

is, m

artin

a

king

, rod

ney

glen

wils

on, a

ugus

t

bond

s, b

arry

mub

arak

, hos

ni

brad

sher

, kei

th

kean

, tho

mas

h

cole

man

, der

rick

brod

sky,

richa

rd l

sond

heim

, ste

phen

tom

mas

ini,

anth

ony

john

son,

ear

vin

gate

s, ro

bert

m

vince

nt, f

ay

philli

ps, s

teve

brow

n, je

rry

kabi

la, l

aure

nt

spre

well,

latre

ll

washi

ngto

n, d

esire

e

lewis,

lenn

ox

kush

ner,

tony

ma,

yo−y

o

whitm

an, c

hrist

ie

wiese

, tho

mas

leite

r, al

kasp

arov

, gar

ry

capr

iati, j

ennif

er

lee, s

pike

moli

nari,

guy

v

prim

akov

, yev

geny

m

shak

espe

are,

willi

am

duka

kis, m

ichae

l s

verd

i, gius

eppe

piazz

a, m

ike

yelts

in, b

oris

n

khat

ami, m

oham

mad

bush

, bar

bara

moham

ed, k

halfa

n kha

mis

dowd,

maure

en

lloyd

web

ber, a

ndre

w

norm

an, g

reg

meese

, edw

in 3d

patak

i, geo

rge e

gershwin, g

eorge

hitler, a

dolf

johnson, la

rry

volpe,

justin a

baker, r

ussell

domingo, placid

o

dinkins,

david n

baryshniko

v, mikh

ail

giamatti, a bartle

tt

leetch, b

rian

dole, robert j

weinberger, casp

ar w

kevo

rkian, ja

ck

paterno, joe

simon, n

eil

simpso

n, o j

gerstner, l

ouis v jr

masur, k

urt

ashcro

ft, john

soros,

george

dole, eliza

beth

jones, mario

n

schlesinger, arth

ur m jr

diana, prin

cess of wales

schroder, gerhard

walsh, lawrence e

lennon, john

gibson, mel

baker, al

gore, albert jr

brown, david m

spitzer, ellio

t l

miyazawa, kiichi

collins, glenn

mccartney, paul

torre, joe

ridge, tom

malone, john c

scalia, antonin

ackerman, felicia

lindros, eric

leonard, sugar ray

mutombo, dikembe

torricelli, r

obert g

aquino, corazon c

kim young sam

murphy, richard

fujimori, alberto k

martin, kenyon

hemingway, ernest

gotti, john

sciolino, elaine

belichick, bill


weingarten, randi

milosevic, slobodan

hill, anita f

kerry, john

thomas, clarence

bettman, gary

stevens, scott

peres, shimon

picasso, pablo

bumiller, elisabeth

keller, bill

spielberg, steven

packwood, robert w


o'neal, shaquille

lipsyte, robert

weinstein, harvey

prodi, romano

papp, joseph

badillo, herman

canby, vincent

collins, kerry

pierce, samuel r jr

maliki, nuri kamal al−

graham, martha

van gogh, vincent


li pengda silva, luiz inacio lula


savimbi, jonas

mickelson, phil

webber, chris

messinger, ruth w

dimaggio, joe

krugman, paul

riley, patmartin, billymcfarlane, robert c

ozawa, seijicamby, marcusnunn, samlewinsky, monica spowell, colin lstrahan, michaelwaldheim, kurtlevy, steveoates, joyce carolkemp, jack fsnow, john whun senbruno, joseph lcourier, jimellington, dukepoindexter, john mpopewhitman, christine toddwilson, petelevitt, arthur jrstern, davidmaslin, janetvolcker, paul a martin, curtis brown, edmund g jr pareles, jon graham, bob holland, bernard lopez, jennifer daly, john johnson, randy albee, edward brawley, tawana tsongas, paul e richardson, bill blair, tony waxman, henry a jones, paula corbin testaverde, vinny reagan, nancy levine, james fabricant, florence williams, ted hyde, henry j nichols, terry lynn

goss, porter j brooks, david sadr, moktada al−schilling, curt carter, jimmy koizumi, junichiro

perot, ross parcells, bill montana, joe gorbachev, mikhail skostunica, vojislavdeng xiaopingrowling, j k sabatini, gabriela

nicklaus, jacknixon, richard milhous

armey, dickmccain, john sroosevelt, theodore

ramon, ilanshamir, yitzhakbush, george whelmsley, harry b

al−'owhali, mohamed rashed daoud

cortines, ramon c

cheney, dickholik, bobby

messier, markwhitehead, mary beth

crew, rudyhosokawa, morihiro

altman, robert

mamet, davidcasey, william j

christo pear, roberthastert, j dennis

shultz, george p

woods, tiger

thompson, tommy g

bentsen, lloyd

cashman, brian

mccain, john

harris, katherine

safire, william

jeffords, james m

jobs, steven p

oakley, charles


robertson, pat

johns, jasper

schiavo, terri

thornburgh, richard l

wilson, michael

singh, vijay

starks, john

ravitch, richard

rose, pete

wilson, joseph c iv

havel, vaclav

o'connor, john

carey, mariah

diallo, amadou

fox, vicente

glass, philip

malcolm x

marsalis, wynton

sand, leonard b


martins, peter

o'neill, william a

kennedy, john f jr

anderson, kenny

king, don

gandhi, rajiv

stalin, joseph

armani, giorgio


spano, andrew j

williams, bernie

clinton, bill

simon, paul

karpov, anatoly

kidd, jason

bowman, patricia

stoppard, tom

rodriguez, alex


mcconnell, m

itch

armstrong, lance

rosenbaum, yankel

ahern, bertie

jennings, peter

pollock, jackson

regan, edward v

versace, gianni

lauder, ronald s

karan, donna

warner, john w

le pen, jean−marie

tudjman, franjo

fernandez, joseph a

schwarz, charles


louima, abner

kalikow, peter s

chavez, hugo

cuomo, andrew m

ebbers, bernard j

rocker, john

norton, gale

herbert, bob

robertson, marion g

safir, howard

lieberman, joseph i

mailer, norm

an

ferguson, colin

kwan, m

ichelle

pavarotti, luciano

gates, william

h

mccall, h carl

lott, trent

forrester, douglas r

zemin, jiang

franco, john

mladic, ratko

gooden, dwight

wiesel, elie

steinbrenner, george

scorsese, martin

castro, fidel

bork, robert h

lauren, ralph

william

s, tennessee

gordon, jeff

eisenhower, dw

ight david

collor de mello, fernando

abdel rahman, om

ar

navratilova, martina

helmsley, leona

bush, jeb

brady, nicholas f

ford, gerald rudolph jr

shevardnadze, eduard

kerry, john f

strawberry, darryl

allen, woody

jagr, jaromir

rafsanjani, hashemi

hu jintao

salameh, m

ohamm

ed a

adams, john

madonna

farrow, m

ia

bolton, john r

wilson, valerie plam

e

griffith, michael

menendez, robert

mugabe, robert

wilpon, fred

weiner, tim

boxer, barbara

reid, harry

lagerfeld, karl

greenspan, alan

khodorkovsky, mikhail b

perelman, ronald o

fassel, jim

de niro, robert

putin, vladimir

harding, tonya

bush, george w.

sather, glen

bowe, riddick

mcnally, terrence

deaver, michael k

rehnquist, william

h

quindlen, anna

jackson, thomas penfield

botha, p w

perez de cuellar, javier

adams, gerry

handel, george frederick

pennington, chad

barak, ehud

zedillo, ernesto

tierney, johnkerrigan, nancy

brown, larry

matsui, hideki

jefferson, thomas

chun doo hwan

zuckerman, m

ortimer b

bakker, jimbalanchine, georgealexander, lam

arfreud, sigm

undchang, m

ichaelhevesi, alan gjohnson, philipcrew

, rudolph fagassi, andrepresley, elvisicahn, carl cbrow

n, ronald hw

ilson, robertchilds, chrislew

is, neil atharp, tw

ylaroosevelt, franklin delanow

eill, sanford ikerkorian, kirkm

uhamm

ad, john allennajibullahw

illiams, jayson

dodd, christopher jtaylor, charlescone, davidpol poto'connor, sandra daygoldm

an, ronald lyleerlanger, stevenking, stephenonassis, jacqueline kennedygoodnough, abbyharris, ericsolom

on, deborahsandom

ir, richardkaye, judith slay, kenneth lrandolph, w

illiecarter, vincestrauss, richardcham

orro, violeta barrios decorzine, jon sholbrooke, richard csinatra, frankbuchanan, patrick jgrasso, richard aahm

adinejad, mahm

oudw

ashington, georgebush, georgefox quesada, vicentegram

m, phil

spitzer, eliotfitzw

ater, marlin

sorenstam, annika

edwards, herm

anjackson, m

ichaelvan natta, don jrtrum

an, harry sabbas, m

ahmoud

cooper, michael

diller, barryboss, kennethseles, m

onicam

andela, nelson r

rumsfeld, donald h

weinstein, jack b

gingrich, newt

aspin, lesm

enem, carlos saul

kissinger, henry a

cage, johnbabbitt, bruce

schumer, charles e

assad, bashar al−

boutros−ghali, boutros

simm

s, phillipton, eric

newm

an, paul

mcgw

ire, mark

lee, wen ho

leahy, patrick j

grassley, charles e

libeskind, daniel

gephardt, richard a

brown, lee p

vacco, dennis c

rell, m jodi

tyson, mike

davis, grayo'donnell, rosie

roberts, john g jr

clinton, hillary rodham

goetz, bernhard hugo

stephanopoulos, george

aziz, tariqlew

insky, monica

iacocca, lee a

feiner, paul

smith, w

illiam kennedy

cunningham, m

erce

khomeini, ruhollah

skilling, jeffrey k

showalter, buck

samaranch, juan antonio

abraham, spencer

springsteen, bruce

davenport, lindsay

monroe, m

arilyn

rohde, david

gonzales, alberto r

romney, m

itt

lemieux, m

ario

trump, donald j

pearl, daniel

jackson, bo

kerik, bernard b

mason, c vernon

lewis, carl

gulotta, thomas s

maazel, lorin

lugar, richard g

lincoln, abraham

de la hoya, oscar

lewis, michael

jackson, jesse l

berlusconi, silvio

suhartowachtler, sol

aoun, michel

lendl, ivan

rushdie, salman

mandela, nelson

holden, stephen

chen shui−bian

albright, madeleine k

gotbaum, betsy

hanover, donna

tenet, george j


piniella, lou

sampras, pete

kimmelman, michael

walesa, lech

mozart, wolfgang amadeus

barkley, charles

levy, harold o

byrd, robert c

jones, roy jr

shays, christopher

cohen, william s





williams, venus

brodeur, martin

el−hage, wadih

aristide, jean−bertrand

feingold, russell d

hawkins, yusuf k


carroll, sean

kerrey, bob

giuliani, rudolph w

cruise, tom

stewart, martha

john paul ii

berke, richard l

graf, steffi

mitchell, george j

chawla, kalpana

maxwell, robert

winfield, dave

o'rourke, andrew p

nelson, lemrick jr

rodgers, richard

kohl, helmut

koppel, ted

rubin, robert e

stern, howard

clinton, chelsea

wright, frank lloyd

d'amato, alfonse m

rich, frank

annan, kofi

jackson, phil

putin, vladimir v

irabu, hideki

weiner, anthony d

miller, gifford

sharpton, al

johnson, keyshawn

hubbell, webster l

egan, edward m

libby, i lewis jr

obama, barack

johnson, ben

roddick, andy

bin laden, osama

brozan, nadine

boesky, ivan f

connors, jimmy


white, mary jo

ryan, george

mattingly, don

weld, william f

sosa, sammy


wells, david

mccool, william c

traub, james

bratton, william j

sweeney, john j

daschle, tom

milken, michael r

ryan, nolan

blumenthal, richard

moussaoui, zacarias

jordan, michael

vallone, peter f

dylan, bob


codey, richard j

pagones, steven a

dershowitz, alan m

hewitt, lleyton

selig, budassad, hafez al−


altman, lawrence k

gehry, frankzimmer, richard a

daley, richard m

broad, william jbrady, lois smithsimpson, nicole brown

zarqawi, abu musab al−

reeves, dantrump, donaldiverson, allenknoblauch, chuckking, martin luther jrkennedy, anthony mkasparov, garykoresh, davidbenedict xvimarcos, ferdinand eferrer, fernando

Figure 5.6: A visualization of the results of applying the Nubbi model to the New YorkTimes. Entities appear along the edge of the circle and lines connect related entities.The thickness of the lines represent the strength of the relationship while the colorsrepresent the relationship topic which appears most frequently in the description ofthe relationship.

100

lazio, rick arichter, mikewilder, l douglasforeman, georgemehta, zubinbhutto, benazirwright, jimolmert, ehudpinochet, augustohatch, orrin gberlin, irvingsouter, david htower, john ghusband, rick drisen, jamescharles, eleanorklebold, dylanrather, danmcmellon, edwardsaint laurent, yveshelms, jessespitzer, eliot lpanetta, leon ecolumbus, christopher

jesus christmcveigh, timothy jameso'neill, paulmartinez, pedrobarry, marion s jrels, erniechernomyrdin, viktor s

hynes, charles jewing, patrickmiers, harriet estein, andrew j

aung san suu kyi, daw

silver, sheldonspecter, arlencoughlin, tomkennedy, john fitzgerald

rodman, dennischaney, donjames, carynlane, nathanhashimoto, ryutaro

lautenberg, frank r

giambi, jasonde klerk, f wmourning, alonzo


glavine, tomwinfrey, o

prahvan horn, keith

jeter, derek

reagan, ronald

limbaugh, rush

netanyahu, benjamin

law, bernard f

sessions, willia

m s

fujimori, a

lberto

ward, charlie

barron, ja

mes

van gundy, jeff

musharraf, p

ervez

blair, jayson

valentine, bobby

rowland, john g

chass, murra

y

jacobs, marc

puccini, giacomo

gates, henry lo

uis jr

staples, brent

baker, howard h jr

clark, wesley k

koch, edward i

canseco, jose

major, john

plame, valerie

o'neill, paul h

holyfield, e

vander

mills, ri

chard p

abdullah

bush, laura


kozlowski, l

dennis

karadzic, ra

dovan

salinas de gorta

ri, carlo

s

federer, r

oger

jospin, li

onel

bennett, willi

am j

nader, ralph

jackso

n, mark

steinbre

nner, georg

e m 3

d

steinberg

, lisa

velella

, guy j

moyn

ihan, daniel p

atrick

kahane, m

eir

kham

enei, ali

darman, r

ichard

g

barber,

tiki

friedm

an, thom

as l

bradley,

bill

min

aya, o

mar

goldin

, harri

son j

hernandez,

orla

ndo

brown, d

ave

reich

, robert

b

smith

, willi

am k

scoppetta

, nichola

s

jam

es, s

harp

e

gore

, al

moi

, dan

iel a

rap

forb

es, s

teve

cher

toff,

micha

el

wol

fowitz

, pau

l d

clar

k, la

urel

sal

ton

ashe

, arth

ur

vecs

ey, g

eorg

e

pirro

, jea

nine

f

sulliv

an, a

ndre

w

man

tle, m

icke

y

hank

s, to

m

robi

nson

, jac

kie

bird

, lar

ry

reno

, jan

et

edbe

rg, s

tefa

n

belkin

, lisa

nuss

baum

, hed

da

blix, h

ans

hous

ton,

alla

n

carter

, bill

scow

crof

t, br

ent

gold

en, h

owar

d

quay

le, d

an

mor

gent

hau,

rob

ert m

tayl

or, l

awre

nce

ovitz

, mic

hael

mur

doch

, rup

ert

dela

y, to

m

wag

ner,

richa

rd

wel

lsto

ne, p

aul

brod

y, ja

ne e

mce

nroe

, joh

n

kim

jong

il

river

a, m

aria

no

wel

ch, j

ohn

f jr

schu

ndle

r, br

et d

gree

n, m

ark

jiang

zem

in

kim

dae

jung

karz

ai, h

amid

roha

tyn,

felix

g

carluc

ci, f

rank

c

kacz

ynsk

i, th

eodo

re j

bia

ggi,

mario

bru

ni,

frank

bake

r, ja

mes

a 3

d

chav

ez,

julio

cesa

r

kanto

r, m

icke

y

clem

ens,

roger

stein

berg

, jo

el b

buffe

tt, w

arr

en e

pauls

on, henry

m jr

gre

enhouse

, st

even

klein

, ca

lvin

conner, d

ennis

letterm

an, davi

d

lam

ont, n

ed

rice

, co

ndole

ezz

a

sununu, jo

hn h

com

bs,

sean

mill

er, a

rthur

bru

der, thom

as

muss

ina, m

ike

freeh, lo

uis

j

difr

ance

sco, donald

t

johnso

n, ly

ndon b

ain

es

pelo

si, nancy

moxl

ey,

mart

ha

weic

ker, lo

well

p jr

scott, byr

on

johnso

n, m

agic

geffen, davi

d

mandela

, w

innie

falw

ell,

jerr

y

farr

akh

an, lo

uis

chre

tien, je

an

east

wood, cl

int

dean, how

ard

florio, ja

mes j

mill

er, judith

ward

, benja

min

gre

tzky,

wayne

zedill

o p

once d

e leon, ern

esto

ara

fat, y

asir

joyce, ja

mes

woodw

ard

, bob

gre

en, richard

r

sum

mers

, la

wre

nce h

muscham

p, herb

ert

izetb

egovic

, alij

a

roth

, phili

p

mart

in, ste

ve

abra

ms, ro

bert

chirac, ja

cques

sulli

van, lo

uis

w

sheva

rdnadze, eduard

a

sta

rr, ke

nneth

w

roste

nko

wski, d

an

thom

pson, w

illia

m c

jr

wassers

tein

, w

endy

levin

, carl

kri

sto

f, n

ichola

s d

rangel, c

harl

es b

nori

ega, m

anuel anto

nio

ali,

muham

mad

mobutu

sese s

eko

gate

s, bill

jam

es, le

bro

n

pitin

o, ri

ck

fehr,

donald

warh

ol, a

ndy

robbin

s, je

rom

e

dala

i la

ma

fein

ste

in, dia

nne

holtzm

an, eliz

abeth

kennedy,

edw

ard

m

pirro

, je

anin

evance, cyru

s r

becker,

bori

sm

oore

, m

ichael

gonzale

z, elia

nbare

nboim

, danie

lhenders

on, ri

ckey

john, elton

edw

ard

s, jo

hn

win

eri

p,

mic

ha

el

padill

a, jo

se

lew

is, anth

ony

abra

moff, ja

ck

eis

ner,

mic

hael d

beckett, sam

uel

franks, bob

bonill

a, bobby

mart

inez, tino

sis

tani, a

li al!

mannin

g, eli

shala

la, donna e

fole

y, thom

as s

bid

en, jo

seph r

jr

ein

ste

in, alb

ert

hart

, gary

will

iam

s, sere

na

hari

ri, ra

fik

feld

, elio

tdole

, eliz

abeth

hcosby,

bill

fris

t, b

illalit

o, sam

uel a jr

din

gell,

john d

kle

in, jo

el i

purd

um

, to

dd s

anders

on, dave

maddox, alton h

jr

kin

g, w

ayne

mulroney,

bri

an

mbeki, thabo

thurm

ond, str

om

moses, ro

bert

ste

rn, henry

jsharo

n, ariel

mcgre

eve

y, jam

es e

robb, charles s

malv

o, jo

hn lee

noro

dom

sih

anouk

taubm

an, a a

lfre

dre

dsto

ne, sum

ner

m

bern

ste

in, le

onard

field

s, c

virgin

iabots

tein

, le

on

rove

, karl

perr

y, w

illia

m j

marc

os,

im

eld

asheffie

ld, gary

hussein

iru

th, georg

e h

erm

an

cuom

o, m

ario m

schm

itt, eric

morr

is, m

ark

mill

er, m

elv

inth

om

as,

isia

hke

atin

g, ch

arles

h jr

chala

bi,

ahm

ad

ceause

scu, nic

ola

e

bro

kaw

, to

msu

ozz

i, th

om

as

r

roh tae w

oo

o'n

eill

, eugene

pettitt

e, andy

polla

n, m

ichael

rabin

, yi

tzhak

leno,

jay

taglia

bue, paul

rose

nth

al,

a m

ort

ega s

aave

dra

, danie

l

nort

h, oliv

er l

turn

er, ted

blu

menth

al,

ralp

h

walte

rs, barb

ara

hark

in, to

m

huss

ein

, sa

ddam

madden, jo

hn

gle

nn, jo

hn

golis

ano,

b thom

as

bry

ant, k

obe

bre

mer, l

paul i

ii

marb

ury

, st

ephon

kelly

, ra

ymond w

pic

kens,

t b

oone jr

qadd

afi,

mua

mm

ar e

l!

dole

, bob

hing

is, m

artin

a

king

, rod

ney

glen

wils

on, a

ugus

t

bond

s, b

arry

mub

arak

, hos

ni

brad

sher

, kei

th

kean

, tho

mas

h

cole

man

, der

rick

brod

sky, ric

hard

l

sond

heim

, ste

phen

tom

mas

ini,

anth

ony

john

son,

ear

vin

gate

s, rob

ert m

vinc

ent,

fay

phillip

s, s

teve

brow

n, je

rry

kabi

la, l

aure

nt

spre

wel

l, la

trel

l

was

hing

ton,

des

iree

lew

is, l

enno

x

kush

ner,

tony

ma,

yo!

yo

whi

tman

, chr

istie

wie

se, t

hom

as

leite

r, al

kasp

arov

, gar

ry

capr

iati,

jenn

ifer

lee,

spi

ke

mol

inar

i, gu

y v

prim

akov

, yev

geny

m

shak

espe

are,

willia

m

duka

kis, m

icha

el s

verd

i, gius

eppe

piaz

za, m

ike

yelts

in, b

oris n

khat

ami,

moh

amm

ad

bush

, bar

bara

moh

amed

, kha

lfan

kham

is

dowd,

mau

reen

lloyd

web

ber,

andr

ew

norm

an, g

reg

mee

se, e

dwin

3d

pata

ki, g

eorg

e e

gershw

in, g

eorge

hitler,

adolf

johnso

n, larry

volp

e, ju

stin

a

baker,

russ

ell

domin

go, p

lacid

o

dinkin

s, d

avid n

barysh

nikov,

mikh

ail

giam

atti, a

bartl

ett

leetc

h, bria

n

dole, robert

j

weinberger,

casp

ar w

kevo

rkian, ja

ck

patern

o, joe

simon, n

eil

simpso

n, o j

gerstn

er, louis

v jr

masur,

kurt

ashcro

ft, john

soro

s, georg

e

dole, eliz

abeth

jones, mario

n

schlesinger, arth

ur m jr

diana, prin

cess of wales

schroder, gerhard

walsh, lawrence e

lennon, john

gibson, mel

baker, a

l

gore, albert j

r

brown, david m

spitzer, e

lliot l

miyazawa, kiichi

collins, g

lenn

mccartney, p

aul

torre, jo

e

ridge, to

m

malone, john c

scalia, antonin

ackerman, fe

licia

lindros, eric

leonard, sugar ray

mutombo, dikembe

torricelli,

robert g

aquino, corazon c

kim young sam

murphy, richard

fujimori, a

lberto k

martin, kenyon

hemingway, ernest

gotti, john

sciolino, elaine

belichick, bill


weingarten, randi

milosevic, slobodan

hill, anita f

kerry, john

thomas, clarence

bettman, gary

stevens, scott

peres, shimon

picasso, pablo

bumiller, elisabeth

keller, bill

spielberg, steven

packwood, robert w


o'neal, shaquille

lipsyte, robert

weinstein, harvey

prodi, romano

papp, joseph

badillo, herman

canby, vincent

collins, kerry

pierce, samuel r jr

maliki, nuri kamal al!

graham, martha

van gogh, vincent


li peng

da silva, luiz inacio lula


savimbi, jonas

mickelson, phil

webber, chris

messinger, ruth w

dimaggio, joe

krugman, paul

riley, pat

martin, billy

mcfarlane, robert c

ozawa, seiji

camby, marcus

nunn, samlewinsky, monica s

powell, colin l

strahan, michael

waldheim, kurtlevy, steveoates, joyce carolkemp, jack fsnow, john whun senbruno, joseph lcourier, jimellington, dukepoindexter, john mpopewhitman, christine toddwilson, petelevitt, arthur jrstern, davidmaslin, janetvolcker, paul amartin, curtisbrown, edmund g jr pareles, jongraham, bob holland, bernard lopez, jennifer daly, john johnson, randy albee, edward brawley, tawana tsongas, paul e richardson, bill blair, tony waxman, henry a jones, paula corbin testaverde, vinny reagan, nancy levine, james fabricant, florence

williams, ted hyde, henry j nichols, terry lynngoss, porter j brooks, david sadr, moktada al!

schilling, curt carter, jimmy koizumi, junichiroperot, ross parcells, bill montana, joe

gorbachev, mikhail s

kostunica, vojislav

deng xiaopingrowling, j k sabatini, gabriela

nicklaus, jacknixon, richard milhous

armey, dickmccain, john sroosevelt, theodore

ramon, ilanshamir, yitzhak

bush, george w

helmsley, harry b

al!'owhali, mohamed rashed daoud

cortines, ramon c

cheney, dickholik, bobby

messier, mark

whitehead, mary beth

crew, rudyhosokawa, morihiro

altman, robert

mamet, david

casey, william j

christo pear, robert

hastert, j dennis

shultz, george p

woods, tiger

thompson, tommy g

bentsen, lloyd

cashman, brian

mccain, john

harris, katherine

safire, william

jeffords, james m

jobs, steven p

oakley, charles


robertson, pat

johns, jasper

schiavo, terri

thornburgh, richard l

wilson, michael

singh, vijay

starks, john

ravitch, richard

rose, pete

wilson, joseph c iv

havel, vaclav

o'connor, john

carey, mariah

diallo, amadou

fox, vicente

glass, philip

malcolm

x

marsalis, wynton

sand, leonard b


martins, peter

o'neill, william a

kennedy, john f jr

anderson, kenny

king, don

gandhi, rajiv

stalin, joseph

armani, giorgio


spano, andrew j

william

s, bernie

clinton, bill

simon, paul

karpov, anatoly

kidd, jason

bowm

an, patricia

stoppard, tom

rodriguez, alex


mcconnell, m

itch

armstrong, lance

rosenbaum, yankel

ahern, bertie

jennings, peter

pollock, jackson

regan, edward v

versace, gianni

lauder, ronald s

karan, donna

warner, john w

le pen, jean!m

arie

tudjman, franjo

fernandez, joseph a

schwarz, charles


louima, abner

kalikow, peter s

chavez, hugo

cuomo, andrew

m

ebbers, bernard j

rocker, john

norton, gale

herbert, bob

robertson, marion g

safir, howard

lieberman, joseph i

mailer, norm

an

ferguson, colin

kwan, m

ichelle

pavarotti, luciano

gates, william

h

mccall, h carl

lott, trent

forrester, douglas r

zemin, jiang

franco, jo

hn

mla

dic, ra

tko

gooden, d

wig

ht

wie

sel, e

lie

stein

bre

nner, g

eorg

e

scorse

se, martin

castro, fid

el

bork, ro

bert h

laure

n, ra

lph

willia

ms, te

nnesse

e

gord

on, je

ff

eise

nhow

er, d

wig

ht d

avid

collo

r de m

ello, fe

rnando

abdel ra

hm

an, o

mar

navra

tilova

, martin

a

helm

sley, le

ona

bush

, jeb

bra

dy, n

ichola

s f

ford

, gera

ld ru

dolp

h jr

sheva

rdnadze

, eduard

kerry, jo

hn f

straw

berry, d

arryl

alle

n, w

oody

jagr, ja

rom

ir

rafsa

nja

ni, h

ash

em

i

hu jin

tao

sala

meh, m

oham

med a

adam

s, john

madonna

farro

w, m

ia

bolto

n, jo

hn r

wilso

n, va

lerie

pla

me

griffith

, mich

ael

menendez, ro

bert

mugabe, ro

bert

wilp

on, fre

d

wein

er, tim

boxe

r, barb

ara

reid

, harry

lagerfe

ld, k

arl

gre

enspan, a

lan

khodorko

vsky, m

ikhail b

pere

lman, ro

nald

o

fassel, jim

de n

iro, ro

bert

putin

, vla

dim

ir

hard

ing, to

nya

bush, g

eorg

e w

.

sath

er, g

len

bow

e, rid

dic

k

mcnally, te

rrence

deave

r, mic

hael k

rehnquis

t, willia

m h

quin

dle

n, a

nna

jackson, th

om

as p

enfie

ld

both

a, p

w

pere

z d

e c

uella

r, javie

r

adam

s, g

erry

handel, g

eorg

e fre

deric

k

pennin

gto

n, c

had

bara

k, e

hud

zedillo

, ern

esto

tiern

ey, jo

hn

kerrig

an, n

ancy

bro

wn, la

rry

mats

ui, h

ideki

jeffe

rson, th

om

as

chun d

oo h

wan

zuckerm

an, m

ortim

er b

bakker, jim

bala

nchin

e, g

eorg

e

ale

xander, la

mar

freud, s

igm

und

chang, m

ichael

hevesi, a

lan g

johnson, p

hilip

cre

w, ru

dolp

h f

agassi, a

ndre

pre

sle

y, elv

isic

ahn, c

arl c

bro

wn, ro

nald

hw

ilson, ro

bert

ch

ilds, c

hris

lew

is, n

eil a

tharp

, twyla

roosevelt, fra

nklin

dela

no

weill, s

anfo

rd i

kerk

oria

n, k

irkm

uham

mad, jo

hn a

llen

najib

ulla

hw

illiam

s, ja

yson

dodd, c

hris

topher j

taylo

r, charle

scone, d

avid

pol p

ot

o'c

onnor, s

andra

day

gold

man, ro

nald

lyle

erla

nger, s

teven

kin

g, s

tephen

onassis

, jacquelin

e k

ennedy

goodnough, a

bby

harris

, eric

solo

mon, d

ebora

hsandom

ir, richard

kaye, ju

dith

sla

y, kenneth

lra

ndolp

h, w

illiecarte

r, vin

ce

stra

uss, ric

hard

cham

orro

, vio

leta

barrio

s d

e

corz

ine, jo

n s

holb

rooke

, richard

csin

atra

, frank

buchanan, p

atric

k j

gra

sso, ric

hard

aahm

adin

eja

d, m

ahm

oud

washin

gto

n, g

eorg

ebush, g

eorg

efo

x q

uesada, v

icente

gra

mm

, phil

spitze

r, elio

tfitz

wate

r, marlin

sore

nsta

m, a

nnik

a

edw

ard

s, herm

an

jackson, m

ichael

van n

atta

, don jr

trum

an, h

arry

sabbas, m

ahm

oud

cooper, m

ichael

dille

r, barry

boss, ke

nneth

sele

s, monica

mandela

, nelso

n r

rum

sfeld

, donald

h

wein

stein

, jack b

gin

grich

, new

t

asp

in, le

sm

enem

, carlo

s saul

kissinger, h

enry a

cage, jo

hn

babbitt, b

ruce

schum

er, ch

arle

s e

assa

d, b

ash

ar a

l!

boutro

s!ghali, b

outro

s

simm

s, phil

lipto

n, e

ricnew

man, p

aul

mcg

wire

, mark

lee, w

en h

o

leahy, p

atrick j

gra

ssley, ch

arle

s e

libeskin

d, d

anie

l

gephard

t, richard

a

bro

wn, le

e p

vacco, d

ennis c

rell, m

jodi

tyson, m

ike

davis, g

ray

o'd

onnell, ro

sie

roberts, jo

hn g

jr

clinto

n, h

illary ro

dham

goetz, b

ern

hard

hugo

stephanopoulo

s, georg

e

aziz, ta

riq

lewin

sky, monica

iacocca, lee a

feiner, paul

smith, w

illiam kennedy

cunningham, m

erce

khomeini, ruhollah

skilling, jeffrey k

showalter, buck

samaranch, juan antonio

abraham, spencer

springsteen, bruce

davenport, lindsay

monroe, m

arilyn

rohde, david

gonzales, alberto r

romney, m

itt

lemieux, m

ario

trump, donald j

pearl, daniel

jackson, bo

kerik, bernard b

mason, c vernon

lewis, carl

gulotta, thomas s

maazel, lorin

lugar, richard g

lincoln, abraham

de la hoya, oscar

lewis, m

ichael

jackson, jesse l

berlusconi, silvio

suharto

wachtler, sol

aoun, michel

lendl, ivan

rushdie, salman

mandela, nelson

holden, stephen

chen shui!bian

albright, madeleine k

gotbaum, betsy

hanover, donna

tenet, george j


piniella, lou

sampras, pete

kimm

elman, m

ichael

walesa, lech

mozart, wolfgang am

adeus

barkley, charles

levy, harold o

byrd, robert c

jones, roy jr

shays, christopher

cohen, william s





williams, venus

brodeur, martin

el!hage, wadih

aristide, jean!bertrand

feingold, russell d

hawkins, yusuf k


carroll, sean

kerrey, bob

giuliani, rudolph w

cruise, tom

stewart, martha

john paul ii

berke, richard l

graf, steffi

mitchell, george j

chawla, kalpana

maxwell, robert

winfield, dave

o'rourke, andrew p

nelson, lemrick jr

rodgers, richard

kohl, helmut

koppel, ted

rubin, robert e

stern, howard

clinton, chelsea

wright, frank lloyd

d'amato, alfonse m

rich, frank

annan, kofi

jackson, phil

putin, vladimir v

irabu, hideki

weiner, anthony d

miller, gifford

sharpton, al

johnson, keyshawn

hubbell, webster l

egan, edward m

libby, i lewis jr

obama, barack

johnson, ben

roddick, andy

bin laden, osama

brozan, nadine

boesky, ivan f

connors, jimmy


white, mary jo

ryan, george

mattingly, don

weld, william f

sosa, sammy


wells, david

mccool, william c

traub, james

bratton, william j

sweeney, john j

daschle, tom

milken, michael r

ryan, nolan

blumenthal, richard

moussaoui, zacarias

jordan, michael

vallone, peter f

dylan, bob


codey, richard j

pagones, steven a

dershowitz, alan m

hewitt, lleyton

selig, bud

assad, hafez al!


altman, lawrence k

gehry, frank

zimmer, richard a

daley, richard m

broad, william j

brady, lois smith

simpson, nicole brown

zarqawi, abu musab al!

reeves, dantrump, donaldiverson, allenknoblauch, chuck

king, martin luther jr

kennedy, anthony mkasparov, garykoresh, davidbenedict xvimarcos, ferdinand eferrer, fernando

fightmatch

trial

Figure 5.7: A zoomed view into a small portion of Figure 5.7. The colors (i.e.,relationship topics) have been annotated with the most frequently occuring term inthat topic. Nubbi is able to discover a way of partitioning relationships into topicsand assigning these relationship topics to individual paris of entities.

101

Figure 5.8: A screen shot of an Amazon Mechanical Turk task asking users to labelthe relationships between entities with textual descriptions. In this way we can get alarge-scale ground truth for the relationships in the New York Times data set.

the other end of this relationship is O. J. Simpson.

Another two topics seem closely related; we have labeled them here as “match”

and “fight”. The latter is focused on (sporting) contests, such as those involving

George Foreman and Gary Kasparov. The former however, seems to capture a more

general notion of contention, with Donald Trump strongly related to several people

according to this topic (and Rick Lazio to a lesser extent). The boxer, George Foreman,

interestingly occupies both topics almost equally.

This sort of qualitative analysis is suggestive that Nubbi is able to capture aspects

of relationships. However, this kind of analysis is difficult to scale up to large data

sets such as this one. To aid in this, we perform a large-scale study using Amazon

Mechanical Turk (AMT)5. AMT is an online marketplace for tasks (known as HITs).

5https://www.mturk.com/mturk/welcome

102

A large pool of users selects and completes HITs for a small fee. In this way it is

possible to obtain a large number of human labelings of data sets.

We offered a series of tasks asking users to label relationships that appear in the

New York Times data set. We collected 600 labelings from 13 users. A screenshot of

our task is shown in Figure 5.8. In it we present each user with ten pairs of entities.

For each pair of entities we ask them to write a textual description of the relationship

between those entities (users may optionally check boxes indicating that they do not

know how they are related or that they are not related). To reduce noise each pair of

entities was presented to multiple users. After removing stop words and tokenizing we

are left with a bag of crowd-sourced labels for each of 200 relationships.

We now measure how well our models can predict these bags of labels. We first train

both the Nubbi model and the Author-Topic model using the parameters mentioned

earlier in this section. As mentioned above, each of these trained models can then

predict words describing the relationship between two entities. For each word in our

test set, that is, the labels we obtained from users on AMT, we compute the rank of

that word in the list of predicted words. We emphasize that for this predictive task

the relationship ground truth was completely hidden from both models. The result of

this experiment is shown in Figure 5.9.

Each word in the figure represents an instance of a word in the test set. The

position of the word is determined by the predicted rank according to the Author-Topic

Model (x-axis) and the Nubbi model (y-axis). Lower is better along both axes. The

words below the diagonal are instances where Nubbi’s prediction was better than the

Author-Topic model’s, and vice versa for those above the diagonal. As with other

visualizations of this data set, because of the large scale it is difficult to tease out

individual differences. Therefore we create another version of this visualization by

removing those terms close to the diagonal, that is, the labels for which Nubbi and the

Author-Topic Model make similar predictions. This allows us to better understand

103

1 2 3 4 5 6 7 8

12

34

56

78

Lower is betterLog Rank (Author−Topic Model)

Log

Ran

k (N

ubbi

)

time

president

minister

countriesbush

prime

israeli

race

1992

presidential

opponents

presidentpresident

american

political

figures

candidates

presidential

ran

republican

republicans

politicians

president

americanpolitical

figures

york

republicans

politicians

president

court

bush

attorney

supreme

husband

married

partydemocratic

politicians

involved

prizeprize

figureolympic

york

politicians

york

state

political

figures

politicians

world leaders

countries

president

world

war

leaders

iraq

bush

troops

opponents

war

leaders

iraq

professional

president

general

attorney

president

vicedemocratic

senators

father son

brothers

brother

brothers

president

election

candidates

presidential

2004

opponents

president

democratic

running

ranpoliticians

husband

married

wife

victim

president

united

states

vice

father

mother

daughter

married

players

basketball

time

worldleaders

countries

wife husband

married

married

california

leaders

minister

prime

russian

russia

father son

time

president

leaders

groups

attack

september2001

terrorist

east

process

involved

middle

peace

headed

president

worldleaders

british

bushcandidates

presidential

1992

candidates

presidential

ran

president

state

secretary

ran

wifehusband

married

previously

president

party

republican

republicans

candidates

politicians

republic

wife

married

federal

building

involved

planned

fed

bombing

father

daughter

wifehusband

married

president

vice

runningcandidates

1996

manager

yankees

owner

york

senate

race

spot

opponents

father

president

candidates

presidential

2004

president

republican

2000

candidates

politiciansopponents

players

tennis

players

professional

tennistime

worldleaders

nations

countrieseast

issue

leaders

middle

israelipalestinian

politicians

show

night

latetalk

wifehusband

married

sexual

sex

players

opponents

tennistime

countryleaders

countries

president

chairman

fed

president

running

president

administration

vice

running

father

sonpersonprofessional

person

executives

convicted

york

currentsenators

husband

married

person

relationship

accused

sexual

sex

field

track stars

olympic

wifehusband

married

president

vicedemocratic

running

2000

candidates

presidential

politicians

players

baseball

users

israelieast middle

eastern

east

leadersnations

process

middle

peace

time

leaders

democratic

countries

president

republican

governor

bush

yorkyork

state

governor

politicians

police

case

victim

york

city house

white

worked

jersey

married

president

special

part

bill

investigation

clintonsubjectsex

headed

players

tennis

president

house

clinton

speaker

american

political

figures

case

accused

sexual

president

court

failed

supreme

reagan

time

countries

worldleaders

powers

worldleaders

person

president

general

attorney

president

stategeneral

attorney

washington

area

baseball

players

baseball

atlanta

president

republican

politicians

york

city

party

mayor

democratic

candidates

players

baseball

politicians

american

political

administration

figures

politicians

president

administration

defense

secretary

leaders

countries

republican

running

politicians

person

senators

president

state

leaders

palestinian

time

political

leaders

countries

court

candidates

supreme

americanpolitical

figures

party

democratic

electionpresidential

primary

president

ranpoliticians

leaders

countriesmiddle

eastern

politicians

yorkmayor

politicians

york

politicians

police

commissioner

jersey

baseball

owner

chief

staff

person

senators

part

east

process

middle

peacehusband

won

governor

ran

convicted

york

city

mayor

candidates

2001

state

politicians

president

state

secretary

york

city

politicians

executives disney

players

subject

iranpoliticians

president

vice

1988

candidates

president

political

south

africa

worldleaders

victim

leaders

countries

convicted

prime

israeli

Figure 5.9: Predicted rank of ground truth labels using the Author-Topic Model(x-axis) versus predicted rank using Nubbi (y-axis). Lower is better along both axes.Words below the diagonal are instances where Nubbi’s prediction was better than theAuthor-Topic model’s.

104

1 2 3 4 5 6 7

12

34

56

7

Lower is betterLog Rank (Author−Topic Model)

Log

Ran

k (N

ubbi

)

baseball

africa

baseball

south

presidential

israeli

owner

chairman

terrorist

palestinianisraeli

ministerminister

olympic

palestinian

primeprimeprime

fed

israeliisraeli

democratic

peacepeace

trackiran

peaceparty

candidates

democraticiraq

primaryrussian

russia

candidates

presidentialpresidentialpresidential

stars

candidates

supreme

presidential

senatedemocratic

presidential

politicianspoliticians2001

case

running

administrationcourt

time

worldworld

presidentstatestate

governor

figure

nations

president

tennis

wife

opponents

disney

baseball

nations

york

brothers

president

brothers

married

tennistennis

wife

fatherfather

husband

wifewife

sex

husband

wifehusbandhusbandwifewife

sonson

sexual

husbandhusband

sex

husbandhusband

relationship

brother

husband

sexualsexual

Figure 5.10: The visualization in Figure 5.9 with the terms closest to the diagonalremoved. This emphasizes the differences betewen the Author-Topic Model and Nubbirevealing that the predictions Nubbi makes are qualitatively different from those madeby the Author-Topic Model.

the differences between these two models.

This second visualization is given in Figure 5.10. This visualization reveals a

qualitative difference between the predictions Nubbi is able to make well (below

the dashed line) versus the predictions the Author-Topic Models is able to make

well (above the dashed line). In particular, the words below the dashed line are

generally “relationship words” such as “brother”, “father”, “husband”, “married”,

and “opponents”. In contrast, the words above the dashed line provide context, such

as “africa”, “baseball”, “russia”, or “olympic”.

The descriptions of relationships provided by users often contain both contextual

and relationship words, for example, “olympic opponents.” The Nubbi model better

predicts the relationship-specific words such as “opponent” opting instead to explain

105

words like “olympic” by the entity itself. This, in fact, reveals some structure about

gold-standard relationship descriptions. In contrast, the Author-Topic Model does not

make this distinction between relationship and context words. One avenue of future

work would be to take this insight about how relationships are characterized by people

to build models specifically designed to generate these sorts of descriptions.

5.5 Discussion and related work

We presented Nubbi, a novel machine learning approach for analyzing free text to

extract descriptions of relationships between entities. We applied Nubbi to several

corpora—the Bible, Wikipedia, scientific abstracts, and New York Times articles. We

showed that Nubbi provides a state-of-the-art predictive model of entities and relation-

ships and, moreover, is a useful exploratory tool for discovering and understanding

network data hidden in plain text.

Analyzing networks of entities has a substantial history (Wasserman and Pattison

1996); recent work has focused in particular on clustering and community struc-

ture (Anagnostopoulos et al. 2008; Cai et al. 2005; Gibson et al. 1998; McGovern et al.

2003; Newman 2006b), deriving models for social networks (Leskovec et al. 2008a,b;

Meeds et al. 2007; Taskar et al. 2003), and applying these analyses to predictive appli-

cations Zhou et al. (2008). Latent variable approaches to modeling social networks

with associated text have also been explored (McCallum et al. 2005; Mei et al. 2008;

Nallapati et al. 2008; Wang et al. 2005). While the space of potential applications for

these models is rich, it is tempered by the need for observed network data as input.

Nubbi allows these techniques to augment their network data by leveraging the large

body of relationship information encoded in collections of free text.

Previous work in this vein has used either pattern-based approaches or co-occurrence

methods. The pattern-based approaches (Agichtein and Gravano 2003; Diehl et al.

106

2007; Mei et al. 2007; Sahay et al. 2008) and syntax based approaches (Banko

et al. 2007; Katrenko and Adriaans 2007) require patterns or parsers which are

meticulously hand-crafted, often fragile, and typically need several examples of desired

relationships limiting the type of relationships that can be discovered. In contrast,

Nubbi makes minimal assumptions about the input text, and is thus practical for

languages and non-linguistic data where parsing is not available or applicable. Co-

occurrence methods (Culotta et al. 2005; Davidov et al. 2007) also make minimal

assumptions. However, because Nubbi draws on topic modeling (Blei et al. 2003a),

it is able to uncover hidden and semantically meaningful groupings of relationships.

Through the distinction between relationship topics and entity topics, it can better

model the language used to describe relationships.

Finally, while other models have also leveraged the machinery of LDA to understand

ensembles of entities and the words associated with them (Bhattacharya et al. 2008;

Newman et al. 2006a; Rosen-Zvi et al. 2004) these models only learn hidden topics for

individual entities. Nubbi models individual entities and pairs of entities distinctly. By

controlling for features of individual entities and explicitly relationships, Nubbi yields

more powerful predictive models and can discover richer descriptions of relationships.

107

Chapter 6

Conclusion

In this thesis we have studied network data. These data may take the form of an online

social network, a social network of characters in a book, public figures in news articles,

networks of webpages, networks of genes, etc. These data are already pervasive and

will only increase in ubiquity as more users use online services which connect them

with other users, or as biologists find ever more complicated interconnections between

proteins and genes of interest, or more literature and news becomes digitized and

scrutinized. Thus being able to learn from these data to gain insights and make

predictions is becoming ever more important. Predictive models can suggest new

friends for members of a social network or new citations for a paper, while descriptive

statistics can discover communities of friends or authors.

In this work we have introduced and explored several models of network data. The

first and simplest models correlations between links most directly. Here, the central

challenge is the speed at which observed data can be synthesized into a learned model.

We develop techniques that drastically speed up this process making these models

more applicable to the large, real-world data that are becoming ubiquitous.

We then developed a model of network data that accounts for both links and

attributes. Given a corpus of documents with connections between them, the Relational

108

Topic Model can map those documents into a latent space leveraging the mechanisms

of topic modeling. With a trained model, we showed how one can predict links for a

node given only its attributes or attributes given only its links. Thus we can suggest

new citations for a document given only its content, or new interests for a user given

only their friends’. We apply this model to several data sets including local news,

twitter, and scientific abstracts and demonstrate the model’s ability to make state of

the art predictions and find interesting perspectives on the data.

Finally, we turned our attention to cases where our understanding of the links is

incomplete or missing altogether. In particular, we focused on the problem of inferring

whether or not a link exists between two nodes, and if so, giving a latent-space

characterization of that relationship. It is important to know, for example, how two

people know each other in a social network or how two genes interact in a biological

network; linkage is not simply binary. While some resources for annotating edges exist,

they are limited and not scalable to the large and varied networks we have today. We

developed the Nubbi model to infer edges and their characterizations using only free

text. We showed qualitatively and quantitatively that our model can construct and

annotate graphs of relationships and make useful predictions.

In sum, this thesis has contributed a set of probabilistic models, along with

attendant inferential and predictive tools that make it possible to better uncover,

understand, and predict links.

109

Appendix A

Derivation of RTM Coordinate

Ascent Updates

Inference under the variational method amounts to finding values of the variational

parameters γ,Φ which optimize the evidence lower bound, L , given in Equation 4.6.

To do so, we first expand the expectations in these terms,

L =∑

(d1,d2)

Ld1,d2 +∑d

∑n

φd,nT log β·,wd,n+

∑d

∑n

φd,nT(Ψ(γd)− 1Ψ(1Tγd))+

∑d

(α− 1)T(Ψ(γd)− 1Ψ(1Tγd))+

∑d

∑n

φd,nT log φd,n−

∑d

(γd − 1)T(Ψ(γd)− 1Ψ(1Tγd))+

∑d

1T log Γ(γd)− log Γ(1Tγd), (A.1)

110

where Ld1,d2 is defined as in Equation 4.7. Since Ld1,d2 is independent of γ, we can

collect all of the terms associated with γd into

Lγd =

(α +

∑n

φd,n − γd

)T

(Ψ(γd)− 1Ψ(1Tγd))+

1T log Γ(γd)− log Γ(1Tγd).

Taking the derivatives and setting equal to zero leads to the following optimality

condition,

(α +

∑n

φd,n − γd

)T

(Ψ′(γd)− 1Ψ′(1Tγd)) = 0,

which is satisfied by the update

γd ← α +∑n

φd,n. (A.2)

In order to derive the update for φd,n we also collect its associated terms,

Lφd,n = φd,nT(log φd,n + log β·,wd,n + Ψ(γd)− 1Ψ(1Tγd)) +

∑d′ 6=d

Ld,d′ .

Adding a Lagrange multiplier to ensure that φd,n normalizes and setting the derivative

equal to zero leads to the following condition,

φd,n ∝ exp{

log β·,wd,n + Ψ(γd)− 1Ψ(1Tγd) +∇φd,nLd,d′}. (A.3)

The exact form of ∇φd,nLd,d′ will depend on the link probability function chosen. If

the expected log link probability depends only on πd1,d2 = φd1 ◦ φd2 , the gradients

are given by Equation 4.10. When ψN is chosen as the link probability function, we

111

expand the expectation,

Eq [logψN(zd, zd′)] = −ηTEq [(zd − zd′) ◦ (zd − zd′)]− ν

= −ν −∑i

ηi(Eq[z2d,i

]+ Eq

[z2d′,i

]− 2φd,iφd′,i). (A.4)

Because each word is independent under the variational distribution, Eq[z2d,i

]=

Var(zd,i) + φ2

d,i, where Var(zd,i) = 1N2d

∑n φd,n,i(1 − φd,n,i). The gradient of this

expression is given by Equation 4.11.

112

Appendix B

Derivation of RTM Parameter

Estimates

In order to estimate the parameters of our model, we find values of the topic multinomial

parameters β and link probability parameters η, ν which maximize the variational

objective, L , given in Equation 4.6.

To optimize β, it suffices to take the derivative of the expanded objective given in

Equation A.1 along with a Lagrange multiplier to enforce normalization:

∂βk,wL =∑d

∑n

φd,n,k1(w = wd,n)1

βk,wd,n+ λk.

Setting this quantity equal to zero and solving yields the update given in Equation 4.12.

By taking the gradient of Equation A.1 with respect to η and ν, we can also derive

updates for the link probability parameters. When the expectation of the logarithm of

the link probability function depends only on ηTπd,d′+ν, as with all the link functions

given in Equation 4.8, then these derivatives take a convenient form. For notational

expedience, denote η+ = 〈η, ν〉 and π+d,d′ = 〈πd,d′ , 1〉. Then the derivatives can be

113

written as

∇η+L σd,d′ ≈ (1− σ(η+T

π+d,d′))π

+d,d′

∇η+L Φd,d′ ≈

Φ′(η+Tπ+d,d′)

Φ(η+Tπ+d,d′)

π+d,d′

∇η+L ed,d′ = π+

d,d′ . (B.1)

Note that all of these gradients are positive because we are faced with a one-class

estimation problem. Unchecked, the parameter estimates will diverge. While a variety

of techniques exist to address this problem, one set of strategies is to add regularization.

A common regularization for regression problems is the `2 regularizer. This

penalizes the objective L with the term λ‖η‖2, where λ is a free parameter. This

penalization has a Bayesian interpretation as a Gaussian prior on η.

In lieu of or in conjunction with `2 regularization, one can also employ regularization

which in effect injects some number of observations, ρ, for which the link variable,

y = 0. We associate with these observations a document similarity of πα = α1Tα◦ α

1Tα,

the expected Hadamard product of any two documents given the Dirichlet prior of the

model. Because both ψσ and ψΦ are symmetric, these gradients of these regularization

terms can be written as

∇η+Rσ = −ρσ(η+Tπ+α )π+

α

∇η+RΦ = −ρΦ′(−η+Tπ+α )

Φ(−η+Tπ+α )π+α .

While this approach could also be applied to ψe, here we use a different approximation.

We do this for two reasons. First, we cannot optimize the parameters of ψe in an

unconstrained fashion since this may lead to link functions which are not probabilities.

Second, the approximation we propose will lead to explicit updates.

Because Eq [logψe(zd ◦ zd′)] is linear in πd,d′ by Equation 4.8, this suggests a linear

114

approximation of Eq [log(1− ψe(zd ◦ zd′))]. Namely, we let

Eq [log(1− ψe(zd ◦ zd′))] ≈ η′Tπd,d′ + ν ′.

This leads to a penalty term of the form

Re = ρ(η′Tπα + ν ′).

We fit the parameters of the approximation, η′, ν ′, by making the approximation exact

whenever πd,d′ = 0 or maxπd,d′ = 1. This yields the following K + 1 equations for

the K + 1 parameters of the approximation:

ν ′ = log(1− exp(ν))

η′i = log(1− exp(ηi + ν))− ν ′.

Combining the gradient of the likelihood of the observations given in Equation B.1

with the gradient of the penalty Re and solving leads to the following updates:

ν ← log(M − 1TΠ

)− log

(ρ(1− 1Tπα) +M − 1TΠ

)η ← log

(Π)− log

(Π + ρπα

)− 1ν,

where M =∑

(d1,d2) 1 and Π =∑

(d1,d2) πd1,d2 . Note that because of the constraints

on our approximation, these updates are guaranteed to yield parameters for which

0 ≤ ψe ≤ 1.

Finally, in order to fit parameters for ψN , we begin by assuming the variance terms

of Equation A.4 are small. Equation A.4 can then be written as

Eq [logψN(zd, zd′)] = −ν − ηT(φd − φd′) ◦ (φd − φd′),

115

which is the log likelihood of a Gaussian distribution where φd − φd′ is random

with mean 0 and diagonal variance 12η

. This suggests fitting η using the empirically

observed variance:

η ← M

2∑

d,d′(φd − φd′) ◦ (φd − φd′).

ν acts as a scaling factor for the Gaussian distribution; here we want only to ensure that

the total probability mass respects the frequency of observed links to regularization

“observations.” Equating the normalization constant of the distribution with the

desired probability mass yields the update

ν ← log1

2πK/2 + log(ρ+M)− logM − 1

21T log η,

guarding against values of ν which would make ψN inadmissable as a probability.

116

Appendix C

Derivation of NUBBI

coordinate-ascent updates

For convenience, we break up the terms of the objective function into two classes

— those that concern each pair of entities, Le,e′ , and those that concern individual

entities, Le. Equation 5.1 can then be rewritten as

L =∑e,e′

Le,e′ +∑e

Le.

We first expand Le as

Le =∑n

φθe,nT

log βθwn +∑n

φθe,nT (

Ψ(γθe)−Ψ

(1Tγθe

))+

(αθ − 1)(Ψ(γθe)−Ψ

(1Tγθe

))−∑

n

φθe,nT

log φθe,n −(log Γ

(γθe)− log Γ

(1Tγθe

))−

(γθe − 1)(Ψ(γθe)−Ψ

(1Tγθe

)).

117

Next we expand Le,e′ . In order to do so, we first define

ξe,e′,n ◦ φψe,e′,n = 〈ξe,e′,n,1φψe,e′,n,1,

ξe,e′,n,2φψe,e′,n,2,

ξe,e′,n,3φψe,e′,n,3〉.

Note that ξe,e′,n ◦ φψe,e′,n defines a multinomial parameter vector of length 3×K repre-

senting the multinomial probabilities for each ze,e′,n, ce,e′,n assignment. In particular

q(ze,e′,n = z∗, ce,e′,n = c∗) = ξe,e′,n,c∗φψe,e′,n,c∗,z∗ . Thus,

Le,e′ =∑n

(ξe,e′,n ◦ φψe,e′,n)T

log〈βθwn , βθwn , β

ψwn〉+∑

n

ξe,e′,n,1φψe,e′,n,1

T (Ψ(γθe)−Ψ

(1Tγθe

))+∑

n


T (Ψ(γθe′)−Ψ

(1Tγθe′

))+∑

n


T(

Ψ(γψe,e′

)−Ψ

(1Tγψe,e′

))+∑

n

ξe,e′,nT(Ψ(γπe,e′

)−Ψ

(1Tγπe,e′

))−∑

n

(ξe,e′,n ◦ φψe,e′,n)T

log(ξe,e′,n ◦ φψe,e′,n)−(log Γ

(γψe,e′

)− log Γ

(1Tγψe,e′

))+

(αψ − 1)(

Ψ(γψe,e′

)−Ψ

(1Tγψe,e′

))−

(γψe,e′ − 1)(

Ψ(γψe,e′

)−Ψ

(1Tγψe,e′

))−(

log Γ(γπe,e′

)− log Γ

(1Tγπe,e′

))+

(απ − 1)(Ψ(γπe,e′

)−Ψ

(1Tγπe,e′

))−

(γπe,e′ − 1)(Ψ(γπe,e′

)−Ψ

(1Tγπe,e′

)),

Since φθe,n only appears in Le, we can optimize this parameter by taking the

118

gradient,

∇φθe,nLe = log βθwn + Ψ

(γθe)−Ψ

(1Tγθe

)− log φθe,n − 1.

Setting this equal to zero yields the update equation for φθe,n in Equation 5.2. To

optimize φψe,e′,n,1, it suffices to take the gradient of Le,e′ ,

∇φψe,e′,n,1

Le,e′ = ξe,e′,n,1

(log βθwn + Ψ

(γθe)−Ψ

(1Tγθe

)−

log φψe,e′,n,1 − 1).

Setting this equal to zero yields the update in Equation 5.5. The updates for φψe,e′,n,2

and φψe,e′,n,3 are derived in exactly the same fashion.

Similarly, to derive the update for ξe,e′,n,1, we take the partial derivative of Le,e′ ,

∂Le,e′

∂ξe,e′,n,1= Ψ

(γπe,e′,1

)−Ψ

(1Tγπe,e′

)− log ξe,e′,n,1 − 1 +

φψe,e′,n,1

(log βθwn + Ψ

(γθe)−Ψ

(1Tγθe

)− log φψe,e′,n,1

).

Replacing log φψe,e′,n,1 with the update equation given above, this expression reduces to

∂Le,e′

∂ξe,e′,n,1= Ψ

(γπe,e′,1

)−Ψ

(1Tγπe,e′

)−

log ξe,e′,n,1 − 1 + λe,e′,n,1.

Consequently the update for ξe,e′,n is Equation 5.6. In order to update γπe,e′ we collect

the terms which contain this parameter,

(απ +

∑n

ξe,e′,n − γπe,e′

)(Ψ(γπe,e′

)−Ψ

(1Tγπe,e′

))−(

log Γ(γπe,e′

)− log Γ

(1Tγπe,e′

)).

119

The optimum for these terms is obtained when the condition in Equation 5.7 is

satisfied. See Blei et al. (2003a) for details on this solution. Collecting terms associated

with γψe,e′ similarly leads to the update given in Equation 5.8.

We also collect terms to yield updates for γθe . The terms associated with this

variational parameter (and this variational parameter alone) span both Le,e′ and Le

and it is via these parameter updates in Equation 5.10 that evidence associated with

individual entities and evidence associated with entity pairs is combined.

To find MAP estimates for βψ and βθ, note that both variables are multinomial

and hence in the exponential family with topic-word assignment counts as sufficient

statistics. Because the conjugate prior on these parameters is Dirichlet, the posterior

is also a multinomial distribution with sufficient statistics defined by the observations

plus the prior hyperparameter. These posterior sufficient statistics are precisely the

right-hand sides of Equation 5.14. The MAP value of the parameters is achieved

when the expected sufficient statistics equal the observed sufficient statistics giving

the updates in Equation 5.14.

120

Appendix D

Derivation of Gibbs sampling

equations

In this section we derive collapsed Gibbs sampling equations for the models presented

in this thesis. Collapsed Gibbs sampling is an alternative to the variational approach

— instead of approximating the posterior distribution by optimizing a variational

lower bound, collapsed Gibbs sampling directly collects samples from the posterior

distribution. In order to sample from the posterior, it suffices to compute the posterior

(up to a constant) for a single assignment conditioned on all other assignments,

p(zd,n|z−(d,n), α, η,w), (D.1)

where z−(d,n) denotes the set of topic assignments to all words in all documents

excluding zd,n. For a review of Gibbs sampling and why this is the case, see ().

In contrast to variational inference, the equations we derive here are collapsed, that

is, they integrate out variables such as the per-topic distribution over words, βk, and

the per-document distribution over topics, θd. What remain are the topic assignments

for each word, zd,n.

121

D.1 Latent Dirichlet allocation (LDA)

First we compute the prior distribution over topic assignments.

∫p(zd|θd)dp(θd|α) =

∫ ∏i

θd,zi1

B(α)

∏k

θαkd,kdθd

=1

B(α)

∫ ∏k

θnd,k+αkd,k dθd

=B(nd,· + α)

B(α), (D.2)

where B(α) =∏k Γ(αk)

Γ(∑k αk)

is a normalizing constant, nd,k =∑

i 1(zd,i = k) counts the

number of words in document d assigned to topic k and nd,· = 〈nd,1, nd,2, . . . , nd,K〉 is

the vector of counts.

We then compute the likelihood of the word observations given a set of topic

assignments,

∫ ∏d

p(wd|zd,β)dp(β|η) =

∫ (∏d

∏i

βwd,i,zd.i

)(∏k

1

B(η)

∏w

βηww,k

)dβ

=∏k

1

B(η)

∫ ∏w

βηw+nw,kw,k dβk

=∏k

B(η + n·,k)

B(η)(D.3)

where nw,k =∑

d

∑i 1(zd,i = k ∧ wd,i = w) counts the number of assignments of word

w to topic k across all documents and n·,k = 〈n1,k, n2,k, . . . , nW,k〉 is the vector of these

counts.

Combining Equation D.2 and Equation D.3, the posterior probability of a set of

122

topic assignments can be written as

p(z|α, η,w) ∝ p(w|z, η)p(z|η)

=

∫p(w|z,β)dp(β|η)

∫p(z|θ)dp(θ|α)

=∏d

B(nd,· + α)

B(α)

∏k

B(η + n·,k)

B(η). (D.4)

Conditioning on all other assignments, the posterior probability of a single assignment

is then

p(zd,n = k|α, η,w, z−(d,n)) ∝∏d′

B(nd′,· + α)

B(α)

∏k′

B(η + n·,k′)

B(η)

∝ B(nd,· + α)∏k′

B(η + n·,k′)

=

∏k′ Γ(nd,k′ + αk′)

Γ(∑

k′ nd,k′ + αk′)

∏k′

∏w Γ(ηw + nw,k′)

Γ(∑

w nw,k′ + ηw)

∝ 1

Γ(∑

k′ nd,k′ + αk′)

∏k′

[Γ(nd,k′ + αk′)

Γ(ηwd,n + nwd,n,k′)

Γ(∑

w nw,k′ + ηw)

]=

1

Γ(Nd +∑

k′ αk′)

∏k′



Γ(∑

w nw,k′ + ηw)

]∝∏k′



Γ(∑

w nw,k′ + ηw)

], (D.5)

where Nd =∑

k′ nd,k denotes the number of words in document d. The second line

follows because terms which are independent of the topic assignment zd,n are constants,

the third line follows by definition of B, and the fourth line follows because the

posterior cannot depend on counts over words other than wd,n.

123

Finally, we make use of the identity

Γ(x+ b)

Γ(x)=

x if b = 1

1 if b = 0(D.6)

= xb, b ∈ {0, 1} (D.7)

which implies that

Γ(nd,k′ + αk′) = Γ(nd,k′ − 1(k = k′) + αk′ + 1(k = k′))

= Γ(nd,k′ − 1(k = k′) + αk′) · (nd,k′ − 1(k = k′) + αk′)1(k=k′)

= Γ(n¬d,nd,k′ + αk′) · (n¬d,nd,k′ + αk′)1(k=k′), (D.8)

where n¬d,nd,k′ =∑

i 6=n 1(zd,i = k′) denotes the number of words assigned to topic k′ in

document d excluding the current assignment, zd,n. Because n¬d,nd,k′ does not depend on

the current assignment, zd,n, it is a constant in the posterior computation; Equation D.8

then becomes

Γ(nd,k′ + αk′) ∝ (n¬d,nd,k′ + αk′)1(k=k′). (D.9)

Applying the same identity to the other instances of the gamma function in Equa-

tion D.5 gives

Γ(ηwd,n + nwd,n,k′) ∝ (n¬d,nwd,n,k′+ ηwd,n)1(k=k′) (D.10)

Γ(∑w

ηw + nw,k′) ∝ (∑w

ηw + n¬d,nw,k′ )1(k=k′), (D.11)

where the exclusionary sum is similarly defined as n¬d,nw,k′ =∑

d′∑

i 1(zd′,i = k′∧wd′,i =

124

w ∧ (d, n) 6= (d′, i)). Combining these identities with Equation D.5 yields

p(zd,n = k|α, η,w, z−(d,n)) ∝∏k′

((n¬d,nd,k′ + αk′)

n¬d,nwd,n,k′+ ηwd,n∑

w ηw + n¬d,nw,k′

)1(k=k′)

= (n¬d,nd,k + αk)n¬d,nwd,n,k

+ ηwd,n∑w ηw + n¬d,nw,k


+ ηwd,n

N¬d,nk +Wη, (D.12)

where for convenience we denote the total number of words assigned to topic k

excluding the current assignment zd,n as N¬d,nk .

D.2 Mixed-membership stochastic blockmodel (MMSB)

Because the observations in the mixed-membership stochastic blockmodel (MMSB)

depend on pairs of topic assignments, the collapsed Gibbs sampling equations also

depend on the pairwise posterior,

p(zd,d′,1 = k1, zd,d′,2 = k2|α, η,y, z−(d,d′)), (D.13)

where z−(d,d′) denotes the set of topic assignments without the two assignments

associated with the link between d and d′, zd,d′,1 and zd,d′,2.

125

To compute this, as before we first compute the likelihood of the observations,

∫ ∏d,d′

p(yd,d′ |zd,d′,1zd,d′,2,β)dp(β|η) =∫ ∏d,d′

βyd,d′zd,d′,1,zd,d′,2(1− βzd,d′,1,zd,d′,2)

1−yd,d′∏k,k′

1

B(η)βη1k,k′(1− βk,k′)

η0dβ

=∏k,k′

∫1

B(η)βnk,k′,1+η1k,k′ (1− βk,k′)nk,k′,0+η0dβk,k′

=∏k,k′

B(η + nk,k′)

B(η), (D.14)

where nk,k′,i =∑

d,d′ 1(zd,d′,1 = k ∧ zd,d′,2 = k′ ∧ yd,d′ = i) counts the number of links

of value i for which the first node is assigned a topic of k for that link and the second

node is assigned a topic of k′ for that link. nk,k′ = 〈nk,k′,1, nk,k′,0〉 denotes the vector

of these counts. Because the prior of the MMSB is the same as that of LDA, we can

express the posterior (the analogue of Equation D.5) as

p(zd,d′,1 = k1, zd,d′,2 = k2|α, η,y, z−(d,d′)) ∝∏k,k′

Γ(nd,k′ + αk′)Γ(nd′,k′ + αk′)B(η + nk,k′)

B(η)

∝∏k,k′

Γ(nd,k′ + αk′)Γ(nd′,k′ + αk′)Γ(ηyd,d′ + nk,k′,yd,d′ )

Γ(∑

i ηi + nk,k′,i)

∝ (n¬d,d′

d,k1+ αk1)(n

¬d,d′d′,k2

+ αk2)n¬d,d

′

k1,k2,yd,d′+ ηk1,k2,yd,d′

H +N¬d,d′

k1,k2

,

where for convenience we denote the total number of links with (k, k′) as the partici-

pating topics excluding the current link (zd,d′,1, zd,d′,2) as N¬d,d′

k,k′ . The first line follows

by expanding the prior terms as in the derivation of Equation D.5. The second line

follows by expanding B and eliminating terms which are constant and the last line

follows using the identities used to derive Equation D.12.

126

D.3 Relational topic model (RTM)

The sampling equations for the relational topic model (RTM) are similar in spirit to

the LDA sampling equations. For brevity, we restrict the derivation to the exponential

response1,

p(yd,d′ = 1|zd, zd′ , b) ∝ exp(bT(zd ◦ zd′)). (D.15)

As with the MMSB, the prior distribution on z is identical to that of LDA, so we

omit its re-derivation. Thus the joint posterior can be written as

p(z|α, η,w,y, b) ∝∏d

B(nd,· + α)

B(α)

∏k

B(η + n·,k)

B(η)

∏d,d′

exp(bT(zd ◦ zd′)), (D.16)

where the latter product is understood to range over d, d′ such that yd,d′ = 1. The

posterior, following the derivation of Equation D.12, is

p(zd,n = k|α, η,w,y, b, z−(d,n)) ∝ (n¬d,nd,k + αk)n¬d,nwd,n,k

+ ηwd,n

N¬d,nk +Wη

∏d′,d′′

exp(bT(zd′ ◦ zd′′))

∝ (n¬d,nd,k + αk)n¬d,nwd,n,k

+ ηwd,n

N¬d,nk +Wη

∏d′

exp(bT(zd ◦ zd′))


+ ηwd,n

N¬d,nk +Wηexp

((b ◦ zd)T

∑d′

zd′

)(D.17)

Notice that zd,k′ = 1Nd

∑n′ 1(zd,n′ = k′) = 1

Nd

∑n′ 6=n 1(zd,n′ = k′) + 1

Nd1(zd,n = k′) =

z¬nd,k′ +1Nd1(zd,n = k′), where z¬nd,k′ is the mean topic assignment to topic k′ in document

d excluding that of the nth word, zd,n. Because z¬nd,k′ does not depend on the topic

1Here we depart from the notation used in previous chapters. We use b for the regressioncoefficients instead of η. We also omit the regression intercept ν and absorb it into the normalizationconstant.

127

assignment zd,n, the last term of Equation D.17 can be efficiently computed as

exp

((b ◦ zd)T

∑d′

zd′

)= exp

(∑k′

[bk′ zd,k′

∑d′

zd′,k′

])

= exp

(∑k′

[bk′

(z¬nd,k′ +

1

Nd

1(k = k′)

)∑d′

zd′,k′

])

= exp

(∑k′

[bk′ z

¬nd,k′

∑d′

zd′,k′

]+bkNd

∑d′

zd′,k

)

∝ exp

(bkNd

∑d′

zd′,k

)= exp

(bkNd

∑d′

nd′,kNd′

), (D.18)

where the second line follows using our identity on zd,k′ and the last proportionality

follows from the fact that the terms in the left sum do not depend on the current

topic assignment, zd,n. Finally, the last equality stems from the definitions of zd′,k and

nd′,k. This expression is efficient because it is constant for all words in a document

and thus need only be computed once per document. Combining Equation D.18 and

Equation 4.6 yields

p(zd,n = k|α, η,w,y, b, z−(d,n)) = (n¬d,nd,k + αk)n¬d,nwd,n,k

+ ηwd,n

N¬d,nk +Wηexp

(bkNd

∑d′

nd′,kNd′

).

(D.19)

D.4 Supervised latent Dirichlet allocation (sLDA)

We derive the sampling equations for supervised latent Dirichlet allocation. Here, we

consider Gaussian errors, but the derivation can be easily extended to other models

128

as well,

p(yd|zd, b, a) ∝ exp(−(yd − bTzd − a)2)

∝ exp(−(yd − a)2 + 2bTzd(yd − a)− (bTzd)2)

∝ exp(2bTzd(yd − a)− (bTzd)2), (D.20)

where the proportionality is with respect to zd.

The prior distribution on z is identical to that of LDA. Thus the joint posterior

can be written as

p(z|α, η,w,y, b, a) ∝∏d

B(nd,· + α)

B(α)

∏k

B(η + n·,k)

B(η)

∏d

exp(−(yd − bTzd − a)2).

(D.21)

The sampling equation, following the derivation of Equation D.12, is

p(zd,n = k|α, η,w,y, b, a,z−(d,n)) ∝ (n¬d,nd,k + αk)n¬d,nwd,n,k

+ ηwd,n

N¬d,nk +Wηexp(−(yd − bTzd − a)2)

(D.22)

The right-most term can be expanded as

exp(2bTzd(yd − a)− (bTzd)2) = exp(2

∑k′

bk′ z¬nd,k′(yd − a) + 2

yd − aNd

bk + (bTzd)2)

∝ exp(2yd − aNd

bk − (bTzd)2)

∝ exp(2yd − aNd

bk − (∑k′

bk′ z¬nd,k′ +

bkNd

)2)

∝ exp(2yd − aNd

bk − 2bkNd

bTz¬nd − (bkNd

)2)

= exp(2bkNd

(yd − a− bTz¬nd )− (bkNd

)2), (D.23)

yielding

129

p(zd,n = k|α, η,w,y, b, a,z−(d,n)) ∝ (n¬d,nd,k + αk)n¬d,nwd,n,k

+ ηwd,n

N¬d,nk +Wη·

exp(2bkNd

(yd − a− bTz¬nd )− (bkNd

)2) (D.24)

D.5 Networks uncovered by Bayesian inference (NUBBI)

model

The networks uncovered by Bayesian inference (NUBBI) model is a switching model,

wherein each word can be explained by one of three distributions — the distribution of

the first entity θe, the distribution of the second entity θe′ , or the distribution over their

relationships ψe,e′ . Each of these generates topic assignments with the same structure

as LDA, so their contributions to the posterior are the same, conditioned on the

assignments from words to distributions, which also follows a Dirichlet-Multinomial

distribution.

Hence, the joint posterior over topic assignments and source assignments is

p(z, c|α,η,w) ∝∏e

B(nee,· + αθ)

B(αθ)

∏k

B(ηθ + ne·,k)

B(ηθ)·

∏ε

B(nεε,· + αψ)

B(αψ)

∏k

B(ηψ + nε·,k)

B(ηψ)∏ε

B(ncε,· + απ)

B(απ). (D.25)

Here we have used the shorthand ε = (e, e′) to denote iteration over pairs of entities.

We have also introduced new count variables for documents associated with individual

130

entities, documents associated with pairs of entities, and source assignments:

new,k =∑e

∑i

1(ze,i = k ∧ we,i = w)

+∑ε

∑i

1(zε,i = k ∧ wε,i = w ∧ cε,i = 1)

+∑ε

∑i

1(zε,i = k ∧ wε,i = w ∧ cε,i = 2) (D.26)

nεw,k =∑ε

∑i

1(cε,i = 3 ∧ wε,i = w) (D.27)

nee,k =∑i

1(ze,i = k)

+∑ε

∑i

1(zε,i = k ∧ cε,i = 1 ∧ ε1 = e)

+∑ε

∑i

1(zε,i = k ∧ cε,i = 2 ∧ ε2 = e) (D.28)

nεε,k =∑i

1(cε,i = 3 ∧ zε,i = k) (D.29)

ncε,k =∑i

1(cε,i = k) (D.30)

with marginals being defined as before. There are two sampling equations to be

considered. First, when sampling the topic assignment for a word in an entity’s

document,

p(ze,n = k|α,η,w, z−(e,n), c) ∝ (ne,(¬e,n)e,k + αθ,k)

ne,(¬e,n)we,n,k

+ ηθ,we,n

Ne,(¬e,n)k +Wηθ

. (D.31)

Second, when sampling the topic and source assignment for a word in an entity

131

pair’s document,

p(zε,n = k, cε,n = 1|α,η,w, z−(ε,n), c−(ε,n))

∝ne,(¬ε,n)e,k + αθ,k

ne,(¬ε,n)e,k +Kθαθ

ne,(¬ε,n)wε,n,k

+ ηθ,wε,n

Ne,(¬ε,n)k +Wηθ

(nc,(¬ε,n)ε,1 + απ) (D.32)


∝ne,(¬ε,n)e′,k + αθ,k

ne,(¬ε,n)e′,k +Kθαθ

ne,(¬ε,n)wε,n,k

+ ηθ,wε,n

Ne,(¬ε,n)k +Wηθ

(nc,(¬ε,n)ε,2 + απ) (D.33)


∝nε,(¬ε,n)ε,k + αψ,k

nε,(¬ε,n)ε,k +Kψαψ

nε,(¬ε,n)wε,n,k

+ ηψ,wε,n

Nε,(¬ε,n)k +Wηψ

(nc,(¬ε,n)ε,3 + απ). (D.34)

132

Bibliography

E. Agichtein and L. Gravano. Querying text databases for efficient information

extraction. Data Engineering, International Conference on, 0:113, 2003. ISSN

1063-6382. doi: http://doi.ieeecomputersociety.org/10.1109/ICDE.2003.1260786.

E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixed membership stochastic block-

models. Journal of Machine Learning Research, pages 1981 – 2014, September 2008.

URL http://arxiv.org/pdf/0705.4485.

A. Anagnostopoulos, R. Kumar, and M. Mahdian. Influence and correlation in social

networks. KDD 2008, 2008.

G. Andrew and J. Gao. Scalable training of l1-regularized log-linear models. Proceed-

ings of the 24th international Conference on Machine Learning, Jan 2007. URL

http://portal.acm.org/citation.cfm?id=1273496.1273501.

C. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonpara-

metric problems. The Annals of Statistics, 2(6):1152–1174, 1974.

M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni.

Open information extraction from the web. In IJCAI 2007, 2007. URL

http://www.ijcai.org/papers07/Papers/IJCAI07-429.pdf.

K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan. Matching

words and pictures. Journal of Machine Learning Research, 3:1107–1135, 2003.

133

J. Besag. Statistical analysis of non-lattice data. The Statistician, 24(3):179–

195, 1975. ISSN 00390526. doi: http://dx.doi.org/10.2307/2987782. URL

http://dx.doi.org/10.2307/2987782.

J. Besag. On the statistical analysis of dirty pictures. Jour-

nal of the Royal Statistical Society, 48(3):259–302, 1986. URL

http://www.informaworld.com/index/739172868.pdf.

I. Bhattacharya, S. Godbole, and S. Joshi. Structured entity identification and

document categorization: Two tasks with one joint model. KDD 2008, 2008.

C. M. Bishop, D. Spiegelhalter, and J. Winn. Vibes: A variational in-

ference engine for bayesian networks. In NIPS 2002, 2002. URL

http://scholar.google.fi/url?sa=U&q=http://books.nips.cc/papers/files/nips15/AA37.pdf.

A. Blake, C. Rother, M. Brown, P. Perez, and P. Torr. Interactive image segmentation

using an adaptive gmmrf model. pages Vol I: 428–441, 2004.

D. Blei and M. Jordan. Modeling annotated data. Proceedings of the 26th annual

international ACM SIGIR Conference on Research and Development in Information

Retrieval, 2003. URL http://portal.acm.org/citation.cfm?id=860460.

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine

Learning Research, 3:993–1022, 2003a.

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet alloca-

tion. Journal of Machine Learning Research, 2003b. URL

http://www.mitpressjournals.org/doi/abs/10.1162/jmlr.2003.3.4-5.993.

D. M. Blei and M. I. Jordan. Variational inference for Dirichlet process mixtures.

Bayesian Analysis, 1(1):121–144, Oct 2006.

134

D. M. Blei and J. D. McAuliffe. Supervised topic models. Neural Information

Processsing Systems, Aug 2007.

J. Boyd-Graber and D. M. Blei. Syntactic topic models. In Neural Information

Processing Systems, Dec 2008.

M. Braun and J. McAuliffe. Variational inference for large-scale models

of discrete choice. Arxiv preprint arXiv:0712.2526, Jan 2007. URL

http://arxiv.org/pdf/0712.2526.

D. Cai, Z. Shao, X. He, X. Yan, and J. Han. Mining hidden commu-

nity in heterogeneous social networks. LinkKDD 2005, Aug 2005. URL


M. Carreira-Perpinan and G. Hinton. On contrastive divergence

learning. Artificial Intelligence and Statistics, Jan 2005. URL

http://www.csri.utoronto.ca/ hinton/absps/cdmiguel.pdf.

S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext clas-

sification using hyperlinks. Proc. ACM SIGMOD, 1998. URL

http://citeseer.ist.psu.edu/article/chakrabarti98enhanced.html.

J. Chang and D. M. Blei. Relational topic models for documenet networks. 2009.

J. Chang and D. M. Blei. Hierarchical relational models for document networks.

Annals of Applied Statistics, 4(1), 2010.

J. Chang, J. Boyd-Graber, and D. M. Blei. Connections between the lines: Augmenting

social networks with text. 2009.

S. F. Chen and R. Rosenfeld. A survey of smoothing techniques for me models. IEEE

Transactions on Speech and Audio Processing, 8(1), Jun 2000.

135

D. Cohn and T. Hofmann. The missing link-a probabilistic model of document content

and hypertext connectivity. Advances in Neural Information Processing Systems 13,

2001.

M. Craven, D. DiPasquo, D. Freitag, and A. McCallum. Learning to ex-

tract symbolic knowledge from the world wide web. Proc. AAAI, 1998. URL

http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/1998/CMU-CS-98-122.pdf.

A. Culotta, R. Bekkerman, and A. McCallum. Extracting social networks and

contact information from email and the web. AAAI 2005, 2005. URL

http://www.cs.umass.edu/ ronb/papers/dex.pdf.

D. Davidov, A. Rappoport, and M. Koppel. Fully unsupervised discovery of concept-

specific relationships by web mining. In ACL, 2007.

A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via

the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1–38, 1977.

C. Diehl, G. M. Namata, and L. Getoor. Relationship identification for social network

discovery. In AAAI 2007, July 2007.

L. Dietz, S. Bickel, and T. Scheffer. Unsupervised prediction of citation influences.

Proc. ICML, 2007. URL http://portal.acm.org/citation.cfm?id=1273526.

M. Dudık, S. Phillips, and R. Schapire. Maximum entropy density estimation

with generalized regularization and an application to species distribution mod-

eling. The Journal of Machine Learning Research, 8:1217–1260, Jan 2007. URL

http://portal.acm.org/citation.cfm?id=1314540.

B. Efron. Estimating the error rate of a prediction rule: Improvement on cross-

validation. Journal of the American Statistical Association, 78(382), 1983.

136

E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific

publications. Proceedings of the National Academy of Sciences, 2004.

E. Erosheva, S. Fienberg, and C. Joutard. Describing disability through individual-

level mixture models for multivariate binary data. Annals of Applied Statistics,

2007.

L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene

categories. Computer Vision and Pattern Recognition, 2005.

T. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of

Statistics, 1:209–230, 1973.

S. E. Fienberg, M. M. Meyer, and S. Wasserman. Statistical analysis of multiple

sociometric relations. Journal of the American Statistical Association, 80:51—67,

1985.

M. E. Fisher. On the dimer solution of planar ising models. Journal of

Mathematical Physics, 7(10):1776–1781, 1966. doi: 10.1063/1.1704825. URL

http://link.aip.org/link/?JMP/7/1776/1.

J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with

the graphical lasso. Biostatistics, 2007.

J. Gao, H. Suzuki, and B. Yu. Approximation lasso methods for language modeling.

Proceedings of the 21st International Conference on Computational Linguistics, Jan

2006. URL http://acl.ldc.upenn.edu/P/P06/P06-1029.pdf.

S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian

restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelli-

gence, (6):721–741, 1984.

137

L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning prob-

abilistic models of relational structure. Proc. ICML, 2001. URL

http://ai.stanford.edu/users/nir/Papers/GFTK1.pdf.

D. Gibson, J. Kleinberg, and P. Raghavan. Inferring web commu-

nities from link topology. HYPERTEXT 1998, May 1998. URL


A. Globerson and T. S. Jaakkola. Approximate inference using planar graph decom-

position. In B. Scholkopf, J. Platt, and T. Hoffman, editors, Advances in Neural

Information Processing Systems 19, pages 473–480. MIT Press, Cambridge, MA,

2007.

A. Globerson, T. Koo, X. Carreras, and M. Collins. Exponentiated gra-

dient algorithms for log-linear structured prediction. Proceedings of the

24th international Conference on Machine Learning, Jan 2007. URL


J. Goodman. Exponential priors for maximum entropy models. Mar 2004.

A. Gruber, M. Rosen-Zvi, and Y. Weiss. Latent topic models for hypertext. Uncertainty

in Artificial Intelligence, May 2008.

P. Haffner, S. Phillips, and R. Schapire. Efficient multiclass implementations of

l1-regularized maximum entropy. May 2006.

P. Hoff, A. Raftery, and M. Handcock. Latent space approaches to social network

analysis. Journal of the American Statistical Association, 2002.

J. Hofman and C. Wiggins. A Bayesian approach to network modularity. eprint arXiv:

0709.3512, 2007. URL http://arxiv.org/pdf/0709.3512.

138

T. Hofmann. Probabilistic latent semantic indexing. SIGIR, 1999. URL


E. Ising. Beitrag zur theorie des ferromagnetismus. Zeitschrift fur Physik, 31:253–258,

1925.

T. S. Jaakkola and M. I. Jordan. Variational methods and the qmr-dt database. MIT

Computational Cognitive Science Technical Report 9701, page 23, Jan 1999.

M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to

variational methods for graphical models. Oct 1999.

D. Jurafsky and J. Martin. Speech and language processing. Prentice Hall, 2008.

S. Katrenko and P. Adriaans. Learning relations from biomedical corpora us-

ing dependency trees. Lecture Notes in Computer Science, 2007. URL

http://www.springerlink.com/index/n145566q7t1u4365.pdf.

J. Kazama and J. Tsujii. Evaluation and extension of maximum entropy models with

inequality constraints. Jun 2003.

C. Kemp, T. Griffiths, and J. Tenenbaum. Discovering latent

classes in relational data. MIT AI Memo 2004-019, 2004. URL

http://www-psych.stanford.edu/ gruffydd/papers/blockTR.pdf.

J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the

ACM (JACM), 1999. URL http://portal.acm.org/citation.cfm?id=324140.

M. Kolar and E. P. Xing. Improved estimation of high-dimensional ising models, 2008.

URL http://www.citebase.org/abstract?id=oai:arXiv.org:0811.1239.

V. Kolmogorov. Convergent tree-reweighted message passing for energy mini-

mization. IEEE Trans. Pattern Anal. Mach. Intell., 28(10):1568–1583, October

139

2006. ISSN 0162-8828. doi: http://dx.doi.org/10.1109/TPAMI.2006.200. URL

http://dx.doi.org/10.1109/TPAMI.2006.200.

J. Lafferty and L. Wasserman. Rodeo: Sparse, greedy nonparametric regression. The

Annals of Statistics, Jan 2008.

J. Leskovec, L. Backstrom, R. Kumar, and A. Tomkins. Microscopic evolution of

social networks. KDD 2008, 2008a.

J. Leskovec, K. Lang, A. Dasgupta, and M. Mahoney. Statistical properties of

community structure in large social and information networks. WWW 2008, 2008b.

URL http://portal.acm.org/citation.cfm?id=1367591.

D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M. Dawson.

The bottlenose dolphin community of doubtful sound features a large proportion

of long-lasting associations. can geographic isolation explain this unique trait?

Behavioral Ecology and Sociobiology, 54:396–405, 2003.

J. Majewski, H. Li, and J. Ott. The ising model in physics and statistical genetics.

American Journal of Human Genetics, 69:853–862, 2001.

R. Malouf. A comparison of algorithms for maximum entropy parameter estimation.

Jun 2002.

A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction

of internet portals with machine learning. Information Retrieval, 2000. URL

http://www.springerlink.com/index/R1723134248214T0.pdf.

A. McCallum, A. Corrada-Emmanuel, and X. Wang. Topic and role discovery in

social networks. Proceedings of the Nineteenth International Joint Conference on

Artificial Intelligence, 2005. URL http://www.ijcai.org/papers/1623.pdf.

140

A. McGovern, L. Friedland, M. Hay, B. Gallagher, A. Fast, J. Neville, and

D. Jensen. Exploiting relational structure to understand publication patterns

in high-energy physics. ACM SIGKDD Explorations Newsletter, 5(2), Dec 2003.

URL http://portal.acm.org/citation.cfm?id=980972.980999.

E. Meeds, Z. Ghahramani, R. Neal, and S. Roweis. Modeling dyadic data with binary

latent factors. NIPS 2007, 2007.

Q. Mei, D. Xin, H. Cheng, J. Han, and C. Zhai. Semantic annotation of frequent

patterns. KDD 2007, 1(3), 2007.

Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic modeling with network regularization.

WWW ’08: Proceeding of the 17th international conference on World Wide Web,

Apr 2008. URL http://portal.acm.org/citation.cfm?id=1367497.1367512.

N. Meinshausen and P. Buhlmann. High dimensional graphs and variable selection

with the lasso. Annals of Statistics, Jan 2006.

T. Minka and Y. Qi. Tree-structured approximations by expectation. In Propagation,”

Proc. Neural Information Processing Systems Conf. (NIPS, page 2003, 2003.

K. Murphy, Y. Weiss, and M. Jordan. Loopy belief propagation for approximate

inference: An empirical study. Proceedings of Uncertainty in AI, Jan 1999. URL

http://www.vision.ethz.ch/ks/slides/murphy99loopy.pdf.

R. Nallapati and W. Cohen. Link-pLSA-LDA: A new unsupervised model for topics

and influence of blogs. ICWSM, 2008.

R. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen. Joint latent topic models for

text and citations. Proceedings of the 14th ACM SIGKDD international conference

on Knowledge discovery and data mining, 2008.

O. J. Nave. Nave’s Topical Bible. Thomas Nelson, 2003. ISBN 0785250581.

141

R. M. Neal. Probabilistic inference using markov chain monte carlo methods. CRG-

TR-93-1, May 1993.

D. Newman, C. Chemudugunta, and P. Smyth. Statistical entity-topic models. In

KDD 2006, pages 680–686, New York, NY, USA, 2006a. ACM. ISBN 1-59593-339-5.

doi: http://doi.acm.org/10.1145/1150402.1150487.

M. Newman. The structure and function of net-

works. Computer Physics Communications, 2002. URL

http://linkinghub.elsevier.com/retrieve/pii/S0010465502002011.

M. E. J. Newman. Finding community structure in networks using the eigenvectors of

matrices. Phys. Rev. E, 74(036104), 2006a.

M. E. J. Newman. Modularity and community structure in networks. Proceedings of

the National Academy of Sciences, 103(23), 2006b. doi: 10.1073/pnas.0601602103.

URL http://arxiv.org/abs/physics/0602124v1.

M. E. J. Newman, A.-L. Barabsi, and D. J. Watts. The Structure and Dynamics of

Networks. Princeton University Press, 2006b.

T. Ohta, Y. Tateisi, and J.-D. Kim. Genia corpus: an annotated research abstract

corpus in molecular biology domain. In HLT 2008, San Diego, USA, 2002. URL

http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/paper/hlt2002GENIA.pdf.

J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference.

1988.

J. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using

multilocus genotype data. Genetics, 155:945–959, June 2000.

S. Riezler and A. Vasserman. Incremental feature selection and l1 reg-

ularization for relaxed maximum-entropy modeling. Proceedings of the

142

2004 Conference on Empirical Methods in NLP, Jan 2004. URL

http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Riezler.pdf.

M. Rosen-Zvi, T. Griffiths, T. Griffiths, M. Steyvers, and P. Smyth. The author-

topic model for authors and documents. In AUAI 2004, pages 487–494, Arlington,

Virginia, United States, 2004. AUAI Press. ISBN 0-9749039-0-6.

S. Sahay, S. Mukherjea, E. Agichtein, E. Garcia, S. Navathe, and A. Ram. Discovering

semantic biomedical relations utilizing the web. KDD 2008, 2(1), Mar 2008. URL


S. Sampson. Crisis in a cloister. PhD thesis, Cornell University, 1969.

L. Saul and M. Jordan. Exploiting Tractable Substructures in Intractable Networks.

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, pages 486–

492, 1996.

L. Saul and M. Jordan. A mean field learning algorithm for unsuper-

vised neural networks. Learning in Graphical Models, Jan 1999. URL

http://citeseer.comp.nus.edu.sg/cache/papers/cs/513/http:zSzzSzwww.ai.mit.eduzSzprojectszSzcbclzSzcourse9.641-F97zSzpmlp.ps.gz/a-mean-field-learning.ps.gz.

Y. Shi and T. Duke. Cooperative model of bacterial sensing. Phys. Rev. E, 58(5):

6399–6406, Nov 1998. doi: 10.1103/PhysRevE.58.6399.

J. Sinkkonen, J. Aukia, and S. Kaski. Component models for large networks. arXiv,

stat.ML, Mar 2008. URL http://arxiv.org/abs/0803.1628v1.

D. Sontag and T. Jaakkola. New Outer Bounds on the Marginal Polytope. Advances

in Neural Information Processing Systems, 21, 2007.

M. Steyvers and T. Griffiths. Probabilistic topic models. Handbook of Latent Semantic

Analysis, 2007.

143

R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agar-

wala, M. Tappen, and C. Rother. A comparative study of energy min-

imization methods for markov random fields with smoothness-based priors.

Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(6):

1068–1080, 2008. doi: http://dx.doi.org/10.1109/TPAMI.2007.70844. URL

http://dx.doi.org/10.1109/TPAMI.2007.70844.

H. Takamura, T. Inui, and M. Okumura. Extracting semantic orientations

of words using spin model. In ACL ’05: Proceedings of the 43rd Annual

Meeting on Association for Computational Linguistics, pages 133–140, Mor-

ristown, NJ, USA, 2005. Association for Computational Linguistics. doi:

http://dx.doi.org/10.3115/1219840.1219857.

L. Tanabe, N. Xie, L. H. Thom, W. Matten, and W. J. Wilbur. Genetag: a tagged

corpus for gene/protein named entity recognition. BMC Bioinformatics, 6 Suppl

1, 2005. ISSN 1471-2105. doi: http://dx.doi.org/10.1186/1471-2105-6-S1-S3. URL

http://dx.doi.org/10.1186/1471-2105-6-S1-S3.

B. Taskar, M.-F. Wong, P. Abbeel, and D. Koller. Link prediction in relational data.

NIPS 2003, 2003.

B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. Ad-

vances in Neural Information Processing Systems, Jan 2004a. URL

http://web.engr.oregonstate.edu/ tgd/classes/539/slides/max-margin-markov-networks.pdf.

B. Taskar, M. Wong, P. Abbeel, and D. Koller. Link prediction in relational data.

NIPS, 2004b.

Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. Journal of

the American Statistical Association, 101(476):1566–1581, 2007.

144

M. Wainwright and M. Jordan. A variational principle for graphical models. In New

Directions in Statistical Signal Processing, chapter 11. MIT Press, 2005a.

M. Wainwright and M. Jordan. Log-determinant relaxation for approximate inference

in discrete markov random fields. Signal Processing, IEEE Transactions on, 54(6):

2099–2109, June 2006. ISSN 1053-587X. doi: 10.1109/TSP.2006.874409.

M. Wainwright, T. Jaakkola, and A. Willsky. Tree-reweighted belief propagation

algorithms and approximate ml estimation by pseudomoment matching. Artificial

Intelligence and Statistics, Jan 2003.

M. J. Wainwright and M. I. Jordan. Variational inference in graphical models: The

view from the marginal polytope. Allerton Conference on Control, Communication

and Computing, Apr 2003.

M. J. Wainwright and M. I. Jordan. A variational principle for graphical models. New

Directions in Statistical Signal Processing, 2005b.

M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and

variational inference. Foundations and Trends in Machine Learnings, 1(1 – 2):1–305,

Dec 2008.

M. J. Wainwright, P. Ravikumar, and J. D. Lafferty. High-dimensional graphical model

selection using l1-regularized logistic regression. Neural Information Processing

Systems, Jan 2006.

X. Wang, N. Mohanty, and A. McCallum. Group and topic discovery from relations

and text. Proceedings of the 3rd international workshop on Link discovery, 2005.

URL http://portal.acm.org/citation.cfm?id=1134276.

S. Wasserman and P. Pattison. Logit models and logistic regressions for social

145

networks: I. an introduction to markov graphs and p*. Psychometrika, 1996. URL

http://www.springerlink.com/index/T2W46715636R2H11.pdf.

M. Welling and G. Hinton. A new learning algorithm for mean field boltzmann

machines. Artificial Neural Networks-Icann 2002, Jan 2002.

M. Welling and Y. W. Teh. Belief optimization for binary networks: a stable alternative

to loopy belief propagation. In In Proceedings of the Conference on Uncertainty in

Artificial Intelligence, pages 554–561, 2001.

D. J. A. Welsh. The computational complexity of some classical problems from

statistical physics. In In Disorder in Physical Systems, pages 307–321. Clarendon

Press, 1990.

Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite hidden relational models. In UAI,

2006.

Z. Xu, V. Tresp, S. Yu, and K. Yu. Nonparametric relational learning for social

network analysis. In 2nd ACM Workshop on Social Network Mining and Analysis

(SNA-KDD 2008), 2008.

J. S. Yedidia, W. T. Freeman, and Y. Weiss. Understanding belief propagation and

its generalizations. Exploring artificial intelligence in the new millennium, pages

239–269, 2003.

W. Zachary. An information flow model for conflict and fission in small groups. Journal

of Anthropological Research, 33:452–473, 1977.

D. Zhou, S. Zhu, K. Yu, X. Song, B. Tseng, H. Zha, and C. Giles. Learning

multiple graphs for document recommendations. WWW 2008, Apr 2008. URL


146

H. Zou and T. Hastie. Regularization and variable selection via the elastic

net. Journal of the Royal Statistical Society Series B, Jan 2005. URL

http://www.blackwell-synergy.com/doi/abs/10.1111/j.1467-9868.2005.00503.x.

147