Upload
jonathan-chang
View
782
Download
7
Embed Size (px)
DESCRIPTION
Uncovering, Understanding, and Predicting LinksJonathan ChangA Dissertation Presented to the Faculty of Princeton University in Candidacy for the Degree of Doctor of PhilosophyRecommended for Acceptance by the Department of Electrical Engineering Adviser: David M. BleiSeptember 1, 2010c Copyright by Jonathan Chang, 2011. All Rights ReservedAbstractNetwork data, such as citation networks of documents, hyperlinked networks of web pages, and social networks of friends, are pervasive
Citation preview
Uncovering, Understanding, and
Predicting Links
Jonathan Chang
A Dissertation
Presented to the Faculty
of Princeton University
in Candidacy for the Degree
of Doctor of Philosophy
Recommended for Acceptance
by the Department of
Electrical Engineering
Adviser: David M. Blei
November 2011
c© Copyright by Jonathan Chang, 2011.
All Rights Reserved
Abstract
Network data, such as citation networks of documents, hyperlinked networks of web
pages, and social networks of friends, are pervasive in applied statistics and machine
learning. The statistical analysis of network data can provide both useful predictive
models and descriptive statistics. Predictive models can point social network mem-
bers towards new friends, scientific papers towards relevant citations, and web pages
towards other related pages. Descriptive statistics can uncover the hidden community
structure underlying a network data set.
In this work we develop new models of network data that account for both links
and attributes. We also develop the inferential and predictive tools around these
models to make them widely applicable to large, real-world data sets. One such model,
the Relational Topic Model can predict links using only a new node’s attributes. Thus,
we can suggest citations of newly written papers, predict the likely hyperlinks of a
web page in development, or suggest friendships in a social network based only on a
new user’s profile of interests. Moreover, given a new node and its links, the model
provides a predictive distribution of node attributes. This mechanism can be used to
predict keywords from citations or a user’s interests from his or her social connections.
While explicit network data — network data in which the connections between
people, places, genes, corporations, etc. are explicitly encoded — are already ubiq-
uitous, most of these can only annotate connections in a limited fashion. Although
relationships between entities are rich, it is impractical to manually devise complete
characterizations of these relationships for every pair of entities on large, real-world
corpora. To resolve this we present a probabilistic topic model to analyze text cor-
pora and infer descriptions of its entities and of relationships between those entities.
We show qualitatively and quantitatively that our model can construct and annotate
graphs of relationships and make useful predictions.
iii
Acknowledgements
A graduate career is an endeavor which requries support from all those around you.
Friends, family, you know who you are and what I owe you (at least $2016). To all
the people in the EE and CS departments, especially the Liberty and SL@P labs, it’s
been a ball.
I want to call some special attention (in temporal order) to the faculty who have
helped me on my peripatetic journey through grad school. First off, thanks to David
August who took a chance on a clown with green hair. I made some good friends and
research contributions during my sojourn at the liberty lab. Next I’d like to thank
Moses Charikar, Christiane D. Fellbaum, and Dan Osherson for giving me my second
chance by including me on the WordNet project when I had all but given up on
graduate school. Special thanks also go out to the members of my FPO committee:
Rob Schapire, Paul Cuff, Sanjeev Kulkarni, and Matt Salganik. Thanks for helping
me make sure my thesis is well-written and relevant.
Finally, the bulk of my thanks must be given to David Blei — a consummate advi-
sor, teacher, and all-around stand-up guy. Thanks for teaching me about variational
inference, schooling me on strange and wonderful music, and never giving up on me
and making sure I finished.
iv
To Rory Gilmore, for being a hell of a lot smarter than me.
v
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
1 Introduction 1
2 Modeling, Inference and Prediction 9
2.1 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Exponential family distributions . . . . . . . . . . . . . . . . . 15
2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Exponential Family Models of Links 25
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Pairwise Ising model . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Approximate inference of marginals . . . . . . . . . . . . . . . 29
3.2.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Estimating marginal probabilities . . . . . . . . . . . . . . . . 35
3.3.2 Making predictions . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
vi
4 Relational Topic Models 41
4.1 Relational Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.1 Modeling assumptions . . . . . . . . . . . . . . . . . . . . . . 44
4.1.2 Latent Dirichlet allocation . . . . . . . . . . . . . . . . . . . . 45
4.1.3 Relational topic model . . . . . . . . . . . . . . . . . . . . . . 48
4.1.4 Link probability function . . . . . . . . . . . . . . . . . . . . . 49
4.2 Inference, Estimation and Prediction . . . . . . . . . . . . . . . . . . 53
4.2.1 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Evaluating the predictive distribution . . . . . . . . . . . . . . 61
4.3.2 Automatic link suggestion . . . . . . . . . . . . . . . . . . . . 65
4.3.3 Modeling spatial data . . . . . . . . . . . . . . . . . . . . . . 67
4.3.4 Modeling social networks . . . . . . . . . . . . . . . . . . . . . 71
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5 Discovering Link Information 78
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Computation with NUBBI . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.1 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . 90
5.3.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.1 Learning networks . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4.2 Evaluating the predictive distribution . . . . . . . . . . . . . . 96
5.4.3 Application to New York Times . . . . . . . . . . . . . . . . . 99
vii
5.5 Discussion and related work . . . . . . . . . . . . . . . . . . . . . . . 106
6 Conclusion 108
A Derivation of RTM Coordinate Ascent Updates 110
B Derivation of RTM Parameter Estimates 113
C Derivation of NUBBI coordinate-ascent updates 117
D Derivation of Gibbs sampling equations 121
D.1 Latent Dirichlet allocation (LDA) . . . . . . . . . . . . . . . . . . . . 122
D.2 Mixed-membership stochastic blockmodel (MMSB) . . . . . . . . . . 125
D.3 Relational topic model (RTM) . . . . . . . . . . . . . . . . . . . . . . 127
D.4 Supervised latent Dirichlet allocation (sLDA) . . . . . . . . . . . . . 128
D.5 Networks uncovered by Bayesian inference (NUBBI) model . . . . . . 130
viii
Chapter 1
Introduction
In this work our aim is to apply the tools of probabilistic modeling to network data,
that is, a collection of nodes with identifiable properties, each one (possibly) connected
to other nodes. In the parlance of graph theory, we are concerned with graphs,
collections of vertices and edges, whose vertices may contain additional information.
In modeling these networks our aim is to gain insight into the structure underpinning
these networks and be able to make predictions about them.
Much of the pioneering work on the study of networks was done under the auspices
of sociological studies, i.e., the networks under consideration were social networks.
Zachary’s data on the members of a university karate club (Zachary 1977) and
Sampson’s study of social interactions among monks at a monastery (Sampson 1969)
are some early iconic works. The number and variety of data sets have grown
considerably since, from networks of dolphins (Lusseau et al. 2003) to co-authorship
networks (Newman 2006a); however the underlying structure of the data remains the
same — a collection of nodes (people / animals / organisms / etc.) connected to one
another through some relationship (friendship / hatred / co-authorship / etc.).
In recent years, with the increasing digital representation of entities and the
relationships between them, the amount of data available to researchers has increased
1
Figure 1.1: A depiction of a subset of an online social network. Nodes representindividuals and edges represent friendships between them.
and the impact of network understanding and prediction has magnified enormously.
Online social networks such as Facebook1, Linked-in2, and Twitter3 have made creating
and leveraging these networks their primary product. Consequently these online social
networks operate on a scale unimaginable to the early researchers of social networks;
the aforementioned early works have social networks on the order of tens of nodes
whereas Facebook alone has over 500 million4.
1http://www.facebook.com2http://www.linkedin.com3http://www.twitter.com4https://www.facebook.com/press/info.php?statistics, retrieved June 2011
2
Figure 1.1 shows what a subset of an online social network might look like. The
nodes in the graph represent people and the edges represent self-reported friendship
between members. Even in this simple example, a rich structure emerges with some
individuals belonging to tightly connected clusters while others exist on the periphery.
Characterizing this structure has been one major thrust of network research (Newman
et al. 2006b).
Figure 1.2 shows a screen capture from the online social network Facebook. In
this view, the screenshot shows some of the other nodes connected to the node that
the profile represents. The screenshot shows the variety of nodes and large number
of edges associated with a single user. For example, in this small portion of the
profile alone there are connections to nodes representing friends and family, nodes
representing workplaces, nodes representing schools, and nodes representing interests.
Again, there is a rich structure to be explored.
In the previous discussion we have discussed social networks as simple graphs;
however, these graphs are often richer than those expressed in traditional graphs. In
particular, both the nodes and edges may have some content associated with them.
Even in Figure 1.2 it is clear that a single node / edge type cannot capture the
structure associated with friends vs. family or musical interest vs. sport interest.
The nodes may also have other attributes such as age or gender that can make for a
more expressive probabilistic model. Additionally, users on online social networks may
produce textual data associated with status updates or biographical prose. Figure 1.3
shows an example of a status update. The user generates some snippet of text which
is then posted online; other users may respond with comments. A collection of status
updates (and comments, etc.) comprise a corpus. Outside of social networks, one may
also consider citation networks (Figure 4.1), gene regulatory networks and many other
instances as networks whose nodes and their attendant attributes comprise a corpus.
Thus instead of referring to nodes / vertices we may refer to documents, and instead of
3
Friends
InterestsFamily
Work
Education
Figure 1.2: A screenshot from a typical Facebook profile with sections annotated. Inthis subset of the profile alone there are network connections to workplaces, schools,interests, family and friends. Understanding the nature of these connections is ofimmense practical and theoretical interest.
4
Figure 1.3: A screenshot of a typical Facebook status update, a small user-generatedsnippet of text. Other users can react to status updates by posting comments inresponse.
referring to node attributes we refer to words. Throughout this work we will be using
the language of textual analysis and the language of graph theory interchangeably.
The study of natural language has a long and rich history (see Jurafsky and Martin
(2008) for a description of many modern techniques for analyzing language). One
modern technique to analyze language that we shall leverage throughout this work is
topic modeling (Blei et al. 2003b). Topic modeling, to be described in more detail
in Section 4.1, is a latent mixed-membership model. It presupposes the existence
of latent themes or topics which characterize how words tend to occur with one
another. Documents are then merely specific realizations of ensembles of these themes.
Figure 1.4 depicts how the approach assumes topics on the left (denoted by β) and
an ensemble of these themes for each document (right). It is this ensemble that
determines the words in the document that we observe.
We have thus far described two incomplete perspectives of data; one which is
centered around documents and another centered around graphs. What we propose in
this thesis is a set of techniques for modeling these data with a complete perspective
that takes both of these aspects of the data into account. We also develop methods
for determining the unknowns in these models. We show that once so determined,
these models provide useful insights into the structure underpinning the data and are
5
The congressman threw the opening pitch at the Yankees game yesterday
evening, despite being under investigation by a house committee. Both Democrats and Republicans on
the committee condemned...
lawyerjusticejudge
investigateprosecutor
gamecoachplayerplay
match
republicandemocrat
senatecampaign
mayor
β2
β3
β1wd,1:N
θd
zd,1 zd,2
Figure 1.4: A depiction of the assumptions underlying topic models. Topic modelspresuppose latent themes (left) and documents (right). Documents are a compositionof latent themes; this composition determines the words in the document that weobserve.
6
able to make predictions about unseen nodes, edges, and attributes.
In Chapter 2 we lay the ground work for our technique by first describing a general
framework in which to define and speak about probabilistic models. We follow up
by describing a set of tools for using data to uncover the unknown aspects of these
models. Then we describe how this can be used to analyze and make predictions about
data. In Chapter 3 we dive into a specific model which has wide applicability not
only to networks but also to a variety of other data. The challenge with this model
has always been the computational complexity of uncovering likely values for the
latent parameters. We introduce a technique which is able to vastly improve on the
state-of-the-art in terms of computation speed, while sacrificing very little accuracy,
thus making these models much more applicable to the large networks in which we
are interested.
In Chapter 4, we introduce the Relational Topic Model, a model specifically
designed to analyze collections of documents with connections between them (or
alternatively graphs with both edges and attributes). It leverages the aforementioned
topic modeling infrastructure but extends it so that the model can offer a unified view
of both links and content. We show that the model can make statements about new
nodes, for example predicting the content of a document based solely on its citations
or predicting additional citations based on its content. Further, it can be used to find
hidden community structure, and we analyze these features of the model on several
data sets.
The work in Chapter 4 presupposes a network in which most links have already
been observed. However, it is often the case that we have only textual content and we
would like to build out this network. Chapter 5 explores the construction of networks
based purely on text. By looking at the content associated with each node, as well as
content appearing around pairs of nodes we are able to infer descriptions of individual
entities and of the relationship between those entities. With the inference machinery
7
we develop we can apply the model to large corpora such as Wikipedia and show that
the model can construct and annotate graphs and make useful predictions.
8
Chapter 2
Modeling, Inference and Prediction
Throughout this work our approach will be to
1. define a probabilistic model with certain unknown parameters for data of a
particular character;
2. perform inference, that is, find values of the unknown parameters of the model
that best explain observations;
3. make predictions using a model whose parameters have been determined.
In this chapter I describe the framework in which we execute these steps. A more
detailed treatment can be found in Wainwright and Jordan (2008).
2.1 Probabilistic Models
Our approach uses the language of directed graphical models to describe probabilistic
models. Directed graphical models have been described as a synthesis of graph theory
and probability. In this framework, distributions are represented as directed, acyclic
graphs. Nodes in this graph represent variables and arrows indicate, informally, a
9
Z
(a) Un-observedvariablenamed Z.
X
(b) Observed(indicated byshading) vari-able namedX.
U V
(c) Variable V possibly dependenton variable U (indicated by arrow).
YN
(d) Variable Y repli-cated N times (indi-cated by box).
Figure 2.1: The language of graphical models.
possible dependence between variables.1
The constituents of directed graphical models are
1. unshaded nodes indicating unobserved variables whose names are enclosed in
the circle;
2. shaded nodes indicate observed variables;
3. arrows between nodes indicating a possible dependence between variables;
4. boxes which depict replication.
These are shown in Figure 2.1.
Associated with each node is a conditional probability distribution over the variable
represented by that node. That probability distribution is conditioned on the variable
represented by that node’s parents. That is, letting xi represent the variable associated
with the ith node,
pi(xi|xj∈parents(i)) (2.1)
describes the distribution of xi. The full joint distribution of the entire graphical
model can thus be written as1The dependence between variables can be formally described by D-separation which is outside
the scope of this text.
10
p(x) =∏i
pi(xi|xj∈parents(i)). (2.2)
Note that it is straightforward to evaluate the probability of a state in this formalism;
one need only take the product of the evaluation of each pi. This formalism also makes
it convenient to simulate draws from this distribution by drawing each constituent
variable in topological order. Because each of the variables xi is conditioned on is a
parent, and all parent variables are guaranteed to have fixed values by dint of the
topological sort, xi can be simulated by doing a single draw from pi.
This also means it is straightforward to describe each probability distribution as a
generative process, that is, a sequence of probabilistic steps by which the data were
hypothetically generated. The intermediate steps of the generative process create
unobserved variables while the final step generates the observed data, i.e., the leaves
of the graph. This construction will be of particular interest in the sequel.
2.2 Inference
With a probability distribution thus defined our goal is to find values of unobserved
variables which explain observed variables. More formally, we are interested in finding
the posterior distribution of hidden variables (z) conditioned on observed variables
(x).
p(z|x) (2.3)
For all but a few special cases, it is computationally prohibitive to compute this
exactly. To see why, let us recall the definition of marginalization,
11
p(z|x) =p(x, z)
p(x)
=p(x, z)∑z′ p(x, z
′).
As mentioned in the previous section, evaluating the joint distribution p(x, z) is
straightforward. However, to compute the posterior probability we must evaluate the
joint probability across all possible values of z′. Since the number of possible values
of z′ increases exponentially with the number of variables comprising z′, this quickly
becomes prohibitive.
Thus we turn to approximate methods. There are many approaches to approx-
imating the posterior such as Markov Chain Monte Carlo (MCMC) (Neal 1993).
However, we will use variational approximations in this work because they do not
rely on stochasticity, they are amenable to various optimization approaches, and have
been empirically shown to achieve good approximations.
Variational methods approximate the true posterior, p(z|x) with an approximate
posterior, q(z). The approximation chosen is that distribution which is in some sense
“closest” to the true distribution. The definition of closeness used is Kullback-Leibler
(KL) divergence,
12
KL(q(z)||p(z|x)) =∑z
q(z) logq(z)
p(z|x)(2.4)
= −∑z
q(z) logp(z|x)
q(z)
≥ − log∑z
q(z)p(z|x)
q(z)
= − log∑z
p(z|x)
= − log 1
= 0, (2.5)
where the inequality follows from Jensen’s inequality. This choice of distance can be
intuitively justified several ways. One is to rewrite the KL-divergence as
KL(q(z)||p(z|x)) = −Eq [log p(z|x)]− H (q) , (2.6)
where H () denotes entropy. Thus KL-divergence promotes distributions q which
look “similar” to p while adding an entropy regularization. Another justification of
KL-divergence arises from its relationship to the likelihood of observed data,
KL(q(z)||p(z|x)) = −Eq [log p(z|x)]− H (q)
= −Eq[log
p(z,x)
p(x)
]− H (q)
= −Eq [log p(z,x)] + Eq [log p(x)]− H (q)
= log p(x)− Eq [log p(z,x)]− H (q) . (2.7)
This representation first implies that the problem can be expressed as finding the
distance between the variational distribution and the joint distribution rather than
13
the posterior distribution. The second is that this distance can be used to form an
evidence lower bound (ELBO); as the distance decreases the likelihood of our data
increases.
Our objective function now is to find q∗ such that
q∗(z) = argminq∈Q
KL(q(z)||p(z|x)). (2.8)
Note that this is trivially minimized when q∗(z) = p(z|x), the true posterior.
Therefore, the optimization problem as formulated is equivalent to posterior inference.
But since this is intractable, a tractable approximation is made by restricting the
search space Q. A common choice is the family of factorized distributions,
q(z) =∏i
qi(zi). (2.9)
This choice of Q is often termed a naıve variational approximation. This expression
is convenient since
H (q) = −Eq
[log∏i
qi(zi)
]
= −Eq
[∑i
log qi(zi)
]
= −∑i
Eq [log qi(zi)]
=∑i
H (qi) . (2.10)
Further, recall from the discussion above that in a generative process all of the
observations (x) appear as leaves of the graph. Therefore the expected log joint
probability can be expressed as
14
Eq [log p(z,x)] = Eq[log∏
pi(zi|zj∈parents(i))∏
pi′(xi|zj∈parents(i′))]
=∑
Eq[log pi(zi|zj∈parents(i))
]+∑
Eq[log pi′(xi|zj∈parents(i′))
].
Note that because of marginalization the expectation of term pi depends only on
{qj(zj)|j ∈ parents(i)} if i is a leaf node, and {qj(zj)|j ∈ parents(i) ∪ {i}} otherwise.
Optimizing this with respect to a common choice for pi warrants further elucidation
below.
2.2.1 Exponential family distributions
Exponential family distributions are a class of distributions which take a particular
form. This form encompasses many common distributions and is convenient to optimize
with respect to the objective described in the previous section. Exponential family
distributions take the following form:
p(x|η) =exp(ηTφ(x))
Z(η). (2.11)
The normalization constant Z(η) is chosen so that the distribution sums to one.
The vector η is termed the natural parameters while φ(x) are the sufficient statistics.
Figure 2.2 helps illustrate how common distributions such as the Gaussian and
the Beta can be expressed in this representation.
The structure of the exponential family representation allows for these distributions
to be easily manipulated in the variational optimization above. In particular,
15
−6 −4 −2 0 2 4 6x
0
0.1
0.2
0.3
0.4
0.5
0.6
p(x)
< −0.5, 0 >
−6 −4 −2 0 2 4 6x
< −0.5, 1 >
−6 −4 −2 0 2 4 6x
< −1, 0 >
(a) The Gaussian distribution has sufficient statistics φ(x) = 〈x2, x〉. The natural parameters arerelated to the common parameterization by η = 〈− 1
2σ2 ,µσ2 〉. The normalization constant Z is given
by√− πη1
exp(η224η1
).
0 0.2 0.4 0.6 0.8 1x
0
0.5
1
1.5
2
2.5
3
3.5
p(x)
< 0, 0 >
0 0.2 0.4 0.6 0.8 1x
< −0.5, −0.5 >
0 0.2 0.4 0.6 0.8 1x
< 2, 1 >
(b) The Beta distribution has sufficient statistics φ(x) = 〈log(x), log(1−x)〉. The natural parametersare related to the common parameterization by η = 〈α− 1, β − 1〉. The normalization constant Z is
given by Γ(η1+1)Γ(η2+1)Γ(η1+η2+2) .
Figure 2.2: Two exponential family functions. The title of each panel shows the valueof the natural parameters for the depicted distribution.
16
2
z
x μN
Figure 2.3: A directed graphical model representation of a Gaussian mixture model.
Eq [log p(x, z)] = Eq[log
exp(ηTφ(x, z))
Z(η)
]= Eq
[ηTφ(x, z)
]− Eq [logZ(η)]
= Eq[ηT]Eq [φ(x, z)]− Eq [logZ(η)] , (2.12)
where the last line follows by independence under a fully-factorized variational distri-
bution. (Note that q is a distribution over both sets of latent variables in the model,
z and η.)
2.3 Example
To illustrate the procedure described in the previous sections, we perform it on a
simple Gaussian mixture model. Figure 2.3 shows a directed graphical model for this
example. We describe the generative process as
1. For i ∈ {0, 1},
(a) Draw µi ∼ Uniform(−∞,∞).2
2. For n ∈ [N ],
2We set aside here the issue of drawing from an improper probability distribution.
17
(a) Draw mixture indicator zn ∼ Bernoulli(0.5);
(b) Draw observation xn ∼ N(µzn , 1).
Our goal now is to approximate the posterior distribution of the hidden variables,
p(z,µ|x), conditioned on observations x. To do so we use the factorized distribution,
q(µ, z) = r(µ0|m0)r(µ1|m1)∏n
qn(zn|πn), (2.13)
where qn(zn|πn) is a binomial distribution with parameter πn, and r(µi|mi) is a
Gaussian distribution with mean mi and unit variance. With the variational family
thus parameterized, the optimization problem becomes
argminπ,m
−Eq [log p(µ, z)]−H(q) (2.14)
To do so we first appeal to Equation 2.12 for the expected log probability of an
exponential family with our choice of parameter,
Eq [log p(xn|µi)] = −1
2x2n + Eq [µi]xn −
1
2Eq[µ2i
]− 1
2log 2π
= −1
2x2n +mixn −
1
2(1 +m2
i )−1
2log 2π.
Since we have chosen uniform distributions for z and µ, we can express the
expected log probability of the joint as
18
Eq [log p(µ, z)] = Eq
[log∏n
p(xn|µ0)1−znp(xn|µ1)zn
]
=∑i
Eq [(1− zn) log p(xn|µ0)] + Eq [zn log p(xn|µ1)]
=∑n
(1− πn)Eq [log p(xn|µ0)] + πnEq [log p(xn|µ1)]
=∑n
(1− πn)(m0xn −1
2m2
0) + πn(m1xn −1
2m2
1) + C,
where C contains terms which do not depend on either πn or mi. We also compute
the entropy terms,
H (qn(zn|πn)) = (1− πn) log(1− πn) + πn log πn
H (ri(µi|mi)) =1
2log(2πe).
To optimize these expressions we take the derivative with respect to each variable,
∂L∂πn
=1
2(m1 −m0)(2−m1 −m0)xn + log
πn1− πn
∂L∂m0
=∑n
(1− πn)(xn −m0)
∂L∂m1
=∑n
πn(xn −m1).
19
● ●● ●● ●●● ●●● ●● ●●● ●●● ●●●●●● ● ●●●● ●● ●●●● ●●● ●●●● ●●●
●● ●●● ● ●● ●● ● ●●●● ●●● ●●●● ● ●● ●●●●●● ●●● ● ●●● ●●●●● ●●●● ●● ● ● ●● ●
−6 −4 −2 0 2 4 6x
0
0.2
0.4
0.6
0.8
1z
Figure 2.4: 100 points drawn from the mixture model depicted in Figure 2.3 withµ0 = −3 and µ1 = 3. The x axis denotes observed values while the horizontal axisand coloring denote the latent mixture indicator values.
Setting these equal to zero yields the following optimality conditions,
πn = σ(1
2(m1 −m0)(2−m1 −m0)xn)
m0 =
∑n(1− πn)xn∑n(1− πn)
m1 =
∑n πnxn∑n πn
where σ(x) denotes the sigmoid function 11+exp(−x)
. This is a system of transcenden-
tal equations which cannot be solved analytically. However, we may apply coordinate
ascent ; we initialize each variable to some guess and repeatedly cycle through variables
optimizing them one at a time while holding the others fixed.
20
1 2 3 4 5iteration
−4
−3
−2
−1
0
1
2
3
mi
Figure 2.5: Estimated values of m0 and m1 as a function of iteration using coordinateascent. The variational method is able to quickly recover the true values of theseparameters (shown as dashed lines).
21
N 2
z
x μ
zN+1
xN+1
Figure 2.6: The mixture model of Figure 2.3 augmented with an additional unobserveddatum to be predicted.
Figure 2.4 shows the result of simulating 100 draws from the distribution to be
estimated. The distribution has µ0 = −3 and µ1 = 3. The x axis denotes observed
values while the horizontal axis and coloring denote the latent mixture indicator values.
Figure 2.5 shows the result of applying the variational method with coordinte ascent
estimation. The series show the estimated values of mi as a function of iteration. The
approach is able to quickly find the parameters of the true generating distributions
(dashed lines).
2.4 Prediction
With an approximate posterior in hand, our goal is often to make predictions about
data we have not yet seen. That is, given some observed data x1:N we wish to evaluate
the probability of an additional datum xN+1,
p(xN+1|x1:N). (2.15)
This desideratum is illustrated in Figure 2.6 for the case of the Gaussian mixture
of the previous section. On the right hand side another unobserved instance of a
draw from the mixture model has been added as the datum to be predicted. One way
of approaching the problem is by noting that the marginalization of the predictive
22
distribution,
p(xN+1|x1:N) =∑zN+1
∑z1:N
p(xN+1, zN+1|z1:N)p(z1:N |x1:N)
=∑zN+1
Ep [p(xN+1, zN+1|z1:N)]
≈∑zN+1
Eq [p(xN+1, zN+1|z1:N)] , (2.16)
where the expectation on the second line is taken with respect to the true posterior
of the observed data, p(z1:N |x1:N ) and the expectation on the third line is taken with
respect to the variational approximation to the posterior, q(z1:N).
In the case of the Gaussian mixture, this expression is
p(xN+1|x1:N) ≈ 1
2Eq [p(xN+1|µ1)] +
1
2Eq [p(xN+1|µ0)]
=1
2p(xN+1|m1) +
1
2p(xN+1|m0). (2.17)
The efficacy of this approach is demonstrated in Figure 2.7 wherein we empirically
estimate the expected value of p(xN+1|x1:N) by drawing an additional M values and
taking their average. The dashed line shows the expectation estimated using the
variational approximation.
We have now described a framework for defining probabilistic models, inferring
the values of their unknowns using data, and taking the model and inferred values
to provide predictions about unseen data. In the following chapters we leverage this
framework to model, understand, and make predictions about networked data.
23
0 20 40 60 80 100M
−4
−3
−2
−1
0
1
x N+1^
Figure 2.7: Estimated expected value of p(xN+1|x1:N) taken by averaging M randomdraws from this function. The dashed line shows the value of this expectation estimatedby the variational approximation.
24
Chapter 3
Exponential Family Models of
Links
The first model of networks we explore are Binary Markov random fields. These
models are widely used to model correlations between binary random variables. While
generally useful for a wide variety of applications, in this chapter we focus on applying
these models to collections of documents which contain words and/or links. In a Binary
Markov random field, each document is treated as a collection of binary variables;
these binary variables may correspond to the presence of words in a document or the
presence of a citation to another document. Modeling the correlations between these
variables allows us to predict new words or new connections for documents.
However, their application to large-scale data sets has been hindered by their
intractability; both parameter estimation and inference is prohibitively expensive
on many large real-world data sets. In this chapter we present a new method to
perform both of these tasks. Leveraging a novel variational approximation to compute
approximate gradients, our technique is accurate yet computationally simple. We
evaluate our technique on both synthetic and real-world data and demonstrate that
we are able to learn models comparable to the state-of-the-art in a fraction of the
25
time.
3.1 Background
Large-scale models of co-occurrence are increasingly in demand. They can be used
to model the words in documents, connections between members of social networks,
or the structure of the human brain; these models can then lead to new insights into
brain function, suggest new friendships, or discover latent patterns of language usage.
The Ising model (Ising 1925) is a model of co-occurrence for binary vectors which
has been successfully applied to a variety of domains such as signal processing (Besag
1986), natural language processing (Takamura et al. 2005), genetics (Majewski et al.
2001), biological sensing (Shi and Duke 1998; Besag 1975), and computer vision (Blake
et al. 2004). Practitioners of the Ising model are limited, however, in the size of the
data sets to which the model can be applied. Many modern data sets and applications
require models with millions of parameters. Unfortunately, estimating the model’s
probabilities and optimizing its parameters are both #P-complete problems (Welsh
1990).
In response to its intractability, there has been a rich body of work on approximate
inference and optimization for the Ising model. The most common approaches have
been sampling-based (Geman and Geman 1984) of which contrastive divergence is
the most recent incarnation (Carreira-Perpinan and Hinton 2005; Welling and Hinton
2002). Other approaches include max-margin (Taskar et al. 2004a) and exponentiated
gradient (Globerson et al. 2007), expectation propagation (Minka and Qi 2003),
various relaxations (Fisher 1966; Globerson and Jaakkola 2007; Wainwright and
Jordan 2006; Kolar and Xing 2008; Sontag and Jaakkola 2007), as well as loopy belief
propagation (Pearl 1988; Murphy et al. 1999; Yedidia et al. 2003; Szeliski et al. 2008)
and its extensions (Wainwright et al. 2003; Welling and Teh 2001; Kolmogorov 2006).
26
In this chapter we present a new approach which is substantially faster and has
accuracy comparable to state-of-the-art methods. Our approach employs iterative
scaling (Dudık et al. 2007) and a new technique for approximating the gradients of the
log partition function of the Ising model. This approximation technique is inspired
by variational mean field methods (Jordan et al. 1999; Wainwright and Jordan 2003).
While these methods have been applied to a variety of models (Jaakkola and Jordan
1999; Saul and Jordan 1999; Bishop et al. 2002) including the Ising model, we will
show that our technique produces more accurate estimates of marginals and that this
in turn produces models with higher predictive accuracy. Further, our approximation
has a simple mathematical form which can be computed much more quickly. This
allows us to apply the Ising model to large models with millions of parameters.
Because of the large parameter space, our model also employs `1 + `22 feature
selection penalties to achieve sparse parameter estimates. This penalty is used in
linear models under the name elastic nets (Zou and Hastie 2005). Feature selection
penalties have an extensive history (Lafferty and Wasserman 2008; Malouf 2002). The
`1 penalty, in particular, has been a popular approach to obtaining sparse parameter
vectors (Friedman et al. 2007; Meinshausen and Buhlmann 2006; Wainwright et al.
2006). However, theory of regularized maximum likelihood estimation also indicates
that it is often beneficial to use `22 regularization (Dudık et al. 2007). Regularizations
of this form have been extensively applied (Chen and Rosenfeld 2000; Goodman 2004;
Riezler and Vasserman 2004; Haffner et al. 2006; Andrew and Gao 2007; Kazama and
Tsujii 2003; Gao et al. 2006).
This chapter is organized as follows. In Section 3.2, we describe the Ising model
and our procedure for approximating the marginals of the model and fitting its
parameters by approximate maximum a posteriori point estimation. In Section 3.3,
we compare the accuracy/speed trade-off of our model with several others on synthetic
and large real-world corpora. We show that our method provides parameter estimates
27
comparable with those of state-of-the-art techniques, but in much less time. This
enables the application of the Ising model to new data sets and application areas
which were previously out of reach. We summarize these findings in Section 3.4.
3.2 Pairwise Ising model
We study the exponential family known as the pairwise Ising model or binary Markov
random field which has long been used in physics to model ensembles of particles with
pairwise interactions. Our motivation is to characterize the co-occurrence of items
within “unordered bags” such as the co-occurrence of citations or keywords in research
papers. Such bags are represented by a binary vector x ∈ {0, 1}n with components
xi indicating presence of each item. The pairwise Ising model is parameterized by
κ ∈ Rn and λ ∈ Rn(n−1) controlling frequencies of individual items and frequencies of
their co-occurrence as
pκ,λ(x) =1
Zκ,λexp
[n∑i=1
κixi +1
2
n∑i=1
∑j 6=i
λijxixj
].
We assume throughout that λij = λji. Here, Zκ,λ denotes the normalization constant
ensuring that probabilities sum to one. For general settings of κ and λ, the exact
calculation of the normalization constant Zκ,λ requires summation over 2n possible
values of x, which becomes intractable for even moderate sizes of n. Since the normal-
ization constant Zκ,λ is required to calculate expectations and evaluate likelihoods,
basic tasks such as inference of marginals and parameter estimation cannot be carried
out exactly and require approximation. We propose a novel technique to approximate
marginals of the Ising model and a new procedure to learn its parameters. Since
learning of parameters relies on inference of marginals as a subroutine, we first present
the marginal approximation.
28
3.2.1 Approximate inference of marginals
Our approach begins with the naıve mean field approximation (Wainwright and Jordan
2005b; Jordan et al. 1999). While naıve mean field approximations may provide good
estimates of singleton marginals pκ,λ(xi), they often provide poor estimates of pairwise
marginals pκ,λ(xi, xj). Our technique corrects these estimates using an augmented
variational family. By combining the richness of the augmented variational family with
the computational simplicity of the naıve mean field, our technique yields accurate
estimates that can be computed efficiently.
In the sequel we first present the naıve mean field and then our improved approxi-
mation.
Naıve mean field
Naıve mean field approximates the Ising model pκ,λ by a distribution qMF with a
factored representation across components xi
qMF(x) =∏i
qMFi (xi) .
Among all distributions of the form above, naıve mean field algorithms seek the
distribution qMF which minimizes the KL divergence from the true distribution pκ,λ,
qMF = argminqMF
D(qMF‖pκ,λ) . (3.1)
Here D(q‖p) = Eq[ln(q/p)] denotes the KL divergence, which measures information-
theoretic discrepancy between densities q and p. Since Equation (3.1) is not convex,
it is usually solved by alternating minimization in each coordinate—a procedure
which only yields a local minimum. In each individual coordinate, the objective of
Equation (3.1) can be minimized exactly by setting the derivatives to zero, yielding
29
the update
qMFi (xi) ∝ exp
(κixi +
∑j 6=i
λijxiqMFj (xj = 1)
). (3.2)
For the derivation see for example Wainwright and Jordan (2005b).
A chief advantage of naıve mean field is its simplicity and the speed of conver-
gence. However, compared with other approximation techniques such as loopy belief
propagation, the naıve mean field solution qMF may yield poor approximations to the
pairwise marginals pκ,λ(xi, xj) (in Section 3.3 we demonstrate this empirically). Since
pairwise marginals are needed for parameter estimation, this is a major drawback.
Our approach
Our approach, Fast Learning of Ising Models (FLIM), takes advantage of the rapid
convergence of naıve mean field while correcting its estimates of pairwise marginals.
When estimating the marginal pκ,λ(xi, xj) for a fixed pair i, j, we propose replacing
the product density qMF in Equation (3.1) by a richer family
q(ij)(x) = q(ij)ij (xi, xj)
∏k 6=i,j
q(ij)k (xk) .
This is similar to the approach known as structured mean field (Saul and Jordan
1996). However, we take advantage of the approximate singleton marginals qMFk (xk)
provided by naıve mean field which, unlike pairwise marginals, provide sufficiently
good approximations of the true singleton marginals pκ,λ(xk). We minimize the KL
divergence from pκ,λ under the constraint that q(ij)k (xk) equal qMF
k (xk):
q(ij) = argminq(ij)
D(q(ij)‖pκ,λ)
s.t. q(ij)k (xk) = qMF
k (xk) for all k 6= i, j . (3.3)
30
Note that the only undetermined portion of q(ij) is q(ij)ij . This can be solved explicitly
by setting derivatives equal to zero, yielding
q(ij)ij (xi, xj) ∝ exp
(κixi + κjxj + λijxixj
+∑k 6=i,j
(λikxi + λjkxj)qMFk (xk = 1)
). (3.4)
Given the naıve mean field solution qMF, it is possible to calculate all corrected pairwise
marginals q(ij)ij in time O(n2) by using auxiliary values
rowsumi =∑k 6=i
λikqMFk (xk = 1) .
Thus, each q(ij)ij is calculated in constant amortized time.
Note that if we have access to estimates of marginals pκ,λ(xk) other than those
given by naıve mean field, we can use them instead of qMFk in Equations (3.3) and
(3.4).
3.2.2 Parameter estimation
The main task we study is the problem of estimating parameters κ and λ from data.
As we will see, this necessitates calculation of pairwise marginals which we derived in
the previous section.
The data consists of a set of observations x1, x2, . . . , xD generated by an Ising
model p(x |κ,λ) = pκ,λ(x). We posit a prior p(κ,λ), and estimate κ and λ as
maximizers of the posterior
p(λ,κ | {xd}) ∝ p(κ,λ)D∏d=1
p(xd |κ,λ) . (3.5)
31
We consider the factored prior
p(κ,λ) =(∏
i
p(κi))(∏
i,j
p(λij)),
with
p(κi) ∝ exp(κi)
p(λij) ∝ exp(−β1 |λij| − β2λ2ij)
(3.6)
where β1 and β2 are hyperparameters. The prior over κi corresponds to Laplace
smoothing of empirical counts (however, note that it is improper). The prior over λij
corresponds to regularization with an `1-norm term and an `22-norm, used in linear
models under the name elastic nets (Zou and Hastie 2005). This prior encourages
parameter vectors which exhibit both sparsity and grouping.
Combining Equation (3.5) and Equation (3.6), we obtain the following expression
for the log posterior:
ln p(κ,λ | {xd})
=(∑
i
κi
)− β1 ‖λ‖1 − β2 ‖λ‖2
2
+D∑d=1
[( n∑i=1
κixdi
)+
1
2
( n∑i=1
∑j 6=i
λijxdix
dj
)− lnZκ,λ
]+ const. (3.7)
We optimize Equation (3.7) by a version of the algorithm PLUMMET (Dudık et al.
2007). This algorithm in each iteration updates κ and λ to new values κ′ and λ′ that
optimize a lower bound on Equation (3.7). More precisely, λ′ij = λij + δij where
δij = argmaxδ
[−µeδ + δµ
− β1(|λij + δ|)− β2(λij + δ)2], (3.8)
32
where µ denotes the empirical co-occurrence count
µ =∑d
xdixdj
while µ is the estimate of this count, µ = DEκ,λ[xixj ]. We approximate the expectation
Eκ,λ[xixj] using the technique of the previous section.
The objective of Equation (3.8) is concave in δ and therefore we can find its
maximizer by setting its derivative to zero
−µeδ + µ− β1sign(λij + δ)− 2β2(λij + δ) = 0 . (3.9)
This can be solved explicitly using Lambert W function, denoted W (z), which for
a given z ≥ −e−1 represents the unique value W (z) ≥ −1 such that W (z)eW (z) = z.
Using this definition it is straightforward to prove the following lemma which can then
be used to solve Equation (3.9).
Lemma 3.2.1. For b > 0, the identity x = a−bex holds if and only if x = a−W (bea).
Rearranging Equation (3.9) to match the lemma, we now just need to carry out
the case analysis according to the sign of λij + δ and consider possibilities
δ+ =µ− β1
2β2
−W(µe−λij
2β2
exp
(µ− β1
2β2
))− λij
δ− =µ+ β1
2β2
−W(µe−λij
2β2
exp
(µ+ β1
2β2
))− λij
δ0 = −λij .
We choose δ+ if λij + δ+ > 0, δ− if λij + δ− < 0 and δ0 otherwise.
33
3.3 Evaluation
In this section we first apply our technique for performing marginal inference to a
synthetic test case. We compare our technique with several competing techniques
on both accuracy and speed. We then evaluate our entire parameter estimation
procedure on two large-scale, real-world data sets and show that models trained using
our procedure perform comparably with the state-of-the-art at making predictions
about unseen data.
Throughout this section we will compare the following five approaches:
Baseline No training is done for the parameters which govern pairwise correlations
λ, i.e., λ is set to 0.
NMF This method uses a naıve mean field to approximate pairwise expectations. As
described in Section 3.2, this method approximates the true model with one in
which all variables are decoupled. Because the implied Markov random field has
no edges, it cannot capture pairwise behavior.
BP Loopy belief propagation (Yedidia et al. 2003) is a message passing algorithm
that optimizes an approximation to the log partition function based on Bethe
energies. Because it must compute O(n2) messages each iteration, it can be
comparatively slow.
FLIM-NMF FLIM-NMF (Fast Learning of Ising Models) is our proposal for esti-
mating pairwise and singleton marginals described in Section 3.2. The estimates
are the solutions to a variational approximation where singleton marginals are
constrained to be equal to the marginals adduced by naıve mean field.
FLIM-Z FLIM-Z is similar to FLIM-NMF except that the singleton marginals are
constrained to be equal to the marginals when the pairwise correlations λ = 0,
i.e., σ(κ). This is an effective approximation to FLIM-NMF when λ is close to
zero. FLIM-Z is faster than FLIM-NMF since it does not require first solving
34
the naıve mean field variational problem.
3.3.1 Estimating marginal probabilities
To evaluate how well each of the approaches approximates the singleton marginals
p(xi) and pairwise marginals p(xi, xj) we generated a model with 24 nodes. Because
the number of nodes in this model is small, it is possible to compute the singleton
marginals and the pairwise marginals exactly through enumeration. By comparing
these true marginals with those estimated by each of the approximation techniques,
we can evaluate their accuracy/speed trade-off.
The following procedure was used to generate the parameters of the model. The
parameters which control the frequency of components, κ, is a vector of length 24
generated from a Beta distribution σ(κi) ∼ Beta(1, 100). The parameters which
control correlations of components, λ, is a vector of length 276. 10% of the elements of
λ are randomly chosen to be non-zero; those elements are generated from a zero-mean
Gaussian λij ∼ N (0, 1). The parameters generated by this process resemble those
found in the real-world corpora described in the next section.
The metric we use to compare the estimated marginals to the true marginals is
the mean relative error,
εsingleton =1
n
∑i
|q(xi = 1)− p(xi = 1)|p(xi = 1)
εpairwise =1
n2 − n∑i
∑j 6=i
|q(xixj = 1)− p(xixj = 1)|p(xixj = 1)
,
where q describes the approximate marginals computed by the approach under test.
To measure the approximation error as a function of computation time, we compute
the mean relative error after every full round of message passing for BP, every full
iteration of coordinate ascent updates in Equation (3.2) for FLIM-NMF and NMF,
and once at the end for FLIM-Z, since FLIM-Z is not iterative. We also compute
35
the time elapsed since the start of the program every time the mean relative error is
computed.
The approximation error versus time for BP, FLIM-Z, FLIM-NMF, and NMF
is shown in Figure 3.1. Loopy belief propagation is the most accurate of all the
techniques at estimating both the singleton marginals and the pairwise marginals.
Further, it converges to its final estimate after very few iterations. Unfortunately, it is
also the slowest. In contrast, naıve mean field and our proposals, FLIM-NMF and
FLIM-Z are much faster. They too converge in very few iterations. However, their
errors are higher than those of BP.
On singleton marginals, all of the approximations are quite accurate — mean
relative errors are always less than 1% on singleton marginals. NMF and FLIM-
NMF have the same relative errors since the singleton marginals for FLIM-NMF are
constrained to be equal to the solutions of NMF. FLIM-Z has a larger error than
either of these since its marginals assume that there are no pairwise correlations, an
assumption that is violated.
On pairwise marginals, BP once again achieves the lowest error, with FLIM-NMF
and FLIM-Z following closely behind. However, here NMF deviates from the other
three, having a much larger error (note that the y-axis is logarithmic). Because the
naıve mean field removes all dependencies between variables, it poorly characterizes
the rich correlation structure implied by λ. As the next section shows, this large error
leads to poorer MAP estimates of λ. FLIM-NMF, FLIM-Z, and BP however have
errors circa 1%; consequently they all have better MAP estimates of λ than NMF. But
our proposals, FLIM-NMF and FLIM-Z are able to run in a fraction of the execution
time of BP.
36
3.3.2 Making predictions
With the parameters of the model optimized using the procedure described in Sec-
tion 3.2, the model can then be used to make predictions on unseen data. The
predictive problem we evaluate here is that of predicting one of the binary random
variables xi given all other variables x−i. This question can be answered by computing
the conditional log likelihood
log p(xi |x−i,λ,κ) ∝ exp(κixi +∑j 6=i
λijxixj).
We apply this predictive procedure to two data sets:
Cora Cora (McCallum et al. 2000) is a set of 2708 abstracts from the Cora research
paper search engine, with links between documents that cite each other. For
the evaluation in this section, we ignore the textual content of the corpus and
concern ourselves with the links alone. The set of observed tokens associated
with each document is the set of cited and citing documents, yielding 2708
unique tokens. The model has a total of 3,667,986 parameters.
Metafilter Metafilter1 is an internet community weblog where users share links.
Users can then annotate links with tags which describe them. We consider each
link to be a document and each link’s attendant tags to be its observed token
set. We culled a subset of these links to create a corpus of 18609 documents
with 3096 unique tokens. The model has a total of 4794156 parameters.
For Cora, this predictive problem amounts to estimating the probability of a document
in Cora citing a particular paper given our knowledge of the document’s other citations.
For Metafilter, we are estimating the probability that a link has a certain tag given
its other tags.
1http://www.metafilter.com
37
We used five-fold cross-validation to compute the predictive perplexity of unseen
data. All experiments were run with Dirichlet prior parameter α = 2 (equivalent
to Laplace smoothing); Gaussian and Laplacian priors were set to β1 = β2 = De−8,
where D is the size of the corpus (cross-validation can be used to find good values of
β1 and β2). The results of these experiments are shown in Figure 3.2.
On both data sets, learning the covariance structure improves the predictive
perplexity over the baseline. Thus the correlation structure captured by the Ising
model provides increased predictive power when applied to these data sets.
The predictive perplexity of the model when trained using our proposals, FLIM-Z
and FLIM-NMF, is nearly identical to that of loopy belief propagation (BP) on both
data sets. Naıve mean field (NMF), on the other hand, does substantially worse, but
still better than Baseline. While FLIM-Z and FLIM-NMF are close to BP with respect
to predictive power, the previous section showed that their speed was closer to that
of NMF. Thus, our procedure provides a way to train models as accurately as loopy
belief propagation, but in a fraction of the time.
3.4 Discussion
We introduced a procedure to estimate the parameters of large-scale Ising models. This
procedure makes use of a novel constrained variational approximation for estimating
the pairwise marginals of the Ising model. This approximation has a simple mathemat-
ical form and can be computed more efficiently than other techniques. We also showed
empirically that this approximation is accurate for real-world data sets. Our approxi-
mation yields a procedure which can tractably be applied to models with millions of
parameters that can make predictions comparable with the state-of-the-art.
38
● ● ● ● ● ●●●●●
0.02 0.05 0.10 0.20 0.50 1.00 2.00
0.00
700.
0075
0.00
800.
0090
Execution time (ms)
Mea
n re
lativ
e er
ror
in s
ingl
eton
mar
gina
ls
● BPFLIM−ZFLIM−NMFNMF
(a) Relative error of singleton marginals
● ● ● ● ● ●●●●●
0.02 0.05 0.10 0.20 0.50 1.00 2.00
0.02
0.05
0.10
Execution time (ms)
Mea
n re
lativ
e er
ror
in p
airw
ise
mar
gina
ls ● BPFLIM−ZFLIM−NMFNMF
(b) Relative error of pairwise marginals
Figure 3.1: Mean relative error of singleton marginals(left) and pairwisemarginals(right) on a synthetic model. Execution times are on a logarithmic scale.The errors in (b) are also on a logarithmic scale. Loopy belief propagation (BP)is accurate but slow. Naıve mean field (NMF) is grossly inaccurate at estimatingpairwise marginals. FLIM-NMF offers a compromise: accuracy not much worse thanBP at speed not much worse than NMF.
39
BP FLIM−Z FLIM−NMF NMF Baseline
Pre
dict
ive
perp
lexi
ty
0e+
002e
+12
4e+
12
(a) Cora
BP FLIM−Z FLIM−NMF NMF Baseline
Pre
dict
ive
perp
lexi
ty
0e+
002e
+09
4e+
09
(b) Metafilter
Figure 3.2: A comparison of the predictive perplexity of the Ising model usingprocedures for parameter optimization. Lower is better. All approaches perform betterthan the baseline. Our proposals (FLIM-Z and FLIM-NMF) achieves better predictiveperplexity than naıve mean field (NMF), as does loopy belief propagation (BP). Butour proposals are able to run in a fraction of the time of BP (Figure 3.1).
40
Chapter 4
Relational Topic Models
In the previous chapter, we described a model of documents and links and inferential
tools for the model. While these models are able to successfully make predictions
about documents, they often miss salient patterns of the corpus better captured by
latent variable models of link structure.
Recent research in this field has focused on latent variable models of link structure
because of their ability to decompose a network according to hidden patterns of
connections between its nodes (Kemp et al. 2004; Hofman and Wiggins 2007; Airoldi
et al. 2008). These models represent a significant departure from statistical models of
networks, which explain network data in terms of observed sufficient statistics (Wasser-
man and Pattison 1996; Newman 2002; Fienberg et al. 1985; Getoor et al. 2001; Taskar
et al. 2004b).
While powerful, current latent variable models account only for the structure
of the network, ignoring additional attributes of the nodes that might be available.
For example, a citation network of articles also contains text and abstracts of the
documents, a linked set of web-pages also contains the text for those pages, and an
on-line social network also contains profile descriptions and other information about
its members. This type of information about the nodes, along with the links between
Portions of this chapter appear in Chang and Blei (2010, 2009).
41
them, should be used for uncovering, understanding and exploiting the latent structure
in the data.
To this end, we develop a new model of network data that accounts for both links
and attributes. While a traditional network model requires some observed links to
provide a predictive distribution of links for a node, our model can predict links using
only a new node’s attributes. Thus, we can suggest citations of newly written papers,
predict the likely hyperlinks of a web page in development, or suggest friendships in a
social network based only on a new user’s profile of interests. Moreover, given a new
node and its links, our model provides a predictive distribution of node attributes.
This mechanism can be used to predict keywords from citations or a user’s interests
from his or her social connections. Such prediction problems are out of reach for
traditional network models.
Here we focus on document networks. The attributes of each document are its
text, i.e., discrete observations taken from a fixed vocabulary, and the links between
documents are connections such as friendships, hyperlinks, citations, or adjacency.
To model the text, we build on previous research in mixed-membership document
models, where each document exhibits a latent mixture of multinomial distributions
or “topics” (Blei et al. 2003b; Erosheva et al. 2004; Steyvers and Griffiths 2007). The
links are then modeled dependent on this latent representation. We call our model,
which explicitly ties the content of the documents with the connections between them,
the relational topic model (RTM).
The RTM affords a significant improvement over previously developed models
of document networks. Because the RTM jointly models node attributes and link
structure, it can be used to make predictions about one given the other. Previous work
tends to explore one or the other of these two prediction problems. Some previous work
uses link structure to make attribute predictions (Chakrabarti et al. 1998; Kleinberg
1999), including several topic models (Dietz et al. 2007; McCallum et al. 2005; Wang
42
et al. 2005). However, none of these methods can make predictions about links given
words.
Other models use node attributes to predict links (Hoff et al. 2002). However,
these models condition on the attributes but do not model them. While this may be
effective for small numbers of attributes of low dimension, these models cannot make
meaningful predictions about or using high-dimensional attributes such as text data.
As our empirical study in Section 4.3 illustates, the mixed-membership component
provides dimensionality reduction that is essential for effective prediction.
In addition to being able to make predictions about links given words and words
given links, the RTM is able to do so for new documents—documents outside of
training data. Approaches which generate document links through topic models treat
links as discrete “terms” from a separate vocabulary that essentially indexes the
observed documents (Nallapati and Cohen 2008; Cohn and Hofmann 2001; Sinkkonen
et al. 2008; Gruber et al. 2008; Erosheva et al. 2004; Xu et al. 2006, 2008). Through
this index, such approaches encode the observed training data into the model and
thus cannot generalize to observations outside of them. Link and word predictions for
new documents, of the kind we evaluate in Section 4.3.1, are ill-defined.
Recent work from Nallapati et al. (2008) has jointly modeled links and document
content so as to avoid these problems. We elucidate the subtle but important differ-
ences between their model and the RTM in Section 4.1.4. We then demonstrate in
Section 4.3.1 that the RTM makes modeling assumptions that lead to significantly
better predictive performance.
The remainder of this chapter is organized as follows. First, we describe the
statistical assumptions behind the relational topic model. Then, we derive efficient
algorithms based on variational methods for approximate posterior inference, parameter
estimation, and prediction. Finally, we study the performance of the RTM on scientific
citation networks, hyperlinked web pages, geographically tagged news articles, and
43
52
478
430
2487
75
288
1123
2122
2299
1354
1854
1855
89
635
92
2438
136
479
109
640
119686
120
1959
1539
147
172
177
965
911
2192
1489
885
178378
286
208
1569
2343
1270
218
1290
223
227
236
1617
254
1176
256
634
264
1963
2195
1377
303
426
2091
313
1642
534
801
335
344
585
1244
2291
2617
1627
2290
1275
375
1027
396
1678
2447
2583
1061 692
1207
960
1238
20121644
2042
381
418
1792
1284
651
524
1165
2197
1568
2593
1698
547 683
2137 1637
2557
2033632
1020
436442
449
474
649
2636
2300
539
541
603
1047
722
660
806
1121
1138
831837
1335
902
964
966
981
16731140
14811432
1253
1590
1060
992
994
1001
1010
1651
1578
1039
1040
1344
1345
1348
1355
14201089
1483
1188
1674
1680
2272
1285
1592
1234
1304
1317
1426
1695
1465
1743
1944
2259
2213
We address the problem of finding a subset of features that allows a supervised induction algorithm to...
Irrelevant features and the subset selection
problem
In many domains, an appropriate inductive bias is the MIN-FEATURES bias, which prefers ...
Learning with many irrelevant features
In this introduction, we define the term bias as it is used in machine learning systems. We motivate ...
Evaluation and selection of biases in machine
learning
The inductive learning problem consists of learning a concept given examples ...
Utilizing prior concepts for learning
The problem of learning decision rules for sequential tasks is addressed, focusing on ...
Improving tactical plans with genetic algorithms
Evolutionary learning methods have been found to be useful in several areas in ...
An evolutionary approach to learning in
robots
...
...
...
...
...
...
...
...
...
...
Figure 4.1: Example data appropriate for the relational topic model. Each documentis represented as a bag of words and linked to other documents via citation. The RTMdefines a joint distribution over the words in each document and the citation linksbetween them.
social networks. The RTM provides better word prediction and link prediction than
natural alternatives and the current state of the art.
4.1 Relational Topic Models
The relational topic model (RTM) is a hierarchical probabilistic model of networks,
where each node is endowed with attribute information. We will focus on text data,
where the attributes are the words of the documents (see Figure 4.1). The RTM
embeds this data in a latent space that explains both the words of the documents and
how they are connected.
4.1.1 Modeling assumptions
The RTM builds on previous work in mixed-membership document models. Mixed-
membership models are latent variable models of heterogeneous data, where each data
point can exhibit multiple latent components. Mixed-membership models have been
44
successfully applied in many domains, including survey data (Erosheva et al. 2007),
image data (Fei-Fei and Perona 2005; Barnard et al. 2003), network data (Airoldi
et al. 2008), and document modeling (Steyvers and Griffiths 2007; Blei et al. 2003b).
Mixed-membership models were independently developed in the field of population
genetics (Pritchard et al. 2000).
To model node attributes, the RTM reuses the statistical assumptions behind
latent Dirichlet allocation (LDA) (Blei et al. 2003b), a mixed-membership model of
documents.1 Specifically, LDA is a hierarchical probabilistic model that uses a set
of “topics,” distributions over a fixed vocabulary, to describe a corpus of documents.
In its generative process, each document is endowed with a Dirichlet-distributed
vector of topic proportions, and each word of the document is assumed drawn by first
drawing a topic assignment from those proportions and then drawing the word from
the corresponding topic distribution. While a traditional mixture model of documents
assumes that every word of a document arises from a single mixture component, LDA
allows each document to exhibit multiple components via the latent topic proportions
vector. Below we describe this model in more detail before introducing our contribution,
the RTM.
4.1.2 Latent Dirichlet allocation
Latent Dirichlet allocation takes as input a collection of documents which are rep-
resented as bags-of-words, that is, an unordered collections of terms from a fixed
vocabulary. A collection of documents is imbued with a fixed number of topics, multi-
nomial distributions over those terms. Intuitively, a topic captures themes by putting
high weights on words which are connected to that theme, and small weights otherwise.
This representation is captured in Figure 1.4 (reproduced here for convenience). On
the left are three topics, β1,β2,β3; we have depicted each by selecting words with
1A general mixed-membership model can accommodate any kind of grouped data paired with anappropriate observation model (Erosheva et al. 2004).
45
high probability mass in that topic. For example, the blue topic, β2 puts high mass
on terms related to jurisprudence while the red topic, β3 puts high mass on terms
related to sports.
The congressman threw the opening pitch at the Yankees game yesterday
evening, despite being under investigation by a house committee. Both Democrats and Republicans on
the committee condemned...
lawyerjusticejudge
investigateprosecutor
gamecoachplayerplay
match
republicandemocrat
senatecampaign
mayor
β2
β3
β1wd,1:N
θd
zd,1 zd,2
Figure 4.2: A depiction of the assumptions underlying topic models. Topic modelspresuppose latent themes (left) and documents (right). Documents are a compositionof latent themes; this composition determines the words in the document that weobserve.
Additionally, LDA associates with each document a multinomial distribution over
topics. Intuitively, this captures what the document “is about” in broad thematic
terms. This is captured by θd in Figure 1.4 also depicted graphically as a bar graph
over topics (colors). In the example text, the document is mostly about “politics” with
a smattering of “sports” and “law”. Finally, LDA associates a single topic assignment
with each word in the document. The topic proportions θd govern the frequency with
which each topic appears in an assignment; the topic vectors βk govern which words
are likely to appear for a given assignment. This is graphically depicte in Figure 1.4
46
DN K
z
w β
θ α
Figure 4.3: A graphical model representation of latent Dirichlet allocation. The wordsare observed (shaded) while the the topic assignments (z), topic proportions (θ), andtopics (β) are latent. Plates indicate replication.
by coloring words according to their topic assignment.
This intuitive description of LDA can be formalized by the following generative
process:
1. For each document d:
(a) Draw topic proportions θd|α ∼ Dir(α).
(b) For each word wd,n:
i. Draw assignment zd,n|θd ∼ Mult(θd).
ii. Draw word wd,n|zd,n,β1:K ∼ Mult(βzd,n).
The notation x|z ∼ F (z) means that x is drawn conditional on z from the
distribution F (z). We use Dir and Mult as shorthand for the Dirichlet and Multinomial
distributions.
This generative process is depicted in Figure 4.3. The words w are the only
observed variables. The parameters for the model are K, the number of topics in the
model, α, a K-dimensional Dirichlet parameter controlling the topic proportions θ,
and β1:K K multinomial parameters representing the topic distributions over terms.
It is worth emphasizing that the words are the only observed data in this model.
47
The topics, the rate at which topics appear in each document, and the topic associated
with each word are all inferred solely based on the way words co-occur in the data.
4.1.3 Relational topic model
In the RTM, each document is first generated from topics as in LDA. The links between
documents are then modeled as binary variables, one for each pair of documents.
These binary variables are distributed according to a distribution that depends on the
topics used to generate each of the constituent documents. Because of this dependence,
the content of the documents are statistically connected to the link structure between
them. Thus each document’s mixed-membership depends both on the content of the
document as well as the pattern of its links. In turn, documents whose memberships
are similar will be more likely to be connected under the model.
The parameters of the RTM are β1:K , K topic distributions over terms, a K-
dimensional Dirichlet parameter α, and a function ψ that provides binary probabilities.
(This function is explained in detail below.) We denote a set of observed documents
by w1:D,1:N , where wi,1:N are the words of the ith document. (Words are assumed to
be discrete observations from a fixed vocabulary.) We denote the links between the
documents as binary variables y1:D,1:D, where yi,j is 1 if there is a link between the ith
and jth document. The RTM assumes that a set of observed documents w1:D,1:N and
binary links between them y1:D,1:D are generated by the following process.
1. For each document d:
(a) Draw topic proportions θd|α ∼ Dir(α).
(b) For each word wd,n:
i. Draw assignment zd,n|θd ∼ Mult(θd).
ii. Draw word wd,n|zd,n,β1:K ∼ Mult(βzd,n).
2. For each pair of documents d, d′:
48
α
Nd
θd
wd,n
zd,n
Kβk
yd,d'
η
Nd'
θd'
wd',n
zd',n
Figure 4.4: A two-document segment of the RTM. The variable yd,d′ indicates whetherthe two documents are linked. The complete model contains this variable for each pairof documents. This binary variable is generated contingent on the topic assignmentsfor the participating documents, zd and zd′ , and global regression parameters η. Theplates indicate replication. This model captures both the words and the link structureof the data shown in Figure 4.1.
(a) Draw binary link indicator
yd,d′ |zd, zd′ ∼ ψ(·|zd, zd′ ,η),
where zd = 〈zd,1, zd,2, . . . , zd,n〉.
Figure 4.4 illustrates the graphical model for this process for a single pair of documents.
The full model, which is difficult to illustrate in a small graphical model, contains
the observed words from all D documents, and D2 link variables for each possible
connection between them.
4.1.4 Link probability function
The function ψ is the link probability function that defines a distribution over the
link between two documents. This function is dependent on the two vectors of topic
assignments that generated their words, zd and zd′ .
This modeling decision is important. A natural alternative is to model links as a
49
function of the topic proportions vectors θd and θd′ . One such model is that of Nallapati
et al. (2008), which extends the mixed-membership stochastic blockmodel (Airoldi
et al. 2008) to generate node attributes. Similar in spirit is the non-generative model
of Mei et al. (2008) which “regularizes” topic models with graph information. The
issue with these formulations is that the links and words of a single document are
possibly explained by disparate sets of topics, thereby hindering their ability to make
predictions about words from links and vice versa.
For example, such a model with ten topics may use the first five topics to describe
the language of the corpus and the latter five to describe its connectivity. Each
document would participate in topics from the first set which account for its language
and the second set which account for its links. However, given a new document without
link information it is impossible in such a model to make predictions about links since
the document does not participate in the latter five topics. Similarly, a new document
without word information does not participate in the first five topics and hence no
predictions can be made.
In enforcing that the link probability function depends on the latent topic as-
signments zd and zd′ , we enforce that the specific topics used to generate the links
are those used to generate the words. A similar mechanism is employed in Blei and
McAuliffe (2007) for non pair-wise response variables. In estimating parameters, this
means that the same topic indices describe both patterns of recurring words and
patterns in the links. The results in Section 4.3.1 show that this provides a superior
prediction mechanism.
We explore four specific possibilities for the link probability function. First, we
consider
ψσ(y = 1) = σ(ηT(zd ◦ zd′) + ν), (4.1)
where zd = 1Nd
∑n zd,n, the ◦ notation denotes the Hadamard (element-wise) product,
and the function σ is the sigmoid. This link function models each per-pair binary
50
variable as a logistic regression with hidden covariates. It is parameterized by coeffi-
cients η and intercept ν. The covariates are constructed by the Hadamard product of
zd and zd′ , which captures similarity between the hidden topic representations of the
two documents.
Second, we consider
ψe(y = 1) = exp(ηT(zd ◦ zd′) + ν). (4.2)
Here, ψe uses the same covariates as ψσ, but has an exponential mean function instead.
Rather than tapering off when zd ◦ zd′ are close, the probabilities returned by this
function continue to increases exponentially. With some algebraic manipulation, the
function ψe can be viewed as an approximate variant of the modeling methodology
presented in Blei and Jordan (2003).
Third, we consider
ψΦ(y = 1) = Φ(ηT(zd ◦ zd′) + ν), (4.3)
where Φ represents the cumulative distribution function of the Normal distribution.
Like ψσ, this link function models the link response as a regression parameterized by
coefficients η and intercept ν. The covariates are also constructed by the Hadamard
product of zd and zd′ , but instead of the logit model hypothesized by ψσ, ψΦ models
the link probability with a probit model.
Finally, we consider
ψN(y = 1) = exp(−ηT(zd − zd′) ◦ (zd − zd′)− ν
). (4.4)
Note that ψN is the only one of the link probability functions which is not a function
of zd ◦ zd′ . Instead, it depends on a weighted squared Euclidean difference between the
51
0.0 0.2 0.4 0.6 0.8 1.0
0.1
0.3
0.5
0.7
zd ⋅⋅ zd′′
Link
pro
babi
lity
ψψσσψψe
ψψΦΦψψN
Figure 4.5: A comparison of different link probability functions. The plot showsthe probability of two documents being linked as a function of their similarity (asmeasured by the inner product of the two documents’ latent topic assignments). Alllink probability functions were parameterized so as to have the same endpoints.
two latent topic assignment distributions. Specifically, it is the multivariate Gaussian
density function, with mean 0 and diagonal covariance characterized by η, applied to
zd − zd′ . Because the range of zd − zd′ is finite, the probability of a link, ψN(y = 1),
is also finite. We constrain the parameters η and ν to ensure that it is between zero
and one.
All four of the ψ functions we consider are plotted in Figure 4.5. The link likelihoods
suggested by the link probability functions are plotted against the inner product of zd
and zd′ . The parameters of the link probability functions were chosen to ensure that
all curves have the same endpoints. Both ψσ and ψΦ have similar sigmoidal shapes.
In contrast, the ψe is exponential in shape and its slope remains large at the right
limit. The one-sided Gaussian form of ψN is also apparent.
52
4.2 Inference, Estimation and Prediction
With the model defined, we turn to approximate posterior inference, parameter estima-
tion, and prediction. We develop a variational inference procedure for approximating
the posterior. We use this procedure in a variational expectation-maximization (EM)
algorithm for parameter estimation. Finally, we show how a model whose parameters
have been estimated can be used as a predictive model of words and links.
4.2.1 Inference
The goal of posterior inference is to compute the posterior distribution of the latent
variables conditioned on the observations. As with many hierarchical Bayesian models
of interest, exact posterior inference is intractable and we appeal to approximate
inference methods. Most previous work on latent variable network modeling has
employed Markov Chain Monte Carlo (MCMC) sampling methods to approximate the
posterior of interest (Hoff et al. 2002; Kemp et al. 2004). Here, we employ variational
inference (Jordan et al. 1999; Wainwright and Jordan 2005a) a deterministic alternative
to MCMC sampling that has been shown to give comparative accuracy to MCMC with
improved computational efficiency (Braun and McAuliffe 2007; Blei and Jordan 2006).
Wainwright and Jordan (2008) investigate the properties of variational approximations
in detail. Recently, variational methods have been employed in other latent variable
network models (Airoldi et al. 2008; Hofman and Wiggins 2007).
In variational methods, we posit a family of distributions over the latent variables,
indexed by free variational parameters. Those parameters are then fit to be close to
the true posterior, where closeness is measured by relative entropy. For the RTM, we
use the fully-factorized family, where the topic proportions and all topic assignments
53
are considered independent,
q(Θ,Z|γ,Φ) =∏d
[qθ(θd|γd)
∏n
qz(zd,n|φd,n)
]. (4.5)
The parameters γ are variational Dirichlet parameters, one for each document, and Φ
are variational multinomial parameters, one for each word in each document. Note
that Eq [zd,n] = φd,n.
Minimizing the relative entropy is equivalent to maximizing the Jensen’s lower
bound on the marginal probability of the observations, i.e., the evidence lower bound
(ELBO),
L =∑
(d1,d2)
Eq [log p(yd1,d2|zd1 , zd2 ,η, ν)] +∑d
∑n
Eq [log p(zd,n|θd)] +
∑d
∑n
Eq [log p(wd,n|β1:K , zd,n)] +∑d
Eq [log p(θd|α)] + H (q) , (4.6)
where (d1, d2) denotes all document pairs and H (q) denotes the entropy of the dis-
tribution q. The first term of the ELBO differentiates the RTM from LDA (Blei
et al. 2003b). The connections between documents affect the objective in approximate
posterior inference (and, below, in parameter estimation).
We develop the inference procedure below under the assumption that only observed
links will be modeled (i.e., yd1,d2 is either 1 or unobserved).2 We do this for both
methodological and computational reasons.
First, while one can fix yd1,d2 = 1 whenever a link is observed between d1 and
d2 and set yd1,d2 = 0 otherwise, this approach is inappropriate in corpora where the
absence of a link cannot be construed as evidence for yd1,d2 = 0. In these cases, treating
these links as unobserved variables is more faithful to the underlying semantics of the
data. For example, in large social networks such as Facebook the absence of a link
2Sums over document pairs (d1, d2) are understood to range over pairs for which a link has beenobserved.
54
between two people does not necessarily mean that they are not friends; they may
be real friends who are unaware of each other’s existence in the network. Treating
this link as unobserved better respects our lack of knowledge about the status of their
relationship.
Second, treating non-links as hidden decreases the computational cost of inference;
since the link variables are leaves in the graphical model they can be removed whenever
they are unobserved. Thus the complexity of computation scales with the number
of observed links rather than the number of document pairs. When the number of
true observations is sparse relative to the number of document pairs, as is typical,
this provides a significant computational advantage. For example, on the Cora data
set described in Section 4.3, there are 3,665,278 unique document pairs but only
5,278 observed links. Treating non-links as hidden in this case leads to an inference
procedure which is nearly 700 times faster.
Our aim now is to compute each term of the objective function given in Equation 4.6.
The first term,
∑(d1,d2)
Ld1,d2 ≡∑
(d1,d2)
Eq [log p(yd1,d2|zd1 , zd2 ,η, ν)] , (4.7)
depends on our choice of link probability function. For many link probability func-
tions, this term cannot be expanded analytically. However, if the link probability
function depends only on zd1 ◦ zd2 we can expand the expectation using the following
approximation arising from a first-order Taylor expansion of the term (Braun and
McAuliffe 2007)3,
L(d1,d2) = Eq [logψ(zd1 ◦ zd2)] ≈ logψ(Eq [zd1 ◦ zd2 ]) = logψ(πd1,d2),
3While we do not give a detailed proof here, the error of a first-order approximation is closelyrelated to the probability mass in the tails of the distribution on zd1 and zd2 . Because the numberwords in a document is typically large, the variance of zd1 and zd2 tends to be small, making thefirst-order approximation a good one.
55
where πd1,d2 = φd1 ◦ φd2 and φd = Eq [zd] = 1Nd
∑n φd,n. In this work, we explore
three functions which can be written in this form,
Eq [logψσ(zd1 ◦ zd2)] ≈ log σ(ηTπd1,d2 + ν)
Eq [logψΦ(zd1 ◦ zd2)] ≈ log Φ(ηTπd1,d2 + ν)
Eq [logψe(zd1 ◦ zd2)] = ηTπd1,d2 + ν. (4.8)
Note that for ψe the expression is exact. The likelihood when ψN is chosen as the link
probability function can also be computed exactly,
Eq [logψN(zd1 , zd2)] = −ν −∑i
ηi(φd1,i − φd2,i)2 + Var(zd1,i) + Var(zd2,i)),
where Var(zd,i) = 1N2d
∑n φd,n,i(1− φd,n,i). (See Appendix A.)
Leveraging these expanded expectations, we then use coordinate ascent to op-
timize the ELBO with respect to the variational parameters γ,Φ. This yields an
approximation to the true posterior. The update for the variational multinomial φd,j
is
φd,j ∝ exp
{∑d′ 6=d
∇φd,nLd,d′ + Eq [log θd|γd] + logβ·,wd,j
}. (4.9)
The contribution to the update from link information, ∇φd,nLd,d′ , depends on the
choice of link probability function. For the link probability functions expanded in
Equation 4.8, this term can be written as
∇φd,nLd,d′ = (∇πd1,d2Ld,d′) ◦φd′
Nd
. (4.10)
Intuitively, Equation 4.10 will cause a document’s latent topic assignments to be
nudged in the direction of neighboring documents’ latent topic assignments. The
56
magnitude of this pull depends only on πd,d′ , i.e., some measure of how close they are
already. The corresponding gradients for the functions in Equation 4.8 are
∇πd,d′Lσd,d′ ≈ (1− σ(ηTπd,d′ + ν))η
∇πd,d′LΦd,d′ ≈
Φ′(ηTπd,d′ + ν)
Φ(ηTπd,d′ + ν)η
∇πd,d′Led,d′ = η.
The gradient when ψN is the link probability function is
∇φd,nLNd,d′ =
2
Nd
η ◦ (φd′ − φd,−n −1
Nd
), (4.11)
where φd,−n = φd− 1Ndφd,n. Similar in spirit to Equation 4.10, Equation 4.11 will cause
a document’s latent topic assignments to be drawn towards those of its neighbors.
This draw is tempered by φd,−n, a measure of how similar the current document is to
its neighbors.
The contribution to the update in Equation 4.9 from the word evidence logβ·,wd,j
can be computed by taking the element-wise logarithm of the wd,jth column of the
topic matrix β. The contribution to the update from the document’s latent topic
proportions is given by
Eq [log θd|γd] = Ψ(γd)−Ψ(∑
γd,i),
where Ψ is the digamma function4. (A digamma of a vector is the vector of digammas.)
The update for γ is identical to that in variational inference for LDA (Blei et al.
4The digamma function is defined as the logarithmic derivative of the gamma function.
57
2003b),
γd ← α +∑n
φd,n.
These updates are fully derived in Appendix A.
4.2.2 Parameter estimation
We fit the model by finding maximum likelihood estimates for each of the parameters:
multinomial topic vectors β1:K and link function parameters η, ν. Once again, this
is intractable so we turn to an approximation. We employ variational expectation-
maximization, where we iterate between optimizing the ELBO of Equation 4.6 with
respect to the variational distribution and with respect to the model parameters. This
is equivalent to the usual expectation-maximization algorithm (Dempster et al. 1977),
except that the computation of the posterior is replaced by variational inference.
Optimizing with respect to the variational distribution is described in Section 4.2.1.
Optimizing with respect to the model parameters is equivalent to maximum likelihood
estimation with expected sufficient statistics, where the expectation is taken with
respect to the variational distribution.
The update for the topics matrix β is
βk,w ∝∑d
∑n
1(wd,n = w)φd,n,k. (4.12)
This is the same as the variational EM update for LDA (Blei et al. 2003b). In practice,
we smooth our estimates of βk,w using pseudocount smoothing (Jurafsky and Martin
2008) which helps to prevent overfitting by positing a Dirichlet prior on βk.
In order to fit the parameters η, ν of the logistic function of Equation 4.1, we employ
gradient-based optimization. Using the approximation described in Equation 4.8, we
compute the gradient of the objective given in Equation 4.6 with respect to these
58
parameters,
∇ηL ≈∑
(d1,d2)
[yd1,d2 − σ
(ηTπd1,d2 + ν
)]πd1,d2 ,
∂
∂νL ≈
∑(d1,d2)
[yd1,d2 − σ
(ηTπd1,d2 + ν
)].
Note that these gradients cannot be used to directly optimize the parameters
of the link probability function without negative observations (i.e., yd1,d2 = 0). We
address this by applying a regularization penalty. This regularization penalty along
with parameter update procedures for the other link probability functions are given in
Appendix B.
4.2.3 Prediction
With a fitted model, our ultimate goal is to make predictions about new data. We
describe two kinds of prediction: link prediction from words and word prediction from
links.
In link prediction, we are given a new document (i.e. a document which is not
in the training set) and its words. We are asked to predict its links to the other
documents. This requires computing
p(yd,d′ |wd,wd′) =∑zd,zd′
p(yd,d′|zd, zd′)p(zd, zd′|wd,wd′),
an expectation with respect to a posterior that we cannot compute. Using the inference
algorithm from Section 4.2.1, we find variational parameters which optimize the ELBO
for the given evidence, i.e., the words and links for the training documents and the
words in the test document. Replacing the posterior with this approximation q(Θ,Z),
59
the predictive probability is approximated with
p(yd,d′|wd,wd′) ≈ Eq [p(yd,d′|zd, zd′)] . (4.13)
In a variant of link prediction, we are given a new set of documents (documents not
in the training set) along with their words and asked to select the links most likely to
exist. The predictive probability for this task is proportional to Equation 4.13.
The second predictive task is word prediction, where we predict the words of a
new document based only on its links. As with link prediction, p(wd,i|yd) cannot be
computed. Using the same technique, a variational distribution can approximate this
posterior. This yields the predictive probability
p(wd,i|yd) ≈ Eq [p(wd,i|zd,i)] .
Note that models which treat the endpoints of links as discrete observations of
data indices cannot participate in the two tasks presented here. They cannot make
meaningful predictions for documents that do not appear in the training set (Nallapati
and Cohen 2008; Cohn and Hofmann 2001; Sinkkonen et al. 2008; Erosheva et al.
2004). By modeling both documents and links generatively, our model is able to
give predictive distributions for words given links, links given words, or any mixture
thereof.
4.3 Empirical Results
We examined the RTM on four data sets5. Words were stemmed; stop words, i.e.,
words like “and” “of” or “but”, and infrequently occurring words were removed.
5An R package implementing these models and more are available online at http://cran.r-project.org/web/packages/lda/. Detailed derivations for some of the models included in the packageare given in Appendix D.
60
Table 4.1: Summary statistics for the four data sets after processing.Data Set # of Documents # of Words Number of Links Lexicon size
Cora 2708 49216 5278 1433WebKB 877 79365 1388 1703PNAS 2218 119162 1577 2239
LocalNews 51 93765 107 1242
Directed links were converted to undirected links6 and documents with no links were
removed. The Cora data (McCallum et al. 2000) contains abstracts from the Cora
computer science research paper search engine, with links between documents that
cite each other. The WebKB data (Craven et al. 1998) contains web pages from the
computer science departments of different universities, with links determined from
the hyperlinks on each page. The PNAS data contains recent abstracts from the
Proceedings of the National Academy of Sciences. The links between documents are
intra-PNAS citations. The LocalNews data set is a corpus of local news culled from
various media markets throughout the United States. We create one bag-of-words
document associated with each state (including the District of Columbia); each state’s
“document” consists of headlines and summaries from local news in that state’s media
markets. Links between states were determined by geographical adjacency. Summary
statistics for these data sets are given in Table 4.1.
4.3.1 Evaluating the predictive distribution
As with any probabilistic model, the RTM defines a probability distribution over unseen
data. After inferring the latent variables from data (as described in Section 4.2.1), we
ask how well the model predicts the links and words of unseen nodes. Models that
give higher probability to the unseen documents better capture the joint structure of
words and links.
We study the RTM with three link probability functions discussed above: the
6The RTM can be extended to accommodate directed connections. Here we modeled undirectedlinks.
61
●
● ●●
●
5 10 15 20 25
600
700
800
900
Predictive Link Rank
Cor
a
●
●
●
●
●
●
●●
●
●
● ●
●●
●
●
●
●
●●
●
●●
● ●
5 10 15 20 25
275
285
295
Predictive Word Rank
●
●
●
●
●
●
●
●
●●
●● ● ●
●
● ●● ●
●
●
●
●
●●
5 10 15 20 25
180
220
260
Web
KB
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
5 10 15 20 2530
030
531
031
5
●●
●
●
●
●
●
●
●
●
● ● ●●
●● ●
●● ●
●
●
●
●
●
5 10 15 20 25
440
480
520
Number of topics
PN
AS
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●RTM,, ψψσσRTM,, ψψe
RTM,, ψψΦΦLDA + Regression
Pairwise Link−LDA
● ●
●
●
●
5 10 15 20 25
430
440
450
460
Number of topics
●
●
●
●
●
●
●
●
●
●
● ● ● ● ●● ● ● ● ●
Figure 4.6: Average held-out predictive link rank (left) and word rank (right) asa function of the number of topics. Lower is better. For all three corpora, RTMsoutperform baseline unigram, LDA, and “Pairwise Link-LDA” Nallapati et al. (2008).
logistic link probability function, ψσ, of Equation 4.1; the exponential link proba-
bility function, ψe of Equation 4.2; and the probit link probability function, ψΦ of
Equation 4.3. We compare these models against two alternative approaches.
62
The first (“Pairwise Link-LDA”) is the model proposed by Nallapati et al. (2008),
which is an extension of the mixed membership stochastic block model (Airoldi et al.
2008) to model network structure and node attributes. This model posits that each link
is generated as a function of two individual topics, drawn from the topic proportions
vectors associated with the endpoints of the link. Because latent topics for words and
links are drawn independently in this model, it cannot ensure that the discovered topics
are representative of both words and links simultaneously. Additionally, this model
introduces additional variational parameters for every link which adds computational
complexity.
The second (“LDA + Regression”) first fits an LDA model to the documents and
then fits a logistic regression model to the observed links, with input given by the
Hadamard product of the latent class distributions of each pair of documents. Rather
than performing dimensionality reduction and regression simultaneously, this method
performs unsupervised dimensionality reduction first, and then regresses to understand
the relationship between the latent space and underlying link structure. All models
were fit such that the total mass of the Dirichlet hyperparameter α was 1.0. (While
we omit a full sensitivity study here, we observed that the performance of the models
was similar for α within a factor of 2 above and below the value we chose.)
We measured the performance of these models on link prediction and word pre-
diction (see Section 4.2.3). We divided the Cora, WebKB and PNAS data sets each
into five folds. For each fold and for each model, we ask two predictive queries: given
the words of a new document, how probable are its links; and given the links of a
new document, how probable are its words? Again, the predictive queries are for
completely new test documents that are not observed in training. During training the
test documents are removed along with their attendant links. We show the results
for both tasks in terms of predictive rank as a function of the number of topics in
Figure 4.6. (See Section 4.4 for a discussion on potential approaches for selecting the
63
number of topics and the Dirichlet hyperparameter α.) Here we follow the convention
that lower predictive rank is better.
In predicting links, the three variants of the RTM perform better than all of the
alternative models for all of the data sets (see Figure 4.6, left column). Cora is
paradigmatic, showing a nearly 40% improvement in predictive rank over baseline and
25% improvement over LDA + Regression. The performance for the RTM on this
task is similar for all three link probability functions. We emphasize that the links are
predicted to documents seen in the training set from documents which were held out.
By incorporating link and node information in a joint fashion, the model is able to
generalize to new documents for which no link information was previously known.
Note that the performance of the RTM on link prediction generally increases as the
number of topics is increased (there is a slight decrease on WebKB). In contrast, the
performance of the Pairwise Link-LDA worsens as the number of topics is increased.
This is most evident on Cora, where Pairwise Link-LDA is competitive with RTM
at five topics, but the predictive link rank monotonically increases after that despite
its increased dimensionality (and commensurate increase in computational difficulty).
We hypothesize that Pairwise Link-LDA exhibits this behavior because it uses some
topics to explain the words observed in the training set, and other topics to explain
the links observed in the training set. This problem is exacerbated as the number of
topics is increased, making it less effective at predicting links from word observations.
In predicting words, the three variants of the RTM again outperform all of the
alternative models (see Figure 4.6, right column). This is because the RTM uses
link information to influence the predictive distribution of words. In contrast, the
predictions of LDA + Regression and Pairwise Link-LDA barely use link information;
thus they give predictions independent of the number of topics similar to those made
by a simple unigram model.
64
4.3.2 Automatic link suggestion
Table 4.2: Top eight link predictions made by RTM (ψe) and LDA + Regression fortwo documents (italicized) from Cora. The models were fit with 10 topics. Boldfacedtitles indicate actual documents cited by or citing each document. Over the wholecorpus, RTM improves precision over LDA + Regression by 80% when evaluated onthe first 20 documents retrieved.
Markov chain Monte Carlo convergence diagnostics: A comparative reviewMinorization conditions and convergence rates for Markov chain Monte Carlo
RTM
(ψe)
Rates of convergence of the Hastings and Metropolis algorithmsPossible biases induced by MCMC convergence diagnostics
Bounding convergence time of the Gibbs sampler in Bayesian image restorationSelf regenerative Markov chain Monte Carlo
Auxiliary variable methods for Markov chain Monte Carlo with applicationsRate of Convergence of the Gibbs Sampler by Gaussian Approximation
Diagnosing convergence of Markov chain Monte Carlo algorithms
Exact Bound for the Convergence of Metropolis Chains LDA
+Regressio
n
Self regenerative Markov chain Monte CarloMinorization conditions and convergence rates for Markov chain Monte Carlo
Gibbs-markov modelsAuxiliary variable methods for Markov chain Monte Carlo with applications
Markov Chain Monte Carlo Model Determination for Hierarchical and Graphical ModelsMediating instrumental variables
A qualitative framework for probabilistic inferenceAdaptation for Self Regenerative MCMC
Competitive environments evolve better solutions for complex tasksCoevolving High Level Representations
RTM
(ψe)
A Survey of Evolutionary StrategiesGenetic Algorithms in Search, Optimization and Machine Learning
Strongly typed genetic programming in evolving cooperation strategiesSolving combinatorial problems using evolutionary algorithms
A promising genetic algorithm approach to job-shop scheduling. . .Evolutionary Module Acquisition
An Empirical Investigation of Multi-Parent Recombination Operators. . .
A New Algorithm for DNA Sequence Assembly
LDA
+Regressio
n
Identification of protein coding regions in genomic DNASolving combinatorial problems using evolutionary algorithms
A promising genetic algorithm approach to job-shop scheduling. . .A genetic algorithm for passive management
The Performance of a Genetic Algorithm on a Chaotic Objective FunctionAdaptive global optimization with local search
Mutation rates as adaptations
A natural real-world application of link prediction is to suggest links to a user
based on the text of a document. One might suggest citations for an abstract or
friends for a user in a social network.
65
As a complement to the quantitative evaluation of link prediction given in the
previous section, Table 4.2 illustrates suggested citations using RTM (ψe) and LDA +
Regression as predictive models. These suggestions were computed from a model fit on
one of the folds of the Cora data using 10 topics. (Results are qualitatively similar for
models fit using different numbers of topics; see Section 4.4 for strategies for choosing
the number of topics.) The top results illustrate suggested links for “Markov chain
Monte Carlo convergence diagnostics: A comparative review,” which occurs in this
fold’s training set. The bottom results illustrate suggested links for “Competitive
environments evolve better solutions for complex tasks,” which is in the test set.
RTM outperforms LDA + Regression in being able to identify more true connections.
For the first document, RTM finds 3 of the connected documents versus 1 for LDA +
Regression. For the second document, RTM finds 3 while LDA + Regression does not
find any. This qualitative behavior is borne out quantitatively over the entire corpus.
Considering the precision of the first 20 documents retrieved by the models, RTM
improves precision over LDA + Regression by 80%. (Twenty is a reasonable number
of documents for a user to examine.)
While both models found several connections which were not observed in the data,
those found by the RTM are qualitatively different. In the first document, both sets
of suggested links are about Markov chain Monte Carlo. However, the RTM finds
more documents relating specifically to convergence and stationary behavior of Monte
Carlo methods. LDA + Regression finds connections to documents in the milieu
of MCMC, but many are only indirectly related to the input document. The RTM
is able to capture that the notion of “convergence” is an important predictor for
citations, and has adjusted the topic distribution and predictors correspondingly. For
the second document, the documents found by the RTM are also of a different nature
than those found by LDA + Regression. All of the documents suggested by RTM
relate to genetic algorithms. LDA + Regression, however, suggests some documents
66
which are about genomics. By relying only on words, LDA + Regression conflates
two “genetic” topics which are similar in vocabulary but different in citation structure.
In contrast, the RTM partitions the latent space differently, recognizing that papers
about DNA sequencing are unlikely to cite papers about genetic algorithms, and vice
versa. Better modeling the properties of the network jointly with the content of the
documents, the model is able to better tease apart the community structure.
4.3.3 Modeling spatial data
While explicitly linked structures like citation networks offer one sort of connectivity,
data with spatial or temporal information offer another sort of connectivity. In this
section, we show how RTMs can be used to model spatially connected data by applying
it to the LocalNews data set, a corpus of news headlines and summaries from each
state, with document linkage determined by spatial adjacency.
Figure 4.7 shows the per state topic distributions inferred by RTM (left) and LDA
(right). Both models were fit with five topics using the same initialization. (We restrict
the discussion here to five topics for expositional convenience. See Section 4.4 for a
discussion on potential approaches for selecting the number of topics.) While topics
are strictly speaking exchangeable and therefore not comparable between models,
using the same initialization typically yields topics which are amenable to comparison.
Each row of Figure 4.7 shows a single component of each state’s topic proportion for
RTM and LDA. That is, if θs is the latent topic proportions vector for state s, then θs1
governs the intensity of that state’s color in the first row, θs2 the second, and so on.
While both RTM and LDA model the words in each state’s local news corpus,
LDA ignores geographical information. Hence, it finds topics which are distributed
over a wide swath of states which are often not contiguous. For example, LDA’s topic
1 is strongly expressed by Maine and Illinois, along with Texas and other states in
the South and West. In contrast, RTM only assigns non-trivial mass to topic 1 in a
67
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Figure 4.7: A comparison between RTM (left) and LDA (right) of topic distributionson local news data. Each color/row depicts a single topic. Each state’s color intensityindicates the magnitude of that topic’s component. The corresponding words associatedwith each topic are given in Table 4.3. Whereas LDA finds geographically diffusetopics, RTM, by modeling spatial connectivity, finds coherent regions.
68
Southern states. Similarly, LDA finds that topic 5 is expressed by several states in
the Northeast and the West. The RTM, however, concentrates topic 4’s mass on the
Northeastern states.
Table 4.3: The top eight words in each RTM (left) and LDA (right) topic shown inFigure 4.7 ranked by score (defined below). RTM finds words which are predictive ofboth a state’s geography and its local news.
comments dead
scores landfill
plane metro
courthouse evidence
Topic 1
crash yesterday
registration county
police children
quarter campaign
Topic 2
measure marriage
suspect officer
guards protesters
appeals finger
Topic 3
bridge area
veterans winter
city snow
deer concert
Topic 4
manslaughter route
girls state
knife grounds
committee developer
Topic 5
election plane
landfill dead
police union
interests veterans
Topic 1
crash police
yesterday judge
fire leave
charges investors
Topic 2
comments marriage
register scores
schools comment
registration rights
Topic 3
snow city
veterans votes
winter bridge
recount lion
Topic 4
garage girls
video dealers
underage housing
mall union
Topic 5
The RTM does so by finding different topic assignments for each state, and
69
commensurately, different distributions over words for each topic. Table 4.3 shows the
top words in each RTM topic and each LDA topic. Words are ranked by the following
score,
scorek,w ≡ βk,w(log βk,w −1
K
∑k′
log βk′,w).
The score finds words which are likely to appear in a topic, but also corrects for
frequent words. The score therefore puts greater weight on words which more easily
characterize a topic. Table 4.3 shows that RTM finds words more geographically
indicative. While LDA provides one way of analyzing this collection of documents,
the RTM enables a different approach which is geographically cognizant. For example,
LDA’s topic 3 is an assortment of themes associated with California (e.g., ‘marriage’)
as well as others (‘scores’, ‘registration’, ‘schools’). The RTM on the other hand,
discovers words thematically related to a single news item (‘measure’, ‘protesters’,
‘appeals’) local to California. The RTM typically finds groups of words associated
with specific news stories, since they are easily localized, while LDA finds words which
cut broadly across news stories in many states. Thus on topic 5, the RTM discovers
key words associated with news stories local to the Northeast such as ‘manslaughter’
and ‘developer.’ On topic 5, the RTM also discovers a peculiarity of the Northeastern
dialect: that roads are given the appellation ‘route’ more frequently than elsewhere in
the country.
By combining textual information along with geographical information, the RTM
provides a novel exploratory tool for identifying clusters of words that are driven by
both word co-occurrence and geographic proximity. Note that the RTM finds regions
in the United States which correspond to typical clusterings of states: the South, the
Northeast, the Midwest, etc. Further, the soft clusterings found by RTM confirm
many of our cultural intuitions—while New York is definitively a Northeastern state,
Virginia occupies a liminal space between the Midatlantic and the South.
70
4.3.4 Modeling social networks
We now show how the RTM can be used to qualitatively understand the structure of
social networks. In this section we apply the RTM to four data sets — people from
the Bible, people from New York Times articles, and two data sets crawled from the
online social networking site Twitter 7.
Bible The data set contains 523 entities which appear in the Bible. For each entity
we extract all of the verses in which those entities appear; we take this collection
of verses to be the “document” for that entity. For links we take all entities which
co-occur in the same verse yielding 475 links. Figure 4.8 shows a visualization of
the results. Each node represents an individual; nodes are colored according to the
topic most associated with that individual. The node near the center (which although
colored brown is on the border of several other clusters) is associated with Jesus.
Another notable figure in that cluster is David who is connected to many others of
that line. A node with high connectivity in a different cluster is Israel. Because Israel
may refer to both the place and as an alternate name for Jacob, it is possible that
some of these edges are spurious and the result of improper disambiguation. However
the results are suggestive, with the RTM clustering Israel along with figures such as
Joseph and Benjamin. As an avenue of future work, the RTM might be used to help
disambiguate these entities.
New York Times The data set contains 944 entities tagged in New York Times
articles. We use the collection of articles (out of a set of approximately one million
articles) in which those entities appear as that entity’s “document”. We consider two
entities connected if they are co-tagged in an article. Figure 4.9 shows the result of
fitting the RTM to these data. The RTM finds distinct clusters corresponding to
distinct areas in which people are notable; these clusters also often have strong internal
7http://www.twitter.com
71
tobiah
jahzeelachish
hodiah
jacobjonathan
jonathan
israel
israel
michal
nethaniah
omar
epher gershon
mahlah
geber
unni
on
ibhar
sihon
ish
damascus
milcah anah
anah
boaz
manasseh
elihu
joses
sodomnebuchadnezzar
zibeon
ehud
adoni
carmi
eliphaz
abigail
abigail
timothy reuben
hanoch
hanoch
ezra james james
gilead
gilead
gilead
debir
nehemiah
solomon
gemariah
makkedah
caleb
edom
dorcasevil
maaseiah
tereshmaaseiah
gideon
adam
adam jehoiada
mishael
benaiah
almodad
amraphelnineveh
priscilla
caiaphassaul
jehoram
shishak
meshach
gad
gad
gad
seraiah
seraiah seraiah
beeliada
uzzah
amasa
pekah
debir
gezer
er eli
legion
haggith
elnathan
beriah
hushai
joram
joram kedar
canaan
canaan
shalmaneser
hadad
zibajoiada
manoah
jeremiah
megiddo
joseph
joseph
joseph
sapphira
ezekiel
ahithophel
jerusalem
daviddavid
eldad
sarah
herod
dedanjezebel
hannah
shechem
enoshophni
benben
adonijah
josiah
trophimus
zimri
bishlam
havilahnahor nahor
joash
mamreazariahazariah
azariah
azariah
korah libnah
bigthan
rehoboam
reuel
reuelpallu
tiras annas
jehozadak
serug ithamar
levi shemaiah
malchus
zebulun
hazor hanan
magog
tubal
hezekiah
rehum
isaac
joel
ephraim
sheshai
meshech
lachish
uzziah
guni
zelophehad
kohath
rachel
huldah
kenaz
ethanpeleg
bela
simeon
simeon
jamin
barzillai
shuah
abraham
enoch
gehazi
jehoahaz
hormah
samuel
shaul
sheleph
nahshon
shobal
hilkiah
hilkiah balaam
joseph
asher
asher
jonadab
shammah
shammah
noah
noah rapha
jairus
mark
andrew
sennacherib
zebuljavan
hamram
eglon
manasseh
jeroboam
jeroboamterahsceva
amaziah
arphaxad
sered
ohad
ahab
heman
elkanah
achbor
rechab
rechab
ebed
ashbel
zohar
og
athaliah
eliab
eliab eliab
eliab
goliath
shillem
ahaz
amminadab
jahleel
lot
simon
simon simon
simon
simon salome
haran
haran
elisha
vashtiartaxerxes
abimelechamram
joab
abijah
miriam
aphek
jeiel
immanuel
ahaziah
aaron
tiglathmilcah
jaazaniah
abiathar
naaman
mephibosheth
shimei
taanach
bethueltirzahtirzah
naboth
nathanael
tarshish
gaal
pelatiah
jotham
eliasaph
abner
zadok
ephraim
ephraim
shaphan
shaphanadriel
hazarmaveth
chenaanah
chedorlaomer
elon
stephen
birsha
jehoash
kish
danshemaiah
eve
baanah
shechemsalmabartholomew
shemaiah
zur
dan
dan
sidon
toi
zipporah
jabin
zimran
onan
dathan
baasha
japhethcush
amariah
demetrius
luke
nathanmalchiel
abishag
abishai
piram
mary
mary
mary
johanan
nicodemus
phinehas
barabbas
jehoiachin
lamech
lamech
ananias
ananias shobach laban
ahimaaz
tamar tamar
elishama
libni arioch
asahel
hiram
jether
gallio
ahimelech
abiram
geshem
shealtiel abel benjamin benjamin
benjamin
gomer
jonathan
aram
tidal
japhia
shimshai
gedaliah
sherebiah
adoni
eliada
judah
joktan
shelah
joel
jason
thomas
shemiramoth
isaiah
hamorornan
seir
seir
israel
lazarus
zeruiah
sheba sheba
cain
nebaiothishmael
ishmael
elijah
gera zalmunna
tabitha
moses
zerah
asenath
uriah
jemuel
nepheg
elah
jericho
john
john
zephaniah
ahijah
ahijah
joah
jeshua
ephah
naphtali
naphtali jarmuthmattithiahmelchizedek
jephthah
jethro
jezer
jachin
reu
hebron
remaliah
matthias
jesusissachar
shaphat
hadadezer
absalom
eliakimebermarthahoglahmedad
nadab
nadab
hazael rezin
thaddaeus
zobah
zedekiah
mizzah
zedekiah heth silas
araunah
asaph titus
elishua
hepher
madai
asa
jehu
jehu
adah
seba
mephibosheth
aharah
matthewsamson
jehoiakim
seth
jehoshaphat
paul
hoshea
methuselah shem judah
eldaah
peter
leah
mordecai
eliphelet
daniel merab
delaiah
esau
pharaoh
pharaoh
pharaoh
pharaoh
raamah
philip
philip
philip
midian
eliashib
zebah
jeshua
jesse
bani
balak
malachi
joshua
joshua
hezron
hezron
Figure 4.8: The result of fitting the RTM to a collection of entities from the Bible.Nodes represent people and edges indicate that the people co-occur in the same verse.Nodes are colored according to the topic most assoicated with that individual. Edgesbetween nodes with the same primary topic are colored black while edges betweennodes with different primary topics are colored grey.
72
ties. For example, in the top center the green cluster contains sports personalities.
Michael Jordan and Derek Jeter are a few prominent, highly-connected figures in
this cluster. The yellow cluster which also has strong internal connections represents
international leaders such as George W. Bush (lower left), Ronald Reagan (lower right),
and George H. W. Bush (upper right). Note that many of these are conservatives.
Beside this cluster is another orange cluster of politicians. This cluster leans more
liberal with figures such as Bill Clinton and Michael Dukakis. Notably, several
republicans are also in this cluster such as Michael Bloomberg. The remaining
clusters found by RTM capture other groups of related individuals, such as artists
and businesspeople.
Twitter Twitter is an online social network where users can regularly post statements
(known as “tweets”). Users can also choose to “follow” other users, that is, receive
their tweets. We take each user’s documents to be the accumulation of their tweets
and we use follower connections as edges between users. Here we present two data
sets.
The first is a series of tweets collected over the period of approximately one week.
The users included in this data set were found by starting a breadth-first crawl from a
distinguished node, leading to 180 users being included. Figure 4.10 shows a force-
directed layout of this data set after RTM has been fit to it. The nodes represent
users; the colors of the nodes indicate the topic most associated with that user. Some
regions of the graph with similar topics have also been highlighted and annotated
with the most frequently occurring words in that topic. For example, one sector of the
graph has people talking about music topics. However these reside on the periphery.
Another sector uses words associated with blogs and social media; this area has a
hub-spoke structure. Finally, another region of the graph is distinguished by frequent
occurences of the phrase “happy easter” (the crawl period included Easter). This
73
lazio, rick a
richter, mike
wilder, l douglas
foreman, george
mehta, zubin
bhutto, benazir
wright, jim olmert, ehud
pinochet, augusto
hatch, orrin g
berlin, irving
souter, david h
tower, john g
husband, rick d
risen, james
charles, eleanor
klebold, dylan
rather, dan
mcmellon, edward
saint laurent, yves
helms, jesse
spitzer, eliot l
panetta, leon e
columbus, christopher
jesus christ
mcveigh, timothy james
o'neill, paul
martinez, pedro
barry, marion s jr
els, ernie
chernomyrdin, viktor s
hynes, charles j
ewing, patrick
miers, harriet e
stein, andrew j
aung san suu kyi, daw
silver, sheldon
specter, arlen
coughlin, tom
kennedy, john fitzgerald
rodman, dennis
chaney, don
james, caryn
lane, nathan
hashimoto, ryutaro
lautenberg, frank r
giambi, jason
de klerk, f w
mourning, alonzo
mapplethorpe, robert
glavine, tom
winfrey, oprah
van horn, keithjeter, derek
reagan, ronald
limbaugh, rush
netanyahu, benjamin
law, bernard f
sessions, william s
fujimori, alberto
ward, charlie
barron, james
van gundy, jeff
musharraf, pervez
blair, jayson
valentine, bobby
rowland, john g
chass, murray
jacobs, marc
puccini, giacomo
gates, henry louis jr
staples, brent
baker, howard h jrclark, wesley k
koch, edward i
canseco, jose
major, john
plame, valerie
o'neill, paul h
holyfield, evander
mills, richard p
abdullah
bush, laura
odeh, mohammed saddiq
kozlowski, l dennis
karadzic, radovan
salinas de gortari, carlos
federer, roger
jospin, lionel
bennett, william j
nader, ralph
jackson, mark steinbrenner, george m 3d
steinberg, lisa
velella, guy j
moynihan, daniel patrick
kahane, meir
khamenei, ali
darman, richard g
barber, tiki
friedman, thomas l
bradley, bill
minaya, omar
goldin, harrison j
hernandez, orlandobrown, dave
reich, robert b
smith, william kscoppetta, nicholas
james, sharpe
gore, al
moi, daniel arap
forbes, steve
chertoff, michael
wolfowitz, paul d
clark, laurel salton
ashe, arthur vecsey, george
pirro, jeanine f
sullivan, andrew
mantle, mickey
hanks, tom
robinson, jackie
bird, larry
reno, janet
edberg, stefan
belkin, lisa
nussbaum, hedda
blix, hans
houston, allan
carter, bill
scowcroft, brent
golden, howard
quayle, dan
morgenthau, robert m
taylor, lawrence
ovitz, michaelmurdoch, rupert
delay, tom
wagner, richard
wellstone, paul
brody, jane e
mcenroe, john
kim jong il
rivera, mariano
welch, john f jr
schundler, bret d green, mark
jiang zemin
kim dae jung
karzai, hamid
rohatyn, felix g
carlucci, frank c
kaczynski, theodore j
biaggi, mario
bruni, frank
baker, james a 3d
chavez, julio cesar
kantor, mickey
clemens, roger
steinberg, joel b
buffett, warren e
paulson, henry m jr
greenhouse, steven
klein, calvin
conner, dennis
letterman, david
lamont, ned rice, condoleezza
sununu, john h
combs, sean miller, arthur
bruder, thomas
mussina, mike
freeh, louis j
difrancesco, donald t
johnson, lyndon baines
pelosi, nancy
moxley, martha
weicker, lowell p jr
scott, byron
johnson, magic
geffen, david
mandela, winnie
falwell, jerryfarrakhan, louis
chretien, jean
eastwood, clint
dean, howard
florio, james j
miller, judith
ward, benjamin
gretzky, wayne
zedillo ponce de leon, ernesto
arafat, yasir
joyce, james
woodward, bob
green, richard r
summers, lawrence h
muschamp, herbert
izetbegovic, alija
roth, philip
martin, steve
abrams, robert
chirac, jacques
sullivan, louis w
shevardnadze, eduard a
starr, kenneth w
rostenkowski, dan
thompson, william c jr
wasserstein, wendy
levin, carl
kristof, nicholas d
rangel, charles b
noriega, manuel antonio
ali, muhammad
mobutu sese seko
gates, bill
james, lebron pitino, rick
fehr, donald
warhol, andy
robbins, jerome
dalai lama
feinstein, dianne
holtzman, elizabeth kennedy, edward m
pirro, jeanine
vance, cyrus r
becker, boris
moore, michael
gonzalez, elian
barenboim, daniel
henderson, rickey
john, elton
edwards, john
winerip, michael
padilla, jose
lewis, anthony
abramoff, jack
eisner, michael d
beckett, samuel
franks, bob
bonilla, bobby martinez, tino
sistani, ali al
manning, eli
shalala, donna e
foley, thomas s
biden, joseph r jr
einstein, albert
hart, gary
williams, serena
hariri, rafik
feld, eliot
dole, elizabeth h
cosby, bill
frist, bill
alito, samuel a jr
dingell, john d
klein, joel i
purdum, todd s
anderson, dave
maddox, alton h jr
king, wayne
mulroney, brian
mbeki, thabo
thurmond, strom
moses, robert
stern, henry j
sharon, ariel
mcgreevey, james e
robb, charles s
malvo, john lee
norodom sihanouk
taubman, a alfredredstone, sumner m
bernstein, leonard
fields, c virginia
botstein, leon
rove, karl
perry, william j
marcos, imelda
sheffield, gary
hussein i
ruth, george herman
cuomo, mario m
schmitt, eric
morris, mark
miller, melvin
thomas, isiah
keating, charles h jr
chalabi, ahmad
ceausescu, nicolae
brokaw, tom
suozzi, thomas r
roh tae woo
o'neill, eugene
pettitte, andy
pollan, michael
rabin, yitzhak
leno, jay
tagliabue, paul
rosenthal, a m
ortega saavedra, daniel
north, oliver l
turner, ted
blumenthal, ralph
walters, barbara
harkin, tom
hussein, saddam
madden, john
glenn, john
golisano, b thomas
bryant, kobe
bremer, l paul iii
marbury, stephon
kelly, raymond w
pickens, t boone jr
qaddafi, muammar el
dole, bob
hingis, martina
king, rodney glen
wilson, august
bonds, barry
mubarak, hosni
bradsher, keith
kean, thomas h
coleman, derrick
brodsky, richard l
sondheim, stephen
tommasini, anthony
johnson, earvin
gates, robert m
vincent, fayphillips, steve
brown, jerry
kabila, laurent
sprewell, latrell
washington, desiree
lewis, lennox
kushner, tonyma, yo
whitman, christie
wiese, thomas
leiter, al
kasparov, garry capriati, jennifer
lee, spike
molinari, guy v
primakov, yevgeny m
shakespeare, william
dukakis, michael s
verdi, giuseppe
piazza, mike
yeltsin, boris n khatami, mohammad
bush, barbara
mohamed, khalfan khamis
dowd, maureen
lloyd webber, andrew
norman, greg
meese, edwin 3d
pataki, george e
gershwin, george
hitler, adolf
johnson, larry
volpe, justin a
baker, russell
domingo, placido
dinkins, david n
baryshnikov, mikhail
giamatti, a bartlett
leetch, brian
dole, robert j
weinberger, caspar w
kevorkian, jack
paterno, joe
simon, neil
simpson, o j
gerstner, louis v jr
masur, kurt
ashcroft, john
soros, george
dole, elizabeth
jones, marion
schlesinger, arthur m jr
diana, princess of wales
schroder, gerhard
walsh, lawrence e
lennon, johngibson, mel
baker, al
gore, albert jr
brown, david m
spitzer, elliot l
miyazawa, kiichi
collins, glenn
mccartney, paul
torre, joe
ridge, tom
malone, john c
scalia, antonin
ackerman, felicia
lindros, eric
leonard, sugar ray
mutombo, dikembe
torricelli, robert g
aquino, corazon c kim young sam
murphy, richard
fujimori, alberto k
martin, kenyon
hemingway, ernest
gotti, john
sciolino, elaine
belichick, bill
reagan, ronald wilson
weingarten, randi
milosevic, slobodan
hill, anita f
kerry, john
thomas, clarence
bettman, gary stevens, scott
peres, shimon
picasso, pablo
bumiller, elisabeth
keller, bill
spielberg, steven
packwood, robert w
foster, vincent w jr
o'neal, shaquille
lipsyte, robert
weinstein, harvey
prodi, romano
papp, joseph
badillo, herman
canby, vincent
collins, kerry
pierce, samuel r jr
maliki, nuri kamal al
graham, martha van gogh, vincent
charles, prince of wales
li pengda silva, luiz inacio lula
bloomberg, michael r
savimbi, jonas
mickelson, phil
webber, chris
messinger, ruth w
dimaggio, joe
krugman, paul
riley, patmartin, billy
mcfarlane, robert c
ozawa, seiji
camby, marcus
nunn, sam
lewinsky, monica s
powell, colin l
strahan, michael
waldheim, kurt
levy, steve
oates, joyce carol
kemp, jack f
snow, john w
hun sen
bruno, joseph l
courier, jim
ellington, duke
poindexter, john m
pope
whitman, christine todd
wilson, pete
levitt, arthur jr
stern, david
maslin, janet
volcker, paul a
martin, curtis
brown, edmund g jr
pareles, jon
graham, bob
holland, bernard
lopez, jennifer
daly, john
johnson, randy
albee, edward
brawley, tawana
tsongas, paul e
richardson, bill
blair, tony
waxman, henry a
jones, paula corbin
testaverde, vinny
reagan, nancy
levine, jamesfabricant, florence
williams, ted
hyde, henry j
nichols, terry lynn
goss, porter j
brooks, david
sadr, moktada al
schilling, curt
carter, jimmy
koizumi, junichiro
perot, ross
parcells, bill
montana, joe
gorbachev, mikhail skostunica, vojislavdeng xiaoping
rowling, j k
sabatini, gabriela
nicklaus, jack
nixon, richard milhous
armey, dickmccain, john s
roosevelt, theodore
ramon, ilanshamir, yitzhak
bush, george w
helmsley, harry b
al
cortines, ramon c
cheney, dick
holik, bobbymessier, mark
whitehead, mary beth
crew, rudy
hosokawa, morihiro
altman, robert
mamet, david
casey, william j
christo
pear, robert
hastert, j dennis
shultz, george p
woods, tiger
thompson, tommy g bentsen, lloyd
cashman, brian
mccain, john
harris, katherine
safire, william
jeffords, james m
jobs, steven p
oakley, charles
ferraro, geraldine a
robertson, pat
johns, jasper
schiavo, terri thornburgh, richard l
wilson, michael
singh, vijay
starks, john
ravitch, richard
rose, pete
wilson, joseph c iv
havel, vaclav
o'connor, john
carey, mariah
diallo, amadou
fox, vicente
glass, philip
malcolm x
marsalis, wynton
sand, leonard b
thatcher, margaret h
martins, peter
o'neill, william a
kennedy, john f jr
anderson, kenny
king, don
gandhi, rajiv
stalin, joseph
armani, giorgio
buckley, william f jr
spano, andrew j
williams, bernie
clinton, bill
simon, paul
karpov, anatoly
kidd, jason
bowman, patricia
stoppard, tom
rodriguez, alex
hartocollis, anemona
mcconnell, mitch
armstrong, lance
rosenbaum, yankel
ahern, bertie
jennings, peter
pollock, jackson
regan, edward v
versace, gianni
lauder, ronald s
karan, donna
warner, john w
le pen, jean tudjman, franjo
fernandez, joseph a
schwarz, charles
shostakovich, dmitri
louima, abner
kalikow, peter s
chavez, hugo
cuomo, andrew m
ebbers, bernard j
rocker, john
norton, gale
herbert, bob
robertson, marion g
safir, howard
lieberman, joseph i
mailer, norman
ferguson, colin
kwan, michelle
pavarotti, luciano
gates, william h
mccall, h carl
lott, trent forrester, douglas r
zemin, jiang
franco, john
mladic, ratko
gooden, dwight
wiesel, elie
steinbrenner, george
scorsese, martin
castro, fidel
bork, robert h
lauren, ralph williams, tennessee
gordon, jeff
eisenhower, dwight david
collor de mello, fernando
abdel rahman, omar
navratilova, martina
helmsley, leona
bush, jeb
brady, nicholas f
ford, gerald rudolph jr
shevardnadze, eduard
kerry, john f
strawberry, darryl
allen, woody
jagr, jaromir
rafsanjani, hashemi
hu jintao
salameh, mohammed a
adams, john
madonna farrow, mia
bolton, john r
wilson, valerie plame
griffith, michael
menendez, robert
mugabe, robert
wilpon, fred
weiner, tim
boxer, barbara
reid, harry
lagerfeld, karl
greenspan, alan
khodorkovsky, mikhail bperelman, ronald o
fassel, jim
de niro, robert
putin, vladimir
harding, tonya
bush, george w.
sather, glen
bowe, riddick
mcnally, terrence
deaver, michael krehnquist, william h
quindlen, anna
jackson, thomas penfield
botha, p w
perez de cuellar, javier
adams, gerry
handel, george frederick
pennington, chad
barak, ehud
zedillo, ernesto
tierney, john
kerrigan, nancy
brown, larry matsui, hideki
jefferson, thomas
chun doo hwan
zuckerman, mortimer b
bakker, jim
balanchine, george
alexander, lamar
freud, sigmund
chang, michael
hevesi, alan g
johnson, philip
crew, rudolph f
agassi, andre
presley, elvis
icahn, carl c
brown, ronald h
wilson, robert
childs, chris
lewis, neil a
tharp, twyla
roosevelt, franklin delano
weill, sanford i
kerkorian, kirk
muhammad, john allen
najibullah
williams, jayson
dodd, christopher j
taylor, charles
cone, david
pol pot
o'connor, sandra day
goldman, ronald lyle
erlanger, steven
king, stephenonassis, jacqueline kennedy
goodnough, abby
harris, eric
solomon, deborah
sandomir, richard
kaye, judith s
lay, kenneth l
randolph, willie
carter, vince
strauss, richard
chamorro, violeta barrios de
corzine, jon s
holbrooke, richard c
sinatra, frank
buchanan, patrick j
grasso, richard a
ahmadinejad, mahmoud
washington, george
bush, georgefox quesada, vicente
gramm, phil
spitzer, eliot
fitzwater, marlin
sorenstam, annika
edwards, herman
jackson, michael
van natta, don jr
truman, harry sabbas, mahmoud
cooper, michael
diller, barry
boss, kenneth
seles, monica
mandela, nelson r
rumsfeld, donald h
weinstein, jack b
gingrich, newt
aspin, les
menem, carlos saul
kissinger, henry a
cage, john
babbitt, bruceschumer, charles e
assad, bashar al boutros
simms, phil
lipton, eric
newman, paul
mcgwire, mark
lee, wen ho
leahy, patrick jgrassley, charles e
libeskind, daniel
gephardt, richard a
brown, lee p
vacco, dennis c
rell, m jodi
tyson, mike
davis, gray
o'donnell, rosie
roberts, john g jr
clinton, hillary rodham
goetz, bernhard hugo
stephanopoulos, george
aziz, tariq
lewinsky, monica
iacocca, lee a
feiner, paul
smith, william kennedy
cunningham, merce
khomeini, ruhollah
skilling, jeffrey k
showalter, buck
samaranch, juan antonio abraham, spencer
springsteen, bruce
davenport, lindsay
monroe, marilyn
rohde, david
gonzales, alberto r
romney, mitt
lemieux, mario
trump, donald j
pearl, daniel
jackson, bo
kerik, bernard b
mason, c vernon
lewis, carl
gulotta, thomas s
maazel, lorin
lugar, richard g
lincoln, abraham
de la hoya, oscar
lewis, michael
jackson, jesse l
berlusconi, silvio suharto
wachtler, sol
aoun, michel
lendl, ivan
rushdie, salman
mandela, nelson
holden, stephen
chen shui albright, madeleine k
gotbaum, betsy
hanover, donna
tenet, george j
cardoso, fernando henrique
piniella, lou
sampras, pete
kimmelman, michael
walesa, lech
mozart, wolfgang amadeus
barkley, charles
levy, harold o
byrd, robert c
jones, roy jr
shays, christopher
cohen, william s
mueller, robert s iii
mitterrand, francois
ginsburg, ruth bader
elizabeth ii, queen of great britain
williams, venus
brodeur, martin
el
aristide, jeanfeingold, russell d
hawkins, yusuf k
beethoven, ludwig van
carroll, sean
kerrey, bob
giuliani, rudolph w
cruise, tom
stewart, martha
john paul ii
berke, richard l
graf, steffi
mitchell, george j
chawla, kalpana
maxwell, robert
winfield, dave
o'rourke, andrew p
nelson, lemrick jr
rodgers, richard
kohl, helmut
koppel, ted
rubin, robert estern, howard
clinton, chelsea
wright, frank lloyd
d'amato, alfonse m
rich, frank
annan, kofi
jackson, phil
putin, vladimir v
irabu, hideki
weiner, anthony d
miller, gifford
sharpton, al
johnson, keyshawn
hubbell, webster l
egan, edward m
libby, i lewis jr
obama, barack
johnson, ben
roddick, andy
bin laden, osama
brozan, nadine
boesky, ivan f
connors, jimmy
schwarzenegger, arnold
white, mary jo
ryan, george
mattingly, don
weld, william f
sosa, sammy
silverstein, larry a
wells, david
mccool, william ctraub, james
bratton, william j
sweeney, john j
daschle, tom
milken, michael r
ryan, nolan
blumenthal, richard
moussaoui, zacarias
jordan, michael
vallone, peter f
dylan, bob
christopher, warren m
codey, richard j
pagones, steven a
dershowitz, alan m
hewitt, lleyton
selig, bud
assad, hafez al
bach, johann sebastian
altman, lawrence k
gehry, frank
zimmer, richard a
daley, richard m
broad, william j
brady, lois smith
simpson, nicole brown
zarqawi, abu musab al
reeves, dan
trump, donald
iverson, allen
knoblauch, chuck
king, martin luther jr
kennedy, anthony m
kasparov, gary
koresh, david
benedict xvi
marcos, ferdinand e
ferrer, fernando
Figure 4.9: The result of fitting the RTM to a collection of entities from the NewYork Times. Nodes represent people and edges indicate that the people co-occur inthe same article. Nodes are colored according to the topic most assoicated with thatindividual. Edges between nodes with the same primary topic are colored black whileedges between nodes with different primary topics are colored grey.
74
happyeaster
overband
blogsocial
Figure 4.10: The result of fitting the RTM to a small collection of Twitter users. Nodesrepresent users and edges indicate follower/followee relationships. Nodes are coloredaccording to the topic most assoicated with each user. Some regions dominated by asingle topic have been highlighted and annotated with frequently appearing words forthat topic.
region is more of a clique, with many users sending individual greetings to one another.
The second Twitter data set we analyze comes from a larger-scale crawl over a
longer period of time. There were 1425 users in this data set. Figure 4.11 shows
a visualization of the RTM applied to this data set. Once again, nodes have been
colored according to primary topic and several of the topical areas have been labeled
with frequently occurring words. This subset of the graph is dominated by a large
connected component in the center focused on online affairs (“blog”, “post”, “online”).
At the periphery are several smaller communities. For example, there is a food-centric
75
blog post money
online businessbutter recipe
research food
sausage
news iphone game
2009 video
obama tcot swine
michigan
sotomayor
night show tonight
chicago game
Figure 4.11: The result of fitting the RTM to a larger collection of Twitter users.Nodes represent users and edges indicate follower/followee relationships. Nodes arecolored according to the topic most assoicated with each user. Some regions areannotated with frequently appearing words for that topic.
community in the lower left, and a politics community just above it8. Because this
is a larger data set, the RTM is able to discover broader, more thematically related
communities than with the smaller data set.
4.4 Discussion
There are many avenues for future work on relational topic models. Applying the
RTM to diverse types of “documents” such as protein-interaction networks, whose
node attributes are governed by rich internal structure, is one direction. Even the
8The frequently appearing term “tcot” is an acronym for Top Conservatives On Twitter.
76
text documents which we have focused on in this chapter have internal structure such
as syntax (Boyd-Graber and Blei 2008) which we are discarding in the bag-of-words
model. Augmenting and specializing the RTM to these cases may yield better models
for many application domains.
As with any parametric mixed-membership model, the number of latent components
in the RTM must be chosen using either prior knowledge or model-selection techniques
such as cross-validation. Incorporating non-parametric Bayesian priors such as the
Dirichlet process into the model would allow it to flexibly adapt the number of topics
to the data (Ferguson 1973; Antoniak 1974; Kemp et al. 2004; Teh et al. 2007). This,
in turn, may give researchers new insights into the latent membership structure of
networks.
In sum, the RTM is a hierarchical model of networks and per-node attribute data.
The RTM is used to analyze linked corpora such as citation networks, linked web
pages, social networks with user profiles, and geographically tagged news. We have
demonstrated qualitatively and quantitatively that the RTM provides an effective
and useful mechanism for analyzing and using such data. It significantly improves on
previous models, integrating both node-specific information and link structure to give
better predictions.
77
Chapter 5
Discovering Link Information
In the previous chapters we have focused on modeling existing network data,
encoding collections of relationships between entities such as people, places, genes, or
corporations. However, the network data thus far have been unannotated, that is, edges
express connectivity but not the nature of the connection. And while many resources
for networks of interesting entities are emerging, most of these can only annotate
connections in a limited fashion. Although relationships between entities are rich, it is
impractical to manually devise complete characterizations of these relationships for
every pair of entities on large, real-world corpora.
Below we present a novel probabilistic topic model to analyze text corpora and
infer descriptions of its entities and of relationships between those entities. We
develop variational methods for performing approximate inference on our model and
demonstrate that our model can be practically deployed on large corpora such as
Wikipedia. We show qualitatively and quantitatively that our model can construct
and annotate graphs of relationships and make useful predictions.
Portions of this chapter appear in Chang et al. (2009).
78
5.1 Background
Network data—data which express relationships between ensembles of entities—are
becoming increasingly pervasive. People are connected to each other through a variety
of kinship, social, and professional relationships; proteins bind to and interact with
other proteins; corporations conduct business with other corporations. Understanding
the nature of these relationships can provide useful mechanisms for suggesting new
relationships between entities, characterizing new relationships, and quantifying global
properties of naturally occurring network structures (Anagnostopoulos et al. 2008;
Cai et al. 2005; Taskar et al. 2003; Wasserman and Pattison 1996; Zhou et al. 2008).
Many corpora of network data have emerged in recent years. Examples of such
data include social networks, such as LinkedIn or Facebook, and citation networks,
such as CiteSeer, Rexa, or JSTOR. Other networks can be constructed manually or
automatically using texts with people such as the Bible, scientific abstracts with genes,
or decisions in legal journals. Characterizing the networks of connections between
these entities is of historical, scientific, and practical interest. However, describing
every relationship for large, real-world corpora is infeasible. Thus most data sets label
edges as merely on or off, or with a small set of fixed, predefined connection types.
These labellings cannot capture the complexities underlying the relationships and
limit the applicability of these data sets.
An example of this is shown in Figure 5.1. The figure depicts a social network
where nodes represent entities and edges represent some relationship between the
entities. Some social networks such as Facebook1 have self-reported information about
each edge; for example, two users may be connected by the fact that they attended
the same school (top panel). However, this self-reported information is limited and
sparsely populated. By analyzing unstructured resources, we hope to increase the
number of annotated edges, the number of nodes covered, and the kinds of annotations
1http://www.facebook.com
79
(bottom panel).
In this chapter we develop a method for augmenting such data sets by analyzing
document collections to uncover the relationships encoded in their texts. Text corpora
are replete with information about relationships, but this information is out of reach
for traditional network analysis techniques. We develop Networks Uncovered By
Bayesian Inference (Nubbi), a probabilistic topic model of text (Blei et al. 2003a;
Hofmann 1999; Steyvers and Griffiths 2007) with hidden variables that represent the
patterns of word use which describes the relationships in the text. Given a collection
of documents, Nubbi reveals the hidden network of relationships that is encoded in
the texts by associating rich descriptions with each entity and its connections. For
example, Figure 5.2 illustrates a subset of the network uncovered from the texts
of Wikipedia. Connections between people are depicted by edges, each of which is
associated with words that describe the relationship.
First, we describe the intuitions and statistical assumptions behind Nubbi. Second,
we derive efficient algorithms for using Nubbi to analyze large document collections.
Finally, we apply Nubbi to the Bible, Wikipedia, and scientific abstracts. We demon-
strate that Nubbi can discover sensible descriptions of the network and can make
predictions competitive with those made by state of the art models.
5.2 Model
The goal of Nubbi is to analyze a corpus to describe the relationships between pairs of
entities. Nubbi takes as input very lightly annotated data, requiring only that entities
within the input text be identified. Nubbi also takes as input the network of entities
to be annotated. For some corpora this network is already explicitly encoded as a
graph. For other text corpora this graph must be constructed. One simple way of
constructing this graph is to use a fully-connected network of entities and then prune
80
Jonathan Chang
Jordan Boyd-GraberYou and Jordan both went to Princeton.
(a) A social network with some extant data abouthow two entities are related.
Ronald Reagan
Jane WymanYou and Jane used to be married.
(b) The desiderata, a social network where relation-ships have been automatically by analyzing freetext.
Figure 5.1: An example motivating this work. The figures depict a social network;nodes represent individuals and edges represent relationships between the individuals.Many social networks have some detailed information about the relationships. It isthis data we seek to automatically build and augment.
81
Joseph Stalin
Winston Churchill
Lyndon B. Johnson
Mao Zedong
Jimmy Carter
Margaret Thatcher
Ronald Reagan
Richard Nixon
Nikita KhrushchevJohn F. KennedyHubert
Humphrey
George H. W. Bush
Ross Perot
Leon Trotsky
Lev Kamenev
Zhou Enlai
Mikhail Gorbachev
labourgovernleaderbritishworld
sovietcommunist
centralunion
full
sovietrussiangovernunion
nuclear
republicanstate
federalistvotevice
Figure 5.2: A small subgraph of the social network Nubbi learned taking only the rawtext of Wikipedia with tagged entities as input. The full model uses 25 relationship andentity topics. An edge exists between two entities if their co-occurrence count is high.For some of the edges, we show the top words from the most probable relationshiptopic associated with that pair of entities. These are the words that best explain thecontexts where these two entities appear together. A complete browser for this datais available at http://topics.cs.princeton.edu/nubbi.
the edges in this graph using statistics such as entity co-occurrence counts.
From the entities in this network, the text is divided into two different classes of
bags of words. First, each entity is associated with an entity context, a bag of words
co-located2 with the entity. Second, each pair of entities is associated with a pair
context, a bag of words co-located with the pair. Figure 5.3 shows an example of the
input to the algorithm turned into entity contexts and pair contexts.
Nubbi learns two descriptions of how entities appear in the corpus: entity topics
and relationship topics. Following Blei et al. (2003a), a topic is defined to be a
distribution over words. To aid intuitions, we will for the moment assume that these
topics are given and have descriptive names. We will describe how the topics and
contexts interplay to reveal the network of relationships hidden in the texts. We
emphasize, however, that the goal of Nubbi is to analyze the texts to learn both the
topics and relationships between entities.
An entity topic is a distribution over words, and each entity is associated with a
distribution over entity topics. For example, suppose there are three entity topics:
politics, movies, and sports. Ronald Reagan would have a distribution that favors
2We use the term “co-located” to refer to words and entities which appear near one-another in atext. The definition of near depends on the corpus; some practical choices are given in Section 5.4.
82
1 When Jesus had spoken these words, he went forth with his disciples over the brook Cedron, where was a garden, into the which he entered, and his disciples.
2 And Judas also, which betrayed him, knew the place: for Jesus ofttimes resorted thither with his disciples.
3 Judas then, having received a band of men and officers from the chief priests and Pharisees, cometh thither with lanterns and torches and weapons.
4 Jesus therefore, knowing all things that should come upon him, went forth, and said unto them, Whom seek ye?
5 They answered him, Jesus of Nazareth. Jesus saith unto them, I am he. And Judas also, which betrayed him, stood with them.
6 As soon then as he had said unto them, I am he, they went backward, and fell to the ground.
7 Then asked he them again, Whom seek ye? And they said, Jesus of Nazareth.
received band officers chief priests Pharisees lanterns
torches weapons
spoken words disciples brook Cedron garden enter disciples knowing things
seek asked seek NazarethJesus
Judas
1 When Jesus had spoken these words, he went forth with his disciples over the brook Cedron, where was a garden, into the which he entered, and his disciples.
2 And Judas also, which betrayed him, knew the place: for Jesus ofttimes resorted thither with his disciples.
3 Judas then, having received a band of men and officers from the chief priests and Pharisees, cometh thither with lanterns and torches and weapons.
4 Jesus therefore, knowing all things that should come upon him, went forth, and said unto them, Whom seek ye?
5 They answered him, Jesus of Nazareth. Jesus saith unto them, I am he. And Judas also, which betrayed him, stood with them.
6 As soon then as he had said unto them, I am he, they went backward, and fell to the ground.
7 Then asked he them again, Whom seek ye? And they said, Jesus of Nazareth.
betrayed knew place disciples answered Nazareth saith betrayed
Jesusand
Judas
Figure 5.3: A high-level overview of Nubbi’s view of text data. A corpus with identifiedentities is turned into a collection of bags-of-words (in rectangles), each associatedwith individual entities (left) or pairs of entities (right). The procedure in the leftpanel is repeated for every entity in the text while the procedure in the right panel isrepeated for every pair of entities.
politics and movies, athlete actors like Johnny Weissmuller and Geena Davis would
have distributions that favor movies and sports, and specialized athletes, like Pele,
would have distributions that favor sports more than other entity topics. Nubbi uses
entity topics to model entity contexts. Because the sports entity topic would contain
words like “cup,” “win,” and “goal,” associating Pele exclusively with the sports
entity topic would be consistent with the words observed in his context.
Relationship topics are distributions over words associated with pairs of entities,
rather than individual entities, and each pair of entities is associated with a distribution
over relationship topics. Just as the entity topics cluster similar people together (e.g.,
Ronald Reagan, George Bush, and Bill Clinton all express the politics topic), the
relationship topics can cluster similar pairs of people. Thus, Romeo and Juliet,
Abelard and Heloise, Ruslan and Ludmilla, and Izanami and Izanagi might all share a
lovers relationship topic.
Relationship topics are used to explain pair contexts. Each word in a pair context
is assumed to express something about either one of the participating entities or
something particular to their relationship. For example, consider Jane Wyman and
83
Ronald Reagan. (Jane Wyman, an actress, was actor/president Ronald Reagan’s first
wife.) Individually, Wyman is associated with the movies entity topic and Reagan
is associated with the movies and politics entity topics. In addition, this pair of
entities is associated with relationship topics for divorce and costars.
Nubbi hypothesizes that each word describes either one of the entities or their
relationship. Consider the pair context for Reagan and Wyman:
In 1938, Wyman co-starred with Ronald Reagan. Reagan and actress Jane
Wyman were engaged at the Chicago Theater and married in Glendale, Califor-
nia. Following arguments about Reagan’s political ambitions, Wyman filed for
divorce in 1948. Since Reagan is the only U.S. president to have been divorced,
Wyman is the only ex-wife of an American President.
We have marked the words that are not associated with the relationship topic. Func-
tional words are gray; words that come from a politics topic (associated with Ronald
Reagan) are underlined; and words that come from a movies topic (associated with
Jane Wyman) are italicized.
The remaining words, “1938,” “co-starred,” “engaged,” “Glendale,” “filed,” “di-
vorce,” “1948,” “divorced,” and “ex-wife,” describe the relationship between Reagan
and Wyman. Indeed, it is by deducing which case each word falls into that Nubbi is
able to capture the relationships between entities. Examining the relationship topics
associated with each pair of entities provides a description of that relationship.
The above discussion gives an intuitive picture of how Nubbi explains the observed
entity and pair contexts using entity and relationship topics. In data analysis, however,
we do not observe the entity topics, pair topics, or the assignments of words to topics.
Our goal is to discover them.
To do this, we formalize these notions in a generative probabilistic model of the
texts that uses hidden random variables to encode the hidden structure described
above. In posterior inference, we “reverse” the process to discover the latent structure
84
that best explains the documents. (Posterior inference is described in the next section.)
More formally, Nubbi assumes the following statistical model.
1. For each entity topic j and relationship topic k,
(a) Draw topic multinomials βθj ∼ Dir(ηθ + 1), βψk ∼ Dir(ηψ + 1)
2. For each entity e,
(a) Draw entity topic proportions θe ∼ Dir(αθ);
(b) For each word associated with this entity’s context,
i. Draw topic assignment ze,n ∼ Mult(θe);
ii. Draw word we,n ∼ Mult(βθze,n).
3. For each pair of entities e, e′,
(a) Draw relationship topic proportions ψe,e′ ∼ Dir(αψ);
(b) Draw selector proportions πe,e′ ∼ Dir(απ);
(c) For each word associated with this entity pair’s context,
i. Draw selector ce,e′,n ∼ Mult(πe,e′);
ii. If ce,e′,n = 1,
A. Draw topic assignment ze,e′,n ∼ Mult(θe);
B. Draw word we,e′,n ∼ Mult(βθze,e′,n).
iii. If ce,e′,n = 2,
A. Draw topic assignment ze,e′,n ∼ Mult(θe′);
B. Draw word we,e′,n ∼ Mult(βθze,e′,n).
iv. If ce,e′,n = 3,
A. Draw topic assignment ze,e′,n ∼ Mult(ψe,e′);
B. Draw word we,e′,n ∼ Mult(βψze,e′,n).
This is depicted in a graphical model in Figure 5.4.
85
MNe Ne,e'Kθ
βθ
ψe,e'
αθ
w
z
θe
αψ
c
πe,e'
ηθz
w
απ
Kψ
βψ
ηψ
Ne'
z
w
θe'
...Ne''
z
w
θe''
N entity contexts M pair contexts Kψ relationship topicsKθ entity topics
Figure 5.4: A depiction of the Nubbi model using the graphical model formalism.Nodes are random variables; edges denote dependence; plates (i.e., rectangles) denotereplication; shaded nodes are observed and unshaded nodes are hidden. The left halfof the figure are entity contexts, while the right half of the figure are pair contexts. Inits entirety, the model generates both the entity contexts and the pair contexts shownin Figure 5.3.
The hyperparameters of the Nubbi model are Dirichlet parameters αθ, αψ, and
απ, which govern the entity topic distributions, the relationship distributions, and
the entity/pair mixing proportions. The Dirichlet parameters ηθ and ηψ are priors
for each topic’s multinomial distribution over terms. There are Kθ per-topic term
distributions for entity topics, βθ1:Kθ, and Kψ per-topic term distributions βψ1:Kψ
for
relationship topics.
The words of each entity context are essentially drawn from an LDA model using
the entity topics. The words of each pair context are drawn in a more sophisticated
way. The topic assignments for the words in the pair context for entity e and entity e′
are hypothesized to come from the entity topic proportions θe, entity topic proportions
θe′ , or relationship topic proportions ψe,e′ . The switching variable ce,e′,n selects which
of these three assignments is used for each word. This selector ce,e′,n is drawn from
πe,e′ , which describes the tendency of words associated with this pair of entities to be
ascribed to either of the entities or the pair.
It is ψe,e′ that describes what the relationship between entities e and e′ is. By
allowing some of each pair’s context words to come from a relationship topic distribu-
tion, the model is able to characterize each pair’s interaction in terms of the latent
86
relationship topics.
5.3 Computation with NUBBI
With the model formally defined in terms of hidden and observed random variables,
we now turn to deriving the algorithms needed to analyze data. Data analysis involves
inferring the hidden structure from observed data and making predictions on future
data. In this section, we develop a variational inference procedure for approximating
the posterior. We then use this procedure to develop a variational expectation-
maximization (EM) algorithm for parameter estimation and for approximating the
various predictive distributions of interest.
5.3.1 Inference
In posterior inference, we approximate the posterior distribution of the latent variables
conditioned on the observations. As for LDA, exact posterior inference for Nubbi is
intractable (Blei et al. 2003a). We appeal to variational methods.
Variational methods posit a family of distributions over the latent variables indexed
by free variational parameters. Those parameters are then fit to be close to the true
posterior, where closeness is measured by relative entropy. See Jordan et al. (1999)
for a review. We use the factorized family
q(Θ,Z,C,Π,Ψ|γθ,γψ,Φθ,Φψ,γπ,Ξ) =∏e
[q(θe|γθe )
∏n q(ze,n|φθe,n)
]·∏
e,e′ q(ψe,e′ |γψe,e′)q(πe,e′ |γπe,e′)·∏
e,e′
[∏n q(ze,e′,n, ce,e′,n|φ
ψe,e′,n, ξe,e′,n)
],
where γθ is a set of Dirichlet parameters, one for each entity; γπ and γψ are sets
87
of Dirichlet parameters, one for each pair of entities; Φθ is a set of multinomial
parameters, one for each word in each entity; Ξ is a set of multinomial parameters, one
for each pair of entities; and Φψ is a set of matrices, one for each word in each entity
pair. Each φψe,e′,n contains three rows — one which defines a multinomial over topics
given that the word comes from θe, one which defines a multinomial given that the
word comes from θe′ , and one which defines a multinomial given that the word comes
from ψe,e′ . Note that the variational family we use is not the fully-factorized family;
this family fully captures the joint distribution of ze,e′,n and ce,e′,n. We parameterize
this pair by φψe,e′,n and ξe,e′,n which define a multinomial distribution over all 3K
possible values of this pair of variables.
Minimizing the relative entropy is equivalent to maximizing the Jensen’s lower
bound on the marginal probability of the observations, i.e., the evidence lower bound
(ELBO),
L =∑e,e′
Le,e′ +∑e
Le + H (q) , (5.1)
where sums over e, e′ iterate over all pairs of entities and
Le,e′ =∑n
Eq[log p(we,e′,n|βψ1:K , β
θ1:K , ze,e′,n, ce,e′,n)
]+
∑n
Eq [log p(ze,e′,n|ce,e′,n, θe, θe′ , ψe,e′)] +
∑n
Eq [log p(ce,e′,n|πe,e′)] +
Eq [log p(ψe,e′ |αψ)] + Eq [log p(πe,e′|απ)]
88
and
Le =∑n
Eq[log p(we,n|βθ1:K , ze,n)
]+
Eq [log p(θe|αθ)] +∑n
Eq [log p(ze,n|θe)] .
The Le,e′ term of the ELBO differentiates this model from previous models (Blei et al.
2003a). The connections between entities affect the objective in posterior inference
(and, below, in parameter estimation).
Our aim now is to compute each term of the objective function given in Equation 5.1.
After expanding this expression in terms of the variational parameters, we can derive a
set of coordinate ascent updates to optimize the ELBO with respect to the variational
parameters, γθ,γψ,Φθ,Φψ,γπ,Ξ. Refer to Appendix C for a full derivation of the
following updates.
The updates for φθe,n assign topic proportions to each word associated with an
individual entity,
φθe,n ∝ exp(log βθwn + Ψ
(γθe)), (5.2)
where log βθwn represents the logarithm of column wn of βθ and Ψ (·) is the digamma
function. (A digamma of a vector is the vector of digammas.) The topic assignments
for each word associated with a pair of entities are similar,
φψe,e′,n,1= exp(log βθwn + Ψ
(γθe)−Ψ
(1Tγθe
)− λe,e′,n,1
)(5.3)
φψe,e′,n,2= exp(log βθwn + Ψ
(γθe′)−Ψ
(1Tγθe′
)− λe,e′,n,2
)(5.4)
φψe,e′,n,3= exp(
log βψwn + Ψ(γψe,e′
)−Ψ
(1Tγψe,e′
)− λe,e′,n,3
), (5.5)
where λe,e′,n is a vector of normalizing constants. These normalizing constants are
89
then used to estimate the probability that each word associated with a pair of entities
is assigned to either an individual or relationship,
ξe,e′,n ∝ exp(λe,e′,n + Ψ
(γπe,e′
)). (5.6)
The topic and entity assignments are then used to estimate the variational Dirichlet
parameters which parameterize the latent topic and entity proportions,
γπe,e′ = απ +∑n
ξe,e′,n (5.7)
γψe,e′ = αψ +∑n
ξe,e′,n,3φe,e′,n,3. (5.8)
Finally, the topic and entity assignments for each pair of entities along with the topic
assignments for each individual entity are used to update the variational Dirichlet
parameters which govern the latent topic assignments for each individual entity. These
updates allow us to combine evidence associated with individual entities and evidence
associated with entity pairs.
γθe =∑e′
∑n
(ξe,e′,n,1φ
ψe,e′,n,1 + ξe′,e,2φ
ψe′,e,n,2
)+ (5.9)
αθ +∑n
φθe,n. (5.10)
5.3.2 Parameter estimation
We fit the model by finding maximum likelihood estimates for each of the parameters:
πe,e′ , βθ1:K and βψ1:K . Once again, this is intractable so we turn to an approximation.
We employ variational expectation-maximization, where we iterate between optimizing
the ELBO of Equation 5.1 with respect to the variational distribution and with respect
to the model parameters.
Optimizing with respect to the variational distribution is described in Section 5.3.1.
90
Optimizing with respect to the model parameters is equivalent to maximum likelihood
estimation with expected sufficient statistics, where the expectation is taken with
respect to the variational distribution. The sufficient statistics for the topic vectors
βθ and βψ consist of all topic-word pairs in the corpus, along with their entity or
relationship assignments. Collecting these statistics leads to the following updates,
βθw ∝ ηθ +∑e
∑n
1(we,n = w)φθe,n + (5.11)∑e,e′
∑n
1(we,e′,n = w)ξe,e′,n,1φψe,e′,n,1 + (5.12)∑
e,e′
∑n
1(we′,e,n = w)ξe′,e,n,2φψe′,e,n,2 (5.13)
βψw ∝ ηψ +∑e,e′
∑n
1(we,e′,n = w)ξe,e′,n,3φψe,e′,n,3. (5.14)
The sufficient statistics for πe,e′ are the number of words ascribed to the first entity,
the second entity, and the relationship topic. This results in the update
πe,e′ ∝ exp (Ψ (απ +∑
n ξe,e′,n)) .
5.3.3 Prediction
With a fitted model, we can make judgments about how well the model describes the
joint distribution of words associated with previously unseen data. In this section we
describe two prediction tasks that we use to compare Nubbi to other models: word
prediction and entity prediction.
In word prediction, the model predicts an unseen word associated with an entity
pair given the other words associated with that pair, p(we,e′,i|we,e′,−i). This quantity
cannot be computed tractably. We instead turn to a variational approximation of this
91
posterior,
p(we,e′,i|we,e′,−i) ≈ Eq [p(we,e′,i|ze,e′,i)] .
Here we have replaced the expectation over the true posterior probability p(ze,e′,i|we,e′,−i)
with the variational distribution q(ze,e′,i) whose parameters are trained by maximizing
the evidence bound given we,e′,−i.
In entity prediction, the model must predict which entity pair a set of words is
most likely to appear in. By Bayes’ rule, the posterior probability of an entity pair
given a set of words is proportional to the probability of the set of words belonging to
that entity pair,
p((e, e′)|w) ∝ p(w|we,e′),
where the proportionality constant is chosen such that the sum of this probability
over all entity pairs is equal to one.
After a qualitative examination of the topics learned from corpora, we use these
two prediction methods to compare Nubbi against other models that offer probabilistic
frameworks for associating entities with text in Section 5.4.2.
5.4 Experiments
In this section, we describe a qualitative and quantitative study of Nubbi on three
data sets: the bible (characters in the bible), biological (genes, diseases, and proteins
in scientific abstracts), and wikipedia. For these three corpora, the entities of interest
are already annotated. Experts have marked all mentions of people in the Bible (Nave
2003) and biological entities in corpora of scientific abstracts (Ohta et al. 2002; Tanabe
et al. 2005), and Wikipedia’s link structure offers disambiguated mentions. Note that
92
Topic 1 Topic 2Entities Jesus, Mary Abraham, Chedorlaomer
Terah, Abraham Ahaz, Rezinfather kingbegat city
Top Terms james smotedaughter lordmother thousand
Table 5.1: Examples of relationship topics learned by a five topic Nubbi model trainedon the Bible. The upper part of the table shows some of the entity pairs highlyassociated with that topic. The lower part of the table shows the top terms in thattopic’s multinomial.
it is also possible to use named entity recognizers to preprocess data for which entities
are not previously identified.
The first step in our analysis is to determine the entity and pair contexts. For
bible, verses offer an atomic context; any term in a verse with an entity (pair) is
associated with that entity (pair). For biological, we use tokens within a fixed distance
from mentions of an entity (pair) to build the data used by our model. For wikipedia,
we used the same approach as biological for associating words with entity pairs. We
associated with individual entities, however, all the terms in his/her Wikipedia entry.
For all corpora we removed tokens based on a stop list and stemmed all tokens using
the Porter stemmer. Infrequent tokens, entities, and pairs were pruned from the
corpora.3
5.4.1 Learning networks
We first demonstrate that the Nubbi model produces interpretable entity topics that
describe entity contexts and relationship topics that describe pair contexts. We also
show that by combining Nubbi’s model of language with a network automatically
estimated through co-occurrence counts, we can construct rich social networks with
labeled relationships.
3After preprocessing, the bible dataset contains a lexicon of size 2411, 523 entities, and 475 entitypairs. The biological dataset contains a lexicon of size 2425, 1566 entities, and 577 entity pairs. Thewikipedia dataset contains a lexicon of size 9144, 1918 entities, and 429 entity pairs.
93
Table 5.1 shows some of the relationship topics learned from the Bible data. (This
model has five entity topics and five relationship topics; see the following section for
more details on how the choice of number of topics affects performance.) Each column
shows the words with the highest weight in that topic’s multinomial parameter vector,
and above each column are examples of entity pairs associated with that topic. In this
example, relationship Topic 1 corresponds to blood relations, and relationship Topic 2
refers to antagonists. We emphasize that this structure is uncovered by analyzing the
original texts. No prior knowledge of the relationships between characters is used in
the analysis.
In a more diverse corpus, Nubbi learns broader topics. In a twenty-five topic model
trained on the Wikipedia data, the entity topics broadly apply to entities across many
time periods and cultures. Artists, monarchs, world politicians, people from American
history, and scientists each have a representative topic (see Table 5.2).
The relationship topics further restrict entities that are specific to an individual
country or period (Table 5.3). In some cases, relationship topics narrow the focus of
broader entity topics. For instance, relationship Topics 1, 5, 6, 9, and 10 in Table 5.3
help explain the specific historical context of pairs better than the very broad world
leader entity Topic 7.
In some cases, these distinctions are very specific. For example, relationship Topic 6
contains pairs of post-Hanoverian monarchs of Great Britain and Northern Ireland,
while relationship Topic 5 contains relationships with pre-Hanoverian monarchs of
England even though both share words like “queen” and “throne.” Note also that
these topics favor words like “father” and “daughter,” which describe the relationships
present in these pairs.
The model sometimes groups together pairs of people from radically different
contexts. For example, relationship Topic 8 groups composers with religious scholars
(both share terms like “mass” and “patron”), revealing a drawback of using a unigram-
94
Topic 1 Topic 2 Topic 3 Topic 4
Entities
George Westinghouse Charles Peirce Lindsay Davenport Lee Harvey OswaldGeorge Stephenson Francis Crick Martina Hingis Timothy McVeighGuglielmo Marconi Edmund Husserl Michael Schumacher Yuri Gagarin
James Watt Ibn al-Haytham Andre Agassi Bobby SealeRobert Fulton Linus Pauling Alain Prost Patty Hearse
Top Terms
electricity work align stateengine universe bgcolor americanpatent theory race year
company science win timeinvent time grand president
Topic 5 Topic 6 Topic 7 Topic 8
Entities
Pierre-Joseph Proudhon Betty Davis Franklin D. Roosevelt Jack KirbyBenjamin Tucker Humphrey Bogart Jimmy Carter Terry PratchettMurray Rothbard Kate Winslet Brian Mulroney Carl Barks
Karl Marx Martin Scorsese Neville Chamberlain Gregory BenfordAmartya Sen Audrey Hepburn Margaret Thatcher Steve Ditko
Top Terms
social film state storywork award party book
politics star election worksociety role president fiction
economics play government publishTopic 9 Topic 10
Entities
Babe Ruth XenophonBarry Bonds CaligulaSatchel Page Horus
Pedro Martinez Nebuchadrezzar IIRoger Clemens Nero
Top Terms
game greekbaseball romeseason historyleague senate
run death
Table 5.2: Ten topics from a model trained on Wikipedia carve out fairly broadcategories like monarchs, athletes, entertainers, and figures from myth and religion.An exception is the more focused Topic 9, which is mostly about baseball. Note thatnot all of the information is linguistic; Topic 3 shows we were unsuccessful in filteringout all Wikipedia’s markup, and the algorithm learned to associate score tables witha sports category.
95
Topic 1 Topic 2 Topic 3 Topic 4
Pairs
Reagan-Gorbachev Muhammad-Moses Grant-Lee Paul VI-John Paul IIKennedy-Khrushchev Rabin-Arafat Muhammad-Abu Bakr Pius XII-Paul II
Alexandra-Alexander III E. Bronte-C. Bronte Sherman-Grant John XXIII-John Paul IINajibullah-Kamal Solomon-Moses Jackson-Lee Pius IX-John Paul II
Nicholas I-Alexander III Arafat-Sharon Sherman-Lee Leo XIII-John Paul II
Terms
soviet israel union vaticanrussian god corp cathol
government palestinian gen papalunion chile campaign council
nuclear book richmond timeTopic 5 Topic 6 Topic 7 Topic 8
Pairs
Philip V-Louis XIV Henry VIII-C. of Aragon Jefferson-Burr Mozart-SalieriLouis XVI-Francis I Mary I (Eng)-Elizabeth I Jefferson-Madison Malory-Arthur
Maria Theresa-Charlemagne Henry VIII-Anne Boleyn Perot-Bush Mozart-BeethovenPhilip V-Louis XVI Mary I (Scot)-Elizabeth I Jefferson-Jay Bede-Augustine
Philip V-Maria Theresa Henry VIII-Elizabeth I J.Q. Adams-Clay Leo X-Julius II
Terms
french queen republican musicdauphin english state playspanish daughter federalist filmdeath death vote pianothrone throne vice workTopic 9 Topic 10
Pairs
George VI-Edward VII Trotsky-StalinGeorge VI-Edward VIII Kamenev-Stalin
Victoria-Edward VII Khrushchev-StalinGeorge V-Edward VII Kamenev-Trotsky
Victoria-George VI Zhou Enlai-Mao Zedong
Terms
royal sovietqueen communistbritish centralthrone unionfather full
Table 5.3: In contrast to Table 5.2, the relationship topics shown here are more specificto time and place. For example, English monarch pairs (Topic 6) are distinct fromBritish monarch pairs (Topic 9). While there is some noise (the Bronte sisters beinglumped in with mideast leaders or Abu Bakr and Muhammad with civil war generals),these relationship topics group similar pairs of entities well. A social network labeledwith these relationships is shown in Figure 5.2.
based method. As another example, relationship Topic 3 links civil war generals and
early Muslim leaders.
5.4.2 Evaluating the predictive distribution
The qualitative results of the previous section illustrate that Nubbi is an effective
model for exploring and understanding latent structure in data. In this section, we
provide a quantitative evaluation of the predictive mechanisms that Nubbi provides.
As with any probabilistic model, Nubbi defines a probability distribution over
unseen data. After fitting the latent variables of our model to data (as described
96
●
●
●
●
10 15 20
−5.
9−
5.8
−5.
7−
5.6
−5.
5−
5.4
biological
Wor
d P
redi
ctio
n Lo
g Li
kelih
ood
●
●
●
●
10 15 20
−6.
5−
6.3
−6.
1−
5.9
bible
●
●
●
●
10 15 20
−7.
5−
7.0
−6.
5
wikipedia
●
●
●
●
10 15 20
−6.
0−
5.5
−5.
0−
4.5
−4.
0
Number of topics
Ent
ity P
redi
ctio
n Lo
g Li
kelih
ood ●
● ●
●
10 15 20
−6.
5−
6.0
−5.
5
Number of topics
●
●
● ●
10 15 20
−7.
0−
6.0
−5.
0−
4.0
Number of topics
● Nubbi Author−Topic LDA Unigram Mutual
Figure 5.5: Predictive log likelihood as a function of the number of Nubbi topics on twotasks: entity prediction (given the context, predict what entities are being discussed)and relation prediction (given the entities, predict what words occur). Higher is better.
in Section 5.3.1), we take unseen pair contexts and ask how well the model predicts
those held-out words. Models that give higher probability to the held-out words better
capture how the two entities participating in that context interact. In a complimentary
problem, we can ask the fitted model to predict entities given the words in the pair
context. (The details of these metrics are defined more precisely in Section 5.3.3.)
We compare Nubbi to three alternative approaches: a unigram model, LDA (Blei
et al. 2003a), and the Author-Topic model (Rosen-Zvi et al. 2004). All of these
approaches are models of language which treat individual entities and pairs of entities
alike as bags of words. In the Author-Topic model (Rosen-Zvi et al. 2004), entities are
associated with individual contexts and pair contexts, but there are no distinguished
pair topics; all words are explained by the topics associated with individuals. In
addition, we also compare the model against two baselines: a unigram model (equivalent
to using no relationship topics and one entity topic) and a mutual information model
97
(equivalent to using one relationship topic and one entity topic).
We use the bootstrap method to create held-out data sets and compute predictive
probability (Efron 1983). Figure 5.5 shows the average predictive log likelihood for the
three approaches. The results for Nubbi are plotted as a function of the total number
of topics K = Kθ +Kψ. The results for LDA and author-topic were also computed
with K topics. All models were trained with the same hyperparameters.
Nubbi outperforms both LDA and unigram on all corpora for all numbers of topics
K. For word prediction Nubbi performs comparably to Author-Topic on bible, worse
on biological, and better on wikipedia. We posit that because the wikipedia corpus
contains more tokens per entity and pair of entities, the Nubbi model is able to leverage
more data to make better word predictions. Conversely, for biological, individual
entities explain pair contexts better than relationship topics, giving the advantage
to Author-Topic. For wikipedia, this yields a 19% improvement in average word log
likelihood over the unigram model at K = 24.
In contrast, the LDA model is unable to make improved predictions over the
unigram model. There are two reasons for this. First, LDA cannot use information
about the participating entities to make predictions about the pair, because it treats
entity contexts and pair contexts as independent bags of words. Second, LDA does not
allocate topics to describe relationships alone, whereas Nubbi does learn topics which
express relationships. This allows Nubbi to make more accurate predictions about the
words used to describe relationships. When relationship words do find their way into
LDA topics, LDA’s performance improves, such as on the bible dataset. Here, LDA
obtains a 6% improvement over unigram while Nubbi achieves a 10% improvement.
With the exception of Author-Topic on biological, Nubbi outperforms the other
approaches on the entity prediction task. For example, on wikipedia, the Nubbi model
shows a 32% improvement over the unigram baseline, LDA shows a 7% improvement,
and Author-Topic actually performs worse than the unigram baseline. While LDA,
98
Author-Topic, and Nubbi improve monotonically with the number of topics on the
word task, they can peak and decrease for the entity prediction task. Recall that an
improved word likelihood need not imply an improved entity likelihood; if a model
assigns a higher word likelihood to other entity pairs in addition to the correct entity
pair, the predictive entity likelihood may still decrease. Thus, while each held-out
context is associated with a particular pair of entities, it does not follow that that
same context could not also be aptly associated with some other entity pair.
5.4.3 Application to New York Times
We can gain qualitative insight into the preformance of the Nubbi model (and demon-
strate its scalability) by investigating its performance on a larger data set from the
New York Times. We treat each of the approximately 1 million articles in this corpus
as a document. We filter the corpus down to 2500 vocabulary terms and 944 entities.
We fit a Nubbi model using five entity topics and five relationship topics.4
Figure 5.6 shows a visualization of the results as a radial plot. Each entity appears
at an angle along the edge of the circle and lines are drawn between related entities.
The thickness of the lines represent the strength of the relationship inferred by the
model while the color of the lines represent the relationship topic which appears most
frequently in the description of the relationship between the two entities.
Because the data set is large, a high-level overview such as Figure 5.6 is difficult to
fully take in. Consequently we zoom in to a small portion of the graph in Figure 5.7
which also annotates some of the relationship topics with the word with highest
probability mass in that topic. This view reveals some of the structure of relationships
that Nubbi is able to uncover on this data set. One topic we have labeled “trial”
appears infrequently in this sector of the graph; the only entity connected by this
relationship is Nicole Brown-Simpson. Although not depicted in this zoomed-in graph,
4For the qualitative evaluation here we fix the number of topics. Refer to the previous section formore details on how performance varies with the number of topics.
99
lazio, rick arichter, mikewilder, l douglasforeman, georgemehta, zubinbhutto, benazirwright, jimolmert, ehudpinochet, augustohatch, orrin gberlin, irvingsouter, david htower, john ghusband, rick drisen, jamescharles, eleanorklebold, dylanrather, danmcmellon, edwardsaint laurent, yveshelms, jessespitzer, eliot lpanetta, leon ecolumbus, christopher
jesus christmcveigh, timothy jameso'neill, paulmartinez, pedrobarry, marion s jrels, erniechernomyrdin, viktor s
hynes, charles jewing, patrickmiers, harriet estein, andrew jaung san suu kyi, dawsilver, sheldonspecter, arlencoughlin, tomkennedy, john fitzgerald
rodman, dennischaney, donjames, caryn
lane, nathanhashimoto, ryutaro
lautenberg, frank r
giambi, jasonde klerk, f wmourning, alonzo
mapplethorpe, robert
glavine, tomwinfrey, oprahvan horn, keithjeter, derek
reagan, ronaldlimbaugh, rush
netanyahu, benjamin
law, bernard fsessions, william s
fujimori, alberto
ward, charliebarron, james
van gundy, jeff
musharraf, pervez
blair, jayson
valentine, bobby
rowland, john g
chass, murray
jacobs, marc
puccini, giacomo
gates, henry louis jr
staples, brent
baker, howard h jr
clark, wesley k
koch, edward i
canseco, jose
major, john
plame, valerie
o'neill, paul h
holyfield, evander
mills, ric
hard p
abdullahbush, la
ura
odeh, mohammed saddiq
kozlowski, l dennis
karadzic, radovan
salinas de gortari,
carlos
federer, roger
jospin, lio
nel
bennett, willia
m j
nader, ralph
jackson, m
ark
steinbrenner, g
eorge m 3d
steinberg, li
sa
velella
, guy j
moynihan, d
aniel patric
k
kahane, m
eir
khamenei, a
li
darman, ri
chard g
barber, tiki
friedman, th
omas l
bradley, bill
minaya, o
mar
goldin, harris
on j
hernandez,
orlando
brown, d
ave
reich, ro
bert b
smith
, willi
am k
scoppetta
, nich
olas
james
, sha
rpe
gore
, al
moi, da
niel a
rap
forbe
s, ste
ve
cher
toff, m
ichae
l
wolfow
itz, p
aul d
clark
, laur
el sa
lton
ashe
, arth
ur
vecs
ey, g
eorg
e
pirro
, jean
ine f
sulliv
an, a
ndre
w
man
tle, m
ickey
hank
s, to
m
robin
son,
jack
ie
bird,
larry
reno
, jane
t
edbe
rg, s
tefa
n
belki
n, lis
a
nuss
baum
, hed
da
blix,
han
s
hous
ton,
alla
n
carte
r, bi
ll
scow
crof
t, br
ent
gold
en, h
owar
d
quay
le, d
an
mor
gent
hau,
robe
rt m
taylo
r, la
wrenc
e
ovitz
, mich
ael
mur
doch
, rup
ert
dela
y, to
m
wagne
r, ric
hard
wells
tone
, pau
l
brod
y, ja
ne e
mce
nroe
, joh
n
kim
jong
il
river
a, m
aria
no
wel
ch, j
ohn
f jr
schu
ndle
r, br
et d
gree
n, m
ark
jiang
zem
in
kim
dae
jung
karz
ai, h
amid
roha
tyn,
felix
g
carlu
cci,
frank
c
kacz
ynsk
i, th
eodo
re j
biag
gi, m
ario
brun
i, fra
nk
bake
r, ja
mes
a 3
d
chav
ez, j
ulio
ces
ar
kant
or, m
icke
y
clem
ens,
roge
r
stei
nber
g, jo
el b
buffe
tt, w
arre
n e
paul
son,
hen
ry m
jr
gree
nhou
se, s
teve
n
klei
n, c
alvi
n
conn
er, d
enni
s
lette
rman
, dav
id
lam
ont,
ned
rice,
con
dole
ezza
sunu
nu, j
ohn
h
com
bs, s
ean
mill
er, a
rthu
r
brud
er, t
hom
as
mus
sina
, mik
e
freeh
, lou
is j
difra
nces
co, d
onal
d t
john
son,
lynd
on b
aine
s
pelo
si, n
ancy
mox
ley,
mar
tha
wei
cker
, low
ell p
jr
scot
t, by
ron
john
son,
mag
ic
geffe
n, d
avid
man
dela
, win
nie
falw
ell,
jerr
y
farr
akha
n, lo
uis
chre
tien,
jean
east
woo
d, c
lint
dean
, how
ard
florio
, jam
es j
mill
er, j
udith
war
d, b
enja
min
gret
zky,
way
ne
zedi
llo p
once
de
leon
, ern
esto
araf
at, y
asir
joyc
e, ja
mes
woo
dwar
d, b
ob
gree
n, r
icha
rd r
sum
mer
s, la
wre
nce
h
mus
cham
p, h
erbe
rt
izet
bego
vic,
alij
a
roth
, phi
lip
mar
tin, s
teve
abra
ms,
rob
ert
chira
c, ja
cque
s
sulli
van,
loui
s w
shev
ardn
adze
, edu
ard
a
star
r, ke
nnet
h w
rost
enko
wsk
i, da
n
thom
pson
, will
iam
c jr
was
sers
tein
, wen
dy
levi
n, c
arl
kris
tof,
nich
olas
d
rang
el, c
harle
s b
norie
ga, m
anue
l ant
onio
ali,
muh
amm
ad
mob
utu
sese
sek
o
gate
s, b
illja
mes
, leb
ron
pitin
o, r
ick
fehr
, don
ald
war
hol,
andy
robb
ins,
jero
me
dala
i lam
afe
inst
ein,
dia
nne
holtz
man
, eliz
abet
hke
nned
y, e
dwar
d m
pirr
o, je
anin
eva
nce,
cyr
us r
beck
er, b
oris
moo
re, m
icha
elgo
nzal
ez, e
lian
bare
nboi
m, d
anie
lhe
nder
son,
ric
key
john
, elto
ned
war
ds, j
ohn
win
erip
, mic
hael
padi
lla, j
ose
lew
is, a
ntho
nyab
ram
off,
jack
eisn
er, m
icha
el d
beck
ett,
sam
uel
fran
ks, b
obbo
nilla
, bob
bym
artin
ez, t
ino
sist
ani,
ali a
l−m
anni
ng, e
lish
alal
a, d
onna
efo
ley,
thom
as s
bide
n, jo
seph
r jr
eins
tein
, alb
ert
hart
, gar
yw
illia
ms,
ser
ena
harir
i, ra
fikfe
ld, e
liot
dole
, eliz
abet
h h
cosb
y, b
illfr
ist,
bill
alito
, sam
uel a
jrdi
ngel
l, jo
hn d
klei
n, jo
el i
purd
um, t
odd
san
ders
on, d
ave
mad
dox,
alto
n h
jrki
ng, w
ayne
mul
rone
y, b
rian
mbe
ki, t
habo
thur
mon
d, s
trom
mos
es, r
ober
tst
ern,
hen
ry j
shar
on, a
riel
mcg
reev
ey, j
ames
ero
bb, c
harle
s s
mal
vo, j
ohn
lee
noro
dom
sih
anou
kta
ubm
an, a
alfr
edre
dsto
ne, s
umne
r m
bern
stei
n, le
onar
dfie
lds,
c v
irgin
iabo
tste
in, l
eon
rove
, kar
lpe
rry,
will
iam
jm
arco
s, im
elda
shef
field
, gar
yhu
ssei
n i
ruth
, geo
rge
herm
an
cuom
o, m
ario
msc
hmitt
, eric
mor
ris, m
ark
mill
er, m
elvi
nth
omas
, isi
ahke
atin
g, c
harle
s h
jr
chal
abi,
ahm
ad
ceau
sesc
u, n
icol
ae
brok
aw, t
omsu
ozzi
, tho
mas
r
roh
tae
woo
o'ne
ill, e
ugen
e
petti
tte, a
ndy
polla
n, m
icha
el
rabi
n, y
itzha
k
leno
, jay
tagl
iabu
e, p
aul
rose
ntha
l, a
m
orte
ga s
aave
dra,
dan
iel
nort
h, o
liver
l
turn
er, t
edbl
umen
thal
, ral
ph
wal
ters
, bar
bara
hark
in, t
omhu
ssei
n, s
adda
m
mad
den,
john
glen
n, jo
hngo
lisan
o, b
thom
as
brya
nt, k
obe
brem
er, l
pau
l iii
mar
bury
, ste
phon
kelly
, ray
mon
d w
pick
ens,
t bo
one
jr
qadd
afi,
mua
mm
ar e
l−
dole
, bob
hing
is, m
artin
a
king
, rod
ney
glen
wils
on, a
ugus
t
bond
s, b
arry
mub
arak
, hos
ni
brad
sher
, kei
th
kean
, tho
mas
h
cole
man
, der
rick
brod
sky,
richa
rd l
sond
heim
, ste
phen
tom
mas
ini,
anth
ony
john
son,
ear
vin
gate
s, ro
bert
m
vince
nt, f
ay
philli
ps, s
teve
brow
n, je
rry
kabi
la, l
aure
nt
spre
well,
latre
ll
washi
ngto
n, d
esire
e
lewis,
lenn
ox
kush
ner,
tony
ma,
yo−y
o
whitm
an, c
hrist
ie
wiese
, tho
mas
leite
r, al
kasp
arov
, gar
ry
capr
iati, j
ennif
er
lee, s
pike
moli
nari,
guy
v
prim
akov
, yev
geny
m
shak
espe
are,
willi
am
duka
kis, m
ichae
l s
verd
i, gius
eppe
piazz
a, m
ike
yelts
in, b
oris
n
khat
ami, m
oham
mad
bush
, bar
bara
moham
ed, k
halfa
n kha
mis
dowd,
maure
en
lloyd
web
ber, a
ndre
w
norm
an, g
reg
meese
, edw
in 3d
patak
i, geo
rge e
gershwin, g
eorge
hitler, a
dolf
johnson, la
rry
volpe,
justin a
baker, r
ussell
domingo, placid
o
dinkins,
david n
baryshniko
v, mikh
ail
giamatti, a bartle
tt
leetch, b
rian
dole, robert j
weinberger, casp
ar w
kevo
rkian, ja
ck
paterno, joe
simon, n
eil
simpso
n, o j
gerstner, l
ouis v jr
masur, k
urt
ashcro
ft, john
soros,
george
dole, eliza
beth
jones, mario
n
schlesinger, arth
ur m jr
diana, prin
cess of wales
schroder, gerhard
walsh, lawrence e
lennon, john
gibson, mel
baker, al
gore, albert jr
brown, david m
spitzer, ellio
t l
miyazawa, kiichi
collins, glenn
mccartney, paul
torre, joe
ridge, tom
malone, john c
scalia, antonin
ackerman, felicia
lindros, eric
leonard, sugar ray
mutombo, dikembe
torricelli, r
obert g
aquino, corazon c
kim young sam
murphy, richard
fujimori, alberto k
martin, kenyon
hemingway, ernest
gotti, john
sciolino, elaine
belichick, bill
reagan, ronald wilson
weingarten, randi
milosevic, slobodan
hill, anita f
kerry, john
thomas, clarence
bettman, gary
stevens, scott
peres, shimon
picasso, pablo
bumiller, elisabeth
keller, bill
spielberg, steven
packwood, robert w
foster, vincent w jr
o'neal, shaquille
lipsyte, robert
weinstein, harvey
prodi, romano
papp, joseph
badillo, herman
canby, vincent
collins, kerry
pierce, samuel r jr
maliki, nuri kamal al−
graham, martha
van gogh, vincent
charles, prince of wales
li pengda silva, luiz inacio lula
bloomberg, michael r
savimbi, jonas
mickelson, phil
webber, chris
messinger, ruth w
dimaggio, joe
krugman, paul
riley, patmartin, billymcfarlane, robert c
ozawa, seijicamby, marcusnunn, samlewinsky, monica spowell, colin lstrahan, michaelwaldheim, kurtlevy, steveoates, joyce carolkemp, jack fsnow, john whun senbruno, joseph lcourier, jimellington, dukepoindexter, john mpopewhitman, christine toddwilson, petelevitt, arthur jrstern, davidmaslin, janetvolcker, paul a martin, curtis brown, edmund g jr pareles, jon graham, bob holland, bernard lopez, jennifer daly, john johnson, randy albee, edward brawley, tawana tsongas, paul e richardson, bill blair, tony waxman, henry a jones, paula corbin testaverde, vinny reagan, nancy levine, james fabricant, florence williams, ted hyde, henry j nichols, terry lynn
goss, porter j brooks, david sadr, moktada al−schilling, curt carter, jimmy koizumi, junichiro
perot, ross parcells, bill montana, joe gorbachev, mikhail skostunica, vojislavdeng xiaopingrowling, j k sabatini, gabriela
nicklaus, jacknixon, richard milhous
armey, dickmccain, john sroosevelt, theodore
ramon, ilanshamir, yitzhakbush, george whelmsley, harry b
al−'owhali, mohamed rashed daoud
cortines, ramon c
cheney, dickholik, bobby
messier, markwhitehead, mary beth
crew, rudyhosokawa, morihiro
altman, robert
mamet, davidcasey, william j
christo pear, roberthastert, j dennis
shultz, george p
woods, tiger
thompson, tommy g
bentsen, lloyd
cashman, brian
mccain, john
harris, katherine
safire, william
jeffords, james m
jobs, steven p
oakley, charles
ferraro, geraldine a
robertson, pat
johns, jasper
schiavo, terri
thornburgh, richard l
wilson, michael
singh, vijay
starks, john
ravitch, richard
rose, pete
wilson, joseph c iv
havel, vaclav
o'connor, john
carey, mariah
diallo, amadou
fox, vicente
glass, philip
malcolm x
marsalis, wynton
sand, leonard b
thatcher, margaret h
martins, peter
o'neill, william a
kennedy, john f jr
anderson, kenny
king, don
gandhi, rajiv
stalin, joseph
armani, giorgio
buckley, william f jr
spano, andrew j
williams, bernie
clinton, bill
simon, paul
karpov, anatoly
kidd, jason
bowman, patricia
stoppard, tom
rodriguez, alex
hartocollis, anemona
mcconnell, m
itch
armstrong, lance
rosenbaum, yankel
ahern, bertie
jennings, peter
pollock, jackson
regan, edward v
versace, gianni
lauder, ronald s
karan, donna
warner, john w
le pen, jean−marie
tudjman, franjo
fernandez, joseph a
schwarz, charles
shostakovich, dmitri
louima, abner
kalikow, peter s
chavez, hugo
cuomo, andrew m
ebbers, bernard j
rocker, john
norton, gale
herbert, bob
robertson, marion g
safir, howard
lieberman, joseph i
mailer, norm
an
ferguson, colin
kwan, m
ichelle
pavarotti, luciano
gates, william
h
mccall, h carl
lott, trent
forrester, douglas r
zemin, jiang
franco, john
mladic, ratko
gooden, dwight
wiesel, elie
steinbrenner, george
scorsese, martin
castro, fidel
bork, robert h
lauren, ralph
william
s, tennessee
gordon, jeff
eisenhower, dw
ight david
collor de mello, fernando
abdel rahman, om
ar
navratilova, martina
helmsley, leona
bush, jeb
brady, nicholas f
ford, gerald rudolph jr
shevardnadze, eduard
kerry, john f
strawberry, darryl
allen, woody
jagr, jaromir
rafsanjani, hashemi
hu jintao
salameh, m
ohamm
ed a
adams, john
madonna
farrow, m
ia
bolton, john r
wilson, valerie plam
e
griffith, michael
menendez, robert
mugabe, robert
wilpon, fred
weiner, tim
boxer, barbara
reid, harry
lagerfeld, karl
greenspan, alan
khodorkovsky, mikhail b
perelman, ronald o
fassel, jim
de niro, robert
putin, vladimir
harding, tonya
bush, george w.
sather, glen
bowe, riddick
mcnally, terrence
deaver, michael k
rehnquist, william
h
quindlen, anna
jackson, thomas penfield
botha, p w
perez de cuellar, javier
adams, gerry
handel, george frederick
pennington, chad
barak, ehud
zedillo, ernesto
tierney, johnkerrigan, nancy
brown, larry
matsui, hideki
jefferson, thomas
chun doo hwan
zuckerman, m
ortimer b
bakker, jimbalanchine, georgealexander, lam
arfreud, sigm
undchang, m
ichaelhevesi, alan gjohnson, philipcrew
, rudolph fagassi, andrepresley, elvisicahn, carl cbrow
n, ronald hw
ilson, robertchilds, chrislew
is, neil atharp, tw
ylaroosevelt, franklin delanow
eill, sanford ikerkorian, kirkm
uhamm
ad, john allennajibullahw
illiams, jayson
dodd, christopher jtaylor, charlescone, davidpol poto'connor, sandra daygoldm
an, ronald lyleerlanger, stevenking, stephenonassis, jacqueline kennedygoodnough, abbyharris, ericsolom
on, deborahsandom
ir, richardkaye, judith slay, kenneth lrandolph, w
illiecarter, vincestrauss, richardcham
orro, violeta barrios decorzine, jon sholbrooke, richard csinatra, frankbuchanan, patrick jgrasso, richard aahm
adinejad, mahm
oudw
ashington, georgebush, georgefox quesada, vicentegram
m, phil
spitzer, eliotfitzw
ater, marlin
sorenstam, annika
edwards, herm
anjackson, m
ichaelvan natta, don jrtrum
an, harry sabbas, m
ahmoud
cooper, michael
diller, barryboss, kennethseles, m
onicam
andela, nelson r
rumsfeld, donald h
weinstein, jack b
gingrich, newt
aspin, lesm
enem, carlos saul
kissinger, henry a
cage, johnbabbitt, bruce
schumer, charles e
assad, bashar al−
boutros−ghali, boutros
simm
s, phillipton, eric
newm
an, paul
mcgw
ire, mark
lee, wen ho
leahy, patrick j
grassley, charles e
libeskind, daniel
gephardt, richard a
brown, lee p
vacco, dennis c
rell, m jodi
tyson, mike
davis, grayo'donnell, rosie
roberts, john g jr
clinton, hillary rodham
goetz, bernhard hugo
stephanopoulos, george
aziz, tariqlew
insky, monica
iacocca, lee a
feiner, paul
smith, w
illiam kennedy
cunningham, m
erce
khomeini, ruhollah
skilling, jeffrey k
showalter, buck
samaranch, juan antonio
abraham, spencer
springsteen, bruce
davenport, lindsay
monroe, m
arilyn
rohde, david
gonzales, alberto r
romney, m
itt
lemieux, m
ario
trump, donald j
pearl, daniel
jackson, bo
kerik, bernard b
mason, c vernon
lewis, carl
gulotta, thomas s
maazel, lorin
lugar, richard g
lincoln, abraham
de la hoya, oscar
lewis, michael
jackson, jesse l
berlusconi, silvio
suhartowachtler, sol
aoun, michel
lendl, ivan
rushdie, salman
mandela, nelson
holden, stephen
chen shui−bian
albright, madeleine k
gotbaum, betsy
hanover, donna
tenet, george j
cardoso, fernando henrique
piniella, lou
sampras, pete
kimmelman, michael
walesa, lech
mozart, wolfgang amadeus
barkley, charles
levy, harold o
byrd, robert c
jones, roy jr
shays, christopher
cohen, william s
mueller, robert s iii
mitterrand, francois
ginsburg, ruth bader
elizabeth ii, queen of great britain
williams, venus
brodeur, martin
el−hage, wadih
aristide, jean−bertrand
feingold, russell d
hawkins, yusuf k
beethoven, ludwig van
carroll, sean
kerrey, bob
giuliani, rudolph w
cruise, tom
stewart, martha
john paul ii
berke, richard l
graf, steffi
mitchell, george j
chawla, kalpana
maxwell, robert
winfield, dave
o'rourke, andrew p
nelson, lemrick jr
rodgers, richard
kohl, helmut
koppel, ted
rubin, robert e
stern, howard
clinton, chelsea
wright, frank lloyd
d'amato, alfonse m
rich, frank
annan, kofi
jackson, phil
putin, vladimir v
irabu, hideki
weiner, anthony d
miller, gifford
sharpton, al
johnson, keyshawn
hubbell, webster l
egan, edward m
libby, i lewis jr
obama, barack
johnson, ben
roddick, andy
bin laden, osama
brozan, nadine
boesky, ivan f
connors, jimmy
schwarzenegger, arnold
white, mary jo
ryan, george
mattingly, don
weld, william f
sosa, sammy
silverstein, larry a
wells, david
mccool, william c
traub, james
bratton, william j
sweeney, john j
daschle, tom
milken, michael r
ryan, nolan
blumenthal, richard
moussaoui, zacarias
jordan, michael
vallone, peter f
dylan, bob
christopher, warren m
codey, richard j
pagones, steven a
dershowitz, alan m
hewitt, lleyton
selig, budassad, hafez al−
bach, johann sebastian
altman, lawrence k
gehry, frankzimmer, richard a
daley, richard m
broad, william jbrady, lois smithsimpson, nicole brown
zarqawi, abu musab al−
reeves, dantrump, donaldiverson, allenknoblauch, chuckking, martin luther jrkennedy, anthony mkasparov, garykoresh, davidbenedict xvimarcos, ferdinand eferrer, fernando
Figure 5.6: A visualization of the results of applying the Nubbi model to the New YorkTimes. Entities appear along the edge of the circle and lines connect related entities.The thickness of the lines represent the strength of the relationship while the colorsrepresent the relationship topic which appears most frequently in the description ofthe relationship.
100
lazio, rick arichter, mikewilder, l douglasforeman, georgemehta, zubinbhutto, benazirwright, jimolmert, ehudpinochet, augustohatch, orrin gberlin, irvingsouter, david htower, john ghusband, rick drisen, jamescharles, eleanorklebold, dylanrather, danmcmellon, edwardsaint laurent, yveshelms, jessespitzer, eliot lpanetta, leon ecolumbus, christopher
jesus christmcveigh, timothy jameso'neill, paulmartinez, pedrobarry, marion s jrels, erniechernomyrdin, viktor s
hynes, charles jewing, patrickmiers, harriet estein, andrew j
aung san suu kyi, daw
silver, sheldonspecter, arlencoughlin, tomkennedy, john fitzgerald
rodman, dennischaney, donjames, carynlane, nathanhashimoto, ryutaro
lautenberg, frank r
giambi, jasonde klerk, f wmourning, alonzo
mapplethorpe, robert
glavine, tomwinfrey, o
prahvan horn, keith
jeter, derek
reagan, ronald
limbaugh, rush
netanyahu, benjamin
law, bernard f
sessions, willia
m s
fujimori, a
lberto
ward, charlie
barron, ja
mes
van gundy, jeff
musharraf, p
ervez
blair, jayson
valentine, bobby
rowland, john g
chass, murra
y
jacobs, marc
puccini, giacomo
gates, henry lo
uis jr
staples, brent
baker, howard h jr
clark, wesley k
koch, edward i
canseco, jose
major, john
plame, valerie
o'neill, paul h
holyfield, e
vander
mills, ri
chard p
abdullah
bush, laura
odeh, mohammed saddiq
kozlowski, l
dennis
karadzic, ra
dovan
salinas de gorta
ri, carlo
s
federer, r
oger
jospin, li
onel
bennett, willi
am j
nader, ralph
jackso
n, mark
steinbre
nner, georg
e m 3
d
steinberg
, lisa
velella
, guy j
moyn
ihan, daniel p
atrick
kahane, m
eir
kham
enei, ali
darman, r
ichard
g
barber,
tiki
friedm
an, thom
as l
bradley,
bill
min
aya, o
mar
goldin
, harri
son j
hernandez,
orla
ndo
brown, d
ave
reich
, robert
b
smith
, willi
am k
scoppetta
, nichola
s
jam
es, s
harp
e
gore
, al
moi
, dan
iel a
rap
forb
es, s
teve
cher
toff,
micha
el
wol
fowitz
, pau
l d
clar
k, la
urel
sal
ton
ashe
, arth
ur
vecs
ey, g
eorg
e
pirro
, jea
nine
f
sulliv
an, a
ndre
w
man
tle, m
icke
y
hank
s, to
m
robi
nson
, jac
kie
bird
, lar
ry
reno
, jan
et
edbe
rg, s
tefa
n
belkin
, lisa
nuss
baum
, hed
da
blix, h
ans
hous
ton,
alla
n
carter
, bill
scow
crof
t, br
ent
gold
en, h
owar
d
quay
le, d
an
mor
gent
hau,
rob
ert m
tayl
or, l
awre
nce
ovitz
, mic
hael
mur
doch
, rup
ert
dela
y, to
m
wag
ner,
richa
rd
wel
lsto
ne, p
aul
brod
y, ja
ne e
mce
nroe
, joh
n
kim
jong
il
river
a, m
aria
no
wel
ch, j
ohn
f jr
schu
ndle
r, br
et d
gree
n, m
ark
jiang
zem
in
kim
dae
jung
karz
ai, h
amid
roha
tyn,
felix
g
carluc
ci, f
rank
c
kacz
ynsk
i, th
eodo
re j
bia
ggi,
mario
bru
ni,
frank
bake
r, ja
mes
a 3
d
chav
ez,
julio
cesa
r
kanto
r, m
icke
y
clem
ens,
roger
stein
berg
, jo
el b
buffe
tt, w
arr
en e
pauls
on, henry
m jr
gre
enhouse
, st
even
klein
, ca
lvin
conner, d
ennis
letterm
an, davi
d
lam
ont, n
ed
rice
, co
ndole
ezz
a
sununu, jo
hn h
com
bs,
sean
mill
er, a
rthur
bru
der, thom
as
muss
ina, m
ike
freeh, lo
uis
j
difr
ance
sco, donald
t
johnso
n, ly
ndon b
ain
es
pelo
si, nancy
moxl
ey,
mart
ha
weic
ker, lo
well
p jr
scott, byr
on
johnso
n, m
agic
geffen, davi
d
mandela
, w
innie
falw
ell,
jerr
y
farr
akh
an, lo
uis
chre
tien, je
an
east
wood, cl
int
dean, how
ard
florio, ja
mes j
mill
er, judith
ward
, benja
min
gre
tzky,
wayne
zedill
o p
once d
e leon, ern
esto
ara
fat, y
asir
joyce, ja
mes
woodw
ard
, bob
gre
en, richard
r
sum
mers
, la
wre
nce h
muscham
p, herb
ert
izetb
egovic
, alij
a
roth
, phili
p
mart
in, ste
ve
abra
ms, ro
bert
chirac, ja
cques
sulli
van, lo
uis
w
sheva
rdnadze, eduard
a
sta
rr, ke
nneth
w
roste
nko
wski, d
an
thom
pson, w
illia
m c
jr
wassers
tein
, w
endy
levin
, carl
kri
sto
f, n
ichola
s d
rangel, c
harl
es b
nori
ega, m
anuel anto
nio
ali,
muham
mad
mobutu
sese s
eko
gate
s, bill
jam
es, le
bro
n
pitin
o, ri
ck
fehr,
donald
warh
ol, a
ndy
robbin
s, je
rom
e
dala
i la
ma
fein
ste
in, dia
nne
holtzm
an, eliz
abeth
kennedy,
edw
ard
m
pirro
, je
anin
evance, cyru
s r
becker,
bori
sm
oore
, m
ichael
gonzale
z, elia
nbare
nboim
, danie
lhenders
on, ri
ckey
john, elton
edw
ard
s, jo
hn
win
eri
p,
mic
ha
el
padill
a, jo
se
lew
is, anth
ony
abra
moff, ja
ck
eis
ner,
mic
hael d
beckett, sam
uel
franks, bob
bonill
a, bobby
mart
inez, tino
sis
tani, a
li al!
mannin
g, eli
shala
la, donna e
fole
y, thom
as s
bid
en, jo
seph r
jr
ein
ste
in, alb
ert
hart
, gary
will
iam
s, sere
na
hari
ri, ra
fik
feld
, elio
tdole
, eliz
abeth
hcosby,
bill
fris
t, b
illalit
o, sam
uel a jr
din
gell,
john d
kle
in, jo
el i
purd
um
, to
dd s
anders
on, dave
maddox, alton h
jr
kin
g, w
ayne
mulroney,
bri
an
mbeki, thabo
thurm
ond, str
om
moses, ro
bert
ste
rn, henry
jsharo
n, ariel
mcgre
eve
y, jam
es e
robb, charles s
malv
o, jo
hn lee
noro
dom
sih
anouk
taubm
an, a a
lfre
dre
dsto
ne, sum
ner
m
bern
ste
in, le
onard
field
s, c
virgin
iabots
tein
, le
on
rove
, karl
perr
y, w
illia
m j
marc
os,
im
eld
asheffie
ld, gary
hussein
iru
th, georg
e h
erm
an
cuom
o, m
ario m
schm
itt, eric
morr
is, m
ark
mill
er, m
elv
inth
om
as,
isia
hke
atin
g, ch
arles
h jr
chala
bi,
ahm
ad
ceause
scu, nic
ola
e
bro
kaw
, to
msu
ozz
i, th
om
as
r
roh tae w
oo
o'n
eill
, eugene
pettitt
e, andy
polla
n, m
ichael
rabin
, yi
tzhak
leno,
jay
taglia
bue, paul
rose
nth
al,
a m
ort
ega s
aave
dra
, danie
l
nort
h, oliv
er l
turn
er, ted
blu
menth
al,
ralp
h
walte
rs, barb
ara
hark
in, to
m
huss
ein
, sa
ddam
madden, jo
hn
gle
nn, jo
hn
golis
ano,
b thom
as
bry
ant, k
obe
bre
mer, l
paul i
ii
marb
ury
, st
ephon
kelly
, ra
ymond w
pic
kens,
t b
oone jr
qadd
afi,
mua
mm
ar e
l!
dole
, bob
hing
is, m
artin
a
king
, rod
ney
glen
wils
on, a
ugus
t
bond
s, b
arry
mub
arak
, hos
ni
brad
sher
, kei
th
kean
, tho
mas
h
cole
man
, der
rick
brod
sky, ric
hard
l
sond
heim
, ste
phen
tom
mas
ini,
anth
ony
john
son,
ear
vin
gate
s, rob
ert m
vinc
ent,
fay
phillip
s, s
teve
brow
n, je
rry
kabi
la, l
aure
nt
spre
wel
l, la
trel
l
was
hing
ton,
des
iree
lew
is, l
enno
x
kush
ner,
tony
ma,
yo!
yo
whi
tman
, chr
istie
wie
se, t
hom
as
leite
r, al
kasp
arov
, gar
ry
capr
iati,
jenn
ifer
lee,
spi
ke
mol
inar
i, gu
y v
prim
akov
, yev
geny
m
shak
espe
are,
willia
m
duka
kis, m
icha
el s
verd
i, gius
eppe
piaz
za, m
ike
yelts
in, b
oris n
khat
ami,
moh
amm
ad
bush
, bar
bara
moh
amed
, kha
lfan
kham
is
dowd,
mau
reen
lloyd
web
ber,
andr
ew
norm
an, g
reg
mee
se, e
dwin
3d
pata
ki, g
eorg
e e
gershw
in, g
eorge
hitler,
adolf
johnso
n, larry
volp
e, ju
stin
a
baker,
russ
ell
domin
go, p
lacid
o
dinkin
s, d
avid n
barysh
nikov,
mikh
ail
giam
atti, a
bartl
ett
leetc
h, bria
n
dole, robert
j
weinberger,
casp
ar w
kevo
rkian, ja
ck
patern
o, joe
simon, n
eil
simpso
n, o j
gerstn
er, louis
v jr
masur,
kurt
ashcro
ft, john
soro
s, georg
e
dole, eliz
abeth
jones, mario
n
schlesinger, arth
ur m jr
diana, prin
cess of wales
schroder, gerhard
walsh, lawrence e
lennon, john
gibson, mel
baker, a
l
gore, albert j
r
brown, david m
spitzer, e
lliot l
miyazawa, kiichi
collins, g
lenn
mccartney, p
aul
torre, jo
e
ridge, to
m
malone, john c
scalia, antonin
ackerman, fe
licia
lindros, eric
leonard, sugar ray
mutombo, dikembe
torricelli,
robert g
aquino, corazon c
kim young sam
murphy, richard
fujimori, a
lberto k
martin, kenyon
hemingway, ernest
gotti, john
sciolino, elaine
belichick, bill
reagan, ronald wilson
weingarten, randi
milosevic, slobodan
hill, anita f
kerry, john
thomas, clarence
bettman, gary
stevens, scott
peres, shimon
picasso, pablo
bumiller, elisabeth
keller, bill
spielberg, steven
packwood, robert w
foster, vincent w jr
o'neal, shaquille
lipsyte, robert
weinstein, harvey
prodi, romano
papp, joseph
badillo, herman
canby, vincent
collins, kerry
pierce, samuel r jr
maliki, nuri kamal al!
graham, martha
van gogh, vincent
charles, prince of wales
li peng
da silva, luiz inacio lula
bloomberg, michael r
savimbi, jonas
mickelson, phil
webber, chris
messinger, ruth w
dimaggio, joe
krugman, paul
riley, pat
martin, billy
mcfarlane, robert c
ozawa, seiji
camby, marcus
nunn, samlewinsky, monica s
powell, colin l
strahan, michael
waldheim, kurtlevy, steveoates, joyce carolkemp, jack fsnow, john whun senbruno, joseph lcourier, jimellington, dukepoindexter, john mpopewhitman, christine toddwilson, petelevitt, arthur jrstern, davidmaslin, janetvolcker, paul amartin, curtisbrown, edmund g jr pareles, jongraham, bob holland, bernard lopez, jennifer daly, john johnson, randy albee, edward brawley, tawana tsongas, paul e richardson, bill blair, tony waxman, henry a jones, paula corbin testaverde, vinny reagan, nancy levine, james fabricant, florence
williams, ted hyde, henry j nichols, terry lynngoss, porter j brooks, david sadr, moktada al!
schilling, curt carter, jimmy koizumi, junichiroperot, ross parcells, bill montana, joe
gorbachev, mikhail s
kostunica, vojislav
deng xiaopingrowling, j k sabatini, gabriela
nicklaus, jacknixon, richard milhous
armey, dickmccain, john sroosevelt, theodore
ramon, ilanshamir, yitzhak
bush, george w
helmsley, harry b
al!'owhali, mohamed rashed daoud
cortines, ramon c
cheney, dickholik, bobby
messier, mark
whitehead, mary beth
crew, rudyhosokawa, morihiro
altman, robert
mamet, david
casey, william j
christo pear, robert
hastert, j dennis
shultz, george p
woods, tiger
thompson, tommy g
bentsen, lloyd
cashman, brian
mccain, john
harris, katherine
safire, william
jeffords, james m
jobs, steven p
oakley, charles
ferraro, geraldine a
robertson, pat
johns, jasper
schiavo, terri
thornburgh, richard l
wilson, michael
singh, vijay
starks, john
ravitch, richard
rose, pete
wilson, joseph c iv
havel, vaclav
o'connor, john
carey, mariah
diallo, amadou
fox, vicente
glass, philip
malcolm
x
marsalis, wynton
sand, leonard b
thatcher, margaret h
martins, peter
o'neill, william a
kennedy, john f jr
anderson, kenny
king, don
gandhi, rajiv
stalin, joseph
armani, giorgio
buckley, william f jr
spano, andrew j
william
s, bernie
clinton, bill
simon, paul
karpov, anatoly
kidd, jason
bowm
an, patricia
stoppard, tom
rodriguez, alex
hartocollis, anemona
mcconnell, m
itch
armstrong, lance
rosenbaum, yankel
ahern, bertie
jennings, peter
pollock, jackson
regan, edward v
versace, gianni
lauder, ronald s
karan, donna
warner, john w
le pen, jean!m
arie
tudjman, franjo
fernandez, joseph a
schwarz, charles
shostakovich, dmitri
louima, abner
kalikow, peter s
chavez, hugo
cuomo, andrew
m
ebbers, bernard j
rocker, john
norton, gale
herbert, bob
robertson, marion g
safir, howard
lieberman, joseph i
mailer, norm
an
ferguson, colin
kwan, m
ichelle
pavarotti, luciano
gates, william
h
mccall, h carl
lott, trent
forrester, douglas r
zemin, jiang
franco, jo
hn
mla
dic, ra
tko
gooden, d
wig
ht
wie
sel, e
lie
stein
bre
nner, g
eorg
e
scorse
se, martin
castro, fid
el
bork, ro
bert h
laure
n, ra
lph
willia
ms, te
nnesse
e
gord
on, je
ff
eise
nhow
er, d
wig
ht d
avid
collo
r de m
ello, fe
rnando
abdel ra
hm
an, o
mar
navra
tilova
, martin
a
helm
sley, le
ona
bush
, jeb
bra
dy, n
ichola
s f
ford
, gera
ld ru
dolp
h jr
sheva
rdnadze
, eduard
kerry, jo
hn f
straw
berry, d
arryl
alle
n, w
oody
jagr, ja
rom
ir
rafsa
nja
ni, h
ash
em
i
hu jin
tao
sala
meh, m
oham
med a
adam
s, john
madonna
farro
w, m
ia
bolto
n, jo
hn r
wilso
n, va
lerie
pla
me
griffith
, mich
ael
menendez, ro
bert
mugabe, ro
bert
wilp
on, fre
d
wein
er, tim
boxe
r, barb
ara
reid
, harry
lagerfe
ld, k
arl
gre
enspan, a
lan
khodorko
vsky, m
ikhail b
pere
lman, ro
nald
o
fassel, jim
de n
iro, ro
bert
putin
, vla
dim
ir
hard
ing, to
nya
bush, g
eorg
e w
.
sath
er, g
len
bow
e, rid
dic
k
mcnally, te
rrence
deave
r, mic
hael k
rehnquis
t, willia
m h
quin
dle
n, a
nna
jackson, th
om
as p
enfie
ld
both
a, p
w
pere
z d
e c
uella
r, javie
r
adam
s, g
erry
handel, g
eorg
e fre
deric
k
pennin
gto
n, c
had
bara
k, e
hud
zedillo
, ern
esto
tiern
ey, jo
hn
kerrig
an, n
ancy
bro
wn, la
rry
mats
ui, h
ideki
jeffe
rson, th
om
as
chun d
oo h
wan
zuckerm
an, m
ortim
er b
bakker, jim
bala
nchin
e, g
eorg
e
ale
xander, la
mar
freud, s
igm
und
chang, m
ichael
hevesi, a
lan g
johnson, p
hilip
cre
w, ru
dolp
h f
agassi, a
ndre
pre
sle
y, elv
isic
ahn, c
arl c
bro
wn, ro
nald
hw
ilson, ro
bert
ch
ilds, c
hris
lew
is, n
eil a
tharp
, twyla
roosevelt, fra
nklin
dela
no
weill, s
anfo
rd i
kerk
oria
n, k
irkm
uham
mad, jo
hn a
llen
najib
ulla
hw
illiam
s, ja
yson
dodd, c
hris
topher j
taylo
r, charle
scone, d
avid
pol p
ot
o'c
onnor, s
andra
day
gold
man, ro
nald
lyle
erla
nger, s
teven
kin
g, s
tephen
onassis
, jacquelin
e k
ennedy
goodnough, a
bby
harris
, eric
solo
mon, d
ebora
hsandom
ir, richard
kaye, ju
dith
sla
y, kenneth
lra
ndolp
h, w
illiecarte
r, vin
ce
stra
uss, ric
hard
cham
orro
, vio
leta
barrio
s d
e
corz
ine, jo
n s
holb
rooke
, richard
csin
atra
, frank
buchanan, p
atric
k j
gra
sso, ric
hard
aahm
adin
eja
d, m
ahm
oud
washin
gto
n, g
eorg
ebush, g
eorg
efo
x q
uesada, v
icente
gra
mm
, phil
spitze
r, elio
tfitz
wate
r, marlin
sore
nsta
m, a
nnik
a
edw
ard
s, herm
an
jackson, m
ichael
van n
atta
, don jr
trum
an, h
arry
sabbas, m
ahm
oud
cooper, m
ichael
dille
r, barry
boss, ke
nneth
sele
s, monica
mandela
, nelso
n r
rum
sfeld
, donald
h
wein
stein
, jack b
gin
grich
, new
t
asp
in, le
sm
enem
, carlo
s saul
kissinger, h
enry a
cage, jo
hn
babbitt, b
ruce
schum
er, ch
arle
s e
assa
d, b
ash
ar a
l!
boutro
s!ghali, b
outro
s
simm
s, phil
lipto
n, e
ricnew
man, p
aul
mcg
wire
, mark
lee, w
en h
o
leahy, p
atrick j
gra
ssley, ch
arle
s e
libeskin
d, d
anie
l
gephard
t, richard
a
bro
wn, le
e p
vacco, d
ennis c
rell, m
jodi
tyson, m
ike
davis, g
ray
o'd
onnell, ro
sie
roberts, jo
hn g
jr
clinto
n, h
illary ro
dham
goetz, b
ern
hard
hugo
stephanopoulo
s, georg
e
aziz, ta
riq
lewin
sky, monica
iacocca, lee a
feiner, paul
smith, w
illiam kennedy
cunningham, m
erce
khomeini, ruhollah
skilling, jeffrey k
showalter, buck
samaranch, juan antonio
abraham, spencer
springsteen, bruce
davenport, lindsay
monroe, m
arilyn
rohde, david
gonzales, alberto r
romney, m
itt
lemieux, m
ario
trump, donald j
pearl, daniel
jackson, bo
kerik, bernard b
mason, c vernon
lewis, carl
gulotta, thomas s
maazel, lorin
lugar, richard g
lincoln, abraham
de la hoya, oscar
lewis, m
ichael
jackson, jesse l
berlusconi, silvio
suharto
wachtler, sol
aoun, michel
lendl, ivan
rushdie, salman
mandela, nelson
holden, stephen
chen shui!bian
albright, madeleine k
gotbaum, betsy
hanover, donna
tenet, george j
cardoso, fernando henrique
piniella, lou
sampras, pete
kimm
elman, m
ichael
walesa, lech
mozart, wolfgang am
adeus
barkley, charles
levy, harold o
byrd, robert c
jones, roy jr
shays, christopher
cohen, william s
mueller, robert s iii
mitterrand, francois
ginsburg, ruth bader
elizabeth ii, queen of great britain
williams, venus
brodeur, martin
el!hage, wadih
aristide, jean!bertrand
feingold, russell d
hawkins, yusuf k
beethoven, ludwig van
carroll, sean
kerrey, bob
giuliani, rudolph w
cruise, tom
stewart, martha
john paul ii
berke, richard l
graf, steffi
mitchell, george j
chawla, kalpana
maxwell, robert
winfield, dave
o'rourke, andrew p
nelson, lemrick jr
rodgers, richard
kohl, helmut
koppel, ted
rubin, robert e
stern, howard
clinton, chelsea
wright, frank lloyd
d'amato, alfonse m
rich, frank
annan, kofi
jackson, phil
putin, vladimir v
irabu, hideki
weiner, anthony d
miller, gifford
sharpton, al
johnson, keyshawn
hubbell, webster l
egan, edward m
libby, i lewis jr
obama, barack
johnson, ben
roddick, andy
bin laden, osama
brozan, nadine
boesky, ivan f
connors, jimmy
schwarzenegger, arnold
white, mary jo
ryan, george
mattingly, don
weld, william f
sosa, sammy
silverstein, larry a
wells, david
mccool, william c
traub, james
bratton, william j
sweeney, john j
daschle, tom
milken, michael r
ryan, nolan
blumenthal, richard
moussaoui, zacarias
jordan, michael
vallone, peter f
dylan, bob
christopher, warren m
codey, richard j
pagones, steven a
dershowitz, alan m
hewitt, lleyton
selig, bud
assad, hafez al!
bach, johann sebastian
altman, lawrence k
gehry, frank
zimmer, richard a
daley, richard m
broad, william j
brady, lois smith
simpson, nicole brown
zarqawi, abu musab al!
reeves, dantrump, donaldiverson, allenknoblauch, chuck
king, martin luther jr
kennedy, anthony mkasparov, garykoresh, davidbenedict xvimarcos, ferdinand eferrer, fernando
fightmatch
trial
Figure 5.7: A zoomed view into a small portion of Figure 5.7. The colors (i.e.,relationship topics) have been annotated with the most frequently occuring term inthat topic. Nubbi is able to discover a way of partitioning relationships into topicsand assigning these relationship topics to individual paris of entities.
101
Figure 5.8: A screen shot of an Amazon Mechanical Turk task asking users to labelthe relationships between entities with textual descriptions. In this way we can get alarge-scale ground truth for the relationships in the New York Times data set.
the other end of this relationship is O. J. Simpson.
Another two topics seem closely related; we have labeled them here as “match”
and “fight”. The latter is focused on (sporting) contests, such as those involving
George Foreman and Gary Kasparov. The former however, seems to capture a more
general notion of contention, with Donald Trump strongly related to several people
according to this topic (and Rick Lazio to a lesser extent). The boxer, George Foreman,
interestingly occupies both topics almost equally.
This sort of qualitative analysis is suggestive that Nubbi is able to capture aspects
of relationships. However, this kind of analysis is difficult to scale up to large data
sets such as this one. To aid in this, we perform a large-scale study using Amazon
Mechanical Turk (AMT)5. AMT is an online marketplace for tasks (known as HITs).
5https://www.mturk.com/mturk/welcome
102
A large pool of users selects and completes HITs for a small fee. In this way it is
possible to obtain a large number of human labelings of data sets.
We offered a series of tasks asking users to label relationships that appear in the
New York Times data set. We collected 600 labelings from 13 users. A screenshot of
our task is shown in Figure 5.8. In it we present each user with ten pairs of entities.
For each pair of entities we ask them to write a textual description of the relationship
between those entities (users may optionally check boxes indicating that they do not
know how they are related or that they are not related). To reduce noise each pair of
entities was presented to multiple users. After removing stop words and tokenizing we
are left with a bag of crowd-sourced labels for each of 200 relationships.
We now measure how well our models can predict these bags of labels. We first train
both the Nubbi model and the Author-Topic model using the parameters mentioned
earlier in this section. As mentioned above, each of these trained models can then
predict words describing the relationship between two entities. For each word in our
test set, that is, the labels we obtained from users on AMT, we compute the rank of
that word in the list of predicted words. We emphasize that for this predictive task
the relationship ground truth was completely hidden from both models. The result of
this experiment is shown in Figure 5.9.
Each word in the figure represents an instance of a word in the test set. The
position of the word is determined by the predicted rank according to the Author-Topic
Model (x-axis) and the Nubbi model (y-axis). Lower is better along both axes. The
words below the diagonal are instances where Nubbi’s prediction was better than the
Author-Topic model’s, and vice versa for those above the diagonal. As with other
visualizations of this data set, because of the large scale it is difficult to tease out
individual differences. Therefore we create another version of this visualization by
removing those terms close to the diagonal, that is, the labels for which Nubbi and the
Author-Topic Model make similar predictions. This allows us to better understand
103
1 2 3 4 5 6 7 8
12
34
56
78
Lower is betterLog Rank (Author−Topic Model)
Log
Ran
k (N
ubbi
)
time
president
minister
countriesbush
prime
israeli
race
1992
presidential
opponents
presidentpresident
american
political
figures
candidates
presidential
ran
republican
republicans
politicians
president
americanpolitical
figures
york
republicans
politicians
president
court
bush
attorney
supreme
husband
married
partydemocratic
politicians
involved
prizeprize
figureolympic
york
politicians
york
state
political
figures
politicians
world leaders
countries
president
world
war
leaders
iraq
bush
troops
opponents
war
leaders
iraq
professional
president
general
attorney
president
vicedemocratic
senators
father son
brothers
brother
brothers
president
election
candidates
presidential
2004
opponents
president
democratic
running
ranpoliticians
husband
married
wife
victim
president
united
states
vice
father
mother
daughter
married
players
basketball
time
worldleaders
countries
wife husband
married
married
california
leaders
minister
prime
russian
russia
father son
time
president
leaders
groups
attack
september2001
terrorist
east
process
involved
middle
peace
headed
president
worldleaders
british
bushcandidates
presidential
1992
candidates
presidential
ran
president
state
secretary
ran
wifehusband
married
previously
president
party
republican
republicans
candidates
politicians
republic
wife
married
federal
building
involved
planned
fed
bombing
father
daughter
wifehusband
married
president
vice
runningcandidates
1996
manager
yankees
owner
york
senate
race
spot
opponents
father
president
candidates
presidential
2004
president
republican
2000
candidates
politiciansopponents
players
tennis
players
professional
tennistime
worldleaders
nations
countrieseast
issue
leaders
middle
israelipalestinian
politicians
show
night
latetalk
wifehusband
married
sexual
sex
players
opponents
tennistime
countryleaders
countries
president
chairman
fed
president
running
president
administration
vice
running
father
sonpersonprofessional
person
executives
convicted
york
currentsenators
husband
married
person
relationship
accused
sexual
sex
field
track stars
olympic
wifehusband
married
president
vicedemocratic
running
2000
candidates
presidential
politicians
players
baseball
users
israelieast middle
eastern
east
leadersnations
process
middle
peace
time
leaders
democratic
countries
president
republican
governor
bush
yorkyork
state
governor
politicians
police
case
victim
york
city house
white
worked
jersey
married
president
special
part
bill
investigation
clintonsubjectsex
headed
players
tennis
president
house
clinton
speaker
american
political
figures
case
accused
sexual
president
court
failed
supreme
reagan
time
countries
worldleaders
powers
worldleaders
person
president
general
attorney
president
stategeneral
attorney
washington
area
baseball
players
baseball
atlanta
president
republican
politicians
york
city
party
mayor
democratic
candidates
players
baseball
politicians
american
political
administration
figures
politicians
president
administration
defense
secretary
leaders
countries
republican
running
politicians
person
senators
president
state
leaders
palestinian
time
political
leaders
countries
court
candidates
supreme
americanpolitical
figures
party
democratic
electionpresidential
primary
president
ranpoliticians
leaders
countriesmiddle
eastern
politicians
yorkmayor
politicians
york
politicians
police
commissioner
jersey
baseball
owner
chief
staff
person
senators
part
east
process
middle
peacehusband
won
governor
ran
convicted
york
city
mayor
candidates
2001
state
politicians
president
state
secretary
york
city
politicians
executives disney
players
subject
iranpoliticians
president
vice
1988
candidates
president
political
south
africa
worldleaders
victim
leaders
countries
convicted
prime
israeli
Figure 5.9: Predicted rank of ground truth labels using the Author-Topic Model(x-axis) versus predicted rank using Nubbi (y-axis). Lower is better along both axes.Words below the diagonal are instances where Nubbi’s prediction was better than theAuthor-Topic model’s.
104
1 2 3 4 5 6 7
12
34
56
7
Lower is betterLog Rank (Author−Topic Model)
Log
Ran
k (N
ubbi
)
baseball
africa
baseball
south
presidential
israeli
owner
chairman
terrorist
palestinianisraeli
ministerminister
olympic
palestinian
primeprimeprime
fed
israeliisraeli
democratic
peacepeace
trackiran
peaceparty
candidates
democraticiraq
primaryrussian
russia
candidates
presidentialpresidentialpresidential
stars
candidates
supreme
presidential
senatedemocratic
presidential
politicianspoliticians2001
case
running
administrationcourt
time
worldworld
presidentstatestate
governor
figure
nations
president
tennis
wife
opponents
disney
baseball
nations
york
brothers
president
brothers
married
tennistennis
wife
fatherfather
husband
wifewife
sex
husband
wifehusbandhusbandwifewife
sonson
sexual
husbandhusband
sex
husbandhusband
relationship
brother
husband
sexualsexual
Figure 5.10: The visualization in Figure 5.9 with the terms closest to the diagonalremoved. This emphasizes the differences betewen the Author-Topic Model and Nubbirevealing that the predictions Nubbi makes are qualitatively different from those madeby the Author-Topic Model.
the differences between these two models.
This second visualization is given in Figure 5.10. This visualization reveals a
qualitative difference between the predictions Nubbi is able to make well (below
the dashed line) versus the predictions the Author-Topic Models is able to make
well (above the dashed line). In particular, the words below the dashed line are
generally “relationship words” such as “brother”, “father”, “husband”, “married”,
and “opponents”. In contrast, the words above the dashed line provide context, such
as “africa”, “baseball”, “russia”, or “olympic”.
The descriptions of relationships provided by users often contain both contextual
and relationship words, for example, “olympic opponents.” The Nubbi model better
predicts the relationship-specific words such as “opponent” opting instead to explain
105
words like “olympic” by the entity itself. This, in fact, reveals some structure about
gold-standard relationship descriptions. In contrast, the Author-Topic Model does not
make this distinction between relationship and context words. One avenue of future
work would be to take this insight about how relationships are characterized by people
to build models specifically designed to generate these sorts of descriptions.
5.5 Discussion and related work
We presented Nubbi, a novel machine learning approach for analyzing free text to
extract descriptions of relationships between entities. We applied Nubbi to several
corpora—the Bible, Wikipedia, scientific abstracts, and New York Times articles. We
showed that Nubbi provides a state-of-the-art predictive model of entities and relation-
ships and, moreover, is a useful exploratory tool for discovering and understanding
network data hidden in plain text.
Analyzing networks of entities has a substantial history (Wasserman and Pattison
1996); recent work has focused in particular on clustering and community struc-
ture (Anagnostopoulos et al. 2008; Cai et al. 2005; Gibson et al. 1998; McGovern et al.
2003; Newman 2006b), deriving models for social networks (Leskovec et al. 2008a,b;
Meeds et al. 2007; Taskar et al. 2003), and applying these analyses to predictive appli-
cations Zhou et al. (2008). Latent variable approaches to modeling social networks
with associated text have also been explored (McCallum et al. 2005; Mei et al. 2008;
Nallapati et al. 2008; Wang et al. 2005). While the space of potential applications for
these models is rich, it is tempered by the need for observed network data as input.
Nubbi allows these techniques to augment their network data by leveraging the large
body of relationship information encoded in collections of free text.
Previous work in this vein has used either pattern-based approaches or co-occurrence
methods. The pattern-based approaches (Agichtein and Gravano 2003; Diehl et al.
106
2007; Mei et al. 2007; Sahay et al. 2008) and syntax based approaches (Banko
et al. 2007; Katrenko and Adriaans 2007) require patterns or parsers which are
meticulously hand-crafted, often fragile, and typically need several examples of desired
relationships limiting the type of relationships that can be discovered. In contrast,
Nubbi makes minimal assumptions about the input text, and is thus practical for
languages and non-linguistic data where parsing is not available or applicable. Co-
occurrence methods (Culotta et al. 2005; Davidov et al. 2007) also make minimal
assumptions. However, because Nubbi draws on topic modeling (Blei et al. 2003a),
it is able to uncover hidden and semantically meaningful groupings of relationships.
Through the distinction between relationship topics and entity topics, it can better
model the language used to describe relationships.
Finally, while other models have also leveraged the machinery of LDA to understand
ensembles of entities and the words associated with them (Bhattacharya et al. 2008;
Newman et al. 2006a; Rosen-Zvi et al. 2004) these models only learn hidden topics for
individual entities. Nubbi models individual entities and pairs of entities distinctly. By
controlling for features of individual entities and explicitly relationships, Nubbi yields
more powerful predictive models and can discover richer descriptions of relationships.
107
Chapter 6
Conclusion
In this thesis we have studied network data. These data may take the form of an online
social network, a social network of characters in a book, public figures in news articles,
networks of webpages, networks of genes, etc. These data are already pervasive and
will only increase in ubiquity as more users use online services which connect them
with other users, or as biologists find ever more complicated interconnections between
proteins and genes of interest, or more literature and news becomes digitized and
scrutinized. Thus being able to learn from these data to gain insights and make
predictions is becoming ever more important. Predictive models can suggest new
friends for members of a social network or new citations for a paper, while descriptive
statistics can discover communities of friends or authors.
In this work we have introduced and explored several models of network data. The
first and simplest models correlations between links most directly. Here, the central
challenge is the speed at which observed data can be synthesized into a learned model.
We develop techniques that drastically speed up this process making these models
more applicable to the large, real-world data that are becoming ubiquitous.
We then developed a model of network data that accounts for both links and
attributes. Given a corpus of documents with connections between them, the Relational
108
Topic Model can map those documents into a latent space leveraging the mechanisms
of topic modeling. With a trained model, we showed how one can predict links for a
node given only its attributes or attributes given only its links. Thus we can suggest
new citations for a document given only its content, or new interests for a user given
only their friends’. We apply this model to several data sets including local news,
twitter, and scientific abstracts and demonstrate the model’s ability to make state of
the art predictions and find interesting perspectives on the data.
Finally, we turned our attention to cases where our understanding of the links is
incomplete or missing altogether. In particular, we focused on the problem of inferring
whether or not a link exists between two nodes, and if so, giving a latent-space
characterization of that relationship. It is important to know, for example, how two
people know each other in a social network or how two genes interact in a biological
network; linkage is not simply binary. While some resources for annotating edges exist,
they are limited and not scalable to the large and varied networks we have today. We
developed the Nubbi model to infer edges and their characterizations using only free
text. We showed qualitatively and quantitatively that our model can construct and
annotate graphs of relationships and make useful predictions.
In sum, this thesis has contributed a set of probabilistic models, along with
attendant inferential and predictive tools that make it possible to better uncover,
understand, and predict links.
109
Appendix A
Derivation of RTM Coordinate
Ascent Updates
Inference under the variational method amounts to finding values of the variational
parameters γ,Φ which optimize the evidence lower bound, L , given in Equation 4.6.
To do so, we first expand the expectations in these terms,
L =∑
(d1,d2)
Ld1,d2 +∑d
∑n
φd,nT log β·,wd,n+
∑d
∑n
φd,nT(Ψ(γd)− 1Ψ(1Tγd))+
∑d
(α− 1)T(Ψ(γd)− 1Ψ(1Tγd))+
∑d
∑n
φd,nT log φd,n−
∑d
(γd − 1)T(Ψ(γd)− 1Ψ(1Tγd))+
∑d
1T log Γ(γd)− log Γ(1Tγd), (A.1)
110
where Ld1,d2 is defined as in Equation 4.7. Since Ld1,d2 is independent of γ, we can
collect all of the terms associated with γd into
Lγd =
(α +
∑n
φd,n − γd
)T
(Ψ(γd)− 1Ψ(1Tγd))+
1T log Γ(γd)− log Γ(1Tγd).
Taking the derivatives and setting equal to zero leads to the following optimality
condition,
(α +
∑n
φd,n − γd
)T
(Ψ′(γd)− 1Ψ′(1Tγd)) = 0,
which is satisfied by the update
γd ← α +∑n
φd,n. (A.2)
In order to derive the update for φd,n we also collect its associated terms,
Lφd,n = φd,nT(log φd,n + log β·,wd,n + Ψ(γd)− 1Ψ(1Tγd)) +
∑d′ 6=d
Ld,d′ .
Adding a Lagrange multiplier to ensure that φd,n normalizes and setting the derivative
equal to zero leads to the following condition,
φd,n ∝ exp{
log β·,wd,n + Ψ(γd)− 1Ψ(1Tγd) +∇φd,nLd,d′}. (A.3)
The exact form of ∇φd,nLd,d′ will depend on the link probability function chosen. If
the expected log link probability depends only on πd1,d2 = φd1 ◦ φd2 , the gradients
are given by Equation 4.10. When ψN is chosen as the link probability function, we
111
expand the expectation,
Eq [logψN(zd, zd′)] = −ηTEq [(zd − zd′) ◦ (zd − zd′)]− ν
= −ν −∑i
ηi(Eq[z2d,i
]+ Eq
[z2d′,i
]− 2φd,iφd′,i). (A.4)
Because each word is independent under the variational distribution, Eq[z2d,i
]=
Var(zd,i) + φ2
d,i, where Var(zd,i) = 1N2d
∑n φd,n,i(1 − φd,n,i). The gradient of this
expression is given by Equation 4.11.
112
Appendix B
Derivation of RTM Parameter
Estimates
In order to estimate the parameters of our model, we find values of the topic multinomial
parameters β and link probability parameters η, ν which maximize the variational
objective, L , given in Equation 4.6.
To optimize β, it suffices to take the derivative of the expanded objective given in
Equation A.1 along with a Lagrange multiplier to enforce normalization:
∂βk,wL =∑d
∑n
φd,n,k1(w = wd,n)1
βk,wd,n+ λk.
Setting this quantity equal to zero and solving yields the update given in Equation 4.12.
By taking the gradient of Equation A.1 with respect to η and ν, we can also derive
updates for the link probability parameters. When the expectation of the logarithm of
the link probability function depends only on ηTπd,d′+ν, as with all the link functions
given in Equation 4.8, then these derivatives take a convenient form. For notational
expedience, denote η+ = 〈η, ν〉 and π+d,d′ = 〈πd,d′ , 1〉. Then the derivatives can be
113
written as
∇η+L σd,d′ ≈ (1− σ(η+T
π+d,d′))π
+d,d′
∇η+L Φd,d′ ≈
Φ′(η+Tπ+d,d′)
Φ(η+Tπ+d,d′)
π+d,d′
∇η+L ed,d′ = π+
d,d′ . (B.1)
Note that all of these gradients are positive because we are faced with a one-class
estimation problem. Unchecked, the parameter estimates will diverge. While a variety
of techniques exist to address this problem, one set of strategies is to add regularization.
A common regularization for regression problems is the `2 regularizer. This
penalizes the objective L with the term λ‖η‖2, where λ is a free parameter. This
penalization has a Bayesian interpretation as a Gaussian prior on η.
In lieu of or in conjunction with `2 regularization, one can also employ regularization
which in effect injects some number of observations, ρ, for which the link variable,
y = 0. We associate with these observations a document similarity of πα = α1Tα◦ α
1Tα,
the expected Hadamard product of any two documents given the Dirichlet prior of the
model. Because both ψσ and ψΦ are symmetric, these gradients of these regularization
terms can be written as
∇η+Rσ = −ρσ(η+Tπ+α )π+
α
∇η+RΦ = −ρΦ′(−η+Tπ+α )
Φ(−η+Tπ+α )π+α .
While this approach could also be applied to ψe, here we use a different approximation.
We do this for two reasons. First, we cannot optimize the parameters of ψe in an
unconstrained fashion since this may lead to link functions which are not probabilities.
Second, the approximation we propose will lead to explicit updates.
Because Eq [logψe(zd ◦ zd′)] is linear in πd,d′ by Equation 4.8, this suggests a linear
114
approximation of Eq [log(1− ψe(zd ◦ zd′))]. Namely, we let
Eq [log(1− ψe(zd ◦ zd′))] ≈ η′Tπd,d′ + ν ′.
This leads to a penalty term of the form
Re = ρ(η′Tπα + ν ′).
We fit the parameters of the approximation, η′, ν ′, by making the approximation exact
whenever πd,d′ = 0 or maxπd,d′ = 1. This yields the following K + 1 equations for
the K + 1 parameters of the approximation:
ν ′ = log(1− exp(ν))
η′i = log(1− exp(ηi + ν))− ν ′.
Combining the gradient of the likelihood of the observations given in Equation B.1
with the gradient of the penalty Re and solving leads to the following updates:
ν ← log(M − 1TΠ
)− log
(ρ(1− 1Tπα) +M − 1TΠ
)η ← log
(Π)− log
(Π + ρπα
)− 1ν,
where M =∑
(d1,d2) 1 and Π =∑
(d1,d2) πd1,d2 . Note that because of the constraints
on our approximation, these updates are guaranteed to yield parameters for which
0 ≤ ψe ≤ 1.
Finally, in order to fit parameters for ψN , we begin by assuming the variance terms
of Equation A.4 are small. Equation A.4 can then be written as
Eq [logψN(zd, zd′)] = −ν − ηT(φd − φd′) ◦ (φd − φd′),
115
which is the log likelihood of a Gaussian distribution where φd − φd′ is random
with mean 0 and diagonal variance 12η
. This suggests fitting η using the empirically
observed variance:
η ← M
2∑
d,d′(φd − φd′) ◦ (φd − φd′).
ν acts as a scaling factor for the Gaussian distribution; here we want only to ensure that
the total probability mass respects the frequency of observed links to regularization
“observations.” Equating the normalization constant of the distribution with the
desired probability mass yields the update
ν ← log1
2πK/2 + log(ρ+M)− logM − 1
21T log η,
guarding against values of ν which would make ψN inadmissable as a probability.
116
Appendix C
Derivation of NUBBI
coordinate-ascent updates
For convenience, we break up the terms of the objective function into two classes
— those that concern each pair of entities, Le,e′ , and those that concern individual
entities, Le. Equation 5.1 can then be rewritten as
L =∑e,e′
Le,e′ +∑e
Le.
We first expand Le as
Le =∑n
φθe,nT
log βθwn +∑n
φθe,nT (
Ψ(γθe)−Ψ
(1Tγθe
))+
(αθ − 1)(Ψ(γθe)−Ψ
(1Tγθe
))−∑
n
φθe,nT
log φθe,n −(log Γ
(γθe)− log Γ
(1Tγθe
))−
(γθe − 1)(Ψ(γθe)−Ψ
(1Tγθe
)).
117
Next we expand Le,e′ . In order to do so, we first define
ξe,e′,n ◦ φψe,e′,n = 〈ξe,e′,n,1φψe,e′,n,1,
ξe,e′,n,2φψe,e′,n,2,
ξe,e′,n,3φψe,e′,n,3〉.
Note that ξe,e′,n ◦ φψe,e′,n defines a multinomial parameter vector of length 3×K repre-
senting the multinomial probabilities for each ze,e′,n, ce,e′,n assignment. In particular
q(ze,e′,n = z∗, ce,e′,n = c∗) = ξe,e′,n,c∗φψe,e′,n,c∗,z∗ . Thus,
Le,e′ =∑n
(ξe,e′,n ◦ φψe,e′,n)T
log〈βθwn , βθwn , β
ψwn〉+∑
n
ξe,e′,n,1φψe,e′,n,1
T (Ψ(γθe)−Ψ
(1Tγθe
))+∑
n
ξe,e′,n,2φψe,e′,n,2
T (Ψ(γθe′)−Ψ
(1Tγθe′
))+∑
n
ξe,e′,n,3φψe,e′,n,3
T(
Ψ(γψe,e′
)−Ψ
(1Tγψe,e′
))+∑
n
ξe,e′,nT(Ψ(γπe,e′
)−Ψ
(1Tγπe,e′
))−∑
n
(ξe,e′,n ◦ φψe,e′,n)T
log(ξe,e′,n ◦ φψe,e′,n)−(log Γ
(γψe,e′
)− log Γ
(1Tγψe,e′
))+
(αψ − 1)(
Ψ(γψe,e′
)−Ψ
(1Tγψe,e′
))−
(γψe,e′ − 1)(
Ψ(γψe,e′
)−Ψ
(1Tγψe,e′
))−(
log Γ(γπe,e′
)− log Γ
(1Tγπe,e′
))+
(απ − 1)(Ψ(γπe,e′
)−Ψ
(1Tγπe,e′
))−
(γπe,e′ − 1)(Ψ(γπe,e′
)−Ψ
(1Tγπe,e′
)),
Since φθe,n only appears in Le, we can optimize this parameter by taking the
118
gradient,
∇φθe,nLe = log βθwn + Ψ
(γθe)−Ψ
(1Tγθe
)− log φθe,n − 1.
Setting this equal to zero yields the update equation for φθe,n in Equation 5.2. To
optimize φψe,e′,n,1, it suffices to take the gradient of Le,e′ ,
∇φψe,e′,n,1
Le,e′ = ξe,e′,n,1
(log βθwn + Ψ
(γθe)−Ψ
(1Tγθe
)−
log φψe,e′,n,1 − 1).
Setting this equal to zero yields the update in Equation 5.5. The updates for φψe,e′,n,2
and φψe,e′,n,3 are derived in exactly the same fashion.
Similarly, to derive the update for ξe,e′,n,1, we take the partial derivative of Le,e′ ,
∂Le,e′
∂ξe,e′,n,1= Ψ
(γπe,e′,1
)−Ψ
(1Tγπe,e′
)− log ξe,e′,n,1 − 1 +
φψe,e′,n,1
(log βθwn + Ψ
(γθe)−Ψ
(1Tγθe
)− log φψe,e′,n,1
).
Replacing log φψe,e′,n,1 with the update equation given above, this expression reduces to
∂Le,e′
∂ξe,e′,n,1= Ψ
(γπe,e′,1
)−Ψ
(1Tγπe,e′
)−
log ξe,e′,n,1 − 1 + λe,e′,n,1.
Consequently the update for ξe,e′,n is Equation 5.6. In order to update γπe,e′ we collect
the terms which contain this parameter,
(απ +
∑n
ξe,e′,n − γπe,e′
)(Ψ(γπe,e′
)−Ψ
(1Tγπe,e′
))−(
log Γ(γπe,e′
)− log Γ
(1Tγπe,e′
)).
119
The optimum for these terms is obtained when the condition in Equation 5.7 is
satisfied. See Blei et al. (2003a) for details on this solution. Collecting terms associated
with γψe,e′ similarly leads to the update given in Equation 5.8.
We also collect terms to yield updates for γθe . The terms associated with this
variational parameter (and this variational parameter alone) span both Le,e′ and Le
and it is via these parameter updates in Equation 5.10 that evidence associated with
individual entities and evidence associated with entity pairs is combined.
To find MAP estimates for βψ and βθ, note that both variables are multinomial
and hence in the exponential family with topic-word assignment counts as sufficient
statistics. Because the conjugate prior on these parameters is Dirichlet, the posterior
is also a multinomial distribution with sufficient statistics defined by the observations
plus the prior hyperparameter. These posterior sufficient statistics are precisely the
right-hand sides of Equation 5.14. The MAP value of the parameters is achieved
when the expected sufficient statistics equal the observed sufficient statistics giving
the updates in Equation 5.14.
120
Appendix D
Derivation of Gibbs sampling
equations
In this section we derive collapsed Gibbs sampling equations for the models presented
in this thesis. Collapsed Gibbs sampling is an alternative to the variational approach
— instead of approximating the posterior distribution by optimizing a variational
lower bound, collapsed Gibbs sampling directly collects samples from the posterior
distribution. In order to sample from the posterior, it suffices to compute the posterior
(up to a constant) for a single assignment conditioned on all other assignments,
p(zd,n|z−(d,n), α, η,w), (D.1)
where z−(d,n) denotes the set of topic assignments to all words in all documents
excluding zd,n. For a review of Gibbs sampling and why this is the case, see ().
In contrast to variational inference, the equations we derive here are collapsed, that
is, they integrate out variables such as the per-topic distribution over words, βk, and
the per-document distribution over topics, θd. What remain are the topic assignments
for each word, zd,n.
121
D.1 Latent Dirichlet allocation (LDA)
First we compute the prior distribution over topic assignments.
∫p(zd|θd)dp(θd|α) =
∫ ∏i
θd,zi1
B(α)
∏k
θαkd,kdθd
=1
B(α)
∫ ∏k
θnd,k+αkd,k dθd
=B(nd,· + α)
B(α), (D.2)
where B(α) =∏k Γ(αk)
Γ(∑k αk)
is a normalizing constant, nd,k =∑
i 1(zd,i = k) counts the
number of words in document d assigned to topic k and nd,· = 〈nd,1, nd,2, . . . , nd,K〉 is
the vector of counts.
We then compute the likelihood of the word observations given a set of topic
assignments,
∫ ∏d
p(wd|zd,β)dp(β|η) =
∫ (∏d
∏i
βwd,i,zd.i
)(∏k
1
B(η)
∏w
βηww,k
)dβ
=∏k
1
B(η)
∫ ∏w
βηw+nw,kw,k dβk
=∏k
B(η + n·,k)
B(η)(D.3)
where nw,k =∑
d
∑i 1(zd,i = k ∧ wd,i = w) counts the number of assignments of word
w to topic k across all documents and n·,k = 〈n1,k, n2,k, . . . , nW,k〉 is the vector of these
counts.
Combining Equation D.2 and Equation D.3, the posterior probability of a set of
122
topic assignments can be written as
p(z|α, η,w) ∝ p(w|z, η)p(z|η)
=
∫p(w|z,β)dp(β|η)
∫p(z|θ)dp(θ|α)
=∏d
B(nd,· + α)
B(α)
∏k
B(η + n·,k)
B(η). (D.4)
Conditioning on all other assignments, the posterior probability of a single assignment
is then
p(zd,n = k|α, η,w, z−(d,n)) ∝∏d′
B(nd′,· + α)
B(α)
∏k′
B(η + n·,k′)
B(η)
∝ B(nd,· + α)∏k′
B(η + n·,k′)
=
∏k′ Γ(nd,k′ + αk′)
Γ(∑
k′ nd,k′ + αk′)
∏k′
∏w Γ(ηw + nw,k′)
Γ(∑
w nw,k′ + ηw)
∝ 1
Γ(∑
k′ nd,k′ + αk′)
∏k′
[Γ(nd,k′ + αk′)
Γ(ηwd,n + nwd,n,k′)
Γ(∑
w nw,k′ + ηw)
]=
1
Γ(Nd +∑
k′ αk′)
∏k′
[Γ(nd,k′ + αk′)
Γ(ηwd,n + nwd,n,k′)
Γ(∑
w nw,k′ + ηw)
]∝∏k′
[Γ(nd,k′ + αk′)
Γ(ηwd,n + nwd,n,k′)
Γ(∑
w nw,k′ + ηw)
], (D.5)
where Nd =∑
k′ nd,k denotes the number of words in document d. The second line
follows because terms which are independent of the topic assignment zd,n are constants,
the third line follows by definition of B, and the fourth line follows because the
posterior cannot depend on counts over words other than wd,n.
123
Finally, we make use of the identity
Γ(x+ b)
Γ(x)=
x if b = 1
1 if b = 0(D.6)
= xb, b ∈ {0, 1} (D.7)
which implies that
Γ(nd,k′ + αk′) = Γ(nd,k′ − 1(k = k′) + αk′ + 1(k = k′))
= Γ(nd,k′ − 1(k = k′) + αk′) · (nd,k′ − 1(k = k′) + αk′)1(k=k′)
= Γ(n¬d,nd,k′ + αk′) · (n¬d,nd,k′ + αk′)1(k=k′), (D.8)
where n¬d,nd,k′ =∑
i 6=n 1(zd,i = k′) denotes the number of words assigned to topic k′ in
document d excluding the current assignment, zd,n. Because n¬d,nd,k′ does not depend on
the current assignment, zd,n, it is a constant in the posterior computation; Equation D.8
then becomes
Γ(nd,k′ + αk′) ∝ (n¬d,nd,k′ + αk′)1(k=k′). (D.9)
Applying the same identity to the other instances of the gamma function in Equa-
tion D.5 gives
Γ(ηwd,n + nwd,n,k′) ∝ (n¬d,nwd,n,k′+ ηwd,n)1(k=k′) (D.10)
Γ(∑w
ηw + nw,k′) ∝ (∑w
ηw + n¬d,nw,k′ )1(k=k′), (D.11)
where the exclusionary sum is similarly defined as n¬d,nw,k′ =∑
d′∑
i 1(zd′,i = k′∧wd′,i =
124
w ∧ (d, n) 6= (d′, i)). Combining these identities with Equation D.5 yields
p(zd,n = k|α, η,w, z−(d,n)) ∝∏k′
((n¬d,nd,k′ + αk′)
n¬d,nwd,n,k′+ ηwd,n∑
w ηw + n¬d,nw,k′
)1(k=k′)
= (n¬d,nd,k + αk)n¬d,nwd,n,k
+ ηwd,n∑w ηw + n¬d,nw,k
= (n¬d,nd,k + αk)n¬d,nwd,n,k
+ ηwd,n
N¬d,nk +Wη, (D.12)
where for convenience we denote the total number of words assigned to topic k
excluding the current assignment zd,n as N¬d,nk .
D.2 Mixed-membership stochastic blockmodel (MMSB)
Because the observations in the mixed-membership stochastic blockmodel (MMSB)
depend on pairs of topic assignments, the collapsed Gibbs sampling equations also
depend on the pairwise posterior,
p(zd,d′,1 = k1, zd,d′,2 = k2|α, η,y, z−(d,d′)), (D.13)
where z−(d,d′) denotes the set of topic assignments without the two assignments
associated with the link between d and d′, zd,d′,1 and zd,d′,2.
125
To compute this, as before we first compute the likelihood of the observations,
∫ ∏d,d′
p(yd,d′ |zd,d′,1zd,d′,2,β)dp(β|η) =∫ ∏d,d′
βyd,d′zd,d′,1,zd,d′,2(1− βzd,d′,1,zd,d′,2)
1−yd,d′∏k,k′
1
B(η)βη1k,k′(1− βk,k′)
η0dβ
=∏k,k′
∫1
B(η)βnk,k′,1+η1k,k′ (1− βk,k′)nk,k′,0+η0dβk,k′
=∏k,k′
B(η + nk,k′)
B(η), (D.14)
where nk,k′,i =∑
d,d′ 1(zd,d′,1 = k ∧ zd,d′,2 = k′ ∧ yd,d′ = i) counts the number of links
of value i for which the first node is assigned a topic of k for that link and the second
node is assigned a topic of k′ for that link. nk,k′ = 〈nk,k′,1, nk,k′,0〉 denotes the vector
of these counts. Because the prior of the MMSB is the same as that of LDA, we can
express the posterior (the analogue of Equation D.5) as
p(zd,d′,1 = k1, zd,d′,2 = k2|α, η,y, z−(d,d′)) ∝∏k,k′
Γ(nd,k′ + αk′)Γ(nd′,k′ + αk′)B(η + nk,k′)
B(η)
∝∏k,k′
Γ(nd,k′ + αk′)Γ(nd′,k′ + αk′)Γ(ηyd,d′ + nk,k′,yd,d′ )
Γ(∑
i ηi + nk,k′,i)
∝ (n¬d,d′
d,k1+ αk1)(n
¬d,d′d′,k2
+ αk2)n¬d,d
′
k1,k2,yd,d′+ ηk1,k2,yd,d′
H +N¬d,d′
k1,k2
,
where for convenience we denote the total number of links with (k, k′) as the partici-
pating topics excluding the current link (zd,d′,1, zd,d′,2) as N¬d,d′
k,k′ . The first line follows
by expanding the prior terms as in the derivation of Equation D.5. The second line
follows by expanding B and eliminating terms which are constant and the last line
follows using the identities used to derive Equation D.12.
126
D.3 Relational topic model (RTM)
The sampling equations for the relational topic model (RTM) are similar in spirit to
the LDA sampling equations. For brevity, we restrict the derivation to the exponential
response1,
p(yd,d′ = 1|zd, zd′ , b) ∝ exp(bT(zd ◦ zd′)). (D.15)
As with the MMSB, the prior distribution on z is identical to that of LDA, so we
omit its re-derivation. Thus the joint posterior can be written as
p(z|α, η,w,y, b) ∝∏d
B(nd,· + α)
B(α)
∏k
B(η + n·,k)
B(η)
∏d,d′
exp(bT(zd ◦ zd′)), (D.16)
where the latter product is understood to range over d, d′ such that yd,d′ = 1. The
posterior, following the derivation of Equation D.12, is
p(zd,n = k|α, η,w,y, b, z−(d,n)) ∝ (n¬d,nd,k + αk)n¬d,nwd,n,k
+ ηwd,n
N¬d,nk +Wη
∏d′,d′′
exp(bT(zd′ ◦ zd′′))
∝ (n¬d,nd,k + αk)n¬d,nwd,n,k
+ ηwd,n
N¬d,nk +Wη
∏d′
exp(bT(zd ◦ zd′))
= (n¬d,nd,k + αk)n¬d,nwd,n,k
+ ηwd,n
N¬d,nk +Wηexp
((b ◦ zd)T
∑d′
zd′
)(D.17)
Notice that zd,k′ = 1Nd
∑n′ 1(zd,n′ = k′) = 1
Nd
∑n′ 6=n 1(zd,n′ = k′) + 1
Nd1(zd,n = k′) =
z¬nd,k′ +1Nd1(zd,n = k′), where z¬nd,k′ is the mean topic assignment to topic k′ in document
d excluding that of the nth word, zd,n. Because z¬nd,k′ does not depend on the topic
1Here we depart from the notation used in previous chapters. We use b for the regressioncoefficients instead of η. We also omit the regression intercept ν and absorb it into the normalizationconstant.
127
assignment zd,n, the last term of Equation D.17 can be efficiently computed as
exp
((b ◦ zd)T
∑d′
zd′
)= exp
(∑k′
[bk′ zd,k′
∑d′
zd′,k′
])
= exp
(∑k′
[bk′
(z¬nd,k′ +
1
Nd
1(k = k′)
)∑d′
zd′,k′
])
= exp
(∑k′
[bk′ z
¬nd,k′
∑d′
zd′,k′
]+bkNd
∑d′
zd′,k
)
∝ exp
(bkNd
∑d′
zd′,k
)= exp
(bkNd
∑d′
nd′,kNd′
), (D.18)
where the second line follows using our identity on zd,k′ and the last proportionality
follows from the fact that the terms in the left sum do not depend on the current
topic assignment, zd,n. Finally, the last equality stems from the definitions of zd′,k and
nd′,k. This expression is efficient because it is constant for all words in a document
and thus need only be computed once per document. Combining Equation D.18 and
Equation 4.6 yields
p(zd,n = k|α, η,w,y, b, z−(d,n)) = (n¬d,nd,k + αk)n¬d,nwd,n,k
+ ηwd,n
N¬d,nk +Wηexp
(bkNd
∑d′
nd′,kNd′
).
(D.19)
D.4 Supervised latent Dirichlet allocation (sLDA)
We derive the sampling equations for supervised latent Dirichlet allocation. Here, we
consider Gaussian errors, but the derivation can be easily extended to other models
128
as well,
p(yd|zd, b, a) ∝ exp(−(yd − bTzd − a)2)
∝ exp(−(yd − a)2 + 2bTzd(yd − a)− (bTzd)2)
∝ exp(2bTzd(yd − a)− (bTzd)2), (D.20)
where the proportionality is with respect to zd.
The prior distribution on z is identical to that of LDA. Thus the joint posterior
can be written as
p(z|α, η,w,y, b, a) ∝∏d
B(nd,· + α)
B(α)
∏k
B(η + n·,k)
B(η)
∏d
exp(−(yd − bTzd − a)2).
(D.21)
The sampling equation, following the derivation of Equation D.12, is
p(zd,n = k|α, η,w,y, b, a,z−(d,n)) ∝ (n¬d,nd,k + αk)n¬d,nwd,n,k
+ ηwd,n
N¬d,nk +Wηexp(−(yd − bTzd − a)2)
(D.22)
The right-most term can be expanded as
exp(2bTzd(yd − a)− (bTzd)2) = exp(2
∑k′
bk′ z¬nd,k′(yd − a) + 2
yd − aNd
bk + (bTzd)2)
∝ exp(2yd − aNd
bk − (bTzd)2)
∝ exp(2yd − aNd
bk − (∑k′
bk′ z¬nd,k′ +
bkNd
)2)
∝ exp(2yd − aNd
bk − 2bkNd
bTz¬nd − (bkNd
)2)
= exp(2bkNd
(yd − a− bTz¬nd )− (bkNd
)2), (D.23)
yielding
129
p(zd,n = k|α, η,w,y, b, a,z−(d,n)) ∝ (n¬d,nd,k + αk)n¬d,nwd,n,k
+ ηwd,n
N¬d,nk +Wη·
exp(2bkNd
(yd − a− bTz¬nd )− (bkNd
)2) (D.24)
D.5 Networks uncovered by Bayesian inference (NUBBI)
model
The networks uncovered by Bayesian inference (NUBBI) model is a switching model,
wherein each word can be explained by one of three distributions — the distribution of
the first entity θe, the distribution of the second entity θe′ , or the distribution over their
relationships ψe,e′ . Each of these generates topic assignments with the same structure
as LDA, so their contributions to the posterior are the same, conditioned on the
assignments from words to distributions, which also follows a Dirichlet-Multinomial
distribution.
Hence, the joint posterior over topic assignments and source assignments is
p(z, c|α,η,w) ∝∏e
B(nee,· + αθ)
B(αθ)
∏k
B(ηθ + ne·,k)
B(ηθ)·
∏ε
B(nεε,· + αψ)
B(αψ)
∏k
B(ηψ + nε·,k)
B(ηψ)∏ε
B(ncε,· + απ)
B(απ). (D.25)
Here we have used the shorthand ε = (e, e′) to denote iteration over pairs of entities.
We have also introduced new count variables for documents associated with individual
130
entities, documents associated with pairs of entities, and source assignments:
new,k =∑e
∑i
1(ze,i = k ∧ we,i = w)
+∑ε
∑i
1(zε,i = k ∧ wε,i = w ∧ cε,i = 1)
+∑ε
∑i
1(zε,i = k ∧ wε,i = w ∧ cε,i = 2) (D.26)
nεw,k =∑ε
∑i
1(cε,i = 3 ∧ wε,i = w) (D.27)
nee,k =∑i
1(ze,i = k)
+∑ε
∑i
1(zε,i = k ∧ cε,i = 1 ∧ ε1 = e)
+∑ε
∑i
1(zε,i = k ∧ cε,i = 2 ∧ ε2 = e) (D.28)
nεε,k =∑i
1(cε,i = 3 ∧ zε,i = k) (D.29)
ncε,k =∑i
1(cε,i = k) (D.30)
with marginals being defined as before. There are two sampling equations to be
considered. First, when sampling the topic assignment for a word in an entity’s
document,
p(ze,n = k|α,η,w, z−(e,n), c) ∝ (ne,(¬e,n)e,k + αθ,k)
ne,(¬e,n)we,n,k
+ ηθ,we,n
Ne,(¬e,n)k +Wηθ
. (D.31)
Second, when sampling the topic and source assignment for a word in an entity
131
pair’s document,
p(zε,n = k, cε,n = 1|α,η,w, z−(ε,n), c−(ε,n))
∝ne,(¬ε,n)e,k + αθ,k
ne,(¬ε,n)e,k +Kθαθ
ne,(¬ε,n)wε,n,k
+ ηθ,wε,n
Ne,(¬ε,n)k +Wηθ
(nc,(¬ε,n)ε,1 + απ) (D.32)
p(zε,n = k, cε,n = 2|α,η,w, z−(ε,n), c−(ε,n))
∝ne,(¬ε,n)e′,k + αθ,k
ne,(¬ε,n)e′,k +Kθαθ
ne,(¬ε,n)wε,n,k
+ ηθ,wε,n
Ne,(¬ε,n)k +Wηθ
(nc,(¬ε,n)ε,2 + απ) (D.33)
p(zε,n = k, cε,n = 3|α,η,w, z−(ε,n), c−(ε,n))
∝nε,(¬ε,n)ε,k + αψ,k
nε,(¬ε,n)ε,k +Kψαψ
nε,(¬ε,n)wε,n,k
+ ηψ,wε,n
Nε,(¬ε,n)k +Wηψ
(nc,(¬ε,n)ε,3 + απ). (D.34)
132
Bibliography
E. Agichtein and L. Gravano. Querying text databases for efficient information
extraction. Data Engineering, International Conference on, 0:113, 2003. ISSN
1063-6382. doi: http://doi.ieeecomputersociety.org/10.1109/ICDE.2003.1260786.
E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixed membership stochastic block-
models. Journal of Machine Learning Research, pages 1981 – 2014, September 2008.
URL http://arxiv.org/pdf/0705.4485.
A. Anagnostopoulos, R. Kumar, and M. Mahdian. Influence and correlation in social
networks. KDD 2008, 2008.
G. Andrew and J. Gao. Scalable training of l1-regularized log-linear models. Proceed-
ings of the 24th international Conference on Machine Learning, Jan 2007. URL
http://portal.acm.org/citation.cfm?id=1273496.1273501.
C. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonpara-
metric problems. The Annals of Statistics, 2(6):1152–1174, 1974.
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni.
Open information extraction from the web. In IJCAI 2007, 2007. URL
http://www.ijcai.org/papers07/Papers/IJCAI07-429.pdf.
K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan. Matching
words and pictures. Journal of Machine Learning Research, 3:1107–1135, 2003.
133
J. Besag. Statistical analysis of non-lattice data. The Statistician, 24(3):179–
195, 1975. ISSN 00390526. doi: http://dx.doi.org/10.2307/2987782. URL
http://dx.doi.org/10.2307/2987782.
J. Besag. On the statistical analysis of dirty pictures. Jour-
nal of the Royal Statistical Society, 48(3):259–302, 1986. URL
http://www.informaworld.com/index/739172868.pdf.
I. Bhattacharya, S. Godbole, and S. Joshi. Structured entity identification and
document categorization: Two tasks with one joint model. KDD 2008, 2008.
C. M. Bishop, D. Spiegelhalter, and J. Winn. Vibes: A variational in-
ference engine for bayesian networks. In NIPS 2002, 2002. URL
http://scholar.google.fi/url?sa=U&q=http://books.nips.cc/papers/files/nips15/AA37.pdf.
A. Blake, C. Rother, M. Brown, P. Perez, and P. Torr. Interactive image segmentation
using an adaptive gmmrf model. pages Vol I: 428–441, 2004.
D. Blei and M. Jordan. Modeling annotated data. Proceedings of the 26th annual
international ACM SIGIR Conference on Research and Development in Information
Retrieval, 2003. URL http://portal.acm.org/citation.cfm?id=860460.
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine
Learning Research, 3:993–1022, 2003a.
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet alloca-
tion. Journal of Machine Learning Research, 2003b. URL
http://www.mitpressjournals.org/doi/abs/10.1162/jmlr.2003.3.4-5.993.
D. M. Blei and M. I. Jordan. Variational inference for Dirichlet process mixtures.
Bayesian Analysis, 1(1):121–144, Oct 2006.
134
D. M. Blei and J. D. McAuliffe. Supervised topic models. Neural Information
Processsing Systems, Aug 2007.
J. Boyd-Graber and D. M. Blei. Syntactic topic models. In Neural Information
Processing Systems, Dec 2008.
M. Braun and J. McAuliffe. Variational inference for large-scale models
of discrete choice. Arxiv preprint arXiv:0712.2526, Jan 2007. URL
http://arxiv.org/pdf/0712.2526.
D. Cai, Z. Shao, X. He, X. Yan, and J. Han. Mining hidden commu-
nity in heterogeneous social networks. LinkKDD 2005, Aug 2005. URL
http://portal.acm.org/citation.cfm?id=1134271.1134280.
M. Carreira-Perpinan and G. Hinton. On contrastive divergence
learning. Artificial Intelligence and Statistics, Jan 2005. URL
http://www.csri.utoronto.ca/ hinton/absps/cdmiguel.pdf.
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext clas-
sification using hyperlinks. Proc. ACM SIGMOD, 1998. URL
http://citeseer.ist.psu.edu/article/chakrabarti98enhanced.html.
J. Chang and D. M. Blei. Relational topic models for documenet networks. 2009.
J. Chang and D. M. Blei. Hierarchical relational models for document networks.
Annals of Applied Statistics, 4(1), 2010.
J. Chang, J. Boyd-Graber, and D. M. Blei. Connections between the lines: Augmenting
social networks with text. 2009.
S. F. Chen and R. Rosenfeld. A survey of smoothing techniques for me models. IEEE
Transactions on Speech and Audio Processing, 8(1), Jun 2000.
135
D. Cohn and T. Hofmann. The missing link-a probabilistic model of document content
and hypertext connectivity. Advances in Neural Information Processing Systems 13,
2001.
M. Craven, D. DiPasquo, D. Freitag, and A. McCallum. Learning to ex-
tract symbolic knowledge from the world wide web. Proc. AAAI, 1998. URL
http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/1998/CMU-CS-98-122.pdf.
A. Culotta, R. Bekkerman, and A. McCallum. Extracting social networks and
contact information from email and the web. AAAI 2005, 2005. URL
http://www.cs.umass.edu/ ronb/papers/dex.pdf.
D. Davidov, A. Rappoport, and M. Koppel. Fully unsupervised discovery of concept-
specific relationships by web mining. In ACL, 2007.
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via
the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1–38, 1977.
C. Diehl, G. M. Namata, and L. Getoor. Relationship identification for social network
discovery. In AAAI 2007, July 2007.
L. Dietz, S. Bickel, and T. Scheffer. Unsupervised prediction of citation influences.
Proc. ICML, 2007. URL http://portal.acm.org/citation.cfm?id=1273526.
M. Dudık, S. Phillips, and R. Schapire. Maximum entropy density estimation
with generalized regularization and an application to species distribution mod-
eling. The Journal of Machine Learning Research, 8:1217–1260, Jan 2007. URL
http://portal.acm.org/citation.cfm?id=1314540.
B. Efron. Estimating the error rate of a prediction rule: Improvement on cross-
validation. Journal of the American Statistical Association, 78(382), 1983.
136
E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific
publications. Proceedings of the National Academy of Sciences, 2004.
E. Erosheva, S. Fienberg, and C. Joutard. Describing disability through individual-
level mixture models for multivariate binary data. Annals of Applied Statistics,
2007.
L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene
categories. Computer Vision and Pattern Recognition, 2005.
T. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of
Statistics, 1:209–230, 1973.
S. E. Fienberg, M. M. Meyer, and S. Wasserman. Statistical analysis of multiple
sociometric relations. Journal of the American Statistical Association, 80:51—67,
1985.
M. E. Fisher. On the dimer solution of planar ising models. Journal of
Mathematical Physics, 7(10):1776–1781, 1966. doi: 10.1063/1.1704825. URL
http://link.aip.org/link/?JMP/7/1776/1.
J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with
the graphical lasso. Biostatistics, 2007.
J. Gao, H. Suzuki, and B. Yu. Approximation lasso methods for language modeling.
Proceedings of the 21st International Conference on Computational Linguistics, Jan
2006. URL http://acl.ldc.upenn.edu/P/P06/P06-1029.pdf.
S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian
restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, (6):721–741, 1984.
137
L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning prob-
abilistic models of relational structure. Proc. ICML, 2001. URL
http://ai.stanford.edu/users/nir/Papers/GFTK1.pdf.
D. Gibson, J. Kleinberg, and P. Raghavan. Inferring web commu-
nities from link topology. HYPERTEXT 1998, May 1998. URL
http://portal.acm.org/citation.cfm?id=276627.276652.
A. Globerson and T. S. Jaakkola. Approximate inference using planar graph decom-
position. In B. Scholkopf, J. Platt, and T. Hoffman, editors, Advances in Neural
Information Processing Systems 19, pages 473–480. MIT Press, Cambridge, MA,
2007.
A. Globerson, T. Koo, X. Carreras, and M. Collins. Exponentiated gra-
dient algorithms for log-linear structured prediction. Proceedings of the
24th international Conference on Machine Learning, Jan 2007. URL
http://portal.acm.org/citation.cfm?id=1273535.
J. Goodman. Exponential priors for maximum entropy models. Mar 2004.
A. Gruber, M. Rosen-Zvi, and Y. Weiss. Latent topic models for hypertext. Uncertainty
in Artificial Intelligence, May 2008.
P. Haffner, S. Phillips, and R. Schapire. Efficient multiclass implementations of
l1-regularized maximum entropy. May 2006.
P. Hoff, A. Raftery, and M. Handcock. Latent space approaches to social network
analysis. Journal of the American Statistical Association, 2002.
J. Hofman and C. Wiggins. A Bayesian approach to network modularity. eprint arXiv:
0709.3512, 2007. URL http://arxiv.org/pdf/0709.3512.
138
T. Hofmann. Probabilistic latent semantic indexing. SIGIR, 1999. URL
http://portal.acm.org/citation.cfm?id=312649.
E. Ising. Beitrag zur theorie des ferromagnetismus. Zeitschrift fur Physik, 31:253–258,
1925.
T. S. Jaakkola and M. I. Jordan. Variational methods and the qmr-dt database. MIT
Computational Cognitive Science Technical Report 9701, page 23, Jan 1999.
M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to
variational methods for graphical models. Oct 1999.
D. Jurafsky and J. Martin. Speech and language processing. Prentice Hall, 2008.
S. Katrenko and P. Adriaans. Learning relations from biomedical corpora us-
ing dependency trees. Lecture Notes in Computer Science, 2007. URL
http://www.springerlink.com/index/n145566q7t1u4365.pdf.
J. Kazama and J. Tsujii. Evaluation and extension of maximum entropy models with
inequality constraints. Jun 2003.
C. Kemp, T. Griffiths, and J. Tenenbaum. Discovering latent
classes in relational data. MIT AI Memo 2004-019, 2004. URL
http://www-psych.stanford.edu/ gruffydd/papers/blockTR.pdf.
J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the
ACM (JACM), 1999. URL http://portal.acm.org/citation.cfm?id=324140.
M. Kolar and E. P. Xing. Improved estimation of high-dimensional ising models, 2008.
URL http://www.citebase.org/abstract?id=oai:arXiv.org:0811.1239.
V. Kolmogorov. Convergent tree-reweighted message passing for energy mini-
mization. IEEE Trans. Pattern Anal. Mach. Intell., 28(10):1568–1583, October
139
2006. ISSN 0162-8828. doi: http://dx.doi.org/10.1109/TPAMI.2006.200. URL
http://dx.doi.org/10.1109/TPAMI.2006.200.
J. Lafferty and L. Wasserman. Rodeo: Sparse, greedy nonparametric regression. The
Annals of Statistics, Jan 2008.
J. Leskovec, L. Backstrom, R. Kumar, and A. Tomkins. Microscopic evolution of
social networks. KDD 2008, 2008a.
J. Leskovec, K. Lang, A. Dasgupta, and M. Mahoney. Statistical properties of
community structure in large social and information networks. WWW 2008, 2008b.
URL http://portal.acm.org/citation.cfm?id=1367591.
D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M. Dawson.
The bottlenose dolphin community of doubtful sound features a large proportion
of long-lasting associations. can geographic isolation explain this unique trait?
Behavioral Ecology and Sociobiology, 54:396–405, 2003.
J. Majewski, H. Li, and J. Ott. The ising model in physics and statistical genetics.
American Journal of Human Genetics, 69:853–862, 2001.
R. Malouf. A comparison of algorithms for maximum entropy parameter estimation.
Jun 2002.
A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction
of internet portals with machine learning. Information Retrieval, 2000. URL
http://www.springerlink.com/index/R1723134248214T0.pdf.
A. McCallum, A. Corrada-Emmanuel, and X. Wang. Topic and role discovery in
social networks. Proceedings of the Nineteenth International Joint Conference on
Artificial Intelligence, 2005. URL http://www.ijcai.org/papers/1623.pdf.
140
A. McGovern, L. Friedland, M. Hay, B. Gallagher, A. Fast, J. Neville, and
D. Jensen. Exploiting relational structure to understand publication patterns
in high-energy physics. ACM SIGKDD Explorations Newsletter, 5(2), Dec 2003.
URL http://portal.acm.org/citation.cfm?id=980972.980999.
E. Meeds, Z. Ghahramani, R. Neal, and S. Roweis. Modeling dyadic data with binary
latent factors. NIPS 2007, 2007.
Q. Mei, D. Xin, H. Cheng, J. Han, and C. Zhai. Semantic annotation of frequent
patterns. KDD 2007, 1(3), 2007.
Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic modeling with network regularization.
WWW ’08: Proceeding of the 17th international conference on World Wide Web,
Apr 2008. URL http://portal.acm.org/citation.cfm?id=1367497.1367512.
N. Meinshausen and P. Buhlmann. High dimensional graphs and variable selection
with the lasso. Annals of Statistics, Jan 2006.
T. Minka and Y. Qi. Tree-structured approximations by expectation. In Propagation,”
Proc. Neural Information Processing Systems Conf. (NIPS, page 2003, 2003.
K. Murphy, Y. Weiss, and M. Jordan. Loopy belief propagation for approximate
inference: An empirical study. Proceedings of Uncertainty in AI, Jan 1999. URL
http://www.vision.ethz.ch/ks/slides/murphy99loopy.pdf.
R. Nallapati and W. Cohen. Link-pLSA-LDA: A new unsupervised model for topics
and influence of blogs. ICWSM, 2008.
R. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen. Joint latent topic models for
text and citations. Proceedings of the 14th ACM SIGKDD international conference
on Knowledge discovery and data mining, 2008.
O. J. Nave. Nave’s Topical Bible. Thomas Nelson, 2003. ISBN 0785250581.
141
R. M. Neal. Probabilistic inference using markov chain monte carlo methods. CRG-
TR-93-1, May 1993.
D. Newman, C. Chemudugunta, and P. Smyth. Statistical entity-topic models. In
KDD 2006, pages 680–686, New York, NY, USA, 2006a. ACM. ISBN 1-59593-339-5.
doi: http://doi.acm.org/10.1145/1150402.1150487.
M. Newman. The structure and function of net-
works. Computer Physics Communications, 2002. URL
http://linkinghub.elsevier.com/retrieve/pii/S0010465502002011.
M. E. J. Newman. Finding community structure in networks using the eigenvectors of
matrices. Phys. Rev. E, 74(036104), 2006a.
M. E. J. Newman. Modularity and community structure in networks. Proceedings of
the National Academy of Sciences, 103(23), 2006b. doi: 10.1073/pnas.0601602103.
URL http://arxiv.org/abs/physics/0602124v1.
M. E. J. Newman, A.-L. Barabsi, and D. J. Watts. The Structure and Dynamics of
Networks. Princeton University Press, 2006b.
T. Ohta, Y. Tateisi, and J.-D. Kim. Genia corpus: an annotated research abstract
corpus in molecular biology domain. In HLT 2008, San Diego, USA, 2002. URL
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/paper/hlt2002GENIA.pdf.
J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference.
1988.
J. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using
multilocus genotype data. Genetics, 155:945–959, June 2000.
S. Riezler and A. Vasserman. Incremental feature selection and l1 reg-
ularization for relaxed maximum-entropy modeling. Proceedings of the
142
2004 Conference on Empirical Methods in NLP, Jan 2004. URL
http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Riezler.pdf.
M. Rosen-Zvi, T. Griffiths, T. Griffiths, M. Steyvers, and P. Smyth. The author-
topic model for authors and documents. In AUAI 2004, pages 487–494, Arlington,
Virginia, United States, 2004. AUAI Press. ISBN 0-9749039-0-6.
S. Sahay, S. Mukherjea, E. Agichtein, E. Garcia, S. Navathe, and A. Ram. Discovering
semantic biomedical relations utilizing the web. KDD 2008, 2(1), Mar 2008. URL
http://portal.acm.org/citation.cfm?id=1342320.1342323.
S. Sampson. Crisis in a cloister. PhD thesis, Cornell University, 1969.
L. Saul and M. Jordan. Exploiting Tractable Substructures in Intractable Networks.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, pages 486–
492, 1996.
L. Saul and M. Jordan. A mean field learning algorithm for unsuper-
vised neural networks. Learning in Graphical Models, Jan 1999. URL
http://citeseer.comp.nus.edu.sg/cache/papers/cs/513/http:zSzzSzwww.ai.mit.eduzSzprojectszSzcbclzSzcourse9.641-F97zSzpmlp.ps.gz/a-mean-field-learning.ps.gz.
Y. Shi and T. Duke. Cooperative model of bacterial sensing. Phys. Rev. E, 58(5):
6399–6406, Nov 1998. doi: 10.1103/PhysRevE.58.6399.
J. Sinkkonen, J. Aukia, and S. Kaski. Component models for large networks. arXiv,
stat.ML, Mar 2008. URL http://arxiv.org/abs/0803.1628v1.
D. Sontag and T. Jaakkola. New Outer Bounds on the Marginal Polytope. Advances
in Neural Information Processing Systems, 21, 2007.
M. Steyvers and T. Griffiths. Probabilistic topic models. Handbook of Latent Semantic
Analysis, 2007.
143
R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agar-
wala, M. Tappen, and C. Rother. A comparative study of energy min-
imization methods for markov random fields with smoothness-based priors.
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(6):
1068–1080, 2008. doi: http://dx.doi.org/10.1109/TPAMI.2007.70844. URL
http://dx.doi.org/10.1109/TPAMI.2007.70844.
H. Takamura, T. Inui, and M. Okumura. Extracting semantic orientations
of words using spin model. In ACL ’05: Proceedings of the 43rd Annual
Meeting on Association for Computational Linguistics, pages 133–140, Mor-
ristown, NJ, USA, 2005. Association for Computational Linguistics. doi:
http://dx.doi.org/10.3115/1219840.1219857.
L. Tanabe, N. Xie, L. H. Thom, W. Matten, and W. J. Wilbur. Genetag: a tagged
corpus for gene/protein named entity recognition. BMC Bioinformatics, 6 Suppl
1, 2005. ISSN 1471-2105. doi: http://dx.doi.org/10.1186/1471-2105-6-S1-S3. URL
http://dx.doi.org/10.1186/1471-2105-6-S1-S3.
B. Taskar, M.-F. Wong, P. Abbeel, and D. Koller. Link prediction in relational data.
NIPS 2003, 2003.
B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. Ad-
vances in Neural Information Processing Systems, Jan 2004a. URL
http://web.engr.oregonstate.edu/ tgd/classes/539/slides/max-margin-markov-networks.pdf.
B. Taskar, M. Wong, P. Abbeel, and D. Koller. Link prediction in relational data.
NIPS, 2004b.
Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. Journal of
the American Statistical Association, 101(476):1566–1581, 2007.
144
M. Wainwright and M. Jordan. A variational principle for graphical models. In New
Directions in Statistical Signal Processing, chapter 11. MIT Press, 2005a.
M. Wainwright and M. Jordan. Log-determinant relaxation for approximate inference
in discrete markov random fields. Signal Processing, IEEE Transactions on, 54(6):
2099–2109, June 2006. ISSN 1053-587X. doi: 10.1109/TSP.2006.874409.
M. Wainwright, T. Jaakkola, and A. Willsky. Tree-reweighted belief propagation
algorithms and approximate ml estimation by pseudomoment matching. Artificial
Intelligence and Statistics, Jan 2003.
M. J. Wainwright and M. I. Jordan. Variational inference in graphical models: The
view from the marginal polytope. Allerton Conference on Control, Communication
and Computing, Apr 2003.
M. J. Wainwright and M. I. Jordan. A variational principle for graphical models. New
Directions in Statistical Signal Processing, 2005b.
M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and
variational inference. Foundations and Trends in Machine Learnings, 1(1 – 2):1–305,
Dec 2008.
M. J. Wainwright, P. Ravikumar, and J. D. Lafferty. High-dimensional graphical model
selection using l1-regularized logistic regression. Neural Information Processing
Systems, Jan 2006.
X. Wang, N. Mohanty, and A. McCallum. Group and topic discovery from relations
and text. Proceedings of the 3rd international workshop on Link discovery, 2005.
URL http://portal.acm.org/citation.cfm?id=1134276.
S. Wasserman and P. Pattison. Logit models and logistic regressions for social
145
networks: I. an introduction to markov graphs and p*. Psychometrika, 1996. URL
http://www.springerlink.com/index/T2W46715636R2H11.pdf.
M. Welling and G. Hinton. A new learning algorithm for mean field boltzmann
machines. Artificial Neural Networks-Icann 2002, Jan 2002.
M. Welling and Y. W. Teh. Belief optimization for binary networks: a stable alternative
to loopy belief propagation. In In Proceedings of the Conference on Uncertainty in
Artificial Intelligence, pages 554–561, 2001.
D. J. A. Welsh. The computational complexity of some classical problems from
statistical physics. In In Disorder in Physical Systems, pages 307–321. Clarendon
Press, 1990.
Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite hidden relational models. In UAI,
2006.
Z. Xu, V. Tresp, S. Yu, and K. Yu. Nonparametric relational learning for social
network analysis. In 2nd ACM Workshop on Social Network Mining and Analysis
(SNA-KDD 2008), 2008.
J. S. Yedidia, W. T. Freeman, and Y. Weiss. Understanding belief propagation and
its generalizations. Exploring artificial intelligence in the new millennium, pages
239–269, 2003.
W. Zachary. An information flow model for conflict and fission in small groups. Journal
of Anthropological Research, 33:452–473, 1977.
D. Zhou, S. Zhu, K. Yu, X. Song, B. Tseng, H. Zha, and C. Giles. Learning
multiple graphs for document recommendations. WWW 2008, Apr 2008. URL
http://portal.acm.org/citation.cfm?id=1367497.1367517.
146
H. Zou and T. Hastie. Regularization and variable selection via the elastic
net. Journal of the Royal Statistical Society Series B, Jan 2005. URL
http://www.blackwell-synergy.com/doi/abs/10.1111/j.1467-9868.2005.00503.x.
147