School of Computer Science
Stochastic Block Models of Mixed Membership
Edo Airoldi 1,2, Dave Blei 2, Steve Fienberg 1, Eric Xing 1
1 Carnegie-Mellon University & 2 Princeton University
SAMSI, High Dimensional Inference and Random Matrices, September 17th, 2006
2
School of Computer Science
The Scientific Problem
• Protein-protein interactions in Yeast
• Different studies test protein interactions with different technologies (precision)
Expression graphs
Interaction graphs
3
School of Computer Science
M = 871 nodesM2 = 750K entries
The Data: Interaction Graphs
• M proteins in a graph (nodes)• M2 observations on pairs of proteins
– Edges are random quantities, Y [n,m]
• Interactions are not independent– Interacting proteins form a protein complex
• T graphs on the same set of proteins• Partial annotations for each protein, X [n]
4
School of Computer Science
The Scientific Problems
• What are stable protein complexes?– They perform many cellular processes– A protein may be a member of several ones
• How many are there?
• How do stable protein complexes interact?– Test hypotheses (inform new analyses)– Learn complex-to-complex interaction patterns
5
School of Computer Science
Disease Spread
Social Network
Food Web
ElectronicCircuit
Internet
More Network Data
6
School of Computer Science
An Abstraction of the Data
• A collection of unipartite graphs: G1:T = (Y1:T ,N )
• Integer, real, multivariate edge weights: Yt = { Yt [nm] : n,m N }
• Node-specific (multivariate) attributes: X1:T = { Xt [n] : n N }
• Partially observable Y1:T and X1:T
7
School of Computer Science
The Challenge
• Given the data abstraction and the goals of the analysis
• Can we posit a rich class of models that is instrumental for thinking about the scientific problems we face? Amenable to theoretical analyses?
8
School of Computer Science
Modeling Ideas
• Hierarchical Bayes– Latent variables encode semantic elements– Assume structure on observable-latent elements
• Combination of 2 class of models
1. Models of mixed membership
2. Network models (block models)
Stochastic block models of mixed membership
=
9
School of Computer Science
Graphical Model Representation
MixedMembership
StochasticBlocks
10
School of Computer Science
Interactions(observed*)
j
i
yij = 1
i
j
1 2 3
Mixed membershipVectors (latent*)
h
g
1 2 3123
23 = 0.9
Group-to-grouppatterns (latent*)
Pr ( yij=1 | i,j, ) = i j
T
A Hierarchical Likelihood
11
School of Computer Science
More Modeling Issues
• Technical :: Sparsity– Introduce parameter that modulates the relative
importance of ones and zeros (binary edges) in the cost function that drives the clustering
• Biological :: Ribosomes & Distress– Some protein complexes act like hubs because
they are involved, e.g., in protein production or cell recovery (Y2H technology is invasive)
12
School of Computer Science
Large Scale Computation
• Masses of data– 750K observations in a small problem (M=871)– 2.5M observations with (M=1578)– 3M expressions for 6K genes/proteins in Yeast
• Variational inference [ Jordan et al., 2001 ]– Naïve implementation does not work– We develop a novel “nested” variational algorithm
13
School of Computer Science
Example: A Scientific Question
• Do PPI contain information about functions?
Model ApproximatePosterior onMembershipVectors
?
Raw dataFunctionalAnnotations
YLD014W
14
School of Computer Science
Interactions in Yeast (MIPS)
• Do PPI contain information about functions?
YLD014W
1
01 2 3 . . . 15
15
School of Computer Science
Results: Identifiability
• In this example we map latent groups to known functional categories
KnownAnnotations
UnknownAnnotations
16
School of Computer Science
Results: Functional Annotations
17
School of Computer Science
Results: Mixed Membership
Mixed membership
• The estimated membership vectors support the mixed membership assumption
18
School of Computer Science
Results: Stochastic Block Model
19
School of Computer Science
• Assumptions for unipartite graphs– Population: existence of K sub-populations
– Latent variable: mixed memb. vectors [n] ~ D
– Subject: exchangeable edges given blocks & memb. Y[nm] ~ f ( . | [n] [m] )
– Sampling scheme: the graphs are IID
• Additional data, e.g., attributes, annotations– Integrated model formulation (descriptive/predictive)
General Bayesian Formulation
T
20
School of Computer Science
Variational Algorithms• Naïve algorithm:
– init (i i, ij ij)
– while (≈ log-lik )update (ij ij)
update (i i)
• Nested algorithm:– init (i i)
– while (≈ log-lik )loop ij
• init ij
• while (≈ log-lik )update ij
partially update (i,j)
We trade space for time but …
21
School of Computer Science
Variational Algorithms for MMSB
On a single machine* we empirically observed: faster convergence (offsets extra computation), and more stable paths to convergence.
NaïveNaïve
Nested Nested
22
School of Computer Science
Take Home Points
• Bayesian formulation is integral to the biology
• A novel class of models that combines MM for soft-clustering & network models for dependent data
• Latent aspects patterns that correlate with, help predict, functional processes in the cell
• Current implementation allows for fast inference on large matrices through variational approximation considerable opportunity to improve upon both computation and efficiency of the approximation
23
School of Computer Science
• Data & Problems: Gavin et al. (2002) Nature; Ho et al. (2002) Nature; Mewes et al. (2004) Nucleic Acids Research; Krogan et al. (2006) Nature.
• Mixed Membership Models– Pritchard et al. (2000); Erosheva (2002); Rosenberg et al. (2002);
Blei et al. (2003); Xing et al. (2003ab); Erosheva et al. (2004); Airoldi et al. (2005); Blei & Lafferty (2006); Xing et al. (2006)
• Stochastic network models– Wasserman et al. (1980, 1994, 1996); Fienberg et al. (1985); Frank
& Strauss (1986); Nowicki & Snijders (2001); Hoff et al. (2002), Airoldi et al. (2006)
• More material on the Web at: http://www.cs.cmu.edu/~eairoldi/
• ICML Workshop on “Statistical Network Analysis: Models, Issues and New Directions” on June 29 at Carnegie Mellon, Pittsburgh PA: http://nlg.cs.cmu.edu/