Download ppt - Stochastic Block Models of Mixed Membership

School of Computer Science

Stochastic Block Models of Mixed Membership

Edo Airoldi 1,2, Dave Blei 2, Steve Fienberg 1, Eric Xing 1

1 Carnegie-Mellon University & 2 Princeton University

SAMSI, High Dimensional Inference and Random Matrices, September 17th, 2006

2


The Scientific Problem

• Protein-protein interactions in Yeast

• Different studies test protein interactions with different technologies (precision)

Expression graphs

Interaction graphs

3


M = 871 nodesM2 = 750K entries

The Data: Interaction Graphs

• M proteins in a graph (nodes)• M2 observations on pairs of proteins

– Edges are random quantities, Y [n,m]

• Interactions are not independent– Interacting proteins form a protein complex

• T graphs on the same set of proteins• Partial annotations for each protein, X [n]

4


The Scientific Problems

• What are stable protein complexes?– They perform many cellular processes– A protein may be a member of several ones

• How many are there?

• How do stable protein complexes interact?– Test hypotheses (inform new analyses)– Learn complex-to-complex interaction patterns

5


Disease Spread

Social Network

Food Web

ElectronicCircuit

Internet

More Network Data

6


An Abstraction of the Data

• A collection of unipartite graphs: G1:T = (Y1:T ,N )

• Integer, real, multivariate edge weights: Yt = { Yt [nm] : n,m N }

• Node-specific (multivariate) attributes: X1:T = { Xt [n] : n N }

• Partially observable Y1:T and X1:T

7


The Challenge

• Given the data abstraction and the goals of the analysis

• Can we posit a rich class of models that is instrumental for thinking about the scientific problems we face? Amenable to theoretical analyses?

8


Modeling Ideas

• Hierarchical Bayes– Latent variables encode semantic elements– Assume structure on observable-latent elements

• Combination of 2 class of models

1. Models of mixed membership

2. Network models (block models)

Stochastic block models of mixed membership

=

9


Graphical Model Representation

MixedMembership

StochasticBlocks

10


Interactions(observed*)

j

i

yij = 1

i

j

1 2 3

Mixed membershipVectors (latent*)

h

g

1 2 3123

23 = 0.9

Group-to-grouppatterns (latent*)

Pr ( yij=1 | i,j, ) = i j

T

A Hierarchical Likelihood

11


More Modeling Issues

• Technical :: Sparsity– Introduce parameter that modulates the relative

importance of ones and zeros (binary edges) in the cost function that drives the clustering

• Biological :: Ribosomes & Distress– Some protein complexes act like hubs because

they are involved, e.g., in protein production or cell recovery (Y2H technology is invasive)

12


Large Scale Computation

• Masses of data– 750K observations in a small problem (M=871)– 2.5M observations with (M=1578)– 3M expressions for 6K genes/proteins in Yeast

• Variational inference [ Jordan et al., 2001 ]– Naïve implementation does not work– We develop a novel “nested” variational algorithm

13


Example: A Scientific Question

• Do PPI contain information about functions?

Model ApproximatePosterior onMembershipVectors

?

Raw dataFunctionalAnnotations

YLD014W

14


Interactions in Yeast (MIPS)

• Do PPI contain information about functions?

YLD014W

1

01 2 3 . . . 15

15


Results: Identifiability

• In this example we map latent groups to known functional categories

KnownAnnotations

UnknownAnnotations

16


Results: Functional Annotations

17


Results: Mixed Membership

Mixed membership

• The estimated membership vectors support the mixed membership assumption

18


Results: Stochastic Block Model

19


• Assumptions for unipartite graphs– Population: existence of K sub-populations

– Latent variable: mixed memb. vectors [n] ~ D

– Subject: exchangeable edges given blocks & memb. Y[nm] ~ f ( . | [n] [m] )

– Sampling scheme: the graphs are IID

• Additional data, e.g., attributes, annotations– Integrated model formulation (descriptive/predictive)

General Bayesian Formulation

T

20


Variational Algorithms• Naïve algorithm:

– init (i i, ij ij)

– while (≈ log-lik )update (ij ij)

update (i i)

• Nested algorithm:– init (i i)

– while (≈ log-lik )loop ij

• init ij

• while (≈ log-lik )update ij

partially update (i,j)

We trade space for time but …

21


Variational Algorithms for MMSB

On a single machine* we empirically observed: faster convergence (offsets extra computation), and more stable paths to convergence.

NaïveNaïve

Nested Nested

22


Take Home Points

• Bayesian formulation is integral to the biology

• A novel class of models that combines MM for soft-clustering & network models for dependent data

• Latent aspects patterns that correlate with, help predict, functional processes in the cell

• Current implementation allows for fast inference on large matrices through variational approximation considerable opportunity to improve upon both computation and efficiency of the approximation

23


• Data & Problems: Gavin et al. (2002) Nature; Ho et al. (2002) Nature; Mewes et al. (2004) Nucleic Acids Research; Krogan et al. (2006) Nature.

• Mixed Membership Models– Pritchard et al. (2000); Erosheva (2002); Rosenberg et al. (2002);

Blei et al. (2003); Xing et al. (2003ab); Erosheva et al. (2004); Airoldi et al. (2005); Blei & Lafferty (2006); Xing et al. (2006)

• Stochastic network models– Wasserman et al. (1980, 1994, 1996); Fienberg et al. (1985); Frank

& Strauss (1986); Nowicki & Snijders (2001); Hoff et al. (2002), Airoldi et al. (2006)

• More material on the Web at: http://www.cs.cmu.edu/~eairoldi/

• ICML Workshop on “Statistical Network Analysis: Models, Issues and New Directions” on June 29 at Carnegie Mellon, Pittsburgh PA: http://nlg.cs.cmu.edu/