Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-13/lecture/lecture10-NetworkLearning.pdf7 Theoretical Guarantees zAssumptions zDeppyendency Condition:

1

School of Computer Science

Probabilistic Graphical ModelsProbabilistic Graphical Models

Gaussian graphical models and Gaussian graphical models and IsingIsing models: modeling networks models: modeling networks

Eric Eric XingXingLecture 10, February 18, 2013

Reading: See class website1© Eric Xing @ CMU, 2005-2013

Where do networks come from?The Jesus network

2© Eric Xing @ CMU, 2005-2013

2

Evolving networksCan I get his vote?

CorporativityCorporativity,

Antagonism,

Cliques,…

over time?

March 2005 January 2006 August 2006

3© Eric Xing @ CMU, 2005-2013

Evolving networks

…

t=1 2 3 T

4© Eric Xing @ CMU, 2005-2013

3

Recall Multivariate GaussianMultivariate Gaussian density:

{ })()()|( 111 T

WOLG: let

{ })-()-(-exp)(

),|( // µµπµ xxx 1212122

1 −ΣΣ

=Σ Tn

p

( )⎭⎬⎫

⎩⎨⎧

−== ∑∑ jiijiiinp xxqxqQ

Qxxxp 2212/

2/1

21 -exp)2(),0|,,,(

πµL

We can view this as a continuous Markov Random Field with potentials defined on every node and edge:

⎭⎩∑∑<

jji

ji

np /)2( π

5© Eric Xing @ CMU, 2005-2013

Cell type

Gaussian Graphical Model

Microarray samples Encodes dependencies

among genesg g

6© Eric Xing @ CMU, 2005-2013

4

Precision Matrix Encodes Non-ZeroEdges in Gaussian Graphical Modela

Edge corresponds to non-zero precision matrix element

7© Eric Xing @ CMU, 2005-2013

Correlation network is based on Covariance Matrix

Markov versus Correlation Network

A GGM is a Markov Network based on Precision MatrixConditional Independence/Partial Correlation Coefficients are a more sophisticated dependence measure

With small sample size, empirical covariance matrix cannot be inverted8© Eric Xing @ CMU, 2005-2013

5

SparsityOne common assumption to make sparsity.

Makes biological sense: Genes are only assumed to interface with small groups of other genes.

Makes statistical sense: Learning is now feasible in high dimensions with small sample size

sparse

9© Eric Xing @ CMU, 2005-2013

Network Learning with the LASSO

Assume network is a Gaussian Graphical Model

Perform LASSO regression of all nodes to a target node

10© Eric Xing @ CMU, 2005-2013

6


LASSO can select the neighborhood of each node

11© Eric Xing @ CMU, 2005-2013

L1 Regularization (LASSO)A convex relaxation.Constrained Form Lagrangian Formg g

x1x2β1β2

Enforces sparsity!

…

y x3

xn-1xn

β3

βn-1

βn

12© Eric Xing @ CMU, 2005-2013

7

Theoretical GuaranteesAssumptions

Dependency Condition: Relevant Covariates are not overly dependentp y y pIncoherence Condition: Large number of irrelevant covariates cannot be too correlated with relevant covariatesStrong concentration bounds: Sample quantities converge to expected values quickly

If these are assumptions are met, LASSO will asymptotically recover correct subset of covariates that relevant.

13© Eric Xing @ CMU, 2005-2013


Repeat this for every nodeForm the total edge setForm the total edge set

14© Eric Xing @ CMU, 2005-2013

8

If

Consistent Structure Recovery[Meinshausen and Buhlmann 2006, Wainwright 2009]

If

Then with high probability,

15© Eric Xing @ CMU, 2005-2013

Why this algorithm work?What is the intuition behind graphical regression?

Continuous nodal attributesDiscrete nodal attributes

Are there other algorihtms?

More general scenarios: non-iid sample and evolving networksnon iid sample and evolving networks

Case study

16© Eric Xing @ CMU, 2005-2013

9

Multivariate GaussianMultivariate Gaussian density:

{ })()()|( 111 T

A joint Gaussian:

How to write down p(x2), p(x1|x2) or p(x2|x1) using the block elements in µ and Σ?

{ })-()-(-exp)(

),|( // µµπµ xxx 1212122

1 −ΣΣ

=Σ Tn

p

) , (),|( ⎥⎦

⎤⎢⎣

⎡ΣΣΣΣ

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=Σ⎥

⎦

⎤⎢⎣

⎡

2221

1211

2

1

2

1

2

1

µµ

µxx

xx

Np

elements in µ and Σ?Formulas to remember:

211

22121121

221

2212121

2121121

ΣΣΣ−Σ=

−ΣΣ+=

=

−

−

|

|

||

)(

),|()(

V

xm

Vmxxx

µµ

Np

222

22

2222

Σ=

=

=

m

m

mmp

V

m

Vmxx

µ

),|()( N

17© Eric Xing @ CMU, 2005-2013

The matrix inverse lemmaConsider a block-partitioned matrix: ⎥

⎦

⎤⎢⎣

⎡=

HGFE

M

First we diagonalize M

Schur complement:

Then we inverse, using this formula:

⎥⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡

HGE-FH

IG-HI

HGFE

I

-FHI --

-

000

0

1

1

1

XZW YW XYZ -- 11 =⇒=GE-FHM/H -1=

( )⎥⎤

⎢⎡⎥⎤

⎢⎡⎥⎤

⎢⎡

⎥⎤

⎢⎡ −

− 1111 00 -FHIM/HIFEM

-

Matrix inverse lemma

( )

( ) ( )( ) ( )

( ) ( )( ) ( ) ⎥⎦

⎤⎢⎣

⎡ +=⎥

⎦

⎤⎢⎣

⎡

+=

⎥⎦

⎢⎣⎥⎦

⎢⎣⎥⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡=

−−

−−−−−

−−−−−

−−

−−

111

111111

111111

111

111

00

M/EGEM/E-M/EF-EGEM/EFEE

FHM/HGHHM/HG-HFHM/H-M/H

IHIG-H

HG M

-

-

-

-

-

( ) ( ) 1111111 --- GEFH-GEFEEGE-FH −−−− +=18© Eric Xing @ CMU, 2005-2013

10

The covariance and the precision matrices

⎥⎦

⎤⎢⎣

⎡

Σ=Σ 111

σσv

vT

( ) ( )( ) ( ) ⎥⎦

⎤⎢⎣

⎡

+= −−−−−

−−−

111111

1111

-

-

FHM/HGHHM/HG-HFHM/H-M/HM

⎥⎦

⎢⎣ Σ−11σv

⇓

( ) ⎥⎦⎤

⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡

Σ+ΣΣΣ

=−

−−

−−

−−

−−

11

1111

111111

111

111

1111111

Qqqq

qI-q-qqQ

T

T

T

v

v

vvv

v

σσσσ

⇓

19© Eric Xing @ CMU, 2005-2013

Single-node Conditional The conditional dist. of a single node i given the rest of the nodes can be written as:

211

22121121

221

2212121

2121121

ΣΣΣ−Σ=

−ΣΣ+=

=

−

−

|

|

||

)(

),|()(

V

xm

Vmxxx

µµ

Np

WOLG: let

( ) ⎥⎦⎤

⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡

Σ+ΣΣΣ

=−

−−

−−

−−

−−

11

1111

111111

111

111

1111111

Qqqq

qI-q-qqQ

T

T

T

v

v

vvv

v

σσσσ

20© Eric Xing @ CMU, 2005-2013

11

Conditional auto-regression From

We can write the following conditional auto-regression function for each node:

Neighborhood est. based on auto-regression coefficient

© Eric Xing @ CMU, 2005-2013 21

Conditional independenceFrom

Given an estimate of the neighborhood si, we have:

Thus the neighborhood si defines the Markov blanket of node i

22© Eric Xing @ CMU, 2005-2013

12

Recent trends in GGM:Covariance selection (classical method)

L1-regularization based method (hot !)

Dempster [1972]: Sequentially pruning smallest elements in precision matrix

Drton and Perlman [2008]: Improved statistical tests for pruning

Meinshausen and Bühlmann [Ann. Stat. 06]:

Used LASSO regression for neighborhood selection

Banerjee [JMLR 08]: Block sub-gradient algorithm for finding precision matrix

Friedman et al. [Biostatistics 08]: Efficient fixed-point equations based on a sub-gradient algorithm

…

Serious limitations in practice: breaks down when covariance matrix is not invertible

Structure learning is possible even when # variables ＞ # samples

23© Eric Xing @ CMU, 2005-2013

The Meinshausen-Bühlmann (MB) algorithm:

Solving separated Lasso for every single variables:

Step 1: Pick up one variable

Step 2: Think of it as “y”, and the rest as “z”

Step 3: Solve Lasso regression problem between y and z

Step 4: Connect the k-th node to those having nonzero weight in w

The resulting coefficient does not correspond to the Q value-wise

24© Eric Xing @ CMU, 2005-2013

13

L1-regularized maximum likelihood learning

Input: Sample covariance matrix S

Assumes standardized data (mean=0, variance=1)S is generally rank-deficient

Thus the inverse does not exist

Output: Sparse precision matrix QOriginally, Q is defined as the inverse of S, but not directly invertibleNeed to find a sparse matrix that can be thought as of as an inverse of S

Approach: Solve an L1-regularized maximum likelihood equation

log likelihood regularizer

25© Eric Xing @ CMU, 2005-2013

From matrix opt. to vector opt.:coupled Lasso for every single Var.

Focus only on one row (column), keeping the others constant

Optimization problem for blue vector is shown to be Lasso (L1-regularized quadratic programming)

Difference from MB’s: Resulting Lasso problems are coupledThe gray part is actually not constant; changes after solving one Lasso problemThis coupling is essential for stability under noise

26© Eric Xing @ CMU, 2005-2013

14

Learning Ising Model (i.e. pairwise MRF)

Assuming the nodes are discrete (e.g., voting outcome of a person), and edges are weighted, then for a sample x, we p ), g g , p ,have

It can be shown the pseudo-conditional likelihood for node k is

27© Eric Xing @ CMU, 2005-2013

Learning Ising Model (i.e. pairwise MRF)

Assuming the nodes are discrete, and edges are weighted, then for a sample x, we have p ,

It can be shown following the same logic that we can use L_1 l i d l i ti i t bt i ti t fregularized logistic regression to obtain a sparse estimate of

the neighborhood of each variable in the discrete case.

28© Eric Xing @ CMU, 2005-2013

15

New Problem: Evolving Social Networks

Can I get his vote?

CorporativityCorporativity,

Antagonism,

Cliques,…

over time?


29© Eric Xing @ CMU, 2005-2013

t*

Reverse engineering time-specific "rewiring" networks

T0 TN

…n=1 or some small #

Drosophila developmentDrosophila development

30© Eric Xing @ CMU, 2005-2013

16

Inference IKELLER: Kernel Weighted L1-regularized Logistic Regression

[Song, Kolar and Xing, Bioinformatics 09]

Lasso:

Constrained convex optimizationEstimate time-specific nets one by one, based on "virtual iidvirtual iid" samplesCould scale to ~104 genes, but under stronger smoothness assumptions

31© Eric Xing @ CMU, 2005-2013

Conditional likelihood

Algorithm – nonparametric neighborhood selection

NNeighborhood Selectioneighborhood Selection:

Time-specific graph regression:Estimate at

Where

and32© Eric Xing @ CMU, 2005-2013

17

Structural consistency of KELLER

AssumptionsDefine:

A1: Dependency Condition

A2: Incoherence Condition

A3: Smoothness Condition

A4: Bounded Kernel

33© Eric Xing @ CMU, 2005-2013

Theorem [Kolar and Xing, 09]

34© Eric Xing @ CMU, 2005-2013

18

Senate network – 109th congress

Voting records from 109th congress (2005 - 2006)There are 100 senators whose votes were recorded on the 542 bills, each vote is a binary outcome

35© Eric Xing @ CMU, 2005-2013

Senate network – 109th congress


36© Eric Xing @ CMU, 2005-2013

20

S1 (normal)

Progression and Reversion of Breast Cancer cells

( )

T4 (malignant)

T4 revert 1

T4 revert 2

T4 revert 3

39© Eric Xing @ CMU, 2005-2013

Estimate Neighborhoods Jointly Across All Cell Types

T4

S1

How to share information across the cell types?

T4R1

T4R2

T4R3

40© Eric Xing @ CMU, 2005-2013

21

Penalize differences between networks of adjacent cell types

Sparsity of Difference

T4

S1

T4R1

T4R2

T4R3

41© Eric Xing @ CMU, 2005-2013

Tree-Guided Graphical Lasso (Treegl)

RSS for all cell types

sparsity Sparsity of difference42© Eric Xing @ CMU, 2005-2013

22

T4

EGFR-ITGB1

Network Overview

MMPS1

PI3K-MAPKK

MMP

43© Eric Xing @ CMU, 2005-2013

SummaryGraphical Gaussian Model

The precision matrix encode structureNot estimatable when p >> n

Neighborhood selection:Conditional dist under GGM/MRFGraphical lassoSparsistency

Time-varying Markov networksKernel reweighting est.Total variation est.

44© Eric Xing @ CMU, 2005-2013

Documents

Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-13/lecture/lecture10-NetworkLearning.pdf7 Theoretical Guarantees zAssumptions zDeppyendency Condition: