22
1 School of Computer Science Probabilistic Graphical Models Probabilistic Graphical Models Gaussian graphical models and Gaussian graphical models and Ising Ising models: modeling networks models: modeling networks Eric Eric Xing Xing Lecture 10, February 18, 2013 Reading: See class website 1 © Eric Xing @ CMU, 2005-2013 Where do networks come from? The Jesus network 2 © Eric Xing @ CMU, 2005-2013

Probabilistic Graphical ModelsProbabilistic Graphical Modelsepxing/Class/10708-13/lecture/lecture10-NetworkLearning.pdf7 Theoretical Guarantees zAssumptions zDeppyendency Condition:

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • 1

    School of Computer Science

    Probabilistic Graphical ModelsProbabilistic Graphical Models

    Gaussian graphical models and Gaussian graphical models and IsingIsing models: modeling networks models: modeling networks

    Eric Eric XingXingLecture 10, February 18, 2013

    Reading: See class website1© Eric Xing @ CMU, 2005-2013

    Where do networks come from?The Jesus network

    2© Eric Xing @ CMU, 2005-2013

  • 2

    Evolving networksCan I get his vote?

    CorporativityCorporativity,

    Antagonism,

    Cliques,…

    over time?

    March 2005 January 2006 August 2006

    3© Eric Xing @ CMU, 2005-2013

    Evolving networks

    t=1 2 3 T

    4© Eric Xing @ CMU, 2005-2013

  • 3

    Recall Multivariate GaussianMultivariate Gaussian density:

    { })()()|( 111 T

    WOLG: let

    { })-()-(-exp)(

    ),|( // µµπµ xxx 1212122

    1 −ΣΣ

    =Σ Tn

    p

    ( )⎭⎬⎫

    ⎩⎨⎧

    −== ∑∑ jiijiiinp xxqxqQ

    Qxxxp 2212/

    2/1

    21 -exp)2(),0|,,,(

    πµL

    We can view this as a continuous Markov Random Field with potentials defined on every node and edge:

    ⎭⎩∑∑<

    jji

    ji

    np /)2( π

    5© Eric Xing @ CMU, 2005-2013

    Cell type

    Gaussian Graphical Model

    Microarray samples Encodes dependencies

    among genesg g

    6© Eric Xing @ CMU, 2005-2013

  • 4

    Precision Matrix Encodes Non-ZeroEdges in Gaussian Graphical Modela

    Edge corresponds to non-zero precision matrix element

    7© Eric Xing @ CMU, 2005-2013

    Correlation network is based on Covariance Matrix

    Markov versus Correlation Network

    A GGM is a Markov Network based on Precision MatrixConditional Independence/Partial Correlation Coefficients are a more sophisticated dependence measure

    With small sample size, empirical covariance matrix cannot be inverted8© Eric Xing @ CMU, 2005-2013

  • 5

    SparsityOne common assumption to make sparsity.

    Makes biological sense: Genes are only assumed to interface with small groups of other genes.

    Makes statistical sense: Learning is now feasible in high dimensions with small sample size

    sparse

    9© Eric Xing @ CMU, 2005-2013

    Network Learning with the LASSO

    Assume network is a Gaussian Graphical Model

    Perform LASSO regression of all nodes to a target node

    10© Eric Xing @ CMU, 2005-2013

  • 6

    Network Learning with the LASSO

    LASSO can select the neighborhood of each node

    11© Eric Xing @ CMU, 2005-2013

    L1 Regularization (LASSO)A convex relaxation.Constrained Form Lagrangian Formg g

    x1x2β1β2

    Enforces sparsity!

    y x3

    xn-1xn

    β3

    βn-1

    βn

    12© Eric Xing @ CMU, 2005-2013

  • 7

    Theoretical GuaranteesAssumptions

    Dependency Condition: Relevant Covariates are not overly dependentp y y pIncoherence Condition: Large number of irrelevant covariates cannot be too correlated with relevant covariatesStrong concentration bounds: Sample quantities converge to expected values quickly

    If these are assumptions are met, LASSO will asymptotically recover correct subset of covariates that relevant.

    13© Eric Xing @ CMU, 2005-2013

    Network Learning with the LASSO

    Repeat this for every nodeForm the total edge setForm the total edge set

    14© Eric Xing @ CMU, 2005-2013

  • 8

    If

    Consistent Structure Recovery[Meinshausen and Buhlmann 2006, Wainwright 2009]

    If

    Then with high probability,

    15© Eric Xing @ CMU, 2005-2013

    Why this algorithm work?What is the intuition behind graphical regression?

    Continuous nodal attributesDiscrete nodal attributes

    Are there other algorihtms?

    More general scenarios: non-iid sample and evolving networksnon iid sample and evolving networks

    Case study

    16© Eric Xing @ CMU, 2005-2013

  • 9

    Multivariate GaussianMultivariate Gaussian density:

    { })()()|( 111 T

    A joint Gaussian:

    How to write down p(x2), p(x1|x2) or p(x2|x1) using the block elements in µ and Σ?

    { })-()-(-exp)(

    ),|( // µµπµ xxx 1212122

    1 −ΣΣ

    =Σ Tn

    p

    ) , (),|( ⎥⎦

    ⎤⎢⎣

    ⎡ΣΣΣΣ

    ⎥⎦

    ⎤⎢⎣

    ⎡⎥⎦

    ⎤⎢⎣

    ⎡=Σ⎥

    ⎤⎢⎣

    2221

    1211

    2

    1

    2

    1

    2

    1

    µµ

    µxx

    xx

    Np

    elements in µ and Σ?Formulas to remember:

    211

    22121121

    221

    2212121

    2121121

    ΣΣΣ−Σ=

    −ΣΣ+=

    =

    |

    |

    ||

    )(

    ),|()(

    V

    xm

    Vmxxx

    µµ

    Np

    222

    22

    2222

    Σ=

    =

    =

    m

    m

    mmp

    V

    m

    Vmxx

    µ

    ),|()( N

    17© Eric Xing @ CMU, 2005-2013

    The matrix inverse lemmaConsider a block-partitioned matrix: ⎥

    ⎤⎢⎣

    ⎡=

    HGFE

    M

    First we diagonalize M

    Schur complement:

    Then we inverse, using this formula:

    ⎥⎦

    ⎤⎢⎣

    ⎡=⎥

    ⎤⎢⎣

    ⎡⎥⎦

    ⎤⎢⎣

    ⎡⎥⎦

    ⎤⎢⎣

    HGE-FH

    IG-HI

    HGFE

    I

    -FHI --

    -

    000

    0

    1

    1

    1

    XZW YW XYZ -- 11 =⇒=GE-FHM/H -1=

    ( )⎥⎤

    ⎢⎡⎥⎤

    ⎢⎡⎥⎤

    ⎢⎡

    ⎥⎤

    ⎢⎡ −

    − 1111 00 -FHIM/HIFEM

    -

    Matrix inverse lemma

    ( )

    ( ) ( )( ) ( )

    ( ) ( )( ) ( ) ⎥⎦

    ⎤⎢⎣

    ⎡ +=⎥

    ⎤⎢⎣

    +=

    ⎥⎦

    ⎢⎣⎥⎦

    ⎢⎣⎥⎦

    ⎤⎢⎣

    ⎡=⎥

    ⎤⎢⎣

    ⎡=

    −−

    −−−−−

    −−−−−

    −−

    −−

    111

    111111

    111111

    111

    111

    00

    M/EGEM/E-M/EF-EGEM/EFEE

    FHM/HGHHM/HG-HFHM/H-M/H

    IHIG-H

    HG M

    -

    -

    -

    -

    -

    ( ) ( ) 1111111 --- GEFH-GEFEEGE-FH −−−− +=18© Eric Xing @ CMU, 2005-2013

  • 10

    The covariance and the precision matrices

    ⎥⎦

    ⎤⎢⎣

    Σ=Σ 111

    σσv

    vT

    ( ) ( )( ) ( ) ⎥⎦

    ⎤⎢⎣

    += −−−−−

    −−−

    111111

    1111

    -

    -

    FHM/HGHHM/HG-HFHM/H-M/HM

    ⎥⎦

    ⎢⎣ Σ−11σv

    ( ) ⎥⎦⎤

    ⎢⎣

    ⎡=⎥

    ⎤⎢⎣

    Σ+ΣΣΣ

    =−

    −−

    −−

    −−

    −−

    11

    1111

    111111

    111

    111

    1111111

    Qqqq

    qI-q-qqQ

    T

    T

    T

    v

    v

    vvv

    v

    σσσσ

    19© Eric Xing @ CMU, 2005-2013

    Single-node Conditional The conditional dist. of a single node i given the rest of the nodes can be written as:

    211

    22121121

    221

    2212121

    2121121

    ΣΣΣ−Σ=

    −ΣΣ+=

    =

    |

    |

    ||

    )(

    ),|()(

    V

    xm

    Vmxxx

    µµ

    Np

    WOLG: let

    ( ) ⎥⎦⎤

    ⎢⎣

    ⎡=⎥

    ⎤⎢⎣

    Σ+ΣΣΣ

    =−

    −−

    −−

    −−

    −−

    11

    1111

    111111

    111

    111

    1111111

    Qqqq

    qI-q-qqQ

    T

    T

    T

    v

    v

    vvv

    v

    σσσσ

    20© Eric Xing @ CMU, 2005-2013

  • 11

    Conditional auto-regression From

    We can write the following conditional auto-regression function for each node:

    Neighborhood est. based on auto-regression coefficient

    © Eric Xing @ CMU, 2005-2013 21

    Conditional independenceFrom

    Given an estimate of the neighborhood si, we have:

    Thus the neighborhood si defines the Markov blanket of node i

    22© Eric Xing @ CMU, 2005-2013

  • 12

    Recent trends in GGM:Covariance selection (classical method)

    L1-regularization based method (hot !)

    Dempster [1972]: Sequentially pruning smallest elements in precision matrix

    Drton and Perlman [2008]: Improved statistical tests for pruning

    Meinshausen and Bühlmann [Ann. Stat. 06]:

    Used LASSO regression for neighborhood selection

    Banerjee [JMLR 08]: Block sub-gradient algorithm for finding precision matrix

    Friedman et al. [Biostatistics 08]: Efficient fixed-point equations based on a sub-gradient algorithm

    Serious limitations in practice: breaks down when covariance matrix is not invertible

    Structure learning is possible even when # variables > # samples

    23© Eric Xing @ CMU, 2005-2013

    The Meinshausen-Bühlmann (MB) algorithm:

    Solving separated Lasso for every single variables:

    Step 1: Pick up one variable

    Step 2: Think of it as “y”, and the rest as “z”

    Step 3: Solve Lasso regression problem between y and z

    Step 4: Connect the k-th node to those having nonzero weight in w

    The resulting coefficient does not correspond to the Q value-wise

    24© Eric Xing @ CMU, 2005-2013

  • 13

    L1-regularized maximum likelihood learning

    Input: Sample covariance matrix S

    Assumes standardized data (mean=0, variance=1)S is generally rank-deficient

    Thus the inverse does not exist

    Output: Sparse precision matrix QOriginally, Q is defined as the inverse of S, but not directly invertibleNeed to find a sparse matrix that can be thought as of as an inverse of S

    Approach: Solve an L1-regularized maximum likelihood equation

    log likelihood regularizer

    25© Eric Xing @ CMU, 2005-2013

    From matrix opt. to vector opt.:coupled Lasso for every single Var.

    Focus only on one row (column), keeping the others constant

    Optimization problem for blue vector is shown to be Lasso (L1-regularized quadratic programming)

    Difference from MB’s: Resulting Lasso problems are coupledThe gray part is actually not constant; changes after solving one Lasso problemThis coupling is essential for stability under noise

    26© Eric Xing @ CMU, 2005-2013

  • 14

    Learning Ising Model (i.e. pairwise MRF)

    Assuming the nodes are discrete (e.g., voting outcome of a person), and edges are weighted, then for a sample x, we p ), g g , p ,have

    It can be shown the pseudo-conditional likelihood for node k is

    27© Eric Xing @ CMU, 2005-2013

    Learning Ising Model (i.e. pairwise MRF)

    Assuming the nodes are discrete, and edges are weighted, then for a sample x, we have p ,

    It can be shown following the same logic that we can use L_1 l i d l i ti i t bt i ti t fregularized logistic regression to obtain a sparse estimate of

    the neighborhood of each variable in the discrete case.

    28© Eric Xing @ CMU, 2005-2013

  • 15

    New Problem: Evolving Social Networks

    Can I get his vote?

    CorporativityCorporativity,

    Antagonism,

    Cliques,…

    over time?

    March 2005 January 2006 August 2006

    29© Eric Xing @ CMU, 2005-2013

    t*

    Reverse engineering time-specific "rewiring" networks

    T0 TN

    …n=1 or some small #

    Drosophila developmentDrosophila development

    30© Eric Xing @ CMU, 2005-2013

  • 16

    Inference IKELLER: Kernel Weighted L1-regularized Logistic Regression

    [Song, Kolar and Xing, Bioinformatics 09]

    Lasso:

    Constrained convex optimizationEstimate time-specific nets one by one, based on "virtual iidvirtual iid" samplesCould scale to ~104 genes, but under stronger smoothness assumptions

    31© Eric Xing @ CMU, 2005-2013

    Conditional likelihood

    Algorithm – nonparametric neighborhood selection

    NNeighborhood Selectioneighborhood Selection:

    Time-specific graph regression:Estimate at

    Where

    and32© Eric Xing @ CMU, 2005-2013

  • 17

    Structural consistency of KELLER

    AssumptionsDefine:

    A1: Dependency Condition

    A2: Incoherence Condition

    A3: Smoothness Condition

    A4: Bounded Kernel

    33© Eric Xing @ CMU, 2005-2013

    Theorem [Kolar and Xing, 09]

    34© Eric Xing @ CMU, 2005-2013

  • 18

    Senate network – 109th congress

    Voting records from 109th congress (2005 - 2006)There are 100 senators whose votes were recorded on the 542 bills, each vote is a binary outcome

    35© Eric Xing @ CMU, 2005-2013

    Senate network – 109th congress

    March 2005 January 2006 August 2006

    36© Eric Xing @ CMU, 2005-2013

  • 19

    Senator Chafee

    37© Eric Xing @ CMU, 2005-2013

    Senator Ben Nelson

    T=0.2 T=0.8

    38© Eric Xing @ CMU, 2005-2013

  • 20

    S1 (normal)

    Progression and Reversion of Breast Cancer cells

    ( )

    T4 (malignant)

    T4 revert 1

    T4 revert 2

    T4 revert 3

    39© Eric Xing @ CMU, 2005-2013

    Estimate Neighborhoods Jointly Across All Cell Types

    T4

    S1

    How to share information across the cell types?

    T4R1

    T4R2

    T4R3

    40© Eric Xing @ CMU, 2005-2013

  • 21

    Penalize differences between networks of adjacent cell types

    Sparsity of Difference

    T4

    S1

    T4R1

    T4R2

    T4R3

    41© Eric Xing @ CMU, 2005-2013

    Tree-Guided Graphical Lasso (Treegl)

    RSS for all cell types

    sparsity Sparsity of difference42© Eric Xing @ CMU, 2005-2013

  • 22

    T4

    EGFR-ITGB1

    Network Overview

    MMPS1

    PI3K-MAPKK

    MMP

    43© Eric Xing @ CMU, 2005-2013

    SummaryGraphical Gaussian Model

    The precision matrix encode structureNot estimatable when p >> n

    Neighborhood selection:Conditional dist under GGM/MRFGraphical lassoSparsistency

    Time-varying Markov networksKernel reweighting est.Total variation est.

    44© Eric Xing @ CMU, 2005-2013