BIGDATA Workshop

  • Upload
    tom-cat

  • View
    231

  • Download
    0

Embed Size (px)

Citation preview

  • 8/22/2019 BIGDATA Workshop

    1/94

    Big Challenges with Big Data

    in Life Sciences

    Shankar Subramaniam

    UC San Diego

  • 8/22/2019 BIGDATA Workshop

    2/94

    The Digital Human

  • 8/22/2019 BIGDATA Workshop

    3/94

    A Super-Moores Law

    Adapted from Lincoln Stein 2012

    http://www.google.com/url?sa=i&rct=j&q=moores+law+in+genomics+lincoln+stein&source=images&cd=&cad=rja&docid=CUa75_qiud_w3M&tbnid=BLVNmC10CilvcM:&ved=0CAUQjRw&url=http%3A%2F%2Fivory.idyll.org%2Fblog%2Fcloud-not-the-solution.html&ei=nCpGUbLYFIugqQHT-ICoDg&bvm=bv.43828540,d.aWc&psig=AFQjCNG2Bd86eAp_WlsrdIWztvI_qm72tw&ust=1363639319309795
  • 8/22/2019 BIGDATA Workshop

    4/94

    The Phenotypic Readout

  • 8/22/2019 BIGDATA Workshop

    5/94

    Data to Networks to Biology

  • 8/22/2019 BIGDATA Workshop

    6/94

    NETWORK RECONSTRUCTION Data-driven network reconstruction of biological

    systems Derive relationships between input/output data

    Represent the relationships as a network

    Inverse Problem: Data-driven Network Reconstruction

    Experiments/Measurements

  • 8/22/2019 BIGDATA Workshop

    7/94

    Network ReconstructionsReverse Engineering of biological networks

    Reverse engineering of biological networks:

    - Structural identification: to ascertain network structure ortopology.

    - Identification of dynamics to determine interaction details.

    Main approaches:

    - Statistical methods

    -Simulation methods

    - Optimization methods

    - Regression techniques

    - Clustering

  • 8/22/2019 BIGDATA Workshop

    8/94

    Network Reconstruction ofDynamic Biological Systems:Doubly Penalized LASSO

    Behrang Asadi*, Mano R. Maurya*,

    Daniel Tartakovsky, Shankar Subramaniam

    Department of BioengineeringUniversity of California, San Diego

    NSF grants (STC-0939370, DBI-0641037 and DBI-0835541)

    NIH grants 5 R33 HL087375-02* Equal effort

  • 8/22/2019 BIGDATA Workshop

    9/94

    APPLICATIONPhosphoprotein signaling and cytokine measurements in RAW

    264.7 macrophage cells.

  • 8/22/2019 BIGDATA Workshop

    10/94

    MOTIVATION FOR THE NOVEL METHOD

    Various methods

    Regression-based approaches (least-squares) with statisticalsignificance testing of coefficients

    Dimensionality-reduction to handle correlation: PCR and PLS

    Optimization/Shrinkage (penalty)-based approach: LASSO

    Partial-correlation and probabilistic model/Bayesian-based Different methods have distinct

    advantages/disadvantages

    Can we benefit by combining the methods?

    Compensate for the disadvantages

    A novel method: Doubly Penalized Linear Absolute

    Shrinkage and Selection Operator (DPLASSO)

    Incorporate both statistical significant testing andShrinkage

  • 8/22/2019 BIGDATA Workshop

    11/94

    LINEAR REGRESSION

    Goal: Building a linear-relationship based model

    X: input data (m samples by n inputs), zero mean, unit standard deviation

    y: output data (m samples by 1 output column), zero-mean

    b: model coefficients: translates into the edges in the network

    e: normal random noise with zero mean

    Ordinary Least Squares solution:

    Formulation for dynamic systems:

    2 arg min{ ( - ) ( - )}Te b y Xb y Xb-1

    ( )T T

    b X X X y

    ),0(~ Nee;Xby

    ),0(~)( Nttdt

    dee;Xb

    XXXy

  • 8/22/2019 BIGDATA Workshop

    12/94

    Most coefficients non-zero, a mathematical artifact

    Perform statistical significance testing Compute the standard deviation on the coefficients

    Ratio

    Coefficient is significant (different from zero) if:

    Edges in the network graph represents the coefficients.

    STATISTICAL SIGNIFICANCE TESTING

    * 2

    cov( )

    T

    b bb

    y y

    , , , ,/ij k ij k ij k r b b

    tinv(1 / 2, )

    , 1 confidence level

    ijr v

    v DOF

    * Krmer, Nicole, and Masashi Sugiyama. "The degrees of freedom of partial least squares regression." Journal of the AmericanStatistical Association106.494 (2011): 697-705.

    1);/())((:SquaresFor Least 2/11, nmvvmRMSEXXdiag LSTLSb

    mmyystdyym

    RMSE piim

    i piiLS/)1()()(

    1,1

    2

    ,

  • 8/22/2019 BIGDATA Workshop

    13/94

    Partial least squares finds direction in the X space that explainsthe maximum variance direction in the Y space

    PLS regression is used when the number of observations per

    variable is low and/or collinearity exists among X values

    Requires iterative algorithm: NIPALS, SIMPLS, etc

    Statistical significance testing is iterative

    CORRELATED INPUTS: PLS

    T

    T

    0

    X=TP +E

    Y=UQ +F

    Y=XB+B

    * H. WOLD, (1975), Soft modelling by latent variables; the non-linear iterative partial least squares approach, inPerspectives in Probability and Statistics, Papers in Honour of M. S. Bartlett, J. Gani, ed., Academic Press, London.

  • 8/22/2019 BIGDATA Workshop

    14/94

    LASSO

    Shrinkage version of the Ordinary Least Squares, subject to

    L-1 penalty constraint (the sum of the absolute value of the

    coefficients should be less than a threshold)

    Where represents the full least square estimates

    0 < t < 1: causes the shrinkage

    The LASSO estimator is then defined as:

    * Tibshirani, R.: Regression shrinkage and selection via the Lasso, J. Roy. Stat. Soc. B Met., 1996, 58, (1), pp. 267288

    CostFunction

    L-1Constraint

    j

    j

    j

    j

    N

    i j

    ijji

    btb

    xbbybb

    0

    1

    200

    subject to

    )(argmin),(

    0b

  • 8/22/2019 BIGDATA Workshop

    15/94

    Noise and Missing Data

    More systematic comparison needed withrespect to:1. Noise: Level, Type

    2. Size (dimension)

    3. Level of missing data4. Collinearity or dependency among input channels

    5. Missing data

    6. Nonlinearity between inputs/outputs and nonlineardependency

    7. Time-series inputs(/outputs) and dynamicstructure

  • 8/22/2019 BIGDATA Workshop

    16/94

    METHODS

    Linear Matrix Inequalities (LMI)*

    Converts a nonlinear optimization problem into a linearoptimization problem.

    Congruence transformation:

    Pre-existing knowledge of the system (e.g. ) can be added inthe form of LMI constraints:

    Threshold the coefficients:

    13 210 , 0a a

    * [Cosentino, C., et al., IET Systems Biology, 2007. 1(3): p. 164-173]

    min( ) / ( - )( - )n p

    T

    m mB

    e s t e

    Y Xb Y Xb I

    -0

    ( ) -

    m m

    T

    p p

    e

    I Y Xb

    Y - Xb I

    ( )0T T Ti j j iv u u v B B0,

    1,

    r

    i

    r

    v r i

    v v r i

    0,

    1,

    r

    i

    r

    u r iu u r i

    .. :2 2

    /ij ij i jb b b b

  • 8/22/2019 BIGDATA Workshop

    17/94

    METRICS

    Metrics for comparing the methods

    o Reconstruction from 80% of datasets and 20% for validation

    o RMSE on the test set, and the number and the identity of the significant

    predictors as the basic metric to evaluate the performance of each method1. Fractional error in the estimating the parameters

    2. Sensitivity, specificity, G, accuracy

    ,,

    ,

    1method j

    frac jtrue j

    bb mean

    b

    parameters smaller than 10% of the standard deviation ofall parameter values were set to 0 when generating thesynthetic data

    :

    :

    :

    TN TP Accuracy

    TN TP FN FP

    TPSensitivity

    TP FN

    TNSpecificity

    TN FP

    TP : True PositiveFP : False PositiveTN : True NegativeFN : False Negative

  • 8/22/2019 BIGDATA Workshop

    18/94

    RESULTS: DATA SETS

    Data sets for benchmarking: Two data sets

    1. First set: experimental data measured on

    macrophage cells (Phosphoprotein (PP) vsCytokine)*

    2. Second sets consist of synthetic datagenerated in Matlab. We build the model using80% of the data-set (called training set) anduse the rest of data-set to validate the model(called test set).

    * [Pradervand, S., M.R. Maurya, and S. Subramaniam, Genome Biology, 2006. 7(2): p. R11].

  • 8/22/2019 BIGDATA Workshop

    19/94

    RESULTS: PP-Cytokine Data Set

    Schematic representation of Phosphoprotein (PP) vsCytokine

    - Signals were transmitted through22 recorded signaling proteins and

    other pathways (unmeasuredpathways).

    - Only measured pathwayscontributed to the analysis

    Schematic graphs from:

    [Pradervand, S., M.R. Maurya, and S. Subramaniam, Genome Biology, 2006. 7(2): p. R11].

  • 8/22/2019 BIGDATA Workshop

    20/94

    PP-CYTOKINE DATASET

    Measurements of phosphoproteins in response to LPS

    Courtesy: AfCS

  • 8/22/2019 BIGDATA Workshop

    21/94

    Measurements of cytokines in response toLPS

    ~ 250 such datasets

  • 8/22/2019 BIGDATA Workshop

    22/94

    RESULTS: COMPARISON

    Comparison on synthetic noisy data The methods are applied on synthetic data with 22 inputs and 1 output.

    The true coefficients for the inputs (about 1/3rd) are made zero totest the methods if they identify them as insignificant.

    Effect of noise levelFour outputs with 5, 10, 20 and 40% noise levels, respectively, aregenerated from the noise-free (true) output.

    Effect of noise type

    Three outputs with White, t-distributed, and uniform noise types,respectively are generated from the noise-free (true) output

  • 8/22/2019 BIGDATA Workshop

    23/94

    RESULTS: COMPARISONVariability between realizations of data with white noisePCR, LASSO, and LMIare used to identify significant predictors for1000 input-output pairs.

    Histograms of the coefficients in the three significant predictorscommon to the three methods:

    Method Predictor # 1 10 11

    True value -3.40 5.82 -6.95

    PCR Mean -3.81 4.73 -6.06

    Std. 0.33 0.32 0.32

    Frac. Err. in mean 0.12 0.19 0.13

    LASSO Mean -2.82 4.48 -5.62

    Std. 0.34 0.32 0.33

    Frac. Err. in mean 0.17 0.23 0.19

    LMI Mean -3.70 4.74 -6.34

    Std. 0.34 0.32 0.34

    Frac. Err. in mean 0.09 0.18 0.09

    Mean and standard deviation in the histograms ofthe coefficients computed with PCR, LASSO, and

    LMI.

  • 8/22/2019 BIGDATA Workshop

    24/94

    RESULTS: COMPARISON Comparison of outcome of different methods on the real data

    Different methods identified unique sets of common and distinctpredictors for each output

    Graphical illustration of methods PCR, LASSO, and LMI in detection of

    significant predictors for output IL-6 in PP/cytokine experimental dataset

    Only the PCRmethod detectsthe true inputcAMP

    zone I providesvalidation and ithighlights thecommon output ofall the methods

  • 8/22/2019 BIGDATA Workshop

    25/94

    RESULTS: SUMMARY

    Comparison with respect to different noise types: LASSO is the most robust methods for different noise types.

    Missing data RMSE: LASSO less deviation, more robust.

    Collinearity:

    PCR less deviation against noise level, better accuracy and Gwithincreasing noise level.

  • 8/22/2019 BIGDATA Workshop

    26/94

    A COMPARISON (Asadi, et al., 2012)Methods / Criteria PCR LASSO LMI

    Increasing Noise

    RMSE

    Score= (average RMSE across different noise levels for LS)/(average RMSE across different noise levels

    for the chosen method)

    / 0.68

    degrades gradually

    with level of noise

    / 0.56 / 0.94

    Standard deviation and error in mean of Coefficients.

    Score = 1average (fractional error in mean(10,12,20) + (std(10,12,20)/ |true associated coefficients|) ) / 0.53 / 0.47 / 0.55

    Acc./G

    Score = average accuracy across different noise levels for chosen method (white noise) / 0.70 / 0.87

    / 0.91

    at high noise all

    similar

    Fractional Error in estimating the parameters

    Score = 1- average fractional error in estimating the coefficients across different noise levels for chosen

    method (white noise)

    / 0.81 / .55 / 0.78

    Types of noise

    Fractional Error in estimating the parameters

    Score = 1- average fractional error in estimating the coefficients across different noise levels and different

    noise types (20% noise level)

    / 0.80 / 0.56 / 0.79

    Accuracy and G

    Score = average accuracy across different noise levels and different noise types / 0.71 / 0.87 / 0.91

    Dimension ratio / Size

    Fractional Error in estimating the parameters

    Score = 1- average fractional error in estimating the coefficients across different noise levels and different

    ratios (m/n = 100/25, 100/50, 400/100)

    / 0.77 / 0.53 / 0.75

    Accuracy and G

    Score = average accuracy across different white noise levels and different ratios (m/n = 100/25, 100/50,400/100)

    / 0.66

    / 0.83

    / 0.90

  • 8/22/2019 BIGDATA Workshop

    27/94

    DPLASSO

    Doubly Penalized Least AbsoluteShrinkage and Selection Operator

  • 8/22/2019 BIGDATA Workshop

    28/94

    OUR APPROACH: DPLASSO

    Reconstructed

    Network

    1 3 5 6 7

    y = Xb +

    : , , , , ,...B b b b b b

    Statistical

    Significant Testing

    PLS

    1 2 3 4 5 6 7 8: , , , , , , , , ...

    : 0, 1 , 0 , 1 , 0 , 1 , 0, 1 ,...

    B b b b b b b b b

    W

    LASSO

    1, 2 3 4 5 6 7 8: , , , , , , , ...B b b b b b b b b

    Model

    y = Xb +

  • 8/22/2019 BIGDATA Workshop

    29/94

    Our approach: DPLASSO includes two parameterselection layers:

    Layer 1 (supervisory layer):

    Partial Least Squares (PLS)

    Statistical significance testing

    Layer 2 (lower layer):

    LASSO with extra weights on less informative model parameters

    derived in layer 1

    Retain significant predictors and set the remaining small coefficients to

    zero

    DPLASSO WORK FLOW

    2

    1,..., 1,...,

    arg min{ ( - ) ( - )}

    /

    T

    j j j j j

    LS

    ij ij ij ij

    i p i p

    e

    s t w b t w b

    b y Xb y Xb

    wij 0 bij is PLS- significant

    1 otherwise

  • 8/22/2019 BIGDATA Workshop

    30/94

    DPLASSO: EXTENDED VERSION Smooth weights:

    Layer 1 : Continuous significance score (versus binary):

    Mapping function (logistic significance score):

    Layer 2:

    Continuous weight vector (versus fuzzy weight vector)

    -tinv(1 / 2, )

    , 1 confidence level

    i i

    PLS

    r v

    v DOF

    ( )

    1( )

    1 ii is

    e

    2

    1,..., 1,...,

    argmin{ ( - ) ( - )} , /T LSj j j j j i ij i iji p i p

    e s t w b t w b

    b y Xb y Xb

    -5 0 50

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    (Significance Score)

    s() (Significance Score)

    w() (Weight function)

    15.05.0)(00:tscoefficienantinsignific5.001)(5.00:tscoefficientsignifican

    ),()(1

    iiii

    iiii

    iiiii

    wsws

    ssw

    Tuning parameter

  • 8/22/2019 BIGDATA Workshop

    31/94

    APPLICATIONS

    1. Synthetic (random) networks: Datasetsgenerated in Matlab

    2. Biological dataset: Saccharomyces

    cerevisiae - cell cycle model

  • 8/22/2019 BIGDATA Workshop

    32/94

    SYNTHETIC (RANDOM) NETWORKS

    Datasets generated in Matlab using:

    Linear dynamic system

    Dominant poles/Eigen values () ranges [-2,0]

    Lyapunov stable

    Informal definition from wikipedia: if all solutions of the

    dynamical system that start out near an equilibrium point xestay near xe forever, then the system is Lyapunov stable.

    Zero-input/Excited-state release condition

    5% measurement (white) noise.

    ),0(~)( Nttdt

    dee;Xb

    XXXy

  • 8/22/2019 BIGDATA Workshop

    33/94

    Two metrics to evaluate the performance of DPLASSO1. Sensitivity, Specificity, G (Geometric mean of Sensitivity and

    Specificity), Accuracy

    2. The root-mean-squared error (RMSE) of prediction

    METRICS

    TP : True Positive

    FP : False Positive

    TN : True NegativeFN : False Negative

    2

    ,

    1

    1( )

    m

    i i p

    i

    RMSE y ym

    Accuracy TNTP

    TNTPFNFP

    SensitivityTP

    TPFN

    SpecificityTN

    TNFP

    Precision TP

    TPFP

  • 8/22/2019 BIGDATA Workshop

    34/94

    TUNING

    Tuning shrinkage parameter for DPLASSO

    The shrinkage parameters in LASSO level (threshold t) via k-foldcross-validation (k= 10) on associated dataset

    Validation error versus selection threshold t for

    DPLASSO on synthetic data set

    Rule of thumb after cross

    validations:

    Example:

    Optimal value of the tuning

    parameter for a network with 65%

    connectivity roughly equal to 0.65

    PERFORMANCE COMPARISON ACCURACY

  • 8/22/2019 BIGDATA Workshop

    35/94

    PERFORMANCE COMPARISON: ACCURACY

    0

    0.51

    1.5

    -4

    -2

    0

    20.5

    0.55

    0.6

    0.65

    0.7

    Accuracy

    LASSODPLASSO

    PLS

    00.5

    11.5

    -4

    -2

    0

    20.2

    0.4

    0.6

    0.8

    1

    Accuracy

    LASSO

    DPLASSO

    PLS

    00.5

    11.5

    -4

    -2

    0

    20.2

    0.4

    0.6

    0.8

    1

    Accuracy

    LASSO

    DPLASSO

    PLS

    00.5

    11.5

    -4

    -2

    0

    20

    0.2

    0.4

    0.6

    0.8

    Accuracy

    LASSO

    DPLASSO

    PLS

    Density 5%

    Density 10%

    Density 50%Density 20%

    Network Size 20MC 10Noise 5%

    PLS Better performance

    DPLASSO provides good compromise between LASSO and PLS in terms of

    accuracy for different network densities

    PERFORMANCE COMPARISON SENSITIVITY

  • 8/22/2019 BIGDATA Workshop

    36/94

    PERFORMANCE COMPARISON: SENSITIVITY

    0

    0.51

    1.5

    -4

    -2

    0

    20.4

    0.6

    0.8

    1

    Sensitivity

    LASSO

    DPLASSO

    PLS

    0 0.5

    11.5

    -4

    -2

    0

    20.4

    0.6

    0.8

    1

    Sensitivity

    LASSO

    DPLASSO

    PLS

    00.5

    1

    1.5

    -4

    -2

    0

    20.4

    0.6

    0.8

    1

    Sensitivity

    LASSO

    DPLASSO

    PLS

    00.5

    1

    1.5

    -4

    -2

    0

    20.4

    0.6

    0.8

    1

    Sensitivity

    LASSO

    DPLASSO

    PLS

    Density 5%Density 10%

    Density 50%Density 20%

    Network Size 20MC 10Noise 5%

    LASSO has better performance

    DPLASSO provides good compromise between LASSO and PLS in terms of

    Sensitivity for different network densities

    PERFORMANCE COMPARISON SPECIFICITY

  • 8/22/2019 BIGDATA Workshop

    37/94

    PERFORMANCE COMPARISON: SPECIFICITY

    00.5

    1

    1.5

    -4

    -2

    0

    20

    0.2

    0.4

    0.6

    0.8

    Specificity

    LASSO

    DPLASSO

    PLS

    00.5

    1

    1.5

    -4

    -2

    0

    20

    0.2

    0.4

    0.6

    0.8

    Specificity

    LASSO

    DPLASSO

    PLS

    00.5

    11.5

    -4

    -2

    0

    20

    0.2

    0.4

    0.6

    0.8

    Specificity

    LASSO

    DPLASSO

    PLS

    00.5

    11.5

    -4

    -2

    0

    20

    0.2

    0.4

    0.6

    0.8

    Specificity

    LASSO

    DPLASSO

    PLS

    Density 50% Density 20%

    Density 5%Density 10%

    Network Size 20MC 10Noise 5%

    DPLASSO provides good compromise between LASSO and PLS in terms of

    specificity for different network densities.

    PERFORMANCE COMPARISON NETWORK SIZE

  • 8/22/2019 BIGDATA Workshop

    38/94

    PERFORMANCE COMPARISON: NETWORK-SIZE

    DPLASSO provides good compromise between LASSO and PLS in terms of

    accuracy for different network sizes

    DPLASSO provides good compromise between LASSO and PLS in terms of

    sensitivity (not shown) for different network sizes

    00.5

    11.5

    -4

    -2

    0

    20.2

    0.4

    0.6

    0.8

    1

    LASSO

    DPLASSO

    PLS

    0

    0.51

    1.5

    -4

    -2

    0

    20.2

    0.4

    0.6

    0.8

    1

    LASSO

    DPLASSO

    PLS

    00.5

    11.5

    -4

    -2

    0

    20.2

    0.4

    0.6

    0.8

    1

    LASSO

    DPLASSO

    PLS

    Network Size: 10* 100 potential connections

    Network Size: 20* 400 potential connections

    Network Size: 50* 2500 potential connections

    Acc

    uracy

    Acc

    uracy

    Acc

    uracy

    ROC CURVE DYNAMICS AND WEIGHTINGS

  • 8/22/2019 BIGDATA Workshop

    39/94

    ROC CURVE vs. DYNAMICS AND WEIGHTINGS

    DPLASSO exhibits better performance for networks with slow dynamics.

    The parameter in DPLASSO can be adjusted to improve performance

    for fast dynamic networks

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.2

    0.4

    0.6

    0.8

    1

    ROC for variable (the closer to origin the larger - Density: 20% MC: 10 Size: 50)

    Specificity

    Sensitivity

    LASSO

    DPLASSO

    PLS

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.2

    0.4

    0.6

    0.8

    1ROC for variable (the larger the larger - Density: 20% MC: 10 Size: 50)

    Specificity

    Sensitivity

    LASSO

    DPLASSO

    PLS

  • 8/22/2019 BIGDATA Workshop

    40/94

    YEAST CELL DIVISION

    Experimental dataset generated via well-known nonlinear model of a

    cell division cycle of fission yeast. The model is dynamic with 9 state

    variables.

    * Novak, Bela, et al. "Mathematical model of the cell division cycle of fissionyeast." Chaos: An Interdisciplinary Journal of Nonlinear Science 11.1 (2001): 277-286.

  • 8/22/2019 BIGDATA Workshop

    41/94

    CELL DIVISION CYCLE

    True Network (Cell Division Cycle)

    PLS DPLASSO LASSO

    Missing in DPLASSO!

  • 8/22/2019 BIGDATA Workshop

    42/94

    RECONSTRUCTION PERFORMANCE

    MethodMetric

    Accuracy Sensitivity Specificity SD RMSE/MeanLASSO 0.31 0.92 0.16 0.14

    DPLASSO 0.56 0.73 0.52 0.08

    PLS 0.60 0.67 0.63 0.09

    Case Study II: Cell Division Cycle, Average over value

    Case Study I: 10 Monte Carlo Simulations, Size 20, Average over different , , network

    density, and Monte Carlo sample datasets

    MethodMetric

    Accuracy

    Sensitivity

    Specificity

    SD RMSE/MeanLASSO 0.39 0.90 0.05 0.06

    DPLASSO 0.52 0.90 0.34 0.07

    PLS 0.59 0.80 0.20 0.07

  • 8/22/2019 BIGDATA Workshop

    43/94

    CONCLUSION

    Novel method, Doubly Penalized Linear Absolute Shrinkage and

    Selection Operator (DPLASSO), to reconstruct dynamic biologicalnetworks

    Based on integration of significance testing of coefficients and optimization

    Smoothening function to trade off between PLS and LASSO

    Simulation results on synthetic datasets

    DPLASSO provides good compromise between PLS and LASSO in terms

    of accuracy and sensitivity for

    Different network densities

    Different network sizes

    For biological dataset

    DPLASSO best in terms of sensitivity

    DPLASSO good compromise between LASSO and PLS in terms of

    accuracy, specificity and lift

  • 8/22/2019 BIGDATA Workshop

    44/94

    Information TheoryMethods

    Farzaneh Farangmehr

  • 8/22/2019 BIGDATA Workshop

    45/94

    Mutual Information

    It gives us a metric that is indicative of how much information from avariable can be obtained to predict the behavior of the other variable .

    The higher the mutual information, the more similar are the two profiles.

    For two discrete random variables of X={x1,..,xn} and Y={y1,ym}:

    p(xi,yj) is the joint probability of xi and yjP(xi) and p(yj) are marginal probability of xi and yj

    m

    j

    n

    i ji

    ji

    jiypxp

    yxpyxpYXI

    1 1 )()(

    ),(log),();(

    I f ti th ti l h

  • 8/22/2019 BIGDATA Workshop

    46/94

    Information theoretical approachShannon theory

    Hartleys conceptual framework of information relates the information of a randomvariable with its probability.

    Shannon defined entropy, H, of a random variable X given a random sample in termsof its probability distribution:

    Entropy is a good measure of randomness or uncertainty.

    Shannon defines mutual information as the amount of information about a randomvariable Xthat can be obtained by observing another random variable Y:

    },...,{ 1 nxx

    )](log[)()()()(11

    i

    n

    i

    ii

    n

    i

    ixPxPxIxPXH

    ),()()()()(),()()(),( XYIYXHXHXYHYHYXHYHXHYXI

  • 8/22/2019 BIGDATA Workshop

    47/94

    Mutual information networks

    X={x1, ,xi} Y={y1 , ,yj}

    The ultimate goal is to find the best model that maps X Y- The general definition: Y= f(X)+U. In linear cases: Y=[A]X+U where [A] is a matrix

    defines the linear dependency of inputs and outputs

    Information theory maps inputs to outputs (both linear and non-linear models)by using the mutual information:

    m

    j

    n

    i ji

    ji

    jiypxp

    yxpyxpYXI

    1 1 )()(

    ),(log),();(

  • 8/22/2019 BIGDATA Workshop

    48/94

    Mutual information networks The entire framework of network reconstruction using information theory

    has two stages:

    1-Mutual information measurements

    2- The selection of a proper threshold.

    Mutual information networks rely on the measurement of the mutualinformation matrix (MIM). MIM is a square matrix whose elements (MIMij= I(Xi;Yj)) are the mutual information between Xi and Yj.

    Choosing a proper threshold is a non-trivial problem. The usual way is toperform permutations of expression of measurements many times andrecalculate a distribution of the mutual information for each permutation.Then distributions are averaged and the good choice for the threshold isthe largest mutual information value in the averaged permuteddistribution.

    M t l i f ti t k

  • 8/22/2019 BIGDATA Workshop

    49/94

    Mutual information networksData Processing Inequality (DPI)

    The DPI for biological networks states that if genesg1 andg3 interactonly through a third gene,g2, then:

    Checking against the DPI may identify those gene pairs which are notdirectly dependent even if

    )],();,(min[),( 322131 ggIggIggI

    )()(),( jiji gpgpggp

  • 8/22/2019 BIGDATA Workshop

    50/94

    ARACNe algorithm

    ARACNE flowchart [Califano and coworkers]

    ARACNE stands for Algorithmfor the Reconstruction ofAccurate Cellular NEtworks[25].

    ARACNE identifies candidateinteractions by estimatingpairwise gene expression profilemutual information, I(gi, gj) andthen filter MIs using anappropriate threshold, I0,

    computed for a specific p-value,p0. In the second step, ARACNeremoves the vast majority ofindirect connections using theData Processing Inequality(DPI).

  • 8/22/2019 BIGDATA Workshop

    51/94

    Protein-Cytokine

    Network in

    MacrophageActivation

  • 8/22/2019 BIGDATA Workshop

    52/94

    Application to Protein-Cytokine Network Reconstruction

    Release of immune-regulatory Cytokines during inflammatory response is medicated by acomplex signaling network [45].

    Current knowledge does not provide a complete picture of these signaling components.

    22 Signaling proteins responsible for cytokine releases:

    cAMP, AKT, ERK1, ERK2, Ezr/Rdx, GSK3A, GSK3B, JNK lg, JNK sh, MSN, p38,p40Phox, NFkB p65, PKCd, PKCmu2,RSK, Rps6 , SMAD2, STAT1a, STAT1b, STAT3,STAT5

    7 released cytokines (as signal receivers):

    G-CSF, IL-1a, IL-6, IL-10, MIP-1a, RANTES, TNFa

    we developed an information theoretic-based model that derives the responses of sevenCytokines from the activation of twenty two signaling Phosphoproteins in RAW 264.7

    macrophages.

    This model captured most of known signaling components involved in Cytokine releases and wasable to reasonably predict potentially important novel signaling components.

    Protein Cytokine Network Reconstruction

  • 8/22/2019 BIGDATA Workshop

    53/94

    Protein-Cytokine Network ReconstructionMI Estimation using KDE

    - Given a random sample for a univariate random variable Xwith an unknowndensity a kernel density estimator (KDE) estimates the shape of this function as:

    assuming Gaussian kernels:

    - Bivariate kernel density function of two random variables Xand Ygiven two randomsamples and :

    -

    Mutual information of X and Y using Kernel Density Estimation:

    n =sample size; h=kernel width

    },...,{ 1 nxxf

    )(1

    )(1

    )(1 h

    xxk

    nhxxk

    nxf ihi

    n

    i

    h

    n

    i

    i

    h

    xx

    nhxf

    12

    2

    2 2

    )(exp

    2

    1)(

    },...,{ 1 nxx },...,{ 1 nyy

    n

    i

    ii

    h

    yyxx

    nhyxf

    1 2

    22

    2 2

    )()(exp

    2

    1),(

    n

    j jj

    jj

    yfxf

    yxf

    nYXI

    1 )()(

    ),(ln

    1),(

    P t in C t kin N t k R nst u ti n

  • 8/22/2019 BIGDATA Workshop

    54/94

    Protein-Cytokine Network ReconstructionKernel bandwidth selection

    There is not a universal way of choosing h and however the ranking of the MIsdepends only weakly on them.

    The most common criterion used to select the optimal kernel width is to minimizeexpected risk function, also known as the mean integrated squared error (MISE):

    Loss function (Integrated Squared Error) :

    Unbiased Cross-validation approach select the kernel width that minimizes the lostfunction by minimizing:

    where f(-i),h

    (xi

    ) is the kernel density estimator using the bandwidth h at xi

    obtainedafter removing ith observation.

    constdxxfdxxfdxxfxfdxxf

    dxxfxfhL

    h

    h

    h

    )(where)()()(2)(

    )](()([)(

    222

    2

    )(2

    )()(),(1

    2

    i

    hi

    n

    i

    h xfn

    dxxfhUCV

    xxfxfEhMISE h d)]()([)( 2

    Protein-Cytokine Network Reconstruction

  • 8/22/2019 BIGDATA Workshop

    55/94

    Protein Cytokine Network ReconstructionThreshold Selection

    Based on large deviation theory (extended to biological networks by ARACNE), the

    probability that an empirical value of mutual information I is greater than I0,provided that its true value , is:

    Where the bar denotes the true MI, N is the sample size and c is a constant. After taking thelogarithm of both sides of the above equation:

    Therefore, lnPcan be fitted as a linear function of I0 and the slope of b, where b isproportional to the sample size N.

    Using these results, for any given dataset with sample size N and a desired p-value,the corresponding threshold can be obtained.

    0I

    P(I> I

    0I=

    0) ~ e

    - cNI0

    0ln bIaP

    Protein-Cytokine Network Reconstructionl d f k

  • 8/22/2019 BIGDATA Workshop

    56/94

    Kernel density estimation of cytokines

    Figure 3: The probability distribution ofseven released cytokines in macrophage 246.7using on Kernel density estimation (KDE)

    Mutual information for all 22x7pairs of phosphoprotein-cytokine

    from toll data (the upper bar) andnon-toll data (the lower bar).

    Protein Cytokine Network Reconstruction

  • 8/22/2019 BIGDATA Workshop

    57/94

    Protein-Cytokine Network ReconstructionProtein-Cytokine signaling networks

    + =

    The topology of signaling protein-released cytokinesobtained from the non-Toll (A) and Toll (B) data.

    A

    B

    Protein cytokine Network Reconstruction

  • 8/22/2019 BIGDATA Workshop

    58/94

    Protein-cytokine Network ReconstructionSummary

    This model successfully captures all known signaling

    components involved in cytokine releases

    It predicts two potentially new signaling componentsinvolved in releases of cytokines including: Ribosomal

    S6 kinase on Tumor Necrosis Factor and RibosomalProtein S6 on Interleukin-10.

    For MIP-1 and IL-10 with low coefficient of

    determination data that lead to less precise linearthe information theoretical model shows advantageover linear methods such as PCR minimal model[Pradervand et al.] in capturing all known regulatory

    components involved in cytokine releases.

  • 8/22/2019 BIGDATA Workshop

    59/94

    Network reconstruction from time-course dataBackground: Time-delayed gene networks

    Comes from the consideration that the expression of a gene at a certain timecould depend by the expression level of another gene at previous time point orat very few time points before.

    The time-delayed gene regulation pattern in organisms is a common phenomenon

    since:

    If effect of geneg1 on geneg2 depends on an inducer,g3, that has to bebound first in order to be able to bind to the inhibition site ong2, therecan be a significant delay between the expression of geneg1 and itsobserved effect, i.e., the inhibition of geneg2.

    Not all the genes that influence the expression level of a gene arenecessarily observable in one microarray experiment. It is quite possiblethat thereare not among the genes that are being monitored in theexperiment, or its function is currently unknown.

    Network reconstruction from time-course data

  • 8/22/2019 BIGDATA Workshop

    60/94

    The Algorithm

    downstsuptssts iiiii eeoreeeICNA 00 //minarg)(

    N t k t ti f ti d t

  • 8/22/2019 BIGDATA Workshop

    61/94

    Network reconstruction from time-course dataAlgorithm

    Network reconstruction from time-course data

  • 8/22/2019 BIGDATA Workshop

    62/94

    Network reconstruction from time-course dataThe flow diagram

    Gene lists

    Cluster into

    n

    subnetwork

    Measure

    sub-network

    activities

    Measure the influence

    between flagged sub-

    networks

    Build Inflence matrixFind the

    threshold

    Remove

    connections

    below the

    threshold

    Apply DPI for

    connections above

    the threshold

    Build the network

    based on non-zero

    elements of the

    mutual information

    matrix

    Flag potentially

    dependent sub-

    networks by

    measuring ICNA

    The flow diagram of the information theoretic approach forbiological network reconstruction from time-course microarraydata by identifying the topology of functional sub-networks

    Network reconstruction from time-course data

  • 8/22/2019 BIGDATA Workshop

    63/94

    Network reconstruction from time-course dataCase study: the yeast cell-cycle

    The cell cycle consists of four distinct phases:

    G0 (Gap 0) :A resting phase where the cell has left the cycle and has stopped dividing.

    G1 (Gap 1) : Cells increase in size in Gap 1. The G1checkpointcontrol mechanism ensures thateverything is ready for DNA synthesis.

    S1 (Synthesis): DNA replication occurs during this phase.

    G2 (Gap 2): During the gap between DNA

    synthesis and mitosis, the cell will

    continue to grow. The G2checkpoint

    control mechanism ensures that

    everything is ready to enter the M

    (mitosis) phase and divide.

    M (Mitosis) : Cell growth stops at this stage and

    cellular energy is focused on the orderly

    division into two daughter cells. A checkpoint

    in the middle of mitosis (Metaphase Checkpoint) ensures that the

    cell is ready to complete cell division.

    Network reconstruction from time-course data

    http://en.wikipedia.org/wiki/DNA_replicationhttp://en.wikipedia.org/wiki/DNA_replicationhttp://en.wikipedia.org/wiki/DNA_replicationhttp://en.wikipedia.org/wiki/DNA_replication
  • 8/22/2019 BIGDATA Workshop

    64/94

    Network reconstruction from time-course dataCase study: the yeast cell-cycle

    Data from Gene Expression Omnibus (GEO)

    Culture synchronized by alpha factor arrest. samples taken every 7minutes as cells went through cell cycle.

    Value type: Log ratio

    5,981 genes, 7728 probes and 14 time points

    94 Pathways from KEGG Pathways

    Network reconstruction from time-course data

  • 8/22/2019 BIGDATA Workshop

    65/94

    Network reconstruction from time course dataCase study: the yeast cell-cycle

    Thereconstructedfunctionalnetwork of yeastcell cycleobtained from

    time-coursemicroarray data

    Mutual information networks

  • 8/22/2019 BIGDATA Workshop

    66/94

    Mutual information networksAdvantages and Limits

    A major advantage of information theory is its nonparametric nature.Entropy does not require any assumptions about the distribution ofvariables [43].

    It does not make any assumption about the linearity of the model forthe ease of computation.

    It is applicable for time series data.

    A high mutual information does not tell us anything about the directionof the relationship.

  • 8/22/2019 BIGDATA Workshop

    67/94

    Time Varying Networks

    Causality

    Maryam Masnardi-Shirazi

    Causal Inference of Time Varying

  • 8/22/2019 BIGDATA Workshop

    68/94

    Causal Inference of Time-VaryingBiological Networks

    Definition of Causality

  • 8/22/2019 BIGDATA Workshop

    69/94

    Definition of Causality

    Beyond Correlation: Causation

  • 8/22/2019 BIGDATA Workshop

    70/94

    y

    Idea: map a set of K time series to a directed graph with K nodeswhere an edge is placed from a to b if the past of a has an impact on

    the future of b

    How do we quantitatively do this in a general purpose manner?

    G N i f C li

  • 8/22/2019 BIGDATA Workshop

    71/94

    Grangers Notion of Causality

    It is said that process X Granger Causes Process Y, if future values of Ycan be better predicted using the past values of X and Y than only using

    past values of Y.

    G C lit F l ti

  • 8/22/2019 BIGDATA Workshop

    72/94

    Ganger Causality Formulation

    There are many ways to formulate the notionof granger causality, some of which are:

    - Information Theory and the concept of

    Directed Information- Learning Theory

    - Dynamic Bayesian Networks

    - Vector Autoregressive Models (VAR)- Hypothesis Tests, e.g. t-test and F tests

    Vector Autoregressive Model (VAR)

  • 8/22/2019 BIGDATA Workshop

    73/94

    Vector Autoregressive Model (VAR)

    Least Squares Estimation

  • 8/22/2019 BIGDATA Workshop

    74/94

    Least Squares Estimation

  • 8/22/2019 BIGDATA Workshop

    75/94

    Least Squares Estimation (Cont.)

    P i th d t

  • 8/22/2019 BIGDATA Workshop

    76/94

    Processing the data

    Phosphoprotein two-ligand screen assay: RAW 264.7

    There are 327 experiments from western blots processed withmixtures of phosphospecific antibodies. In all experiments, theeffects of single ligand and simultaneous ligand addition are

    measured

    Each experiment includes the fold change of Phosphoprotein attime points t=0, 1, 3, 10, 30 minutes

    Data at time=30 minute is omitted, and data from t=0:10 isinterpolated by steps=1 min

    Least Squares Estimation and Rank Deficiency of

  • 8/22/2019 BIGDATA Workshop

    77/94

    Transformation Matrix

    Exp.1

    Exp.2

    Exp. 327

    All Y data

    Exp.1

    Exp.2

    Exp. 327

    All X data

    N li i h d

  • 8/22/2019 BIGDATA Workshop

    78/94

    Normalizing the data

  • 8/22/2019 BIGDATA Workshop

    79/94

    Statistical Significance Test (Confidence Interval)

    The Reconstructed Phosphoproteins Signaling

  • 8/22/2019 BIGDATA Workshop

    80/94

    The Reconstructed Phosphoproteins SignalingNetwork

    The network isreconstructed byestimating causalrelationships between allnodes

    All the 21phosphoproteins arepresent and interactingwith one another

    There are 122 edges inthis network

    l d

  • 8/22/2019 BIGDATA Workshop

    81/94

    Correlation and Causation

    The conventional dictum that "correlation does notimply causation" means that correlation cannot beused to infer a causal relationship between thevariables

    This does not mean that correlations cannot indicatethe potential existence of causal relations. However,the causes underlying the correlation, if any, may beindirect and unknown

    Consequently, establishing a correlation between twovariables is not a sufficient condition to establish acausal relationship (in either direction).

    C l ti d C lit i

    http://en.wikipedia.org/wiki/Correlation_does_not_imply_causationhttp://en.wikipedia.org/wiki/Correlation_does_not_imply_causationhttp://en.wikipedia.org/wiki/Correlation_does_not_imply_causationhttp://en.wikipedia.org/wiki/Correlation_does_not_imply_causation
  • 8/22/2019 BIGDATA Workshop

    82/94

    Correlation and Causality comparison

    Heat-map of the correlation matrix betweenthe input (X) and output (Y)

    The reconstructed network consideringsignificant coefficients and their intersection

    with connections having correlations higher than0.5

    The conventional dictum that "correlation does not imply causation" means that correlation cannot be used to infer a causalrelationship between the variables. This dictum should not be taken to mean that correlations cannot indicate the potentialexistence of causal relations. However, the causes underlying the correlation, if any, may be indirect and unknown.

    Consequently, establishing a correlation between two variables is not a sufficient condition to establish a causalrelationship (in either direction).

    Correlation and Causality comparison (cont )

    http://en.wikipedia.org/wiki/Correlation_does_not_imply_causationhttp://en.wikipedia.org/wiki/Correlation_does_not_imply_causation
  • 8/22/2019 BIGDATA Workshop

    83/94

    Correlation and Causality comparison (cont.)

    Heat-map of the correlation matrix between

    the input (X) and output (Y)

    The reconstructed network consideringsignificant coefficients and their intersection

    with connections having correlations higher than0.4

    Validating our network

  • 8/22/2019 BIGDATA Workshop

    84/94

    Validating our network

    Identification ofCrosswalk between

    phosphoproteinSignaling Pathways in

    RAW 264.7Macrophage Cells

    (Gupta et al., 2010)

    The Reconstructed Phosphoproteins Signaling Network

  • 8/22/2019 BIGDATA Workshop

    85/94

    The Reconstructed Phosphoproteins Signaling Networkfor t=0 to t=4 minutes

    Heat-map of the correlation matrixbetween the input (X) and output (Y)

    for t=0 to t=4 minutes

    Intersection of Causal Coefficients withconnections with correlations higher than

    0.4 for time t=0 to t=4 minutes

    9 nodes15 edges

    The Reconstructed Phosphoproteins Signaling Network

  • 8/22/2019 BIGDATA Workshop

    86/94

    The Reconstructed Phosphoproteins Signaling Networkfor t=3 to t=7 minutes

    Heat-map of the correlation matrixbetween the input (X) and output (Y)

    for t=3 to t=7 minutes

    Intersection of Causal Coefficients withconnections with correlations higher than

    0.4 for time t=3 to t=7 minutes

    19 nodes51 edges

    The Reconstructed Phosphoproteins Signaling Network

  • 8/22/2019 BIGDATA Workshop

    87/94

    The Reconstructed Phosphoproteins Signaling Networkfor t=6 to t=10 minutes

    Heat-map of the correlation matrixbetween the input (X) and output (Y)

    for t=6 to t=10 minutes

    Intersection of Causal Coefficients withconnections with correlations higher than

    0.4 for time t=6 to t=10 minutes

    19 nodes56 edges

    Time-Varying reconstructed Network

  • 8/22/2019 BIGDATA Workshop

    88/94

    Time-Varying reconstructed Network

    t=0 to 4 min t=3 to 7 min t=6 to 10 min

    The Reconstructed Network for t=0 to t=4 minutes

  • 8/22/2019 BIGDATA Workshop

    89/94

    The Reconstructed Network for t 0 to t 4 minuteswithout the presence of LPS as a Ligand

    With LPS15 Edges

    WithoutLPS

    16 Edges

    The Reconstructed Network for t=3 to t=7 minuteswithout the presence of LPS as a Ligand VS the

  • 8/22/2019 BIGDATA Workshop

    90/94

    without the presence of LPS as a Ligand VS thepresence of all ligands

    With all ligandsincluding LPS

    (51 Edges)

    Without LPS

    (55 Edges)

    The Reconstructed Network for t=6 to t=10 minutes without

  • 8/22/2019 BIGDATA Workshop

    91/94

    The Reconstructed Network for t 6 to t 10 minutes withoutthe presence of LPS as a Ligand VS the presence of all ligands

    With all ligandsincluding LPS

    (56 Edges)

    Without LPS

    (66 Edges)

    Time-Varying Network with LPS not present as aligand

  • 8/22/2019 BIGDATA Workshop

    92/94

    g

    t=0 to 4 min t=3 to 7 min t=6 to 10 min

    Summary

  • 8/22/2019 BIGDATA Workshop

    93/94

    Summary

    Information theory methods can help in determining causal and time-dependent networks from time series data.

    The granularity of the time course will be a factor in determining the

    causal connections.

    Such dynamical networks can be used to construct both linear andnonlinear models from data.

  • 8/22/2019 BIGDATA Workshop

    94/94