44
Emerging causal inference problems in molecular systems biology Yi Liu, Ph.D. Beijing Jiaotong University The presented work was mainly collaborated with: Prof. Jing-Dong Jackie Han, Dr. Nan Qiao, Dr. Wei Zhang @ CAS -Max Planck partner Institute for Computational Biology Prof. Min Liu, Dr. Jin’e Li @ Institute of Genetics & Developmental Biology, CAS

Emerging causal inference problems in molecular systems biology

  • Upload
    chuck

  • View
    23

  • Download
    1

Embed Size (px)

DESCRIPTION

Emerging causal inference problems in molecular systems biology. Yi Liu, Ph.D. Beijing Jiaotong University The presented work was mainly collaborated with: Prof. Jing-Dong Jackie Han, Dr. Nan Qiao, Dr. Wei Zhang @ CAS -Max Planck partner Institute for Computational Biology - PowerPoint PPT Presentation

Citation preview

Page 1: Emerging causal inference problems in molecular systems biology

Emerging causal inference problems in molecular systems biology

Yi Liu, Ph.D.

Beijing Jiaotong University

The presented work was mainly collaborated with:Prof. Jing-Dong Jackie Han, Dr. Nan Qiao, Dr. Wei Zhang

@ CAS -Max Planck partner Institute for Computational BiologyProf. Min Liu, Dr. Jin’e Li

@ Institute of Genetics & Developmental Biology, CAS

Page 2: Emerging causal inference problems in molecular systems biology

Outline• Background

Mining biological knowledge from the big data generated by the Next Generation Sequencing (NGS) Technology

• Examples of causal inference problems in biology 1) Inferring causal relationships between transcription factors,

epigenetic modifications and gene expression level from heterogeneous deep sequencing data sets

2) Reverse-engineering the Yeast genetic regulatory network from deletion-mutant gene expression data

3) Discovering subtypes of ovarian cancer and uncovering key molecular signatures that distinguish these subtypes.

Page 3: Emerging causal inference problems in molecular systems biology

The need for integrating heterogeneousfunctional genomic data sets

3

Yi Liu* and Jing-Dong J. Han*. Application of Bayesian networks on large-scale biological data. Frontiers in Biology, 2010, 5(2):98-104.

Page 4: Emerging causal inference problems in molecular systems biology

SeqSpider: A new Bayesian network inference algorithm enabling integrative

analysis of deep sequencing data

Y Liu, N Qiao et al., Cell Research (2013)

Thanks for Prof. Jing-Dong Han’s contribution to the slides on this topic.

Page 5: Emerging causal inference problems in molecular systems biology

Limitation of traditional BN learning approaches

In traditional BN structure learning approaches, each node must take a discrete value.

The only exception is the Linear-Gaussian case. However, this Parameterization is still very restrictive.

Page 6: Emerging causal inference problems in molecular systems biology

H3K4me3 profile

mRNA profile

Profiled signature of deep sequencing data

Liu et al, Nucleic Acids Res, 2010

Deep sequencing data have distinctive profiled signatures along the chromosomes, especially at the gene promoter regions.

However, there is no way to utilize such information in theBN learning algorithms.

Page 7: Emerging causal inference problems in molecular systems biology

Profiles of hESC regulators around TSSs

In this work, we infer causalrelationships between transcription factors, epigenetic modifications and gene expression level In human/mouse embryonic stem cells.

Page 8: Emerging causal inference problems in molecular systems biology

Heterogeneous data types in systems biology

Datasets type Details Data type Cell

line Labs/Organizations

DNA methylation DNA methylation vector real value

hES, H1

University of California, San Diego

Histone modification

s

H3K27ac, H3K27me3, H3K36me3, H3K4me1,

H3K4me3, H3K9ac, H3K9me3

vector real valuehES, H1

University of California, San Diego

Gene expression RNA-seq data real value

hES, H1

University of California, San Diego

Transcript factor

OCT4, KLF, MYC, TAFII, P300, SOX2,

NANOGvector real value

hES, H1

Ludwig Institute for Cancer Research

PRC complex EZH2 and RING1B vector real valuehES, H9

Broad Institute of MIT and Harvard

More severely, there could be heterogeneous data types in one systems biological investigation.Handling multiple data-types simultaneously in BN structurelearning is not a trivial task.

Page 9: Emerging causal inference problems in molecular systems biology

Kernel-based surrogate dependency measures

In this work, we use the Kernel Generalized Variance(F. Bach, JMLR 2002) to quantify the joint

dependencebetween heterogeneous variables, which replace themutual information-like measures in BN learning.

Page 10: Emerging causal inference problems in molecular systems biology

Kernels for heterogeneous types of data

Using Kernel Generalized Variance (F. Bach, JMLR 2002),to quantify the joint dependence between heterogeneous variables, we only need to define a kernel for each type of data.

Discrete Data:

Real-valued Data:

For vectored (profiled) Data, we define:

Page 11: Emerging causal inference problems in molecular systems biology

The L1-RPS kernel

Page 12: Emerging causal inference problems in molecular systems biology

The L1-RPS kernel

Page 13: Emerging causal inference problems in molecular systems biology

Motivation of the L1-RPS kernel

Bin-to-bin distances (such as Euclidean) are not ideal ones to measure the discrepancy between two sequence tag profiles.

The Earth Mover’s distance (EMD) computes the minimum mass transportation efforts to ‘deform’ one profile to another.

The L1-RPS distance is equivalent to EMD when the two profiles have equal mass. In other cases, it also quantifies the total mass difference between the two profiles while EMD not.

Page 14: Emerging causal inference problems in molecular systems biology

Data Preprocessing: Profile clustering

We use cluster centers of input data, instead of each gene, as the training data to the BN learning algorithm for noise reduction.

Page 15: Emerging causal inference problems in molecular systems biology

Super k-means vs. k-means++ / Cluster 3.0

We propose the Super k-means algorithm to perform clustering,which yields tighter clusters than the k-means algorithm (in Cluster 3.0) and the k-means++algorithm.

Better clustering quality is necessary for the final good BN learning result.

Page 16: Emerging causal inference problems in molecular systems biology

The consensus PDAG network with feedbacks

We relax the acyclic constraint and perform additional structure search after BN learning to find potential feedback edges (as learning a dependency network), since feedbacks are important and ubiquitous in biology.

Human Embryonic Stem Cells

Page 17: Emerging causal inference problems in molecular systems biology

Perfect ROC in Cross Validation

Page 18: Emerging causal inference problems in molecular systems biology

ROC of alternative approaches

Page 19: Emerging causal inference problems in molecular systems biology

Alternative clustering approaches for preprocessing

Cluster 3.0

AffinityPropagation

Page 20: Emerging causal inference problems in molecular systems biology

Alternative Kernels for BN learning

Page 21: Emerging causal inference problems in molecular systems biology

CD4+ T Cell network

Page 22: Emerging causal inference problems in molecular systems biology

Mouse ESC network

Page 23: Emerging causal inference problems in molecular systems biology

The proposed hub role of H3K4me3 in ESCs

Page 24: Emerging causal inference problems in molecular systems biology

Functional Dissection of Regulatory Models Using Gene Expression Data of

Deletion Mutants

J Li, Y Liu et al., PLoS Genetics (2013)

Page 25: Emerging causal inference problems in molecular systems biology

Gene Expression Data of Deletion Mutants

In this table, each column represents a deletion mutant strain, and each row measures the expression changes of a specific gene, ‘1’ means up-regulation, ‘-1’ means down-regulation and ‘0’ means no specific change.

Page 26: Emerging causal inference problems in molecular systems biology

Inferring Genetic Regulatory Networks

Our goal is to infer a genetic regulatory network among the Deletion mutant genes …

However, traditional Bayesian network learning approaches failed…

Why?

It is because the dominant value in the deletion mutant gene expression data set is ‘0’, which quantity is magnitudes larger than the ‘1’ and ‘-1’ values.

Using traditional BN-learning metrics, such as K2, BDeu, BIC/MDL, the huge intra-similarities between ‘0’s will overwhelm true regulatory signals….

Page 27: Emerging causal inference problems in molecular systems biology

The DM_BN Kernel

To overcome this problem, we resort to kernel-based BN learning.

To this end, we propose the DM_BN kernel.

The key insight is to block the intra-similarities between ‘0’s:

Page 28: Emerging causal inference problems in molecular systems biology

Incorporating a priori causal information

We also use a template matrix to incorporate the a priori knowledge from deletion-mutant experiments into BN learning.

If Gene B is in the (influence) target list of Gene A, but not thereverse case , we set (i, j) = 1, (j, i) = 0 in the template matrix to prohibit the appearance of B->A in the BN.

In this way, the template matrix constraints the set of plausibleedges in a DAG.

Finally, to convert a DAG to a PDAG after BN learning, we must Resort to Meek’s rules [Meek, 1995] to judge the reversibility of Each edge, but not Chickering’s algorithm [Chickering, 1995].

Page 29: Emerging causal inference problems in molecular systems biology

High quality of the networks inferred by DM_BN

Page 30: Emerging causal inference problems in molecular systems biology

Correctness of edge directions with/without using templates

Without using the template matrix, DM_BN kernel leads to ~80% accuracy in the de novo inference of edge directionalities, which is statistically significant compared to random guessing.

Page 31: Emerging causal inference problems in molecular systems biology

The inferred Yeast regulatory network

Online acyclicity checking is implemented to enable learninglarge networks.

Page 32: Emerging causal inference problems in molecular systems biology

Integrating Genomic, Epigenomic, and Transcriptomic Features Reveals Modular Signatures Underlying Poor Prognosis in

Ovarian Cancer

Thanks for Dr. Wei Zhang’s contribution to the slides on this topic.

W Zhang, Y Liu et al., Cell Reports (2013)

Page 33: Emerging causal inference problems in molecular systems biology

The Cancer Genome Atlas (TCGA)

http://cancergenome.nih.gov/

Page 34: Emerging causal inference problems in molecular systems biology

Summary of the Ovarian cancer data in TCGA

Page 35: Emerging causal inference problems in molecular systems biology

Summary of the Ovarian cancer data in TCGA

The copy number segmentation data were mapped to the positions of genes and miRNAs.

Normalization:Valuenorm = (Valueraw – Mediancontrols) / STDpatients

Page 36: Emerging causal inference problems in molecular systems biology

Scientific Questions

By combining the clinical and heterogeneous high-throughput data, can we discover Ovarian cancer subtypes whose outcomes are different?

Whether we can find active regulatory pathways of the subtypes which could explain their different prognosis?

Page 37: Emerging causal inference problems in molecular systems biology

Selecting the Ovarian Cancer Hazard Factors

To investigate which features are related to the prognosis of ovarian cancer, we first used Cox proportional hazard model to perform the regression analysis between each feature and the patients’ survival time.

In total we selected 4,526 features as hazard factors (P < 0.05), including 1,651 genes’ expression changes, 455 genes’ promoter DNA methylation changes, 140 miRNAs’ expression changes, and the CNAs of 2,191 genes and 89 miRNAs.

Page 38: Emerging causal inference problems in molecular systems biology

De novo discovery of ovarian cancer subtypes by adaptive clustering

Page 39: Emerging causal inference problems in molecular systems biology

Signatures of the 7 subtypes of Ovarian Cancer

These signatures were identified using Wilcoxon rank-sum test.

Page 40: Emerging causal inference problems in molecular systems biology

Enriched terms of subtype 2-specific up-regulated genes

These terms, such as cell adhesion, TGF-beta binding,angiogenesis and positive regulation of cell proliferation, are related to tumorigenesis and metastasis.

Page 41: Emerging causal inference problems in molecular systems biology

Comparing the survival curves between subtype 2 and stage IV patients

The 5-year survival rate of subtype 2 was even worsethan that of tumor stage IV.

Page 42: Emerging causal inference problems in molecular systems biology

The cancer knowledge base

Pathways in cancer Telomere maintenance

Inflammatory response

MAPK signaling pathway

VEGF signaling pathway

Glycolysis / Gluconeogenesis

mTOR signaling pathway

Wnt signaling pathway

T cell receptor signaling pathway

ErbB signaling pathway

ECM-receptor interaction

B cell receptor signaling pathway

Jak-STAT signaling pathway Adherens junction

Natural killer cell mediated cytotoxicity

Cytokine-cytokine receptor interaction Focal adhesion

Cell cycle p53 signaling pathway

PPAR signaling pathway Base excision repair

TGF-beta signaling pathway Mismatch repair

Apoptosis Nucleotide excision repairHanahan & Weinberg 2011

The hallmarks of cancer

Used to filter out signature genes that are not drivers of cancer.

Page 43: Emerging causal inference problems in molecular systems biology

The interaction network of signature genes

Page 44: Emerging causal inference problems in molecular systems biology

THANKS

• Q & A?