View
2
Download
0
Category
Preview:
Citation preview
Network Biology- part III
Jun Zhu, Ph. D.
Professor of Genomics and Genetic Sciences
Icahn Institute of Genomics and Multi-scale Biology
The Tisch Cancer Institute
Icahn Medical School at Mount Sinai
New York, NY
@IcahnInstitute
http://research.mssm.edu/integrative-network-biology/
Email: jun.zhu@mssm.edu
Mount Sinai Hospital and Mount Sinai
School of Medicine
in New York City
Mount Sinai Hospital and Mount Sinai
School of Medicine
• Hospital
• Founded in 1852 as a Jewish Hosptical, is one of the oldest and largest teaching hospital in the US
• Located right next to Central Park in Manhattan
• Is the largest hospital in New York City
• Ranked 16th best hospital in 2015
• Medical School
• Founded in 1963
• Merged with New York Univerisity in 1998 as Mount Sinai-NYU medical center
• Independent in 2010
• Ranked 18th best medical school
Mount Sinai Health System
▶ 6,200 physicians, >2,000 residents and
fellows
▶ 36,000 employees
▶ 169,532 inpatient admissions
▶ More than 2,600,000 outpatient visits to
offices and clinics (non-Emergency
Department)
▶ 489,508 Emergency Department visits
▶ 18,000 babies delivered a year
Goals of the workshop
▶ NOT to teach you how to use one method or one
program
▶ Learn from history
▶ Learn about critical thinking
– What you want to achieve?
– What you need to achieve the goal?
– How to abstract a biological problem into
mathematical problem?
– What are underlying assumptions and problems?
Why it is so hard to model biological systems? ▶ The more we learn, the more complicated it becomes!
Post transcriptional regulation
• Splicing (1981)
• RNA editing (1986)
• miRNA mediated regulation (1993)
Post translational regulation
• Phosphorylation
• Glycosaltion
• acetylation It is not one gene to one protein anymore!
Epigenetic regulation : heritable
changes in gene function that cannot
be explained
by changes in DNA sequence
• DNA methylation
• Chromotin structure
Junk DNA?
Gene sets Association
networks
Probabilistic causal
networks
Mechanism
based models
Biological details revealed
Data required to train models
Biological networks/pathways
Observation-> description-> explanation-> prediction
Association network: Connection matrix
▶ Is one gene statistically
associated with another
gene?
▶ A binary symmetric matrix
▶ 1 (red) – two genes are
associated
▶ 0 (black) – two genes are
not associated
▶ Diagonal = 1: genes are
always associated with
themselves!
Topological overlap matrix and corresponding dissimilarity (Ravasz et al 2002)
min( , ) 1
i jij
i j
k kTOM
k k
1ij ijDistTOM TOM
Node i Node j
Association by gene expression correlation
▶ How strong the correlation of mRNA expression levels should be?
– the p-value cutoff for correlation: Bonferroni correction?
• Assuming two expression levels are independent
– FDR (False Discover Rate) by permutation
• No explicit assumption
• Data set specific
detectedpositivestotal
positivesfalseFDR
p-value < total positive false positive FDR
(from data) (from permuted data)
1e-10 40245988 1079 2.68e-5
1e-15 22475531 192 8.54e-6
1e-20 13755681 38 2.76e-6
At p value <1e-20, there are only 38 false positives
so that no module was detected for the permuted data
Pvalue<1e-20 was chosen as threshold
Selecting threshold for Gene-Gene Correlation
(GGC) of 25,000 genes on a microarray chip
weighted coexpression networks
Unsigned Network Signed Network
Zhang & Horvath SAGMB, 2005
Two types of weighted correlation networks
Unsigned
Signed
network, absolute value
| ( , ) |
network preserves sign info
| 0.5 0.5 ( , ) |
ij i j
ij i j
a cor x x
a cor x x
Default values: β=6 for unsigned and β =12 for signed
networks.
Zhang & Horvath SAGMB, 2005
Generalized Connectivity
▶ Gene connectivity = row sum of the adjacency
matrix
– For unweighted networks=number of direct neighbors
– For weighted networks= sum of connection strengths to
other nodes
i ijjk a
Generalized Topological overlap matrix and corresponding dissimilarity
min( , ) 1
iu uj ij
uij
i j ij
a a a
TOMk k a
1ij ijDistTOM TOM
soft thresholding vs hard thresholding?
1. Preserves the continuous information of the co-expression information
2. Results tend to be more robust with regard to different threshold choices
But hard thresholding has its own advantages:
In particular, graph theoretic algorithms from the computer science
community can be applied to the resulting networks
Pros:
Cons:
Making Sense of These Associations
▶ Do a set of genes
connect to each other
similarly?
▶ Hierarchical clustering
reorders the matrix so
that patterns emerge
How to identify modules in an ordered
connection matrix
▶ Identify a largest set
of genes
▶ Most coherent
(connect to each
other)
4267 top genes in BxH liver female rescan qtl overlap (num(p(GGC)<1e-15)>100 ~abs(cor)>0.5886)
,obs
tot
GPCoherence
GP
Lum et al, 2005
Gene sets Association
networks
Biological details revealed
Data required to train models
Biological networks/pathways
1. Do they enrich for a
biological function?
2. Do they overlap with any
signatures?
3. Do they correlate with
clinical traits?
4. Do they link to a QTL?
5. Do they enrich for any
transcription factor binding
sites?
4267 top genes in BxH liver female rescan qtl overlap (num(p(GGC)<1e-15)>100 ~abs(cor)>0.5886)
4267 top genes in BxH liver female rescan qtl overlap (num(p(GGC)<1e-15)>100 ~abs(cor)>0.5886)
1000 2000 3000 4000
1
2
3
4
5
6
7
8
'acyl-CoA binding'
'chromatin remodeling complex'
'respiratory chain complex I‘
'ribosome'
'fibroblast growth factor receptor binding'
'hormone activity'
'positive regulation of phosphorylation' 'glucosyltransferase activity'
'bile acid metabolism'
'carboxy-lyase activity'
'cell-matrix junction'
'B-cell mediated immunity'
'regulation of immune response'
Using the singular value decomposition to define (module) eigengenes
1 2
1 2
1 2
1
(q)
Scale the gene expressions profiles (columns)
( )
( )
( )
(| |,| |, ,| |)
Message: u is the (first) eigengene E
If datX corresponds to the q-th module then
T
m
m
m
datX scale datX
datX UDV
U u u u
V v v v
D diag d d d
(q)E is the q-th module eigengene.
Module eigengenes are very useful
▶ 1) They allow one to relate modules to each other
– Allows one to determine whether modules should be
merged
– Or to define eigengene networks
▶ 2) They allow one to relate modules to clinical traits
and SNPs
– -> avoids multiple comparison problem
▶ 3) They allow one to define a measure of module
membership: kME=cor(x,ME)
Bin Zhang
ARACNE
▶ Reverse engineering of regulatory networks in human B cells.
Basso K, Margolin AA, Stolovitzky G, Klein U, Dalla-Favera R, Califano A.
Nat Genet. 2005 Apr;37(4):382-90. Epub 2005 Mar 20.
▶ ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context.
Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A.
BMC Bioinformatics. 2006 Mar 20;7 Suppl 1:S7.
ARACNE: ranking mutual information
Spectral clustering
▶ Useful for sparse network
Adjacency matrix , jiA
, ,i i i j
j
D ADiagonal matrix
Laplacian matrix 1P D A
Cut along the vector
corresponding to the largest
eigen value
Meta-analysis and comparison of association
networks
▶ Integrating multiple data sets into one network
▶ Integrating multiple networks into one network
What Are Common Among Them?
How Do They Differ?
Kai Wang
Mouse and Rat Are Commonly Used Animal
Models in Studying Human Diseases
▶ Understanding their conserved mechanisms is important in predicting whether drug targets identified in mouse and rat will achieve efficacy in humans
▶ Identifying mechanisms that differ among them can help improve the design and interpretation of toxicity studies that involve rodent models
▶ Liver is an important organ for glucose and lipid metabolism, as well as for metabolizing toxic compounds
▶ Gene expression data can be organized into co-expression networks that can shed light on the functional relationship between genes
Wang et al, PLoS Comp Bio., 2009
Existing Methods in the Literature
▶ Meta-analysis approaches
– Parametric meta-analysis
• Combining p-values (Fisher’s Inverse 2 test)
• Fisher-Z statistics (Hedges & Olkin,1985; Rosenthal & Rubin,
1978; DerSimonian & Laird, 1986)
– Non-parametric: order statistics (Stuart et al, 2003)
▶ Network alignment approaches (Kelly et al, 2003, Berg 2006, etc)
– Sub-graph based vs. gene-pair based
– Search for specific network structure of interest
Wang et al, PLoS Comp Bio., 2009
Proposed Semi-nonparametric
Approach: d-statistics
▶ Define effect size, d, as the normalized correlation coefficient to the gene-centric mean correlation
– Gene context specific
– Less assumption needed
– Mean effect size can defined
– Heterogeneity statistics can be used
21
22
1
1
(1) For gene pair , in dataset :
~ 0,1
(2) Mean effect
1,
(3) Statisitical significance
~ 0,1
(4) Homogeneity
~
i k
i k
ij
ij
ijk r
ijk
r
K
ijk
kij d
ij
ij ij
d
K
ijk ij K
k
i j k
d N
d
dK K
dg d K N
Q d d
Distribution of context of gene AA
A
A
A
A
AB
ABd
Wang et al, PLoS Comp Bio., 2009
Meta-analysis Procedures
Compute GGC in single dataset
Gene specific d-transformation
Compute Mean
effect size
Homogeneity?
Differential
Interactions
Yes No
Stat. significance?
Conserved
Interactions
Dataset
specificity
Wang et al, PLoS Comp Bio., 2009
Methods Comparison
▶ Similar results were also obtained using KEGG pathways
101
102
103
104
105
106
107
108
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
# predicted pairs
% p
red
icte
d p
airs s
ha
rin
g G
O a
nn
ota
tio
n
d-statisticsOrder StatisticsCombine P-valueFEM Fisher-ZREM Fisher-ZHumanMouseRat
Wang et al, PLoS Comp Bio., 2009
Conserved Modules Show Better Association
with Human Lipid Traits
▶ Kathiresan et al. A genome-wide association study for blood lipid in the Framingham Heart Study. BMC Medical Genetics 2007, 8:SI7
▶ Association is defined as p-value < 0.001
▶ Genes were selected if marker is within 50kb of the gene
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
0.05
0.1
0.15
0.2
0.25
0.3
0.35
p<0.001
p<0.001
% L
ipid
Asso
cia
tin
g G
ene
s
0
1000
2000Module
Siz
e
carboxylic acid metabolic
translation
imm
une response
cell proliferation
Wang et al, PLoS Comp Bio., 2009
Meta-analysis of Differential Interactions between
BxHwt vs. BxH ApoE-/- Mice
▶ A Proof-of-concept case for identifying network changes using the proposed method
– Identical genetic background
– Similarly raised and fed
– ApoE is the only major difference between two mouse strains
▶ 500 differentially connected genes; 1023 differential interactions
▶ Over-represented biological processes were specifically enriched in those which ApoE is known to participates in
Keyword Pvalue Evalue
Bkg
Set
Size
Bkg Set
Count
Input Set
Size
Input Set
Count
cholesterol metabolic process 9.78E-10 5.6E-06 13846 98 ( 0.7%) 375 17 ( 4.5%)
cholesterol biosynthetic process 1.71E-08 9.8E-05 13846 44 ( 0.3%) 375 11 ( 2.9%)
sterol metabolic process 2.08E-08 1.2E-04 13846 119 ( 0.9%) 375 17 ( 4.5%)
sterol biosynthetic process 5.75E-08 3.3E-04 13846 49 ( 0.4%) 375 11 ( 2.9%)
lipid metabolic process 5.77E-08 3.3E-04 13846 999 ( 7.2%) 375 57 (15.2%)
cellular lipid metabolic process 1.39E-07 8.0E-04 13846 845 ( 6.1%) 375 50 (13.3%)
alcohol metabolic process 2.74E-07 1.6E-03 13846 390 ( 2.8%) 375 30 ( 8.0%)
Wang et al, PLoS Comp Bio., 2009
Differentially Connected Genes are Enriched
in Known ApoE Subnetwork
▶ Protein-protein and protein-DNA interactions curated from databases and literature
▶ Differentially connected genes are significantly enriched in the immediate physical network around ApoE (4/21, p < 1E-4), and is still marginally enriched when second neighbors are included (12/356, p < 0.06)
Wang et al, PLoS Comp Bio., 2009
Identification of Differential Interactions Between
Human and Rodent Species
▶ Assume the systems being compared are thoroughly perturbed – Lack of correlation in one system is not due to lack of expression
dynamics
▶ 1,171 differential interactions (among 918 orthologous genes
▶ FDR is estimated to be < 1E-3 using permutation method
▶ 163 of the 1,171 differential interactions are human specific
Wang et al, PLoS Comp Bio., 2009
RXRG Is Identified as A Key Regulator that Differs
Between Human and Rodent Species
▶ The largest sub-network consists of 11 genes, three of them, PIP5K1B, RXRG and ACSBG1, are known to be involved in lipid metabolism
▶ RXRG (Retinoid X receptor ) is: – # 4 most differentially connected
gene
– Involved in 8 human specific interactions, 7 of which are with other top differentially connected genes
– RXRG has previously been associated with hyperlipidemia
– RXRG is a direct upstream regulator of CETP (cholesteryl ester transfer protein), which is a human specific gene that is involved in regulating HDL cholesterol
Wang et al, PLoS Comp Bio., 2009
Conserved Only Both Different Only
0
0.1
0.2
0.3
0.4
0.5
Hum
an-m
ouse K
a/K
s
Evolutionary Difference between Conserved vs.
Differentially Connected Genes
3205 547
479 2726 68
Conserved Different Are differentially connected genes evolve
faster than those involved only in conserved
interactions?
Ka/Ks - ratio of non-synonymous
substitutions rate to synonymous
substitutions rate, which can be
used as an indication of positive
selection on a protein-coding gene
p < 0.131
Wang et al, PLoS Comp Bio., 2009
JointClustering multiple networks
Narayanan et al, PLoS Comp Bio., 2010
JointClustering multiple networks
Narayanan et al, PLoS Comp Bio., 2010
Performs better when two networks are different
JointClustering multiple networks
Narayanan et al, PLoS Comp Bio., 2010
Identify differences between association
networks ▶ Define a module first
Zhang et al, Cell, 2013
Identify differences between association
networks
▶ Define a differential
connection first
▶ Make no assumption of
module structure a priori
▶ Can identify differential
connections between two
modules
Narayanan et al, Mol. Syst. Biol. 2014
Association
networks
Probabilistic causal
networks
Biological details revealed
Data required to train models
Biological networks/pathways
1. How do genes in the same
module interact?
2. How do genes in different
modules interact?
3. Can we make causal
inferences to elucidate
signaling pathway for
disease targets?
4267 top genes in BxH liver female rescan qtl overlap (num(p(GGC)<1e-15)>100 ~abs(cor)>0.5886)
Aknowledgements Mount Sinai
Genomics Institute
Eric Schadt
Bin Zhang
Zhidong Tu
Charles Powell
Patrizia Casaccia
Zhu lab
Seungyeul Yoo
Eunjee Lee
Li Wang
Luan Lin
Quan Long
•Icahn Institute of Genomics and Multiscale Biology,
Icahn School of Medicine at Mount Sinai
•Janssen
•Canary Foundation
•Prostate Cancer Foundation
•NIH
•NCI
Supported by:
Boston University
Avrum Spira
Joshua Campbell
U Washington
Roger Baumgarner
Berkerley
Rachel Brem
Princeton
Lenoid Kruglyak
Recommended