Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Proteomics and systems biology
Bing Zhang, Ph.D. Associate Professor Department of Biomedical Informatics Vanderbilt University School of Medicine [email protected]
No man is an island, entire of itself; every man is a piece of the continent, a part of the main. If a clod be washed away by the sea, Europe is the less, as well as if a promontory were, as well as if a manor of thy friend's or of thine own were. Any man's death diminishes me because I am involved in mankind; and therefore never send to know for whom the bell tolls; it tolls for thee.
John Donne, Devotions upon Emergent Occasions, 1624
No man is an island
The impact of a specific genetic abnormality is not restricted to the activity of the gene product that carries it, but can spread along the links of the network and alter the activity of gene products that otherwise carry no defects.
Barabasi et al., Nature Review Genetics, 2011 Disease
Pathways/ Networks
No gene is an island
2
Omics opportunities and challenges
Genome CNV
mutation methylation
Transcriptome expression
splicing
Proteome expression modification
Elephant Data for
individual genes
Understanding of biological systems
3
! Pathway analysis " Motivation " Pathway databases " Over-representation analysis " Gene set enrichment analysis
! Network analysis " Network mapping " Network representation " Network properties " Network-based applications
Outline
4
! Pathway analysis " Motivation " Pathway databases " Over-representation analysis " Gene set enrichment analysis
! Network analysis " Network mapping " Network representation " Network properties " Network-based applications
Outline
5
! Genomics " Genome Wide Association Study (GWAS) " Whole genome or exome sequencing
! Transcriptomics " mRNA profiling
! Microarray ! RNA-Seq
" Protein-DNA interaction ! Chromatin immunoprecipitation (ChIP)-Seq
! Proteomics " Protein profiling
! LC-MS/MS " Protein-protein interaction
! Yeast two hybrid ! Affinity pull-down/LC-MS/MS
Omics studies generate gene/protein lists
……….. ……….. ……….. ……….. ………. ………. ………. ………. ……… ………………
6
Omics studies generate gene/protein lists
Differential expression
Clustering Lists of genes or proteins with potential biological interest
log2(ratio)
-log1
0(p
valu
e)
Microarray RNA-Seq Proteomics
IPI00375843 IPI00171798 IPI00299485 IPI00009542 IPI00019568 IPI00060627 IPI00168262 IPI00082931 IPI00025084 IPI00412546 IPI00165528 IPI00043992 IPI00384992 IPI00006991 IPI00021885 …...
7
Different ways of data organization
! Gene 1 " Pathway 1 " Pathway 2
! Gene 2 " Pathway 1 " Pathway 3
! Gene 3 " Pathway 2 " Pathway 3
! Gene 4 " Pathway 3
! Pathway 1 " Gene 1 " Gene 2
! Pathway 2 " Gene 1 " Gene 3
! Pathway 3 " Gene 2 " Gene 3 " Gene 4
8
Advantages of pathway analysis
! Better interpretation " From interesting genes to interesting biological themes
! Improved robustness " Robust against noise in the data
! Improved sensitivity " Detecting minor but concordant changes in a pathway
9
! Pathway analysis " Motivation " Pathway databases " Over-representation analysis " Gene set enrichment analysis
! Network analysis " Network mapping " Network representation " Network properties " Network-based applications
Outline
10
! Databases " BioCarta (http://www.biocarta.com/genes/index.asp) " KEGG (http://www.genome.jp/kegg/pathway.html) " MetaCyc (http://metacyc.org) " Pathway commons (http://www.pathwaycommons.org) " Reactome (http://www.reactome.org) " STKE (http://stke.sciencemag.org/cm) " Signaling Gateway (http://www.signaling-gateway.org) " Wikipathways (http://www.wikipathways.org)
! Limitation " Limited coverage " Inconsistency among different databases " Relationship between pathways is not defined
Pathway databases
11
! Structured, precisely defined, controlled vocabulary for describing the roles of genes and gene products
! Three organizing principles: molecular function, biological process, and cellular component " Dopamine receptor D2, the product of human gene DRD2 " molecular function: dopamine receptor activity " biological process: synaptic transmission " cellular component: plasma membrane
! Terms in GO are linked by several types of relationships " Is_a (e.g. plasma membrane is_a membrane) " Part_of (e.g. membrane is part_of cell) " Has part " Regulates " Occurs in
Gene Ontology
12
Gene Ontology
13
! Two types of GO annotations " Electronic annotation " Manual annotation
! All annotations must: " be attributed to a source " indicate what evidence was found to support the GO term-gene/protein
association
! Types of evidence codes " Experimental codes - IDA, IMP, IGI, IPI, IEP " Computational codes - ISS, IEA, RCA, IGC " Author statement - TAS, NAS " Other codes - IC, ND
Annotating genes using GO terms
14
Annotating genes using GO terms
…… ……
DLGAP1 discs, large (Drosophila) homolog-associated protein 1
DLGAP2 discs, large (Drosophila) homolog-associated protein 2
DNM1 dynamin 1
DOC2A double C2-like domains, alpha
DRD1 dopamine receptor D1
DRD1IP dopamine receptor D1 interacting protein
DRD2 dopamine receptor D2
DRD3 dopamine receptor D3
DRD4 dopamine receptor D4
DRD5 dopamine receptor D5
…… ……
Synaptic transmission
226 human genes
Cell-cell signaling
Child
Parent
15
! Downloads (http://www.geneontology.org) " Ontologies
! http://www.geneontology.org/page/download-ontology " Annotations
! http://www.geneontology.org/page/download-annotations
! Web-based access " AmiGO: http://www.godatabase.org " QuickGO: http://www.ebi.ac.uk/QuickGO
Access GO
16
Homo sapiens Mus musculus #term #gene #term #gene
GO/BP 6502 15228 6227 15709 GO/MF 3144 16389 2961 17287 GO/CC 947 16765 882 16801
Coverage of GO annotations
17
Other “gene sets”
! Transcription factor targets ! miRNA targets ! Protein complexes ! Cytogenetic bands ! Disease-related genes ! Drug targets ! Experimentally-derived gene sets ! ……
18
! Pathway analysis " Motivation " Pathway databases " Over-representation analysis " Gene set enrichment analysis
! Network analysis " Network mapping " Network representation " Network properties " Network-based applications
Outline
19
Is the observed overlap significantly larger than the expected value?
Over-representation analysis: concept
MMP9 SERPINF1 A2ML1 F2 FN1 LYZ TNXB FGG MPO FBLN1 THBS1 HDLBP GSN FBN1 CA2 P11 CCL21 FGB ……
IPI00375843 IPI00171798 IPI00299485 IPI00009542 IPI00019568 IPI00060627 IPI00168262 IPI00082931 IPI00025084 IPI00412546 IPI00165528 IPI00043992 IPI00384992 IPI00006991 IPI00021885 …...
180
Observed 22
83
1305 1305
Random
All identified proteins (1733)
Filter for significant proteins
Differentially expressed protein list
(260 proteins)
Extracellular space
(83 genes)
Compare
180 83
9.2
annotated
20
Over-representation analysis: hypergeometric test
Significant proteins
Non-significant proteins
Total
Proteins in the group k j-k j
Other proteins n-k m-n-j+k m-j Total n m-n m
Hypergeometric test: given a total of m proteins where j proteins are in the functional group, if we pick n proteins randomly, what is the probability of having k or more proteins from the group?
Zhang et.al. Nucleic Acids Res. 33:W741, 2005
€
p =
m − jn − i
#
$ %
&
' ( ji#
$ % &
' (
mn#
$ % &
' ( i= k
min(n, j )
∑ n
Observed k
j
m
21
92546_r_at 92545_f_at 96055_at 102105_f_at 102700_at ……
Over-representation analysis: WebGestalt
WebGestalt
8 organisms Human, Mouse, Rat, Dog, Fruitfly, Worm, Zebrafish, Yeast
Microarray Probe IDs • Affymetrix • Agilent • Codelink • Illumina
Gene IDs • Gene Symbol • GenBank • Ensembl Gene • RefSeq Gene • UniGene • Entrez Gene • SGD • MGI • Flybase ID • Wormbase ID • ZFIN
Protein IDs • UniProt • IPI • RefSeq Peptide • Ensembl Peptide
196 ID types with mapping to Entrez Gene ID
59,278 functional categories with genes identified by Entrez Gene IDs
Gene Ontology • Biological Process • Molecular Function • Cellular Component
Pathway • KEGG • Pathway Commons • WikiPathways
Network module • Transcription factor targets • microRNA targets • Protein interaction modules
Disease and Drug • Disease association genes • Drug association genes
Chromosomal location • Cytogenetic bands
Genetic Variation IDs • dbSNP
http://www.webgestalt.org
Jul. 1, 2013 – Jun. 30, 2014 49,136 visits from 18,213 visitors
~200 ID types
~60K gene sets
Statistical analysis
Zhang et.al. Nucleic Acids Res. 33:W741, 2005 Wang et al. Nucleic Acids Res. 41:W77, 2013
22
WebGestalt output: an enriched pathways
TGF Beta Signaling Pathway
Input genes
23
WebGestalt output: Enriched GO terms
Response to unfolded proteins
12 genes adjp=1.32e-08
24
! Does not account for the order of genes in the significant gene list
! Arbitrary thresholding may lead to lose of information
Limitation of the over-representation analysis
25
! Pathway analysis " Motivation " Pathway databases " Over-representation analysis " Gene set enrichment analysis
! Network analysis " Network mapping " Network representation " Network properties " Network-based applications
Outline
26
Gene Set Enrichment Analysis: concept
! Do genes in a gene set tend to locate at the top or bottom of the ranked gene list?
27
Gene Set Enrichment Analysis: method
Subramanian et.al. PNAS 102:15545, 2005
http://www.broad.mit.edu/gsea/
+1/k
-1/(n-k)
k: Number of genes in the gene set S n: Number of all genes in the ranked gene list
28
! Pathway analysis " Motivation " Pathway databases " Over-representation analysis " Gene set enrichment analysis
! Network analysis " Network mapping " Network representation " Network properties " Network-based applications
Outline
29
! http://bioinfo.vanderbilt.edu/zhanglab/?q=node/410 ! Using WebGestalt (http://www.webgestalt.org) to identify
GO categories and pathways enriched in proteins up-regulated in the poor-prognosis colorectal cancer (CRC) subtype (Zhang et al. Nature, 513:382-387, 2014).
Exercise
30
Proteomics and systems biology (II)
Bing Zhang, Ph.D. Associate Professor Department of Biomedical Informatics Vanderbilt University School of Medicine [email protected]
! Pathway analysis " Motivation " Pathway databases " Over-representation analysis " Gene set enrichment analysis
! Network analysis " Network mapping " Network representation " Network properties " Network-based applications
Outline
32
! Experimental " Yeast two-hybrid " Tandem affinity purification
! Computational (based on evidence of association) " Gene fusion (form part of a single protein in other organisms) " Ortholog interaction (orthologs interact in other organisms) " Phylogenetic profiling (co-existence in different organisms) " Microarray gene co-expression (co-expression in various
conditions)
! PPI data integration " Voting " Probabilistic integration (Jansen et al. Science 302:449, 2003)
Mapping protein-protein interaction networks
Valencia et al. Curr. Opin. Struct. Biol, 12:368, 2002
Phylogenetic profiling
33
! Database of Interacting Proteins (DIP) http://dip.doe-mbi.ucla.edu/
! The Molecular INTeraction database (MINT) http://mint.bio.uniroma2.it/mint/
! The Biomolecular Object Network Databank (BOND) http://bond.unleashedinformatics.com/
! The General Repository for Interaction Datasets (BioGRID) http://www.thebiogrid.org/
! Human Protein Reference Database (HPRD) http://www.hprd.org
! Online Predicted Human Interaction Database (OPHID) http://ophid.utoronto.ca
! iRef http://wodaklab.org/iRefWeb
! HI-II-14 http://interactome.dfci.harvard.edu/H_sapiens/
! The International Molecular Exchange Consortium (IMEX) http://www.imexconsortium.org
PPI databases
34
! Experimental " Chromatin immunoprecipitation (ChIP)
! ChIP-chip ! ChIP-seq
! Computational " Promoter sequence analysis " Reverse engineering from gene expression data
! Public databases " Transfac (http://www.gene-regulation.com) " MSigDB (http://www.broadinstitute.org/gsea/msigdb) " hPDI (http://bioinfo.wilmer.jhu.edu/PDI/ )
Mapping transcriptional regulatory networks
35
Mapping metabolic networks
! Public databases " KEGG (http://www.genome.jp/kegg/ ) " MetaCyc (http://metacyc.org/ )
36
! Pathway analysis " Motivation " Pathway databases " Over-representation analysis " Gene set enrichment analysis
! Network analysis " Network mapping " Network representation " Network properties " Network-based applications
Outline
37
Graph representation of networks
Cramer et al. Science 292:1863, 2001
edge
node
! Graph: a graph is a set of objects called nodes or vertices connected by links called edges. In mathematics and computer science, a graph is the basic object of study in graph theory.
RNA polymerase II
38
Undirected graph vs directed graph
Protein interaction network
Nodes: protein
Edges: physical interaction
Undirected
Transcriptional regulatory network
Nodes: genes
Edges: transcriptional regulation
Directed
TF->target gene
Metabolic network
Nodes: metabolites
Edges: enzymes
Directed
Substrate->Product
Krogan et al. Nature 440:637, 2006
Ravasz et al. Science 297:1551, 2002
Shen-orr et al. Nat Genet, 31:64, 2002
39
Unweighted graph vs weighted graph
Protein interaction network
Nodes: protein
Edges: physical interaction
Unweighted
Krogan et al. Nature 440:637, 2006
Gene co-expression network
Nodes: gene
Edges: co-expression level
Unweighted Weighted Correlation filtering (>0.8)
40
Degree, path, shortest path
! Degree: the number of edges adjacent to a node. A simple measure of the node centrality.
! Path: a sequence of nodes such that from each of its nodes there is an edge to the next node in the sequence.
! Shortest path: a path between two nodes such that the sum of the distance of its constituent edges is minimized.
YDL176W
Degree: 3
MetR
Out degree: 3
In degree: 1
41
! Pathway analysis " Motivation " Pathway databases " Over-representation analysis " Gene set enrichment analysis
! Network analysis " Network mapping " Network representation " Network properties " Network-based applications
Outline
42
Properties of complex networks
Scale-free Modular Hierarchical Small world
43
Obama vs Lady Gaga: who is more influential?
Obama 7,035,548 701,301
Gaga 8,873,525 144,263
Eminem 3,509,469 0
Twitter followers (in degree)
Twitter following (out degree)
28,490,739 664,606
35,158,014 136,511
13,946,813 0
44
Role of hubs in biological networks
! Based on data from model organisms S. cerevisiae and C. elegans " Correspond to essential genes " Have a tendency to be older and have evolved more slowly " Have a tendency to be more abundant " Have a larger diversity of phenotypic outcomes resulting from their
deletion
Vidal et al. Cell, 144:986, 2011
45
Scale-free topology and network robustness
! Robust against random damage
! Yet fragile against selective damage
46
Modularity
! Modularity refers to a group of physically or functionally linked molecules (nodes) that work together to achieve a relatively distinct function.
! Examples " Transcriptional module: a set of co-
regulated genes " Protein complex: assembly of
proteins that build up some cellular machinery, commonly spans a dense sub-network of proteins in a protein interaction network
" Signaling pathway: a chain of interacting proteins propagating a signal in the cell
Protein interaction modules Palla et al, Nature, 435:841, 2005
Gene co-expression modules Shi et al, BMC Syst Biol, 4:74, 2010
47
Making sense of the hairballs using network hierarchies
Nodes ordered in one dimension on the basis of network structure
Data visualized as tracks
www.netgestalt.org Shi et al., Nature Methods, 2013
48
! Pathway analysis " Motivation " Pathway databases " Over-representation analysis " Gene set enrichment analysis
! Network analysis " Network mapping " Network representation " Network properties " Network-based applications
Outline
49
Network visualization
Cytoscape (http://www.cytoscape.org)
Gehlenborg et al. Nature Methods, 7:S56, 2010
Network visualization tools
Red Blue
Black
Node label color: protein type HNE targets Transcription factors Other proteins
Node size: prediction cutoff level
Node color: protein function
Loose cutoff level
Stringent cutoff level
Cell cycle RNA splicing
Both Neither
50
! Proteins that lie closer to one another in a protein interaction network are more likely to have similar function and involve in similar biological process.
Network distance vs functional similarity
Sharan et al. Mol Syst Biol, 3:88, 2007
51
Gene function prediction: neighborhood majority voting
Direct neighbor (local)
Iterative majority voting (global)
52
Module-based Direct neighbor voting
Gene function prediction: module-based approach
53
Candidate disease gene prioritization
Kohler et al. Am J Hum Genet. 82:949, 2008
54
! Pathway analysis " Motivation " Pathway databases " Over-representation analysis " Gene set enrichment analysis
! Network analysis " Network mapping " Network representation " Network properties " Network-based applications
Outline
Data for individual genes
Understanding of biological systems
55