Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Tim Conrad, VL Network Analysis, SS16 1
Based on slides by J Ruan (U Texas)and U von Luxburg (U Tübingen)
VL Network Analysis (19401701)
SS2016Week 5
Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin
Tim Conrad, VL Network Analysis, SS16 2
Community structure
Tim Conrad, VL Network Analysis, SS16 3
Source: Newman and M. Girvan, Finding and evaluating community structure in networks, Physical Review E 69, 026113 (2004).
Tim Conrad, VL Network Analysis, SS16 4
Consider edges that fall within a community or between a community and the rest of the networkDefine modularity Q:
A: Adjacency matrixL : Total number of links ki : degree of i-th nodeci : label of module to which i-th node belongsD: indicator function – 1 if both nodes are in same cluster
probability of an edge betweentwo vertices is proportional to their degrees
Modularity function (Q)
Tim Conrad, VL Network Analysis, SS16 5
HQcut
• Ruan & Zhang, Physical Review E 2008
• Apply Qcut to get communities with largest Q
• Recursively search for sub-communities within each community
• When to stop?– Q value of sub-network is small, or– Q is not statistically significant
• Estimated by Monte-Carlo method
Tim Conrad, VL Network Analysis, SS16 6
Applications to a PPI network
• Protein-protein interaction (PPI) network– Vertices: proteins– Edges: interactions detected by experiments
• Motivation:– Community = protein complex?
• Protein complex– Group of proteins associated via interactions– Elementary functional unit in the cell– Prediction from PPI network is important
Tim Conrad, VL Network Analysis, SS16 7
Experiments• Data set
– A yeast protein-protein interaction network• Krogan et.al., Nature. 2006
– 2708 proteins, 7123 interactions
• Algorithms:– Qcut, HQcut, Newman
• Evaluation– ~300 Known protein complexes in MIPS– How well does a community match to a known protein
complex?
Tim Conrad, VL Network Analysis, SS16 8
Results
Newman Qcut HQcut
# of communities 56 93 316
Max community size 312 264 60
# of matched communities 53 52 216
Communities with matching score = 1 5 (9%) 7 (13%) 43 (20%)
Average matching score 0.56 0.55 0.70
# of novel predictions 3 41 100
Tim Conrad, VL Network Analysis, SS16 9
Communities found by HQcutSmall ribosomal subunit (90%)
RNA poly II mediator (83%)
Proteasome core (90%)
Exosome (94%)
gamma-tubulin (77%)
respiratory chain complex IV (82%)
Tim Conrad, VL Network Analysis, SS16 10
Lecture outline
• Gene expression analysis• Converting data to networks• Applications of network clustering methods
Tim Conrad, VL Network Analysis, SS16 11
Gene Expression Analysis
Tim Conrad, VL Network Analysis, SS16
The early steps of a microarray study
• Scientific Question (biological)• Study design (biological/statistical)• Conducting Experiment (biological)• Preprocessing/
Normalising Data (statistical)
• Finding differentially expressed genes (statistical)
1st Classical statistics T-tests, ANOVA Since 1950s
2ndHigh-dimensional feature selection;Machine learning
SAM, Limma; SVM, Neural networks
Since 1990s
3rd Group-based enrichment analysis
GSEA, GSA, Globaltest Since 2003
4th Pathway Analysis SPIA, TopoGSA Since 2007
Tim Conrad, VL Network Analysis, SS16
Specific Filtering
• t-statistic (one-way ANOVA F-statistic if > 2 samples)–problem is that there often isn’t enough data to estimate variances
•Fold change: simplest method; ratio of expression levels(but as microarray data is typically log transformed,
calculated as difference of means)
Tim Conrad, VL Network Analysis, SS16
A data example
• Lee et al (2005) compared adipose tissue (abdominal subcutaenous adipocytes) between obese and lean Pima Indians
• Samples were hybridised on HGu95e-Affymetrix arrays (12639 genes/probe sets)
• Available as GDS1498 on the GEO database
Tim Conrad, VL Network Analysis, SS16
Tim Conrad, VL Network Analysis, SS16
The “Result”Probe Set ID log.ratio pvalue adj.p73554_at 1.4971 0.0000 0.000491279_at 0.8667 0.0000 0.001774099_at 1.0787 0.0000 0.010483118_at -1.2142 0.0000 0.013981647_at 1.0362 0.0000 0.013984412_at 1.3124 0.0000 0.022290585_at 1.9859 0.0000 0.025884618_at -1.6713 0.0000 0.025891790_at 1.7293 0.0000 0.035080755_at 1.5238 0.0000 0.035185539_at 0.9303 0.0000 0.035190749_at 1.7093 0.0000 0.035174038_at -1.6451 0.0000 0.035179299_at 1.7156 0.0000 0.035172962_at 2.1059 0.0000 0.035188719_at -3.1829 0.0000 0.035172943_at -2.0520 0.0000 0.035191797_at 1.4676 0.0000 0.035178356_at 2.1140 0.0001 0.035990268_at 1.6552 0.0001 0.0421
What happened to the biology???
Tim Conrad, VL Network Analysis, SS16
Naive functional analyses
• Manually annotate list of differentially expressed (DE) genes • Extremely time-consuming, not systematic, user-dependent• Group together genes with similar function• Conclude functional categories with most DE genes
important in disease/condition under study• BUT may not be the right conclusion
Tim Conrad, VL Network Analysis, SS16
Tim Conrad, VL Network Analysis, SS16
The Gene Ontology Consortium
Tim Conrad, VL Network Analysis, SS16
GO Consortium
• Developed three structured and controlled vocabularies (ontologies)
• Describe gene products in terms of their
• associated biological processes, • cellular components and • molecular functions
in a species-independent manner
• Has become a major resource for microarray data interpretation
Tim Conrad, VL Network Analysis, SS16
The Gene Ontology
• Molecular Function: basic activity or task• e.g. catalytic activity, calcium ion binding
• Biological Process: broad objective or goal• e.g. signal transduction, immune response
• Cellular Component: location or complex• e.g. nucleus, mitochondrion
Tim Conrad, VL Network Analysis, SS16
Slightly more informative resultsProbe Set ID Gene SymboGene Title go biological process termgo molecular function term log.ratio pvalue adj.p73554_at CCDC80 coiled-coil domain contain --- --- 1.4971 0.0000 0.000491279_at C1QTNF5 /// C1q and tumor necrosis fa visual perception /// embry --- 0.8667 0.0000 0.001774099_at --- --- --- --- 1.0787 0.0000 0.010483118_at RNF125 ring finger protein 125 immune response /// mod protein binding /// zinc ion -1.2142 0.0000 0.013981647_at --- --- --- --- 1.0362 0.0000 0.013984412_at SYNPO2 synaptopodin 2 --- actin binding /// protein bin 1.3124 0.0000 0.022290585_at C15orf59 chromosome 15 open rea --- --- 1.9859 0.0000 0.025884618_at C12orf39 chromosome 12 open rea --- --- -1.6713 0.0000 0.025891790_at MYEOV myeloma overexpressed --- --- 1.7293 0.0000 0.035080755_at MYOF myoferlin muscle contraction /// bloo protein binding 1.5238 0.0000 0.035185539_at PLEKHH1 pleckstrin homology doma --- binding 0.9303 0.0000 0.035190749_at SERPINB9 serpin peptidase inhibitor, anti-apoptosis /// signal traendopeptidase inhibitor ac 1.7093 0.0000 0.035174038_at --- --- --- --- -1.6451 0.0000 0.035179299_at --- --- --- --- 1.7156 0.0000 0.035172962_at BCAT1 branched chain aminotran G1/S transition of mitotic c catalytic activity /// branch 2.1059 0.0000 0.035188719_at C12orf39 chromosome 12 open rea --- --- -3.1829 0.0000 0.035172943_at --- --- --- --- -2.0520 0.0000 0.035191797_at LRRC16A leucine rich repeat contain --- --- 1.4676 0.0000 0.035178356_at TRDN triadin muscle contraction receptor binding 2.1140 0.0001 0.035990268_at C5orf23 chromosome 5 open read --- --- 1.6552 0.0001 0.0421
• If we are lucky, some of the top genes mean something to us
• But what if they don’t?
• And what are the results for other genes with similar biological functions?
Tim Conrad, VL Network Analysis, SS16
Major bioinformatic developments
• Requires annotating entire set of genes
• The Gene Ontology Consortium (www.geneontology.org)
• Automated, statistical approaches for annotating gene lists and performing functional profiling
Tim Conrad, VL Network Analysis, SS16
Functional profiling tools
Identify GO categories with significantly more DE genes than expected by chance (i.e. over-represented among DE genes relative to
representation on array as a whole)
Correct for testing multiple GO categories
Hypergeometric Distribution or Fisher’s Exact Test
Tim Conrad, VL Network Analysis, SS16
Biological Interpretation
• Interpretation still requires substantial work• search literature and public databases • likely functional consequences of the changes• are the genes identified as significant within each GO category up-
or down-regulated?• genes within a category can have opposite effects e.g. apoptosis
would include genes that induce or repress apoptosis
Tim Conrad, VL Network Analysis, SS16
More than GO…
Tim Conrad, VL Network Analysis, SS16
• Methods of how to incorporate biological knowledge into microarray analysis
• The type of knowledge we deal with is rather simple: we know groups/sets of genes that for example:
• have a similar function (e.g. GO)• belong to the same pathway• are located on the same chromosome, etc…
• We will assume these groupings to be given• i.e we will not discuss methods how to detect pathways,
networks, gene clusters
Tim Conrad, VL Network Analysis, SS16
What is a pathway?
• No clear definition• Wikipedia: “In biochemistry, metabolic pathways are series of chemical
reactions occurring within a cell. In each pathway, a principal chemical is modified by chemical reactions.”
• These pathways describe enzymes and metabolites
• But often the word “pathway” is also used to describe gene regulatory networks or protein interaction networks
• In all cases a pathway describes a biological function very specifically
Tim Conrad, VL Network Analysis, SS16
What is a Gene Set?
• Just what it says: a set of genes!• All genes involved in a pathway are an example of a Gene Set• All genes corresponding to a Gene Ontology term are a Gene Set• All genes mentioned in a paper of Smith et al might form a Gene Set
• A Gene Set is a much more general and less specific concept than a pathway
• Still: we will sometimes use two words interchangeably, as the analysis methods are mainly the same
Tim Conrad, VL Network Analysis, SS16
What is Gene Set/Pathway analysis?
• The aim is to give one number (score, p-value) to a Gene Set/Pathway
• Are many genes in the pathway differentially expressed (up-regulated/down-regulated)
• Can we give a number (p-value) to the probability of observing these changes just by chance?
Tim Conrad, VL Network Analysis, SS16
Classes of Gene Set Analysis
Khatri et al. PLOS Comp Bio. 8:1 2012
DAVID
GSEA
Reactome FI networkPARADIGM
Tim Conrad, VL Network Analysis, SS16
Limitations of Gene Set Enrichment Analysis
• Many possible gene sets – diseases, molecular function, biological process, cellular compartment, pathways...
• Gene sets are heavily overlapping; need to sort through lists of enriched gene sets!
• “Bags of genes” obscure regulatory relationships among them.
Tim Conrad, VL Network Analysis, SS16
Pathway Analysis
Tim Conrad, VL Network Analysis, SS16
Pathway Databases
• Advantages:– Usually curated.– Biochemical view of biological processes.– Cause and effect captured.– Human-interpretable visualizations.
• Disadvantages:– Sparse coverage of genome.– Different databases disagree on boundaries of
pathways.
Tim Conrad, VL Network Analysis, SS16
KEGG
Tim Conrad, VL Network Analysis, SS16
Reactome
Tim Conrad, VL Network Analysis, SS16
Reactome
• Hand-curated pathways in human.• Rigorous curation standards – every reaction
traceable to primary literature.• Automatically-projected pathways to non-human
species.• 22 species; 1112 human pathways; 5078 proteins.• Features:
– Google-map style reaction diagrams with overlays; – Find pathways containing your gene list; – Calculate gene overrepresentation in pathways;– Find corresponding pathways in other species.
• Open access.
Tim Conrad, VL Network Analysis, SS16
Pathway Commons
Tim Conrad, VL Network Analysis, SS16
Pathway Colorization
• Main feature offered by all pathway databases.• Upload a gene list• Database calculates an enrichment score on each
pathway and displays ranked list.• Browse into pathways of interest; download
colorized pictures.
Tim Conrad, VL Network Analysis, SS16
Example from Reactome
Tim Conrad, VL Network Analysis, SS16
Example from Reactome
Tim Conrad, VL Network Analysis, SS16
Tim Conrad, VL Network Analysis, SS16
Curated Human Data – Version 35.5078 proteins 4166 reactions3870 complexes 1112 pathways Only ~25% of genome!
Goal: add a “corona” of uncurated interaction data around scaffold of curated pathway data.
Example: Reactome FI Network
Tim Conrad, VL Network Analysis, SS16
Tim Conrad, VL Network Analysis, SS16
More than pathways
Tim Conrad, VL Network Analysis, SS16
Networks
• Pathways capture only the “well understood” portion of biology.• Networks cover less well understood relationships:
– Genetic interactions– Physical interaction– Coexpression– GO term sharing– Adjacency in pathways
Tim Conrad, VL Network Analysis, SS16 47
Gene Expression Networks
Tim Conrad, VL Network Analysis, SS16 48
Microarray data
• Data organized into a matrix– Rows are genes– Columns are samples representing different
time points, conditions, tissues, etc.• Analysis techniques
– Differential expression analysis– Classification and clustering– Regulatory network construction– Enrichment analysis
• Characteristics of microarray data– High dimensionality and noise– Underlying topology unknown, often
irregular shape
Sample
Gen
e
Red: high activityGreen: low activity
Tim Conrad, VL Network Analysis, SS16 49
Microarray data clustering
• Many clustering algorithms available– K-means– Hierarchical– Self organizing maps– Parameter hard to tune– Does not consider network topology
Sample
Gen
e • Common functions?• Common regulation?• Predict functions for
unknown genes?
Analyze genes in each cluster
Red: high activityGreen: low activity
Tim Conrad, VL Network Analysis, SS16 50
From Data to Neworks
Tim Conrad, VL Network Analysis, SS16 51
Network-based data analysis
Tim Conrad, VL Network Analysis, SS16 52
Network-based data analysis
Tim Conrad, VL Network Analysis, SS16 53
Network-based data analysis
Tim Conrad, VL Network Analysis, SS16 54
Network-based data analysis
Tim Conrad, VL Network Analysis, SS16 55
Network-based data analysis
Tim Conrad, VL Network Analysis, SS16 56
Distances & Similarity
Tim Conrad, VL Network Analysis, SS16 57
Directed k-nearest neighbor graph
Tim Conrad, VL Network Analysis, SS16 58
Undirected k-nearest neighbor graph
Tim Conrad, VL Network Analysis, SS16 59
Undirected k-nearest neighbor graph
Tim Conrad, VL Network Analysis, SS16 60
epsilon-neighborhood Graph