Upload
felix75
View
380
Download
0
Tags:
Embed Size (px)
Citation preview
Integration of Full-Coverage Probabilistic Functional Networks
with Relevance to Specific Biological Processes
James, K., Wipat, A. & Hallinan, J.School of Computing Science, Newcastle University
Data Integration in the Life Sciences 2009
2
Integrated functional networksIntegrated functional networks
• Bring together data from a wide range of sources
• High throughput data is – Large (one node per gene; multiple interactions per node)
– noisy (FP 20 – 90%)
– Incomplete (to different extents)
• Assess quality of each dataset against a Gold Standard
• Weighted edges reflect sum probability that edge actually exists
• Network can be thresholded to draw attention to most probable edges
• Suitable for manual (interactive) or computational analysis
3
Dataset biasDataset bias
• Different experiment types provide different types of information
• Overlap between datasets usually low– 1% of synthetic lethal pairs physically interact
• Genes involved in the same process may be transcribed together– Ribosomal biogenesis in yeast
• Some types of interaction may provide more information about a particular biological process– Complex formation: Y2H
– Signal transduction: phosphorylation
4
Bias in HTP datasetsBias in HTP datasets
From Myers and Troyanskaya, Bioinformatics 2007.
5
Bias & RelevanceBias & Relevance
• Most network analyses are related to a Process of Interest (PoI)
• PFINs tend to be very large
• Interactions with equal probability will have different utility
• Several attempts to eliminate bias– Loss of data
• We aim to use bias – Relevance
6
HypothesisHypothesis
Functional annotations can be applied to probabilistic integrated functional networks to identify interactions
relevant to a biological process of interest
7
)(~/)(
)|(~/)|(ln
LPLP
ELPELPLLS
Network integrationNetwork integration
8
)(~/)(
)|(~/)|(ln
LPLP
ELPELPLLS
n
iii
DL
WS1
)1(
Network integrationNetwork integration
9
Effect of D valueEffect of D value
10
Relevance scoringRelevance scoring
• GO annotations
• One-tailed Fisher’s exact test to score over-representation of genes related to POI
• POI: term of interest plus any descendants except inferred from electronic annotation
• Control network integrated in order of confidence
• Relevance network integrated in order of relevance
• We use Lee et al. (2004), but method can be applied to any network, any data integration algorithm
11
Relevance scoringRelevance scoring
12
Data setsData sets
• Saccharomyces cerevisiae data from BioGRID v.38
• Split by PMID, duplicates removed
• Datasets > 100 interactions treated individually– 50 data sets, max 14,421 interactions
• Datasets < 100 grouped by BioGRID Experimental categories– 22 data sets, min 33 interactions
• Gene Ontology terms – Telomere Maintenance (GO:0000723)
– Ageing (GO:0007568)
13
Choice of D valueChoice of D value
• GO annotations
• Assign function to nodes based on annotation of neighbour with highest weighted edge
• Leave-one-out on full network
• Construct Receiver Operating Characteristic (ROC) curve– Area Under Curve (AUC)
– SE(W) using Wilcoxon statistic
14
Classifier outputClassifier outputclassification threshold
positives negatives
TP TN
FP FN
Increasing specificity Increasing sensitivity
15
ROC CurvesROC Curves
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
False Positives
True
Pos
itive
s
No power
Intermediate
Perfect classification
16
D valueD value
17
D valueD value
18
RankingRanking
Dataset Conf. Score
Conf. Rank
Ageing Rank
Telomere Rank
A&T Rank
1 6.6937 1 4 6 6
2 5.7054 2 8 7 8
3 5.7040 3 6 2 2
4 5.0842 4 3 4 4
5 4.9335 5 7 5 5
6 4.5212 6 5 8 7
7 4.4641 7 1 1 1
8 4.2253 8 2 3 3
19
ResultsResults
20
Evaluation - ClusteringEvaluation - Clustering
• MCL Markov-based clustering algorithm
• Considers network topology and edge weights
21
ResultsResults
Net Bias Clusters % COI >2 nodes >3 nodes
>4 nodes
A C 573 21.29 26.14 28.86 35.19
R 523 22.37 27.73 31.75 36.92
T C 573 5.06 6.14 7.02 6.53
R 508 6.50 7.73 8.90 8.59
C C 573 24.26 29.55 33.80 37.98
R 523 24.67 29.83 33.33 38.35
22
Cluster annotationCluster annotation
23
ConclusionsConclusions
• Function assignment is statistically significantly better, but probably not practically useful– Simplistic algorithm
– Dependant upon existing annotation
• Clustering– Fewer, larger clusters
– Clusters draw together genes of interest
– Different GO terms perform differently
• Relevance networks are better for interactive exploration– Related PoIs
24
Future workFuture work
• Which GO terms work best with relevance?
• Why?
• Further exploration of experimental types and relevance
• Implement algorithms in Ondex
• Optimize function assignment / clustering algorithms
• Extend technique to edges
25
AcknowledgementsAcknowledgements
• Centre for Integrated Systems Biology of Ageing and Nutrition (CISBAN)
• Newcastle Systems Biology Resource Centre
• Research Councils of the UK
• BBSRC SABR Ondex Project
• Integrative Bioinformatics Research Group