View
35
Download
0
Category
Tags:
Preview:
DESCRIPTION
Automate Function Prediction. Outline. Goal How function is defined Why Gene Ontology Methods for protein function prediction End points. GOAL. A) You find a new protein B) You sequence the whole genome of your favorite organism Obtained gene (s) should be annotated - PowerPoint PPT Presentation
Citation preview
Automate Function Prediction
Outline
• Goal• How function is defined• Why Gene Ontology• Methods for protein function prediction• End points
GOAL
• A) You find a new protein• B) You sequence the whole genome of your
favorite organism• Obtained gene(s) should be annotated
• A can be solved manually. B needs automatic tools
How function is defined
• Functional description as text• Linking gene to Key Words (Uniprot)• Linking gene Gene Ontology • Linking gene to Signalling Pathways or
Biochemical Pathways (KEGG)
Why Gene Ontology (GO)
• GO represents a popular standard currently in the gene annotation
• GO represents categories that represent gene function
• Creates an union for genes in same process• Easy summary for genes with similar function
Why Gene Ontology (GO)
• 3 sub-parts: Biological Process, Molecular Function, Cellular Localization – Molecular Function => chemical activity– Biological Process => Biology, cellular process– Cellular localization => Location of gene
• Hierarchical structure– Categories with very precise function– Categories with less precise function– Categories with very broad function
How GO helps
• End user: Summary categories for genes with various functions
• Computer programs: Classifier algorithms can be taught to predict the categories for genes
Understanding GO• Amigo server
(http://amigo.geneontology.org/cgi-bin/amigo/go.cgi)
Function Prediction: What can we use to predict function
• Sequence homology (BLAST result list)• Phylogenetic tree of sequences• Protein Domains (PFAM domains)• Short sequence patterns – motifs• Sequence features (sec. struct., low compl.
regions)
Sequence Homology Methods
• Do a BLAST search with a query sequence• Collect GO classes for genes in the BLAST
result hit• Give a weight to each BLAST hit – often log(E-value)
• Combine the scores from the genes that belong to same GO class
• Report the top best / significant GO classes
Sequence Homology Methods
• Simple methods• Programs– BLAST2GO (http://www.blast2go.com/b2ghome)
– GOTCHA (http://www.compbio.dundee.ac.uk/gotcha/gotcha.php)
– ARGOT(http://www.medcomp.medicina.unipd.it/Argot2/form.php)
– PFP (http://kiharalab.org/web/pfp.php)
Phylogenetic tree methods
• Create the pair-wise distances for the set of genes• Do a hierarchical clustering of genes• Map the know GO functions to cluster tree• Look for unknown genes in a cluster with many
genes from the same GO class• Report the top best / significant GO classes
• More => http://genome.cshlp.org/content/8/3/163.full
Phylogenetic tree methods
• These should outperform sequence homology methods (CAFA 2011?)
• Require a set of related genes• Often much heavier calculations• Programs:– Sifter
(http://genome.cshlp.org/content/early/2011/07/22/gr.104687.109)
Prediction with Protein domains
• Look what protein domains there are in query protein (PFAM)
• Map the functions that are linked to domains to your query sequence– PFAM2GO
• Programs: InterProScan + PFAM2GO • Drawbacks: – This mapping is same in plant, mammal, bacteria– Many domains to specific function
Prediction with Protein domains
• Benefits:– Can create annotation from separate domains– Similar seq:s do not have to be in database
• Programs (?): InterProScan (http://www.ebi.ac.uk/InterProScan/)
• Drawbacks: – The mapping is same in plant, mammal, bacteria– Many domains to specific function
Prediction with patterns and motifs
• Same principle as before, but we look sequence patterns and motifs
• Map the functions that are linked to patterns to your query sequence
• Programs: – InterProScan – IBM BioDictionary (http://cbcsrv.watson.ibm.com/Tpa.html)
• Drawbacks and benefits appr. same as before
Prediction with sequence features
• Again same principle as before • We look seq. features (see pict.)• These are given as an input to classifier
algorithm (Support Vector Machine)
Prediction with sequence features
Prediction with sequence features
• Benefits: – No actual seq. similarity needed– Info collected from vague similarities– Use of classifier => feature weighting
• Program: FFPred (http://bioinf.cs.ucl.ac.uk/ffpred/)
• Drawbacks: • Calculations probably quite heavy• No use of nearby sequence similarities (domains etc.)
Our contribution: PANNZER
• Use BLAST result list• Add Taxonomic information• Score GO classes using a score that takes the
frequency of GO class in seq. DB into account• Method is used to predict:– GO Classes– Description line
Our contribution: PANNZER
• Benefits:– Taking the species taxonomy into account– Improved use of statistics
• Not public yet
Our contribution: No Name Yet
• Take PFAM domain predictions, BLAST similarities and Taxonomic information
• Feed this to feature selection and to classifier algorithm
• …Wait…• Method is used to predict GO-classes• Not public + testing is ongoing
Conclusion
• These methods increasingly needed• Some methods exist• Unfortunately no clear evaluation (my
opinion)• Remember: These are predictions. No certain
info until they are tested in wet lab…
Recommended