View
219
Download
0
Category
Tags:
Preview:
Citation preview
Virus Host co-evolution in sight of their proteomes and
codon preferences
Bioinformatics project 2007
Yaar ReuveniInstructor - Michal Linial
Outline:
My project is composed of two phases:
1. Phase I: The virus host web tool – VirOsNet. You are welcome to visit at: www.virosnet.cs.huji.ac.il
2. Phase II: Virus Host co-evolution research using codon usage analysis.
Viruses: Basically a cpasid
envelope that contains genetic information.
Viruses can not replicate by themselves, and depend on the host for reproduction.
It’s main purpose in life enter a host, and use it’s facilities to reproduce
Viruses fight back:
Phase I: VirOsNet
VirOsNet provides database and tools for exploring virus evolution and virus-host co-
evolution
Background and Motivation:
Ample of examples suggest that often viruses steal information from their hosts.
Viruses must optimize their amount of genetic material and physical size.
Viruses have very fast evolution:o Hard to trace.o Might change by switching hosts.o Shuffle their genetic material.
Phase (I) main objective:
Compare all viral proteins to all known proteins and detect resemblance.
Meaning: in what way do viral proteins "resemble" any of all other known proteins in our world?
Objectives and possible outcomes (i)
Clever search: Provide crossbreeding factors when searching
Offer comparisons of viruses relative to the proteome of their known hosts
Stolen elements: where were they stolen from? Was it from the host?
Mimicking phenomenon: detect host - protein mimicry
When did it happen: Evolutionary tracking
Objectives and possible outcomes (ii)
Recent event – indicative by similarity search results that are exceptional.
Insights on viruses and their proteomes.
Long term: Pharmaceutics applications. Proposal
of drug targets
Methods: Data is from the ProtoNet DB (currently ~ 1.8 million
proteins) All proteins are from UniProt.
New tables to the DB -specialized for host-virus relations.
Pre computed BLAST (BLOSUM62) and dynamic BLAST options.
Entry is a Viral Protein, BLAST search results are sorted by the descending E-values.
Several display schemes. Each result associated with domain information
(InterPro) Download options for next phase analysis
Tool overview:
The tool works in a 4 steps scheme:1. Step 1: search for a virus to query on
using one of the search methods2. Step 2: choose a specific virus3. Step 3: choose one of it’s proteins, and
the BLAST properties4. Step 4: choosing one of the BLAST
results to get it’s pairwise alignment
7,763 viruses and 199,563 proteins
Some Statistics
Entry point to viruses according to their genetic material complexity
Example: check all dsRNA viruses
Affecting Eukaryotes
Case study: Abelson murine leukemia virus:a VERY close homolog of human and a
mouse protein tyrosine kinase that:(i) Regulates cytoskeleton during cell differentiation,
cell division and cell adhesion(ii) Regulates DNA repair potentially in severe
demage.
The viral protein causes cancer (active site mutation)
Lets look at it ……
Active site
Summery Phase I: Pros:
Platform for studying viruses relative to hosts A discovery tool Rich BLAST options for evolutionary wider view Crossbreeding with host data (i.e. IntrPro
Domains). Dynamic view on BLAST result as a group
(ProtoMesh) Cons: Still to improve the usability to the average
biologist VirOsNet can get very slow on overload or in some
of the filtering options.
Phase II: Codon usage
Virus-host classification using codon usage analysis
with SVM
Figure adapted fromL. Merkel, N. Budisa, BIOspektrum 2006 , 12 , 41.Veränderung des genetischen Codes.
RNA codons:
2’nd base
UCAG
1’st
U
UUU (Phe/F)PhenylalanineUCU (Ser/S)SerineUAU (Tyr/Y)TyrosineUGU (Cys/C)Cysteine
baseUUC (Phe/F)PhenylalanineUCC (Ser/S)SerineUAC (Tyr/Y)TyrosineUGC (Cys/C)Cysteine
UUA (Leu/L)LeucineUCA (Ser/S)SerineUAA Ochre (Stop)UGA Opal (Stop)
UUG (Leu/L)LeucineUCG (Ser/S)SerineUAG Amber (Stop)UGG (Trp/W)Tryptophan
C
CUU (Leu/L)LeucineCCU (Pro/P)ProlineCAU (His/H)HistidineCGU (Arg/R)Arginine
CUC (Leu/L)LeucineCCC (Pro/P)ProlineCAC (His/H)HistidineCGC (Arg/R)Arginine
CUA (Leu/L)LeucineCCA (Pro/P)ProlineCAA (Gln/Q)GlutamineCGA (Arg/R)Arginine
CUG (Leu/L)LeucineCCG (Pro/P)ProlineCAG (Gln/Q)GlutamineCGG (Arg/R)Arginine
A
AUU (Ile/I)IsoleucineACU (Thr/T)ThreonineAAU (Asn/N)AsparagineAGU (Ser/S)Serine
AUC (Ile/I)IsoleucineACC (Thr/T)ThreonineAAC (Asn/N)AsparagineAGC (Ser/S)Serine
AUA (Ile/I)IsoleucineACA (Thr/T)ThreonineAAA (Lys/K)LysineAGA (Arg/R)Arginine
AUG (Met/M)Methionine, Start[1]ACG (Thr/T)ThreonineAAG (Lys/K)LysineAGG (Arg/R)Arginine
G
GUU (Val/V)ValineGCU (Ala/A)AlanineGAU (Asp/D)Aspartic acidGGU (Gly/G)Glycine
GUC (Val/V)ValineGCC (Ala/A)AlanineGAC (Asp/D)Aspartic acidGGC (Gly/G)Glycine
GUA (Val/V)ValineGCA (Ala/A)AlanineGAA (Glu/E)Glutamic acidGGA (Gly/G)Glycine
GUG (Val/V)ValineGCG (Ala/A)AlanineGAG (Glu/E)Glutamic acidGGG (Gly/G)Glycine
Main question:
Given a viral protein, determine who might be a potential host of the virus.
The basis for the hypothesis: An optimization of the viruses toward their hosts
Objectives:
Create a classification tool, that receives a viral protein and will give a prediction on its potential hosts.
Classify all the proteins to different classes, using a maximum-margin hyperplane.
Provide different levels of classification. Create a “host rank” for a given viral
protein for each of its potential hosts.
Results: May suggest a “virus cross-species potential index”
Methods: Collect and arrange all the codon usage
data (or other relevant data for this classification).
Analyze the data, normalization and processing.
Unsupervised learning and clustering for better understanding of the data.
Given all codon usage for all species, use the SVM algorithm to create a predictor for a new specimens.
Provide various levels of classifying classes for the codon data.
About the data: Codon usage is calculated for
each species. Each species is represented
by a 64 positions vector. The question of
normalization:o standard normalize to 1.o functional per amino-acid, or
by entropy.o percentage – per column
666444442222222223113
RLSTPAGVKNQHEDYCFIMWSTOP
Codon usage
spec
ies
1 . . . 64
Bacteria
666444442222222223113
RLSTPAGVKNQHEDYCFIMWSTOP
Primates
Data from Nakamura: Codon usage tabulated from the
international DNA sequence databasesNakamura, Y., Gojobori, T. and Ikemura, T. (2000) Nucl. Acids
Res. 28, 292.
Downloading the codon usage table The data covers all species (including
viruses).
Usage distribution:Bacteria Invertebrates Primates
ViralPlants Rodents
Usage distribution:
Positions 1-13
Our data: It was expected to find diverse codon
usage between different taxonomy groups.
There are 703 distinct known hosts in our DB and 2152 distinct known hosted viruses.
I created an interface for extracting the CDS data from the coding data we have in ProtoNet.
I used the same convention for the vector
In ProtoNet (version 5.1):16,567 viruses and 409,726 proteins
Dividing our data in to groups:
GroupName
FungiBacteria
Viridiplantae (green plants)
Rodents
Primates
Fish
Aves (birds)
Tetrapoda
Arthropoda
Taxid4751233090998999443
32443
8782325236656
distinct Hosts
4463393831313418788
Number viruses not distinct
916914142511015474262761263
Distinct viruses
9161329162868163741549175
Distinct viruses with CDS
9151304150816163631462169
Who infect what?
226
112
1370
308
64732
6Primates
Rodents
Aves
Tetrapoda
16
Fish
151
Bacteria
7
2 302Fungi
Plants
6 Others
70
Arthropoda
+)99 (distributed
These are all diferent viruses groups:
Comparison:Positions 1-12
Looks Promising!
Clustering: preliminary results
Using a set of COMPACT tool (COMPACT: A Comparative Package for Clustering Assessment)
Varshavsky et al, 2005 ISPA: 159-167.
Visualization of resultsScoring
Hierarchal - Percentage Normalization
Hierarchal - Standard Normalization
Summery phase II: All data is organized, accessible and
will update along with the ProtoNet DB. Comprehensive analysis, created a
good understanding of the data. Future plans:
Decide on a good division into classes. Use SVM algorithm to create a classifier, given
a virus codon preferences guess potential hosts.
Create an interface that offers this service.
Acknowledgements:
Thank you to all the people that helped:
Michal Linial Iris Bahir Menachem Fromer Alexander Savenok Michael Dvorkin Roy Varshavsky
Thank You!
Recommended