7/15/01 1
WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF
PROTEIN SEQUENCES
Gary W. StuartDepartment of Life SciencesIndiana State UniversityTerre Haute, IN 47809
[email protected] [email protected]
Jeffery J. LeaderDepartment of MathematicsRose-Hulman Institute of TechnologyTerre Haute, IN 47803
7/15/01 2
INTRODUCTION Whole Genome Sequences
Whole genome sequences are now available for 51 prokaryotic bacterial genomes, 5 eukaryotic genomes (yeast, roundworm, fruit fly, human, and A. thaliana), and 160 animal mitochondrial genomes.
Molecular Phylogeny Gene or protein sequence similarity is an accurate
measure of species similarity. Methods that involve a comparison of whole genomes
should improve estimates of phylogenetic relatedness. Comprehensive methods for calculating, archiving, and
accessing gene and species similarity estimates using whole genomes are needed.
7/15/01 3
METHODS: Whole Genome Target Sets
Mitochondrial: 13 gene families in 64 species = 832 proteins Microbial: 84 COG’s in 44 species = 5,871 proteins Mixed: 19 whole genomes & organelles = 28,653 proteins
Peptide Frequency Matrix 160,000 overlapping 4-grams x 832 = 133 million elements 8,000 overlapping 3-grams x 5,871 = 46 million elements 160,000 overlapping 4-grams x 28,653 = 4.5 billion elements
SVD and Dimension Reduction (DR) SVD-DR using bls-1:SVDPACK (Berry)
Gene Trees and Species Trees Gene to species data by vector addition (below) Cosine-based distances obtained from SVD-DR vectors Gene/species trees via NJ: PHYLIP (Felsenstein)
7/15/01 4
METHODS: 13 n-dimensional Vectors Define each Mitochondrial Genome 84+ n-dimensional Vectors Define each COG Subset Genome 102-3 n-dimensional Vectors Define each Complete Microbial Genome
Separate vectors: gene trees
Vector sums: species trees
7/15/01 5
RESULTS: Optimal Dimension Reduction for Mitochondrial Dataset
832 proteins in 13 families placed into 16 groups @ k=73-74
7/15/01 6
RESULTS: Gene Tree Example for Mitochodrial Dataset
832 proteins from 64 species 13 families formed 16 groups @ k=74 dimensions Only ND3, ND5, and CYTB failed to group exclusively
CYTB Dsem
ND5 Amis
COX3 (64)COX2 (64)ATP6 (64)ATP8 (64)
ND3 (31)
ND1 (64)ND4L (64)ND6 (64)ND2 (64)COX1 (64)CYTB (63)ND5 (29)ND5 (18)ND5 (16)ND4 (64)ND3 (33)
7/15/01 7
RESULTS: Consensus Species Tree for Mitochondrial Dataset
Optimal trees @ k=64-80 dimensions
7/15/01 8
RESULTS: Optimal Dimension Reduction for Microbial COG Dataset
5, 871 proteins from 44 species (including 2 fungi) 84 gene families placed into less than 1040 groups @ k>256
7/15/01 9
RESULTS: Gene Tree Example for Microbial COG Dataset
84 COG families formed 997 groups @ k=259 Too big to show!
Consensus Species Tree for COG Dataset Optimal trees @ 257-280 dimensions ALL 24 TREES - IDENTICAL TOPOLOGY
Challenging Relationships Hyperthermophiles: Aaeo & Tmar
eubacteria with archaebacteria Archaebacterium: Hnrc
archaebacteria with eubacteria Spirochetes: Tpal & Bbur
widely separated Bbur (& Baps) like Rpro mitochondria like?
circles: archaebacteriatriangles: eubacteriasquares: eukaryotesblack: hyperthermophilesgreen: Halobacteriumblue: spirochetes
LlacSpyrBhalBsub
CpneCtra
Sssp
HinfPmul
Vcho
EcooEcol
NmenNmez
TpalXfas
HnrcMtubMlep
CcreMlotPaerDrad
Cjej
HpyjHpyl
Uure
MpneMgen
BapsBbur
Rpro
CalbScer
TaciTvolAperMtheTmarAaeoAfulMjanPabyPhor
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
48
24
24
24
2424
2424
24
24
24
24
24
2424
24
24
24
24
24
24
24
24
2424
2424
2424
24
24
24
24
24
24 24
2424
24
24
2424
7/15/01 10
RESULTS: Consensus Species Tree for Total Genome Dataset
Optimal gene trees too big to analyze (for now?) For species trees, k = 210-220 large enough?
Challenging Relationships Chloroplast (Noli) not with Cyanobacteria (Sspp) Eubacterial thermophiles with archaea Spirochetes Tpal & Bbur widely separated Mitochondria (Recl) with Rpro, Bbur
Bbur
Rpro
Recl
Mgen
Mpne
Hpyl
Sspp
HinF
Cpne
Ctra
Noli
Mtub
Tpal
Mthe
Mjan
Phor
Aful
Tmar
Aaeo
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
4
7
1111
11
11
circles: archaebacteriatriangles: eubacteriablack: hyperthermophilesred: mitochondriagreen: chloroplastblue: spirochetes
7/15/01 11
DISCUSSION: How Does It Work?
Singular Value Decomposition (SVD): Ar = Ur r VrT
Matrix “Ar” is the original matrix (peptides x proteins) of rank “r”
Matrix “Ur” is the “peptide” matrix (peptides x singular vectors)
Matrix “Vr” is the “protein matrix (proteins x singular vectors)
Matrix “r” is a diagonal matrix of singular values in descending order
Graphically, SVD represents a rotation of description space In the “old” spaces, either...
• peptides are defined by protein vectors (protein vectors = columns of “A”)• proteins are defined by peptide vectors (peptide vectors = rows of “A”)
In the “new” space,• peptides are defined by singular vectors (singular vectors = columns of “U”)• proteins are defined by singular vectors (singular vectors = columns of “V”)• singular values () and their vectors are ordered by “importance” (variance)
Dimension Reduction removes “noise” : Ak = Uk k VkT
• if smaller singular values set to zero, then k<r
• Ak is always the most accurate reconstruction of Ak for a given k
• How is the “correct” value of “k” determined? (where does “signal” become “noise”?)
7/15/01 12
DISCUSSION: What are the Singular Vectors? What do they represent?
Singular Vectors are Orthonormal Basis Vectors for the “New” Space the sum of the squares of the elements of each vector sum to exactly “1”.
Singular Vectors of “U” Define Peptides by Presence in Conserved Motifs the individual elements of these vectors indicate contributions of each peptide a positive value for an element indicates a positive correlation with a given motif a negative value indicates a negative correlation with that motif singular vectors are therefore precisely defined by relative peptide contributions
Singular Vectors of “V” Define Proteins by Presence of Conserved Motifs the individual elements of these vectors indicate contributions of each protein a positive value indicates protein contains the motif a negative value indicates protein has nothing like this motif
Singular Vectors Represent and Define Conserved Motifs motifs are described as correlated peptides (copep motifs) in the “U” matrix copep motifs are NOT REQUIRED to have peptides in extended alignments
7/15/01 13
DISCUSSION: SVD produces a new descriptive space ...
Matrix A: “old” space with peptide bases
'
copep 1
copep 2
copep k
peptide 1
peptide 2
peptide n
Matrices U & V: “new” space with copep bases
7/15/01 14
ND3_Sscr:MNIMLTLLTNVTLASLLVLIAFWLPQLNAYSEKTSPYECGFDPMGSARLPFSMKFFLVAITFLLFDLEIALLLPLPWASQTNNLKTMLTMALFLLILLAASLAYEWTQKGLEWAE 2:LLVLIAFWLPQLN 8:FDPMGSA 5:VAITFLL 4:SLAYEW QGGLEWAE:1 LLVL 0.03683679 FDPM 0.04752085 VAIT 0.06136436 SLAY 0.04349018 LVLI 0.04179308 DPMG 0.04346968 AITF 0.06688452 LAYE 0.07069019 VLIA 0.03430444 PMGS 0.04377919 ITFL 0.06649574 AYEW 0.07069019 LIAF 0.04204521 MGSA 0.04145973 TFLL 0.06773809 IAFW 0.06443857 AFWL 0.07452322 FWLP 0.10143487 WLPQ 0.09049735 LPQL 0.05343578 PQLN 0.04795852
ND3_Pdol:MNLLIVMIISTALPIILMLLGFWLPNLNPDNEKVSPYECGFDPLGSARLPFSLKFFLVAILFLLFDLEIAILLPLPWALQYDTPTTAFLIALLILILLTLGLIYEWLQGGLEWAE 0:EKLSPYECGFDPLGSARLPFSMKFFLVAILFLLFDLEIALLLPLPWASQ 3:TLLTLGLIYEWTQ EKLS 0.05915245 TLLT 0.03798160 KLSP 0.06205812 LLTL 0.07589711 LSPY 0.06205812 LTLG 0.05564693 SPYE 0.12096941 TLGL 0.05957197 PYEC 0.13026607 LGLI 0.04020463 YECG 0.13224916 GLIY 0.04413720 ECGF 0.13224916 LIYE 0.04413720 CGFD 0.13080502 IYEW 0.04558135 GFDP 0.13080502 YEWT 0.06092579 FDPL 0.07362187 EWTQ 0.05599396 DPLG 0.06971443 PLGS 0.06871255 LGSA 0.06846775 GSAR 0.11228549 SARL 0.12589498 ARLP 0.12589498 RLPF 0.12456818 LPFS 0.12617765 PFSM 0.06863905 FSMK 0.06474277 SMKF 0.06460109 MKFF 0.06688452 KFFL 0.06847234 FFLV 0.10783106 FLVA 0.10640073 LVAI 0.10613140 VAIL 0.04388013 AILF 0.06119889 ILFL 0.06509992 LFLL 0.07745415 FLLF 0.12525181 LLFD 0.12614454 LFDL 0.12960829 FDLE 0.12960829 DLEI 0.12746527 LEIA 0.12925820 EIAL 0.12133975 IALL 0.13983325 ALLL 0.18434782 LLLP 0.13276736 LLPL 0.10135227 LPLP 0.12080467 PLPW 0.11729774 LPWA 0.08805169 PWAS 0.03825444 WASQ 0.03348659
DISCUSSION: An Example Copep Motif ...
Singular Vector 10 defines copep 10 as square-root
probabilities of tetrapeptides all possible peptides are included
in an exhaustive description only positive values above 0.03
shown here describes conserved elements
within the NADH Dehydrogenase 3 proteins
many peptides appear in aligned sub-motifs (but some do not)
alternative versions of the motif are represented by overlapping sub-motifs
copep motifs are IDEALIZED motifs that may not exist in nature.
7/15/01 15
Future Goals
Use a transformation on the input data. normalization of input protein vectors word entropy transformation or “df” transformation improves performance in text-retrieval applications
Are n-grams larger than 4 feasible? 3.2 million possible pentapeptides
Are multiple eukaryotic genomes feasible? millions of genes? networked computers parallel code for SVD (with M. Berry, Tennessee-Knoxville)