38
Extending Gene Families Extending Gene Families via Predicted Ancestral via Predicted Ancestral Sequences Sequences Loretta Goldberg Loretta Goldberg Computational Biosciences Computational Biosciences Arizona State University Arizona State University April 28, 2006 April 28, 2006

Extending Gene Families via Predicted Ancestral Sequencescbs/projects/2006_presentation...[Atls 90] Altschul S F, Gish W, Miller W, Myers E W, Lipman D J; Basic LocMyers E W, Lipman

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

  • Extending Gene Families Extending Gene Families via Predicted Ancestral via Predicted Ancestral

    SequencesSequences

    Loretta GoldbergLoretta GoldbergComputational BiosciencesComputational Biosciences

    Arizona State UniversityArizona State UniversityApril 28, 2006April 28, 2006

  • This internship project is presented in This internship project is presented in partial fulfillment of the requirement of partial fulfillment of the requirement of

    the Professional Science Master's in the Professional Science Master's in Computational Biosciences, Computational Biosciences,

    Arizona State UniversityArizona State University

    Research conducted fromResearch conducted fromSeptember 2005 September 2005 –– April 2006April 2006

  • 3

    AcknowledgementsAcknowledgements

    Thanks to my Advisor, Dr. Michael Rosenberg, Thanks to my Advisor, Dr. Michael Rosenberg, who proposed this internship project, and has who proposed this internship project, and has provided guidance throughout.provided guidance throughout.

    Thanks to my committee: Dr. Rosenberg, Dr. Thanks to my committee: Dr. Rosenberg, Dr. Renaut and Dr. Touchman, for the knowledge they Renaut and Dr. Touchman, for the knowledge they have shared throughout my coursework and for have shared throughout my coursework and for review of this final project.review of this final project.

  • 4

    Personal BackgroundPersonal Background

    B.A. ChemistryB.A. ChemistryRussell Sage College, Troy NYRussell Sage College, Troy NY

    4 years as a Process Development Chemist in the 4 years as a Process Development Chemist in the Pharmaceutical IndustryPharmaceutical Industry

    M.S. Computer SciencesM.S. Computer SciencesArizona State UniversityArizona State University

    14 years as a Software Engineer in the 14 years as a Software Engineer in the Telecommunications IndustryTelecommunications Industry

    P.S.M. Computational BiosciencesP.S.M. Computational BiosciencesTBDTBD

  • 5

    Extending Gene Families via Extending Gene Families via Predicted Ancestral SequencesPredicted Ancestral Sequences

    Gene familiesGene familiesWhat are they?What are they?How do we determine them?How do we determine them?

    Ancestral SequencesAncestral SequencesWhat are they?What are they?How do we determine them?How do we determine them?

    How does this project propose to utilize How does this project propose to utilize ancestral sequences in the context of gene ancestral sequences in the context of gene families?families?

  • 6

    Gene FamiliesGene Families

    A A gene familygene family is the grouping together of is the grouping together of genes based on the similarity of the products genes based on the similarity of the products or proteins that they produce.or proteins that they produce.

    A A gene familygene family is a set of related genes is a set of related genes occupying various loci in the DNA, almost occupying various loci in the DNA, almost certainly formed by duplication of ancestral certainly formed by duplication of ancestral genes, and having a genes, and having a recognizablyrecognizably similar similar sequence.sequence.

  • 7

    Ancestral sequencesAncestral sequences

    A A phylogenetic treephylogenetic tree is a is a tool used to show the tool used to show the evolutionary relationship evolutionary relationship between biological objectsbetween biological objects

    The The tipstips represent the represent the observed genes observed genes (sequences) in living (sequences) in living organismsorganisms

    The The nodesnodes or branching or branching points represent the points represent the extinct ancestorsextinct ancestors

    Ancestralsequence

    ObservedSequence

  • 8

    Project GoalProject Goal

    To determine if the inclusion of ancestral To determine if the inclusion of ancestral sequences in an allsequences in an all--vsvs--all comparison of all comparison of protein sequences will extend the number of protein sequences will extend the number of homologous sequences that can be extracted homologous sequences that can be extracted from this comparison.from this comparison.

  • 9

    Proposed Work FlowProposed Work Flow

    Identifying protein sequences of interestIdentifying protein sequences of interest

    AllAll--vsvs--all sequence comparisonall sequence comparison

    Alignment of sequencesAlignment of sequences

    Generation of Phylogenetic treesGeneration of Phylogenetic trees

    Prediction of ancestral sequencesPrediction of ancestral sequences

    AllAll--vsvs--all sequence comparison (including all sequence comparison (including ancestral sequences)ancestral sequences)

  • 10

    Protein Sequences

    All-vs-AllComparison

    SequenceAlignment

    Phylogenetic Tree Construction

    Ancestral SequencePrediction

    DetermineEffect ofAncestral

    Sequences

    Proposed Work FlowProposed Work Flow

  • 11

    Protein Sequences

    All-vs-AllComparison

    SequenceAlignment

    Phylogenetic Tree Construction

    Ancestral SequencePrediction

    DetermineEffect ofAncestral

    Sequences

    Obtaining Protein SequencesObtaining Protein Sequences

    Sequences for the desired species are extracted from NCBI’s FASTA formatted nonredundant protein database (nr).

    formatdb

  • 12

    Protein Sequences

    All-vs-AllComparison

    SequenceAlignment

    Phylogenetic Tree Construction

    Ancestral SequencePrediction

    DetermineEffect ofAncestral

    Sequences

    Finding Gene FamiliesFinding Gene Families

    blastclust performs the all-vs-all sequence comparison, returning a cluster list (each cluster represents a gene family or group of homologous sequences).

    formatdb

    blastclust

  • 13

    Protein Sequences

    All-vs-AllComparison

    SequenceAlignment

    Phylogenetic Tree Construction

    Ancestral SequencePrediction

    DetermineEffect ofAncestral

    Sequences

    Sequence AlignmentSequence Alignment

    clustalw performs a progressive multiple sequence alignment for each cluster of sequences. clustalw is the command line version of web-based CLUSTAL W.

    formatdb

    blastclust

    clustalw

  • 14

    Protein Sequences

    All-vs-AllComparison

    SequenceAlignment

    Phylogenetic Tree Construction

    Ancestral SequencePrediction

    DetermineEffect ofAncestral

    Sequences

    Phylogenetic Tree ConstructionPhylogenetic Tree Construction

    clustalw utilizes the sequence alignment to build a phylogenetic tree via neighbor joining.

    formatdb

    blastclust

    clustalw

    clustalw

  • 15

    Protein Sequences

    All-vs-AllComparison

    SequenceAlignment

    Phylogenetic Tree Construction

    Ancestral SequencePrediction

    DetermineEffect ofAncestral

    Sequences

    Ancestral Sequences PredictionAncestral Sequences Prediction

    codeml utilizes both the sequence alignment and phylogenetic tree as input. The maximum likelihood method is used in the reconstruction of the ancestral sequences.

    formatdb

    blastclust

    clustalw

    clustalw

    codeml

  • 16

    Protein Sequences

    All-vs-AllComparison

    SequenceAlignment

    Phylogenetic Tree Construction

    Ancestral SequencePrediction

    DetermineEffect ofAncestral

    Sequences

    Extending Sequence DatabaseExtending Sequence Database

    The database of protein sequence now includes the predicted ancestral sequences.

    formatdb

    blastclust

    clustalw

    clustalw

    codeml

  • 17

    Protein Sequences

    All-vs-AllComparison

    SequenceAlignment

    Phylogenetic Tree Construction

    Ancestral SequencePrediction

    DetermineEffect ofAncestral

    Sequences

    Finding Gene Families (again)Finding Gene Families (again)

    The database of protein sequence now includes the predicted ancestral sequences.

    formatdb

    blastclust

    clustalw

    clustalw

    codeml

  • 18

    Protein Sequences

    All-vs-AllComparison

    SequenceAlignment

    Phylogenetic Tree Construction

    Ancestral SequencePrediction

    DetermineEffect ofAncestral

    Sequences

    Gene Family AnalysisGene Family Analysis

    The cluster lists from each all-vs-all comparison are the basis of this analysis.

    formatdb

    blastclust

    clustalw

    clustalw

    codeml

  • 19

    Programming EnvironmentProgramming Environment

    All of the software developed for this project included:All of the software developed for this project included:PerlPerlCCBatch files executed from the DOS promptBatch files executed from the DOS prompt

    All of the development was done under Windows XP. The All of the development was done under Windows XP. The development environment included:development environment included:

    Active State PerlActive State PerlMinGW CMinGW CSource EditSource Edit

    All processing was performed locallyAll processing was performed locallyThe external programs utilized (formatdb, blastclust, The external programs utilized (formatdb, blastclust, clustalw, and codeml) were command line versions, clustalw, and codeml) were command line versions, compiled for Windows, and executed from the DOS compiled for Windows, and executed from the DOS prompt.prompt.

  • 20

    Sequences/Species Sequences/Species

    FASTA sequences from the nr protein database Sequences Rel Size

    All species as of 01/11/06 3203752

    Mus musculus and Rattus norvegicus – rodents 136403 110.35%

    Homo sapiens – human 123609 100.00%Mus musculus – mouse 106205 85.92%Arabidopsis thaliana - thale cress 52159 42.20%Bos taurus – cattle 36855 29.82%Rattus norvegicus – rat 30881 24.98%Pan troglodytes – chimp 22873 18.50%Escherichia coli 12225 9.89%Saccharomyces cerevisiae and Schizosaccharomyces pombe – yeast 15484 12.53%

    Saccharomyces cerevisiae – baker's yeast 9158 7.41%Zea mays – corn 3410 2.76%

  • 21

    Processing Time RequiredProcessing Time Required

    SpeciesDatabase

    (nr 1/11/06)

    FASTArecs formatdb blastclust clustalw codeml

    AncestralSeq blastclust

    rodents 136403 4 d 7375 4d 15h 1m

    human 123609 46 s 3d 1h 22m 3h 45m >3 d 5822 3d 10h 36m

    mouse 106205 41 s 2d 7h 55m 1h 27m >3d 4569 2d 14h 19m

    thale cress 52159 22 s 10h 6m

  • 22

    Initial ClusteringInitial Clustering

    SpeciesDatabase

    FASTArecs

    Clusters size ≥ 6

    Largestcluster

    LargestCluster

    processed

    Sequencesprocessed

    Avg. # Seq. /cluster

    rodents 136403 1815 4051 404 23314 13

    human 123609 1050 3210 406 28103 27

    mouse 106205 951 4049 402 15958 17

    thale cress 52159 183 38 38 1513 8

    rat 30881 112 195 195 1346 12chimp 22873 17 92 NA NA NAyeast 15484 34 34 34 319 9

    byeast 9158 33 34 34 313 9corn 3410 53 48 48 672 13

  • 23

    Second ClusteringSecond Clustering

    SpeciesDatabase

    AncestralSequences

    added

    Clusterssize ≥ 6

    Largestcluster

    Sequencesprocessed

    Avg. #Seq. /

    cluster

    rodents 7375 1815 4051 30805 17

    human 5822 1051 3210 33974 32

    mouse 4569 956 4049 20592 22

    thale cress 520 183 75 2040 11

    rat 784 110 403 2147 20

    yeast 101 34 60 420 12

    corn 157 53 52 829 16

  • 24

    Comparative AnalysisComparative Analysis

    DEBUG: Collecting clusters from "corn_db\clust.out"DEBUG: Collecting clusters from "corn_db2\aClust.out"------------------------------------------------------------Initial clustering (size 6+) included: 672 Sequences# maintained in reclustering: 672 # lost during reclustering: 0 ------------------------------------------------------------Reclustering (size 6+) included: 829 sequences# maintained during reclustering: 672 # ancestral added for reclustering: 157 # ancestral maintained during reclustering: 157 # other seq added during reclustering: 0

    672+157

    829

    For this relatively small sequence database (corn), the presence of the ancestral sequences did not change the sequences that were clustered together

  • 25

    Comparative AnalysisComparative Analysis

    DEBUG: Collecting clusters from "thale_db\clust.out"DEBUG: Collecting clusters from "thale_db2\aclust.out"------------------------------------------------------------Initial clustering (size 6+) included: 1513 Sequences# maintained in reclustering: 1513 # lost during reclustering: 0 ------------------------------------------------------------Reclustering (size 6+) included: 2040 sequences# maintained during reclustering: 1513 # ancestral added for reclustering: 520 # ancestral maintained during reclustering: 517 # other seq added during reclustering: 10

    Sequence gained during reclustering: cluster 00019_0001 id 15231449 Sequence gained during reclustering: cluster 00027_0001 id 15809938 Sequence gained during reclustering: cluster 00027_0001 id 16648811 Sequence gained during reclustering: cluster 00046_0001 id 2832540 Sequence gained during reclustering: cluster 00046_0001 id 2832572 Sequence gained during reclustering: cluster 00027_0001 id 30690085 Sequence gained during reclustering: cluster 00075_0001 id 32364494 Sequence gained during reclustering: cluster 00075_0001 id 32364496 Sequence gained during reclustering: cluster 00075_0001 id 32364523 Sequence gained during reclustering: cluster 00010_0018 id 7671458

    For this sequence database, the presence of the ancestral sequences caused 10 additional sequences to be pulled into gene families. The number of clusters size 6+ was unchanged.

  • 26

    Comparative AnalysisComparative Analysis

    DEBUG: Collecting clusters from "rat_db\clust.out"DEBUG: Collecting clusters from "rat_db2\aclust.out"------------------------------------------------------------Found composite cluster 00403_0001 in "rat_db2\aclust.out"

    Sequences from 00195_0001and 00037_0001 in "rat_db\clust.out"

    ------------------------------------------------------------Found composite cluster 00031_0001 in "rat_db2\aclust.out"

    Sequences from 00010_0003and 00006_0018 in "rat_db\clust.out"

    ------------------------------------------------------------Initial clustering (size 6+) included: 1346 Sequences# maintained in reclustering: 1346 # lost during reclustering: 0 ------------------------------------------------------------Reclustering (size 6+) included: 2147 sequences# maintained during reclustering: 1346 # ancestral added for reclustering: 784 # ancestral maintained during reclustering: 784 # other seq added during reclustering: 17

    Sequence gained during reclustering: cluster 00032_0001 id 27692470 Sequence gained during reclustering: cluster 00022_0001 id 27695749 . . .

    For this sequence database, the presence of the ancestral sequences caused 17 additional sequences to be pulled into gene families. In addition, 2 pairs of clusters were combined to form larger clusters.

  • 27

    2

    13

    4

    6

    8

    57

    10

    9

    2

    1

    3

    4

    5

    6

    Olfactory Receptor ProteinsInitial Clustering

    1

    1

    1

    313 – 320amino acids

    312 – 313amino acids

    309 – 313amino acids

  • 28

    Olfactory Receptor Proteins +Predicted Ancestral Sequences

    1

    6

    2

    5

    78

    4

    3 1

    2 4

    3

    2

    1

    3

    4

    5

    6

    2

    13

    4

    6

    8

    57

    10

    9

    (n – 2) unique ancestral sequences

    generated for each cluster

  • 291

    1

    1

    Olfactory Receptor ProteinsClustering with Ancestral Sequences

    1

    6

    2

    5

    7 8

    4

    31

    2

    4

    3

    2

    1

    34

    5

    6

    2

    13

    4

    6

    8

    57

    10

    9

    Total of31 sequences233 linkages

    (not all shown!!)

  • 30

    Comparative AnalysisComparative Analysis

    SpeciesDatabase

    Clusterssize ≥ 6

    ObservedSequences

    AncestralSequences

    added

    Clusterssize ≥ 6

    Observed+Ancestral

    Sequences

    NumberOf

    CompositeClusters

    AdditionalSequencesClustered

    1815 133

    59

    71

    10

    17

    0

    0

    1050

    951

    183

    112

    34

    53

    Sequencesprocessed

    rodents 7375 1815 11 30805

    human 5822 1051 7 33974

    mouse 4569 956 7 20592

    thale cress 520 183 0 2040

    rat 784 110 2 2147

    yeast 101 34 0 420

    corn 157 53 0 829

  • 31

    ConclusionsConclusions

    It appears that using ancestral sequences to It appears that using ancestral sequences to extend gene families is a viable approach.extend gene families is a viable approach.

    The work presented here is just a starting The work presented here is just a starting point, a demonstration of point, a demonstration of ““proof of proof of conceptconcept””..

    There is much more work that can be done There is much more work that can be done to determine how effective this approach is.to determine how effective this approach is.

  • 32

    Future WorkFuture Work

    Default parameters were used for Default parameters were used for blastclustblastclustresulting in fairly stringent similarity criteria, thus resulting in fairly stringent similarity criteria, thus all of the sequences clustered together are all of the sequences clustered together are annotated similarly in annotated similarly in GenbankGenbank ((egeg. Olfactory . Olfactory receptors, immunoglobulin chains, etc.)receptors, immunoglobulin chains, etc.)

    A true test of this approach would be to determine A true test of this approach would be to determine the criteria that clusters all of sequences perceived the criteria that clusters all of sequences perceived as homologous in the first pass, then cluster again as homologous in the first pass, then cluster again with ancestral sequences.with ancestral sequences.

  • 33

    Future WorkFuture Work

    For each of the steps described: alignment, For each of the steps described: alignment, tree building, and prediction of ancestral tree building, and prediction of ancestral sequences we have tried one algorithm.sequences we have tried one algorithm.

    How does the accuracy of each of these step How does the accuracy of each of these step effect the result?effect the result?Would another method/algorithm be better?Would another method/algorithm be better?egeg. Bayesian methods for ancestral sequences. Bayesian methods for ancestral sequences

    Parsimony or Likelihood methods for treeParsimony or Likelihood methods for treeconstructionconstruction

  • 34

    ReferencesReferences

    [Atls 90][Atls 90] Altschul S F, Gish W, Miller W, Myers E W, Lipman D J; Basic LocAltschul S F, Gish W, Miller W, Myers E W, Lipman D J; Basic Local al Alignment Search Tool; Alignment Search Tool; Journal of Molecular BiologyJournal of Molecular Biology 1990; 215:4031990; 215:403--410.410.

    [Atls 97][Atls 97] Altschul S F, Madden T L, Schaffer A A, Zhang J, Zhang Z, MillerAltschul S F, Madden T L, Schaffer A A, Zhang J, Zhang Z, Miller W, Lipman D J; W, Lipman D J; Gapped BLAST and PSIGapped BLAST and PSI--BLAST: a new generation of protein database search BLAST: a new generation of protein database search programs; programs; Nucleic Acids ResNucleic Acids Res 1997; 25:33891997; 25:3389--3402. 3402.

    [Brid 06][Brid 06] BridghamBridgham J T, Carroll S M, Thornton J W; Evolution of HormoneJ T, Carroll S M, Thornton J W; Evolution of Hormone--Receptor Receptor Complexity by Molecular Evolution; Science 2006; 312:97Complexity by Molecular Evolution; Science 2006; 312:97--101.101.

    [Geer 02][Geer 02] Geer L Y, Geer L Y, DomrachevDomrachev M, Lipman D J, Bryant S H; CDART: Protein Homology M, Lipman D J, Bryant S H; CDART: Protein Homology by by Domain Architecture; Domain Architecture; Genome ResearchGenome Research 2002; 12(10):16192002; 12(10):1619--1623.1623.

    [Heni 92][Heni 92] HeinkoffHeinkoff S, S, HeinkoffHeinkoff J G; Amino Acid Substitution Matrices for Block Proteins; J G; Amino Acid Substitution Matrices for Block Proteins; Proc. Nat. Proc. Nat. Acad. Acad. SciSci. USA. USA 1992; 89(22):109151992; 89(22):10915--9.9.

    [Jean 98][Jean 98] JeanmouginJeanmougin F, Thompson J D, F, Thompson J D, GouyGouy M, Higgins D G, and Gibson, T J; Multiple M, Higgins D G, and Gibson, T J; Multiple Sequence Alignments with Clustal X; Trends Sequence Alignments with Clustal X; Trends BiochemBiochem SciSci 1998; 23:4031998; 23:403--5.5.

    [Jone 92][Jone 92] Jones D T, Taylor W R, Thornton J M; The Rapid Generation of MutJones D T, Taylor W R, Thornton J M; The Rapid Generation of Mutation Data Matrices ation Data Matrices from Protein Sequences; from Protein Sequences; CABIOSCABIOS[1][1] 1992; 8:2751992; 8:275--282.282.

    [Kosi 05][Kosi 05] KosiolKosiol C, Goldman N; Different Versions of the C, Goldman N; Different Versions of the DayhoffDayhoff Rate Matrix; Rate Matrix; Molecular Biology and Molecular Biology and EvolutionEvolution 2005; 22:1932005; 22:193--199.199.

    [Lewi 04][Lewi 04] Lewin B; Lewin B; Genes VIII; Genes VIII; Pearson Prentice Hall, NJ 2004.Pearson Prentice Hall, NJ 2004.

    [1][1] CABIOS is the former name of the Bioinformatic Journal, Oxford CABIOS is the former name of the Bioinformatic Journal, Oxford University Press.University Press.

  • 35

    ReferencesReferences

    [Li 01][Li 01] Li W, Jaroszewski L, Godzik A; Clustering of Highly Homologous SLi W, Jaroszewski L, Godzik A; Clustering of Highly Homologous Sequences to Reduce equences to Reduce the Size of Large Protein Databases; the Size of Large Protein Databases; BioinformaticsBioinformatics 2001; 17(3):2822001; 17(3):282--283.283.

    [Pevs 03][Pevs 03] Pevsner J; BioinformaticsPevsner J; Bioinformatics and Functional Genomicsand Functional Genomics; John Wiley & Sons, Inc., NJ 2003.; John Wiley & Sons, Inc., NJ 2003.[Remm 00][Remm 00] Remm M, Remm M, SonnhammerSonnhammer E; Classification of Transmembrane Protein Families in the E; Classification of Transmembrane Protein Families in the

    CaenorhabditisCaenorhabditis eleganselegans Genome and Identification of Human Orthologs; Genome and Identification of Human Orthologs; Genome ResearchGenome Research2000; 10(11):16792000; 10(11):1679--1689.1689.

    [Tatu 97][Tatu 97] TatusovTatusov R L, R L, KooninKoonin E V, Lipman D J; A Genomic Perspective on Protein Families; E V, Lipman D J; A Genomic Perspective on Protein Families; ScienceScience 1997; 278(5338):6311997; 278(5338):631--637.637.

    [Tigr 05][Tigr 05] TIGR; Domain Based Paralogous Protein Families; TIGR; Domain Based Paralogous Protein Families; www.tigr.orgwww.tigr.org ; Annotation Workshop ; Annotation Workshop July 13, 2005.July 13, 2005.

    [Thom 94][Thom 94] Thompson J D, Higgins D G, Gibson T J; Clustal W: Improving the Thompson J D, Higgins D G, Gibson T J; Clustal W: Improving the Sensitivity of Sensitivity of Progressive Multiple Sequence Alignment through Sequence WeightiProgressive Multiple Sequence Alignment through Sequence Weighting, Positionng, Position--Specific Specific Gap Penalties and Weight Matrix Choice; Gap Penalties and Weight Matrix Choice; Nucleic Acids ResearchNucleic Acids Research 1994; 22(22):46731994; 22(22):4673--4680.4680.

    [Yang 94][Yang 94] Yang Z, Goldman N, Friday A; Comparison of Models for NucleotideYang Z, Goldman N, Friday A; Comparison of Models for Nucleotide substitution used substitution used in Maximumin Maximum--Likelihood Phylogenetic Estimation; Likelihood Phylogenetic Estimation; Molecular Biology and EvolutionMolecular Biology and Evolution 1994; 1994; 11:31611:316--324.324.

    [Yang 95][Yang 95] Yang Z, Kumar S, Yang Z, Kumar S, NeiNei M; A New Method of Inference of Ancestral Nucleotide and M; A New Method of Inference of Ancestral Nucleotide and Amino Acid Sequences; Amino Acid Sequences; GeneticsGenetics 1995; 141:16411995; 141:1641--1650.1650.

    [Yang 97] [Yang 97] Yang Z; PAML: a Program Package for Phylogenetic Analysis by MaxYang Z; PAML: a Program Package for Phylogenetic Analysis by Maximum Likelihood; imum Likelihood; CABIOSCABIOS 1997; 12(7):5551997; 12(7):555--556.556.

  • 36

    Questions?Questions?

    Thank you for attending!Thank you for attending!

  • 37

    Input Software OutputNCBI’s nr database spFilter.pl speciesDB(FASTA)

    speciesDB(FASTA) dbFormatter.pl(formatdb) speciesDB(BLAST)

    speciesDB(BLAST) bClust.pl(blastclust) cluster listneighbor/hit-list

    cluster list countClust.pl summary of cluster sizes

    cluster list collectClust.pl cluster files (FASTA)

    cluster files (FASTA) buildClustalBat.pl bat file(clustalw)

    cluster files (FASTA) bat file (clustalw) alignment files (PHYLIP-C)tree files (PHYLIP-C)

    alignment files (PHYLIP-C) modifyAlignFiles.pl alignment files (PHYLIP-P)

    tree files (PHYLIP-C) modyifyTreeFiles.pl tree files (PHYLIP-P)

    alignment files (PHYLIP-P)tree files (PHYLIP-P) buildCtlFiles.pl

    control filesbat file(codeml)

    alignment files (PHYLIP-P)tree files (PHYLIP-P)control files

    bat file(codeml) ancestral sequences files

    ancestral sequence files extractAnSeq.pl ancestral sequences (FASTA)

    speciesDB(FASTA)ancestral sequences (FASTA) concatenate files together speciesDB + AnSeq(FASTA)

    speciesDB + AnSeq(FASTA) dbFormatter.pl(formatdb) speciesDB + AnSeq (BLAST)

    speciesDB + AnSeq (BLAST) bClust.pl(blastclust) + AnSeq cluster list+ AnSeq neighbor/hit-list

    cluster list+ AnSeq cluster list compareClust.pl

    summary of clusters combined and sequences maintained and added

    neighbor/hit-list xnbr.c neighbor/hit-list(text)

  • 38

    or 1 a

    or 6 3

    or 1 b

    or 1 c

    or 6 6or 6 1or 6 5

    or 6 4or 6 2

    or 10 1

    or 10 2or 10 3

    or 10 4

    or 10 5 or 10 6 or 10 7 or 10 8 or 10 9

    or 10 10

    Unrooted tree, constructed by Clustal W and displayed via Tree ViewColors show positioning of sequences from each initial cluster.

    Olfactory Receptor Proteins

    Extending Gene Families via Predicted Ancestral SequencesAcknowledgementsPersonal BackgroundExtending Gene Families via Predicted Ancestral SequencesGene FamiliesAncestral sequencesProject GoalProposed Work FlowProgramming EnvironmentSequences/Species Processing Time RequiredInitial ClusteringSecond ClusteringComparative AnalysisComparative AnalysisComparative AnalysisOlfactory Receptor Proteins�Initial ClusteringOlfactory Receptor Proteins +�Predicted Ancestral SequencesOlfactory Receptor Proteins�Clustering with Ancestral SequencesComparative AnalysisConclusionsFuture WorkFuture WorkReferencesReferencesQuestions?Olfactory Receptor Proteins�