Extending Gene Families via Predicted Ancestral Sequencescbs/projects/2006_presentation...[Atls 90] Altschul S F, Gish W, Miller W, Myers E W, Lipman D J; Basic LocMyers E W, Lipman

Extending Gene Families Extending Gene Families via Predicted Ancestral via Predicted Ancestral

SequencesSequences

Loretta GoldbergLoretta GoldbergComputational BiosciencesComputational Biosciences

Arizona State UniversityArizona State UniversityApril 28, 2006April 28, 2006

This internship project is presented in This internship project is presented in partial fulfillment of the requirement of partial fulfillment of the requirement of

the Professional Science Master's in the Professional Science Master's in Computational Biosciences, Computational Biosciences,

Arizona State UniversityArizona State University

Research conducted fromResearch conducted fromSeptember 2005 September 2005 –– April 2006April 2006

3

AcknowledgementsAcknowledgements

Thanks to my Advisor, Dr. Michael Rosenberg, Thanks to my Advisor, Dr. Michael Rosenberg, who proposed this internship project, and has who proposed this internship project, and has provided guidance throughout.provided guidance throughout.

Thanks to my committee: Dr. Rosenberg, Dr. Thanks to my committee: Dr. Rosenberg, Dr. Renaut and Dr. Touchman, for the knowledge they Renaut and Dr. Touchman, for the knowledge they have shared throughout my coursework and for have shared throughout my coursework and for review of this final project.review of this final project.

4

Personal BackgroundPersonal Background

B.A. ChemistryB.A. ChemistryRussell Sage College, Troy NYRussell Sage College, Troy NY

4 years as a Process Development Chemist in the 4 years as a Process Development Chemist in the Pharmaceutical IndustryPharmaceutical Industry

M.S. Computer SciencesM.S. Computer SciencesArizona State UniversityArizona State University

14 years as a Software Engineer in the 14 years as a Software Engineer in the Telecommunications IndustryTelecommunications Industry

P.S.M. Computational BiosciencesP.S.M. Computational BiosciencesTBDTBD

5

Extending Gene Families via Extending Gene Families via Predicted Ancestral SequencesPredicted Ancestral Sequences

Gene familiesGene familiesWhat are they?What are they?How do we determine them?How do we determine them?

Ancestral SequencesAncestral SequencesWhat are they?What are they?How do we determine them?How do we determine them?

How does this project propose to utilize How does this project propose to utilize ancestral sequences in the context of gene ancestral sequences in the context of gene families?families?

6

Gene FamiliesGene Families

A A gene familygene family is the grouping together of is the grouping together of genes based on the similarity of the products genes based on the similarity of the products or proteins that they produce.or proteins that they produce.

A A gene familygene family is a set of related genes is a set of related genes occupying various loci in the DNA, almost occupying various loci in the DNA, almost certainly formed by duplication of ancestral certainly formed by duplication of ancestral genes, and having a genes, and having a recognizablyrecognizably similar similar sequence.sequence.

7

Ancestral sequencesAncestral sequences

A A phylogenetic treephylogenetic tree is a is a tool used to show the tool used to show the evolutionary relationship evolutionary relationship between biological objectsbetween biological objects

The The tipstips represent the represent the observed genes observed genes (sequences) in living (sequences) in living organismsorganisms

The The nodesnodes or branching or branching points represent the points represent the extinct ancestorsextinct ancestors

Ancestralsequence

ObservedSequence

8

Project GoalProject Goal

To determine if the inclusion of ancestral To determine if the inclusion of ancestral sequences in an allsequences in an all--vsvs--all comparison of all comparison of protein sequences will extend the number of protein sequences will extend the number of homologous sequences that can be extracted homologous sequences that can be extracted from this comparison.from this comparison.

9

Proposed Work FlowProposed Work Flow

Identifying protein sequences of interestIdentifying protein sequences of interest

AllAll--vsvs--all sequence comparisonall sequence comparison

Alignment of sequencesAlignment of sequences

Generation of Phylogenetic treesGeneration of Phylogenetic trees

Prediction of ancestral sequencesPrediction of ancestral sequences

AllAll--vsvs--all sequence comparison (including all sequence comparison (including ancestral sequences)ancestral sequences)

10

Protein Sequences

All-vs-AllComparison

SequenceAlignment

Phylogenetic Tree Construction

Ancestral SequencePrediction

DetermineEffect ofAncestral

Sequences

Proposed Work FlowProposed Work Flow

11

Protein Sequences


SequenceAlignment




Sequences

Obtaining Protein SequencesObtaining Protein Sequences

Sequences for the desired species are extracted from NCBI’s FASTA formatted nonredundant protein database (nr).

formatdb

12

Protein Sequences


SequenceAlignment




Sequences

Finding Gene FamiliesFinding Gene Families

blastclust performs the all-vs-all sequence comparison, returning a cluster list (each cluster represents a gene family or group of homologous sequences).

formatdb

blastclust

13

Protein Sequences


SequenceAlignment




Sequences

Sequence AlignmentSequence Alignment

clustalw performs a progressive multiple sequence alignment for each cluster of sequences. clustalw is the command line version of web-based CLUSTAL W.

formatdb

blastclust

clustalw

14

Protein Sequences


SequenceAlignment




Sequences

Phylogenetic Tree ConstructionPhylogenetic Tree Construction

clustalw utilizes the sequence alignment to build a phylogenetic tree via neighbor joining.

formatdb

blastclust

clustalw

clustalw

15

Protein Sequences


SequenceAlignment




Sequences

Ancestral Sequences PredictionAncestral Sequences Prediction

codeml utilizes both the sequence alignment and phylogenetic tree as input. The maximum likelihood method is used in the reconstruction of the ancestral sequences.

formatdb

blastclust

clustalw

clustalw

codeml

16

Protein Sequences


SequenceAlignment




Sequences

Extending Sequence DatabaseExtending Sequence Database

The database of protein sequence now includes the predicted ancestral sequences.

formatdb

blastclust

clustalw

clustalw

codeml

17

Protein Sequences


SequenceAlignment




Sequences

Finding Gene Families (again)Finding Gene Families (again)

The database of protein sequence now includes the predicted ancestral sequences.

formatdb

blastclust

clustalw

clustalw

codeml

18

Protein Sequences


SequenceAlignment




Sequences

Gene Family AnalysisGene Family Analysis

The cluster lists from each all-vs-all comparison are the basis of this analysis.

formatdb

blastclust

clustalw

clustalw

codeml

19

Programming EnvironmentProgramming Environment

All of the software developed for this project included:All of the software developed for this project included:PerlPerlCCBatch files executed from the DOS promptBatch files executed from the DOS prompt

All of the development was done under Windows XP. The All of the development was done under Windows XP. The development environment included:development environment included:

Active State PerlActive State PerlMinGW CMinGW CSource EditSource Edit

All processing was performed locallyAll processing was performed locallyThe external programs utilized (formatdb, blastclust, The external programs utilized (formatdb, blastclust, clustalw, and codeml) were command line versions, clustalw, and codeml) were command line versions, compiled for Windows, and executed from the DOS compiled for Windows, and executed from the DOS prompt.prompt.

20

Sequences/Species Sequences/Species

FASTA sequences from the nr protein database Sequences Rel Size

All species as of 01/11/06 3203752

Mus musculus and Rattus norvegicus – rodents 136403 110.35%

Homo sapiens – human 123609 100.00%Mus musculus – mouse 106205 85.92%Arabidopsis thaliana - thale cress 52159 42.20%Bos taurus – cattle 36855 29.82%Rattus norvegicus – rat 30881 24.98%Pan troglodytes – chimp 22873 18.50%Escherichia coli 12225 9.89%Saccharomyces cerevisiae and Schizosaccharomyces pombe – yeast 15484 12.53%

Saccharomyces cerevisiae – baker's yeast 9158 7.41%Zea mays – corn 3410 2.76%

21

Processing Time RequiredProcessing Time Required

SpeciesDatabase

(nr 1/11/06)

FASTArecs formatdb blastclust clustalw codeml

AncestralSeq blastclust

rodents 136403 4 d 7375 4d 15h 1m

human 123609 46 s 3d 1h 22m 3h 45m >3 d 5822 3d 10h 36m

mouse 106205 41 s 2d 7h 55m 1h 27m >3d 4569 2d 14h 19m

thale cress 52159 22 s 10h 6m

22

Initial ClusteringInitial Clustering

SpeciesDatabase

FASTArecs

Clusters size ≥ 6

Largestcluster

LargestCluster

processed

Sequencesprocessed

Avg. # Seq. /cluster

rodents 136403 1815 4051 404 23314 13

human 123609 1050 3210 406 28103 27

mouse 106205 951 4049 402 15958 17

thale cress 52159 183 38 38 1513 8

rat 30881 112 195 195 1346 12chimp 22873 17 92 NA NA NAyeast 15484 34 34 34 319 9

byeast 9158 33 34 34 313 9corn 3410 53 48 48 672 13

23

Second ClusteringSecond Clustering

SpeciesDatabase

AncestralSequences

added

Clusterssize ≥ 6

Largestcluster

Sequencesprocessed

Avg. #Seq. /

cluster

rodents 7375 1815 4051 30805 17

human 5822 1051 3210 33974 32

mouse 4569 956 4049 20592 22

thale cress 520 183 75 2040 11

rat 784 110 403 2147 20

yeast 101 34 60 420 12

corn 157 53 52 829 16

24

Comparative AnalysisComparative Analysis

DEBUG: Collecting clusters from "corn_db\clust.out"DEBUG: Collecting clusters from "corn_db2\aClust.out"------------------------------------------------------------Initial clustering (size 6+) included: 672 Sequences# maintained in reclustering: 672 # lost during reclustering: 0 ------------------------------------------------------------Reclustering (size 6+) included: 829 sequences# maintained during reclustering: 672 # ancestral added for reclustering: 157 # ancestral maintained during reclustering: 157 # other seq added during reclustering: 0

672+157

829

For this relatively small sequence database (corn), the presence of the ancestral sequences did not change the sequences that were clustered together

25


DEBUG: Collecting clusters from "thale_db\clust.out"DEBUG: Collecting clusters from "thale_db2\aclust.out"------------------------------------------------------------Initial clustering (size 6+) included: 1513 Sequences# maintained in reclustering: 1513 # lost during reclustering: 0 ------------------------------------------------------------Reclustering (size 6+) included: 2040 sequences# maintained during reclustering: 1513 # ancestral added for reclustering: 520 # ancestral maintained during reclustering: 517 # other seq added during reclustering: 10

Sequence gained during reclustering: cluster 00019_0001 id 15231449 Sequence gained during reclustering: cluster 00027_0001 id 15809938 Sequence gained during reclustering: cluster 00027_0001 id 16648811 Sequence gained during reclustering: cluster 00046_0001 id 2832540 Sequence gained during reclustering: cluster 00046_0001 id 2832572 Sequence gained during reclustering: cluster 00027_0001 id 30690085 Sequence gained during reclustering: cluster 00075_0001 id 32364494 Sequence gained during reclustering: cluster 00075_0001 id 32364496 Sequence gained during reclustering: cluster 00075_0001 id 32364523 Sequence gained during reclustering: cluster 00010_0018 id 7671458

For this sequence database, the presence of the ancestral sequences caused 10 additional sequences to be pulled into gene families. The number of clusters size 6+ was unchanged.

26


DEBUG: Collecting clusters from "rat_db\clust.out"DEBUG: Collecting clusters from "rat_db2\aclust.out"------------------------------------------------------------Found composite cluster 00403_0001 in "rat_db2\aclust.out"

Sequences from 00195_0001and 00037_0001 in "rat_db\clust.out"

------------------------------------------------------------Found composite cluster 00031_0001 in "rat_db2\aclust.out"

Sequences from 00010_0003and 00006_0018 in "rat_db\clust.out"

------------------------------------------------------------Initial clustering (size 6+) included: 1346 Sequences# maintained in reclustering: 1346 # lost during reclustering: 0 ------------------------------------------------------------Reclustering (size 6+) included: 2147 sequences# maintained during reclustering: 1346 # ancestral added for reclustering: 784 # ancestral maintained during reclustering: 784 # other seq added during reclustering: 17

Sequence gained during reclustering: cluster 00032_0001 id 27692470 Sequence gained during reclustering: cluster 00022_0001 id 27695749 . . .

For this sequence database, the presence of the ancestral sequences caused 17 additional sequences to be pulled into gene families. In addition, 2 pairs of clusters were combined to form larger clusters.

27

2

13

4

6

8

57

10

9

2

1

3

4

5

6

Olfactory Receptor ProteinsInitial Clustering

1

1

1

313 – 320amino acids



28

Olfactory Receptor Proteins +Predicted Ancestral Sequences

1

6

2

5

78

4

3 1

2 4

3

2

1

3

4

5

6

2

13

4

6

8

57

10

9

(n – 2) unique ancestral sequences

generated for each cluster

291

1

1

Olfactory Receptor ProteinsClustering with Ancestral Sequences

1

6

2

5

7 8

4

31

2

4

3

2

1

34

5

6

2

13

4

6

8

57

10

9

Total of31 sequences233 linkages

(not all shown!!)

30


SpeciesDatabase

Clusterssize ≥ 6

ObservedSequences

AncestralSequences

added

Clusterssize ≥ 6

Observed+Ancestral

Sequences

NumberOf

CompositeClusters

AdditionalSequencesClustered

1815 133

59

71

10

17

0

0

1050

951

183

112

34

53

Sequencesprocessed

rodents 7375 1815 11 30805

human 5822 1051 7 33974

mouse 4569 956 7 20592

thale cress 520 183 0 2040

rat 784 110 2 2147

yeast 101 34 0 420

corn 157 53 0 829

31

ConclusionsConclusions

It appears that using ancestral sequences to It appears that using ancestral sequences to extend gene families is a viable approach.extend gene families is a viable approach.

The work presented here is just a starting The work presented here is just a starting point, a demonstration of point, a demonstration of ““proof of proof of conceptconcept””..

There is much more work that can be done There is much more work that can be done to determine how effective this approach is.to determine how effective this approach is.

32

Future WorkFuture Work

Default parameters were used for Default parameters were used for blastclustblastclustresulting in fairly stringent similarity criteria, thus resulting in fairly stringent similarity criteria, thus all of the sequences clustered together are all of the sequences clustered together are annotated similarly in annotated similarly in GenbankGenbank ((egeg. Olfactory . Olfactory receptors, immunoglobulin chains, etc.)receptors, immunoglobulin chains, etc.)

A true test of this approach would be to determine A true test of this approach would be to determine the criteria that clusters all of sequences perceived the criteria that clusters all of sequences perceived as homologous in the first pass, then cluster again as homologous in the first pass, then cluster again with ancestral sequences.with ancestral sequences.

33

Future WorkFuture Work

For each of the steps described: alignment, For each of the steps described: alignment, tree building, and prediction of ancestral tree building, and prediction of ancestral sequences we have tried one algorithm.sequences we have tried one algorithm.

How does the accuracy of each of these step How does the accuracy of each of these step effect the result?effect the result?Would another method/algorithm be better?Would another method/algorithm be better?egeg. Bayesian methods for ancestral sequences. Bayesian methods for ancestral sequences

Parsimony or Likelihood methods for treeParsimony or Likelihood methods for treeconstructionconstruction

34

ReferencesReferences

[Atls 90][Atls 90] Altschul S F, Gish W, Miller W, Myers E W, Lipman D J; Basic LocAltschul S F, Gish W, Miller W, Myers E W, Lipman D J; Basic Local al Alignment Search Tool; Alignment Search Tool; Journal of Molecular BiologyJournal of Molecular Biology 1990; 215:4031990; 215:403--410.410.

[Atls 97][Atls 97] Altschul S F, Madden T L, Schaffer A A, Zhang J, Zhang Z, MillerAltschul S F, Madden T L, Schaffer A A, Zhang J, Zhang Z, Miller W, Lipman D J; W, Lipman D J; Gapped BLAST and PSIGapped BLAST and PSI--BLAST: a new generation of protein database search BLAST: a new generation of protein database search programs; programs; Nucleic Acids ResNucleic Acids Res 1997; 25:33891997; 25:3389--3402. 3402.

[Brid 06][Brid 06] BridghamBridgham J T, Carroll S M, Thornton J W; Evolution of HormoneJ T, Carroll S M, Thornton J W; Evolution of Hormone--Receptor Receptor Complexity by Molecular Evolution; Science 2006; 312:97Complexity by Molecular Evolution; Science 2006; 312:97--101.101.

[Geer 02][Geer 02] Geer L Y, Geer L Y, DomrachevDomrachev M, Lipman D J, Bryant S H; CDART: Protein Homology M, Lipman D J, Bryant S H; CDART: Protein Homology by by Domain Architecture; Domain Architecture; Genome ResearchGenome Research 2002; 12(10):16192002; 12(10):1619--1623.1623.

[Heni 92][Heni 92] HeinkoffHeinkoff S, S, HeinkoffHeinkoff J G; Amino Acid Substitution Matrices for Block Proteins; J G; Amino Acid Substitution Matrices for Block Proteins; Proc. Nat. Proc. Nat. Acad. Acad. SciSci. USA. USA 1992; 89(22):109151992; 89(22):10915--9.9.

[Jean 98][Jean 98] JeanmouginJeanmougin F, Thompson J D, F, Thompson J D, GouyGouy M, Higgins D G, and Gibson, T J; Multiple M, Higgins D G, and Gibson, T J; Multiple Sequence Alignments with Clustal X; Trends Sequence Alignments with Clustal X; Trends BiochemBiochem SciSci 1998; 23:4031998; 23:403--5.5.

[Jone 92][Jone 92] Jones D T, Taylor W R, Thornton J M; The Rapid Generation of MutJones D T, Taylor W R, Thornton J M; The Rapid Generation of Mutation Data Matrices ation Data Matrices from Protein Sequences; from Protein Sequences; CABIOSCABIOS[1][1] 1992; 8:2751992; 8:275--282.282.

[Kosi 05][Kosi 05] KosiolKosiol C, Goldman N; Different Versions of the C, Goldman N; Different Versions of the DayhoffDayhoff Rate Matrix; Rate Matrix; Molecular Biology and Molecular Biology and EvolutionEvolution 2005; 22:1932005; 22:193--199.199.

[Lewi 04][Lewi 04] Lewin B; Lewin B; Genes VIII; Genes VIII; Pearson Prentice Hall, NJ 2004.Pearson Prentice Hall, NJ 2004.

[1][1] CABIOS is the former name of the Bioinformatic Journal, Oxford CABIOS is the former name of the Bioinformatic Journal, Oxford University Press.University Press.

35

ReferencesReferences

[Li 01][Li 01] Li W, Jaroszewski L, Godzik A; Clustering of Highly Homologous SLi W, Jaroszewski L, Godzik A; Clustering of Highly Homologous Sequences to Reduce equences to Reduce the Size of Large Protein Databases; the Size of Large Protein Databases; BioinformaticsBioinformatics 2001; 17(3):2822001; 17(3):282--283.283.

[Pevs 03][Pevs 03] Pevsner J; BioinformaticsPevsner J; Bioinformatics and Functional Genomicsand Functional Genomics; John Wiley & Sons, Inc., NJ 2003.; John Wiley & Sons, Inc., NJ 2003.[Remm 00][Remm 00] Remm M, Remm M, SonnhammerSonnhammer E; Classification of Transmembrane Protein Families in the E; Classification of Transmembrane Protein Families in the

CaenorhabditisCaenorhabditis eleganselegans Genome and Identification of Human Orthologs; Genome and Identification of Human Orthologs; Genome ResearchGenome Research2000; 10(11):16792000; 10(11):1679--1689.1689.

[Tatu 97][Tatu 97] TatusovTatusov R L, R L, KooninKoonin E V, Lipman D J; A Genomic Perspective on Protein Families; E V, Lipman D J; A Genomic Perspective on Protein Families; ScienceScience 1997; 278(5338):6311997; 278(5338):631--637.637.

[Tigr 05][Tigr 05] TIGR; Domain Based Paralogous Protein Families; TIGR; Domain Based Paralogous Protein Families; www.tigr.orgwww.tigr.org ; Annotation Workshop ; Annotation Workshop July 13, 2005.July 13, 2005.

[Thom 94][Thom 94] Thompson J D, Higgins D G, Gibson T J; Clustal W: Improving the Thompson J D, Higgins D G, Gibson T J; Clustal W: Improving the Sensitivity of Sensitivity of Progressive Multiple Sequence Alignment through Sequence WeightiProgressive Multiple Sequence Alignment through Sequence Weighting, Positionng, Position--Specific Specific Gap Penalties and Weight Matrix Choice; Gap Penalties and Weight Matrix Choice; Nucleic Acids ResearchNucleic Acids Research 1994; 22(22):46731994; 22(22):4673--4680.4680.

[Yang 94][Yang 94] Yang Z, Goldman N, Friday A; Comparison of Models for NucleotideYang Z, Goldman N, Friday A; Comparison of Models for Nucleotide substitution used substitution used in Maximumin Maximum--Likelihood Phylogenetic Estimation; Likelihood Phylogenetic Estimation; Molecular Biology and EvolutionMolecular Biology and Evolution 1994; 1994; 11:31611:316--324.324.

[Yang 95][Yang 95] Yang Z, Kumar S, Yang Z, Kumar S, NeiNei M; A New Method of Inference of Ancestral Nucleotide and M; A New Method of Inference of Ancestral Nucleotide and Amino Acid Sequences; Amino Acid Sequences; GeneticsGenetics 1995; 141:16411995; 141:1641--1650.1650.

[Yang 97] [Yang 97] Yang Z; PAML: a Program Package for Phylogenetic Analysis by MaxYang Z; PAML: a Program Package for Phylogenetic Analysis by Maximum Likelihood; imum Likelihood; CABIOSCABIOS 1997; 12(7):5551997; 12(7):555--556.556.

36

Questions?Questions?

Thank you for attending!Thank you for attending!

37

Input Software OutputNCBI’s nr database spFilter.pl speciesDB(FASTA)

speciesDB(FASTA) dbFormatter.pl(formatdb) speciesDB(BLAST)

speciesDB(BLAST) bClust.pl(blastclust) cluster listneighbor/hit-list

cluster list countClust.pl summary of cluster sizes

cluster list collectClust.pl cluster files (FASTA)

cluster files (FASTA) buildClustalBat.pl bat file(clustalw)

cluster files (FASTA) bat file (clustalw) alignment files (PHYLIP-C)tree files (PHYLIP-C)

alignment files (PHYLIP-C) modifyAlignFiles.pl alignment files (PHYLIP-P)

tree files (PHYLIP-C) modyifyTreeFiles.pl tree files (PHYLIP-P)

alignment files (PHYLIP-P)tree files (PHYLIP-P) buildCtlFiles.pl

control filesbat file(codeml)

alignment files (PHYLIP-P)tree files (PHYLIP-P)control files

bat file(codeml) ancestral sequences files

ancestral sequence files extractAnSeq.pl ancestral sequences (FASTA)

speciesDB(FASTA)ancestral sequences (FASTA) concatenate files together speciesDB + AnSeq(FASTA)

speciesDB + AnSeq(FASTA) dbFormatter.pl(formatdb) speciesDB + AnSeq (BLAST)

speciesDB + AnSeq (BLAST) bClust.pl(blastclust) + AnSeq cluster list+ AnSeq neighbor/hit-list

cluster list+ AnSeq cluster list compareClust.pl

summary of clusters combined and sequences maintained and added

neighbor/hit-list xnbr.c neighbor/hit-list(text)

38

or 1 a

or 6 3

or 1 b

or 1 c

or 6 6or 6 1or 6 5

or 6 4or 6 2

or 10 1

or 10 2or 10 3

or 10 4

or 10 5 or 10 6 or 10 7 or 10 8 or 10 9

or 10 10

Unrooted tree, constructed by Clustal W and displayed via Tree ViewColors show positioning of sequences from each initial cluster.

Olfactory Receptor Proteins

Extending Gene Families via Predicted Ancestral SequencesAcknowledgementsPersonal BackgroundExtending Gene Families via Predicted Ancestral SequencesGene FamiliesAncestral sequencesProject GoalProposed Work FlowProgramming EnvironmentSequences/Species Processing Time RequiredInitial ClusteringSecond ClusteringComparative AnalysisComparative AnalysisComparative AnalysisOlfactory Receptor Proteins�Initial ClusteringOlfactory Receptor Proteins +�Predicted Ancestral SequencesOlfactory Receptor Proteins�Clustering with Ancestral SequencesComparative AnalysisConclusionsFuture WorkFuture WorkReferencesReferencesQuestions?Olfactory Receptor Proteins�

Documents

Extending Gene Families via Predicted Ancestral Sequencescbs/projects/2006_presentation...[Atls 90] Altschul S F, Gish W, Miller W, Myers E W, Lipman D J; Basic LocMyers E W, Lipman