1
Exploring Genome Rearrangements using Virtual Hybridization Mahdi Belcaid 1 , Anne Bergeron 1 , Annie Chateau 1 , Cedric Chauve 1 , Yannick Gingras 1 , Guylaine Poisson 2 (1) CGL, Universit´ e du Qu´ ebec ` a Montr´ eal, Canada (2) Information and Computer Sciences, University of Hawaii at M¯ anoa I NTRODUCTION Visual hybridization: Dobzhansky and Sturtevant [2]. “Genes” are sections of chromosomes identified by a combination of numbers and letters. Relative arrangements of chromosome 3 in two strains, Standard and Santa Cruz, of D. pseudoob- scura 68 C 68 D 69 70 71 72 73 74 75 76 A 76 B 77 78 79 A 79 B 80 68 C 79 B 76 A 75 74 73 72 71 70 69 68 D 79 A 78 77 76 B 80 Top: Loop-like configurations that occur when two homologuous rearranged chromosomes are paired. Bottom: fragment of the dataset constructed in 1938 by Dobzhansky and Sturtevant to compare whole genomes. Other techniques: / Chromosome painting, hybridization with probes: long and onerous; / Gene detection on well-annotated genomes: problems of orthologous gene identification; / BLAST on raw sequences cut into “conserved synteny” blocks: lot of computational ressources. Technical and financial difficulties of reproducing inde- pendently the datasets, and of including new sequenced genomes in existing dataset, or even “revised” genomes. Our technique: Virtual hybridization based on sets of small probes (150 bp), whose presence(s), absence, or- der and orientation can be quickly and accurately deter- mined in a given genome. A set of probes is useful for a group of genomes if it captures all significant rearrange- ments within this group. The chloroplast genome of Huperzia Lucidula ([8]) First data set: Chloro- plast genomes are ranging from 110 kbp to 150 kbp and code typically for about 120 proteins and RNAs. The gene content is rather well conserved across species, making chloroplast genomes a very good model to study rearrangements, duplications and losses. We found a set of 160 probes that captures most rear- rangements discussed in [1, 8]. Computing the order of the probes for all the 50 choloroplast genomes available in February 2006 took 10 minutes on a P4 3GHz. Fur- thermore, this set of probes was used to confirm the re- ported extensive transfer of chloroplast DNA to nuclear DNA [4]. V IRTUAL H YBRIDIZATION Approximate string matching is defined as identifying, in a text, substrings that are similar to a given string p. In biological applications, the text is typically a genomic se- quence, and similarity is defined by scoring possible alig- ments between s and p. Numerous algorithms and scor- ing schemes are available to identify approximate occur- rences of short sequences in genomic sequences, the best known being the BLAST heuristic and variations of the Smith-Waterman algorithm. In the following, probes refer to sequences between 60 and 250 bp long, and hybridization refers to the detection of occurrences of these probes in a genomic sequence. In the current study, we detect an occurrence of a probe p in a genomic sequence if there exist an aligment between p and a substring s of the sequence that has 80 % identity over 80 % of the length of p. P ROBE SELECTION Given a set P of probes, virtual hybridization can be used to compare the presence, order, and orienta- tion of the probes of P in several genomes. A good set of probes should capture significant rearrangements between genomes, such as inversions, transpositions, translocations, duplications and losses. Our first goal was to obtain a good set of probes for chloroplast genomes. Given the relatively small size of these genomes, most of the initial work was done by hand. First step: we identified a set of candidate probes us- ing global alignments of the non-duplicated regions of the following 8 references genomes: Adiantum capillus Arabidopsis thaliana Calycanthus floridus Chaetosphaeridium globosum Huperzia lucidula Pinus thunbergii Psilotum nudum Triticum aestivum The global alignments were obtained with MultiPip- Maker [7] whose output was analyzed with a graphic tool that allows the selection of a subsequence s of the first genome in the list, and the computation of the virtual hy- brydization score of s on the remaining genomes. Candidate probes were selected with the following crite- ria: 1. Good scores on at least two of the reference genomes. 2. A well defined separation between high and low scores. This initial set of candidate probes was then checked for adequate coverage of annotated genes Second step: we eliminated candidates. We used the con- tainment clustering algorithm icaAss [5] to detect total or partial containment. Members of each cluster were hy- bridized on the eight reference genomes, and the most specific probe was selected. The resulting set of probes has currently 160 members, ranging from 65 bp to 288 bp, with average length 144 bp. The following table gives the number of occurrences of probes in each of the 8 ref- erence genomes. Note that a probe can have more than one occurrence, thus the total number of occurrences can be greater than 160. Genome Single hits Double hits Triple hits Total Arabidopsis thaliana 110 34 0 178 Calycanthus floridus 115 31 0 176 Pinus thunbergii 102 6 0 114 Triticum aestivum 96 29 2 160 Adiantum capillus 40 14 0 68 Psilotum nudum 74 19 0 112 Huperzia lucidula 87 16 0 119 Chaetosphaeridium globosum 52 13 0 78 C ONSERVED MAX - GAP CLUSTERS Chloroplast genome evolution is punctuated with gene losses and duplications. Understanding these events re- quires computational tools that can reveal both the con- served and the variable parts of the different genomes. We used the DomainTeam algorithm [6] to identify max- gap clusters of probes. With 11 genomes, we found 588 clusters present in at least two genomes. A striking exam- ple is depicted below. It shows the SSC region flanked by parts of the inverted repeat regions, with rearrangements involving the border between regions. P HYLOGENY OF 18 CHLOROPLASTS We studied the phylogeny of 18 chloroplasts using our probes. First, chloroplasts genomes were represented as sequences of integers, based on the result of the virtual hybridization process. Next, these sequences were used to define binary characters matrices based on three mod- els of gene order analysis (gene content, conserved ad- jacencies, common intervals), where we used successfull hybridizations instead of genes. Finally, these matrices were analysed with PAUP, using the Neighbor-Joining algorithm, the Hamming distance and 100 replicates of bootstrap. The reference tree is described in [3]. (a) Reference tree (b) Gene content (c) Conserved adjacencies (d) Common intervals References [1] L. Cui, J. Tang, B.M.E. Moret, and C. dePamphilis. Inferring ancestral chloroplast genomes with duplications. In preparation. [2]Th. Dobzhansky and A, T. Sturtevant. Inversions in the Chro- mosomes of Drosophila pseudoobscura. Genetics, 23:18, 1938. [3]W. Martin, O. Deusch, N. Stawski, N. Grunheit, V. Goremykin. Chloroplast genome phylogenetics: why we need independent approaches to plant molecular evolution. Trends Plant Sci. 9(10):477–83, 2004. [4]M. Matsuoa, Y. Itob, R. Yamauchib and J. Obokataa. The Rice Nuclear Genome Continuously Integrates, Shuffles, and Elim- inates the Chloroplast Genome to Cause ChloroplastNuclear DNA Flux. The Plant Cell, 17:665–675, 2005. [5] J.D. Parsons Improved Tools for DNA Comparison and Cluster- ing. Comput. Applic. Biosci., 11:603–613, 1995. [6] S. Pasek, A. Bergeron, J.-L. Risler, A. Louis, E. Ollivier and M. Raffinot. Identification of genomic features using microsynte- nies of domains: domain teams. Genome Research, 15(6):867- 74, 2005. [7] S. Schwartz, et al.. MultiPipMaker and supporting tools: align- ments and analysis of multiple genomic DNA sequences, Nucl. Acids Res., 31(13):3518–3524, 2003. [8]P.G. Wolf, et al.. The first complete chloroplast genome sequence of a lycophyte, Huperzia lucidula (Lycopodiaceae). Gene., 350(2):117-28, 2005. Address for correspondence: D´ epartment d’Informatique, UQ ` AM, CP 8888, succ. centre-ville, Montr´ eal (QC), H3C 3P8 Photographs are under GFDL, plants drawing are from the public domain and chromosome drawings are (C) 1938 Genetics

Exploring Genome Rearrangements using Virtual Hybridizationcchauve/Publications/JOBIM06.pdf · Exploring Genome Rearrangements using Virtual Hybridization Mahdi Belcaid1, Anne Bergeron1,

Embed Size (px)

Citation preview

Exploring Genome Rearrangements using Virtual HybridizationMahdi Belcaid1, Anne Bergeron1, Annie Chateau1, Cedric Chauve1, Yannick Gingras1, Guylaine Poisson2

(1) CGL, Universite du Quebeca Montreal, Canada (2) Information and Computer Sciences, University of Hawaii at Manoa

INTRODUCTION

Visual hybridization: Dobzhansky and Sturtevant [2].“Genes” are sections of chromosomes identified by acombination of numbers and letters.

Relative arrangements of chromosome 3 in twostrains, Standard and Santa Cruz, ofD. pseudoob-scura

68C 68D 69 70 71 72 73 74 75 76A 76B 77 78 79A 79B 8068C 79B 76A 75 74 73 72 71 70 69 68D 79A 78 77 76B 80

Top: Loop-like configurations that occur when two homologuous rearrangedchromosomes are paired.Bottom:fragment of the dataset constructed in 1938by Dobzhansky and Sturtevant to compare whole genomes.

Other techniques:¤ Chromosome painting, hybridization with probes: longand onerous;¤ Gene detection on well-annotated genomes: problemsof orthologous gene identification;¤ BLAST on raw sequences cut into “conserved synteny”blocks: lot of computational ressources.Technical and financial difficulties of reproducing inde-pendently the datasets, and of including new sequencedgenomes in existing dataset, or even “revised” genomes.

Our technique: Virtual hybridizationbased on sets ofsmall probes (≈ 150 bp), whose presence(s), absence, or-der and orientation can be quickly and accurately deter-mined in a given genome. A set of probes isusefulfor agroup of genomes if it captures all significant rearrange-ments within this group.

The chloroplast genomeof Huperzia Lucidula([8])

First data set: Chloro-plast genomes are rangingfrom 110 kbp to 150 kbpand code typically forabout 120 proteins andRNAs. The gene contentis rather well conservedacross species, makingchloroplast genomesa very good model tostudy rearrangements,duplications and losses.

We found a set of 160 probes that captures most rear-rangements discussed in [1, 8]. Computing the order ofthe probes for all the 50 choloroplast genomes availablein February 2006 took≈10 minutes on a P4 3GHz. Fur-thermore, this set of probes was used to confirm the re-ported extensive transfer of chloroplast DNA to nuclearDNA [4].

V IRTUAL HYBRIDIZATION

Approximate string matching is defined as identifying, ina text, substrings that aresimilar to a given stringp. Inbiological applications, the text is typically a genomic se-quence, and similarity is defined by scoring possible alig-ments betweens andp. Numerous algorithms and scor-ing schemes are available to identify approximate occur-rences of short sequences in genomic sequences, the bestknown being the BLAST heuristic and variations of theSmith-Waterman algorithm.In the following, probesrefer to sequences between 60and 250 bp long, andhybridizationrefers to the detectionof occurrences of these probes in a genomic sequence. Inthe current study, we detect an occurrence of a probep ina genomic sequence if there exist an aligment betweenpand a substrings of the sequence that has 80 % identityover 80 % of the length ofp.

PROBE SELECTION

Given a setP of probes, virtual hybridization canbe used to compare the presence, order, and orienta-tion of the probes ofP in several genomes. Agoodset of probes should capture significant rearrangementsbetween genomes, such as inversions, transpositions,translocations, duplications and losses.Our first goal was to obtain a good set of probes forchloroplast genomes. Given the relatively small size ofthese genomes, most of the initial work was done by hand.

First step: we identified a set ofcandidate probesus-ing global alignments of the non-duplicated regions of thefollowing 8 references genomes:

Adiantum capillus Arabidopsis thaliana Calycanthus floridus Chaetosphaeridium globosum

Huperzia lucidula Pinus thunbergii Psilotum nudum Triticum aestivum

The global alignments were obtained with MultiPip-Maker [7] whose output was analyzed with a graphic toolthat allows the selection of a subsequences of the firstgenome in the list, and the computation of the virtual hy-brydization score ofs on the remaining genomes.

Candidate probes were selected with the following crite-ria:

1.Good scores on at least two of the reference genomes.

2.A well defined separation between high and low scores.

This initial set of candidate probes was then checked foradequate coverage of annotated genes

Second step: we eliminated candidates. We used the con-tainment clustering algorithm icaAss [5] to detect total orpartial containment. Members of each cluster were hy-bridized on the eight reference genomes, and the mostspecific probe was selected. The resulting set of probeshas currently 160 members, ranging from 65 bp to 288bp, with average length 144 bp. The following table givesthe number of occurrences of probes in each of the 8 ref-erence genomes. Note that a probe can have more thanone occurrence, thus the total number of occurrences canbe greater than 160.

Genome Single hits Double hits Triple hits TotalArabidopsis thaliana 110 34 0 178Calycanthus floridus 115 31 0 176Pinus thunbergii 102 6 0 114Triticum aestivum 96 29 2 160Adiantum capillus 40 14 0 68Psilotum nudum 74 19 0 112Huperzia lucidula 87 16 0 119Chaetosphaeridium globosum 52 13 0 78

CONSERVED MAX-GAP CLUSTERS

Chloroplast genome evolution is punctuated with genelosses and duplications. Understanding these events re-quires computational tools that can reveal both the con-served and the variable parts of the different genomes.We used the DomainTeam algorithm [6] to identify max-gap clusters of probes. With 11 genomes, we found 588clusters present in at least two genomes. A striking exam-ple is depicted below. It shows the SSC region flanked byparts of the inverted repeat regions, with rearrangementsinvolving the border between regions.

PHYLOGENY OF 18 CHLOROPLASTS

We studied the phylogeny of 18 chloroplasts using ourprobes. First, chloroplasts genomes were represented assequences of integers, based on the result of the virtualhybridization process. Next, these sequences were usedto define binary characters matrices based on three mod-els of gene order analysis (gene content, conserved ad-jacencies, common intervals), where we used successfullhybridizations instead of genes. Finally, these matriceswere analysed with PAUP, using the Neighbor-Joiningalgorithm, the Hamming distance and 100 replicates ofbootstrap. The reference tree is described in [3].

(a) Reference tree (b) Gene content

(c) Conserved adjacencies (d) Common intervals

References

[1] L. Cui, J. Tang, B.M.E. Moret, and C. dePamphilis. Inferringancestral chloroplast genomes with duplications. In preparation.

[2] Th. Dobzhansky and A, T. Sturtevant. Inversions in the Chro-mosomes ofDrosophila pseudoobscura. Genetics, 23:18, 1938.

[3] W. Martin, O. Deusch, N. Stawski, N. Grunheit, V. Goremykin.Chloroplast genome phylogenetics: why we need independentapproaches to plant molecular evolution. Trends Plant Sci.9(10):477–83, 2004.

[4] M. Matsuoa, Y. Itob, R. Yamauchib and J. Obokataa. The RiceNuclear Genome Continuously Integrates, Shuffles, and Elim-inates the Chloroplast Genome to Cause ChloroplastNuclearDNA Flux. The Plant Cell, 17:665–675, 2005.

[5] J.D. Parsons Improved Tools for DNA Comparison and Cluster-ing. Comput. Applic. Biosci., 11:603–613, 1995.

[6] S. Pasek, A. Bergeron, J.-L. Risler, A. Louis, E. Ollivier and M.Raffinot. Identification of genomic features using microsynte-nies of domains: domain teams.Genome Research, 15(6):867-74, 2005.

[7] S. Schwartz, et al.. MultiPipMaker and supporting tools: align-ments and analysis of multiple genomic DNA sequences, Nucl.Acids Res., 31(13):3518–3524, 2003.

[8] P.G. Wolf, et al.. The first complete chloroplast genomesequence of a lycophyte, Huperzia lucidula (Lycopodiaceae).Gene., 350(2):117-28, 2005.

Address for correspondence: Department d’Informatique, UQ AM, CP 8888, succ. centre-ville, Montreal (QC), H3C 3P8Photographs are under GFDL, plants drawing are from the public domain and chromosome drawings are (C) 1938 Genetics