1
Identifying losses and expansions of selected gene families in incomplete genomic datasets Arnaud D I F RANCO 1,2 , Marc H ANIKENNE 2,3 , Denis B AURAIN 1,2 1 Eukaryotic Phylogenomics, Dept of Life Sciences, University of Liège, B-4000 Liège, Belgium 2 PhytoSYSTEMS, University of Liège, B-4000 Liège, Belgium 3 Functional Genomics and Plant Molecular Imaging, Center for Protein Engineering (CIP), Dept of Life Sciences, University of Liège, B-4000 Liège, Belgium Introduction D URING evolution, some eukaryotic lineages have acquired the ability to photosynthesize through endosymbiosis and conversion of the endosymbiont into a plastid. The first en- dosymbiotic event has involved a phagotrophic eukaryote and a cyanobacterium and has given rise to three monophyletic groups: glaucophytes, green plants and red algae. Red algae have generally smaller genomes with less coding genes com- pared to the two others (especially green plants). Because the earliest-branching group among red algae (Cyanidiales) is composed of thermoacidophilic organisms, we hypothesize that these missing genes might have been lost by the common an- cestor of extant red algal during its adaptation to life in ex- tremophilic conditions. Figure 1: Hypothesis of the bottleneck in red algal evolution. Due to a past environmental perturbation (e.g., Snowball Earth), red algal diversity could have been reduced to extremophilic organisms. The adaptation of red algae to the peculiar thermoacidophilic environment would have led to massive losses of unnecessary genes. We speculate that such an hypothesis might be demonstrated by observing specific gene losses shared by current red algae. Results U SING the pipeline described to the right, we examined in red algae the conservation of thousands of gene families defined in green plants, among which about 200 families correspond- ing to manually selected genes related to differential metabolism in geothermal springs and present oceans. We found that sev- eral ancestral gene families were absent in all available red algal genomes (family losses), whereas, for other gene families, the number of genes composing the family had been increased in red algal genomes (family expansion). These observations car- ried out on our inventory were compared to those obtained from all gene families. Family losses I N figure 2, we compared the losses in our inventory and in all gene families. The Venn diagrams show that the losses in each group are proportionally equivalent. This observation can also be made in the bar chart, even if the latter points out that losses are less abundant among the genes of our inventory. This suggests that we selected genes (and gene families) that are generally more conserved in red algae than the average gene (or gene family). Figure 2: Comparison of gene family losses between the set of genes in our inventory and in all gene families. Shared losses were studied in two groups of red algal species: Cyanidiales and other reds. Cyanidiales are composed of two species, Cyanidioschyzon merolae and Galdieria sulphuraria, whereas other reds include three species, Porphyridium purpureum, Pyropia yezoensis and Chondrus crispus. Therefore, genes lost in both groups are ab- sent from all red algal genomes currently available. A. Venn diagram of gene family losses in our inventory. B. Venn diagram of gene family losses in all gene families. C. Bar chart comparing the fraction of missing genes in our in- ventory and in all gene families. Regarding our inventory, 16 gene families are absent from all red algal genomes (among which APG7, COPT1, CTP2, CUTC, CYP737A1, FPN1, MDHM, MOT1, PKL1, SOUL1, THI1), whereas 14 more gene families are missing only in Cyanidiales (among which ARD1, CDJ3, CYC4, FSD1, NAR1.1, LciA, NCR2, PAO1, TIC55, pfl-AE, Sac3) for only 4 gene fami- lies in other reds (COX17, CYP746A1, SULTR2, SULTR3). Family expansions W E determined gene family expansions by comparing the size of gene families in green plants and red algae. In figure 3, the comparison is based on the following ratio: e p,cy = log 10 c p,cy c p,gp where c p,g = # orthologs for profile p # genomes for group g . e p,cy > 0 indi- cates a more abundant family in red algae (here, Cyanidiales) relative to green plants, whereas e p,cy < 0 suggests the opposite. The different plots show that our inventory is enriched in ex- panded families and also in really small red algal gene families compared to green plants. In figure 4, we observe that gene families related to copper and iron transport seem to be often expanded in non-Cyanidiales (other) red algae. Figure 3: Comparison of gene family sizes between green plants and red algae. The distribution of the e p,group ratio for all gene families is depicted by the blue bar chart, while the distribution for the genes of our inventory is overlaid in red. A. e p,all (all five red algae). B. e p,cy (two Cyanidiales). C. e p,oth (three other red algae). Figure 4: Gene family expansions for copper and iron transport- related gene families. For each of the two categories, the corresponding set of pHMMs defined on green plants was selected and the total count of orthologs plotted for each red algal species (blue dot). The box plot summarizes the distri- bution of control counts obtained from 1000 random sets of pHMMs mimicking the statistical properties of the real sets (see figure 6.F). A. Copper-transport gene families. B. Iron-transport gene families. Example Figure 5: Main expansions and losses in Pyropia yezoensis. We ob- served that the genes of three expanded families interact for a better assimila- tion of iron. CTP1 and CTP3 transport copper into Golgi vesicles. FOX1 and FOX2 are copper-dependent ferroxidases located in the plasma membrane. FRE1 is an iron permease also located in the plasma membrane. In contrast, the specific molybdenum transporter MOT1 (regulated by nitrate) is absent, as well as CTP2, another copper transporter related to CTP1/CTP3 and located in the plastid membrane. Methods W E computed orthology relationships between green plant genes and red algal genes using a combination of Hidden Markov Model profiles (pHMMs) and (BLAST) Best Reciprocal Hit criterion (BRH). These relationships were uploaded into a SQL database to facilitate downstream analyses. Querying of the database was carried out directly from R to take advantage of the statistic and graphical functions of the latter. A. B. C. D. E. F. Figure 6: Pipeline. A. An inventory of genes related to copper, iron, molyb- denum, nitrogen and sulphur is established using Chlamydomonas reinhardtii as a genomic source. B. Red algal homologs are identified by similarity to pHMMs built from green plant orthologs. C. To this end, orthologous groups (OGs) are computed for ten genomes of green plants using OrthoMCL. D. OGs are annotated by the genes of our inventory using BLAST. E. OGs are aligned, converted to pHMMs and used to identify red algal homologs with HMMER3. Orthologs are separated from paralogs using reciprocal BLAST searches and the BRH criterion. F. Unknown pHMMs and annotated pHMMs are merged into a pool of control pHMMs used to generate a background distribution for the observed counts. BBC2013, the eighth Benelux Bioinformatics Conference, Brussels, December 9-–10, 2013

Identifying losses and expansions of selected gene

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Identifying losses and expansions of selectedgene families in incomplete genomic datasets

Arnaud DI FRANCO1,2, Marc HANIKENNE2,3, Denis BAURAIN1,2

1Eukaryotic Phylogenomics, Dept of Life Sciences, University of Liège, B-4000 Liège, Belgium2PhytoSYSTEMS, University of Liège, B-4000 Liège, Belgium

3Functional Genomics and Plant Molecular Imaging, Center for Protein Engineering (CIP), Dept of LifeSciences, University of Liège, B-4000 Liège, Belgium

Introduction

DURING evolution, some eukaryotic lineages have acquiredthe ability to photosynthesize through endosymbiosis and

conversion of the endosymbiont into a plastid. The first en-dosymbiotic event has involved a phagotrophic eukaryote anda cyanobacterium and has given rise to three monophyleticgroups: glaucophytes, green plants and red algae. Red algaehave generally smaller genomes with less coding genes com-pared to the two others (especially green plants). Becausethe earliest-branching group among red algae (Cyanidiales) iscomposed of thermoacidophilic organisms, we hypothesize thatthese missing genes might have been lost by the common an-cestor of extant red algal during its adaptation to life in ex-tremophilic conditions.

Figure 1: Hypothesis of the bottleneck in red algal evolution. Due toa past environmental perturbation (e.g., Snowball Earth), red algal diversitycould have been reduced to extremophilic organisms. The adaptation of redalgae to the peculiar thermoacidophilic environment would have led to massivelosses of unnecessary genes. We speculate that such an hypothesis might bedemonstrated by observing specific gene losses shared by current red algae.

Results

USING the pipeline described to the right, we examined in redalgae the conservation of thousands of gene families defined

in green plants, among which about 200 families correspond-ing to manually selected genes related to differential metabolismin geothermal springs and present oceans. We found that sev-eral ancestral gene families were absent in all available red algalgenomes (family losses), whereas, for other gene families, thenumber of genes composing the family had been increased inred algal genomes (family expansion). These observations car-ried out on our inventory were compared to those obtained fromall gene families.

Family losses

IN figure 2, we compared the losses in our inventory and inall gene families. The Venn diagrams show that the losses

in each group are proportionally equivalent. This observationcan also be made in the bar chart, even if the latter points outthat losses are less abundant among the genes of our inventory.This suggests that we selected genes (and gene families) that aregenerally more conserved in red algae than the average gene (orgene family).

Figure 2: Comparison of gene family losses between the set of genesin our inventory and in all gene families. Shared losses were studied intwo groups of red algal species: Cyanidiales and other reds. Cyanidiales arecomposed of two species, Cyanidioschyzon merolae and Galdieria sulphuraria,whereas other reds include three species, Porphyridium purpureum, Pyropiayezoensis and Chondrus crispus. Therefore, genes lost in both groups are ab-sent from all red algal genomes currently available. A. Venn diagram of genefamily losses in our inventory. B. Venn diagram of gene family losses in allgene families. C. Bar chart comparing the fraction of missing genes in our in-ventory and in all gene families. Regarding our inventory, 16 gene families areabsent from all red algal genomes (among which APG7, COPT1, CTP2, CUTC,CYP737A1, FPN1, MDHM, MOT1, PKL1, SOUL1, THI1), whereas 14 more genefamilies are missing only in Cyanidiales (among which ARD1, CDJ3, CYC4,FSD1, NAR1.1, LciA, NCR2, PAO1, TIC55, pfl-AE, Sac3) for only 4 gene fami-lies in other reds (COX17, CYP746A1, SULTR2, SULTR3).

Family expansions

WE determined gene family expansions by comparing thesize of gene families in green plants and red algae. In

figure 3, the comparison is based on the following ratio:ep,cy = log10

(cp,cycp,gp

)where cp,g = # orthologs for profile p

# genomes for group g . ep,cy > 0 indi-cates a more abundant family in red algae (here, Cyanidiales)relative to green plants, whereas ep,cy < 0 suggests the opposite.The different plots show that our inventory is enriched in ex-panded families and also in really small red algal gene familiescompared to green plants. In figure 4, we observe that genefamilies related to copper and iron transport seem to be oftenexpanded in non-Cyanidiales (other) red algae.

Figure 3: Comparison of gene family sizes between green plants andred algae. The distribution of the ep,group ratio for all gene families is depictedby the blue bar chart, while the distribution for the genes of our inventory isoverlaid in red. A. ep,all (all five red algae). B. ep,cy (two Cyanidiales). C. ep,oth(three other red algae).

Figure 4: Gene family expansions for copper and iron transport-related gene families. For each of the two categories, the corresponding setof pHMMs defined on green plants was selected and the total count of orthologsplotted for each red algal species (blue dot). The box plot summarizes the distri-bution of control counts obtained from 1000 random sets of pHMMs mimickingthe statistical properties of the real sets (see figure 6.F). A. Copper-transportgene families. B. Iron-transport gene families.

Example

Figure 5: Main expansions and losses in Pyropia yezoensis. We ob-served that the genes of three expanded families interact for a better assimila-tion of iron. CTP1 and CTP3 transport copper into Golgi vesicles. FOX1 andFOX2 are copper-dependent ferroxidases located in the plasma membrane.FRE1 is an iron permease also located in the plasma membrane. In contrast,the specific molybdenum transporter MOT1 (regulated by nitrate) is absent, aswell as CTP2, another copper transporter related to CTP1/CTP3 and located inthe plastid membrane.

Methods

WE computed orthology relationships between green plantgenes and red algal genes using a combination of Hidden

Markov Model profiles (pHMMs) and (BLAST) Best Reciprocal Hitcriterion (BRH). These relationships were uploaded into a SQLdatabase to facilitate downstream analyses. Querying of thedatabase was carried out directly from R to take advantage ofthe statistic and graphical functions of the latter.

A.

B.

C.

D.

E.

F.

Figure 6: Pipeline. A. An inventory of genes related to copper, iron, molyb-denum, nitrogen and sulphur is established using Chlamydomonas reinhardtiias a genomic source. B. Red algal homologs are identified by similarity topHMMs built from green plant orthologs. C. To this end, orthologous groups(OGs) are computed for ten genomes of green plants using OrthoMCL. D. OGsare annotated by the genes of our inventory using BLAST. E. OGs are aligned,converted to pHMMs and used to identify red algal homologs with HMMER3.Orthologs are separated from paralogs using reciprocal BLAST searches andthe BRH criterion. F. Unknown pHMMs and annotated pHMMs are merged intoa pool of control pHMMs used to generate a background distribution for theobserved counts.

BBC2013, the eighth Benelux Bioinformatics Conference, Brussels, December 9-–10, 2013