1 Many to 1 Gene Associations The following slides show a few examples of gene predictions by one annotation group that overlap one or more genes from

Embed Size (px)

DESCRIPTION

3 8: One gene or two?

Citation preview

1 Many to 1 Gene Associations The following slides show a few examples of gene predictions by one annotation group that overlap one or more genes from another group. Some of the examples that follow also illustrate issues related to - differences in annotation type (e.g., pseudogene versus gene),and -in confusing nomenclature (e.g., different genes assigned the same official gene name). 2 2: One gene or two? Orientation issue for OTT15152? 3 8: One gene or two? 4 3: One gene or two? 5 11: One gene or two or three? 6 5: One gene or two or three? 7 6: One gene or two? The VEGA gene model seems to unite two separate gene models in NCBI see mRNA 8 One gene or two? 9: 9 7: One gene or two? 10 7: One gene or two? EST CX Has 257 aa upstream CDS Another joining variant (rat mRNA U25653, mouse EST CF172660), displaying upstream CDS (not actually annotated like that) None of the evidence (mouse or rat) shows distinct upstream gene at the moment. 11 4: One gene or two? Its a heavily duplicated region: is more or less duplicate of 10670 12 2: n:m ENSMUSG and ENSMUSG overlap OTTMUSG and OTTMUSG Zbt26 and Zbtb6 share 5 UTR exon but have non-overlapping CDSs 13 2: n:m ENSMUSG overlaps ENSMUSG OTTMUSG , OTTMUSG , and OTTMUSG moved to Cpne1 (19746) already part of Cpne1 (19746) coding regions dont overlap 14 6: n:m OTTMUSG overlaps OTTMUSG and EG68089 and EG overlaps limited to UTR/non-coding 15 8: Are EG and EG14081 different genes? cant see any evidence for that structure 16 7: Are EG and EG different genes? 17 6: Are EG71950 and EG different genes? cant see any evidence for that micro-intron 18 16: Are EG11957 and EG different genes? 19 7: Are EG and EG different genes? 20 9: Are EG and EG21838 different genes? 21 13: Are OTT00466 and OTT13227 different genes? A new mRNA BC has appeared which extends the 5 end to include the ATG of (similar to human POM121). So now theyre variants of same locus even though they dont share a splice. 22 4: Are OTT08975 and OTT08978 different genes? Yes. They share a splice, so is now a variant of 23 3: Are OTT22306 and OTT19657 different genes? Yes. Ive put them all under (Scnm1). But theres more to this picture: in BL/6 this gene is possibly a pseudogene because of a strain- specific premature stopcodon about 30bp from the end of the penultimate exon supported by mRNA AK 24 3: Are OTT25890 and OTT07101 the same gene? Yes. Made part of 25 1: Are OTT21542 and OTT21543 different genes? Yes. But made an artefact. 7bp of the mRNA is repeated on genomic sequence around the ggc and cag splices. 26 1: Are OTT21571 and OTT21573 different genes? No. Already fixed 27 2: Are OTT14319 and OTT14315 different genes? Ive made OTT14315 part of Normally when the transcripts dont share a splice, theyre kept separate is based on EST AV Ive found its companion BY Aligning it against the BAC, it matches the first exon of transcript and very vaguely the second exon as well. Oddly the homology is very weak, while AV is 100% match ???? 28 4: Are ENS78738 and ENS78736 different genes? Are the genes predicted new members of the chemokine (C-C motif) ligand family? In Ensembl multiple gene predictions are assigned to the same gene symbol/MGI id. 29 15: One gene or two or three? Are Nptxr and Cbx6 Overlapping? artefact (has two non-splices) Case to be made for all three options! Currently annotated merged transcripts as part of Nptxr as the proportion of that CDS is bigger. Option to make it three genes is attractive. 30 2: One gene or two? Are Cdan1 and Ttbk2 Overlapping? cDNA AK220258, retained intron (in Cdan1 portion) and apart from that the CDSs do not join up anyway. Both loci got their own CpG and pA features. 31 X: Srpx and Rpgr Overlapping? One gene or two? cDNA BC and AK046821; last exon is in frame with 2nd coding exon of Srpx, but continues beyond exon to end in pA features. 32 2: Zgpat and Lime1 Overlapping? One gene or two? mRNA AK173276; retained intron EST BQ552943; CDSs in-frame not shown: cDNA BC034599; contains all exons of both genes but because joining splice is beyond Zgpat 3 UTR (in Lime1 5 UTR), it is NMD. both loci have pA features 33 5: One gene or two? Mpv17 and Gtf3c2 overlapping? CDSs are in-frame but additional variation in a downstream exon would cause NMD; based on EST AA CDSs are in-frame; based on cDNA AK pA features and CpG island. Mpv17 very conserved in human, rat, cow, frog, zebrafish (same length +/- 1 aa; >70% id). But no own CpG. 34 16: One gene or two? Are Pcp4 and Igsf5 two different genes? Next slide cDNA AK % also in rat, human 35 In Ensembl currently it looks as though Pcp4 and Igsf5 are considered synonyms for the same gene? 36 6: One gene or two? NCBI gene is a pseudogene, Ensembl gene is a protein coding gene. Pseudogene Protein coding gene 37 13: Pseudogene Protein coding gene 38 14: Pseudogene Protein coding gene 39 Retrotransposed vs pseudogene 6: Pseudogene Retrotransposed 40 Gene Family Challenges killer cell lectin-like receptor (Klra) family UDP glucuronosyltransferase 1 family Gene families present many challenges to determining equivalency among gene predictions and for nomenclature. Examples from two gene families are shown in the following slides. cysteine-rich perinuclear theca C-type lectin domain family 2 41 6: killer cell lectin-like receptor (Klra) family Next slide 42 6: killer cell lectin-like receptor (Klra) family Gene identity crisis! Pseudogene Protein coding gene Next slide transcript pseudogene stopcodon stopcodon supported by 100% cDNA 43 6: Overlapping NCBI annotation 2.Overlapping features of different types Pseudogene Protein coding gene currently a pseudogene in otter ?! 44 1: Next slide UDP glucuronosyltransferase 1 family Ensembl maintains a single gene id for all of the members of the family. 45 9: cysteine-rich perinuclear theca Gene identity crisis! 46 C-type lectin domain family 2 6: Ensembl and VEGA predict only a single gene with multiple transcripts rather than two genes Clec2g and Clec2f. 47 Vega hasnt annotated Clec2f, period. In actual fact that gene doesnt exist as such. The Clec2f locus is a partial duplication of the Clec2g locus (last four exons). Though the duplicate exons have diverged from the parent, they still are open. However, there is no trace of the first exon and no locus-specific transcriptional evidence. We would annotate this as an unprocessed pseudogene. The three-exon gene between Clec2g and Clec2f actually overlaps another Clec2 pseudogene (in this case a duplication of the last three exons). And just a 200 bp further theres another Clec2 pseudogene consisting of a duplication of the penultimate exon broken into two fragments plus part of the last exon. This pseudogene overlaps the big termal exon. Cleg2g Cleg2f pseudo 48 Clec2gClec2f 49 Unique to MGI MGI does not have a high-throughput computational genome annotation pipeline. However, we integrated the results of high throughput cDNA sequencing projects into the database prior to the availability of the mouse genome. Many of these genes have remained unique to MGI. The following slides illustrate several cases where MGI has a gene that has not been predicted by one of the three major annotation groups. Many (most) of these MGI-unique genes are from the RIKEN cDNA sequencing initiative. Many of them likely represent non-protein coding genes. 50 11: 51 move c splice site no corresponding splice site 205 splice site missing base c would destroy either splice When aligned against forward strand (same as Suz12), no splice sites; when aligned in reverse against reverse strand some (questionable) splice sites. Conclusion: garbage! 52 9: Unique to MGI 53 11: 54