UCSC Known Genes Version 3 Take 9

UCSC Known Genes Version 3

Take 9

UCSC Known Genes Version 3

Take 9

Known Gene HistoryKnown Gene History• Initially based on Genie predictions

constrained by BLAT mRNA alignments.– David Kulp got busy at Affy.

• Switched to RefSeq– Jim got paranoid Riken RNAs would take over

• Fan built KG 1– Mark got annoyed at low quality predictions

• Fan & Mark built KG 2– Jim got annoyed at missing genes

• KG 3– The perfect set … until KG 4.

• Initially based on Genie predictions constrained by BLAT mRNA alignments.– David Kulp got busy at Affy.

• Switched to RefSeq– Jim got paranoid Riken RNAs would take over

• Fan built KG 1– Mark got annoyed at low quality predictions

• Fan & Mark built KG 2– Jim got annoyed at missing genes

• KG 3– The perfect set … until KG 4.

Overall PipelineOverall Pipeline• Get alignments etc. from database• Remove antibody fragments• Clean alignments, project to genome• Cluster into splicing graph• Add EST, Exoniphy, OrthoSplice info.• Walk unique transcripts out of graph.• Assign coding regions (CDS) to transcripts.• Classify into coding, antisense, noncoding.• Remove weak transcripts.• Assign accessions.• Build gene-centric database tables.

• Get alignments etc. from database• Remove antibody fragments• Clean alignments, project to genome• Cluster into splicing graph• Add EST, Exoniphy, OrthoSplice info.• Walk unique transcripts out of graph.• Assign coding regions (CDS) to transcripts.• Classify into coding, antisense, noncoding.• Remove weak transcripts.• Assign accessions.• Build gene-centric database tables.

Genbank & Alignment IssuesGenbank & Alignment Issues• Using global instead of local near-best

alignment, also higher stringency.• Including all Genbank RNA, not just mRNA

• These changes not yet reflected in Genbank mRNA/RefSeq tracks.

• Collect data such as selenocysteine substitutions and alternative start codons from Genbank. These data are in the .ra files but not the SQL database.

• Using global instead of local near-best alignment, also higher stringency.

• Including all Genbank RNA, not just mRNA

• These changes not yet reflected in Genbank mRNA/RefSeq tracks.

• Collect data such as selenocysteine substitutions and alternative start codons from Genbank. These data are in the .ra files but not the SQL database.

Removing Antibody Var RegionsRemoving Antibody Var Regions

• Chromosomes 2,14,22 contain antibody regions.• Thousands of transcripts for these in Genbank.• Gaps are from genomic rearrangements, not

splicing. Millions of possibilities.• Identify regions by:

– Searching for words like ‘immunoglobulin’ ‘variable’ to make initial set of Ab fragments.

– Treat anything that overlaps these as Ab fragment too.– Cluster together putative Ab fragments.– Take 4 largest clusters as the 4 variable regions. (One is

just a pseudogene of a real variable region.)

• Remove all alignments in Ab clusters.• Replace with a single noncoding gene for each

cluster near end of gene build.

• Chromosomes 2,14,22 contain antibody regions.• Thousands of transcripts for these in Genbank.• Gaps are from genomic rearrangements, not

splicing. Millions of possibilities.• Identify regions by:

– Searching for words like ‘immunoglobulin’ ‘variable’ to make initial set of Ab fragments.

– Treat anything that overlaps these as Ab fragment too.– Cluster together putative Ab fragments.– Take 4 largest clusters as the 4 variable regions. (One is

just a pseudogene of a real variable region.)

• Remove all alignments in Ab clusters.• Replace with a single noncoding gene for each

cluster near end of gene build.

Chr22 Ab Region (lambda light chain)Chr22 Ab Region (lambda light chain)

Cleaning, projecting alignmentsCleaning, projecting alignments

• BLAT sometimes leaves messy gappy ends.• New heuristic:

– For gaps 6 base or less on both mRNA and genome, just ignore gap, filling in with genome if necessary.

– Try to turn other gaps into introns if they are not already by wiggling one base on either side of gap.

– Break up alignments at remaining gaps that are not intronic. Intronic gaps are at least 16 bases, and have gt/ag or gc/ag ends.

– After break up throw away any pieces less than 18 bases long.

• For refSeq mRNA only, join pieces back together after breaking up. Other mRNA can be joined by other transcripts (which may not suffer the same problems from polymorphism/error)

• Consider applying similar heuristic in mRNA track.

• BLAT sometimes leaves messy gappy ends.• New heuristic:

– For gaps 6 base or less on both mRNA and genome, just ignore gap, filling in with genome if necessary.

– Try to turn other gaps into introns if they are not already by wiggling one base on either side of gap.

– Break up alignments at remaining gaps that are not intronic. Intronic gaps are at least 16 bases, and have gt/ag or gc/ag ends.

– After break up throw away any pieces less than 18 bases long.

• For refSeq mRNA only, join pieces back together after breaking up. Other mRNA can be joined by other transcripts (which may not suffer the same problems from polymorphism/error)

• Consider applying similar heuristic in mRNA track.

Cleaning and projectingCleaning and projecting

Cluster into splicing graphCluster into splicing graph

• Make graph where vertices are begin/ends of exons, edges are exons and introns.

• Multiple input transcripts can share vertices and edges.

• Went over this in some detail a few weeks back…

• Make graph where vertices are begin/ends of exons, edges are exons and introns.

• Multiple input transcripts can share vertices and edges.

• Went over this in some detail a few weeks back…

Splicing graph and txWalkSplicing graph and txWalk

Adding Evidence to GraphAdding Evidence to Graph• Initial evidence for each edge comes from

mRNAs.• If edge is supported by at least 2 ESTs.

(Single EST likely is same clone as single RNA…) Just use spliced ESTs

• Make graph in mouse and map via chains. Reinforce orthologous human edges.

• Reinforce exon edges that overlap Exoniphy predictions.

• Evidence weight: refSeq 100, each mRNA 2, est pair 1, mouse ortho 1, exoniphy 1.

• Initial evidence for each edge comes from mRNAs.

• If edge is supported by at least 2 ESTs. (Single EST likely is same clone as single RNA…) Just use spliced ESTs

• Make graph in mouse and map via chains. Reinforce orthologous human edges.

• Reinforce exon edges that overlap Exoniphy predictions.

• Evidence weight: refSeq 100, each mRNA 2, est pair 1, mouse ortho 1, exoniphy 1.

Walking graphWalking graph• Weight of 3 on an edge is good enough.• Rank input RNA by whether refSeq, and

number of good edges they use.• If any good edges, output a transcript

consisting of the edges used by the first RNA.

• Output transcript based on next RNA if the good edges it uses have not been output in same order before.

• Continue until reach last RNA.

• Weight of 3 on an edge is good enough.• Rank input RNA by whether refSeq, and

number of good edges they use.• If any good edges, output a transcript

consisting of the edges used by the first RNA.

• Output transcript based on next RNA if the good edges it uses have not been output in same order before.

• Continue until reach last RNA.

Evidence, Walk, AltSpliceEvidence, Walk, AltSplice

Assigning Coding RegionsAssigning Coding Regions

• Align UniProt and RefSeq proteins to txWalk transcripts. Mark regions they hit as possible CDS.

• Align Genbank/RefSeq RNAs to txWalk transcripts, map CDS from RNA records as possible CDS.

• Use bestorf program for another possible CDS.

• Assign an ad-hoc score to each possible CDS, choose highest scoring.

• More comparative genomics could really help here someday…

• Align UniProt and RefSeq proteins to txWalk transcripts. Mark regions they hit as possible CDS.

• Align Genbank/RefSeq RNAs to txWalk transcripts, map CDS from RNA records as possible CDS.

• Use bestorf program for another possible CDS.

• Assign an ad-hoc score to each possible CDS, choose highest scoring.

• More comparative genomics could really help here someday…

CDS Mapping, FilteringCDS Mapping, Filtering

Classifying and WeedingClassifying and Weeding• The transcripts are classified into:

– Coding: CDS survives trimming stage– Near-coding: overlap coding by at

least 20 bases on same strand– Antisense: overlap coding by at least

20 bases on opposite strand– Noncoding: other transcripts

• Near-coding transcripts that show signs of incomplete splicing (retained intron, bleeds > 100 bases into intron) are removed.

• The transcripts are classified into:– Coding: CDS survives trimming stage– Near-coding: overlap coding by at

least 20 bases on same strand– Antisense: overlap coding by at least

20 bases on opposite strand– Noncoding: other transcripts

• Near-coding transcripts that show signs of incomplete splicing (retained intron, bleeds > 100 bases into intron) are removed.

Assigning accessionsAssigning accessions• Initial temporary identifiers of form

<chrom>.<cluster>.<tx>.<accession>, eg chr22.210.5.AB209301

• Make permanent identifiers of form TX12345678.– Find exact match in previous gene set, and

reuse previous accession.– Find compatible match (all introns alike) in old

gene set, reuse accession, bump version.– Make up new accession otherwise.– Record genes in old set not in new.

• Version 7 -> version 9 mapping actually a good test of this: 53025 exact, 4732 lost, 3736 new, 464 compatible.

• Move to UC1234567 format in v. 10?

• Initial temporary identifiers of form <chrom>.<cluster>.<tx>.<accession>, eg chr22.210.5.AB209301

• Make permanent identifiers of form TX12345678.– Find exact match in previous gene set, and

reuse previous accession.– Find compatible match (all introns alike) in old

gene set, reuse accession, bump version.– Make up new accession otherwise.– Record genes in old set not in new.

• Version 7 -> version 9 mapping actually a good test of this: 53025 exact, 4732 lost, 3736 new, 464 compatible.

• Move to UC1234567 format in v. 10?

Building gene-centric tablesBuilding gene-centric tables• mmBlastTab, rnBlastTab etc. homolog tables.

Blastp best plus syntenic weeding.• kgXref and knownToXxx tables to relate gene to

other databases and tables.• kgAlias table to help search on gene names.• gnfAtlas2Distance to measure expression

similarity between genes for Gene Sorter. 3 other expression distance tables

• humanVidalP2P and humanWankerP2P protein network distance tables.

• knownCanonical/knownIsoform tables to help people selectively view alt-splicing.

• pbXXX tables for proteome browser.• In all about 10 hours of compute and

indexing.

• mmBlastTab, rnBlastTab etc. homolog tables. Blastp best plus syntenic weeding.

• kgXref and knownToXxx tables to relate gene to other databases and tables.

• kgAlias table to help search on gene names.• gnfAtlas2Distance to measure expression

similarity between genes for Gene Sorter. 3 other expression distance tables

• humanVidalP2P and humanWankerP2P protein network distance tables.

• knownCanonical/knownIsoform tables to help people selectively view alt-splicing.

• pbXXX tables for proteome browser.• In all about 10 hours of compute and

indexing.

The PlanThe Plan• Next week

– test preliminary integration on hg18a– resolve issues with proteome browser– Tinker on take 10, maybe take 11

• Week after– Integration of final gene build into hg18a– Move hg18.knownGenes to hg18.knownGenesOld– Swap hg18a tables into hg18.

• Coming months– Continue to improve gene build.– Add new information from build into details pages.– Allow user filtering of which genes are shown– Allowing selection by names as well as ID’s in

table browser.– Present at Cold Spring Harbor. Write up paper.

• Next week– test preliminary integration on hg18a– resolve issues with proteome browser– Tinker on take 10, maybe take 11

• Week after– Integration of final gene build into hg18a– Move hg18.knownGenes to hg18.knownGenesOld– Swap hg18a tables into hg18.

• Coming months– Continue to improve gene build.– Add new information from build into details pages.– Allow user filtering of which genes are shown– Allowing selection by names as well as ID’s in

table browser.– Present at Cold Spring Harbor. Write up paper.

Documents

UCSC Known Genes Version 3 Take 9