Upload
tal
View
40
Download
0
Embed Size (px)
DESCRIPTION
Manual Annotation of Human Genome at Broad Institute. Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA. Goals. Accurate and comprehensive catalog of genes and gene products Robust annotation system for annotation of all sequenced genomes. - PowerPoint PPT Presentation
Citation preview
Chinnappa Kodira
April 2004 GMOD 2004, Cambridge, MA
Manual Annotation of Human Genome at Broad Institute
Goals
Accurate and comprehensive catalog of genes and gene products
Robust annotation system for annotation of all sequenced genomes
Annotation Strategy: Evidence-based Annotation
CSMD1 gene:Gene Size: 2065,608 bases
Transcript Length: 11,297 basesProtein Length: 3565 aa
No of Exons: 68 Average length of Exons : 166 bases
Fgensh 20
Genscan 25
Blat_EST 179
mRNA 3
Rule-based AnnotationFL-mRNA
Species-specific ESTs
Cross-species ESTs
Protein homology
Ecores + GenePredictionsDecreasing order of confidence level
Annotation System
Automated GeneCaller
Publication
database
Loader
Genome Evidence
Transcript HunterManual Annotation
Argo Genome Browser
Alignment
QA
Critical Steps in our Annotation Process Running Computes
Selection and Filtering Evidence
Intelligent Automated Gene Caller
Genome Browser and Editor
Annotation Rules
Trained Manual Annotators
Annotation QA Process
Computes
Finished Sequence
Repeat Mask Homology Search
Sequence AlignmentGene Prediction
Computed Features
Filtering of High Quality Evidence•Identity >95% and >50% QS coverage
•Splice Junctions
•Rank Order
•Repeat filtering
Annotation
Raw Features
TranscriptHunter
Computed Features
Exon-based Clustering
•Define Gene Locus
Intron Edge Clustering
•Identify Variants
TranscriptHunter
Creation of Gene Models•ORF and UTRs•Gene Name•Transcript Classification•Curation Flags
Screening of spliced ESTs contained within repeat elements
AluYb8 Repeat
Spliced ESTs
Manual annotation
TranscriptHunter Gene Models
•Refine Gene Boundaries
•Exon/Intron
•3’ and 5’ UTR
•Create New Genes
•Classify Transcripts
•Edit Automated Gene Calls
•Identify Pseudogenes
•Add Curation Flags
•Call/Adjust ORF
•Select PolyA Signals
AnnotDB
Features of Argo Attaching primary and supplemental evidence
Cluster feature display
Filtering and customizing evidence list
Display poly A signals and splice junctions
Alerting discrepancies before updating
Highlighting parent and child features
Real-time interactive analysis
ORF selection options
Tabular dump of selected features
Roll back and save work
Customization of feature display
Annotation View
Confidence levels of our gene models
Classification of transcripts –Hawk standards Known, Novel_CDS, Novel, Putative, Pseudogene
Association of primary and supplemental evidence with annotated feature
Rank order in selection of supporting evidence
Curation flags
Free text comments
Gene counts for Broad and Ensembl
chrom known novel known novel+putative Spl count pseudogene8 4.7 710 132 724 587 2.6 298
15 2.7 581 165 589 556 2.8 21317 2.6 1120 167 1134 578 3.3 26418 2.5 265 73 289 275 2.1 167
TOTAL 12.5 2676 537 2736 1996 942
Ensembl Broad genome
(%)
Manually Annotated Gene Models vs. public Gene Models
Broad
MGC
Refseq
ENSEMBL
Gene-wise
mRNA
Types of splice variation
Type % of variants
extra 31
skip 18
alt site 33
run on 18
CDS altered 84 %
new stop 48 %
Our data extend most RefSeq/MGC transcripts
distribution of extensions relative to RefSeq or MGC evidence(human chromosomes 8, 15, 17, 18)
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
100 200 300 400 500 600 700 800 900 1000
length of extension (bp)
% o
f ex
ten
sio
ns
5'
3'
38 % positive for 5' extension71 % positive for 3' extension30 % positive for both79 % positive for either
median 5' extension = 46 basesmedian 3' extension = 143 bases
Complete 3 end as compared to Refseq mRNA and ENSEMBL gene
How valid are these 3’ and 5’ extensions ?
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Broad
ENSEMBL
Broad 86% 1.16%
ENSEMBL 68% 10.89%
PolyA signals5 ^ATG…STOP$
Using Start and Stop Codon Context to Refine
Annotation
Location of stop codons on exons
0102030405060708090
100
n n-1 n-2 n-3 n-4 n-5
exon order
% st
op co
dons stop codons
Location of start codons on exons
0
10
20
30
40
50
60
70
1 2 3 4 5 6
exon order
% sta
rt co
dons start codons
•Pseudogenes•Real Stop codons•NMD candidates•Sequence Errors•Non-coding genes•SECIS genes
•Pseudogenes•Real Start codons•NMD candidates•Sequence Errors•Non-coding genes
Issues with Novel and putative transcripts
•High number
•Low depth EST coverage
•Small transcript size
•Low no of variants
•Poor coding potential
•Poor cross-species conservation
•Low poly A frequency
•Weak CpG context
• Spurious transcription
• Mostly partial
• Temporal genes
• Non-coding
• Poorly expressed
• Lineage specific
•
Concerns Probable reasons
Putative Novel Known Transcript
PutativeNovel
Known
Annotating Non-coding mRNAs is still a challenge !!!
Sno RNAs
Challenges Ahead….
Establishing Common Standards
Validating Novel Transcripts
Single Exon Expressed Sequences
Determination of Accurate ORFs
Annotation of Functionally Relevant Alternative Splice Forms
Finding Sparsely Expressed Genes
Annotation of New Types of Non-coding Functional mRNAs
Incremental Update of Annotation
Capturing Biological Exceptions
Acknowledgements
•Reinhard Engels
•Shunguang Wang
•Seth Purcell
•Tim Elkins
•Yuhong Wu
•Serge Smirnov
•Sarah Calvo
•David Dicaprio
Annotation and Analysis
•Charlie Whittaker
•Mark Borowsky
•Sinead O’leary
•James Galagan
•Jill Mesirov
•Eric Lander
•Sequencing, Finishing and Closure Teams
Annotation Pipeline
Comparison of alternative splice forms between ENSEMBL and Broad annotation
Broad
ENSEMBL
Refseq
dbEST
nrnt-mRNA
Manually Annotated Gene Models vs. public Gene Models
ENSEMBLGENEWISE
REFSEQ
Transcript Hunter
MANUALANNOTATION
ESTs
PolyA signal
Novel Transcript Variants of Known Genes