Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA

Chinnappa Kodira

April 2004 GMOD 2004, Cambridge, MA

Manual Annotation of Human Genome at Broad Institute

Goals

Accurate and comprehensive catalog of genes and gene products

Robust annotation system for annotation of all sequenced genomes

Annotation Strategy: Evidence-based Annotation

CSMD1 gene:Gene Size: 2065,608 bases

Transcript Length: 11,297 basesProtein Length: 3565 aa

No of Exons: 68 Average length of Exons : 166 bases

Fgensh 20

Genscan 25

Blat_EST 179

mRNA 3

Rule-based AnnotationFL-mRNA

Species-specific ESTs

Cross-species ESTs

Protein homology

Ecores + GenePredictionsDecreasing order of confidence level

Annotation System

Automated GeneCaller

Publication

database

Loader

Genome Evidence

Transcript HunterManual Annotation

Argo Genome Browser

Alignment

QA

Critical Steps in our Annotation Process Running Computes

Selection and Filtering Evidence

Intelligent Automated Gene Caller

Genome Browser and Editor

Annotation Rules

Trained Manual Annotators

Annotation QA Process

Computes

Finished Sequence

Repeat Mask Homology Search

Sequence AlignmentGene Prediction

Computed Features

Filtering of High Quality Evidence•Identity >95% and >50% QS coverage

•Splice Junctions

•Rank Order

•Repeat filtering

Annotation

Raw Features

TranscriptHunter

Computed Features

Exon-based Clustering

•Define Gene Locus

Intron Edge Clustering

•Identify Variants

TranscriptHunter

Creation of Gene Models•ORF and UTRs•Gene Name•Transcript Classification•Curation Flags

Screening of spliced ESTs contained within repeat elements

AluYb8 Repeat

Spliced ESTs

Manual annotation

TranscriptHunter Gene Models

•Refine Gene Boundaries

•Exon/Intron

•3’ and 5’ UTR

•Create New Genes

•Classify Transcripts

•Edit Automated Gene Calls

•Identify Pseudogenes

•Add Curation Flags

•Call/Adjust ORF

•Select PolyA Signals

AnnotDB

Features of Argo Attaching primary and supplemental evidence

Cluster feature display

Filtering and customizing evidence list

Display poly A signals and splice junctions

Alerting discrepancies before updating

Highlighting parent and child features

Real-time interactive analysis

ORF selection options

Tabular dump of selected features

Roll back and save work

Customization of feature display

Annotation View

Confidence levels of our gene models

Classification of transcripts –Hawk standards Known, Novel_CDS, Novel, Putative, Pseudogene

Association of primary and supplemental evidence with annotated feature

Rank order in selection of supporting evidence

Curation flags

Free text comments

Gene counts for Broad and Ensembl

chrom known novel known novel+putative Spl count pseudogene8 4.7 710 132 724 587 2.6 298

15 2.7 581 165 589 556 2.8 21317 2.6 1120 167 1134 578 3.3 26418 2.5 265 73 289 275 2.1 167

TOTAL 12.5 2676 537 2736 1996 942

Ensembl Broad genome

(%)

Manually Annotated Gene Models vs. public Gene Models

Broad

MGC

Refseq

ENSEMBL

Gene-wise

mRNA

Types of splice variation

Type % of variants

extra 31

skip 18

alt site 33

run on 18

CDS altered 84 %

new stop 48 %

Our data extend most RefSeq/MGC transcripts

distribution of extensions relative to RefSeq or MGC evidence(human chromosomes 8, 15, 17, 18)

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

100 200 300 400 500 600 700 800 900 1000

length of extension (bp)

% o

f ex

ten

sio

ns

5'

3'

38 % positive for 5' extension71 % positive for 3' extension30 % positive for both79 % positive for either

median 5' extension = 46 basesmedian 3' extension = 143 bases

Complete 3 end as compared to Refseq mRNA and ENSEMBL gene

How valid are these 3’ and 5’ extensions ?

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Broad

ENSEMBL

Broad 86% 1.16%

ENSEMBL 68% 10.89%

PolyA signals5 ^ATG…STOP$

Using Start and Stop Codon Context to Refine

Annotation

Location of stop codons on exons

0102030405060708090

100

n n-1 n-2 n-3 n-4 n-5

exon order

% st

op co

dons stop codons

Location of start codons on exons

0

10

20

30

40

50

60

70

1 2 3 4 5 6

exon order

% sta

rt co

dons start codons

•Pseudogenes•Real Stop codons•NMD candidates•Sequence Errors•Non-coding genes•SECIS genes

•Pseudogenes•Real Start codons•NMD candidates•Sequence Errors•Non-coding genes

Issues with Novel and putative transcripts

•High number

•Low depth EST coverage

•Small transcript size

•Low no of variants

•Poor coding potential

•Poor cross-species conservation

•Low poly A frequency

•Weak CpG context

• Spurious transcription

• Mostly partial

• Temporal genes

• Non-coding

• Poorly expressed

• Lineage specific

•

Concerns Probable reasons

Putative Novel Known Transcript

PutativeNovel

Known

Annotating Non-coding mRNAs is still a challenge !!!

Sno RNAs

Challenges Ahead….

Establishing Common Standards

Validating Novel Transcripts

Single Exon Expressed Sequences

Determination of Accurate ORFs

Annotation of Functionally Relevant Alternative Splice Forms

Finding Sparsely Expressed Genes

Annotation of New Types of Non-coding Functional mRNAs

Incremental Update of Annotation

Capturing Biological Exceptions

Acknowledgements

•Reinhard Engels

•Shunguang Wang

•Seth Purcell

•Tim Elkins

•Yuhong Wu

•Serge Smirnov

•Sarah Calvo

•David Dicaprio

Annotation and Analysis

•Charlie Whittaker

•Mark Borowsky

•Sinead O’leary

•James Galagan

•Jill Mesirov

•Eric Lander

•Sequencing, Finishing and Closure Teams

Annotation Pipeline

Comparison of alternative splice forms between ENSEMBL and Broad annotation

Broad

ENSEMBL

Refseq

dbEST

nrnt-mRNA

Manually Annotated Gene Models vs. public Gene Models

ENSEMBLGENEWISE

REFSEQ

Transcript Hunter

MANUALANNOTATION

ESTs

PolyA signal

Novel Transcript Variants of Known Genes

Documents

Chinnappa Kodira April 2004 GMOD 2004, Cambridge, MA