Ensembl annotation

EBI is an Outstation of the European Molecular Biology Laboratory.

Ensembl annotation

Bronwen Aken

21 September 2014

How Ensembl started

• Ewan Birney

• Michele Clamp

• Tim Hubbard

Ensembl’s goals

Annotate (vertebrate)

genome

Integrate with other biological

data

Make publicly

available

• Stable, automatic annotation

• High quality

• Regular release cycles• Open source

“Provide a bioinformatics framework to organise biology around

the sequences of large genomes”

Challenges

1. Find functional elements in a genome

• Data have lots of noise

2. Software / hardware

• Storing and manipulating data

3. Intuitive and comprehensive access to data

• Visualization

GRCh38 annotation in Ensembl

What is Genebuilding?

• Automatic, evidence-based annotation of

genes

• Not ab initio

• Based on sequence alignment

• “Best-in-genome”

• Aim for high specificity

• Prefer to miss a few features than heavily over-

predict

Automated gene annotation pipeline is designed

around decisions made during manual annotation

Advantages of re-annotating

• Add new genes to new / fixed genomic regions

• Updated supporting evidence: Remove models built on

data that has been deleted from archives

• Move alignments to regions with better mapping

Gene annotation pipeline – the basics

Identify interesting regions

• Rough alignment of sequences to genome

Exhaustive alignment to produce transcript models

Filter models

• Prioritize data sources

Produce ‘best guess’ gene set

Repeatmasking

Same-species proteins Other-species proteins

cDNAs/ESTs

UTR addition

Final gene set

Filtering

Protein-coding genebuild

Filtering

TranscriptConsensus

LayerAnnotation

Also:

Small ncRNAs

LincRNAs

Pseudogenes

Repeatmasking

Same-species proteins Other-species proteins

cDNAs/ESTs

UTR addition

Final gene set

Filtering

Protein-coding genebuild

FilteringRNA-Seq models

Also:

Small ncRNAs

LincRNAs

Pseudogenes

MERGE WITH HAVANA

Release cycle

26 September 201411

Regulation

Gene

Allele

Conserved

sequence

Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/

Genes

• Coding & noncoding

• Protein & mRNA alignments

• GTF & BAM files

Compara

• Conserved DNA sequence

• Multiple genome alignments

• Homologues

• Protein families

Regulatory regions

• DNA methylation

• TFBS

• Open chromatin

Variation

• SNPs, indels, structural variation

• Phenotypes

• QTLs

Integrate with other speciesC

him

panzee

Hum

an

Gene SLC12A1

‘Patch’ annotation in Ensembl

Genome assembly representation

• Coord_system table

• Lists the allowed coordinate systems

• chromosome, scaffold, contig

• With ‘versions’

• GRCh37, GRCh38

• Contigs are shared between assemblies so have no version

• ‘Toplevel’ coordinate system

• Chromosomes + unplaced scaffolds + unlocalized scaffolds

+ alternate sequences

• Most popular means to access the whole genome

• API options for including/excluding alternate sequences and

PAR


GRCh38

Scaffolds

Contigs

Chromosome

DNA only loaded for contigs


GRCh38

Scaffolds

Contigs

Chromosome

DNA only loaded for contigs


GRCh38

Scaffolds

Contigs

Chromosome


GRCh38

Scaffolds

Contigs

Chromosome

GRCh37


GRCh38

Scaffolds

Contigs

Chromosome

GRCh37

Seq_region names

• Regions of the genome are given a slice name; it’s like an

address

• eg. chromosome:GRCh37:6:133090509:133119701:1

• Users like to say, ‘chromosome 6’

• INSDC coordinates are versioned, but less human-readable

• chromosome:GRCh37:CM000668.1:133090509:133119701:1

assembly

seq_region.

name

coord_system

start

end

strand

Alternate sequences

• Assembly_exception table defines ‘bubbles’

• Initially set up to handle Y chromosome PAR

• Adapted to work for MHC haplotypes

• Now also used for GRC patches

• Assumes ‘equivalent’ region will be present in primary

assembly

Gene annotation on a ‘patched’ genome

62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH

Assembly excepti...

SNORA76 >

SNORD104 >

MILR1 >

Genes (GENCODE...

Primary assembly...

AC025362.12 > AC016489.18 > < AC234063.4Contigs

< Y_RNA < hsa-mir-1273e

< AC234063.1

< TEX2 < AC016489.1

< PECAM1

Genes (GENCODE...

H.sap-H.sap lastz-...

Assembly excepti...

62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH

protein coding merged Ensembl/Havana

RNA gene pseudogene

Alternative alleles Projection

Gene Legend

62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17

Assembly excepti...

H.sap-H.sap lastz-...

SNORA76 >

SNORD104 >

AC138744.2 >

MILR1 > Genes (GENCODE...

GL383558.1

... ...GRC alignment i...

AC025362.12 > AC016489.18 > < AC009994.10Contigs

< TEX2 < RPL31P57 < POLG2 Genes (GENCODE...

Assembly excepti...

62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17

Insert relative to reference Delete relative to reference ... Large insert shown truncated due to image scale or edge MatchAlignment Differe...

protein coding merged Ensembl/Havana

RNA gene pseudogene

Alternative alleles Projection

Gene Legend

331.04 kb Forward strand

Reverse strand 331.04 kb

276.06 kb Forward strand

Reverse strand 276.06 kb

TEX2 gene lies across

the patch boundary

PECAM1 is annotated

only on patch HG183

Gap in primary

assembly

Pa

tch

ed

ch

rom

oso

me

Prim

ary

ch

rom

oso

me

Gene annotation on a ‘patched’ genome

Gene annotation on patches

Patch

Primary


Patch

Primary

1. Manual

annotation


Patch

Primary

Patch

Primary

2. Project

models to

patch

1. Manual

annotation


Patch

Primary

Patch

Primary

Patch

Primary

1. Manual

annotation

2. Project

models to

patch

3. Gap-fill

with mini

genebuilld

Ongoing challenges

• How strict should we be when aligning proteins cDNAs to

the genome?

1. Genome assembly

• Sequencing error (inversion, artificial duplication)

• Assembly incomplete

• Alignments must allow for truncated matches

2. Population variation

• Linear genome is made from ‘one’ individual vs protein

databases contain data from many unknown individuals

• Paralogues, gene families, pseudogenes

3. Public databases eg. UniProt

• Include suspect data and incomplete for many species

• When there’s a match, or no match, is it biologically real?

• Aligning proteins from other species must allow for mismatches

SpecificitySensitivity

FundingEuropean Commission

Framework Programme 7

Ensembl Acknowledgements

Questions?

Reporting data to usersVisualisation and Data querying:

• - When browsing the primary assembly, how do we make it obvious to users

when alternate sequences are available?

• - How do we show when the alternate genomic sequences are identical or differ

from one another?

• - How do we show whether the alternate genome sequences result in identical or

different transcribed / translated products?

• - How do we make a qualitative call about which allele is “better” to use? eg. ABO

• - Data download options

• - Concept of a ‘canonical’ transcript per gene (per tissue)

Data analysis:

• - Linking between alternate alleles (and paralogues?)

• - How do we show when data have been mapped from an old to new assembly,

compared to freshly aligned to a new assembly? When is it right to map instead of

align?

• - In a non-linear genome model, how will SNPs (rsIDs) work?

• - In a non-linear genome model, what coordinate system should be used?

Science

Ensembl annotation