Upload
genome-reference-consortium
View
336
Download
1
Embed Size (px)
DESCRIPTION
GRC Workshop held at Churchill College on Sep 21, 2014. Talk by Bronwen Aken discussing the Ensembl approach to annotating the complete human reference assembly.
Citation preview
EBI is an Outstation of the European Molecular Biology Laboratory.
Ensembl annotation
Bronwen Aken
21 September 2014
How Ensembl started
• Ewan Birney
• Michele Clamp
• Tim Hubbard
Ensembl’s goals
Annotate (vertebrate)
genome
Integrate with other biological
data
Make publicly
available
• Stable, automatic annotation
• High quality
• Regular release cycles• Open source
“Provide a bioinformatics framework to organise biology around
the sequences of large genomes”
Challenges
1. Find functional elements in a genome
• Data have lots of noise
2. Software / hardware
• Storing and manipulating data
3. Intuitive and comprehensive access to data
• Visualization
GRCh38 annotation in Ensembl
What is Genebuilding?
• Automatic, evidence-based annotation of
genes
• Not ab initio
• Based on sequence alignment
• “Best-in-genome”
• Aim for high specificity
• Prefer to miss a few features than heavily over-
predict
Automated gene annotation pipeline is designed
around decisions made during manual annotation
Advantages of re-annotating
• Add new genes to new / fixed genomic regions
• Updated supporting evidence: Remove models built on
data that has been deleted from archives
• Move alignments to regions with better mapping
Gene annotation pipeline – the basics
Identify interesting regions
• Rough alignment of sequences to genome
Exhaustive alignment to produce transcript models
Filter models
• Prioritize data sources
Produce ‘best guess’ gene set
Repeatmasking
Same-species proteins Other-species proteins
cDNAs/ESTs
UTR addition
Final gene set
Filtering
Protein-coding genebuild
Filtering
TranscriptConsensus
LayerAnnotation
Also:
Small ncRNAs
LincRNAs
Pseudogenes
Repeatmasking
Same-species proteins Other-species proteins
cDNAs/ESTs
UTR addition
Final gene set
Filtering
Protein-coding genebuild
FilteringRNA-Seq models
Also:
Small ncRNAs
LincRNAs
Pseudogenes
MERGE WITH HAVANA
Release cycle
26 September 201411
Regulation
Gene
Allele
Conserved
sequence
Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/
Genes
• Coding & noncoding
• Protein & mRNA alignments
• GTF & BAM files
Compara
• Conserved DNA sequence
• Multiple genome alignments
• Homologues
• Protein families
Regulatory regions
• DNA methylation
• TFBS
• Open chromatin
Variation
• SNPs, indels, structural variation
• Phenotypes
• QTLs
Integrate with other speciesC
him
panzee
Hum
an
Gene SLC12A1
‘Patch’ annotation in Ensembl
Genome assembly representation
• Coord_system table
• Lists the allowed coordinate systems
• chromosome, scaffold, contig
• With ‘versions’
• GRCh37, GRCh38
• Contigs are shared between assemblies so have no version
• ‘Toplevel’ coordinate system
• Chromosomes + unplaced scaffolds + unlocalized scaffolds
+ alternate sequences
• Most popular means to access the whole genome
• API options for including/excluding alternate sequences and
PAR
Genome assembly representation
GRCh38
Scaffolds
Contigs
Chromosome
DNA only loaded for contigs
Genome assembly representation
GRCh38
Scaffolds
Contigs
Chromosome
DNA only loaded for contigs
Genome assembly representation
GRCh38
Scaffolds
Contigs
Chromosome
Genome assembly representation
GRCh38
Scaffolds
Contigs
Chromosome
GRCh37
Genome assembly representation
GRCh38
Scaffolds
Contigs
Chromosome
GRCh37
Seq_region names
• Regions of the genome are given a slice name; it’s like an
address
• eg. chromosome:GRCh37:6:133090509:133119701:1
• Users like to say, ‘chromosome 6’
• INSDC coordinates are versioned, but less human-readable
• chromosome:GRCh37:CM000668.1:133090509:133119701:1
assembly
seq_region.
name
coord_system
start
end
strand
Alternate sequences
• Assembly_exception table defines ‘bubbles’
• Initially set up to handle Y chromosome PAR
• Adapted to work for MHC haplotypes
• Now also used for GRC patches
• Assumes ‘equivalent’ region will be present in primary
assembly
Gene annotation on a ‘patched’ genome
62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH
Assembly excepti...
SNORA76 >
SNORD104 >
MILR1 >
Genes (GENCODE...
Primary assembly...
AC025362.12 > AC016489.18 > < AC234063.4Contigs
< Y_RNA < hsa-mir-1273e
< AC234063.1
< TEX2 < AC016489.1
< PECAM1
Genes (GENCODE...
H.sap-H.sap lastz-...
Assembly excepti...
62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH
protein coding merged Ensembl/Havana
RNA gene pseudogene
Alternative alleles Projection
Gene Legend
62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17
Assembly excepti...
H.sap-H.sap lastz-...
SNORA76 >
SNORD104 >
AC138744.2 >
MILR1 > Genes (GENCODE...
GL383558.1
... ...GRC alignment i...
AC025362.12 > AC016489.18 > < AC009994.10Contigs
< TEX2 < RPL31P57 < POLG2 Genes (GENCODE...
Assembly excepti...
62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17
Insert relative to reference Delete relative to reference ... Large insert shown truncated due to image scale or edge MatchAlignment Differe...
protein coding merged Ensembl/Havana
RNA gene pseudogene
Alternative alleles Projection
Gene Legend
331.04 kb Forward strand
Reverse strand 331.04 kb
276.06 kb Forward strand
Reverse strand 276.06 kb
TEX2 gene lies across
the patch boundary
PECAM1 is annotated
only on patch HG183
Gap in primary
assembly
Pa
tch
ed
ch
rom
oso
me
Prim
ary
ch
rom
oso
me
Gene annotation on a ‘patched’ genome
Gene annotation on patches
Patch
Primary
Gene annotation on patches
Patch
Primary
1. Manual
annotation
Gene annotation on patches
Patch
Primary
Patch
Primary
2. Project
models to
patch
1. Manual
annotation
Gene annotation on patches
Patch
Primary
Patch
Primary
Patch
Primary
1. Manual
annotation
2. Project
models to
patch
3. Gap-fill
with mini
genebuilld
Ongoing challenges
• How strict should we be when aligning proteins cDNAs to
the genome?
1. Genome assembly
• Sequencing error (inversion, artificial duplication)
• Assembly incomplete
• Alignments must allow for truncated matches
2. Population variation
• Linear genome is made from ‘one’ individual vs protein
databases contain data from many unknown individuals
• Paralogues, gene families, pseudogenes
3. Public databases eg. UniProt
• Include suspect data and incomplete for many species
• When there’s a match, or no match, is it biologically real?
• Aligning proteins from other species must allow for mismatches
SpecificitySensitivity
FundingEuropean Commission
Framework Programme 7
Ensembl Acknowledgements
Questions?
Reporting data to usersVisualisation and Data querying:
• - When browsing the primary assembly, how do we make it obvious to users
when alternate sequences are available?
• - How do we show when the alternate genomic sequences are identical or differ
from one another?
• - How do we show whether the alternate genome sequences result in identical or
different transcribed / translated products?
• - How do we make a qualitative call about which allele is “better” to use? eg. ABO
• - Data download options
• - Concept of a ‘canonical’ transcript per gene (per tissue)
Data analysis:
• - Linking between alternate alleles (and paralogues?)
• - How do we show when data have been mapped from an old to new assembly,
compared to freshly aligned to a new assembly? When is it right to map instead of
align?
• - In a non-linear genome model, how will SNPs (rsIDs) work?
• - In a non-linear genome model, what coordinate system should be used?