Ensembl: A Genomic Toolset for pigs, poultry, …...Ensembl: A Genomic Toolset for pigs, poultry,...

Preview:

Citation preview

Ensembl: A Genomic Toolset for pigs, poultry, plants, pests, pathogens and pollinators

Ensembl: A Genomic Toolset for pigs, poultry, plants, pests, pathogens and pollinators

Paul Kersey

pkersey@ebi.ac.uk 18.03.20152

A brief history of genome sequencing

• 1995 Haemophilus influenzae 1.8 Mb

• 1996 Saccharomyces cerevisiae 12 Mb

• 1999 Drosophila melanogaster 140 Mb

• 2001 Homo sapiens 3.1 Gb

• Sequencing technology is continuously improving, but (massively parallel) “next generation” techniques really were game-changers

Cost of Sequencing a Human Genome 2001-2013

pkersey@ebi.ac.uk 18.03.20154

A brief history of genome sequencing

• 2008-2015 1000 genomes project (2500 human genomes)

• 2008-2015 1001 genomes project (1,0001 Arabidopsis genomes)

• 2015-2019 Genomics England (100,000 human genomes)

pkersey@ebi.ac.uk 18.03.20155

What can we do with thousands of genome sequences?• Statistical association of traits with markers

• Increased marker resolution to find causative variants

• Understand population structure and evolutionary processes

• Track epidemics

• Assay for known variation

• Environmental distribution

• Tool for managing crosses

• More genomes…

• More statistical power, find rarer causative alleles

pkersey@ebi.ac.uk 18.03.20156

Thousands of genomes – a tool for breeding

• Characterize germplasm of land races and wild relatives

• Understand what’s actually present in an existing line

• Find alleles associated with traits

• Combine genotyping with various (laboratory, greenhouse, field) phenotyping mechanisms, themselves increasingly automated and high-throughput

• Manage crosses

pkersey@ebi.ac.uk 18.03.20157

Everyone can do their own experiments, but…

• EMBL-EBI would like to maintain a cataologue of reference genomes and variants for all majorly studied species

• Selected lines can be re-phenotyped and analysed against the same reference data

• One major challenge: organising the pan-genome

• No single genome is enough to serve as a reference for many species

• Variants, functional elements present in some strains but not in the reference

• Reference is still a useful concept: but needs to be extended –“choose your own reference” according to need

pkersey@ebi.ac.uk 18.03.20158

Phenotyping data

• Immensely varied

• Dependent on an environment (GxPxE)

• Anything you can measure – from molecular assays to in-field imaging

• Increasing use of structured controlled vocabularies for human readable, inter-operable data summaries

• Meta data is critical

• What has been assayed?

• Where was it assayed?

• How has it been assayed?

pkersey@ebi.ac.uk 18.03.20159

The EBI mission• EMBL-EBI provides freely available data from life

science experiments, performs basic research in computational biology and offers an extensive user training programme, supporting researchers in academia and industry.

• We also coordinated the ELIXIR pilot phase and are hosting the ELIXIR hub

pkersey@ebi.ac.uk 18.03.201510

EBI provides…

• Structured archives (and associated submission services) for most major types of molecular biological data

• e.g. European Nucleotide Archive (part of the ENA-GenBank-DDBJ International Nucleotide Sequence Database Consortium)

• European Variation Archive – now accepting submissions in VCF format

• ArrayExpress, PRIDE, Metabolights

• Integrative, interpreted services providing access to that data in a biologically meaningful context

• e.g. Ensembl

pkersey@ebi.ac.uk ELIXIR Innovation and SME Forum Wageningen 18th-19th March 2015

18.03.201511

Ensembl

• A modular suite of software for genome analysis and visualisation developed jointly by the Wellcome Trust Sanger Institute and the European Bioinformatics Institute

• Now used for genomes from across the taxonomic space

• Offers a standard set of interfaces to a wide range of genome-scale data, including:

• Web-based GUI

• Public mySQL server

• Perl and REST-ful APIs

• FTP

• Data mining tool (constructed using BioMart) framework with its own set of interfaces: web GUI, web services, command line and local client

ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201526th July 2013 pkersey@ebi.ac.uk

vertebrates

metazoaplants

protistsfungibacteria

• Farm animals

• Crop plants

• Pests

• Vectors

• Pollinators

• Pathogens

• Symbionts and commensuals

Agriculturally relevant species in Ensembl

ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201514

ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201515

ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201516

18.03.201517

Gene tree pipeline

O r t h o l o g s & P a r a l o g s

Take canonical protein for each gene belonging to one Ensembl Genomes clade

Cluster: WU-BLASTP + Smith-Waterman all-versus-all, hcluster_sg

Align: multiple aligners consensified by M-Coffee

Build trees: PhyML-WAG + PhyML-HKY + NJ-p + NJ-dN + NJ-dS + species tree → TreeBeST-merge

Infer orthologues and paralogues

Paralogues:

Any gene pairwise relationship where the ancestor node is a duplication event

Orthologues:

Any gene pairwise relationship where the ancestor node is a speciation event

Orthologues and paralogues

• ortholog_one2one

• ortholog_one2many

• ortholog_many2many

• apparent_ortholog_one2one

• possible_ortholog (weakly supported duplication node)

• within_species_paralog

• other_paralog (too distant to be in the same tree)

• contiguous_gene_split (artefact)

• putative_gene_split (artefact)

Orthology / paralogy types

ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201502.02.2015 pkersey@ebi.ac.uk21

• Only for certain combinations of species

• Generated using (B)LASTz-net

Synteny

• Organisms of relatively recent divergence show similar blocks of genes in the same relative positions in the genome

• Shows how the genome is “cut and pasted” in the course of evolution

• Calculated using pairwise whole genome alignments

• Only for certain combinations of species

Pairwise whole genome alignments & synteny

Ensembl

• Ensembl supports many livestock species

• Ensembl provides automatic gene annotation for these species

• Ensembl works with Havana to support manual annotation in Pigs

• Ensembl provides Variation databases and functional annotation where the data exists

• Ensembl is playing an active role in FAANG and will integrate the functional data generated as it becomes available

FAANG

• Functional Annotation of Animal Genomes

• High quality transcriptomic and regulatory annotation of Animal Genomes

• Open Data released pre-publication

• Common data and analysis standards

• EBI leading establishment of infrastructure for data sharing and standard

• http://www.faang.org

pkersey@ebi.ac.uk18.03.201525

pkersey@ebi.ac.uk18.03.201526

pkersey@ebi.ac.uk 18.03.201527

The bread wheat genome

• Large – haploid genome size is > 5 Gb

• But in fact, the genome is an alloxhexaploid (triploid genome size ~ 16 Gb)

• Each diploid genome is quite homozygous

pkersey@ebi.ac.uk18.03.201528

Evolution of hexaploid bread wheat

pkersey@ebi.ac.uk 18.03.201529

The bread wheat genome

• Genome has been sequenced by Illumina after chromosome sorting

• Assembly is fragmented, but gene models are broadly comparable to other grasses

• Chromosome 3B has been sequenced BAC-by-BAC

pkersey@ebi.ac.uk 18.03.201530

Wheat in Ensembl Plants

• We represent the IWGSC chromosome survey sequence with the addition of the “finished” 3B sequence.

• We also use PopSeq data (from IPK, Gatersleben) to group scaffolds into bins based on genetic locations

ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201518.03.201531

1:1 orthology calls over 19 cereals including the three sub-genomes of bread wheat

ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201518.03.201532

ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201518.03.201533

pkersey@ebi.ac.uk 18.03.201534

Polymorphism data for bread wheat

• ~900,000 SNPs provided by CerealsDB, as follows:

• The Axiom 820K SNP Array contains 820,000 SNPs of which ~684,000 have been mapped.

• The iSelect 80K Array contains over 80,000 SNP loci of which ~58,000 have been mapped.

• The KASP probeset contains ~3,900 SNP loci of which ~3,100 have been loaded in Ensembl Plants

ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201518.03.201535

ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201518.03.201536

pkersey@ebi.ac.uk18.03.201537

pkersey@ebi.ac.uk 18.03.201538

Inter-homoeologous variants

Genome combination

Mismatchlength (in reference genome), bp

Alignment length, bp

% mismatch

B on A 2,881,969 41,739,915 6.90

D on A 2,665,562 43,228,044 6.17

A on B 2,892,005 41,749,951 6.93

D on B 2,739,967 44,238,039 6.34

A on D 2,689,840 43,252,322 6.22

B on D 2,745,993 43,244,065 6.35

Mismatch defined as length on reference not matched in non reference

ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201518.03.201539

Bread wheat whole genome alignment

• DNA-DNA pairwise alignments with lastZ

• Brachypodium distachyon: 617,996,145 Mb (14% of bread wheat) in 1,310,922 blocks

• Hordeum vulgare: 423,284,874 Mb (9% of bread wheat) in 2,902,234 blocks

• Oryza sativa Japonica: 312,857,683 Mb out of 4,460,951,632 (7% of bread wheat) in 718,036 blocks

pkersey@ebi.ac.uk 18.03.201540

pkersey@ebi.ac.uk 18.03.201541

Additional alignment data for bread wheat

• Repbase repeats

• Triticeae repeats from TREP

• Wheat RNA-Seq, ESTs, and UniGene datasets have been aligned to the Triticum aestivum genome:

• 454 RNA-seq data for the following INSDC studies: SRP02455 (Akhuvnova et al.), ERP001415 (Brenchley et al.), SRP004502

• Sequences from TriFLDB

• Transcriptome assembly from diploid einkorn wheat Triticum monococcum (Fox et al.)

Diploid progenitors of bread wheat

• Aegilops tauschii (DD) and Triticum urartu (AA) are also included in Ensembl Plants

• In addition, we have RNA-seq data from Triticum monococcum (AA)

• These genomes have been aligned to rice, and barley

• Relevant RNA-seq reads have been also aligned

ELIXIR Innovation and SME Forum Wageningen 18th-19th March 201519.02.2013 pkersey@ebi.ac.uk42

Bread wheat whole genome alignment

• DNA-DNA pairwise alignments with lastZ

• Brachypodium distachyon: 617,996,145 Mb (14% of bread wheat) in 1,310,922 blocks

• Hordeum vulgare: 423,284,874 Mb (9% of bread wheat) in 2,902,234 blocks

• Oryza sativa Japonica: 312,857,683 Mb out of 4,460,951,632 (7% of bread wheat) in 718,036 blocks

pkersey@ebi.ac.uk 18.03.201543

Accessing Ensembl Data ProgramaticallyAccessing Ensembl Data Programatically

5 easy methods

• ftp://ftp.ensemblgenomes.org/pub/

• http://plants.ensembl.org/info/data/ftp/index.html

• Genomic, cDNA and protein sequence (FASTA)

• Annotated sequence (EMBL / GenBank)

• Gene sets (GTF)

• Resequencing alignments individuals / strains (EMF)

• Whole-genome multiple alignments (EMF)

• Gene-based multiple alignments (EMF)

• Constrained elements (BED)

• Database dumps (MySQL)

Access method 1:FTP downloads

pkersey@ebi.ac.uk Gramene Workshop, Plant and Animal Genomes XIII18.03.201546

Access method 2: mySQL

• MySQL: an open-source relational database management system (RDBMS)

• Used as the back end to support most Ensembl pipelines and applications

• You get the database from http:///mysql.com and install locally

• On the Ensembl Genomes FTP site, you can download the Ensembl schema as a .sql file.

• You can also download the data files

/data/mysql/bin/mysql -u mysqldba

create database zea_mays_core_24_77_6;

exit;

/data/mysql/bin/mysql -u mysqldba zea_mays_core_24_77_6 < zea_mays_core_24_77_6

/data/mysql/bin/mysqlimport -u mysqldba --fields_escaped_by=\\zea_mays_core_24_77_6 -L *.txt

Access method 3: Ensembl Perl API

• Mature, fully featured Perl API (Applications Programming Interface) for Ensembl resources

• Perl: a commonly used programming language in bioinformatics, designed to make “easy thing easy and hard things possible”

• Provides access to:

• Genomic sequence

• Genome features e.g. genes, translations

• Annotation e.g. cross-references

• http://http://www.ensembl.org/info/docs/api/index.html

• REpresentational State Transfer

• is an abstraction of the architecture of the World Wide Web; more precisely, REST is an architectural style consisting of a coordinated set of architectural constraints applied to components, connectors, and data elements, within a distributed hypermedia system. REST ignores the details of component implementation and protocol syntax in order to focus on the roles of components, the constraints upon their interaction with other components, and their interpretation of significant data elements (Wikipedia)

• A style for structuring URLs (i.e. web addresses) according to the content they contain

• RESTful web service or RESTful web API• Allows users to access data simply by invoking the URL

• Often returns a data structure defined in a simple grammar (e.g. JSON) which can be imported into an object in any programming language

Access method 4: REST API

• A generic tool to facilitate the design and query of data warehouses

• Data warehouses are databases designed to optimise the performance of certain commonly performed queries

• May be less flexible than normalised schema

• Less suitable for maintaining primary data (harder to automatically define constraints due to form of data model)

• Nonetheless, can still be implemented within RDBMS

• BioMart uses mySQL

• We have gene-centric and variant centric BioMarts for all Ensembl divisions

• BioMart comes with its own web interface

Access method 5: BioMart

BioMart Web UI

pkersey@ebi.ac.uk Gramene Workshop, Plant and Animal Genomes XIII18.03.201551

Access Method 6: Virtual Machines

• Download an environment containing all of Ensembl to run on your machine

• In effect, you are downloading/running a model of a computer

• As long as your computer can support running the VM, there should be no problem with library incompatibilities etc. - all the resources Ensembl needs are within the VM

• Increasingly, a model of choice for running web-based services (e.g. in cloud environments) – you don’t deploy into a platform, you deploy a whole platform

• We use OpenBox, an open source virtualisation platform

• http://ensemblgenomes.org/info/access/virtual_machine

pkersey@ebi.ac.uk 18.03.201552

Funding• Ensembl Genomes Funded by

• EMBL

• EU (INFRAVEC, Microme, transPLANT, AllBio)

• BBSRC (PhytoPath, wheat/barley/midge sequencing, UK-US collaboration, RNAcentral)

• Wellcome Trust (PomBase)

• NIH/NIAID (VectorBase)

• NSF (Gramene collaboration)

• Bill and Melinda Gates Foundation (wheat rust)

pkersey@ebi.ac.uk 18.03.201553

People• James Allen, Irina Armean, Dan Bolser, Bruce Bolt, Mikkel

Christensen, Paul Davis, Thomas Down, Christoph Grabmueller, Kevin Howe, Arnaud Kerhornou, Julia Khobdova, Eugene Kulesha, Nick Langridge, Dan Lawson, Mark McDowall, Uma Maheswari, Gareth Maslen, Michael Nuhn, Chuang Kee Ong, Michael Paulini, Helder Pedro, Anton Petrov, Dan Staines, Brandon Walts, Gary Williams

• The vertebrate genomics team @ EBI (Paul Flicek)

Recommended