Introduction to Genomes with Ensembl - Tufts...

Preview:

Citation preview

1 of 24

Dr. Giulietta M. Spudich

Ensembl Outreach Team

Introduction to Genomes

with Ensembl

2 of 31

Objectives

What information about a gene can I find?

What about a region of the genome?

How do I navigate the data?

Introduction

1977: 1st genome to be sequenced (5 kb) 2004: finished human sequence (3 gb)

Large amounts of raw DNA sequence data

Fragment

BAC clones

Sequence

Contigs

Assemble

Scaffolds

Assemble

Genome Sequencing

CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAG

CTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCA

TTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATT

GCACTGCTGCGCCTCTGCTGCGCCTCGGGTGTCTTTTGCGGCGGTGGGTCGC

CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAG

CTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCA

TTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATT

TTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGATTTAGGACCAATAAGTCTT

AATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAG

ACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAG

AAGAATCTGAACATAAAAACAACAATTACGAACCAAACCTATTTAAAACTCCACAA

AGGAAACCATCTTATAATCAGCTGGCTTCAACTCCAATAATATTCAAAGAGCAAG

GGCTGACTCTGCCGCTGTACCAATCTCCTGTAAAAGAATTAGATAAATTCAAATT

AGACTTAGGAAGGAATGTTCCCAATAGTAGACTAAAAGTCTTCGCACAGTGAAAT

CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAG

CTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCA

TTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATT

ACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAG

AATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAG

TGAAAGTCCTGTTGTTCTACAATGTACACATGTAACACCACAAAGAGATAAGTCA

Genome sequence

21 May 2012 6

The Ensembl genome browser:

making it interesting

Regulation

Gene

Allele

Conserved

sequence

Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/

• Splice variants, proteins, non-coding RNA

• Small and large scale sequence variation, phenotype associations

• Whole genome alignments, protein trees

• Potential promoters and enhancers, DNA methylation

• User upload, custom data

7 of 31

Genome Browsers

• Ensembl Genome Browsers

http://www.ensemblgenomes.org

• NCBI Map Viewer

http://www.ncbi.nlm.nih.gov/mapview/

• UCSC Genome Browser

http://genome.ucsc.edu

Ensembl is Used Worldwide

8 of 31

Top users:

UK

US

Canada

China

France

Germany

Italy

Japan

Spain

Data Volume Challenge

• UniProtKB/Swiss-Prot (reviewed)

536,029 (25,871 human) protein sequences

• UniProtKB/TrEMBL

22,128,511 (217,918)

9 of 24 www.uniprot.org

NCBI RefSeq (reviewed)

15,744,232 (24,539) NP_006570

NM_006579

Q8IU82

10 of 31

A consensus set of protein coding

sequences

• Reaching a consensus coding

sequence set for human and mouse.

• 26,473 (human)

22,187 (mouse) (*as of Sept 2011)

• If you see a “CCDS ID”, the coding

sequence is agreed upon.

Genome Res. 2009 Jul;19(7):1316-23. Epub 2009 Jun 4

11 of 31

What are the gold transcripts?

UTR Coding Intron

12 of 31

VEGA/Havana

(human, mouse, z-fish)

• Automatic annotation pipeline: Gene

building all at once (whole genome)

Ensembl

• Manual curation: reviewed by experts

VEGA: Vertebrate Genome Annotation

Havana

13 of 31

Genes and Transcripts in Ensembl

High Quality:

• CCDS transcripts

• Ensembl/Havana merged (gold)

transcripts

14 of 31

Ensembl/Havana

• Transcripts are from:

Ensembl

Havana

Ensembl/Havana

Ensembl (20_)

Havana (00_)

Both (“gold”)

Havana (00_)

15 of 31

Gene Names in Ensembl

• ENSG### Ensembl Gene ID

• ENST### Ensembl Transcript ID

• ENSP### Ensembl Peptide ID

• ENSE### Ensembl Exon ID

• For non-human species a suffix is added:

MUS for M. musculus ENSMUSG###

DAR (Danio rerio) for zebrafish: ENSDARG###

16 of 31

Ensembl Features

• The gene set.

• Comparative analysis

• Variation and regulation

• BioMart (data export)

• Display of external data (DAS)

• Programmatic access via the Perl API

• Open Source

17 of 31

Objectives

What information about a gene can I find?

What about a region of the genome?

How do I navigate the data?

See our coursebook for walk-throughs and

exercises using our browser:

http://www.ensembl.org/info/website/tutorials/coursebook.pdf

• Nucleotide level

• Single nucleotide polymorphism (SNP)

• Small insertions and deletions (InDels)

• Microsatellites (short tandem repeats)

• Structural

• Copy number variations (CNV)

• Large insertions and deletions

Variation

Sequence displays

Gene: Sequence

Transcript: Exons

Transcript:cDNA

Comparative Genomics

69 species in e!67

Ensembl tools

Phenotype for a gene

23 of 31

How is all this information

organised?

• Ensembl Views (Website)

• Ensembl Database (open source)

• BioMart „DataMining tool‟

Help and documentation

• Comments and questions?

helpdesk@ensembl.org

• Mailing lists announce@ensembl.org, dev@ensembl.org

• Course online www.ensembl.info/ecourse

• Our tutorials page www.ensembl.org/info/website/tutorials

• YouTube channel www.youtube.com/user/EnsemblHelpdesk

Follow us

• Facebook www.facebook.com/Ensembl.org

• Twitter https://twitter.com/Ensembl

• Come visit our blog! www.ensembl.info

Publications

• Flicek, P. et. al.

Ensembl 2012

Nucleic Acids Res 40:D84-90 (2012)

http://nar.oxfordjournals.org/content/40/D1/D84.long

• Xosé M. Fernández-Suárez and Michael K. Schuster Using the Ensembl Genome Server to Browse Genomic Sequence Data. Current Protocols in Bioinformatics 1.15.1-1.15.48 (2010) www.ncbi.nlm.nih.gov/pubmed/20521244

• Giulietta M Spudich and Xosé M Fernández-Suárez Touring Ensembl: A practical guide to genome browsing BMC Genomics 11:295 (2010) www.biomedcentral.com/1471-2164/11/295

http://www.ensembl.org/info/about/publications.html

Ensembl Paul Flicek (EBI), Steve Searle (Wellcome Trust Sanger Institute)

Software Andy Yates, Stephen Keenan, Monika Komorowska, Rhoda Kinsella, Thomas Maurel, Kieron Taylor

Comparative Genomics

Javier Herrero, Kathryn Beal, Stephen Fitzgerald, Leo Gordon, Matthieu Muffato, Miguel Pignatelli

Regulation Ian Dunham, Ikhlak Ahmed, Nathan Johnson, Thomas Juettemann, Steven Wilder

Variation Fiona Cunningham, Laurent Gil, Sarah Hunt, Will McLaren, Graham Ritchie, Anja Thormann

Analysis and Annotation

Bronwen Aken, Amonida Zadissa, Dan Barrell, Susan Fairley, Carlos Garcίa Girón, Thibaut Hourlier, Andreas Kähäri, Rishi Nag, Magali Ruffier, Simon White

Web Team Anne Parker, Ridwan Amode, Simon Brent, Bethan Pritchard, Harpreet Riat, Dan Sheppard, Steve Trevanion

Outreach Giulietta M. Spudich, Jeff Almeida-King, Denise Carvalho-Silva, Bert Overduin, Michael Schuster

Ensembl Genomes

Paul Kersey, Paul Derwent, Jay Humphrey, Arnaud Kerhornou, Eugene Kulesha, Nick Langridge, Uma Maheswari, Mark McDowall, Michael Nuhn, Helder Pedro, Claudia Rato da Silva, Dan Staines, Iliana Toneva

Ensembl Strategy

Ewan Birney, Richard Durbin, Paul Flicek, Jen Harrow, Tim Hubbard, Glenn Proctor, Steve Searle

Ensembl Team

Recommended