114
An Introduction to ENSEMBL Cédric Notredame

An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Embed Size (px)

Citation preview

Page 1: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

An Introduction to ENSEMBL

Cédric Notredame

Page 2: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

The Top 5 Surprises in the Human Genome Map

1. The blue gene exists in 3 genotypes: Straight Leg, Loose Fit and Button-Fly. 2. Tiny villages of Hobbits actually live in our DNA and produce minute quantities of wool -- which we've been

ignorantly referring to as "navel lint" and throwing away for centuries. 3. It's nearly impossible to re-fold it along the original creases. 4. Beer-drinking gene conveniently located next to bathroom-locating gene.

and the Number 1 Surprise In The Human Genome Map...

5-Now that there's a map, male scientists will attempt to cure diseases by randomly throwing stuff into beakers, stubbornly refusing to use the map or ask for directions -- all the while insisting the cure is right around the next corner

Page 3: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

ENSEMBL: Our Scope

-What is ENSEMBL ?

-Searching Genes in ENSEMBL

-Viewing Genes in ENSEMBL?

-Doing Research With ENSEMBL?

-Where do ENSEMBL Genes Come From

Page 4: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

• Genomes sequences are becoming available very rapidly– Large and difficult to handle computationally– Everyone expects to be able to access them immediately

• Bench Biologists– Has my gene been sequenced?– What are the genes in this region?– Where are all the GPCRs– Connect the genome to other resources

• Research Bioinformatics– Give me a dataset of human genomic DNA– Give me a protein dataset

Accessing Genomes

Page 5: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

• Set of high quality gene predictions– From known human mRNAs aligned against genome– From similar protein and mRNAs aligned against

genome– From Genscan predictions confirmed via BLAST of

Protein, cDNA, ESTs databases.

• Initial functional annotation from Interpro• Integration with external resources (SNPs, SAGE,

OMIM)

• Comparative analysis– DNA sequence alignment– Protein orthologs

What is It ?

Page 6: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Mr ENSEMBL ?

Richard Durbin (ACEDB)

Ewan Birney (EBI)

Page 7: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

• Scale and data flow– mainly engineering problems

• Presentation, ease of use– mainly engineering problems

• Algorithmic– Partly engineering– Partly research

Challenges ?

Page 8: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

ENSEMBL Home

Page 9: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 10: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 11: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 12: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Help!

• context sensitive help pages - click

• access other documentation via generic home page

• email the helpdeskHelpDesk / Suggestions

Page 13: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Finding What You Need

Page 14: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Human homepage

Page 15: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Text search

Page 16: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

BLAST/SSAHA

Page 17: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 18: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

BLAST/SSAHA ????

Page 19: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Changing Angle…

Page 20: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 21: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Anchor View

Map View

Page 22: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Detailed ViewGenes, ESTs, CpG etc.100kb

OverviewGenes and Markers1Mb

Chromosome

Configuration

Contig View

Page 23: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Contig View

close-up

Evidence

Transcriptsred & black(Ensembl predictions)

Customising& short cuts

Pop-up menu

Page 24: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Cyto View

Page 25: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Marker View

Page 26: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

SNP View

Page 27: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Synteny View

Page 28: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Dotter View

Page 29: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

GeneView

Page 30: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Gene-View

Page 31: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Gene-View

Page 32: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Gene-View

Page 33: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Trans View

Page 34: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Exon-View

Page 35: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Protein-View

Page 36: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Protein-View

Page 37: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Protein-View

Page 38: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

CDK-like

Family-View

Page 39: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

CDK-like

Family-View

Page 40: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

The Right View On My Gene

-Where Is My Gene ?Map ViewCyto ViewContig View

-How Many Transcript for My GeneGene ViewExon View

-What is the Function of my GeneProtein ViewSNP ViewFamily View

-How does My Gene compare with other Species

Synteny ViewDotter View

Page 41: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Getting The Stuff Back Home

Page 42: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Export-View

Page 43: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

• The aim of EnsMart is to integrate Ensembl data into a single, multi-species, query-optimised database– Requirement for cross-database joins removed.– Query-optimised schema improves speed of data

retrieval.• Examples

– Coding SNPs for all novel GPCRs– The sequence in the 5kb upstream region of known

proteases between D1S2806 and D1S2907– Mouse homologues of human disease genes containing

transmembrane domain located between 1p23 and 1q23

Data Mining with EnsMart

Page 44: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

EnsMart I

Page 45: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

EnsMart II

Page 46: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Asking Questions With

ENSEMBL

Page 47: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Asking Questions

1-Selecting AND Downloading Genes using-Functional-And Evolutive Criteria

2-Comparing Two Pieces of Genome

Page 48: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

All The Human Genes

-Involved in Cell Death-Associated with a Disease-With a Homologue in Mouse and Chicken

Asking A Question with ENSMART

What Do You Want ???

Page 49: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Which Specie

Page 50: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Select the regionSelect the region

Where?

What kindof Gene ?

Page 51: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Select the Select the kind of datakind of data

Choose AnEvolutionnary Trace

What Kind of Function ?

Page 52: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Select the Select the kind of datakind of data

Control of Genetic Variation

Control of Regulatory Region

Control ofBiochemicalFunction

Page 53: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Human GeneCell Death

Human GeneCell DeathMouse

Human GeneCell DeathChicken

Human GeneCell DeathC. Elegans

1133 genes 1106 genes 880 genes 338 genes

Page 54: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

I would like -Chromosome Information-The ID of my sequences-The corresponding OMIM Id-The corresponding Chicken id

Asking A Question with ENSMART

How Do You Want it Packed ???

Page 55: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 56: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 57: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Come to think of it…

-I’d like to take a look at the 5’ upstream regions

Asking A Question with ENSMART

How Do You Want it Packed ???

Page 58: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 59: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 60: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

I Want To know if the Mouse and the Human Genome are conserved around the Human Gene SNX5

Asking A Question with ENSMART

What Do You Want ???

Page 61: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 62: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 63: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 64: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 65: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 66: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 67: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 68: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 69: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 70: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 71: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Where Do ENSEMBLGenes Come From

Genebuild

Page 72: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 73: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 74: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

• Ensembl gene set

• Ensembl EST genes

• Ab initio predictions

• Manual curation (Vega / Sanger)

• Gene models from other groups

• Known v. novel genes

• Gene names & descriptions

Evaluating genes and transcripts

Page 75: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

The Aim…

Page 76: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Ensembl transcript predictions

evidence

other groups’ models

manual curation

Overview…

Page 77: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Automatic Gene Annotationhuman proteins

Ensembl Genes

Other proteins cDNAs

Pmatch Exonerate

Genewise Est2Genome

ESTs

Genscan exons

Add UTRs

EST genes

other evidence

Merge

Page 78: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

• Place all available species-specific proteins to make transcripts

• Place similar proteins to make transcriptsUse mRNA data to add UTRs

• Build transcripts using cDNA evidence

• Build additional transcripts using Genscan + homology evidence

• Combine annotations to make genes with alternative transcripts

ENSEMBL Geneset

Page 79: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

blast and Miniseq

Human protein sequencesSwissProt/TrEMBL/RefSeq

pmatch* v. assembly

Genewise

*R. Durbin, unpublished

Getting Genes from Known Proteins

Page 80: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Translatable gene with UTRs

cDNAs - Est2Genome – UTRs, no phases

proteins - Genewise – phases, no UTRs

Adding the UTRs

Page 81: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

•DNA-DNA alignments don’t give translatable genes

•Protein level Alignment give:– frameshifts and splice sites

•Genewise (Ewan Birney)– Protein – genomic alignment– Has splice site model– Penalises stop codons– Allows for frameshifts

Gene Build is Protein-Based

Page 82: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

• Combine results of all Genewises and Genscans:

• Group transcripts which share exons• Reject non-translating transcripts• Remove duplicate exons• Attach supporting evidence• Write genes to database

Making Genes

Page 83: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

• NCBI 34 assembly, released Dec 2003

• Ensembl genes:  21,787 (23.762 in release 35)• Ensembl coding transcripts: 31,609 • (plus 1,744 pseudogenes)• Ensembl exons: 225,897

• Input human seqs: 48,176 proteins; 86,918 cDNAs

• Transcripts made from:– Human proteins with (without) UTRs 68% (19%)– Non-human proteins with (without) UTRs 2% (9%) – cDNA alignment only 0.8%

A Typical Human Release:NCBI 34 (Dec 2003)

Page 84: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Genes Sensitivity ~90% of manual genes are in Specificity ~75% of genes are in the manual sets

Exon bps Sensitivity ~70% of manual bps are in exons (90% of coding bps)Specificity ~80% of bps are in manual exons

Alternative transcripts per genemanual 3 1.3

Figures are for the gene build on NCBI 33 (human) and manual annotation for chromosomes 6, 14 & 14

Manual Vs Automatic Annotation

Page 85: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Data availabilityHard evidences in mouse, rat, human Similarity build more important For other species;

Structural IssuesZebrafish Many similar genes near each other

Genome from different haplotypes

C. briggsae Very dense genomeShort introns

Mosquito Many single-exon genesGenes within genes

Configuration Files provide flexibility

Each Genebuild is a Story…

Page 86: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Species Gene number Exons/geneHomo sapiens 21787 8.7

Mus musculus 24948 8.7

Rattus norvegicus 23751 7.9

Danio rerio (zebra fish) 20062 7.9

Caenorhabditis briggsae (nematode)

11884 7.2

Anopheles gambiae (mosquito)

14707 4.0

Life in Release 2003

Page 87: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

• Ensembl gene set

• Ensembl EST genes

• Ab initio predictions

• Manual curation (Vega / Sanger)

• Gene models from other groups

• Known v. novel genes

• Gene names & descriptions

Evaluating genes and transcripts

Page 88: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

human proteins

Ensembl Genes

Other proteins cDNAs

Pmatch Exonerate

Genewise Est2Genome

ESTs

Genscan exons

Add UTRs

EST genes

Other evidence

Merge

Using ESTs

Page 89: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

EST analysis

Map to genome using Est2Genome(determine strand, splicing)

Map ESTs using Exonerate(determine coverage, % identity and location in genome)

Filter on %identity and depth(5.5 million ESTs from dbEST – maping of about 1/3)

Using ESTs

Page 90: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

ExonerateGolden path contigs

cDNA hits

•Exonerate positions cDNA sequences to assembly contigs

• Store hits as Ensembl FeaturePairs in database

Exonerate

Page 91: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Blast and Est2GenomeVirtual contig

cDNA hits

FilterBlast & MiniseqEst_genome

EST2Genome

Page 92: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Merge ESTs according to consecutive exon overlap and set splice ends

Genomewise

Alternative transcripts with translation and UTRs

ESTs

Reconstructing Alternative Splicing

Page 93: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Human ESTs

EST transcripts

Display limited to 7 at any one point – full data accessible in the databases

Display of EST Evidences

Page 94: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

• Ensembl gene set

• Ensembl EST genes

• Ab initio predictions

• Manual curation (Vega / Sanger)

• Gene models from other groups

• Known v. novel genes

• Gene names & descriptions

Evaluating genes and transcripts

Page 95: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Ab initio Genscan predictions

Genscan prediction

Evidence supporting Genscan

exons

Page 96: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

• Ensembl gene set

• Ensembl EST genes

• Ab initio predictions

• Manual curation (Vega / Sanger)

• Gene models from other groups

• Known v. novel genes

• Gene names & descriptions

Evaluating genes and transcripts

Page 97: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Manual Curation: VErtebrate Genome Annotation

Page 98: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Sanger / Vega manual curation

Manual Curation: VEGA

Page 99: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

• Ensembl gene set

• Ensembl EST genes

• Ab initio predictions

• Manual curation (Vega / Sanger)

• Gene models from other groups

• Known v. novel genes

• Gene names & descriptions

Evaluating Genes and Transcripts

Page 100: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Other models as ‘DAS sources’

Turn on DAS sources

FASTAView display

Other Gene-Models

Page 101: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

• Ensembl gene set

• Ensembl EST genes

• Ab initio predictions

• Manual curation (Vega / Sanger)

• Gene models from other groups

• Known v. novel genes

• Gene names & descriptions

Evaluating Genes and Transcripts

Page 102: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

• Naming takes place after the gene build is completed

• Transcripts/proteins mapped to SwissProt, RefSeq and SPTrEMBL entries

• If mapped = ‘known’ : if not = ‘novel’

• Require high sequence similarity, but allow incomplete coverage

• Note: Difficult for families of closely-related genes Wrongly annotated pseudogenes may also cause problems

Known Vs novel transcripts

Page 103: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

• Ensembl gene set

• Ensembl EST genes

• Ab initio predictions

• Manual curation (Vega / Sanger)

• Gene models from other groups

• Known v. novel genes

• Gene names & descriptions

Evaluating Genes and Transcripts

Page 104: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Names and descriptions• Names taken from mapped database entries

• Official HGNC (HUGO) name used if available (or equivalent for other species)

• Otherwise SwissProt > RefSeq > SPTrEMBL

• Novel transcripts have only Ensembl stable ids

• Genes named after ‘best-named’ transcript

• Gene description taken from mapped database entries (source given)

• Hints: Orthology can provide useful confirmation If no description, check for any Family description

Gene Names and Descriptors

Page 105: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Stability…

www.ensembl.org/Docs/wiki/html/EnsemblDocs/Answer006.html

Page 106: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Evidence used to build the transcript

links to ExonVie

w

Mapping to external

databases

Links to putative orthologues

Transcript name

Gene name &

descriptionAlternative transcripts

Geneview and Exonview

Page 107: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Compressed tracks

Expanded tracks

Evidence Tracks in ContigView

Page 108: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

•Improved pseudogene annotation, for all species •Upstream regulatory elements - using CpG islands, Eponine predictions, motifs to aid in prediction of transcription start sites

• Improve use of cDNAs - can already use to add alternatively spliced transcripts

• Improve UTR extension

• Make use of comparative data

• Non coding RNAs - currently filtered out of build sets

Future Directions

Page 109: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

ENSEMBL

-Finding the right DATA: ENSMART and BLAST

-The central View of ENSEMBL: ContigView

-Genome Comparison: Synteny View

-ENSEMBL incorporate all the evidences intoits gene models

Page 110: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 111: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit
Page 112: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Genebuild overview

Pmatch

Other Proteins

Genewise genes with UTRs

HumanProteins

Genewise

Genewisegenes

GenebuilderSupportedgenscans(optional)

Preliminarygene set

cDNA genes

ClusterMerge

GeneCombiner

Core Ensemblgenes

PseudogenesFinal set

+ pseudogenesEnsembl

EST genes

Est2Genome

AlignedcDNAs

Exonerate

Human cDNAs

Aligned ESTs

Human ESTs

Page 113: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Place all known genes

Map all AVAILABLE species specific proteins in the genome and find gene structure using Genewise

Annotate novel genes

Use protein from other species to build new transcripts based on homology

Use AVAILABLE mRNAs to add UTRs to the built transcripts

Use further homology to proteins, mRNAs and ESTs to build transcripts using Genscan exons

Combine annotations

Annotation Stages

Page 114: An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit

Sn Sp

chr13 0.90 0.74

chr14 0.92 0.77

chr6 0.94 0.72

Numbers are for NCBI33 genebuild

Gene locus level

ENSEMBL predictions cover 90% or more

of manually annotated gene structures,

with around 75% of the predictions

covered by a manual annotation

Exon level (based on transcript pairs)

Coding exons only All exons

Sn Sp Sn Sp

chr13 0.83 0.90 0.73 0.78

chr14 0.78 0.88 0.69 0.77

chr6 0.85 0.89 0.73 0.76

UTR exons predictions

are less accurate than

coding exons.

92% of coding exons

and 80% of all exons

are exact matches

Manual Vs Automatic Annotation