1/30 Comparative Genomics. 2/30 Overview of the Talk Comparing Genomes Homologies & Families...

Preview:

Citation preview

Comparative Genomics

2/30

Overview of the Talk

• Comparing Genomes

• Homologies & Families

• Sequence Alignments

3/30

Evolution at the DNA Level

…ACTGACATGTACCA…

…AC----CATGCACCA…

Mutation

Sequence edits

Rearrangements

Deletion

InversionTranslocationDuplication

4/30

• We can better understand evolution/ speciation

• We can find important, functional regions of the sequence (codons, promoters, regulatory regions)

• It can help us locate genes in other species that are missing or not well-defined (also through comparison and alignments).

Why Compare Genomes?

5/30

Mammals have roughly 3 billion base pairs in their genomes

Over 98% human genes are shared with primates, wth more than 95-98% similarity between genes.

Even the fruit fly shares 60% of its genes with humans! (March 2000)

Differences: gene structure, sequence

Remember… one nucleotide change can cause disease such as sickle cell anemia and cancer.

Comparing Genomes

6/30

• Uses all the species

• Uses a representative protein (the longest) for every gene

• Builds a gene tree

• EnsemblCompara GeneTrees: Analysis of complete, duplication aware phylogenetic trees in vertebrates. Vilella AJ, Severin J, Ureta-Vidal A, Durbin R, Heng L, Birney E. Genome Res. 2008 Nov 24.

How Does Ensembl Predict Homology?

7/30

Load longest protein for every gene from all species

WU Blastp + SmithWaterman longest translation of every gene

against every other (Blast Reciprocal Hit/ Blast Score Ratio)

Protein clustering, build multiple alignments (MCoffee)

From each alignment, build a gene tree (TreeBest)

Reconcile each gene tree with the species tree to determine internal

nodes (TreeBest) Orthologues, paralogues…

Steps in Homology Prediction

..MEDPATA…

8/30

Viewing Trees in Ensembl

9/30

Types of Homologues

• Orthologues : any gene pairwise relation where the ancestor node is a speciation event

• Paralogues : any gene pairwise relation where the ancestor node is a duplication event

10/30

The Gene Tree for INS (insulin precursor)

A red square is a

duplication event

(Paralogues)

A blue square is a

speciation event

(Orthologues)

Reconciliation

M

R

H

M

R

H

species tree

unrooted gene tree

Duplication nodeSpeciation node

M

R

HM

H

R

gene

loss

gene

loss

gene lossR’

H’

M’

12/30

Orthologue Types

What is ‘1 to 1’?

What is ‘1 to many’?

13/30

Protein Families

• How: Cluster proteins for every isoform in every species + UniProt proteins.

• BLASTP comparison of:– all Ensembl ENSP…– all metazoan (animal) proteins in UniProt

14/30

1. Find the human MYL6 gene: go to its gene summary.

2. How many paralogues does it have? Find them in the gene tree.

3. Which paralogue is closest to the human MYL6 gene? In what taxon is the common ancestor?

Homologues ExerciseHomologues Exercise

15/30

Pan-taxonomic compara

Anopheles gambiaeCaenorhabditis elegansDrosophila melanogaster

Aspergillus nidulansNeurospora crassaSaccharomyces cerevisiaeSchizosaccharomyces pombe

B_aphidicola_Tokyo_1998B_burgdorferi_DSM_4680B_subtilisE_coli_K12M_tuberculosis_H37RvN_meningitidis_AP_horikoshiiS_aureus_N315S_pneumoniae_TIGR4S_pyogenes_SF370W_pipientis_wMel

Anolis carolinensisCiona savignyiDanio rerioEquus caballusGallus gallusHomo sapiensMacaca mulattaMonodelphis domesticaMus musculusOrnithorhynchus anatinusPan troglodytesPongo pygmaeusXenopus tropicalis

Dictyostelium discoideumPlasmodium falciparumPlasmodium vivax

16/30

www.ensemblgenomes.org

17/30

Families

18/30

Ensembl Proteins in the Family

19/30

Overview of the Talk

• Comparing Genomes

• Homologies and Families

• Sequence Alignments

20/30

• To identify homologous regions

• To spot trouble gene predictions

• Conserved regions could be functional

• To define syntenic regions (long regions of DNA sequences where order and orientation is highly conserved)

Aligning Whole Genomes- Why?

21/30

Aligning large genomic sequences

Difficulties:• Requires a significant computer resource• Scalability, as more and more genomes are

sequenced• Time constraint• As the «true» alignment is not known, then

difficult to measure the alignment accuracy and apply the right method

22/30

Whole Genome Alignments• BLASTZ-net (nucleotide level) closer species e.g. human – mouse

• Translated BLAT (amino acid level) more distant species, e.g. human – zebrafish

• EPO/PECAN multispecies alignments

• ORTHEUS used to determine ancestral alleles

23/30

Which Multispecies Alignments?

Mercator-Pecan• 16 amniota vertebrates + constrained elements

Enredo-Pecan-Ortheus (EPO)• For 6 primates• For 5 teleost fish + constrained elements• For 12 eutherian mammals• For 34 eutherian mammals + constrained elements

24/30

• “Phylogenetic Footprinting” – conserved noncoding regions can be functional

• Regulatory regions discovered in this way for genes:

Hoxb-1, Hoxb4, PAX6, SOX9

Non-Coding Regions

25/30

More Examples

• Highly conserved transcription factor binding sites discovered

eg. 401 bp non-coding sequence involved in transcriptional regulation of Interleukins.

• New genes (human-mouse comparison)

eg. APOA5, identified as a paralogue to APOA4 in human and mouse.

26/30

Going Beyond Mammals

Where human-mouse is too conserved, go to other species:

Chicken (Mammals and birds: 300MYA)

e.g. A cardiac-specific enhancer of Nkx2-5

Human and fish (400-450 MYA)

In 2002, comparison of human to Fugu rubripes led to identification of 1000 genes.

27/30

Regulatory Features of the PDX1 gene

Region in Detail shows conservation of sequence in regionsinvolved in PDX1 transcriptional regulation (1.6-2.8 kb upstream of the gene).

28/30

1. Have a look at Region in Detail for the ACN9 gene.

2. Turn on the BLASTZ alignment against macaque. What parts of the macaque genome aligns to this region in human?

3. Turn on the constrained elements for the 33 eutherian mammals. How does this track differ from the BLASTZ alignment?

Alignments ExerciseAlignments Exercise

29/30

1. Zoom out one box in the zoom slide.

Are there constrained elements upstream of the ACN9 transcript that overlap a regulatory feature?

2. View the ‘6 primates alignment’ using the Alignments links at the left.

Alignments ContinuedAlignments Continued

30/30

Compara Team at EBI

• Javier Herrero• Kathryn Beal• Stephen Fitzgerald• Leo Gordon

Recommended