77
Genome Paleontology: Discoveries from complete genomes Steven L. Salzberg The Institute for Genomic Research (TIGR) and Johns Hopkins University

Genome Paleontology: Discoveries from complete genomes Steven L. Salzberg The Institute for Genomic Research (TIGR) and Johns Hopkins University

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Genome Paleontology:Discoveries from complete genomes

Steven L. SalzbergThe Institute for Genomic Research (TIGR)and Johns Hopkins University

© 2003 Steven L. Salzberg2

What is genome paleontology?

Compare genomes to uncover:• history of species• genome

transformations• recent mutations

such as SNPs• evolution

3© 2003 Steven L. Salzberg

Outline (time permitting) An algorithm for rapid large-scale alignment

A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg. Alignment of whole genomes. Nucleic Acids Res 27:11 (1999), 2369-76. MUMmer 2: Delcher et al., NAR, 2002.

Alignments and analyses of bacterial genomesJ.A. Eisen, J.F. Heidelberg, O. White, and S.L. Salzberg. Evidence for

symmetric chromosomal inversions around the replication origin in bacteria. Genome Biology 1:6 (2000), 1-9.

Large-scale genome duplications: plant and human• The Arabidopsis Genome Initiative. Analysis of the genome sequence

of the flowering plant Arabidopsis thaliana. Nature 408 (2000), 796-815.

• J.C. Venter et al. The sequence of the human genome. Science 291 (2001), 1304-1351.

● Lateral gene transfer between humans and bacteria• S.L. Salzberg, O. White, J. Peterson, and J.A. Eisen. Microbial genes

in the human genome: lateral transfer or gene loss? Science 292 (2001), 1903–1906.

4

Genomes completed and published by TIGR and our collaborators, 1995-present

Organism ReferenceArabidopsis thaliana Lin et al., Nature 402: 761-8 (2000)Archaeoglobus fulgidus Klenk et al., Nature 390:364-370 (1997)Bacillus anthracis Ames Read et al., Nature 423: 81-86 (2003)Bacillus anthracis Florida Read et al., Science 296, 2028-33 (2002)Borrelia burgdorferi Fraser et al., Nature 390: 580-586 (1997) Brucella suis Paulsen et al., PNAS 99 (2002)Caulobacter crescentus Nierman et al., PNAS 98 (2001)Chlamydia pneumoniae Read et al., Nucl. Acids Res. 28, (2000)Chlamydia muridarum Read et al., Nucl. Acids Res. 28, (2000)Chlamydophila caviae Read et al., Nucl. Acids Res. 31, (2003) Chlorobium tepidum Eisen et al., PNAS 99: 9509-9514 (2002)Coxiella burnetii RSA 493 Seshadri et al., PNAS 100: 5455-60 (2003)Deinococcus radiodurans White et al., Science 286 (1999)Enterococcus faecalis Paulsen et al., Science 299: 2071-2074 (2003)Haemophilus influenzae Fleischmann et al., Science 269, (1995)Helicobacter pylori Tomb et al., Nature 388:539-547 (1997)Methanococcus jannaschii Bult et al., Science 273:1058-1073 (1996)Mycobacterium tuberculosis Fleischmann et al., J. Bact.184, (2002)Mycoplasma genitalium Fraser et al., Science 270:397-403 (1995)Neisseria meningitidis Tettelin et al., Science 287 (2000)Oryza sativa (rice) chr 10 Wing et al., Science 300: 1566-1569 (2003)Plasmodium falciparum Gardner et al., Nature 419:531-534 (2002)Plasmodium yoelii Carlton et al., Nature 419:512-519(2002)Porphyromonas gingivalis Nelson et al., J. Bact., in revision.Pseudomonas putida Nelson et al., Envir. Microbiol. (2002)Shewanella oneidensis Heidelberg et al., Nat. Biotech. 20 (2002) Streptococcus agalactiae Tettelin et al., PNAS. 99 (2002) Streptococcus pneumoniae Tettelin et al., Science 293 (2001)Sulfolobus islandicus virus Arnold et al., Virology 15:252-66 (2000)Thermotoga maritima Nelson et al., Nature 399: 323-329 (1999)Treponema pallidum Fraser et al., Science 281: 375-388 (1998)Vibrio cholerae Heidelberg et al., Nature 406, (2000)

5

Genomes in progress or recently completedFibrobacter succinogenesPrevotella intermediaPseudomonas fluorescensSilicibacter pomeroyi DSS-3Streptococcus agalactiae A909Streptococcus gordoniiStreptococcus mitisStreptococcus pneumoniae 670Acidobacterium capsulatum Bacillus anthracis A01055Bacillus anthracis A0402Bacillus anthracis Ames 0581Burkholderia thailandensisCampylobacter coli RM2228Campylobacter upsaliensis RM3195Clostridium perfringens SM101Epulopiscium fisheloniiHyphomonas neptuniumListeria monocytogenes F6854Listeria monocytogenes H7858Mycoplasma arthritidis Mycoplasma capricolumMyxococcus xanthusPrevotella ruminicolaPyrococcus furiosusVerrucomicrobium spinosum Actinomyces naeslundii

Bacillus anthracis A0071 Bacillus anthracis Kruger BErwinia chrysanthemiGemmata obscuriglobus Mycobacterium tuberculosisRuminococcus albusStreptococcus sobrinusAspergillus fumigatus Brugia malayi Coccidioides immitisCryptococcus neoformansEntamoeba histolyticaOryza sativa Chromosome 3 & 10Plasmodium vivaxSchistosoma mansoniSolanum spp.Tetrahymena thermophilaToxoplasma gondii Theileria parvaTrichomonas vaginalis Trypanosoma brucei Trypanosoma cruzi

Acidithiobacillus ferrooxidansBacillus anthracis Kruger BBurkholderia mallei Clostridium perfringens ATCC13124Dehalococcoides ethenogenesDesulfovibrio vulgaris Ehrlichia chaffeensisEhrlichia sennetsuGeobacter sulfurreducens Listeria monocytogenes Methylococcus capsulatusMycobacterium avium 104Mycobacterium smegmatisPseudomonas syringae Staphylococcus aureus Staphylococcus epidermidis Treponema denticolaWolbachia sp.Anaplasma phagocytophilaBacillus cereus 10987Bacteroides forsythesBrucella ovisBaumannia cicadellinicolaCampylobacter jejuniCarboxydothermus hydrogenoformansColwellia sp. 34HDichelobacter nodosus

© 2003 Steven L. Salzberg6

• Efficiently compute alignments between entire genomes and chromosomes, for example:

• Two strains of B. anthracis,each 5.1 Mb (<30 CPU seconds)

• Two chromosomes of A. thaliana, each 20-30 Mb (< 5 minutes)

• Two chromosomes of human, 100+ Mb each (< 30 minutes)

Genome-Scale Sequence Alignment

© 2003 Steven L. Salzberg7

MUMs: Maximal Unique MatchesAlgorithm finds ALL matchesString them together and align gaps

Suffix treesVery fast alignment of long DNA sequencesLinear time and space requirementsSoftware at:

http://www.tigr.org/software/mummer/

MUMmer alignments

TIGRTIGRTIGRTIGR

8© 2003 Steven L. Salzberg

A trieA tree with edges labelled by stringsEach leaf represents a sequence—the

labels on the path to it from the rootThe suffix tree for sequences A and B :

Contains |A | + |B | leaf nodes.Can be constructed in O (|A | + |B |) time!

Holds all suffixes of a set of sequences

Suffix Trees

9© 2003 Steven L. Salzberg

Sequences in genomes A and B that:Occur exactly once in A and in BAre not contained in any larger matching sequence

Maximal Unique Matches (MUMs)

A:

B:

Occurs only here Mismatch at both ends

10© 2003 Steven L. Salzberg

MUMmer 2 streaming algorithm

i+1

87 i

91 10

5 3 6Suffix Tree for String atgtgtgtc$

1 2 3

atgtgtgtc$

c$ gt t

$

c$ c$gt gt

c$ gtc$ c$ gt

4 2

c$ gtc$

Streaming String

4 5 6 7 8 9 10

...atgtcc...

MUMmer results: M. tuberculosis CDC1551 vs. H37Rv

A C G TA 66 164 9C 48 81 169G 164 89 44T 11 159 61

a MUM

© 2003 Steven L. Salzberg12

Helicobacter pylori strain 26695 vs. J99

V. cholera vs. E. coli (forward)

V. cholera vs. E. coli (reverse)

V. cholera vs. E. coli (both strands)

16© 2003 Steven L. Salzberg

Duplication and Gene Loss?

A

B

CD

E

F

A

B

CD

E

F

A

B

CD

E

F

A

B

C

D

EF

A’

B’

C’

D’

E’F’

A

B

C

D

EF

A’

B’

C’

D’

E’F’

A

C

D

F

A’

B’

E’

E. coliE. coli

B

C

D

F

A’

B’

D’

E’

V. cholerae

A

B

C

D

EF

A’

B’

C’

D’

E’F’

© 2003 Steven L. Salzberg17

V. cholera vs. itself

© 2003 Steven L. Salzberg18

S. pyogenes vs. itself

© 2003 Steven L. Salzberg19

Symmetric Inversions Model

B1

A1

B2

A2

B3

A3

B3

B2

2423

2221

2019

1817161514

1312

11109

67258

2627

2829

301 2 3

45

3132

B1

3132

6789

1011

1213

1415161718

1920

2122

2324252627

2829

301 2 3

45

3132

B3 2423

2221

2019

1817161514

1312

11109

67258

2627

2829

33231 30

45

2 1

A1

3132

6789

1011

1213

1415161718

1920

2122

2324252627

2829

301 2 3

45

3132

A2

3132

6789

1011

1213

1918171615

1420

2122

2324252627

2829

301 2 3

45

3132

A3

2

6789

1011

1213

1918171615

1420

2122

2324252627

54

3 31 3029

28

1 32

B2

* *

* *

* *

* *

CommonAncester ofA and B

6789

1011

1213

1415161718

1920

2122

2324252627

28

29

301 2 3

45

3231

A2

A1 A2

A3

B2

B1

Inversionaroundterminus

Inversionaroundterminus

Inversionaroundorigin

Inversionaroundorigin

© 2003 Steven L. Salzberg20

M. leprae vs M. tuberculosis

M. leprae

M.

tub

erc

ulo

sis

21© 2003 Steven L. Salzberg

The “X-files” paper

© 2003 Steven L. Salzberg22

Arabidopsis genome paleontologyCompare all chromosomes to each other....

Diorama by B.E. Dahlgren, © The Field Museum, Chicago

© 2003 Steven L. Salzberg23

The hunt for genome-scale duplications

S. cerevisiae?16% duplicated (Seoighe & Wolfe, 1999)

Maize? 10 chromosomes vs. 5 in some related

grasses; segmental allotetraploid? (Gaut & Dobley, 1997)

Drosophila melanogaster - no duplications Vertebrates: much speculation but little

evidence (Skrabanek & Wolfe, 1998) Arabidopsis thaliana: yes!

24© 2003 Steven L. Salzberg

chr.2

chr.4

First discovery: large-scale duplication between chromosomes 2 and 4 (Lin et al., 1999)

25© 2003 Steven L. Salzberg

chr.1

chr.1

Tandem duplications

26© 2003 Steven L. Salzberg

•Over 60% of the genome is covered by duplicated regions

•Centromeres cover much of the rest

•Strikingly, only about 1/3 of the genes in each block remain as duplicates

27© 2003 Steven L. Salzberg

No triplications!

19-24 large-scale duplications>60% of the genome duplicatedIf duplications occurred over time,

triplications highly likelyDuplications likely happened as one

event (on evolutionary time scale)Conclusion: whole genome duplication

28© 2003 Steven L. Salzberg

I III IV V I III IV V

Warning: Salzberg’s speculation follows

Start with 4 ancestral chromosomes

29© 2003 Steven L. Salzberg

I III IV V I III IV V

30© 2003 Steven L. Salzberg

I III IV V I III IV V

31© 2003 Steven L. Salzberg

I III IV V I III IV V

32© 2003 Steven L. Salzberg

I III IV V I III IV V

33© 2003 Steven L. Salzberg

I III IV V I III IV V II

34© 2003 Steven L. Salzberg

I III IV V I III IV V II

35© 2003 Steven L. Salzberg

I III IV V I IV V II

36© 2003 Steven L. Salzberg

I III IV V I IV V II

37© 2003 Steven L. Salzberg

I III IV V I IV V II

38© 2003 Steven L. Salzberg

I III IV V I IV V II

39© 2003 Steven L. Salzberg

I III IV V I IV V II

40© 2003 Steven L. Salzberg

I III IV V I IV V II

41© 2003 Steven L. Salzberg

I III IV V I IV V II

42© 2003 Steven L. Salzberg

I III IV V I IV V II

43© 2003 Steven L. Salzberg

I III IV V I IV V II

44© 2003 Steven L. Salzberg

I III IV V I IV V II

45© 2003 Steven L. Salzberg

I III IV V I IV V II

46© 2003 Steven L. Salzberg

I III IV V I IV V II

47© 2003 Steven L. Salzberg

I III IV V I IV V II

48© 2003 Steven L. Salzberg

I III IV V I IV V II

49© 2003 Steven L. Salzberg

I III IV V I IV V II

50© 2003 Steven L. Salzberg

I III IV V II

51© 2003 Steven L. Salzberg

I III IV VII

52© 2003 Steven L. Salzberg

I II III IV V

53© 2003 Steven L. Salzberg

I II III IV V

54© 2003 Steven L. Salzberg

Warning: data quality control Until December 2000, Arabidopsis data in

GenBank was all BAC-based Errors included:

BACs on the wrong chromosome BACs entered twice with different IDs, different

annotation (sequenced twice), slightly different sequence

For duplications analysis, these errors would prove disastrous

Many of these errors are still in GenBank Old BACs are not automatically deleted

55© 2003 Steven L. Salzberg

Human Genome analysis used Celera’s assembly and annotation 26,588 genes, ordered along each of 24

chromosomes MUMmer 2.0 used to align whole

chromosomes Nothing found in DNA-level alignments Proteome alignments used instead

Recently re-computed using latest human genome annotation (Ensembl)

56© 2003 Steven L. Salzberg

Human whole-genome aligment Create 24 “mini-proteomes” by concatenating

all proteins on each chromosome Use MUMmer to align each mini-proteome to

the complete proteome (9,675,713 amino acids)

Search for conserved clusters of proteins Confirmed analysis by looking at Blast hits of

all vs. all

57© 2003 Steven L. Salzberg

Not looking fortandem duplicationsdomain hits (very common, often give highly

significant Blast hits)

What we’re looking for

58© 2003 Steven L. Salzberg

Summary results 1077 duplicated blocks

10,310 “gene pairs” “pair” = 2 genes that match between two blocks

296 blocks with 3-4 gene pairs 781 blocks with 5 or more gene pairs

3522 distinct genes, many duplicated more than once Large block: 33 genes on chr 2 and chr 14

spans 63Mbp on chr 14, over 70% of chr 14’s length spread over 97 genes on chr 2 and 332 genes on 14 includes two of four known Hox clusters, an ancient duplication

Large block: 64 genes on chr 18 and chr 20 previously undiscovered

Shuffled data: 370 gene pairs (3.6% false positive rate)

© 2003 Steven L. Salzberg59

60© 2003 Steven L. Salzberg

Duplications in Human Chromosome 2

© 2003 Steven L. Salzberg61

Human-mouse genome mapping

Close evolutionary distance permits DNA-level alignments

Protein similarity even greater than DNA MUMmer quickly aligns each mouse mini-

proteome to its human counterparts Blast finds most (not all) of the same

matches (and is far slower) 77% (566/731) of Mouse16 genes are

found in syntenic regions of human 2.5% (18/731) of Mouse16 genes are

unique to mouse, not found in human

© 2003 Steven L. Salzberg62

Mouse chr 16 maps to human chromosomes 3, 8, 12, 16, 21, and 22

© 2003 Steven L. Salzberg63

Have bacteria transferred their genes directly into the human genome?

“Startling” discovery, Feb. 2001: 223 bacterial genes were laterally transferred into a vertebrate ancestor of humans (from the Nature human genome paper)

© 2003 Steven L. Salzberg64

Horizontal (Lateral) Gene Transfer

© 2003 Steven L. Salzberg65

Vertical Inheritance

© 2003 Steven L. Salzberg66

Horizontal Gene Transfer ???

© 2003 Steven L. Salzberg67

Horizontal gene transfer in Arabidopsis thaliana chr 2(Lin et al., Nature, 1999)

135 genes most closely related to cyanobacterial genes and thus likely were transferred from chloroplast to the nucleus

Very recent transfer of > 250 kb section of mitochondrial genome

Many additional older mitochondrial → nuclear gene transfers

68© 2003 Steven L. Salzberg

Examples of Horizontal TransfersAntibiotic resistance genes on plasmidsPathogenicity islandsToxin resistance genes on plasmidsAgrobacterium Ti plasmidViruses and viroidsOrganelle to nucleus transfers

69© 2003 Steven L. Salzberg

Mechanisms of Horizontal TransferPlasmid exchange (prokaryotes)Mating/conjugation (prokaryotes)Viruses and viroidsOrganelle to nucleus exchange

(eukaryotes)Scavenging from environmentPassive absorptionFusion of cells

70© 2003 Steven L. Salzberg

Nature human genome paper (2001): Evidence for transfer?

Evidence: Genes match bacteria, but do not match non-vertebrate eukaryotes

Or, genes really are in non-vertebrates, but have stronger match to bacteriaMeasured by BLAST E-value

113 of the 223 genes found in a broad spectrum of prokaryotic species

71© 2003 Steven L. Salzberg

Alternative explanations Gene loss from a small sample of non-vertebrate

eukaryotes Only 4 non-vertebrates used for analysis: fruit fly, nematode,

yeast, and mustard weed (Arabidopsis) Large and diverse set of prokaryotes (over 30 organisms,

including extremophiles) used as well Rapid divergence in non-vertebrate eukaryotes

(evolutionary rate variation) Still-incomplete genomes (e.g., D. melanogaster) Erroneous annotation/gene finding Contamination

72© 2003 Steven L. Salzberg

Re-analysis: number of “transfers” decreases with # of genomes analyzed

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 2 3 4 5 Other

Number of protein sets removed

Nu

mb

er

of

gen

es in

late

ral tr

an

sfe

r can

did

ate

set

Fruit fly

C. elegans

Arabidopsis

Yeast

Parasites

73© 2003 Steven L. Salzberg

Evolutionary Rate Variation2

3

14

5

6

74© 2003 Steven L. Salzberg

Trees Don’t Support Transfer

Paramecium bursaria Chlorella virus 1Homo sapiens HAS1Mus musculus HAS1

Xenopus laevisXenopus laevis Danio rerio

Homo sapiens Mus musculus

Danio rerio Xenopus laevis Gallus gallus Bos taurus Homo sapiens Mus musculus Rattus norvegicus

Bradyrhizobium sp SNU001Rhizobium leguminosarum

Rhizobium spRhizobium loti

Rhizobium tropici

Rhizobium sp. NodC

Mesorhizobium sp 7653RSinorhizobium melilotiRhizobium melilotiRhizobium leguminosarumRhizobium galegaeAzorhizobium caulinodans

Stigmatella aurantiacaStreptomyces coelicolor

Streptococcus uberisStreptococcus equisimilisStreptococcus pyogenes HASA

Streptococcus pneumoniae

0.2

Bacteria

Vertebrates

Virus

III

II

I

75© 2003 Steven L. Salzberg

Birney et al., Nature special issue on human genome“The unfinished human genomic DNA may contain contamination, particularly from bacteria but also from other sources. .... If the predicted gene matches a bacterial gene more closely than any vertebrate gene then it will almost always be a contaminant.”

© 2003 Steven L. Salzberg76

Were genes really transferred? NO Our re-analysis finds just 41 genes (Ensembl) or 46

(Celera) with best hits to bacteria – not 223 All of these could be explained by alternative mechanisms More genomes will likely eliminate these remaining

candidates At least 3 have already been found in Drosophila, 10 more in

other species Great care is needed in order to make assertions of

transfer from bacteria to humans Implications would be significant; e.g., GMOs

Even more care is needed when working with unfinished data

Nature erratum to human genome paper, August 2001: “We agree.”

© 2003 Steven L. Salzberg77

AcknowledgementsMUMmer: Arthur Delcher, Jeremy Peterson, Rob Fleischmann, Owen White, Simon Kasif, Jonathan Allen, Sam Angiuoli, Adam Phillippy

X alignments: Jonathan Eisen, Owen White, John HeidelbergArabidopsis duplications:

TIGR: Maria Ermolaeva, Owen White, Jonathan Eisen, Xiaoying Lin, Samir KaulAGI collaborators: Klaus Meyer and all his MIPS colleagues, Mike Bevan

Human duplications: Mark Yandell, Mark Adams, Mani Subramanian, Craig Venter (all formerly Celera), Ron Wides (Bar-Ilan University), Art Delcher

Lateral transfer: Jonathan Eisen, Owen White, Jeremy Peterson

Funding support:National Institutes of Health (NHGRI, NLM)National Science Foundation (CISE, BIO)