50
TIGR TIGR Tetrahymena thermophila macronuclear genome project

Tetrahymena genome project update 2004 by Jonathan Eisen

Embed Size (px)

DESCRIPTION

Talk by Jonathan Eisen on progress on the Tetrahymena genome sequencing project. Presented at NSF Microbial Genomics Workshop in 2004.

Citation preview

Page 1: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Tetrahymena thermophila macronuclear genome project

Page 2: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Acknowledgements

• Ed Orias• Members of Tetrahymena steering

committee• Members of Tetrahymena Genome

Advisory Board• NSF/Pat Dennis• NIGMS/Tony Carter• Tetrahymena research community

Page 3: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Genome Project Planning - coordinated by Ed Orias at UCSB

• 8/99 Workshop in Ciliate Genomics

• 10/99 First Meeting of Tetrahymena Genome Project Steering Committee

• 10/00 Second Meeting of Tetrahymena Genome Project Steering Committee

• 8/01 Third Meeting of Tetrahymena Genome Project Steering Committee

Page 4: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Page 5: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Details of Project

• Collaboration between– TIGR (Jonathan Eisen, Malcolm Gardner, Steven

Salzberg, others)– Stanford (Mike Cherry)– UCSB (Ed Orias)

• Funding– NSF Microbial Genome Program– NIH-NIGMS

Page 6: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Major Goals of Project

• ~8x coverage of macronuclear genome of strain SB210

• Generation of genome assemblies

• Creation and maintenance of two genome databases– Sequence and automated-annotation - TIGR– Tetrahymena Genome Database - Stanford

Page 7: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Eukaryotic Phylogeny

Page 8: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR Baldauf et al. 2001

Page 9: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Why Tetrahymena?• Model alveolate and ciliate• Free living, pure culture, non pathogenic• Genetic unicellular eukaryotic model:• Processes and cellular components not found in

yeasts• Organelle function: cilia, phagosome, nucleoli,

centrosomes

• Robust and novel molecular genetic tools• Large research community• Heterologous expression of alveolate genes

Page 10: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Major Discoveries Using Tetrahymena

• Dynein and its unidirectional motor activity

• Ribozymes, self-splicing RNA• Telomere structure, telomerase &

telomerase RNA• Role of histone acetylation in control

of gene expression• Role of RNAi in developmental DNA

rearrangements

Page 11: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Tools in Tetrahymena• Genetic tools

– Conjugation, genetic-crossing, inducible self-fertilization

– Transformation, gene disruption, gene replacement

– Gene overexpression, ribosome antisense repression

• Many genomic resources– Genetic maps (for mic and mac)

– Physical maps

– EST projects

• Ease of use– Grows fast (1.5 h doubling) in pure culture

– Large cell size

– Large T° range for growth

– Storage in liquid N2

– Large scale sub-cellular compartment fractionation

Page 12: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Tetrahymena’s two nuclear genomes

Micronucleus (MIC) Germline Genome (Silent) 5 pairs of chromosomes

Macronucleus (MAC) Somatic genome (Expressed) 250-300 chromosomes @ ~45 copies each

Page 13: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Macronuclear Differentiation

Page 14: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Macronuclear Genome

• Little repetitive DNA• 180 Mbp genome• Little evidence for large duplications• No centromeres• Few and small introns• No alternative splicing reported• Genes are lower AT (63%) than rest of the

genome (83%)

Page 15: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Major Achievements

• 8x coverage achieved September 20, 2003

• Shotgun assembly finished September 25, 2003

• Sequence and assembly Data released to TIGR web site October 1, 2003

• Traces released to NCBI trace archive October 15, 2003

Page 16: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Why sequence the Mac?

• Advantages:– It contains all the genes and control elements

required for life– IES loss removes the vast majority of the

germline’s repeated sequences

• Special challenges– Assembling a highly fragmented genome.– Relating the MAC genome sequence to the MIC

genome.

Page 17: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Macronuclear DNA Libraries

Size of DNA used

% Good Sequences

% No insert

TTAAA 1.5-2.0 95 0

TUAAA 2.0-3.0 90 0

TXAAA 3.0-4.0 88 1

TYAAA 4.0-6.0 85 1

TQAAA 6.0-10.0 45 27

Made by Bill Nierman at TIGR

Page 18: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Sequencing

• Sequencing done at the J. Craig Venter Science Foundation’s Joint Technology Center

• 1,197,106 million reads primarily from 4-6 kb library

• Average edited length 815 bp

Page 19: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Assembly

Scaffolds 2988

Contigs 4223

Bases in Scaffolds

106,196,540

Largest contig 715,652

Largest scaffold

2,217,035

Coverage 9.01

N50 Scaffolds 464,449

• Celera Assembler with modifications by Mihai Pop, Art Delcher, Steven Salzberg, et al.

Page 20: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Data Release

• All raw data is in the NCBI Trace Archive

• Sequences and assemblies are available at (http://www.tigr.org/tdb/e2k1/ttg/ and will be available in Genbank

• Assemblies will be released monthly if there are any improvements

Page 21: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Feature StatNumber of “capped” scaffolds 114Fraction of the genome residing in capped scaffolds 40%Fraction of the genome residing in scaffolds capped on at least one end 75%Post-genomic estimate of the number of MAC chromosomes 292Number of sequenced RAPS found in single scaffolds 93/94 testedLongest single-contig scaffold 716 kbLongest scaffold 2.2 MbLongest capped scaffold (on both ends) 1.1 MbShortest capped scaffold (on both ends) 37.5 kbEstimated fold-redundancy of MIC sequence in the TIGR sequence database 0.1 fold

Assorted statistics

Page 22: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Accuracy?

• No scaffolds are larger than the corresponding MAC chromosomes

• All independently assorting loci match different scaffolds and all co-assorting loci match either same scaffold or the sum of the scaffolds is < than the size of cognate MAC chromosome

• Previously obtained Cbs-adjacent sequences that match to untelomerized scaffolds invariably do so at scaffold ends.

Page 23: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Scaffold to MAC Chromosome Size Ratio

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

0 0.5 1 1.5 2 2.5 3 3.5

MAC Chromosome Ratio (Mb)

Scaffold to Chromosome Ratio

Observed "0.9 & 1.1 Lines"

Page 24: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Estimating the number of MAC chromosomes

• 114 “closed” scaffolds (= MAC chromosomes) encompass 40% of the genome sequence in scaffolds.

• If the size distribution of these scaffolds is representative, then, by proportionality,

• The entire genome is estimated to contain ~290 MAC chromosomes.

• This number falls within the range of earlier estimates, suggesting that few, if any, MAC chromosomes are missing from the TIGR Tetrahymena sequence

Page 25: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Assembly Issues

• rRNA and mitochondrial contigs are considered “repetitive” due to the higher depth of coverage

• Reran assembly in three subsets– rRNA– mitochondrial – other sequences

Page 26: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Assembly 2rRNA Mitochondria Major

chromosomes

Scaffolds 2 1 1971

Contigs 2 1 2955

Bases in Scaffolds

12,166 45,538 103,927,049

Largest contig 45,538 715,652

Largest scaffold

12,166 45,538 2,214,258

Coverage 635x 17.85x 9.08x

Page 27: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

Page 28: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Tetrahymena Genome Database

• Phenotypes associated with gene knockouts, replacements and other types of mutations.

• Gene regulation information from the literature.• Post-translational modifications.• Linkage & physical maps • DNA polymorphisms • Experimental protocols• Links to other sites

Page 29: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Page 30: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Paul Doerder, Cleveland StateImmobilization antigens (i-ag)

Major GPI-linked cell surface protein o related to surface proteins of disease-causing protistso encoded by at least 8 families of paralogs expressed

under different conditions of temperature and salinityo members of H, L, J and S families already sequenced

Tetrahymena Genome Project:o additional H, L, J and S paralogs and pseudogenes have

been identified

o candidate I, T, M and P i-ag genes currently being tested by RT-PCR and real-time PCR

Page 31: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Todd Hennesey, Buffalo

• Identified ecto-ATPase that he’s been trying to clone for the past 7 years

• Making a knockout• Identified "lysozyme receptor" that he’s been

trying to clone for the past 5 years• We screened some antisense ribosome mutants,

got an interesting phenotype (extended backward swimming in Ba++), BLASTed the short antisense sequence into the database and now have 1.7kb of sequence to use to make a knockout

Page 32: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Kathleen Karrer, Marquette

• We have just today had a paper accepted by Eukaryotic Cell, pending revisions, which was significantly enhanced by analysis of the data base. There are two undergraduate co-authors on the paper.

Page 33: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Cliff Brunk, UCLAT. thermophila genes detected by CUI

CUI versus Gene Position

0

10

20

30

40

50

60

70

23500 28500 33500 38500 43500Nucleotide Position

1000/CUI Nucleotide

Page 34: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Davis Asai, Harvey Mudd College

• Dynein heavy chains are very large ORFs (ca. 16 kb) and traditional cloning etc. has been a slow go.

• We were able to use the database to complete the determination of the sequence of the major cytoplasmic dynein heavy chain gene, DYH1, and we are extending our information on the second cytoplasmic dynein heavy chain, DYH2.

• Further, we have been able to walk "in silico" upstream of the DYH1 gene in order to make constructs for the N- terminal tagging of the heavy chain.

Page 35: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

TIGR – sequences

TIGR – scaffolds

Translate in 6 reading frames using ciliate code

Use these files as databases of all known proteins in Tetrahymena thermophila in these two mass

spectrometry related searching programs (in-house):

J. Smith, K. Belay, S. Beeser, A. Keuroghlian, R.E.

Pearlman, K.W.M. Siu

Page 36: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

ExciseDigest with trypsin

Gel approach…

Identify based ontryptic fingerprint usingtranslated T. thermophiladatabase (MS-FIT).

Sequence individualpeptides and identifyusing MASCOT andtranslated T. thermophiladatabase.

Ciliary axonemal proteins from Tetrahymena thermophila

Ciliary axonemal proteins from Tetrahymena thermophila

Digest with trypsin

Divide into 30 fractions using SCX

Run each fraction on a 1.5 hourreverse phase gradient (C18 column) into a mass spectrometer, acquiring a CIDspectrum of each peptide in thesolution.

Identify using MASCOT andtranslated T. thermophiladatabase.2D LC/MS/MS approach…

Page 37: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

(These are different gels,not a magnification of the

same gel)

Page 38: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Preliminary Summary (using Gel approach):Axonemal proteins found:

• Alpha Tubulin• Beta Tubulin• Unnamed protein product• Axoneme central apparatus protein• Chain A, Tryparedoxin Ii / Thioredoxin Peroxidase / Peroxiredoxin 2 / Natural Killer Cell Enhancing Factor• Hypothetical Protein• Dynein, 70 kDa intermediate chain• Calmodulin like protein / Outer dynein arm-docking complex• Axonemal leucine-rich repeat protein• Testes specific A2 / Meichroacidin / phosphatidylinositol-4-phosphate• invl / putative ankyrin repeat protein / Ankyrin 3• Calmodulin• Radial spokehead-like protein• Flagellar Radial Spoke protein• ABC transporter

Membrane proteins found (tubulins found in previous experiments):• Hypothetical Protein• Xenobiotic reductase• SerH3 immobilization antigen• NADH:flavin oxidoreductase

Page 39: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

OralApparatus

Preliminary Analysis of the Tetrahymena Phagosome ProteomeL. Klobutcher (Univ. Connecticut Health Ctr.) & R. Pearlman (York Univ.)Preliminary Analysis of the Tetrahymena Phagosome ProteomeL. Klobutcher (Univ. Connecticut Health Ctr.) & R. Pearlman (York Univ.)

PROTEINS IDENTIFIED:*1. Vacuolar-type H+-ATPase*2. Cathepsin B*3. HSP 70*4. 14-3-3 protein 5. Cytochrome b5-related protein 6. Two novel proteins

*Components of the mouse phagosome proteome (Garin et al. J. Cell Biol. 152:165, 2001)

Page 40: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Doug Chalker, Wash. U.

Using the genome sequence to predict genes that we are going to use this semester as the focus of an undergraduate lab class.

We are going to knockout these genes and study the phenotypes. This will bring up to the date research techniques into the undergraduate classroom.

Page 41: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Marty Gorovsky, Rochester

• Expansion of a family of cystein proteases

• Two new histone H3 genes

• One new histone H2A gene

Page 42: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Kapler: Gene Amplification and DNA Replication Control rDNA minichromosome (21 kb) Macronuclear development: amplified 5,000-fold Vegetative replication: once per cell cycle Biochemically purified trans-acting factors: TIF1, TIF4

TIGR genome sequencing project: Bioinformatics

Immediate impact on two funded research projects• Kapler: NIH (GMS) (Cis- and trans-acting determinants for replication and amplification

of the rDNA minichromosome) Strong candidates identified for orthologs of Orc1,2,4,5,6, Cdc6, Mcm2-6, Cdt1

• Kapler and Orias (co-PIs): NSF (Eukaryotic Genetics) (Genetic dissection of replicons in non-rDNA chromosomes)Complete sequence of 16 non-rDNA minichromosomes (size range 37.4-99.5 kb)

Page 43: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

ID new genes by blasting

3 new histones, including a cen-P homolog Gorovsky16 new ciliogenesis-induced genes with known homologs Gorovsky51 novel ciliogenesis-induced genes with no known homologs Gorovsky55 new cysteine protease genes – only one in GenBank Gorovsky8 strong candidates for proteins involved in replication and amplification ofthe rDNA minichromosome

Kapler

Completing the very long (~16 kb) dynein heavy chain ORFs AsaiOrthologues of light chains and light intermediate chains characterized inother systems

Asai

2-3 families of homing endonucleases Karrer20 nuclear transport proteins; interest, MIC vs. MAC JahnNew heat shock proteins MiceliNew stress response proteins (oxidative and UV), including some neverreported in protozoa

Miceli

Subunits of heterotrimeric G-proteins MiceliTetrahymenol (cholesterol surrogate) cyclase; bacterial-related, possible LGT MatsudaMany snoRNA candidate genes NielsenNew alternative family of U1-3 spliceosomal RNAs NielsenGlutamic-dehydrogenase; regulation-wise, “missing link” between bacterialand animal GDH; lacks “off” switch, just like mutant GDH that in childrencauses insulin hypersecretion

Smith

16 complete minichromosomes (37.5 to 99 kb) for a study of origins ofreplication

Kapler

Page 44: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 45: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Other Ciliate Projects• Paramecium genomic survey (Dr. Linda Sperling,

Centre de Genetique Moleculaire, CNRS, France)• European rumen ciliate cDNA project (C. Jamie

Newbold, Rowett Research Institute, Aberdeen, UK)

• Oxytricha (Spirotrich ciliate) micronuclear BAC project (Laura Landweber, Princeton University);

• Ichthyophthirius EST sequencing proposal (Theodore G. Clark, Cornell University

Page 46: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Relating MIC and MAC genomes

• Paired sequence tags from MAC chromosome ends adjacent to Cbs junctions

• MIC:MAC relational genetic and physical maps of sequenced DNA polymorphisms (not shown)

Page 47: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Physically Relating the MIC and MAC Genomes

MICCbs

Cbs Library

MAC

Cbs Cbs

Page 48: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Chromosome Breakage Junction Sequence

Scaffold Sequence

Ordering and Orienting Tetrahymena MAC Chromosome DNA in the Micronuclear

Genome: Genominoes

Page 49: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Current state of MIC Genominoes

I’m sending you a Word document with the status before I tel-linked the 273 additional scaffold ends.

Their tel-adjacent sequence was blasted against our paired Cbs tags on Friday.

I should be able to send you a slide with longer “contigs” of scaffolds within the next couple of days (please let me know what the hard deadline is).

Page 50: Tetrahymena genome project update 2004 by Jonathan Eisen

TIGRTIGRTIGRTIGR

Fraction of the genome in Tel-linked Scaffolds

Scaffold Number % gemome

-----------------------------

Both tels 114 40

One tel 120 35

No tel 289 25

-----------------------------

Total tel-linked

scaffold ends: 348