44
www.sciencemag.org/cgi/content/full/317/5846/1921/DC1 Supporting Online Material for Genomic Minimalism in the Early Diverging Intestinal Parasite Giardia lamblia Hilary G. Morrison, * Andrew G. McArthur, Frances D. Gillin, Stephen B. Aley, Rodney D. Adam, Gary J. Olsen, Aaron A. Best, W. Zacheus Cande, Feng Chen, Michael J. Cipriano, Barbara J. Davids, Scott C. Dawson, Heidi G. Elmendorf, Adrian B. Hehl, Michael E. Holder, Susan M. Huse, Ulandt U. Kim, Erica Lasek-Nesselquist, Gerard Manning, Anuranjini Nigam, Julie E. J. Nixon, Daniel Palm, Nora E. Passamaneck, Anjali Prabhu, Claudia I. Reich, David S. Reiner, John Samuelson, Staffan G. Svard, Mitchell L. Sogin *To whom correspondence should be addressed. E-mail: [email protected] Published 28 September 2007, Science 317, 1921 (2007) DOI: 10.1126/science.1143837 This PDF file includes: Materials and Methods Figs. S1 to S5 Tables S1 to S8 References

Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

www.sciencemag.org/cgi/content/full/317/5846/1921/DC1

Supporting Online Material for

Genomic Minimalism in the Early Diverging Intestinal Parasite Giardia lamblia

Hilary G. Morrison,* Andrew G. McArthur, Frances D. Gillin, Stephen B. Aley, Rodney

D. Adam, Gary J. Olsen, Aaron A. Best, W. Zacheus Cande, Feng Chen, Michael J. Cipriano, Barbara J. Davids, Scott C. Dawson, Heidi G. Elmendorf, Adrian B. Hehl, Michael E. Holder, Susan M. Huse, Ulandt U. Kim, Erica Lasek-Nesselquist, Gerard Manning, Anuranjini Nigam, Julie E. J. Nixon, Daniel Palm, Nora E. Passamaneck,

Anjali Prabhu, Claudia I. Reich, David S. Reiner, John Samuelson, Staffan G. Svard, Mitchell L. Sogin

*To whom correspondence should be addressed. E-mail: [email protected]

Published 28 September 2007, Science 317, 1921 (2007)

DOI: 10.1126/science.1143837

This PDF file includes:

Materials and Methods Figs. S1 to S5 Tables S1 to S8 References

Page 2: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Materials and Methods

DNA extraction and library construction We used a whole genome shotgun sequencing strategy to determine the sequence

of Giardia lamblia WB-C6 (ATCC50803; (1)). The estimated size of the genome is 11,700,000 nucleotides distributed over five chromosomes, including two size variants of chromosome 1 (2, 3). Trophozoites were grown in modified TYI-S-33 medium (1). DNA was extracted from trophozoites lysed in 100 mM NaCl, 10 mM Tris, 25 mM ethylenediamine tetraacetic acid (EDTA), 0.5% sodium dodecyl sulphate (SDS), pH 7.5, followed by treatment with RNase (0.1mg/ml) and proteinase K (0.2 mg/ml) for 40 minutes at 37°C, and then phenol/chloroform extracted and ethanol precipitated. DNA was further purified by centrifugation in cesium chloride gradients and sent to Stratagene (La Jolla, CA) for commercial library construction or used to prepare libraries at the MBL (HGM) and at University of Illinois, Urbana-Champaign (CIR) (Table S1).

We partially digested 30 micrograms of genomic DNA using Tsp509 I (New England Biolabs, Beverly, MA) at a concentration of 0.1 unit/microgram of DNA incubated between 5 and 30 minutes at 37°C. The incubation times were adjusted to obtain a smear in the 2-5 Kbp range on agarose gels. The AATT ends generated by Tsp509 I are compatible with EcoRI restriction sites in the pUC18 vector. After size selection in 6% SeaPlaque agarose, regions of the gel containing 3-3.5 Kbp fragments were excised. We recovered DNA from the gel by melting 5 minutes at 65°C, extracting with phenol, phenol-chloroform, chloroform followed precipitation with ethanol. The purified 3-3.5 Kbp genomic fragments were ligated into the EcoR I site of the cloning vector pBluescript. Prior to DNA ligation, the vector was digested with EcorRI and dephosphorylated following the manufacturer's protocols (Stratagene, La Jolla CA). DNA ligations were carried out overnight at 16°C and recombinant plasmids were electrophoresed on 0.6% SeaPlaque agarose gels. Recombinants with a single insert are represented by a single tight band while a ladder of bands corresponding to vector plus one insert, vector plus two inserts, etc. are observed if multiple DNA inserts are incorporated into the vector. The region of the gel containing the vector plus one insert was excised and DNA recovered as described above. The recombinant plasmids containing a single DNA insert were electrotransformed using a BioRad Gene Pulser into XL1Blue-MRF cells which were plated immediately without incubation or IPTG induction. Recombinant plasmids were picked and used for preparing DNA sequencing templates.

We also prepared random fragment libraries using physical shearing and blunt end ligation techniques. For blunt end ligations, DNA was isolated from Giardia as described above and 100 micrograms was suspended in 1 ml of 25% glycerol/1M NaOAc. The sample was nebulized with a stream of nitrogen at approximately 10 psi for 15-20 sec. at pressures required to yield a smear in the 2-5 Kbp range on agarose gels. The DNA was concentrated by ethanol precipitation, suspended in 100 microliters of Bal31 buffer and incubated with Bal31 nuclease for five minutes at 30°C. Subsequent to extraction with phenol, phenol/chloroform, chloroform and ethanol precipitation, the DNA fragments were sized and extracted from 0.6% agarose gels as described above. DNA was suspended in buffer containing 2 mM dNTPs plus 5 units of T4 DNA polymerase and incubated for 30 min at 16°C. After extraction and ethanol precipitation, the blunt-end DNA was mixed with 200 nanograms of pUC18 that had been cut with SmaI and dephosphorylated. After overnight incubation at 16°C the ligation reaction was resolved on 0.6% agarose gel to isolate vectors with a single insert as described above.

Page 3: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Sequencing and assembly

Approximately 200,000 reads were generated by sequencing both ends of the small-insert plasmid libraries. Additionally, 2,400 end sequences were generated from a 200kbp library in the bacterial artificial chromosome vector pBACE3.6, to assist with ordering and orienting contigs. Sequencing was initially done using LI-COR automated sequencers (labeled primers; ~90,000 reads) and completed by capillary sequencing (BigDye terminator chemistry) on an Applied Biosystems 3700.

Reads were assembled using the ARACHNE 2.0 genome assembly software (4, 5) with basecalling by either LI-COR software (for LI-COR generated reads) or PHRED (6, 7) (capillary reads). ARACHNE includes routines for trimming low quality sequence, vector, and contaminating E. coli DNA. Eighty-three percent of the total reads were retained in the assembly, of which 91% were paired. Total genomic coverage was 11X. The ARACHNE assembly was edited using CONSED (8). Some repeated elements were incorrectly assembled (overtiled) and were manually curated based on previously published data. Additional directed sequence reads were generated, including those based on multiplex PCR (9, 10).

The assembly of scaffolds had a mean GC content of 49%. The average contig is 36.5 kbp, for a combined total of 11.2 MB or 96% coverage of the genome. The assembly excludes Giardia’s frequently rearranged telomeric and subtelomeric regions, which comprise approximately 4% of the genome and are enriched with repetitive VSPs, GC-rich ribosomal RNA repeats and transposons. There is no evidence of variability at other chromosomal locations. Approximately 77% of the genome is in ORFs; nearly 1800 were overlapping, ~1500 more were within 100 nt of an adjacent ORF, and ~200 were separated by fewer than 12 nt. We detected only four introns. We identified ~700 putative Giardia-specific proteins, e.g. those with evidence of expression (SAGE or cDNA), but no similarity to any known proteins or protein motif. Giardia encodes all twenty tRNA synthetases, and 63 tRNA genes (five with introns) are distributed on 34 different contigs (Table S2). Several tRNA genes are close to the start of transcribed ORFs (4-12 nts).

Gene finding and preliminary annotation

Genes were predicted using the computer programs CRITICA (11) and GLIMMER (12, 13) using training set of published giardial genes from GenBank. The two programs collectively predicted 9,663 protein-coding genes after removal of ORFs lacking proper start and stop codons. Where alternate start codons were possible for gene calls, the longest ORF was retained. We retained overlapping ORFs on alternate reading frames. Each ORF was evaluated using independent statistical measures: TestCode (14), GeneScan (15) and the Codon Adaptation Index (16). Threshold scores were TestCode value of >4, GeneScan score of >9.7, or an inframe codon usage score greater than the two out-of-frame scores (17). We retained ORFs with (1) similarity to known proteins at e-10 or better (2,370 ORFs), (2) weak or no similarity, but transcriptional or proteomic evidence (2,753 ORFs), or (3) no similarity or expression evidence, but high scores on two or more metrics (1,347 ORFs).

Similarity to known proteins was assessed by querying GenBank and SwissProt protein databases using BLAST (18, 19) and the Pfam protein motif database with HMMER (20, 21). We searched for additional protein structural features with the programs COILS (22, 23), SEG (24), and TMM (25). We predicted cellular localization for proteins using PSORT (26), SignalP (27), and TargetP (28). Structural and non-coding RNAs were predicted using tRNAscan and the RFAM database (29, 30). We

Page 4: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

added SAGE (31, 32) and cDNA expression data to the ORF annotation database when these were available. Functional annotations were obtained using the SEED annotation process (33) and the KEGG Automatic Annotation Server which compares predicted genes to the manually curated KEGG Genes database (34, 35). The predicted protein dataset was manually curated, primarily by querying the called ORFs, unassembled reads, and intergenic regions with known proteins via TBLASTN. We queried the nucleotide and amino acid datasets with known eukaryotic and giardial motifs, e.g. the polyadenylation signal and the VSP C-terminal sequence, as regular expressions to aid assessment of predicted genes.

Determination of the giardial kinome

We constructed profile HMMs for the PIKK, RIO, ABC1, PDK, and alpha-kinase families with HMMer and used these to search the ORF candidates using Decypher hardware-accelerated HMMer implementation from Time Logic (http://www.timelogic.com). Sequences that matched the HMMs were extracted from contigs. Three or more homologues were used to determine full length proteins which were compared to the ORFs predicted. Final predicted kinase sequences were searched against the Pfam 7.4 HMM profiles, using both local and glocal models. All matches with P scores <0.01 were accepted and all matches with scores of 0.01 to 1.0 were evaluated in comparison with known, homologous sequences, inspection of the domain alignment, and reference to the literature. Some calmodulin binding motifs were identified from the literature or from sequence similarity to known CaM motifs. Signal peptides were detected using SignalP and transmembrane regions using TM-HMM.

Mutagenesis and analysis of aurora kinase The Giardia AU1-epitope tagging vector (36) containing the promoter and coding sequences for Giardia aurora kinase (ORF5358) was the template plasmid for the deletion of the 28-amino acid insert (aa 153-180) by overlapping PCR. Two separate PCRs were performed using the primer sets: gAK-for (5’-ATC-TGA-ATT-CGG-ATA-AGG-ATA-AAG-AAA-GAG-3’) and delAKins-rev (5’-ATC-TTG-TAT-GTT-TTC-CCC-TTG-CAA-ATC-AAG-3’); gAK-rev (5’-TAT-TGG-GCC-CCT-TGG-GGA-CCT-TAC-TCC-TGT-3’) and delAKins-for (5’-TGA-CCG-GGG-AAA-ACA-TAC-AAG-ATT-GCA-G-3’). Conditions for PCR were an initial denaturation at 95°C for 3 min followed by 3 cycles of 1 min at 95°C, 1 min at 42°C 1 min, 2 min at 72°C; 3 cycles of 1 min at 95°C, 1 min at 47°C, 2 min at 72°C; 30 cycles of 1 min at 95°C, 1 min at 55°C, 2 min at 72°C; and a final extension for 10 min at 72°C. In a third PCR, 10 µg of an equal mixture of the two PCR products was used as a template in combination with primer set gAK-for and gAK-rev. Conditions for PCR were an initial denaturation at 95°C for 5 min, 4 cycles of 50 sec at 95°C, 1 min at 50°C, 6 min at 72°C and a final extension for 10 min at 72°C. The PCR product was gel purified, digested with Apa1 and Eco R1, and ligated into the Giardia AU1-epitope tagging vector. Deletion of the insert was confirmed by sequencing. Stable transformation of gAK lacking the 28-aa insert (“gAK-ins-AU1”) was as described in (37).

Quantitative PCR was used to compare abundance of gAK-AU1 vs. gAK-ins-AU1 mRNA transcripts in stably transfected Giardia. cDNA from each of the 2 cell lines was used as the template for PCR in combination with the primer set gAKAU1for 5’-GAT-ATC-CTC-CGG-CAT-CCA-TTC-3’ and gAKAU1rev 5’-GAG-GAG-TCT-AGA-TGT-AGC-GGT-ACG-T-3’ and Power SYBR Green PCR Master Mix (Applied Biosystems, Foster City, CA) in a 7300 Real Time PCR System (Applied Biosystems).

Page 5: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Alpha-tubulin (GenBank XM76294) was used as a control. The CT (cycle threshold) value was set to 0.1 relative fluorescence units to detect transcript levels and data were analyzed using the 2-∆∆CT method (38).

Western analyses were done to determine whether deleting the gAK insert affected the levels of AU1-tagged protein from gAK-ins-AU1 compared normal gAK-AU1. Protein extracts were prepared and Westerns performed essentially as described (39), except proteins were transferred to Hybond PVDF (Amersham, Piscataway, NJ), the anti-AU1 dilution was 1/750 and filters were developed in ECL-plus (Amersham) according to standard methods. The loading control was PDI2 (40). Western signals were compared by density using Quantity One quantitation software (version 4; Bio-Rad, Hercules, CA).

Giardia database

GiardiaDB (www.mbl.edu/Giardia; (41)) was developed in support of the genome consortium and to make the data publicly available in advance of publication. The server uses the Generic Model Organism Database (GMOD) relational database schema and software (42), with custom modifications. Data are sorted as “tracks” which include contigs, gene predictions, Pfam protein domains, serial analysis of gene expression (SAGE) profiles, anti-sense transcripts, snRNAs, tRNAs, rRNAs, and other small RNAs, retrotransposons, restriction sites, chromosome assignments, and transcription signals. Various metrics, such as G-C content, shotgun read overlap coverage, quality of assembly consensus sequence, regions of probable poor assembly, regions of probable unpredicted genes, and regions of unique genome-wide nucleotide sequence (for probe or primer design) are also tracked to the predicted genome sequence. In addition to general query and browsing tools, advanced searches of precompiled BLAST, Pfam, and SAGE data can be performed to find protein coding genes with specific phylogenetic relationships, similarity to known proteins, or expression profiles.Similarity searches and multiple sequence alignments can be performed for rapid assessment of homology and annotations. Data are currently tracked to the genome using the General Feature Format (GFF). The Giardia genome project database will be maintained at www.giardiadb.org.

Phylogenetic analyses Approximately 30% of the predicted giardial proteins have one or more

homologues in the Genbank nr database. For purposes of high-throughput phylogenetic tree generation for Giardia, each ORF was individually searched against a custom reference protein database using NCBI’s BLASTALL software, the BLASTP algorithm with the BLOSUM62 substitution matrix, and default parameters (18, 19).We scored the resulting trees for the presence of specific clades as an indicator of the robustness of the inferred phylogeny. The source genomes for the reference protein database and clades scored were: ARCHAEA, Archaeoglobus fulgidus (NCBI), Sulfolobus solfataricus (NCBI); BACTERIA, Aquifex aeolicus (NCBI), Escherichia coli (NCBI), Rickettsia prowazekii (NCBI); ANIMALS, Caenorhabditis elegans (NCBI), Mus musculus (NCBI); PLANTS, Arabidopsis thaliana (NCBI), Oryza sativa (NCBI); APICOMPLEXANS, Cryptosporidium parvum (NCBI), Plasmodium falciparum (NCBI), Toxoplasma gondii (TIGR, NCBI); KINETOPLASTIDS, Leishmania major (NCBI), Trypanosoma brucei (NCBI), Trypanosoma brucei; FUNGI, Saccharomyces cerevisiae (NCBI), Cryptococcus neoformans (NCBI); STRAMENOPILES, Phytophthora ramorum (JGI), Phytophthora

Page 6: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

sojae (JGI), Thalassiosira pseudonana (JGI); FUNGI + MICROSPORIDIA, Saccharomyces cerevisiae (NCBI), Cryptococcus neoformans (NCBI), Encephalitozoon cuniculi (NCBI); PLANTS + ALGAE, Arabidopsis thaliana (NCBI), Oryza sativa (NCBI), Chlamydomonas reinhardtii (JGI); FUNGI + ANIMALS, Saccharomyces cerevisiae (NCBI), Cryptococcus neoformans (NCBI), Caenorhabditis elegans (NCBI), Mus musculus (NCBI); UNRELATED PROTISTS, Dictyostelium discoideum (NCBI), Entamoeba histolytica (NCBI), Trichomonas vaginalis (TIGR, NCBI). A single putative ortholog was retained from each reference genome if it had a BLAST expectation value of e-10 or better. Multiple protein sequence alignments were generated using the computer program MUSCLE (43, 44) under default parameters if homologues to the giardial protein were available for at least five other taxa. The resulting multiple sequence alignments were inspected using custom scripts specifically developed for our high-throughput phylogenetics pipeline (45). These scripts use MUSCLE alignment scores to identify regions of uncertain homology that should be excluded from phylogenetic analyses. MUSCLE scores each position in the alignment using a BLOSUM62 score over pairs of amino acids in the alignment and our software uses a sliding window to detect stretches of 7+ amino acids with BLOSUM62 score of less than 15. This cut-off was determined by empirical examination of BLOSUM62 scores in a large number of multiple sequence alignments. These regions were excluded from phylogenetic analysis.

Any Giardia ORF generating a multiple sequence alignment greater than 75 amino acids in length after the above exclusion of poorly aligned regions was submitted for phylogenetic analysis using a Bayesian statistical procedure, as implemented by the computer program MrBayes (46, 47). MrBayes performs a Metropolis-coupled Markov chain Monte Carlo (MC3) estimation of posterior probabilities (48-50). We performed MC3 estimation of posterior probabilities using noninformative prior probabilities, the JTT+I+Γ (51) substitution model with inclusion of unequal amino acid frequencies, and four incrementally heated Markov chains with different random starting trees. The Metropolis-coupled Markov chains were run to 500,000 generations with sampling every 100 generations. Posterior probabilities of topologies, clades, and parameters were estimated from the sampled topologies after removal of the first 5,000 generations to allow for MC3 burn-in. All of the resulting multiple sequence alignments and phylogenetic trees were deposited into GiardiaDB and used for further curation of the annotation and to select datasets for further phylogenetic analysis.

Phylogenetic analysis of a concatenated alignment of 61 ribosomal proteins was performed using both Bayesian and maximum likelihood approaches. Our dataset included diverse eukaryotic lineages, with eubacterial and archaeal sequences as outgroups. Independent alignments were created for each of 61 subunits. We concatenated these alignments and excluded gaps and highly divergent or ambiguous regions, for a final alignment of 9501 amino acid positions. The Metropolis-coupled Markov chains were run to 1,000,000 generations and the first 1,000 generations were removed for burn-in. Bayesian analyses were performed in quadruplicate to confirm stationarity. Bootstrap analysis under maximum likelihood was performed for the same dataset using the RAxML software (52). Bootstrapping included 100 replicates, the JTT+I+Γ model of substitution with four discrete rate categories, and 10 distinct randomized maximum parsimony starting trees per bootstrap.

Page 7: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Table S1. Genomic libraries used in sequencing and assembly.

Library ID Source Vector Cloning Site Selected Insert

Size Insert Ends

Mean Insert Size

B F. Gillin / Stratagene LambdaZapII EcoRI 3-5 Kbp EcoRI 4,126 bp

C F. Gillin / Stratagene LambdaZapII EcoRI 6-10 Kbp EcoRI no data

D H. Morrison pBluescriptIIKS- EcoRI 2-6 Kbp Tsp509I 3,558 bp

F C. Reich pUC18 SmaI 2.5-3 Kbp sheared 2,365 bp

G C. Reich pUC18 SmaI 3-5 Kbp sheared 2,896 bp

I C. Reich pUC18 SmaI 3-5 Kbp sheared 2,171 bp

J C. Reich pUC18 SmaI 2.5-3 Kbp sheared 2,389 bp

L C. Reich pUC18 SmaI 2-3 Kbp sheared 2,288 bp

M Lucigen pSmart-LC-

KAN blunt-end ligation 12-15 Kbp sheared ~9 Kbp

BAC S. Aley pBACE3.6 EcoRI 200 Kbp EcoRI no data

Page 8: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Table S2. Location and sequence of tRNA genes.

tRNA Anticodon Sequence1 Contig Start Stop Orientation

Ala AGC GGGCGGTTAGCTCAGTTGGTAGAGCGCTTGCCTAGCATGCAAGAGGCCGGGGGTTCAAATCCCTCATCGTCCA

9 259581 259653 +

Ala AGC TGGACGATGAGGGATTTGAACCCCCGGCCTCTTGCATGCTAGGCAAGCGCTCTACCAACTGAGCTAACCGCCC

33 2145 2073 -

Ala CGC TGGACGATGGGGGAGTCGAACCCCCGGCCTCGTGCTTGCGAAGCACGCGCTCTACCAACTGAGCTAACCGCCC

6 176209 176137 -

Ala TGC GGGCGGTTAGCTCAGTTGGTAGAGCGCGACGTTTGCATCGTCGAGGCCGGGGGTTCAAATCCCCCATCGTCCA

16 192167 192239 +

Arg ACG CGTACCCGGCAGGACTTGAACCTGCAACCTCCTGATCCGTAGTCAGGCGCGCTATCCAATTACGCCACGGGCAC

7 169664 169591 -

Arg ACG CGTACCCGGCAGGACTTGAACCTGCAACCTCCTGATCCGTAGTCAGGCGCGCTATCCAATTACGCCACGGGCAC

7 178657 178584 -

Arg CCG GCTCGCATAGTGCAATGGAAAGCATGCTAGCCTCCGGAGCTAGTGATCTGGGTTCGAGTCCCGGTGTGGGCT

46 41665 41736 +

Arg CCT CACCCCCGGCGAGATTCGAACTCGCAACCTCTTGATTAGGAGTCAAGCGCGCTATCCATTGCGCCACGGGGGC

15 218048 217976 -

Arg TCG CACACCCGACGGGACTTGAACCCGCAACCCCCAGATCCGAAGTCTGGTGCGCTATCCATTGCGCCACGGGTGC

44 71143 71071 -

Arg TCT GCCCGCGTAGCTCAGTGGATAGAGTGGTAGCCTTCTAAGCTATAGGTCGCGGGTTCGAGTCCCGTCGTGGGCC

5 63569 63641 +

Asn GTT CGCTCCCTGGCGGGCTTGAACCGCCGACCTTGCGGTTAACAGCCGCACGCTCTAACCAGCTGAGCTAAGGAAGC

21 2790 2717 -

Asn GTT GCTTCCTTAGCTCAGCTGGTTAGAGCGTGCGGCTGTTAACCGCAAGGTCGGCGGTTCAAGCCCGCCAGGGAGCG

21 60976 61049 +

Asp GTC GCTGGGGTGGCGTAACGGTCTAGCGCGCTCGGTTGTCGTCCGATCGGTCCGGGTTCGATTCCCGGCCCCGGCA

5 158035 158107 +

Asp GTC CGCGGTCACCGGGAATCGAACCCGGGTCACGTGAGTGACAGTCACGCATACTAGCCACTGTACTATGACCGC

16 331 260 -

Asp GTC GCGGTCATAGTACAGTGGCTAGTATGCGTGACTGTCACTCACGTGACCCGGGTTCGATTCCCGGTGACCGCG

25 145999 146070 +

Cys GCA AGGGCCTGACCGGATTTGAACCGATGACCACTCGGACTGCAGCCGAGCGCTCTACCCCTGAGCTACAGACCC

26 41408 41337 -

Gln CTG CGGTTTCACCCGGATTCGAACCGGGGTTATGGGATTCAGAGTCCCATGTGCTAACCAACTACACTATGAAACC

9 174401 174329 -

Gln TTG TTGGAGTGCCGGGAGTCGAACCCGGGTCGTAACCgcatacgtagttgatagccatgccaTCAAAGGCCACTGTGCTCCCGCTGCACCACACCCCA

7 109054 108960 -

Page 9: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Gln TTG TGGGGTGTGGTGCAGCGGGAGCACAGTGGCCTTTGAtggcatggctatcaactacgtatgcGGTTACGACCCGGGTTCGACTCCCGGCACTCCAA

7 112357 112451 +

Gln TTG GCTCTTGTAGTGTAATGGTTAGCACCTCGGACTTTGAATCCGGTAGTCTGGGTTCGATCCCCAGCGAGAGCG

12 181540 181611 +

Glu CTC TCCGGCGTAGTATAACGGCTAGAATGGTTGGTTCTCACCCAACAGATCCGGGTTCGATTCCCGGCGCCGGAG

3 118956 119027 +

Glu CTC CTCCGGCGCCGGGAATCGAACCCGGATCTGTTGGGTGAGAACCAACCATTCTAGCCGTTATACTACGCCGGA

3 129895 129824 -

Glu TTC CTCCGATGCCGGGAATCGAACCCGGATCTACTGGGTGAAAACCAGTCATGCTAGCCGTTATACCACACCGGA

20 15706 15635 -

Gly CCC GCGCATGTGGTGAAAGGGTATCATGAGAGTTTCCCAAGCTTTCGTTCCGGGTTCGAGCCCCGGCATGCGCA

36 42135 42205 +

Gly GCC GCATCAATGGTTTAGGGGTAGAATGCTTGCTTGCCAAGCAAGAGAGCCGGGTCCGAGTCCCGGTTGATGCA

62 41592 41662 +

Gly GCC TGCATCAACCGGGACTCGGACCCGGCTCTCTTGCTTGGCAAGCAAGCATTCTACCCCTAAACCATTGATGC

76 2002 1932 -

Gly TCC GCACCATTGGTGTATCGGCTAGCATGACAGCCTTCCAAGCTGTTGGGGCGGGTCCGACTCCCGCATGGTGCA

17 188231 188302 +

His GTG TGCCGGGACCGGGAATCGAACCCGGATTGTTCGGACCACAACCGAACGTACTAGCCTTTATACGATCCCAGC

41 57223 57152 -

Ile GAT GGTCGGTTAGCTCAGTCGGTAGAGCGTCAGTCTGATAAGCTGAAGGTCGGGGGTTCGAGCCCCCCACCGACCA

32 69925 69997 +

Ile GAT TGGTCGGTGGGGGGCTCGAACCCCCGACCTTCAGCTTATCAGACTGACGCTCTACCGACTGAGCTAACCGACC

32 106478 106406 -

Ile GAT GGCGCTATGGCCGAGTGGTTAAGGCGATGACCTGATAAATCATTGTGCGCAGCACGCGTGGGTTCGAATCCCGCTGGCGCCG

59 45812 45893 +

Ile TAT GCTCGTGTGGCGCAGCTGGTTAGCGCGTGTGACTTATGATCACGAGGTCGAGGGTTCGAGCCCCTCCTCGAGCA

12 107920 107993 +

Leu AAG TGCAACCTGTGGGGCTCGAACCCACGCGGGATGACTCCCATTACGGCCTTAACGTAACGCCTTAACCACTCGGCCAAAGTTGC

4 205431 205349 -

Leu AAG GCAACTTTGGCCGAGTGGTTAAGGCGTTACGTTAAGGCCGTAATGGGAGTCATCCCGCGTGGGTTCGAGCCCCACAGGTTGCA

4 210375 210457 +

Leu CAA TGCCAGCTGTGGGGTTCGAACCCACGCGGTCTTGCAACCAATGGGACTTGAATCCATCGCCTTAACCACTCGGCCAAACTGGC

1 518633 518551 -

Leu CAG TGCAACCTGTGGGGCTCGAACCCACGCGGGATGACTCCCACTGCGACCTGAACGCAGCGCCTTAACCACTCGGCCAAAGTTGC

20 92223 92141 -

Leu TAA TGCCAGCTGTGGGGTTCGAACCCACGCGATCTTGCGATCAGAGGATCTTAAGTCCCCCGCCTTAACCACTCGGCCAAACTGGC

36 60013 59931 -

Page 10: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Leu TAG TGCACGCTGTGGGGTTCGAACCCACGCGGGGTCGCCCCCATGAGAGCCTAAATCTCACGCCTTAACCACTCGGCCAAACGTGC

38 71524 71442 -

Lys CTT TACCGACGACGGGGCTCGAACCCGCGACCACAGGATTAAGAGTCCTGCGCTCTACCGACTGAGCTACATCGGC

51 51658 51586 -

Lys CTT GCCGATGTAGCTCAGTCGGTAGAGCGCAGGACTCTTAATCCTGTGGTCGCGGGTTCGAGCCCCGTCGTCGGTA

67 1652 1724 +

Lys TTT CACCGCCTACGGGGCTCGAACCCGCGACCGTTGGATTAAAAGTCCAACGCTCTACCAACTGAGCTAAAGCGGC

13 109239 109167 -

Met CAT GGCGTTGTAGCGCAGTGGTCCAGCGCATCGGGCTCATAACTCGAGGGTCGGAGGTTCGATCCCTCCCGACGCCA

1 364127 364200 +

Met CAT GTCTGTGTAGCTCAGTCGGCAGAGCGATAGTCTCATAAGCTATAGGTCGTGAGTTCAAGCCTCACCACAGGCA

20 91313 91385 +

Met CAT TGCCTGTGGTGAGGCTTGAACTCACGACCTATAGCTTATGAGACTATCGCTCTGCCGACTGAGCTACACAGAC

20 179227 179155 -

Phe GAA TGCCGAGTTTGGGACTCGAACCCAAGACCGATAGATCTTCAGTCTACCGCTCTACCATCTGAGCTAACTCGGC

12 92626 92554 -

Pro AGG GGGGACCACCGGGATTTGAACCCGGGACCTCTCGCACCCTAAGCGAGAATCATACCCCTAGACCATGGTCCC

23 68145 68074 -

Pro CGG GGGGAGCACCGGGAGTCGAACCCGGAACCTCTCCGACCCGAACGGAGAATCATACCGCTAGACCATGCTCCC

25 47848 47777 -

Pro TGG GGGGACCACCGGGGATCGAACCCGGGACCTCTCGCACCCAAAGCGAGAATCATACCACTAGACCATGGTCCC

1 379770 379699 -

Pseudo TTG TGGGGCGTGGTGCAGCGGGAGCATACTGACACTTGAcatatttggccggtttccggttgttgAGTCACGACCCGGGTTCGACTCCCGGCGCCTCAG

41 1148 1243 +

Pseudo TTG CTGAGGCGCCGGGAGTCGAACCCGGGTCGTGACTcaacaaccggaaaccggccaaatatgTCAAGTGTCAGTATGCTCCCGCTGCACCACGCCCCA

41 6441 6346 -

Ser AGA GACAGTTTGGCCGAGTGGTTAAGGCGATTGACTAGAAATCAATTGTGCTCCGCACGCATGGGTTCGAATCCCATAGCTGTCG

40 43213 43294 +

Ser CGA GGCGCTATGGCCGAGTGGTTAAGGCGGTTGACTCGAAATCAACTGTGCTCCGCACGCGTGGGTTCGAATCCCACTGGCGCCG

12 107783 107864 +

Ser GCT GACAGTTTGGCCGAGTGGTTAAGGCGCTTGCCTGCTAAGCAAGTGTGCTCCGCACGCGTGGGTTCGAATCCCACAGCTGTCG

54 30532 30613 +

Ser TGA CGGCGCCAGTGGGGTTCGAACCCACGCGTGCGACGCACAACAGATTTCAAGTCTGTCGCCTTAACCACTCGGCCATAGCGCC

4 296861 296780 -

Thr AGT AGCCGAATACGGTGCTCGAAACCGTGACCTCGACATTACTAGTGTCGCGCTCTAGCCAACTGAGCTAATTCGGC

1 666335 666262 -

Thr CGT GCCGGGATAGCTCAGTGGTAGAGCGTGGCACTCGTAATGCTAAGGTCGTGGGTTCAACTCCCGCTCTCGGCT

1 621110 621181 +

Page 11: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Thr TGT AGCCGAAAGCGGGAATTGAACCCGCGACCACTGCATTACAAGTGCAGCGCTCTACCGCTGAGCTATTCCGGC

36 20684 20613 -

Trp CCA TGAGGGCTACCGGGATTGAACCGATGACCGCTCGATCTGGAGTCGAGTGCTCTACCACTGAGCTAAGCCCCC

61 26140 26069 -

Tyr GTA CCTCGCTTAGCTCAGTTGGTAGAGCGTTCGGCTGTAGagtccgcagtcACCGAACGGTCGCTGGTTCGAATCCGGCAGCGAGGA

11 96078 96161 +

Val CAC TGCTCGCGTAGGGATTCGAACCCTAGACCCTTCGTACGTGAAACGAATGTGATAACCAACTACACCACGCGAGC

16 131706 131633 -

Val GAC GGTCCGATAGTGTAGCTGGTTAGCACGTTCGCTTGACGTGCGAGAGGTCCGGAGTTCGAGTCTCCGTCGGATCA

62 19568 19641 +

Val GAC TGATCCGACGGAGACTCGAACTCCGGACCTCTCGCACGTCAAGCGAACGTGCTAACCAGCTACACTATCGGACC

76 10765 10692 -

Val TAC GCTTTCGTGGCGCAATGGTTAGCGCGTCGCATTTACGTTGCGAAGGCTGTGGGTTCGATCCCCACCGGAAGCA

31 36293 36365 +

1Introns are indicated in lower case.

Page 12: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Table S3. Lateral gene transfer candidates

ORF A1 B P E Eukaryotic taxa2 Top match definition E-value Comment3

17587 1 18 0 0 Giardia Thermoanaerobacter tengcongensis CTP synthase (UTP-ammonia lyase)

1.00E-139

33769 0 18 0 0 Giardia Exiguobacterium sp. uncharacterized NAD(FAD)-dependent dehydrogenase

1.00E-111

NADH oxidase; known LGT

14759 0 19 2 0 Giardia Enterococcus faecalis 6-phosphogluconate dehydrogenase

5.00E-89

15009 0 19 19 0 Giardia Yersinia pestis biovar Medievalis str. flavohemoprotein

2.00E-82 Flavohemoglobin; known LGT(53)

103891 0 19 19 0 Giardia Xylella fastidiosus chaperonin GroEL (HSP60) 2.00E-82 cpn60; mitosomal

17143 0 19 17 0 Giardia Silicibacter sp. pyruvate kinase 2.00E-81

8407 2 17 12 0 Giardia Haemophilus ducreyi aminoacyl-histidine dipeptidase

5.00E-77

14547 0 18 18 0 Giardia Vibrio cholerae L-asparaginase I 8.00E-76

15832 0 19 15 0 Giardia Haemophilus influenzae di- and tripeptidases 2.00E-74

15090 0 16 11 0 Giardia Bacteroides thetaiotaomicron L-serine dehydratase 2.00E-74

8682 0 19 6 0 Giardia Crocosphaera watsonii glucose-6-phosphate 1-dehydrogenase

2.00E-72

21750 0 19 15 0 Giardia Clostridium perfringens ribose-phosphate pyrophosphokinase

5.00E-70

6563 0 19 0 0 Giardia Lactococcus lactis uracil phosphoribosyltransferase 1.00E-66

24662 0 16 10 0 Giardia Bacteroides thetaiotaomicron L-serine dehydratase 2.00E-66

3313 4 15 6 0 Giardia Lactococcus lactis YeiG 8.00E-66

10311 16 3 0 0 Giardia Methanocaldococcus jannaschii ornithine carbamoyltransferase (argF)

9.00E-54

3206 0 17 13 0 Giardia Silicibacter sp. pyruvate kinase 1.00E-53

16667 2 15 13 0 Giardia Desulfovibrio vulgaris hypothetical protein DVU0585

1.00E-51

86511 4 13 11 0 Giardia Desulfovibrio vulgaris CoA-binding domain protein 8.00E-51

8074 1 17 2 0 Giardia Geobacter metallireducens Inorganic pyrophosphatase/exopolyphosphatase

5.00E-48

7368 0 19 18 0 Giardia Geobacter metallireducens predicted ATPase of the PP-loop superfamily

7.00E-48

15256 0 19 8 0 Giardia Streptococcus pyogenes undecaprenyl pyrophosphate synthetase

2.00E-47

17389 4 15 4 0 Giardia Bdellovibrio bacteriovorus putative translation factor related to Sua5

3.00E-46

2452 5 14 3 0 Giardia Bacillus cereus ornithine cyclodeaminase 4.00E-46

4507 3 16 5 0 Giardia Thermoanaerobacter tengcongensis CTP synthase 1.00E-45

7982 6 11 9 0 Giardia Pseudomonas syringae nucleoside-diphosphate-sugar epimerase

7.00E-43

7573 0 18 0 0 Giardia Streptococcus pyogenes putative 3-hydroxy-3-methylglutaryl-coenzyme A

8.00E-42

4946 0 19 8 0 Giardia Mycoplasma pneumoniae peptide methionine sulfoxide reductase

1.00E-37

91348 0 17 0 0 Giardia Clostridium tetani purine nucleoside phosphorylase 2.00E-37

23602 1 17 0 0 Giardia Desulfitobacterium hafniense uncharacterized conserved protein

6.00E-37

27614 0 19 3 0 Giardia Clostridium tetani ribose 5-phosphate isomerase RpiB

4.00E-36

17315 0 17 14 0 Giardia Synechococcus elongatus ATP binding protein 1.00E-35

8163 1 16 2 0 Giardia Geobacter metallireducens inorganic 1.00E-35

Page 13: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

pyrophosphatase/exopolyphosphatase

9239 0 19 19 0 Giardia Idiomarina loihiensis L2TR NIF3 3.00E-35

6757 0 19 19 0 Giardia Mesorhizobium loti pyrazinamidase/nicotinamidase 1.00E-34

16722 10 9 1 0 Giardia Rubrobacter xylanophilus Xaa-Pro aminopeptidase 3.00E-34

4059 0 18 13 0 Giardia Ralstonia solanacearum probable bifunctional protein (MTA/SAH nucleosidase)

3.00E-34

16722 10 9 1 0 Giardia Rubrobacter xylanophilus Xaa-Pro aminopeptidase 3.00E-34

5180 0 18 13 0 Giardia Geobacter sulfurreducens PilB-related protein 7.00E-34

6427 5 12 10 0 Giardia Methanococcus voltae transporter 2.00E-32

8826 0 18 5 0 Giardia Crocosphaera watsonii glucokinase 2.00E-31 Reported LGT (54, 55)

17451 0 18 0 0 Giardia Bacillus subtilis deoxyadenosine/deoxycytidine kinase

2.00E-31

480 7 12 4 0 Giardia Pyrococcus furiosus hypothetical protein PF0668 3.00E-31

11436 0 19 2 0 Giardia Clostridium tetani dinucleotide-utilizing enzyme 1.00E-30 E. histolytica LGT

14581 0 18 18 1 Caenorhabditis (mitochondrion)

Rickettsia prowazekii DnaK 1.00E-124

DnaK, mitosomal

16125 0 13 0 1 Entamoeba E. histolytica NAD(FAD)-dependent dehydrogenase, putative

1.00E-130

*

9368 12 6 4 1 Entamoeba Thermus thermophilus pyruvate formate-lyase activating enzyme

1.00E-80 *

9266 0 16 16 1 Entamoeba E. histolytica recQ family DNA helicase 7.00E-74 *

31530 18 0 0 1 Entamoeba E. histolytica NAD synthetase 3.00E-73 *

16519 8 10 5 1 Entamoeba E. histolytica radical SAM domain protein 2.00E-61 *

16069 9 8 0 1 Entamoeba Porphyromonas gingivalis phosphomannomutase 1.00E-53 *

13350 1 17 1 1 Entamoeba Desulfitobacterium hafniense alcohol dehydrogenase, class IV

3.00E-39 *

113021 0 17 1 1 Entamoeba E. histolytica acetyl-coA carboxylase, putative 1.00E-106

*

15041 0 18 18 1 Schizosaccharomyces Burkholderia fungorum predicted enzyme with a TIM-barrel fold

4.00E-43

7195 6 12 2 1 Spironucleus Rhodopseudomonas palustris possible pyridine nucleotide-linked oxidoreductase,

1.00E-117

E. histolytica LGT

3861 3 14 5 1 Spironucleus Thermococcus hydrothermalis alcohol dehydrogenase

8.00E-85

3593 4 13 5 1 Spironucleus Thermococcus hydrothermalis alcohol dehydrogenase

1.00E-84

6184 0 18 4 1 Spironucleus Bacteroides thetaiotaomicron branched-chain amino acid aminotransferase

2.00E-80

10829 0 14 0 1 Spironucleus Rubrobacter xylanophilus 6-phosphogluconolactonase/Glucosamine-6-phosphate isomerase/deaminase

3.00E-55

13608 6 9 3 1 Spironucleus, Entamoeba

Chloroflexus aurantiacus acyl-CoA synthetase (NDP forming)

1.00E-42

15983 17 0 0 1 Trichomonas T. vaginalis prolyl-tRNA synthetase 1.00E-122

T. vaginalis LGT

9779 4 14 5 1 Ustilago Ustilago maydis hypothetical protein UM03504.1 2.00E-35

15297 0 16 16 2 Danio, Caenorhabditis Salmonella enterica ribokinase 6.00E-33

3042 0 16 16 2 Entamoeba E. histolytica hydroxylamine reductase, (hybrid cluster protein)

1.00E-95 Reported LGT(53)

16549 0 16 0 2 Entamoeba Clostridium thermocellum uridine kinase 5.00E-65 *

8217 0 15 0 2 Entamoeba Clostridium thermocellum uridine kinase 6.00E-63 *

12942 0 17 17 2 Entamoeba, Anopheles Vibrio vulnificus uncharacterized conserved protein 3.00E-35

Page 14: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

16453 7 5 1 2 Hexamita, Trichomonas

Pyrococcus furiosus carbamate kinase-like carbamoylphosphate synthetase

1.00E-76 *

22204 0 16 7 2 Leishmania, Neurospora

Rhodospirillum rubrum methionyl-tRNA synthetase 1.00E-101

2969 0 17 6 2 Neurospora Anabaena variabilis fructosamine-3-kinase 5.00E-43

93358 0 14 6 2 Spironucleus, Entamoeba

Thermosynechococcus elongatus CoA-linked acetaldehyde dehydrogenase and iron-dependent alcohol dehydrogenase / pyruvate-formate-lyase deactivase

0.00E+00 ADHE; known LGT

10358 8 9 5 2 Spironucleus, Trichomonas

Methanococcus maripaludis Flavodoxin:beta-lactamase-like

5.00E-69 E. histolytica LGT

15127 0 17 0 2 Trypansoma, Leishmania

Clostridium perfringens deoxyribose-phosphate aldolase

2.00E-45 known LGT; (53)

9834 0 15 3 3 Anopheles, Caenorhabditis,

Tetraodon

Fusobacterium nucleatum exodeoxyribonuclease III 9.00E-44

9508 0 15 1 3 Arabidopsis Clostridium acetobutylicum Zn-dependent peptidase, insulinase family

3.00E-51

9145 2 12 8 3 Cryptosporidium Bacteroides thetaiotaomicron ATP-dependent DNA helicase recQ

1.00E-34

7865 0 15 15 3 Entamoeba, Dictyostelium

E. histolytica L-asparaginase, putative 4.00E-64 *

22138 0 16 3 3 Fungi Geobacter metallireducens ATPase related to the helicase subunit of the Holliday junction resolvase

3.00E-57

114609 0 17 2 3 Mastigamoeba, Entamoeba

Desulfovibrio desulfuricans pyruvate:ferredoxin oxidoreductase

0.00E+00

112885 0 14 4 3 Metazoa Bacteroides thetaiotaomicron adenine phosphoribosyltransferase

2.00E-32

8245 0 12 0 3 Spironucleus, Caenorhabditis, Fungi

Thermoanaerobacter tengcongensis 6-phosphogluconolactonase / glucosamine-6-phosphate isomerase

2.00E-57

17063 0 16 4 4 Entamoeba, Cryptosporidium

Moorella thermoacetica pyruvate:ferredoxin 0.00E+00

15196 0 14 14 4 Metazoa, Fungi Salmonella enterica NifU-like protein 3.00E-43 IscU, mitosomal

96460 10

0 0 5 Entamoeba E. histolytica alanyl-tRNA synthetase 0.00E+00 E. histolytica LGT

14195 4 9 0 5 Metazoa, Viridiplantae Methanosarcina barkeri DNA mismatch repair enzyme (predicted ATPase)

2.00E-33

113876 0 8 5 6 Dictyostelium, Metazoa

Treponema denticola ABC transporter, ATP-binding protein

9.00E-31

11043 0 12 10 6 Spironucleus, Entamoeba, Trichomonas

Clostridium thermocellum fructose/tagatose bisphosphate aldolase

3.00E-86 FBA, origin unresolved (56)

9909 1 12 5 6 Viridiplantae chloroplast

Methanococcoides burtonii phosphoenolpyruvate synthase/pyruvate phosphate

0.00E+00

15574 0 11 11 7 Entamoeba, Aspergillus

Xylella fastidiosa alanyl dipeptidyl peptidase 4.00E-96

24712 0 10 5 8 Apicomplexa Bacteroides thetaiotaomicron putative phosphatidylinositol-4-phosphate 5-kinase

1.00E-45

14993 0 10 0 8 Hexamita, Viridiplantae

Spirochaeta thermophila pyrophosphate-dependent phosphofructokinase

1.00E-40 E. histolytica LGT

6148 0 9 9 9 Entamoeba, Aspergillus

E. histolytica dipeptidyl-peptidase, putative 1.00E-81 *

93551 0 9 4 9 Viridiplantae, Fungi, Entamoeba

Clostridium perfringens probable zinc metalloprotease

5.00E-62

14628 0 9 3 10 Apicomplexa Rhodobacter sphaeroides uncharacterized protein conserved in bacteria

2.00E-76

14519 0

9 9 10 Metazoa, Fungi Rickettsia typhi cysteine desulfurase protein IscS/NifS

2.00E-92 IscS, mitosomal

9827 0 9 4 10 Spironucleus, Fungi Bdellovibrio bacteriovorus hypothetical protein Bd0373

3.00E-63

9115 0 7 1 11 Spironucleus, T. vaginalis glucose-6-phosphate isomerase 3.00E-92 Known LGT (54)

Page 15: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Trichomonas, Viridiplantae

8822 0 7 12 Trypanosoma, Leishmania,

Chlamydomonas, Viridiplantae

Methylococcus capsulatus phosphoglycerate mutase, 2,3-bisphosphoglycerate-independent

1.00E-133

1Column A: number of top hits to Archaea; B: number of hits to Bacteria; P: number of hits to Proteobacteria; E: number of hits to Eukaryotes other than Giardia 2Eukaryotic groups which contained a gene with similarity to Giardia ORF, within top 20 hits at e-30 or better 3Candidates marked with an asterisk had a top hit to E. histolytica, but other top hits were archaeal or bacterial

Page 16: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Table S4. Cytoskeletal proteins Annotation ORF Contig Start Stop Class

Actin 15113 39 22538 23941 Actin & actin asociated proteins

Actin related protein 11039 1 391007 393619 Actin & actin asociated proteins

Actin related protein 16172 11 249456 250490 Actin & actin asociated proteins

Actin related protein 8726 47 46180 49086 Actin & actin asociated proteins

Actin related protein 40817 50 42826 43953 Actin & actin asociated proteins

Protein 21.1 111967 5 85127 88651 Ankyrin repeat

Protein 21.1 137703 5 162868 164409 Ankyrin repeat

Protein 21.1 114247 10 41826 43514 Ankyrin repeat

Protein 21.1 113622 23 56195 60709 Ankyrin repeat

Protein 21.1 115054 25 116801 119644 Ankyrin repeat

Protein 21.1 115786 73 26014 26754 Ankyrin repeat

Protein 21.1 115787 73 22922 23662 Ankyrin repeat

Protein 21.1 114671 75 2844 4739 Ankyrin repeat

Ankyrin 1 17015 37 60528 61808 Ankyrins

Basal body protein 8146 3 205706 207037 Basal body

Basal body protein 8508 9 216716 218533 Basal body

Centrin 6744 13 135568 136053 Basal body

Centrin 104685 34 59094 59624 Basal body

Dynein heavy chain 16804 1 193841 200734 Dyneins

Dynein heavy chain 37985 4 329859 333077 Dyneins

Dynein heavy chain 111950 5 19805 35158 Dyneins

Dynein heavy chain 10538 7 174379 176688 Dyneins

Dynein heavy chain 94440 8 120554 137287 Dyneins

Dynein heavy chain 40496 14 52385 67018 Dyneins

Dynein heavy chain 17478 15 43213 59451 Dyneins

Dynein heavy chain 101138 15 185806 201297 Dyneins

Dynein heavy chain 17265 23 147472 155499 Dyneins

Dynein heavy chain 17243 24 125582 129910 Dyneins

Dynein heavy chain 93736 28 13545 27866 Dyneins

Dynein heavy chain 8172 34 30850 31995 Dyneins

Dynein heavy chain 100906 48 5771 22423 Dyneins

Dynein heavy chain 103059 53 46134 53354 Dyneins

Dynein heavy chain like

29256 29 140090 140389 Dyneins

Dynein intermediate chain

10254 4 133449 135863 Dyneins

Dynein intermediate chain

33218 25 141436 143709 Dyneins

Page 17: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Dynein intermediate chain

6939 42 70472 72355 Dyneins

Dynein light chain 14270 5 317375 317809 Dyneins

Dynein light chain 15124 12 149574 149807 Dyneins

Dynein light chain 9848 23 145783 146052 Dyneins

Dynein light chain 27308 27 132646 133146 Dyneins

Dynein light chain 15606 31 116157 116444 Dyneins

Dynein light chain 17371 36 81860 82234 Dyneins

Dynein light chain 7578 40 70721 71233 Dyneins

Dynein light chain 4236 52 28801 29124 Dyneins

Dynein light chain 13575 52 29235 29558 Dyneins

Dynein light chain 4463 64 37827 38390 Dyneins

Dynein light intermediate chain

13273 37 15556 16263 Dyneins

Dynein regulatory complex

16540 60 25867 27273 Dyneins

Midasin 39312 53 14403 28910 Dyneins

Axonemal p66.0 114462 51 28041 29699 Flagella & associated proteins

Axoneme central apparatus protein

16202 33 23857 25365 Flagella & associated proteins

Axoneme-associated protein GASP-180

13475 3 265534 271908 Flagella & associated proteins

Axoneme-associated protein GASP-180

137716 3 12259 17016 Flagella & associated proteins

Axoneme-associated protein GASP-180

16745 10 158198 161317 Flagella & associated proteins

Flagella associated protein

41512 42 22769 24712 Flagella & associated proteins

IFT complex A 17251 8 171316 177105 Flagella & associated proteins

IFT complex A 16547 20 111386 116515 Flagella & associated proteins

IFT complex B 14713 1 47602 48765 Flagella & associated proteins

IFT complex B 15428 1 180463 182481 Flagella & associated proteins

IFT complex B 17223 7 33498 36737 Flagella & associated proteins

IFT complex B 17105 20 165749 171631 Flagella & associated proteins

IFT complex B 40995 36 82250 83218 Flagella & associated proteins

Intraflagellar transport particle protein IFT88

16660 32 1525 4089 Flagella & associated proteins

Intraflagellar transport protein component IFT74/72

9750 13 166903 168612 Flagella & associated proteins

Kinesin-associated protein

114885 18 6429 8735 Flagella & associated proteins

Long-flagella protein

14004 59 35442 37079 Flagella & associated proteins

Radial-spoke protein 16450 38 16048 16968 Flagella & associated proteins

Page 18: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Alpha-1 giardin 11654 60 34098 34985 Giardins

Alpha-10 giardin 5649 40 47426 48412 Giardins

Alpha-11 giardin 17153 40 45460 46383 Giardins

Alpha-12 giardin 10073 7 24503 25474 Giardins

Alpha-13 giardin 1076 60 7325 8362 Giardins

Alpha-13 giardin, partial

18058 77 8626 8976 Giardins

Alpha-14 giardin 15097 6 161508 162521 Giardins

Alpha-15 giardin 13996 40 46428 47339 Giardins

Alpha-16 giardin 10036 40 28352 29449 Giardins

Alpha-17 giardin 15101 40 27392 28324 Giardins

Alpha-18 giardin 10038 40 26477 27337 Giardins

Alpha-19 giardin 4026 45 26576 27892 Giardins

Alpha-2 giardin 7796 18 164373 165263 Giardins

Alpha-3 giardin 11683 5 66918 67808 Giardins

Alpha-4 giardin 7799 18 167441 168331 Giardins

Alpha-5 giardin 7797 18 165333 166241 Giardins

Alpha-6 giardin 14551 18 166440 167333 Giardins

Alpha-7.1 giardin 103373 68 31732 32898 Giardins

Alpha-7.2 giardin 114119 64 8217 9383 Giardins

Alpha-7.3 giardin 114787 7 38556 39443 Giardins

Alpha-8 giardin 11649 60 28911 29846 Giardins

Alpha-9 giardin 103437 60 8414 9319 Giardins

Alpha-9 giardin, pseudogene

5047 77 7670 8574 Giardins

Beta-giardin 4812 35 55484 56302 Giardins

Delta giardin 86676 14 190310 191191 Giardins

Gamma giardin 17230 24 35292 36227 Giardins

Kinesin like protein 112729 10 247195 250521 Giardins

Kinesin like protein 17264 23 137518 139908 Giardins

Kinesin-1 13825 48 37032 39971 Giardins

Kinesin-13 16945 1 231024 233168 Giardins

Kinesin-14 8886 21 156342 158219 Giardins

Kinesin-14 13797 27 133290 136808 Giardins

Kinesin-16 7874 26 143757 146090 Giardins

Kinesin-16 16161 27 92712 95225 Kinesins

Kinesin-2 16456 9 209774 211702 Kinesins

Kinesin-2 17333 22 87224 89380 Kinesins

Kinesin-3 6262 2 466085 469306 Kinesins

Kinesin-3 102101 2 471447 474527 Kinesins

Page 19: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Kinesin-3 112846 2 501266 504553 Kinesins

Kinesin-4 16650 13 125722 128892 Kinesins

Kinesin-5 16425 17 76040 79240 Kinesins

Kinesin-6 102455 2 238600 242184 Kinesins

Kinesin-6 like 15134 1 792925 796044 Kinesins

Kinesin-7 15962 27 128214 130985 Kinesins

Kinesin-8 4371 1 618658 620991 Kinesins

Kinesin-9 10137 12 151665 153935 Kinesins

Kinesin-9 6404 32 93694 95997 Kinesins

Kinesin-like protein 14070 22 146387 148768 Kinesins

Kinesin-related protein

16224 2 180011 182728 Kinesins

Kinesin-related protein

11442 10 250897 252816 Kinesins

Median body protein 16343 14 101531 104104 Median body

Spindle protein 15248 22 40398 41378 Median body

Centromere/microtubule binding protein CBF5

16311 70 10980 12245 Microtubule & associate proteins

Dynamin 14373 43 61241 63439 Microtubule & associate proteins

Katanin 15368 19 85298 86827 Microtubule & associate proteins

Alpha-tubulin 112079 30 23195 24559 Tubulins & associate proteins

Alpha-tubulin (Fragment)

103676 19 126972 128336 Tubulins & associate proteins

Beta tubulin 101291 14 50994 52337 Tubulins & associate proteins

Beta tubulin 136020 14 29697 31040 Tubulins & associate proteins

Beta tubulin 136021 14 25060 26403 Tubulins & associate proteins

Caltractin (Centrin) 104685 34 59094 59624 Tubulins & associate proteins

Delta tubulin 5462 11 94099 95439 Tubulins & associate proteins

Epsilon tubulin 6336 25 55270 56706 Tubulins & associate proteins

Gamma tubulin 114218 33 3929 5404 Tubulins & associate proteins

Gamma tubulin ring complex

12057 15 114406 117084 Tubulins & associate proteins

Tubulin specific chaperone B

5374 21 122695 123414 Tubulins & associate proteins

Tubulin specific chaperone D

10145 12 137890 141765 Tubulins & associate proteins

Tubulin specific chaperone D

15906 14 191262 192155 Tubulins & associate proteins

Tubulin specific chaperone E

16535 3 138817 140673 Tubulins & associate proteins

Tubulin tyrosine ligase

8456 1 48769 50754 Tubulins & associate proteins

Tubulin tyrosine ligase

10382 2 583250 584560 Tubulins & associate proteins

Page 20: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Tubulin tyrosine ligase

14498 2 621625 622836 Tubulins & associate proteins

Tubulin tyrosine ligase

9272 8 235095 237434 Tubulins & associate proteins

Tubulin tyrosine ligase

8592 9 129424 131637 Tubulins & associate proteins

Tubulin tyrosine ligase

10801 9 85653 88880 Tubulins & associate proteins

Tubulin tyrosine ligase

95661 61 30280 32598 Tubulins & associate proteins

Tubulin, small gamma tubulin complex gcp2

17429 12 102671 105628 Tubulins & associate proteins

Actin capping None No Homolog Found

Actin cross-linking, anchoring

None No Homolog Found

Actin-severing None No Homolog Found

Alpha-actinin None No Homolog Found

Beta-thymosin None No Homolog Found

Calponin-spectrin family

None No Homolog Found

Cofilin None No Homolog Found

Cytoplasmic dynein None No Homolog Found

Dynactin None No Homolog Found

Espin None No Homolog Found

Formin None No Homolog Found

G-actin monomer binding

None No Homolog Found

Gelsolin None No Homolog Found

Myosin None No Homolog Found

Neulin None No Homolog Found

Profilin None No Homolog Found

Tropomyosin None No Homolog Found

Troponin C None No Homolog Found

Vinculin None No Homolog Found

Page 21: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Table S5. Distribution of variant specific surface proteins (VSPs)

ORF Id Chromosome Contig Start Stop Orientation Contig Length

115797 1 11 468 2516 - 271033

Not Called 1 11 269637 271031 +

112647 1 17 195315 196601 + 203025

Not Called 1 17 201204 199294 -

10562 1 22 26585 27142 + 174696

112208 1 22 171842 173632 +

112113 1 45 4364 6445 - 78771

137612 1 45 50998 53595 -

134711 2 6 1323 1736 + 329365

Not Called 2 6 47868 47380 -

137729 2 6 107391 108602 +

117473 2 6 176521 177123 -

117472 2 6 180225 180827 +

113450 2 9 80249 82447 + 275601

113439 2 9 108341 110524 +

8595 2 9 128351 128734 -

117203 2 9 270587 271354 +

117204 2 9 274075 274842 -

101010 2 16 33488 34747 + 203113

33279 2 25 10994 13240 + 161865

115047 2 25 79756 81519 -

11521 2 33 90891 92777 + 96597

118786 2 33 95525 95695 -

112048 2 50 2041 2730 + 68869

134710 2 50 55598 57508 -

115830 3 1 686 2587 + 870956

115831 3 1 5401 7302 -

113512 3 1 65409 67190 -

137740 3 1 71484 72839 +

137744 3 1 227477 228220 +

113797 3 1 440465 442606 -

14324 3 1 551645 552364 +

Page 22: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

137752 3 1 768855 771962 +

136003 3 1 864729 866384 +

136004 3 1 869071 870726 -

101074 3 18 198026 200245 - 202859

Not Called 3 24 116639 115419 - 163570

102178 3 43 1849 2235 - 80740

119707 3 51 11342 13363 - 64407

119706 3 51 16665 18686 +

98058 3 54 3119 3382 + 61783

101380 3 65 26231 26818 - 43823

137618 4 2 103891 106452 + 647811

14043 4 2 162594 163313 +

32890 4 2 245442 246047 -

32933 4 2 247339 247506 -

13520 4 2 306261 306647 +

Not Called 4 2 329039 329263 +

114930 4 2 356927 359344 +

112867 4 2 439658 441883 -

28626 4 2 444468 444974 -

112331 4 2 588241 589236 +

103992 4 2 642929 644812 +

16501 4 8 3698 5776 - 281786

40591 4 8 196700 197653 +

137610 4 12 551 2476 + 268442

137611 4 12 3379 4401 +

40571 4 12 4843 7506 +

114277 4 12 7582 9534 +

Not Called 4 12 9619 11370 +

26590 4 12 12624 13169 +

Not Called 4 12 13233 15008 +

Not Called 4 12 15062 16810 +

115085 4 12 16823 18811 +

14297 4 12 111678 113447 -

Not Called 4 12 120445 113531 -

15123 4 12 155733 157514 +

Page 23: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

33783 4 12 200737 202518 +

40630 4 12 258684 259196 +

12993 4 15 2048 2338 - 231239

29744 4 15 109122 109337 -

40592 4 15 227730 228410 +

41349 4 15 229959 230351 +

137606 4 21 125028 127586 + 178254

114065 4 21 127599 129419 +

112009 4 21 130059 131252 +

Not Called 4 21 175716 177875 +

Not Called 4 26 3 1409 + 155861

15206 4 26 2671 3312 +

Not Called 4 29 1 1056 + 143621

101410 4 29 1370 2869 +

137607 4 29 3511 4794 +

34357 4 29 5257 7920 +

26894 4 29 84594 85022 +

113304 4 29 113060 114970 +

97820 4 31 8904 9212 - 117353

113024 4 31 15048 16250 -

96055 4 32 847 1233 + 114288

5812 4 32 54073 55389 +

113357 4 32 64914 66776 -

137620 4 32 109651 111693 +

113093 4 32 111706 113661 +

114162 4 35 1037 3061 + 91425

137608 4 56 57531 59192 + 60917

137717 5 3 94019 95959 + 483138

32607 5 3 96190 98772 +

101496 5 3 184497 185936 +

137721 5 3 418453 419718 +

37093 5 3 465730 467646 +

137723 5 3 476424 478571 -

137722 5 3 480761 482908 +

115474 5 5 654 1076 - 343211

Page 24: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

115742 5 5 925 2781 -

4313 5 5 149946 150539 +

41626 5 7 71672 72409 + 283910

13194 5 7 109646 111493 -

15237 5 7 178922 180484 -

137697 5 7 180696 182108 -

137707 5 10 1482 3263 + 274062

137708 5 10 6041 7822 -

10659 5 10 80578 82752 -

41472 5 10 88746 90509 +

137710 5 10 90719 92266 +

112693 5 10 139876 140652 +

16472 5 10 218445 219722 -

11470 5 10 267766 269787 +

137714 5 10 272457 274031 -

137681 5 13 2375 3487 + 266103

111732 5 13 210716 212842 -

103001 5 13 219898 221745 +

Not Called 5 13 261314 262162 +

25892 5 14 31686 31982 + 237218

115158 5 14 126807 128171 -

137617 5 14 130165 132369 +

Not Called 5 20 67897 66113 - 187573

Not Called 5 20 123963 122908 -

14783 5 20 125652 126578 -

Not Called 5 27 2 884 + 148505

14331 5 27 3632 4891 -

102662 5 27 36639 38231 -

13390 5 34 2771 4936 - 91580

112207 5 34 90885 91502 +

35454 5 37 339 638 + 87849

41401 5 37 81706 82356 -

114813 5 39 41130 42875 + 83218

Not Called 5 39 83217 81496 -

116477 5 40 2350 4596 - 82796

Page 25: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

137604 5 55 22741 24858 - 61242

Not Called 5 55 60006 61241 +

137614 5 57 52647 54290 - 56156

Not Called 5 57 55512 56156 +

38901 Not mapped 4 7945 8163 + 343362

87628 Not mapped 4 98908 100758 +

113163 Not mapped 4 202079 203977 +

Not Called Not mapped 30 2498 4705 + 142143

101765 Not mapped 30 133705 135804 -

Not Called Not mapped 36 85392 89315 + 89315

111873 Not mapped 38 741 1511 - 86414

Not Called Not mapped 38 6276 4489 -

Not Called Not mapped 38 8203 6338 -

135882 Not mapped 38 8293 8583 -

Not Called Not mapped 38 12376 10118 -

103916 Not mapped 44 79593 80138 - 80364

103142 Not mapped 46 2040 2543 + 78122

111936 Not mapped 49 612 2837 - 71699

111933 Not mapped 49 6684 8909 +

Not Called Not mapped 49 68331 70718 +

Not Called Not mapped 49 70762 71697 +

112801 Not mapped 52 6945 9188 + 62709

36493 Not mapped 52 62047 62631 -

13402 Not mapped 53 35884 36174 + 62321

101589 Not mapped 53 56367 57548 -

13727 Not mapped 63 43222 44844 - 45121

114121 Not mapped 64 999 2588 - 44973

137605 Not mapped 64 2809 4146 -

Not Called Not mapped 64 41689 43629 +

Not Called Not mapped 64 43692 44972 +

41227 Not mapped 66 15499 17127 + 41384

Not Called Not mapped 66 23915 21831 -

14586 Not mapped 68 35307 37508 + 38092

89315 Not mapped 69 1753 3084 - 35240

137613 Not mapped 73 3680 4864 + 28166

Page 26: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Not Called Not mapped 79 2662 311 - 5687

32916 Not mapped 80 737 1027 + 5673

Not Called Not mapped 80 1091 2926 +

114653 Not mapped 80 2939 4735 +

Not Called Not mapped 80 4799 5671 +

92835 Not mapped 81 701 1207 + 5266

113491 Not mapped 81 2487 3038 +

Not Called Not mapped 81 3138 4901 +

Not Called Not mapped 81 4963 5265 +

Not Called Not mapped 84 3 803 + 4632

34196 Not mapped 84 1258 2598 +

Not Called Not mapped 84 2660 4630 +

135831 Not mapped 85 1883 2434 + 4626

105983 Not mapped 85 3678 4229 +

135832 Not mapped 86 442 993 + 4600

Not Called Not mapped 87 1 148 + 4079

Not Called Not mapped 87 210 2048 +

Not Called Not mapped 87 2941 3865 +

40621 Not mapped 89 2466 2852 - 3944

113211 Not mapped 90 513 1160 + 3922

102540 Not mapped 90 2738 3034 +

113954 Not mapped 91 479 2602 - 3749

112178 Not mapped 91 2436 2711 -

115475 Not mapped 94 634 1056 - 3356

114672 Not mapped 94 905 2749 -

136002 Not mapped 95 2825 3211 - 3341

34442 Not mapped 97 773 1048 - 3288

8338 Not mapped 98 682 1371 - 3266

114122 Not mapped 99 211 2148 - 3260

112314 Not mapped 100 2667 3176 - 3185

135881 Not mapped 104 1747 2037 - 2495

41539 Not mapped 106 570 1412 - 2311

115796 Not mapped 115 896 1462 + 1944

97233 Not mapped 121 1102 1392 - 1822

122564 Not mapped 123 365 637 + 1777

Page 27: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

118180 Not mapped 126 1008 1175 - 1710

41476 Not mapped 132 716 1261 + 1660

121069 Not mapped 145 80 472 - 1529

137761 Not mapped 149 180 674 - 1500

118132 Not mapped 150 553 720 + 1494

90215 Not mapped 159 37 747 + 1450

135918 Not mapped 171 1050 1340 - 1383

111903 Not mapped 174 13 396 + 1349

118133 Not mapped 186 522 689 + 1273

15400 Not mapped 189 464 1150 - 1231

121070 Not mapped 198 369 761 - 1193

122565 Not mapped 221 758 1030 + 1104

135919 Not mapped 224 760 1050 - 1093

14307 Not mapped 226 50 826 + 1086

124980 Not mapped 229 186 482 + 1065

99743 Not mapped 232 98 481 + 1058

114286 Not mapped 258 701 868 - 998

122566 Not mapped 273 585 857 + 940

136001 Not mapped 277 511 897 + 922

105759 Not mapped 287 332 883 - 887

118181 Not mapped 305 112 279 + 678

Page 28: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Table S6. Candidate drug targets

ORF Query GI Definition Annotation Score E-value

16975 19913408 DNA topoisomerase II, beta isozyme

DNA topoisomerase II 748 0

112079 17986283 tubulin, alpha 1a alpha-tubulin 830 0

136021 29788768 tubulin, beta 2B beta tubulin 809 0

32658 48255957 plasma membrane calcium ATPase 4

Plasma membrane calcium-transporting ATPase

623 1.00E-179

6226 21361370 brain glycogen phosphorylase glycogen phosphorylase 613 1.00E-176

96670 51944966 ATPase, H+/K+ exchanging, alpha

Potassium-transporting ATPase alpha chain 1

541 1.00E-154

35180 4826730 FK506 binding protein 12-rapamycin

TOR1 529 1.00E-150

115052 5031915 ATP-binding cassette, sub-family C

Multidrug resistance-associated protein 1

507 1.00E-144

35428 5454158 valyl-tRNA synthetase valine-tRNA ligase 473 1.00E-134

114218 31543831 tubulin, gamma 1 Tubulin gamma chain 472 1.00E-134

10521 15149476 arginyl-tRNA synthetase Arginyl-tRNA synthetase 442 1.00E-125

9348 4826960 glutaminyl-tRNA synthetase Glutaminyl-tRNA synthetase 435 1.00E-122

6687 7669492 glyceraldehyde-3-phosphate Glyceraldehyde 3-phosphate dehydrogenase

410 1.00E-115

7537 4503093 casein kinase 1 epsilon Casein kinase I, alpha isoform 410 1.00E-115

98054 20149594 heat shock 90kDa protein 1, beta

Heat shock protein HSP 90-alpha 404 1.00E-113

8037 4557439 cyclin-dependent kinase 3 Cell division protein kinase 2 358 1.00E-100

3032 47419914 tryptophanyl-tRNA synthetase Tryptophanyl-tRNA synthetase 349 4.10E-96

22204 24308436 methionine-tRNA synthetase 2 Methionyl-tRNA synthetase 345 1.10E-94

16802 4826675 cyclin-dependent kinase 5 Cell division protein kinase 2 325 4.10E-89

17417 51702240 dual-specificity Dual-specificity tyrosine- 325 8.10E-89

9116 21361340 glycogen synthase kinase 3 beta

Kinase 321 1.10E-87

9270 15451929 CDC14 homolog A isoform 1 Probable protein-tyrosine phosphatase CDC14

305 1.10E-82

14004 7657498 MAPK/MAK/MRK overlapping kinase

Serine/threonine-protein kinase MAK

304 1.10E-82

17563 20986497 mitogen-activated protein kinase 7

Mitogen-activated protein kinase 303 8.10E-82

14364 46877068 AMP-activated protein kinase alpha

SNF1-related protein kinase KIN10

302 9.10E-82

35094 4505933 polymerase (DNA directed), delta 1

DNA pol delta 297 5.10E-80

15247 4507677 tumor rejection antigen (gp96) 1

Heat shock protein 90 295 2.10E-79

27520 4503095 casein kinase II alpha 1 subunitCasein kinase II, alpha chain 287 2.10E-77

11214 4826948 protein kinase, X-linked cAMP-dependent protein kinase, alpha-catalytic

280 2.10E-75

5867 10835051 cysteinyl-tRNA synthetase isoform

Cysteinyl-tRNA synthetase 281 3.10E-75

17327 4557835 Xaa-Pro dipeptidase Xaa-Pro dipeptidase 276 6.10E-74

Page 29: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

5375 4505373 NIMA (never in mitosis gene Serine/threonine-protein kinase NEK2

273 3.10E-73

17566 38569460 SNF1-like kinase 2 Probable serine/threonine protein kinase SNF

267 4.10E-71

86444 20127541 serum/glucocorticoid regulated Serine/threonine-protein kinase Sgk1

266 4.10E-71

104150 41872374 polo-like kinase 3 CDC5, YEAST 267 4.10E-71

16034 24308326 BR serine/threonine kinase 1 Probable serine/threonine protein kinase SNF

264 4.10E-70

15514 5454096 serine/threonine kinase 4 Serine/threonine protein kinase 3 256 5.10E-68

6700 7662388 intestinal cell kinase Serine/threonine-protein kinase MAK

256 9.10E-68

86600 5803092 methionyl aminopeptidase 2 Methionine aminopeptidase 254 2.10E-67

8587 18959200 LATS, large tumor suppressor protein kinase, putative 254 4.10E-67

92498 23510391 NIMA-related kinase 3 Serine/threonine-protein kinase NEK3

253 4.10E-67

21512 23199995 Williams Beuren syndrome S-adenosylmethionine-dependent methyltransfe

245 4.10E-65

22850 4506081 mitogen-activated protein kinase 10

Mitogen-activated protein kinase 243 4.10E-64

90343 50845418 WNK lysine deficient protein protein kinase family 232 4.10E-60

14842 7657198 dimethyladenosine transferase Dimethyladenosine transferase 228 8.10E-60

104173 46852147 mitochondrial isoleucine tRNAIsoleucyl-tRNA synthetase 229 1.10E-59

7260 5174391 aldo-keto reductase family 1 Aldose reductase 226 3.10E-59

11364 48255885 protein kinase C, iota Ribosomal protein S6 kinase alpha 1

227 3.10E-59

94582 4505489 ornithine decarboxylase 1 Ornithine decarboxylase 223 3.10E-58

17368 24308123 serine/threonine kinase 36 (fused

Fused1 protein 216 2.10E-55

10609 4826878 oxidative-stress responsive 1 Serine/threonine protein kinase 25

207 3.10E-53

21750 28557709 phosphoribosyl pyrophosphate Ribose-phosphate pyrophosphokinase

205 8.10E-53

5772 10835073 N-myristoyltransferase 1 Glycylpeptide N-tetradecanoyltransferase

196 5.10E-50

102647 6806921 solute carrier family 9 Sodium/hydrogen exchanger 3 196 1.10E-49

92741 10190706 CDC-like kinase 4 Protein kinase CLK4 192 1.10E-48

13962 5031751 3-hydroxy-3-methylglutaryl-Coenzyme

Hydroxymethylglutaryl-CoA synthase

184 3.10E-46

8496 16903164 TC10-like Rho GTPase Rac/Rho-like protein 181 7.10E-46

6509 4504549 tenascin C (hexabrachion) Neurogenic locus notch homolog protein 1 prec

184 2.10E-45

13875 28872761 myotubularin-related protein 1 Myotubularin 182 2.10E-45

16834 42794765 mitogen-activated protein kinase

Mitogen-activated protein kinase kinase kina

176 7.10E-44

17132 4505771 ATP-binding cassette, subfamily B

MRP-like ABC transporter 176 2.10E-43

14019 4503139 cathepsin B preproprotein Cathepsin B precursor 171 1.10E-42

14566 4503141 cathepsin C isoform a preproprotein

Dipeptidyl-peptidase I precursor 169 6.10E-42

22165 4506889 mitogen-activated protein Serine/threonine protein kinase 3 162 6.10E-40

Page 30: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

kinase

15566 47174861 peter pan homolog Peter pan protein 160 4.10E-39

17082 21359854 Rab geranylgeranyltransferase Protein farnesyltransferase beta subunit

158 9.10E-39

221689 36413607 ATP-binding cassette, sub-family

partial ORF of putative ABC transporter

159 1.10E-38

12235 8923219 TRM1 tRNA methyltransferase 1

N(2),N(2)-dimethylguanosine tRNA methyltransferase

158 3.10E-38

16779 11545918 P3ECSL Cathepsin B precursor 157 4.10E-38

10055 4759046 RNA polymerase I subunit isoform 2

DNA-directed RNA polymerase subunit D

153 4.10E-37

14983 4503155 cathepsin L preproprotein Cathepsin L precursor 150 2.10E-36

12223 29570780 AP2 associated kinase 1 probable serine/threonine-protein kinase

150 6.10E-36

16558 4505807 phosphatidylinositol 4-kinase Phosphatidylinositol 4-kinase 150 7.10E-36

10311 38788445 ornithine carbamoyltransferase Ornithine carbamoyltransferase 148 1.10E-35

16149 4557757 MutL protein homolog 1 DNA mismatch repair protein mutL

145 2.10E-34

14855 21237725 phosphoinositide-3-kinase phosphoinositide-3-kinase, catalytic, alpha

145 3.10E-34

114670 4506051 DNA primase small subunit, 49kDa

DNA pol/primase, small sub 144 3.10E-34

16380 4503151 cathepsin K preproprotein Cathepsin L precursor 143 4.10E-34

4405 23308722 TTK protein kinase Dual specificity protein kinase TTK

141 4.10E-33

16796 41327715 p53-related protein kinase TP53 regulating kinase 135 4.10E-32

17315 9955963 ATP-binding cassette, sub-family B

Multidrug resistance ABC transporter ATP-bin

134 7.10E-31

103944 30425444 ankyrin repeat and kinase domain

Protein kinase 130 5.10E-30

16468 47271446 tubulointerstitial nephritis Cathepsin B precursor 129 7.10E-30

14058 5453862 phosphodiesterase 4A, cAMP-specific

cAMP-specific 3',5'-cyclic phosphodiesterase

129 1.10E-29

11311 44917615 NIMA (never in mitosis gene a)

Serine/threonine-protein kinase NEK2

127 3.10E-29

15297 11545855 ribokinase Ribokinase 125 8.10E-29

10450 4503725 FK506-binding protein 1A FKBP-type peptidyl-prolyl cis-trans isomerase

122 1.10E-28

8364 4507519 thymidine kinase 1, soluble Thymidine kinase 122 4.10E-28

14626 7705855 steroid dehydrogenase homolog

Oxidoreductase, short chain dehydrogenase/re

121 1.10E-27

16728 40549429 TPTE and PTEN homologous inositol

Phosphatase and tensin homologue

121 2.10E-27

91348 4557801 purine nucleoside phosphorylase

Purine nucleoside phosphorylase 119 5.10E-27

12215 56676399 hypothetical protein LOC28989

Putative S-adenosylmethionine-dependent methylase

116 2.10E-26

14434 41327754 ankyrin repeat domain 3 Protein 21.1 119 2.10E-26

17406 34761064 phosphoinositide-3-kinase, class 3

Phosphoinositide-3-kinase, class 3

119 2.10E-26

16160 6042196 cathepsin F Cathepsin B precursor 115 1.10E-25

Page 31: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

95593 33383241 protein kinase Myt1 isoform 1 Serine/threonine-protein kinase NEK2

114 2.10E-25

112885 4502171 adenine phosphoribosyltransferase

Adenine phosphoribosyltransferase

112 4.10E-25

12853 7657339 molybdenum cofactor synthesis 3

Molybdopterin biosynthesis MoeB protein

113 6.10E-25

9413 20070125 prolyl 4-hydroxylase, beta subunit

Protein disulfide isomerase precursor

110 5.10E-24

4653 41152086 serine (or cysteine) proteinase Serpin 1 108 9.10E-24

27326 4506483 REV3-like, catalytic subunit of DNA

DNA pol alpha , subA 110 3.10E-23

103713 21361657 protein disulfide Protein disulfide isomerase precursor

107 5.10E-23

8152 5174647 PTK6 protein tyrosine kinase 6 Serine/threonine-protein kinase NEK2

106 6.10E-23

4349 42542403 CGI-01 protein isoform 1 Endothelin-converting enzyme 2 107 7.10E-23

16948 32698918 NOL1/NOP2/Sun domain family

Nucleolar protein NOP2 102 1.10E-21

137719 21361306 neurotrophic tyrosine kinase Serine/threonine-protein kinase Nek1

103 1.10E-21

16612 4507947 tyrosyl-tRNA synthetase Tyrosyl-tRNA synthetase 100 5.10E-21

93103 51173878 HpaII tiny fragments locus 9C HpaII tiny fragments locus 9c 95 3.10E-19

17430 6005757 chromatin-specific transcription

DRE4 protein 94 8.10E-19

113456 23943912 phosphoinositide-3-kinase VPS15 protein 94 1.10E-18

4246 19923661 PRIP-interacting protein PIPMT

PRIP-interacting protein PIPMT 91 5.10E-18

7246 4503727 FK506-binding protein 3 FKBP-type peptidyl-prolyl cis-trans isomerase

87 1.10E-17

113094 10863929 ribonuclease L protein kinase 90 1.10E-17

16477 5730098 tenascin R (restrictin, janusin) tenascin-X 90 2.10E-17

14195 4505913 PMS2 postmeiotic segregation DNA mismatch repair protein mutL

89 3.10E-17

15820 4503771 farnesyltransferase, CAAX box

Protein farnesyltransferase alpha subunit

87 3.10E-17

3677 50346001 endoplasmic reticulum to nucleus

Hypothetical protein 89 4.10E-17

15112 18254478 dual specificity phosphatase 19 Dual specificity phosphatase, catalytic component

85 7.10E-17

17069 4557665 insulin-like growth factor 1 Serine/threonine-protein kinase NEK4

86 3.10E-16

10783 17402865 thiosulfate sulfurtransferase Thiosulfate sulfurtransferase 83 4.10E-16

3643 52630440 FK506-binding protein 8 70 kDa peptidylprolyl isomerase, putative

84 5.10E-16

12807 18104959 ATP binding protein associated

Similarity to ATP binding protein

80 2.10E-15

9421 29029632 anaplastic lymphoma kinase Ki-1

Serine/threonine-protein kinase NEK2

83 3.10E-15

14787 13325072 phosphatidylinositol polyphosphate

Type II inositol-1,4,5-trisphosphate 5-phosphate

82 3.10E-15

112076 4502715 CDC7 cell division cycle 7 Hypothetical protein 81 3.10E-15

13215 18490991 T-LAK cell-originated protein G2-specific protein kinase NIMA

80 4.10E-15

Page 32: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

16792 11125768 heme-regulated initiation factor

Serine/threonine-protein kinase NEK3

80 1.10E-14

6148 23510451 N-acylaminoacyl-peptide hydrolase

Alanyl dipeptidyl peptidase 79 2.10E-14

15897 40254422 nitric oxide synthase 3 Nitric oxide synthase, inducible 80 2.10E-14

15500 5729802 thioredoxin-like 4A NONE 75 3.10E-14

14135 30023828 thioredoxin-like 2 Nucleoside diphosphate kinase 76 4.10E-14

14934 40254426 natriuretic peptide receptor G2-specific protein kinase nimA 79 4.10E-14

8350 4759226 transforming growth factor, beta

Serine/threonine-protein kinase NEK2

75 2.10E-13

36315 6005956 dual specificity phosphatase 12 Dual specificity protein phosphatase 12

74 2.10E-13

16322 20127446 integrin, beta 5 Neurogenic locus Notch protein precursor

75 3.10E-13

8805 4507353 TBP-associated factor 15 isoform 2

NONE 74 6.10E-13

5359 11225260 DNA topoisomerase I Nucleolar protein NOP5 74 1.10E-12

14670 5803121 protein disulfide Protein disulfide isomerase precursor

72 2.10E-12

14661 41349437 bone morphogenetic protein Serine/threonine kinase 72 3.10E-12

16675 53828918 Rab geranylgeranyltransferase geranylgeranyl transferase alpha subunit

72 3.10E-12

16936 40807489 egf-like module containing Neurogenic locus notch homolog protein 1 pre

72 3.10E-12

114815 19743813 integrin beta 1 isoform 1A Tenascin precursor 71 7.10E-12

8687 29244926 corin Tenascin precursor 70 2.10E-11

9827 21389617 apoptosis-inducing factor like Thioredoxin reductase 69 2.10E-11

101534 27262659 colony stimulating factor 1 Protein kinase 70 2.10E-11

9528 52856442 methyltransferase like 2A methyltransferase like 2 67 3.10E-11

95162 47078292 integrin beta chain, beta 3 tenascin-X 68 4.10E-11

8627 42741682 zinc finger protein 265 isoform 2

Hypothetical Protein 66 6.10E-11

8382 4503817 follicular lymphoma variant Probable short-chain dehydrogenase

66 8.10E-11

Page 33: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Table S7. Components of protein complexes found in selected eukaryotic genomes Subunit Giardia Trichomonas Entamoeba EncephalitozoonSaccharomyces

Replication

orc1p +1 + + + + mcm2 + + + + + mcm3 + + + + + mcm4 + + + + + mcm5 + + + + + mcm6 + + + + + mcm7 + + + + + orc4p + - - - + cdc6 - + - - + cdc45 - + + - + rpa1 - + + + + psf2 - + + - + orc2p - - - - + orc3p - - - - + orc5p - - - - + orc6p - - - - + mcm1 - - - - + cdt1 - - - - + mcm10 - - - - + rpa2 - - - + + sid2 - - - - + sid3 - - - - + dpb11 - - - - + sld5 - - - - + psf1 - - - - + psf3 - - - - +

Transcription RNAPII B3 + + + + + RNAPII B5 + + + + + RNAPII B6 + + + + + RNAPII B10 + + - + + RNAPII B11 + + + + + RNAPII B1 - + + + + RNAPII B2 - + - + + RNAPII B7 - + + + + RNAPII B8 - + + + +

Page 34: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

RNAPII B4 - - - + + RNAPII B9 - - + + + RNAPII B12 - - - - +

Basal Transcription Factors TBP + + + + + TFIIH2 + + + + + TFIID1 - + + + + TFIID2 - + + - + TFIID4 - + + + + TFIID5 - + - + + TFIID6 - + - + + TFIID7 - + + + + TFIID8 - + - - + TFIIE1 - + - + + TFIIH3 - + + - + TFIIH4 - + + + + TFIID9 - - - + + TFIID10 - - - - + TFIID11 - - - + + TFIIB - - - + + TFIIA1 - - - - + TFIIA2 - - - - + TFIIF1 - - - - + TFIIF2 - - - + + TFIIF3 - - - - + TFIIE2 - - - + + TFIIH1 - - - - +

Polyadenylation Glc7 (Serine/threonine-protein phosphatase PP1-2) + + + + + Pab1 + + + + + Pap1 + + + + + RnaP2 + + + + + Ysh1 + + + + + Yth1 + + + + + Clp1 - + + - + Psf2 - + + + + Rna14 - + - + + Ssu72 - + + + + Cft1 - - - + + Cft2 - - - + +

Page 35: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Fip1 - - - + + Hrp1/Nab4 - - - - + Mpe1/YKL059c - - - + + Nab2 - - - - + Pcf11 - - - - + Pta1 - - - - + Pti1 - - - - + Ref2 - - - - + Rna15 - - - - + Swd2/YKL018w - - - - + Syc1/YOR179C - - - - +

Ubiquitin-mediated proteolysis UBE1 + + + + + UBE2D/E + + + + + RBX1 + + + + + APC1 - + + - + APC2 - + - - + APC3 - + - + + APC6 - + - + + ACP8 - + - + + APC10 - + + + + APC11 - + + - + CDH1 - + + - + CUL1 - + + - + HERC1 - + + + + TCEB1 - + - - -

UBE2C - - - - + CDC34 - - - + + ACP4 - - - - + APC5 - - - - + APC9 - - - - + APC12 - - - - + CDC20 - - - + + SKP1 - - + - + GRR1 - - - - + CDC4 - - - + + MET30 - - - - + CUL3 - - + - + Regulation of actin cytoskeleton

PIP5K + + + + + ERK1_2 + + - - +

Page 36: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

ACTB_G + + + + + PPP1C + + - + + PIK3C1 + + - - -

RAC1 + + + - -

SSH + + + - -

PAK1 - + - - + CYFIP - + - - + ARPC4 - + + - + ARPC3 - + + - + ARPC1A_B - + + - + ARPC2 - + + - + CFN - + + - + GSN - + + - -

ARHGEF4 - + - - -

RRAS - + - - -

RRAS2 - + - - -

SOS - + + -

MAP2K1 - + - -

CDC42 - + - -

ACTN - + + +

FGD2 - + - -

FGD3 - + - -

FGD5_6 - + + -

CSK - + - -

NCKAP1 - + - -

FAB1 - - - - + ARPC5 - - - - + PFN - - + - + IQGAP - - + + + EGFR1 - - + - -

RAF1 - - + - -

PAK2 - - + - -

PTK2 - - + - -

PXN - - + - -

KRAS - - + - - 1Plus sign indicates that the protein was found in the annotation of the respective genome or by KEGG pathway analysis or by BLAST search using yeast homologues as queries and a cutoff of 1E-4 or better. Matches found by BLAST were examined by reciprocal BLAST against the nr database and considered possible homologues if their best significant, annotated match was to the expected component or if they contained the expected conserved domain. Minus sign indicated that no such evidence was found.

Page 37: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Table S8. Heterozygosity observed in 1.5 MB of genome

ORF ID

High quality change1 Percent change (# reads at site)

33985 CCG (Pro) insertion 6.7 (15) 11040 AGA to AAA 7.7 (26) 24880 CTT to TTT 20 (5) 42357 CGT to TGT 33 (9) 6471 AAT to CAT 7.1 (14) 7616 CCC to GCC 5.5 (18)

95192 CCT to CTC 44 (9) 17073 TGC to TAC 6.7 (15)

1The two largest contigs were examined for high quality sequence mismatches in individual reads, using the program CONSED. Mismatches were detected at only 25 locations, most within non-coding regions.

Page 38: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Figure S1. Consensus intron motifs in Giardia lamblia and Trichomonas vaginalis T. vaginalis 5' motif: G TAT/C GT G. lamblia 5' motif: G/C TAT GTT T. vaginalis 3’ motif: A CT A AC A CACAG G. lamblia 3’ motif: A/C CT A/G AC A/C CACAG Intron locations: ORF27266 2Fe-2S ferredoxin 35 nt

CTATGTTGAGAACCACCCAAACAACTAACACACAG

ORF15124 dynein-like 32 nt

GTATGTTATCTCCCGCATAACCTAACACACAG

ORF17244 ribosomal protein RPL7A 109 nt

GTATGTTCTTATGCGCGAGGAGCCGTCCGCTGACCGCACACACCTCTGATTG

CGGGTTGTGTGTTGTCAGCGGGTGGACTTCGCTGTTCACCTGACAACTGACCC

ACAG

Upstream of ORF35332, no annotation 220 nt; would extend amino terminus of protein

GTATGTTTGTAGCTCGGCGGCACTATACTTCAAGATTACTGGAAACTAGCCC

AGCGGATCGAAGGTAGAACAATTTCCTCTCCTATCACGCTCTACGAAACTGC

CAAAAGGGTACGCATTCCTGCCAACTATTCAACTTCTTACCTCTTTTGGCTTTC

TATTAACGGGCTTTTAGACGAGGGATTGACCGCCGAGCATTTACCATCCAACT

GACACACAG

Page 39: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Figure S2. Giardial kinome. AGC: containing PKA, PKG, PKC kinases family; CAMK: calmodulin-dependent protein kinase family; CK1: casein kinase 1; CMGC: containing CDK (cyclin-dependent kinase), MAPK (mitogen-activated protein kinase), GSK3 (glycogen synthase kinase), CLK (CDC2-like kinase) families; NEK: NIMA (never in mitosis associated)-related kinase; STE (sterile kinase); TK/TKL: tyrosine kinase, tyrosine kinase-like.

Page 40: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Figure S3. Amino acid insertion in giardial histone acetyltransferase.

NGSEHTNYCRNICLVARLFLQHKTLVADVDVFLFYVMFAKTNNKKANNQQLSAAQKASDEQEQRAHCEPEIKRIPDWQQTGPGDQD------DDSYHFVGFFSKEKQQEN--SLSCIVALP GIARDIA DGAFATRWCTNLCLITKLFLFHKTEYYNPTLFHFYVV---------------------------------------------CFHD------SHGAHPCGFFSKEKDFNCPNNLACILAFP TRICHOMONAS DGAEAKMFCQSLCLLSKMFLDHKTLYYDVEPFYFYVL---------------------------------------------CEFNYNEQWKSDDYHIVGYFSKEKASPDGYNLSCLMVLP ENTAMOEBA DGNHCRVYCENLCFLSKLFLDHKTLRHPVSLFLFYVM---------------------------------------------TEID------DKGYHITGYFSKEKYSKN--NVSCILTLP TOXOPLASMA DGSYFRIYCENLCFLSKLFLDHKTLKHRVNLFLFYVI---------------------------------------------TEYD------EYGYHITGYFSKEKYSKN--NVSCILTLP PLASMODIUM DGALTRGYAENLCYLAKLFLDHKTLQYDVEPFLFYIV---------------------------------------------TEVD------EEGCHIVGYFSKEKVSLLHYNLACILTLP CRYPTOSPORIDIUM DGAISKIYCQNLCYLAKLFLDHKTLYYDVDPFLFYIV---------------------------------------------CEVD------SRGFHPVGYFSKEKYSELGYNLACILTFP PHYTOPHTHORA DGFEERIYCQNLCYIAKLFLDHKTLYFDVDPFLFYVL---------------------------------------------CEVD------ERGYHPVGYYSKEKYSDVGYNLACILTFP THALASSIOSIRA DGHIQKNYCRNLSLLSKLFLDHKSLYYDIDVFMFYVL---------------------------------------------CRLE------DNGYQIVGYFSKEKMSEQGYNLACILTLP ENCEPHALITOZOON DGKKEKAFCQNLCYLAKLFLDHKTLYYDVDLFLFYIL---------------------------------------------CEID------ERGAHIVGYFSKEKCSEEGYNLACILTLP CHLAMYDOMONAS DGKKNKVYGQNLCYLAKLFLDHKTLYYDVDLFLFYVL---------------------------------------------CECD------DRGCHMVGYFSKEKHSEESYNLACILTLP ORYZA DGKKNKVYAQNLCYLAKLFLDHKTLYYDVDLFLFYVL---------------------------------------------CECD------DRGCHMVGYFSKEKHSEEAYNLACILTLP ARABIDOPSIS DGKKNKIYCQNLCLLAKLFLDHKTLYYDVEPFLFYVM---------------------------------------------TEAD------NTGCHLIGYFSKEKNSFLNYNVSCILTMP MUS DGRKNKSYAQNLCLLAKLFLDHKTLYYDTDPFLFYVL---------------------------------------------TEED------EKGHHIVGYFSKEKESAEEYNVACILVLP CAENORHABDITIS DGKRNRIYCQNLGLLAKLFLDHKTLYYDVEPFLFYIM---------------------------------------------TEYD------ERGCHMVGYFSKEKESPDGNNLACILTLP DICTYOSTELIUM DGRKQRTWCRNLCLLSKLFLDHKTLYYDVDPFLFYCM---------------------------------------------TRRD------ELGHHLVGYFSKEKESADGYNVACILTLP SACCHAROMYCES DGRKQRTWCRNLCLISKCFLDHKTLYYDVDPFLYYCM---------------------------------------------TVKD------DYGCHLIGYFSKEKESAEGYNVACILTLP CRYPTOCOCCUS

Page 41: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Figure S4. Deep divergence of Giardia in phylogenetic analysis of 61 ribosomal

proteins. The 50% majority rule consensus of the trees sampled by the MC3 procedure is

shown (based on 9,990 sampled trees). All nodes had a posterior probability of 1.00 (not

shown), with four replicate MC3 analyses providing identical results. For comparison,

maximum likelihood bootstrap values are superimposed upon the tree. The tree was

rooted using Sulfolobus and Archaeoglobus sequences. Horizontal branch lengths are

representative of evolutionary change. The scalebar indicates number of amino acid

substitutions per site.

Page 42: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

Figure S5. Number of cellular machinery components found in eukaryotic genomes.

Components were detected by KEGG pathway analysis. KEGG pathways/complexes examined

were DNA polymerase, RNA polymerase II, ribosome, proteasome, basal transcription factors,

amino-acyl-tRNA synthetases, protein export, SNARE interactions, regulation of autophagy,

ubiquitin-mediated proteolysis (Genetic Information Processing); MAPK, Notch, calcium,

phosphotidylinositol and mTOR signaling (Signal Transductions); and regulation of actin

cytoskeleton, cell cycle, and apoptosis (Cellular Processes).

0

50

100

150

200

250

300

350

400

450

Giard

ia

Trichom

onas

Entam

oeba

Encephalit

ozoon

Saccharo

myce

s

Total

Genetic InformationProcessingSignal Transduction

Cellular Processes

Page 43: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

1. F. D. Gillin, L. S. Diamond, Arch Invest Med (Mex) 9 Suppl 1, 237 (1978). 2. R. D. Adam, Int J Parasitol 30, 475 (Apr 10, 2000). 3. R. D. Adam, T. E. Nash, T. E. Wellems, Nucleic Acids Research 16, 4555 (1988). 4. D. B. Jaffe et al., Genome Res 13, 91 (Jan, 2003). 5. S. Batzoglou et al., Genome Res 12, 177 (Jan, 2002). 6. B. Ewing, P. Green, Genome Res 8, 186 (Mar, 1998). 7. B. Ewing, L. Hillier, M. C. Wendl, P. Green, Genome Res 8, 175 (Mar, 1998). 8. D. Gordon, C. Abajian, P. Green, Genome Res 8, 195 (Mar, 1998). 9. D. Radune, H. Tettelin, Methods Mol Biol 255, 309 (2004). 10. H. Tettelin, D. Radune, S. Kasif, H. Khouri, S. L. Salzberg, Genomics 62, 500

(Dec 15, 1999). 11. J. H. Badger, G. J. Olsen, Mol Biol Evol 16, 512 (Apr, 1999). 12. A. L. Delcher, D. Harmon, S. Kasif, O. White, S. L. Salzberg, Nucleic Acids Res

27, 4636 (Dec 1, 1999). 13. S. L. Salzberg, M. Pertea, A. L. Delcher, M. J. Gardner, H. Tettelin, Genomics 59,

24 (Jul 1, 1999). 14. J. W. Fickett, Nucleic Acids Res 10, 5303 (Sep 11, 1982). 15. S. Tiwari, S. Ramachandran, A. Bhattacharya, S. Bhattacharya, R. Ramaswamy,

Comput Appl Biosci 13, 263 (Jun, 1997). 16. P. M. Sharp, W. H. Li, Nucleic Acids Res 15, 1281 (Feb 11, 1987). 17. G. Aggarwal, E. A. Worthey, P. D. McDonagh, P. J. Myler, BMC Bioinformatics

4, 23 (Jun 7, 2003). 18. S. F. Altschul et al., Nucleic Acids Res 25, 3389 (Sep 1, 1997). 19. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, J Mol Biol 215,

403 (Oct 5, 1990). 20. A. Bateman et al., Nucleic Acids Res 27, 260 (Jan 1, 1999). 21. S. R. Eddy, Bioinformatics 14, 755 (1998). 22. A. Lupas, M. Van Dyke, J. Stock, Science 252, 1162 (May 24, 1991). 23. A. Lupas, Trends Biochem Sci 21, 375 (Oct, 1996). 24. J. C. Wootton, S. Federhen, Computers and Chemistry 17, 149 (1993). 25. A. Krogh, B. Larsson, G. von Heijne, E. L. Sonnhammer, J Mol Biol 305, 567

(Jan 19, 2001). 26. K. Nakai, P. Horton, Trends Biochem Sci 24, 34 (Jan, 1999). 27. J. D. Bendtsen, H. Nielsen, G. von Heijne, S. Brunak, J Mol Biol 340, 783 (Jul 16,

2004). 28. O. Emanuelsson, H. Nielsen, S. Brunak, G. von Heijne, J Mol Biol 300, 1005 (Jul

21, 2000). 29. T. M. Lowe, S. R. Eddy, Nucleic Acids Res 25, 955 (Mar 1, 1997). 30. S. Griffiths-Jones et al., Nucleic Acids Res 33, D121 (Jan 1, 2005). 31. V. E. Velculescu, B. Vogelstein, K. W. Kinzler, Trends Genet 16, 423 (Oct,

2000). 32. V. E. Velculescu, L. Zhang, B. Vogelstein, K. W. Kinzler, Science 270, 484 (Oct

20, 1995). 33. R. Overbeek et al., Nucleic Acids Res 33, 5691 (2005). 34. F. Chen, A. J. Mackey, J. K. Vermunt, D. S. Roos, PLoS ONE 2, e383 (2007). 35. M. Kanehisa et al., Nucleic Acids Res 34, D354 (Jan 1, 2006).

Page 44: Supporting Online Material forscience.sciencemag.org/content/suppl/2007/09/27/317.5846.1921.DC1/... · 27.09.2007  · (La Jolla, CA) for commercial library construction or used to

36. M. E. Weiland, J. E. Palm, W. J. Griffiths, J. M. McCaffery, S. G. Svard, Int J Parasitol 33, 1341 (2003).

37. B. J. Davids et al., PLoS ONE 1, e44 (2006). 38. K. J. Livak, T. D. Schmittgen, Methods 25, 402 (2001). 39. B. J. Davids, K. Mehta, L. Fesus, J. M. McCaffery, F. D. Gillin, Mol Biochem

Parasitol 136, 173 (2004). 40. L. A. Knodler et al., J Biol Chem 274, 29805 (Oct 15, 1999). 41. A. G. McArthur et al., FEMS Microbiol Lett 189, 271 (Aug 15, 2000). 42. L. D. Stein et al., Genome Res 12, 1599 (Oct, 2002). 43. R. C. Edgar, Nucleic Acids Res 32, 1792 (2004). 44. R. C. Edgar, BMC Bioinformatics 5, 113 (Aug 19, 2004). 45. B. L. Cantarel, H. G. Morrison, W. Pearson, Mol Biol Evol 23, 2090 (Nov, 2006). 46. F. Ronquist, J. P. Huelsenbeck, Bioinformatics 19, 1572 (Aug 12, 2003). 47. J. P. Huelsenbeck, F. Ronquist, Bioinformatics 17, 754 (Aug, 2001). 48. J. P. Huelsenbeck, B. Larget, R. E. Miller, F. Ronquist, Syst Biol 51, 673 (Oct,

2002). 49. D. L. Swofford et al., Syst Biol 50, 525 (Aug, 2001). 50. J. S. Shoemaker, I. S. Painter, B. S. Weir, Trends Genet 15, 354 (Sep, 1999). 51. D. T. Jones, W. R. Taylor, J. M. Thornton, Comput Appl Biosci 8, 275 (Jun,

1992). 52. A. Stamatakis, Bioinformatics 22, 2688 (Nov 1, 2006). 53. J. O. Andersson, A. M. Sjogren, L. A. Davis, T. M. Embley, A. J. Roger, Curr

Biol 13, 94 (Jan 21, 2003). 54. K. Henze et al., Gene 281, 123 (Dec 27, 2001). 55. G. Wu, K. Henze, M. Muller, Gene 264, 265 (Feb 21, 2001). 56. K. Henze, H. G. Morrison, M. L. Sogin, MullerM, Gene 222, 163 (Nov 19, 1998).