3
Mining the schistosome DNA sequence database The basic understanding of organism biology that can be gained through genome analysis has prompted numerous research communities to initiate projects to sequence either whole genomes or specific components such as expressed genes (the transcriptome). In many networks, the initial priority is the generation of expressed sequence tags (ESTs); these are randomly selected, short, single-pass cDNA sequences that reflect the transcriptional activity of organisms or tissues 1 . For parasitologists, such approaches promise the discovery of new drug targets and new candidate vaccine antigens, as well as revealing the molecular mechanisms underlying, for example, parasite biochemistry, development, pathogenicity and diversity 2 . Schistosoma possesses a large, highly repetitious genome; therefore, EST analysis has provided the Schistosome Genome Network (SGN) with a rapid and cost- effective way to produce gene catalogues for both Schistosoma mansoni (the main focus) and Schistosoma japonicum 3 . With the support of various agencies, the SGN has generated, annotated and deposited more than 16 000 sequences in dbEST (the EST division of GenBank) 4,5 (Box 1a). EST analysis continues for both species; major new initiatives totaling 200 000 S. mansoni ESTs (potentially a 1.6–2.5× transcriptome coverage) have recently been announced in Brazil (Box 1, b–e) and large-scale projects for S. japonicum are under development in China. An overview of the SGN’s activities can be obtained from its website (Box 1d). To date, exploitation of public EST data has chiefly been by keyword search of sequence annotations or by homology search. The size of the Schistosoma EST dataset now permits its exploration in additional creative and informative ways. Some analyses require access to advanced computing and programming support, but there are increasing possibilities for individual research analysis. This article provides an overview of methods being used to mine Schistosoma EST data; an accompanying website provides more comprehensive information (Box 1e). Data mining through database searching Text searches and sequence comparisons are the most common ways to query a database (Box 1) and allow the putative assignment of sequence–function relationships. Many database queries can be carried out via the World Wide Web and several schistosome-specific search resources are available. These include: blast analysis against non-redundant DNA sequence sets and their six-frame amino acid translations, a parasite protein motif search (both are available through the Parasite Genome Web server) and the S. mansoni gene index [at The Institute for Genomic Research (TIGR)]. These tools and links to more general bioinformatics services are listed in Box 1. Computational technology also allows more imaginative and complicated database searches that analyze gene expression in a wider context and generate testable hypotheses. (1) Microsatellite polymorphisms provide essential markers for genome sequencing, positional cloning, physical mapping and population analysis 6,7 . RepeatMasker Web servers (Box 1) allow large numbers of sequences to be scanned rapidly for microsatellite-like simple repeats. For example, examination of 310 S. japonicum cluster consensus sequences revealed 31 tri-, eight tetra- and two pentanucleotide repeat regions. Primers designed to regions flanking these sequences can then be used to test for polymorphisms. (2) Parasite biochemistry and genomics can be integrated through in silico metabolomics 8 – the mapping of identified genes onto metabolic pathways. With a view to identifying new drug targets, it should be possible to identify standard pathways that are present exclusively in the parasite. Moreover, where the parasite’s gene catalog does not appear to contain expected enzymes, this could suggest that alternate or novel pathways are in operation. Metabolism- related ESTs can also be separated by life- cycle stage and statistical comparisons made. A comprehensive list of enzyme descriptions has been used to search Schistosoma sequence annotations and hits classified by life-cycle stage. Fructose- biphosphate aldolase appears to be expressed at a higher-than-expected level in cercariae, whereas expression of ubiquinol, cytochrome-c-oxidase and glycogen synthase appears to be biased towards adult worms. Such observations can be correlated with current knowledge of parasite metabolism and used to create testable hypotheses. Data mining through cluster analysis Cluster analysis groups together homologous sequences, identifying the non-redundant sequence set 9 . Several such analyses are available for Schistosoma EST data (Box 1). Once cluster data is available, a wide variety of secondary analyses can be performed. (1) In general, the utility of the primary databases (GenBank and European Molecular Biology Laboratory; EMBL) is determined by the accuracy of sequence annotation. This can be assessed by comparing the annotation of sequences within individual clusters 10 . Our analysis of 2763 Schistosoma ESTs reveals only 0.65% discrepancy in annotation. Thus, with a quality database available, data- mining processes involving computer analysis and interpretation of data can be reliably undertaken 11 . (2) Cluster consensus sequences are frequently longer than the individual sequences within them, facilitating identification through homology searching, and blast analysis of consensus sequences is often performed as part of the clustering process (Box 1). More than 40% of previously unidentified Schistosoma ESTs could be classified in this way. (3) Transcript families and alternate splicing events can be revealed by comparing clusters that return similar database homology results. For example, two S. mansoni clusters grouped with different actin cDNA sequences present in GenBank, and S. mansoni glucose-3-phosphate dehydrogenase (G3PDH) might be alternatively spliced because one of two G3PDH clusters lacks a stretch of codons that is present in the other. (4) Single nucleotide polymorphisms (SNPs) 12 can be identified by inspection of sequence alignments for individual clusters. For example, rRNA and cytochrome oxidase subunit II clusters reveal two major polymorphic groups. Schistosome SNPs could be relevant to pharmacogenomics (how individual TRENDS in Parasitology Vol.17 No.10 October 2001 http://parasites.trends.com 1471-4922/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved. PII: S1471-4922(01)02019-0 501 Forum ParaSite – Genome Analysis

Mining the schistosome DNA sequence database

Embed Size (px)

Citation preview

Mining the schistosome

DNA sequence

database

The basic understanding of organism

biology that can be gained through genome

analysis has prompted numerous research

communities to initiate projects to

sequence either whole genomes or specific

components such as expressed genes (the

transcriptome). In many networks, the

initial priority is the generation of

expressed sequence tags (ESTs); these are

randomly selected, short, single-pass

cDNA sequences that reflect the

transcriptional activity of organisms or

tissues1. For parasitologists, such

approaches promise the discovery of new

drug targets and new candidate vaccine

antigens, as well as revealing the

molecular mechanisms underlying, for

example, parasite biochemistry,

development, pathogenicity and diversity2.

Schistosoma possesses a large, highly

repetitious genome; therefore, EST analysis

has provided the Schistosome Genome

Network (SGN) with a rapid and cost-

effective way to produce gene catalogues for

both Schistosoma mansoni (the main focus)

and Schistosoma japonicum3. With the

support of various agencies, the SGN has

generated, annotated and deposited more

than 16 000 sequences in dbEST (the EST

division of GenBank)4,5 (Box 1a). EST

analysis continues for both species; major

new initiatives totaling 200 000 S. mansoni

ESTs (potentially a 1.6–2.5× transcriptome

coverage) have recently been announced in

Brazil (Box 1, b–e) and large-scale projects

for S. japonicum are under development in

China. An overview of the SGN’s activities

can be obtained from its website (Box 1d).

To date, exploitation of public EST data

has chiefly been by keyword search of

sequence annotations or by homology

search. The size of the Schistosoma EST

dataset now permits its exploration in

additional creative and informative ways.

Some analyses require access to advanced

computing and programming support, but

there are increasing possibilities for

individual research analysis. This article

provides an overview of methods being

used to mine Schistosoma EST data; an

accompanying website provides more

comprehensive information (Box 1e).

Data mining through database searching

Text searches and sequence comparisons

are the most common ways to query a

database (Box 1) and allow the putative

assignment of sequence–function

relationships. Many database queries can

be carried out via the World Wide Web and

several schistosome-specific search

resources are available. These include:

blast analysis against non-redundant

DNA sequence sets and their six-frame

amino acid translations, a parasite protein

motif search (both are available through

the Parasite Genome Web server) and the

S. mansoni gene index [at The Institute

for Genomic Research (TIGR)]. These tools

and links to more general bioinformatics

services are listed in Box 1.

Computational technology also allows

more imaginative and complicated

database searches that analyze gene

expression in a wider context and

generate testable hypotheses.

(1) Microsatellite polymorphisms provide

essential markers for genome sequencing,

positional cloning, physical mapping and

population analysis6,7. RepeatMasker Web

servers (Box 1) allow large numbers of

sequences to be scanned rapidly for

microsatellite-like simple repeats. For

example, examination of 310 S. japonicum

cluster consensus sequences revealed

31 tri-, eight tetra- and two

pentanucleotide repeat regions. Primers

designed to regions flanking these

sequences can then be used to test for

polymorphisms. (2) Parasite biochemistry

and genomics can be integrated through

in silico metabolomics8 – the mapping of

identified genes onto metabolic pathways.

With a view to identifying new drug

targets, it should be possible to identify

standard pathways that are present

exclusively in the parasite. Moreover,

where the parasite’s gene catalog does

not appear to contain expected enzymes,

this could suggest that alternate or novel

pathways are in operation. Metabolism-

related ESTs can also be separated by life-

cycle stage and statistical comparisons

made. A comprehensive list of enzyme

descriptions has been used to search

Schistosoma sequence annotations and

hits classified by life-cycle stage. Fructose-

biphosphate aldolase appears to be

expressed at a higher-than-expected level

in cercariae, whereas expression of

ubiquinol, cytochrome-c-oxidase and

glycogen synthase appears to be biased

towards adult worms. Such observations

can be correlated with current knowledge

of parasite metabolism and used to create

testable hypotheses.

Data mining through cluster analysis

Cluster analysis groups together

homologous sequences, identifying the

non-redundant sequence set9. Several

such analyses are available for

Schistosoma EST data (Box 1). Once

cluster data is available, a wide variety of

secondary analyses can be performed.

(1) In general, the utility of the primary

databases (GenBank and European

Molecular Biology Laboratory; EMBL)

is determined by the accuracy of

sequence annotation. This can be

assessed by comparing the annotation

of sequences within individual

clusters10. Our analysis of 2763

Schistosoma ESTs reveals only 0.65%

discrepancy in annotation. Thus, with

a quality database available, data-

mining processes involving computer

analysis and interpretation of data can

be reliably undertaken11.

(2) Cluster consensus sequences are

frequently longer than the individual

sequences within them, facilitating

identification through homology

searching, and blast analysis of

consensus sequences is often

performed as part of the clustering

process (Box 1). More than 40% of

previously unidentified Schistosoma

ESTs could be classified in this way.

(3) Transcript families and alternate

splicing events can be revealed by

comparing clusters that return similar

database homology results. For

example, two S. mansoni clusters

grouped with different actin cDNA

sequences present in GenBank, and

S. mansoni glucose-3-phosphate

dehydrogenase (G3PDH) might be

alternatively spliced because one of two

G3PDH clusters lacks a stretch of

codons that is present in the other.

(4) Single nucleotide polymorphisms

(SNPs)12 can be identified by inspection

of sequence alignments for individual

clusters. For example, rRNA and

cytochrome oxidase subunit II clusters

reveal two major polymorphic groups.

Schistosome SNPs could be relevant to

pharmacogenomics (how individual

TRENDS in Parasitology Vol.17 No.10 October 2001

http://parasites.trends.com 1471-4922/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved. PII: S1471-4922(01)02019-0

501Forum

ParaSite – Genome Analysis

parasites or isolates are affected by

chemotherapy, or recognized by natural

or induced immune responses13) as well

as to population studies.

(5) Some Schistosoma EST clusters are

derived exclusively from a single

developmental stage (e.g. calcium

binding protein, glutathione

S-transferase), whereas others reveal

transcription across two or more stages

(e.g. actin, cytochrome oxidase subunit I).

Statistical analysis of expression with

respect to life cycle can, with some

caveats, reflect transcriptional activity

and might be useful for those interested

in studying a specific developmental

stage or the regulation of gene

expression throughout development. In

addition, abundantly expressed

sequences that cannot be identified by

simple homology analysis are worthy of

more detailed study as they could

represent parasite-specific genes14.

Data mining to support post genomics

Data mining also contributes to post-

genomic activities such as microarray and

proteomic analysis. Non-redundant clone

sets identified by clustering are being used

to prepare Schistosoma microarrays.

These will facilitate analysis of global

transcription profiles, contributing to our

understanding of parasite development,

sexual differentiation and responses to

environmental or experimental

perturbations (e.g. pharmacological

attack)15. For proteomics, ESTs and cluster

consensus sequences provide a database

for analysis of peptide mass fingerprints

and peptide sequence tags, linking gene

expression to gene products16. In

particular, cluster consensus sequences

provide more accurate open reading frame

predictions than individual ESTs.

Perspective

Although schistosome genome analysis is

far from complete, and almost all aspects

of functional genomics still remain

unexplored, mining of the accumulated

data already enables workers to

undertake important and exciting

research into basic schistosome biology.

Guilherme Oliveira

Centro de Pesquisas René Rachou,Fundação Oswaldo Cruz – FIOCRUZ, BeloHorizonte, Minas Gerais 30190-002, Brazil. e-mail: [email protected]

David A. Johnston

Dept of Zoology, The Natural HistoryMuseum, London, UK SW7 5BD.

References

1 Adams, M.D. et al. (1991) Complementary DNA

sequencing: expressed sequence tags and human

genome project. Science 252, 1651–1656

2 Johnston, D.A. et al. (1999) Genomics and the

biology of parasites. BioEssays 21, 131–1473

3 Williams, S.A. et al. (1999) Helminth genome

analysis: the current status of the filarial and

schistosome genome projects. Parasitology

118 (Suppl.), S19–S38

TRENDS in Parasitology Vol.17 No.10 October 2001

http://parasites.trends.com

502 Forum

(a) DbEST [expressed sequence tag database at the National Center for BiotechnologyInformation (NCBI)]: http://www.ncbi.nlm.nih.gov/dbEST/index.htmlEST summary by organism: http://www.ncbi.nlm.nih.gov/dbEST/dbEST_summary.html

(b) http://verjo18.iq.usp.br/schisto/(c) http://www.mct.gov.br/sobre/noticias/2001/25_04canexo.htm (d) http://www.nhm.ac.uk/hosted_sites/schisto/index.html

(Contains network administration, current project descriptions, resources, clusteranalysis, protocols and links of interest)

(e) www.nhm.ac.uk/hosted_sites/schisto/TIP2001/

Database annotation searches

Entrez server at NCBI: http://www.ncbi.nlm.nih.gov/Entrez/Sequence Retrieval System (SRS) at European Bioinformatics Institute (EBI):http://srs.ebi.ac.uk/ Other SRS servers: http://www.lionbio.co.uk/publicsrs.html

Database homology searches

Blast at NCBI: http://www.ncbi.nlm.nih.gov/BLAST/Blast at EBI: http://www.ebi.ac.uk/blast2/Blast tutorials: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.htmlParasite genome blast at EBI: http://www.ebi.ac.uk/blast2/parasites.html

World Health Organization parasite genome and proteome website

http://www.ebi.ac.uk/parasites/parasite-genome.htmlIncludes servers to search a range of parasite-specific sequence databases with yourown sequence. Available searches include: parasite blast server; proteome keywordsearch; protein motif search; six-frame translation of DNA sequences motif search andparasite codon usage tables. (Standard databases are updated monthly.)

Schistosoma cluster analyses

S. mansoni and S. japonicum clusters at the Schistosome Genome Network website: http://www.nhm.ac.uk/hosted_sites/schisto/clusters/intro.html

S. mansoni gene index at The Institute for Genomic Research: http://www.TIGR.org/tdb/smgi/

S. mansoni clusters at University of Pennsylvania: http://www.cbil.upenn.edu/ParaDBs/Schistosoma_2/index.html

Newsgroups or mailing lists

Parasite genome newsgroup: http://www.jiscmail.ac.uk/lists/parasite-genome.htmlSchistosoma newsgroup: http://www.bio.net/hypermail/schisto/

Repeat Masker Web server

http://ftp.genome.washington.edu:80/cgi-bin/RepeatMaskerScreens DNA sequences against libraries of simple and characterized repetitiveelements (use the ‘Only mask simple repeats and low complexity DNA’ option to searchfor microsatellite sequences)

General web tools

See http://www.ebi.ac.uk/parasites/webtools.html for listing.

Box 1. Websites of interest to the schistosome research community

4 Oliveira, G.C. (2001) Schistosoma gene discovery

project, an update. Trends Parasitol. 17, 108–109

5 Franco, G.R. et al. (2000) The Schistosoma gene

discovery program: state of the art. Int. J.

Parasitol. 30, 453–463

6 Curtis, J. and Minchella, D.J. (2000) Schistosome

population genetics structure: when clumping

worms is not just splitting hairs. Parasitol. Today

16, 68–71

7 Durand, P. et al. (2000). Isolation of microsatellite

markers in the digenetic trematode Schistosoma

mansoni from Guadeloupe island. Mol. Ecol.

9, 997–998

8 Goto, S. et al. (1997) Organizing and computing

metabolic pathway data in terms of binary

relations. Pac. Symp. Biocomput. 175–186

9 Yee, D.P. and Conklin, D. (1998) Automated

clustering and assembly of large EST collections.

Proc. Int. Conf. Intell. Syst. Mol. Biol. 6, 203–211

10 Pennisi, E. (1999) Keeping genome databases

clean and up to date. Science 286, 447–450

11 Boguski, M.S. (1998) Bioinformatics – a new era.

Bioinformatics: A Trends Guide 5, 1–3

12 Picoult-Newberg, L. et al. (1999) Mining SNPs

from EST databases. Genome Res. 9, 167–174

13 Evans, W.E. and Relling, M.V. (1999)

Pharmacogenomics: translating functional

genomics into rational therapeutics. Science

286, 487–491

14 Meira, W.S. et al. (1998) Protein, nucleotide

characterization of an abundant Schistosoma

mansoni transcript with no homologs in the

databases. Mem. Inst. Oswaldo Cruz 93 (Suppl.)

1, 211–213

15 Marshal, E. (1999) Do-it-yourself gene watching.

Science 286, 444–447

16 Ashton, P.D. et al. (2001) Linking proteome and

genome: how to identify parasite proteins. Trends

Parasitol. 17, 198–202

TRENDS in Parasitology Vol.17 No.10 October 2001

http://parasites.trends.com

503Forum

ParaSite

Size matters on

the Web

ProMed Discussion List

(http://www.fas.org/promed/)Leishmaniasis and foxhounds in the USA:how is it transmitted and why only tofoxhounds?These questions were raised on the ProMed

List and speculated about following a report

in a local newspaper in Virginia, USA. After

the death of ‘an avid fox hunter’, his pack of

15 foxhounds was distributed among his

friends. One by one, the hounds succumbed

to Leishmania infantum MON1, a strain

endemic in the Mediterranean area. Later,

their offspring also began to show symptoms.

Peter Schantz (Centers for Disease

Control and Prevention, Atlanta, GA, USA)

quoted that although native sandflies could

have picked up the parasite, it was now

believed to be transmitted from dog to dog.

The moderator also thought this was likely

because other breeds of dog living nearby

were not infected and were liable to be

exposed to the same vector. His theory was

that some foxhounds could have been sent

to the Middle East, perhaps to a dog show,

where they became infected. Foxhounds live

in packs and are often exchanged between

hunts so: ‘There are perfect opportunities

for the spread of the disease…(if it is

transmitted) sexually and congenitally as

well as by sandfly bite.’ [see ParaSite (2000)

Parasitol. Today 16, 371–372 for more

information]. Bruce Akey, a vet from the

Virginia Department of Agriculture, added

that 14 000 registered foxhounds

nationwide have been tested by the Center

for Disease Control’s (CDC)

immunofluorescence assay. At least one dog

in 69 different kennels across 21 states and

two Canadian provinces was sero-positive

and all culture isolates were L. infantum

MON1. The dogs often sustain superficial

lacerations during hunting and from

fighting, which commonly occurs within a

pack to establish a pecking order. Perhaps

foxhounds, similar to some humans and

mice, are also genetically immunodeficient?

Mosquito Discussion Group

(mosquito-l@ iastate.edu)Funeral urns and the toxic properties of copperScott Campbell (Arthropod-Borne Disease

Laboratory, Suffolk County, NY, USA)

knew of a local cemetery pushing the use

of bronze flower vases because mosquitoes

will not breed in them. Was this claim

true? Rick Duhrkopf (Baylor University,

TX, USA) said he does not bother looking

for larvae in bronze containers because he

has never found any, and several others

explained that a small amount of copper is

toxic enough to prevent larvae surviving

beyond the first few instars [Romi, R. et al.

(2000) J. Med. Entomol. 37, 281–285].

Tom Iwanejko (Arthropod-Borne

Disease Laboratory, Suffolk County, NY,

USA) said that in the aquarium trade,

copper salts are used as an anti-parasite

treatment, with a warning to remove all

invertebrates. Dominick Ninivaggi

(Arthropod-Borne Disease Laboratory,

Suffolk County, NY, USA) warned that in

some states putting copper in a vase might

be considered as an unregistered pesticide.

However, Cam Lay (Clemson University,

SC, USA), speaking for the sovereign state

of South Carolina, hoped his fellow pesticide

regulators had more significant things to

worry about: ‘You can apply trimethyl mole

cricket death [an insecticide]…to your lawn

whether or not mole crickets are

present…and if it kills the chinch bugs

that’s OK, even if they’re not on the label.’

Mosquito abundanceMartina Schäfer (Uppsala University,

Sweden) wanted to compare numbers.

Last summer, the record in central

Sweden was ~56 500 females (mostly

Aedes sticticus) caught in one night in a

CDC light trap with dry ice. Larry

Hribar’s collections in Florida Keys

ranged from 0 to ~55 000 mosquitoes in a

single night. In 1999, the top of the list

was Paul Reiter (CDC, San Juan, PR,

USA), who trapped 267 600 Aedes

taeniorhynchus in Grand Cayman. In an

answer to another question, Rick

Duhrkopf (Baylor University, TX, USA)

replied that Texas has the highest number

of mosquito species (85) (a current list can

be found at http://www.texasmosquito.org/

Checklist.html), although Gary

McCallister (Mesa State, CO, USA)

suggested that the figure should be

corrected for square miles…and size!

Is the mosquito the official state bird of

Minnesota? ‘No,’said Carlos Andrade

(Unicamp, SP, Brazil), ‘it’s the official state

bird of Michigan – the big Aedes vexans’.

Dennis Wallette (East Baton Rouge

Mosquito Abatement, LA, USA) was

scornful: ‘Aedes vexans big? … our

Psorophora ciliata or Psorophora howardii

… are often referred to as “gallinippers”

because they take a gallon of blood.’

Dominick Ninivaggi was told that in Texas,

mosquitoes cannot enter through windows

smaller than 18 inches square, but

Duhrkopf informed him that this is only

true in summer when adults are fully

grown and that, ‘the wintering adults are

small enough to get in which is why so

many Texans are armed’. Rob Anderson

(Simon Fraser University, Canada) agreed

that Ps. ciliata is the largest

haematophagous mosquito anywhere.

When doing media interviews he found it