34
- 1 - Automated analysis of ARISA data using ADAPT system Robert Schmieder 1,2 , Matthew Haynes 3 , Elizabeth Dinsdale 3 , Forest Rohwer 3 , and Robert Edwards 1,3,4§ 1 Department of Computer Science, San Diego State University, San Diego, CA, USA 2 Computational Science Research Center, San Diego State University, San Diego, CA, USA 3 Department of Biology, San Diego State University, San Diego, CA, USA 4 Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA § Corresponding author Email addresses: RS: [email protected] MH: [email protected] ED: [email protected] FR: [email protected] RE: [email protected]

Automated analysis of ARISA data using ADAPT system · - 1 - Automated analysis of ARISA data using ADAPT system Robert Schmieder1,2, Matthew Haynes3, Elizabeth Dinsdale3, Forest

Embed Size (px)

Citation preview

- 1 -

Automated analysis of ARISA data using ADAPT system

Robert Schmieder1,2, Matthew Haynes3, Elizabeth Dinsdale3, Forest Rohwer3, and Robert Edwards1,3,4§

1Department of Computer Science, San Diego State University, San Diego, CA, USA

2Computational Science Research Center, San Diego State University, San Diego, CA,

USA

3Department of Biology, San Diego State University, San Diego, CA, USA

4Mathematics and Computer Science Division, Argonne National Laboratory, Argonne,

IL, USA

§Corresponding author

Email addresses:

RS: [email protected]

MH: [email protected]

ED: [email protected]

FR: [email protected]

RE: [email protected]

- 2 -

Abstract Background A goal of microbial community profiling projects is to understand the influence of

environmental changes on microbial communities. Here, we present a computational

system consisting of the database ADAPTdb and the program ADAPT for the analysis of

Automated Ribosomal Intergenic Spacer Analysis (ARISA) data sets. ARISA is a method

for analyzing the composition of microbial communities that is both faster and cheaper

than other community profiling techniques. ARISA relies on the analysis of intergenic

regions called Internal Transcribed Spacers (ITS), which are located between the 16S and

23S rRNA genes. The analysis of the data produced by ARISA requires a reliable

resource of known ITS regions and the bioinformatics tools to enable researchers to

automatically analyze their data. ARISA data analysis steps include profile filtering, data

transformation, database searching, value calculations, and sample comparison.

Results 16S-ITS-23S regions along with information about the organisms that the sequences

came from are stored and maintained in the database ADAPTdb. The data in ADAPTdb

are retrieved from different data resources, including the SEED and NCBI sequence

databases. The program ADAPT takes the manual data analysis process and

automatically characterizes ARISA data sets using the data in ADAPTdb. Additional

information associated with each 16S-ITS-23S region in the ADAPTdb database is used

by ADAPT for pathogenic and autotrophic/heterotrophic comparisons of organisms

among different ARISA samples.

- 3 -

Conclusions The program is publicly available through a user-friendly web interface, which allows

onsite analysis of ARISA data sets and automatic computation of the graphical and

tabular output. The interactive web interface facilitates navigation through the outputs

and export functionality for subsequent analysis, and is available at

http://edwards.sdsu.edu/adapt/.

Background There are growing numbers of researchers interested in exploring the composition and

function of complete microbial community samples. Cloning and sequencing individual

genes such as the 16S rRNA gene is relatively time consuming and low throughput.

Random community genomics and metagenomics are high throughput but are more

expensive. Fingerprinting methods such as Automated Ribosomal Intergenic Spacer

Analysis (ARISA, Figure 1A) [1] provide a cheap and convenient method to visualize the

temporal and spatial changes occurring in microbial communities. ARISA analyzes the

length of the intergenic spacer region between 16S and 23S rRNA genes (called ITS

region) present in almost all Bacteria and Archaea (Figure 1B). ARISA has been used

successfully to analyze several bacterial communities, including samples from freshwater

[1, 2], marine [3-9], and terrestrial environments [10-12]. It has been proven valuable in

clinical studies to detect and identify microbial species [13-15], and to differentiate

between closely related bacterial species [16, 17].

Here, we present ADAPT, a system designed for ARISA data analysis and the only

online resource capable of highlighting differences in community structure based on the

metadata associated with the organisms. The system provides a regularly updated

database of annotated 16S-ITS-23S regions (ADAPTdb) and a web-based program

- 4 -

(ADAPT) to automatically characterize the taxonomic composition of environmental

samples based on ARISA data sets mapped to the database. The taxonomic profiles

computed by ADAPT allow the user to compare samples to reveal the differences in

taxonomic composition of the underlying communities. Additionally, the system enables

the classification of the source organisms predicted by the 16S-ITS-23S regions into

autotrophic or heterotrophic types, and the prediction of whether the source organisms

are pathogenic.

Software design and computational platform The software system developed in this project consists of the ARISA data analysis

program (ADAPT) and the ADAPTdb database of annotated 16S-ITS-23S regions. The

system was designed in a two-tier architecture, consisting of bottom-tier (database

backend) and top-tier (analytical frontend). The two-tier approach was chosen to achieve

a less complex architecture and to allow rapid development. The bottom-tier provides

mechanisms for data storage and retrieval and consists of a relational database and server

(MySQL; Sun Microsystems, Santa Clara, CA) that provides fast access to the data. The

top-tier is responsible for the interaction with users. It receives input data, interacts with

the bottom-tier and presents the results via a user-friendly interface. In the

ADAPTdb/ADAPT system, the frontend layer is separated into a client-server

architecture, the ADAPTdb scripts on the user client and the ADAPT web interface on

the web server.

The frontend was implemented in Perl 5.8 using the following modules: Perl Database

Interface (DBI) module and its MySQL-specific DBI driver module DBD::mysql,

Common Gateway Interface (CGI) module, Graphics Draw (GD) module, Applied

- 5 -

Biosystems, Inc. Format (ABIF) module, and Algorithm::CurveFit module. The DBI and

DBD modules are used to communicate queries to the MySQL database and fetch

responses. The CGI module is used to generate the HTML content, and to input and

output data from and to the web interface. The GD module is used to generate the images

of the result charts. The ABIF module is used to read and parse the ABIF files, and lastly,

the Algorithm::CurveFit module is used to predict the fragment length of fragments

larger than the size standards.

The ADAPT web interface is currently running on a PC server with Fedora Linux using

an Apache HTTP server to support the web services. The web interface provides a high

level of compatibility with heterogeneous computing environments.

Cron jobs running on the server allow automatic updates by calling the update functions

of the ADAPTdb frontend layer to retrieve new data from the external data resources as

described below.

ADAPTdb data collection The generation of the ADAPTdb database was divided into two main steps. In the first

step, different external data resources were mined to identify valid 16S-ITS-23S regions

along with information about their source organisms. For this step, we were using the

NCBI Genome database [18] and the SEED database [19] as data resources. In the

second step, the 16S-ITS-23S regions and organism metadata were saved persistently in

the MySQL database.

Data from NCBI Genome database The NCBI Genome database contained sequence and organism data from 1,211 bacterial

and 76 archaeal species (as of May 30, 2009) [18, 20]. Data was retrieved from NCBI as

- 6 -

GenBank format (GBK) files using ESearch and EFetch from the Entrez Programming

Utilities (eUtils) [21, 22]. After retrieving the GBK files, each file was examined for 16S

and 23S rRNA annotations using regular expressions for "16S ribosomal RNA", "23S

ribosomal RNA", "16S rRNA", "23S rRNA", and their variants. Files that contained 16S

and 23S rRNA annotations were then parsed to find the feature type "rRNA" and checked

for descriptions containing "rRNA" or "ribosomal RNA" (usually in the "/product="

definition). The descriptions that contained “16S” represented 16S regions and those that

contained “23S” represented 23S regions. These regions were validated as described

below.

Data from the SEED database The SEED database contained 990 bacterial and 64 archaeal completely sequenced and

annotated genomes (as of May 30, 2009) [23]. Simple Object Access Protocol (SOAP)

web services were used to retrieve the annotation and sequence data from the SEED

database (Edwards et al., in preparation). Retrieved data was parsed to find valid 16S-

ITS-23S regions using a different algorithm than parsing the data retrieved from NCBI.

First, all RNAs in the genome data were recovered using the rnas_of() SOAP remote

procedural call (RPC). Annotations containing "SSU rRNA" and "LSU rRNA" were

extracted by using the function_of() RPC. Here, “SSU rRNA” regions were 16S regions

and “LSU rRNA” regions were 23S regions.

Find valid 16S-ITS-23S regions Finding valid 16S-ITS-23S regions among the data retrieved from remote sites required

additional steps. First, all the 16S and 23S rRNA gene regions that lay on the same contig

were identified. Overlapping regions on the same strand (16S with 16S, 23S with 23S,

- 7 -

16S with 23S) were discarded since these likely resulted from incorrect annotations.

Valid regions on the same contig met the following criteria: (a) 16S region before 23S on

forward strand; (b) 16S region after 23S on reverse strand; (c) 16S and 23S regions are

each at least 300 bp long; (d) distance between 16S and 23S regions is less than 3,000 bp

long. This process was repeated for every 16S gene. Only ITS regions that are less than

3,000 bp long were considered because it is very unlikely that bacterial or archaeal

genomes contain long sequences that are not coding for something, and such long regions

are probably due to incorrect or missing annotations.

Metadata retrieval If the entry of the external data resource contained at least one valid 16S-ITS-23S region,

additional information was extracted for this entry, such as the organism name,

taxonomy, NCBI taxonomy identifier (taxid), unique data resource accession number,

and 16S and 23S rRNA gene annotation information (contig, strand, start and stop

position). To retrieve the pathogenesis information for each organism entry, we used the

Lproks.txt file from the NCBI FTP server [24]. This file contained information about the

pathogenesis of Bacteria whose genomes were completely sequenced and in the NCBI

Genome database. If available, additional information about pathogenic hosts or targets

was extracted from this file.

The information about potential pathogenesis of the organisms in the ADAPTdb database

was available from constantly updated data resources, such as the Lproks file. However,

the autotrophic/heterotrophic classification was not available online in such a file. A

system for trophic (autotrophic and heterotrophic) level classification was developed

based on the phyla of the organisms in the database. All organisms stemming from the

- 8 -

same phyla were defined as having the same trophic type (Table 1) . Proteobacteria, for

example, can either grow as heterotrophs or facultative autotrophs depending on the

environmental conditions, but they were classified as heterotrophs in this work to allow a

unique classification for each phylum.

Data analysis using ADAPT ADAPT is a web-based program for the analysis of ARISA data sets. The program uses a

fingerprint database (ADAPTdb) to predict the taxonomic composition of the underlying

microbial communities and to perform trophic and pathogenic comparisons of the

samples from which the ARISA data were obtained. The results of the analysis are

displayed in the form of tables and charts.

Input data Currently, ADAPT is able to parse different file formats for data input, including raw

ABI data provided in .fsa or .ab1 files, or text files containing the extracted ARISA

profile data (usually .txt or .tsv files). It is possible to select more than one input file to

analyze multiple data sets simultaneously. The data parsing of raw ABI files requires

more steps than the parsing of text files. The additional steps include baselining,

smoothing, peak identification, and size calibration. The baselining step is used to ensure

the data has a linear baseline parallel to the horizontal axis. This is required for the

fluorescence intensity threshold filtering of the data, applied later. The smoothing step

removes local noise from the trace file peaks, enabling easier peak identification. The

previous steps and the peak identification are performed for both the sample and the size

standards extracted from the ABI file. After the extraction of the peak positions, the size

- 9 -

standards are used to calibrate the sample data and calculate the fragment length for each

peak.

These steps result in an ARISA profile for the sample. Alternatively, the user may supply

their own profile, or one that was saved from an earlier analysis. The strategy for

analyzing these ARISA profiles with ADAPT can be divided into three steps: (1) profile

filtering of input trace files, (2) database search, and (3) calculating the outputs.

Additionally, ADAPT is able to perform multiple sample analysis in the third step if

more than one sample is provided in the ARISA data sets.

(1) Profile filtering The profile filtering removes noise and uninformative peaks using user-specified

threshold parameters. These parameters include the fragment length range (to filter out

peaks that would give fragments that are too short or too long) and the fluorescence

intensity or peak-height threshold (to exclude background fluorescence with an intensity

too close to the baseline). The fluorescence intensity threshold is automatically calculated

or set to a fixed value. If the user combines both options, the higher value will be used as

threshold.

(2) Database search After filtering the profiles, ADAPT performs a database search using the fragment

lengths from the ARISA profile and the user-provided primer set. The primer set is used

to calculate the expected fragment length from the database entries. The fragment lengths

of the ARISA data are then mapped to the fragment lengths of the database entries to find

similar regions.

- 10 -

Differences in the fragment analysis machines and methods used to generate the ARISA

profiles can cause variability. Since the DNA in the ARISA samples is separated by gel

electrophoresis, each band is a distribution across the gel. Moreover, the size standards

are not evenly distributed through the length of the electropherogram. Therefore, the

length of the fragment associated with each peak is a distribution of likely lengths and the

length of every fragment cannot be determined with 100% accuracy. Furthermore, two

fragments of identical length could erroneously result in peaks representing two different

lengths.

ADAPT uses a binning strategy to address this type of variability, and to compensate for

inexact matches during the mapping to database entries. The user is required to set a bin

value b. For an ARISA fragment of length l, all database entries within a length window

of l ± b are grouped into the same bin. Therefore, a bin value of 0 would only allow exact

matches. Since the inaccuracy of gelelectrophoresis increases as the fragment length

increases (largely from diffusive band spreading) [6], different bin values should be used

for different fragment lengths. To reduce the complexity, users are only able to set bin

values for certain fragment length intervals of < 600, from 600 to 899, and > 899 bp.

After setting the bin values for the length intervals, the database is queried for each input

length in its bin. A length of 400 bp and a bin value of three, for example, would query

the database for fragments within the length window from 397 bp to 403 bp and use the

values found for further calculations.

Additionally, ADAPT allows the use of filters before mapping the input fragment length

against the database entries. The user specifies which database entries should be used

- 11 -

based on taxonomy and origin of the data (external data resource such as SEED and

NCBI).

(3) Output calculation The ADAPTdb database contains taxonomic information for the organisms associated

with each 16S-ITS-23S region, which is used to characterize the taxonomic composition

of the original microbial sample and to calculate autotrophic/heterotrophic and

pathogenic/non-pathogenic fractions. The fraction calculations are based on the

taxonomic profiles and fluorescence intensity (peak height) information of the input

ARISA data sets. The fluorescence intensity indicates the number of fragments with the

same length, which is interpreted as the number of organisms with a fragment of this

particular length. The intensity is used to weight each fragment length in the taxonomic

profile for a quantitative interpretation of the ARISA profiles.

Many genomes contain multiple 16S-ITS-23S regions (Table 2). For example,

Escherichia coli K12 contains seven regions. A filter has been added to restrict the data

analysis output to those cases where every 16S-ITS-23S region in the genome is found in

the sample. This results in a more accurate approach to identify species based on all 16S-

ITS-23S regions than on only one region.

ADAPT implements an average fraction calculation to perform autotrophic/heterotrophic

and pathogenic/non-pathogenic comparisons. The fractions are calculated in the same

manner for both comparisons. For each fragment length in an ARISA profile, the

fluorescence intensity is represented as a percentage of the total fluorescence intensities

from the whole sample. The number of database entries that match each fragment length

(in its bin) from each trophic type is normalized to the total number of database entries

- 12 -

that match that fragment length. The product of this fraction and the percentage

fluorescence intensity defines the proportion of each trophic type for each fragment

length in the ARISA profile, and the sum of all these fractions describes the overall

community.

Results ADAPT is publicly available through a user-friendly web interface (Figure 2). The

interactive web interface facilitates navigation through the results and allows export of

the results for subsequent offline analysis. The web interface also displays information

about the database content. The input page of ADAPT provides a mechanism to import

ARISA data sets and to set the parameters for the data analysis. A detailed user help

guide is provided for each section of the input page. After the user has imported the

ARISA data sets, ADAPT displays the results on the output page that were calculated

based on the user settings. The output page provides access to the results of the analysis

in the form of tables and charts, each of which are available to be exported into different

text (CSV and TSV) and image formats (GIF, JPG, and PNG). CSV and TSV files can be

easily parsed by other applications (such as Microsoft Excel ®; Microsoft, Redmond,

WA) and therefore are ideal for exchanging the results between ADAPT and other

applications for further studies. The charts provide a fast overview of the calculated

values. The tables show detailed information about the results in an abbreviated (grouped

results for each sample in the ARISA data set) or detailed version (results for each single

peak), or for different filters that were applied. The entries in most tables are linked to

additional information, such as the sequence of the 16S-ITS-23S regions or the

organism's taxonomy in the NCBI Taxonomy browser. Additional information from the

- 13 -

results that are not shown on the charts (but given in the tables) are displayed by ADAPT

in so called “tool tips” that appear when the users cursor is over the chart. Furthermore,

the ADAPT web frontend shows information about the ADAPTdb database, which is

used for the ARISA data analysis.

The current version of ADAPTdb (as of May 30, 2009) contains 4,069 16S-ITS-23S

entries and 953 organism entries retrieved from either the NCBI Genome database or the

SEED database. The organism entries are grouped into two domains, 21 phyla, 345

genera, and 639 species. Of the 953 organism entries, 902 (94.65%) are classified as

heterotrophic, 45 (4.72%) as autotrophic, and six (0.63%) are unknown (Figure 3A). For

the pathogenicity, 374 (39.24%) of the organism entries are pathogenic, 380 (39.87%) are

non-pathogenic, and the pathogenesis of 199 (20.88%) is unknown (Figure 3B).

The average number of ITS regions per organism varies among the phyla (Table 2). Nine

bacterial and two archaeal phyla have on average less than two ITS regions per organism

whereas the Proteobacteria have on average 4.21 ITS regions and the Firmicutes have on

average 7.21 ITS regions per organism. The overall average number of ITS regions per

organism in ADAPTdb is 4.27. Of course, the number of database entries in ADAPTdb

will increase with the automatic updates of new entries in the external data resources.

Application example In the following example, the ADAPT system was applied to the analysis of ARISA data

sets to explore the relative number of autotrophic and heterotrophic, as well as pathogenic

and non-pathogenic bacteria on water overlying coral reefs. Water samples were

collected in 2005 at coral reefs near two of the Northern Line Islands: Kiritimati

(Chistmas Island) and Tabuaeran (Fanning Island). The islands lie within the same

- 14 -

biogeographic region, but show a gradient of human disturbance. The population gradient

(estimated population from the 2005 census: Tabuaeran 2,539, Kiritimati 5,115) was used

to investigate the influence of human activities on the microbes living in the coral reefs.

Bacterial DNA was extracted from the filtrate captured from approximately 1 liter of

seawater as described in Dinsdale et al. [25], one sample per island was processed using

ARISA as described in Fisher et al. [1], and analyzed using the ADAPT program.

The PCR primers used in this experiment and selected in the ADAPT input interface

were 1406f as forward primer (5'-TGYACACACCGCCCGT-3') and 23Sr as reverse

primer (5'-GGGTTBCCCCATTCRG-3'). The input format was a text file (available as

SOM) and therefore, a size standard was not selected in ADAPT. The fragment length

range was set from 300 bp to 2,000 bp. The bin values for the analysis were set to three (a

window of ±1 bp) for fragment lengths < 600 bp, and to five (a window of ±2 bp) for

fragment lengths ≥ 600 bp. The fluorescence intensity (peak-height) threshold was set to

a minimum threshold value of 50 fluorescence units greater than the baseline. The data

used for the analysis were filtered to only use organisms of the kingdom Bacteria and that

were retrieved from either the NCBI Genome or the SEED database.

The analysis identified 24 peaks for the sample from Kiritimati matching 39 potential

organisms in the database and 22 peaks for the sample from Tabuaeran matching 25

potential organisms. There were five common fragment lengths in both samples (309,

415, 493, 528, and 732 bp) and four common potentially matching organisms

(Cyanothece sp. ATCC 51142, Lactobacillus gasseri, Streptococcus pneumoniae SP14-

BS69, and Syntrophus aciditrophicus SB). Additionally, the analysis identified a range of

pathogenic bacterial species present on the two islands (Table 3). The microbial

- 15 -

communities of the coral reefs near the two Northern Line Islands showed a high

percentage of heterotrophic organisms: 72.95% for Kiritimati and 70.32% for Tabuaeran

(Figure 4A). In addition, the Kiritimati coral reef water sample showed a higher

abundance of pathogens (53.02%) compared to non-pathogens (32.65%) than in the

Tabuaeran water sample with 20.13% pathogenic and 66.51% non-pathogenic species

(Figure 4B).

Discussion ARISA is fast, cheap, reproducible, and accurate. The sensitivity, in this case the ability

to detect a single-nucleotide difference in sequence length, of ARISA is high compared to

other fingerprinting techniques. The ARISA technique allows the detection of different

fragment lengths in organisms displaying 99% 16S rRNA gene similarity, suggesting a

higher level of resolution than other techniques [5, 26]. ARISA has previously been used

to analyze the genetic structure of several bacterial communities including samples from

freshwater, bacterioplankton, and different soils. In addition to the community or

environmental studies, the technique was used for clinical applications. The ease of use

and the rapid detection and identification of pathogens present in a sample may be

promising for molecular diagnostics of infectious diseases. Rapid detection is particularly

useful because it would allow quick initiation of the appropriate antimicrobial therapy.

Although ARISA has many advantages over similar techniques, it also has drawbacks

that may limit its widespread application. Some of the drawbacks are unique to ARISA,

but most are common to all environmental sampling techniques. The most abundant

bacteria within a sample result in the strongest signal (highest peaks). Therefore, the

abundance of bacteria present in lesser numbers or near the detection limit may not be

- 16 -

measured accurately. Consequently, peaks representing fragments of organisms with low

abundance may be classified as noise or not even detected. Second, environmental

samples are random sampling of the entire microbial community. Hence, taxonomic

profiles obtained for different samples from the same environment may show some

degree of variation. Third, PCR amplification is a stochastic process that is susceptible to

biases. PCR-based techniques such as ARISA may be biased by uneven amplification

efficiencies of the PCR fragments due to sequence heterogeneity [27, 28], differential

extractability of cells in communities [29], or differences in the amount of DNA per cell.

These errors are common to all methods that use PCR amplification, such as T-RFLP and

16S rRNA sequence analysis. As shown in Table 2, some Archaea and Bacteria contain

multiple ITS regions, which can either be exact copies or vary slightly within one

organism. Multiple exact copies of the same ITS region may increase the intensity of a

PCR fragment of a certain length. In the analysis of the ARISA data sets, this may distort

the results (e.g. higher peaks).

An alternative to ARISA for analyzing the taxonomic composition of environmental

samples is the 16S rRNA sequence analysis. Recently, DNA pyrosequencing techniques

have been used to sequence hypervariable regions of the 16S rRNA gene to investigate

the microbial diversity in different samples [30]. Pyrosequencing overcomes the low

throughput problem of conventional capillary sequencing, but is expensive both for

sequencing and data analysis.

ARISA is more reproducible than conventional gel-based techniques like T-RFLP. For

example, the instrumental automatism of ARISA guarantees its reproducibility among

different laboratories or operators. Additionally, ADAPT requires PCR primers and the

- 17 -

size standard in order to perform the analysis. This allows a higher level of comparability

between laboratories using different primer sets and size standards.

Previously established ITS region resources are not maintained anymore (e.g. RISSC

collection [31] and IWoCS database [32]), which causes the information to be outdated

and/or inaccurate. The development of the ADAPTdb database was driven by the need

for a maintained ITS region data resource that can be used for the analysis of ARISA data

sets and provides metadata for additional analysis. The ADAPTdb database contains 16S-

ITS-23S regions and metadata of their source organisms, including taxonomy, trophic

classification, and pathogenicity. ADAPT uses the ADAPTdb database to automatically

analyze ARISA data sets.

The usability of ADAPT has been demonstrated in several practical applications.

Although all the samples showed more heterotrophic than autotrophic organisms, the

composition of the reference ITS region database used by ADAPT skews these results:

approximately 92% of the database entries are classified as heterotrophic while only 8%

are classified as autotrophic. Therefore, the results may show a higher proportion of

heterotrophs than the samples actually contain. However, in the case the Line Island

analysis, the ARISA analysis supported the previous metagenomics analyses that showed

proportionally high numbers of heterotrophs on degraded reefs near Kiritimati and

Tabuaeran than on the healthy reef near Kingman [25]. The results also agree with the

previous results from Dinsdale et al. (2008) that showed there are more potential

pathogens at Kiritimati than on the other islands and that Kingman has the lowest

proportion of potential pathogens. In that study, the authors suggested that the increase of

human disturbance coincides with an increase of pathogenic species, some of which are

- 18 -

pathogenic for corals and fish. A higher number of pathogenic species may therefore

damage the coral health and reduce the number of small fish.

Fragment length binning, used to consider inexact matches during the mapping of input

data to the database is necessary to account for variability in fragment lengths caused by

different sequencing machines and methods used to generate the ARISA profiles.

However, it is not readily apparent what are appropriate bins. For example, prior studies

used different bin values in different fragment or ITS length intervals. Fuhrman and

Hewson (2004) used a bin value of three for lengths up to 500 bp and seven for lengths

longer than 500 bp [33]. Two years later, they used a bin value of three for lengths up to

700 bp, five for 700 bp to 1000 bp and ten for lengths longer than 1,000 bp [6]. Hewson

et al. (2006) used bin values of three for lengths from 400 bp to 700 bp, five from 700 bp

to 1,000 bp and eleven for lengths longer than 1,000 bp in their studies [34]. Ruan et al

(2006) suggested a binning strategy with variable bin values less than a maximal bin

value. The maximal bin values they used are three for lengths up to 600 bp, five for

lengths up to 698 bp and seven for lengths longer than 698 bp [35]. The ADAPT web

interface allows the user to set different bin values for length intervals of < 600 bp, from

600 bp to 899 bp, and > 899 bp. The standard bin values set as defaults in the web

interface are five (equivalent to window of ±2) for lengths smaller than 600 bp and seven

(equivalent to window of ±3) for lengths greater than or equal to 600 bp. For analysis

without length binning, the length window values can be set to 0 for each interval,

providing only exact length matches against the database entries of ADAPTdb.

Accurate identification of ITS regions depends on the correct annotation of flanking 16S

and 23S rRNA gene regions. Commonly, the mapping of ITS regions to a database is

- 19 -

done from the length of the ITS regions calculated from the stop of the 16S to the start of

the 23S rRNA gene. This approach requires a conserved length of the flanking regions

where the primers bind (inside the 16S and 23S rRNA genes) and accurate annotation of

the ends of the genes. Using the fragment length of the PCR product (determined by the

primers) instead overcomes limitations of annotation errors, as imposed by the fragment

length of wrongly annotated gene positions. Including the PCR primer sets used for the

generation of the ARISA data set gives the advantage that miss-annotations of flanking

rRNA regions will not change the outcome of the analysis of ADAPT.

The commonly used primer sets for ARISA [36, 37] might not allow amplification of all

16S-ITS-23S regions in ADAPTdb. The primer sets are designed to recognize the highly

conserved sequences in the flanking regions of the 16S and 23S rRNA genes. The

number of exact matches of the primer set to the 16S and 23S regions in the database

were compared (Table 4). The primer set 1406f/23Sr showed the best result with exact

matches against 87% of the regions, followed by the ITSF/ITSReub primer set with 36%

exact matches. We found no exact matches of the S-D-Bact-1522-b-S-20/L-D-Bact-132-

a-A-18 primer set against the 16S-ITS-23S regions in the database. Of course, allowing

mismatches as may occur during PCR amplification in the primer sequences showed

higher numbers of matching regions for all three primer sets (data not shown).

Jensen et al. (1993) reported that around 85% of the species they analyzed produced

more than one peak in the ARISA profile [38]. Most of the genomes contain one or two

sets of 16S and 23S rRNA genes. We found that five of the bacterial genomes in the

NCBI Genome database contain 12 and another five 13 copies of the 16S and 23S rRNA

gene. Only a few genomes were annotated with different numbers of 16S and 23S rRNA

- 20 -

genes. Since 16S and 23S rRNA gene products are subunits of the same enzyme it is

unlikely that the number of copies varies for the two genes and this is more likely to

either incorrect or incomplete annotations. Because of interoperonic length variation, a

single organism can contribute more than one peak to a sample. This can cause problems

if the analysis does not account for multiple different ITS regions in one organism and

identifies the different ITS regions as individual organisms. Therefore, using only

completely sequenced microbial genomes for the analysis can significantly reduce this

limitation of ARISA. All (annotated) 16S and 23S rRNA gene regions from complete

genomes (including the ITS regions) in ADAPTdb were retrieved from the external

databases and are used to compensate for multiple peaks in a single organism. The

taxonomic composition of the ARISA sample is calculated in ADAPT based on the

fragment lengths that are present in the complete genome for a specific organism.

Organisms that can be detected in ARISA data sets are only those that are represented in

the reference database used to analyze ARISA data sets. Therefore, it is important to

provide as many regions as possible in the database that is used for the analysis. As the

number of sequenced organisms is continuously increasing, it becomes possible to fill in

the gaps of missing regions in the ADAPTdb database. Keeping ADAPTdb up-to-date

with information of newly sequenced organisms requires regular updates. The newly

developed ADAPTdb is equipped with an automated update function that ensures weekly

updates from data resources such as the NCBI Genome database and the SEED database.

In the future, including data resources with ITS regions from organisms whose genomes

are not fully sequenced may prove advantageous.

- 21 -

Conclusions ADAPT provides a free, platform-independent tool to the research community that

enables the users to automatically analyze ARISA data sets via an easy-to-use web

interface. Environmental samples from the Line Islands were analyzed using these tools

to recapitulate the trophic differences seen in other studies and attributed to human

impact. ADAPT and ADAPTdb enable rapid assessment of environmental communities

by ARISA.

Availability and requirements • Project name: ADAPT

• Project home page: http://edwards.sdsu.edu/adapt

• Operating system(s): Web service, platform independent

• Programming language: Perl

• Restrictions to use by non-academics: None

Authors' contributions RS has designed and implemented the database and web application, tested the program,

and drafted the initial manuscript. ED collected the water samples. MH extracted the

DNA, processed the samples to generate the ARISA data sets, and participated in the

GUI component design. FR and RE conceived of the study, and participated in its design

and coordination. All authors read and approved the final version of the manuscript.

Acknowledgements We thank Lutz Krause, Ramy K. Aziz, and Peter Salamon for helpful discussions. This

work was supported by grants DBI 0850356 Advances in Bioinformatics from the

- 22 -

National Science Foundation, the Gordon and Betty Moore Foundation, and the Canadian

Institute for Advanced Research.

- 23 -

References 1. Fisher MM, Triplett EW: Automated approach for ribosomal intergenic

spacer analysis of microbial diversity and its application to freshwater bacterial communities. Appl Environ Microbiol 1999, 65(10):4630-4636.

2. Yannarell AC, Triplett EW: Geographic and environmental sources of variation in lake bacterial community composition. Appl Environ Microbiol 2005, 71(1):227-239.

3. Brown MV, Schwalbach MS, Hewson I, Fuhrman JA: Coupling 16S-ITS rDNA clone libraries and automated ribosomal intergenic spacer analysis to show marine microbial diversity: development and application to a time series. Environ Microbiol 2005, 7(9):1466-1479.

4. Brown MV, Fuhrman JA: Marine bacterial microdiversity as revealed by internal transcribed spacer analysis. Aquat Microb Ecol 2005, 41(1):15-23.

5. Danovaro R, Luna GM, Dell'anno A, Pietrangeli B: Comparison of two fingerprinting techniques, terminal restriction fragment length polymorphism and automated ribosomal intergenic spacer analysis, for determination of bacterial diversity in aquatic environments. Appl Environ Microbiol 2006, 72(9):5982-5989.

6. Fuhrman JA, Hewson I, Schwalbach MS, Steele JA, Brown MV, Naeem S: Annually reoccurring bacterial communities are predictable from ocean conditions. Proc Natl Acad Sci U S A 2006, 103(35):13104-13109.

7. Hewson I, Capone DG, Steele JA, Fuhrman JA: Influence of Amazon and Orinoco offshore surface water plumes on oligotrophic bacterioplankton diversity in the west tropical Atlantic. Aquatic Microbial Ecology 2006, 43(1):11-22.

8. Leuko S, Goh F, Allen MA, Burns BP, Walter MR, Neilan BA: Analysis of intergenic spacer region length polymorphisms to investigate the halophilic archaeal diversity of stromatolites and microbial mats. Extremophiles 2007, 11(1):203-210.

9. Fuhrman JA, Steele JA, Hewson I, Schwalbach MS, Brown MV, Green JL, Brown JH: A latitudinal diversity gradient in planktonic marine bacteria. Proc Natl Acad Sci U S A 2008, 105(22):7774-7778.

10. Ranjard L, Poly F, Lata JC, Mougel C, Thioulouse J, Nazaret S: Characterization of bacterial and fungal soil communities by automated ribosomal intergenic spacer analysis fingerprints: biological and methodological variability. Appl Environ Microbiol 2001, 67(10):4479-4487.

11. Kirk JL, Beaudette LA, Hart M, Moutoglis P, Klironomos JN, Lee H, Trevors JT: Methods of studying soil microbial diversity. J Microbiol Methods 2004, 58(2):169-188.

- 24 -

12. Jones CM, Thies JE: Soil microbial community analysis using two-dimensional polyacrylamide gel electrophoresis of the bacterial ribosomal internal transcribed spacer regions. J Microbiol Methods 2007, 69(2):256-267.

13. Baudart J, Lemarchand K, Brisabois A, Lebaron P: Diversity of Salmonella strains isolated from the aquatic environment as determined by serotyping and amplification of the ribosomal DNA spacer regions. Appl Environ Microbiol 2000, 66(4):1544-1552.

14. Clementino MM, de Filippis I, Nascimento CR, Branquinho R, Rocha CL, Martins OB: PCR analyses of tRNA intergenic spacer, 16S-23S internal transcribed spacer, and randomly amplified polymorphic DNA reveal inter- and intraspecific relationships of Enterobacter cloacae strains. J Clin Microbiol 2001, 39(11):3865-3870.

15. Gonzalez-Escalona N, Jaykus LA, DePaola A: Typing of Vibrio vulnificus strains by variability in their 16S-23S rRNA intergenic spacer regions. Foodborne Pathog Dis 2007, 4(3):327-337.

16. Boyer SL, Flechtner VR, Johansen JR: Is the 16S-23S rRNA internal transcribed spacer region a good tool for use in molecular systematics and population genetics? A case study in cyanobacteria. Mol Biol Evol 2001, 18(6):1057-1069.

17. Xu D, Cote JC: Phylogenetic relationships between Bacillus species and related genera inferred from comparison of 3' end 16S rDNA and 5' end 16S-23S ITS nucleotide sequences. Int J Syst Evol Microbiol 2003, 53(Pt 3):695-704.

18. NCBI Genome database [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome]

19. Overbeek R, Disz T, Stevens R: The SEED: a peer-to-peer environment for genome annotation. Commun ACM 2004, 47(11):46-51.

20. Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S et al: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2009, 37(Database issue):D5-15.

21. Schuler GD, Epstein JA, Ohkawa H, Kans JA: Entrez: molecular biology database and retrieval system. Methods Enzymol 1996, 266:141-162.

22. Entrez Programming Utilities (eUtils) Homepage [http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html]

23. The SEED Viever [http://seed-viewer.theseed.org/]

24. Lproks.txt file with pathogenesis information [ftp://ftp.ncbi.nih.gov/genomes/Bacteria/lproks_0.txt]

25. Dinsdale EA, Pantos O, Smriga S, Edwards RA, Angly F, Wegley L, Hatay M, Hall D, Brown E, Haynes M et al: Microbial ecology of four coral atolls in the Northern Line Islands. PLoS ONE 2008, 3(2):e1584.

- 25 -

26. Jaspers E, Overmann J: Ecological significance of microdiversity: identical 16S rRNA gene sequences can be found in bacteria with highly divergent genomes and ecophysiologies. Appl Environ Microbiol 2004, 70(8):4831-4839.

27. Suzuki M, Rappe MS, Giovannoni SJ: Kinetic bias in estimates of coastal picoplankton community structure obtained by measurements of small-subunit rRNA gene PCR amplicon length heterogeneity. Appl Environ Microbiol 1998, 64(11):4522-4529.

28. Polz MF, Cavanaugh CM: Bias in template-to-product ratios in multitemplate PCR. Appl Environ Microbiol 1998, 64(10):3724-3730.

29. Polz MF, Harbison C, Cavanaugh CM: Diversity and heterogeneity of epibiotic bacterial communities on the marine nematode Eubostrichus dianae. Appl Environ Microbiol 1999, 65(9):4271-4275.

30. Sogin ML, Morrison HG, Huber JA, Mark Welch D, Huse SM, Neal PR, Arrieta JM, Herndl GJ: Microbial diversity in the deep sea and the underexplored "rare biosphere". Proc Natl Acad Sci U S A 2006, 103(32):12115-12120.

31. Garcia-Martinez J, Bescos I, Rodriguez-Sala JJ, Rodriguez-Valera F: RISSC: a novel database for ribosomal 16S-23S RNA genes spacer regions. Nucleic Acids Res 2001, 29(1):178-180.

32. D'Auria G, Pushker R, Rodriguez-Valera F: IWoCS: analyzing ribosomal intergenic transcribed spacers configuration and taxonomic relationships. Bioinformatics 2006, 22(5):527-531.

33. Hewson I, Fuhrman JA: Richness and diversity of bacterioplankton species along an estuarine gradient in Moreton Bay, Australia. Appl Environ Microbiol 2004, 70(6):3425-3433.

34. Hewson I, Fuhrman JA: Improved strategy for comparing microbial assemblage fingerprints. Microb Ecol 2006, 51(2):147-153.

35. Ruan Q, Steele JA, Schwalbach MS, Fuhrman JA, Sun F: A dynamic programming algorithm for binning microbial community profiles. Bioinformatics 2006, 22(12):1508-1514.

36. Cardinale M, Brusetti L, Quatrini P, Borin S, Puglia AM, Rizzi A, Zanardini E, Sorlini C, Corselli C, Daffonchio D: Comparison of different primer sets for use in automated ribosomal intergenic spacer analysis of complex bacterial communities. Appl Environ Microbiol 2004, 70(10):6147-6156.

37. Jones SE, Shade AL, McMahon KD, Kent AD: Comparison of primer sets for use in automated ribosomal intergenic spacer analysis of aquatic bacterial communities: an ecological perspective. Appl Environ Microbiol 2007, 73(2):659-662.

38. Jensen MA, Webster JA, Straus N: Rapid identification of bacteria on the basis of polymerase chain reaction-amplified ribosomal DNA spacer polymorphisms. Appl Environ Microbiol 1993, 59(4):945-952.

- 26 -

Figures Figure 1 - Basic approach of ARISA After extracting the DNA from the sample (A1), the ITS region between the 16S and 23S

rRNA genes is amplified using PCR with fluorescently labeled primers (A2). The

resulting PCR products vary in their length, since the length of the ITS region can be

different for different species. The amplified DNA is then run on a sequencing machine

and the length of the ITS-region is then measured from its trace output (A3). The

prokaryotic 16S, 23S and 5S rDNA arranged in a tandem repetitive cluster (B). NTS

stands for non-transcribed spacer, ETS for external transcribed spacer and ITS for

internal transcribed spacer.

Figure 2 - ADAPT web interface The interactive web interface facilitates data input and parameter setup as shown in the

screenshot, as well as navigation through the output of the analysis results.

Figure 3 - Trophic (A) and pathogenic (B) composition of organism entries in the ADAPTdb database

Figure 4 - ADAPT chart output of the trophic (A) and pathogenic (B) fractions for each ARISA sample B05_KIRB3_003 represents samples taken near Kiritimati and A05_TABB1_001 the

samples taken near Tabuaeran.

- 27 -

Tables Table 1 - Trophic type classification for phyla in ADAPTdb

Trophic type Phylum autotrophic Aquificae Cyanobacteria Nitrospirae heterotrophic Acidobacteria Actinobacteria Bacteroidetes Bacteroidetes/Chlorobi group Chlamydiae Chlamydiae/Verrucomicrobia group Chlorobi Chloroflexi Crenarchaeota Deinococcus-Thermus Euryarchaeota Firmicutes Fusobacteria Korarchaeota Proteobacteria Spirochaetes Synergistetes Tenericutes Thermotogae Verrucomicrobia unknown Deferribacteres Dictyoglomi Gemmatimonadetes unknown

- 28 -

Table 2. Overview of the taxonomic composition of ADAPTdb organism entries.

Domain Phylum Organisms (%) 16S-ITS-23S regions (%)

Regions per organism

Archaea Crenarchaeota 22 (2.31) 24 (0.59) 1.09 Euryarchaeota 35 (3.67) 93 (2.29) 2.66 Korarchaeota 1 (0.10) 1 (0.02) 1.00 Bacteria Actinobacteria 85 (8.92) 224 (5.51) 2.64 Bacteroidetes/Chlorobi group 34 (3.56) 106 (2.60) 3.12

Chlamydiae/Verrucomicrobia group 17 (1.77) 35 (0.85) 2.06

Cyanobacteria 41 (4.30) 97 (2.38) 2.37 Firmicutes 199 (20.88) 1435 (35.27) 7.21 Proteobacteria 458 (48.06) 1927 (47.36) 4.21 Tenericutes 21 (2.20) 51 (1.25) 2.43 Rare bacterial spp. 40 (4.17) 76 (1.84) 1.70 Total 953 4069 4.27

- 29 -

Table 3 - Potentially pathogenic species that match all the input fragment lengths

of an organism.

Island name Potentially pathogenic organisms identified in ARISA sample Kiritimati Bdellovibrio bacteriovorus HD100

Coxiella burnetii spp. Laribacter hongkongensis Microcystis aeruginosa NIES-843 Pseudomonas syringae pv. tomato str. DC3000 Renibacterium salmoninarum ATCC 33209 Shigella flexneri 5 str. 8401 Streptococcus pneumoniae SP14-BS69 Xanthomonas campestris pv. vesicatoria str. 85-10 Xylella fastidiosa spp. Tabuaeran Brucella abortus S19 Brucella abortus bv. 1 str. 9-941 Brucella ovis ATCC 25840 Brucella suis 1330 Granulibacter bethesdensis CGDNIH1 Lawsonia intracellularis PHE/MN1-00 Streptococcus pneumoniae SP14-BS69

- 30 -

Table 4 - Number of regions and organisms matching different primer sets that

were used for the ARISA analysis

Primer set Regions matching primer set (%)

Regions not matching primer set (%)

Organisms matching primer set with all regions (%)

Organisms matching primer set with any regions (%)

1406f/23Sr 3441 (87) 510 (13) 745 (78) 764 (80) ITSF/ITSReub 1433 (36) 2518 (64) 390 (41) 401 (42) S-D-Bact-1522-b-S-20/L-D-Bact-132-a-A-18

0 (0) 3951 (100) 0 (0) 0 (0)

The numbers are based on exact matches for the kingdom Bacteria. Introducing

mismatches to S-D-Bact-1522-b-S-20/L-D-Bact-132-a-A-18 showed regions matching

the primer set.