Upload
andresvaldez
View
229
Download
0
Embed Size (px)
Citation preview
8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook
1/14
14TheoreticalConsiderations forthe Efficient Design
of DNA Arrays
CONTENTS
14.1 Introduction
14.2 Surface DesignRole of Interface Electrostatic Interactions • Sensitivity
Enhancement • Multiplexed SNPs Detection
14.3 DNA Biosensors Presence of Short Subsequences in the Genomes • Correlation
of Presence of Short Subsequences between Genomes
14.4 Conclusions Acknowledgments
References
14.1 Introduction
The use of combinatorial or array-based detection (and synthesis) technologies has qualitatively changed
many areas of bioscience in the last several years. These technologies include DNA, protein, and combi-
natorial chemistry arrays. Of these, DNA arrays, designed to determine gene content and expression
levels in living cells, have shown the most potential. DNA arrays allow simultaneous, parallel measurement
of thousands of interactions between target strands and genome-derived probes. Two areas of concernare the design and analysis of such experiments. Microarrays are rapidly producing enormous amounts
of raw data. The bioinformatics solutions to problems associated with the analysis of data on this scale
are a major current challenge. In addition, designing such experiments requires consideration of not only
the genomic information required to answer a given problem but also consideration of the chemistry
and physics of highly charged species near prepared surfaces.
On the medical side, DNA arrays may someday help us to better understand complex issues concerning
human health and disease. Among other things, they should help us separate out the effects of ones genes
vs. environment and life-style to help usher in the individualized molecular medicine of the future.
A current important practical application of DNA arrays is biosensors used to determine which
organism a given DNA/RNA sample belongs. This use of microarrays is based on specific properties of viral and microbial genomes and the ability of arrays to provide information regarding the presence/
absence of thousands of short subsequences in given genome simultaneously.
Arnold VainrubUniversity of Houston
Tong-Bin LiUniversity of Houston
Yuriy FofanovUniversity of Houston
B. Montgomery PettittUniversity of Houston
© 2004 by CRC Press LLC
8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook
2/14
While DNA arrays will be an important tool for some time, it should be emphasized that DNA array
technology is still at an early stage of development. It is cluttered with heterogeneous technologies and
data formats as well as basic issues of signal to noise, fidelity, calibrations, and statistical significance that
are still being sorted out. Until these issues are resolved and standardized, it will not be possible to define
accurately the complete genetic regulatory network of even a well-studied prokaryotic cell system.
DNA arrays were introduced as a high throughput technology for performing hybridization assays
based on formation of a double helix to a surface-immobilized single-strand DNA probe according to
Watson-Crick pairing rules. In its current high-density format, in a single microarray experiment the
hybridization is performed with up to hundreds of thousands of different probes, producing a tremendous
volume of information on the assayed DNA sequence and their abundance in the tested target. Typically,
a DNA microarray hybridization experiment contains 107 to 1010 DNA probe molecules of a sequence
to be tested, immobilized in a ~50-mm-diameter spot on a prepared glass surface, and thus may includeabout 104 to 105 different probe spots per square centimeter. Usually, the probes are oligonucleotides of
8 to 80 bases long, tethered by one end through a linker molecule to the surface. DNA microbeads are
similar to microarrays, but the probes are tethered to a micron size glass bead’s surface (Brenner et al.,
2000) similar to peptide technology.The use of DNA microarrays in clinical practice is a rapidly growing area. In present work we will
concentrate our attention on two tasks: electrostatic effects in solution DNA hybridization at surfaces
and the ability of microarrays to serve as biosensors based on information of the presence or absence of
certain subsequences.
14.2 Surface Design
14.2.1 Role of Interface Electrostatic Interactions
Electrostatic effects in solution DNA hybridization (formation of the double helix by two complimentary DNA single strands) are well known (Saenger, 1984, Bloomfield, 1999). They appear simply because the
DNA is a negatively charged polymeric ion, and thus an electrostatic repulsion occurs that can be either
between single strands (ssDNA) or double helices (dsDNA). Each phosphate group PO 2– of the outer
dsDNA backbone bears a single negative charge, and therefore typical dsDNA B-helix is often modeled
by a 2-nm-diameter cylinder with a high negative surface charge of about six electron charges per
nanometer of length. The repulsion diminishes when the added salt cations are present in solution and
partly shield the electrostatic interactions. Thus, the dsDNA stability against a dehybridization into
ssDNAs increases with the solution ionic strength; typically, the melting temperature of dsDNA increases
from 10 to 20!C as the added salt concentration grows tenfold. For instance, in human blood plasma
concentrations of approximately 150 m M for Na+ and 2.5 m M for Ca2+ cations produce the electrostaticscreening length about 1 nm and help make the chromosomal dsDNA stable.
In addition to the above-mentioned DNA-DNA repulsion, two new electrostatic interactions appear for
a DNA microarray, namely, the DNA-surface interaction and repulsion between the assayed nucleic acid
and the on-surface layer of DNA probes. The nucleic acid–surface interactions operate on surface-tethered
DNA probes, the assayed nucleic acid (which is a biological RNA or prepared from it by a reverse transcrip-
tion cDNA), and also dsDNA formed as their hybrid. Our theoretical analysis of this DNA-surface electro-
static interaction shows its possible important role in on-surface hybridization thermodynamics (Vainrub
and Pettitt, 2000, 2003). Recent experiments (Heaton et al., 2001; Su et al., 2002) confirm the theory and
demonstrate a complete control of hybridization and melting by applied electric potential to the oligonu-
cleotide array on gold film surface. Indeed, the negative surface potential –300 mV induces the melting of prehybridized dsDNA whereas the positive potential promotes the hybridization of complimentary DNA
targets (Heaton et al., 2001). The experiments can be simply understood (Vainrub and Pettitt, 2000, 2003)
as a result of the electrostatic repulsion between both the negatively charged surface and ssDNA target
tending to melt the dsDNA and remove the target from the surface; attraction to the positively charged
surface stabilizes the dsDNA. It is important that the probe DNA is tethered to the surface (typically through
© 2004 by CRC Press LLC
8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook
3/14
the covalent bond and linker molecule) and thus dsDNA cannot be displaced or diffuse from the negatively
charged surface and must be released by melting a mobile target ssDNA that can drift from the surface to
decrease the electrostatic energy. Interestingly, the DNA-surface repulsion (attraction) occurs even for
noncharged dielectric (metallic) surfaces due to the known electrostatic induction phenomena (Vainrub
and Pettitt, 2000). Evidently, the DNA-surface electrostatics can be regulated by the surface charge and
material (dielectric or metallic) as well as almost canceled by using long linker molecule and/or high ionic
strength hybridization solution. In previous work (Vainrub and Pettitt, 2000, 2003) we considered in detail
the DNA-surface electrostatics and its optimization in oligonucleotide microarrays.
Here we focus on another type of on-array electrostatic interaction, the repulsion between the assayed
nucleic acid and array of surface tethered DNA probe that both bear the negative charge (Vainrub and
Pettitt, 2002). First, we describe the origin of this interaction that is specific for on-array hybridization
and does not occur in homogeneous solution hybridization assays. To obtain sufficient numbers of
dsDNA hybrids for reliable detection, the oligonucleotide probe molecules are quite crowded on the array
with the surface density typically from 1012 to 1014 probes per square centimeter corresponding to the
mean distance between the neighbors on the surface from 10 to 1 nm, respectively. Therefore, a target
cDNA closely approaches not only a hybridization partner, but also the surrounding probe oligonucle-otides, which contribute into electrostatic repulsion of the target. Recently we considered this effect
(Vainrub and Pettitt, 2002) and derived an equation for the on-array hybridization binding isotherm:
. (14.1)
Here q (0 < q< 1) is the hybridization efficiency, i.e., the fraction of hybridized probes, C0 is the assayedDNA concentration, DG0 = DH0–TDS0 is the Gibbs free energy (DH0 the enthalpy, DS0 the entropy) of
dsDNA formation in homogeneous solution, T is the temperature, and R is the universal gas constant. ZPand ZT denote the probe and target lengths (the number of nucleotides), and NP is the probe surface density.
VS is the interaction strength, which is estimated (Vainrub and Pettitt, 2002) both theoretically and from
the experiments as about 10–14 J m2/mol for 25-mer long probe oligonucleotides in 1 M NaCl solution.
The array hybridization isotherm, Equation 14.1 as demonstrated in Figure 14.1, successfully explains
the well-known experiments (Forman et al., 1998; Guo et al., 1994; Shchepinov et al., 1995), showing
FIGURE 14.1 Hybridization binding isotherm at different surface density of 25-mer probe oligonucleotides. The curve
number notes the surface density in 1012 probes/cm2 units. The number 0 corresponds to the Langmuir isotherm.
CG
T
V N Z Z
T
S P P T
00
1=
-Ê Ë Á
ˆ ¯ ˜
+( )È
ÎÍÍ
ù
ûúú
qexp exp
DR R
0 2000 4000 6000 8000 10000
1.0
0.8
0.6
0.4
0.2
0.0
1
2
0
4
6
8
10
12
Target concentration (*exp[∆G0 /RT0] moles)
H y b r i d i z a t i o n e f f i c i e n c y
© 2004 by CRC Press LLC
8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook
4/14
the temperature decrease and strong broadening of the dsDNA melting on array compared to solution.
Also, the theory quantitatively accounts for the recent 25-mer (Peterson et al., 2001) and 20-mer (Watter-
son et al., 2000) oligonucleotide array experimental data. Below we review how this theory can be used
in microarray optimization. It should be noted that, in addition to the discussed electrostatic forces, the
other interface interactions, e.g., hydration effects, van der Waals forces, and steric hindrance (packing),
could contribute under specific array conditions. However, the interface electrostatic effects often dom-
inate the interactions. In combination with understanding the probabilistic basis of the sequences for
analysis (below), we now have the ability to bring some interesting concepts to bear on the design of
biochips.
14.2.2 Sensitivity Enhancement
Considering the interaction-free energies involved in surface-bound DNA devices, several factors affecting
binding and performance were apparent. We found that the concentration dependence of the electrostatic
repulsion between the assayed target and probe array affects the sensitivity and dynamic range of DNA
microarrays. Calculated from our theory, Figure 14.2 shows the number of hybrids qNp as a function of the target concentration at different probe surface densities Np assuming the same array parameters Z =
25, Vs = 10–14 J m2/mol and room temperature T = 25˚C as in Figure 14.1. For microarray assays in the low
target concentration regime, the strongest signals (curve 1 in Figure 14.2) correspond to a probe density of
about 1012 cm–2. As seen in the insert to Figure 14.2, the theoretical sensitivity peak is rather narrow
suggesting that the probe density on the surface in microarrays should be thoroughly optimized for each
surface preparation and solution condition. This result is in accord with experimental observations of a
clear signal peak in a similar probe density range (Steel et al., 1998) and a weaker signal at higher probe
densities (Peterson et al., 2001). This means the dynamic range near higher target concentrations can be
expanded by an increase of the probe density at the expense of a substantial decrease in sensitivity. The
width of the peak may have other factors that could be important under different conditions.Explicit control of the electrostatic interactions by microscopic or macroscopic field generation is
therefore of obvious importance for optimization of microarrays. Suppression of the Coulomb repulsion
could, in favorable circumstances, increase the sensitivity. We predict this could be achieved using external
fields, charged molecular surface preparations, and in three-dimensional arrays using probe immobili-
zation in gels, which indeed show solution-like hybridization thermodynamics (Vasiliskov et al., 2001),
FIGURE 14.2 Number of hybridized probes as a function of the normalized target concentration at different surface
density of 25-mer probe oligonucleotides. The curve number notes the surface density in 1012 probes/cm2 units.
Insert: Number of hybrids vs. probe surface density at the normalized target concentration 0.1.
1 2 4 6 80.1
2x1010
10-2 10-1 100 101 102 103 104
5x1012 1x10130
0
Probe density (1/cm2)
Target concentration (*exp[∆G0 /RT0] Moles)
1012
1011
1010
109
H y b r i d d e n s i t y ( 1 / c m
2 )
© 2004 by CRC Press LLC
8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook
5/14
but suffer from the slow hybridization and washing kinetics. For two-dimensional arrays use of
multivalent counterions for enhancement of the Coulomb screening and repulsion reduction (Nguyen
et al., 2000) may be important as well as the use of a positive electrostatic potential at the surface (Vainrub
and Pettitt, 2000). In addition, replacement of DNA probes by noncharged peptide nucleic acids (PNA)
(Nielsen, 2001) provides an interesting chemical way to lessen the unfavorable electrostatic interaction.
Other complications arise to make a detailed analysis of such a hetero duplex beyond the scope of our
present discussion.
14.2.3 Multiplexed SNPs Detection
In contrast to gene expression profiling, the Coulomb hybridization blockage plays a positive role in on-
array single nucleotide polymorphism (SNP) genotyping and provides an interesting possibility for
multiplexed SNPs detection. Given a reasonable estimate of the mean SNP frequency in a human
chromosome DNA of about 1 per 1000 nucleotide sites (Cutler et al., 2001), the individual genomes may
differ by several millions of SNPs. Therefore, highly multiplexed detection using microarrays is very
important for high throughput large-scale SNP genotyping.The principle of on-array multiplexed SNPs genotyping is demonstrated in Figure 14.3. For 20-mer
perfectly matched duplexes with oligomer L 5¢-CTGAA CGGTA GCATC TTGAC-3¢ and oligomer H 5¢-CTGAG CGGTA GCACC GCGAC-3¢ the melting curves in solution at 5 nM oligonucleotide concentra-tion with 1 M added NaCl salt are shown in Figure 14.3 (left side). The H duplex (Tm = 72.5!C, 70% of
GC-bases) is more stable than the L duplex (Tm = 62.9!C, 50% of GC-bases) because of the higher GC-
base content (SantaLucia et al., 1996). In addition to the matched L and H, Figure 14.3 shows also the
melting curves for the L1 and H1 SNPs corresponding to the A for T single nucleotide replacement at
the tenth position from the 5¢-end. As a practical example, the conditions for an SNP detection aredefined as at least 1% hybridization efficiency signal strength and 1.5 times discrimination ratio for the
match/mismatch signals. The resulting detection temperature ranges of L and H SNPs are shown inFigure 14.3 by the bars. Since in solution L and H ranges do not overlap, the two SNPs cannot be detected
FIGURE 14.3 Principle of multiplexed SNPs detection on DNA biochip. The melting curves of matched 20-mers
L and H and their single mismatches L1 and H1. The (L, L1) bar shows the temperature range where both detection
and discrimination between L and L1 duplexes is possible (see text); the (H, H1) bar indicates the similar range for
H and H1. In solution (left figure) the (L, L1) and (H, H1) ranges do not overlap, but on DNA biochip (right figure)
the overlap occurs and allows detection of both (L, L1) and (H, H1) SNPs in a single fixed temperature experiment.
Temperature (K)
H y b r i d i z a t i o n e f f i c i e n c y
320 330 340 350 360
Temperature (K)
300 310 320 330 340
Probe density1.2 * 1013 oligos/cm2
1.0
0.8
0.6
0.4
0.2
0.0
H y b r i d i z a t i o n e f f i c i e n c y
1.0
0.8
0.6
0.4
0.2
0.0
In solution:No melting curves overtap
Impossible to detect both SNPs
On biochip:Both SNPs
Detection range
L1 L H1 H L1 L H1 H
© 2004 by CRC Press LLC
8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook
6/14
in a single temperature assay. However, on an array with a probe surface density Np = 1.2*1013 cm–2 the
above-mentioned broadening of melting curve increases the detection ranges for L and H SNPs and
makes them both detectable in the overlap temperature region as shown in Figure 14.3 (right side). This
example illustrates the principle behind our suggested multiplexed detection of SNPs that differ by up
to 10!C in the melting temperature. Further extensions of diversity of on-array genotyped SNPs can be
achieved using higher probe surface density to make the melting transition even more broad.
14.3 DNA Biosensors
14.3.1 Presence of Short Subsequences in the Genomes
Statistical analysis of the appearance of short subsequences in different DNA sequences, from individual
genes to full genomes is important for a variety of reasons. Applications include PCR primer (Fislage,
1998; Fislage et al., 1997) as well as microarray probe design (Southern, 2001). Several attempts (Descha-
vanne, et al., 1999; Karlin and Ladunga, 1994; Karlin and Mrazek, 1997; Nakashima et al., 1997; Nakash-
ima et al., 1998; Nussinov, 1984; Sandberg et al., 2001) have been made to employ the frequency
distribution of short subsequences (n-mers) to identify species with relatively short genome sizes (micro-
bial). In such an approach, the shape of the frequency distribution for certain short subsequences, 2–4-
mers (Deschavanne et al., 1999; Karlin and Ladunga, 1994; Karlin and Mrazek, 1997; Nakashima et al.,
1997; Nakashima et al., 1998; Nussinov, 1984) and 8–9-mers (Deschavanne et al., 1999; Sandberg et al.,
2001) has been used to decide what microbial genome one is dealing with, based on a given piece of
genome or a whole genome.
Many sequencing projects are in progress and more full genomes have recently become available. The
several hundred projects completed so far provide sufficient material to consider them from a statistical
viewpoint. Yet, we are still far from having a complete or even reasonable statistical picture. There are
simply too many species yet to be sequenced to obtain globally relevant statistical answers.Recently (Fofanov et al., 2002a, 2002b) the comparative statistical analysis of the presence/absence of
all possible short n-mers (7 to 20 nucleotides long) for more than 250 complete genomes was performed
in this group. The set under consideration included 76 complete microbial genome sequences with sizes
ranging from 0.58 to 7.04 Mb and 176 viral genomes (128 RNA containing viruses with genome sizes
from 0.32 to 130.76 Kb and 48 DNA containing viruses with genome sizes from 2.0 to 671.19 kb) as well
as complete genomes of five multicellular organisms: Caenorhabditis elegans (99.99 Mb), Drosophila
melanogaster (119.98 Mb), Oryza sativa (Rice, 255.87 Mb), Schizosaccharomyces pombe (12.49 Mb), and
Homo sapiens (human, 2.875 Gb) genomes. A complete list of genomes and all supplementary materials
mentioned below can be found on the University of Houston Bioinformatics lab website http://www.bio-
info.uh.edu/publications/how_random_are_genomes.Tables 14.1 and 14.2 show representative results for some of the analyzed genomes (microbial and
viral), for n = 8 and 12 using our techniques. It is worth mentioning that as n increases, the total number
of possible n-mers, 4n, strongly exceeds the total sequence length M and most of the possible n-mers do
not appear at all because the maximum number of n-mers contained in this sequence is M – n + 1 ª M .Moreover, for a reasonably high ratio, 4n/ M , most of the n-mers that appear tend to appear only once,
in accordance with the fact that the number of present n-mers becomes very close to M (see Tables 14.1,
14.2, and supplementary data on the above-mentioned Web site). That is why it was decided to use the
statistics for “presence/absence” in our method of analysis, instead of the usual “frequency of appearance,”
which is reasonable for short n-mers (total sequence length M
8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook
7/14
“genome” is also shown for comparison in all figures as a solid line. One can observe the extraordinary similarity between these plots. All of the different genomes form a well-defined pattern, when plotted
against the ratio 4n/ M and not against the size of the genome or the length of the n-mer separately.
For much longer genomes of multicellular organisms, practically all n-mers for n < 12 are present.
Therefore, we chose to calculate the number of distinct 13–20-mers present in each genome (see Figure
14.7 and Table 14.3). These results point to the conclusion that the presence of n-mers in all genomes
TABLE 14.1 Frequency of Presence of 8-mers and Self-Similarity* for Several Viral Genomes
Accession Genome
Total
Sequence
Length
(bp)
Number
of Present
8-mers
Frequency
of Present
8-mers
Random
Boundary
Self-
Similarity
NC_001436 Human T-cell lymphotropic
virus type 1
17,014 13,739 20.96% 22.86% 8.31%
NC_001707 Hepatitis B virus 6,430 5,963 9.10% 9.35% 2.64%
NC_001503 Mouse mammary tumor virus 17,610 14,307 21.83% 23.56% 7.35%
NC_001547 Sindbis virus 11,703 10,431 15.92% 16.35% 2.67%
NC_001434 Hepatitis E virus 7,176 6,517 9.94% 10.37% 4.12%
NC_003312 Swine hepatitis E virus 7,257 6,608 10.08% 10.48% 3.81%
NC_001489 Hepatitis A virus 7,478 6,543 9.98% 10.78% 7.42%
NC_001433 Hepatitis C virus 9,413 8,480 12.94% 13.38% 3.29%
NC_001653 Hepatitis D virus 1,682 1,608 2.45% 2.53% 3.17%
NC_001802 Human immunodeficiency
virus type 1
9,181 7,725 11.79% 13.07% 9.83%
NC_003461 Human parainfluenza virus 1 15,600 12,242 18.68% 21.18% 11.82%
NC_001796 Human parainfluenza virus 3 15,462 11,506 17.56% 21.02% 16.46%
NC_003443 Human parainfluenza virus 2 15,646 12,702 19.38% 21.24% 8.74%
* See the definition in the text.
TABLE 14.2 Frequency of Presence of 12-mers and Self-Similarity for Several Microbial Genomes
Accession Genome
Total
Sequence
Length
(bp)
Number
of Present
12-mers
Frequency
of Present
12-mers
Random
Boundary
Self-
Similarity
NC_000964 Bacillus subtilis 8,429,628 5,346,103 31.87% 39.50% 19.32%
NC_002696 Caulobacter crescentus 8,033,894 3,399,234 20.26% 38.05% 46.75%
NC_000913 Escherichia coli K12 9,278,442 5,695,881 33.95% 42.48% 20.08%
NC_000916 Methanobacterium
thermoautotrophicum
3,502,754 2,658,450 15.85% 18.84% 15.91%
NC_003197 Salmonella typhimurium LT2 9,714,864 5,821,910 34.70% 43.96% 21.06%
NC_002758 Staphylococcus aureus Mu50 5,756,080 3,398,622 20.26% 29.04% 30.25%
NC_003098 Streptococcus pneumoniae R6 4,077,230 2,992,091 17.83% 21.57% 17.34%
NC_002737 Streptococcus pyogenes 3,704,882 2,778,223 16.56% 19.81% 16.43%
NC_002578 Thermoplasma acidophilum 3,129,812 2,602,761 15.51% 17.02% 8.84%
NC_002689 Thermoplasma volcanium 3,169,608 2,590,718 15.44% 17.22% 10.30%
NC_000919 Treponema pallidum 2,275,888 1,978,453 11.79% 12.69% 7.04%
NC_000853 Thermotoga maritima 3,721,450 2,755,886 16.43% 19.89% 17.43%
NC_002162 Ureaplasma urealyticum 1,503,438 948,274 5.65% 8.57% 34.06%
NC_002505 Vibrio cholerae chromosome I,
chromosome II
8,066,854 5,383,520 32.09% 38.17% 15.94%
NC_002488 Xylella fastidiosa 9a5c 5,358,610 3,996,398 23.82% 27.34% 12.88%
© 2004 by CRC Press LLC
8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook
8/14
FIGURE 14.4 Frequency of presence of 9–14-mers in 76 microbial genomes.
FIGURE 14.5 Frequency of presence of 7–10-mers in 129 RNA viral genomes.
FIGURE 14.6 Frequency of presence of 7–10-mers in 48 DNA viral genomes.
Microbial Genomes
4n /M
f r e q u e n c y o f p r e s e n c e o
f n - m e r s
0 5 10 15 20 25
1.00
0.80
0.60
0.40
0.20
0.00
9-mers
10-mers
11-mers12-mers
13-mers
14-mers
random boundary
RNA Virus Genomes
4n /M
f r e q u e n c y o f p r e s e n c e o f n - m e r s
0 5 10 15 20 25
1
0.8
0.6
0.4
0.2
0
7-mers
8-mers
9-mers
10-mers
random boundary
DNA Virus Genomes
4n /M
f r e q u e n c y o f p r e s e n c e o f n - m e r s
0 5 10 15 20 25
1
0.8
0.6
0.4
0.2
0
7-mers
8-mers
9-mers
10-mers
random boundary
© 2004 by CRC Press LLC
8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook
9/14
considered (in the range of n, when the condition M
8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook
10/14
Let us provide a simple example based on three different genomes: (1) Salmonella typhi (NC_003198),
(2) Mycobacterium tuberculosis H37Rv (NC_000962), and (3) Bacillus subtilis (NC_000964). A complete
set of n-mers would contain 4n n-mers, which, for n = 12, is 412 = 16,777,216. Based on our analysis,
Table 14.4 shows how many different 12-mers are contained in each of these three genomes. The number
N (n, G1, G2) of n-mers (n = 12) that appears in each pair of species genomes (G1, G2) is shown in Table
14.5. We can compare the probabilities of finding randomly picked 12-mers in each pair of genomes
with probabilities calculated using the multiplication rule. As seen from Table 14.5, the actual andcalculated (expected) probabilities do not differ greatly from each other, which allows us to treat the
presence/absence of randomly picked 12-mers in these three genomes as statistically independent events.
Actual and expected pair-wise probabilities were calculated (Fofanov et al., 2002a, 2002b) in each of
the above-mentioned groups of genomes (170,000+ pairs in total). We were especially interested in the
range of n where p* = 5 to 50% of the total possible number of n-mers occurred. This range is different
for different genome sizes and can be determined from Figure 14.4. The analytic formula for the random
boundary also can be used to estimate this range:
. (14.2)
Upper and lower bounds for sizes from 0.8 to 10 Mb, which are typical for microbial genomes, are
shown in Table 14.6. In accordance with this, the value n = 12 seems to be the most reasonable one for
all microbial genomes. For viral genomes, the value was found to be n = 7.
For all 2800+ pairs of microbial genomes and the value of n = 12, the average ratio of actual and
expected probabilities was found to be 1.35 ± 0.61. For viral genomes and the corresponding value of
n = 7, the average ratio of actual and expected probabilities was found to be 1.06 ± 0.10 for 1100+ genome
TABLE 14.4 The Frequency of Presence of 12-mers within the Three Microbial Genomes
Genome (G) Genome Length
Total Sequence
Length (bp)
Number of Different
12-mers Present in
Genome: N (12 ,G) p = N (12 ,G)/4n
Salmonella typhi 4,809,037 9,618,074 5,813,330 34.65% Mycobacterium tuberculosis
H37Rv
4,411,529 8,823,058 4,361,508 26.00%
Bacillus subtilis 4,214,814 8,429,628 5,346,103 31.87%
TABLE 14.5 Actual and Predicted Simultaneous Presence of 12-mers within the ThreeMicrobial Genomes: (1) Salmonella typhi, (2) Mycobacterium tuberculosis H37Rv, and (3)Bacillus subtilis
Case Number 12-mers N (n, G1, G2)/4n
Calculated Probability
Assuming Independence
Present in genomes (1) and (2) 1,943,814 11.6% 9.0%Present in genomes (1) and (3) 2,335,710 13.9% 11.0%
Present in genomes (2) and (3) 1,334,288 8.0% 8.3%
TABLE 14.6 The Optimal Length of n-mers (n*) for Different Genome Sizes and Frequencies of Presence ( p*)
Total Sequence Length (bp)
n* Determined for Frequency of
Presence 50% ( p* = 0.5)
n* Determined for Frequency of Presence
5% ( p* = 0.05)
0.8 Mb 9.80 11.93
2.0 Mb 10.47 12.59
10.0 Mb 11.63 13.75
n
M p p
*
log * / *
log( )=
-( )[ ]14
© 2004 by CRC Press LLC
8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook
11/14
pairs DNA-based viruses and 1.04 ± 0.05 for 8100+ genome pairs RNA-based viruses. This led us to the
conclusion that for this range of n, the presences of n-mers in different genomes, to a good approximation,
can be treated as independent events.
The highest deviations between expected and actual probabilities were found for closely related
genomes. For 48 DNA-based viruses under consideration, using 7-mers, the highest ratio (185%) was
found for Duck hepatitis B virus (NC_001344) vs. Stork hepatitis B virus (NC_003325) with 8.1% expected
and 15.0% actual.
An example of closely related microbial genomes would be S. aureus N 315 (NC_002745) vs. S. aureus
Mu50 (NC_002758) with 4.0% expected and 19.7% actual or 491% higher than expected. Another
extreme case was found for three microbial genomes: Chlamydophila pneumoniae CWL029 (NC_000922),
C. pneumoniae AR39 (NC_002179), and C. pneumoniae J138 (NC_002491), which have the highest
(eightfold) ratio of actual and expected probabilities for 12-mers (1.5% expected and 12.3% actual). For
the group containing 24 human chromosomes, pair-wise ratios of actual and expected probabilities of
14-mers were found to be 1.91 ± 0.16, maximum ratio being found for n = 20 and Y-chromosomes(expectation 2.9% vs. actual 6.9%).
Assuming that results for 250+ genomes are statistically significant, we expect similar behaviorfrom many different (as yet sequenced) genomes. Thus our analysis indicates that, in this case, one
may use relatively small sets of randomly picked n-mers for differentiating between different viruses
and organisms.
Let us further illustrate the idea by continuing our example for three microbial genomes. Let n* be
the size of n-mer, which fits the interval where from 5 to 50% of all possible n-mers show up for a
desirable range of genome lengths. In accordance with Table 14.6, we may choose the value n* = 12. Let
us randomly pick L 12-mers (say, L = 1000). Given a genome G1 with the frequency of presence of n-
mers p1, we expect that K = p1L n-mers present in G1 will appear also in our random set, forming a
“fingerprint” of G1 (in our example, we expect 50 < K < 500). The probability, e , that the fingerprint of
G1 will exactly coincide with the fingerprint of some other genome G2 (with the frequency of presenceof n-mers p2) is (Fofanov et al., 2002a; 2002b):
e = (1 – p1 – p2 + 2 p12)L. (14.3)
Here p12 is the probability for the n-mer to be present in both genomes simultaneously.
Let us consider the numeric example mentioned in Tables 14.4 and 14.5 of two species that are far
from each other, Salmonella typhi vs. Mycobacterium tuberculosis H37Rv; p1 = 0.3465, p 2 = 0.2600, p12 =
0.1160; with L = 1000, a remarkable accuracy of e = 1.7*10–204 can theoretically be achieved.
Given a desirable probability of error, e , one can determine the appropriate size, L, of a random set
of n-mers that can be used for reliable identification of genomes as
. (14.4)
For related organisms, the genomes may contain large common parts. This means that p12 may be
close to p1 and p 2. To give a numeric example of close relatives, let us consider S. aureus N 315 vs. S. aureus
Mu50. Now p1 = 0.198, p 2 = 0.203, p12 = 0.197, and an accuracy of e = 10–10 can be achieved with L =
4451. We would like to stress that our analysis predicts a logarithmic dependence of the sampling or
microarray size, L, on the error probability, e . This feature is of principal importance for the estimation
procedure under discussion.Therefore, we can use practically any sufficiently random subset of n-mers of appropriate size for design
a microarray to diagnose an organism to which a given DNA/RNA sample belongs. Different sizes of n-mers
must be employed for recognition of different organisms based on their genome length. Values of n that
correspond to given intervals of genome lengths can be easily calculated using above formulas. In fact, only
11 different n values, 7 £ n £ 17, would be enough to cover a large variety of genome sizes from 1 kb to 9 Gb.
L p p p
=- - +( )
log
log
e1 2
1 2 12
© 2004 by CRC Press LLC
8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook
12/14
The important advantage of such an approach is that it can be used without a priori knowledge of the
sequence itself. This implies there is no need to perform the expensive and time-consuming process of
sequencing before array construction. It is enough to obtain the purified DNA, hybridize it on a suffi-
ciently random microarray chip, and check which n-mers show up. Taking into account how accessible
the DNA of thousands of microbes and viruses are, how easily each microarray can be produced, and
the fact that we do not need to determine quantitative values of expression (we need only a yes/no
answer), it should be possible to produce an essentially universal microbial/viral DNA chip.
14.4 Conclusions
In this article we have attempted to demonstrate how to use both physical and mathematical techniques
to explore design criteria of relevance to modern genetic analysis. In particular, DNA arrays allow
simultaneous, parallel detection and concentration measurement of thousands of target strands. In this
article we have concentrated on two areas of concern in the design and analysis of such experiments: the
information content of the polymer and the physical/chemical properties of the detection. Microarrays
are now producing amounts of raw biological (and biophysical) data on a scale not seen before in the
biological arena. The bioinformatics solutions to problems associated with the analysis of data on this
scale will remain a challenge for some time. The physical design of efficient devices to conduct such
experiments requires consideration of the chemistry and physics of often highly charged species near
prepared surfaces as well as the sequence. This article was aimed at demonstrating the current state of
theory in the hopes that many will find application of these principles.
ACKNOWLEDGMENTS
This work was partially supported by grants from NIH, Texas Coordinating Board, and the Robert A.
Welch Foundation to BMP. TBL was a fellow at the Keck Center for Computational Biology. BMP and
YF acknowledge the Texas Center for Learning and Computation for seed funding, and also NPACI for
computing time and support at the San Diego Supercomputing Center. We also thank the Molecular
Science Computing Facility (MSCF) in the William R. Wiley Environmental Molecular Sciences Labo-
ratory, a national scientific user facility sponsored by the U.S. Department of Energy’s Office of Biological
and Environmental Research and located at the Pacific Northwest National Laboratory. Pacific Northwest
is operated for the Department of Energy by Battelle. BMP and AV also thank Accelrys for providing
visualization software through the Institute for Molecular Design.
ReferencesBloomfield, V.A., Crothers, D.M., and Tinoco, I., Nucleic Acids: Structures, Properties and Functions,
University Science Books, Sausalito, CA, 1999.
Brenner, S., Johnson, M., Bridgham, J., Golda, G., Lloyd, D.H., Johnson, D., Luo, S., McCurdy, S., Foy,
M., Ewan, M., Roth, R., George, D., Eletr, S., Albrecht, G., Vermaas, E., Williams, S.R., Moon, K.,
Burcham, T., Pallas, M., DuBridge, R.B., Kirchner, J., Fearon, K., Mao, J., and Corcoran, K., Gene
expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays, Nat.
Biotechnol ., 18, 630–634, 2000.
Cutler, D.J., Zwick, M.E., Carrasquillo, M.M., Yohn, C.T., Tobin, K.P., Kashuk, C., Mathews, D.J., Shah,
N.A., Eichler, E.E., Warrington, J.A., and Chakravarti, A., High-throughput variation detection
and genotyping using microarrays, Genome Res., 11, 1913–1925, 2001.Deschavanne, P.J., Giron, A., Vilain, J., Fagot, G., and Fertil, B., Genomic signature: characterization and
classification of species assessed by chaos game representation of sequences, Mol. Biol. Evol., 16,
1391–1399, 1999.
Fislage, R., Differential display approach to quantitation of environmental stimuli on bacterial gene
expression, Electrophoresis, 19, 613–616, 1998.
© 2004 by CRC Press LLC
8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook
13/14
Fislage, R., Berceanu, M., Humboldt, Y., Wendt, M., and Oberender, H., Primer design for a prokaryotic
differential display RT-PCR, Nucleic Acids Res., 25, 1830–1835, 1997.
Fofanov, Y., Luo, Y., Katili, C., Wang, J., Powdrill, B.Y.T., Fofanov, V., Li, T.-B., Chumakov, S., and Pettitt,
B.M., How independent are the appearances of n-mers in different genomes? Submitted, 2000a.
Fofanov, Y., Luo, Y., Katili, C., Wang, J., Powdrill, B.Y.T., Fofanov, V., Li, T.-B., Chumakov, S., and Pettitt,
B.M., Short subsequences in genomes: how random are they? Submitted, 2002b.
Forman, E.J., Walton, I.D., Stern, D., Rava, R.P., and Trulson, M.O., Thermodynamics of duplex formation
and mismatch discrimination of photolithographically synthesized oligonucleotide arrays, ACS
Symp. Ser., 682, 206–228, 1998.
Guo, Z., Guilfoyle, R.A., Thiel, A.J., Wang, R., and Smith, L.M., Direct fluorescence analysis of genetic
polymorphisms by hybridization with oligonucleotide arrays on glass supports, Nucleic Acids Res.,
22, 5456–5465, 1994.
Heaton, R.J., Peterson, A.W., and Georgiadis, R.M., Electrostatic surface plasmon resonance: direct
electric field-induced hybridization and denaturation in monolayer nucleic acid films and label-
free discrimination of base mismatches, Proc. Natl. Acad. Sci. U.S.A., 98, 3701–3704, 2001.
Karlin, S. and Ladunga, I., Comparisons of eukaryotic genomic sequences, Proc. Natl. Acad. Sci. U.S.A.,91, 12832–12836, 1994.
Karlin, S. and Mrazek, J., Compositional differences within and between eukaryotic genomes, Proc. Natl.
Acad. Sci. U.S.A., 94, 10227–10232, 1997.
Nakashima, H., Nishikawa, K., and Ooi, T., Differences in dinucleotide frequencies of human, yeast, and
Escherichia coli genes, DNA Res., 4, 185–192, 1997.
Nakashima, H., Ota, M., Nishikawa, K., and Ooi, T., Genes from nine genomes are separated into their
organisms in the dinucleotide composition space, DNA Res., 5, 251–259, 1998.
Nguyen, T.T., Grosberg, A.Y., and Shklovskii, B.I., Screening of a charged particle by multivalent coun-
terions in salty water: strong charge inversion, J. Chem. Phys., 113, 1110–1125, 2000.
Nielsen, P.E., Peptide nucleic acid: a versatile tool in genetic diagnostics and molecular biology, Curr.Opinion Biotech., 12, 16–20, 2001.
Nussinov, R., Doublet frequencies in evolutionary distinct groups, Nucleic Acids Res., 12, 1749–1763, 1984.
Peterson, A.W., Heaton, R.J., and Georgiadis, R.M., The effect of surface probe density on DNA hybrid-
ization, Nucleic Acids Res., 29, 5163–5168, 2001.
Saenger, W., Principles of Nucleic Acid Structure, Springer-Verlag, New York, 1984.
Sandberg, R., Winberg, G., Branden, C.I., Kaske, A., Ernberg, I., and Coster, J., Capturing whole-genome
characteristics in short sequences using a naive Bayesian classifier, Genome Res., 11, 1404–1409,
2001.
SantaLucia, J., Allawi, H.T., and Seneviratne, P.A., Improved nearest-neighbor parameters for predicting
DNA duplex stability, Biochemistry , 35, 3555–3562, 1996.Shchepinov, M.S., Case-Green, S.C., and Southern, E.M., Steric factors influencing hybridization of
nucleic acids to oligonucleotide, Nucleic Acids Res., 25, 1155–1161, 1995.
Southern, E.M., DNA microarrays — history and overview, Methods Mol. Biol., 170, 1–15, 2001.
Steel, A.B., Herne, T.M., and Tarlov, M.J., Electrochemical quantitation of DNA immobilized on gold,
Anal. Chem., 70, 4670–4677, 1998.
Su, H.J., Surrey, S., McKenzie, S.E., Fortina, P., and Graves, D.J., Kinetics of heterogeneous hybridization
on indium tin oxide surfaces with and without an applied potential, Electrophoresis, 23, 1551–1557,
2002.
Vainrub, A. and Pettitt, B.M., Surface electrostatic effects in oligonucleotide microarrays: control and
optimization of binding thermodynamics, Biopolymers, 68, 265–270, 2003.Vainrub, A. and Pettitt, B.M., Thermodynamics of association to a molecule immobilized in an electric
double layer, Chem. Phys. Lett., 323, 160–166, 2000.
Vainrub, A. and Pettitt, B.M., Coulomb blockage of hybridization in two-dimensional DNA arrays, Phys.
Rev ., E66, art. no. 041905, 2002.
© 2004 by CRC Press LLC
8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook
14/14
Vasiliskov, V.A., Prokopenko, D.V., and Mirzabekov, A.D., Parallel multiplex thermodynamic analysis of
coaxial base stacking in DNA duplexes by oligonucleotide microchips, Nucleic Acids Res., 29,
2303–2313, 2001.
Watterson, J.H., Piunno, P.A., Wust, C.C., and Krull, U.J., Effects of oligonucleotide immobilization
density on selectivity of quantitative transduction of hybridization of immobilized DNA, Langmuir ,
16, 4984–4992, 2000.