Chapter 14 Book Biomedical Technology and Devices Handbook

8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook

1/14

14TheoreticalConsiderations forthe Efficient Design

of DNA Arrays

CONTENTS

14.1 Introduction

14.2 Surface DesignRole of Interface Electrostatic Interactions • Sensitivity

Enhancement • Multiplexed SNPs Detection

14.3 DNA Biosensors Presence of Short Subsequences in the Genomes • Correlation

of Presence of Short Subsequences between Genomes

14.4 Conclusions Acknowledgments

References

14.1 Introduction

The use of combinatorial or array-based detection (and synthesis) technologies has qualitatively changed

many areas of bioscience in the last several years. These technologies include DNA, protein, and combi-

natorial chemistry arrays. Of these, DNA arrays, designed to determine gene content and expression

levels in living cells, have shown the most potential. DNA arrays allow simultaneous, parallel measurement

of thousands of interactions between target strands and genome-derived probes. Two areas of concernare the design and analysis of such experiments. Microarrays are rapidly producing enormous amounts

of raw data. The bioinformatics solutions to problems associated with the analysis of data on this scale

are a major current challenge. In addition, designing such experiments requires consideration of not only

the genomic information required to answer a given problem but also consideration of the chemistry

and physics of highly charged species near prepared surfaces.

On the medical side, DNA arrays may someday help us to better understand complex issues concerning

human health and disease. Among other things, they should help us separate out the effects of ones genes

vs. environment and life-style to help usher in the individualized molecular medicine of the future.

A current important practical application of DNA arrays is biosensors used to determine which

organism a given DNA/RNA sample belongs. This use of microarrays is based on specific properties of viral and microbial genomes and the ability of arrays to provide information regarding the presence/

absence of thousands of short subsequences in given genome simultaneously.

Arnold VainrubUniversity of Houston

Tong-Bin LiUniversity of Houston

Yuriy FofanovUniversity of Houston

B. Montgomery PettittUniversity of Houston

© 2004 by CRC Press LLC


2/14

While DNA arrays will be an important tool for some time, it should be emphasized that DNA array

technology is still at an early stage of development. It is cluttered with heterogeneous technologies and

data formats as well as basic issues of signal to noise, fidelity, calibrations, and statistical significance that

are still being sorted out. Until these issues are resolved and standardized, it will not be possible to define

accurately the complete genetic regulatory network of even a well-studied prokaryotic cell system.

DNA arrays were introduced as a high throughput technology for performing hybridization assays

based on formation of a double helix to a surface-immobilized single-strand DNA probe according to

Watson-Crick pairing rules. In its current high-density format, in a single microarray experiment the

hybridization is performed with up to hundreds of thousands of different probes, producing a tremendous

volume of information on the assayed DNA sequence and their abundance in the tested target. Typically,

a DNA microarray hybridization experiment contains 107 to 1010 DNA probe molecules of a sequence

to be tested, immobilized in a ~50-mm-diameter spot on a prepared glass surface, and thus may includeabout 104 to 105 different probe spots per square centimeter. Usually, the probes are oligonucleotides of

8 to 80 bases long, tethered by one end through a linker molecule to the surface. DNA microbeads are

similar to microarrays, but the probes are tethered to a micron size glass bead’s surface (Brenner et al.,

2000) similar to peptide technology.The use of DNA microarrays in clinical practice is a rapidly growing area. In present work we will

concentrate our attention on two tasks: electrostatic effects in solution DNA hybridization at surfaces

and the ability of microarrays to serve as biosensors based on information of the presence or absence of

certain subsequences.

14.2 Surface Design

14.2.1 Role of Interface Electrostatic Interactions

Electrostatic effects in solution DNA hybridization (formation of the double helix by two complimentary DNA single strands) are well known (Saenger, 1984, Bloomfield, 1999). They appear simply because the

DNA is a negatively charged polymeric ion, and thus an electrostatic repulsion occurs that can be either

between single strands (ssDNA) or double helices (dsDNA). Each phosphate group PO 2– of the outer

dsDNA backbone bears a single negative charge, and therefore typical dsDNA B-helix is often modeled

by a 2-nm-diameter cylinder with a high negative surface charge of about six electron charges per

nanometer of length. The repulsion diminishes when the added salt cations are present in solution and

partly shield the electrostatic interactions. Thus, the dsDNA stability against a dehybridization into

ssDNAs increases with the solution ionic strength; typically, the melting temperature of dsDNA increases

from 10 to 20!C as the added salt concentration grows tenfold. For instance, in human blood plasma

concentrations of approximately 150 m M for Na+ and 2.5 m M for Ca2+ cations produce the electrostaticscreening length about 1 nm and help make the chromosomal dsDNA stable.

In addition to the above-mentioned DNA-DNA repulsion, two new electrostatic interactions appear for

a DNA microarray, namely, the DNA-surface interaction and repulsion between the assayed nucleic acid

and the on-surface layer of DNA probes. The nucleic acid–surface interactions operate on surface-tethered

DNA probes, the assayed nucleic acid (which is a biological RNA or prepared from it by a reverse transcrip-

tion cDNA), and also dsDNA formed as their hybrid. Our theoretical analysis of this DNA-surface electro-

static interaction shows its possible important role in on-surface hybridization thermodynamics (Vainrub

and Pettitt, 2000, 2003). Recent experiments (Heaton et al., 2001; Su et al., 2002) confirm the theory and

demonstrate a complete control of hybridization and melting by applied electric potential to the oligonu-

cleotide array on gold film surface. Indeed, the negative surface potential –300 mV induces the melting of prehybridized dsDNA whereas the positive potential promotes the hybridization of complimentary DNA

targets (Heaton et al., 2001). The experiments can be simply understood (Vainrub and Pettitt, 2000, 2003)

as a result of the electrostatic repulsion between both the negatively charged surface and ssDNA target

tending to melt the dsDNA and remove the target from the surface; attraction to the positively charged

surface stabilizes the dsDNA. It is important that the probe DNA is tethered to the surface (typically through



3/14

the covalent bond and linker molecule) and thus dsDNA cannot be displaced or diffuse from the negatively

charged surface and must be released by melting a mobile target ssDNA that can drift from the surface to

decrease the electrostatic energy. Interestingly, the DNA-surface repulsion (attraction) occurs even for

noncharged dielectric (metallic) surfaces due to the known electrostatic induction phenomena (Vainrub

and Pettitt, 2000). Evidently, the DNA-surface electrostatics can be regulated by the surface charge and

material (dielectric or metallic) as well as almost canceled by using long linker molecule and/or high ionic

strength hybridization solution. In previous work (Vainrub and Pettitt, 2000, 2003) we considered in detail

the DNA-surface electrostatics and its optimization in oligonucleotide microarrays.

Here we focus on another type of on-array electrostatic interaction, the repulsion between the assayed

nucleic acid and array of surface tethered DNA probe that both bear the negative charge (Vainrub and

Pettitt, 2002). First, we describe the origin of this interaction that is specific for on-array hybridization

and does not occur in homogeneous solution hybridization assays. To obtain sufficient numbers of

dsDNA hybrids for reliable detection, the oligonucleotide probe molecules are quite crowded on the array

with the surface density typically from 1012 to 1014 probes per square centimeter corresponding to the

mean distance between the neighbors on the surface from 10 to 1 nm, respectively. Therefore, a target

cDNA closely approaches not only a hybridization partner, but also the surrounding probe oligonucle-otides, which contribute into electrostatic repulsion of the target. Recently we considered this effect

(Vainrub and Pettitt, 2002) and derived an equation for the on-array hybridization binding isotherm:

. (14.1)

Here q (0 < q< 1) is the hybridization efficiency, i.e., the fraction of hybridized probes, C0 is the assayedDNA concentration, DG0 = DH0–TDS0 is the Gibbs free energy (DH0 the enthalpy, DS0 the entropy) of

dsDNA formation in homogeneous solution, T is the temperature, and R is the universal gas constant. ZPand ZT denote the probe and target lengths (the number of nucleotides), and NP is the probe surface density.

VS is the interaction strength, which is estimated (Vainrub and Pettitt, 2002) both theoretically and from

the experiments as about 10–14 J m2/mol for 25-mer long probe oligonucleotides in 1 M NaCl solution.

The array hybridization isotherm, Equation 14.1 as demonstrated in Figure 14.1, successfully explains

the well-known experiments (Forman et al., 1998; Guo et al., 1994; Shchepinov et al., 1995), showing

FIGURE 14.1 Hybridization binding isotherm at different surface density of 25-mer probe oligonucleotides. The curve

number notes the surface density in 1012 probes/cm2 units. The number 0 corresponds to the Langmuir isotherm.

CG

T

V N Z Z

T

S P P T

00

1=

-Ê Ë Á

ˆ ¯ ˜

+( )È

ÎÍÍ

ù

ûúú

qq

qexp exp

DR R

0 2000 4000 6000 8000 10000

1.0

0.8

0.6

0.4

0.2

0.0

1

2

0

4

6

8

10

12

Target concentration (*exp[∆G0 /RT0] moles)

H y b r i d i z a t i o n e f f i c i e n c y



4/14

the temperature decrease and strong broadening of the dsDNA melting on array compared to solution.

Also, the theory quantitatively accounts for the recent 25-mer (Peterson et al., 2001) and 20-mer (Watter-

son et al., 2000) oligonucleotide array experimental data. Below we review how this theory can be used

in microarray optimization. It should be noted that, in addition to the discussed electrostatic forces, the

other interface interactions, e.g., hydration effects, van der Waals forces, and steric hindrance (packing),

could contribute under specific array conditions. However, the interface electrostatic effects often dom-

inate the interactions. In combination with understanding the probabilistic basis of the sequences for

analysis (below), we now have the ability to bring some interesting concepts to bear on the design of

biochips.

14.2.2 Sensitivity Enhancement

Considering the interaction-free energies involved in surface-bound DNA devices, several factors affecting

binding and performance were apparent. We found that the concentration dependence of the electrostatic

repulsion between the assayed target and probe array affects the sensitivity and dynamic range of DNA

microarrays. Calculated from our theory, Figure 14.2 shows the number of hybrids qNp as a function of the target concentration at different probe surface densities Np assuming the same array parameters Z =

25, Vs = 10–14 J m2/mol and room temperature T = 25˚C as in Figure 14.1. For microarray assays in the low

target concentration regime, the strongest signals (curve 1 in Figure 14.2) correspond to a probe density of

about 1012 cm–2. As seen in the insert to Figure 14.2, the theoretical sensitivity peak is rather narrow

suggesting that the probe density on the surface in microarrays should be thoroughly optimized for each

surface preparation and solution condition. This result is in accord with experimental observations of a

clear signal peak in a similar probe density range (Steel et al., 1998) and a weaker signal at higher probe

densities (Peterson et al., 2001). This means the dynamic range near higher target concentrations can be

expanded by an increase of the probe density at the expense of a substantial decrease in sensitivity. The

width of the peak may have other factors that could be important under different conditions.Explicit control of the electrostatic interactions by microscopic or macroscopic field generation is

therefore of obvious importance for optimization of microarrays. Suppression of the Coulomb repulsion

could, in favorable circumstances, increase the sensitivity. We predict this could be achieved using external

fields, charged molecular surface preparations, and in three-dimensional arrays using probe immobili-

zation in gels, which indeed show solution-like hybridization thermodynamics (Vasiliskov et al., 2001),

FIGURE 14.2 Number of hybridized probes as a function of the normalized target concentration at different surface

density of 25-mer probe oligonucleotides. The curve number notes the surface density in 1012 probes/cm2 units.

Insert: Number of hybrids vs. probe surface density at the normalized target concentration 0.1.

1 2 4 6 80.1

2x1010

10-2 10-1 100 101 102 103 104

5x1012 1x10130

0

Probe density (1/cm2)

Target concentration (*exp[∆G0 /RT0] Moles)

1012

1011

1010

109

H y b r i d d e n s i t y ( 1 / c m

2 )



5/14

but suffer from the slow hybridization and washing kinetics. For two-dimensional arrays use of

multivalent counterions for enhancement of the Coulomb screening and repulsion reduction (Nguyen

et al., 2000) may be important as well as the use of a positive electrostatic potential at the surface (Vainrub

and Pettitt, 2000). In addition, replacement of DNA probes by noncharged peptide nucleic acids (PNA)

(Nielsen, 2001) provides an interesting chemical way to lessen the unfavorable electrostatic interaction.

Other complications arise to make a detailed analysis of such a hetero duplex beyond the scope of our

present discussion.

14.2.3 Multiplexed SNPs Detection

In contrast to gene expression profiling, the Coulomb hybridization blockage plays a positive role in on-

array single nucleotide polymorphism (SNP) genotyping and provides an interesting possibility for

multiplexed SNPs detection. Given a reasonable estimate of the mean SNP frequency in a human

chromosome DNA of about 1 per 1000 nucleotide sites (Cutler et al., 2001), the individual genomes may

differ by several millions of SNPs. Therefore, highly multiplexed detection using microarrays is very

important for high throughput large-scale SNP genotyping.The principle of on-array multiplexed SNPs genotyping is demonstrated in Figure 14.3. For 20-mer

perfectly matched duplexes with oligomer L 5¢-CTGAA CGGTA GCATC TTGAC-3¢ and oligomer H 5¢-CTGAG CGGTA GCACC GCGAC-3¢ the melting curves in solution at 5 nM oligonucleotide concentra-tion with 1 M added NaCl salt are shown in Figure 14.3 (left side). The H duplex (Tm = 72.5!C, 70% of

GC-bases) is more stable than the L duplex (Tm = 62.9!C, 50% of GC-bases) because of the higher GC-

base content (SantaLucia et al., 1996). In addition to the matched L and H, Figure 14.3 shows also the

melting curves for the L1 and H1 SNPs corresponding to the A for T single nucleotide replacement at

the tenth position from the 5¢-end. As a practical example, the conditions for an SNP detection aredefined as at least 1% hybridization efficiency signal strength and 1.5 times discrimination ratio for the

match/mismatch signals. The resulting detection temperature ranges of L and H SNPs are shown inFigure 14.3 by the bars. Since in solution L and H ranges do not overlap, the two SNPs cannot be detected

FIGURE 14.3 Principle of multiplexed SNPs detection on DNA biochip. The melting curves of matched 20-mers

L and H and their single mismatches L1 and H1. The (L, L1) bar shows the temperature range where both detection

and discrimination between L and L1 duplexes is possible (see text); the (H, H1) bar indicates the similar range for

H and H1. In solution (left figure) the (L, L1) and (H, H1) ranges do not overlap, but on DNA biochip (right figure)

the overlap occurs and allows detection of both (L, L1) and (H, H1) SNPs in a single fixed temperature experiment.

Temperature (K)


320 330 340 350 360

Temperature (K)

300 310 320 330 340

Probe density1.2 * 1013 oligos/cm2

1.0

0.8

0.6

0.4

0.2

0.0


1.0

0.8

0.6

0.4

0.2

0.0

In solution:No melting curves overtap

Impossible to detect both SNPs

On biochip:Both SNPs

Detection range

L1 L H1 H L1 L H1 H



6/14

in a single temperature assay. However, on an array with a probe surface density Np = 1.2*1013 cm–2 the

above-mentioned broadening of melting curve increases the detection ranges for L and H SNPs and

makes them both detectable in the overlap temperature region as shown in Figure 14.3 (right side). This

example illustrates the principle behind our suggested multiplexed detection of SNPs that differ by up

to 10!C in the melting temperature. Further extensions of diversity of on-array genotyped SNPs can be

achieved using higher probe surface density to make the melting transition even more broad.

14.3 DNA Biosensors

14.3.1 Presence of Short Subsequences in the Genomes

Statistical analysis of the appearance of short subsequences in different DNA sequences, from individual

genes to full genomes is important for a variety of reasons. Applications include PCR primer (Fislage,

1998; Fislage et al., 1997) as well as microarray probe design (Southern, 2001). Several attempts (Descha-

vanne, et al., 1999; Karlin and Ladunga, 1994; Karlin and Mrazek, 1997; Nakashima et al., 1997; Nakash-

ima et al., 1998; Nussinov, 1984; Sandberg et al., 2001) have been made to employ the frequency

distribution of short subsequences (n-mers) to identify species with relatively short genome sizes (micro-

bial). In such an approach, the shape of the frequency distribution for certain short subsequences, 2–4-

mers (Deschavanne et al., 1999; Karlin and Ladunga, 1994; Karlin and Mrazek, 1997; Nakashima et al.,

1997; Nakashima et al., 1998; Nussinov, 1984) and 8–9-mers (Deschavanne et al., 1999; Sandberg et al.,

2001) has been used to decide what microbial genome one is dealing with, based on a given piece of

genome or a whole genome.

Many sequencing projects are in progress and more full genomes have recently become available. The

several hundred projects completed so far provide sufficient material to consider them from a statistical

viewpoint. Yet, we are still far from having a complete or even reasonable statistical picture. There are

simply too many species yet to be sequenced to obtain globally relevant statistical answers.Recently (Fofanov et al., 2002a, 2002b) the comparative statistical analysis of the presence/absence of

all possible short n-mers (7 to 20 nucleotides long) for more than 250 complete genomes was performed

in this group. The set under consideration included 76 complete microbial genome sequences with sizes

ranging from 0.58 to 7.04 Mb and 176 viral genomes (128 RNA containing viruses with genome sizes

from 0.32 to 130.76 Kb and 48 DNA containing viruses with genome sizes from 2.0 to 671.19 kb) as well

as complete genomes of five multicellular organisms: Caenorhabditis elegans (99.99 Mb), Drosophila

melanogaster (119.98 Mb), Oryza sativa (Rice, 255.87 Mb), Schizosaccharomyces pombe (12.49 Mb), and

Homo sapiens (human, 2.875 Gb) genomes. A complete list of genomes and all supplementary materials

mentioned below can be found on the University of Houston Bioinformatics lab website http://www.bio-

info.uh.edu/publications/how_random_are_genomes.Tables 14.1 and 14.2 show representative results for some of the analyzed genomes (microbial and

viral), for n = 8 and 12 using our techniques. It is worth mentioning that as n increases, the total number

of possible n-mers, 4n, strongly exceeds the total sequence length M and most of the possible n-mers do

not appear at all because the maximum number of n-mers contained in this sequence is M – n + 1 ª M .Moreover, for a reasonably high ratio, 4n/ M , most of the n-mers that appear tend to appear only once,

in accordance with the fact that the number of present n-mers becomes very close to M (see Tables 14.1,

14.2, and supplementary data on the above-mentioned Web site). That is why it was decided to use the

statistics for “presence/absence” in our method of analysis, instead of the usual “frequency of appearance,”

which is reasonable for short n-mers (total sequence length M


7/14

“genome” is also shown for comparison in all figures as a solid line. One can observe the extraordinary similarity between these plots. All of the different genomes form a well-defined pattern, when plotted

against the ratio 4n/ M and not against the size of the genome or the length of the n-mer separately.

For much longer genomes of multicellular organisms, practically all n-mers for n < 12 are present.

Therefore, we chose to calculate the number of distinct 13–20-mers present in each genome (see Figure

14.7 and Table 14.3). These results point to the conclusion that the presence of n-mers in all genomes

TABLE 14.1 Frequency of Presence of 8-mers and Self-Similarity* for Several Viral Genomes

Accession Genome

Total

Sequence

Length

(bp)

Number

of Present

8-mers

Frequency

of Present

8-mers

Random

Boundary

Self-

Similarity

NC_001436 Human T-cell lymphotropic

virus type 1

17,014 13,739 20.96% 22.86% 8.31%

NC_001707 Hepatitis B virus 6,430 5,963 9.10% 9.35% 2.64%

NC_001503 Mouse mammary tumor virus 17,610 14,307 21.83% 23.56% 7.35%

NC_001547 Sindbis virus 11,703 10,431 15.92% 16.35% 2.67%

NC_001434 Hepatitis E virus 7,176 6,517 9.94% 10.37% 4.12%

NC_003312 Swine hepatitis E virus 7,257 6,608 10.08% 10.48% 3.81%

NC_001489 Hepatitis A virus 7,478 6,543 9.98% 10.78% 7.42%

NC_001433 Hepatitis C virus 9,413 8,480 12.94% 13.38% 3.29%

NC_001653 Hepatitis D virus 1,682 1,608 2.45% 2.53% 3.17%

NC_001802 Human immunodeficiency

virus type 1

9,181 7,725 11.79% 13.07% 9.83%

NC_003461 Human parainfluenza virus 1 15,600 12,242 18.68% 21.18% 11.82%



* See the definition in the text.

TABLE 14.2 Frequency of Presence of 12-mers and Self-Similarity for Several Microbial Genomes

Accession Genome

Total

Sequence

Length

(bp)

Number

of Present

12-mers

Frequency

of Present

12-mers

Random

Boundary

Self-

Similarity

NC_000964 Bacillus subtilis 8,429,628 5,346,103 31.87% 39.50% 19.32%

NC_002696 Caulobacter crescentus 8,033,894 3,399,234 20.26% 38.05% 46.75%

NC_000913 Escherichia coli K12 9,278,442 5,695,881 33.95% 42.48% 20.08%

NC_000916 Methanobacterium

thermoautotrophicum

3,502,754 2,658,450 15.85% 18.84% 15.91%

NC_003197 Salmonella typhimurium LT2 9,714,864 5,821,910 34.70% 43.96% 21.06%

NC_002758 Staphylococcus aureus Mu50 5,756,080 3,398,622 20.26% 29.04% 30.25%

NC_003098 Streptococcus pneumoniae R6 4,077,230 2,992,091 17.83% 21.57% 17.34%

NC_002737 Streptococcus pyogenes 3,704,882 2,778,223 16.56% 19.81% 16.43%

NC_002578 Thermoplasma acidophilum 3,129,812 2,602,761 15.51% 17.02% 8.84%

NC_002689 Thermoplasma volcanium 3,169,608 2,590,718 15.44% 17.22% 10.30%

NC_000919 Treponema pallidum 2,275,888 1,978,453 11.79% 12.69% 7.04%

NC_000853 Thermotoga maritima 3,721,450 2,755,886 16.43% 19.89% 17.43%

NC_002162 Ureaplasma urealyticum 1,503,438 948,274 5.65% 8.57% 34.06%

NC_002505 Vibrio cholerae chromosome I,

chromosome II

8,066,854 5,383,520 32.09% 38.17% 15.94%

NC_002488 Xylella fastidiosa 9a5c 5,358,610 3,996,398 23.82% 27.34% 12.88%



8/14

FIGURE 14.4 Frequency of presence of 9–14-mers in 76 microbial genomes.

FIGURE 14.5 Frequency of presence of 7–10-mers in 129 RNA viral genomes.

FIGURE 14.6 Frequency of presence of 7–10-mers in 48 DNA viral genomes.

Microbial Genomes

4n /M

f r e q u e n c y o f p r e s e n c e o

f n - m e r s

0 5 10 15 20 25

1.00

0.80

0.60

0.40

0.20

0.00

9-mers

10-mers

11-mers12-mers

13-mers

14-mers

random boundary

RNA Virus Genomes

4n /M

f r e q u e n c y o f p r e s e n c e o f n - m e r s

0 5 10 15 20 25

1

0.8

0.6

0.4

0.2

0

7-mers

8-mers

9-mers

10-mers

random boundary

DNA Virus Genomes

4n /M

f r e q u e n c y o f p r e s e n c e o f n - m e r s

0 5 10 15 20 25

1

0.8

0.6

0.4

0.2

0

7-mers

8-mers

9-mers

10-mers

random boundary



9/14

considered (in the range of n, when the condition M


10/14

Let us provide a simple example based on three different genomes: (1) Salmonella typhi (NC_003198),

(2) Mycobacterium tuberculosis H37Rv (NC_000962), and (3) Bacillus subtilis (NC_000964). A complete

set of n-mers would contain 4n n-mers, which, for n = 12, is 412 = 16,777,216. Based on our analysis,

Table 14.4 shows how many different 12-mers are contained in each of these three genomes. The number

N (n, G1, G2) of n-mers (n = 12) that appears in each pair of species genomes (G1, G2) is shown in Table

14.5. We can compare the probabilities of finding randomly picked 12-mers in each pair of genomes

with probabilities calculated using the multiplication rule. As seen from Table 14.5, the actual andcalculated (expected) probabilities do not differ greatly from each other, which allows us to treat the

presence/absence of randomly picked 12-mers in these three genomes as statistically independent events.

Actual and expected pair-wise probabilities were calculated (Fofanov et al., 2002a, 2002b) in each of

the above-mentioned groups of genomes (170,000+ pairs in total). We were especially interested in the

range of n where p* = 5 to 50% of the total possible number of n-mers occurred. This range is different

for different genome sizes and can be determined from Figure 14.4. The analytic formula for the random

boundary also can be used to estimate this range:

. (14.2)

Upper and lower bounds for sizes from 0.8 to 10 Mb, which are typical for microbial genomes, are

shown in Table 14.6. In accordance with this, the value n = 12 seems to be the most reasonable one for

all microbial genomes. For viral genomes, the value was found to be n = 7.

For all 2800+ pairs of microbial genomes and the value of n = 12, the average ratio of actual and

expected probabilities was found to be 1.35 ± 0.61. For viral genomes and the corresponding value of

n = 7, the average ratio of actual and expected probabilities was found to be 1.06 ± 0.10 for 1100+ genome

TABLE 14.4 The Frequency of Presence of 12-mers within the Three Microbial Genomes

Genome (G) Genome Length

Total Sequence

Length (bp)

Number of Different

12-mers Present in

Genome: N (12 ,G) p = N (12 ,G)/4n

Salmonella typhi 4,809,037 9,618,074 5,813,330 34.65% Mycobacterium tuberculosis

H37Rv

4,411,529 8,823,058 4,361,508 26.00%

Bacillus subtilis 4,214,814 8,429,628 5,346,103 31.87%

TABLE 14.5 Actual and Predicted Simultaneous Presence of 12-mers within the ThreeMicrobial Genomes: (1) Salmonella typhi, (2) Mycobacterium tuberculosis H37Rv, and (3)Bacillus subtilis

Case Number 12-mers N (n, G1, G2)/4n

Calculated Probability

Assuming Independence

Present in genomes (1) and (2) 1,943,814 11.6% 9.0%Present in genomes (1) and (3) 2,335,710 13.9% 11.0%

Present in genomes (2) and (3) 1,334,288 8.0% 8.3%

TABLE 14.6 The Optimal Length of n-mers (n*) for Different Genome Sizes and Frequencies of Presence ( p*)

Total Sequence Length (bp)

n* Determined for Frequency of

Presence 50% ( p* = 0.5)

n* Determined for Frequency of Presence

5% ( p* = 0.05)

0.8 Mb 9.80 11.93

2.0 Mb 10.47 12.59

10.0 Mb 11.63 13.75

n

M p p

*

log * / *

log( )=

-( )[ ]14



11/14

pairs DNA-based viruses and 1.04 ± 0.05 for 8100+ genome pairs RNA-based viruses. This led us to the

conclusion that for this range of n, the presences of n-mers in different genomes, to a good approximation,

can be treated as independent events.

The highest deviations between expected and actual probabilities were found for closely related

genomes. For 48 DNA-based viruses under consideration, using 7-mers, the highest ratio (185%) was

found for Duck hepatitis B virus (NC_001344) vs. Stork hepatitis B virus (NC_003325) with 8.1% expected

and 15.0% actual.

An example of closely related microbial genomes would be S. aureus N 315 (NC_002745) vs. S. aureus

Mu50 (NC_002758) with 4.0% expected and 19.7% actual or 491% higher than expected. Another

extreme case was found for three microbial genomes: Chlamydophila pneumoniae CWL029 (NC_000922),

C. pneumoniae AR39 (NC_002179), and C. pneumoniae J138 (NC_002491), which have the highest

(eightfold) ratio of actual and expected probabilities for 12-mers (1.5% expected and 12.3% actual). For

the group containing 24 human chromosomes, pair-wise ratios of actual and expected probabilities of

14-mers were found to be 1.91 ± 0.16, maximum ratio being found for n = 20 and Y-chromosomes(expectation 2.9% vs. actual 6.9%).

Assuming that results for 250+ genomes are statistically significant, we expect similar behaviorfrom many different (as yet sequenced) genomes. Thus our analysis indicates that, in this case, one

may use relatively small sets of randomly picked n-mers for differentiating between different viruses

and organisms.

Let us further illustrate the idea by continuing our example for three microbial genomes. Let n* be

the size of n-mer, which fits the interval where from 5 to 50% of all possible n-mers show up for a

desirable range of genome lengths. In accordance with Table 14.6, we may choose the value n* = 12. Let

us randomly pick L 12-mers (say, L = 1000). Given a genome G1 with the frequency of presence of n-

mers p1, we expect that K = p1L n-mers present in G1 will appear also in our random set, forming a

“fingerprint” of G1 (in our example, we expect 50 < K < 500). The probability, e , that the fingerprint of

G1 will exactly coincide with the fingerprint of some other genome G2 (with the frequency of presenceof n-mers p2) is (Fofanov et al., 2002a; 2002b):

e = (1 – p1 – p2 + 2 p12)L. (14.3)

Here p12 is the probability for the n-mer to be present in both genomes simultaneously.

Let us consider the numeric example mentioned in Tables 14.4 and 14.5 of two species that are far

from each other, Salmonella typhi vs. Mycobacterium tuberculosis H37Rv; p1 = 0.3465, p 2 = 0.2600, p12 =

0.1160; with L = 1000, a remarkable accuracy of e = 1.7*10–204 can theoretically be achieved.

Given a desirable probability of error, e , one can determine the appropriate size, L, of a random set

of n-mers that can be used for reliable identification of genomes as

. (14.4)

For related organisms, the genomes may contain large common parts. This means that p12 may be

close to p1 and p 2. To give a numeric example of close relatives, let us consider S. aureus N 315 vs. S. aureus

Mu50. Now p1 = 0.198, p 2 = 0.203, p12 = 0.197, and an accuracy of e = 10–10 can be achieved with L =

4451. We would like to stress that our analysis predicts a logarithmic dependence of the sampling or

microarray size, L, on the error probability, e . This feature is of principal importance for the estimation

procedure under discussion.Therefore, we can use practically any sufficiently random subset of n-mers of appropriate size for design

a microarray to diagnose an organism to which a given DNA/RNA sample belongs. Different sizes of n-mers

must be employed for recognition of different organisms based on their genome length. Values of n that

correspond to given intervals of genome lengths can be easily calculated using above formulas. In fact, only

11 different n values, 7 £ n £ 17, would be enough to cover a large variety of genome sizes from 1 kb to 9 Gb.

L p p p

=- - +( )

log

log

e1 2

1 2 12



12/14

The important advantage of such an approach is that it can be used without a priori knowledge of the

sequence itself. This implies there is no need to perform the expensive and time-consuming process of

sequencing before array construction. It is enough to obtain the purified DNA, hybridize it on a suffi-

ciently random microarray chip, and check which n-mers show up. Taking into account how accessible

the DNA of thousands of microbes and viruses are, how easily each microarray can be produced, and

the fact that we do not need to determine quantitative values of expression (we need only a yes/no

answer), it should be possible to produce an essentially universal microbial/viral DNA chip.

14.4 Conclusions

In this article we have attempted to demonstrate how to use both physical and mathematical techniques

to explore design criteria of relevance to modern genetic analysis. In particular, DNA arrays allow

simultaneous, parallel detection and concentration measurement of thousands of target strands. In this

article we have concentrated on two areas of concern in the design and analysis of such experiments: the

information content of the polymer and the physical/chemical properties of the detection. Microarrays

are now producing amounts of raw biological (and biophysical) data on a scale not seen before in the

biological arena. The bioinformatics solutions to problems associated with the analysis of data on this

scale will remain a challenge for some time. The physical design of efficient devices to conduct such

experiments requires consideration of the chemistry and physics of often highly charged species near

prepared surfaces as well as the sequence. This article was aimed at demonstrating the current state of

theory in the hopes that many will find application of these principles.

ACKNOWLEDGMENTS

This work was partially supported by grants from NIH, Texas Coordinating Board, and the Robert A.

Welch Foundation to BMP. TBL was a fellow at the Keck Center for Computational Biology. BMP and

YF acknowledge the Texas Center for Learning and Computation for seed funding, and also NPACI for

computing time and support at the San Diego Supercomputing Center. We also thank the Molecular

Science Computing Facility (MSCF) in the William R. Wiley Environmental Molecular Sciences Labo-

ratory, a national scientific user facility sponsored by the U.S. Department of Energy’s Office of Biological

and Environmental Research and located at the Pacific Northwest National Laboratory. Pacific Northwest

is operated for the Department of Energy by Battelle. BMP and AV also thank Accelrys for providing

visualization software through the Institute for Molecular Design.

ReferencesBloomfield, V.A., Crothers, D.M., and Tinoco, I., Nucleic Acids: Structures, Properties and Functions,

University Science Books, Sausalito, CA, 1999.

Brenner, S., Johnson, M., Bridgham, J., Golda, G., Lloyd, D.H., Johnson, D., Luo, S., McCurdy, S., Foy,

M., Ewan, M., Roth, R., George, D., Eletr, S., Albrecht, G., Vermaas, E., Williams, S.R., Moon, K.,

Burcham, T., Pallas, M., DuBridge, R.B., Kirchner, J., Fearon, K., Mao, J., and Corcoran, K., Gene

expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays, Nat.

Biotechnol ., 18, 630–634, 2000.

Cutler, D.J., Zwick, M.E., Carrasquillo, M.M., Yohn, C.T., Tobin, K.P., Kashuk, C., Mathews, D.J., Shah,

N.A., Eichler, E.E., Warrington, J.A., and Chakravarti, A., High-throughput variation detection

and genotyping using microarrays, Genome Res., 11, 1913–1925, 2001.Deschavanne, P.J., Giron, A., Vilain, J., Fagot, G., and Fertil, B., Genomic signature: characterization and

classification of species assessed by chaos game representation of sequences, Mol. Biol. Evol., 16,

1391–1399, 1999.

Fislage, R., Differential display approach to quantitation of environmental stimuli on bacterial gene

expression, Electrophoresis, 19, 613–616, 1998.



13/14

Fislage, R., Berceanu, M., Humboldt, Y., Wendt, M., and Oberender, H., Primer design for a prokaryotic

differential display RT-PCR, Nucleic Acids Res., 25, 1830–1835, 1997.

Fofanov, Y., Luo, Y., Katili, C., Wang, J., Powdrill, B.Y.T., Fofanov, V., Li, T.-B., Chumakov, S., and Pettitt,

B.M., How independent are the appearances of n-mers in different genomes? Submitted, 2000a.

Fofanov, Y., Luo, Y., Katili, C., Wang, J., Powdrill, B.Y.T., Fofanov, V., Li, T.-B., Chumakov, S., and Pettitt,

B.M., Short subsequences in genomes: how random are they? Submitted, 2002b.

Forman, E.J., Walton, I.D., Stern, D., Rava, R.P., and Trulson, M.O., Thermodynamics of duplex formation

and mismatch discrimination of photolithographically synthesized oligonucleotide arrays, ACS

Symp. Ser., 682, 206–228, 1998.

Guo, Z., Guilfoyle, R.A., Thiel, A.J., Wang, R., and Smith, L.M., Direct fluorescence analysis of genetic

polymorphisms by hybridization with oligonucleotide arrays on glass supports, Nucleic Acids Res.,

22, 5456–5465, 1994.

Heaton, R.J., Peterson, A.W., and Georgiadis, R.M., Electrostatic surface plasmon resonance: direct

electric field-induced hybridization and denaturation in monolayer nucleic acid films and label-

free discrimination of base mismatches, Proc. Natl. Acad. Sci. U.S.A., 98, 3701–3704, 2001.

Karlin, S. and Ladunga, I., Comparisons of eukaryotic genomic sequences, Proc. Natl. Acad. Sci. U.S.A.,91, 12832–12836, 1994.

Karlin, S. and Mrazek, J., Compositional differences within and between eukaryotic genomes, Proc. Natl.

Acad. Sci. U.S.A., 94, 10227–10232, 1997.

Nakashima, H., Nishikawa, K., and Ooi, T., Differences in dinucleotide frequencies of human, yeast, and

Escherichia coli genes, DNA Res., 4, 185–192, 1997.

Nakashima, H., Ota, M., Nishikawa, K., and Ooi, T., Genes from nine genomes are separated into their

organisms in the dinucleotide composition space, DNA Res., 5, 251–259, 1998.

Nguyen, T.T., Grosberg, A.Y., and Shklovskii, B.I., Screening of a charged particle by multivalent coun-

terions in salty water: strong charge inversion, J. Chem. Phys., 113, 1110–1125, 2000.

Nielsen, P.E., Peptide nucleic acid: a versatile tool in genetic diagnostics and molecular biology, Curr.Opinion Biotech., 12, 16–20, 2001.

Nussinov, R., Doublet frequencies in evolutionary distinct groups, Nucleic Acids Res., 12, 1749–1763, 1984.

Peterson, A.W., Heaton, R.J., and Georgiadis, R.M., The effect of surface probe density on DNA hybrid-

ization, Nucleic Acids Res., 29, 5163–5168, 2001.

Saenger, W., Principles of Nucleic Acid Structure, Springer-Verlag, New York, 1984.

Sandberg, R., Winberg, G., Branden, C.I., Kaske, A., Ernberg, I., and Coster, J., Capturing whole-genome

characteristics in short sequences using a naive Bayesian classifier, Genome Res., 11, 1404–1409,

2001.

SantaLucia, J., Allawi, H.T., and Seneviratne, P.A., Improved nearest-neighbor parameters for predicting

DNA duplex stability, Biochemistry , 35, 3555–3562, 1996.Shchepinov, M.S., Case-Green, S.C., and Southern, E.M., Steric factors influencing hybridization of

nucleic acids to oligonucleotide, Nucleic Acids Res., 25, 1155–1161, 1995.

Southern, E.M., DNA microarrays — history and overview, Methods Mol. Biol., 170, 1–15, 2001.

Steel, A.B., Herne, T.M., and Tarlov, M.J., Electrochemical quantitation of DNA immobilized on gold,

Anal. Chem., 70, 4670–4677, 1998.

Su, H.J., Surrey, S., McKenzie, S.E., Fortina, P., and Graves, D.J., Kinetics of heterogeneous hybridization

on indium tin oxide surfaces with and without an applied potential, Electrophoresis, 23, 1551–1557,

2002.

Vainrub, A. and Pettitt, B.M., Surface electrostatic effects in oligonucleotide microarrays: control and

optimization of binding thermodynamics, Biopolymers, 68, 265–270, 2003.Vainrub, A. and Pettitt, B.M., Thermodynamics of association to a molecule immobilized in an electric

double layer, Chem. Phys. Lett., 323, 160–166, 2000.

Vainrub, A. and Pettitt, B.M., Coulomb blockage of hybridization in two-dimensional DNA arrays, Phys.

Rev ., E66, art. no. 041905, 2002.



14/14

Vasiliskov, V.A., Prokopenko, D.V., and Mirzabekov, A.D., Parallel multiplex thermodynamic analysis of

coaxial base stacking in DNA duplexes by oligonucleotide microchips, Nucleic Acids Res., 29,

2303–2313, 2001.

Watterson, J.H., Piunno, P.A., Wust, C.C., and Krull, U.J., Effects of oligonucleotide immobilization

density on selectivity of quantitative transduction of hybridization of immobilized DNA, Langmuir ,

16, 4984–4992, 2000.

Documents

Chapter 14 Book Biomedical Technology and Devices Handbook