Chapter 14 Book Biomedical Technology and Devices Handbook

Embed Size (px)

Citation preview

  • 8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook

    1/14

    14TheoreticalConsiderations forthe Efficient Design

    of DNA Arrays

    CONTENTS

    14.1 Introduction

    14.2 Surface DesignRole of Interface Electrostatic Interactions • Sensitivity

    Enhancement • Multiplexed SNPs Detection

    14.3 DNA Biosensors Presence of Short Subsequences in the Genomes • Correlation

    of Presence of Short Subsequences between Genomes

    14.4 Conclusions Acknowledgments

    References

    14.1 Introduction

    The use of combinatorial or array-based detection (and synthesis) technologies has qualitatively changed

    many areas of bioscience in the last several years. These technologies include DNA, protein, and combi-

    natorial chemistry arrays. Of these, DNA arrays, designed to determine gene content and expression

    levels in living cells, have shown the most potential. DNA arrays allow simultaneous, parallel measurement

    of thousands of interactions between target strands and genome-derived probes. Two areas of concernare the design and analysis of such experiments. Microarrays are rapidly producing enormous amounts

    of raw data. The bioinformatics solutions to problems associated with the analysis of data on this scale

    are a major current challenge. In addition, designing such experiments requires consideration of not only 

    the genomic information required to answer a given problem but also consideration of the chemistry 

    and physics of highly charged species near prepared surfaces.

    On the medical side, DNA arrays may someday help us to better understand complex issues concerning

    human health and disease. Among other things, they should help us separate out the effects of ones genes

    vs. environment and life-style to help usher in the individualized molecular medicine of the future.

    A current important practical application of DNA arrays is biosensors used to determine which

    organism a given DNA/RNA sample belongs. This use of microarrays is based on specific properties of viral and microbial genomes and the ability of arrays to provide information regarding the presence/

    absence of thousands of short subsequences in given genome simultaneously.

    Arnold VainrubUniversity of Houston

    Tong-Bin LiUniversity of Houston

    Yuriy FofanovUniversity of Houston

    B. Montgomery PettittUniversity of Houston

    © 2004 by CRC Press LLC

  • 8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook

    2/14

     

    While DNA arrays will be an important tool for some time, it should be emphasized that DNA array 

    technology is still at an early stage of development. It is cluttered with heterogeneous technologies and

    data formats as well as basic issues of signal to noise, fidelity, calibrations, and statistical significance that

    are still being sorted out. Until these issues are resolved and standardized, it will not be possible to define

    accurately the complete genetic regulatory network of even a well-studied prokaryotic cell system.

    DNA arrays were introduced as a high throughput technology for performing hybridization assays

    based on formation of a double helix to a surface-immobilized single-strand DNA probe according to

    Watson-Crick pairing rules. In its current high-density format, in a single microarray experiment the

    hybridization is performed with up to hundreds of thousands of different probes, producing a tremendous

    volume of information on the assayed DNA sequence and their abundance in the tested target. Typically,

    a DNA microarray hybridization experiment contains 107  to 1010 DNA probe molecules of a sequence

    to be tested, immobilized in a ~50-mm-diameter spot on a prepared glass surface, and thus may includeabout 104 to 105 different probe spots per square centimeter. Usually, the probes are oligonucleotides of 

    8 to 80 bases long, tethered by one end through a linker molecule to the surface. DNA microbeads are

    similar to microarrays, but the probes are tethered to a micron size glass bead’s surface (Brenner et al.,

    2000) similar to peptide technology.The use of DNA microarrays in clinical practice is a rapidly growing area. In present work we will

    concentrate our attention on two tasks: electrostatic effects in solution DNA hybridization at surfaces

    and the ability of microarrays to serve as biosensors based on information of the presence or absence of 

    certain subsequences.

    14.2 Surface Design

    14.2.1 Role of Interface Electrostatic Interactions

    Electrostatic effects in solution DNA hybridization (formation of the double helix by two complimentary DNA single strands) are well known (Saenger, 1984, Bloomfield, 1999). They appear simply because the

    DNA is a negatively charged polymeric ion, and thus an electrostatic repulsion occurs that can be either

    between single strands (ssDNA) or double helices (dsDNA). Each phosphate group PO 2– of the outer

    dsDNA backbone bears a single negative charge, and therefore typical dsDNA B-helix is often modeled

    by a 2-nm-diameter cylinder with a high negative surface charge of about six electron charges per

    nanometer of length. The repulsion diminishes when the added salt cations are present in solution and

    partly shield the electrostatic interactions. Thus, the dsDNA stability against a dehybridization into

    ssDNAs increases with the solution ionic strength; typically, the melting temperature of dsDNA increases

    from 10 to 20!C as the added salt concentration grows tenfold. For instance, in human blood plasma

    concentrations of approximately 150 m M  for Na+ and 2.5 m M  for Ca2+ cations produce the electrostaticscreening length about 1 nm and help make the chromosomal dsDNA stable.

    In addition to the above-mentioned DNA-DNA repulsion, two new electrostatic interactions appear for

    a DNA microarray, namely, the DNA-surface interaction and repulsion between the assayed nucleic acid

    and the on-surface layer of DNA probes. The nucleic acid–surface interactions operate on surface-tethered

    DNA probes, the assayed nucleic acid (which is a biological RNA or prepared from it by a reverse transcrip-

    tion cDNA), and also dsDNA formed as their hybrid. Our theoretical analysis of this DNA-surface electro-

    static interaction shows its possible important role in on-surface hybridization thermodynamics (Vainrub

    and Pettitt, 2000, 2003). Recent experiments (Heaton et al., 2001; Su et al., 2002) confirm the theory and

    demonstrate a complete control of hybridization and melting by applied electric potential to the oligonu-

    cleotide array on gold film surface. Indeed, the negative surface potential –300 mV induces the melting of prehybridized dsDNA whereas the positive potential promotes the hybridization of complimentary DNA

    targets (Heaton et al., 2001). The experiments can be simply understood (Vainrub and Pettitt, 2000, 2003)

    as a result of the electrostatic repulsion between both the negatively charged surface and ssDNA target

    tending to melt the dsDNA and remove the target from the surface; attraction to the positively charged

    surface stabilizes the dsDNA. It is important that the probe DNA is tethered to the surface (typically through

    © 2004 by CRC Press LLC

  • 8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook

    3/14

     

    the covalent bond and linker molecule) and thus dsDNA cannot be displaced or diffuse from the negatively 

    charged surface and must be released by melting a mobile target ssDNA that can drift from the surface to

    decrease the electrostatic energy. Interestingly, the DNA-surface repulsion (attraction) occurs even for

    noncharged dielectric (metallic) surfaces due to the known electrostatic induction phenomena (Vainrub

    and Pettitt, 2000). Evidently, the DNA-surface electrostatics can be regulated by the surface charge and

    material (dielectric or metallic) as well as almost canceled by using long linker molecule and/or high ionic

    strength hybridization solution. In previous work (Vainrub and Pettitt, 2000, 2003) we considered in detail

    the DNA-surface electrostatics and its optimization in oligonucleotide microarrays.

    Here we focus on another type of on-array electrostatic interaction, the repulsion between the assayed

    nucleic acid and array of surface tethered DNA probe that both bear the negative charge (Vainrub and

    Pettitt, 2002). First, we describe the origin of this interaction that is specific for on-array hybridization

    and does not occur in homogeneous solution hybridization assays. To obtain sufficient numbers of 

    dsDNA hybrids for reliable detection, the oligonucleotide probe molecules are quite crowded on the array 

    with the surface density typically from 1012  to 1014 probes per square centimeter corresponding to the

    mean distance between the neighbors on the surface from 10 to 1 nm, respectively. Therefore, a target

    cDNA closely approaches not only a hybridization partner, but also the surrounding probe oligonucle-otides, which contribute into electrostatic repulsion of the target. Recently we considered this effect

    (Vainrub and Pettitt, 2002) and derived an equation for the on-array hybridization binding isotherm:

    . (14.1)

    Here q (0 < q< 1) is the hybridization efficiency, i.e., the fraction of hybridized probes, C0 is the assayedDNA concentration, DG0  = DH0–TDS0  is the Gibbs free energy (DH0  the enthalpy, DS0  the entropy) of 

    dsDNA formation in homogeneous solution, T is the temperature, and R is the universal gas constant. ZPand ZT denote the probe and target lengths (the number of nucleotides), and NP is the probe surface density.

    VS is the interaction strength, which is estimated (Vainrub and Pettitt, 2002) both theoretically and from

    the experiments as about 10–14 J m2/mol for 25-mer long probe oligonucleotides in 1  M  NaCl solution.

    The array hybridization isotherm, Equation 14.1 as demonstrated in Figure 14.1, successfully explains

    the well-known experiments (Forman et al., 1998; Guo et al., 1994; Shchepinov et al., 1995), showing

    FIGURE 14.1 Hybridization binding isotherm at different surface density of 25-mer probe oligonucleotides. The curve

    number notes the surface density in 1012 probes/cm2 units. The number 0 corresponds to the Langmuir isotherm.

    CG

    T

    V N Z Z

    T

    S P P T

    00

    1=

    -Ê Ë Á

    ˆ ¯ ˜ 

    +( )È

    ÎÍÍ

    ù

    ûúú

    qq

    qexp exp

    DR R

    0 2000 4000 6000 8000 10000

    1.0

    0.8

    0.6

    0.4

    0.2

    0.0

    1

    2

    0

    4

    6

    8

    10

    12

    Target concentration (*exp[∆G0 /RT0] moles)

       H  y   b  r   i   d   i  z  a   t   i  o  n  e   f   f   i  c   i  e  n  c  y

    © 2004 by CRC Press LLC

  • 8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook

    4/14

     

    the temperature decrease and strong broadening of the dsDNA melting on array compared to solution.

    Also, the theory quantitatively accounts for the recent 25-mer (Peterson et al., 2001) and 20-mer (Watter-

    son et al., 2000) oligonucleotide array experimental data. Below we review how this theory can be used

    in microarray optimization. It should be noted that, in addition to the discussed electrostatic forces, the

    other interface interactions, e.g., hydration effects, van der Waals forces, and steric hindrance (packing),

    could contribute under specific array conditions. However, the interface electrostatic effects often dom-

    inate the interactions. In combination with understanding the probabilistic basis of the sequences for

    analysis (below), we now have the ability to bring some interesting concepts to bear on the design of 

    biochips.

    14.2.2 Sensitivity Enhancement

    Considering the interaction-free energies involved in surface-bound DNA devices, several factors affecting

    binding and performance were apparent. We found that the concentration dependence of the electrostatic

    repulsion between the assayed target and probe array affects the sensitivity and dynamic range of DNA

    microarrays. Calculated from our theory, Figure 14.2 shows the number of hybrids qNp as a function of the target concentration at different probe surface densities Np assuming the same array parameters Z =

    25, Vs = 10–14 J m2/mol and room temperature T = 25˚C as in Figure 14.1. For microarray assays in the low 

    target concentration regime, the strongest signals (curve 1 in Figure 14.2) correspond to a probe density of 

    about 1012  cm–2. As seen in the insert to Figure 14.2, the theoretical sensitivity peak is rather narrow 

    suggesting that the probe density on the surface in microarrays should be thoroughly optimized for each

    surface preparation and solution condition. This result is in accord with experimental observations of a

    clear signal peak in a similar probe density range (Steel et al., 1998) and a weaker signal at higher probe

    densities (Peterson et al., 2001). This means the dynamic range near higher target concentrations can be

    expanded by an increase of the probe density at the expense of a substantial decrease in sensitivity. The

    width of the peak may have other factors that could be important under different conditions.Explicit control of the electrostatic interactions by microscopic or macroscopic field generation is

    therefore of obvious importance for optimization of microarrays. Suppression of the Coulomb repulsion

    could, in favorable circumstances, increase the sensitivity. We predict this could be achieved using external

    fields, charged molecular surface preparations, and in three-dimensional arrays using probe immobili-

    zation in gels, which indeed show solution-like hybridization thermodynamics (Vasiliskov et al., 2001),

    FIGURE 14.2 Number of hybridized probes as a function of the normalized target concentration at different surface

    density of 25-mer probe oligonucleotides. The curve number notes the surface density in 1012  probes/cm2  units.

    Insert: Number of hybrids vs. probe surface density at the normalized target concentration 0.1.

    1   2   4   6   80.1

    2x1010

    10-2 10-1 100 101 102 103 104

    5x1012 1x10130

    0

    Probe density (1/cm2)

    Target concentration (*exp[∆G0 /RT0] Moles)

    1012

    1011

    1010

    109

       H  y   b  r   i   d   d  e  n  s   i   t  y   (   1   /  c  m

       2   )

    © 2004 by CRC Press LLC

  • 8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook

    5/14

     

    but suffer from the slow hybridization and washing kinetics. For two-dimensional arrays use of 

    multivalent counterions for enhancement of the Coulomb screening and repulsion reduction (Nguyen

    et al., 2000) may be important as well as the use of a positive electrostatic potential at the surface (Vainrub

    and Pettitt, 2000). In addition, replacement of DNA probes by noncharged peptide nucleic acids (PNA)

    (Nielsen, 2001) provides an interesting chemical way to lessen the unfavorable electrostatic interaction.

    Other complications arise to make a detailed analysis of such a hetero duplex beyond the scope of our

    present discussion.

    14.2.3 Multiplexed SNPs Detection

    In contrast to gene expression profiling, the Coulomb hybridization blockage plays a positive role in on-

    array single nucleotide polymorphism (SNP) genotyping and provides an interesting possibility for

    multiplexed SNPs detection. Given a reasonable estimate of the mean SNP frequency in a human

    chromosome DNA of about 1 per 1000 nucleotide sites (Cutler et al., 2001), the individual genomes may 

    differ by several millions of SNPs. Therefore, highly multiplexed detection using microarrays is very 

    important for high throughput large-scale SNP genotyping.The principle of on-array multiplexed SNPs genotyping is demonstrated in Figure 14.3. For 20-mer

    perfectly matched duplexes with oligomer L 5¢-CTGAA CGGTA GCATC TTGAC-3¢ and oligomer H 5¢-CTGAG CGGTA GCACC GCGAC-3¢ the melting curves in solution at 5 nM oligonucleotide concentra-tion with 1 M  added NaCl salt are shown in Figure 14.3 (left side). The H duplex (Tm = 72.5!C, 70% of 

    GC-bases) is more stable than the L duplex (Tm = 62.9!C, 50% of GC-bases) because of the higher GC-

    base content (SantaLucia et al., 1996). In addition to the matched L and H, Figure 14.3 shows also the

    melting curves for the L1 and H1 SNPs corresponding to the A for T single nucleotide replacement at

    the tenth position from the 5¢-end. As a practical example, the conditions for an SNP detection aredefined as at least 1% hybridization efficiency signal strength and 1.5 times discrimination ratio for the

    match/mismatch signals. The resulting detection temperature ranges of L and H SNPs are shown inFigure 14.3 by the bars. Since in solution L and H ranges do not overlap, the two SNPs cannot be detected

    FIGURE 14.3 Principle of multiplexed SNPs detection on DNA biochip. The melting curves of matched 20-mers

    L and H and their single mismatches L1 and H1. The (L, L1) bar shows the temperature range where both detection

    and discrimination between L and L1 duplexes is possible (see text); the (H, H1) bar indicates the similar range for

    H and H1. In solution (left figure) the (L, L1) and (H, H1) ranges do not overlap, but on DNA biochip (right figure)

    the overlap occurs and allows detection of both (L, L1) and (H, H1) SNPs in a single fixed temperature experiment.

    Temperature (K)

       H  y   b  r   i   d   i  z  a   t   i  o  n  e   f   f   i  c   i  e  n  c  y

    320 330 340 350 360

    Temperature (K)

    300 310 320 330 340

    Probe density1.2 * 1013 oligos/cm2

    1.0

    0.8

    0.6

    0.4

    0.2

    0.0

       H  y   b  r   i   d   i  z  a   t   i  o  n  e   f   f   i  c   i  e  n  c  y

    1.0

    0.8

    0.6

    0.4

    0.2

    0.0

    In solution:No melting curves overtap

    Impossible to detect both SNPs

    On biochip:Both SNPs

    Detection range

    L1 L H1 H   L1 L H1 H

    © 2004 by CRC Press LLC

  • 8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook

    6/14

     

    in a single temperature assay. However, on an array with a probe surface density Np = 1.2*1013 cm–2 the

    above-mentioned broadening of melting curve increases the detection ranges for L and H SNPs and

    makes them both detectable in the overlap temperature region as shown in Figure 14.3 (right side). This

    example illustrates the principle behind our suggested multiplexed detection of SNPs that differ by up

    to 10!C in the melting temperature. Further extensions of diversity of on-array genotyped SNPs can be

    achieved using higher probe surface density to make the melting transition even more broad.

    14.3 DNA Biosensors

    14.3.1 Presence of Short Subsequences in the Genomes

    Statistical analysis of the appearance of short subsequences in different DNA sequences, from individual

    genes to full genomes is important for a variety of reasons. Applications include PCR primer (Fislage,

    1998; Fislage et al., 1997) as well as microarray probe design (Southern, 2001). Several attempts (Descha-

    vanne, et al., 1999; Karlin and Ladunga, 1994; Karlin and Mrazek, 1997; Nakashima et al., 1997; Nakash-

    ima et al., 1998; Nussinov, 1984; Sandberg et al., 2001) have been made to employ the frequency 

    distribution of short subsequences (n-mers) to identify species with relatively short genome sizes (micro-

    bial). In such an approach, the shape of the frequency distribution for certain short subsequences, 2–4-

    mers (Deschavanne et al., 1999; Karlin and Ladunga, 1994; Karlin and Mrazek, 1997; Nakashima et al.,

    1997; Nakashima et al., 1998; Nussinov, 1984) and 8–9-mers (Deschavanne et al., 1999; Sandberg et al.,

    2001) has been used to decide what microbial genome one is dealing with, based on a given piece of 

    genome or a whole genome.

    Many sequencing projects are in progress and more full genomes have recently become available. The

    several hundred projects completed so far provide sufficient material to consider them from a statistical

    viewpoint. Yet, we are still far from having a complete or even reasonable statistical picture. There are

    simply too many species yet to be sequenced to obtain globally relevant statistical answers.Recently (Fofanov et al., 2002a, 2002b) the comparative statistical analysis of the presence/absence of 

    all possible short n-mers (7 to 20 nucleotides long) for more than 250 complete genomes was performed

    in this group. The set under consideration included 76 complete microbial genome sequences with sizes

    ranging from 0.58 to 7.04 Mb and 176 viral genomes (128 RNA containing viruses with genome sizes

    from 0.32 to 130.76 Kb and 48 DNA containing viruses with genome sizes from 2.0 to 671.19 kb) as well

    as complete genomes of five multicellular organisms: Caenorhabditis elegans  (99.99 Mb), Drosophila

    melanogaster  (119.98 Mb), Oryza sativa (Rice, 255.87 Mb), Schizosaccharomyces pombe (12.49 Mb), and

    Homo sapiens (human, 2.875 Gb) genomes. A complete list of genomes and all supplementary materials

    mentioned below can be found on the University of Houston Bioinformatics lab website http://www.bio-

    info.uh.edu/publications/how_random_are_genomes.Tables 14.1 and 14.2 show representative results for some of the analyzed genomes (microbial and

    viral), for n = 8 and 12 using our techniques. It is worth mentioning that as n increases, the total number

    of possible n-mers, 4n, strongly exceeds the total sequence length M  and most of the possible n-mers do

    not appear at all because the maximum number of n-mers contained in this sequence is M – n + 1 ª  M .Moreover, for a reasonably high ratio, 4n/ M , most of the n-mers that appear tend to appear only once,

    in accordance with the fact that the number of present n-mers becomes very close to M (see Tables 14.1,

    14.2, and supplementary data on the above-mentioned Web site). That is why it was decided to use the

    statistics for “presence/absence” in our method of analysis, instead of the usual “frequency of appearance,”

    which is reasonable for short n-mers (total sequence length M

  • 8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook

    7/14

     

    “genome” is also shown for comparison in all figures as a solid line. One can observe the extraordinary similarity between these plots. All of the different genomes form a well-defined pattern, when plotted

    against the ratio 4n/ M and not against the size of the genome or the length of the n-mer separately.

    For much longer genomes of multicellular organisms, practically all n-mers for n < 12 are present.

    Therefore, we chose to calculate the number of distinct 13–20-mers present in each genome (see Figure

    14.7 and Table 14.3). These results point to the conclusion that the presence of n-mers in all genomes

    TABLE 14.1 Frequency of Presence of 8-mers and Self-Similarity* for Several Viral Genomes

    Accession Genome

    Total

    Sequence

    Length

    (bp)

    Number

    of Present

    8-mers

    Frequency

    of Present

    8-mers

    Random

    Boundary 

    Self-

    Similarity 

    NC_001436 Human T-cell lymphotropic

    virus type 1

    17,014 13,739 20.96% 22.86% 8.31%

    NC_001707 Hepatitis B virus 6,430 5,963 9.10% 9.35% 2.64%

    NC_001503 Mouse mammary tumor virus 17,610 14,307 21.83% 23.56% 7.35%

    NC_001547 Sindbis virus 11,703 10,431 15.92% 16.35% 2.67%

    NC_001434 Hepatitis E virus 7,176 6,517 9.94% 10.37% 4.12%

    NC_003312 Swine hepatitis E virus 7,257 6,608 10.08% 10.48% 3.81%

    NC_001489 Hepatitis A virus 7,478 6,543 9.98% 10.78% 7.42%

    NC_001433 Hepatitis C virus 9,413 8,480 12.94% 13.38% 3.29%

    NC_001653 Hepatitis D virus 1,682 1,608 2.45% 2.53% 3.17%

    NC_001802 Human immunodeficiency

    virus type 1

    9,181 7,725 11.79% 13.07% 9.83%

    NC_003461 Human parainfluenza virus 1 15,600 12,242 18.68% 21.18% 11.82%

    NC_001796 Human parainfluenza virus 3 15,462 11,506 17.56% 21.02% 16.46%

    NC_003443 Human parainfluenza virus 2 15,646 12,702 19.38% 21.24% 8.74%

    * See the definition in the text.

    TABLE 14.2 Frequency of Presence of 12-mers and Self-Similarity for Several Microbial Genomes

    Accession Genome

    Total

    Sequence

    Length

    (bp)

    Number

    of Present

    12-mers

    Frequency

    of Present

    12-mers

    Random

    Boundary 

    Self-

    Similarity 

    NC_000964 Bacillus subtilis 8,429,628 5,346,103 31.87% 39.50% 19.32%

    NC_002696 Caulobacter crescentus 8,033,894 3,399,234 20.26% 38.05% 46.75%

    NC_000913 Escherichia coli K12 9,278,442 5,695,881 33.95% 42.48% 20.08%

    NC_000916  Methanobacterium

    thermoautotrophicum

    3,502,754 2,658,450 15.85% 18.84% 15.91%

    NC_003197 Salmonella typhimurium LT2 9,714,864 5,821,910 34.70% 43.96% 21.06%

    NC_002758 Staphylococcus aureus Mu50 5,756,080 3,398,622 20.26% 29.04% 30.25%

    NC_003098 Streptococcus pneumoniae R6 4,077,230 2,992,091 17.83% 21.57% 17.34%

    NC_002737 Streptococcus pyogenes 3,704,882 2,778,223 16.56% 19.81% 16.43%

    NC_002578 Thermoplasma acidophilum 3,129,812 2,602,761 15.51% 17.02% 8.84%

    NC_002689 Thermoplasma volcanium 3,169,608 2,590,718 15.44% 17.22% 10.30%

    NC_000919 Treponema pallidum 2,275,888 1,978,453 11.79% 12.69% 7.04%

    NC_000853 Thermotoga maritima 3,721,450 2,755,886 16.43% 19.89% 17.43%

    NC_002162 Ureaplasma urealyticum 1,503,438 948,274 5.65% 8.57% 34.06%

    NC_002505 Vibrio cholerae chromosome I,

    chromosome II

    8,066,854 5,383,520 32.09% 38.17% 15.94%

    NC_002488  Xylella fastidiosa 9a5c 5,358,610 3,996,398 23.82% 27.34% 12.88%

    © 2004 by CRC Press LLC

  • 8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook

    8/14

     

    FIGURE 14.4 Frequency of presence of 9–14-mers in 76 microbial genomes.

    FIGURE 14.5 Frequency of presence of 7–10-mers in 129 RNA viral genomes.

    FIGURE 14.6 Frequency of presence of 7–10-mers in 48 DNA viral genomes.

    Microbial Genomes

    4n /M

       f  r  e  q  u  e  n  c  y  o   f  p  r  e  s  e  n  c  e  o

       f     n  -  m  e  r  s

    0 5 10 15 20 25

    1.00

    0.80

    0.60

    0.40

    0.20

    0.00

      9-mers

    10-mers

    11-mers12-mers

    13-mers

    14-mers

    random boundary

    RNA Virus Genomes

    4n /M

       f  r  e  q  u  e  n  c  y  o   f  p  r  e  s  e  n  c  e  o   f     n  -  m  e  r  s

    0 5 10 15 20 25

    1

    0.8

    0.6

    0.4

    0.2

    0

      7-mers

      8-mers

      9-mers

    10-mers

    random boundary

    DNA Virus Genomes

    4n /M

       f  r  e  q  u  e  n  c  y  o   f  p  r  e  s  e  n  c  e  o   f     n  -  m  e  r  s

    0 5 10 15 20 25

    1

    0.8

    0.6

    0.4

    0.2

    0

      7-mers

      8-mers

      9-mers

    10-mers

    random boundary

    © 2004 by CRC Press LLC

  • 8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook

    9/14

     

    considered (in the range of n, when the condition M 

  • 8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook

    10/14

     

    Let us provide a simple example based on three different genomes: (1) Salmonella typhi (NC_003198),

    (2) Mycobacterium tuberculosis H37Rv (NC_000962), and (3) Bacillus subtilis (NC_000964). A complete

    set of n-mers would contain 4n n-mers, which, for n = 12, is 412 = 16,777,216. Based on our analysis,

    Table 14.4 shows how many different 12-mers are contained in each of these three genomes. The number

     N (n, G1, G2) of n-mers (n = 12) that appears in each pair of species genomes (G1, G2) is shown in Table

    14.5. We can compare the probabilities of finding randomly picked 12-mers in each pair of genomes

    with probabilities calculated using the multiplication rule. As seen from Table 14.5, the actual andcalculated (expected) probabilities do not differ greatly from each other, which allows us to treat the

    presence/absence of randomly picked 12-mers in these three genomes as statistically independent events.

    Actual and expected pair-wise probabilities were calculated (Fofanov et al., 2002a, 2002b) in each of 

    the above-mentioned groups of genomes (170,000+ pairs in total). We were especially interested in the

    range of n where p* = 5 to 50% of the total possible number of n-mers occurred. This range is different

    for different genome sizes and can be determined from Figure 14.4. The analytic formula for the random

    boundary also can be used to estimate this range:

      . (14.2)

    Upper and lower bounds for sizes from 0.8 to 10 Mb, which are typical for microbial genomes, are

    shown in Table 14.6. In accordance with this, the value n = 12 seems to be the most reasonable one for

    all microbial genomes. For viral genomes, the value was found to be n = 7.

    For all 2800+ pairs of microbial genomes and the value of n = 12, the average ratio of actual and

    expected probabilities was found to be 1.35 ± 0.61. For viral genomes and the corresponding value of 

    n = 7, the average ratio of actual and expected probabilities was found to be 1.06 ± 0.10 for 1100+ genome

    TABLE 14.4 The Frequency of Presence of 12-mers within the Three Microbial Genomes

    Genome (G) Genome Length

    Total Sequence

    Length (bp)

    Number of Different

    12-mers Present in

    Genome:  N (12 ,G)  p =  N (12 ,G)/4n

    Salmonella typhi 4,809,037 9,618,074 5,813,330 34.65% Mycobacterium tuberculosis 

    H37Rv 

    4,411,529 8,823,058 4,361,508 26.00%

    Bacillus subtilis 4,214,814 8,429,628 5,346,103 31.87%

    TABLE 14.5 Actual and Predicted Simultaneous Presence of 12-mers within the ThreeMicrobial Genomes: (1) Salmonella typhi, (2) Mycobacterium tuberculosis H37Rv, and (3)Bacillus subtilis

    Case Number 12-mers  N (n, G1, G2)/4n

    Calculated Probability

    Assuming Independence

    Present in genomes (1) and (2) 1,943,814 11.6% 9.0%Present in genomes (1) and (3) 2,335,710 13.9% 11.0%

    Present in genomes (2) and (3) 1,334,288 8.0% 8.3%

    TABLE 14.6 The Optimal Length of n-mers (n*) for Different Genome Sizes and Frequencies of Presence ( p*)

    Total Sequence Length (bp)

    n* Determined for Frequency of

    Presence 50% ( p* = 0.5)

    n* Determined for Frequency of Presence

    5% ( p* = 0.05)

    0.8 Mb 9.80 11.93

    2.0 Mb 10.47 12.59

    10.0 Mb 11.63 13.75

    n

     M p p

    *

    log * / *

    log( )=

    -( )[ ]14

    © 2004 by CRC Press LLC

  • 8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook

    11/14

     

    pairs DNA-based viruses and 1.04 ± 0.05 for 8100+ genome pairs RNA-based viruses. This led us to the

    conclusion that for this range of n, the presences of n-mers in different genomes, to a good approximation,

    can be treated as independent events.

    The highest deviations between expected and actual probabilities were found for closely related

    genomes. For 48 DNA-based viruses under consideration, using 7-mers, the highest ratio (185%) was

    found for Duck hepatitis B virus (NC_001344) vs. Stork hepatitis B virus (NC_003325) with 8.1% expected

    and 15.0% actual.

    An example of closely related microbial genomes would be S. aureus N 315 (NC_002745) vs. S. aureus

    Mu50 (NC_002758) with 4.0% expected and 19.7% actual or 491% higher than expected. Another

    extreme case was found for three microbial genomes: Chlamydophila pneumoniae CWL029 (NC_000922),

    C. pneumoniae  AR39 (NC_002179), and C. pneumoniae  J138 (NC_002491), which have the highest

    (eightfold) ratio of actual and expected probabilities for 12-mers (1.5% expected and 12.3% actual). For

    the group containing 24 human chromosomes, pair-wise ratios of actual and expected probabilities of 

    14-mers were found to be 1.91 ±  0.16, maximum ratio being found for n  = 20 and Y-chromosomes(expectation 2.9% vs. actual 6.9%).

    Assuming that results for 250+ genomes are statistically significant, we expect similar behaviorfrom many different (as yet sequenced) genomes. Thus our analysis indicates that, in this case, one

    may use relatively small sets of randomly picked n-mers for differentiating between different viruses

    and organisms.

    Let us further illustrate the idea by continuing our example for three microbial genomes. Let n* be

    the size of n-mer, which fits the interval where from 5 to 50% of all possible n-mers show up for a

    desirable range of genome lengths. In accordance with Table 14.6, we may choose the value n* = 12. Let

    us randomly pick L 12-mers (say, L = 1000). Given a genome G1 with the frequency of presence of n-

    mers  p1, we expect that K = p1L  n-mers present in G1  will appear also in our random set, forming a

    “fingerprint” of G1 (in our example, we expect 50 < K < 500). The probability, e , that the fingerprint of 

    G1 will exactly coincide with the fingerprint of some other genome G2 (with the frequency of presenceof n-mers p2) is (Fofanov et al., 2002a; 2002b):

    e = (1 –  p1 –  p2 + 2 p12)L. (14.3)

    Here p12 is the probability for the n-mer to be present in both genomes simultaneously.

    Let us consider the numeric example mentioned in Tables 14.4 and 14.5 of two species that are far

    from each other, Salmonella typhi vs. Mycobacterium tuberculosis H37Rv; p1 = 0.3465, p 2 = 0.2600, p12 =

    0.1160; with L = 1000, a remarkable accuracy of e  = 1.7*10–204 can theoretically be achieved.

    Given a desirable probability of error, e , one can determine the appropriate size, L, of a random set

    of n-mers that can be used for reliable identification of genomes as

    . (14.4)

    For related organisms, the genomes may contain large common parts. This means that  p12  may be

    close to p1 and p 2. To give a numeric example of close relatives, let us consider S. aureus N 315 vs. S. aureus

    Mu50. Now p1 = 0.198, p 2 = 0.203, p12 = 0.197, and an accuracy of e  = 10–10 can be achieved with L =

    4451. We would like to stress that our analysis predicts a logarithmic dependence of the sampling or

    microarray size, L, on the error probability, e . This feature is of principal importance for the estimation

    procedure under discussion.Therefore, we can use practically any sufficiently random subset of n-mers of appropriate size for design

    a microarray to diagnose an organism to which a given DNA/RNA sample belongs. Different sizes of n-mers

    must be employed for recognition of different organisms based on their genome length. Values of n  that

    correspond to given intervals of genome lengths can be easily calculated using above formulas. In fact, only 

    11 different n values, 7 £ n £ 17, would be enough to cover a large variety of genome sizes from 1 kb to 9 Gb.

    L p p p

    =- - +( )

    log

    log

    e1 2

    1 2 12

    © 2004 by CRC Press LLC

  • 8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook

    12/14

     

    The important advantage of such an approach is that it can be used without a priori knowledge of the

    sequence itself. This implies there is no need to perform the expensive and time-consuming process of 

    sequencing before array construction. It is enough to obtain the purified DNA, hybridize it on a suffi-

    ciently random microarray chip, and check which n-mers show up. Taking into account how accessible

    the DNA of thousands of microbes and viruses are, how easily each microarray can be produced, and

    the fact that we do not need to determine quantitative values of expression (we need only a yes/no

    answer), it should be possible to produce an essentially universal microbial/viral DNA chip.

    14.4 Conclusions

    In this article we have attempted to demonstrate how to use both physical and mathematical techniques

    to explore design criteria of relevance to modern genetic analysis. In particular, DNA arrays allow 

    simultaneous, parallel detection and concentration measurement of thousands of target strands. In this

    article we have concentrated on two areas of concern in the design and analysis of such experiments: the

    information content of the polymer and the physical/chemical properties of the detection. Microarrays

    are now producing amounts of raw biological (and biophysical) data on a scale not seen before in the

    biological arena. The bioinformatics solutions to problems associated with the analysis of data on this

    scale will remain a challenge for some time. The physical design of efficient devices to conduct such

    experiments requires consideration of the chemistry and physics of often highly charged species near

    prepared surfaces as well as the sequence. This article was aimed at demonstrating the current state of 

    theory in the hopes that many will find application of these principles.

    ACKNOWLEDGMENTS

    This work was partially supported by grants from NIH, Texas Coordinating Board, and the Robert A.

    Welch Foundation to BMP. TBL was a fellow at the Keck Center for Computational Biology. BMP and

    YF acknowledge the Texas Center for Learning and Computation for seed funding, and also NPACI for

    computing time and support at the San Diego Supercomputing Center. We also thank the Molecular

    Science Computing Facility (MSCF) in the William R. Wiley Environmental Molecular Sciences Labo-

    ratory, a national scientific user facility sponsored by the U.S. Department of Energy’s Office of Biological

    and Environmental Research and located at the Pacific Northwest National Laboratory. Pacific Northwest

    is operated for the Department of Energy by Battelle. BMP and AV also thank Accelrys for providing

    visualization software through the Institute for Molecular Design.

    ReferencesBloomfield, V.A., Crothers, D.M., and Tinoco, I.,  Nucleic Acids: Structures, Properties and Functions,

    University Science Books, Sausalito, CA, 1999.

    Brenner, S., Johnson, M., Bridgham, J., Golda, G., Lloyd, D.H., Johnson, D., Luo, S., McCurdy, S., Foy,

    M., Ewan, M., Roth, R., George, D., Eletr, S., Albrecht, G., Vermaas, E., Williams, S.R., Moon, K.,

    Burcham, T., Pallas, M., DuBridge, R.B., Kirchner, J., Fearon, K., Mao, J., and Corcoran, K., Gene

    expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays, Nat.

    Biotechnol ., 18, 630–634, 2000.

    Cutler, D.J., Zwick, M.E., Carrasquillo, M.M., Yohn, C.T., Tobin, K.P., Kashuk, C., Mathews, D.J., Shah,

    N.A., Eichler, E.E., Warrington, J.A., and Chakravarti, A., High-throughput variation detection

    and genotyping using microarrays, Genome Res., 11, 1913–1925, 2001.Deschavanne, P.J., Giron, A., Vilain, J., Fagot, G., and Fertil, B., Genomic signature: characterization and

    classification of species assessed by chaos game representation of sequences,  Mol. Biol. Evol., 16,

    1391–1399, 1999.

    Fislage, R., Differential display approach to quantitation of environmental stimuli on bacterial gene

    expression, Electrophoresis, 19, 613–616, 1998.

    © 2004 by CRC Press LLC

  • 8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook

    13/14

     

    Fislage, R., Berceanu, M., Humboldt, Y., Wendt, M., and Oberender, H., Primer design for a prokaryotic

    differential display RT-PCR, Nucleic Acids Res., 25, 1830–1835, 1997.

    Fofanov, Y., Luo, Y., Katili, C., Wang, J., Powdrill, B.Y.T., Fofanov, V., Li, T.-B., Chumakov, S., and Pettitt,

    B.M., How independent are the appearances of n-mers in different genomes? Submitted, 2000a.

    Fofanov, Y., Luo, Y., Katili, C., Wang, J., Powdrill, B.Y.T., Fofanov, V., Li, T.-B., Chumakov, S., and Pettitt,

    B.M., Short subsequences in genomes: how random are they? Submitted, 2002b.

    Forman, E.J., Walton, I.D., Stern, D., Rava, R.P., and Trulson, M.O., Thermodynamics of duplex formation

    and mismatch discrimination of photolithographically synthesized oligonucleotide arrays,  ACS

    Symp. Ser., 682, 206–228, 1998.

    Guo, Z., Guilfoyle, R.A., Thiel, A.J., Wang, R., and Smith, L.M., Direct fluorescence analysis of genetic

    polymorphisms by hybridization with oligonucleotide arrays on glass supports,  Nucleic Acids Res.,

    22, 5456–5465, 1994.

    Heaton, R.J., Peterson, A.W., and Georgiadis, R.M., Electrostatic surface plasmon resonance: direct

    electric field-induced hybridization and denaturation in monolayer nucleic acid films and label-

    free discrimination of base mismatches, Proc. Natl. Acad. Sci. U.S.A., 98, 3701–3704, 2001.

    Karlin, S. and Ladunga, I., Comparisons of eukaryotic genomic sequences, Proc. Natl. Acad. Sci. U.S.A.,91, 12832–12836, 1994.

    Karlin, S. and Mrazek, J., Compositional differences within and between eukaryotic genomes, Proc. Natl.

     Acad. Sci. U.S.A., 94, 10227–10232, 1997.

    Nakashima, H., Nishikawa, K., and Ooi, T., Differences in dinucleotide frequencies of human, yeast, and

    Escherichia coli genes, DNA Res., 4, 185–192, 1997.

    Nakashima, H., Ota, M., Nishikawa, K., and Ooi, T., Genes from nine genomes are separated into their

    organisms in the dinucleotide composition space, DNA Res., 5, 251–259, 1998.

    Nguyen, T.T., Grosberg, A.Y., and Shklovskii, B.I., Screening of a charged particle by multivalent coun-

    terions in salty water: strong charge inversion, J. Chem. Phys., 113, 1110–1125, 2000.

    Nielsen, P.E., Peptide nucleic acid: a versatile tool in genetic diagnostics and molecular biology, Curr.Opinion Biotech., 12, 16–20, 2001.

    Nussinov, R., Doublet frequencies in evolutionary distinct groups, Nucleic Acids Res., 12, 1749–1763, 1984.

    Peterson, A.W., Heaton, R.J., and Georgiadis, R.M., The effect of surface probe density on DNA hybrid-

    ization, Nucleic Acids Res., 29, 5163–5168, 2001.

    Saenger, W., Principles of Nucleic Acid Structure, Springer-Verlag, New York, 1984.

    Sandberg, R., Winberg, G., Branden, C.I., Kaske, A., Ernberg, I., and Coster, J., Capturing whole-genome

    characteristics in short sequences using a naive Bayesian classifier, Genome Res., 11, 1404–1409,

    2001.

    SantaLucia, J., Allawi, H.T., and Seneviratne, P.A., Improved nearest-neighbor parameters for predicting

    DNA duplex stability, Biochemistry , 35, 3555–3562, 1996.Shchepinov, M.S., Case-Green, S.C., and Southern, E.M., Steric factors influencing hybridization of 

    nucleic acids to oligonucleotide, Nucleic Acids Res., 25, 1155–1161, 1995.

    Southern, E.M., DNA microarrays — history and overview,  Methods Mol. Biol., 170, 1–15, 2001.

    Steel, A.B., Herne, T.M., and Tarlov, M.J., Electrochemical quantitation of DNA immobilized on gold,

     Anal. Chem., 70, 4670–4677, 1998.

    Su, H.J., Surrey, S., McKenzie, S.E., Fortina, P., and Graves, D.J., Kinetics of heterogeneous hybridization

    on indium tin oxide surfaces with and without an applied potential, Electrophoresis, 23, 1551–1557,

    2002.

    Vainrub, A. and Pettitt, B.M., Surface electrostatic effects in oligonucleotide microarrays: control and

    optimization of binding thermodynamics, Biopolymers, 68, 265–270, 2003.Vainrub, A. and Pettitt, B.M., Thermodynamics of association to a molecule immobilized in an electric

    double layer, Chem. Phys. Lett., 323, 160–166, 2000.

    Vainrub, A. and Pettitt, B.M., Coulomb blockage of hybridization in two-dimensional DNA arrays, Phys.

    Rev ., E66, art. no. 041905, 2002.

    © 2004 by CRC Press LLC

  • 8/17/2019 Chapter 14 Book Biomedical Technology and Devices Handbook

    14/14

     

    Vasiliskov, V.A., Prokopenko, D.V., and Mirzabekov, A.D., Parallel multiplex thermodynamic analysis of 

    coaxial base stacking in DNA duplexes by oligonucleotide microchips,  Nucleic Acids Res., 29,

    2303–2313, 2001.

    Watterson, J.H., Piunno, P.A., Wust, C.C., and Krull, U.J., Effects of oligonucleotide immobilization

    density on selectivity of quantitative transduction of hybridization of immobilized DNA, Langmuir ,

    16, 4984–4992, 2000.