View
8
Download
0
Category
Preview:
Citation preview
Probability and Statistic Dec 23,2013
Ivan Ivanov
Vanier College
George Kolampas
DNA Scanning and Statistical Analysis of the Human
Cytomegalovirus
With the amount of DNA sequences available today the r-scan statistic is extremely useful in
determining regions of importance in a strand of DNA. In this paper the r-scan is used to find
the replication site of Human Cytomegalovirus at the 91490 to 92643 position. The r-scans
relation to the scan statistic is touched and the Compound Poisson approximation is also
observed.
Introduction and Backround
DNA has been known to be the genomic system of storing and passing on data
since Watson and Crick studied the phenomena in 1953. DNA stores the information for
controlling everything from its own replication to the transcription of RNA which
transcribes proteins used in all organisms.
DNA also known as Deoxyribonucleic acid is a polymer of nucleic acid made up of
the pentose sugar Deoxyribose, phosphate groups and one of four possible bases:
guanine, adenine, thymine, and cytosine. DNA has a double helix structure that can split
up into two parts and each part can be seen as a long stranded sequence of the 4 bases
previously listed.
Figure:1 DNA double helix
DNA is made up of codons which are 3 bases long and with 4 bases to choose
from there are 43(64) different possible codons too choose from. Each codon codes for
an amino acid and there are 20 amino acids that can form. Each specific codon codes for
an unambiguous amino acid but many codons can code for the same amino acid. The
sequences of bases are called genes and are what control all processes that go on in a
cell and subsequently in an organism. DNA strands can be from 20 bases to even a few
million bases long and because a huge percentage of DNA is made up of introns(non-
coding genes) that in fact code for nothing and are considered useless, distinguishing
the useful parts from the useless parts is a big job.
Figure2: introns and their removal during mRNA transcription
With the introduction of PCR, electrophoresis and many other new techniques
used to analyse DNA . These advances in biochemical techniques have led to an
exponential increase in the amount of sequence data. There are over a million
sequences containing more than a billion nucleotide bases that have been recorded. As
the genome databases expand, mathematical methods play an increasingly important
role in obtaining, organizing, archiving, analyzing, and interpreting the rapidly
accumulating DNA data, especially when DNA is composed mostly of non-coding parts.
While searching for insights into the organization of a genome, one of the problems that
seems to arises is how to identify anomalies in the spacings of patterns in a long
sequence of nucleotides. Here the patterns refer to any short sequence segments with a
obvious reoccurrence. Spacing anomalies include properties of clumping (too many
neighboring short spacings), overdispersion (too many long gaps between paterns), and
excessive regularity (too few short spacings and/or too few long gaps). This is where
statistical analysis is needed and used for the scanning of DNA.
Statistics is the study of the collection, organization, analysis, interpretation and
presentation of data. In this study we are using statistics in order to identify anomalies
and patterns in sequences of DNA that may be of importance with a certain degree of
certainty.
The next section describes the r-scan and its close relationship to the traditional
scan statistic. The application of the r-scans will be illustrated by an example that
identifies unusual palindrome clusters in a family of herpesvirus genomes. The genes of
viruses are normally structured in a circular hoop called a plasmid. In order to replicate
viruses need a host cell for them to insert their genetic information into, once
replication has occurred the virus cell bursts out of the host cell and the cycle begins
again.
Figure3: Virus life cycle
Viruses are not wasteful in the sense that they do not code for proteins or
processes that are not immediately useful, for this reason we know that the anomalies
we are searching for using the r-scan method is for the purpose of finding the operon
which triggers replication in viruses. The way operons work is for example consider the
lac operon it is an example of an inducible system whose proteins allow bacteria to
metabolize lactose. When lactose is absent, a repressor protein binds tightly to the
operator. The repressor prevents RNA polymerase from binding to the promoter,
turning transcription off. Therefore when lactose is present the repressor detaches from
the operator and transcription of the gene begins. This is precisely what is happening in
the virus, when the viruses begin to replicate it is because of an outside factor removing
the repressor.
We are going to have to rely on an approximation when observing the statistical
significance of the clusters because of the fact that the exact probability distribution of
r-scans are not available. In Section 12.3, we shall compare the accuracy of three
Poisson-type approximate distributions by contrasting the calculated approximate
probabilities with simulation results.
r-Scans and DNA Sequence Analysis:
The r-scan is defined as the cumulative lengths of r consecutive distances
between the ordered statistics X(1), ... , X(N). It can also be explained as the continuous
sum of the lengths of r consecutive distances between the sample points X(1), ... , X(N).
For a set of points : X1, ... , XN distributed independently and uniformly over
the unit interval (0,1),
We let Di denote the distance between the ordered ith and (i + 1)th points,
so then: Di = X(i+1) - X(i)' i = 1, ... , N - 1.
For any fixed integer r between 1 and N -1, the r-scan at the point X(i) is
The order statistics of these r-scans are denoted by Ar,(i).
The most frequently used r-scan used in DNA analysis is the minimal r-scan which is Ar
the sum from zero to 1 r:
Ar,(1) = min{Ar,i, i = 1, ... , N - r}. We will use this Ar(1) throughout this paper.
Let Ar(1) be denoted as Ar
Ar is closely related to the traditional scan statistic of which:
where 0 < w < 1 is a prescribed window length and Yt (w) is the number of points in the
interval [t, t + w].
When looking at the event where:
for i = 1, ... , N - r, this states that there are at least
r + 1 points contained in the window [X(i) , X(i) + w]. This is equivalent to the event
which states that there are at the very least r adjoining spacings, starting at X(i)' whose
cumulative length cannot be greater than w. Because this is true for all i, it must also
hold for the particular window holding the maximal number of points. Therefore we can
say that we have of the relation:
for fixed values of w ε (0,1) and r = 1, ... ,N - 1.
` Because of this relation, we will automatically know the distribution
of Ar if the distribution of Sw is available. So the minimal r-scan statistic and the
traditional scan statistic can be used interchangeably. In DNA sequence analysis, the r-
scan is usually preferred. The exact probability distribution of Ar is unknown, so as
shown in (Ming-Ying Leung and Traci E.Yamashita).
where:
and where:
In this case N=296 the number of palindromes, j is the palindrome count but because
we are using the Compound Poisson approximation we use r 2 and w is the window
length.
Palindrome Placement and the Data at Hand
As previously explained we will use the Ar r-scan statistic in order to find dense
palindrome clusters that are found on the DNA which can be considered statistically
significant with a significance level of α=.05 and which are not due to random scatter.
The following table is the data provided and is what will be tested. Each number in the
table is a point in the DNA which has a palindrome of length ten or above. There also as
documented two hundred and ninety-six palindromes that fit the criteria in this table.
Using this table and the Compound-Poisson approximation method mentioned above
we will be able to obtain the palindromes of length 10 or above that are likely to not be
due to random chance.
Statistically Significant Palindromes With a Window Length of 500
For this window length of 500 and the DNA strand of 229354 bases long our
w = 500/229354= 0.00218 . Using Excel we obtain a table for r values from 2 to 10
This table tells us that when looking at a significance level of at least α=.05 with a
window length of 500, we only consider clusters of 6 or above to be statistically
significant. The following graph shows where clusters are found on the DNA strand.
r P(Ar )
2 1.00000
3 0.99815
4 0.023846
5 0.118076
6 0.013357
7 0.001225
8 9.73E-5
9 6.82E-6
10 4.20E-7
The graph shows two points above 6 which means that those points can be considered
anomalies not due to random chance. The one which is around 90000 is congruent with
the information stated on the paper by Masse et al. (1992) which carried out detailed
experimental assays around this part of the genome and characterized the segment
between 92210 bp and 93715 bp as the lytic origin of replication for HCMV. We choose
to ignore the one that surpasses six because in the paper in speaks about only being
interested in palindromes of length 10 or higher.
0
2
4
6
8
10
12
14
0 50000 100000 150000 200000 250000
Pal
ind
rom
e C
lust
er
De
nsi
ty
Position with respect to bases
Graph for Sliding Window of Length 500
Statistically Significant Palindromes With a Window Length of 1000
The same calculations will be used as those done with a window length of 500.
w=1000/229354=0.00436
r P(Ar )
2 1.00000
3 1.00000
4 0.999722
5 0.862716
6 0.34455
7 0.074703
8 0.012356
9 0.001758
10 0.000223
For a significance of at least α=.05 then only palindrome clusters of 8 or above are
acceptable.
0
2
4
6
8
10
12
14
0 50000 100000 150000 200000 250000
Pal
ind
rom
e C
lust
er
De
nsi
ty
Position with respect to bases
Graph for Sliding Window of Length 1000
Again the same two regions are acceptable but once again only one of the two surpass
the 10 base pairs written in the paper.
Conclusion
Finally to conclude, r-scan statistical analysis is very effective tool for finding
coding genes in DNA, thanks to the Bernoulli nature of DNA sequences . In this case we
found that the replication site for the HCMV virus was located at the cluster 91490 to
92643 as did Masse et al. (1992) in their analysis. Any of the two graphs are useful but
those with a higher window lengths are a bit more effective when accepting only high
amounts of clusters. In either case when reading the r-scan table values and comparing
to any graphs of clusters one can determine where important biological processes are
being coded for with a significant certainty that it is not due to random chance and with
all the sequence banks this is a big breakthrough in DNA research.
Bibliography Leung, M.-Y., & Traci, Y. E. (n.d.). Applications of the Scan Statistic in DNA Sequence Analysis. 27-
271,280-281.
Sadava, Hillis, Heller, & Berenbaum. (2011). Life:The Science of Biology. Sunderlan: The Courier
Companies inc.
Recommended