Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Palindromes in Viral Genomes
Ming-Ying Leung Department of Mathematical Sciences
The University of Texas at El Paso (UTEP)
Outline: Cytomegalovirus(CMV) Particle • Palindromes
• Viral Genomes • Roles of palindromes in
DNA and RNA viruses
SARS CoronavirusParticle
Palindromes in Letter Sequences
Odd Palindrome:“A nut for a jar of tuna”
ANUTFORA AROFTUNAJ
remove spaces and capitalize
Even Palindrome:“Step on no pets”
STEPON NOPETS
DNA and RNA• DNA is deoxyribonucleic acid, made
up of 4 nucleotide bases Adenine, Cytosine, Guanine, and Thymine.
• RNA is ribonucleic acid, made up of 4 nucleotide bases Adenine, Cytosine, Guanine, and Uracil.
• The bases A and T form a complementary pair, so are C and G.
G
AC
TGU
AC
G
C
T
A
DNA/RNA Palindromes
Virus and Eye DiseasesCMV Particle
CMV Retinitis• inflammation of the retina • triggered by CMV particles• may lead to blindnessGenome size
~ 230 kbp
Palindromes and Replication Origins in Herpesviruses
• High concentration of palindromes exists around replication origins of herpesviruses
• Locating clusters of palindromes (above a minimal length) on herpesvirus genome sequences might reveal likely locations of its replication origins.
Human Herpesvirus (HHV-6) Genome
• Replication origin (oriLyt) in non-coding region• Two binding sites for origin-binding protein (OBP)• Two 200-bp DNA unwinding elements (DUE)
Computational Prediction of Replication Origins
• Poisson approximation for palindrome distribution in an i.i.d. random sequence model
• Criterion for identifying statistically significant palindrome clusters using scan statistics
• Improve prediction accuracy
Poisson Process Approximation of Palindrome Distribution
Use of the Scan Statistic to Identify Clusters of Palindromes
Improve Prediction Accuracy
• Using a Markov random sequence model• Taking the lengths of palindromes and base
composition of the genomes into consideration when counting palindromes. [Chew et al. 2005, Nucleic Acids Research 33(15):e134]
• Using machine learning approaches to utilize knowledge about replication origin locations in closely related genomes.
SARS Viral Particles
SARS Virus Particle
Spike Glycoprotein
HostCell
• Single-stranded RNA genome with ~30 kilobases• Only about 7% the genome size of Cytomegalovirus
(CMV) with double-stranded DNA
SARS Virus Genome Map
• Replicase (1a and 1b), spike glycoprotein (S), X1 and X2 occupy 87% of the genome
• Two pairs of overlapping ORFs,1a & 1b and X1 & X2 (as designated by Rota et al. 2003), are predicted in this region
• 1a and 1b are standard in all coronaviruses, X1 and X2 are unique to SARS. Whether X1 and X2 do code for proteins is still unconfirmed
256 13398 21485 25268
ORF1aORF1b
S X1X2
A Long Palindrome in X1 and X2
TCTTTAACAAGCTTGTTAAAGA
Positions: 25962-25983 (22 bases)
• Found in SARS but not in other 6 coronavirus genomes (Chew et al. 2004)
• The next longest palindrome in SARS is 14 bases long• In the overlapping region of X1 and X2
Palindrome: A string of nucleotide bases that reads the same as its reverse complement. A palindrome must be even in length, e.g. palindrome of length 10:
5’ ….. GCAATATTGC …..3’
j - L +1 j j + 1 j +L b1 b2 … bL bL+1 … b2L-1 b2L
We say that a palindrome of length 2L occurs at position j when the (j-i+1)st and the (j+i)th bases are complementary to each other for i=1,…, L. In an i.i.d. sequence model this occurs with probability
( )2 LA T C Gp p p p⎡ ⎤+⎣ ⎦ .
Probability of Observing a Length 22 Palindrome
• Approximate the palindrome distribution by a Poisson process with rate
• Here n = genome length = 29727, and
• The probability of the occurrence of at least one length 22 palindrome in the genome is
[ ]11ˆ ˆ ˆ ˆ2( )A T C Gp p p p p= +
0.010081 0.01e−− ≈
0.01008npλ= =
Expression of Overlapping Genes Requires Frameshifting in Reading
Frameshifting must have the following elements:• Slippery Sequence - a mechanism that allows the
“reader” (called ribosome) to slip • Stimulatory Element - a pseudoknot or stem-loop
structure that blocks the ribosome
Slippery Sequences
Short binding sequencePseudoknot:
Palindrome-like sequencesStem-loop or
hairpin loop:
Ribosome
start reading
-1 FrameshiftingPseudoknot
CGGGTTT
• Reading with default frame: GGG then TTT
Ribosome
start reading
Pseudoknot
CGGGTTT
-1 Frameshifting
• Reading with default frame: GGG then TTT• Reading after -1 frameshifting: CGG then GTT
Heptanucleotide Slippery Sequences
• ORF1a and ORF1b (Overlap: 13398, 1 base only)
• X1 and X2 (Overlap: 25689-26089, 401 bases)
A string in the form of XXXYYYNwhere X = A,T, or G; Y = A or T; and N = A, T, or C
Locations of Slippery Sequences256 13398 21485 25268
ORF1aORF1b
S X1X2
TTTAAAC TTGAAAASlippery Sequences:?
• Right preceding the overlapping base between 1a and 1b, there is a slippery sequence followed by a pseudoknot (Theil et al 2003)
• Possible slippery sequences are detected in the overlapping region of X1 and X2; any pseudoknot or stem-loop structure in close proximity downstream?
Pseudoknot Predicted by PknotsRG
∆G = -55.84 kcal/mol.
RNA Secondary Structure Prediction
• Prediction RNA secondary structures including pseudoknots is a computational intensive problem.
• Use of heterogeneous grid computing to provide computing power.
• Use of palindrome content evaluation to improve prediction consistency.
1 if palindrome of length 2 occurs at base 0 otherwisej
L jI
≥=
n L
L kj L
X I−
=
= ∑
Palindrome counts in random nucleotide sequences Define the indicator random variable
Then
is the total count of palindromes of length at least 2L in a sequence of length n.
12
1
( ) ( 2 1) ( )
var( ) var( ) 2 cov( , )
n L
L L k Lj L
n L n L n L
L L j j kj L j L k j
E X E I n L E I
X I I I
µ
σ
−
=
− − − −
= = = +
= = = − +
= = +
∑
∑ ∑ ∑
j−
2
( ) (0)
cov( , )
var( ) (0)(1 (0))( ) (0)
j
j j d
j
E I
I I
Id
γ
γ γ
γ γ+
=
=
= −
−
Mean and variance of palindrome counts
If we let (0) ( 1) for
( ) ( 1, 1) for 1j
j j d
P I L j n L
d P I I d n L
γ
γ +
= = ≤ ≤ −
= = = ≤ ≤ − then
( )
( )
( )
2
1
1
22
1
( ) ( 2 1) 0
var( )
var( ) 2 cov( , )
( 2 1) (0) 1 (0)
2 2 1 ( ) (0)
L L
L Ln L n L n L
j j kj L j L k j
n L
d
E X n L
X
I I I
n L
n L d d
µ γ
σ
γ γ
γ γ
− − − −
= = = +
−
=
= = − +
=
= +
= − + −
+ − + − −
∑ ∑ ∑
∑
Mean and variance of palindrome counts (cont’d)
How to find the γ’s?
• Under a Markov sequence model, Chew et al.(INFORMS Journal of Computing, 16:331-340, 2004) have obtained computable formulas for the γ’s, expressed in terms of the transition and stationary probabilities of the Markov chain. These can be estimated by the observed base frequencies and dinucleotide frequencies.
• Let’s look at a special case, namely the i.i.d. random sequence model where the nucleotide bases are generated independently with probabilities pA, pC, pG, pT.
G
Finding γ(0) for the i.i.d. sequence model
(0) ( 1) [2( )]Lj A T CP I p p p pγ = = = +
j - L +1 j j + 1 j +L b1 b2 … bL bL+1 … b2L-1 b2L
Finding γ(d) for the i.i.d. sequence model: Case 1: d ≥ 2L
Case 2: L ≤ d < 2L
Case 3: 1 ≤ d < L
Palindromes, and More Palindromes…
i - L i - 2 i - 1 i i + 1 i + 2 i + 3 i +L+1 b1 b2 … bL - 1 bL b’L b’L - 1 … b’2 b’1
i - L i - 3 i – 2 I - 1 i i + 1 i + 2 i + 3 i +L+1 b1 b2 … bL - 1 bL b’L b’L - 1 … b’2 b’1
Acknowledgements
CollaboratorsLouis H. Y. Chen (National University of Singapore) David Chew (National University of Singapore) Kwok Pui Choi (National University of Singapore)Raul Cruz-Cano (The University of Texas at El Paso)Hans Heidner (The University of Texas at San Antonio)Michela Taufer (The University of Texas at El Paso)Aihua Xia (University of Melbourne, Australia)
Funding SupportNIH Grants S06GM08012-35, 2G12RR008124-11 and 3T34GM008048-20S1.THECB Advanced Research Program 003661-0008-2006
Palindromes in Letter SequencesPalindromes and Replication Origins in HerpesvirusesComputational Prediction of Replication OriginsRNA Secondary Structure PredictionPagesfrom-HKU-Leung040625.pdfHairpin Loop Predicted by Pknots