Upload
nairi
View
31
Download
0
Embed Size (px)
DESCRIPTION
RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar. PMSB2006, June 18, Tuusula, Finland Yuki Kato, Hiroyuki Seki and Tadao Kasami Graduate School of Information Science, Nara Institute of Science and Technology (NAIST). NAIST. Table of Contents. - PowerPoint PPT Presentation
Citation preview
RNA Structure Prediction Including Pseudoknots
Based on Stochastic Multiple Context-Free Grammar
PMSB2006, June 18, Tuusula, Finland
Yuki Kato, Hiroyuki Seki and Tadao KasamiGraduate School of Information Science,
Nara Institute of Science and Technology (NAIST)
2
NAIST
3
Table of Contents
Background Grammatical approach to RNA structure model
ing Model
Stochastic multiple context-free grammar Algorithms
Parsing and parameter estimation Experimental results
RNA pseudoknot prediction Summary
4
RNA Secondary Structure:Stem-Loop
5’—C A A U G A C—3’
C•GU•AU•A
U C A A
C U U C A U C A G A A A A U G A C
nested
Connect base pairs with arcs.
Loop
Stem
Complementarybase pairs
A•UG•C
5
Modeling RNA Secondary Structure by Context-Free Grammar (CFG)
RNA secondary structure can be modeled by parse structure of CFG.Structure prediction Parsing
Example of CFG rules:
U U C A U C A G A ASecondary structure
S
S
S
S
u u c a u c a g a aDerivation tree
6
RNA Secondary Structure:Pseudoknot
CFGs cannot represent pseudoknots.
5’—C U U C A A G A C U U G A C—3’
• • •• • •
A
AA
C U U C A U C A G A A A A U G A C
crossed
Connect base pairs with arcs.
7
Early Studies
Brown & Wilson /
Cai et al.
Rivas & Eddy / Uemura e
t al.
Matsuiet al.
Generative power
Based on CFG > CFG > CFG
Metric for optimization
Stochastic grammar Free energy Stochastic
grammar
Time complexity
O(n3) / O(n6) O(n6) / O(n5) O(n5)
n: sequence length
8
Early Studies (cont.)
Grammars for fully describing RNA pseudoknots: SL-TAG and ESL-TAG [Uemura et al., 1999] RPG [Rivas and Eddy, 2000]
These grammars have been identified as subclasses of multiple context-free grammars. [Kato et al., 2005]
9
Motivation
Multiple context-free grammar (MCFG): Natural extension of CFG
Easy to compare generative power and design algorithms
Generative power to represent pseudoknots Polynomial time parsing algorithm
We have shown a candidate subclass of the minimum grammars of MCFGs for representing pseudoknots.[Kato et al., 2005]
10
What’s New in the Present Work
Extension of MCFGs to a probabilistic model (stochastic MCFG, SMCFG)
Design of polynomial time parsing and parameter estimation algorithms for the subclass of SMCFGs
Experiments on RNA pseudoknot prediction
11
Early Studies and Present Work
Brown & Wilson /
Cai et al.
Rivas & Eddy / Uemura et a
l.
Matsuiet al.
Our model
Generative power
Based on CFG > CFG > CFG > CFG
Metric for optimization
Stochastic grammar Free energy Stochastic
grammarStochastic grammar
Time complexity O(n3) / O(n6) O(n6) / O(n5) O(n5) O(n5)
12
Table of Contents
Background Grammatical approach to RNA structure model
ing Model
Stochastic multiple context-free grammar Algorithms
Parsing and parameter estimation Experimental results
RNA pseudoknot prediction Summary
13
Relation between SMCFG and Major Probabilistic Models
MCFG
CFG
FA
SMCFG
SCFG
HMM
Generativepower
Strong
Weak
Probabilisticextension
A G A C U UPseudoknot
A G A C UStem-loop
Gene finding
genes
14
From HMM to SCFG
HMM SCFG
RuleTerminal and nonte
rminal
“Emit a in Aand transit to B”
Sequence of nonterminals
and terminals
15
Stochastic Multiple Context-Free Grammar (SMCFG)
G = (N, T, F, P, S)N: finite set of nonterminals, T: finite set of terminals,F: finite set of functions,P: finite set of rules with probabilities, S N: start symbol
SCFG SMCFG
Nonterminalgenerates sequence generates tuple of sequences
Rule Sequence of nonterminals
and terminals
A0,…, Ak: nonterminalsf: function defined over seque
nces
16
Functions of SMCFG
Example:
17
Rules of SMCFG
Rule: : probability that the rule is applied The sum of the probabilities of the rules with
the same left hand side should be one. Example:
18
Derivation Trees in SMCFG
A1
Prob. p1…
Ak
Prob. pk
Prob.
A: f
…A1 Ak
19
Modeling Pseudoknot by SMCFG
UP2La[(x1, x2)] = (x1, ax2)
UP2Ru[(x1, x2)] = (x1, x2u)
(a g , a c u u)
(a g , a c u)
(a g , c u)
AProb.
0.7
B
A
Prob. 0.35
Prob. 0.28
20
SMCFG for RNA Pseudoknot Modeling W1,…,Wm: nonterminals
Note: W1 is the start symbol. For each rule, two real values called
transition probability p1 (0 < p1 1) and emission probability p2 (0 < p2 1) are specified.
Probability of each rule is defined as
21
SMCFG Gs
22
Table of Contents
Background Grammatical approach to RNA structure model
ing Model
Stochastic multiple context-free grammar Algorithms
Parsing and parameter estimation Experimental results
RNA pseudoknot prediction Summary
23
Algorithms for SMCFG
CYK algorithmcalculates the optimal alignment of a sequence to an SMCFG (the most likely derivation tree).
Inside algorithmcalculates the probability of a sequence given an SMCFG.
Inside-outside algorithmestimates optimal probability parameters for an SMCFG given a set of example sequences.
24
CYK Algorithm
Input: The following are calculated by dynamic progra
mming: : log maximum probability that Wv gener
ates : log maximum probability that Wy ge
nerates
25
CYK Algorithm (cont.)
Output: log maximum probability that W1 generates
i.e. : the most likely derivation tree : entire set of probability parameters
26
Algorithm [CYK]
Initialization:for i←1 to n+1, j←i to n+1, v←1 to m
do if // : empty sequencethen
else Iteration:
for i←n downto 1, j←i1 to n, k←n+1 downto j+1, l←k1 to n, v←1 to m// Some examples are shown.
27
Algorithm [CYK] (cont.)
if
Wv
Wy Wz
i h h+1 j k l1 n
x1 x21 x22
28
Algorithm [CYK] (cont.)
if
Wv
Wy
i l1i+1 j k l1 n
x1 x2ai al
29
Complexity of CYK Algorithm
m: # of nonterminals (m = a+b) n: sequence length
Time complexity: O(amn4+bn5) Space complexity: O(mn4)
30
Table of Contents
Background Grammatical approach to RNA structure model
ing Model
Stochastic multiple context-free grammar Algorithms
Parsing and parameter estimation Experimental results
RNA pseudoknot prediction Summary
31
Experimental Method
Construction of a model
RNA familydatabase
SMCFG
CUAGUCUUATest sequence
Secondarystructureprediction
CYK algorithm
Sample sequenceswith structure annotation
CUACUGUUC
parsing
32
Data Sets for Experiments
Three viral RNA families including pseudoknots from Rfam ver. 7.0
Family Rage of length # of annotated sequences
# of test sequences
Corona_pk_3 6264 14 10
HDV_ribozyme 8791 15 10
Tombus_3_IV 8992 18 12
33
Corona_pk_3 in Rfam ver. 7.0
Coronavirus 3' UTR pseudoknot
Sequence length:6264
Consensus structure
34
HDV_ribozyme in Rfam ver. 7.0
Hepatitis delta virus ribozyme
Sequence length:8791
Consensus structure
35
Tombus_3_IV in Rfam ver. 7.0
Tombusvirus 3' UTR region IV
Sequence length:8992
Consensus structure
36
Evaluation for Prediction Results
precision
=
recall
=
# of correct base pairs predicted by the algorithm# of predicted base pairs
# of correct base pairs predicted by the algorithm# of base pairs specified by the annotation
37
Experimental Results
Prediction accuracy
RNAfamily
Precision [%] Recall [%]
Average Min Max Average Min Max
Corona_pk_3 99.4 94.4 100.0 99.4 94.4 100.0
HDV_ribozyme 100.0 100.0 100.0 100.0 100.0 100.0
Tombus_3_IV 100.0 100.0 100.0 100.0 100.0 100.0
38
Experimental Results (cont.)
Running time
*: Implementation in ANSI C on a machine with Intel Pentium D CPU 2.80GHZ and 2.00GB RAM
RNAfamily
CPU time* [sec]
Average Min Max
Corona_pk_3 27.8 26.0 30.4
HDV_ribozyme 252.1 219.0 278.4
Tombus_3_IV 244.8 215.2 257.5
39
Pair Stochastic Tree Adjoining Grammar (PSTAG)[MSS05]
Derivation treerepresenting
known structure
Test sequence alignment
[MSS05] Matsui et al., “Pair stochastic tree adjoining grammars for aligningand predicting pseudoknot RNA structures,” Bioinformatics, 2005.
RNA familydatabase
PSTAG algorithm
CUAGUCUUA Secondarystructureprediction
Sample sequenceswith structure annotation
CUACUGUUC
40
Comparison with PSTAG
ModelAverage precision [%] Average recall [%]
Corona HDV Tombus Corona HDV Tombus
SMCFG 99.4 100.0 100.0 99.4 100.0 100.0
PSTAG 95.5 95.6 97.4 94.6 94.1 97.4
41
Summary
A new probabilistic model called SMCFG has been proposed for RNA pseudoknot modeling.
Polynomial time parsing and parameter estimation algorithms have been designed.
Experimental results on RNA pseudoknot prediction have shown good prediction accuracy.