41
RNA Structure Prediction I ncluding Pseudoknots Based on Stochastic Multip le Context-Free Grammar PMSB2006, June 18, Tuusula, Finlan d Yuki Kato, Hiroyuki Seki and Tadao Kasami Graduate School of Information Science,

RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

  • Upload
    nairi

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar. PMSB2006, June 18, Tuusula, Finland Yuki Kato, Hiroyuki Seki and Tadao Kasami Graduate School of Information Science, Nara Institute of Science and Technology (NAIST). NAIST. Table of Contents. - PowerPoint PPT Presentation

Citation preview

Page 1: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

RNA Structure Prediction Including Pseudoknots

Based on Stochastic Multiple Context-Free Grammar

PMSB2006, June 18, Tuusula, Finland

Yuki Kato, Hiroyuki Seki and Tadao KasamiGraduate School of Information Science,

Nara Institute of Science and Technology (NAIST)

Page 2: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

2

NAIST

Page 3: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

3

Table of Contents

Background Grammatical approach to RNA structure model

ing Model

Stochastic multiple context-free grammar Algorithms

Parsing and parameter estimation Experimental results

RNA pseudoknot prediction Summary

Page 4: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

4

RNA Secondary Structure:Stem-Loop

5’—C A A U G A C—3’

C•GU•AU•A

U C A A

C U U C A U C A G A A A A U G A C

nested

Connect base pairs with arcs.

Loop

Stem

Complementarybase pairs

A•UG•C

Page 5: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

5

Modeling RNA Secondary Structure by Context-Free Grammar (CFG)

RNA secondary structure can be modeled by parse structure of CFG.Structure prediction Parsing

Example of CFG rules:

U U C A U C A G A ASecondary structure

S

S

S

S

u u c a u c a g a aDerivation tree

Page 6: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

6

RNA Secondary Structure:Pseudoknot

CFGs cannot represent pseudoknots.

5’—C U U C A A G A C U U G A C—3’

• • •• • •

A

AA

C U U C A U C A G A A A A U G A C

crossed

Connect base pairs with arcs.

Page 7: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

7

Early Studies

Brown & Wilson /

Cai et al.

Rivas & Eddy / Uemura e

t al.

Matsuiet al.

Generative power

Based on CFG > CFG > CFG

Metric for optimization

Stochastic grammar Free energy Stochastic

grammar

Time complexity

O(n3) / O(n6) O(n6) / O(n5) O(n5)

n: sequence length

Page 8: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

8

Early Studies (cont.)

Grammars for fully describing RNA pseudoknots: SL-TAG and ESL-TAG [Uemura et al., 1999] RPG [Rivas and Eddy, 2000]

These grammars have been identified as subclasses of multiple context-free grammars. [Kato et al., 2005]

Page 9: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

9

Motivation

Multiple context-free grammar (MCFG): Natural extension of CFG

Easy to compare generative power and design algorithms

Generative power to represent pseudoknots Polynomial time parsing algorithm

We have shown a candidate subclass of the minimum grammars of MCFGs for representing pseudoknots.[Kato et al., 2005]

Page 10: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

10

What’s New in the Present Work

Extension of MCFGs to a probabilistic model (stochastic MCFG, SMCFG)

Design of polynomial time parsing and parameter estimation algorithms for the subclass of SMCFGs

Experiments on RNA pseudoknot prediction

Page 11: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

11

Early Studies and Present Work

Brown & Wilson /

Cai et al.

Rivas & Eddy / Uemura et a

l.

Matsuiet al.

Our model

Generative power

Based on CFG > CFG > CFG > CFG

Metric for optimization

Stochastic grammar Free energy Stochastic

grammarStochastic grammar

Time complexity O(n3) / O(n6) O(n6) / O(n5) O(n5) O(n5)

Page 12: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

12

Table of Contents

Background Grammatical approach to RNA structure model

ing Model

Stochastic multiple context-free grammar Algorithms

Parsing and parameter estimation Experimental results

RNA pseudoknot prediction Summary

Page 13: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

13

Relation between SMCFG and Major Probabilistic Models

MCFG

CFG

FA

SMCFG

SCFG

HMM

Generativepower

Strong

Weak

Probabilisticextension

A G A C U UPseudoknot

A G A C UStem-loop

Gene finding

genes

Page 14: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

14

From HMM to SCFG

HMM SCFG

RuleTerminal and nonte

rminal

“Emit a in Aand transit to B”

Sequence of nonterminals

and terminals

Page 15: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

15

Stochastic Multiple Context-Free Grammar (SMCFG)

G = (N, T, F, P, S)N: finite set of nonterminals, T: finite set of terminals,F: finite set of functions,P: finite set of rules with probabilities, S N: start symbol

SCFG SMCFG

Nonterminalgenerates sequence generates tuple of sequences

Rule Sequence of nonterminals

and terminals

A0,…, Ak: nonterminalsf: function defined over seque

nces

Page 16: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

16

Functions of SMCFG

Example:

Page 17: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

17

Rules of SMCFG

Rule: : probability that the rule is applied The sum of the probabilities of the rules with

the same left hand side should be one. Example:

Page 18: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

18

Derivation Trees in SMCFG

A1

Prob. p1…

Ak

Prob. pk

Prob.

A: f

…A1 Ak

Page 19: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

19

Modeling Pseudoknot by SMCFG

UP2La[(x1, x2)] = (x1, ax2)

UP2Ru[(x1, x2)] = (x1, x2u)

(a g , a c u u)

(a g , a c u)

(a g , c u)

AProb.

0.7

B

A

Prob. 0.35

Prob. 0.28

Page 20: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

20

SMCFG for RNA Pseudoknot Modeling W1,…,Wm: nonterminals

Note: W1 is the start symbol. For each rule, two real values called

transition probability p1 (0 < p1 1) and emission probability p2 (0 < p2 1) are specified.

Probability of each rule is defined as

Page 21: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

21

SMCFG Gs

Page 22: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

22

Table of Contents

Background Grammatical approach to RNA structure model

ing Model

Stochastic multiple context-free grammar Algorithms

Parsing and parameter estimation Experimental results

RNA pseudoknot prediction Summary

Page 23: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

23

Algorithms for SMCFG

CYK algorithmcalculates the optimal alignment of a sequence to an SMCFG (the most likely derivation tree).

Inside algorithmcalculates the probability of a sequence given an SMCFG.

Inside-outside algorithmestimates optimal probability parameters for an SMCFG given a set of example sequences.

Page 24: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

24

CYK Algorithm

Input: The following are calculated by dynamic progra

mming: : log maximum probability that Wv gener

ates : log maximum probability that Wy ge

nerates

Page 25: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

25

CYK Algorithm (cont.)

Output: log maximum probability that W1 generates

i.e. : the most likely derivation tree : entire set of probability parameters

Page 26: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

26

Algorithm [CYK]

Initialization:for i←1 to n+1, j←i to n+1, v←1 to m

do if // : empty sequencethen

else Iteration:

for i←n downto 1, j←i1 to n, k←n+1 downto j+1, l←k1 to n, v←1 to m// Some examples are shown.

Page 27: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

27

Algorithm [CYK] (cont.)

if

Wv

Wy Wz

i h h+1 j k l1 n

x1 x21 x22

Page 28: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

28

Algorithm [CYK] (cont.)

if

Wv

Wy

i l1i+1 j k l1 n

x1 x2ai al

Page 29: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

29

Complexity of CYK Algorithm

m: # of nonterminals (m = a+b) n: sequence length

Time complexity: O(amn4+bn5) Space complexity: O(mn4)

Page 30: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

30

Table of Contents

Background Grammatical approach to RNA structure model

ing Model

Stochastic multiple context-free grammar Algorithms

Parsing and parameter estimation Experimental results

RNA pseudoknot prediction Summary

Page 31: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

31

Experimental Method

Construction of a model

RNA familydatabase

SMCFG

CUAGUCUUATest sequence

Secondarystructureprediction

CYK algorithm

Sample sequenceswith structure annotation

CUACUGUUC

parsing

Page 32: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

32

Data Sets for Experiments

Three viral RNA families including pseudoknots from Rfam ver. 7.0

Family Rage of length # of annotated sequences

# of test sequences

Corona_pk_3 6264 14 10

HDV_ribozyme 8791 15 10

Tombus_3_IV 8992 18 12

Page 33: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

33

Corona_pk_3 in Rfam ver. 7.0

Coronavirus 3' UTR pseudoknot

Sequence length:6264

Consensus structure

Page 34: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

34

HDV_ribozyme in Rfam ver. 7.0

Hepatitis delta virus ribozyme

Sequence length:8791

Consensus structure

Page 35: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

35

Tombus_3_IV in Rfam ver. 7.0

Tombusvirus 3' UTR region IV

Sequence length:8992

Consensus structure

Page 36: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

36

Evaluation for Prediction Results

precision

=

recall

=

# of correct base pairs predicted by the algorithm# of predicted base pairs

# of correct base pairs predicted by the algorithm# of base pairs specified by the annotation

Page 37: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

37

Experimental Results

Prediction accuracy

RNAfamily

Precision [%] Recall [%]

Average Min Max Average Min Max

Corona_pk_3 99.4 94.4 100.0 99.4 94.4 100.0

HDV_ribozyme 100.0 100.0 100.0 100.0 100.0 100.0

Tombus_3_IV 100.0 100.0 100.0 100.0 100.0 100.0

Page 38: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

38

Experimental Results (cont.)

Running time

*: Implementation in ANSI C on a machine with Intel Pentium D CPU 2.80GHZ and 2.00GB RAM

RNAfamily

CPU time* [sec]

Average Min Max

Corona_pk_3 27.8 26.0 30.4

HDV_ribozyme 252.1 219.0 278.4

Tombus_3_IV 244.8 215.2 257.5

Page 39: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

39

Pair Stochastic Tree Adjoining Grammar (PSTAG)[MSS05]

Derivation treerepresenting

known structure

Test sequence alignment

[MSS05] Matsui et al., “Pair stochastic tree adjoining grammars for aligningand predicting pseudoknot RNA structures,” Bioinformatics, 2005.

RNA familydatabase

PSTAG algorithm

CUAGUCUUA Secondarystructureprediction

Sample sequenceswith structure annotation

CUACUGUUC

Page 40: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

40

Comparison with PSTAG

ModelAverage precision [%] Average recall [%]

Corona HDV Tombus Corona HDV Tombus

SMCFG 99.4 100.0 100.0 99.4 100.0 100.0

PSTAG 95.5 95.6 97.4 94.6 94.1 97.4

Page 41: RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar

41

Summary

A new probabilistic model called SMCFG has been proposed for RNA pseudoknot modeling.

Polynomial time parsing and parameter estimation algorithms have been designed.

Experimental results on RNA pseudoknot prediction have shown good prediction accuracy.