SCFG Smaller

8/2/2019 SCFG Smaller

1/38

Stochastic Context Free GrammarsStochastic Context Free Grammars

CBB 231 / COMPSCI 261

for noncoding RNA gene predictionfor noncoding RNA gene prediction


2/38

Formal Languages

A formal language is simply a set of strings (i.e., sequences). That

set may be infinite.

Let Mbe a model denoting a language. IfM is a generative model

such as an HMM or a grammar, then L(M) denotes the languagegenerated byM.

IfM is an acceptor model such as a finite automaton, then L(M)

denotes the language accepted byM.

When all the parameters of a stochastic generative model are known,

we can ask:

What is the probability that model M will generate string S?

which we denote:

P(S|M)


3/38

Recall: The Chomsky Hierarchy

Examples:

* gene-finding assumes DNA is regular

* secondary structure prediction assumes RNA is context-free

* RNA pseudoknots are context-sensitive

regular languages

context free languages

context sensitive languages

recursive languages

HMMs / reg. exp.s

SCFGs / PDAs

Linear-bounded TMs

Halting TMs

recursively enumerable languages Turing machines

* each class is a subset of the next higher class in the hierarchy

all languages


4/38

Context-free Grammars (CFGs)

A context-free grammar is a generative model denoted by a 4-tuple:

G= (V, , S,R)

where:

is a terminal alphabet, (e.g., {a, c, g, t} )

Vis a nonterminal alphabet, (e.g., {A,B, C,D,E, ...} )

SVis a special start symbol, and

R is a set of rewriting rules calledproductions.

Productions inR are rules of the form:

X

for XV, (V)*; such a production denotes that the nonterminal

symbolXmay be rewritten by the expression , which may consist ofzero or more terminals and nonterminals.


5/38

A Simple Example

As an example, consider G=(VG, , S,RG), for VG={S,L,N}, ={a,c,g,t}, andRG theset consisting of:

SaSt|tSa|cSg|gSc|L

LN N N N

Na|c|g|t

One possible derivation using this grammar is:

S

aSt

acSgt

acgScgt

acgtSacgt

acgtLacgt

acgtNNNNacgtacgtaNNNacgt

acgtacNNacgt

acgtacgNacgt

acgtacgtacgt

SaSt

S

cSgSgSc

StSa

SL

LN N N N

Na

Nc

Ng

Nt


6/38

Derivations

Suppose a CFG G has generated a terminal stringx*

. A derivationdenotes a single way by which G may have generatedx. For a grammar

G and a stringx, there may exist multiple distinct derivations.

A derivation (or parse) consists of a series of applications ofproductions fromR, beginning with the start symbolSand ending with

the terminal stringx:

S

s1

s2

s3

xWe can denote this more compactly as: S*x. Each string si in a

derivation is called a sentential form, and may consist of both terminal

and nonterminal symbols: si(V)*. Each step in a derivation must be

of the form:

wXzwz

for w,z

(V

)*

, where X

is a production in R; note that w and zmay be empty (denotes the empty string).


7/38

Leftmost Derivations

A leftmost derivation is one in which at each step the leftmost

nonterminal in the current sentential form is the one which is

rewritten:

SabXdYZabxxxdYZabxxxdyyyZabxxxyzdyyyzzzFor many applications, it is not necessary to restrict ones attention to

only the leftmost parses. In that case, there may exist multiplederivations which can produce the same exact string.

However, when we get to stochastic CFGs, it will be convenient to

assume that only leftmost derivations are valid. This will simplifyprobability computations, since we dont have to model the process of

stochastically choosing a nonterminal to rewrite. Note that doing this

does not reduce the representational power of the CFG in any way; it

just makes it easier to work with.


8/38

Context-freeness

The context-freeness of context-free grammars is imposed by therequirement that the l.h.s of each production rule may contain only a

single symbol, and that symbol must be a nonterminal:

X

for XV, (V)*. That is, Xis a nonterminal and is any (possibly

empty) string of terminals and/or nonterminals. Thus, a CFG cannot

specify context-sensitive rules such as:

wXzwz

which states that nonterminal X can be rewritten by only when X

occurs in the local context wXz in a sentential form. Such productions

are possible in context-sensitive grammars (CSGs).


9/38

The advantage of CFGs over HMMs lies in their ability to model arbitrary runs of

matching pairs of palindromic elements, such as nested pairs of parentheses:

(((((((())))))))

where each opening parenthesis must have exactly one matching closing parenthesis on

the right. When the number of nested pairs is unbounded (i.e., a matching close

parenthesis can be arbitrarily far away from its open parenthesis), a finite-state model

such as a DFA or an HMM is inadequate to enforce the constraint that all left elementsmust have a matching right element.

In contrast, the modeling of nested pairs of elements can be readily achieved in a CFG

using rules such asX(X). A sample derivation using such a rule is:

X(X)((X))(((X)))((((X))))(((((X)))))An additional rule such asXis necessary to terminate the recursion.

Context-free Versus Regular


10/38

CFGs and Pushdown Automata

Unlike finite-state automata (and HMMs), which have only a finite amount of memory,a machine for recognizing context-free languages needs an unbounded amount of

memory (in general).

One such machine is the pushdown automaton (PDA), which consists of a

(nondeterministic)fi

nite automaton plus a stack:

.

.

.

.

.

.

automaton stack

The unbounded memory provided by the stack allows the PDA to recognize nested

structures of arbitrary depth, whereas an HMM has no such capability since its onlymemory is its finite set of states. That is, CFGs relax theMarkov assumption.

In practice, a recognizer for a CFL neednt have an infinite stack, but does require a

dynamic programming matrix of size O(L

2

), for L the length of the sequence to beparsed. We will see such an algorithm later.


11/38

Pushdown Automaton Example

S1S1S0S0S

110011

S

110011

S

1

1

10011

S

1

10011

S

1stack:

input:

time: t=0 t=1 t=2 t=3

1

1

S

1

1

0011

t=4

S

1

1

0011

t=5

0

0

S

1

1

011

t=6

0

action: push S S1S1 match1 S1S1 match1S0S0 match0 S

the grammar: the input: 110011

Note that in principle this

procedure is nondeterministic,

so that back-tracking may be

necessary in practice!

1

1

11

1

1

t=9 t=10t=8

match1 match 1 accept

1

1

0

011

t=7

Why didnt the PDA

use S0S0 here?


12/38

Limitations of CFGs

One thing that CFGs cant model is the matching of arbitrary runs of matchingelements in the same direction (i.e., notpalindromic):

......abcdefg.......abcdefg.....

In other words, languages of the form:

wxw

for strings w andx of arbitrary length, cannot be modeled using a CFG.

More relevant to ncRNA prediction is the case ofpseudoknots, which also cannot be

recognized using standard CFGs:

....abcde....rstuv.....edcba.....vutsr....

The problem is that the matching palindromes (and the regions separating them) are

ofarbitrary length.

Q: why isnt this very

relevant to RNA

structure prediction?

Hint: think of the

directionality of

paired strands.


13/38

Stochastic CFGs (SCFGs)

A stochastic context-free grammar (SCFG) is a CFG plus a probabilitydistribution on productions:

G= (V, , S,R, Pp)

wherePp :R

, and probabilities are normalized at the level of eachl.h.s. symbolX:

[Pp(X)=1 ]

XVX

Thus, we can compute the probability of a single derivation S*x by

multiplying the probabilities for all productions used in the derivation:

iP(Xii)

We can sum over all possible (leftmost) derivations of a given stringx

to get the probability that G will generatex at random:

P(x | G) = P(Sj*x | G).

j


14/38

A Simple Example

As an example, consider G=(VG

, , S,RG

, PG

), for VG

={S,L,N}, ={a,c,g,t}, andRG

the set consisting of:

SaSt|tSa|cSg|gSc|L

LN N N N

Na|c|g|t

where PG(S)=0.2, PG(LNNNN)=1, and PG(N)=0.25. Then the

probability of the sequenceacgtacgtacgt

is given by:P(acgtacgtacgt) =

P( SaStacSgtacgScgtacgtSacgt

acgtLacgtacgtNNNNacgtacgtaNNNacgt

acgtacNNacgtacgtacgNacgtacgtacgtacgt) =

0.2 0.2 0.2 0.2 0.2 1 0.25 0.25 0.25 0.25 = 1.2510-6

because this sequence has only one possible leftmostderivation under grammar G.

(P=0.2)

(P=1.0)

(P=0.25)


15/38

Chomsky Normal Form (CNF)

Any CFG which does not derive the empty string (i.e., L(G)) can be converted intoan equivalent grammar in Chomsky Normal Form (CNF). A CNF grammar is one in

which all productions are of the form:

XYZ

or:Xa

for nonterminalsX, Y,Z, and terminal a.

Transforming a CFG into CNF can be accomplished by appropriately-ordered

application of the following operations:

eliminating useless symbols (nonterminals that only derive )

eliminating null productions (X)eliminating unit productions (XY)

factoring long rhs expressions (Aabc factored intoAaB,BbC, Cc)

factoring terminals (AcB is factored intoACB, Cc)

(see, e.g., Hopcroft & Ullman, 1979).


16/38

Non-CNF:

SaSt|tSa|cSg|gSc|L

LN N N N

Na|c|g|t

CNF - Example

CNF:

SA ST|T S

A|C S

G|G S

C|N L

1

SAS A

STS T

SCS C

SGS G

L1NL

2

L2

NNNa|c|g|t

Aa

Cc

Gg

Tt

Disadvantages of CNF: (1) more nonterminals & productions, (2) more convoluted relation to problem domain (can be

important when implementingposterior decoding)

Advantages: (1) easy implementation of parsers


17/38

The Parsing Problem

Two questions for a CFG:

1) Can a grammarG derive stringx?

2) If so, what series of productions would be used during

the derivation? (there may be multiple answers!)

Additional questions for an SCFG:

1) What is theprobability that G derives stringx?

2) What is the most probable derivation ofx via G?

(likelihood)


18/38

The CYK Parsing Algorithm

j

k

i

(i,j)

atcg

atcg

atcgta

gcccta

tcccta

gctat

ctag

ctta

cggtaa

tcgcat

cgcg

cgtctta

gcgca

i

j

(i, k)

(k+1,j)

Cell (i,j) contains all the nonterminalsXwhich can derive the entire subsequence:

actagctatctagcttacggtaatcgcatcgcgc.

(k+1,j) contains only those nonterminals

which can derive the red substring.

(i, k) contains only those nonterminals

which can derive the green

substring.

(0, n-1)

initialization:

Xx(diagonal)

inductive:

ABC(for all A,BC, and k)

termination:

is SD0, n-1?

A

C

B

S?


19/38

The CYK Parsing Algorithm (CFGs)

Given a grammar G = (V,

, S,R) in CNF, we initialize a DP matrixD such that:

0i


20/38

Relation to PDAs

If this is equivalent to a pushdown automaton, then

where is thepushdown stack?

And where are thestates of the automaton?

.....

.

automaton stack

?

PDA CYK

j

k

i(i,j)

(i, k)

(k+1,j)

(0, n-1 )

X

Z

Y


21/38

Relation to PDAs

The stack exists implicitly as a jagged slice across the DP matrix.Each such slice represents a possible configuration for the stack

at some point in time during the operation of a PDA. CYK

effectively checks whether there exists a series of stack

configurations (plus discarded input prefixes)corresponding to a valid derivation of the input

string.

Under CYK there is implicitly a single stateq1 in addition to the silent start/stop state q0.

Each collapse of a production during DP

corresponds to the self transition q1q1. At

the end of the algorithm, if the answer isACCEPT then the transition q1q0:1

(terminate and emit a 1) is taken; if the

answer is REJECT then the transition

q1q0:0 (terminate and emit a 0) is taken.

q0 q1

$,:1

$,x:0

x,:0

$,S:

x,y:

(stack, input : emission)


22/38

Modified CYK for SCFGs (Inside Algorithm)

CYK can be easily modified to compute the probability of a string.

We associate a probability with each cell inDi,j, as follows:

1) For each nonterminal A we multiply the probabilities associated

with B and C when applying the production ABC (and alsomultiply by the probability attached to the production itself)

2) We sum the probabilities associated with differentproductions forA

and different values ofk

The probability of the input string is then given by the probability

associated with the start symbol Sin cell (0, n-1).

If we instead want the single highest-scoring parse, we can simply

perform an argmax operation rather than the sums in step #2.


23/38

The Inside Algorithm

Recall that for the forward algorithm we defined a forward variablef(i, j). Similarly,

for the inside algorithm we define an inside variable(i,j,X):

(i,j,X) =P(X*xi...xj|X)

which denotes the probability that nonterminalXwill derive subsequencexi...xj.

Computing this variable for all integers i and j and all nonterminals Xconstitutes the

inside algorithm:

for i=0 up to L-1 do

foreach nonterminal X do

(i,i,X)=P(Xxi);for i=L-2 down to 0 do

for j=i+1 up to L-1 do

foreach nonterminal X do

(i,j,X)=YZk=i..j-1P(XYZ)(i,k,Y)(k+1,j,Z);

Note thatP(XYZ)=0 ifXYZis not a valid production in the grammar.

The probabilityP(x|G) of the full input sequencex of lengthL can then be found in the

final cell of the matrix: (0, L-1, S) (the corner cell). Reconstructing the most

probable derivation (parse) can be done by modifying this algorithm to (1) computemaxs instead of sums, and (2) to keep traceback pointers as in Viterbi.

j

k

i(i,j)

(i, k)

(k+1,j)

(0,L-1)

X

Z

Y

Q: why did we assume

the grammar was in

CNF?

time=

O(L3

M

3

)


24/38

Training an SCFG

Two common methods for training an SCFG:

1) If parses are known for the training sequences, we can simply count

the number of times each production occurs in the training parses

and normalize these counts into probabilities. This is analogous tolabeled sequence training of an HMM (i.e., when each symbol in

a training sequence is labeled with an HMM state).

2) If parses are NOT known for the training sequences, we can use an

EM algorithm similar to the Forward-Backward (Baum-Welch)

algorithm for HMMs. The EM algorithm for SCFGs is called

Inside-Outside.


25/38

Recall: Forward-Backward

INSIDE

OUTSIDE

CATCG

TATCGCGCGATATCTCGATCATCG

CTCGAC

TATT

ATAT

CAGT

CTACTTCAGA

TCTAT

CATC

GTAT

CGCG

CGAT

ATC

TCGATC

ATCGCTCGACTATTATATCAGTCTAC

TTCAGA

TCTAT

FORW

ARD

BACKWAR

D

CATCGTATCGCGCGATATCTCGATCATCGCTCGACTATTATATCA CATCGTATCGCGCGATATCTCGATCATCGCTCGACTATTATATCA

Inside-Outside uses a similar trick to estimate the

expected number of times each production is used:

time = O(L3M

3) time = O(L3M

3)

time=O(LN2) time =O(LN

2)


26/38

The Outside Algorithm

For the outside algorithm we define an outside variable(i,j, Y):

(i,j,Y) =P( S*x0..xi-1 Yxj+1..xL-1 )

which denotes the probability that the start symbol S will derive the sentential form

x0..xi-1 Yxj+1..xL-1(i.e., that Swill derive some string having prefixx0...xi-1 and suffixxj+1...xL-1 and that the region between will be derived through nonterminal Y).

(0,L-1,S)=1;foreach XS do (0,L-1,X)=0;for i=0 up to L-1 do

for j=L-1 down to i doforeach nonterminal X do

if (i,j,X) undefined then (i,j,X)=YZk=j+1..L-1 P(YXZ)(j+1,k,Z)(i,k,Y)+ YZk=0..i-1 P(YZX)(k,i-1,Z)(k,j,Y);

S

YX

Z(j+1,k

,Z)

(i,

k,Y)

time = O(L3M

3)

i

j

k


27/38

Inside-Outside Parameter Estimation

Pnew(XYZ)=E(XYZ x)

E(X x)

=

(i, j,X)P(XYZ)(i,k,Y)(k+1, j,Z)k=i

j1

j=i+1

L1

i=0

L2

(i, j,X)(i, j,X)j=i

L1

i=0

L1

Pnew(X a) =E(X a x)

E(X x)

=

kron(xi ,a)(i, i,X)P(X a)

i=

0

L1

(i, j,X)(i, j,X)j=i

L1

i=0

L1

EM-update equations:

(see Durbin et al., 1998, section 9.6)


28/38

Posterior Decoding for SCFGs

P(Xi,jYi,kZk+1,j x)=(i, j,X)P(XYZ)(i,k,Y)(k+1, j,Z)

(0,L 1)

P(X xix)=

(i,i,X)P(X xi)

(0,L 1)

What is the probability that nonterminal X generates the subsequence xi ...xj via

productionXYZ, with Ygeneratingxi ...xkandZgeneratingxk+1...xj?

What is the probability that nonterminal X generates xi (in some particularsequence)?

P(F

i,j x) =some product ofs and s (and other things)

(0,L

1)

What is the probability that a structural feature of type F will occupy sequencepositions i throughj?


29/38

An SCFG for simple Stem-loop Structures

S

aSt

(generate paired bases in the stem)StSaScSgSgScSgStStSgSL

LN L (generate a loop of length 4)L

N N N N

Na (generate each nucleotide in the loop)NcNgN

t

15%15%

25%

25%

5%

5%

10%

75%

25%20%

20%

30%

30%


30/38

Sliding Windows to Find ncRNA Genes

Given a grammarG describing ncRNA structures and an input sequence Z, we can

slide a window of length L across the sequence, computing the probability P(Zi,i+L-1|

G) that the subsequence Zi,i+L-1 falling within the current window could have been

generated by grammarG.

This is the Bayesian inverse ofP(G|Zi,i+L-1), the probability that this subsequence

would (if transcribed into ssRNA) fold into a shape characteristic of the modeled class

of ncRNAs.

Using a likelihood ratio:

R =P(Zi,i+L-1|G) /P(Zi,i+L-1|background),

we can impose the rule that any subsequence having a scoreR>>1 is likely to contain

a ncRNA gene (where the background modelis typically a 0th-orderMarkov chain).

atcgatcgtatcgtacgatcttctctatcgcgcgattcatctgctatcattatatctattatttcaaggcattcag

sliding window

R = 0.99537 (summing overall possible

secondary

structures under

the grammar)

?


31/38

What about Pseudoknots?

Among the most prevalent RNA structures is a motif known as the pseudoknot. (Staple & Butcher, 2005)

A G f Si l P d k t


32/38

A Grammar for Simple Pseudoknots

L(G) = {x y xr

yr

|x*

,y*

}

SL X R

X aXt |tXa|cXg|gXc|Y

Ya A Y|c C Y|g G Y|t T Y|

A aaA , A ccA , A ggA ,A ttA

Ca

a C, Cc

c C, Cg

g C, Ct

t CG aa G , G cc G , G gg G , G tt G

Taa T, Tcc T, Tgg T, Ttt T

A RtR , C RgR , G RcR , T RaR

L aaL ,L ccL ,L ggL ,L ttL ,L R

generate x and xr

erase extra markers

reverse-complement second y atend of sequence

propagate encoded copy of yto end of sequence

generate y and encoded 2ndcopy

Q: what type of grammar is this?

I l ti Z k i SCFG


33/38

Implementing Zuker in an SCFG

(Rivas & Eddy, 2000)

i is unpaired j is unpaired i pairs with j

i+1 pairs with j i pairs with j-1 i+1 pairs with j-1

serial structures

these allow

incorporation of

stacking energies

. . .

loops bulges multiloops

I l ti Z k i SCFG


34/38

Implementing Zuker in an SCFG

(Rivas & Eddy, 2000)

Attrib te grammars & Generali ed SCFGs


35/38

Attribute grammars & Generalized SCFGs

What is an attribute-SCFG?

An attribute-SCFG (ASCFG) is an SCFG in which each nonterminalXhas associated

with it a set ofattributesAX={x0,x1, ... ,xn} and each production has associated with it

a set ofattribute update equations, by which the attributes of the nonterminals on the

left-hand-side of the production can be updated from the values of the attributes of the

nonterminals on the right-hand-side. For example:

XYZ{x0=y3+3z1;x1=6z0}

Attributes can be computed using these equations during the Inside algorithm, in

addition to (or in place of) the usual probabilistic DP equations.

We can also think of this as being something of a Generalized SCFG (as in

Generalized HMMs), since they can encapsulate arbitrary additional computations

within each state (i.e., nonterminal).

Batzoglous GSCFG (CONTRAfold)


36/38

Batzoglou s GSCFG (CONTRAfold)

(Do et al., 2006)

Attribute updates allow them

to perform additional

computation during Inside,

so as to provide more fine-

grained modeling of the

structure.

Much like in a GHMM, they

can incorporate explicit

length scores, etc..

Discriminative Training of GSCFGs


37/38

Discriminative Training of GSCFG s

S


38/38

An SCFG is a generative model utilizing production rules togenerate strings

The probability of a string S being generated by an SCFG G

can be computed using the Inside Algorithm

Given a set of productions for a SCFG, the parameters can

be estimated using the Inside-Outside (EM) algorithm

SCFGs are more powerful than HMMs because they canmodel arbitrary runs ofpaired nested elements, such as base-

pairings in a stem-loop structure, though they cant model

pseudoknots

Thermodynamicfolding algorithms can be simulated in an

SCFG

SCFGs can be generalized via explicit attribute equations,

and they can be discriminatively trained

Summary

Documents

SCFG Smaller