SCFG Smaller

Embed Size (px)

Citation preview

  • 8/2/2019 SCFG Smaller

    1/38

    Stochastic Context Free GrammarsStochastic Context Free Grammars

    CBB 231 / COMPSCI 261

    for noncoding RNA gene predictionfor noncoding RNA gene prediction

  • 8/2/2019 SCFG Smaller

    2/38

    Formal Languages

    A formal language is simply a set of strings (i.e., sequences). That

    set may be infinite.

    Let Mbe a model denoting a language. IfM is a generative model

    such as an HMM or a grammar, then L(M) denotes the languagegenerated byM.

    IfM is an acceptor model such as a finite automaton, then L(M)

    denotes the language accepted byM.

    When all the parameters of a stochastic generative model are known,

    we can ask:

    What is the probability that model M will generate string S?

    which we denote:

    P(S|M)

  • 8/2/2019 SCFG Smaller

    3/38

    Recall: The Chomsky Hierarchy

    Examples:

    * gene-finding assumes DNA is regular

    * secondary structure prediction assumes RNA is context-free

    * RNA pseudoknots are context-sensitive

    regular languages

    context free languages

    context sensitive languages

    recursive languages

    HMMs / reg. exp.s

    SCFGs / PDAs

    Linear-bounded TMs

    Halting TMs

    recursively enumerable languages Turing machines

    * each class is a subset of the next higher class in the hierarchy

    all languages

  • 8/2/2019 SCFG Smaller

    4/38

    Context-free Grammars (CFGs)

    A context-free grammar is a generative model denoted by a 4-tuple:

    G= (V, , S,R)

    where:

    is a terminal alphabet, (e.g., {a, c, g, t} )

    Vis a nonterminal alphabet, (e.g., {A,B, C,D,E, ...} )

    SVis a special start symbol, and

    R is a set of rewriting rules calledproductions.

    Productions inR are rules of the form:

    X

    for XV, (V)*; such a production denotes that the nonterminal

    symbolXmay be rewritten by the expression , which may consist ofzero or more terminals and nonterminals.

  • 8/2/2019 SCFG Smaller

    5/38

    A Simple Example

    As an example, consider G=(VG, , S,RG), for VG={S,L,N}, ={a,c,g,t}, andRG theset consisting of:

    SaSt|tSa|cSg|gSc|L

    LN N N N

    Na|c|g|t

    One possible derivation using this grammar is:

    S

    aSt

    acSgt

    acgScgt

    acgtSacgt

    acgtLacgt

    acgtNNNNacgtacgtaNNNacgt

    acgtacNNacgt

    acgtacgNacgt

    acgtacgtacgt

    SaSt

    S

    cSgSgSc

    StSa

    SL

    LN N N N

    Na

    Nc

    Ng

    Nt

  • 8/2/2019 SCFG Smaller

    6/38

    Derivations

    Suppose a CFG G has generated a terminal stringx*

    . A derivationdenotes a single way by which G may have generatedx. For a grammar

    G and a stringx, there may exist multiple distinct derivations.

    A derivation (or parse) consists of a series of applications ofproductions fromR, beginning with the start symbolSand ending with

    the terminal stringx:

    S

    s1

    s2

    s3

    xWe can denote this more compactly as: S*x. Each string si in a

    derivation is called a sentential form, and may consist of both terminal

    and nonterminal symbols: si(V)*. Each step in a derivation must be

    of the form:

    wXzwz

    for w,z

    (V

    )*

    , where X

    is a production in R; note that w and zmay be empty (denotes the empty string).

  • 8/2/2019 SCFG Smaller

    7/38

    Leftmost Derivations

    A leftmost derivation is one in which at each step the leftmost

    nonterminal in the current sentential form is the one which is

    rewritten:

    SabXdYZabxxxdYZabxxxdyyyZabxxxyzdyyyzzzFor many applications, it is not necessary to restrict ones attention to

    only the leftmost parses. In that case, there may exist multiplederivations which can produce the same exact string.

    However, when we get to stochastic CFGs, it will be convenient to

    assume that only leftmost derivations are valid. This will simplifyprobability computations, since we dont have to model the process of

    stochastically choosing a nonterminal to rewrite. Note that doing this

    does not reduce the representational power of the CFG in any way; it

    just makes it easier to work with.

  • 8/2/2019 SCFG Smaller

    8/38

    Context-freeness

    The context-freeness of context-free grammars is imposed by therequirement that the l.h.s of each production rule may contain only a

    single symbol, and that symbol must be a nonterminal:

    X

    for XV, (V)*. That is, Xis a nonterminal and is any (possibly

    empty) string of terminals and/or nonterminals. Thus, a CFG cannot

    specify context-sensitive rules such as:

    wXzwz

    which states that nonterminal X can be rewritten by only when X

    occurs in the local context wXz in a sentential form. Such productions

    are possible in context-sensitive grammars (CSGs).

  • 8/2/2019 SCFG Smaller

    9/38

    The advantage of CFGs over HMMs lies in their ability to model arbitrary runs of

    matching pairs of palindromic elements, such as nested pairs of parentheses:

    (((((((())))))))

    where each opening parenthesis must have exactly one matching closing parenthesis on

    the right. When the number of nested pairs is unbounded (i.e., a matching close

    parenthesis can be arbitrarily far away from its open parenthesis), a finite-state model

    such as a DFA or an HMM is inadequate to enforce the constraint that all left elementsmust have a matching right element.

    In contrast, the modeling of nested pairs of elements can be readily achieved in a CFG

    using rules such asX(X). A sample derivation using such a rule is:

    X(X)((X))(((X)))((((X))))(((((X)))))An additional rule such asXis necessary to terminate the recursion.

    Context-free Versus Regular

  • 8/2/2019 SCFG Smaller

    10/38

    CFGs and Pushdown Automata

    Unlike finite-state automata (and HMMs), which have only a finite amount of memory,a machine for recognizing context-free languages needs an unbounded amount of

    memory (in general).

    One such machine is the pushdown automaton (PDA), which consists of a

    (nondeterministic)fi

    nite automaton plus a stack:

    .

    .

    .

    .

    .

    .

    automaton stack

    The unbounded memory provided by the stack allows the PDA to recognize nested

    structures of arbitrary depth, whereas an HMM has no such capability since its onlymemory is its finite set of states. That is, CFGs relax theMarkov assumption.

    In practice, a recognizer for a CFL neednt have an infinite stack, but does require a

    dynamic programming matrix of size O(L

    2

    ), for L the length of the sequence to beparsed. We will see such an algorithm later.

  • 8/2/2019 SCFG Smaller

    11/38

    Pushdown Automaton Example

    S1S1S0S0S

    110011

    S

    110011

    S

    1

    1

    10011

    S

    1

    10011

    S

    1stack:

    input:

    time: t=0 t=1 t=2 t=3

    1

    1

    S

    1

    1

    0011

    t=4

    S

    1

    1

    0011

    t=5

    0

    0

    S

    1

    1

    011

    t=6

    0

    action: push S S1S1 match1 S1S1 match1S0S0 match0 S

    the grammar: the input: 110011

    Note that in principle this

    procedure is nondeterministic,

    so that back-tracking may be

    necessary in practice!

    1

    1

    11

    1

    1

    t=9 t=10t=8

    match1 match 1 accept

    1

    1

    0

    011

    t=7

    Why didnt the PDA

    use S0S0 here?

  • 8/2/2019 SCFG Smaller

    12/38

    Limitations of CFGs

    One thing that CFGs cant model is the matching of arbitrary runs of matchingelements in the same direction (i.e., notpalindromic):

    ......abcdefg.......abcdefg.....

    In other words, languages of the form:

    wxw

    for strings w andx of arbitrary length, cannot be modeled using a CFG.

    More relevant to ncRNA prediction is the case ofpseudoknots, which also cannot be

    recognized using standard CFGs:

    ....abcde....rstuv.....edcba.....vutsr....

    The problem is that the matching palindromes (and the regions separating them) are

    ofarbitrary length.

    Q: why isnt this very

    relevant to RNA

    structure prediction?

    Hint: think of the

    directionality of

    paired strands.

  • 8/2/2019 SCFG Smaller

    13/38

    Stochastic CFGs (SCFGs)

    A stochastic context-free grammar (SCFG) is a CFG plus a probabilitydistribution on productions:

    G= (V, , S,R, Pp)

    wherePp :R

    , and probabilities are normalized at the level of eachl.h.s. symbolX:

    [Pp(X)=1 ]

    XVX

    Thus, we can compute the probability of a single derivation S*x by

    multiplying the probabilities for all productions used in the derivation:

    iP(Xii)

    We can sum over all possible (leftmost) derivations of a given stringx

    to get the probability that G will generatex at random:

    P(x | G) = P(Sj*x | G).

    j

  • 8/2/2019 SCFG Smaller

    14/38

    A Simple Example

    As an example, consider G=(VG

    , , S,RG

    , PG

    ), for VG

    ={S,L,N}, ={a,c,g,t}, andRG

    the set consisting of:

    SaSt|tSa|cSg|gSc|L

    LN N N N

    Na|c|g|t

    where PG(S)=0.2, PG(LNNNN)=1, and PG(N)=0.25. Then the

    probability of the sequenceacgtacgtacgt

    is given by:P(acgtacgtacgt) =

    P( SaStacSgtacgScgtacgtSacgt

    acgtLacgtacgtNNNNacgtacgtaNNNacgt

    acgtacNNacgtacgtacgNacgtacgtacgtacgt) =

    0.2 0.2 0.2 0.2 0.2 1 0.25 0.25 0.25 0.25 = 1.2510-6

    because this sequence has only one possible leftmostderivation under grammar G.

    (P=0.2)

    (P=1.0)

    (P=0.25)

  • 8/2/2019 SCFG Smaller

    15/38

    Chomsky Normal Form (CNF)

    Any CFG which does not derive the empty string (i.e., L(G)) can be converted intoan equivalent grammar in Chomsky Normal Form (CNF). A CNF grammar is one in

    which all productions are of the form:

    XYZ

    or:Xa

    for nonterminalsX, Y,Z, and terminal a.

    Transforming a CFG into CNF can be accomplished by appropriately-ordered

    application of the following operations:

    eliminating useless symbols (nonterminals that only derive )

    eliminating null productions (X)eliminating unit productions (XY)

    factoring long rhs expressions (Aabc factored intoAaB,BbC, Cc)

    factoring terminals (AcB is factored intoACB, Cc)

    (see, e.g., Hopcroft & Ullman, 1979).

  • 8/2/2019 SCFG Smaller

    16/38

    Non-CNF:

    SaSt|tSa|cSg|gSc|L

    LN N N N

    Na|c|g|t

    CNF - Example

    CNF:

    SA ST|T S

    A|C S

    G|G S

    C|N L

    1

    SAS A

    STS T

    SCS C

    SGS G

    L1NL

    2

    L2

    NNNa|c|g|t

    Aa

    Cc

    Gg

    Tt

    Disadvantages of CNF: (1) more nonterminals & productions, (2) more convoluted relation to problem domain (can be

    important when implementingposterior decoding)

    Advantages: (1) easy implementation of parsers

  • 8/2/2019 SCFG Smaller

    17/38

    The Parsing Problem

    Two questions for a CFG:

    1) Can a grammarG derive stringx?

    2) If so, what series of productions would be used during

    the derivation? (there may be multiple answers!)

    Additional questions for an SCFG:

    1) What is theprobability that G derives stringx?

    2) What is the most probable derivation ofx via G?

    (likelihood)

  • 8/2/2019 SCFG Smaller

    18/38

    The CYK Parsing Algorithm

    j

    k

    i

    (i,j)

    atcg

    atcg

    atcgta

    gcccta

    tcccta

    gctat

    ctag

    ctta

    cggtaa

    tcgcat

    cgcg

    cgtctta

    gcgca

    i

    j

    (i, k)

    (k+1,j)

    Cell (i,j) contains all the nonterminalsXwhich can derive the entire subsequence:

    actagctatctagcttacggtaatcgcatcgcgc.

    (k+1,j) contains only those nonterminals

    which can derive the red substring.

    (i, k) contains only those nonterminals

    which can derive the green

    substring.

    (0, n-1)

    initialization:

    Xx(diagonal)

    inductive:

    ABC(for all A,BC, and k)

    termination:

    is SD0, n-1?

    A

    C

    B

    S?

  • 8/2/2019 SCFG Smaller

    19/38

    The CYK Parsing Algorithm (CFGs)

    Given a grammar G = (V,

    , S,R) in CNF, we initialize a DP matrixD such that:

    0i

  • 8/2/2019 SCFG Smaller

    20/38

    Relation to PDAs

    If this is equivalent to a pushdown automaton, then

    where is thepushdown stack?

    And where are thestates of the automaton?

    .....

    .

    automaton stack

    ?

    PDA CYK

    j

    k

    i(i,j)

    (i, k)

    (k+1,j)

    (0, n-1 )

    X

    Z

    Y

  • 8/2/2019 SCFG Smaller

    21/38

    Relation to PDAs

    The stack exists implicitly as a jagged slice across the DP matrix.Each such slice represents a possible configuration for the stack

    at some point in time during the operation of a PDA. CYK

    effectively checks whether there exists a series of stack

    configurations (plus discarded input prefixes)corresponding to a valid derivation of the input

    string.

    Under CYK there is implicitly a single stateq1 in addition to the silent start/stop state q0.

    Each collapse of a production during DP

    corresponds to the self transition q1q1. At

    the end of the algorithm, if the answer isACCEPT then the transition q1q0:1

    (terminate and emit a 1) is taken; if the

    answer is REJECT then the transition

    q1q0:0 (terminate and emit a 0) is taken.

    q0 q1

    $,:1

    $,x:0

    x,:0

    $,S:

    x,y:

    (stack, input : emission)

  • 8/2/2019 SCFG Smaller

    22/38

    Modified CYK for SCFGs (Inside Algorithm)

    CYK can be easily modified to compute the probability of a string.

    We associate a probability with each cell inDi,j, as follows:

    1) For each nonterminal A we multiply the probabilities associated

    with B and C when applying the production ABC (and alsomultiply by the probability attached to the production itself)

    2) We sum the probabilities associated with differentproductions forA

    and different values ofk

    The probability of the input string is then given by the probability

    associated with the start symbol Sin cell (0, n-1).

    If we instead want the single highest-scoring parse, we can simply

    perform an argmax operation rather than the sums in step #2.

  • 8/2/2019 SCFG Smaller

    23/38

    The Inside Algorithm

    Recall that for the forward algorithm we defined a forward variablef(i, j). Similarly,

    for the inside algorithm we define an inside variable(i,j,X):

    (i,j,X) =P(X*xi...xj|X)

    which denotes the probability that nonterminalXwill derive subsequencexi...xj.

    Computing this variable for all integers i and j and all nonterminals Xconstitutes the

    inside algorithm:

    for i=0 up to L-1 do

    foreach nonterminal X do

    (i,i,X)=P(Xxi);for i=L-2 down to 0 do

    for j=i+1 up to L-1 do

    foreach nonterminal X do

    (i,j,X)=YZk=i..j-1P(XYZ)(i,k,Y)(k+1,j,Z);

    Note thatP(XYZ)=0 ifXYZis not a valid production in the grammar.

    The probabilityP(x|G) of the full input sequencex of lengthL can then be found in the

    final cell of the matrix: (0, L-1, S) (the corner cell). Reconstructing the most

    probable derivation (parse) can be done by modifying this algorithm to (1) computemaxs instead of sums, and (2) to keep traceback pointers as in Viterbi.

    j

    k

    i(i,j)

    (i, k)

    (k+1,j)

    (0,L-1)

    X

    Z

    Y

    Q: why did we assume

    the grammar was in

    CNF?

    time=

    O(L3

    M

    3

    )

  • 8/2/2019 SCFG Smaller

    24/38

    Training an SCFG

    Two common methods for training an SCFG:

    1) If parses are known for the training sequences, we can simply count

    the number of times each production occurs in the training parses

    and normalize these counts into probabilities. This is analogous tolabeled sequence training of an HMM (i.e., when each symbol in

    a training sequence is labeled with an HMM state).

    2) If parses are NOT known for the training sequences, we can use an

    EM algorithm similar to the Forward-Backward (Baum-Welch)

    algorithm for HMMs. The EM algorithm for SCFGs is called

    Inside-Outside.

  • 8/2/2019 SCFG Smaller

    25/38

    Recall: Forward-Backward

    INSIDE

    OUTSIDE

    CATCG

    TATCGCGCGATATCTCGATCATCG

    CTCGAC

    TATT

    ATAT

    CAGT

    CTACTTCAGA

    TCTAT

    CATC

    GTAT

    CGCG

    CGAT

    ATC

    TCGATC

    ATCGCTCGACTATTATATCAGTCTAC

    TTCAGA

    TCTAT

    FORW

    ARD

    BACKWAR

    D

    CATCGTATCGCGCGATATCTCGATCATCGCTCGACTATTATATCA CATCGTATCGCGCGATATCTCGATCATCGCTCGACTATTATATCA

    Inside-Outside uses a similar trick to estimate the

    expected number of times each production is used:

    time = O(L3M

    3) time = O(L3M

    3)

    time=O(LN2) time =O(LN

    2)

  • 8/2/2019 SCFG Smaller

    26/38

    The Outside Algorithm

    For the outside algorithm we define an outside variable(i,j, Y):

    (i,j,Y) =P( S*x0..xi-1 Yxj+1..xL-1 )

    which denotes the probability that the start symbol S will derive the sentential form

    x0..xi-1 Yxj+1..xL-1(i.e., that Swill derive some string having prefixx0...xi-1 and suffixxj+1...xL-1 and that the region between will be derived through nonterminal Y).

    (0,L-1,S)=1;foreach XS do (0,L-1,X)=0;for i=0 up to L-1 do

    for j=L-1 down to i doforeach nonterminal X do

    if (i,j,X) undefined then (i,j,X)=YZk=j+1..L-1 P(YXZ)(j+1,k,Z)(i,k,Y)+ YZk=0..i-1 P(YZX)(k,i-1,Z)(k,j,Y);

    S

    YX

    Z(j+1,k

    ,Z)

    (i,

    k,Y)

    time = O(L3M

    3)

    i

    j

    k

  • 8/2/2019 SCFG Smaller

    27/38

    Inside-Outside Parameter Estimation

    Pnew(XYZ)=E(XYZ x)

    E(X x)

    =

    (i, j,X)P(XYZ)(i,k,Y)(k+1, j,Z)k=i

    j1

    j=i+1

    L1

    i=0

    L2

    (i, j,X)(i, j,X)j=i

    L1

    i=0

    L1

    Pnew(X a) =E(X a x)

    E(X x)

    =

    kron(xi ,a)(i, i,X)P(X a)

    i=

    0

    L1

    (i, j,X)(i, j,X)j=i

    L1

    i=0

    L1

    EM-update equations:

    (see Durbin et al., 1998, section 9.6)

  • 8/2/2019 SCFG Smaller

    28/38

    Posterior Decoding for SCFGs

    P(Xi,jYi,kZk+1,j x)=(i, j,X)P(XYZ)(i,k,Y)(k+1, j,Z)

    (0,L 1)

    P(X xix)=

    (i,i,X)P(X xi)

    (0,L 1)

    What is the probability that nonterminal X generates the subsequence xi ...xj via

    productionXYZ, with Ygeneratingxi ...xkandZgeneratingxk+1...xj?

    What is the probability that nonterminal X generates xi (in some particularsequence)?

    P(F

    i,j x) =some product ofs and s (and other things)

    (0,L

    1)

    What is the probability that a structural feature of type F will occupy sequencepositions i throughj?

  • 8/2/2019 SCFG Smaller

    29/38

    An SCFG for simple Stem-loop Structures

    S

    aSt

    (generate paired bases in the stem)StSaScSgSgScSgStStSgSL

    LN L (generate a loop of length 4)L

    N N N N

    Na (generate each nucleotide in the loop)NcNgN

    t

    15%15%

    25%

    25%

    5%

    5%

    10%

    75%

    25%20%

    20%

    30%

    30%

  • 8/2/2019 SCFG Smaller

    30/38

    Sliding Windows to Find ncRNA Genes

    Given a grammarG describing ncRNA structures and an input sequence Z, we can

    slide a window of length L across the sequence, computing the probability P(Zi,i+L-1|

    G) that the subsequence Zi,i+L-1 falling within the current window could have been

    generated by grammarG.

    This is the Bayesian inverse ofP(G|Zi,i+L-1), the probability that this subsequence

    would (if transcribed into ssRNA) fold into a shape characteristic of the modeled class

    of ncRNAs.

    Using a likelihood ratio:

    R =P(Zi,i+L-1|G) /P(Zi,i+L-1|background),

    we can impose the rule that any subsequence having a scoreR>>1 is likely to contain

    a ncRNA gene (where the background modelis typically a 0th-orderMarkov chain).

    atcgatcgtatcgtacgatcttctctatcgcgcgattcatctgctatcattatatctattatttcaaggcattcag

    sliding window

    R = 0.99537 (summing overall possible

    secondary

    structures under

    the grammar)

    ?

  • 8/2/2019 SCFG Smaller

    31/38

    What about Pseudoknots?

    Among the most prevalent RNA structures is a motif known as the pseudoknot. (Staple & Butcher, 2005)

    A G f Si l P d k t

  • 8/2/2019 SCFG Smaller

    32/38

    A Grammar for Simple Pseudoknots

    L(G) = {x y xr

    yr

    |x*

    ,y*

    }

    SL X R

    X aXt |tXa|cXg|gXc|Y

    Ya A Y|c C Y|g G Y|t T Y|

    A aaA , A ccA , A ggA ,A ttA

    Ca

    a C, Cc

    c C, Cg

    g C, Ct

    t CG aa G , G cc G , G gg G , G tt G

    Taa T, Tcc T, Tgg T, Ttt T

    A RtR , C RgR , G RcR , T RaR

    L aaL ,L ccL ,L ggL ,L ttL ,L R

    generate x and xr

    erase extra markers

    reverse-complement second y atend of sequence

    propagate encoded copy of yto end of sequence

    generate y and encoded 2ndcopy

    Q: what type of grammar is this?

    I l ti Z k i SCFG

  • 8/2/2019 SCFG Smaller

    33/38

    Implementing Zuker in an SCFG

    (Rivas & Eddy, 2000)

    i is unpaired j is unpaired i pairs with j

    i+1 pairs with j i pairs with j-1 i+1 pairs with j-1

    serial structures

    these allow

    incorporation of

    stacking energies

    . . .

    loops bulges multiloops

    I l ti Z k i SCFG

  • 8/2/2019 SCFG Smaller

    34/38

    Implementing Zuker in an SCFG

    (Rivas & Eddy, 2000)

    Attrib te grammars & Generali ed SCFGs

  • 8/2/2019 SCFG Smaller

    35/38

    Attribute grammars & Generalized SCFGs

    What is an attribute-SCFG?

    An attribute-SCFG (ASCFG) is an SCFG in which each nonterminalXhas associated

    with it a set ofattributesAX={x0,x1, ... ,xn} and each production has associated with it

    a set ofattribute update equations, by which the attributes of the nonterminals on the

    left-hand-side of the production can be updated from the values of the attributes of the

    nonterminals on the right-hand-side. For example:

    XYZ{x0=y3+3z1;x1=6z0}

    Attributes can be computed using these equations during the Inside algorithm, in

    addition to (or in place of) the usual probabilistic DP equations.

    We can also think of this as being something of a Generalized SCFG (as in

    Generalized HMMs), since they can encapsulate arbitrary additional computations

    within each state (i.e., nonterminal).

    Batzoglous GSCFG (CONTRAfold)

  • 8/2/2019 SCFG Smaller

    36/38

    Batzoglou s GSCFG (CONTRAfold)

    (Do et al., 2006)

    Attribute updates allow them

    to perform additional

    computation during Inside,

    so as to provide more fine-

    grained modeling of the

    structure.

    Much like in a GHMM, they

    can incorporate explicit

    length scores, etc..

    Discriminative Training of GSCFGs

  • 8/2/2019 SCFG Smaller

    37/38

    Discriminative Training of GSCFG s

    S

  • 8/2/2019 SCFG Smaller

    38/38

    An SCFG is a generative model utilizing production rules togenerate strings

    The probability of a string S being generated by an SCFG G

    can be computed using the Inside Algorithm

    Given a set of productions for a SCFG, the parameters can

    be estimated using the Inside-Outside (EM) algorithm

    SCFGs are more powerful than HMMs because they canmodel arbitrary runs ofpaired nested elements, such as base-

    pairings in a stem-loop structure, though they cant model

    pseudoknots

    Thermodynamicfolding algorithms can be simulated in an

    SCFG

    SCFGs can be generalized via explicit attribute equations,

    and they can be discriminatively trained

    Summary