Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff

Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of

AccuracyZasha Weinberg, and Walter L. Ruzzo

Presented by:

Jeff Bonis

CISC841 - Bioinformatics

What Are Non-Coding RNAs (ncRNA)?

• “functional molecules that do not code for proteins”

• Examples: transfer RNA (tRNA), spliceosomal RNA, microRNA, regulatory RNA elements

• Over 100 known ncRNA families

Secondary Structure of ncRNAs

• Conserved, therefore useful for identifying homologs• Secondary structure is functionally important to RNAs• Base pairing important in pattern searching• e.g. 16s RNA - part of small subunit of prokaryote

ribosome

What Techniques Exist?

• Two models that predict homologs in ncRNA families– Covariance Models (CMs)– Easy RNA Profile IdentificatioN (ERPIN) - http://tagc.univ-mrs.fr/erpin/

• Both use multiple alignment of family members with secondary structure annotation

• Statistical model is built from this multiple alignment• Display high sensitivity and low specificity

What about ERPIN?

• DP algorithm matches the statistical profile onto a target database and returns the solutions and their scores

• Cannot take into account non-consensus bulges in helices (caused by indels)

• Need user specified score thresholds which compromises accuracy

CMs

• “specify a tree-like SCFG arcitecture suited for modelling consensus RNA secondary structures.”

• Can’t accommodate pseudoknots• Very slow algorithm

Which model should be improved?

• Covariance Model (CM) is chosen because it’s limitation, pseudoknots, contain little information anyway

• Address slow speed without sacrificing accuracy

• CMs used in Rfam - http://rfam.wustl.edu– 8 gigabase genome DB called RFAMSEQ– Takes over a year to search for tRNA on P4– Over 100 ncRNA families

http://rfam.wustl.edu/

Previous improvements on speed

• BLAST based heuristic– Known members are BLASTed against RFAMSEQ– CM is run on resulting set

• BLAST misses family members, especially where there is low sequence conservation

• tRNAscan-SE - http://www.genetics.wustl.edu/eddy/tRNAscan-SE/

– Uses 2 heuristic based programs for tRNA searches– CM is used on resulting set– May miss tRNAs that CMs would find

http://www.genetics.wustl.edu/eddy/tRNAscan-SE/

How to improve sensitivity?

• Authors previously developed rigorous filters with 100% sensitivity of CM found set

• Filters based on profile HMMs– Profile HMM is built from CM then run on DB – Much of DB is filtered out, CM runs on remaining set

• HMM filter based on sequence conservation– Scanned for 126 of 139 ncRNA families in Rfam– Other 13 display low sequence conservation, but have

strong conservation of secondary structure which HMM can’t take into account

– Heuristic methods also miss these ncRNAs

How can these special biological situations be accounted for?

• Authors propose 3 innovations to overcome these setbacks– 2 techniques to include secondary structure

information in filtering at expense of CPU time• Sub-CMs

– Hybrid filtering composed of CMs and profile HMMs

• Store-Pair– Uses additional HMM states for modeling key base pairs

– Third techique will help reduce scan time• Runs filters in series with quickest first ending with most

selective• Shortest path problem

Results• Techniques worked for 11 of the 13 previously missed Rfams

– Also found new hits missed by BLAST• In tRNAscan-SE, provided rigorous scan for 3 of 4 CMs finding

missed hits• 100 times faster than raw CM on average• Uncovers members missed by heuristics

What are CMs anyway?• “statistical models that can detect when a positional sequence and

secondary structure resemble a given multiple RNA alignment”• Described in terms of stochastic context-free grammars (SCFGs)• Transformational Grammars

– Rules: describe grammar of the form Si -> xL Si+1 xR, xL and xR are left and right nucleotide

– Terminals: symbols in the actual string (nucleotides)– Non-Terminals: abstract symbols (states)– Parse: series of steps to obtain final output

• Example:– RNA molecules CAG or GAC– S1 -> c S2 g | g S2 C; S2 -> a– Parse: S1 -> c S2 g -> cag

How are CM’s used?

• Each rule is assigned a probability– Rules more consistent w/ family have higher

probability

• The probability of a parse is the product of all the probability of the rules it used

• CMs use a log-odds ratios and sum the scores instead of multiplying

• CM Viterbi requires window length input which upper bounds the family member’s length and affects scan time

How are profile HMMs and CMs combined?• Given a CM, a profile HMM is created whose Viterbi score upper

bounds the CM’s Viterbi score– Guarantees 100% sensitivity on CM

• Filtering:– At each nucleotide position in the subsequences of the database, a HMM is

used to compute the CM score upper bound– A CM scan is applied to all subsequences that produce an upper bound

exceeding some threshold– Subsequences that are below the threshold are filtered out.

• Profile HMMs are represented by regular grammars which cannot emit paired nucleotides, e.g. – CM: S1 -> a S2 u | c S S2 G; S2 -> e– HMM: S1L -> a S2L | C S2L; S2L -> S1R; S1R-> g | u

• A CM is expanded into a left and right HMM

How can these be supplemented?

• Selecting an optimal series of filters– Filtering fraction (fraction of DB left over) and run time are

given by running an filter on a training sequence– Minimize expected total CPU time– Assumptions:

• estimated fractions and CPU times are constant for all training sequences

• A filter’s fraction is not affected by the previously run filters

• Optimal sequence of filters is solved as a shortest graph problem– nodes are filters and the CM– Weight of edges are CPU time

Sub-CM technique

• Exploit info in hairpins (bulges and internal loops)

• Much info is stored in short hairpins that need only part of the CMs states

• Grammar contains both HMM and CMs• Window length of sub-CM is crucial• HMMs are created manually after sub-CMs

are found– Automation of this is a future project

Store-pair technique• A HMM with extra states can reflect

base pairs

• S1L[C] -> gS1L[C] has score neg. inf.

• 5 states are added per HMM state, but can be reduced

Documents

Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff