Modeling RNA Secondary Structure with Sequence ...gsharma/papers/TanModelRN...maximum expected accuracy algorithm (34–36) and a mul-tiple sequence alignment is estimated using a

Please cite this article in press as: Tan et al., Modeling RNA Secondary Structure with Sequence Comparison and Experimental Mapping Data, BiophysicalJournal (2017), http://dx.doi.org/10.1016/j.bpj.2017.06.039

Article

Modeling RNA Secondary Structure with SequenceComparison and Experimental Mapping Data

Zhen Tan,1,2 Gaurav Sharma,2,3,4,* and David H. Mathews1,2,4,*1Department of Biochemistry and Biophysics, 2Center for RNA Biology, 3Department of Electrical and Computer Engineering, and4Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, New York

ABSTRACT Secondary structure prediction is an important problem in RNA bioinformatics because knowledge of structure iscritical to understanding the functions of RNA sequences. Significant improvements in prediction accuracy have recently beendemonstrated though the incorporation of experimentally obtained structural information, for instance using selective 20-hydroxylacylation analyzed by primer extension (SHAPE) mapping. However, such mapping data is currently available only for a limitednumber of RNA sequences. In this article, we present a method for extending the benefit of experimental mapping data in sec-ondary structure prediction to homologous sequences. Specifically, we propose a method for integrating experimental mappingdata into a comparative sequence analysis algorithm for secondary structure prediction of multiple homologs, whereby the map-ping data benefits not only the prediction for the specific sequence that was mapped but also other homologs. The proposedmethod is realized by modifying the TurboFold II algorithm for prediction of RNA secondary structures to utilize basepairing prob-abilities guided by SHAPE experimental data when such data are available. The SHAPE-mapping-guided basepairing probabil-ities are obtained using the RSample method. Results demonstrate that the SHAPE mapping data for a sequence improvesstructure prediction accuracy of other homologous sequences beyond the accuracy obtained by sequence comparison alone(TurboFold II). The updated version of TurboFold II is freely available as part of the RNAstructure software package.

INTRODUCTION

RNA functions in diverse cellular activities; it is a carrier ofgenetic information in transcription (1), a regulator of geneexpression (2), and a catalyst (3). These cellular functionsdepend on the structure of RNA (4). Therefore, accuratepredictions for the secondary structure, i.e., canonical base-pairings between nucleotides, are critical for understandingand proposing hypotheses related to RNA functions. Acommonly used approach is to predict secondary structuresbased on folding thermodynamics (5,6).

To achieve greater prediction accuracy, several thermo-dynamics-based methods incorporate experimental dataderived from chemical probing to guide RNA secondarystructure prediction (7–17). One mapping method, selec-tive 20-hydroxyl acylation analyzed by primer extension(SHAPE), provides quantitative reactivity at each nucleotideto the SHAPE reagent, which measures the nucleotideflexibility (18,19). Because basepaired nucleotides arestructurally restricted, high SHAPE reactivity is generally

Submitted March 1, 2017, and accepted for publication June 19, 2017.

*Correspondence: [email protected] or gaurav.

[email protected]

Editor: Tamar Schlick.

http://dx.doi.org/10.1016/j.bpj.2017.06.039

� 2017 Biophysical Society.

associated with not being canonically basepaired (20).SHAPE data can be collected with high-throughputsequencing (21–23) and can also be obtained invivo (24–26).

RSample (Spasic, S.M. Assmann, P.C. Bevilacqua,D.H.M., unpublished data) models RNA secondary struc-ture using SHAPE data. It focuses on matching structuremodels to the mapping data rather than directly integratingdata into the model. In this way, it can model folding ensem-bles of multiple structures. A nucleotide-level comparisonbetween experimental mapping data and modeled mappingdata is used to guide a single refinement of a stochasticsample. The sample is then clustered to predict sets of struc-ture models. The single structure prediction accuracy ofRSample is similar to leading methods (>80% of predictedpairs in the accepted structure) (12), and RSample is able toestimate the population of multiple structures in the foldingensemble (27).

Another approach to improving secondary structure pre-diction accuracy is to use multiple homologous sequencesto identify conserved basepairs (5,28–30). One method,TurboFold II (31; Z.T., Y. Fu, G. Sharma, D.H.M., unpub-lished data), iteratively refines basepairing probabilitiesfor each sequence in a set of homologs by comparingthe predicted basepairing probabilities across the set of

Biophysical Journal 113, 1–9, July 25, 2017 1

mailto:[email protected]:[email protected]:[email protected]

Tan et al.


homologs. Additionally, nucleotide alignment probabilitiesin pairwise alignments, estimated using a hidden Markovmodel (HMM) (32), are iteratively improved using infor-mation from estimated secondary structures (33). Afterthe iterative updates, structures are predicted using themaximum expected accuracy algorithm (34–36) and a mul-tiple sequence alignment is estimated using a probabilisticconsistency transformation (36) and progressive alignment.

An open problem in the field is the integration of bothstructure mapping data and comparative data to improvesecondary structure prediction accuracy. Prior work focusedon the case where SHAPE data is available for all homolo-gous sequences (37). For this situation, a multiple sequencealignment was first created by also including SHAPE data inpairwise global alignment. Then the RNAalifold method(38) was used to predict a consensus structure that isconserved given the fixed input alignment, using pseudofree energies to incorporate the SHAPE information (7).This article addresses the problem of predicting conservedsecondary structures when SHAPE mapping is only avail-able for one homolog. This use case is expected to beincreasingly common as SHAPE is performed in vivo acrosstranscriptomes. The method reported in this article is theintegration of RSample into TurboFold II. In the resultingmethod, SHAPE-guided structure prediction and predictionof conserved structures act synergistically to improve sec-ondary structure prediction accuracy, even for sequencesfor which SHAPE mapping was not performed. Resultsdemonstrate that the SHAPE mapping data for a sequenceimproves structure prediction accuracy of other homologoussequences beyond the accuracy obtained by sequence com-parison alone (TurboFold II).

METHODS

Fig. 1 illustrates the proposed new version of TurboFold II that uses avail-

able SHAPE mapping data for one or more of the RNA sequence homo-

logs to improve structure prediction for the sequences without SHAPE

data. The input to TurboFold II is a set of homologous sequences and

the outputs are the predicted secondary structures for each sequence and

a multiple sequence alignment (31). To incorporate experimental

mapping data into the predictions, the proposed approach makes use of

RSample. Specifically, as shown in Fig. 1, within the TurboFold II itera-

tions, RSample is used to refine estimated basepairing probabilities for se-

quences with SHAPE data and these estimated basepairing probabilities

are incorporated in the iterations. As shown via the dashed lines in

Fig. 1, in subsequent TurboFold II iterations, the incorporated SHAPE

information propagates to other homologous sequences and thereby

improves the prediction of structure for these sequences, in addition to

improving structure prediction for the sequence with which the SHAPE

data is affiliated. The major individual steps in the proposed approach

are outlined next.

SHAPE-guided computation of basepairingprobabilities using RSample

RSample first generates a stochastic sample (39) using a secondary struc-

ture partition function calculation (40). Then SHAPE reactivities are esti-

2 Biophysical Journal 113, 1–9, July 25, 2017

mated for each nucleotide in each structure based on the status of the

nucleotide: unpaired, paired at the last position of a helix, or paired in

the interior of a helix. SHAPE reactivities are drawn from distributions

composed of a database of 16 known secondary structures with experimen-

tally measured SHAPE reactivities (12). The estimated SHAPE reactivity

for a nucleotide is then the mean reactivity across all structures. The sto-

chastic sampling is then repeated, where the partition function is reesti-

mated so that the estimated SHAPE reactivities better match the

experimental SHAPE mapping data. The free energy change term intro-

duced to the partition function is

DGbonus;i ¼ 0:5 � ln�

Rexpi þ 1:1Rcalci þ 1:1

�; (1)

where Rexpi and Rcalci are experimentally measured reactivities and esti-

mated reactivities of nucleotide i. This functional form was chosen so

that the free energy of basepair stacking is only altered for nucleotides

for which the originally estimated SHAPE reactivity does not match the

experiment. The constants 0.5 and 1.1 in the equation were obtained

(data not shown) via a grid search as the parameters that maximized struc-

ture prediction accuracy. The free energy bonus DGbonus, i is then applied

for each basepair stack involving nucleotide i. This approach focuses on

matching the experimentally measured SHAPE reactivity.

Incorporation of RSample into TurboFold II

TurboFold II is a method to predict secondary structures for multiple RNA

homologs and multiple sequence alignments. TurboFold II iteratively esti-

mates basepairing probabilities for each sequence using intrinsic informa-

tion and extrinsic information for sequence folding. Intrinsic information

is derived from the thermodynamic model, which used the latest set of near-

est-neighbor thermodynamic parameters (11,41). Extrinsic information is a

proclivity for basepairing inferred from the basepairing probabilities of

other homologous sequences, mapped to the sequence of interest by the

posterior probabilities of nucleotide coincidence of the other homologs to

the sequence (32). The posterior coincidence probabilities can be obtained

with a HMM for pairwise alignments (42). The estimated basepairing prob-

abilities can be used to predict secondary structure using the maximum ex-

pected accuracy (MEA) algorithm (34,35,43) or the ProbKnot method (44).

RSample is integrated into TurboFold II to estimate basepairing probabil-

ities for homologous sequences with available SHAPE mapping data on

one of the homologs. The integrated algorithm uses nine steps illustrated

in Fig. 1.

We adapt the description focusing particularly on the new elements intro-

duced in this article.

Step 1 computes pairwise posterior coincidence probabilities using an

HMM. Pairwise posterior coincidence probabilities are estimated for all

pairs of sequences with an HMM as implemented by Harmanci et al.

(32). Using the forward-backward algorithm, matrices of posterior coinci-

dence probabilities for two nucleotides (one from each sequence) are

computed. Details can be found in Harmanci et al. (32).

Step 2 computes basepairing probabilities of all sequences using the

partition function method in RNAstructure (40).

Steps 3–5 are only performed for sequences for which there is SHAPE

mapping data.

Step 3 generates an ensemble of Ns ¼ 10,000 structures by stochasticsampling for sequences with input SHAPE reactivity.

Step 4 estimates the SHAPE reactivity for each nucleotide based on the

sample. The SHAPE reactivities are assigned to each nucleotide at each

structure in the sample according to the distributions for three different

local structures: unpaired, paired at a helix end, or paired in the interior

of a helix. The SHAPE reactivity for each nucleotide is the arithmetic

mean across structures in the sample. Because the size of ensemble is large,

the variance between samples is relatively low.

Input: H homologoussequences

HMMalignment

Match scorecomputation

Extrinsicinformationcomputation

Probability consistencytransformation;

Guide tree computation;Progressive alignment

Multiple sequencealignment

MEASecondary structure

prediction

(2)

(9)

(1)

(8)

(10)

H(H-1)/2 Pairwise posterior

co-incidence probabilities

Yes

Partitionfunction

Stochastic sampling to generate N

structures

Assign SHAPE reactivity based on

each structure

Estimating SHAPE reactivity by averaging

No

Partitionfunction

Partition function calculation with

restraintsH Base pairing

probablities

(3)

(4) (5)

(6)

(7)

(11)

1st

Are SHAPE data available ?

2ndAve

H

H

H

H

H

H

S

N thS

RSample

FIGURE 1 Flowchart for TurboFold II with incorporation of SHAPE mapping data for one or more sequences. The input is a set of H homologous RNA

sequences and the outputs are the predicted secondary structures for each sequence and the predicted multiple sequence alignment. Steps 1–11 are described

in Materials and Methods. The dashed arrow lines show the flow of SHAPE information and illustrate how, through the iterations, the SHAPE information

contributes not only to the structure prediction for sequences with SHAPE data but also to the structure prediction for other sequences. Steps 3–5 in the

dashed box show the processing for the sequences with SHAPE mapping data using RSample.

Modeling Conserved RNA Structure


Step 5 recalculates the partition function using the free energy change

term (in Eq. 1) to predict basepairing probability for the sequence with

input SHAPE reactivities. Nucleotides with higher or lower estimated

SHAPE reactivity than that measured by experiment are restrained with

a lower or higher propensity to basepair, respectively. Nucleotides with

consistent estimated and experimental SHAPE reactivity receive no

restraint.

Step 6 calculates match scores that encourage alignment between nucle-

otide positions where both nucleotides are upstream paired, downstream

paired, or unpaired. The match score was first proposed in PMcomp

(33), and is utilized in TurboFold II as a prior for recalculating posterior

coincidence probability in next step via the HMM pair alignment algo-

rithm. For the mth sequence, based on estimated basepairing probabilities

between all pairs of nucleotide positions obtained from the partition func-

tion calculation, for a nucleotide at position i, the estimated probability

of downstream pairing is Pm< ðiÞ ¼P

j > iPmij , of upstream pairing is

P m> ðiÞ ¼P

j < iPmij , and of being unpaired is P

m� ðiÞ ¼ 1� Pm< ðiÞ � Pm> ðiÞ.

The match score between nucleotides i and k in sequences m and n, respec-

tively, is formulated as

rði; kÞ ¼� ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

P m< ðiÞP n< ðkÞq

þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP m> ðiÞP n> ðkÞ

q �þ 0:8

�� ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

Pm� ðiÞPn� ðkÞq �

þ 0:5: (2)

For sequences without SHAPE mapping data, the basepairing probabilities

from Step 2 are utilized for the computation of match scores, whereas for

sequences with SHAPE mapping data, the basepairing probabilities from

Step 5 are used in the computation of the match scores.

Step 7 reestimates the posterior coincidence probability. Information

from prior iterations is utilized to reestimate alignment posterior probabil-

ities and basepairing probabilities for secondary structures. The iterative

reestimation of alignment posterior probabilities is introduced (TurboFold


Tan et al.


II) and uses the standard HMM alignment model (42), but with the match

score of Eq. 3 incorporated as a prior.

Step 8 calculates extrinsic information for each sequence by combining

basepairing probabilities from other input sequences using posterior coinci-

dence probabilities:

Pðn/mÞði; jÞ ¼X

8>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>:

Pk;l

1%k < l%Nn

k˛Cm;ni

l˛Cm;nj

Probbpðk; lÞ � Pðm;nÞði � kÞ � Pðm;nÞðj � lÞ � ðH � 1Þ � l ðif sequence n is with SHAPEÞ

Pk;l

1%k < l%Nn

k˛Cm;ni

l˛Cm;nj

Probbpðk; lÞ � Pðm;nÞði � kÞ � Pðm;nÞðj � lÞ ��1� jm;n

� ðotherwiseÞ;

(3)

where P(n/m) denotes the extrinsic information for sequence m inferred

from sequence n. Nn indicates the length of sequence n. The notations

Cm;ni and Cm;nj denote the sets of indices for which posterior coincidence

alignment probabilities P(m,n) (i � k) and P(m,n) (j � l), respectively,exceed a predetermined threshold below which values are considered 0

for computational simplification. Probbp(k,l) denotes the (estimated)

basepairing probability between nucleotide k and nucleotide l within a

sequence. The value ‘‘i � k’’ indicates the alignment between indices iand k in two sequences. H is the number of homologous sequences.

To keep the ratio of extrinsic information from sequence n to every

other sequence constant, the extrinsic information term for sequence n

is multiplied by H�1 if sequence n has SHAPE data. This ensures thatmore extrinsic information is used from sequences with SHAPE data

than from sequences without SHAPE data. l is a parameter, optimized

based on training. The factor (1 � jm,n) weights the contributionof sequence n to the extrinsic information for sequence m using the

sequence identity, jm,n, for sequences m and n computed from an HMM

alignment. This term is only used when sequence n does not have associ-

ated SHAPE mapping data. Because of the factor (1 � jm,n), sequencesthat are highly similar to sequence m have a lower contribution to extrinsic

information than those with lower similarities. The extrinsic information is

calculated from basepairing proclivity for each sequence as inferred from

every other sequence pairwise. Because the sequence with SHAPE

reactivities is presumed to have more accurate estimates of basepairing

probabilities, the basepairing proclivities from the sequence with SHAPE

reactivities to sequences without SHAPE reactivities are assigned a

different, adjustable weighting (l). The basepairing proclivities for se-

quences without SHAPE data and from other sequences to the sequence

with SHAPE data are computed in an identical fashion to the TurboFold

II algorithm.

Step 9 updates the basepairing probability by recomputing the partition

function for each sequence with the addition of extrinsic information.

The extrinsic information is incorporated as a pseudo free energy term in

the partition function calculation for each sequence. A detailed description

is in Harmanci et al. (31).

Steps 2–9 form a loop that is iterated through three times, which is shown

to be optimal in Harmanci et al. (31).

Steps 10 and 11 perform progressive alignment and predict final sec-

ondary structures, respectively. In Step 10, the posterior coincidence


probabilities obtained with the updated match scores via Step 6 are

used to calculate a multiple sequence alignment. A probabilistic

consistency transformation, as described in ProbCons (36), is used

to refine alignment probabilities based on three-way alignment consis-

tency of pairwise HMM posterior probabilities. Refined alignments are

further predicted hierarchically based on a guide tree, as described in

ProbCons (36).

In Step 11, the structures are predicted by the MEA algorithm. Given the

basepair probabilities Pm(i,j) for structure sm of sequencem, the MEA struc-

ture is defined as

S�m ¼ argmaxSm

8>>>><>>>>:

Xði; jÞ˛Sm

2 ,Pmði; jÞ þXci;

i unpaired in Sm

PmðiÞ

9>>>>=>>>>;;

(4)

where Pm(i) is the probability that nucleotide position i is not basepaired,

computed as

PmðiÞ ¼ 1�XNmj¼ iþ1

Pm ði; jÞ �Xi�1j¼ 1

Pm ðj; iÞ; (5)

and where Nm is the length of sequence m. The MEA structure is ob-

tained with a dynamic programming algorithm, as described in Harmanci

et al. (31).

Parameter optimization

To train the parameter l corresponding to the weighting of the extrinsic

information term in Eq. 3, 20 groups of input sequences formed by 10

homologous sequences (including the sequence with SHAPE data)

were randomly chosen from the small subunit ribosomal RNA in the

RNAStralign database. The range for parameter l was from 0 to 2.0

(with samples at 0, 0.02, 0.1, 0.2, 0.4, 1.0, 1.6, and 2.0). The resulting

optimal parameter (l ¼ 1.0) was then used as the default for the method.The geometric mean of sensitivity and PPV was used as the accuracy metric

for optimizing the parameter l, and the values of this metric over the

training set are given in the Supporting Material (Fig. S15).



Benchmarks

For benchmarking, groups of sequence homologs were selected

from several families based on the selection criterion that SHAPE data

were available for a sequence in the family (12). Hepatitis C virus

(HCV) IRES domain, TPP riboswitch, cyclic-di-GMP riboswitch,

SAM I riboswitch, M-box riboswitch, and Lysine riboswitch RNA se-

quences were randomly selected from the Rfam database (45). tRNA,

5S ribosomal RNA, and group I intron sequences were selected from

the RNAStralign database (http://rna.urmc.rochester.edu/RNAStralign.

tar.gz). 23S rRNA sequences were selected from the Comparative

RNA web site and project (http://www.rna.icmb.utexas.edu/). Specif-

ically, 20 groups of 4-, 9-, or 19-sequence homologs were selected

from each of the RNA family. All methods were benchmarked on

the same groups of sequences. Detailed information of selected

sequences is in Tables S1 and S2. For comparison, a single sequence

prediction accuracy was also computed as the average of the accu-

racies for each homolog in the set of sequences for predictions obtained

using the MaxExpect (maximum expected accuracy) method from

RNAstructure 5.7.

Scoring of prediction accuracy

The F1 score, which is the harmonic mean of sensitivity and PPV, is used in

the structure-prediction benchmark. The F1 score is computed as

F1 ¼ 2 � Sensitivity � PPVSensitivityþ PPV : (6)

Sensitivity is the fraction of basepairs from the Rfam database that are

correctly predicted. PPV is the fraction of predicted basepairs that are cor-

rect, i.e., included in the Rfam database.

Predicted basepairs are considered correct if a nucleotide on either the

50- or 30-position of the helix is off by one position compared to the standard(13,46). For instance, a predicted basepair (i, j) is correct if basepair (i, j), or

(i 5 1, j), or (i, j 5 1) exists in the database. This is important because of

uncertainty in the determination of secondary structure by comparative

analysis (47) and also because of thermodynamic fluctuations of local struc-

tures (48,49).

Significance testing

To assess the statistical significance of the differences in F1 score, sensi-

tivity, and PPV, paired t-tests were performed using R 3.0.2 (50) between

TurboFold II with SHAPE data and each of the other methods (51). Alpha,

the type I error rate, was set to 0.05. The figures summarizing the bench-

marking results are annotated to indicate the results of the significance

tests.

Alternative methods

Although no previous work has been reported on using SHAPE data

for one homolog in the prediction of structures for other homologs,

the RNAalifold (38,52) method can be used for this purpose and it is

therefore used for comparison. For RNAalifold, the SHAPE reactivity

data is converted to per-nucleotide pseudo free energies that are then

applied for each basepair stack including a nucleotide. A log-linear fit

based on Deigan et al. (7) is used to convert reactivities into pseudo

free energies. The RNAalifold method does not compute an alignment

and requires an input multiple sequence alignment. Input alignments

for RNAalifold (2.2.5) were generated using ClustalW (2.1) (38,53).

Default options and parameters were used for these programs in the

benchmarking.

RESULTS

The new version of TurboFold II, capable of incorporatingSHAPE data, was benchmarked for structure predictionaccuracy using RNA families, where one sequence ineach family has measured SHAPE reactivity (12). Themethod was compared with RNAalifold (38), RSample,and MaxExpect (35). RNAalifold is a method for predictingconsensus structures for multiple homologs. It was previ-ously adapted for using SHAPE data, and was benchmarkedfor cases when all sequences had SHAPE mapping data(37). RSample is run for the single sequences with SHAPEdata available. MaxExpect is the single sequence maximumexpected accuracy method, and maximum expected accu-racy is used to generate the predicted structures frompredicted basepairing probabilities with TurboFold. Theaccuracy results are represented in Figs. 2 and S1–S11;Tables S4 and S5.

Fig. 2 shows the average structure prediction accuracy forthe sequences without SHAPE data. The results demonstratethat the majority of RNA families (tRNA, 5S rRNA, hepati-tis C virus IRES, group I intron, lysine riboswitch, SAM Iriboswitch, cyclic-di-GMP riboswitch, and 23S rRNA)have significantly (p < 0.05) better structure prediction ac-curacy when SHAPE is used in the calculation than whenSHAPE data is not used. This shows that SHAPE data fora single sequence can inform the structure modeling for ho-mologous sequences. However, for the M-box riboswitchand TPP riboswitch, the accuracies are not significantlyimproved by having SHAPE data. For the sequences withoutSHAPE data, the new version of TurboFold II performedbetter than RNAalifold using SHAPE data and MaxExpect.Fig. S12 shows that much of the improvement in accuracy isfor sequences that were relatively poorly predicted in theabsence of SHAPE data. The accuracy performance forthose sequences is rescued by having SHAPE informationfor a homologous sequence.

It is observed that structure prediction accuracies byTurboFold II using SHAPE data across sizes of sequencegroups are scarcely changed (from 5 to 20 sequences).The relationship between structure prediction accuraciesand sequence lengths is also weak (Tables S1 and S2). Forthe 23S rRNA family, which has the longest averagesequence length (�2900 nucleotides), all methods, exceptsingle-sequence MaxExpect, perform well. On the RNAfamilies with sequence lengths shorter than 200 nucleotides,TurboFold II þ SHAPE improves structure predictions fortRNA, 5S, lysine riboswitch, and cyclic-di-GMP riboswitch,but does not improve structure predictions for M-box ribos-witch and TPP riboswitch.

For the one sequence with SHAPE mapping data in eachRNA family, the results show that the majority of RNA fam-ilies (5S rRNA, HCV IRES domain, group I intron, TPPriboswitch, and 23S rRNA) have significantly (p < 0.05)improved prediction accuracy when SHAPE data are used


http://rna.urmc.rochester.edu/RNAStralign.tar.gzhttp://rna.urmc.rochester.edu/RNAStralign.tar.gzhttp://www.rna.icmb.utexas.edu/

TurboFoldII +SHAPETurboFoldIIRNAalifold +SHAPERNAalifold MaxExpect

tRNA*

* **

** *

*

** *

**

** *

*

* *

*

* *

5 sequences 10 sequences 20 sequences 0

0.2

0.4

0.6

0.8

15S rRNA

*

*

* * *

*

* * *

*

* ** *

0

0.2

0.4

0.6

0.8

1

5 sequences 10 sequences 20 sequences

*

*

* *

*

* *

*

Group I Intron

*

*

*

*

0

0.2

0.4

0.6

0.8

1


Hepatitis C Virus(HCV) IRES Domain

** *

*

**

** *

*

* *

0

0.2

0.4

0.6

0.8

1


Lysine riboswitch

* ** *

*

* *

*

M-box riboswitch

** * **

* ***

* ** *

* *

* * *

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

23S rRNA

* **

*

*

* *

*

* *

* *

* *

**

*

*

cyclic-di-GMP riboswitch

TPP riboswitch

* * *** **

*

**

*

**

**

*

*

**

SAM I riboswitch

**

*

**

*

5 sequences 10 sequences 20 sequences5 sequences 10 sequences 20 sequences

5 sequences 10 sequences 20 sequences5 sequences 10 sequences 20 sequences 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

5 sequences 10 sequences 20 sequences5 sequences 10 sequences 20 sequences

FIGURE 2 Average F1 score of structure predic-

tions of the sequences that did not have SHAPE

mapping data. Given here is the average F1 score

of structure predictions obtained by running the

methods with 5-, 10-, or 20-input sequences on

tRNA, 5S rRNA, hepatitis C virus IRES domain,

group I intron, lysine riboswitch, M-box ribos-

witch, SAM I riboswitch, TPP riboswitch, cyclic-

di-GMP riboswitch, and 23S rRNA test datasets.

Standard errors of the mean are shown by error

bars. The star (*) above the bar for a method indi-

cates that the difference in F1 score between the

method and the new TurboFold II is statistically

significant, as determined by paired t-tests (51).

Tan et al.


than when SHAPE data are not used (Fig. S1 and Table S4).For tRNA, the lysine riboswitch, and the M-box riboswitchfamilies, the accuracy performances are the same. In theSAM I riboswitch and the cyclic-di-GMP riboswitch fam-ilies, the accuracies decreased when SHAPE data areused. In tRNA, 5S rRNA, group I intron, lysine riboswitch,SAM I riboswitch, TPP riboswitch, and 23S rRNA families,the new version of TurboFold II performed better thanRSample. Only in the hepatitis C virus IRES domain andcyclic-di-GMP riboswitch families, the new version ofTurboFold II performed worse than RSample. TheTurboFold IIþSHAPE performed better than RNAalifoldusing SHAPE data on every family and performed betterthan MaxExpect on a majority of families (except the cy-


clic-di-GMP riboswitch and the M-box riboswitch) usingSHAPE data.

The alignment predictions by TurboFold II with andwithout SHAPE (Fig. S13) are compared with the predictedalignment by ClustalW (53), a method that is based on pair-wise dynamic programing alignments, which is the inputalignment for RNAalifold. Because the Rfam databasealignments do not include the sequence with SHAPE datafor all of the families, the alignment accuracy is assessedonly over the sequences without SHAPE data within eachfamily of homologs. With the exception of the 5S rRNAand the hepatitis C virus IRES domain, TurboFold IIwith SHAPE had higher sensitivity and PPV compared toClustalW. Using SHAPE data on one sequence in each



RNA family also significantly improved the alignment accu-racy of other homologs without SHAPE in a majority ofRNA families (group I intron, lysine riboswitch, M-box ri-boswitch, SAM I riboswitch, TPP riboswitch, and cyclic-di-GMP riboswitch).

DISCUSSION

Secondary structure models are important for understandingthe functions of the RNA structure (54). Using SHAPE datawas shown to improve structure prediction accuracy signif-icantly for single sequence secondary structure predictions(7,12). In this work, it is demonstrated that the SHAPEdata can inform the folding of other homologs by combininginformation from sequence comparison of RNA homologs.In particular, it is shown that given SHAPE data for onesequence out of the multiple sequences used in secondarystructure prediction by comparative analysis, TurboFoldII þ SHAPE can substantially improve the structure predic-tion accuracies of the sequences that did not have SHAPEmapping data.

One of the reasons for the improvements of the structureprediction accuracies of homologs without SHAPE is themore accurate prediction of the structure of the sequencewith SHAPE reactivity. In three RNA families (5S rRNA,HCV IRES, and group I intron), TurboFold II improvedthe average structure accuracy of both the sequences withand without SHAPE (Fig. S1). The more accurate structuralinformation from the sequence with SHAPE is transmittedto its homologs through the extrinsic information calcula-tion. Due to the specially designed extrinsic informationcalculation from the sequence with SHAPE to other (H�1total) homologs by introducing the factor (H�1), which en-sures that the fraction of extrinsic information provided bysequences with SHAPE is high compared to other homo-logs, the structure prediction of homologs is improved.

To take the advantage of SHAPE data on one of the ho-mologs, the new method ignores pairwise sequence identityduring the calculation of extrinsic information from the

a b

sequence with SHAPE to other sequences. To understandthe nature of improvements in structure prediction accuraciesof sequenceswithout SHAPE, the relationship between struc-ture prediction accuracy and sequence identity is studied(Fig. S14). Sequence identity is defined as the ratio of thenumber of columns with same pairwise aligned nucleotidesat the output alignment between the sequence with SHAPEand other homologs from theTurboFold IIþSHAPEmethod.One observed trend is that the sequenceswithmore accuratelypredicted structure (higher F1 score) generally with hadhigher sequence identity to the sequencewith SHAPE.More-over, the F1 score improvementswere distributed in a roughlyGaussian shape along the sequence identity (Fig. S14). For thesequences with relatively high sequence identity, the room toimprove accuracy was limited. The Gaussian shape is alsopartially due to the effects of improvements in structure pre-diction because of a more accurate alignment. This isobserved in some of the RNA families (tRNA, group I intron,lysine riboswitch, and SAM I riboswitch) (Fig. S13). The5S rRNA, hepatitis C virus IRES domain, and cyclic-di-GMP riboswitch RNA families showed improvements onstructure prediction accuracy although little or no improve-ment on alignment prediction accuracy, because the align-ment accuracies of these RNA families were alreadyrelatively high (�90% in sensitivity and PPV).

The other reason for the improvements of the structureprediction accuracies of homologs without SHAPE is themore accurate coincidence probability as compared to thecase without SHAPE data on any of the input sequences.The coincidence is important to map the basepairing proba-bilities of other homologous sequences to the sequence ofinterest and it is also helpful to estimate the final multiplesequence alignment (Fig. S13).

One remaining challenge of structure prediction usingexperimental probing data on one of the homologs is the dif-ficulty to determine the balance of information from thermo-dynamics of the sequence and extrinsic information fromthe sequence using experimental data. In Fig. 3, an examplefrom the TPP riboswitch family shows that the structure of

FIGURE 3 Representative secondary structure

prediction for TPP riboswitch (BA000043) with

(a) and without (b) SHAPE data on a homolo-

gous RNA. Basepair predictions are illustrated

by colored lines (green, red, and black denoting

correct, incorrect, and missing basepairs, respec-

tively) on circle plots. The circular plots were

generated using the CircleCompare program in

RNAstructure (55).


Tan et al.


one homologous sequence BA000043 was incorrectly pre-dicted to form three extra basepairs between 50 and 30

ends when SHAPE was used as compared to when SHAPEwas not used, although the longer helix contributes to amore stable structure.

RNAalifold showed lower accuracies for predicted struc-tures than those of TurboFold II þ SHAPE in most of theRNA families. A contributing factor to this inaccuracywas the lower accuracy of the input sequence alignment(Fig. S13). Although pseudo free energies obtained fromthe SHAPE reactivity data at nucleotides might be helpfulfor estimating the structure, an inaccurate alignment be-tween the sequence with SHAPE data and homologs candisturb the consensus structure for the set of aligned se-quences and can cause loss of basepairs in the consensusstructure. For the group I intron, lysine riboswitch, SAM Iriboswitch, TPP riboswitch, and cyclic-di-GMP riboswitchRNA families, the sensitivity and PPV of the predictedClustalW alignment for sequences without SHAPE are�10% lower than those of TurboFold II þ SHAPE andthe F1 score of structure prediction on these RNA familiesis �20% lower than TurboFold II þ SHAPE.

Another contributing factor for the worse performance ofRNAalifold is the integration of SHAPE data. There is aweakening of the information from experimental data withincreasing number of homologs, because the pseudo energyfrom SHAPE reactivity is only applied to the free energycalculation of the particular sequence.

TurboFold II using SHAPE data on one or moresequences maintains a computation speed comparable toTurboFold II (with complexity O(H2N2 þ HN3) for Hsequences of average length N). The time performance onselect sequence families is provided in Table S6.

CONCLUSION

A new version of TurboFold II with the ability to includeSHAPE mapping data for one or more of the RNA sequencehomologs can substantially improve the structure predictionaccuracies of the sequences that do not have SHAPE data.TurboFold II with the capability to include SHAPE mappingdata for one or more sequences is available under the GNUlicense as part of the RNAstructure software package at:http://rna.urmc.rochester.edu/RNAstructure.html.

SUPPORTING MATERIAL

Supporting Materials and Methods, fifteen figures, and six tables are avail-

able at http://www.biophysj.org/biophysj/supplemental/S0006-3495(17)

30689-6.

AUTHOR CONTRIBUTIONS

All authors planned experiments. Z.T. wrote code and performed experi-

ments. Z.T. drafted the manuscript. All authors participated in the writing.


ACKNOWLEDGMENTS

This work was supported by National Institutes of Health (NIH) grants R01

GM097334 to G.S. and R01 GM076485 to D.H.M.

REFERENCES

1. Cech, T. R., and J. A. Steitz. 2014. The noncoding RNA revolution-trashing old rules to forge new ones. Cell. 157:77–94.

2. Wu, L., and J. G. Belasco. 2008. Let me count the ways: mechanisms ofgene regulation by miRNAs and siRNAs. Mol. Cell. 29:1–7.

3. Doudna, J. A., and T. R. Cech. 2002. The chemical repertoire of naturalribozymes. Nature. 418:222–228.

4. Gesteland, R. F., T. Cech, and J. F. Atkins. 2006. The RNAWorld: TheNature of Modern RNA Suggests a Prebiotic RNAWorld. Cold SpringHarbor Laboratory Press, Cold Spring Harbor, NY.

5. Seetin, M. G., and D. H. Mathews. 2012. RNA structure prediction: anoverview of methods. Methods Mol. Biol. 905:99–122.

6. Hofacker, I. L. 2014. Energy-directed RNA structure prediction.Methods Mol. Biol. 1097:71–84.

7. Deigan, K. E., T. W. Li, ., K. M. Weeks. 2009. Accurate SHAPE-directed RNA structure determination. Proc. Natl. Acad. Sci. USA.106:97–102.

8. Quarrier, S., J. S. Martin,., A. Laederach. 2010. Evaluation of the in-formation content of RNA structure mapping data for secondary struc-ture prediction. RNA. 16:1108–1117.

9. Washietl, S., I. L. Hofacker,., M. Kellis. 2012. RNA folding with softconstraints: reconciliation of probing data and thermodynamic second-ary structure prediction. Nucleic Acids Res. 40:4261–4272.

10. Sloma, M. F., and D. H. Mathews. 2015. Improving RNA secondarystructure prediction with structure mapping data. Methods Enzymol.553:91–114.

11. Mathews, D. H., M. D. Disney, ., D. H. Turner. 2004. Incorporatingchemical modification constraints into a dynamic programming algo-rithm for prediction of RNA secondary structure. Proc. Natl. Acad.Sci. USA. 101:7287–7292.

12. Hajdin, C. E., S. Bellaousov,., K. M.Weeks. 2013. Accurate SHAPE-directed RNA secondary structure modeling, including pseudoknots.Proc. Natl. Acad. Sci. USA. 110:5498–5503.

13. Mathews, D. H., J. Sabina,., D. H. Turner. 1999. Expanded sequencedependence of thermodynamic parameters improves prediction ofRNA secondary structure. J. Mol. Biol. 288:911–940.

14. Eddy, S. R. 2014. Computational analysis of conserved RNA second-ary structure in transcriptomes and genomes. Annu. Rev. Biophys.43:433–456.

15. Zarringhalam, K., M. M. Meyer, ., P. Clote. 2012. Integrating chem-ical footprinting data into RNA secondary structure prediction. PLoSOne. 7:e45160.

16. Ouyang, Z., M. P. Snyder, and H. Y. Chang. 2013. SeqFold: genome-scale reconstruction of RNA secondary structure integrating high-throughput sequencing data. Genome Res. 23:377–387.

17. Deng, F., M. Ledda,., S. Aviran. 2016. Data-directed RNA secondarystructure prediction using probabilistic modeling. RNA. 22:1109–1119.

18. McGinnis, J. L., J. A. Dunkle,., K. M.Weeks. 2012. The mechanismsof RNA SHAPE chemistry. J. Am. Chem. Soc. 134:6617–6624.

19. Merino, E. J., K. A. Wilkinson,., K. M. Weeks. 2005. RNA structureanalysis at single nucleotide resolution by selective 20-hydroxyl acyla-tion and primer extension (SHAPE). J. Am. Chem. Soc. 127:4223–4231.

20. S€ukösd, Z., M. S. Swenson,., C. E. Heitsch. 2013. Evaluating the ac-curacy of SHAPE-directed RNA secondary structure predictions. Nu-cleic Acids Res. 41:2807–2816.

21. Kertesz, M., Y. Wan,., E. Segal. 2010. Genome-wide measurement ofRNA secondary structure in yeast. Nature. 467:103–107.

http://rna.urmc.rochester.edu/RNAstructure.htmlhttp://www.biophysj.org/biophysj/supplemental/S0006-3495(17)30689-6http://www.biophysj.org/biophysj/supplemental/S0006-3495(17)30689-6http://refhub.elsevier.com/S0006-3495(17)30689-6/sref1http://refhub.elsevier.com/S0006-3495(17)30689-6/sref1http://refhub.elsevier.com/S0006-3495(17)30689-6/sref2http://refhub.elsevier.com/S0006-3495(17)30689-6/sref2http://refhub.elsevier.com/S0006-3495(17)30689-6/sref3http://refhub.elsevier.com/S0006-3495(17)30689-6/sref3http://refhub.elsevier.com/S0006-3495(17)30689-6/sref4http://refhub.elsevier.com/S0006-3495(17)30689-6/sref4http://refhub.elsevier.com/S0006-3495(17)30689-6/sref4http://refhub.elsevier.com/S0006-3495(17)30689-6/sref5http://refhub.elsevier.com/S0006-3495(17)30689-6/sref5http://refhub.elsevier.com/S0006-3495(17)30689-6/sref6http://refhub.elsevier.com/S0006-3495(17)30689-6/sref6http://refhub.elsevier.com/S0006-3495(17)30689-6/sref7http://refhub.elsevier.com/S0006-3495(17)30689-6/sref7http://refhub.elsevier.com/S0006-3495(17)30689-6/sref7http://refhub.elsevier.com/S0006-3495(17)30689-6/sref8http://refhub.elsevier.com/S0006-3495(17)30689-6/sref8http://refhub.elsevier.com/S0006-3495(17)30689-6/sref8http://refhub.elsevier.com/S0006-3495(17)30689-6/sref9http://refhub.elsevier.com/S0006-3495(17)30689-6/sref9http://refhub.elsevier.com/S0006-3495(17)30689-6/sref9http://refhub.elsevier.com/S0006-3495(17)30689-6/sref10http://refhub.elsevier.com/S0006-3495(17)30689-6/sref10http://refhub.elsevier.com/S0006-3495(17)30689-6/sref10http://refhub.elsevier.com/S0006-3495(17)30689-6/sref11http://refhub.elsevier.com/S0006-3495(17)30689-6/sref11http://refhub.elsevier.com/S0006-3495(17)30689-6/sref11http://refhub.elsevier.com/S0006-3495(17)30689-6/sref11http://refhub.elsevier.com/S0006-3495(17)30689-6/sref12http://refhub.elsevier.com/S0006-3495(17)30689-6/sref12http://refhub.elsevier.com/S0006-3495(17)30689-6/sref12http://refhub.elsevier.com/S0006-3495(17)30689-6/sref13http://refhub.elsevier.com/S0006-3495(17)30689-6/sref13http://refhub.elsevier.com/S0006-3495(17)30689-6/sref13http://refhub.elsevier.com/S0006-3495(17)30689-6/sref14http://refhub.elsevier.com/S0006-3495(17)30689-6/sref14http://refhub.elsevier.com/S0006-3495(17)30689-6/sref14http://refhub.elsevier.com/S0006-3495(17)30689-6/sref15http://refhub.elsevier.com/S0006-3495(17)30689-6/sref15http://refhub.elsevier.com/S0006-3495(17)30689-6/sref15http://refhub.elsevier.com/S0006-3495(17)30689-6/sref16http://refhub.elsevier.com/S0006-3495(17)30689-6/sref16http://refhub.elsevier.com/S0006-3495(17)30689-6/sref16http://refhub.elsevier.com/S0006-3495(17)30689-6/sref17http://refhub.elsevier.com/S0006-3495(17)30689-6/sref17http://refhub.elsevier.com/S0006-3495(17)30689-6/sref18http://refhub.elsevier.com/S0006-3495(17)30689-6/sref18http://refhub.elsevier.com/S0006-3495(17)30689-6/sref19http://refhub.elsevier.com/S0006-3495(17)30689-6/sref19http://refhub.elsevier.com/S0006-3495(17)30689-6/sref19http://refhub.elsevier.com/S0006-3495(17)30689-6/sref19http://refhub.elsevier.com/S0006-3495(17)30689-6/sref19http://refhub.elsevier.com/S0006-3495(17)30689-6/sref20http://refhub.elsevier.com/S0006-3495(17)30689-6/sref20http://refhub.elsevier.com/S0006-3495(17)30689-6/sref20http://refhub.elsevier.com/S0006-3495(17)30689-6/sref20http://refhub.elsevier.com/S0006-3495(17)30689-6/sref21http://refhub.elsevier.com/S0006-3495(17)30689-6/sref21



22. Underwood, J. G., A. V. Uzilov, ., D. Haussler. 2010. FragSeq:transcriptome-wide RNA structure probing using high-throughputsequencing. Nat. Methods. 7:995–1001.

23. Talkish, J., G. May, ., C. J. McManus. 2014. Mod-seq: high-throughput sequencing for chemical probing of RNA structure. RNA.20:713–720.

24. Ding, Y., Y. Tang, ., S. M. Assmann. 2014. In vivo genome-wideprofiling of RNA secondary structure reveals novel regulatory features.Nature. 505:696–700.

25. Spitale, R. C., P. Crisalli,., H. Y. Chang. 2013. RNA SHAPE analysisin living cells. Nat. Chem. Biol. 9:18–20.

26. Rouskin, S., M. Zubradt, ., J. S. Weissman. 2014. Genome-wideprobing of RNA structure reveals active unfolding of mRNA structuresin vivo. Nature. 505:701–705.

27. Cordero, P., and R. Das. 2015. Rich RNA structure landscapes revealedby mutate-and-map analysis. PLOS Comput. Biol. 11:e1004473.

28. Puton, T., L. P. Kozlowski, ., J. M. Bujnicki. 2014. CompaRNA: aserver for continuous benchmarking of automated methods for RNAsecondary structure prediction. Nucleic Acids Res. 42:5403–5406.

29. Havgaard, J. H., and J. Gorodkin. 2014. RNA structural alignments,part I: Sankoff-based approaches for structural alignments. MethodsMol. Biol. 1097:275–290.

30. Asai, K., and M. Hamada. 2014. RNA structural alignments, part II:non-Sankoff approaches for structural alignments. Methods Mol.Biol. 1097:291–301.

31. Harmanci, A. O., G. Sharma, and D. H. Mathews. 2011. TurboFold:iterative probabilistic estimation of secondary structures for multipleRNA sequences. BMC Bioinformatics. 12:108.

32. Harmanci, A. O., G. Sharma, and D. H. Mathews. 2007. Efficient pair-wise RNA structure prediction using probabilistic alignment con-straints in Dynalign. BMC Bioinformatics. 8:130.

33. Hofacker, I. L., S. H. Bernhart, and P. F. Stadler. 2004. Alignment ofRNA base pairing probability matrices. Bioinformatics. 20:2222–2227.

34. Knudsen, B., and J. Hein. 2003. Pfold: RNA secondary structure pre-diction using stochastic context-free grammars. Nucleic Acids Res.31:3423–3428.

35. Lu, Z. J., J. W. Gloor, and D. H. Mathews. 2009. Improved RNA sec-ondary structure prediction by maximizing expected pair accuracy.RNA. 15:1805–1813.

36. Do, C. B., M. S. Mahabhashyam, ., S. Batzoglou. 2005. ProbCons:probabilistic consistency-based multiple sequence alignment. GenomeRes. 15:330–340.

37. Lavender, C. A., R. Lorenz, ., K. M. Weeks. 2015. Model-Free RNAsequence and structure alignment informed by SHAPE probing revealsa conserved alternate secondary structure for 16S rRNA. PLOS Com-put. Biol. 11:e1004126.

38. Bernhart, S. H., I. L. Hofacker, ., P. F. Stadler. 2008. RNAalifold:improved consensus structure prediction for RNA alignments. BMCBioinformatics. 9:474.

39. Ding, Y., and C. E. Lawrence. 2003. A statistical sampling algorithmfor RNA secondary structure prediction. Nucleic Acids Res. 31:7280–7301.

40. Mathews, D. H. 2004. Using an RNA secondary structure partitionfunction to determine confidence in base pairs predicted by free energyminimization. RNA. 10:1178–1190.

41. Turner, D. H., and D. H. Mathews. 2010. NNDB: the nearest neighborparameter database for predicting stability of nucleic acid secondarystructure. Nucleic Acids Res. 38:D280–D282.

42. Durbin, R., S. R. Eddy, ., G. Mitchison. 1998. Biological SequenceAnalysis: Probabilistic Models of Proteins and Nucleic Acids. Cam-bridge University Press, Cambridge, United Kingdom.

43. Do, C. B., D. A. Woods, and S. Batzoglou. 2006. CONTRAfold: RNAsecondary structure prediction without physics-based models. Bioin-formatics. 22:e90–e98.

44. Bellaousov, S., and D. H. Mathews. 2010. ProbKnot: fast prediction ofRNA secondary structure including pseudoknots. RNA. 16:1870–1880.

45. Nawrocki, E. P., S. W. Burge,., R. D. Finn. 2015. Rfam 12.0: updatesto the RNA families database. Nucleic Acids Res. 43:D130–D137.

46. Fu, Y., G. Sharma, and D. H. Mathews. 2014. Dynalign II: commonsecondary structure prediction for RNA homologs with domain inser-tions. Nucleic Acids Res. 42:13939–13948.

47. Gutell, R. R., J. C. Lee, and J. J. Cannone. 2002. The accuracy of ribo-somal RNA comparative structure models. Curr. Opin. Struct. Biol.12:301–310.

48. Woodson, S. A., and D. M. Crothers. 1987. Proton nuclear magneticresonance studies on bulge-containing DNA oligonucleotides from amutational hot-spot sequence. Biochemistry. 26:904–912.

49. Znosko, B. M., S. B. Silvestri, ., M. J. Serra. 2002. Thermodynamicparameters for an expanded nearest-neighbor model for the formationof RNA duplexes with single nucleotide bulges. Biochemistry.41:10406–10417.

50. R Development Core Team. 2013. R: A Language and Environmentfor Statistical Computing. R Foundation for Statistical Computing,Vienna, Austria.

51. Xu, Z., A. Almudevar, and D. H. Mathews. 2012. Statistical evaluationof improvement in RNA secondary structure prediction. Nucleic AcidsRes. 40:e26.

52. Lorenz, R., S. H. Bernhart,., I. L. Hofacker. 2011. ViennaRNA pack-age 2.0. Algorithms Mol. Biol. 6:26.

53. Larkin, M. A., G. Blackshields,., D. G. Higgins. 2007. Clustal WandClustal X version 2.0. Bioinformatics. 23:2947–2948.

54. Mauger, D. M., N. A. Siegfried, and K. M. Weeks. 2013. The geneticcode as expressed through relationships between mRNA structureand protein function. FEBS Lett. 587:1180–1188.

55. Reuter, J. S., and D. H. Mathews. 2010. RNAstructure: software forRNA secondary structure prediction and analysis. BMC Bioinformat-ics. 11:129.


http://refhub.elsevier.com/S0006-3495(17)30689-6/sref22http://refhub.elsevier.com/S0006-3495(17)30689-6/sref22http://refhub.elsevier.com/S0006-3495(17)30689-6/sref22http://refhub.elsevier.com/S0006-3495(17)30689-6/sref23http://refhub.elsevier.com/S0006-3495(17)30689-6/sref23http://refhub.elsevier.com/S0006-3495(17)30689-6/sref23http://refhub.elsevier.com/S0006-3495(17)30689-6/sref24http://refhub.elsevier.com/S0006-3495(17)30689-6/sref24http://refhub.elsevier.com/S0006-3495(17)30689-6/sref24http://refhub.elsevier.com/S0006-3495(17)30689-6/sref25http://refhub.elsevier.com/S0006-3495(17)30689-6/sref25http://refhub.elsevier.com/S0006-3495(17)30689-6/sref26http://refhub.elsevier.com/S0006-3495(17)30689-6/sref26http://refhub.elsevier.com/S0006-3495(17)30689-6/sref26http://refhub.elsevier.com/S0006-3495(17)30689-6/sref27http://refhub.elsevier.com/S0006-3495(17)30689-6/sref27http://refhub.elsevier.com/S0006-3495(17)30689-6/sref28http://refhub.elsevier.com/S0006-3495(17)30689-6/sref28http://refhub.elsevier.com/S0006-3495(17)30689-6/sref28http://refhub.elsevier.com/S0006-3495(17)30689-6/sref29http://refhub.elsevier.com/S0006-3495(17)30689-6/sref29http://refhub.elsevier.com/S0006-3495(17)30689-6/sref29http://refhub.elsevier.com/S0006-3495(17)30689-6/sref30http://refhub.elsevier.com/S0006-3495(17)30689-6/sref30http://refhub.elsevier.com/S0006-3495(17)30689-6/sref30http://refhub.elsevier.com/S0006-3495(17)30689-6/sref31http://refhub.elsevier.com/S0006-3495(17)30689-6/sref31http://refhub.elsevier.com/S0006-3495(17)30689-6/sref31http://refhub.elsevier.com/S0006-3495(17)30689-6/sref32http://refhub.elsevier.com/S0006-3495(17)30689-6/sref32http://refhub.elsevier.com/S0006-3495(17)30689-6/sref32http://refhub.elsevier.com/S0006-3495(17)30689-6/sref33http://refhub.elsevier.com/S0006-3495(17)30689-6/sref33http://refhub.elsevier.com/S0006-3495(17)30689-6/sref34http://refhub.elsevier.com/S0006-3495(17)30689-6/sref34http://refhub.elsevier.com/S0006-3495(17)30689-6/sref34http://refhub.elsevier.com/S0006-3495(17)30689-6/sref35http://refhub.elsevier.com/S0006-3495(17)30689-6/sref35http://refhub.elsevier.com/S0006-3495(17)30689-6/sref35http://refhub.elsevier.com/S0006-3495(17)30689-6/sref36http://refhub.elsevier.com/S0006-3495(17)30689-6/sref36http://refhub.elsevier.com/S0006-3495(17)30689-6/sref36http://refhub.elsevier.com/S0006-3495(17)30689-6/sref37http://refhub.elsevier.com/S0006-3495(17)30689-6/sref37http://refhub.elsevier.com/S0006-3495(17)30689-6/sref37http://refhub.elsevier.com/S0006-3495(17)30689-6/sref37http://refhub.elsevier.com/S0006-3495(17)30689-6/sref38http://refhub.elsevier.com/S0006-3495(17)30689-6/sref38http://refhub.elsevier.com/S0006-3495(17)30689-6/sref38http://refhub.elsevier.com/S0006-3495(17)30689-6/sref39http://refhub.elsevier.com/S0006-3495(17)30689-6/sref39http://refhub.elsevier.com/S0006-3495(17)30689-6/sref39http://refhub.elsevier.com/S0006-3495(17)30689-6/sref40http://refhub.elsevier.com/S0006-3495(17)30689-6/sref40http://refhub.elsevier.com/S0006-3495(17)30689-6/sref40http://refhub.elsevier.com/S0006-3495(17)30689-6/sref41http://refhub.elsevier.com/S0006-3495(17)30689-6/sref41http://refhub.elsevier.com/S0006-3495(17)30689-6/sref41http://refhub.elsevier.com/S0006-3495(17)30689-6/sref42http://refhub.elsevier.com/S0006-3495(17)30689-6/sref42http://refhub.elsevier.com/S0006-3495(17)30689-6/sref42http://refhub.elsevier.com/S0006-3495(17)30689-6/sref43http://refhub.elsevier.com/S0006-3495(17)30689-6/sref43http://refhub.elsevier.com/S0006-3495(17)30689-6/sref43http://refhub.elsevier.com/S0006-3495(17)30689-6/sref44http://refhub.elsevier.com/S0006-3495(17)30689-6/sref44http://refhub.elsevier.com/S0006-3495(17)30689-6/sref45http://refhub.elsevier.com/S0006-3495(17)30689-6/sref45http://refhub.elsevier.com/S0006-3495(17)30689-6/sref46http://refhub.elsevier.com/S0006-3495(17)30689-6/sref46http://refhub.elsevier.com/S0006-3495(17)30689-6/sref46http://refhub.elsevier.com/S0006-3495(17)30689-6/sref47http://refhub.elsevier.com/S0006-3495(17)30689-6/sref47http://refhub.elsevier.com/S0006-3495(17)30689-6/sref47http://refhub.elsevier.com/S0006-3495(17)30689-6/sref48http://refhub.elsevier.com/S0006-3495(17)30689-6/sref48http://refhub.elsevier.com/S0006-3495(17)30689-6/sref48http://refhub.elsevier.com/S0006-3495(17)30689-6/sref49http://refhub.elsevier.com/S0006-3495(17)30689-6/sref49http://refhub.elsevier.com/S0006-3495(17)30689-6/sref49http://refhub.elsevier.com/S0006-3495(17)30689-6/sref49http://refhub.elsevier.com/S0006-3495(17)30689-6/sref50http://refhub.elsevier.com/S0006-3495(17)30689-6/sref50http://refhub.elsevier.com/S0006-3495(17)30689-6/sref50http://refhub.elsevier.com/S0006-3495(17)30689-6/sref51http://refhub.elsevier.com/S0006-3495(17)30689-6/sref51http://refhub.elsevier.com/S0006-3495(17)30689-6/sref51http://refhub.elsevier.com/S0006-3495(17)30689-6/sref52http://refhub.elsevier.com/S0006-3495(17)30689-6/sref52http://refhub.elsevier.com/S0006-3495(17)30689-6/sref53http://refhub.elsevier.com/S0006-3495(17)30689-6/sref53http://refhub.elsevier.com/S0006-3495(17)30689-6/sref54http://refhub.elsevier.com/S0006-3495(17)30689-6/sref54http://refhub.elsevier.com/S0006-3495(17)30689-6/sref54http://refhub.elsevier.com/S0006-3495(17)30689-6/sref55http://refhub.elsevier.com/S0006-3495(17)30689-6/sref55http://refhub.elsevier.com/S0006-3495(17)30689-6/sref55

Biophysical Journal, Volume 113

Supplemental Information

Modeling RNA Secondary Structure with Sequence Comparison and

Experimental Mapping Data

Zhen Tan, Gaurav Sharma, and David H. Mathews

Supplementary information for

“Modeling RNA secondary structure with sequence comparison and experimental mapping data”

Z. Tan, G. Sharma, and D. H. Mathews

Details are provided for dataset used in benchmarking (Section 1), structure modeling accuracy (Section 2), parameter optimization methods (Section 2), sequences used in parameter optimization, software efficiency test (Section 3), and benchmarking (Section 4).

Section 1. Dataset information:

Family H Average sequence length

Standard deviation

Average MEA

sensitivity

Standard deviation

Average MEA PPV

Standard deviation

tRNA 5 sequences 75.7 3.5 0.76 0.23 0.75 0.2410 sequences 76.2 4.7 0.77 0.23 0.74 0.2520 sequences 76.3 4.8 0.77 0.21 0.74 0.23

cGMP riboswitch

5 sequences 89.0 8.3 0.86 0.19 0.33 0.1210 sequences 87.9 6.9 0.81 0.26 0.31 0.1320 sequences 87.5 6.5 0.81 0.25 0.31 0.13

TPP riboswitch


SAM I riboswitch


5S rRNA


M‐box riboswitch


lysine riboswitch


HCV 5 sequences 267.4 66.1 0.50 0.16 0.46 0.1610 sequences 250.7 62.9 0.47 0.17 0.43 0.1720 sequences 251.0 60.5 0.48 0.18 0.43 0.17

Group I intron


23S rRNA


Table S1. Summary statistics on the sets of sequences selected for testing. Mean and standard deviation of sequence length, sensitivity and PPV of MEA structure prediction are shown for sequences from each RNA family in the test sets of homologs used.

Family Total number of distinct sequences Total number of sequences in databasetRNA 627 9245

cGMP riboswitch 150 155TPP riboswitch 97 109SAM I riboswitch 272 433

5S rRNA 429 710M‐box riboswitch 138 157Lysine riboswitch 45 47

HCV 74 79Group I intron 437 81623S rRNA 35 35

Table S2. Number of distinct sequences on the sets of sequences selected for testing. Number of distinct sequences from each RNA family in test sets and the total number of sequences available in database are shown.

Family Sequence with SHAPE reactivity data tRNA E. coli

cGMP riboswitch V. cholerae TPP riboswitch E. coli SAM I riboswitch T. tencongensis

5S rRNA E. coli M‐box riboswitch B. subtilis Lysine riboswitch T. maritime

HCV Hepatitis C virus IRES domain Group I intron T. thermophila 23S rRNA E. coli

Table S3. List of sequences with SHAPE reactivity data for each family.

Section 2. Structure prediction accuracy:

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5seq 10seq 20seq

*

*

* * * *

*

** * *

*

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5seq 10seq 20seq

*

*

* ** *

*

** * *

*

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5seq 10seq 20seq

*

*

** *

*

*

* * *

*

*

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5seq 10seq 20seq

*

*

*

*

**

*

*

* *

*

*

tRNA, E.coli 5S rRNA, E.coli

Hepatitis C Virus(HCV) IRES Domain Group I Intron, T. thermophila

* * *

*

* * *

*

* **

*

* * * *

*

*

*

**

*

**

*

* * *

(A) (B)

(C) (D)

Figure S1. Average F1 score of structure predictions of sequences that did not have SHAPE mapping data. F1 score of structures predictions obtained by running the methods with 5, 10, or 20 input sequences on (A) tRNA, (B) 5S rRNA, (C) hepatitis C virus IRES domain, (D) group I intron, (E) lysine riboswitch, (F) M-box riboswitch, (G) SAM I riboswitch, (H) TPP riboswitch,

23S rRNA, E. colicyclic-di-GMP riboswitch, V. cholerae

Lysine riboswitch, T. maritime M-box riboswitch, B. subtilis

SAM I riboswitch, T. tencongensis TPP riboswitch, E. coli

*

* ** * * *

*

*

*

*

* *

**

*

*

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5seq 10seq 20seq

*

*

*

**

*

*

*

* *

*

*

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5seq 10seq 20seq

*

* *

*

* *

*

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5seq 10seq 20seq

*

*

* *

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5seq 10seq 20seq

*

*

*

* **

*

*

* *

*

*

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5seq 10seq 20seq

*

*

*

*

*

*

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5seq 10seq 20seq

* * *

*

*

* * *

* **

* * *

* *

* *

*

*

* *

*

*

* *

*

*

**

*

*

*

** *

*

***

* *

*

* *

* *

*

* *

* *

*

(E) (F)

(G) (H)

(I) (J)

(I) cyclic-di-GMP riboswitch, and (J) 23S rRNA test datasets. The star (*) above the bar for a method indicates that the difference in F1 score between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

Figure S2. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that do not have SHAPE mapping data (bottom) on tRNA test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with H = 5, 10, or 20 input sequences on tRNA test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

*

*

*

*

* *

*

*

* *

*

*

*

*

*

* *

*

* *

*

*

* *

*

*

*

* **

*

** * *

*

* *

*

** *

*

* * *

*

Sensitivity PPV

5seq 10seq 20seq 5seq 10seq 20seq


0.5

0.6

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

***

*

*

* *

* *

*

Figure S3. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that do not have SHAPE mapping data (bottom) on 5S rRNA test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with H = 5, 10, or 20 input sequences on 5S rRNA test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

*

*

**

**

*

*

* * *

*

*

*

*

** *

*

*

*

* *

*

*

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5seq 10seq 20seq

*

* *

*

* *

*

* * *

*

* **

*

** *

*

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5seq 10seq 20seq


Sensitivity PPV

*** *

* *

Figure S4. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that do not have SHAPE mapping data (bottom) on hepatitis C virus (HCV) IRES domain test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with 5, 10, or 20 input sequences on hepatitis C virus (HCV) IRES domain test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

*

*

*

*

* *

*

*

* *

*

* *

**

*

*

*

* *

*

*

*

* *

*

*



Sensitivity PPV

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

*

* *

*

* * *

*

* * *

** *

*

**

*

*

** *

*

*

* *

*

Figure S5. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that do not have SHAPE mapping data (bottom) on group I intron test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with H = 5, 10, or 20 input sequences on group I intron test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

*

* *

*

*

* *

*

**

* *

* *

* *

*

*

* *

*

*

*

* *

*

* *

*

* *

*

*

* *

*

**

*

*

* **

*

* **



Sensitivity PPV

0.20.30.40.5

0.6

0.70.80.9 1

00.1

0.20.30.40.5

0.6

0.70.80.9 1

00.1

**

* *

* **

*

*

*

Figure S6. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that do not have SHAPE mapping data (bottom) on lysine riboswitch test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with 5, 10, or 20 input sequences on lysine riboswitch test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

*

*

* *

*

*

* *

**

* *

*

*

*

**

*

* *

*

*

*

*

*

* *

*

*

**

*

**

**

*

*

**

*

*

*

**

Sensitivity PPV


5seq 10seq 20seq 5seq 10seq 20seq0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1* * * *

*

*

Figure S7. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that do not have SHAPE mapping data (bottom) on M-box riboswitch test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with 5, 10, or 20 input sequences on M-box riboswitch test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

*

*

*

* **

*

* *

**

*

*

*

*

*

*

**

*

*

*

* *

*

**

*

0.4

0.5

0.6

0.7

0.8

0.9

1


5seq 10seq 20seq 5seq 10seq 20seq 0.4

0.5

0.6

0.7

0.8

0.9

1

Sensitivity PPV

***

* *

Figure S8. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that d0 not have SHAPE mapping data (bottom) on SAM I riboswitch test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with H = 5, 10, or 20 input sequences on SAM I riboswitch test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

* *

* *

*

*

* *

*

*

* *

* *

* *

*

*

* *

*

* *



Sensitivity PPV

0.2 0.3 0.4 0.5

0.6

0.7 0.8 0.9 1

0 0.1

0.2 0.3 0.4 0.5

0.6

0.7 0.8 0.9 1

0 0.1

*

* *

* *

** *

*

*

*

*

*

*

*

**

*

*

**

*

*

*

* *

*

*

* *

*

Figure S9. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that do not have SHAPE mapping data (bottom) on TPP riboswitch test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with H = 5, 10, or 20 input sequences on TPP riboswitch test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

* *

*

*

* *

*

*

* *

*

**

*

*

*

* * *

*

*

* *

*

*

Sensitivity PPV


5seq 10seq 20seq 5seq 10seq 20seq0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

* ** *

***

* *

* *

*

* *

*

* *

*

**

*

*

*

*

* *

*

Figure S10. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that do not have SHAPE mapping data (bottom) on cyclic-di-GMP riboswitch test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with H = 5, 10, or 20 input sequences on cyclic-di-GMP riboswitch test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

*

* *

* *

* *

* *

*

* *

* * *

* *

* * ** *

* * ** * * *

Sensitivity PPV



0.20.30.40.5

0.6

0.70.80.9 1

00.1

0.20.30.40.5

0.6

0.70.80.9 1

00.1

* * *

* * *

**

* *

*

* *

** * *

**

*

*

*

*

* *

*

Figure S11. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that do not have SHAPE mapping data (bottom) on 23S rRNA test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with H = 5, 10, or 20 input sequences on 23S rRNA test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

00.10.20.30.40.50.60.70.80.9 1

5seq 10seq 20seq

*

*

*

**

*

*

*

* * *

*

5seq 10seq 20seq

*

*

*

* * *

*

*

* * *

*

*

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

5seq 10seq 20seq

*

*

**

*

* *

*

5seq 10seq 20seq

* * *

*

* *

*

** *

*

Sensitivity PPV

**

*

* * *

*

**

*

**

* **

Figure S12. Scatter plots of F1 score of structure predictions obtained with TurboFold II and TurboFold II + SHAPE for sequences (20 sequence groups) that did not have SHAPE mapping data. The F1 scores of structures predictions are obtained by running the methods with H = 20 input sequences on tRNA, 5S rRNA, hepatitis C virus IRES domain, and group I intron RNA test datasets. Each point represents the F1 scores by TurboFold II and TurboFold II + SHAPE for one sequence.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

F1 s

core

(Tur

boFo

ld II

+ S

HA

PE

)

F1 s

core

(Tur

boFo

ld II

+ S

HA

PE

)

F1 s

core

(Tur

boFo

ld II

+ S

HA

PE

)

F1 s

core

(Tur

boFo

ld II

+ S

HA

PE

)

tRNA 5S rRNA

Hepatitis C Virus(HCV) IRES domain Group I intron

F1 score (TurboFold II) F1 score (TurboFold II)


(A) (B)

(C) (D)

Figure S12. Scatter plots of F1 score of structure predictions obtained with TurboFold II and TurboFold II+SHAPE for sequences (20 sequence groups) that do not have SHAPE mapping data. F1 score of structures predictions obtained by running the methods with H = 20 input sequences on lysine riboswitch, M-box riboswitch, SAM I riboswitch, and cyclic-di-GMP riboswitch test datasets. Each dot represents the F1 scores by TurboFold II and TurboFold II+SHAPE.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

SAM I riboswitch cyclic-di-GMP riboswitch

Lysine riboswitch M-box riboswitch(E)F1

sco

re (T

urbo

Fold

II +

SH

AP

E)

F1 s

core

(Tur

boFo

ld II

+ S

HA

PE

)

F1 s

core

(Tur

boFo

ld II

+ S

HA

PE

)

F1 s

core

(Tur

boFo

ld II

+ S

HA

PE

)F1 score (TurboFold II) F1 score (TurboFold II)

F1 score (TurboFold II)F1 score (TurboFold II)

(F)

(G) (H)

Figure S12. Scatter plots of F1 score of structure predictions obtained with TurboFold II and TurboFold II+SHAPE for sequences (20 sequence groups) that do not have SHAPE mapping data. F1 score of structures predictions obtained by running the methods with 5 input sequences (left) and H = 20 input sequences (right) on (A) tRNA, (B) 5S rRNA, (C) hepatitis C virus IRES domain, (D) group I intron, (E) lysine riboswitch, (F) M-box riboswitch, (G) SAM I riboswitch, (H) cyclic-di-GMP riboswitch, (I) 23S rRNA (5 sequences), and (J) 23S rRNA (20 sequences) test datasets. Each dot represents the F1 scores by TurboFold II and TurboFold II + SHAPE.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

23S rRNA (5 seq) 23S rRNA (20 seq)


F1 s

core

(Tur

boFo

ld II

+ S

HA

PE

)

F1 s

core

(Tur

boFo

ld II

+ S

HA

PE

)

(I) (J)

Table S4. Average structure prediction sensitivity and PPV on sequences without SHAPE data for each method on each dataset:

5S rRNA Prediction Method H = 5 sequences H = 10 sequences H = 20 sequences

sensitivity PPV sensitivity PPV sensitivity PPV

TurboFold II + SHAPE 0.880 0.927 0.871 0.913 0.873 0.903

TurboFold II 0.861 0.888 0.864 0.883 0.869 0.873 RNAalifold + SHAPE 0.914 0.900 0.823 0.921 0.782 0.932

RNAalifold 0.912 0.914 0.815 0.928 0.776 0.932

MaxExpect 0.636 0.619 0.564 0.551 0.569 0.544

Group I intron Prediction Method H = 5 sequences H = 10 sequences H = 20 sequences



TurboFold II 0.735 0.769 0.742 0.774 0.750 0.775

RNAalifold + SHAPE 0.163 0.375 0.092 0.554 0.052 0.537

RNAalifold 0.160 0.398 0.095 0.547 0.054 0.558

MaxExpect 0.608 0.584 0.604 0.585 0.612 0.594

HCV

Prediction Method H = 5 sequences H = 10 sequences H = 20 sequences sensitivity PPV sensitivity PPV sensitivity PPV

TurboFold II + SHAPE 0.705 0.676 0.710 0.686 0.717 0.685 TurboFold II 0.581 0.547 0.586 0.555 0.592 0.557

RNAalifold + SHAPE 0.510 0.510 0.493 0.579 0.549 0.737 RNAalifold 0.496 0.540 0.481 0.570 0.534 0.715 MaxExpect 0.504 0.456 0.469 0.426 0.480 0.431

tRNA



TurboFold II 0.916 0.944 0.930 0.939 0.922 0.933


RNAalifold 0.837 0.856 0.833 0.910 0.833 0.920

MaxExpect 0.763 0.752 0.768 0.742 0.771 0.742

TPP riboswitch



TurboFold II 0.752 0.775 0.820 0.833 0.816 0.801


RNAalifold 0.379 0.917 0.332 0.953 0.294 0.980

MaxExpect 0.535 0.428 0.547 0.436 0.552 0.431

SAM I riboswitch




RNAalifold 0.671 0.824 0.604 0.886 0.510 0.937 MaxExpect 0.826 0.680 0.822 0.667 0.840 0.681

M-box riboswitch



TurboFold II 0.730 0.729 0.743 0.720 0.744 0.729


RNAalifold 0.660 0.721 0.556 0.767 0.565 0.814

MaxExpect 0.636 0.608 0.658 0.615 0.663 0.626

Lysine riboswitch



RNAalifold + SHAPE 0.494 0.733 0.394 0.794 0.274 0.762 RNAalifold 0.670 0.796 0.440 0.799 0.294 0.779 MaxExpect 0.760 0.709 0.651 0.604 0.677 0.627

Cyclic-di-GMP riboswitch



TurboFold II 0.884 0.876 0.882 0.871 0.889 0.874


RNAalifold 0.665 0.881 0.623 0.904 0.498 0.974

MaxExpect 0.865 0.329 0.809 0.306 0.810 0.313

23S rRNA



TurboFold II 0.788 0.834 0.817 0.858 0.826 0.865


RNAalifold 0.764 0.828 0.746 0.885 0.718 0.902

MaxExpect 0.520 0.533 0.511 0.521 0.515 0.507

Table S5. Average structure prediction sensitivity and PPV on sequences with SHAPE data for each method on each dataset:

5S rRNA Prediction Method H = 5 sequences H = 10 sequences H = 20 sequences



TurboFold II 0.850 0.859 0.901 0.913 0.909 0.914


RNAalifold 0.876 0.914 0.797 0.955 0.761 0.967 Rsample 0.857 0.833 0.857 0.833 0.857 0.833

MaxExpect 0.286 0.263 0.286 0.263 0.286 0.263

Group I intron Prediction Method H = 5 sequences H = 10 sequences H = 20 sequences



TurboFold II 0.884 0.837 0.903 0.853 0.907 0.858



MaxExpect 0.849 0.766 0.849 0.766 0.849 0.766

HCV



TurboFold II 0.473 0.527 0.474 0.519 0.469 0.513



MaxExpect 0.548 0.612 0.548 0.612 0.548 0.612

tRNA



TurboFold II 0.990 1.000 1.000 1.000 1.000 1.000


RN

Documents

Modeling RNA Secondary Structure with Sequence ...gsharma/papers/TanModelRN...maximum expected accuracy algorithm (34–36) and a mul-tiple sequence alignment is estimated using a