45
Article Modeling RNA Secondary Structure with Sequence Comparison and Experimental Mapping Data Zhen Tan, 1,2 Gaurav Sharma, 2,3,4, * and David H. Mathews 1,2,4, * 1 Department of Biochemistry and Biophysics, 2 Center for RNA Biology, 3 Department of Electrical and Computer Engineering, and 4 Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, New York ABSTRACT Secondary structure prediction is an important problem in RNA bioinformatics because knowledge of structure is critical to understanding the functions of RNA sequences. Significant improvements in prediction accuracy have recently been demonstrated though the incorporation of experimentally obtained structural information, for instance using selective 2 0 -hydroxyl acylation analyzed by primer extension (SHAPE) mapping. However, such mapping data is currently available only for a limited number of RNA sequences. In this article, we present a method for extending the benefit of experimental mapping data in sec- ondary structure prediction to homologous sequences. Specifically, we propose a method for integrating experimental mapping data into a comparative sequence analysis algorithm for secondary structure prediction of multiple homologs, whereby the map- ping data benefits not only the prediction for the specific sequence that was mapped but also other homologs. The proposed method is realized by modifying the TurboFold II algorithm for prediction of RNA secondary structures to utilize basepairing prob- abilities guided by SHAPE experimental data when such data are available. The SHAPE-mapping-guided basepairing probabil- ities are obtained using the RSample method. Results demonstrate that the SHAPE mapping data for a sequence improves structure prediction accuracy of other homologous sequences beyond the accuracy obtained by sequence comparison alone (TurboFold II). The updated version of TurboFold II is freely available as part of the RNAstructure software package. INTRODUCTION RNA functions in diverse cellular activities; it is a carrier of genetic information in transcription (1), a regulator of gene expression (2), and a catalyst (3). These cellular functions depend on the structure of RNA (4). Therefore, accurate predictions for the secondary structure, i.e., canonical base- pairings between nucleotides, are critical for understanding and proposing hypotheses related to RNA functions. A commonly used approach is to predict secondary structures based on folding thermodynamics (5,6). To achieve greater prediction accuracy, several thermo- dynamics-based methods incorporate experimental data derived from chemical probing to guide RNA secondary structure prediction (7–17). One mapping method, selec- tive 2 0 -hydroxyl acylation analyzed by primer extension (SHAPE), provides quantitative reactivity at each nucleotide to the SHAPE reagent, which measures the nucleotide flexibility (18,19). Because basepaired nucleotides are structurally restricted, high SHAPE reactivity is generally associated with not being canonically basepaired (20). SHAPE data can be collected with high-throughput sequencing (21–23) and can also be obtained in vivo (24–26). RSample (Spasic, S.M. Assmann, P.C. Bevilacqua, D.H.M., unpublished data) models RNA secondary struc- ture using SHAPE data. It focuses on matching structure models to the mapping data rather than directly integrating data into the model. In this way, it can model folding ensem- bles of multiple structures. A nucleotide-level comparison between experimental mapping data and modeled mapping data is used to guide a single refinement of a stochastic sample. The sample is then clustered to predict sets of struc- ture models. The single structure prediction accuracy of RSample is similar to leading methods (>80% of predicted pairs in the accepted structure) (12), and RSample is able to estimate the population of multiple structures in the folding ensemble (27). Another approach to improving secondary structure pre- diction accuracy is to use multiple homologous sequences to identify conserved basepairs (5,28–30). One method, TurboFold II (31; Z.T., Y. Fu, G. Sharma, D.H.M., unpub- lished data), iteratively refines basepairing probabilities for each sequence in a set of homologs by comparing the predicted basepairing probabilities across the set of Submitted March 1, 2017, and accepted for publication June 19, 2017. *Correspondence: [email protected] or gaurav. [email protected] Editor: Tamar Schlick. Biophysical Journal 113, 1–9, July 25, 2017 1 http://dx.doi.org/10.1016/j.bpj.2017.06.039 Ó 2017 Biophysical Society. Please cite this article in press as: Tan et al., Modeling RNA Secondary Structure with Sequence Comparison and Experimental Mapping Data, Biophysical Journal (2017), http://dx.doi.org/10.1016/j.bpj.2017.06.039

Modeling RNA Secondary Structure with Sequence ...gsharma/papers/TanModelRN...maximum expected accuracy algorithm (34–36) and a mul-tiple sequence alignment is estimated using a

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Please cite this article in press as: Tan et al., Modeling RNA Secondary Structure with Sequence Comparison and Experimental Mapping Data, BiophysicalJournal (2017), http://dx.doi.org/10.1016/j.bpj.2017.06.039

    Article

    Modeling RNA Secondary Structure with SequenceComparison and Experimental Mapping Data

    Zhen Tan,1,2 Gaurav Sharma,2,3,4,* and David H. Mathews1,2,4,*1Department of Biochemistry and Biophysics, 2Center for RNA Biology, 3Department of Electrical and Computer Engineering, and4Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, New York

    ABSTRACT Secondary structure prediction is an important problem in RNA bioinformatics because knowledge of structure iscritical to understanding the functions of RNA sequences. Significant improvements in prediction accuracy have recently beendemonstrated though the incorporation of experimentally obtained structural information, for instance using selective 20-hydroxylacylation analyzed by primer extension (SHAPE) mapping. However, such mapping data is currently available only for a limitednumber of RNA sequences. In this article, we present a method for extending the benefit of experimental mapping data in sec-ondary structure prediction to homologous sequences. Specifically, we propose a method for integrating experimental mappingdata into a comparative sequence analysis algorithm for secondary structure prediction of multiple homologs, whereby the map-ping data benefits not only the prediction for the specific sequence that was mapped but also other homologs. The proposedmethod is realized by modifying the TurboFold II algorithm for prediction of RNA secondary structures to utilize basepairing prob-abilities guided by SHAPE experimental data when such data are available. The SHAPE-mapping-guided basepairing probabil-ities are obtained using the RSample method. Results demonstrate that the SHAPE mapping data for a sequence improvesstructure prediction accuracy of other homologous sequences beyond the accuracy obtained by sequence comparison alone(TurboFold II). The updated version of TurboFold II is freely available as part of the RNAstructure software package.

    INTRODUCTION

    RNA functions in diverse cellular activities; it is a carrier ofgenetic information in transcription (1), a regulator of geneexpression (2), and a catalyst (3). These cellular functionsdepend on the structure of RNA (4). Therefore, accuratepredictions for the secondary structure, i.e., canonical base-pairings between nucleotides, are critical for understandingand proposing hypotheses related to RNA functions. Acommonly used approach is to predict secondary structuresbased on folding thermodynamics (5,6).

    To achieve greater prediction accuracy, several thermo-dynamics-based methods incorporate experimental dataderived from chemical probing to guide RNA secondarystructure prediction (7–17). One mapping method, selec-tive 20-hydroxyl acylation analyzed by primer extension(SHAPE), provides quantitative reactivity at each nucleotideto the SHAPE reagent, which measures the nucleotideflexibility (18,19). Because basepaired nucleotides arestructurally restricted, high SHAPE reactivity is generally

    Submitted March 1, 2017, and accepted for publication June 19, 2017.

    *Correspondence: [email protected] or gaurav.

    [email protected]

    Editor: Tamar Schlick.

    http://dx.doi.org/10.1016/j.bpj.2017.06.039

    � 2017 Biophysical Society.

    associated with not being canonically basepaired (20).SHAPE data can be collected with high-throughputsequencing (21–23) and can also be obtained invivo (24–26).

    RSample (Spasic, S.M. Assmann, P.C. Bevilacqua,D.H.M., unpublished data) models RNA secondary struc-ture using SHAPE data. It focuses on matching structuremodels to the mapping data rather than directly integratingdata into the model. In this way, it can model folding ensem-bles of multiple structures. A nucleotide-level comparisonbetween experimental mapping data and modeled mappingdata is used to guide a single refinement of a stochasticsample. The sample is then clustered to predict sets of struc-ture models. The single structure prediction accuracy ofRSample is similar to leading methods (>80% of predictedpairs in the accepted structure) (12), and RSample is able toestimate the population of multiple structures in the foldingensemble (27).

    Another approach to improving secondary structure pre-diction accuracy is to use multiple homologous sequencesto identify conserved basepairs (5,28–30). One method,TurboFold II (31; Z.T., Y. Fu, G. Sharma, D.H.M., unpub-lished data), iteratively refines basepairing probabilitiesfor each sequence in a set of homologs by comparingthe predicted basepairing probabilities across the set of

    Biophysical Journal 113, 1–9, July 25, 2017 1

    mailto:[email protected]:[email protected]:[email protected]

  • Tan et al.

    Please cite this article in press as: Tan et al., Modeling RNA Secondary Structure with Sequence Comparison and Experimental Mapping Data, BiophysicalJournal (2017), http://dx.doi.org/10.1016/j.bpj.2017.06.039

    homologs. Additionally, nucleotide alignment probabilitiesin pairwise alignments, estimated using a hidden Markovmodel (HMM) (32), are iteratively improved using infor-mation from estimated secondary structures (33). Afterthe iterative updates, structures are predicted using themaximum expected accuracy algorithm (34–36) and a mul-tiple sequence alignment is estimated using a probabilisticconsistency transformation (36) and progressive alignment.

    An open problem in the field is the integration of bothstructure mapping data and comparative data to improvesecondary structure prediction accuracy. Prior work focusedon the case where SHAPE data is available for all homolo-gous sequences (37). For this situation, a multiple sequencealignment was first created by also including SHAPE data inpairwise global alignment. Then the RNAalifold method(38) was used to predict a consensus structure that isconserved given the fixed input alignment, using pseudofree energies to incorporate the SHAPE information (7).This article addresses the problem of predicting conservedsecondary structures when SHAPE mapping is only avail-able for one homolog. This use case is expected to beincreasingly common as SHAPE is performed in vivo acrosstranscriptomes. The method reported in this article is theintegration of RSample into TurboFold II. In the resultingmethod, SHAPE-guided structure prediction and predictionof conserved structures act synergistically to improve sec-ondary structure prediction accuracy, even for sequencesfor which SHAPE mapping was not performed. Resultsdemonstrate that the SHAPE mapping data for a sequenceimproves structure prediction accuracy of other homologoussequences beyond the accuracy obtained by sequence com-parison alone (TurboFold II).

    METHODS

    Fig. 1 illustrates the proposed new version of TurboFold II that uses avail-

    able SHAPE mapping data for one or more of the RNA sequence homo-

    logs to improve structure prediction for the sequences without SHAPE

    data. The input to TurboFold II is a set of homologous sequences and

    the outputs are the predicted secondary structures for each sequence and

    a multiple sequence alignment (31). To incorporate experimental

    mapping data into the predictions, the proposed approach makes use of

    RSample. Specifically, as shown in Fig. 1, within the TurboFold II itera-

    tions, RSample is used to refine estimated basepairing probabilities for se-

    quences with SHAPE data and these estimated basepairing probabilities

    are incorporated in the iterations. As shown via the dashed lines in

    Fig. 1, in subsequent TurboFold II iterations, the incorporated SHAPE

    information propagates to other homologous sequences and thereby

    improves the prediction of structure for these sequences, in addition to

    improving structure prediction for the sequence with which the SHAPE

    data is affiliated. The major individual steps in the proposed approach

    are outlined next.

    SHAPE-guided computation of basepairingprobabilities using RSample

    RSample first generates a stochastic sample (39) using a secondary struc-

    ture partition function calculation (40). Then SHAPE reactivities are esti-

    2 Biophysical Journal 113, 1–9, July 25, 2017

    mated for each nucleotide in each structure based on the status of the

    nucleotide: unpaired, paired at the last position of a helix, or paired in

    the interior of a helix. SHAPE reactivities are drawn from distributions

    composed of a database of 16 known secondary structures with experimen-

    tally measured SHAPE reactivities (12). The estimated SHAPE reactivity

    for a nucleotide is then the mean reactivity across all structures. The sto-

    chastic sampling is then repeated, where the partition function is reesti-

    mated so that the estimated SHAPE reactivities better match the

    experimental SHAPE mapping data. The free energy change term intro-

    duced to the partition function is

    DGbonus;i ¼ 0:5 � ln�

    Rexpi þ 1:1Rcalci þ 1:1

    �; (1)

    where Rexpi and Rcalci are experimentally measured reactivities and esti-

    mated reactivities of nucleotide i. This functional form was chosen so

    that the free energy of basepair stacking is only altered for nucleotides

    for which the originally estimated SHAPE reactivity does not match the

    experiment. The constants 0.5 and 1.1 in the equation were obtained

    (data not shown) via a grid search as the parameters that maximized struc-

    ture prediction accuracy. The free energy bonus DGbonus, i is then applied

    for each basepair stack involving nucleotide i. This approach focuses on

    matching the experimentally measured SHAPE reactivity.

    Incorporation of RSample into TurboFold II

    TurboFold II is a method to predict secondary structures for multiple RNA

    homologs and multiple sequence alignments. TurboFold II iteratively esti-

    mates basepairing probabilities for each sequence using intrinsic informa-

    tion and extrinsic information for sequence folding. Intrinsic information

    is derived from the thermodynamic model, which used the latest set of near-

    est-neighbor thermodynamic parameters (11,41). Extrinsic information is a

    proclivity for basepairing inferred from the basepairing probabilities of

    other homologous sequences, mapped to the sequence of interest by the

    posterior probabilities of nucleotide coincidence of the other homologs to

    the sequence (32). The posterior coincidence probabilities can be obtained

    with a HMM for pairwise alignments (42). The estimated basepairing prob-

    abilities can be used to predict secondary structure using the maximum ex-

    pected accuracy (MEA) algorithm (34,35,43) or the ProbKnot method (44).

    RSample is integrated into TurboFold II to estimate basepairing probabil-

    ities for homologous sequences with available SHAPE mapping data on

    one of the homologs. The integrated algorithm uses nine steps illustrated

    in Fig. 1.

    We adapt the description focusing particularly on the new elements intro-

    duced in this article.

    Step 1 computes pairwise posterior coincidence probabilities using an

    HMM. Pairwise posterior coincidence probabilities are estimated for all

    pairs of sequences with an HMM as implemented by Harmanci et al.

    (32). Using the forward-backward algorithm, matrices of posterior coinci-

    dence probabilities for two nucleotides (one from each sequence) are

    computed. Details can be found in Harmanci et al. (32).

    Step 2 computes basepairing probabilities of all sequences using the

    partition function method in RNAstructure (40).

    Steps 3–5 are only performed for sequences for which there is SHAPE

    mapping data.

    Step 3 generates an ensemble of Ns ¼ 10,000 structures by stochasticsampling for sequences with input SHAPE reactivity.

    Step 4 estimates the SHAPE reactivity for each nucleotide based on the

    sample. The SHAPE reactivities are assigned to each nucleotide at each

    structure in the sample according to the distributions for three different

    local structures: unpaired, paired at a helix end, or paired in the interior

    of a helix. The SHAPE reactivity for each nucleotide is the arithmetic

    mean across structures in the sample. Because the size of ensemble is large,

    the variance between samples is relatively low.

  • Input: H homologoussequences

    HMMalignment

    Match scorecomputation

    Extrinsicinformationcomputation

    Probability consistencytransformation;

    Guide tree computation;Progressive alignment

    Multiple sequencealignment

    MEASecondary structure

    prediction

    (2)

    (9)

    (1)

    (8)

    (10)

    H(H-1)/2 Pairwise posterior

    co-incidence probabilities

    Yes

    Partitionfunction

    Stochastic sampling to generate N

    structures

    Assign SHAPE reactivity based on

    each structure

    Estimating SHAPE reactivity by averaging

    No

    Partitionfunction

    Partition function calculation with

    restraintsH Base pairing

    probablities

    (3)

    (4) (5)

    (6)

    (7)

    (11)

    1st

    Are SHAPE data available ?

    2ndAve

    H

    H

    H

    H

    H

    H

    S

    N thS

    RSample

    FIGURE 1 Flowchart for TurboFold II with incorporation of SHAPE mapping data for one or more sequences. The input is a set of H homologous RNA

    sequences and the outputs are the predicted secondary structures for each sequence and the predicted multiple sequence alignment. Steps 1–11 are described

    in Materials and Methods. The dashed arrow lines show the flow of SHAPE information and illustrate how, through the iterations, the SHAPE information

    contributes not only to the structure prediction for sequences with SHAPE data but also to the structure prediction for other sequences. Steps 3–5 in the

    dashed box show the processing for the sequences with SHAPE mapping data using RSample.

    Modeling Conserved RNA Structure

    Please cite this article in press as: Tan et al., Modeling RNA Secondary Structure with Sequence Comparison and Experimental Mapping Data, BiophysicalJournal (2017), http://dx.doi.org/10.1016/j.bpj.2017.06.039

    Step 5 recalculates the partition function using the free energy change

    term (in Eq. 1) to predict basepairing probability for the sequence with

    input SHAPE reactivities. Nucleotides with higher or lower estimated

    SHAPE reactivity than that measured by experiment are restrained with

    a lower or higher propensity to basepair, respectively. Nucleotides with

    consistent estimated and experimental SHAPE reactivity receive no

    restraint.

    Step 6 calculates match scores that encourage alignment between nucle-

    otide positions where both nucleotides are upstream paired, downstream

    paired, or unpaired. The match score was first proposed in PMcomp

    (33), and is utilized in TurboFold II as a prior for recalculating posterior

    coincidence probability in next step via the HMM pair alignment algo-

    rithm. For the mth sequence, based on estimated basepairing probabilities

    between all pairs of nucleotide positions obtained from the partition func-

    tion calculation, for a nucleotide at position i, the estimated probability

    of downstream pairing is Pm< ðiÞ ¼P

    j > iPmij , of upstream pairing is

    P m> ðiÞ ¼P

    j < iPmij , and of being unpaired is P

    m� ðiÞ ¼ 1� Pm< ðiÞ � Pm> ðiÞ.

    The match score between nucleotides i and k in sequences m and n, respec-

    tively, is formulated as

    rði; kÞ ¼� ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

    P m< ðiÞP n< ðkÞq

    þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP m> ðiÞP n> ðkÞ

    q �þ 0:8

    �� ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

    Pm� ðiÞPn� ðkÞq �

    þ 0:5: (2)

    For sequences without SHAPE mapping data, the basepairing probabilities

    from Step 2 are utilized for the computation of match scores, whereas for

    sequences with SHAPE mapping data, the basepairing probabilities from

    Step 5 are used in the computation of the match scores.

    Step 7 reestimates the posterior coincidence probability. Information

    from prior iterations is utilized to reestimate alignment posterior probabil-

    ities and basepairing probabilities for secondary structures. The iterative

    reestimation of alignment posterior probabilities is introduced (TurboFold

    Biophysical Journal 113, 1–9, July 25, 2017 3

  • Tan et al.

    Please cite this article in press as: Tan et al., Modeling RNA Secondary Structure with Sequence Comparison and Experimental Mapping Data, BiophysicalJournal (2017), http://dx.doi.org/10.1016/j.bpj.2017.06.039

    II) and uses the standard HMM alignment model (42), but with the match

    score of Eq. 3 incorporated as a prior.

    Step 8 calculates extrinsic information for each sequence by combining

    basepairing probabilities from other input sequences using posterior coinci-

    dence probabilities:

    Pðn/mÞði; jÞ ¼X

    8>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>:

    Pk;l

    1%k < l%Nn

    k˛Cm;ni

    l˛Cm;nj

    Probbpðk; lÞ � Pðm;nÞði � kÞ � Pðm;nÞðj � lÞ � ðH � 1Þ � l ðif sequence n is with SHAPEÞ

    Pk;l

    1%k < l%Nn

    k˛Cm;ni

    l˛Cm;nj

    Probbpðk; lÞ � Pðm;nÞði � kÞ � Pðm;nÞðj � lÞ ��1� jm;n

    � ðotherwiseÞ;

    (3)

    where P(n/m) denotes the extrinsic information for sequence m inferred

    from sequence n. Nn indicates the length of sequence n. The notations

    Cm;ni and Cm;nj denote the sets of indices for which posterior coincidence

    alignment probabilities P(m,n) (i � k) and P(m,n) (j � l), respectively,exceed a predetermined threshold below which values are considered 0

    for computational simplification. Probbp(k,l) denotes the (estimated)

    basepairing probability between nucleotide k and nucleotide l within a

    sequence. The value ‘‘i � k’’ indicates the alignment between indices iand k in two sequences. H is the number of homologous sequences.

    To keep the ratio of extrinsic information from sequence n to every

    other sequence constant, the extrinsic information term for sequence n

    is multiplied by H�1 if sequence n has SHAPE data. This ensures thatmore extrinsic information is used from sequences with SHAPE data

    than from sequences without SHAPE data. l is a parameter, optimized

    based on training. The factor (1 � jm,n) weights the contributionof sequence n to the extrinsic information for sequence m using the

    sequence identity, jm,n, for sequences m and n computed from an HMM

    alignment. This term is only used when sequence n does not have associ-

    ated SHAPE mapping data. Because of the factor (1 � jm,n), sequencesthat are highly similar to sequence m have a lower contribution to extrinsic

    information than those with lower similarities. The extrinsic information is

    calculated from basepairing proclivity for each sequence as inferred from

    every other sequence pairwise. Because the sequence with SHAPE

    reactivities is presumed to have more accurate estimates of basepairing

    probabilities, the basepairing proclivities from the sequence with SHAPE

    reactivities to sequences without SHAPE reactivities are assigned a

    different, adjustable weighting (l). The basepairing proclivities for se-

    quences without SHAPE data and from other sequences to the sequence

    with SHAPE data are computed in an identical fashion to the TurboFold

    II algorithm.

    Step 9 updates the basepairing probability by recomputing the partition

    function for each sequence with the addition of extrinsic information.

    The extrinsic information is incorporated as a pseudo free energy term in

    the partition function calculation for each sequence. A detailed description

    is in Harmanci et al. (31).

    Steps 2–9 form a loop that is iterated through three times, which is shown

    to be optimal in Harmanci et al. (31).

    Steps 10 and 11 perform progressive alignment and predict final sec-

    ondary structures, respectively. In Step 10, the posterior coincidence

    4 Biophysical Journal 113, 1–9, July 25, 2017

    probabilities obtained with the updated match scores via Step 6 are

    used to calculate a multiple sequence alignment. A probabilistic

    consistency transformation, as described in ProbCons (36), is used

    to refine alignment probabilities based on three-way alignment consis-

    tency of pairwise HMM posterior probabilities. Refined alignments are

    further predicted hierarchically based on a guide tree, as described in

    ProbCons (36).

    In Step 11, the structures are predicted by the MEA algorithm. Given the

    basepair probabilities Pm(i,j) for structure sm of sequencem, the MEA struc-

    ture is defined as

    S�m ¼ argmaxSm

    8>>>><>>>>:

    Xði; jÞ˛Sm

    2 ,Pmði; jÞ þXci;

    i unpaired in Sm

    PmðiÞ

    9>>>>=>>>>;;

    (4)

    where Pm(i) is the probability that nucleotide position i is not basepaired,

    computed as

    PmðiÞ ¼ 1�XNmj¼ iþ1

    Pm ði; jÞ �Xi�1j¼ 1

    Pm ðj; iÞ; (5)

    and where Nm is the length of sequence m. The MEA structure is ob-

    tained with a dynamic programming algorithm, as described in Harmanci

    et al. (31).

    Parameter optimization

    To train the parameter l corresponding to the weighting of the extrinsic

    information term in Eq. 3, 20 groups of input sequences formed by 10

    homologous sequences (including the sequence with SHAPE data)

    were randomly chosen from the small subunit ribosomal RNA in the

    RNAStralign database. The range for parameter l was from 0 to 2.0

    (with samples at 0, 0.02, 0.1, 0.2, 0.4, 1.0, 1.6, and 2.0). The resulting

    optimal parameter (l ¼ 1.0) was then used as the default for the method.The geometric mean of sensitivity and PPV was used as the accuracy metric

    for optimizing the parameter l, and the values of this metric over the

    training set are given in the Supporting Material (Fig. S15).

  • Modeling Conserved RNA Structure

    Please cite this article in press as: Tan et al., Modeling RNA Secondary Structure with Sequence Comparison and Experimental Mapping Data, BiophysicalJournal (2017), http://dx.doi.org/10.1016/j.bpj.2017.06.039

    Benchmarks

    For benchmarking, groups of sequence homologs were selected

    from several families based on the selection criterion that SHAPE data

    were available for a sequence in the family (12). Hepatitis C virus

    (HCV) IRES domain, TPP riboswitch, cyclic-di-GMP riboswitch,

    SAM I riboswitch, M-box riboswitch, and Lysine riboswitch RNA se-

    quences were randomly selected from the Rfam database (45). tRNA,

    5S ribosomal RNA, and group I intron sequences were selected from

    the RNAStralign database (http://rna.urmc.rochester.edu/RNAStralign.

    tar.gz). 23S rRNA sequences were selected from the Comparative

    RNA web site and project (http://www.rna.icmb.utexas.edu/). Specif-

    ically, 20 groups of 4-, 9-, or 19-sequence homologs were selected

    from each of the RNA family. All methods were benchmarked on

    the same groups of sequences. Detailed information of selected

    sequences is in Tables S1 and S2. For comparison, a single sequence

    prediction accuracy was also computed as the average of the accu-

    racies for each homolog in the set of sequences for predictions obtained

    using the MaxExpect (maximum expected accuracy) method from

    RNAstructure 5.7.

    Scoring of prediction accuracy

    The F1 score, which is the harmonic mean of sensitivity and PPV, is used in

    the structure-prediction benchmark. The F1 score is computed as

    F1 ¼ 2 � Sensitivity � PPVSensitivityþ PPV : (6)

    Sensitivity is the fraction of basepairs from the Rfam database that are

    correctly predicted. PPV is the fraction of predicted basepairs that are cor-

    rect, i.e., included in the Rfam database.

    Predicted basepairs are considered correct if a nucleotide on either the

    50- or 30-position of the helix is off by one position compared to the standard(13,46). For instance, a predicted basepair (i, j) is correct if basepair (i, j), or

    (i 5 1, j), or (i, j 5 1) exists in the database. This is important because of

    uncertainty in the determination of secondary structure by comparative

    analysis (47) and also because of thermodynamic fluctuations of local struc-

    tures (48,49).

    Significance testing

    To assess the statistical significance of the differences in F1 score, sensi-

    tivity, and PPV, paired t-tests were performed using R 3.0.2 (50) between

    TurboFold II with SHAPE data and each of the other methods (51). Alpha,

    the type I error rate, was set to 0.05. The figures summarizing the bench-

    marking results are annotated to indicate the results of the significance

    tests.

    Alternative methods

    Although no previous work has been reported on using SHAPE data

    for one homolog in the prediction of structures for other homologs,

    the RNAalifold (38,52) method can be used for this purpose and it is

    therefore used for comparison. For RNAalifold, the SHAPE reactivity

    data is converted to per-nucleotide pseudo free energies that are then

    applied for each basepair stack including a nucleotide. A log-linear fit

    based on Deigan et al. (7) is used to convert reactivities into pseudo

    free energies. The RNAalifold method does not compute an alignment

    and requires an input multiple sequence alignment. Input alignments

    for RNAalifold (2.2.5) were generated using ClustalW (2.1) (38,53).

    Default options and parameters were used for these programs in the

    benchmarking.

    RESULTS

    The new version of TurboFold II, capable of incorporatingSHAPE data, was benchmarked for structure predictionaccuracy using RNA families, where one sequence ineach family has measured SHAPE reactivity (12). Themethod was compared with RNAalifold (38), RSample,and MaxExpect (35). RNAalifold is a method for predictingconsensus structures for multiple homologs. It was previ-ously adapted for using SHAPE data, and was benchmarkedfor cases when all sequences had SHAPE mapping data(37). RSample is run for the single sequences with SHAPEdata available. MaxExpect is the single sequence maximumexpected accuracy method, and maximum expected accu-racy is used to generate the predicted structures frompredicted basepairing probabilities with TurboFold. Theaccuracy results are represented in Figs. 2 and S1–S11;Tables S4 and S5.

    Fig. 2 shows the average structure prediction accuracy forthe sequences without SHAPE data. The results demonstratethat the majority of RNA families (tRNA, 5S rRNA, hepati-tis C virus IRES, group I intron, lysine riboswitch, SAM Iriboswitch, cyclic-di-GMP riboswitch, and 23S rRNA)have significantly (p < 0.05) better structure prediction ac-curacy when SHAPE is used in the calculation than whenSHAPE data is not used. This shows that SHAPE data fora single sequence can inform the structure modeling for ho-mologous sequences. However, for the M-box riboswitchand TPP riboswitch, the accuracies are not significantlyimproved by having SHAPE data. For the sequences withoutSHAPE data, the new version of TurboFold II performedbetter than RNAalifold using SHAPE data and MaxExpect.Fig. S12 shows that much of the improvement in accuracy isfor sequences that were relatively poorly predicted in theabsence of SHAPE data. The accuracy performance forthose sequences is rescued by having SHAPE informationfor a homologous sequence.

    It is observed that structure prediction accuracies byTurboFold II using SHAPE data across sizes of sequencegroups are scarcely changed (from 5 to 20 sequences).The relationship between structure prediction accuraciesand sequence lengths is also weak (Tables S1 and S2). Forthe 23S rRNA family, which has the longest averagesequence length (�2900 nucleotides), all methods, exceptsingle-sequence MaxExpect, perform well. On the RNAfamilies with sequence lengths shorter than 200 nucleotides,TurboFold II þ SHAPE improves structure predictions fortRNA, 5S, lysine riboswitch, and cyclic-di-GMP riboswitch,but does not improve structure predictions for M-box ribos-witch and TPP riboswitch.

    For the one sequence with SHAPE mapping data in eachRNA family, the results show that the majority of RNA fam-ilies (5S rRNA, HCV IRES domain, group I intron, TPPriboswitch, and 23S rRNA) have significantly (p < 0.05)improved prediction accuracy when SHAPE data are used

    Biophysical Journal 113, 1–9, July 25, 2017 5

    http://rna.urmc.rochester.edu/RNAStralign.tar.gzhttp://rna.urmc.rochester.edu/RNAStralign.tar.gzhttp://www.rna.icmb.utexas.edu/

  • TurboFoldII +SHAPETurboFoldIIRNAalifold +SHAPERNAalifold MaxExpect

    tRNA*

    * **

    ** *

    *

    ** *

    **

    ** *

    *

    * *

    *

    * *

    5 sequences 10 sequences 20 sequences 0

    0.2

    0.4

    0.6

    0.8

    15S rRNA

    *

    *

    * * *

    *

    * * *

    *

    * ** *

    0

    0.2

    0.4

    0.6

    0.8

    1

    5 sequences 10 sequences 20 sequences

    *

    *

    * *

    *

    * *

    *

    Group I Intron

    *

    *

    *

    *

    0

    0.2

    0.4

    0.6

    0.8

    1

    5 sequences 10 sequences 20 sequences

    Hepatitis C Virus(HCV) IRES Domain

    ** *

    *

    **

    ** *

    *

    * *

    0

    0.2

    0.4

    0.6

    0.8

    1

    5 sequences 10 sequences 20 sequences

    Lysine riboswitch

    * ** *

    *

    * *

    *

    M-box riboswitch

    ** * **

    * ***

    * ** *

    * *

    * * *

    0

    0.2

    0.4

    0.6

    0.8

    1

    0

    0.2

    0.4

    0.6

    0.8

    1

    23S rRNA

    * **

    *

    *

    * *

    *

    * *

    * *

    * *

    **

    *

    *

    cyclic-di-GMP riboswitch

    TPP riboswitch

    * * *** **

    *

    **

    *

    **

    **

    *

    *

    **

    SAM I riboswitch

    **

    *

    **

    *

    5 sequences 10 sequences 20 sequences5 sequences 10 sequences 20 sequences

    5 sequences 10 sequences 20 sequences5 sequences 10 sequences 20 sequences 0

    0.2

    0.4

    0.6

    0.8

    1

    0

    0.2

    0.4

    0.6

    0.8

    1

    0

    0.2

    0.4

    0.6

    0.8

    1

    0

    0.2

    0.4

    0.6

    0.8

    1

    5 sequences 10 sequences 20 sequences5 sequences 10 sequences 20 sequences

    FIGURE 2 Average F1 score of structure predic-

    tions of the sequences that did not have SHAPE

    mapping data. Given here is the average F1 score

    of structure predictions obtained by running the

    methods with 5-, 10-, or 20-input sequences on

    tRNA, 5S rRNA, hepatitis C virus IRES domain,

    group I intron, lysine riboswitch, M-box ribos-

    witch, SAM I riboswitch, TPP riboswitch, cyclic-

    di-GMP riboswitch, and 23S rRNA test datasets.

    Standard errors of the mean are shown by error

    bars. The star (*) above the bar for a method indi-

    cates that the difference in F1 score between the

    method and the new TurboFold II is statistically

    significant, as determined by paired t-tests (51).

    Tan et al.

    Please cite this article in press as: Tan et al., Modeling RNA Secondary Structure with Sequence Comparison and Experimental Mapping Data, BiophysicalJournal (2017), http://dx.doi.org/10.1016/j.bpj.2017.06.039

    than when SHAPE data are not used (Fig. S1 and Table S4).For tRNA, the lysine riboswitch, and the M-box riboswitchfamilies, the accuracy performances are the same. In theSAM I riboswitch and the cyclic-di-GMP riboswitch fam-ilies, the accuracies decreased when SHAPE data areused. In tRNA, 5S rRNA, group I intron, lysine riboswitch,SAM I riboswitch, TPP riboswitch, and 23S rRNA families,the new version of TurboFold II performed better thanRSample. Only in the hepatitis C virus IRES domain andcyclic-di-GMP riboswitch families, the new version ofTurboFold II performed worse than RSample. TheTurboFold IIþSHAPE performed better than RNAalifoldusing SHAPE data on every family and performed betterthan MaxExpect on a majority of families (except the cy-

    6 Biophysical Journal 113, 1–9, July 25, 2017

    clic-di-GMP riboswitch and the M-box riboswitch) usingSHAPE data.

    The alignment predictions by TurboFold II with andwithout SHAPE (Fig. S13) are compared with the predictedalignment by ClustalW (53), a method that is based on pair-wise dynamic programing alignments, which is the inputalignment for RNAalifold. Because the Rfam databasealignments do not include the sequence with SHAPE datafor all of the families, the alignment accuracy is assessedonly over the sequences without SHAPE data within eachfamily of homologs. With the exception of the 5S rRNAand the hepatitis C virus IRES domain, TurboFold IIwith SHAPE had higher sensitivity and PPV compared toClustalW. Using SHAPE data on one sequence in each

  • Modeling Conserved RNA Structure

    Please cite this article in press as: Tan et al., Modeling RNA Secondary Structure with Sequence Comparison and Experimental Mapping Data, BiophysicalJournal (2017), http://dx.doi.org/10.1016/j.bpj.2017.06.039

    RNA family also significantly improved the alignment accu-racy of other homologs without SHAPE in a majority ofRNA families (group I intron, lysine riboswitch, M-box ri-boswitch, SAM I riboswitch, TPP riboswitch, and cyclic-di-GMP riboswitch).

    DISCUSSION

    Secondary structure models are important for understandingthe functions of the RNA structure (54). Using SHAPE datawas shown to improve structure prediction accuracy signif-icantly for single sequence secondary structure predictions(7,12). In this work, it is demonstrated that the SHAPEdata can inform the folding of other homologs by combininginformation from sequence comparison of RNA homologs.In particular, it is shown that given SHAPE data for onesequence out of the multiple sequences used in secondarystructure prediction by comparative analysis, TurboFoldII þ SHAPE can substantially improve the structure predic-tion accuracies of the sequences that did not have SHAPEmapping data.

    One of the reasons for the improvements of the structureprediction accuracies of homologs without SHAPE is themore accurate prediction of the structure of the sequencewith SHAPE reactivity. In three RNA families (5S rRNA,HCV IRES, and group I intron), TurboFold II improvedthe average structure accuracy of both the sequences withand without SHAPE (Fig. S1). The more accurate structuralinformation from the sequence with SHAPE is transmittedto its homologs through the extrinsic information calcula-tion. Due to the specially designed extrinsic informationcalculation from the sequence with SHAPE to other (H�1total) homologs by introducing the factor (H�1), which en-sures that the fraction of extrinsic information provided bysequences with SHAPE is high compared to other homo-logs, the structure prediction of homologs is improved.

    To take the advantage of SHAPE data on one of the ho-mologs, the new method ignores pairwise sequence identityduring the calculation of extrinsic information from the

    a b

    sequence with SHAPE to other sequences. To understandthe nature of improvements in structure prediction accuraciesof sequenceswithout SHAPE, the relationship between struc-ture prediction accuracy and sequence identity is studied(Fig. S14). Sequence identity is defined as the ratio of thenumber of columns with same pairwise aligned nucleotidesat the output alignment between the sequence with SHAPEand other homologs from theTurboFold IIþSHAPEmethod.One observed trend is that the sequenceswithmore accuratelypredicted structure (higher F1 score) generally with hadhigher sequence identity to the sequencewith SHAPE.More-over, the F1 score improvementswere distributed in a roughlyGaussian shape along the sequence identity (Fig. S14). For thesequences with relatively high sequence identity, the room toimprove accuracy was limited. The Gaussian shape is alsopartially due to the effects of improvements in structure pre-diction because of a more accurate alignment. This isobserved in some of the RNA families (tRNA, group I intron,lysine riboswitch, and SAM I riboswitch) (Fig. S13). The5S rRNA, hepatitis C virus IRES domain, and cyclic-di-GMP riboswitch RNA families showed improvements onstructure prediction accuracy although little or no improve-ment on alignment prediction accuracy, because the align-ment accuracies of these RNA families were alreadyrelatively high (�90% in sensitivity and PPV).

    The other reason for the improvements of the structureprediction accuracies of homologs without SHAPE is themore accurate coincidence probability as compared to thecase without SHAPE data on any of the input sequences.The coincidence is important to map the basepairing proba-bilities of other homologous sequences to the sequence ofinterest and it is also helpful to estimate the final multiplesequence alignment (Fig. S13).

    One remaining challenge of structure prediction usingexperimental probing data on one of the homologs is the dif-ficulty to determine the balance of information from thermo-dynamics of the sequence and extrinsic information fromthe sequence using experimental data. In Fig. 3, an examplefrom the TPP riboswitch family shows that the structure of

    FIGURE 3 Representative secondary structure

    prediction for TPP riboswitch (BA000043) with

    (a) and without (b) SHAPE data on a homolo-

    gous RNA. Basepair predictions are illustrated

    by colored lines (green, red, and black denoting

    correct, incorrect, and missing basepairs, respec-

    tively) on circle plots. The circular plots were

    generated using the CircleCompare program in

    RNAstructure (55).

    Biophysical Journal 113, 1–9, July 25, 2017 7

  • Tan et al.

    Please cite this article in press as: Tan et al., Modeling RNA Secondary Structure with Sequence Comparison and Experimental Mapping Data, BiophysicalJournal (2017), http://dx.doi.org/10.1016/j.bpj.2017.06.039

    one homologous sequence BA000043 was incorrectly pre-dicted to form three extra basepairs between 50 and 30

    ends when SHAPE was used as compared to when SHAPEwas not used, although the longer helix contributes to amore stable structure.

    RNAalifold showed lower accuracies for predicted struc-tures than those of TurboFold II þ SHAPE in most of theRNA families. A contributing factor to this inaccuracywas the lower accuracy of the input sequence alignment(Fig. S13). Although pseudo free energies obtained fromthe SHAPE reactivity data at nucleotides might be helpfulfor estimating the structure, an inaccurate alignment be-tween the sequence with SHAPE data and homologs candisturb the consensus structure for the set of aligned se-quences and can cause loss of basepairs in the consensusstructure. For the group I intron, lysine riboswitch, SAM Iriboswitch, TPP riboswitch, and cyclic-di-GMP riboswitchRNA families, the sensitivity and PPV of the predictedClustalW alignment for sequences without SHAPE are�10% lower than those of TurboFold II þ SHAPE andthe F1 score of structure prediction on these RNA familiesis �20% lower than TurboFold II þ SHAPE.

    Another contributing factor for the worse performance ofRNAalifold is the integration of SHAPE data. There is aweakening of the information from experimental data withincreasing number of homologs, because the pseudo energyfrom SHAPE reactivity is only applied to the free energycalculation of the particular sequence.

    TurboFold II using SHAPE data on one or moresequences maintains a computation speed comparable toTurboFold II (with complexity O(H2N2 þ HN3) for Hsequences of average length N). The time performance onselect sequence families is provided in Table S6.

    CONCLUSION

    A new version of TurboFold II with the ability to includeSHAPE mapping data for one or more of the RNA sequencehomologs can substantially improve the structure predictionaccuracies of the sequences that do not have SHAPE data.TurboFold II with the capability to include SHAPE mappingdata for one or more sequences is available under the GNUlicense as part of the RNAstructure software package at:http://rna.urmc.rochester.edu/RNAstructure.html.

    SUPPORTING MATERIAL

    Supporting Materials and Methods, fifteen figures, and six tables are avail-

    able at http://www.biophysj.org/biophysj/supplemental/S0006-3495(17)

    30689-6.

    AUTHOR CONTRIBUTIONS

    All authors planned experiments. Z.T. wrote code and performed experi-

    ments. Z.T. drafted the manuscript. All authors participated in the writing.

    8 Biophysical Journal 113, 1–9, July 25, 2017

    ACKNOWLEDGMENTS

    This work was supported by National Institutes of Health (NIH) grants R01

    GM097334 to G.S. and R01 GM076485 to D.H.M.

    REFERENCES

    1. Cech, T. R., and J. A. Steitz. 2014. The noncoding RNA revolution-trashing old rules to forge new ones. Cell. 157:77–94.

    2. Wu, L., and J. G. Belasco. 2008. Let me count the ways: mechanisms ofgene regulation by miRNAs and siRNAs. Mol. Cell. 29:1–7.

    3. Doudna, J. A., and T. R. Cech. 2002. The chemical repertoire of naturalribozymes. Nature. 418:222–228.

    4. Gesteland, R. F., T. Cech, and J. F. Atkins. 2006. The RNAWorld: TheNature of Modern RNA Suggests a Prebiotic RNAWorld. Cold SpringHarbor Laboratory Press, Cold Spring Harbor, NY.

    5. Seetin, M. G., and D. H. Mathews. 2012. RNA structure prediction: anoverview of methods. Methods Mol. Biol. 905:99–122.

    6. Hofacker, I. L. 2014. Energy-directed RNA structure prediction.Methods Mol. Biol. 1097:71–84.

    7. Deigan, K. E., T. W. Li, ., K. M. Weeks. 2009. Accurate SHAPE-directed RNA structure determination. Proc. Natl. Acad. Sci. USA.106:97–102.

    8. Quarrier, S., J. S. Martin,., A. Laederach. 2010. Evaluation of the in-formation content of RNA structure mapping data for secondary struc-ture prediction. RNA. 16:1108–1117.

    9. Washietl, S., I. L. Hofacker,., M. Kellis. 2012. RNA folding with softconstraints: reconciliation of probing data and thermodynamic second-ary structure prediction. Nucleic Acids Res. 40:4261–4272.

    10. Sloma, M. F., and D. H. Mathews. 2015. Improving RNA secondarystructure prediction with structure mapping data. Methods Enzymol.553:91–114.

    11. Mathews, D. H., M. D. Disney, ., D. H. Turner. 2004. Incorporatingchemical modification constraints into a dynamic programming algo-rithm for prediction of RNA secondary structure. Proc. Natl. Acad.Sci. USA. 101:7287–7292.

    12. Hajdin, C. E., S. Bellaousov,., K. M.Weeks. 2013. Accurate SHAPE-directed RNA secondary structure modeling, including pseudoknots.Proc. Natl. Acad. Sci. USA. 110:5498–5503.

    13. Mathews, D. H., J. Sabina,., D. H. Turner. 1999. Expanded sequencedependence of thermodynamic parameters improves prediction ofRNA secondary structure. J. Mol. Biol. 288:911–940.

    14. Eddy, S. R. 2014. Computational analysis of conserved RNA second-ary structure in transcriptomes and genomes. Annu. Rev. Biophys.43:433–456.

    15. Zarringhalam, K., M. M. Meyer, ., P. Clote. 2012. Integrating chem-ical footprinting data into RNA secondary structure prediction. PLoSOne. 7:e45160.

    16. Ouyang, Z., M. P. Snyder, and H. Y. Chang. 2013. SeqFold: genome-scale reconstruction of RNA secondary structure integrating high-throughput sequencing data. Genome Res. 23:377–387.

    17. Deng, F., M. Ledda,., S. Aviran. 2016. Data-directed RNA secondarystructure prediction using probabilistic modeling. RNA. 22:1109–1119.

    18. McGinnis, J. L., J. A. Dunkle,., K. M.Weeks. 2012. The mechanismsof RNA SHAPE chemistry. J. Am. Chem. Soc. 134:6617–6624.

    19. Merino, E. J., K. A. Wilkinson,., K. M. Weeks. 2005. RNA structureanalysis at single nucleotide resolution by selective 20-hydroxyl acyla-tion and primer extension (SHAPE). J. Am. Chem. Soc. 127:4223–4231.

    20. S€ukösd, Z., M. S. Swenson,., C. E. Heitsch. 2013. Evaluating the ac-curacy of SHAPE-directed RNA secondary structure predictions. Nu-cleic Acids Res. 41:2807–2816.

    21. Kertesz, M., Y. Wan,., E. Segal. 2010. Genome-wide measurement ofRNA secondary structure in yeast. Nature. 467:103–107.

    http://rna.urmc.rochester.edu/RNAstructure.htmlhttp://www.biophysj.org/biophysj/supplemental/S0006-3495(17)30689-6http://www.biophysj.org/biophysj/supplemental/S0006-3495(17)30689-6http://refhub.elsevier.com/S0006-3495(17)30689-6/sref1http://refhub.elsevier.com/S0006-3495(17)30689-6/sref1http://refhub.elsevier.com/S0006-3495(17)30689-6/sref2http://refhub.elsevier.com/S0006-3495(17)30689-6/sref2http://refhub.elsevier.com/S0006-3495(17)30689-6/sref3http://refhub.elsevier.com/S0006-3495(17)30689-6/sref3http://refhub.elsevier.com/S0006-3495(17)30689-6/sref4http://refhub.elsevier.com/S0006-3495(17)30689-6/sref4http://refhub.elsevier.com/S0006-3495(17)30689-6/sref4http://refhub.elsevier.com/S0006-3495(17)30689-6/sref5http://refhub.elsevier.com/S0006-3495(17)30689-6/sref5http://refhub.elsevier.com/S0006-3495(17)30689-6/sref6http://refhub.elsevier.com/S0006-3495(17)30689-6/sref6http://refhub.elsevier.com/S0006-3495(17)30689-6/sref7http://refhub.elsevier.com/S0006-3495(17)30689-6/sref7http://refhub.elsevier.com/S0006-3495(17)30689-6/sref7http://refhub.elsevier.com/S0006-3495(17)30689-6/sref8http://refhub.elsevier.com/S0006-3495(17)30689-6/sref8http://refhub.elsevier.com/S0006-3495(17)30689-6/sref8http://refhub.elsevier.com/S0006-3495(17)30689-6/sref9http://refhub.elsevier.com/S0006-3495(17)30689-6/sref9http://refhub.elsevier.com/S0006-3495(17)30689-6/sref9http://refhub.elsevier.com/S0006-3495(17)30689-6/sref10http://refhub.elsevier.com/S0006-3495(17)30689-6/sref10http://refhub.elsevier.com/S0006-3495(17)30689-6/sref10http://refhub.elsevier.com/S0006-3495(17)30689-6/sref11http://refhub.elsevier.com/S0006-3495(17)30689-6/sref11http://refhub.elsevier.com/S0006-3495(17)30689-6/sref11http://refhub.elsevier.com/S0006-3495(17)30689-6/sref11http://refhub.elsevier.com/S0006-3495(17)30689-6/sref12http://refhub.elsevier.com/S0006-3495(17)30689-6/sref12http://refhub.elsevier.com/S0006-3495(17)30689-6/sref12http://refhub.elsevier.com/S0006-3495(17)30689-6/sref13http://refhub.elsevier.com/S0006-3495(17)30689-6/sref13http://refhub.elsevier.com/S0006-3495(17)30689-6/sref13http://refhub.elsevier.com/S0006-3495(17)30689-6/sref14http://refhub.elsevier.com/S0006-3495(17)30689-6/sref14http://refhub.elsevier.com/S0006-3495(17)30689-6/sref14http://refhub.elsevier.com/S0006-3495(17)30689-6/sref15http://refhub.elsevier.com/S0006-3495(17)30689-6/sref15http://refhub.elsevier.com/S0006-3495(17)30689-6/sref15http://refhub.elsevier.com/S0006-3495(17)30689-6/sref16http://refhub.elsevier.com/S0006-3495(17)30689-6/sref16http://refhub.elsevier.com/S0006-3495(17)30689-6/sref16http://refhub.elsevier.com/S0006-3495(17)30689-6/sref17http://refhub.elsevier.com/S0006-3495(17)30689-6/sref17http://refhub.elsevier.com/S0006-3495(17)30689-6/sref18http://refhub.elsevier.com/S0006-3495(17)30689-6/sref18http://refhub.elsevier.com/S0006-3495(17)30689-6/sref19http://refhub.elsevier.com/S0006-3495(17)30689-6/sref19http://refhub.elsevier.com/S0006-3495(17)30689-6/sref19http://refhub.elsevier.com/S0006-3495(17)30689-6/sref19http://refhub.elsevier.com/S0006-3495(17)30689-6/sref19http://refhub.elsevier.com/S0006-3495(17)30689-6/sref20http://refhub.elsevier.com/S0006-3495(17)30689-6/sref20http://refhub.elsevier.com/S0006-3495(17)30689-6/sref20http://refhub.elsevier.com/S0006-3495(17)30689-6/sref20http://refhub.elsevier.com/S0006-3495(17)30689-6/sref21http://refhub.elsevier.com/S0006-3495(17)30689-6/sref21

  • Modeling Conserved RNA Structure

    Please cite this article in press as: Tan et al., Modeling RNA Secondary Structure with Sequence Comparison and Experimental Mapping Data, BiophysicalJournal (2017), http://dx.doi.org/10.1016/j.bpj.2017.06.039

    22. Underwood, J. G., A. V. Uzilov, ., D. Haussler. 2010. FragSeq:transcriptome-wide RNA structure probing using high-throughputsequencing. Nat. Methods. 7:995–1001.

    23. Talkish, J., G. May, ., C. J. McManus. 2014. Mod-seq: high-throughput sequencing for chemical probing of RNA structure. RNA.20:713–720.

    24. Ding, Y., Y. Tang, ., S. M. Assmann. 2014. In vivo genome-wideprofiling of RNA secondary structure reveals novel regulatory features.Nature. 505:696–700.

    25. Spitale, R. C., P. Crisalli,., H. Y. Chang. 2013. RNA SHAPE analysisin living cells. Nat. Chem. Biol. 9:18–20.

    26. Rouskin, S., M. Zubradt, ., J. S. Weissman. 2014. Genome-wideprobing of RNA structure reveals active unfolding of mRNA structuresin vivo. Nature. 505:701–705.

    27. Cordero, P., and R. Das. 2015. Rich RNA structure landscapes revealedby mutate-and-map analysis. PLOS Comput. Biol. 11:e1004473.

    28. Puton, T., L. P. Kozlowski, ., J. M. Bujnicki. 2014. CompaRNA: aserver for continuous benchmarking of automated methods for RNAsecondary structure prediction. Nucleic Acids Res. 42:5403–5406.

    29. Havgaard, J. H., and J. Gorodkin. 2014. RNA structural alignments,part I: Sankoff-based approaches for structural alignments. MethodsMol. Biol. 1097:275–290.

    30. Asai, K., and M. Hamada. 2014. RNA structural alignments, part II:non-Sankoff approaches for structural alignments. Methods Mol.Biol. 1097:291–301.

    31. Harmanci, A. O., G. Sharma, and D. H. Mathews. 2011. TurboFold:iterative probabilistic estimation of secondary structures for multipleRNA sequences. BMC Bioinformatics. 12:108.

    32. Harmanci, A. O., G. Sharma, and D. H. Mathews. 2007. Efficient pair-wise RNA structure prediction using probabilistic alignment con-straints in Dynalign. BMC Bioinformatics. 8:130.

    33. Hofacker, I. L., S. H. Bernhart, and P. F. Stadler. 2004. Alignment ofRNA base pairing probability matrices. Bioinformatics. 20:2222–2227.

    34. Knudsen, B., and J. Hein. 2003. Pfold: RNA secondary structure pre-diction using stochastic context-free grammars. Nucleic Acids Res.31:3423–3428.

    35. Lu, Z. J., J. W. Gloor, and D. H. Mathews. 2009. Improved RNA sec-ondary structure prediction by maximizing expected pair accuracy.RNA. 15:1805–1813.

    36. Do, C. B., M. S. Mahabhashyam, ., S. Batzoglou. 2005. ProbCons:probabilistic consistency-based multiple sequence alignment. GenomeRes. 15:330–340.

    37. Lavender, C. A., R. Lorenz, ., K. M. Weeks. 2015. Model-Free RNAsequence and structure alignment informed by SHAPE probing revealsa conserved alternate secondary structure for 16S rRNA. PLOS Com-put. Biol. 11:e1004126.

    38. Bernhart, S. H., I. L. Hofacker, ., P. F. Stadler. 2008. RNAalifold:improved consensus structure prediction for RNA alignments. BMCBioinformatics. 9:474.

    39. Ding, Y., and C. E. Lawrence. 2003. A statistical sampling algorithmfor RNA secondary structure prediction. Nucleic Acids Res. 31:7280–7301.

    40. Mathews, D. H. 2004. Using an RNA secondary structure partitionfunction to determine confidence in base pairs predicted by free energyminimization. RNA. 10:1178–1190.

    41. Turner, D. H., and D. H. Mathews. 2010. NNDB: the nearest neighborparameter database for predicting stability of nucleic acid secondarystructure. Nucleic Acids Res. 38:D280–D282.

    42. Durbin, R., S. R. Eddy, ., G. Mitchison. 1998. Biological SequenceAnalysis: Probabilistic Models of Proteins and Nucleic Acids. Cam-bridge University Press, Cambridge, United Kingdom.

    43. Do, C. B., D. A. Woods, and S. Batzoglou. 2006. CONTRAfold: RNAsecondary structure prediction without physics-based models. Bioin-formatics. 22:e90–e98.

    44. Bellaousov, S., and D. H. Mathews. 2010. ProbKnot: fast prediction ofRNA secondary structure including pseudoknots. RNA. 16:1870–1880.

    45. Nawrocki, E. P., S. W. Burge,., R. D. Finn. 2015. Rfam 12.0: updatesto the RNA families database. Nucleic Acids Res. 43:D130–D137.

    46. Fu, Y., G. Sharma, and D. H. Mathews. 2014. Dynalign II: commonsecondary structure prediction for RNA homologs with domain inser-tions. Nucleic Acids Res. 42:13939–13948.

    47. Gutell, R. R., J. C. Lee, and J. J. Cannone. 2002. The accuracy of ribo-somal RNA comparative structure models. Curr. Opin. Struct. Biol.12:301–310.

    48. Woodson, S. A., and D. M. Crothers. 1987. Proton nuclear magneticresonance studies on bulge-containing DNA oligonucleotides from amutational hot-spot sequence. Biochemistry. 26:904–912.

    49. Znosko, B. M., S. B. Silvestri, ., M. J. Serra. 2002. Thermodynamicparameters for an expanded nearest-neighbor model for the formationof RNA duplexes with single nucleotide bulges. Biochemistry.41:10406–10417.

    50. R Development Core Team. 2013. R: A Language and Environmentfor Statistical Computing. R Foundation for Statistical Computing,Vienna, Austria.

    51. Xu, Z., A. Almudevar, and D. H. Mathews. 2012. Statistical evaluationof improvement in RNA secondary structure prediction. Nucleic AcidsRes. 40:e26.

    52. Lorenz, R., S. H. Bernhart,., I. L. Hofacker. 2011. ViennaRNA pack-age 2.0. Algorithms Mol. Biol. 6:26.

    53. Larkin, M. A., G. Blackshields,., D. G. Higgins. 2007. Clustal WandClustal X version 2.0. Bioinformatics. 23:2947–2948.

    54. Mauger, D. M., N. A. Siegfried, and K. M. Weeks. 2013. The geneticcode as expressed through relationships between mRNA structureand protein function. FEBS Lett. 587:1180–1188.

    55. Reuter, J. S., and D. H. Mathews. 2010. RNAstructure: software forRNA secondary structure prediction and analysis. BMC Bioinformat-ics. 11:129.

    Biophysical Journal 113, 1–9, July 25, 2017 9

    http://refhub.elsevier.com/S0006-3495(17)30689-6/sref22http://refhub.elsevier.com/S0006-3495(17)30689-6/sref22http://refhub.elsevier.com/S0006-3495(17)30689-6/sref22http://refhub.elsevier.com/S0006-3495(17)30689-6/sref23http://refhub.elsevier.com/S0006-3495(17)30689-6/sref23http://refhub.elsevier.com/S0006-3495(17)30689-6/sref23http://refhub.elsevier.com/S0006-3495(17)30689-6/sref24http://refhub.elsevier.com/S0006-3495(17)30689-6/sref24http://refhub.elsevier.com/S0006-3495(17)30689-6/sref24http://refhub.elsevier.com/S0006-3495(17)30689-6/sref25http://refhub.elsevier.com/S0006-3495(17)30689-6/sref25http://refhub.elsevier.com/S0006-3495(17)30689-6/sref26http://refhub.elsevier.com/S0006-3495(17)30689-6/sref26http://refhub.elsevier.com/S0006-3495(17)30689-6/sref26http://refhub.elsevier.com/S0006-3495(17)30689-6/sref27http://refhub.elsevier.com/S0006-3495(17)30689-6/sref27http://refhub.elsevier.com/S0006-3495(17)30689-6/sref28http://refhub.elsevier.com/S0006-3495(17)30689-6/sref28http://refhub.elsevier.com/S0006-3495(17)30689-6/sref28http://refhub.elsevier.com/S0006-3495(17)30689-6/sref29http://refhub.elsevier.com/S0006-3495(17)30689-6/sref29http://refhub.elsevier.com/S0006-3495(17)30689-6/sref29http://refhub.elsevier.com/S0006-3495(17)30689-6/sref30http://refhub.elsevier.com/S0006-3495(17)30689-6/sref30http://refhub.elsevier.com/S0006-3495(17)30689-6/sref30http://refhub.elsevier.com/S0006-3495(17)30689-6/sref31http://refhub.elsevier.com/S0006-3495(17)30689-6/sref31http://refhub.elsevier.com/S0006-3495(17)30689-6/sref31http://refhub.elsevier.com/S0006-3495(17)30689-6/sref32http://refhub.elsevier.com/S0006-3495(17)30689-6/sref32http://refhub.elsevier.com/S0006-3495(17)30689-6/sref32http://refhub.elsevier.com/S0006-3495(17)30689-6/sref33http://refhub.elsevier.com/S0006-3495(17)30689-6/sref33http://refhub.elsevier.com/S0006-3495(17)30689-6/sref34http://refhub.elsevier.com/S0006-3495(17)30689-6/sref34http://refhub.elsevier.com/S0006-3495(17)30689-6/sref34http://refhub.elsevier.com/S0006-3495(17)30689-6/sref35http://refhub.elsevier.com/S0006-3495(17)30689-6/sref35http://refhub.elsevier.com/S0006-3495(17)30689-6/sref35http://refhub.elsevier.com/S0006-3495(17)30689-6/sref36http://refhub.elsevier.com/S0006-3495(17)30689-6/sref36http://refhub.elsevier.com/S0006-3495(17)30689-6/sref36http://refhub.elsevier.com/S0006-3495(17)30689-6/sref37http://refhub.elsevier.com/S0006-3495(17)30689-6/sref37http://refhub.elsevier.com/S0006-3495(17)30689-6/sref37http://refhub.elsevier.com/S0006-3495(17)30689-6/sref37http://refhub.elsevier.com/S0006-3495(17)30689-6/sref38http://refhub.elsevier.com/S0006-3495(17)30689-6/sref38http://refhub.elsevier.com/S0006-3495(17)30689-6/sref38http://refhub.elsevier.com/S0006-3495(17)30689-6/sref39http://refhub.elsevier.com/S0006-3495(17)30689-6/sref39http://refhub.elsevier.com/S0006-3495(17)30689-6/sref39http://refhub.elsevier.com/S0006-3495(17)30689-6/sref40http://refhub.elsevier.com/S0006-3495(17)30689-6/sref40http://refhub.elsevier.com/S0006-3495(17)30689-6/sref40http://refhub.elsevier.com/S0006-3495(17)30689-6/sref41http://refhub.elsevier.com/S0006-3495(17)30689-6/sref41http://refhub.elsevier.com/S0006-3495(17)30689-6/sref41http://refhub.elsevier.com/S0006-3495(17)30689-6/sref42http://refhub.elsevier.com/S0006-3495(17)30689-6/sref42http://refhub.elsevier.com/S0006-3495(17)30689-6/sref42http://refhub.elsevier.com/S0006-3495(17)30689-6/sref43http://refhub.elsevier.com/S0006-3495(17)30689-6/sref43http://refhub.elsevier.com/S0006-3495(17)30689-6/sref43http://refhub.elsevier.com/S0006-3495(17)30689-6/sref44http://refhub.elsevier.com/S0006-3495(17)30689-6/sref44http://refhub.elsevier.com/S0006-3495(17)30689-6/sref45http://refhub.elsevier.com/S0006-3495(17)30689-6/sref45http://refhub.elsevier.com/S0006-3495(17)30689-6/sref46http://refhub.elsevier.com/S0006-3495(17)30689-6/sref46http://refhub.elsevier.com/S0006-3495(17)30689-6/sref46http://refhub.elsevier.com/S0006-3495(17)30689-6/sref47http://refhub.elsevier.com/S0006-3495(17)30689-6/sref47http://refhub.elsevier.com/S0006-3495(17)30689-6/sref47http://refhub.elsevier.com/S0006-3495(17)30689-6/sref48http://refhub.elsevier.com/S0006-3495(17)30689-6/sref48http://refhub.elsevier.com/S0006-3495(17)30689-6/sref48http://refhub.elsevier.com/S0006-3495(17)30689-6/sref49http://refhub.elsevier.com/S0006-3495(17)30689-6/sref49http://refhub.elsevier.com/S0006-3495(17)30689-6/sref49http://refhub.elsevier.com/S0006-3495(17)30689-6/sref49http://refhub.elsevier.com/S0006-3495(17)30689-6/sref50http://refhub.elsevier.com/S0006-3495(17)30689-6/sref50http://refhub.elsevier.com/S0006-3495(17)30689-6/sref50http://refhub.elsevier.com/S0006-3495(17)30689-6/sref51http://refhub.elsevier.com/S0006-3495(17)30689-6/sref51http://refhub.elsevier.com/S0006-3495(17)30689-6/sref51http://refhub.elsevier.com/S0006-3495(17)30689-6/sref52http://refhub.elsevier.com/S0006-3495(17)30689-6/sref52http://refhub.elsevier.com/S0006-3495(17)30689-6/sref53http://refhub.elsevier.com/S0006-3495(17)30689-6/sref53http://refhub.elsevier.com/S0006-3495(17)30689-6/sref54http://refhub.elsevier.com/S0006-3495(17)30689-6/sref54http://refhub.elsevier.com/S0006-3495(17)30689-6/sref54http://refhub.elsevier.com/S0006-3495(17)30689-6/sref55http://refhub.elsevier.com/S0006-3495(17)30689-6/sref55http://refhub.elsevier.com/S0006-3495(17)30689-6/sref55

  • Biophysical Journal, Volume 113

    Supplemental Information

    Modeling RNA Secondary Structure with Sequence Comparison and

    Experimental Mapping Data

    Zhen Tan, Gaurav Sharma, and David H. Mathews

  • Supplementary information for

    “Modeling RNA secondary structure with sequence comparison and experimental mapping data”

    Z. Tan, G. Sharma, and D. H. Mathews

    Details are provided for dataset used in benchmarking (Section 1), structure modeling accuracy (Section 2), parameter optimization methods (Section 2), sequences used in parameter optimization, software efficiency test (Section 3), and benchmarking (Section 4).

  • Section 1. Dataset information:

    Family H Average sequence length

    Standard deviation

    Average MEA 

    sensitivity

    Standard deviation

    Average MEA PPV

    Standard deviation

    tRNA 5 sequences 75.7 3.5 0.76 0.23 0.75 0.2410 sequences 76.2 4.7 0.77 0.23 0.74 0.2520 sequences 76.3 4.8 0.77 0.21 0.74 0.23

    cGMP riboswitch

    5 sequences 89.0 8.3 0.86 0.19 0.33 0.1210 sequences 87.9 6.9 0.81 0.26 0.31 0.1320 sequences 87.5 6.5 0.81 0.25 0.31 0.13

    TPP riboswitch

    5 sequences 101.5 16.8 0.54 0.29 0.43 0.2810 sequences 104.4 13.9 0.55 0.29 0.44 0.2720 sequences 106.1 13.1 0.55 0.29 0.43 0.27

    SAM I riboswitch

    5 sequences 111.3 13.9 0.83 0.18 0.68 0.1710 sequences 111.9 14.1 0.82 0.17 0.67 0.1620 sequences 111.9 15.3 0.84 0.16 0.68 0.15

    5S rRNA

    5 sequences 117.7 4.6 0.64 0.24 0.62 0.2410 sequences 117.8 3.2 0.56 0.25 0.55 0.2520 sequences 117.8 4.2 0.57 0.27 0.54 0.26

    M‐box riboswitch

    5 sequences 164.7 8.5 0.64 0.15 0.61 0.1510 sequences 167.1 8.5 0.66 0.17 0.62 0.1620 sequences 167.8 7.3 0.66 0.15 0.63 0.14

    lysine riboswitch

    5 sequences 179.1 6.8 0.76 0.17 0.71 0.1510 sequences 183.5 12.6 0.65 0.22 0.60 0.2020 sequences 182.7 10.7 0.68 0.22 0.63 0.21

    HCV 5 sequences 267.4 66.1 0.50 0.16 0.46 0.1610 sequences 250.7 62.9 0.47 0.17 0.43 0.1720 sequences 251.0 60.5 0.48 0.18 0.43 0.17

    Group I intron

    5 sequences 431.1 51.0 0.61 0.16 0.58 0.1510 sequences 433.3 52.7 0.60 0.16 0.59 0.1620 sequences 433.8 54.0 0.61 0.16 0.59 0.16

    23S rRNA

    5 sequences 2919.4 51.8 0.52 0.53 0.08 0.0710 sequences 2928.8 62.6 0.51 0.52 0.02 0.0420 sequences  2924.3  56.4  0.52  0.51  0.01  0.06 

    Table S1. Summary statistics on the sets of sequences selected for testing. Mean and standard deviation of sequence length, sensitivity and PPV of MEA structure prediction are shown for sequences from each RNA family in the test sets of homologs used.

       

  • Family Total number of distinct sequences Total number of sequences in databasetRNA 627 9245

    cGMP riboswitch 150 155TPP riboswitch 97 109SAM I riboswitch 272 433

    5S rRNA 429 710M‐box riboswitch 138 157Lysine riboswitch 45 47

    HCV 74 79Group I intron 437 81623S rRNA 35 35

    Table S2. Number of distinct sequences on the sets of sequences selected for testing. Number of distinct sequences from each RNA family in test sets and the total number of sequences available in database are shown.

    Family Sequence with SHAPE reactivity data tRNA E. coli 

    cGMP riboswitch V. cholerae TPP riboswitch E. coli SAM I riboswitch T. tencongensis 

    5S rRNA E. coli M‐box riboswitch B. subtilis Lysine riboswitch T. maritime 

    HCV Hepatitis C virus IRES domain Group I intron T. thermophila 23S rRNA E. coli 

    Table S3. List of sequences with SHAPE reactivity data for each family.

  • Section 2. Structure prediction accuracy:

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    5seq 10seq 20seq

    *

    *

    * * * *

    *

    ** * *

    *

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    5seq 10seq 20seq

    *

    *

    * ** *

    *

    ** * *

    *

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    5seq 10seq 20seq

    *

    *

    ** *

    *

    *

    * * *

    *

    *

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    5seq 10seq 20seq

    *

    *

    *

    *

    **

    *

    *

    * *

    *

    *

    tRNA, E.coli 5S rRNA, E.coli

    Hepatitis C Virus(HCV) IRES Domain Group I Intron, T. thermophila

    * * *

    *

    * * *

    *

    * **

    *

    * * * *

    *

    *

    *

    **

    *

    **

    *

    * * *

    (A) (B)

    (C) (D)

  • Figure S1. Average F1 score of structure predictions of sequences that did not have SHAPE mapping data. F1 score of structures predictions obtained by running the methods with 5, 10, or 20 input sequences on (A) tRNA, (B) 5S rRNA, (C) hepatitis C virus IRES domain, (D) group I intron, (E) lysine riboswitch, (F) M-box riboswitch, (G) SAM I riboswitch, (H) TPP riboswitch,

    23S rRNA, E. colicyclic-di-GMP riboswitch, V. cholerae

    Lysine riboswitch, T. maritime M-box riboswitch, B. subtilis

    SAM I riboswitch, T. tencongensis TPP riboswitch, E. coli

    *

    * ** * * *

    *

    *

    *

    *

    * *

    **

    *

    *

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    5seq 10seq 20seq

    *

    *

    *

    **

    *

    *

    *

    * *

    *

    *

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    5seq 10seq 20seq

    *

    * *

    *

    * *

    *

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    5seq 10seq 20seq

    *

    *

    * *

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    5seq 10seq 20seq

    *

    *

    *

    * **

    *

    *

    * *

    *

    *

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    5seq 10seq 20seq

    *

    *

    *

    *

    *

    *

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    5seq 10seq 20seq

    * * *

    *

    *

    * * *

    * **

    * * *

    * *

    * *

    *

    *

    * *

    *

    *

    * *

    *

    *

    **

    *

    *

    *

    ** *

    *

    ***

    * *

    *

    * *

    * *

    *

    * *

    * *

    *

    (E) (F)

    (G) (H)

    (I) (J)

  • (I) cyclic-di-GMP riboswitch, and (J) 23S rRNA test datasets. The star (*) above the bar for a method indicates that the difference in F1 score between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

  • Figure S2. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that do not have SHAPE mapping data (bottom) on tRNA test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with H = 5, 10, or 20 input sequences on tRNA test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

    *

    *

    *

    *

    * *

    *

    *

    * *

    *

    *

    *

    *

    *

    * *

    *

    * *

    *

    *

    * *

    *

    *

    *

    * **

    *

    ** * *

    *

    * *

    *

    ** *

    *

    * * *

    *

    Sensitivity PPV

    5seq 10seq 20seq 5seq 10seq 20seq

    5seq 10seq 20seq 5seq 10seq 20seq

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    ***

    *

    *

    * *

    * *

    *

  • Figure S3. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that do not have SHAPE mapping data (bottom) on 5S rRNA test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with H = 5, 10, or 20 input sequences on 5S rRNA test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

    *

    *

    **

    **

    *

    *

    * * *

    *

    *

    *

    *

    ** *

    *

    *

    *

    * *

    *

    *

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    5seq 10seq 20seq

    *

    * *

    *

    * *

    *

    * * *

    *

    * **

    *

    ** *

    *

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    5seq 10seq 20seq

    5seq 10seq 20seq 5seq 10seq 20seq

    Sensitivity PPV

    *** *

    * *

  • Figure S4. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that do not have SHAPE mapping data (bottom) on hepatitis C virus (HCV) IRES domain test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with 5, 10, or 20 input sequences on hepatitis C virus (HCV) IRES domain test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

    *

    *

    *

    *

    * *

    *

    *

    * *

    *

    * *

    **

    *

    *

    *

    * *

    *

    *

    *

    * *

    *

    *

    5seq 10seq 20seq 5seq 10seq 20seq

    5seq 10seq 20seq 5seq 10seq 20seq

    Sensitivity PPV

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    *

    * *

    *

    * * *

    *

    * * *

    ** *

    *

    **

    *

    *

    ** *

    *

    *

    * *

    *

  • Figure S5. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that do not have SHAPE mapping data (bottom) on group I intron test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with H = 5, 10, or 20 input sequences on group I intron test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

    *

    * *

    *

    *

    * *

    *

    **

    * *

    * *

    * *

    *

    *

    * *

    *

    *

    *

    * *

    *

    * *

    *

    * *

    *

    *

    * *

    *

    **

    *

    *

    * **

    *

    * **

    5seq 10seq 20seq 5seq 10seq 20seq

    5seq 10seq 20seq 5seq 10seq 20seq

    Sensitivity PPV

    0.20.30.40.5

    0.6

    0.70.80.9 1

    00.1

    0.20.30.40.5

    0.6

    0.70.80.9 1

    00.1

    **

    * *

    * **

    *

    *

    *

  • Figure S6. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that do not have SHAPE mapping data (bottom) on lysine riboswitch test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with 5, 10, or 20 input sequences on lysine riboswitch test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

    *

    *

    * *

    *

    *

    * *

    **

    * *

    *

    *

    *

    **

    *

    * *

    *

    *

    *

    *

    *

    * *

    *

    *

    **

    *

    **

    **

    *

    *

    **

    *

    *

    *

    **

    Sensitivity PPV

    5seq 10seq 20seq 5seq 10seq 20seq

    5seq 10seq 20seq 5seq 10seq 20seq0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1* * * *

    *

    *

  • Figure S7. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that do not have SHAPE mapping data (bottom) on M-box riboswitch test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with 5, 10, or 20 input sequences on M-box riboswitch test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

    *

    *

    *

    * **

    *

    * *

    **

    *

    *

    *

    *

    *

    *

    **

    *

    *

    *

    * *

    *

    **

    *

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    5seq 10seq 20seq 5seq 10seq 20seq

    5seq 10seq 20seq 5seq 10seq 20seq 0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Sensitivity PPV

    ***

    * *

  • Figure S8. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that d0 not have SHAPE mapping data (bottom) on SAM I riboswitch test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with H = 5, 10, or 20 input sequences on SAM I riboswitch test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

    * *

    * *

    *

    *

    * *

    *

    *

    * *

    * *

    * *

    *

    *

    * *

    *

    * *

    5seq 10seq 20seq 5seq 10seq 20seq

    5seq 10seq 20seq 5seq 10seq 20seq

    Sensitivity PPV

    0.2 0.3 0.4 0.5

    0.6

    0.7 0.8 0.9 1

    0 0.1

    0.2 0.3 0.4 0.5

    0.6

    0.7 0.8 0.9 1

    0 0.1

    *

    * *

    * *

    ** *

    *

    *

    *

    *

    *

    *

    *

    **

    *

    *

    **

    *

    *

    *

    * *

    *

    *

    * *

    *

  • Figure S9. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that do not have SHAPE mapping data (bottom) on TPP riboswitch test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with H = 5, 10, or 20 input sequences on TPP riboswitch test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

    * *

    *

    *

    * *

    *

    *

    * *

    *

    **

    *

    *

    *

    * * *

    *

    *

    * *

    *

    *

    Sensitivity PPV

    5seq 10seq 20seq 5seq 10seq 20seq

    5seq 10seq 20seq 5seq 10seq 20seq0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    * ** *

    ***

    * *

    * *

    *

    * *

    *

    * *

    *

    **

    *

    *

    *

    *

    * *

    *

  • Figure S10. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that do not have SHAPE mapping data (bottom) on cyclic-di-GMP riboswitch test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with H = 5, 10, or 20 input sequences on cyclic-di-GMP riboswitch test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

    *

    * *

    * *

    * *

    * *

    *

    * *

    * * *

    * *

    * * ** *

    * * ** * * *

    Sensitivity PPV

    5seq 10seq 20seq 5seq 10seq 20seq

    5seq 10seq 20seq 5seq 10seq 20seq

    0.20.30.40.5

    0.6

    0.70.80.9 1

    00.1

    0.20.30.40.5

    0.6

    0.70.80.9 1

    00.1

    * * *

    * * *

    **

    * *

    *

    * *

    ** * *

    **

    *

    *

    *

    *

    * *

    *

  • Figure S11. Average Sensitivity and PPV of structure predictions of sequences that have SHAPE mapping data (top) and sequences that do not have SHAPE mapping data (bottom) on 23S rRNA test datasets. Sensitivity and PPV of structures predictions obtained by running the methods with H = 5, 10, or 20 input sequences on 23S rRNA test datasets. The star (*) above the bar for a method indicates that the difference in sensitivity or PPV between the method and TurboFold II+SHAPE is statistically significant, as determined by paired t-tests.

    00.10.20.30.40.50.60.70.80.9 1

    5seq 10seq 20seq

    *

    *

    *

    **

    *

    *

    *

    * * *

    *

    5seq 10seq 20seq

    *

    *

    *

    * * *

    *

    *

    * * *

    *

    *

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

    5seq 10seq 20seq

    *

    *

    **

    *

    * *

    *

    5seq 10seq 20seq

    * * *

    *

    * *

    *

    ** *

    *

    Sensitivity PPV

    **

    *

    * * *

    *

    **

    *

    **

    * **

  • Figure S12. Scatter plots of F1 score of structure predictions obtained with TurboFold II and TurboFold II + SHAPE for sequences (20 sequence groups) that did not have SHAPE mapping data. The F1 scores of structures predictions are obtained by running the methods with H = 20 input sequences on tRNA, 5S rRNA, hepatitis C virus IRES domain, and group I intron RNA test datasets. Each point represents the F1 scores by TurboFold II and TurboFold II + SHAPE for one sequence.

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1 0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1 0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    F1 s

    core

    (Tur

    boFo

    ld II

    + S

    HA

    PE

    )

    F1 s

    core

    (Tur

    boFo

    ld II

    + S

    HA

    PE

    )

    F1 s

    core

    (Tur

    boFo

    ld II

    + S

    HA

    PE

    )

    F1 s

    core

    (Tur

    boFo

    ld II

    + S

    HA

    PE

    )

    tRNA 5S rRNA

    Hepatitis C Virus(HCV) IRES domain Group I intron

    F1 score (TurboFold II) F1 score (TurboFold II)

    F1 score (TurboFold II) F1 score (TurboFold II)

    (A) (B)

    (C) (D)

  • Figure S12. Scatter plots of F1 score of structure predictions obtained with TurboFold II and TurboFold II+SHAPE for sequences (20 sequence groups) that do not have SHAPE mapping data. F1 score of structures predictions obtained by running the methods with H = 20 input sequences on lysine riboswitch, M-box riboswitch, SAM I riboswitch, and cyclic-di-GMP riboswitch test datasets. Each dot represents the F1 scores by TurboFold II and TurboFold II+SHAPE.

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1 0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1 0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    SAM I riboswitch cyclic-di-GMP riboswitch

    Lysine riboswitch M-box riboswitch(E)F1

    sco

    re (T

    urbo

    Fold

    II +

    SH

    AP

    E)

    F1 s

    core

    (Tur

    boFo

    ld II

    + S

    HA

    PE

    )

    F1 s

    core

    (Tur

    boFo

    ld II

    + S

    HA

    PE

    )

    F1 s

    core

    (Tur

    boFo

    ld II

    + S

    HA

    PE

    )F1 score (TurboFold II) F1 score (TurboFold II)

    F1 score (TurboFold II)F1 score (TurboFold II)

    (F)

    (G) (H)

  • Figure S12. Scatter plots of F1 score of structure predictions obtained with TurboFold II and TurboFold II+SHAPE for sequences (20 sequence groups) that do not have SHAPE mapping data. F1 score of structures predictions obtained by running the methods with 5 input sequences (left) and H = 20 input sequences (right) on (A) tRNA, (B) 5S rRNA, (C) hepatitis C virus IRES domain, (D) group I intron, (E) lysine riboswitch, (F) M-box riboswitch, (G) SAM I riboswitch, (H) cyclic-di-GMP riboswitch, (I) 23S rRNA (5 sequences), and (J) 23S rRNA (20 sequences) test datasets. Each dot represents the F1 scores by TurboFold II and TurboFold II + SHAPE.

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1 0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    23S rRNA (5 seq) 23S rRNA (20 seq)

    F1 score (TurboFold II) F1 score (TurboFold II)

    F1 s

    core

    (Tur

    boFo

    ld II

    + S

    HA

    PE

    )

    F1 s

    core

    (Tur

    boFo

    ld II

    + S

    HA

    PE

    )

    (I) (J)

  • Table S4. Average structure prediction sensitivity and PPV on sequences without SHAPE data for each method on each dataset:

    5S rRNA Prediction Method H = 5 sequences H = 10 sequences H = 20 sequences

    sensitivity PPV sensitivity PPV sensitivity PPV

    TurboFold II + SHAPE 0.880 0.927 0.871 0.913 0.873 0.903

    TurboFold II 0.861 0.888 0.864 0.883 0.869 0.873 RNAalifold + SHAPE 0.914 0.900 0.823 0.921 0.782 0.932

    RNAalifold 0.912 0.914 0.815 0.928 0.776 0.932

    MaxExpect 0.636 0.619 0.564 0.551 0.569 0.544

    Group I intron Prediction Method H = 5 sequences H = 10 sequences H = 20 sequences

    sensitivity PPV sensitivity PPV sensitivity PPV

    TurboFold II + SHAPE 0.749 0.797 0.754 0.800 0.763 0.807

    TurboFold II 0.735 0.769 0.742 0.774 0.750 0.775

    RNAalifold + SHAPE 0.163 0.375 0.092 0.554 0.052 0.537

    RNAalifold 0.160 0.398 0.095 0.547 0.054 0.558

    MaxExpect 0.608 0.584 0.604 0.585 0.612 0.594

    HCV

    Prediction Method H = 5 sequences H = 10 sequences H = 20 sequences sensitivity PPV sensitivity PPV sensitivity PPV

    TurboFold II + SHAPE 0.705 0.676 0.710 0.686 0.717 0.685 TurboFold II 0.581 0.547 0.586 0.555 0.592 0.557

    RNAalifold + SHAPE 0.510 0.510 0.493 0.579 0.549 0.737 RNAalifold 0.496 0.540 0.481 0.570 0.534 0.715 MaxExpect 0.504 0.456 0.469 0.426 0.480 0.431

  • tRNA

    Prediction Method H = 5 sequences H = 10 sequences H = 20 sequences sensitivity PPV sensitivity PPV sensitivity PPV

    TurboFold II + SHAPE 0.945 0.981 0.949 0.973 0.948 0.968

    TurboFold II 0.916 0.944 0.930 0.939 0.922 0.933

    RNAalifold + SHAPE 0.786 0.853 0.840 0.905 0.833 0.920

    RNAalifold 0.837 0.856 0.833 0.910 0.833 0.920

    MaxExpect 0.763 0.752 0.768 0.742 0.771 0.742

    TPP riboswitch

    Prediction Method H = 5 sequences H = 10 sequences H = 20 sequences sensitivity PPV sensitivity PPV sensitivity PPV

    TurboFold II + SHAPE 0.744 0.773 0.819 0.829 0.816 0.812

    TurboFold II 0.752 0.775 0.820 0.833 0.816 0.801

    RNAalifold + SHAPE 0.382 0.808 0.335 0.952 0.288 0.980

    RNAalifold 0.379 0.917 0.332 0.953 0.294 0.980

    MaxExpect 0.535 0.428 0.547 0.436 0.552 0.431

    SAM I riboswitch

    Prediction Method H = 5 sequences H = 10 sequences H = 20 sequences sensitivity PPV sensitivity PPV sensitivity PPV

    TurboFold II + SHAPE 0.905 0.784 0.908 0.768 0.910 0.772 TurboFold II 0.911 0.785 0.908 0.762 0.908 0.762

    RNAalifold + SHAPE 0.206 0.552 0.430 0.902 0.464 0.945

    RNAalifold 0.671 0.824 0.604 0.886 0.510 0.937 MaxExpect 0.826 0.680 0.822 0.667 0.840 0.681

  • M-box riboswitch

    Prediction Method H = 5 sequences H = 10 sequences H = 20 sequences sensitivity PPV sensitivity PPV sensitivity PPV

    TurboFold II + SHAPE 0.727 0.734 0.734 0.724 0.738 0.733

    TurboFold II 0.730 0.729 0.743 0.720 0.744 0.729

    RNAalifold + SHAPE 0.630 0.722 0.502 0.774 0.536 0.826

    RNAalifold 0.660 0.721 0.556 0.767 0.565 0.814

    MaxExpect 0.636 0.608 0.658 0.615 0.663 0.626

    Lysine riboswitch

    Prediction Method H = 5 sequences H = 10 sequences H = 20 sequences sensitivity PPV sensitivity PPV sensitivity PPV

    TurboFold II + SHAPE 0.885 0.862 0.873 0.834 0.878 0.838 TurboFold II 0.880 0.842 0.871 0.819 0.875 0.823

    RNAalifold + SHAPE 0.494 0.733 0.394 0.794 0.274 0.762 RNAalifold 0.670 0.796 0.440 0.799 0.294 0.779 MaxExpect 0.760 0.709 0.651 0.604 0.677 0.627

    Cyclic-di-GMP riboswitch

    Prediction Method H = 5 sequences H = 10 sequences H = 20 sequences sensitivity PPV sensitivity PPV sensitivity PPV

    TurboFold II + SHAPE 0.874 0.887 0.902 0.897 0.900 0.901

    TurboFold II 0.884 0.876 0.882 0.871 0.889 0.874

    RNAalifold + SHAPE 0.624 0.759 0.626 0.902 0.511 0.974

    RNAalifold 0.665 0.881 0.623 0.904 0.498 0.974

    MaxExpect 0.865 0.329 0.809 0.306 0.810 0.313

  • 23S rRNA

    Prediction Method H = 5 sequences H = 10 sequences H = 20 sequences sensitivity PPV sensitivity PPV sensitivity PPV

    TurboFold II + SHAPE 0.823 0.868 0.834 0.876 0.803 0.848

    TurboFold II 0.788 0.834 0.817 0.858 0.826 0.865

    RNAalifold + SHAPE 0.699 0.793 0.693 0.867 0.696 0.895

    RNAalifold 0.764 0.828 0.746 0.885 0.718 0.902

    MaxExpect 0.520 0.533 0.511 0.521 0.515 0.507

  • Table S5. Average structure prediction sensitivity and PPV on sequences with SHAPE data for each method on each dataset:

    5S rRNA Prediction Method H = 5 sequences H = 10 sequences H = 20 sequences

    sensitivity PPV sensitivity PPV sensitivity PPV

    TurboFold II + SHAPE 0.950 0.917 0.966 0.918 0.971 0.919

    TurboFold II 0.850 0.859 0.901 0.913 0.909 0.914

    RNAalifold + SHAPE 0.871 0.896 0.803 0.945 0.764 0.964

    RNAalifold 0.876 0.914 0.797 0.955 0.761 0.967 Rsample 0.857 0.833 0.857 0.833 0.857 0.833

    MaxExpect 0.286 0.263 0.286 0.263 0.286 0.263

    Group I intron Prediction Method H = 5 sequences H = 10 sequences H = 20 sequences

    sensitivity PPV sensitivity PPV sensitivity PPV

    TurboFold II + SHAPE 0.968 0.889 0.962 0.877 0.963 0.874

    TurboFold II 0.884 0.837 0.903 0.853 0.907 0.858

    RNAalifold + SHAPE 0.124 0.294 0.072 0.433 0.042 0.379

    RNAalifold 0.116 0.288 0.073 0.425 0.046 0.425 Rsample 0.924 0.816 0.924 0.816 0.924 0.816

    MaxExpect 0.849 0.766 0.849 0.766 0.849 0.766

    HCV

    Prediction Method H = 5 sequences H = 10 sequences H = 20 sequences sensitivity PPV sensitivity PPV sensitivity PPV

    TurboFold II + SHAPE 0.586 0.648 0.576 0.634 0.631 0.694

    TurboFold II 0.473 0.527 0.474 0.519 0.469 0.513

    RNAalifold + SHAPE 0.354 0.568 0.328 0.592 0.353 0.740

    RNAalifold 0.311 0.534 0.313 0.572 0.339 0.715 Rsample 0.798 0.864 0.798 0.864 0.798 0.864

    MaxExpect 0.548 0.612 0.548 0.612 0.548 0.612

  • tRNA

    Prediction Method H = 5 sequences H = 10 sequences H = 20 sequences sensitivity PPV sensitivity PPV sensitivity PPV

    TurboFold II + SHAPE 1.000 1.000 1.000 1.000 1.000 1.000

    TurboFold II 0.990 1.000 1.000 1.000 1.000 1.000

    RNAalifold + SHAPE 0.852 0.936 0.883 0.951 0.836 0.938

    RN