RNA secondary structure prediction consists of predicting the 2D fo...
The Weeks laboratory works on RNA structure prediction and has a ph...
Several illustrative examples of pseudoknots: ![Imgur](https://i.im...
There are over 250 software packages for RNA structure prediction. ...
Math background on pseudoknot prediction: http://math.mit.edu/clas...
This simple, concise equation is packed with information. The key t...
Computational RNA secondary prediction is empirically improved sign...
Riboswitches are regulatory segments of messenger RNA molecules tha...
Accurate SHAPE-directed RNA secondary structure
modeling, including pseudoknots
Christine E. Hajdin
, Stanislav Bellaousov
, Wayne Huggins
, Christopher W. Leonard
, David H. Mathews
and Kevin M. Weeks
Department of Chemistry, University of North Carolina, Chapel Hill, NC 27599-3290; and
Department of Biochemistry and Biophysics, and Center for RNA
Biology, University of Rochester Medical Center, Rochester, NY 14642
Edited by Ignacio Tinoco, University of California, Berkeley, CA, and approved February 5, 2013 (received for review November 15, 2012)
A pseudoknot forms in an RNA when nucleotides in a loop pair
with a region outside the helices tha t close the loop . Pseudok nots
occur relatively rarely in RNA but are highly overrepresented in
functionally critical mot ifs in large catalytic RNAs, in riboswitches,
andinregulatoryelementsofviruses. Pseudoknots are usually
excluded from RNA structure prediction algorithms. When included,
these pairings are difcult to model accurately, especially in large
RNAs, because allowing this structure dramatically increases the
number of possible incorrect folds and because it is difcult to
search the fold space for an optimal structure. We have developed
a concise secondary structure modeling approach that combines
SHAPE (selective 2-hydroxyl acylation analyzed by primer exten-
sion) experimental chemical probing information and a simple, but
robust, energy model for the entropic cost of single pseudoknot
formation. Structures are predicted with iterative renement, using
a dynamic programming algorithm. This melded experimental and
thermodynamic energy function predicted the secondary structures
and the pseudoknots for a set of 21 challenging RNAs of known
structure ranging in size from 34 to 530 nt. On average, 93% of
known base pairs were predicted, and all pseudoknots in well-
folded RNAs were identied.
nearest neighbor parameters
circle plot
polymer model
NA constitutes the central information conduit in biology (1).
Information is encoded in an RNA molecule at two levels: in
its primary sequence and in its ability to form higher-order sec-
ondary and tertiary structures. Nearly all RNAs can fold to form
some secondary structure and, in many RNAs, highly structured
regions encode important regulatory motifs . Such structured
regulatory elements can be composed of canonical base pairs but
may also feature specialized and distinctive RNA structures.
Among the best characterized of these specialized structures are
RNA pseudoknots. Pseudoknots are relatively rare but occur
overwhelmingly in functionally important regions of RNA (24).
For example, all of the large catalytic RNAs contain pseudoknots
(5, 6); roughly two-thirds of the known classes of riboswitches
contain pseudoknots that appear to be essential for ligand binding
and gene regulatory functions (7); and pseudoknots occur prom-
inently in the regulatory elements that viruses use to usurp cellular
metabolism (3). Pseudokno ts are thus harbingers of biological
function. An important and challenging goal is to identify these
structures reliably.
Pseudoknots are excluded from the most widely used algo-
rithms that model RNA secondary structure (8). This exclusion is
based on the challenge of incorporating the pseudoknot struc-
ture into the efcient dynamic programming algorithm used in
the most popular secondary structure prediction approaches and
because of the additional computational effort required. The
prediction of lowest free energy structures with pseudoknots is
NP-complete (9), which means that lowest free energy structure
cannot be solved as a function of sequence length in polynomial
time. In addition, allowing pseudoknots greatly increases the
number of (incorrect) helices possible and tends to reduce sec-
ondary structure prediction accuracies, even for RNAs that in-
clude pseudoknots. Current algorithms also have high false-
positive rates for pseudoknot prediction, necessitating extensive
follow-up testing and analysis of proposed structures.
Pseudoknot prediction is challenging, in part, for the same
reasons that RNA secondary structure prediction is difcult.
First, energy models for loops are incomplete because they ex-
trapolate from a limited set of experiments. Second, folding can
be affected by kinetic, ligand-mediated, tertiary, and transient
interactions that are difcult or impossible to glean from the
sequence. Prediction is also difcult for a third reason unique to
pseudoknots: Energy models for pseudoknot formation are gen-
erally incomplete because the factors governing their stability are
not fully understood (1012). The result is that current algorithms
that model pseudoknots predict the base pairs in the simplest
pseudoknots (termed H-type, formed when bases in a loop region
bind to a single-stranded region), when the beginning and end of
the pseudoknotted structure are known, with accuracies of only
about 75% (10). Secondary structure prediction is much less ac-
curate for full-length biological RNA sequences, with as few as
5% of known pseudoknotted pairs predicted correctly and with
more false-positive than correct pseudoknot predictions in some
benchmarks (13).
The accuracy of secondary structure prediction is improved
dramatically by including experimental information as restraints
(14, 15). Selective 2-hydroxyl acylation analyzed by primer ex-
tension (SHAPE) probing data have proved especially useful in
yielding robust working models for RNA secondary structure
(15, 16). In essence, inclusion of SHAPE information provides
an experimental adjustment to the well-established, nearest-
neighbor model parameters (17) for RNA folding. This adjust-
ment is implemented as a simple pseudo-free energy change
term, ΔG°
. SHAPE reactivities are approximately inversely
proportional to the probability that a given nucleotide is base
paired (high reactivities correspond to a low likelihood of being
paired and vice versa) and the logarithm of a probability corre-
sponds to an energy, in this case ΔG°
, which has the form
= mln½SHAPE + 1 + b: [1]
The slope, m , corresponds to a penalty for base pairin g that
increases with the experimental SHAPE reactivity, and the in-
tercept, b,reects a favorable pseudo-free energy change term
for base pairing at nucleotides with low SHAPE reactivities.
These two parameters must be determined empirically. This
Author contributions: C.E.H., S.B., W.H., D.H.M., and K.M.W. designed research; C.E.H.,
S.B., W.H., and C.W.L. performed research; C.E.H., S.B., W.H., C.W.L., D.H.M., and K.M.W.
analyzed data; and C.E.H., S.B., D.H.M., and K.M.W. wrote the paper.
The authors declare no conict of interest.
This article is a PNAS Direct Submission.
Data deposition: Structure probing data have been deposited in the single nucleotide
resolution nucleic acid structure mapping (SNRNASM) community structure probing da-
tabase (snrnasm.bio.unc.edu).
C.E.H. and S.B. contributed equally to this work.
To whom correspondence may be addressed. E-mail: weeks@unc.edu or David_
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.
April 2, 2013
vol. 110
no. 14 www.pnas.org/cgi/doi/10.1073/pnas.1219988110
pseudo-free energy change approach yields high-quality sec-
ondary structure models for both short RNAs and those that
are kilobases long (15, 16).
Our original SHAPE -directed algorithm di d not allow for
pseudoknotted base pairs (15). Given the strong relationship
between pseudoknots and functiona lly critical regions in RNA
and the fact that it is i mpossible to know a priori whet her an
RNA contains a pseudoknot, this limitation severely restricts
the accuracy and generality of experimentally directed RNA
structure analysis. Here, we describe a concise approach for
applying SHAPE-directed RNA secondary structure modeling
to include pseudoknots, i n an algorithm we call S hapeKnots,
and we show th at the algorithm yi elds high-quality structures
for diverse RNA sequences.
Challenging RNA Test Set. We developed the ShapeKnots algo-
rithm, using a test set of 16 nonpseudoknotted and pseudoknot-
containing RNAs that were selected for their complex, and
generally dif cult to predict, structures (Table 1, Top). These
RNAs included (i) 5 RNAs with lengths >300 nt, both with and
without pseudoknots; (ii) 5 riboswitch RNAs whose structures
form only upon binding by speci c ligands, for which thermo-
dynamic rules are obligatorily i ncomplete; (iii) 4 RNAs with
structures that are predicted especially poorly, with accuracies
<60% using nearest-neighbor thermodynamic parameters; and
(iv) 3 RNAs whose structures are probably modulated by protein
binding. SHAPE experiments were performed on each of the
RNAs in the presence of ligand if applicable but in the absence
of any protein. Each of the training set RNAs had SHAPE prob-
ing patterns that suggested these RNAs folded in solution into
structures generally consistent with accepted secondary structure
models based on either X-ray crystallography or comparative se-
quence analyses. The structures of the 16 RNAs in the test set
are predicted poorly by a conventional algorithm based on their
sequences alone: The average sensitivity (sens, fraction of base
pairs in the accepted structure predicted correctly), positive
predictive value (ppv, the fraction of predicted pairs that occur in
the accepted structure), and geometric average of these metrics
are 72%, 78%, and 74%, respectively (Table 1).
In the process of developing this training set, we also analyzed two
RNAsRNase P RNA and the human signal recognition particle
RNAwhose in vitro SHAPE reactivities were incompatible with
the accepted structures for these RNAs. We include prediction
statistics for these RNAs (Table 1, Bottom) but did not use these to
evaluate our SHAPE-directed modeling algorithm.
Simple, Robust Model for Pseudoknot Formation. The favorable
energetic contributions for forming the helices that comprise a
pseudoknot are likely to be predicted accurately by the Turn er
Table 1. Prediction accuracies as a function of algorithm and SHAPE information
Sensitivities (sens), positive predictive value (ppv), and their geometric average (geo) are shown for four test cases: no pseudoknots allowed and no SHAPE
data, no pseudoknots allowed and with SHAPE data (both by free energy minimization), pseudoknots allowed and no SHAPE data, and pseudoknots allowed
and with SHAPE data (both using ShapeKnots). Complicating features are ligand (L) binding and protein (P) binding that are not accounted for in nearest-
neighbor thermodynamic parameters. Pseudoknot (PK) predictions are indicated with a checkmark () or an X; a checkmark indicates that pseudoknots were
predicted correctly and that there were no false-positive pseudoknot predictions. For the ribosomal RNAs (), regions in which the SHAPE reactivities were
clearly incompatible with the accepted structure, as described in ref. 15, were omitted from the sensitivity and ppv calculations; for the E. coli 16 rRNA, this
included nucleotides 143220. The HIV-1 5 leader domain (§) was included as an example of pseudoknot prediction in a large RNA. Because the accepted
structure for this RNA is based on SHAPE-directed prediction (24), we did not include sensitivity and ppv for this RNA in the overall average values; however,
the pseudoknot was proved independently (23) and is included.
Hajdin et al. PNAS
April 2, 2013
vol. 110
no. 14