Accurate SHAPE-directed RNA secondary structure
modeling, including pseudoknots
Christine E. Hajdin
a,1
, Stanislav Bellaousov
b,1
, Wayne Huggins
a
, Christopher W. Leonard
a
, David H. Mathews
b,2
,
and Kevin M. Weeks
a,2
a
Department of Chemistry, University of North Carolina, Chapel Hill, NC 27599-3290; and
b
Department of Biochemistry and Biophysics, and Center for RNA
Biology, University of Rochester Medical Center, Rochester, NY 14642
Edited by Ignacio Tinoco, University of California, Berkeley, CA, and approved February 5, 2013 (received for review November 15, 2012)
A pseudoknot forms in an RNA when nucleotides in a loop pair
with a region outside the helices tha t close the loop . Pseudok nots
occur relatively rarely in RNA but are highly overrepresented in
functionally critical mot ifs in large catalytic RNAs, in riboswitches,
andinregulatoryelementsofviruses. Pseudoknots are usually
excluded from RNA structure prediction algorithms. When included,
these pairings are difficult to model accurately, especially in large
RNAs, because allowing this structure dramatically increases the
number of possible incorrect folds and because it is difficult to
search the fold space for an optimal structure. We have developed
a concise secondary structure modeling approach that combines
SHAPE (selective 2′-hydroxyl acylation analyzed by primer exten-
sion) experimental chemical probing information and a simple, but
robust, energy model for the entropic cost of single pseudoknot
formation. Structures are predicted with iterative refinement, using
a dynamic programming algorithm. This melded experimental and
thermodynamic energy function predicted the secondary structures
and the pseudoknots for a set of 21 challenging RNAs of known
structure ranging in size from 34 to 530 nt. On average, 93% of
known base pairs were predicted, and all pseudoknots in well-
folded RNAs were identified.
thermodynamics
|
nearest neighbor parameters
|
circle plot
|
polymer model
|
1M7
R
NA constitutes the central information conduit in biology (1).
Information is encoded in an RNA molecule at two levels: in
its primary sequence and in its ability to form higher-order sec-
ondary and tertiary structures. Nearly all RNAs can fold to form
some secondary structure and, in many RNAs, highly structured
regions encode important regulatory motifs . Such structured
regulatory elements can be composed of canonical base pairs but
may also feature specialized and distinctive RNA structures.
Among the best characterized of these specialized structures are
RNA pseudoknots. Pseudoknots are relatively rare but occur
overwhelmingly in functionally important regions of RNA (2–4).
For example, all of the large catalytic RNAs contain pseudoknots
(5, 6); roughly two-thirds of the known classes of riboswitches
contain pseudoknots that appear to be essential for ligand binding
and gene regulatory functions (7); and pseudoknots occur prom-
inently in the regulatory elements that viruses use to usurp cellular
metabolism (3). Pseudokno ts are thus harbingers of biological
function. An important and challenging goal is to identify these
structures reliably.
Pseudoknots are excluded from the most widely used algo-
rithms that model RNA secondary structure (8). This exclusion is
based on the challenge of incorporating the pseudoknot struc-
ture into the efficient dynamic programming algorithm used in
the most popular secondary structure prediction approaches and
because of the additional computational effort required. The
prediction of lowest free energy structures with pseudoknots is
NP-complete (9), which means that lowest free energy structure
cannot be solved as a function of sequence length in polynomial
time. In addition, allowing pseudoknots greatly increases the
number of (incorrect) helices possible and tends to reduce sec-
ondary structure prediction accuracies, even for RNAs that in-
clude pseudoknots. Current algorithms also have high false-
positive rates for pseudoknot prediction, necessitating extensive
follow-up testing and analysis of proposed structures.
Pseudoknot prediction is challenging, in part, for the same
reasons that RNA secondary structure prediction is difficult.
First, energy models for loops are incomplete because they ex-
trapolate from a limited set of experiments. Second, folding can
be affected by kinetic, ligand-mediated, tertiary, and transient
interactions that are difficult or impossible to glean from the
sequence. Prediction is also difficult for a third reason unique to
pseudoknots: Energy models for pseudoknot formation are gen-
erally incomplete because the factors governing their stability are
not fully understood (10–12). The result is that current algorithms
that model pseudoknots predict the base pairs in the simplest
pseudoknots (termed H-type, formed when bases in a loop region
bind to a single-stranded region), when the beginning and end of
the pseudoknotted structure are known, with accuracies of only
about 75% (10). Secondary structure prediction is much less ac-
curate for full-length biological RNA sequences, with as few as
5% of known pseudoknotted pairs predicted correctly and with
more false-positive than correct pseudoknot predictions in some
benchmarks (13).
The accuracy of secondary structure prediction is improved
dramatically by including experimental information as restraints
(14, 15). Selective 2′-hydroxyl acylation analyzed by primer ex-
tension (SHAPE) probing data have proved especially useful in
yielding robust working models for RNA secondary structure
(15, 16). In essence, inclusion of SHAPE information provides
an experimental adjustment to the well-established, nearest-
neighbor model parameters (17) for RNA folding. This adjust-
ment is implemented as a simple pseudo-free energy change
term, ΔG°
SHAPE
. SHAPE reactivities are approximately inversely
proportional to the probability that a given nucleotide is base
paired (high reactivities correspond to a low likelihood of being
paired and vice versa) and the logarithm of a probability corre-
sponds to an energy, in this case ΔG°
SHAPE
, which has the form
ΔG8
SHAPE
= m ln½SHAPE + 1 + b: [1]
The slope, m , corresponds to a penalty for base pairin g that
increases with the experimental SHAPE reactivity, and the in-
tercept, b,refl ects a favorable pseudo-free energy change term
for base pairing at nucleotides with low SHAPE reactivities.
These two parameters must be determined empirically. This
Author contributions: C.E.H., S.B., W.H., D.H.M., and K.M.W. designed research; C.E.H.,
S.B., W.H., and C.W.L. performed research; C.E.H., S.B., W.H., C.W.L., D.H.M., and K.M.W.
analyzed data; and C.E.H., S.B., D.H.M., and K.M.W. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
Data deposition: Structure probing data have been deposited in the single nucleotide
resolution nucleic acid structure mapping (SNRNASM) community structure probing da-
tabase (snrnasm.bio.unc.edu).
1
C.E.H. and S.B. contributed equally to this work.
2
To whom correspondence may be addressed. E-mail: weeks@unc.edu or David_
Mathews@urmc.rochester.edu.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.
1073/pnas.1219988110/-/DCSupplemental.
5498–5503
|
PNAS
|
April 2, 2013
|
vol. 110
|
no. 14 www.pnas.org/cgi/doi/10.1073/pnas.1219988110