Nucleotide secondary structure prediction programs
NASP is an attempt to improve the selectivity with which individual secondary structures can be identified. It uses base pairing probabilities provided by the UNAfold nucleic acid folding program hybrid-ss Markham and Zuker, that applies a combined partition function calculation, stochastic sampling and dynamic programming approach to compute base pairing probabilities and minimum free energy MFE estimates from single-stranded nucleotide sequences.
The rationale behind NASP is simple: we assume that randomly shuffling nucleotides within sequences that have evolved to form stable secondary structures should influence their overall base pairing potential such that the shuffled sequences should yield higher MFE estimates than the real sequences from which they were produced. By comparing MFE estimates made with real sequences to those made with randomized versions of these sequences, NASP tests whether there is evidence that the real sequences have greater structure forming capability than can be accounted for by chance.
For each sequence, k , in an input alignment, hybrid-ss estimates the over-all Gibbs free energy of an optimally folded nucleotide sequence and yields a list of Boltzmann probabilities P k i , j of individual potential base pairings.
NASP then computes a consensus base pairing matrix, M, whose entries satisfy. It is given by. Thus during the calculation of M ij the factor 2 Cij is used to weigh co-evolving base pairs more heavily than other sites.
NASP aims to find a set of base pairs i. To avoid alignment gaps obscuring signals of conserved structural motifs within M we allow relaxation on the requirement that homologous nucleotides must fall within the same alignment column.
Specifically, we consider nucleotides within different sequences to be potentially homologous if they are separated by no more than d bases within the alignment where d is a user-specified non-negative integer, usually between 0 and This translates into. NASP scans M through the anti-diagonal and recursively identifies groups of potentially base paired nucleotides displaying the highest degree of evolutionary conservation i.
At each step:. The coordinates of nucleotides within the bounds of what appear to be the largest and most evolutionarily conserved structure represented in M are added to a list of potentially paired sites Supplementary Figure S1. All alignment columns that are not included in this list are randomly shuffled or more times with the MFEs of each sequence in each shuffled alignment being compared with those of sequences in the original alignment.
The probability that there remain no unaccounted for paired nucleotides within the alignment fraction excluded from the potentially paired site list is estimated as the fraction of shuffled sequences with MFE estimates lower than those of their unshuffled counterparts. For example, the binding affinity between different ACE2 receptors and SARS-COV-2 spike protein cannot be fully explained by amino acid similarity at ACE2 contact sites because protein structure similarities are not fully reflected by amino acid sequence similarities.
To comprehensively compare protein homology, secondary structure SS analysis is required. While protein structure is slow and difficult to obtain, SS predictions can be made rapidly, and a well-predicted SS structure may serve as a viable proxy to gain biological insight.
Here we review algorithms and information used in predicting protein SS to highlight its potential application in pandemics research. We also showed examples of how SS predictions can be used to compare ACE2 proteins and to evaluate the zoonotic origins of viruses. As computational tools are much faster than wet-lab experiments, these applications can be important for research especially in times when quickly obtained biological insights can help in speeding up response to pandemics.
The efficacy of this interaction determines host specificity and severity of infection. The above expectation, while largely correct, is not completely accurate. Second, structural similarity is not fully reflected in sequence similarity; i. Only through structural studies can we hope to gain mechanistic insights into the differences in mammalian susceptibility to SARS-CoV Nevertheless, protein structure is difficult to obtain, and well-predicted protein secondary structure SS may serve as the next best answer.
The Protein Data Bank PDB is the main depository of experimentally determined 3D protein structures, and around thousand protein structures are deposited. In silico structure prediction techniques are faster and cheaper, and they have been useful in many research areas. For example, SS predictions have been used in enzyme structure similarity calculations, 11 ribosomal protein comparison, 12 protein activity mechanisms, 13 COVID proteomics, 14 and many other areas.
In section 3 we review examples of protein secondary structure predictions PSSP algorithms, and in section 4 we review their practical uses in pandemics research. The examples described in this review highlight how PSSP can be a useful tool in pandemics research. In protein structure models, aa sequences are used to predict secondary and tertiary protein structures. SS are often classified in either three states or eight states of structures.
In addition to PSSP, protein structures can be modeled at the 2D level as contact maps 15 and at the 3D level as tertiary structures. First, unlike 2D or 3D structures, PSSP is reported as a sequence and can be used together with aa chains in multiple sequence alignments.
This makes PSSP modeling useful in determining proteins that might be more similar in structures than in nucleotide or aa sequence. Second, the sequential nature allows alignment of SS elements with known or exploratory protein hotspots.
Q3 and Q8 represent the percentages of SS sequence positions correctly predicted by the models using three or eight structure states, respectively. SOV is a more complex measure that represents the percentage of segment overlap between predicted and correct sequences. Different protein databases can be used for the evaluation, and the best practice is to use multiple data sets. Also, depending on data sets and metrics used, results of PSSP programs comparisons vary.
While some programs are readily available through web servers, predictions through server are often limited by sequence length or number.
For example, Mufold-SS only allows sequences of up to aa long and Jpred4 only allows sequences of up to aa long. In addition, most web servers only allow prediction of one protein sequence at a time, which is often impractical when working with a large number of sequences. This will open a window that indicates the current structure number and allows that number to be changed. The lowest free energy structure is structure 1 and folding free energy increases with the structure number.
Alternatively, pressing the control and up arrow keys increases the number of the currently displayed structure and pressing the control and down arrow keys lowers the number of the currently displayed structure. The number of the currently displayed structure is indicated in the upper left hand corner of the window.
As indicated in Figure Base pairing probabilities are an indication of the quality of a predicted pair. Highly probably pairs are more likely to be correctly predicted than low probability pairs Mathews, The drawing window also shows the predicted folding free energy change for each structure. Hence, the binding region is walked down the length of the sequence.
Open the OligoWalk input window Figure A default name is then chosen for output of the thermodynamic estimates. This name is the same as the ct file, but with the. This mode is the slowest, but best approximates equilibrium. The contribution made by each suboptimal structure to the total cost of opening target self-structure is weighted according the folding free energy change.
In general, it is recommended to check this box to account for alternative possible secondary structures. Also, choose an oligonucleotide concentration. For the example shown in Figure Next, the region for oligonucleotide binding can be reduced by adjusting the start and stop locations.
To adjust the limits, use the up and down arrows next to the value. Limiting the area of interest on the target RNA strand reduces the calculation time. A window will open to show the progress of the calculation. This shows the results as they appear on Microsoft Windows 7. The OligoWalk output window provides an interactive method for displaying the calculated thermodynamic parameters. Red nucleotides are predicted to be base paired in the lowest free energy structure.
Black nucleotides are predicted to be single-stranded. The position along the target of the currently displayed oligonucleotide is indicated in the upper left-hand corner of the display. Each of the free energy terms can be graphed and a check on the menu shows the current selection.
The currently displayed oligonucleotide can be changed in several ways. For oligonucleotides with self-structure, the self-structure can be drawn on the screen by double-clicking the oligonucleotide sequence. For oligonucleotides with both bimolecular and unimolecular structure, a window opens to allow user selection of the structure type to display.
Therefore, secondary structure prediction should be viewed as a method for developing structure hypotheses. Suboptimal structures are thus alternative hypotheses for the secondary structure.
Recently developed methods for RNA secondary structure prediction can also be used to develop alternative hypotheses, including maximum expected accuracy structure prediction Lu et al. Using RNAstructure, constraints or restraints on the possible structures can be specified. It has been shown that the use of constraints based on experimental data improve the accuracy of secondary structure prediction Deigan et al.
RNAstructure can use constraints based on enzymatic cleavage revealing paired or unpaired nucleotides Knapp, , FMN cleavage revealing uracils in GU pairs Burgstaller et al. Base pair probabilities can be used to estimate confidence in a predicted base pair Mathews, When only base pairs with predicted pairing probability at or above 0. For a probability threshold of 0. Nearly one quarter of predicted base pairs, on average, in the lowest free energy structure have pairing probability of at least 0.
OligoWalk provides an estimate of binding affinity of structured oligonucleotides to a structured RNA target Mathews et al. For an oligonucleotide to bind tightly, not only should the duplex free energy change be low more negative , the magnitude of the cost of opening target structure should also be minimized. It has been shown that the duplex formation free energy and oligonucleotide self-structure terms correlate with antisense oligonucleotide efficacy Lu and Mathews, a ; Matveeva et al.
RNAstructure predicts secondary structures on the basis of thermodynamics. The lowest free energy structure is the structure that is most likely to occur at equilibrium, and predicting the lowest free energy structure is the traditional method for predicting RNA secondary structure. The secondary structure formation free energy change is estimated using a set of empirical nearest neighbor parameters, determined from optical melting experiments on model systems Mathews et al.
The partition function is likewise built from free energy changes for structure formation and implicitly considers all possible secondary structures when calculating base pair probabilities. For free energy minimization, RNAstructure uses a dynamic programming algorithm that guarantees the predicted lowest free energy structure will be found. Essentially, the structure prediction problem is divided into smaller problems and recursion builds the complete secondary structure.
Two reviews are available that explain dynamic programming in detail Eddy, ; Mathews and Zuker, The partition function is also calculated with a dynamic programming algorithm McCaskill, The dynamic programming algorithms used in RNAstructure, however, cannot predict pseudoknotted non-nested base pairs. On average, only 1. A review of the thermodynamics and prediction of pseudoknots is available Liu et al.
The ProbKnot component of RNAstructure can predict pseudoknots, but the accuracy of pseudoknot prediction by this and other freely available tools is relatively low, but can be much higher when SHAPE mapping data are used to restrain the prediction Bellaousov and Mathews, ; Hajdin et al.
The structure prediction algorithms presented in this unit scale O N 3 , where N is the sequence length. This means that doubling the sequence length would make the calculation time approximately eight times longer. This scaling is considered costly, but in practice it does not limit most calculations. Table 1 shows sample calculation times for predicting lowest free energy structures and for partition functions.
For sequences up to 2, nucleotides long, these calculations take less than 11 minutes. For long sequences, partition function calculations in RNAstructure can be performed on graphics processor units GPUs , but this requires running on the command line Stern and Mathews, OligoWalk also poses little difficulty in structure prediction times Table 2.
It scales linearly with length of the target after the target structure has been predicted ahead of time; therefore doubling the target sequence length roughly doubles the calculation time. Sample Structure Prediction Times. The hardware was a machine with a 3. These calculations multithread across multiple cores; therefore using a 4 core computer cuts the calculation time by almost a factor of 4. These calculations execute on a single core only. Several other software packages are available for predicting low free energy RNA secondary structures.
The packages differ slightly in the implementation of the nearest neighbor parameters for multibranch loops and exterior loops loops that contain the ends of the sequence as compared to RNAstructure. For example, RNAstructure explicitly considers both coaxial stacking of adjacent helices and helices separated by a single mismatch.
These interactions are known to stabilize RNA structures Kim et al. The Vienna Package 2. Mfold does not consider coaxial stacking in the dynamic programming algorithm, but a second step, efn2, recalculates the free energy change of folding for each structure including coaxial stacking of adjacent helices and helices separated by a single mismatch Mathews et al.
Because of these differences in the energy model, the programs are not guaranteed to predict the same lowest free energy structure. Benchmarks, however, showed that these programs have similar average accuracy Dowell and Eddy, ; Lorenz et al.
XRNA is a program that can make publication quality structure drawings. A second useful program for manipulating structure drawings is VARNA, which allows interactive manipulation of the drawing Darty et al. National Center for Biotechnology Information , U.
Curr Protoc Bioinformatics. Author manuscript; available in PMC Jul 8. David H. Author information Copyright and License information Disclaimer.
Copyright notice. The publisher's final edited version of this article is available at Curr Protoc Bioinformatics. See other articles in PMC that cite the published article. Connect to the web server and submit sequences 1. Enter the sequence 3. Open in a separate window. Select parameters 5. Next, the maximum loop size can be set. For most calculations, the default parameters are the best choice.
Select optional data 9. Enter an email address and start the calculation February 14, RNAstructure tools have been updated to version 6. Please contact us with any questions or issues regarding this update. Sequence Title: Sequence: Click here to add an example sequence to the box.
Select Predict a Secondary Structure Options: Default Data If a default value is left blank, the value is treated as if it was not changed at all.
0コメント