Clark Atlanta University
Department of Biological Sciences and
Center for Theoretical Study of Physical Systems
223 James Brawley Dr., S.W.
Atlanta, GA 30134
Correspondence should be addressed to:Wm Seffens, PhD.
Submitted for publication: August 15, 1998
Keywords:mRNA, folding free energy, randomization
An examination of 51 mRNA sequences in GENBANK has revealed that calculated mRNA folding is more stable than expected by chance for most genes. Free energy minimization calculations of native mRNA sequences are usually more negative than randomized mRNA sequences with the same base composition and length. Randomization of major regions of genes typically yields folding free energies of less negative magnitude than the original native mRNA sequence. This suggests that a bias in the selection of nucleotide bases favors the potential formation of mRNA structures which contribute to folding stability. A classification scheme is presented based on the significance level of this bias.
Control of gene expression is known to occur at any of the events from promotion of transcription to stabilization of the mature polypeptide product. The role of RNA structure in mRNA regulation is not well understood. The long 5' UTRs (untranslated regions) of pico RNA viruses have common secondary structures that are important to the viral life cycle. It is thought that the 3' UTRs of certain developmental genes in Drosophila are similarly important. RNA secondary structure is thought to be important in the coding regions of certain viruses. Motifs such as pseudoknots have been shown to be responsible for frame-shifting (1) . It is not clear if RNA structure has any importance for non-viral protein coding mRNAs. Several studies have demonstrated that mRNA stability may be an important factor in gene expression for certain genes (2) . Structural RNA features are suspected to be involved in the regulation of mRNA degradation in those cases. Several authors have suggested that the choice of codons in eukaryotic genes may be constrained by effects other than the frequency of codons in the whole gene (3,4,5) . This is shown by the occurrence of specific codons at specific positions in a large set of genes with a frequency larger than expected by chance (4) . Statistical nonrandomness in the occurrence of those codons was demonstrated for 96 eukaryotic genes, including the actin and beta-globin gene families (6) . For a message coding 100 amino acids, there are approximately 3^100 or 10^48 different combinations of bases using synonymous codons coding for the same polypeptide. This work tested the hypothesis that gene sequences are biased to generate mRNAs with greater negative folding free energies.
MATERIALS AND METHODS
mRNA sequences were selected from GENBANK using programs in the Wisconsin Group GCG software package (7) . mRNA sequence files were randomly selected from GENBANK with short Locus descriptors (limited to 8 or 9 characters) and which possessed sufficient information in the Features annotation to reconstruct the sequence of the mRNA. Fifty-one mRNA sequences were selected from the GENBANK sequence data base possessing the following properties identified in the Features annotation: 1) mRNA +1 start site identified, 2) MET start codon identified, 3) termination codon identified, 4) poly-A site or signal identified, and 5) the mRNA sequence must be less than 1200 bases long. A variety of sequence files were examined from diverse species including prokaryotes, plants, invertebrates and higher animals (Table 1). These mRNA sequences were folded using Zucker's FOLD program from UWGCG using a VAXStation 4000 computer (8) . The newer MFOLD program was also tried, but since it produces multiple sub- optimal folds of each sequence, it consumes more than ten times the computational execution time as FOLD. The data produced from the optimal folds from MFOLD were consistent with the data produced from FOLD.
Each in silico folding free energy of a native mRNA was compared with folding free energies calculated from mRNA sequences randomized by one of four different procedures. In the first randomization procedure, each native mRNA sequence was randomized at least ten times using the SHUFFLE program of UWGCG. SHUFFLE randomizes the order of bases in a sequence keeping the composition constant. These randomized sequences (termed "whole-random" sequences) were folded and the free energies averaged. In the second randomization procedure, the native sequences were randomized only within the coding region, yielding "CDS-random" sequences. These sequences contained unmodified 5' and 3' UTRs. In the third randomization procedure, native sequences were randomized only within the UTR sequences, yielding UTR-random sequences, containing unmodified coding sequences of the respective native mRNA. A program ( RNARANDOM ) was written in FORTRAN using the GCG software library which randomized only the codon choice to produce "codon-random" sequences for the fourth randomization procedure. Codon- randomized sequences have the same nucleotide base composition and translated polypeptide product as the respective native mRNA.
Statistical significance was tested for the biases observed in calculated folding free energy between native and randomized sequences. Large sets of randomized mRNA folding free energies were statistically tested and found to be normally distributed. Standard statistical techniques were employed using statistical analysis software. All thermodynamic energies are free energies expressed as kcal/mol.
Fifty one mRNA sequences were selected from a variety of plant, animal and bacteria sequences in GENBANK (Table 1). These sequences and their respective randomized sequences were in silico folded using Zuker's FOLD program. To ensure consistency of the folding algorithm with the selected set of mRNA sequences, the calculated native folding free energies were plotted as a function of sequence length in Figure (1). As expected folding free energies become more negative for increasing sequence length since more interactions are possible in longer molecules. A greater negative free energy indicates that a more stable folding configuration is possible. A linear regression equation computed from data in Figure (1) gives a slope of -0.27 (kcal/mole per nucleotide) with a standard error of 0.025 and a 0.70 regression coefficient. This indicates that 70% of the free energy of each mRNA is accounted by its sequence length. A further check of the data set examined normalized folding free energies (free energy divided by sequence length) as a function of C+G content in Figure (2). As expected the folding free energy per base becomes more negative as the C+G content of the mRNA increases. This is due to the greater interaction energy between C-G pairs compared with A-U pairs. A linear regression equation computed from data in Figure (2) gives a slope of -0.50 (kcal/mole per %G+C) with standard error of 0.05 and with a 0.66 regression coefficient. Thermodynamic data from others indicates that an A-U pair contributes -0.9 kcal/mole while a C-G pair contributes -2.9 kcal/mole (9) .
The difference between native folding free energies and the mean of free energies from sets of randomized sequences is shown with histograms in Figure 3. Four histograms of the percent differences between native and randomized mRNA sequences for the four randomization procedures are shown. The percent difference is calculated by subtracting the mean free energy of the randomized set from the native free energy, divided by the sequence length and multiplied by 100. Frequency of occurrence of the percent difference among the 51 mRNA sequences is plotted on the vertical axis. Differences that are negative (to the left of the highlighted zero -difference line) indicate native mRNAs that are more stable than their randomized sequences. For each randomization procedure a sample mean is shown superimposed on the histogram indicated by an arrow. Native mRNA sequences are generally more stable than the corresponding randomized sequences. The sample mean for the "whole-random" distribution is - 4.3 percent, with a 95% confidence interval of -5.96 percent to -2.72 percent. The confidence interval does not include zero and indicates a significant bias. Of the 51 mRNAs examined, 41 or 80% were more negative compared to the mean of the set of respective whole-randomized sequences. Of that negative free energy different group, 27 or 66% were significantly more negative by one standard deviation, and 17 or 41% were more negative by two standard deviations. In the case of Humifnaf, the native mRNA folding free energy is 6.2 standard deviations more stable compared to the mean of the whole-randomized sequences. In two cases found, Tomrbcsd and a histone (Tahi02), the native mRNA is less stable than expected. Interestingly, these classes of genes are suspected to be at least partially regulated by post-transcriptional events. Rubisco small subunit has been demonstrated to be post-transcriptionally regulated (10) . The cell-cycle regulation of histone mRNA stability has also been demonstrated (11) , so it is not unreasonable to expect that the folding stability for these mRNAs would be different from other mRNAs not so regulated. In addition, equal G and C contents in histone genes were hypothesized to indicate selection pressures on mRNA secondary structure (12) . The free energy of histone mRNA secondary structures was noted in Huynen's study to be only slightly lower than expected on the basis of nucleotide frequencies. The group of less stable mRNAs is not as significantly different from the native folding free energies. Only 3 of 11, or 27% are less negative by one standard deviation.
To determine if the above observed bias toward more stable mRNAs resides in the coding region or the flanking untranslated regions of the mRNAs, native sequences were randomized only within the coding region, yielding "CDS-random" sequences. These sequences contain identical 5' and 3' untranslated regions of the respective native mRNA. Again the native mRNA sequences are usually more negative than the corresponding CDS-random sequences (Figure 3). The mean of the percent difference of native from CDS-random free energies is -3.78 percent, with a 95% confidence interval from -5.12 percent to -2.44 percent, again not including zero percent difference. This demonstrates there is a significant difference in folding free energies between native and partially randomized mRNA sequences. Of the 51 mRNAs examined, 43 or 84% are more negative than the CDS- randomized sequences. The 51 mRNAs possessed a total 5'UTR length of 3014 nucleotides (nt), a total coding length of 27306 nt, and a total 3' UTR length of 8533 nt. Therefore the coding regions comprise 70% of the total mRNA nucleotides, yet randomization of the coding region does not substantially alter the number of mRNAs observed with a bias toward more negative folding free energies. The significance level of this bias is also only slightly reduced by CDS-randomization compared to whole-sequence randomization. Of the group with more negative folding free energies, 30 or 70% are significantly more negative by one standard deviation, and 19 or 44% are more negative by two standard deviations.
The above randomization procedure was modified to shuffle simultaneously the 5' and 3' UTR while preserving the native coding sequence. Native sequences were randomized only within the untranslated region, yielding "UTR-random" sequences. These files contained identical CDS sequences of the respective native mRNA. The native mRNA sequences tend to be more negative than the corresponding UTR- random sequences (Figure 3). Of the 51 mRNAs examined, 34 or 67% are more negative than their UTR-randomized mRNAs. Of the group with more negative folding free energies, 22 or 65% are significantly more negative by one standard deviation, and 8 or 24% are more negative by two standard deviations.
To determine if the above-observed bias toward more stable mRNAs resides in the choice of codons within the coding sequence, a fourth randomization procedure was performed. Native sequences were randomized only by codon choice within the coding region, yielding "codon-random" sequences. These sequences contained identical 5' and 3' untranslated regions of the respective native mRNA and coded for the same polypeptide. Again the native mRNA sequences tend to fold more negative in free energy than the corresponding codon-random sequences ( Figure 3). The mean of the percent difference of native from codon-random free energies is -2.50 percent, with a 95% confidence interval from -4.45 percent to -1.72 percent. The confidence interval does not include zero, demonstrating a significant difference in folding free energies between native and codon- randomized mRNA sequences. Of the 51 mRNAs examined, 43 or 84% are more negative than their codon-randomized mRNAs.
Current computer algorithms for predicting RNA secondary structure from base sequence are based on a nearest neighbor model of interaction (13) . Experimental evidence indicates short-ranged stacking and hydrogen bonding are important determinants of RNA stability while hydrophobic bonding is of lesser importance (14) . The thermodynamic data available and the folding rules based on those data cannot be depended on to predict a correct secondary structure. The data for base paired helical regions are thought to have an uncertainty of 0.2 to 0.5 kcal/mole, while the values for loops have an estimated uncertainty of l to 2 kcal/mole (15) . Furthermore, there are secondary features whose effect on RNA stability are unknown. In solution the molecules will have tertiary interactions (loop to loop) while proteins or other biological molecules may change secondary structure. Despite these limitations the predicted structures show many of the features that have also partially been verified experimentally (16) . For a set of 92 tRNA molecules, 51% of the sequences were folded into either perfect or similar cloverleaf structures (13) . The remainder of the sequences appeared in a variety of structures whose calculated free energy was lower than that of the cloverleaf form. The source of difficulty concerns the inadequate treatment of multibranched loops. If such structures are not general features of mRNAs, then predicted folding structures for mRNAs should be more accurate than for tRNAs. Even if the calculated structure is not similar to the in vivo structure, the calculated free energy will probably be close to the actual free energy value for the in vivo structure. Although a lowest free energy structure is found by the computer algorithms, numerous alternate structures are possible with similar free energy (17) . Therefore the folding free energy value is more realistic than the predicted RNA structure. Even if a most stable secondary structure could be calculated accurately, other structures could have important biological functions. The apparent existence of alternate structures has been reported in studies of AMV-4 RNA and E. coli 23S rRNA (18) . Alternate structures may be important in the initiation of translation, mRNA stability, and in transcriptional attenuation (19,20) . These alternate folded mRNA structures are probably of very similar free energy.
The thermodynamic treatment in this work concerns not the actual mRNA structures but the depth of the free energy potential well that a mRNA molecule could fold. The actual structures are not examined here but rather the possible basepairing and hence free energy release due to folding. Different regions of mRNA sequence were randomized, including codon choice, resulting in destabilization of the folding free energy. This suggests a classification of mRNAs based on the calculated folding free energies. mRNAs could be classed as either more stable, or less stable, than some randomized set derived from that sequence. The degree of stability could be measured in units of %-difference, or in units of standard deviations. The former measure was used by Le and Maizel for short regions of RNA sequence (21) . Correlation of this measure with mRNA half-life, or gene function may be useful to uncover new biological relationships
A survey of 51 mRNA sequences reveals a bias in the coding and untranslated regions that allows for greater negative folding free energies than predicted by sequence length or nucleotide base content. A free energy reference state is taken to be a large enough set of randomized mRNA sequences. Randomization can be implemented over the whole sequence or over sections such as the CDS, UTR, or codon choice. Randomization of most regions of the mRNA sequences display lower folding stability as measured by free energy calculated values. Randomizations of codon choice, while preserving original base composition, also results in less stable mRNAs. This suggests that a bias in the selection of codons favors mRNA structures which contribute to folding stability. This provides for a classification of mRNAs that may have biological significance.
This work was supported (or partially supported) by NIH grant GM08247, by a Research Centers in Minority Institutions award, G12RR03062, from the Division of Research Resources, National Institutes of Health, and NSF CREST Center for Theoretical Studies of Physical Systems (CTSPS) Cooperative Agreement #HRD- 9632844.