A Hybrid approach using Neural Networks and Genetic Algorithms : A case study using optimization of DNA curvature.

Optimizing transcription efficiency in eukaryotic systems using a hybrid approach involving an Artificial Neural Network and Genetic Algorithm: a case study of b globin gene

Rupali N. Kalate, S. S. Tambe and B. D. Kulkarni=

Chemical Engineering Division

National Chemical laboratory

Pune 411 008, India.

Abstract

Effects of single base substitutions in the upstream region of the b-globin gene are known to alter the relative transcription level (RTL). Information with regard to multiple base substitutions leading to higher RTL is however very scanty. The motivation of this work is to obtain maximum gene expression using multiple base substitutions. Using an Artificial Neural Network (ANN) and Genetic Algorithm (GA) based hybrid strategy we study the effects of multiple base mutations with particular emphasis on those that can cause enhanced RTL. The study reveals that multiple base substitutions in the conserved as well as non-conserved regions can cause substantial enhancements in RTL. We identify positions in the nucleotide sequences, which preferably should not be altered, as well as those positions where mutations can lead to increased RTL. The various trends observed are rationalized. The ANN-GA strategy can help in experimental planning and reducing the search space.

Introduction

The mechanism of the level of gene expression governing the fate of a cell, cell proliferation, and survival of the organism continues to be one of the intriguing questions to molecular biologists. Even more interesting is the mechanism underlying the switching on and off of a particular gene according to development programs. Failure to follow these programs accurately may result in gross abnormalities in the gene structure. Most control mechanisms in the regulation of gene expression occur at the level of transcription and translation. The efficiencies of these critical processes are determined by the nucleotide sequences of the promoter and the ribosome binding sites (RBS) on the encoded mRNA. Although the nucleotide sequences of many promoters and the RBS are known, the specific features determining the efficiency of transcription and translation are not well understood. The very first step of gene expression i.e. transcription is an intricate, highly regulated process and its role in eukaryotes is still not clear. The biochemical events in transcription involve a series of highly specific interactions between regulatory sequences in DNA and the cellular enzyme RNA polymerase that catalyzes the transcription reaction.

The eukaryotic promoters that have been most thoroughly studied by the molecular genetic approach are: (i) the herpesvirus thymidine kinase (tk) [1-3], (ii) the SV40 T-antigen [4], and (iii) mammalian b-globin genes [5]. These studies have focused on the DNA sequences immediately upstream from the messenger RNA (mRNA) initiation sites and provided an evidence for the establishment of transcription efficiency via signals contained within the eukaryotic genes. However, the problem of prediction of the mutations in the upstream region that may lead to maximum expression of a gene has so far remained unresolved. The problem essentially is that of an optimization where the nucleotide content of a promoter sequence needs to be rigorously searched such that the corresponding transcription efficiency represented in terms of relative transcription level (RTL) is maximized. The general objective in optimization is to obtain a set of values of the variables and/or parameters subject to various constraints (if applicable) that will produce the desired optimum response for the chosen objective function [6]. For performing such an optimization, the conventional methods such as gradient-based algorithms require: (i) a mathematical model described by a smooth, continuous closed functional form, and (ii) derivatives of the function to be optimized. Biological systems often being non-linear and complex, are difficult to be modeled phenomenonlogically, or even empirically. Consequently, such systems are not amenable to representation in an exact mathematical form and, therefore, to optimization using gradient-based methods. In view of these difficulties, it becomes necessary to explore newer tools for solving problems such as the optimization of transcription efficiency alluded to above. The objective of this paper is two-fold: (i) to present a hybrid non-linear strategy involving an artificial neural network (ANN) and genetic algorithm (GA) for the optimization of transcription efficiency, and (ii) to obtain an insight - from the results of the ANN-GA based optimization simulations - about the structural aspects of b-globin gene leading to high transcription efficiency.

Philosophy of ANN-GA optimization technique

In the last decade, ANNs have been extensively used for modeling biological systems; the main reason being their ability of modeling not only quantitative data but also qualitative data, such as DNA sequences [7]. ANNs trained with the error-back-propagation (EBP) algorithm [8-9] represent the most widely used neural network paradigm. An EBP-based network (EBPN) possesses a multi-layered feed-forward structure that undergoes supervised learning, i.e. for training it requires an example data set comprising pairs of input and the corresponding output patterns. Once trained adequately, an EBPN is capable of making output predictions for new input data. In essence, an EBPN serves as a non-phenomenological modeling technique for approximating (particularly nonlinear) relationships existing between two sets of data. For instance, an ANN model has been developed to correlate a DNA sequence and the sequence-dependent property, namely, transcription efficiency [10]. ANNs though a powerful modeling technique possess an undesirable characteristic that they essentially lead to "black-box" models. It means that an ANN model cannot be easily expressed as a closed form equation relating its inputs and outputs. Consequently, utilization of the gradient descent-based optimization methodologies becomes cumbersome. A novel technique known as "genetic algorithms (GAs)" that helps in overcoming the said difficulty is described below.

Genetic Algorithms

GAs are nonlinear optimization techniques based on the mechanisms of natural selection and genetics [11-13]. They combine the "survival of the fittest" principle of natural selection with a randomized information exchange procedure known as crossover to arrive at a robust search and optimization technique. A prerequisite to optimization using the GA methodology is a functional form (model) whose parameters/variables are to be optimized. Given such a functional form, a GA searches its solution (parameter) space so as to maximize a pre-specified objective criterion (function). In GA parlance, the objective function is referred to as fitness function. The salient features of GAs are [14-15]:

· GAs perform global search as against the local one performed by the gradient-based methods. Thus, GAs are most likely to arrive at the global optimum of the objective function.

· During optimization, search is conducted from a population of probable candidate solutions to the problem under study.

· GA search procedure is stochastic requiring only values of the function to be optimized and it does not impose preconditions such as smoothness, derivability, and continuity, on the form of the function.

· GAs can easily handle functions that are highly non-linear, complex, and noisy; in such cases the traditional gradient-based methods are found to be inefficient.

It may be noted that owing to GA's leniency towards the form of the function to be optimized, it is possible to use an ANN model in place of a closed form function. In the resulting ANN-GA optimization approach, a trained ANN serves as an input-output model whose inputs are optimized using the GA methodology. The GA in essence finds the optimal values of the network inputs such that the corresponding values of the network outputs are maximized.

ANN-GA based optimization of eukaryotic transcription efficiency

In order to address the optimization problem of maximizing the eukaryotic transcription efficiency, we have chosen the globin gene as a test case. The mouse globin gene family is an ideal candidate for the study of gene expression since differentiation of these genes exhibits both the temporal and coordinate regulation. Thus, the globin gene has been extensively studied for its expression, function, and abnormalities. It has been observed that the mutations in the b-globin gene and its upstream regions can cause many genetic disorders [16].

System and Methods

Implementation of ANN-GA methodology

Implementation of the ANN-GA methodology is a two-part procedure; the first part consists of training an EBPN with a view to model the input-output example data. An EBPN architecture in general possesses three layers (input, hidden, and output) of neurons (also termed as “nodes”). The nodes in the successive layers are connected using weighted links. The two sets of example data to be modeled (correlated) by training an EBPN form the network input and the desired output, respectively. In the present study, DNA sequences of the b-globin gene and the corresponding transcription efficiency values form the EBPN input and output, respectively. Training of EBPN involves minimization of an error function such as the sum-squared-error (SSE) using a strategy known as the generalized delta rule (GDR). While minimization, the network outputs are compared with their desired values and the corresponding SSE is used to update the values of the inter-layer connection weights. The weight-updation continues till a convergence criterion is satisfied. At this point the network is assumed to be trained. The detailed description of EBPN training can be found at numerous places (see e.g., [17-18]).

In the second part of the ANN-GA hybrid methodology, a GA rigorously searches the input space of the trained EBPN so as to maximize its output. In essence, the GA searches the sequence space with a view to maximize the magnitude of the transcription efficiency. GA begins by randomly encoding a set (population) of possible solutions to the optimization problem in the form of “chromosome strings”. A pre-specified objective function returns the fitness value (score) of each chromosome string in a population that serves as a measure of the goodness of the solution searched by the GA. In the ANN-GA methodology, the trained EBPN acts as an objective function wherein the network output also represents the fitness score of the GA-searched solution string (a DNA sequence). For computing the fitness value, the DNA solution string is applied as an input to the trained EBPN and the network output is evaluated. Since a nonlinear activation function such as the logistic sigmoid is used to compute the output of EBPN's output nodes, the fitness value is always constrained between zero and one. With this background, a simple five-step GA has been described in the following:

Step 1 (Initialization): Create a random initial population of N chromosome strings where each string contains l elements. A string element characterizing a nucleotide is chosen randomly with equal probability of selecting either A, T, G, or C. Evaluate each chromosome in the initial population using ANN as the objective function. Set the initial population as the current population.

Step 2 (Selection): Select chromosome strings from the current population with a view to form a mating pool to be used subsequently for the offspring production. The selection procedure is stochastic in nature and carried out using the weighted Roulette-wheel algorithm wherein fitter chromosome strings on a priority basis select their partner from among the remaining strings. The probability of selecting of a particular partner string is directly proportional to its fitness score. Such a selection procedure gives rise to a mating pool comprising N/2 number of parent pairs.

Step 3 (Crossover): The action of this most important GA operator results in creating two offspring chromosomes from each parent-pair. Typically, the two parent chromosomes are cut at the same randomly selected crossover point to obtain two sub-strings per parent string. The second sub-strings are then mutually exchanged between the parent chromosomes and combined with the respective first sub-strings to generate two offspring chromosomes (see Figure 1). The probability of crossover (P_cross) is kept high. The crossover operator essentially generates new solution strings (DNA sequences) thereby searching hitherto unexplored regions in the solution space. Repeating crossover operation on N/2 parent pairs generates N number of offspring strings following which the offspring population is merged with the parent population; the post-merger population has 2N strings.

Step 4 (Mutation): Randomly change (mutate) elements of the offspring strings where the probability (P_mut)an element undergoing mutation is kept small. The objective of mutation is to create new solutions in the neighborhood of the region represented by the 2N number of chromosome strings and thereby perform a local search around the region. Subsequently, evaluate fitness of each chromosome using EBPN as the objective function and rank the 2N number of strings in the descending order of their fitness scores. Next, discard the lower half of the 2N-sized population and set the resulting population of size N to the new population (generation).

The above-described procedure is repeated till a pre-selected convergence criterion such as, the GA has evolved a fixed number of generations or the fitness of the best solution does not improve in successive generations, gets satisfied. The best chromosome as judged by the highest fitness score following convergence, represents the final solution of the genetic search. The essence of GA-implementation can be stated as: better solutions in the current population are selected for the reproduction and their offspring generated via crossover and mutation operations replace the sub-optimal solutions. The population of candidate solutions, owing to the repetitive actions of the crossover and mutation operators, improves itself from one generation to the next till convergence is achieved.

As most steps involved in the GA implementation are performed stochastically, the final solution depends upon the series of random numbers used during the search. Thus, it may be necessary - for securing an overall optimal solution - to repeat the search procedure giving each time a different seed to the random number generator. This way GA begins with different initial populations, which help in the exploration of widely different solution space.

Optimization of transcription efficiency

In an earlier study [10], the problem of modeling transcription efficiency was addressed using EBPN as the modeling tool. The data for modeling was taken from the mutation studies carried out by Myers et al. [19-20] wherein saturation mutagenesis has been used to introduce random single base substitutions into the mouse b-globin promoter region. The effects of single base substitutions in the b-globin promoter have been determined by comparing the levels of correctly initiated RNA derived from the test and reference plasmids co-transfected into HeLa cells and expressed as the relative transcription level (RTL) of each mutant. The expression used for computing the RTL value has been:

(1)

where M refers to signal of the mutant test gene; WT is the signal from the wild-type test gene; R₁represents the signal from the reference gene co-transfected with the mutant test gene, and R₂ denotes the signal from the reference gene co-transfected with the wild-type test gene.

The data used by Nair et al. [10] consisted of the b-globin promoter and its mutant sequences (network input) and their corresponding RTL values (network output). In the present work we used the available data on single base substitution in the upstream region of b-globin and its effects on the RTL value. It is important to note that the data on effects of multiple base substitutions is practically nonexistent. It is expected, however, that a properly trained neural network would capture the intrinsic patterns. For EBPN training, the sequences with mutations were coded using the CODE-4 strategy [21], wherein A, T, G and C were represented by four binary digits: 0001 = C, 0010 = G, 0100 = A, and 1000 = T. The desired (target) output of each sequence was the experimentally determined RTL values normalized by dividing with ten so that they lie between zero and one. The EBPN architecture had 484 neurons in the input layer for representing the DNA sequences each of length 121 bp, eight neurons in a single hidden layer, and one neuron in the output layer to represent the RTL value (refer Figure 2). The values of the GDR parameters, namely, the learning rate and momentum coefficient that resulted in the optimal values of the EBPN weights were 0.6 and 0.9, respectively.

The flow-chart of the ANN-GA hybrid methodology as applied to the RTL optimization problem is depicted in Figure 3. The steps in flow-chart concerning the objective function (RTL) evaluation were executed using the optimal EBPN weights obtained by Nair and co-workers [10]. This essentially involves operating the trained EBPN in the prediction mode and multiplying the output by ten. The specific steps in the flow-chart relating to GA were implemented as given below.

Instead of creating the initial population (step 1) of candidate solutions representing the DNA sequences randomly, we used the promoter sequence of the mouse b-globin gene and its mutants as the initial population for the GA analysis. Specifically, 130 patterns of DNA promoter sequences and their mutants whose experimental RTL values are known, were used as the strings in the initial population. This was done purposely so that the GA search begins directly from the most plausible solution space. The values of the GA parameters used for simulation are: population size (N) = 130, probability of crossover (P_cross) = 1.0, probability of mutation (P_mut) = 0.01, total number of generations over which the GA evolves (N_gen) = 100, and the length of each chromosome string (l) = 121. The source-code used to obtain the upstream regions of b globin gene having high RTL value is available on request from the corresponding author.

Result and Discussion

In this study, we have specifically analyzed the transcriptional control signals of a eukaryotic protein-coding gene for establishing a relationship between the site of mutation and increased level of the process of eukaryotic gene transcription. Experimentally, Myers and co-workers [20] could obtain only one single base substitution pattern of upstream region of b-globin gene whose transcription efficiency was 3.5. However, using the ANN-GA methodology, it was possible using multiple base substitution to obtain a large number of sequences having transcription efficiency greater than 3.5. This was achieved by repeating the ANN-GA procedure several times while utilizing every time a different seed value for initializing the random number generator. In the ensuing paragraphs we discuss the significance of the results obtained using the ANN-GA optimization approach. For brevity, the discussion is limited to only ten sequences possessing RTL magnitudes in excess of 3.5. These sequences and their corresponding RTL values are listed in Table I.

Myers and co-workers [20] have shown that single base substitutions in three conserved regions of the promoter resulted in a significant decrease in the level of transcription in: (i) CACCC box, (ii) CCAAT box, and (iii) the TATA box. It was also shown that a promoter containing two base substitutions, one at -75 and the other at -74 results in a 40 to 50-fold decrease in the RTL. In contrast, two different mutations in nucleotides immediately upstream from the CCAAT box caused a 3- to 3.5- fold increase in transcription. Thus, positions -78 and -79 were termed "up mutations". With these two minor exceptions, single base substitutions in all other regions of the promoter were shown to have no effect on transcription. The ANN-GA approach, on the other hand, could arrive at multiple base substitutions that synergistically shows a significant increase in the transcription efficiency.

A comparison of sequences in the upstream region of b-globin gene (glo, RTL=1.00) with the ANN-GA predicted sequences from the same region (R1 to R10, RTL > 3.5) has been made using FASTA package [22]. Such a comparison helps to understand the role of nucleotide variation leading to high transcription efficiency of ANN-GA simulated patterns vis-a-vis original sequence of upstream region of b-globin gene. The results of comparison, shown in Table II, indicate that sequences from the upstream region of b-globin gene possessing maximum transcription efficiency show 74.4-95.8% sequence homology with the upstream region having transcription efficiency value of one. The nucleotide positions in the sequences predicted by the ANN-GA method that are not similar to the upstream region of b-globin gene can be considered as effective mutation points (listed in Table III) for sequences indexed as R1 to R10. These points are most probably responsible for enhancing the transcription efficiency of b-globin gene.

The ANN-GA simulation results show that not all mutations in three conserved regions decrease the RTL as is generally believed based upon the available experimental results [20]. In order to interpret the results and better understand the role of mutations in enhancing the transcription efficiency, a close look at the sequences R1 to R10 reveal the following: (i) mutations in conserved regions can enhance RTL (sequences R1, R3, R4, R7, R8, and R9), and (ii) mutations in non-conserved regions can also enhance RTL (sequences R2, R5, R6 and R10). In what follows we shall analyze these cases separately. Also, to understand the role of individual positions of mutations and their surroundings we further subdivide the sequence into seven different segments consisting of : (i) upstream region of CACCC box (i.e., -101 to -96 position), (ii) CACCC box (located between -95 to -87 position), (iii) region between CACCC box and CCAAT box (i.e., -86 to -78 position), (iv) CCAAT box (present between -77 to -72 position), (v) region between CCAAT box and TATA box (-71 to -31 position), (vi) TATA box (lying between -30 to -26 position), and (vii) region between -25 to cap site and the region below cap site.

I. Mutations in conserved regions leading to higher RTL

CACCC box (located between -95 to -87 position):

· The optimal sequences having value of RTL in excess of 3.5 searched by the genetic algorithm, including the representative examples of sequences shown here (R1 to R10), reveal that the positions -87, -90, -91, -92 and -93 remain unaltered. This feature is therefore relevant for obtaining sequences with higher RTL.

· Mutations at positions other than those listed above can cause enhancement in RTL. We show one example of each such alteration. Thus mutation at position -88 (sequence R9), -89 (sequence R8), along with the changes at few other positions (see sequences R8 and R9 for details) cause several fold increase in RTL. It is important to note that these sequences also include the mutations at the 'up-mutation points'. Sequences R4 and R7 show case examples when mutation occurs at the other remaining positions viz. -94 and -95 and cause enhancement. These examples also show that mutation at these positions is also accompanied by change at few other locations, but this time the mutations at the 'up-mutation points' is not involved.

CCAAT box (present between -77 and -72 positions):

· Sequences R1 to R10, show that the nucleotide positions -73, -75, -76 and -77, remain unchanged. No alteration in these positions seem to be important for high transcription efficiency. Other positions viz. -72 and -74 within this region can undergo mutations to cause increased RTL. We show one example of each.

· Sequence R3 indicates that if mutation at -74 position is accompanied by mutation at the "up mutation points" (positions -78 and -79), then an increase in RTL value is witnessed. Note that -74 position is responsible for lowering the RTL magnitude, whereas -78 and -79 position causes increase. The simultaneous mutations has an synergistic effect-causing enhancement more than known for the up mutation point.

· Upon examining sequence R8 it can be noted that if nucleotide position -72 is mutated in combination with "up mutation point" (position -78), and other favorable mutation points (especially in the region -71 to -31 and -25 to cap site), then it causes high magnitude of RTL.

TATA box (lying between -30 and -26 positions):

· For sequences R1 and R8, mutations at -27 and -30 positions effect increase in RTL value if they possess mutation at -78 position and, additionally, at other favorable mutation points such as -47 and -66 positions. These results once again underline the importance of up mutation point, such as position -78.

· At -26 and -29 positions of sequence R4, transition (AÕ G i.e. R Ö R) mutations are witnessed. In here, despite presence of mutations in the TATA box, high RTL value has been obtained. This can be interpreted as: if specific mutations (positions -26 and -29) in the TATA box are supported by drastic variation in the nucleotide content of the region surrounding TATA box (i.e., region between -71 and -31, and -25 and cap site), then they result in increased RTL.

· The % identity (homology) of sequence R4 with original b-globin gene promoter is 74.4. This value despite being the lowest among the ten ANN-GA predicted patterns (refer Table II), the corresponding RTL value (=4.8404) is high.

II. Mutations in non-conserved regions leading to higher RTL

Upstream region of CACCC box (positions -101 to -96):

· If mutations in this region are in favorable agreement with other mutation points, especially in the region -71 to -31, they cause increase in the magnitude of RTL. This is evidenced from the sequence entries R2, R4 and, R7-R10 listed in Table III. The sequences also indicate that G at -97, -84 and -78 positions is always mutated by A, T and C respectively.

· For the ten patterns in Table III, positions -99 and -100 are always conserved thus indicating their importance in maintaining high transcription efficiency.

Region between CACCC box and CCAAT box (positions -86 to -78):

· The region is of prime importance since it includes the most important positions i.e., -78 and -79. These two "up mutation points" are primarily responsible for increased transcription efficiency (see sequences R1, R3, R6, R8 and R9).

· Sequences R1-R10 do not exhibit any effective mutation at -77 position. Moreover, as verified experimentally [20], the mutation at -77 position, which is in the nearest-neighbor position of up mutation points (i.e., -78 and -79 position), does not seem to help in increasing transcription efficiency.

· At position -78 of sequences R1 and R3, and at position -84 of sequences R5 and R9, transversion type of mutation (-84 and -78 G Õ C or T i.e., R Ö Y) can be observed. It can therefore be inferred that the transversion mutation at these positions can cause increased magnitude of RTL.

Region between CCAAT box and TATA box (positions -71 to -31):

· Table III lists various combinations of multiple base substitutions for sequences R1-R10 in the region between CCAAT box and TATA box, which result in the increased RTL value. However, the average trend in the ten sequences suggests that nucleotide positions -71, -70, -68, -67, -65, -55, -48 and -43, despite remaining unchanged, still cause high RTL. Thus these positions seem to be important in obtaining high RTL.

· Transversion type of mutations (-60 GÕT, -59 and -57 A Õ T or C i.e. R Ö Y) seen at position -60 (sequences R4, R5 and R6), at position -59 (sequences R2, R4 and R8), and at position -57 (sequences R4, R7 and R8) appear to cause high transcription efficiency.

Region between -25 to cap site and in the region below the cap site:

· In most of the cases, the mutations in these regions have favorably supported the multiple base substitutions in the upstream region of gene. It is also of interest to study the role of this region, in causing increased transcription efficiency for sequences where % identity between the original b-globin promoter sequence and the ANN-GA simulated promoter patterns is greater than 90% (refer Table II). Although R6, R9, and R10 meet the stated criterion, we will concentrate only on sequence R10 since sequences R6 and R9 show presence of up mutation points. The % identity of sequence R10 with b-globin promoter is 94.2 and its RTL is 3.6896. Interesting feature of this sequence is that all the three conserved regions i.e., CACCC, CCAAT and TATA box, are not subjected to any mutational changes; the sequence shows variation only in regions -101 to -96, -71 to -31, and below the cap site (position +14). Since R10 possesses maximum homology with the original b-globin gene, only eight effective mutation points that can lead to higher RTL are possible. Thus mutations at positions -101, -98, -97, -56, -51, -46, -41 and +14 can cause increased RTL.

· Among the ten sequences, R8 possesses highest RTL magnitude (=6.7307). This pattern includes mutation at position -78 (up mutation point) and has % identity value of 79.3. Hence, sequence R10 gives us an idea about the effective multiple mutation points, in regions -71 to -31, -25 to the cap site, and below the cap site, that eventually lead to the highest RTL value. This is an example of how the ANN-GA optimization methodology could be exploited for a priori estimation of multiple base substitutions before conducting the mutation experiments.

Role of curvature in gene expression

Sequence dependent DNA structure is important in packaging, recombination and transcription. Therefore it is of interest to study the role of sequence-dependent DNA structure in governing the extent of transcription efficiency. For this purpose, CURVATURE program [23] can be used. This program is useful for plotting the sequence-dependent spatial trajectory of the DNA double helix and/or distribution of curvature along the DNA molecule. The routine calculates the overall DNA path using experimentally determined local helix parameters, namely, helix twist angle, wedge (deflection) angle, and direction (of deflection) angle [24]. The CURVATURE software can thus be used to investigate possible role of curvature in modulation of gene expression and to locate curved portions of DNA that may play an important role in sequence specific DNA-protein interactions.

For conducting the above-mentioned investigation, the DNA sequence of upstream region of b-globin gene (glo, RTL=1.00) and ANN-GA predicted patterns of b-globin gene were used as inputs to the CURVATURE program and the likely degree of curvature at each point along the molecule was computed. The graphical comparison of the curvature map of promoter sequence of b-globin gene and the ANN-GA predicted promoter sequences is depicted in Figure 4. The results suggest that sequences having maximum transcription efficiency show the sequence-dependant bendability or deformability of duplex DNA. This can be justified on the fact that certain nucleic acid sequences take up a particular structure required for binding to a protein at lower free energy than other sequences. The comparison also reveals that a change in the superstructure results in the alteration of transcriptional activity. These results in essence indicate that the ANN-GA methodology is able to capture the relationship between DNA superstructures and transcriptional activity.

Figure 5 shows the comparison of spatial trajectories of the DNA double helix of upstream region of b-globin gene (glo, RTL=1.0) and the promoter sequence (R8) having highest RTL (=6.7307). In both the cases, the projections are chosen such that the most curved regions of the fragments are seen best. This is done by placing the plane - where the axis is curved - perpendicular to the viewing direction. Any other orientation would result in false impression of excessive curvature. It can be seen in Figure 5 that the promoter pattern R8 is more curved at the center than the promoter sequence of b-globin gene (glo). This structural variation that changes the signature of b-globin gene is responsible for RNA polymerase to recognize and thus facilitate the transcription.

Conclusion

Highly intricate process like transcription can be well captured using the hybrid approach of two novel intelligent tools. This approach helps us to study the effect of multiple base substitutions causing the increase in transcription efficiency. These simulation results can be used as a guide in designing mutation experiments since a priori estimate of the possible outcome of multiple mutations can be obtained. This methodology has also captured the role of DNA superstructures in gene expression. Such a hybrid approach, involving an ANN that maps the given inputs onto the outputs, and a genetic algorithm (GA) that maximizes the output by searching the input space of ANN can be used for optimizing any biological property.

References

1. McKnight, S.L. and Kingsbury, R. Transcription control signals of a eukaryotic protein-coding gene. (1982) Science, 217, 316-324.

2. McKnight, S.L., Kingsbury, R.C., Spence, A. and Smith M. The distal transcription signals of the herpesvirus tk gene share a common hexanucleotide control sequence. (1984) Cell, 37, 253-262.

3. Graves, P.F. Johnson, S.L. McKnight. Homologous recognition of a promoter domain common to the MSV LTR and the HSV tk gene. (1986) Cell, 44, 565-576.

4. Giodoni, J.T. Kadonaga, H. Barrera-Saldana, K. Takahashi, P. Chambom and Tijian, R. Bi-directional SV40 transcription mediated by tandem Sp1 binding interactions. (1985) Science, 230, 511-517.

5. Grosveld, G.C., de Boer, E., Shewmaker, C.K., Flavell, R.A DNA sequences necessary for transcription of the rabbit b-globin gene. (1982) Nature, 295, 120-126.

6. Edgar, T.F. and Himmelblau, D.M. (1989) Optimization of Chemical Processes. McGraw-Hill.

7. Nair T.M., Tambe S.S. and Kulkarni B.D. Analysis of transcription control signals using artificial neural networks. (1995) Comp. Applic. Biosci., 3, 293-300.

8. Rumelhart, D.E., Hinton, G.E. and Williams, R.J. Learning representations by back-propagating errors. (1986) Nature, 323, 533-536.

9. Rumelhart, D.E. and McClelland, J.L. (1986) Parallel and Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, Cambridge, MA.

10. Nair, T.M. In Tambe, S.S., Kulkarni, B.D. and Deshpande, P.B. (Eds) Artificial Neural Networks in biological Sciences. (1996) In Elements of Artificial Neural Networks with Selected Applications in Chemical Engineering and Biological Sciences. Simulation and Advanced Controls, Inc., Louisville, 395-437.

11. Goldberg, D.E. (1989) Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, Mass.

12. Davis, L., ed. (1991) Handbook of Genetic Algorithms. Van Nostrand Reinhold, New York.

13. Holland, J.H. (1992) Adaptation in Natural and Artificial Systems. 2^nd ed. (University of Michigan Press, Ann Arbor.

14. Hanagandi, V.; Ploehn, H.; Nikolaou, M. Solution of the Self-Consistent Field Model for Polymer Adsorption by Genetic Algorithms. (1996) Chem. Eng. Sci., 51, 1071-74.

15. Schoenauer, M.; Michalewicz, Z. (1997) Evolutionary Computation. Control and Cybernetics, 26, 307-338.

16. Efstratiadis, A., Posakony, J. W., Maniatis, T., Lawn, R.M., O'Conell, C., Spriz, R.A., Deriel, J.K., Forget, B.G., Weissmann, S.M., Slightom, J.L., Blechl, A.E., Smithies, O., Barale, F.E., Shoulders, C.C. & Proudfoot, N.J. The structure and evolution of the Human b-globin Gene Family. (1980) Cell 21, 653-668.

17. Freeman, J.A. and Skapura, D.M. (1992) Neural Networks Algorithms, Applications, and Programming Techniques. Addison-Wesley.

18. Tambe, S.S., Kulkarni, B.D. and Deshpande, P.B. (Eds) (1996) Elements of Artificial Neural Networks with Selected Applications in Chemical Engineering and Biological Sciences. Simulation and Advanced Controls, Inc., Louisville.

19. Myers, R.M., Lerman, S.L. and Maniatis, T. A general method for saturation mutagenesis of cloned DNA fragments. (1985) Science, 229, 242-247.

20. Myers, R.M., Tilly, K. and Maniatis, T. Fine structure genetic analysis of b-globin promoter. (1986) Science, 232, 613-618.

21. Demeler, B. and Zhou, G. Neural network optimization for E. coli promoter prediction. (1991) Nucl. Acids Res., 19, 1593-1599.

22. Pearson, W. R. Rapid and sensitive sequence comparison with FASTP and FASTA. (1990) Methods Enzymol, 183, 63-98.

23. Shpigelman, E.S., Trifonov, E.N. and Bolshoy, A. CURVATURE: software for the analysis of curved DNA. (1993) Comput. Applic. Biosci., 9, 435-440.

24. Bolshoy, A., McNamara, P., Harrington, R.E. and Trifonov, E.N. Curved DNA without A-A: Experimental estimation of all 16 wedge angles. (1991) Proc. Natl. Acad. Sci. USA, 88, 2312-2316.

25. Trifonov, E.N. and Ulanovsky, L.E. In Wells, R.D. and Harvey, S.C. (eds) Inherently curved DNA and its structural elements. (1987) Unusual DNA structures. Springer-Verlag, Berlin, pg. 173-187.

Legends to Figures:

Figure 1: Basic crossover of the nucleotide sequence of the two parent strings.

Figure 2: Architecture of trained EBPN consisting of (i) 484 input neurons (121 bp long promoter sequences are coded using CODE-4 representation), (ii) eight neurons in the hidden layer, and (iii) one neuron (representing the RTL value) in the output layer.

Figure 3: Flow chart for the implementation of ANN-GA strategy for the optimization of transcription efficiency (in terms of its RTL value) of b-globin gene.

Figure 4: Comparison of the curvature map of the upstream region of b-globin gene (glo, RTL=1.0) and ANN-GA predicted promoter patterns of b-globin gene (R1 to R10, RTL > 3.5). Curvature is given in DNA curvature units (Trifonov and Ulanovsky, 1987) which is the mean DNA curvature in the crystalline nucleosome (1/42.8A^o ).

Figure 5-a) DNA path of the b-globin gene (glo, RTL=1.0) calculated using CURVATURE plot.

Figure 5-b) DNA path of the ANN-GA predicted promoter sequence (R8, RTL=6.7307) using CURVATURE plot.

Legends to Tables:

Table I: Sequence (simulated patterns of upstream region of b-globin gene) details along with their ANN-GA predicted Relative Transcription Level (RTL) value.

Table II: Comparison of upstream region of b-globin gene with ANN-GA predicted promoter patterns for sequence homology using FASTA package.

Table III: Effective mutation points for ANN-GA predicted promoter patterns in accordance with various sub-regions.

= Corresponding Author. Fax: 091-020-5893041 E-mail: bdk@che.ncl.res.in