Optimizing
transcription efficiency in eukaryotic systems using a hybrid approach
involving an Artificial Neural Network and Genetic Algorithm: a case study of b globin gene
Rupali N. Kalate, S. S.
Tambe and B. D. Kulkarni=
Chemical Engineering Division
National Chemical laboratory
Pune 411 008, India.
Abstract
Effects of single base substitutions in the upstream region of the b-globin gene are known to alter the relative transcription level (RTL). Information with regard to multiple base substitutions leading to higher RTL is however very scanty. The motivation of this work is to obtain maximum gene expression using multiple base substitutions. Using an Artificial Neural Network (ANN) and Genetic Algorithm (GA) based hybrid strategy we study the effects of multiple base mutations with particular emphasis on those that can cause enhanced RTL. The study reveals that multiple base substitutions in the conserved as well as non-conserved regions can cause substantial enhancements in RTL. We identify positions in the nucleotide sequences, which preferably should not be altered, as well as those positions where mutations can lead to increased RTL. The various trends observed are rationalized. The ANN-GA strategy can help in experimental planning and reducing the search space.
Introduction
The mechanism of
the level of gene expression governing the fate of a cell, cell proliferation,
and survival of the organism continues to be one of the intriguing questions to
molecular biologists. Even more interesting is the mechanism underlying the
switching on and off of a particular gene according to development programs.
Failure to follow these programs accurately may result in gross abnormalities
in the gene structure. Most control mechanisms in the regulation of gene
expression occur at the level of transcription and translation. The
efficiencies of these critical processes are determined by the nucleotide
sequences of the promoter and the ribosome binding sites (RBS) on the encoded
mRNA. Although the nucleotide sequences of many promoters and the RBS are
known, the specific features determining the efficiency of transcription and
translation are not well understood. The very first step of gene expression
i.e. transcription is an intricate, highly regulated process and its role in
eukaryotes is still not clear. The biochemical events in transcription involve
a series of highly specific interactions between regulatory sequences in DNA
and the cellular enzyme RNA polymerase that catalyzes the transcription reaction.
The eukaryotic
promoters that have been most thoroughly studied by the molecular genetic
approach are: (i) the herpesvirus thymidine kinase (tk) [1-3], (ii) the SV40
T-antigen [4], and (iii) mammalian b-globin genes [5]. These
studies have focused on the DNA sequences immediately upstream from the
messenger RNA (mRNA) initiation sites and provided an evidence for the
establishment of transcription efficiency via signals contained within the
eukaryotic genes. However, the problem of prediction of the mutations in the
upstream region that may lead to maximum expression of a gene has so far
remained unresolved. The problem essentially is that of an optimization where
the nucleotide content of a promoter sequence needs to be rigorously searched
such that the corresponding transcription efficiency represented in terms of
relative transcription level (RTL) is maximized. The general objective in
optimization is to obtain a set of values of the variables and/or parameters
subject to various constraints (if applicable) that will produce the desired
optimum response for the chosen objective function [6]. For performing such an
optimization, the conventional methods such as gradient-based algorithms
require: (i) a mathematical model described by a smooth, continuous closed
functional form, and (ii) derivatives of the function to be optimized.
Biological systems often being non-linear and complex, are difficult to be
modeled phenomenonlogically, or even empirically. Consequently, such systems
are not amenable to representation in an exact mathematical form and,
therefore, to optimization using gradient-based methods. In view of these
difficulties, it becomes necessary to explore newer tools for solving problems
such as the optimization of transcription efficiency alluded to above. The
objective of this paper is two-fold: (i) to present a hybrid non-linear
strategy involving an artificial neural network (ANN) and genetic algorithm
(GA) for the optimization of transcription efficiency, and (ii) to obtain an
insight - from the results of the ANN-GA based optimization simulations - about
the structural aspects of b-globin gene leading to high
transcription efficiency.
In the last
decade, ANNs have been extensively used for modeling biological systems; the
main reason being their ability of modeling not only quantitative data but also
qualitative data, such as DNA sequences [7]. ANNs trained with the
error-back-propagation (EBP) algorithm [8-9] represent the most widely used
neural network paradigm. An EBP-based network (EBPN) possesses a multi-layered
feed-forward structure that undergoes supervised learning, i.e. for training it
requires an example data set comprising pairs of input and the corresponding
output patterns. Once trained adequately, an EBPN is capable of making output
predictions for new input data. In essence, an EBPN serves as a
non-phenomenological modeling technique for approximating (particularly
nonlinear) relationships existing between two sets of data. For instance, an
ANN model has been developed to correlate a DNA sequence and the
sequence-dependent property, namely, transcription efficiency [10]. ANNs though
a powerful modeling technique possess an undesirable characteristic that they
essentially lead to "black-box" models. It means that an ANN model
cannot be easily expressed as a closed form equation relating its inputs and
outputs. Consequently, utilization of the gradient descent-based optimization
methodologies becomes cumbersome. A novel technique known as "genetic
algorithms (GAs)" that helps in overcoming the said difficulty is
described below.
Genetic Algorithms
GAs are nonlinear optimization techniques based on the
mechanisms of natural selection and genetics [11-13]. They combine the
"survival of the fittest" principle of natural selection with a
randomized information exchange procedure known as crossover to arrive at a robust search and optimization technique.
A prerequisite to optimization using the GA methodology is a functional form
(model) whose parameters/variables are to be optimized. Given such a functional
form, a GA searches its solution (parameter) space so as to maximize a
pre-specified objective criterion (function). In GA parlance, the objective
function is referred to as fitness function.
The salient features of GAs are [14-15]:
·
GAs
perform global search as against the local one performed by the gradient-based
methods. Thus, GAs are most likely to arrive at the global optimum of the
objective function.
·
During
optimization, search is conducted from a population of probable candidate
solutions to the problem under study.
·
GA
search procedure is stochastic requiring only values of the function to be
optimized and it does not impose preconditions such as smoothness,
derivability, and continuity, on the form of the function.
·
GAs
can easily handle functions that are highly non-linear, complex, and noisy; in
such cases the traditional gradient-based methods are found to be inefficient.
It may be noted that owing
to GA's leniency towards the form of the function to be optimized, it is
possible to use an ANN model in place of a closed form function. In the
resulting ANN-GA optimization approach, a trained ANN serves as an input-output
model whose inputs are optimized using the GA methodology. The GA in essence
finds the optimal values of the network inputs such that the corresponding
values of the network outputs are maximized.
In order to
address the optimization problem of maximizing the eukaryotic transcription
efficiency, we have chosen the globin gene as a test case. The mouse globin
gene family is an ideal candidate for the study of gene expression since
differentiation of these genes exhibits both the temporal and coordinate regulation.
Thus, the globin gene has been extensively studied for its expression,
function, and abnormalities. It has been observed that the mutations in the b-globin gene and its upstream regions can
cause many genetic disorders [16].
System and Methods
Implementation
of the ANN-GA methodology is a two-part procedure; the first part consists of
training an EBPN with a view to model the input-output example data. An EBPN
architecture in general possesses three layers (input, hidden, and output) of
neurons (also termed as “nodes”). The nodes in the successive layers are
connected using weighted links. The two sets of example data to be modeled
(correlated) by training an EBPN form the network input and the desired output,
respectively. In the present study, DNA sequences of the b-globin gene and the corresponding
transcription efficiency values form the EBPN input and output, respectively.
Training of EBPN involves minimization of an error function such as the sum-squared-error (SSE) using a strategy
known as the generalized delta rule
(GDR). While minimization, the network outputs are compared with their desired
values and the corresponding SSE is used to update the values of the
inter-layer connection weights. The weight-updation continues till a
convergence criterion is satisfied. At this point the network is assumed to be
trained. The detailed description of EBPN training can be found at numerous
places (see e.g., [17-18]).
In the second
part of the ANN-GA hybrid methodology, a GA rigorously searches the input space
of the trained EBPN so as to maximize its output. In essence, the GA searches
the sequence space with a view to maximize the magnitude of the transcription
efficiency. GA begins by randomly encoding a set (population) of possible
solutions to the optimization problem in the form of “chromosome strings”. A
pre-specified objective function returns the fitness value (score) of each
chromosome string in a population that serves as a measure of the goodness of
the solution searched by the GA. In the ANN-GA methodology, the trained EBPN
acts as an objective function wherein the network output also represents the
fitness score of the GA-searched solution string (a DNA sequence). For
computing the fitness value, the DNA solution string is applied as an input to
the trained EBPN and the network output is evaluated. Since a nonlinear
activation function such as the logistic sigmoid is used to compute the output
of EBPN's output nodes, the fitness value is always constrained between zero
and one. With this background, a simple five-step GA has been described in the
following:
Step 1 (Initialization): Create a random initial population of N chromosome strings where each string
contains l elements. A string element
characterizing a nucleotide is chosen randomly with equal probability of
selecting either A, T,
G, or C. Evaluate each chromosome
in the initial population using ANN as the objective function. Set the initial
population as the current population.
Step 2 (Selection): Select chromosome strings from the current
population with a view to form a mating pool to be used subsequently for the
offspring production. The selection procedure is stochastic in nature and
carried out using the weighted Roulette-wheel algorithm wherein fitter
chromosome strings on a priority basis select their partner from among the
remaining strings. The probability of selecting of a particular partner string
is directly proportional to its fitness score. Such a selection procedure gives
rise to a mating pool comprising N/2
number of parent pairs.
Step 3 (Crossover): The action of this most important GA operator
results in creating two offspring chromosomes from each parent-pair. Typically,
the two parent chromosomes are cut at the same randomly selected crossover
point to obtain two sub-strings per parent string. The second sub-strings are
then mutually exchanged between the parent chromosomes and combined with the
respective first sub-strings to generate two offspring chromosomes (see Figure
1). The probability of crossover (Pcross)
is kept high. The crossover operator essentially generates new solution strings
(DNA sequences) thereby searching hitherto unexplored regions in the solution
space. Repeating crossover operation on N/2
parent pairs generates N number of
offspring strings following which the
offspring population is merged with the parent population; the post-merger
population has 2N strings.
Step 4 (Mutation): Randomly change (mutate) elements of the offspring
strings where the probability (Pmut)
an element undergoing mutation is kept small. The objective of mutation
is to create new solutions in the neighborhood of the region represented by the
2N number of chromosome strings and
thereby perform a local search around the region. Subsequently, evaluate
fitness of each chromosome using EBPN as the objective function and rank the 2N number of strings in the descending
order of their fitness scores. Next, discard the lower half of the 2N-sized population and set the
resulting population of size N to the
new population (generation).
The above-described procedure is repeated till a
pre-selected convergence criterion such as, the GA has evolved a fixed number
of generations or the fitness of the best solution does not improve in
successive generations, gets satisfied. The best chromosome as judged by the
highest fitness score following convergence, represents the final solution of
the genetic search. The essence of GA-implementation can be stated as: better
solutions in the current population are selected for the reproduction and their
offspring generated via crossover and mutation operations replace the
sub-optimal solutions. The population of candidate solutions, owing to the
repetitive actions of the crossover and mutation operators, improves itself from
one generation to the next till convergence is achieved.
As most steps involved
in the GA implementation are performed stochastically, the final solution
depends upon the series of random numbers used during the search. Thus, it may
be necessary - for securing an overall optimal solution - to repeat the search
procedure giving each time a different seed to the random number generator.
This way GA begins with different initial populations, which help in the
exploration of widely different solution space.
In an earlier
study [10], the problem of modeling transcription efficiency was addressed
using EBPN as the modeling tool. The data for modeling was taken from the
mutation studies carried out by Myers et al. [19-20] wherein saturation
mutagenesis has been used to introduce random single base substitutions into
the mouse b-globin promoter region. The
effects of single base substitutions in the b-globin promoter have been
determined by comparing the levels of correctly initiated RNA derived from the
test and reference plasmids co-transfected into HeLa cells and expressed as the
relative transcription level (RTL) of each mutant. The expression used for
computing the RTL value has been:
(1)
where M refers to signal of
the mutant test gene; WT is the signal from the wild-type test gene; R1 represents the signal from the
reference gene co-transfected with the mutant test gene, and R2
denotes the signal from the reference gene co-transfected with the wild-type
test gene.
The data used by
Nair et al. [10] consisted of the b-globin promoter and its
mutant sequences (network input) and their corresponding RTL values (network
output). In the present work we used the available data on single base
substitution in the upstream region of b-globin and its effects on
the RTL value. It is important to note that the data on effects of multiple
base substitutions is practically nonexistent. It is expected, however, that a
properly trained neural network would capture the intrinsic patterns. For EBPN
training, the sequences with mutations were coded using the CODE-4 strategy
[21], wherein A, T, G and C were represented by four binary
digits: 0001 = C, 0010 = G, 0100 = A, and 1000 = T. The desired (target)
output of each sequence was the
experimentally determined RTL values normalized by dividing with ten so that
they lie between zero and one. The EBPN architecture had 484 neurons in the
input layer for representing the DNA sequences each of length 121 bp, eight
neurons in a single hidden layer, and one neuron in the output layer to
represent the RTL value (refer Figure 2). The values of the GDR parameters,
namely, the learning rate and momentum coefficient that resulted in the optimal
values of the EBPN weights were 0.6 and 0.9, respectively.
The flow-chart
of the ANN-GA hybrid methodology as applied to the RTL optimization problem is
depicted in Figure 3. The steps in flow-chart concerning the objective function
(RTL) evaluation were executed using the optimal EBPN weights obtained by Nair
and co-workers [10]. This essentially involves operating the trained EBPN in
the prediction mode and multiplying the output by ten. The specific steps in
the flow-chart relating to GA were implemented as given below.
Instead of
creating the initial population (step 1) of candidate solutions representing
the DNA sequences randomly, we used the promoter sequence of the mouse b-globin gene and its mutants as the initial
population for the GA analysis. Specifically, 130 patterns of DNA promoter
sequences and their mutants whose experimental RTL values are known, were used
as the strings in the initial population. This was done purposely so that the
GA search begins directly from the most plausible solution space. The values of
the GA parameters used for simulation are: population size (N) = 130, probability of crossover (Pcross) = 1.0, probability of
mutation (Pmut) = 0.01,
total number of generations over which the GA evolves (Ngen) = 100, and the length of each chromosome string (l) = 121. The source-code used to obtain
the upstream regions of b globin gene having high RTL
value is available on request from the corresponding author.
Result and Discussion
In this study,
we have specifically analyzed the transcriptional control signals of a
eukaryotic protein-coding gene for establishing a relationship between the site
of mutation and increased level of the process of eukaryotic gene
transcription. Experimentally, Myers and co-workers [20] could obtain only one
single base substitution pattern of upstream region of b-globin gene whose transcription efficiency
was 3.5. However, using the ANN-GA methodology, it was possible using multiple
base substitution to obtain a large number of sequences having transcription
efficiency greater than 3.5. This was achieved by repeating the ANN-GA
procedure several times while utilizing every time a different seed value for
initializing the random number generator. In the ensuing paragraphs we discuss
the significance of the results obtained using the ANN-GA optimization
approach. For brevity, the discussion is limited to only ten sequences
possessing RTL magnitudes in excess of 3.5. These sequences and their
corresponding RTL values are listed in Table I.
Myers and
co-workers [20] have shown that single base substitutions in three conserved
regions of the promoter resulted in a significant decrease in the level of
transcription in: (i) CACCC box, (ii) CCAAT box, and (iii) the TATA box. It was
also shown that a promoter containing two base substitutions, one at -75 and
the other at -74 results in a 40 to 50-fold decrease in the RTL. In contrast, two different mutations in
nucleotides immediately upstream from the CCAAT box caused a 3- to 3.5- fold
increase in transcription. Thus, positions -78 and -79 were termed "up
mutations". With these two minor exceptions, single base substitutions in
all other regions of the promoter were shown to have no effect on
transcription. The ANN-GA approach, on the other hand, could arrive at multiple
base substitutions that synergistically shows a significant increase in the
transcription efficiency.
A comparison of
sequences in the upstream region of b-globin gene (glo, RTL=1.00)
with the ANN-GA predicted sequences from the same region (R1 to R10, RTL >
3.5) has been made using FASTA package [22]. Such a comparison helps to
understand the role of nucleotide variation leading to high transcription
efficiency of ANN-GA simulated patterns vis-a-vis original sequence of upstream
region of b-globin gene. The results of
comparison, shown in Table II, indicate that sequences from the upstream region
of b-globin gene possessing maximum transcription
efficiency show 74.4-95.8% sequence homology with the upstream region having
transcription efficiency value of one. The nucleotide positions in the
sequences predicted by the ANN-GA method that are not similar to the upstream
region of b-globin gene can be
considered as effective mutation points (listed in Table III) for sequences
indexed as R1 to R10. These points are most probably responsible for enhancing
the transcription efficiency of b-globin gene.
The ANN-GA
simulation results show that not all mutations in three conserved regions
decrease the RTL as is generally believed based upon the available experimental
results [20]. In order to interpret the results and better understand the role
of mutations in enhancing the transcription efficiency, a close look at the
sequences R1 to R10 reveal the following: (i) mutations in conserved regions
can enhance RTL (sequences R1, R3, R4, R7, R8, and R9), and (ii) mutations in
non-conserved regions can also enhance RTL (sequences R2, R5, R6 and R10). In
what follows we shall analyze these cases separately. Also, to understand the
role of individual positions of mutations and their surroundings we further
subdivide the sequence into seven different segments consisting of : (i)
upstream region of CACCC box (i.e., -101 to -96 position), (ii) CACCC box
(located between -95 to -87 position), (iii) region between CACCC box and CCAAT
box (i.e., -86 to -78 position), (iv) CCAAT box (present between -77 to -72
position), (v) region between CCAAT box and TATA box (-71 to -31 position),
(vi) TATA box (lying between -30 to -26 position), and (vii) region between -25
to cap site and the region below cap site.
I.
Mutations in conserved
regions leading to higher RTL
CACCC box (located between -95 to -87 position):
· The optimal sequences having
value of RTL in excess of 3.5 searched by the genetic algorithm, including the
representative examples of sequences shown here (R1 to R10), reveal that the
positions -87, -90, -91, -92 and -93 remain unaltered. This feature is
therefore relevant for obtaining sequences with higher RTL.
· Mutations at positions other
than those listed above can cause enhancement in RTL. We show one example of
each such alteration. Thus mutation at position -88 (sequence R9), -89
(sequence R8), along with the changes at few other positions (see sequences R8
and R9 for details) cause several fold increase in RTL. It is important to note
that these sequences also include the mutations at the 'up-mutation points'.
Sequences R4 and R7 show case examples when mutation occurs at the other
remaining positions viz. -94 and -95 and cause enhancement. These examples also
show that mutation at these positions is also accompanied by change at few
other locations, but this time the mutations at the 'up-mutation points' is not
involved.
CCAAT box (present between -77 and -72 positions):
· Sequences R1 to R10, show
that the nucleotide positions -73, -75, -76 and -77, remain unchanged. No
alteration in these positions seem to be important for high transcription
efficiency. Other positions viz. -72 and -74 within this region can undergo
mutations to cause increased RTL. We show one example of each.
· Sequence R3 indicates that
if mutation at -74 position is accompanied by mutation at the "up mutation
points" (positions -78 and -79), then an increase in RTL value is
witnessed. Note that -74 position is responsible for lowering the RTL
magnitude, whereas -78 and -79 position causes increase. The simultaneous
mutations has an synergistic effect-causing enhancement more than known for the
up mutation point.
· Upon examining sequence R8
it can be noted that if nucleotide position -72 is mutated in combination with
"up mutation point" (position -78), and other favorable mutation
points (especially in the region -71 to -31 and -25 to cap site), then it
causes high magnitude of RTL.
TATA box (lying between -30 and -26 positions):
· For sequences R1 and R8,
mutations at -27 and -30 positions effect increase in RTL value if they possess
mutation at -78 position and, additionally, at other favorable mutation points
such as -47 and -66 positions. These results once again underline the
importance of up mutation point, such as position -78.
· At -26 and -29 positions of
sequence R4, transition (AÕ G i.e. R Ö R) mutations are witnessed. In here, despite presence of mutations in
the TATA box, high RTL value has been obtained. This can be interpreted as: if
specific mutations (positions -26 and -29) in the TATA box are supported by
drastic variation in the nucleotide content of the region surrounding TATA box
(i.e., region between -71 and -31, and -25 and cap site), then they result in
increased RTL.
· The % identity (homology) of
sequence R4 with original b-globin gene promoter is
74.4. This value despite being the lowest among the ten ANN-GA predicted
patterns (refer Table II), the corresponding RTL value (=4.8404) is high.
II. Mutations in non-conserved regions leading to higher RTL
Upstream region of CACCC box (positions -101 to -96):
· If mutations in this region
are in favorable agreement with other mutation points, especially in the region
-71 to -31, they cause increase in the magnitude of RTL. This is evidenced from
the sequence entries R2, R4 and, R7-R10 listed in Table III. The sequences also
indicate that G at -97, -84 and -78
positions is always mutated by A, T and C respectively.
· For the ten patterns in
Table III, positions -99 and -100 are always conserved thus indicating their
importance in maintaining high transcription efficiency.
Region between CACCC box and CCAAT box (positions -86 to -78):
· The region is of prime
importance since it includes the most important positions i.e., -78 and -79.
These two "up mutation points" are primarily responsible for
increased transcription efficiency (see sequences R1, R3, R6, R8 and R9).
· Sequences R1-R10 do not
exhibit any effective mutation at -77 position. Moreover, as verified
experimentally [20], the mutation at -77 position, which is in the nearest-neighbor
position of up mutation points (i.e., -78 and -79 position), does not seem to
help in increasing transcription efficiency.
· At position -78 of sequences
R1 and R3, and at position -84 of sequences R5 and R9, transversion type of
mutation (-84 and -78 G Õ C or T i.e., R Ö Y) can be observed. It can therefore be inferred that the transversion
mutation at these positions can cause increased magnitude of RTL.
Region between CCAAT box and TATA box (positions -71 to -31):
· Table III lists various combinations
of multiple base substitutions for sequences R1-R10 in the region between CCAAT
box and TATA box, which result in the increased RTL value. However, the average
trend in the ten sequences suggests that nucleotide positions -71, -70, -68,
-67, -65, -55, -48 and -43, despite remaining unchanged, still cause high RTL.
Thus these positions seem to be important in obtaining high RTL.
· Transversion type of
mutations (-60 GÕT, -59 and -57 A Õ T or C i.e. R Ö Y) seen at position -60 (sequences R4, R5 and R6), at position -59
(sequences R2, R4 and R8), and at position -57 (sequences R4, R7 and R8) appear
to cause high transcription efficiency.
Region between -25 to cap site and in the region below the cap site:
· In most of the cases, the
mutations in these regions have favorably supported the multiple base
substitutions in the upstream region of gene. It is also of interest to study
the role of this region, in causing increased transcription efficiency for
sequences where % identity between the original b-globin promoter sequence
and the ANN-GA simulated promoter patterns is greater than 90% (refer Table
II). Although R6, R9, and R10 meet the stated criterion, we will concentrate
only on sequence R10 since sequences R6 and R9 show presence of up mutation points.
The % identity of sequence R10 with b-globin promoter is 94.2 and
its RTL is 3.6896. Interesting feature of this sequence is that all the three
conserved regions i.e., CACCC, CCAAT and TATA box, are not subjected to any
mutational changes; the sequence shows variation only in regions -101 to -96,
-71 to -31, and below the cap site (position +14). Since R10 possesses maximum
homology with the original b-globin gene, only eight
effective mutation points that can lead to higher RTL are possible. Thus mutations
at positions -101, -98, -97, -56, -51, -46, -41 and +14 can cause increased
RTL.
· Among the ten sequences, R8
possesses highest RTL magnitude (=6.7307). This pattern includes mutation at
position -78 (up mutation point) and has % identity value of 79.3. Hence,
sequence R10 gives us an idea about the effective multiple mutation points, in
regions -71 to -31, -25 to the cap site, and below the cap site, that
eventually lead to the highest RTL value. This is an example of how the ANN-GA
optimization methodology could be exploited for a priori estimation of multiple
base substitutions before conducting the mutation experiments.
Sequence dependent DNA structure is important in
packaging, recombination and transcription. Therefore it is of interest to
study the role of sequence-dependent DNA structure in governing the extent of
transcription efficiency. For this purpose, CURVATURE program [23] can be used.
This program is useful for plotting the sequence-dependent spatial trajectory
of the DNA double helix and/or distribution of curvature along the DNA
molecule. The routine calculates the overall DNA path using experimentally
determined local helix parameters, namely, helix twist angle, wedge
(deflection) angle, and direction (of deflection) angle [24]. The CURVATURE
software can thus be used to investigate possible role of curvature in
modulation of gene expression and to locate curved portions of DNA that may
play an important role in sequence specific DNA-protein interactions.
For conducting the above-mentioned investigation, the DNA
sequence of upstream region of b-globin gene (glo, RTL=1.00)
and ANN-GA predicted patterns of b-globin gene were used as
inputs to the CURVATURE program and the likely degree of curvature at each
point along the molecule was computed. The graphical comparison of the
curvature map of promoter sequence of b-globin gene and the ANN-GA
predicted promoter sequences is depicted in Figure 4. The results suggest that
sequences having maximum transcription efficiency show the sequence-dependant
bendability or deformability of duplex DNA. This can be justified on the fact
that certain nucleic acid sequences take up a particular structure required for
binding to a protein at lower free energy than other sequences. The comparison
also reveals that a change in the superstructure results in the alteration of
transcriptional activity. These results in essence indicate that the ANN-GA
methodology is able to capture the relationship between DNA superstructures and
transcriptional activity.
Figure 5 shows
the comparison of spatial trajectories of the DNA double helix of upstream
region of b-globin gene (glo, RTL=1.0)
and the promoter sequence (R8) having highest RTL (=6.7307). In both the cases,
the projections are chosen such that the most curved regions of the fragments
are seen best. This is done by placing the plane - where the axis is curved -
perpendicular to the viewing direction. Any other orientation would result in
false impression of excessive curvature. It can be seen in Figure 5 that the
promoter pattern R8 is more curved at the center than the promoter sequence of b-globin gene (glo). This structural variation
that changes the signature of b-globin gene is responsible
for RNA polymerase to recognize and thus facilitate the transcription.
Conclusion
Highly intricate process like transcription can be well
captured using the hybrid approach of two novel intelligent tools. This
approach helps us to study the effect of multiple base substitutions causing the
increase in transcription efficiency. These simulation results can be used as a
guide in designing mutation experiments since a priori estimate of the possible
outcome of multiple mutations can be obtained. This methodology has also
captured the role of DNA superstructures in gene expression. Such a hybrid
approach, involving an ANN that maps the given inputs onto the outputs, and a
genetic algorithm (GA) that maximizes the output by searching the input space
of ANN can be used for optimizing any biological property.
References
1.
McKnight,
S.L. and Kingsbury, R. Transcription control signals of a eukaryotic
protein-coding gene. (1982) Science, 217, 316-324.
2.
McKnight,
S.L., Kingsbury, R.C., Spence, A. and Smith M. The distal transcription signals
of the herpesvirus tk gene share a common hexanucleotide control sequence.
(1984) Cell, 37, 253-262.
3.
Graves,
P.F. Johnson, S.L. McKnight. Homologous recognition of a promoter domain common
to the MSV LTR and the HSV tk gene. (1986) Cell,
44, 565-576.
4.
Giodoni,
J.T. Kadonaga, H. Barrera-Saldana, K. Takahashi, P. Chambom and Tijian, R.
Bi-directional SV40 transcription mediated by tandem Sp1 binding interactions.
(1985) Science, 230, 511-517.
5.
Grosveld,
G.C., de Boer, E., Shewmaker, C.K., Flavell, R.A DNA sequences necessary for
transcription of the rabbit b-globin gene. (1982) Nature, 295, 120-126.
6.
Edgar,
T.F. and Himmelblau, D.M. (1989) Optimization of Chemical Processes.
McGraw-Hill.
7.
Nair
T.M., Tambe S.S. and Kulkarni B.D. Analysis of transcription control signals
using artificial neural networks. (1995) Comp.
Applic. Biosci., 3, 293-300.
8.
Rumelhart,
D.E., Hinton, G.E. and Williams, R.J. Learning representations by
back-propagating errors. (1986) Nature,
323, 533-536.
9.
Rumelhart,
D.E. and McClelland, J.L. (1986) Parallel
and Distributed Processing: Explorations in the Microstructure of Cognition.
MIT Press, Cambridge, MA.
10.
Nair,
T.M. In Tambe, S.S., Kulkarni, B.D. and Deshpande, P.B. (Eds) Artificial Neural
Networks in biological Sciences. (1996) In Elements
of Artificial Neural Networks with Selected Applications in Chemical
Engineering and Biological Sciences. Simulation and Advanced Controls,
Inc., Louisville, 395-437.
11.
Goldberg,
D.E. (1989) Genetic Algorithms in Search,
Optimization and Machine Learning. Addison-Wesley, Reading, Mass.
12.
Davis,
L., ed. (1991) Handbook of Genetic
Algorithms. Van Nostrand Reinhold, New York.
13.
Holland,
J.H. (1992) Adaptation in Natural and
Artificial Systems. 2nd ed. (University of Michigan Press, Ann
Arbor.
14.
Hanagandi,
V.; Ploehn, H.; Nikolaou, M. Solution of the Self-Consistent Field Model for
Polymer Adsorption by Genetic Algorithms. (1996) Chem. Eng. Sci., 51,
1071-74.
15.
Schoenauer,
M.; Michalewicz, Z. (1997) Evolutionary Computation. Control and Cybernetics, 26,
307-338.
16.
Efstratiadis,
A., Posakony, J. W., Maniatis, T., Lawn, R.M., O'Conell, C., Spriz, R.A.,
Deriel, J.K., Forget, B.G., Weissmann, S.M., Slightom, J.L., Blechl, A.E.,
Smithies, O., Barale, F.E., Shoulders, C.C. & Proudfoot, N.J. The structure
and evolution of the Human b-globin Gene Family. (1980) Cell 21, 653-668.
17.
Freeman,
J.A. and Skapura, D.M. (1992) Neural
Networks Algorithms, Applications, and Programming Techniques.
Addison-Wesley.
18.
Tambe,
S.S., Kulkarni, B.D. and Deshpande, P.B. (Eds) (1996) Elements of Artificial Neural Networks with Selected Applications in
Chemical Engineering and Biological Sciences. Simulation and Advanced
Controls, Inc., Louisville.
19.
Myers,
R.M., Lerman, S.L. and Maniatis, T. A general method for saturation mutagenesis
of cloned DNA fragments. (1985) Science,
229, 242-247.
20.
Myers,
R.M., Tilly, K. and Maniatis, T. Fine structure genetic analysis of b-globin promoter. (1986) Science, 232, 613-618.
21.
Demeler,
B. and Zhou, G. Neural network optimization for E. coli promoter prediction.
(1991) Nucl. Acids Res., 19, 1593-1599.
22.
Pearson,
W. R. Rapid and sensitive sequence comparison with FASTP and FASTA. (1990) Methods Enzymol, 183, 63-98.
23.
Shpigelman,
E.S., Trifonov, E.N. and Bolshoy, A. CURVATURE: software for the analysis of
curved DNA. (1993) Comput. Applic.
Biosci., 9, 435-440.
24.
Bolshoy,
A., McNamara, P., Harrington, R.E. and Trifonov, E.N. Curved DNA without A-A:
Experimental estimation of all 16 wedge angles. (1991) Proc. Natl. Acad. Sci. USA, 88,
2312-2316.
25.
Trifonov,
E.N. and Ulanovsky, L.E. In Wells, R.D. and Harvey, S.C. (eds) Inherently
curved DNA and its structural elements. (1987) Unusual DNA structures.
Springer-Verlag, Berlin, pg. 173-187.
Legends to Figures:
Figure 1: Basic crossover of the nucleotide sequence of the two parent strings.
Figure 2: Architecture of trained EBPN consisting of (i) 484 input neurons (121
bp long promoter sequences are coded using CODE-4 representation), (ii) eight
neurons in the hidden layer, and (iii) one neuron (representing the RTL value)
in the output layer.
Figure 3: Flow chart for the implementation of ANN-GA strategy for the
optimization of transcription efficiency (in terms of its RTL value) of b-globin gene.
Figure 4: Comparison of the curvature map of the upstream region of b-globin gene (glo, RTL=1.0) and ANN-GA
predicted promoter patterns of b-globin gene (R1 to R10, RTL
> 3.5). Curvature is given in DNA curvature units (Trifonov and Ulanovsky,
1987) which is the mean DNA curvature in the crystalline nucleosome (1/42.8Ao
).
Figure 5-a) DNA path of the b-globin gene (glo, RTL=1.0)
calculated using CURVATURE plot.
Figure 5-b) DNA path of the ANN-GA predicted promoter sequence (R8, RTL=6.7307)
using CURVATURE plot.
Legends to Tables:
Table I: Sequence (simulated patterns of upstream region of b-globin gene) details along with their ANN-GA
predicted Relative Transcription Level (RTL) value.
Table II: Comparison of upstream region of b-globin gene with ANN-GA
predicted promoter patterns for sequence homology using FASTA package.
Table III: Effective mutation points for ANN-GA predicted promoter patterns in
accordance with various sub-regions.