The present invention relates to computer-based drug discovery. More particularly, it relates to identifying binding conformations of chemical fragments and biological molecules.
Bringing a new drug to market currently takes about 10 to 15 years, and it costs hundreds of millions of dollars. Consequently, pharmaceutical and biotechnology companies are interested in approaches to make the drug discovery process more efficient, both in terms of speed and cost. Among the technologies that have been brought to bear on this problem are computer-implemented simulation methods such as, for example, simulations that allow virtual or in-silico screening and docking of candidate drug compounds to a binding site on a pharmaceutically-relevant target molecule. By identifying from a large pool of candidate compounds those few possessing conformations consistent with a desired activity, a researcher can save considerable time that would otherwise be lost synthesizing and testing many different compounds.
Although biological molecules are flexible, the study of the binding of a rigid fragment to a rigid molecule has application in drug discovery, particularly when fragment based methods are used. Information obtained from such studies can be used to guide the process of selecting fragments and connecting them, for example, to design potent inhibitors. Such information includes the locations and distributions of fragment poses that achieve low interaction energies as well as the fragment binding affinities.
A variety of computational approaches to this problem, often referred to as rigid docking, have been used with mixed results as to the quality and usability of the results to assist in the drug discovery process. Many of these methods are based on scoring functions or other heuristics, with parameters that are often fitted (and in some cases over-fitted), to reproduce known results on known sets of protein-ligand systems. Physically based methods have also been used with mixed results, for example, in which force fields are used in conjunction with energy minimization, Molecular Dynamics, or Monte Carlo methods. The computational costs for such calculations however often render them practically applicable only for refining structures produced by other methods.
What is needed are novel methods and techniques for designing new drugs that overcome the limitations of conventional methods.
The present invention provides a new approach to identifying binding conformations of chemical fragments and biological molecules, in which fragment poses are explored in a systematic fashion. In an embodiment, for each selected pose, a fast computation is performed of the fragment interaction with the biological molecule using interpolation on a grid. Once the energies of fragment poses are computed, thermodynamical quantities such as binding affinity, binding enthalpy, and binding entropy are computed by direct sum over fragment poses. Using the present invention, it is possible to navigate fragment configuration space and identify separate binding modes. The present invention can be used to scan an entire biological molecule and identify possible binding pockets, or it can be used for localized explorations limited to interesting areas of known binding pockets.
Further features and advantages of the present invention, as well as the structure and operation of various embodiments of the present invention, are described in detail below with reference to the accompanying drawings.
The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate the present invention and, together with the description, further serve to explain the principles of the invention and to enable a person skilled in the pertinent art to make and use the invention.
FIGS. 57 is a table that illustrates example data for a set of test fragments.
The present invention provides methods, systems, and computer program products for identifying binding conformations of chemical fragments and biological molecules. As described in detail herein, in embodiments, this is accomplished by systematically sampling fragment poses that cover a region of interest and computing, for each fragment pose, a fragment-molecule interaction energy using interpolation over a grid.
In the detailed description of the invention that follows, references to “one embodiment”, “an embodiment”, “an example embodiment”, etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
As shown in
Referring to
In step 104, a plurality of potential field values are calculated as described below and in subsequent sections. Each potential field value corresponds to a selected potential point of the potential grid. The calculated potential field values are independent of the bodies of the chemical fragment.
In step 106, a set of poses is selected for the chemical fragment. The selected poses correspond to rotations of the chemical fragment about the centroid of the bodies that make up the chemical fragment.
In step 108, a translation grid is selected. This grid includes a plurality of translation points useful for positioning the chemical fragment relative to the biological molecule. In embodiments, the resolution of this grid is different than the resolution of the potential grid. In other embodiments, the resolution is substantially the same. The translation grid corresponds to a region of interest of the biological molecule, which can be the entire molecule or a portion thereof.
In step 110, a plurality of first interaction values are calculated. These values are for a first pose of the chemical fragment when the centroid of the bodies of the chemical fragment coincides with a first translation point of the translation grid. Each first interaction value corresponds to a measure of interaction between the biological molecule and a selected body of the chemical fragment. The first interaction values are calculated by multiplying a charge value of the selected body with a selected potential field value. In one embodiment, the selected potential field value is generated using trilinear interpolation of the potential field values corresponding to the eight corners of the potential grid cell containing each fragment body (e.g., atom or molecule).
In step 112, a second interaction value is calculated. This value is generated by summing the first interaction values calculated in step 110. This second interaction value corresponds to a measure of interaction between the biological molecule and all of the bodies of the chemical fragment.
In step 114, additional second interaction values are calculated. These additional second interaction values are calculated by repeating steps 110 and 112 for additional poses of the chemical fragment and for instances when the centroid of the bodies of the chemical fragment coincides with translation points of the translation grid other than the first translation point. In an embodiment, an algorithm is used to accomplish this, in which the outer loop is a loop over rotations because it is very fast to translate the fragment once its rotation is fixed.
In step 116, conformations associated with selected ones of the second values are identified as possible binding conformations of the chemical fragment and the biological molecule.
A graphical representation of how the steps of method 100 are implemented in an embodiment of the present invention is provided in
As illustrated in
Referring to
Various features and embodiments of the present invention will now be described in even greater detail with regard to
In embodiments of the present invention, poses for chemical fragment 206 are selected by systematic sampling. Systematic sampling or exploration of the fragment configuration space is facilitated by using a relatively small number of dimensions (e.g., six degrees of freedom), which describe the translations and rotations of fragment 206 relative to biological molecule 202 (e.g., a protein). In some embodiments, however, a number of torsional degrees of freedom also are used.
In an embodiment, chemical fragment or ligand translations and rotations are described using a reference pose. Additional ligand poses are obtained by translating the reference pose by a chosen translation vector t, and then rotating it around the fragment centroid using a rotation matrix R. As used herein, the expression fragment rotation refers to a rotation of the fragment that leaves its centroid fixed. The centroid position rc is defined as the average position of all the fragment bodies (e.g., atoms and/or molecules), without regard to mass. The centroid can be calculated using the following equation:
where the sum extends to all na fragment bodies.
1. Translational Sampling
The sampling of fragment translations is achieved in embodiments of the present invention by successively setting the translation vector t to points of a uniform three-dimensional rectangular grid consisting of the vectors:
tijk=iΔx{circumflex over (x)}+jΔyŷ+kΔz{circumflex over (z)}
where i, j, and k are integers, {circumflex over (x)}, ŷ, and {circumflex over (z)} are unit vectors in the coordinate directions, and Δx, Δy, and Δz are translational resolutions in the three coordinate directions. This expression can also be generalized to allow arbitrary independent unit vectors, which are not necessarily orthogonal. In an embodiment, the three translational resolutions are equal (Δx=Δy=Δz={tilde over (Δ)}T).
In embodiments where the three translational resolutions are equal, for any point in space, the distance to the closest grid point is never larger than ΔT=√{square root over (3)}{tilde over (Δ)}T/2. Thus, ΔT is the worst case translational resolution. A typical translational resolution Δ*T is defined as the distance from the closest grid point, averaged in a square sense over all points in space, and it can be shown that Δ*T=ΔT/2.
2. Rotational Sampling
Each of the translation vectors constructed in the manner described above is combined with a set of fragment rotations, which provide a good sampling of the fragment rotations. With regard to fragment rotations, however, there is no set of three rotational degrees of freedom which can be discretized separately to provide a uniform coverage of rotation space. Additionally, it is desirable to sample more densely rotations around a short axis of the fragment, because such rotations generate larger body (atomic) displacements than rotations around a long axis of the fragment.
In an embodiment, the process of selecting fragment rotations is started using a large set S0 consisting of NR randomly selected fragment rotations. The fragment rotations to be used in the sampling are then selected from S0 to form a set S1 (a subset of S0) of nR chosen fragment rotations. In an embodiment, the distance between two fragment rotations as the atomic root mean square (rms) displacement generated when the fragment is moved from the first rotation to the second one. Using this metric, the distance between two rotations does not simply depend on the angle between the two rotations, and it takes into account the fragment or ligand shape. In an embodiment, the goal is to construct S1 in such a way that for any possible fragment rotation there is at least one in S1 that is close enough, according to the metric.
In an embodiment, the distance between a fragment rotation and a set of fragment rotations is defined as the minimum distance between the given rotation and any of the rotations in the set. With this definition, when constructing S1, one wants to minimize ΔR, defined to be the maximum distance between any possible fragment rotation and S1, without making S1 too large. Because of this definition, ΔR represents the worst case rms atomic displacement generated when replacing any possible fragment rotation with the closest fragment rotation in S1. Thus, ΔR is the worst case rotational resolution. Similarly to the case of translations, one can also define a typical rotational resolution Δ*R as the distance between any possible fragment rotation and S1, averaged in a square sense over all possible fragment rotations. Because of this definition, most fragment rotations will have at least one rotation in S1 at an atomic rms distance of about Δ*R.
As an approximation to achieving a small ΔR, one may minimize the maximum distance between any rotation in S0 and S1. One way to do this is to use a simple clustering algorithm, as follows:
This procedure has two parameters: {tilde over (Δ)}R, which is a sort of target rotational resolution; and m, which is the number of rotational neighbors. It is possible for m to be zero, in which case condition 3b above is always satisfied. The procedure can fail if NR if is not sufficiently large and m>0.
3. Examples of Rotational Sampling
To illustrate how the above algorithm performs, it was run with various combinations of the parameters using indole as the test fragment. The results are reported in Table 1 (see
The time needed to select the rotations is mostly dependent on NR and increases approximately as NR2 For given values of {tilde over (Δ)}R and m, ΔR and nR approximately stabilize once NR is about an order of magnitude larger than nR. Thus, a good choice for NR is one that results in NR≈10 nR, since larger values make the algorithm slower without additional advantage.
In Table 5 (see
For the reasons noted above, these data are representative of results for large NR, although it is not necessarily the case in all cases that NR>10 nR, particularly at the higher resolutions. The last two columns of Table 5 contain values of α=nRΔ3R and α*=nRΔ*R3. These are reasonable figures of merit since the space of rotations is three dimensional and the number of rotations necessary to achieve a certain value of ΔR is therefore expected to be proportional to 1/Δ3R. For smaller values of α and α*, the nR rotations are used more efficiently. These data indicate that α and α* are largely insensitive to the choice of m, although they worsen slightly as m is increased. For large NR, it can be seen from Tables 1-4 (
Tables 1-4 also show that the choice m≠0 might be advantageous if it is important to minimize the time required to generate rotations, for a given ΔR. For this typical fragment size, the number of required rotations increases from tens for ΔR=2 Å to tens of thousands for ΔR=0.25 Å. This is consistent with the fact that an improvement of a factor of 8 in resolution should require an increase of a factor 83=512 in the required number of rotations.
The computation of the rotations used for sampling can become relatively expensive for high resolution (small ΔR). If a fragment is to be used repeatedly with several proteins, the calculation of the sampling rotations for that fragment could be done once and stored for all future usage.
As will be understood by persons skilled in the relevant art(s), algorithms other than the one described above may be used without deviating from the scope of the present invention.
As described above, after a fragment pose is selected, an interaction value (e.g., energy) to describe the interaction between the fragment and the biological molecule (e.g., protein) for that pose is calculated. The procedure described below is used, for example, whenever the interaction between the fragment and the protein is a sum of pair potential terms. In the description that follows, it is presented for an embodiment in which the interaction energy E is a sum of Amber Van der Waals energies plus electrostatic interaction using an Amber distance-dependent dielectric constant:
where index a runs over fragment atoms, index b runs over atoms of the protein, qa and qb are atomic charges, εab and σab are Van der Waals parameters for the atom pair (a,b), rab is the distance between atoms a and b, and k is the electrostatic constant. The 1/rab2 dependence of the electrostatic term is due to the usage of an Amber distance dependent dielectric constant. In the description that follows, ra and rb represent the position vectors of atoms a and b, respectively.
Without making any approximation, the interaction energy can be rewritten as:
where φ(r) and ψa(r) are potential scalar fields, independent of the positions of the fragment atoms:
The number of distinct ψa(r) fields equals the number of distinct atom types in the fragment.
The above expression for the interaction energy can be evaluated very rapidly if the required values of φ(r) and ψa(r) at the positions of the fragment atoms are available, since one does not have to sum over the atoms of the protein.
In an embodiment, values of φ(r) and ψa(r) are computed on a three dimensional rectangular grid with resolution ΔF. This grid is similar but distinct from the grid used to sample translations and described in the previous section. In particular, the resolutions for the two grids, ΔT and ΔF, don't have to be the same. In an embodiment, values of φ(r) and φa(r) at atomic positions are computed by trilinear interpolation of the values at the eight corners of the grid cell containing each fragment atom. This computation is very fast. For a twelve atom fragment, it runs at a rate of 1.3×105 per second under the benchmarking conditions described in the previous section, if values for φ(r) and ψa(r) are available. This corresponds to approximately 5×108 energy calculations per hour or 1010 per day. These times increase proportionally to the number of atoms in the ligand, but they are insensitive to the protein size. The protein size only affects the time needed to calculate values of φ(r) and ψa(r) at grid points. A more detailed discussion of performance factors is presented below.
1. Energy Interpolation Accuracy (1-Dimensional Analysis)
Of interest is how interpolation affects the accuracy of computed energy values. An answer to this question is provided in
Also plotted is the exact energy computed directly by summing over all of the protein atoms. The interpolation error is plotted as a function of position for two values of ΔF in
As shown in
2. Energy Interpolation Accuracy (2-Dimensional Analysis)
The interpolation error incurred with ΔF=0.25 Å is plotted in
The interpolation error increases rapidly when approaching the high energy walls due to Van der Waals repulsion, but these regions are statistically unimportant as their Boltzmann factor e−E/kT decreases rapidly as E increases. This is accounted for by estimating the average interpolation error using a statistical mechanics average in which the Boltzmann factor is used as a weight. Therefore, the average error δE over a set of fragment positions can be estimated as:
where the sums run over the set of fragment positions in question, Ei is the energy for the i-th fragment position computed by direct sum over all the protein atoms, and {tilde over (E)}i is the same energy computed using grid interpolation. For the two dimensional data shown above, this gives δE=0.37 kcal/mol for ΔF=0.25 Å and δE=1.94 kcal/mol for ΔF=0.5 Å. For the case of a three dimensional set of fragment positions around the same binding site, distributed in a cube of size 2 Å, very similar values are obtained, δE=0.39 kcal/mol for ΔF=0.25 Å and δE=1.93 kcal/mol for ΔF=0.5 Å. These values of δE are not much larger than the interpolation error near the center of the potential well (see
These values of δE are typical energy errors for calculations done using grid interpolation. As indicated by the figures noted above, most of the average error is systematic in nature. That is the interpolated energies are systematically higher than energies computed by direct sum. The systematic error does affect computed energies, but it does not affect the quality of the poses computed because it is equivalent to an irrelevant additive constant to the energy. Therefore, it is useful to get an estimate of the nonsystematic portion of the interpolation error, δE, since this is the only error component that affects energy difference between poses. This can be done using a statistical mechanical average:
σ2E={tilde over (E)}−E−δE)2={tilde over (E)}−E)2−δ2E,
where {tilde over (E)}−E)2 is given by the statistical mechanical average:
For the same two dimensional data used above, σE=0.15 kcal/mol for ΔF=0.25 Å and σE=0.65 kcal/mol for ΔF=0.5 Å. For the three dimensional case, σE=0.16 kcal/mol for ΔF=0.25 Å and σE=0.76 kcal/mol for ΔF=5 Å. Again, there is not a large difference between the values obtained using the two dimensional or the three dimensional point sets.
These data indicate that both the systematic and the non-systematic components of the error scale according to Δ2F. This is consistent with the fact that a trilinear interpolation scheme is used, in which the most important terms neglected are of second order with respect to grid size. To confirm this, δE and σE were computed as a function of ΔF for a number of values of ΔF.
The results are plotted in
3. Energy Interpolation Accuracy (Distortion Of Equipotential Surfaces)
The description above looks at the interpolation error as an energy error, for example, something which modifies the energy value computed at one point. An alternative approach consists of looking at how equipotential surfaces for the fragment are distorted as a result of the interpolation. If the interpolated energy at point r0 is {tilde over (E)}, the point r1, defined as the point with energy equal to {tilde over (E)} closest to r0, can be located. The distance δR between r0 and r1 is the amount by which the equipotential surface with energy equal to {tilde over (E)} was translated. This is referred to herein as the distortion generated by the interpolation process. Plots of δR for the same two dimensional data discussed above are shown in
From these figures, one can see that for ΔF=0.25 Å, δR is usually less than 0.05 Å and peaks at around 0.1 Å at the center of the binding pocket. For ΔF=0.5 Å, δR is usually less than 0.2 Å and peaks at about 0.3 Å near the center of the binding pocket.
4. Energy Interpolation Accuracy (Monte Carlo Runs)
This section presents the results of Monte Carlo calculations performed using the above noted fragment to illustrate how ΔF affects the energy and position of the lowest energy pose. In each run, 106 Monte Carlo steps were attempted at a temperature of 300 K. At every 104 attempted Monte Carlo steps, a local energy minimization was performed, without affecting the Monte Carlo run, and the pose that achieves the local minimum and its energy were saved. The pose with the lowest energy encountered during the run is taken as an approximation of the global energy minimum. Such a run was performed for several values of ΔF, and one run was performed in which the energy was computed exactly by direct sum over all protein atoms, without interpolating. Energy minimizations were done using 2×104 Monte Carlo steps at zero temperature because the minimization algorithms used were designed for objective functions with continuous derivatives. For each value of ΔF, the atomic rms displacement between the lowest energy pose found and the lowest energy pose found in the run which did not use interpolation (ΔF=0) were computed. The tabulated results are shown in Table 6 (
As can be seen, there is excellent positional agreement between the approximate and the exact global energy minimum at all values of ΔF below 0.5 Å. The energy error increases quadratically with ΔF, as would be expected, and is consistent with the results shown in
A series of Monte Carlo runs also was performed to compute binding enthalpies ΔH. Since pressure effects can be neglected, the enthalpy can be estimated as the Monte Carlo average of the interaction energy. The results are tabulated in Table 7 (
This error is estimated by comparing the computed enthalpy with the value obtained in a Monte Carlo run that did not use energy interpolation. By comparison with
Based on the results presented in this section, one can conclude that ΔF=0.5 Å is sufficient for quick qualitative scans with the purpose of locating binding pockets and that ΔF=0.25 Å can be used for more detailed studies of smaller regions of interest. A more global study of the effect of ΔF on the accuracy and performance of the systematic sampling procedure is described in the sections that follow.
This section describes how to combine the techniques presented above and how to use grid interpolation to compute interaction energies for all combinations of fragment translations and rotations. As described herein, in embodiments, values of the potential fields φ(r) and ψa(r) are computed on demand. In one embodiment, only values at grid points that are actually needed are computed. Additionally, it is noted here that the techniques described herein can be made more efficient by taking into account the fact that interesting poses will have the fragment close to the protein but without any spatial overlap. This is taken advantage of as described below.
A tri-dimensional array of pointers to grid data objects is defined, wherein each contains values of φ(r) and ψa(r) at a grid point. This array corresponds to all the possible grid points in the region of interest. In embodiments, this region is extended by a guard region of size equal to the fragment diameter. The pointers are initialized to zero to indicate that data for all grid points have not yet been computed but will be computed in the future, if needed.
Grid points that are too close to atoms of the protein or too far from them are ignored. The distance between a grid point and the protein is defined herein as the minimum distance between the grid point and any of the protein atoms. A distance range of interest (rmin, rmax) is selected and a value, for example “uninteresting”, is assigned to pointers corresponding to all grid points whose distance from the protein is not in this range. In an embodiment, rmin is on the order of 1 Å and rmax is on the order of 10 Å.
After initialization is completed, a main algorithm loop is started that iterates in turn over all selected fragment rotations and translations. In an embodiment, the loop over rotations is the outer loop because it is very fast to translate the fragment once its rotation is fixed.
For each pose, the interaction energy of the fragment with the protein is computed using the interpolation described in the previous section. Initially, zero value or uncomputed pointers to grid data are encountered. These data are computed, and the zero value pointer is replaced with a pointer to the actual data. Whenever a grid point pointer marked as “uninteresting” is encountered, the energy computation for that pose is immediately aborted because the fragment is either too close to the protein or too far from it, and the pose would not be energetically relevant.
In an embodiment, values of ψa(r) are monitored at the time each grid point is computed. If any value of ψa(r) is larger than a pre-specified threshold ψcut, the grid point is also marked as “uninteresting”, even though it does lie in the distance range (rmin, rmax). In an embodiment, ψcut is on the order of 100 kcal/mol. This reduces the number of poses that have to be computed without skipping potentially relevant poses.
If the energy calculation for a pose is completed without encountering any uninteresting grid points, the pose and the computed energy is stored in a list of computed energy poses, which constitutes the raw output of a run. Poses whose interaction energy are larger than an energy cutoff value Ecut are not stored. There usually is a wide range of values of Ecut which result in substantial reductions of the number of poses Np to be stored without resulting in potentially relevant poses being discarded. It is noted here, however, the value of Ecut does not affect performance.
As described above, the output of a run consists of a list of poses and their corresponding energies Ei. If the parameters for a run are properly selected, all fragment poses not included in the list are such that their energy is high enough to make their Boltzmann weight e−Ei/kT negligible. In addition, because of the procedure used to construct translations and rotations, these provide an essentially uniform coverage of the fragment configuration space. Therefore it is permissible to replace configuration integrals which appear in statistical mechanics equations with sums over the computed poses. For example, we can compute the partition sum Z and the Helmoltz free energy F by a simple sum over poses:
The free energy difference with respect to the unbound fragment at the standard chemical reference state with a concentration of 1 mol/L can be computed by observing that in that case the fragment has an accessible volume V0=1660 Å3, and that in that state all accessible poses have zero interaction energy. The number of such poses would be nRV0/(ΔxΔyΔz). This gives the partition function for the reference state:
Z0=nRV0/(ΔxΔyΔz),
together with the corresponding Helmoltz free energy:
F0=−kTlnZ0.
The Helmoltz free energy of binding is therefore:
ΔF=F−F0.
Since changes in pressure/volume during binding are negligible, this also gives the Gibbs free energy of binding:
ΔG=ΔF.
With similar reasoning one can compute the binding enthalpy:
The binding entropy can be computed as:
ΔS=(ΔH−ΔG)/T.
In embodiments, the output of a systematic sampling run consists of a series of poses and their energies. This information can be used to analyze in detail the configuration space of the fragment. If low energy poses are organized in well separated clusters, each cluster can be considered a distinct binding mode of the fragment to the protein. Since the clusters are well separated, one can assume that the fragment will almost never jump from one cluster to another. Therefore, it is desirable to compute thermodynamical quantities ΔHb and ΔGb separately for each binding mode. These quantities are computed as described above, with summing over the poses that belong to the cluster corresponding to the binding mode of interest. This procedure corresponds to conceptually increasing the energy barrier between clusters to infinity, at which point each separate cluster can be treated as a separate thermodynamical ensemble.
It is noted here that the definition of binding mode above is not the only definition that can be used. For example, in one alternative definition, a binding mode is characterized by specific chemical contacts, but the system can switch between binding modes as part of its thermal motion. With this definition, it is not possible to assign separate thermodynamical quantities to each binding mode. This alternative definition of binding mode is not used in the description below.
The detection of clusters of poses corresponding to binding modes is implemented as follows. First, start with the pose of minimum energy and label it as belonging to a new cluster. Next, find all neighboring poses with energy below a threshold Eb. These poses are also labeled as belonging to the cluster. Two poses are considered neighboring if their atomic rms separation is less than a preset value δb which is a parameter of the procedure. As used herein, δb is the binding mode separation. In an embodiment, the procedure is continued iteratively, and any neighboring pose of a pose already in the cluster is iteratively added to the cluster if its energy is less than Eb. The iteration is continued until no more poses can be added to the cluster. Finally, any high energy poses which are neighbors of any poses in the cluster are also added to the cluster without regard to their energy. At this point, the first binding mode is considered to be described by the cluster just constructed.
Additional binding modes are found by repeating the same procedure, but without considering poses that have already been assigned to a binding mode. The process is stopped when there are no poses left, or when the value of ΔG computed using all poses left is too high to make any additional binding modes interesting.
The value of the energy threshold Eb is computed for each binding mode as the energy cutoff which gives a small predetermined error (for example, 0.01 kcal/mol) on the ΔG for all poses left at each stage. Typically, Eb turns out to be several kcal/mol higher than the energy of the lowest energy pose left, which is also the minimum energy for the current binding mode.
If a binding mode is very tight relative to the sampling resolution used, the systematic sampling procedure described herein may not be able to identify it. Stated differently, there is a critical entropy below which the binding mode may not or cannot be detected. The mode with the lowest entropy which can be detected consists of a single pose being occupied, with all remaining poses being free. On the other hand, the reference state consists of nRV0/(ΔxΔyΔz)=nRV0/{tilde over (Δ)}3T poses. Therefore, the entropy difference between the single pose mode and the reference state is given by TΔS=kTln({tilde over (Δ)}3T/nRV0). The systematic sampling procedure does not detect modes with entropy lower than this critical value. In embodiments, {tilde over (Δ)}T=0.2 Å and nR=104, and the critical value TΔS=−12.8 kcal/mol at 300 K.
Several differing runs have been performed to illustrate the accuracy and performance of the present invention using dichlorobenzene binding at a particular pocket in the allosteric site of p38. This is the same case used above to explore interpolation error. In this section, a description of how results and performance of the procedure are affected by the various parameters is presented.
1. Effect Of Translational And Rotational Sampling Resolution
In a first series of runs, referred to herein as Series1, a box of 12 Å×12 Å×12 Å was used for translational sampling. The box was centered a few Angstroms away from the ideal position of the fragment in the binding pocket. Energy interpolation was performed using a grid resolution ΔF=0.25 Å, using electrostatics in vacuum and charges computed using AM1-BCC. (See, for example, Jakalian et al., Comput. Chem. 2000, 21, 132-146, and Morton et al., Biochemistry 1995, 34, 8564-8575, which are incorporated herein by reference.) For the translational sampling grid, a set of grid spacing values {tilde over (Δ)}T varying between 0.25 Å and 1 Å were used, resulting in typical translation resolutions Δ*T between 0.125 Å and 0.5 Å. To generate fragment rotations, m=1 rotational neighbors were used, and for each run a value of {tilde over (Δ)}R was chosen which resulted in an estimated typical rotational resolution {tilde over (Δ)}*R similar to the value of Δ*T used for that run. The initial number of rotations NR was chosen in such a way that the number of selected rotations nR was about an order of magnitude smaller than NR. However, this was not the case for some of the higher accuracy runs, in which case NR was capped at a maximum practical value of 105. The remaining systematic sampling parameters were set at rmin=1 Å, rmax=10 Å, ψcut=100 kcal/mol, and Ecut=0 kcal/mol. The results of this series of runs are presented in Table 9 (
Table 9 (
To illustrate the effect of Ecut,
Referring to the Series1 runs, one can see from Table 9 (
2. Locating The Energy Minimum
In column 2 of Table 10 (
In column 3 of Table 10, the atomic rms displacements between the lowest energy pose found in each run and the pose corresponding to the global energy minimum computed exactly without interpolation are reported.
Because of the symmetry of the fragment, the lowest of the two rms values obtained by flipping the two identical halves of the fragment is reported in each case. Although all the rms values are less than 2 Å, their quality is not the same probably due to a shallow potential well and/or to the presence of secondary energy minima. This suggests it may be difficult to accurately locate global minima using sampling only at low resolution. As an alternative to higher resolution systematic sampling runs, an easy and inexpensive procedure to locate the global minimum consists of performing energy minimizations for a set of the lowest energy poses found during the systematic sampling run. This can be overcome by energy minimizing the lowest 100 poses found during each systematic sampling run, which takes a negligible amount of time compared to the systematic sampling run. Among all the 100 energy minimized poses, the one with the lowest energy (Emin,100) was selected as the best estimate of the exact energy minimum. As used herein, rms100 is the atomic rms displacement between this pose and the ideal pose (the global energy minimum found by the Monte Carlo runs). In Table 10 (
As can be seen from the results described herein, the final inexpensive energy minimization of a small set of low energy poses provides an accurate computation of the global energy minimum, and of the corresponding pose, with atomic rms displacements consistently below 0.3 Å beginning at a coarse sampling resolution (Run4). The minimized energy values are actually lower than the value −24.1 kcal/mol computed via Monte Carlo runs combined with energy minimization and reported in Table 6 (
As an illustration, the energy of the poses computed during sampling versus the corresponding atomic displacement rms relative to the global energy minimum computed exactly are plotted in
3. Performance Considerations
In Table 10 (
4. Effect Of Changing The Ratio Of Rotational To Translational Sampling Resolution
In the runs of Series1, a value of {tilde over (Δ)}R was chosen which resulted in an estimated typical rotational resolution Δ*R similar to the value of Δ*T used for that run, and it turned out that such value of {tilde over (Δ)}R was approximately exactly equal to {tilde over (Δ)}T. In order to explore the relative importance of the translation and rotational sampling resolutions, five additional series of runs were preformed, Series2 through Series6. In each series the ratio rΔ={tilde over (Δ)}R/{tilde over (Δ)}T was kept constant, as summarized in Table 8 (see
As a first measure of convergence for each run, the atomic rms displacement from the global minimum, rms100, was used (computed as described above). This rms displacement is plotted in
5. Effect Of Changing The Resolution For Energy Interpolation
All of the systematic sampling runs presented so far have used the choice ΔF=0.25 Å, which as shown in a previous section provides good accuracy. To confirm this finding, a series of runs (Series7) were performed using the same input parameters of Run36 of Series6, but in which ΔF was varied in the range 0.1 Å to 1 Å. This series also includes a run, Run48, with ΔF=0. This was performed by turning off energy interpolation and computing the interaction energy for each pose by direct sum over all protein atoms. The results of the runs of Series7 are summarized in Table 21 (
6. Full Surface Scans
In the final two series of runs described in this section, the size of the translational sampling box was enlarged to cover the entire surface of the protein. In both series, a value of rΔ={tilde over (Δ)}R/{tilde over (Δ)}T=1.5 was chosen. In Series8, see Table 23 (
In this section, a detailed analysis of systematic sampling, when applied to dichlorobenzene binding at a particular pocket in the allosteric site of p38, was presented. The results show that systematic sampling can achieve practically useful accuracies in reasonable amounts of computing time. In the sections that follow, the results of applying systematic sampling are present for a variety of fragments on T4 Lysozyme.
In this section, the results of applying systematic sampling according to the present invention on T4 Lysozyme are described. This is a useful case for demonstrating the present invention because experimental thermodynamical data (ΔH and ΔG) are available (Morton, A., et al., Biochemistry 34:8564-8575 (1995)), together with crystal structures (Morton, A. and Matthews, B. W., Biochemistry 34:8576-8588 (1995)), for 16 small, fragment-like molecules with little or no internal flexibility. Because the T4 Lysozyme binding pocket is very tight and completely buried, this case is not necessarily representative of situations of interest in drug development. However, for the same reasons, it can be considered a worst-case scenario for fragment binding computations in terms of difficulty of simulation.
1. Available Experimental Results
Table 27, in
2. Systematic Sampling Runs
Systematic sampling runs were performed on twelve fragments using a 8 Å×8 Å×8 Å cubic box, centered at the cavity where binding is known to occur. Based on the results of the previous section, the following were chosen for the remaining parameters: ΔF=0.25 Å, m=1, NR=48000, rmin=1 Å, rmax=15 Å, ψcut=100 kcal/mol, and Ecut=0. In a first series of runs (Series10) to investigate the sensitivity to sampling resolution, three runs were performed for each fragment with {tilde over (Δ)}T=0.15, 0.2 and 0.25 Å respectively. Again, based on the teachings of the previous section, a ratio rΔ=2 was used, which means that the three runs had {tilde over (Δ)}R=0.3, 0.4, and 0.5 Å respectively. The parameters of the physical model were electrostatics in vacuum, Amber Van der Waals parameters, and charges generated using GAUSSIAN/CHELPG.
The results of the runs of Series10 are summarized in Table 28 (see
3. Correlation Of Computed And Experimental Free Energies
In Table 29 (see
Only four of the molecules in the test set have no rotatable bonds and, therefore, can be truly considered rigid fragments. For all the remaining fragments, the assumption of rigidity neglects entropy changes associated with restriction of the torsional flexibility due to binding. Least square fits and standard deviations were recomputed using only these four fragments, and the results are tabulated in Table 30 (
4. Effect Of Changing The Resolution For Energy Interpolation
As an additional test of the energy interpolation accuracy, a series of runs (Series11) was performed in which all twelve fragments were run again with {tilde over (Δ)}T=0.2 Å and {tilde over (Δ)}R=0.4 Å, but this time using ΔF=0.1 Å instead of ΔF=0.25 Å. The results are summarized in Table 31 (see
5. Alternative Electrostatic Models
An investigation was performed to determine the dependence of the results presented herein on the electrostatic model used. In the runs of Series12, each fragment was rerun with {tilde over (Δ)}T=0.2 Å, {tilde over (Δ)}R=0.4 Å, and ΔF=0.25 Å with all combinations of two electrostatic models (vacuum and Amber distance dependent dielectric constant), three values of the dielectric constant (1, 2, and 4), and two choices of methods to compute charges (GAUSSLAN/CHELPG or AM1-BCC). The computed values of ΔG (which in all cases include a solvation correction) are tabulated in Table 32 (see
6. Binding Mode Analysis
Crystal structures were available for 7 of the fragments. For these fragments, a binding analysis of the results was performed with {tilde over (Δ)}T=0.15 Å, {tilde over (Δ)}R=0.3 Å, and ΔF=0.25 Å. For the binding analyses, a binding mode separation δb=1 Å was used. The results of the binding mode analysis are summarized in Table 33 (see
Since the calculations used a common protein structure (186pn-neutral.pdb) for all the fragments, there is some ambiguity in the proper way to compute rms displacements relative to the crystal structure. It was decided to fit, for each experimental crystal structure, a set of protein atoms near the cavity to the corresponding atoms in the protein structure used in the runs. This fit defines a fragment position which was used as a reference to compute the rms displacements shown in Table 33 (
The data in Table 33 (
In embodiments, the invention is directed toward a system, computer, and/or computer program product. Computer program products are intended to be executed on one or more computer systems capable of carrying out the functionality described herein. Embodiments of the present invention may be implemented using hardware, firmware, software, or a combination thereof, referred to herein as computer logic, and may be implemented in a stand-alone computer system or other processing system, or in multiple computer systems or other processing systems networked together.
The computer system 6800 includes one or more processors, such as processor 6804, and one or more user interfaces 6805 such as, for example, a display, a printer, a keyboard, a mouse, et cetera. Processor 6804 and user interface 6805 are connected to a communication bus 6806. Various embodiments are described in terms of this example computer system. After reading this description, it will become apparent to persons skilled in the relevant art(s) how to implement the invention using other computer systems and/or computer architectures.
Computer system 6800 also includes a main memory 6808, preferably random access memory (RAM), and may also include a secondary memory 6810. The secondary memory 6810 may include, for example, a hard disk drive 6812 and/or a removable storage drive 6814, representing a floppy disk drive, a magnetic tape drive, an optical disk drive, etc. The removable storage drive 6814 reads from and/or writes to a removable storage unit 6818 in a well-known manner. Removable storage unit 6818, represents a floppy disk, magnetic tape, optical disk, etc. which is read by and written to by removable storage drive 6814. As will be appreciated, the removable storage unit 6818 includes a computer usable storage medium having stored therein computer software and/or data.
In alternative embodiments, secondary memory 6810 may include other similar means for allowing computer programs or other instructions to be loaded into computer system 6800. Such means may include, for example, a removable storage unit 6822 and an interface 6820. Examples of such may include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an EPROM, or PROM) and associated socket, and other removable storage units 6822 and interfaces 6820 which allow software and data to be transferred from the removable storage unit 6822 to computer system 6800.
Computer system 6800 may also include a communications interface 6824. Communications interface 6824 allows software and data to be transferred between computer system 6800 and external devices. Examples of communications interface 6824 may include a modem, a network interface (such as an Ethernet card), a communications port, a PCMCIA slot and card, etc. Software and data transferred via communications interface 6824 are in the form of signals 6828 which may be electronic, electromagnetic, optical or other signals capable of being received by communications interface 6824.
These signals 6828 are provided to communications interface 6824 via a communications path (i.e., channel) 6826. This channel 6826 carries signals 6828 and may be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an RF link and other communications channels.
In this document, the terms “computer program medium” and “computer usable medium” are used to generally refer to media such as removable storage drive 6814, a hard disk installed in hard disk drive 6812, and signals 6828. These computer program products are means for providing software to computer system 6800.
Computer programs (also called computer control logic) are stored in main memory 6808 and/or secondary memory 6810. Computer programs may also be received via communications interface 6824. Such computer programs, when executed, enable the computer system 6800 to perform the features of the present invention as discussed herein. In particular, the computer programs, when executed, enable the processor 6804 to perform the features of the present invention. Accordingly, such computer programs represent controllers of the computer system 6800.
In an embodiment where the invention is implemented using software, the software may be stored in a computer program product and loaded into computer system 6800 using removable storage drive 6814, hard drive 6812 or communications interface 6824. The control logic (software), when executed by the processor 6804, causes the processor 6804 to perform the functions of the invention as described herein. The functions can be performed in any computationally-feasible order that does not substantially alter the ultimate result. For example, in some implementations the order of certain steps is not important, so long as the steps are executed and the result is the same as if they were executed in the order presented herein.
In another embodiment, the invention is implemented primarily in hardware using, for example, hardware components such as application specific integrated circuits (ASICs). Implementation of the hardware state machine so as to perform the functions described herein will be apparent to persons skilled in the relevant art(s).
It has been shown herein, both theoretically and computationally, that systematic sampling according to the present invention is capable of significantly reducing the critical entropy limitations incurred with Monte Carlo runs, while at the same time being fast and computationally robust.
Thus, using the present invention, it is possible to substantially improve the chemical richness of fragment poses available for drug design.
While the foregoing is a complete description of exemplary embodiments of the invention, it should be evident that various modifications, alternatives and equivalents may be made and used. Accordingly, the above description should not be taken as limiting the scope of the invention which is defined by the metes and bounds of the appended claims. It will be understood that the invention includes all usable combinations of the appended claims.