The present invention relates to an apparatus for in silico screening, and a method of in silico screening.
There have been hitherto compounds commercially available from reagent supplier companies or the like, such as the compounds falling under the category of pharmaceutical products, or reagent compounds. Furthermore, like those macromolecules confirmed by various tests based mainly on mass spectrometry as a macromolecule that interacts with compounds, or those macromolecules publicly recognized through the documents included in the journals represented by, for example, Nature or Science, there are target macromolecules which interact with compounds such as invented pharmaceutical products, and thereby bring about healing of the state of illness or the state of disease of animals and plants, alleviation of symptoms, maintenance of the status quo and the like, such as drug target proteins, drug target nucleic acids, drug target saccharides, or drug target lipids.
In the occasion of performing low molecular compound docking to target macromolecules and in silico screening, the conventional procedure has been such that each compound in the compound database in which an enormous number of compounds such as the pharmaceutical candidate compounds as described above are stored, are subjected to docking interaction with, for example, a target macromolecular protein which is mainly composed of a protein, to thereby determine the conformations by which compounds amounting to several hundred thousand individuals that exist in reality, directly interact with target proteins, and thus the interaction energy or score values corresponding thereto are obtained. The procedure has also been such that the relevant score values are lined up using stability as an index, in order of high to low stability, and the order of the interaction between a compound and a target protein to drug is determined.
In conventional methods such as Dock by Kuntz, et al. [reference literature (Ewing, et al. J. Comput. Aided Mol. Des. 2001; 15(5): 411-428)], AutoDock by Goodsell et al. [reference literature (Goodsell, et al. J. Mol. Recognit. 1996; 9: 1-5)], GOLD by Gareth, et al. [reference literature (Jones et al. J. Mol. Biol. 1997; 267: 724-748)], FlexX by Rarey, et al., Fragment Potential by Nicolas, et al., calculation of the score value has been carried out by the respective methods, using the grid information of the environment for ligand binding of a target protein as the target macromolecule, or the multipoint information of compounds, which emphasizes the vector between a compound and a target macromolecule.
That is, even though there is some sort of ingenuity such as the biological environment of the target protein, the processes has been carried out such that the interaction energy and the like are calculated from the formulas for classical physical interatomic potential between the atoms of a compound and the atoms constituting a target macromolecular protein, based on the information such as the grid information or the multipoint information, and the order related to the conformation of the compound or the intensity of binding in the interaction of the compound is determined as score values. Furthermore, to determine the order of interaction, devises for determining the order regarding the conformations of various compounds that involved interaction, using techniques such as be clustering, have been implemented.
However, since the conventional methods of in silico screening focus on predicting protein-ligand complexes with high accuracy, there has been a problem that such prediction is not consistent with the results of directly selecting a large number of compounds that get a hit.
Furthermore, since the conventional methods of in silico screening make nonempirical predictions using the potential functions of classical physics, there has been a problem that screening cannot be carried out with high prediction efficiency while taking the information of biochemical experiments or the like into consideration.
The present invention has been made under such circumstances, and it is an object of the invention to provide an apparatus for in silico screening and a method of in silico screening, which can predict the binding between a protein and a compound with high accuracy while being able to select many compounds that gets a hit, and can increase the prediction efficiency.
In view of the fact that an enormous amount of information on 3-dimensional coordinates representing the interactions between a compound and a target macromolecule, which are obtained by experiments such as X-ray analysis, NMR, electron beam analysis and high-resolution electron microscope photographs, are registered on a publicly accessible database, or of the recent enhancement of the performance of computers and the progress in bioinformatics, the inventors of the present invention conceived an idea that instead of performing the general methods of in silico screening of classical physics in a traditional manner, it is possible to perform semiempirical in silico screening of compounds based on human intelligence, using the information of bioinformatics such as the collectively overlapping state of various compounds that are bound to target macromolecular proteins.
The present invention has been completed as a result of the present invention devotedly investigating by the inventors based on the idea described above. An apparatus for in silico screening for performing screening of candidate compounds that bind to a target protein includes a storage unit, and a control unit, wherein the storage unit includes a compound database produced by extracting a chemical descriptor that includes the atom type and the interatomic bonding rules as the fingerprint of a compound related to a plurality of atoms in the compound, for each of the candidate compounds, and the control unit includes a fingerprint of compound producing unit that extracts the fingerprint of compound from a binding compound known to bind to a family protein, which has a 3-dimensional structure that is identical or similar to that of the target protein, along with 3-dimensional coordinates that have been converted into the coordinate system of the target protein, to thereby produce a fingerprint set of binding compound, and an optimizing unit that computes, for the candidate compounds stored in the compound database, the 3-dimensional structures of the candidate compounds with respect to the target protein, so that the interaction scores that are based on the root-mean-square deviations of each unit of fingerprint of compound that have been calculated using the 3-dimensional coordinates of the fingerprint set of binding compound as a basic, are optimized.
That is, according to the present invention, the binding between a protein and a compound can be predicted with high accuracy, while a large number of compounds that get a hit can be selected. Also, semiempirical screening can be performed while taking the information of biochemical experiments and the like into consideration, and the prediction efficiency can be increased.
As discussed above, the present invention is different from conventional techniques, from the viewpoint of making the bioinformatics technology which uses a fingerprint set of 3-dimensional compound, to exhibit a performance equivalent to the docking of a low molecular compound and a macromolecular protein, which uses the techniques of classical physical energy. Particularly, when it is considered that technologies such as X-ray analysis, NMR, electron beam analysis and high-resolution electron microscopic analysis are making remarkable progress, it is predicted that the number of molecules of the compounds binding to target macromolecular proteins would enormously increase, and therefore, the present invention exhibits high effects.
The apparatus for in silico screening according to one aspect of the present invention, is connected to a protein database apparatus that stores the respective 3-dimensional structure and amino acid sequence of the proteins bound to a compound, wherein the control unit further includes a homology searching unit that searches for the family protein and the binding compound from the protein database system, based on the homology of the target protein with the amino acid sequence, and the fingerprint of compound producing unit extracts the fingerprint of compound from the binding compound to bind to the family protein searched by the homology searching unit, along with 3-dimensional coordinates that have been converted into the coordinate system of the target protein to thereby produce the fingerprint set of binding compound.
Here, as a specific exemplary embodiment of the present invention, when a family macromolecule set is extracted under the conditions used for extracting a collective conformation by which various low molecular compounds bind to a family macromolecule set that is similar in the 3-dimensional structure to a target protein, among target macromolecules, the detection is performed through homology search based on PSI-Blast or the like, for the sequence of the target protein as the query sequence. According to the present invention, when a protein-ligand complex that has been searched as coming under the detected proteins, contains a low molecular ligand, the protein-ligand complex is superimposed to the target protein using CE (an operation of superimposing the conformations of proteins not in awareness of the kind of atoms) or the like. According to the present invention, when the Z-score which represents similarity in the structure has a predetermined value (for example, equal to or more than 3.7), the ligand bound to the searched homologous protein is converted from the coordinate system of the homologous protein into the coordinate system of the target protein, along with the ligand coordinates, and only the ligand can be extracted.
Here, the CE performs the operation of superimposing the conformations of proteins not in awareness of the kind of atoms, but a program having the same functions can also be used in substitution. According to the present invention, when only those sequences having high homology are obtained by the homology search based on the PSI-Blast or the like, for the sequence of the target protein as the query sequence, a program that performs an operation of superimposing the conformations of proteins in awareness of the kind of atoms, may also be used. According to the present invention, the homology search is not limited to the PSI-Blast, but any homology search program may be applied as long as it is a software program capable of performing homology search using a sequence as a query, and performing an evaluation of the similarity of sequence in a quantitative manner.
The present invention is the apparatus for in silico screening according to another aspect of the present invention, wherein the fingerprint of compound producing unit converts the 3-dimensional coordinates of the binding compound that binds to the family protein, into the coordinate system of the target protein, through the superimposition of conformations of the family protein and the target protein, and extracts the fingerprints of compound along with the converted 3-dimensional coordinates of compound, to thereby produce the fingerprint set of binding compound.
The present invention is the apparatus for in silico screening according to still another aspect of the present invention, wherein the fingerprint of compound producing unit further includes a fingerprint of new compound adding unit that performs superimposition of conformations by referring to another compound that is different from the binding compound, extracts a fingerprint of compound that straddles between the atoms of the binding compound and the atoms of the other compound, and adds the fingerprint of compound to the fingerprint set of binding compound.
In a specific example according to the invention, a specific example of the fingerprint set of binding compound may be constituted of “CElib” (FP (fingerprint) set extracted from collected ligands in the binding site), which is a database of various low molecular compounds bound to a family macromolecular protein set that is similar in the 3-dimensional structure to a target protein among target macromolecules. This CElib includes the coordinates and Sybyl atom-type in the coordinate system of target proteins, and the information on bonding rules such as in a single bond, a double bond, or bonding in aromatic rings. Here, according to the present invention, an arbitrary FP (fingerprint: refers to the “fingerprint of compound”; hereinafter, the same applies) may be added to the CElib according to the necessity of the purpose of searching for low molecular compounds in regard to target proteins.
That is, according to the present invention, instead of extracting the FP from the various low molecular compounds collectively bound to a family macromolecule set that is similar in 3-dimensional structure to an existing target protein, the kind of atom is interchanged within various low molecular compounds, while maintaining the similarity between the FP and generally existing ordinary compound molecules. According to the present invention, the interaction energy with a target protein is calculated using a program that can evaluate the stability, such as “Circle,” and thereby a “modified FP,” which is slightly different in the structure and performs the interaction more stably, is obtained. According to the present invention, the modified FP which is stable against a target protein in terms of local energy is used, and is adopted as a counterpart FP in the superimposition of FP, such as in the instance of the FPs obtained from the collectively bound various low molecular compounds that are obtained as a result of the operation of superimposition of conformations between proteins, which have been considered and used as new FPs as described for the previously discussed inventions.
According to the present invention, in regard to the docking between a protein and a ligand, a ligand conformation which uses bioinformatics called a fingerprint set of compound including 3-dimensional coordinates, is obtained instead of the conventionally used physicochemical interaction functions. According to the present invention, instead of extracting the FP from the various low molecular compounds collectively bound to a family macromolecular protein set that is similar in 3-dimensional structure to an existing target protein, a fingerprint set of plural compound-bound 3-dimensional compound resembling the FP of generally existing conventional molecules is created by referring to compounds having different molecular structures, among various low molecular compounds. According to the present invention, the created fingerprint set of compound adopts the counterpart FP for the superimposition of FP, such as in the instance of the FP obtained from the collectively bound various low molecular compounds that are obtained as a result of the operation of superimposition of conformations between proteins, which have been considered and used as new FPs as described for the previously discussed inventions.
That is, according to the present invention, the various low molecular compounds collectively bound to a family macromolecule set are completely separated, so that the FPs of the various low molecular compounds that have been parted discretely are used as the basic of docking, instead of the calculation of docking by physical formulas in conventional cases. The present invention has been created as a result of careful consideration of the fact that the entity of the conformations of the various low molecular compounds collectively bound to a family macromolecular protein set that is similar in 3-dimensional structure to an existing protein, is close to the most stable conformation that has interacted with the family protein of the target protein, and thus has superior effects unlike the conventional techniques. Thus, the present invention is useful.
The present invention is the apparatus for in silico screening according to still another aspect of the present invention, wherein the fingerprint of compound producing unit further includes a fingerprint of new compound adding unit that, in regard to the compound that is similar to the binding compound on the basis of the Tanimoto coefficient, interchanges the kind of atom between the atoms of the binding compound and them of the compound, calculates the interaction energy with respect to the target protein to thereby produce a fingerprint of compound that is more stable in terms of local energy than the fingerprint of compound from the binding compound, and adding the produced fingerprint of compound to the fingerprint set of binding compound.
In a specific example according to the invention, the present invention uses an interaction calculating program such as the Circle program, in regard to the various low molecular compounds bound to the target family macromolecular protein set from the CElib, so that the interaction with a ligand would be stabilized with respect to the complexes of each family macromolecular protein and ligand. According to the present invention, the kind of atom or the kind of bonding in a fingerprint (fp) unit, that is, a chemical descriptor unit, is ameliorated or modified, and this is used as a new fingerprint (fp) unit, that is, a new chemical descriptor unit, to adopt the unit as the counterpart FP in the superimposition of FP, such as in the instance of the new unit being used as a new FP as described for the previously discussed inventions.
Furthermore, the FP of the CElib, which is a database of various low molecular compounds bound to a family macromolecular protein set that is similar in 3-dimensional structure to a target protein among target macromolecules, contributes largely to the determination of docking score. Thus, according to the present invention, when the docking structure of an ideal low molecular ligand that binds to a target macromolecular protein, has been completely analyzed experimentally in the present invention, if various substituted groups are added so as to improve the interaction energy by using a low molecular ligand that is ideal to the binding, as a lead compound, or if an arbitrary low molecular ligand is found, whose Tanimoto coefficient, which is a quantifying function of the fingerprint of compound, is very similar to that of an ideal low molecular ligand, that is, close to 1, the FP region is limited to a region (for example, 4 or 5 Angstroms) surrounding with an ideal low molecular ligand that has been completely analyzed experimentally. Thereby, according to the present invention, the docking structure and the score of these compounds having a similar chemical structure, that is, having a very similar Tanimoto coefficient, can be easily calculated. This corresponds to the lead optimization of a binding compound or a design de novo of a compound, and has high effects, unlike the conventional techniques, for the combination with the role of the FP in the inventions described above, and is useful.
Conventionally, when a functional group such as a benzene ring, which is a partial moiety of various low molecular compounds, is bound to a target protein to thereby obtain a physically stable partial structure, and these results are used to calculate the interaction between a target protein and various low molecular compounds containing a large extent of intramolecular free rotation, reducing the occurrence of the docking conformation has been generally implemented. According to the present invention, modified FPs are created by calculating the interaction energy with the target protein using a program that can evaluate the stability, such as the “Circle,” which is a technique of bioinformatics. In this regard, publications such as literature articles are not found, and when the superimposition of FP is adopted as the basic for the calculation of docking, there is no conventional technique which reports the usage of a modified FP as the basic for the calculation of the docking, as in the case of the present invention. The present invention has excellent effects, unlike the conventional techniques, and thus is useful.
The present invention is the apparatus for in silico screening according to still another aspect of the present invention, wherein the binding compound is a compound that is predicted by a known docking algorithm to have a stable conformation with respect to the target protein.
In a specific example according to the invention, the present invention adopts a first-principle approach (Ab-initio Approach), which uses a physical potential function such as hydrogen bond, hydrophobic interaction, or electrostatic interaction, that are conventionally implemented in general. For example, the present invention adds an FP (fingerprint) extracted from the 3-dimensional coordinates of a low molecular compound that is predicted such that a stable conformation has a high score, by the docking calculation using an existing docking software such as DOCK, AutoDock or GOLD, such as that a fraction that can predict the correct structure with an rmsd of 2.0 or less has been verified by a blind test concealing the correct structure.
According to the present invention, the conformation obtained by scoring of the interaction between a target protein and various low molecular compounds, may be used as the initial conformation for existing docking software programs such as DOCK, AutoDock and GOLD. As a result, the initial conformation obtained for the present invention can be conveniently obtained, and also, since the accuracy of reproduction of experiments is high, useful results are obtained from a combination with other software programs.
The present invention is the apparatus for in silico screening according to still another aspect of the present invention, wherein the optimizing unit further includes an interaction score calculating unit that calculates the interaction score, based on a function that takes into consideration of not only the root-mean-square deviation for a unit of fingerprint of compound but also the collision state of the candidate compound with the target protein, the existential rate of the candidate compound in the region of interaction of the target protein, and the fraction of direct interaction of the candidate compound with the target protein.
The present invention is the apparatus for in silico screening according to still another aspect of the present invention, wherein the optimizing unit optimizes the interaction score by determining the interaction score based on the Metropolis method, and modifying, increasing or decreasing the basal fingerprint of compound from the candidate compound according to the results of determination.
In a specific example according to the present invention, if the score of this round is larger than the score of the previous round, the Metropolis decision of the present invention accepts the structure of the candidate ligand. If the score is smaller, the adopting probability, Paccept, is calculated, and whether the structure of a candidate ligand should be rejected or accepted may be determined based on the Paccept.
The present invention is the apparatus for in silico screening according to still another aspect of the present invention, wherein the optimizing unit further includes a structure transforming unit that repeatedly changes the conformation of the candidate compound in the process of optimizing the interaction score, and repeatedly translates or rotates the candidate compound as a rigid body, for each of the conformations of the candidate compound based on a simulated annealing method, and the optimizing unit calculates the interaction score of the candidate compound for each of the conformations translated or rotated by the conformation transforming unit.
In another specific example according to the present invention, for FPs including the several information on the 3-dimensional coordinates of various low molecular compounds bound to a family macromolecule set having a similar 3-dimensional structure of the target protein, mathematical calculation is repeatedly performed by Monte Carlo type simulated annealing, so that the score would be the highest, in order to look for a conformation optimal for the interaction by docking low molecular compounds from a library of virtual compounds to the target protein.
More specifically, first, the present invention changes the conformation by randomly varying the rotatable dihedral angle of a candidate ligand, and uses the coordinates of the candidate ligand with the changed conformation. The present invention randomly selects ten FPs from an FP band derived from the binding compound set bound to a family protein of the target protein. The present invention then randomly selects a candidate ligand from the selected FP band, and an FP atomic coordinate set from the LIBRARY LIGANDS. Then, the present invention uses this state as the fingerprint (FP) alignment, and performs least square fitting for a correspondence relationship thereof. The present invention calculates the interaction score using the root-mean-square deviation (rmsd) of the superimposition at that time, and the atomic coordinates of a candidate ligand after the superimposition. Then, from the second round, the present invention stores the state of the previous round, and performs translation and rotation while maintaining the conformation of the ligand atoms, that is, rigid body translation and rotation. The present invention performs an increase or decrease of one FP, and modification and addition of the correspondence relationship of the atomic coordinate set. The present invention performs this step, for example, 10,000 times. Here, the temperature of the simulated annealing may be decreased, starting from 30 K and down to 0.07 K. As such, the present invention calculates the maximum value of the score of one conformation, compares the value for 1000 conformations generated initially, and predicts and outputs the structure with the maximum score as the protein-ligand complex structure. At this time, the process of ranking the 1000 conformations in terms of the score may be devised in regard to the time for calculation or the search for the maximum value, by using a genetic algorithm or the like.
The present invention is the apparatus for in silico screening according to still another aspect of the present invention, wherein the optimizing unit calculates the interaction score based on the following mathematical formula (1):
wherein the FPAScore represents the interaction score, the F(aligned_fp,fp_rmsd, molecule) is a function using, as variables, the degree of alignment and the root-mean-square deviation of the unit of fingerprint of compound between the binding compound and the candidate compound, and the 3-dimensional structure of the candidate compound with respect to the target protein, the BaseScore(aligned_fp,fp_rmsd) is an index representing the degree of consistency and crowded degree of the unit of fingerprint of compound, the fp_volume(molecule) is an index representing the fraction occupied by the candidate compound in a space formed by the 3-dimensional coordinates of the fingerprint set of binding compound, and the collision state with the target protein, and the fp_contact_surface(molecule) is an index representing the contacting degree of the candidate compound with the target protein, and the degree of attribution to the 3-dimensional coordinates of the fingerprint set of binding compound.
As described above, these mathematical calculations described above according to the invention calculate the interaction between a target protein and a low molecular compound of library of virtual compounds using a conventional physical interacting function. However, the mathematical calculation is different from conventional techniques in view of semiempirically performing the calculation using the information of bioinformatics, and the success ratio of the structure prediction exhibits excellent effects such that the mathematical calculation according to the present invention is never inferior to the world-renowned docking software programs, no matter however excellent the programs are. Furthermore, since accumulation of information leads the results of the calculation of interaction by the semiempirical bioinformatics technique to be favorable, the mathematical calculations are useful, unlike the conventional techniques.
The present invention is the apparatus for in silico screening according to still another aspect of the present invention, wherein the BaseScore (aligned_fp, fp_rmsd) in the mathematical formula (1) is calculated based on the following mathematical formula (2):
wherein the RawScore(aligned_fp) is an index based on the number of atoms in the fingerprints of compound that are aligned between the binding compound and the candidate compound, and the fp_rmsd is the root-mean-square deviation, the fp_volume(molecule) is calculated based on the following mathematical formula (6):
wherein the nafp is the number of lattice points occupied by the 3-dimensional coordinates of the candidate compound in a region of proper grid based on the 3-dimensional coordinates of the fingerprint set of binding compound, the nap is the number of lattice points to which the 3-dimensional coordinates of the candidate compound fall into the region of proper grid of the atoms in the 3-dimensional structure of the target protein, and the k2 and k3 are arbitrary constants, and the fp_contact_surface(molecule) is calculated based on the following mathematical formula (7):
wherein n is the number of atoms of the candidate compound, atom(i) is the 3-dimensional coordinates of the ith atom in the candidate compound, the density_of_atom(atom(i)) is a function that reciprocates the sum of the number of atoms in the target protein contacting with the atoms of the fingerprint of compound at a predetermined distance and the number of atoms in the binding compound falling into the same lattice points of the fingerprint of compound when the 3-dimensional coordinates of the atom belong to the fingerprint of compound of the fingerprint set of binding compound, and the total_density_of_atom(molecule) is the number obtained by reordering the distribution of the density_of_atom in a descending order, and summing the numeric values in order as many times as the number of atoms in the candidate compound.
In a specific example according to the present invention, the present invention looks for already known active compounds with respect to an intrinsic target protein such as EGFR or VEGFR so as to clarify the values of k2 and k3 in the matter described above, and optimizes k2 and k3. The present invention relates to a method for performing in silico screening so that the values become, for example, k2=2.0 and k3=1.0, in the in silico screening of an inhibitor of EGFR. Since adequately listing up compounds that fit into the intrinsic target protein such as EGFR or VEGFR, by the in silico screening according to the present invention, is directly related to the development of new drugs for anticancer agent, the method of the present invention has excellent effects, unlike the conventional techniques, and is thus useful.
Typically, a docking software program such as GOLD is devised to select a good set in a genetic algorithm using the atoms participating in the biologically important hydrogen bond as a point or a vector. Such a point or vector is distinguished from the FP of the 3-dimensional descriptor, which is an element as the conditions for extracting the collective conformation in which various low molecular compounds are bound to a family macromolecule set that is similar to the 3-dimensional structure of the target protein as described in the aspect of the invention previously described. According to this embodiment of the invention, when an atom point or vector set constituting the biologically important hydrogen bond or the like is to be took in from the interacting conformation according to the aspect of the invention described above, if the fp_rmsd value is obtained by the following formula, the atoms participating in the biologically important hydrogen bond, hydrophobic bond or van der Waals interaction can be incorporated into the invention described above without any problem.
That is, according to the present invention, the formula of the fp_rmsd+distance rmsd indicative atom set composed of important points vectors may be extended to the form of fp_rmsd**k1+distance_rmsd**k4 (**k1<<**k4 indicates small contribution to FP: **k1>>**k4 emphasizes the contribution to FP), or may be extended to the form of distance_rmsd**k4. Here, when there are biologically important hydrogen bond, hydrophobic bond or van der Waals interaction for the ligand atoms at the ligand binding site of a target protein in the interaction between the target protein and a low molecular compound that is docking, the distance_rmsd is defined as the least square error of the ideal coordinates at the ligand binding site of the target protein, and the final point coordinates of a vector generated from the biologically important atoms, or nearby atoms, of the target protein.
According to the present invention, in regard to various low molecular compounds, when most of the compounds are peptides having amino acid residues linked up, since the correspondence relationship of FP becomes complicated because of the large number of peptide groups, the correspondence relationship is underestimated in the process of score calculation. Thus, the part corresponding to the formula for the FP of the peptide moieties in the mathematical formula for the RawScore according to the present invention may be replaced with an underestimated number such as zero.
That is, according to the present invention, in regard to the method for calculating the interaction of a target protein and a docking low molecular compound on the basis of the FP, any of devices such as the grid information of the environment for ligand binding of a target protein that is a target macromolecule, the multipoint information of a compound which emphasizes a vector between the compound and the target macromolecule, and a vector directed from a compound representing the biological environment of the target protein to the target protein, is implemented. In addition to that, another aspect of the present invention is an extended invention of the invention described above, such as including and merging a method for calculating the interaction energy or the like from an interatomic potential formula of classical physics between various atoms of a compound and various atoms constituting a target macromolecular protein. This aspect of the invention relates to determining the order related to the conformation or the strength of binding in interaction of compounds, to be expressed as score values. Thus, the present invention has excellent effects, unlike the conventional techniques, and thus is useful.
An method of in silico screening executed by an apparatus for in silico screening for performing screening of candidate compounds that bind to a target protein includes a storage unit and a control unit, wherein the storage unit includes a compound database produced by extracting a chemical descriptor that includes the atom type and the interatomic bonding rules as the fingerprint of a compound related to a plurality of atoms in the compound, for each of the candidate compounds, and the method includes a fingerprint of compound producing step of extracting the fingerprint of compound from a binding compound known to bind to a family protein, which has a 3-dimensional structure that is identical or similar to that of the target protein, along with 3-dimensional coordinates that have been converted into the coordinate system of the target protein, to thereby produce a fingerprint set of binding compound, and an optimizing step of computing, for the candidate compounds stored in the compound database, the 3-dimensional structures of the candidate compounds with respect to the target protein, so that the interaction scores that are based on the root-mean-square deviations of each unit of fingerprint of compound that have been calculated using the 3-dimensional coordinates of the fingerprint set of binding compound as a basic, are optimized, wherein the steps are executed by the control unit.
Therefore, according to the present invention, binding between a protein and a compound can be predicted with high accuracy, and a large number of compounds that get a hit can be selected. Furthermore, semiempirical screening can be performed while taking into consideration of the information of biochemical experiments or the like, and an effect of increasing the prediction efficiency is obtained.
FIG. 35 is a diagram presenting the performance in the case of further lowering the upper limit value of the Tc range of the ligands used in the FP library to be 0.16, 0.24, 0.36, and to be 0.08 as the lower limit value, and the ratio of predictive success in the Tc range described above, namely, 0.56, 0.76, 0.96 as the upper limit value, and 0.08 as the lower limit value;
The following describes an embodiment of an apparatus for in silico screening, and a method of in silico screening according to the present invention in detail with reference to the drawings. The embodiment is illustrative only, and is not intended to limit the present invention in any way.
The following outlines the present invention, and then, a configuration and processing of the present invention are explained in detail.
Currently, 3-dimensional coordinates approximating 40,000 in number are registered on the PDB (Protein Data Bank). The 3-dimensional coordinates represent the state in which various compounds such as peptides, low molecular compounds or metals are directly interacting with target macromolecules, based on experiments such as X-ray analysis, an NMR experiment, an electron beam analysis experiment, or a high-resolution electron microscope photography. Furthermore, as a result of progress in the performance of computer and in the bioinformatics, a family macromolecular protein set, which has a 3-dimensional structure similar to that of a target macromolecular protein having various compounds bound thereto, is easily obtained, and can be extracted, by websites such as SCOP, or the program produced by the applicant of the present invention, which shows excellent results in the CASP.
Under such circumstances, the inventors of the present invention conceived an idea that if bioinformatics could be used by utilizing the collective overlapped state of various compounds bound to a target macromolecular protein, instead of the technique of determining the order of in silico screening of compounds from the results of the interaction energy obtained using the conformation, or the score value obtainable at that time, of the compounds that directly bind to a target macromolecular protein, which has been traditionally determined in general in a classical physical manner, it would be possible to determine the order by in silico screening of compounds from the results of the interaction energy using the conformation, or the score value obtainable at that time, of compounds based on the human intelligence.
The present invention was made as a result of the present application devotedly conducting investigation based on the above idea by the inventors, and the present invention roughly has the following fundamental features. That is, the present invention is an apparatus for in silico screening for performing screening of candidate compounds that bind to a target protein including at least a storage unit, and a control unit, wherein the storage unit includes a compound database produced by extracting a chemical descriptor that includes the atom type and the interatomic bonding rules as the fingerprint of a compound related to a plurality of atoms in the compound, for each of the candidate compounds.
Here, the term “fingerprint of compound” (fingerprint: FP) is more specifically a chemical descriptor that includes the atom type of atoms, such as two, three or four atoms, in a compound, and the interatomic bonding rules. The “atom type” is, for example, the Sybyl atom type, or the “valence type” and so on. The “interatomic bonding rules” represents the chemical bonding state between atoms, and for example, represents the bonding rules such as a single bond, a double bond or the bonding in aromatic rings, or the category classification according to the molecular orbital method.
The apparatus for in silico screening extracts the fingerprint of compound from a binding compound known to bind to a family protein, which has a 3-dimensional structure that is identical or similar to that of the target protein, along with 3-dimensional coordinates that have been converted into the coordinate system of the target protein, to thereby produce a fingerprint set of binding compound. That is, in the coordinate system of the target protein, the screening apparatus gathers the collective conformation of a group of the 3-dimensional structure of compounds binding to the target protein, and extracts the fingerprints of compound in correspondence to the 3-dimensional coordinates.
Here, the “family protein, which has a 3-dimensional structure that is identical or similar to that of a target protein,” may be the target protein itself, or may be a protein having a partial structure (for example, an active site, a ligand-binding site or the like) that is identical or similar to that of the target protein. Alternatively, an identical or similar protein may also be used without analyzing the 3-dimensional structure of the target protein and thereby specifying the active site. To allow a stable conformation to have a higher score, there has been a need to analyze the 3-dimensional structure of the target protein in advance and to specify the active site, in the calculation of docking using existing docking software programs such as conventional DOCK, AutoDock and GOLD. However, in comparison thereto, the present invention has a superior effect that is different from that of the conventional techniques, and the invention is useful because it is not necessary to specify the active site through a study of the literature.
Also, a homology search from a protein database which stores the 3-dimensional structures and amino acid sequences of proteins that are bound to a compound, may be carried out using the amino acid sequence of the target protein as a query sequence, and a protein found to a certain value or higher for an index that represents similarity in the structure through the superimposition of conformations with the target protein, may be designated as the family protein. Here, the “binding compound (that is) already known to bind to a protein” may be a compound for which the 3-dimensional structure of the protein-compound complex has been experimentally confirmed by X-ray structural analysis, NMR structural analysis or the like. The binding compound may be acceptable merely the compound that is known to bind to a protein, or may be a compound predicted to have a stable conformation with respect to a target protein by a known docking algorithm (DOCK, AutoDock, GOLD, or the like) or any program for generating coordinates (Corina, or the like).
Here, in order to converts the 3-dimensional coordinates of the binding compound into the coordinate system of the target protein, an operation of superimposing the conformations of the family protein and the target protein may be performed by the apparatus for in silico screening, to thereby convert a binding compound that has been bound to the family protein, along with the coordinates of the binding compound from the coordinate system of the family protein into the coordinate system of the target protein. For example, the operation of superimposition of conformations may be carried out based on an algorithm for the superimposition of conformations between proteins (CE or the like), which does not take the kind of atom into consideration, or if the homology between the target protein and the family protein is high, the superimposition of conformations may also be performed, with the kind of atom taken into consideration.
Extraction of the fingerprint of compound is not limited to direct extraction from a binding compound, but an arbitrary fingerprint of compound may also be added as necessary, for the purpose of searching for candidate compounds with respect to the target protein. For example, the superimposition of conformations may be performed by referring to another compound that is different from the binding compound, to thereby produce a new fingerprint of compound that straddles between the atoms of the binding compound and the atoms of the other compound, and the new fingerprint of compound may be added to the fingerprint set of binding compound. Alternatively, the interaction energy with respect to the target protein may be calculated for a compound that is analogous to the binding compound on the basis of the Tanimoto coefficient, using a program (“Circle” or the like) that can evaluate the stability by interchanging the kind of atom between the atoms of the binding compound and the atoms of the relevant compound, to thereby newly produce a fingerprint of compound that is more stable in terms of local energy than the fingerprint of compound from the binding compound, as a “modified fingerprint of compound (modified FP),” and this modified fingerprint of compound may be added to the fingerprint set of binding compound. That is, in the case of using, as a lead compound, a low molecular compound that is ideal to the binding to a target protein, to add various substituted groups to improve the interaction energy, or in the case of finding an arbitrary low molecular compound whose Tanimoto coefficient, which is a quantifying function of the fingerprint of compound, is very similar to that of an ideal low molecular compound, that is, close to 1, the region of the fingerprint of compound is limited to 4 or 5 Angstroms, which is a region surrounding with an ideal low molecular compound that has been completely analyzed experimentally. Thereby, the docking structure of compounds having similar chemical structures, that is, very similar Tanimoto coefficients, and the interaction score thereof, can be easily calculated.
The apparatus for in silico screening computes, for the candidate compounds stored in the compound database, the 3-dimensional structures of the candidate compounds with respect to the target protein, so that the interaction scores that are based on the root-mean-square deviations (rmsd) of units of fingerprint of compound that have been calculated using the 3-dimensional coordinates of the fingerprint set of binding compound to be fixing coordinates as a basic, are optimized.
That is, in this process for optimization, for example, the apparatus for in silico screening determines, according to the Metropolis method, the interaction score that has been calculated, on the basis of the root-mean-square deviation, after repeatedly changing the conformation of a candidate compound, and repeatedly translating or rotating the candidate compound as a rigid body, for each of the conformations of the candidate compound, and modifies, increases or decreases the fingerprint of compound from the candidate compound according to the results of determination. Here, a fingerprint set of binding compound to be fixing coordinates that will serve as the basic, may be selected by randomly extracting several fingerprints of compound. Alternatively, instead of changing the conformation by randomly modifying the rotatable dihedral angle of the candidate compound, the conformations of the candidate compound may also be changed by storing the previous conformations in memory, such as in the case of a genetic algorithm.
The calculation of the interaction scores in the optimization process is carried out, for example, based on a function that takes into consideration of the collision state of the candidate compound with the target protein, the existential rate of the candidate compound in the region of interaction of the target protein, and the fraction of direct interaction of the candidate compound with the target protein, which function is based on the root-mean-square deviation of the unit of fingerprint of compound. More specifically, the interaction scores are calculated based on the following mathematical formula (1):
wherein the FPAScore represents the interaction score, the F(aligned_fp,fp_rmsd, molecule) is a function using, as variables, the degree of alignment and the root-mean-square deviation of the unit of fingerprint of compound between the binding compound and the candidate compound, and the 3-dimensional structure of the candidate compound with respect to the target protein, the BaseScore(aligned_fp,fp_rmsd) is an index representing the degree of consistency and crowded degree of the unit of fingerprint of compound, the fp_volume(molecule) is an index representing the fraction occupied by the candidate compound in a space formed by the 3-dimensional coordinates of the fingerprint set of binding compound, and the collision state with the target protein, and the fp_contact_surface(molecule) is an index representing the contacting degree of the candidate compound with the target protein, and the degree of attribution to the 3-dimensional coordinates of the fingerprint set of binding compound.
Thus, the outline of the processing of the present invention is as discussed above. As such, the ranking of the candidate compounds for the interaction with respect to the target protein is determined on the basis of the interaction scores that have been calculated according to the optimization technique, and a significant candidate compound can be inferred from the compound database. Therefore, the binding between a protein and a compound can be predicted with high accuracy, and also, a large number of compounds that get a hit can be selected. Furthermore, it is possible to perform semiempirical screening while taking the information of biochemical experiments and the like into consideration, and the prediction efficiency can be increased.
In other words, the present invention has been achieved as a result of contemplating that the conformations of various low molecular compounds (binding compounds) that are collectively bound to a family protein, which has a 3-dimensional structure that is identical or similar to that of a target protein, are close to the most stable conformation that has made interaction with the target protein. Furthermore, the present invention can perform semiempirical in silico screening with higher prediction efficiency than conventional techniques, by scoring of appropriate interaction scores using an easily handlable fingerprint of compound as a unit, and optimizing, when a comparison is made between a binding compound and a candidate compound.
The following first describes a configuration of the apparatus for in silico screening.
In
The various databases and tables (such as a candidate compound DB 106a, a fingerprint set of binding compound 106b, and a medicinal chemical compound DB 106c) stored in the storage unit 106 are storage units such as fixed disk devices, and store various programs, various tables, various file, various databases, various web pages, and the like used in various processes.
Among these respective constituent elements of the storage unit 106, a candidate compound DB 106a is a candidate compound database unit that has been produced by extracting a fingerprint of compound for each of the compounds serving as candidates for in silico screening (referred to as “candidate compounds”).
A fingerprint set of binding compound 106b is a storage means for fingerprints of binding compound, which stores a fingerprint set of binding compound produced by extracting fingerprints of compound for the compounds known to bind (referred to as “binding compounds”) to a protein having a 3-dimensional structure that is identical or similar to that of a target protein (referred to as “family protein”), along with the 3-dimensional coordinates that have been converted into the coordinate system of the target protein.
A medicinal chemical compound DB 106c is a medicinal chemical compound database that stores a fingerprint set of medicinal chemical compound produced by extracting fingerprints of compound for known medicinal chemical compounds, such as the MDL CMC Library. That is, the medicinal chemical compound DB 106c is used to produce the fingerprint set of binding compound 106b that is specialized in drug absorption, drug metabolism, drug excretion or drug toxicity, which have been organized in advance using drug absorption, drug metabolism, drug excretion, drug toxicity or the like as an index, and using the fundamental data units as the bases for the organization of fingerprints of compound, in order to take out compound information using the pharmaceutical database.
In
In
In
The fingerprint of compound producing unit 102a is a unit that produces fingerprints of compound by extracting fingerprints of compound from compounds such as candidate compounds, binding compounds or medicinal chemical compounds. For example, the fingerprint of compound producing unit 102a produces a fingerprint set of candidate compound by extracting fingerprints of compound for candidate compounds that have been input via the input device 112, and stores the fingerprint set of candidate compound in the candidate compound DB 106a. The fingerprint of compound producing unit 102a also produces a fingerprint set of medicinal chemical compound by extracting fingerprints of compound from medicinal chemical compounds that have been acquired, and stores the fingerprint set of medicinal chemical compound in the medicinal chemical compound DB 106c.
The fingerprint of compound producing unit 102a also produces a fingerprint set of binding compound 106b by converting the 3-dimensional coordinates of atoms into the coordinate system of the target protein, and extracting fingerprints of compound for the binding compounds that are already known to bind to the family protein, along with the converted 3-dimensional coordinates. That is, the fingerprint of compound producing unit 102a gathers collective conformations for a group of the 3-dimensional structure of compounds binding to the target proteins on the coordinate system of the target protein, and extracts the fingerprints of compound in correspondence to the 3-dimensional coordinates. In other words, the fingerprint of compound producing unit 102a produces the fingerprint set of binding compound 106b by extracting, from the group of compound bound to the target protein, as many chemical descriptors as possible, which are called as fingerprints of compound, and which include the atom type of atoms, such as two, three or four atoms, and the interatomic bonding rules, together with the 3-dimensional coordinates of the chemical descriptors, and storing the chemical descriptors in the storage unit 106 in the form of a table of database.
Here, the fingerprint of compound producing unit 102a may perform, in order to convert the 3-dimensional coordinates of the binding compound into the coordinate system of the target protein, an operation of superimposing the conformations of the family protein and the target protein, and then convert the 3-dimensional coordinates of the binding compound that has been bound to the family protein, into the coordinate system of the target protein (from the coordinate system of the family protein). For example, the fingerprint of compound producing unit 102a may perform the operation of superimposition of conformations through an algorithm (CE or the like) for the superimposition of conformations between proteins (the target protein and the family protein), which does not take the kind of atom into consideration. If the homology between the target protein and the family protein is high, the superimposition of conformations may also be performed, with the kind of atoms taken into consideration.
The fingerprint of compound producing unit 102a is not limited to direct extraction of the fingerprint of compound from a binding compound, but may also add an arbitrary fingerprint of compound to the fingerprint set of binding compound 106b as necessary, for the purpose of searching for candidate compounds for the target protein. Here, the fingerprint of compound producing unit 102a includes, as shown in
The optimizing unit 102b is an optimizing unit that computes, for the candidate compounds stored in the candidate compound DB 106a, the 3-dimensional structures of the candidate compounds with respect to the target protein, so that the interaction scores that are based on the root-mean-square deviations (rmsd) of each unit of fingerprint of compound that have been calculated using the 3-dimensional coordinates stored in the fingerprint set of binding compound 106b as a basic, are optimized. For example, the optimizing unit 102b determines, according to the Metropolis method, the interaction scores calculated on the basis of the root-mean-square deviation, for each of the conformations of the generated candidate compounds and each of the 3-dimensional coordinates with respect to the target protein, and modifies, increases or decreases the fingerprints of compound of the candidate compounds according to the results of determination. Here, the optimizing unit 102b may randomly extract several fingerprints of compound from the fingerprint set of binding compound 106b, and thereby select a fingerprint set of binding compound to be fixing coordinates, which will serve as the basic. Here, the optimizing unit 102b includes, as shown in
The interaction score calculating unit 102f is an interaction score calculating unit that calculates the interaction score, based on a function that takes into consideration of not only the root-mean-square deviation for the unit of fingerprint of compound but also the collision state of the candidate compound with the target protein, the existential rate of the candidate compound in the region of interaction of the target protein, and the fraction of direct interaction of the candidate compound with the target protein in the optimization process by the optimizing unit 102b. A concrete example of calculation of the interaction scores by the interaction score calculating unit 102f will be described below in detail in explanation of processing.
The structure transforming unit 102g is a structure transforming unit that repeatedly changes the conformation of the candidate compound in the optimization process by the optimizing unit 102b, and repeatedly translates or rotates the candidate compound as a rigid body, for each of the conformations of the candidate compound based on a simulated annealing method. The structure transforming unit 102g may also change the structure of the candidate compound by storing the previous conformations in memory, such as in the case of a genetic algorithm, instead of changing the conformation by randomly modifying the rotatable dihedral angle of the candidate compound.
The screening result output unit 102c is a unit for outputting results, which determines the ranking of the candidate compounds for the interaction with the target protein, based on the interaction scores that have been optimized by the optimizing unit 102b, and outputting the results of in silico screening.
The homology searching unit 102d is a homology searching unit that searches for the family protein and the binding compound from the protein database system, based on the homology of the target protein with the amino acid sequence. That is, the homology searching unit 102d performs a homology search, in order to acquire a binding compound, by making reference to a protein database such as an external system 200 using the amino acid sequence of the target protein as a query sequence, and thereby acquires a binding compound, for which the conformation bound to a protein having homology to the target protein is already known.
As shown in
That is, in
The following describes in detail one example of the processing executed by the apparatus for in silico screening 100 composed as above with reference to
As shown in
The fingerprint of compound producing unit 102a superimposes structure of the target protein and the structure of the family protein accompanied by the binding compound (Step SA-2). Here, the fingerprint of compound producing unit 102a may perform the superimposition of conformations between proteins without taking the kind of atom into consideration, or if the homology between the target protein and the family protein is high at a predetermined value or above, the unit may also perform the superimposition of conformations while taking the kind of atoms into consideration.
The fingerprint of compound producing unit 102a then converts the 3-dimensional coordinates of the binding compound from the coordinate system of the family protein into the coordinate system of the target protein (Step SA-3).
The fingerprint of compound producing unit 102a produces the fingerprint set of binding compound 106b by extracting the fingerprint of compound from the binding compound, along with the 3-dimensional coordinates of the binding compound that have been converted into the coordinate system of the target protein, and storing the fingerprint of compound in the storage unit 106 (Step SA-4). Here, the fingerprint of new compound adding unit 102e may add an arbitrary fingerprint of compound (“modified FP” or the like) as necessary, for the purpose of searching for a candidate compound with respect to the target protein. The fingerprint of compound producing unit 102a may also narrow structures by performing the search of those resembling a medicinal chemical compound, by determining an intersection of the fingerprint set of compound stored in the fingerprint set of binding compound 106b and the fingerprint set of compound stored in the medicinal chemical compound DB 106c.
The optimizing unit 102b selects, from the fingerprint set of binding compound 106b, a fingerprint of compound of fixing coordinates that serves as the basis of the calculation of the interaction score with respect to the candidate compounds stored in the candidate compound DB 106a (Step SA-5).
The optimizing unit 102b then calculates, for the candidate compounds, the root-mean-square deviations of each unit of fingerprint of compound using the 3-dimensional coordinates to be fixing coordinates of the selected fingerprints of compound as the basic, and performs the least square fitting, and computes the 3-dimensional structures of the candidate compounds with respect to the target protein, so that the interaction scores based on the calculated root-mean-square deviations are optimized (Step SA-6). That is, the optimizing unit 102b calculates, through the processing of the interaction score calculating unit 102f, the interaction scores based on the root-mean-square deviations of the 3-dimensional coordinates between fingerprints of compound, using, as the basic, the fingerprint of compound to be fixing coordinates of the target protein that has been arbitrarily selected from the fingerprint set of binding compound 106b. Then, the optimizing unit 102b performs a simulated annealing method based on the Metropolis method using the interaction scores as an index, so that the conformations of the candidate compounds converted by the processing of the structure transforming unit 102g and the structures with respect to the target protein will be optimized.
The screening result output unit 102c determines the order of interaction of the candidate compounds in the candidate compound DB 106a with respect to the target protein, based on the interaction scores that have been optimized by the optimizing unit 102b, and outputs the results of in silico screening to the output device 114 (Step SA-7). For example, the screening result output unit 102c sorts the group of candidate compounds in a descending order with respect to the interaction scores at the maximum point obtained for each of the candidate compounds by the optimizing unit 102b, and outputs the results.
This is the end of the processing of the apparatus for in silico screening 100.
The following describes one example of the procedure for calculating the interaction score by the interaction score calculating unit 102f. The interaction score calculating unit 102f calculates the interaction score, based on a function that takes into consideration of not only the root-mean-square deviation for each unit of fingerprint of compound but also the collision state of the candidate compound with the target protein, the existential rate of the candidate compound in the region of interaction of the target protein, and the fraction of direct interaction of the candidate compound with the target protein. Specifically, the interaction score is calculated based on the following mathematical formula (1):
wherein the FPAScore represents the interaction score, the F(aligned_fp,fp_rmsd, molecule) is a function using, as variables, the degree of alignment and the root-mean-square deviation of the unit of fingerprint of compound between the binding compound and the candidate compound, and the 3-dimensional structure of the candidate compound with respect to the target protein, the BaseScore(aligned_fp,fp_rmsd) is an index representing the degree of consistency and crowded degree of the unit of fingerprint of compound, the fp_volume(molecule) is an index representing the fraction occupied by the candidate compound in a space formed by the 3-dimensional coordinates of the fingerprint set of binding compound, and the collision state with the target protein, and the fp_contact_surface(molecule) is an index representing the contacting degree of the candidate compound with the target protein, and the degree of attribution to the 3-dimensional coordinates of the fingerprint set of binding compound.
More specifically, according to the present embodiment, the respective terms in the mathematical formula (1) are calculated based on the following mathematical formulas.
This term is a function taking into consideration of the degree of consistency and crowded degree of units of fingerprint of compound:
wherein the RawScore(aligned_fp) is an index based on the number of atoms in the fingerprints of compound that are aligned between the binding compound and the candidate compound, and the fp_rmsd is the root-mean-square deviation.
The RawScore(aligned_fp) is specifically calculated based on the following mathematical formula (3):
wherein the assigned_score(i) is the score given in advance to the fingerprint of compound aligned on the ith order, based on the following formula.
The assigned_score(i) is calculated in more detail based on the following mathematical formula (4):
wherein the total_atom(i) is the number of atoms constituting the fingerprint of compound aligned on the ith order, and for example, in the case of a fingerprint of compound formed from four atoms, the number is 4; Case1_S, Case2_S and Case3_S are the scalar values given when the conditions described below are satisfied; and the n-neighbor_atom(i), which will be described later, is the number of atoms belonging to the same fingerprint of compound that is adjacent to the ith atom set.
For example, in regard to the term Case1_S, the depth-first search (see “Complete Course of C Algorithm: From Basics to Graphics, ISBN4-7649-0239-7, Kindai Kagaku Sha Co., Ltd.”) is performed up to four atoms for a binding compound present in the fingerprint set of binding compound (for example, fingerprint of compound such as C.ar-N.ar-C.ar-C.ar). In the present embodiment, since the search is ended with up to four atoms, the number of ring structure is not taken into account. That is, a benzene ring and a naphthalene ring are not distinguished. If the search is successful, a score (Case1_S) is given to each of the atoms constituting the fingerprint of compound. Herein, the scalar value per one atom is defined as 5.0. That is, a fingerprint of compound composed of four atoms is given 20.0, while a fingerprint of compound composed of three atoms is given 15.0.
The term Case2_S is a constant score given to each atom when a new fingerprint of compound is produced using the fingerprint of compound obtained in Case1, by selecting two arbitrary fingerprints of compound that superimpose at a certain distance, and linking the atoms through imaginary bonds to produce a new fingerprint of compound. A default of 2.5 may be used.
The term Case3_S is an arbitrary scalar value that can be given when there is a possibility of the presence of an atom based on biochemical information or energy calculation. Here, the Case3_S was not used in the validation calculation using a training set.
Here, the fingerprints of compound obtained from the process for producing the Case1_S and Case2_S must belong to the fingerprint set of compound obtained from a known pharmaceutical database that is capable of discriminating the information on bonding rules and the atom type.
Furthermore, in the process for producing the Case1_S, Case2_S and Case3_S, between the coordinates that belong to a same fingerprint of compound, the natural logarithm of the number of atoms for which the distance between an atomic coordinate set and another atom is within the value of dist (default is 1.0 Å), is added to the score for the coordinates of fp. In regard to the binding compounds, when most of the compounds are peptides that amino acid residues links up, the correspondence relationship of a fingerprint of compound having many peptide groups becomes complicated. Therefore, the correspondence relationship may be underestimated during the process of calculating the interaction scores, and the part corresponding to the mathematical formula (3) of the fingerprint of compound for the peptide moiety in the mathematical formula for RawScore given above, may be replaced with the underestimated number such as zero.
The denominator of right side in the above mathematical formula (2) is calculated based on the following mathematical formula (5):
1.0+ln(fp—rmsd**k1+1.0) (5)
wherein ln is natural log; k1 uses 4.0 as the optimized result; fp_rmsd is the rmsd for least square superimposition; k1 is the scaling factor determining how strict the accuracy of the superimposition of FP should be, and is a constant which makes such that a large value of k1 results in a large rmsd (bad), and therefore decreases the RawScore (score) of the formula (3).
<fp_volume(molecule) Term>
This term is a function that evaluates the fraction occupied by a candidate compound in a space formed by the 3-dimensional coordinates of the fingerprint set of binding compound, that is, how much space formed from the fingerprint of compound obtained from the fingerprint set of binding compound is filled, and the collision with the target protein:
wherein the nafp (Number of Ligand Atom covering Fingerprint) is the number of lattice points occupied by the 3-dimensional coordinates of the candidate compound in a region of proper grid based on the 3-dimensional coordinates of the fingerprint set of binding compound, the nap (Number of Ligand Atom covering Protein) is the number of lattice points to which the 3-dimensional coordinates of the candidate compound fall into the region of proper grid of the atoms in the 3-dimensional structure of the target protein, and the k2 and k3 are each coefficient, and an arbitrary constant that can be modified by the biochemical information of the target protein, the degree of induced fit, or the like. In the present embodiment, 1.0 is used as the default.
<fp_contact_surface(molecule) Term>
This term is a function taking into consideration of the contacting degree of the candidate compound with the target protein, and the degree of attribution of the fingerprint set of binding compound to the there-dimensional coordinates:
wherein n is the number of atoms of the candidate compound, atom(i) is the 3-dimensional coordinates of the ith atom in the candidate compound, the density_of_atom(atom(i)) is a function that reciprocates the sum of the number of atoms in the target protein contacting with the atoms of the fingerprint of compound at a predetermined distance and the number of atoms in the binding compound falling into the same lattice points of the fingerprint of compound when the 3-dimensional coordinates of the atom belong to the fingerprint of compound of the fingerprint set of binding compound, and the total_density_of_atom(molecule) is the number obtained by reordering the distribution of the density_of_atom in a descending order, and summing the numeric values in order as many times as the number of atoms in the candidate compound.
The density_of_atom(atom(i)) is expressed in more detail based on the following mathematical formula (8):
In this formula, if the coordinates of an atom that constitutes the candidate compound do not fall into the fingerprint of compound derived from the fingerprint set of binding compound, the crowded degree becomes 0, while if the coordinates do fall into the fingerprint of compound, the score is calculated according to the formula described above.
That is, nfpcontact is the number of atoms of the candidate compound that are contacting with the atoms belonging to the fingerprint of compound at a certain distance (default is 3.8). natom is the number of atoms constituting a compound derived from the binding compound set, which atoms fall into the same lattice point. In the case of identical binding compounds having different ID code in the PDB, the value may be appropriately changed, but in the present embodiment, the number may be counted, with overlapping being allowed. In is used in the case of having particularly important biochemical information, and 0 is used as default. That is, the value occurs depending on the modified FP, which is introduced according to the 3D-1D method of the “Circle” or the like, when stable contact with the target protein is suggested.
The formula of total_density_of_atom(molecule) is described below:
wherein total is the number of atoms (atoms in a molecule) of the compound; sort_density_of_atom is the result of arranging the distribution of density_of_atom in order, starting from larger values. That is, if the molecule is large, a larger value is added, and therefore, the total_density_of_atom is increased.
This is the end of explanation of one example of the method for calculating the interaction score by the interaction score calculating unit 102f.
An example of the processing of optimizing the conformation and configuration of candidate compounds according to simulated annealing by the optimizing unit 102b, based on the interaction scores calculated by the method for calculating interaction scores as described above, will be explained in the following.
Initially, the structure transforming unit 102g changes the conformation by randomly varying the rotatable dihedral angle of the candidate compound. In the present embodiment, the change of conformation is performed 1000 times. If this number is larger, there is a possibility of obtaining better results. However, since there is a need to perform docking calculation for the large number of low molecular compounds included in the virtual candidate compound DB 106a, it is necessary to limit the finite number of runs. It is also speculated that even though the number may depend on the degree of freedom of rotation of the candidate compound, this number of runs will be sufficient in the preliminary calculation. The initial conformation may be the binding conformation with respect to a family protein, which conformation is registered on the candidate compound DB 106a. The optimizing unit 102b uses the coordinates of the candidate compound in the following processing, for each of these changed conformations.
The optimizing unit 102b randomly selects ten fingerprints of compound from the bands of fingerprint of compound (fp bands) of the fingerprint set of binding compound 106b. When there are fewer than ten fingerprints of compound, half the maximum number of the bands of fingerprint of compound is used. More specifically, the atomic coordinates of the fingerprints of compound of the candidate compound and the fingerprint set of binding compound 106b are randomly selected from the selected bands of fingerprint of compound. This state is referred to as fingerprint alignment. In the correspondence relationship thereof, least square fitting is performed, and using the root-mean-square deviation (rmsd) of the superimposition in that case and the atomic coordinates of the candidate compound after the superimposition, the interaction score is calculated by the formula described above.
After the second round of repetition, the structure transforming unit 102g stores the state of the previous round in the storage unit 106, and performs translation and rotation while maintaining the conformation of the candidate compound, that is, while dealing with the candidate compound as a rigid body. Thereby, the structure transforming unit 102g performs an increase or decrease of one fingerprint of compound, and modification and addition of the correspondence relationship of the atomic coordinate set. In the present embodiment, this step is carried out 10,000 times.
In this process, the optimizing unit 102b performs the Metropolis decision. That is, the optimizing unit 102b accepts the configuration of the relevant candidate compound If the interaction score of this round is larger than the interaction score of the previous round. On the contrary, if the interaction score is smaller, the adopting probability, Paccept, is calculated based on the following mathematical formula:
That is, since the scope of the adopting probability, Paccept, is such that 0<Paccept<=1, the optimizing unit 102b simultaneously generates uniform random numbers in the range of 0<=r<=1 in this case. If r<Paccept, even if the interaction score is smaller than that of the previous round, the interaction score is accepted. In the simulated annealing process, T (temperature) starts from 30 K and is decreased to 0.07 K.
As such, the optimizing unit 102b calculates the maximum value of the interaction score of one conformation, makes a comparison for 1000 conformations that have been initially generated, and thereby predicts the structure with the maximum interaction score as the optimal target protein-candidate compound complex (Protein-Ligand complex) structure. At this time, in the process of ranking the 1000 conformations, the previous conformation may be stored, to thereby keep changing the ligand structure according to a certain algorithm, by using a genetic algorithm or the like instead of randomly generating the conformations, and the calculation time or the search of the maximum value may also be subjected to devising. In the process of calculation for 1000 runs, it is possible to shorten the calculation time, or to obtain the minimum score having a possibility of the ligand conformations approaching more closely to the true conformation, in order to determine the order of the ligand conformations, by using a genetic algorithm or the like, such as that employed in the GOLD program.
The explanation of the maximization of the interaction score by simulated annealing is as discussed above.
Upon producing a fingerprint set of compound, as a scale for measuring the similarity between compounds, for example, a low molecular compound set having a Tanimoto coefficient (Tc) of 0.08 or more may be used. In the case of determining the fingerprint (fp) of compound from a chemical descriptor, which is the fingerprint of compound from each compound, such as the Sybyl atom type, the Tanimoto coefficient (Tc) is calculated as follows:
wherein a is the number of fingerprints of compound that are present in the FP bands of both the binding compound and the candidate compound; and b and c are each the number of FP that are present in only one side of the FP bands.
To explain the same matter using the concept of assembly, if the assembly of fingerprints of compound possessed by the respective FP bands is designated as A and B, the following formula may be obtained:
wherein number_of_fp(assembly) is the number of fingerprints of compound that belong to a certain assembly.
Thus, the explanation of the Tanimoto index is as described above.
Hereinafter, a first example of the present embodiment to which the present invention is applied, will be described below in detail while referring to
In recent years, owing to the enhancement in the speed of computing machines, the methods for predicting the 3-dimensional structures of proteins and the evaluation method of the 3-dimensional structures in the field of pharmaceutical development [reference literature (Terashi G, Takeda-Shitaka M, Kanou K, Iwadate M, Takaya D, Hosoi A, Ohta K, Umeyama H Proteins, 2007; 69(Suppl 8): 98-107)] have improved. For example, in Homology Modeling, which is one of the methods for predicting 3-dimensional structures of proteins, the accuracy of prediction is ascentive, as a result of an increase in the structures registered to the Protein Data Bank (PDB) [reference literature (Westbrook et al. Nucleic Acids Res. 2003 Jan. 1; 31(1): 489-91)], an increase in the templates that are referred to, except for membrane proteins, and blind tests in the Critical Assessment of Techniques for Protein Structure Prediction (CASP) [reference literature (Takeda-Shitaka, M., Terashi, G., Takaya, D, Kanou, K., Iwadate, M., Umeyama, H. Protein structure prediction in CASP6 using CHIMERA and FAMS. Proteins 2005; 61: 122-127)]. Furthermore, in regard to this homology modeling, the scope of application of the 3-dimensional structure prediction method is extending to the prediction of activity changes under the effect of mutation [reference literature (Nakamachi Y et al. Computer modeling analysis of Antithrombin with Ala54 Thr and Ala249Glu mutations (P04-08).], drug design [reference literature (Takede-Shitaka, M., Takaya, D., Chiba, C., Tanaka, H., Umeyama, H. Curr. Med. Chem. 2004; 11: 551-558)], and the like.
Together with the increase in the protein 3-dimensional structures registered on the PDB, the results of X-ray structural analysis of protein-ligand complexes are also increasing, and in many cases, multiple X-ray structures that have been analyzed may exist in a single family protein [reference literature (Edgar R. Wood et al. CANCER RESEARCH 2004; 64: 6652-6659), (Jennifer et al. J. Bio. Chem. 2002; 277 (48): 46265-46272)]. Even in the CASP described above, tests for predicting the residues at the binding site of proteins are performed [reference literature (Lopez, G, Rojas, A, Tress, M, Valencia, A Proteins 2007; 69(S8): 165-174)]. Thus, the importance of improving the accuracy of prediction of protein-ligand complexes is ever raising.
Meanwhile, in recent years, experimental determination of causative proteins of diseases is becoming popular [reference literature (Nature, or the like)], and the necessity of the design of inhibitors inhibiting those proteins is more and more rising.
As a potent method for the design of inhibitors, there may be mentioned inhibitor design (SBDD) based on the 3-dimensional structures of target proteins, and in-silico screening using software for protein-ligand complex prediction (so-called docking software) is performed. Here,
As shown in
In order to perform the docking of compounds having many rotatable bonds with high accuracy, a method of disposing fragments of the compound at the ligand binding site in advance using a potential function has also been designed [reference literature (Budin et al. Biol Chem. 2001; 382(9): 1365-1372)].
There have been a number of reports on the attempts to perform re-evaluation for selecting many hit compounds, in order to select the hit compounds, performing the calculation of the distance between the protein and the ligand, the energy of classical physics, and the like from the structure of a known protein-ligand complex, to extract the information on the interaction, after docking an inhibitor candidate compound from a library of virtual compounds to a target protein using an existing docking software program to predict the structure of the protein-ligand complex [reference literature (Sukumaran et al. Eur. J. Med. Chem. 2007; 42: 966-976), (Zhan et al. J. Med. Chem. 2004; 47: 337-344)].
However, it is implied by the series of studies described above that the existing docking software programs can predict protein-ligand complexes with high accuracy, but this prediction does not coincide with (is not directly connected to) direct selection of many hit compounds from the library of virtual compounds.
In other words, currently, the development of a system that is capable of predicting the structure of protein-ligand complexes with high accuracy and also capable of detecting many hit compounds from a virtual library, is highly requested, and such a system is absolutely indispensable in drug discovery.
Under such circumstances, the inventors of the present application developed the system CHOOse information Semi-Empirically on the Ligand Docking (ChooseLD), which can predict the structure of protein-ligand complexes by efficiently picking out effective information from the biochemical information of those protein-ligand complexes, of which interactions are already known that are registered on the PDB, and performing docking, and which can detect many hit compounds, without using the potential functions of classical physics in the evaluation of the interaction of protein-ligand complexes. Furthermore, the method of the inventors of the present invention does not use the potential functions of classical physics in the evaluation of the interaction of protein-ligand complexes. Therefore, the method of the present invention is expected that physical approaches such as CHARMM [reference literature (Brooks, R. B, Bruccoleri, E. R., Olafson, D. B., States, J. D., Swaminathan, S., Karplus, M. CHARMM. A program for macromolecular energy, minimization, and dynamics calculations. J. Comp. Chem. 1983; 4: 187-217)], AMBER [reference literature (Case, A. D., Cheatham III, E. T., Darden, T., Gohlke, H., Luo, R., Merz Jr., M. K. Onufriev, A., Simmerling, C., Wang, B., Woods, J. R. The Amber Biomolecular Simulation Programs. J. Comput. Chem. 2005; 26: 1668-1688)], and quantum chemistry [reference literature (Fedorov, G. D., Kitaura, K. Extending the Power of Quantum Chemistry to Large Systems with the Fragment Molecular Orbital Method J. Phys. Chem. 2007; 111: 6904-6914)] may function effectively on the optimization of the structures of protein-ligand complexes, for which it cannot be said that the physical interaction energy has been optimized.
The following describes an outline of the present example with reference to
Here, in the present example, the LIBRARY LIGANDS corresponds to a set of binding compounds, and the CElib corresponds to the fingerprint set of binding compound 106b.
Here, in
In
That is, as shown in
The apparatus for in silico screening 100 refers the group of ligands oriented to target proteins (C) to a druggable FP database (D), which corresponds to the medicinal chemical compound DB 106c, and obtains a target protein-specific FP band (L) as union (C)̂ (D). Here, the group of ligands oriented to target proteins (C) may be added with virtual FPs such as modified FPs, through the processing of the fingerprint of new compound adding unit 102e.
Subsequently, the apparatus for in silico screening 100 extracts fingerprints of compound from the candidate ligands of a library of virtual ligand or a benchmark set, which is ligands docked to the target protein, and thereby produces an FP band (R) of the candidate ligands, which corresponds to the candidate compound DB 106a.
The apparatus for in silico screening 100 changes the conformations of the candidate ligands through the processing of the structure transforming unit 102g, and performs FP alignment between the FP bands (R) of the ligands oriented to target proteins (C) and the candidate ligands.
When the apparatus for in silico screening 100 performs docking of the candidate ligands to the binding site of the target protein using an interaction scoring function through the processing of the optimizing unit 102b, the apparatus for in silico screening 100 performs the prediction of 3-dimensional structures of the target protein-candidate ligand complexes, while optimizing the interaction scores using the simulated annealing (SA) method. This is the outline of the present example.
LIBRARY LIGANDS corresponds to sets of binding compounds. That is, the apparatus for in silico screening 100 performs alignment between the target protein and a homologous protein using CE [reference literature (Shindyalov et al. Protein Engineering 1998; 11(9): 739-747)], a 3-dimensional structure alignment generating program, when a protein among the proteins detected by the homology search using the PSI-Blast [reference literature (Altschul et al. Nucleic Acids Res. 1997; 27(17): 3389-3402)] is a protein-ligand complex, and the apparatus superimposes the homologous protein to the target protein by a least square fitting method. The LIBRARY LIGANDS is a group obtained, when the Z-score resulting from the least square fitting is 3.7 or more, by converting the coordinate system of the binding ligands into it of the target protein, and picking out only the binding ligands.
In the present example, if the Z-score is less than 3.7, the ligand is not used as a binding compound. This is because the basis of this numerical value is “3.7 to 4.0-twilight zone where some similarities of biological significance can be seen” according to the CE, and thus, 3.7 or more is accepted. According to the present example, the lowest homology of the homology search is defined as a homology of 0.1% or more. That is, most of the similar proteins detected by the homology search are superimposed using the CE.
The following describes in detail a method to make an FP band with reference to
The present example is not aimed at a precise interpretation of FPs, but in order to avoid confusion, the following terms will be unified. When a molecule is expressed as a vector that has, as an element, a combination taking into consideration of the atom form (or atom type), the order of atomic bonds and the like, the element of the vector is called “FP,” while the vector is called an “FP vector.” According to the present example, information more than or equal to simple description of the character string of atom form may be merely added to the element of vector, but the additional information may also be interpreted as one feature expressing the molecule. If the information means to be an element of the vector, the information is also referred to as “FP,” while a vector having the FP as an element is distinguished from the typical “FP vector,” and is referred to as an “FP band.” This implies that in addition, the “FP band” has a nature such as that the respective elements in the “FP vector” are atom type. Here,
In the ChooseLD method, which is the present example, the purpose of the method is to predict an unknown ligand structure which docks so as to satisfy the minimization of free energy, using a protein-ligand complex structure of which interaction is already known, and in order to achieve this purpose, the FP (fingerprint) has been defined, which is a component maintaining the partial free energy of binding from a ligand known to have interaction. The substance name of the chemical substance presented in
That is, the parts surrounded by the line along the bonding of the compound in
[Calculation of Similarity between Compounds Based on the Tanimoto Coefficient]
The method for calculating similarity between compounds based on the Tanimoto coefficient will be explained in the following. Here,
According to the present example, the Tanimoto coefficient (hereinafter, referred to as Tc) was introduced so as to calculate similarity between compounds [reference literature (J. Chem. Inf. Comput. Sci. 2000; 40: 163-166)]. In general, the Tc is a result of numerical conversion of the degree of similarity of vectors consisted of two bits, that is, 0 or 1. As shown in
The Tc was calculated by the following mathematical formula. Here, when the corresponding bits of both vectors are all on, 1 is added to a. If the bit of only one of the both vectors is on, 1 is added to b or c. That is, when the bits of both vectors are off, nothing is added to d, and d is taken out of consideration in the Tc calculation. For example, between the two bit columns shown in
According to the present example, the FP bands were defined such that when any two FP bands derived from low molecular compounds, which are obtained from a set of low molecular compounds belonging to the LIBRARY LIGANDS of binding compounds and form a set, are compared, the Tanimoto coefficient (Tc) must be 0.08 or more. In other words, the term a in the above mathematical formula is the number of FPs existing in each FP bands. The terms b and c are the numbers of FPs existing in the other FP bands.
To explain the same matter using the concept of assembly, if the assembly of FP possessed by the respective FP bands are designated as A and B, the following formula may be obtained:
wherein number_of_fp(assembly) is the number of FP that belong to the set “assembly.”
An FP library corresponds to a set of binding compounds, and is the source of acquisition of the description of atom type of FPs that are used in the ChooseLD method of the present example, as well as a ligand group serving as the origin of the atomic coordinates registered on the established FPs. Conventionally, the FP library is collected from family proteins that have been detected by a homology search or the like using the primary structure, that is, the amino acid sequence, of a target protein as a query; however, the object is not limited to family proteins, and even ligands, or proteins, peptides and the like, which are considered to bind to a target site, such as an active site of the target protein, may also be added if necessary.
According to the ChooseLD method of the present example, the FP library was established mainly from family proteins. When proteins that have been detected by a homology search using the PSI-Blast [reference literature (Nucleic Acids Res. 1997; 27: 3398-3402)], of which 3-dimensional coordinate structures are already known, and which is protein-ligand complexes, 3-dimensional structure alignment between the target protein and a family protein is performed using CE [reference literature (Protein Engineering 1998; 11: 739-747)]. The CE is a program equipped with an algorithm which performs alignment of two proteins using similar parts in terms of the 3-dimensional structure, without depending on the similarity in the amino acid sequence, and other programs for 3-dimensional structure alignment include Dali [reference literature (J. Mol. Biol. 1993; 233: 123-138), TOPOFIT[reference literature (Protein Science 2004; 13: 1865-1874)], and the like. To mention the main differences of these programs, it is possible with the CE to obtain results quickly, as a result of ameliorations such as performing superimposition of amino acid sequences sequentially from the N-terminal, but when the subject proteins have domain swapping or the like, it is difficult to perform alignment with high accuracy. In that case, high accuracy may be obtained by using Dali or the like, which performs alignment without depending on the order of the amino acid sequence.
The ChooseLD method of the present example used CE, which takes shorter calculation time, since the family proteins detected by the PSI-Blast were mainly superimposed. The family proteins were superimposed to the target protein based on least square fitting, using the alignments output by the CE. When the Z-score of the alignment from CE was 3.7 or more, the binding ligands were converted based on the coordinate system of the target protein, and thereby only the binding ligands were picked out. That is, according to the present example, only the proteins that are structurally similar to the target protein, will be used as family proteins.
The FP band is a vector of FP correlated with one or multiple atomic coordinates as additional information, and is obtained from a set of binding ligands that belong to the FP library. The binding ligands that belong to the obtained set (FP library) include the coordinates in the coordinate system of a target protein; the atom type expressed as Sybyl atom type; and the information of bonding rules such as a single bond, a double bond and the bonding in aromatic rings. Here,
The “intra-molecule FP” (rectangle in
In
The “modified FP” (diamond shape in
Thus, according to the present example, from MDL Comprehensive Medicinal Chemistry (MDL CMC) Library, which is a database for 3-dimensional coordinates of physicochemically existing pharmaceuticals (corresponding to the medicinal chemical compound DB 106c), a drug-like FP vector is produced and compared with the FP vector part of the FP band obtained from the FP library, so that the description of the atom type of FP that is included in both parties, remains in the target protein-specific FP band. In the process of calculation using an arbitrary FP (fingerprint), when a pharmaceutical database or a compound database is used to take out compound information, a pharmaceutical database or a compound database specialized in drug absorption, drug metabolism, drug excretion or drug toxicity, which has been arranged in advance using an index for drug absorption, drug metabolism, drug excretion, drug toxicity or the like in this foundational database and using fundamental data units which use fingerprints (FP) or the like as the basic for arrangement, is produced, and the same series of operations are carried out.
Specifically, when a intersection between the FP vectors derived from the ligand library and the FP vectors derived from the pharmaceutical library is determined, only the FPs present in the medicinal chemical compound DB 106c are registered on the FP bands, and the FPs not present in the medicinal chemical compound DB 106c are neglected in the present example, thereby the fingerprint set of binding compound 106b being established.
As shown in
Thus, the explanation of the method for establishing FP bands according to the present example is completed. According to the present example, it is allowed that one atom belongs to multiple FPs in the overall process of establishing FP bands. If the FPs obtained from the FP bands has already been registered, the coordinates of FPs are added. If FPs are not present, new FPs are added to the FP bands, and coordinates are added. It is allowed that one atom belongs to multiple FPs. The same operation is carried out for the candidate ligands that serve as the target of docking (docked ligands), and FP bands derived from candidate ligands (fp bands of docked ligand) are produced.
The FP band is correlated with the coordinates of an atomic set, and when two FP bands are compared, not only the atom type is used, but also the correlated coordinates are used. That is, the alignment of FP bands means that a comparison of the FP band obtained from a candidate ligand and an FP band obtained from the FP library of the binding ligands is carried out. The comparison is carried out through the following processes of (1) and (2).
(1) Comparison of Complete Correspondence of Character Strings for Description of Atom Type Constituting FP
In regard to the FP vector (bit column (1)) derived from the FP band obtained from a candidate ligand to be docked, and the FP vector (bit column (2)) derived from the FP band obtained from an FP library including binding compounds, the existence or nonexistence of FP is expressed in each bit, and a combination in which the bits in both vectors are on is selected (see
(2) Process of Defining Correspondence Relationship Between Coordinate Vectors for Atoms Registered on Selected FP
Carrying out these two processes (1) and (2) constitutes the alignment of FP according to the present example. Furthermore, the phrase “FP alignment is different” means that at least one of the following is different:
1. the total number of FPs having on for both of the two bits,
2. the type of FP to be correlated, and
3. the correspondence relationship of coordinates within FPs.
That is, the phrase “FP alignment is changed” means that at least one among these is changed. The significance of “at least one” is due to that when the atom type of FP changes, the correspondence relationship between the coordinates of FPs before change is lost, and because the correspondence relationship is redefined for FPs after change, the correspondence relationship of the coordinates is also necessarily changed.
The interaction score FPAScore is explained below in detail in the present example. The FPAScore (fingerprint alignment score) is defined in the present example such that a higher FPAScore better satisfies the structure of family protein-binding ligand complex of which interaction is already known, based on the assumption of the ChooseLD method saying that the FP is a set of partial binding free energy. The FPAScore evaluates a target protein-candidate ligand complex structure by considering the accuracy of the superimposition of FPs, the number of FPs used in alignment, the crowded degree of FP, and the protein-ligand complex interaction at the same time. According to the present example, the optimal target protein-candidate ligand complex was predicted by searching for the optimal alignment of the FP bands obtained by the operation described above.
That is, according to the present example, the interaction score, FPAScore, is defined by the following mathematical formula. Here, aligned_fp means FPs that have been aligned; fp_rmsd means the rmsd calculated by least square fitting using the alignment; and molecule means the coordinates of the complex after the candidate ligand has docked to the target protein. Each of the terms will be explained below in detail.
<1. BaseScore(fp_rmsd, aligned_fp) Term>
This term is defined as a function taking into consideration of the degree of consistency and crowded degree of FP, that is, a function for evaluating the strength of using already known FP, and can be represented by the following mathematical formula:
wherein In is natural log (natural logarithm); and k1 is a scaling factor for determining how strict the accuracy of superimposition of FPs should be. If the rmsd for the superimposition of aligned FPs is large, the denominator is increased, resulting in a smaller BaseScore. This implies exclusion of the cases where, even if the degree of consistency of FP is large, the rmsd that represents the accuracy of overlapping of the FP atomic coordinates registered on the FP is large (bad). According to the present example, k1 was set at 4.0; fp_rmsd is the rmsd calculated by least square fitting using the alignment; and aligned_fp is the correspondence relationship of FPs in that case, that is, aligned FPs.
Here, in the mathematical formula described above, the term raw_score(aligned_fp) is represented by the following formula. Here, assigned_score(i) is the score given in advance to the FP aligned on the ith order. n is the total number of aligned FPs. The term aligned FPs means the atom type and atomic coordinate set for a target protein-specific FP band (see, “alignment of FP bands” and
Here, in the mathematical formula described above, assigned_score(i) is the score given in advance to the FP aligned on the ith order, and is represented by the following mathematical formula. This score is given as follows, to the FP obtained from a ligand library such as CElib.
Here, total_atom(i) in the formula represents the number of atomic coordinates constituting the FP; and Case1_S, Case2_S and Case3_S (not described above) are scores that are given in advance to the atoms constituting the FP, and are respectively used in the following cases.
Case1_S is a score that is given to each atom when the “Intra-molecule FP” as described above has been constructed. If the value is not particularly designated, 5.0 is used. For example, when the search is successful, the score Case1_S (the default of 5.0 was used) is given to each atom that constitutes the FP, so that an FP constituted of four atoms is given 20.0 points, while an FP constituted of three atoms is given 15.0 points.
Next, Case2_S will be explained. This is a score that is given to each atom when the “Modified FP” as described above has been established. If the value is not particularly designated, 2.5 is used.
Finally, Case3_S is an arbitrary scalar value given when there is a possibility of the presence of atoms based on biochemical information or calculation of energy (“Circle” or the like). The Case3_S is not used in the present example, and is not used in the calculation for verifying the docking performance (binding mode prediction performance) using a benchmark set, and the in silico screening performance.
According to the present example, the natural logarithmic value of the crowded degree of atoms belonging to the FP library was added to the score, in addition to the score of the sum of Case1_S, Case2_S and Case3_S. This involves adding the natural logarithm of the number of atoms (n_neighbor_atom(i)) of an atomic coordinate set that belongs to another FP, which are within 1.0 Å from the atoms of an atomic coordinate set that belongs to the FP, to the score of FP, and thus this term can be said to be a term which gives preference to a dense FP. That is, in regard to the case1 and case2, the natural logarithm of the number of atoms (Neighbor_atom) of an atomic coordinate set, which are within a distance of dist (default 1.0 Å) between the coordinates that belong to the same FP, is added to the score for the coordinates of FP.
<2. fp_volume(molecule) Term>
This term is a function for evaluating the complex structure, after a candidate ligand has been docked to a target protein using aligned FPs. That is, it is a function for evaluating the number of the molecular coordinates of a candidate ligand after docking which occupy a space formed from the FPs obtained from the bound ligands of an FP library (that is, how much space formed from the FPs derived from an FP library is filled), and the collision with the target protein. The term is represented by the following mathematical formula. Here, the term molecule represents the atomic coordinates of the candidate ligand after docking.
Here, nafp (Number of Ligand Atom covering Fingerprint) is the number of the coordinates of a molecule occupying in a region of proper grid that is produced using the atoms of small molecules constituting the LIBRARY LIGAND, that is, the number of the coordinates of candidate ligands occupying in a region of proper grid produced using the atoms of binding ligands that constitute the FP library. nafp represents how much a molecule of a candidate ligand satisfies an FP (fingerprint) of fixing coordinates. nap(Number of Ligand Atom covering Protein) is the number of the coordinates of a molecule (molecule of a candidate ligand after docking) falling into a region of proper grid produced from the atomic coordinates of the target protein, and represents the collision state with the constituent atoms of the target protein.
k2 and k3 are each a coefficient, and if the value is not particularly designated (default), 1.0 is used, respectively. However, they can be respectively modified depending on the biochemical information of the target protein and the degree of induced fit. That is, k2 is a constant which makes a point of attaching importance to the region occupied a space of a group of binding ligands of a family protein to be identical or similar to the target protein, and if the coefficient is increased, a larger ligand may obtain a larger score. The k2 value is possible to be grouped, even based on the size of the binging region of the target protein. k3 is a tolerance factor for the collision of a candidate ligand in a region occupied by the target protein, and is a coefficient which makes a point of attaching importance to the collision between the atoms of a candidate ligand and the atoms of the target protein. If the value of k3 is increased, the collision between the target protein and the candidate ligand may not be allowed. In regard to k3, there is a possibility of grouping the flexibility at the active site of a protein, and the like.
As shown in
<3. fp_contact_surface(molecule) Term>
The term of fp_contact_surface is a function taking into consideration of the contacting degree of the atomic coordinates with a target protein in the after-docking structure of a candidate ligand, and the degree of attribution of the coordinates to the FP library. The term is represented by the following mathematical formula. Here, the term molecule means the atomic coordinates of the candidate ligand after docking; atom(i) means the ith atomic coordinates after docking; and n means the number of atoms. That is, this formula is calculated with respect to the complex structure obtained after the docking of a candidate ligand to a target protein, as in the case of the mathematical formula for fp_volume described above, and is a function taking into consideration of the contacting degree of the atomic coordinates of the candidate ligand with the surface of the target protein, and the degree of attribution of the atomic coordinates of the candidate ligand to the FP atoms obtained from the FP library.
In the above formula, density_of_atom is represented by the following mathematical formula. Here, nfpcontact is the number of atoms of the target protein that get in contact with the atomic coordinates of an FP that belongs to the FP library, at 3.8 Å or less if there is no particular designation (at default); and natom is the number of atoms of an FP library-derived binding ligand compound, that fall into a same lattice point. In this case, a plurality of ligand molecules of the same atom type may be present, and even in the case of the same ligand molecule with different ID codes for the PDB, the present examples include all of such molecules. In is a variable used in the case of having particularly important biochemical information, and if the value is not particularly designated (at default), 0 is used. It is envisaged that the variable be used when an FP (Modified FP, Creative FP, or the like) that does not depend on the family protein is input as a result of the 3D-1D score value of CIRCLE [reference literature (Terashi G, Takeda-Shitaka M, Kanou K, Iwadate M, Takaya D, Hosoi A, Ohta K, Umeyama H Proteins 2007; 69(S8): 98-107)] or the like. The following mathematical formula becomes 0 when atomic coordinates x of the ligand do not fall into the FP obtained from the FP library (not in contact at a distance of 3.8 Å or less). Otherwise, the score is calculated according to the formula given above. density_of_atom(x)=0 or ln(nhpcontact+natom+hi)
In the mathematical formula for the fp_contact_surface, total_dense_of_atom(molecule) is represented by the following mathematical formula. Here, the term total is the number of atoms of the candidate ligand molecule. sort_density_of_atom is the result of arranging the distribution of the scalar values of density_of_atom in order of great numerical value, starting from larger values. That is, if the molecule of the candidate ligand is large, the total_dense_of_atom is increased.
This is the end of the explanation of the interaction score FPAScore in the present example.
Next, in order to maximize the FPAScore function defined as described above, the method for performing simulated annealing (hereinafter, referred to as “SA”) according to the present example will be explained, while referring to
First of all, one cycle of steps 1. to 3. from the conformation change of the candidate ligand to the obtainment of a docking structure resulting in the maximum FPAScore for the structure, will be described.
First, the conformation is changed by randomly modifying the rotatable dihedral angle that is present in the candidate ligand to be docked (docked ligand). According to the present example, values of the van der Waals radius of the candidate ligand atoms obtained by referring to AMBER99 was used.
The candidate ligand with changed conformation is used as a rigid body, to dock to the ligand binding site (the binding site). The following operation of translation and rotation is carried out for a single conformation generated in the step 1.
First, ten atom types of FP are randomly selected from an FP band described above. If there are fewer than 10, half the maximum number of the size of the FP vector in the FP band was used. The atomic coordinate sets registered for the selected FPs are randomly selected. These are used as aligned FPs, and for the correspondence relationship, least square fitting is performed to calculate the rmsd between the atomic coordinates of the candidate ligand and the atomic coordinates derived from the FP library. The translation and rotation matrices thus obtained are operated with respect to the target ligand, to thereby obtain one target protein-candidate ligand complex structure. The FPAScore is then calculated using the aligned FPs, the rmsd, and the target protein-candidate ligand complex structure. Here,
As shown in
The change in state caused by simulated annealing is a process of modifying, increasing or decreasing the FP. That is, this change in state is carried out by repeating a process of selecting the coordinates belonging to the FP, from the FPs derived from the candidate ligands to be docked and the ligand library-derived FPs. Simulated annealing changes alignment by increasing by one or maintaining the atom type of FP, with respect to the aligned FPs, and performing modification or addition of the correspondence relationship of the atomic coordinate set registered with the FP, and reduction of FP, and thus maximizes the FPAScore. When one or more atomic coordinate sets are selected from one FP, or when the FPA score is decreased regardless of the presence of coordinates, Metropolis decision is carried out, and if the score is accepted, the state is maintained. That is, Metropolis decision is carried out in the process of SA, and if the score of this round is larger than the score of the previous round, the score is accepted. Otherwise, the adopting probability, Paccept, is calculated based on the following mathematical formula. In this case, if uniform random numbers in the range of 0<=r<=1 are generated simultaneously, and r<Paccept, even a low score is accepted. In the present example, T (temperature) was started from 30.0 K and decreased down to 0.07 K. Thus, the maximum value of the FPAScore is calculated for one conformation.
ΔScore=Score(after)−Score(before)
Paccept=exp(ΔScore/T)
Using the FP bands thus obtained, the FPAScore is optimized according to the SA method. In the present example, SA was performed 10,000 times.
The maximum FPAScore obtained for one conformation in the step 2, is stored in the structure pool of the storage unit, together with the structure.
The processing of one cycle for the maximization of FPAScore for one conformation is as discussed above.
According to the present example, since it has been set up to perform the change of conformation 1000 times, if the process is performed fewer than 1000 times, it should be controlled such that the steps 1. to 3. as described above are carried out again. A larger number of generated conformation may lead to a possibility of obtaining better results, but since it is needed to perform docking calculation for a large number of low molecular compounds that are included in the virtual compound database, the calculation must be stopped after a limited number of cycles. Thus, although the number may depend on the degree of freedom of rotation of the compound, this number of runs was sufficient in the preliminary calculation of the present example.
When the maximum value of the interaction score, FPAScore, has been calculated for each of the 1000 conformations generated, the process of the repetition of cycle is terminated, and the maximum FPAScores of the 1000 conformations stored in the structure pool are compared. Thus, a predictive structure of the target protein-candidate ligand complex (Protein-Ligand complex), which is the docking structure having the largest score, is output as the optimal conformation for the candidate ligand.
The “Results and Discussion (Materials)” in regard to the present example will be described in the following. In the construction of the FP library described in the present example, development was achieved by combining shell script languages such as Perl (http://www.perl.com/), Ruby (http://www.ruby-lang.org/), and bash (http://www.gnu.org/software/bash), and so on. The algorithm described as the method of the present example, for searching for a protein-ligand complex structure that is likely to maximize the FPAScore by changing the conformation of candidate ligands to be docked, was written in C/C++. As for the compiler, Intel (registered trademark) C++ Compiler 10.0 was used. To mention the configuration of the computing machine used, up to 200 non-shared memory-type computing machine clusters having different configurations of computing machines, such as Red Hat Linux or Scientific Linux for the OS; Pentium 4, Core2Duo or Opteron for the CPU; and 512 MB, 1024 MB or 2048 MB for the memory, were used. To mention about the calculation time for reference, when in silico screening of 20,000 compounds of the MDL Available Chemicals Directory (MDL ACD) Library (Symyx Technologies, Inc. Corporate Address: 3100 Central Expressway, Santa Clara, Calif. 95051) is carried out for the kinase domain of EGFR that will be described later, in regard to the time for implementation of calculation per CPU for the docking of one candidate ligand to one target protein, the median value was 10.2 minutes, and the average value was 18.6 minutes. The minimum calculation time was 4.8 minutes, and the maximum calculation time was 1077 minutes. Here,
As shown in the distribution of calculation time in the in silico screening of EGFR of
According to the present example, protein-ligand complex structures were acquired from the Protein Data Bank [reference literature (Nucleic Acids Res. 2003; 31: 489-491)], to test the docking performance of the ChooseLD. A bench mark used in the present example is explained with reference to
As shown in
85 PDB structures; 1G9V 1GKC 1GM8 1GPK 1HNN 1HP0 1HQ2 1HVY 1HWI 1HWW 1IA1 1IG3 1J3J 1JD0 1JJE 1JLA 1K3U 1KE5 1KZK 1L2S 1L7F 1LPZ 1LRH 1M2Z 1MEH 1MMV 1MZC 1N1M 1N2J 1N2V 1N46 1NAV 1OF1 1OF6 1OPK 1OQ5 1OWE 1OYT 1P2Y 1P62 1PMN 1Q1G 1Q41 1Q4G 1R1H 1R55 1R58 1R9O 1S19 1S3V 1SG0 1SJ0 1SQ5 1SQN 1T40 1T46 1T9B 1TOW 1TT1 1TZ8 1U1C 1U4D 1UML 1UNL 1UOU 1V0P 1V48 1V4S 1VCJ 1W1P 1W2G 1X8X 1XM6 1XOQ 1XOZ 1Y6B 1YGC 1YQY 1YV3 1YVF 1YWR 1Z95 2BM2 2BR1 2BSM
133 PDB structures; 1AAQ 1ABE 1ACJ 1ACK 1ACM 1ACO 1AEC 1AHA 1APT 1ASE 1ATL 1AZM 1BAF 1BBP 1BLH 1BMA 1BYB 1CBS 1CBX 1CDG 1CIL 1COM 1COY 1CPS 1CTR 1DBB 1DBJ 1DID 1DIE 1DR1 1DWD 1EAP 1EED 1EPB 1ETA 1ETR 1FEN 1FKG 1FKI 1FRP 1GHB 1GLP 1GLQ 1HDC 1HDY 1HEF 1HFC 1HRI 1HSL 1HYT 1ICN 1IDA 1IGJ 1IMB 1IVE 1LAH 1LCP 1LDM 1LIC 1LMO 1LNA 1LPM 1LST 1MCR 1MDR 1MMQ 1MRG 1MRK 1MUP 1NCO 1NIS 1PBD 1PHA 1PHD 1PHG 1POC 1RDS 1RNE 1ROB 1SLT 1SNC 1SRJ 1STP 1TDB 1TKA 1TMN 1TNG 1TNI 1TNL 1TPH 1TPP 1TRK 1TYL 1UKZ 1ULB 1WAP 1XID 1XIE 2ADA 2AK3 2CGR 2CHT 2CMD 2CTC 2DBL 2 GBP 2LGS 2MCP 2MTH 2PHH 2PK4 2PLV 2R07 2SIM 2YHX 3AAH 3CLA 3CPA 3GCH 3HVT 3PTB 3TPI 4CTS 4DFR 4EST 4FAB 4PHV 5P2P 6ABP 6RNT 6RSA 7TIM 8GCH
The two circles in
That is, the breakdown of these benchmark sets are such that 85 benchmark sets are collected by selecting target proteins that serve as targets of drug discovery among those registered on the PDB after Aug. 11, 2000, and finally manually selecting those judged to be druggable ligands according to the determination criteria such as whether the ligand to be docked also has heteroatoms, a hydrogen bond donor, a hydrogen bond acceptor, and a hydrophobic group and the like respectively, or whether the Lipinski's Rule of Five is satisfied. Furthermore, the Riken Benchmark [reference literature (Onodera et al. J. Chem. Inf. Model. 2007; 47: 1609-1618)] uses the benchmark of GOLD [reference literature (Gareth et al. J. Mol. Biol. 1997; 267: 727-748)]. This benchmark uses the target proteins registered on the PDB before August 2000, as mentioned above. However, since this benchmark compares AutoDock or DOCK in addition to the GOLD, it was conceived that comparing with the results of this benchmark would be very useful to find out the placement of the ChooseLD in the docking software program. There is no overlapping in the PDB ID for the two benchmarks. Thus, setup of default parameters for the ChooseLD was performed with 85 sets, and the performance evaluation of ChooseLD based on the parameters was performed with the Riken Benchmark. Here,
The years of registration in PDB in these benchmark sets are distributed as shown in
As discussed above, the k1 value of the FPAScore is a coefficient that controls the degree of consistency between the atomic coordinates registered on the FP library and the atomic coordinates of the candidate ligand. The k1 value is possible to be changed in accordance with the target, but upon considering the case of performing in silico screening with respect to a large quantity of target proteins, or the case of being used by other researchers, determining the optimal parameter serves as one of data for judgment of employing the technique. Therefore, the optimal value of k1 of the FPAScore function was determined as the optimal value in the docking performance testing of the ChooseLD method using 85 sets [reference literature (Michael et al. J. Med. Chem. 2007; 50: 726-741)].
The 85 sets collect many of drug-like target proteins, and are therefore subjected to the performance evaluation of GOLD [reference literature (Gareth et al. J. Mol. Biol. 1997; 267: 727-748)]. This is because the 85 sets do not overlap with the 133 sets in regard with the PDBID, that is, the 85 sets do not use the information of the 133 sets in this process of optimization. Furthermore, the 85 sets are subjected only to the benchmarking of GOLD, and the success rate of GOLD was 75.2±0.4% in the case of docking the structure of Corina to the target protein; 80.5±0.5% in the case of defining the binding site as 6 Å by using the ligand structure of the experimental structure; 86.9±0.3% in the case of defining the binding site as 4 Å by using the ligand structure of the experimental structure; and 98.6±0.1% in the case of including the water of crystallization that is present in the X-ray crystallographic structure [reference literature (J. Med. Chem. 2007; 50: 726-741)]. That is, in the case of performing evaluation only by GOLD, the placement of ChooseLD in the existing docking software programs can be not known, and therefore, the 85 sets were used in the optimization of the k1 value. Here, the optimization of k1 described with the FPA score was performed.
The docking conditions were as described below. Similarly to other benchmarks, the ligand binding site was defined because it is advantageous to narrow the scope of search for the ligand binding site, or the like. That is, the benchmark of the docking performance testing of ChooseLD is not about predicting the amino acid residues at the ligand binding site of a protein, but is about testing the accuracy of the conformation of a candidate ligand at the ligand binding site. The size of the binding site was set at 4 Å from each atom of the ligand of the correct structure of the protein-ligand complex. Furthermore, in order to test the influence of the similarity to candidate ligands of a ligand included in the FP library, the Tc with the ligand that belongs to the FP library was calculated, and thus the ligands included in the FP library were limited. The Tanimoto coefficient between the ligands to be docked and the ligands belonging to the LIBRARY LIGANDS was calculated using a drug like fingerprint (FP). The Tc range for the FP bands was set at 0.96 for the maximum value, 0.76, 0.56, and 0.08 as the minimum value.
In regard to the initial conformation, a conformation obtained by randomly rotating the dihedral angle, and a structure having the largest rmsd and sufficiently separating from the binding site among the initial ligands, was used. Using that ligand, 10 times of docking was performed for one target. Among the 85 sets, 84 sets could be docked. Here,
k1 in the table of
As a result, when k1=4.0, the average value had the highest success rate, as high as 62.1%, and then the results were good in order of 6.0, 3.0, 5.0, and 2.0. When the k1 value was 1.0, the success rate was poorer than the success rates of other k1 values in regard to all of the Tc ranges. The case of the k1 value being 4.0 and the case of the k1 value being 6.0 were almost equal, but the case of the k1 value being 4.0, for which the average value was slightly more excellent, was used as the optimum value. Thus, this value was used for the benchmarking of the 133 species [reference literature (Onodera et al. J. Chem. Inf. Model. 2007; 47: 1609-1618)].
Here,
As explained above, at 2.5 Å, approximately 70% was successful, but it was shown that in order to obtain a success rate equivalent to the success rate of the binding mode prediction, 75.2% [reference literature (Michael et al. J. Med. Chem. 2007; 50: 726-741)], in the case of not using the ligand generated by Corina, which was one of the ratio of predictive success of GOLD in the 85 set benchmarks, that is, the conformation of an experimental structure, a Tc range of 3.2 to 3.3 at 0.56 to 0.08, a Tc range of 2.8 at 0.76 to 0.08, and a Tc range of 2.6 to 2.7 at 0.96 to 0.08, should be used. In addition, when the general covalent-bond length of 1.5 Å is defined as successful, about 40% of the prediction is considered successful. Within 3.5 Å, which is close to the limit value of van der Waals interaction, about 80% of prediction is considered successful. Here,
Here, the Docking method means the name of each docking software program (Docking soft). ChooseLD performs a performance evaluation on three Tc ranges. The values of GOLD GOLDScoreSTD, GOLDScoreLib, GOLD ChemScoreSTD, AutoDock, and DOCK are the average values of Corina and MINI, while the standard deviation with respect to the success rate of each docking software program is represented by an error bar.
As shown in the graph of
Here,
According to Onodera, et al. [reference literature (Onodera et al. J. Chem. Inf. Model. 2007; 47: 1609-1618)], benchmarking is performed in a state close to the initial state of each docking software program as received. According to them, in regard to the target proteins, the protein-ligand complex used in the benchmarking of GOLD [reference literature (Gareth et al. J. Mol. Biol. 1997; 267: 727-748)] are total 116 species, obtained by removing the targets that could not dock with GOLD or DOCK, and the targets that could not generate 3-dimensional coordinates by Corina, from 133 species. The removed PDBIDs are 1TPH, 1TRK, 1XID, 4FAB, 6RSA, 1BBP, 1CTR, 1HYT, 1PHG, 1POC, 1SNC, 1TMN, 1CDG, 1DR1, 1LDM, 4CTS, 4EST (Virtual Screening [reference literature (Onodera et al. J. Chem. Inf. Model. 2007; 47: 1609-1618)]).
For the parameters of the respective docking software programs, those parameters provided by the respective docking software programs are used, and the parameters are not optimized for targeting. It is conceived that if optimization of the parameters is carried out, the success rate would be definitely changed. However, the same also applies to the ChooseLD, and since the parameters k1, k2 and k3 values that are variable in accordance with the target protein are defined also in the ChooseLD method, there is still room for optimization. Thus, in the performance evaluation of ChooseLD, the values mentioned in the terms in the method, and the k1 value optimized with the 85 sets, that is, 4.0 were used.
Here, the docking conditions used by ChooseLD were defined as follows, with respect to each target.
1. Binding site
The binding site was defined, similarly to the case of conventional benchmarking [reference literature (Onodera et al. J. Chem. Inf. Model. 2007; 47: 1609-1618)], as a sphere of atoms of the protein present in a distance of within a radius of 5.0 Å from each atom of the ligand of the native protein-ligand complex.
2. Conformation Change of Ligand
In the benchmarking of 133 sets, three ligands for docking are prepared. That is, the three include the ligand generated by Corina, ligand with the minimum energy structure (hereinafter, referred to as MINI) among the ligands generated by Corina, and ligand with the structure as registered on the PDB, and these are respectively subjected to 1000 predictions with respect to 116 target proteins (Virtual Screening [reference literature (Onodera et al. J. Chem. Inf. Model. 2007; 47: 1609-1618)]). In the docking performance test of the ChooseLD method, the conformation was randomly changed, and a ligand having a structure with the largest rmsd from the ligand of the protein-ligand complex of experimental structure, and being sufficiently far from the ligand binding site defined above, was used. That is, it was equivalent to that 10 predictions were performed with respect to each of the 116 target proteins without using the experimental structure, and thus the predictions were performed under almost the same conditions as those for the benchmarking using the 133 sets. In this process, the hydrogen atoms were eliminated in case of a ligand having hydrogens in the molecule.
3. Scope of Tanimoto Coefficient with Ligand
In regard to the LIBRARY LIGAND used, the values 0.96, which is the maximum value, 0.76 and 0.56 in the range of Tc with a candidate ligand (docked ligand) correspond to the implications that a compound very similar to the docking ligand exists, that a compound similar to the docking ligand exists, and that a compound slightly similar to the docking ligand exists, respectively. Thus, for the scope of Tc, the values corresponding to 0.96 to 0.08 (that is, solution not included), 0.76 to 0.08, and 0.56 to 0.08 were used.
4. Onodera, et al. performs docking runs 1000 times with respect to one ligand [reference literature (Onodera et al. J. Chem. Inf. Model. 2007; 47: 1609-1618)]. In this performance evaluation of ChooseLD, the candidate ligand (docked ligand) was docked 10 times. That is, docking runs for 1160 times was performed for each Tc range, so that docking was performed 3480 times in all. It was considered successful if the rmsd between the docking structure predicted in one docking run, and the ligand of the native protein-ligand complex was 2.0 Å or better.
(Comparison with Gold)
Two examples with an rmsd of 2.0 or less, that could dock in the method of the present inventors, although GOLD failed in the Riken benchmarking, will be presented.
Conditions and values in
CYAN (central cyan parts in the figure): experimental (X-ray crystallographic analysis) structure (Answer) (same in the following figures)
GREEN (central deep green parts in the figure): predicted ligand Structure (same in the following figures)
The other: the binding site (same in the following figures).
That is,
Conditions and values in
That is,
Four examples of all the docking instances that existing docking software programs (GOLD and DOCK) have failed will be presented.
Here,
Conditions, and the like in
Conditions, and the like in
Conditions, and the like in
Conditions, and the like in
Glide [reference literature (J. Med. Chem. 2004; 47: 1739-1749)] is a flexible ligand docking software program, and performs a comparison of prediction accuracy with GOLD or the like, in the method of the present example.
As shown in
Here,
Conditions, and the like in
Conditions, and the like in
PDBID: below:
Conditions, and the like in
[Results and Discussion (4): Result rmsd Regard as Suceess]
The rmsd defined as successful is changed. At the time of comparing with Riken benchmarking, the rmsd between the predictive structure defined as successful and the correct structure was defined to be 2.0 Å, but numerical values other than that (1.5, 2.5, 3.0 and 3.5) may use as well. It is because when the rmsd is 3.5 Å, the predictive ligand structure is in most cases thought to be present in the neighborhood of the ligand binding site, and thus the structure can be used as the initial structure for the calculation of molecular dynamics or quantum chemistry.
As shown in
It is also implied that in regard to the Tc range of 0.96 to 0.08 (that is, when fairly similar ligands are present in the library), about 70% is present at the ligand binding site.
Here, the value of rmsd of 2.0 as the definition of successful docking is a basic evaluation criterion for various benchmarks and the like [reference literature (Gareth et al. J. Mol. Biol. 1997; 267: 727-748), (Michael et al. J. Med. Chem. 2007; 50: 726-741), (Onodera et al. J. Chem. Inf. Model. 2007; 47: 1609-1618)]. However, in fact, even in the cases of having an rmsd larger than 2.0, when optimization such as MD or QM is performed, the structure of a protein-ligand complex can be predicted with good accuracy. That is, presenting such an rmsd defined as successful serves as useful data when an investigator using MD or QM selects the initial structure for the optimization of the complex structure. That is, it is speculated that the data may be instructive for the supposition of the time taken for the optimization (shot time 100 ps, long time 1 ns and so on) or the scope of the ligand binding site to be optimized (5 Å, 10 Å and so on).
Discussion will be mainly described below, while referring to
In the present example, an assumption is made such that the conformation of FP, which is a part of the ligand, is most stable as a structure subjected to interaction. In regard to the interaction with the target protein of an FP according to the present example, an FP in a short distance from the protein is interpreted as enthalpic interaction including hydrophobic interaction, hydrogen bond interaction, and van der Waals interaction, while an FP in a long distance from the protein is interpreted as entropic interaction such as interaction with a solvent.
That is, according to the present example, it is assumed, finally using the conformation of an FP, that when a chemical compound takes the most stable docking structure as the basic, this is equivalent to the structure taking the most stable free energy for the protein-ligand interaction.
That is, an FP conformation extracted from a group of binding ligands derived from homologous proteins showing satisfactory the overlap, includes the free energy interaction with protein.
Here, when there is one target protein, a homologous protein having low homology or low e-value is used in order to gather many ligands, but a family protein in a broad sense, which is not bound by these functional classifications, is accompanied by slight structural changes and changes in the amino acid residue at the neighborhood of the binding site. Thus, the possibility that the FP extracted from the family protein does not satisfy the assumption of stabilizing free energy, should also be considered.
Therefore, this drawback needs to be compensated, and thus the FP extracted from a family protein is changed to an FP having more stable free energy with respect to the interaction with the target protein to be a “Modified FP,” which is accepted as a slightly less reliable FP. This copes with modifying the program of the 3D-1D method. When this Modified FP is produced for a target protein, this is equivalent to the considering of an undiscovered ligand having a new skeleton. Thus, there is a possibility of finding a compound having higher activity than the known ligands that are bound to the target protein.
Meanwhile, the FP in the common region of atomic interaction of a plurality of binding compounds attaches importance to the overlap of the family proteins binding to a plurality of similar chemical compounds. Thus, it is speculated that an FP to which experimental information has been more reflected than the “Creative FP,” which is given when there is a possibility of the atomic presence by the biochemical information or energy calculation can be obtained.
With respect to the structure of a protein-ligand complex predicted by conventional energy of classic physics, ranking the docking structures obtained by the method described above, and clustering is performed, using the known information of the structure of a protein-ligand complex [reference literature (Zhan et al. J. Med. Chem. 2004; 47: 337-344)]. This implies that the output by an existing docking software program is equivalent to outputting a structure which does not certainly reflect the experimental information.
Meanwhile, there have been attempts to optimize the predicted structure of a protein-ligand complex with MD using AMBER or CHARMM [reference literature (Case, A. D., Cheatham III, E. T., Darden, T., Gohlke, H., Luo, R., Merz Jr., M. K., Onufriev, A., Simmerling, C., Wang, B. Woods, J. R. The Amber Biomolecular Simulation Programs, J. Comput. Chem. 2005; 26: 1668-1688), (Brooks, R. B, Bruccoleri, E. R., Olafson, D. B., States, J. D., Swaminathan, S. & Karplus, M. CHARMM A program for macromolecular energy, minimization, and dynamics calculations. J. Comp. Chem. 1983; 4: 187-217)], or QM [reference literature (Kamiya K, Sugawara Y, Umeyama H. J. Comput. Chem. 2003; 24: 826-841)]. With these methods of MD or QM, performing docking or in silico screening involves an excessively large amount of calculation, it is needed to performing docking of a ligand at a location somehow close from the native structure of a protein-ligand complex, and use this as the initial structure.
Existing docking software programs are used to obtain the initial structure, but since these programs are based on the physical energy previously mentioned, optimization based on physical energy is repeated.
On the other hand, the technique according to the present example mainly uses known information of a protein-ligand complex, and thus it is possible to take into consideration of the viewpoint of bioinformatics and the viewpoint of physical energy. Since the bioinformatics information called the structure information of PDB used in the present example is accumulated every year, the protein-ligand complexes attracting the public interest from a medical viewpoint are studied by many researchers, and are also conceived to be useful for such optimization of predictive structures.
In regard to the performance in the case of docking various ligands with a druggable target protein, when the Tc range was 0.56 to 0.08, 0.76 to 0.08, and 0.96 to 0.08, the probability of obtaining a structure graded Good was 40.1, 44.8, and 46.4%, respectively, and the probability of obtaining a structure graded Close was 53.2, 57.8, and 59.3%, respectively. It was shown that these performances were almost equivalent to the performances of existing docking software programs.
From the results of training calculation including compounds that are all druggable as the target protein and the ligand, when the conformations for the top ten interaction scores between a target protein and an arbitrary ligand are contemplated, the ligand structure including the solution in the range of 2.0 Å, which is said to give a good model for the solution, is found for at least one time in 83% of the entire target proteins (values down to the 10th rank, 0.96 to 0.08 of
On the other hand, when the conformations for the top ten interaction scores between a target protein and an arbitrary ligand are contemplated, the ligand structure including the solution in the range of 2.5 Å, which is said to give a model similar to the model that is good for the solution, is found for at least one ligand structure in 88% of the entire target proteins (values within the 10th rank, 0.96 to 0.08 of
From the results of training calculation including a druggable target protein, and various low molecular compounds as the ligand, when the conformations within the top ten interaction scores between a target protein and an arbitrary ligand are contemplated, the ligand structure including the solution in the range of 2.0 Å, which is said to give a good model for the solution, is found for at least one ligand structure in 65% of the entire target proteins (values within the 10th rank, 0.96 to 0.08 of
On the other hand, when the conformations within the top ten interaction scores between a target protein and an arbitrary ligand are contemplated, the ligand structure including the solution in the range of 2.5 Å, which is said to give a model similar to the model that is good for the solution, is found for at least one ligand structure in 76% of the entire target proteins (values down to within the 10th rank, 0.96 to 0.08 of
Traditionally, this interaction between a target protein and a low molecular compound from a library of virtual compounds has been calculated by a physical interaction function, but the present example differs from the conventional techniques in view of semiempirically calculating using the information of bioinformatics. Furthermore, the present example also has excellently high effects in terms of the success rate of structure prediction, as compared to the globally acknowledged docking software program, GOLD. Also, since the accumulation of information that is increasing every year leads up to obtaining better results for the interaction calculation of the semiempirical bioinformatics technique, the technique of the present example is highly useful and gives effects that are different from those of conventional techniques.
According to the present example, the conformations obtained by scoring of the interactions between a target protein and various low molecular compounds, can be used in the DOCK, AutoDock or GOLD, which are docking programs involving the calculation formulas of molecular dynamics, and also can be used as the initial conformations of existing docking software programs such as Amber or Charm, which are molecular dynamics calculation programs. In this regard, since the initial conformations obtained in the present example can be conveniently obtained, and the accuracy of reproducing the experiments is high, useful results can be obtained by combination with other software programs.
Furthermore, the present example can be carried out by a method that does not require analyzing the 3-dimensional structure of the target protein and designating the active site in the process of calculation using an arbitrary FP (fingerprint) based on the CElib (FP (fingerprint) set extracted from collected ligands in the binding site), which is a database of various low molecular compounds bound to a family macromolecular protein set having a 3-dimensional structure similar to that of the target protein. In the conventional techniques, it was necessary to analyze the 3-dimensional structure of the target protein and to designate the active site in advance, in the docking calculation using existing docking software programs such as DOCK, AutoDock or GOLD, so that stable conformations would receive higher scores. However, compared to these, the present example has high effects that are different from those of the conventional techniques, and does not require designation of the active site through learning from the literature, thus being useful.
The method according to the present example succeeded in using the score which defines the information on the interaction of a protein-ligand complex that is already known from the viewpoint of bioinformatics, and reflecting this appropriately in docking simulation.
Attempts to enhance the accuracy in the docking simulation by adding the known information of protein-ligand complex to the output of existing docking software programs have been traditionally carried out. However, these methods depend on the intelligence and action of the researcher, and thus lack generality.
The technique according to the present example automatically performs the homology search and the superimposition of 3-dimensional structure. Furthermore, by using the scoring functions suggested by the present technique, docking structure could be obtained with high accuracy.
Based on these, the technique can be widely used without requiring too much human intervention by the researcher. The scoring functions suggested by the present technique can also be combined with existing docking software programs.
That is, the method according to the present example is extremely useful for the following three aspects.
The technique of the present example differs from conventional techniques in that the information of the interaction of protein-ligand complex that is already known from the viewpoint of bioinformatics can be appropriately reflected in the docking simulation. Furthermore, the technique of the present example exhibits excellent effects of being capable of automatically adding the parameters of the physical amount of suitable for the ligand and the constraint on distance, while taking into consideration of the complementarity with the receptor, and the conformation and atomic species of the known ligand. Of course, these aspects are very useful in the search of pharmaceuticals of new skeletons or similar skeletons, because the bioinformatics information on the interaction between target protein and ligands, which is important from neomedicinal and biological viewpoints, is accumulated every year. Furthermore, due to the coming of the age of tailor-made medicine, drug design of target proteins with rich experimental information will be required, and the method according to the present example is very useful.
As a second example, the optimization of k2 and k3 and in silico screening in the case that the target protein is Epidermal growth factor receptor (EGFR) will be described below. Here,
The k2 and k3 values in the score of FPAScore defined in the ChooseLD method of the first example have been defined as coefficients capable of optimizing in accordance with the target protein. Thus, verification was performed on whether the coefficients would function effectively with respect to the target protein. EGFR, which is an epidermal growth factor receptor family, serves as an important inhibitory target in cancer therapeutic [reference literature (J. Biol. Chem. 2002; 277: 46265-46272, Cell 2006; 125: 1137-1149)]. Therefore, in silico screening was performed using EGFR as the target protein.
For the amino acid sequence of EGFR, ACCESSION ID P00533 in NCBI [reference literature (Wheeler, D. L. et al. Nucleic Acids Res. 2007 Nov. 27)] was used, and the A chain of PDBID 1M17 was used as the template. The alignment used was presented in
Homology is about 99%, and the purpose is to append the lack in residues at the C-terminus of 1M17, rather than predicting the 3-dimensional structure. A model was constructed using the alignment with a homology modeling software, FAMS Ligand & Complex [reference literature (Proteins, 2005; Suppl 7: 122-127)]. Here,
The CIRCLE score [reference literature (Terashi, G. et al. Proteins, 2007)] was 71.367. The score of the template 1M17_A was 82.110. The CIRCLE score is a statistical potential obtained from the X-ray structure of a protein which belongs to the experimental structure coordinate database obtained by the PDB or the like. As the score increases more positive, the environment of the known protein X-ray structure is more satisfied, that is, it can be said that the model is closer to the X-ray structure.
The PDBID of the ligands used as the FP library obtained according to the ChooseLD method of the second example is as follows.
1AD5,1AGW,1BYG,1E9H,1FGI,1FIN,1FPU,1FVV,1GAG,1H1P,1H1Q,1H24,1H25,1H26,1H27,1I44,1IEP,1IR3,1JPA,1JQH,1K3A,1KSW,1M17,1M5 2,1MP8,1MQB,1OEC,1OGU,1OI9,1OIU,1OPJ,1OPK,1OPL,1PF8,1PKG,1QCF,1QMZ,1QPC,1QPD,1QPE,1QPJ,1ROP,1RQQ,1SM2,1SNU,1T46,1U4D,1U54,1U59,1UWH,1UWJ,1VYW,1XBB,1XBC,1XKK,1Y57,1Y6A,1Y6B,1YKR,1YOL,1YOM,1YVJ,1YWN,2B54,287A,2BDF,2BDJ,2BKZ,2BPM,2C0I,2C0O,2C0T,2C4G,2C5N,2C5O,2C5P,2C5T,2C5V,2C5X,2DQ7,2E2B,2ETM,2EVA,2EXM,2F4J,2FB8,2FGI,2FO0,2G1T,2G2F,2G2H,2G2I,2G9X,2GNF,2GNG,2GNH,2GNI,2GQG,2GS6,2GS7,2H8H,2HCK,2HEN,2HIW,2HK5,2HWO,2HWP,2HYY,2HZ0,2HZ4,2HZI,2HZN,2I0V,2I0Y,2I1M,2I40,2ITN,2ITO,2ITP,2ITQ,2ITT,2ITU,2ITV,2ITW,2ITX,2ITY,2ITZ,2IVS,2IVT,2IVU,2IVV,2IW6,2IW8,2IW9,2J0J,2J0K,2J0L,2J0M,2J5F,2J6M,2NRU,2NRY,2OF2,2OF4,2OFU,2OFV,2OG8,2OIQ,2OJ9,2OO8,2OSC,2OZO,2P0C,2P2H,2P2I,2P4I,2SRC,2UUE
(Acquisition of Compounds with known IC50)
Eleven 2-dimensional structures of compounds that competitively inhibit EGFR, with known IC50 value were obtained from the website of BIOMOL (http://www.biomol.com/).
An experiment was performed by changing the k2 value of the FPAScore in the range of 0.5 to 5.0, assuming the MDL Comprehensive Medicinal Chemistry (MDL CMC) Library (Symyx Technologies, Inc. Corporate Address: 3100 Central Expressway, Santa Clara, Calif. 95051) to be dummy compounds having inactivity toward EGFR, and examining whether a known inhibitor would rank highly as compared to those compounds.
The lower limit value of Tc for the ligands that may be included in the FP library was determined. By defining the lower limit value of Tc, compounds that are not similar to the docking ligand can be excluded. The Tc lower limit value with which the line chart of harvest rate would become optimal, was determined.
The results for the in silico screening of EGFR under the conditions of k2=2.0, k3 value=1.0, and Tc lower limit value=0.24, will be presented below. Among the one hundred high-ranking structures, 97 structures were ATP derivatives including phosphoric acids. Thus, refined selection as follows was carried out.
(1) Including molecules having a molecular weight of 350 or more and 800 or less, and excluding molecules including phosphorus from those;
(2) Excluding molecules that do not involve important hydrogen bond (nitrogen in the main chain of MET); and
(3) Excluding docking ligand molecules for which collision between a protein atom and a ligand atom occurs at 2.0 Å or less.
The results of applying the ChooseLD method according to the examples 1 and 2 to various target proteins, are shown below. These results require demonstration by experiment. A first example relates to the search for an inhibitor of dimmer formation of EGFR. A second example is related to the prediction of the complex structures of KRN633 and KRN951 with VEGF2, and the prediction of the protein-ligand complex structure requires demonstration by X-ray structural analysis. A third example is related to in silico screening against malaria, and requires demonstration by a binding experiment.
As shown in
(Prediction of Complex Structures of KRN633 and KRN951 with Vascular Endothelial Growth Factor Receptor-2 (VEGFR2)).
VEGFR2 is a kinase participating in vascularization, and is one of the proteins that are overexpressed at the development of cancer such as lung cancer. A compound which specifically inhibits this protein serves as a therapeutic drug for cancer. As the inhibitors, KRN633 [reference literature (Mol. Cancer. Ther. 2004; 3: 1639-1649)] and KRN951 [reference literature (Cancer Res. 2006; 66: 9134-9142)] is known. However, X-ray crystallographic structure analysis of these complex has not been achieved in December 2007. Thus, the complex structures of VEGFR2 and KRN633 and the complex of VEGFR2 and KRN951 were predicted. Here,
The 3-dimensional structure of VEGFR2 was constructed using PDBID 2P2H A chain as the reference protein. To describe the conditions for the docking of KRN633 and KRN951, the ligand used in the FP library was obtained by a homology search based on PSI-Blast, and the top ten compounds in the FP library used in docking were, for the KRN633, PDBID: 2 HZN_A, 1YWN_A, 2J5F_A, 2IVU_A, 2H8H_A, 2OH4_A, 1GAG_A, 1FPU_A, 2C0I_A, 2P4I_A, and for the KRN951, PDBID: 2I0V_A, 2 HZN_A, 2OH4_A, 1FGI_A, 1YWN_A, 1FPU_A, 2OFU_A, 2C0I_A, 2H8H_A, 2FGI_A.
In order to evaluate the reliability of the predictive complex structures of KRN633 and KRN951, a statistical success rate calculated from 133 sets was calculated using the Tc maximum value of the docking ligands included in the FP library.
That is, it is possible to statistically calculate the accuracy for predictive success at the time of applying the ChooseLD method by interpolating the Tc upper limit value in the graph. However, this statistical ratio of predictive success does not take into consideration of the 3-dimensional structure and amino acid sequence of the target protein. Among the ligands included in the FP library used in the docking of KRN633, the ligand with the maximum Tc was 0.45. Thus, when the ratio of predictive success in the case of 0.36 and 0.56 were interpolated, the success rate was 34.7%. Similarly, for KRN951, the ratio of predictive success was interpolated from the ratio of predictive success in the case of 0.24 and 0.36, was 24.3%. From the ratio of predictive success in the 133 sets, GOLD Score STD, which results in the highest ratio of predictive success, was 46.0%, DOCK was 21.1%, and AutoDock was 26.6%. For KRN633, an accuracy of ChooseLD can be predicted better than that of AutoDock but insufficient compared to that of GOLD. It was conceived that for KRN951 an accuracy of ChooseLD can be predicted similar to that of AutoDock. (Docking with Plasmodium falciparum enoyl acyl carrier protein reductase under condition of existing a low molecular compound (NAD))
Plasmodium falciparum enoyl acyl carrier protein is a pathogenic protein for malarial fever, and is a protein participating in the synthesis of lipids. However, since this pathway for lipid synthesis does not exist in human being, it is conceived that inhibiting the function of this protein leads to the treatment of malarial fever [reference literature (J. Biol. Chem. 2002; 277: 13106-13114)].
According to the present example, a ligand docking and method of in silico screening, the ChooseLD method based on bioinformatics using a method for optimizing newly defined FPAScores by simulated annealing, were developed. By performing optimization of the k1 value in the 85 sets, the optimum value was determined to be 4.0 by considering the use in the high-throughput screening or the like. In the case of taking this k1 value, when the fraction capable of predicting the experimental structure at an rmsd of 2.0 Å or less for the 133 sets was used as an index, the docking performance of the ChooseLD method of the present example was equal to that of GOLD which performs docking using existing classical physical function, and when the Tc upper limit value was low, the performance was the same as that of DOCK or AutoDock. This implies that the assumption that an FP obtained according to the FP construction method from the ligands included in the FP library established from ligands derived from a family protein, is the coordinates such as having decreasing free energy, was right.
However, since conventionally existing docking software programs are not necessarily be able to search for a structure with the minimum free energy, this means that there is still room for the conventional techniques to undergo improvement. Furthermore, for 133 set, from the viewpoint of the distribution of the PDBIDs successfully predicted, the ChooseLD method was compared with Glide, GOLD and FlexX, to calculate the degree of similarity of the distribution of PDBID based on Tc. As a result, the targets successfully predicted had originality, and it showed a possibility that the accuracy of in silico screening could be increase when the ChooseLD method of the present example and a conventional method are used in combination. Furthermore, as described above, in the second example, it was shown that the k2 and k3 values of the FPAScore are variables that can be optimized in accordance with the target protein, using the kinase domain of EGFR as the target protein. From these results, it was conceived that when the k1, k2 and k3 values of the FPAScore according to the ChooseLD method of the present second example are optimized in accordance with the target protein, in silico screening of more inhibitors and lead compounds can be achieved.
Third example will be described below. In the third example, in silico screening was performed for the purpose of developing an antagonist and an agonist of AMPKhomoGAMMA1 enzyme.
First, a homology search for the amino acid sequence of AMPKhomoGAMMA1 enzyme was performed, which is a target protein. Thus, modeling of AMPKhomoGAMMA1 was carried out using FAMS Ligand including the following ligands, of which the template is 2V9J_E (E chain of 2V9J) having 99.7% homology. Here,
Subsequently, ligands except 2V9J_E were superimposed to the coordinate system of 2V9J_E by fitting based on CE (structural superimposition between proteins without considering the kind of atoms). Screening of an antagonist and an agonist was attempted on AMP_E—1329 binding site, which does not depend on MG ions among the three ATP (AMP) binging sites of the 2V9J_E model.
Upon performing ChooseLD of the present example, amino acid residues within 18A from the binding site of AMP_E—1329 was cut out to be used as a receptor model for 2V9J_E. During the CHooseLD screening, the ligands and MG ions except the receptor binding site were included in the receptor as cofactors. Also, as the FPs of ChooseLD of the present example, three adenosines having a phosphate group (PO3) removed from the ligand molecules at the receptor binding site, and 1-(5-amino-4-carboxamide-1H-imidazole-yl)-ribose were used, but the phosphate group moiety is unsuitable for the functional group of the candidate compound. Therefore, phosphate was not used directly as FP, but the relative distance between the pair of His151 and His298 (His150 and His297 in the 2V9J_E of the template protein) that are hydrogen bonded with the oxygen atom of the phosphate group, is calculated, and the structural difference was calculated by GDT_TS (0.5 Å, 1.0 Å, 1.5 Å, and 2.0 Å). Thus, a ligand which is a residue pair having a GDT_TS of 70% or more (modifiable) and which exists within 3.0 Å (modifiable) from the residue pair, was extracted as HETATM from the 95% NR_PDB. In addition, not only two amino acid residues, but also three amino acid residues can also be assigned.
GDT_TS represents the fraction of residues that can be overlapped to the native structure at XÅ or less. As a result, 1061 ligands could be extracted. For these ligands, collision with the 2V9J_E receptor was checked, and thereby 18 ligands or parts of ligands were added to the FP, so as to perform screening of the CMC [reference literature
Screening result for CMC database were set up as follows: atomic collision between the receptor site and the ligand (2.0 Å one atom or less, 2.2 Å three atoms or less, 2.4 Å five atoms or less), the ligand molecular weight was 200 to 500, LogP of the ligand was −1 to 5, and the number of rings in the ligand, the number of hydrogen donor atoms, and the number of hydrogen acceptor atoms were 0 to 5, respectively. Here,
Here,
Although the invention has been described with respect to a specific embodiment for a complete and clear disclosure, the appended claims are not to be thus limited but are to be construed as embodying all modifications and alternative constructions that may occur to one skilled in the art which fairly fall within the basic teaching herein set forth.
For example, in the embodiment, the apparatus for in silico screening 100 performs various processes as a stand-alone device. However, the apparatus for in silico screening 100 can be configured to perform processes in response to request from a client terminal, which is a separate unit, and return the process results to the client terminal.
All the automatic processes explained in the present embodiment can be, entirely or partially, carried out manually. Similarly, all the manual processes explained in the present embodiment can be, entirely or partially, carried out automatically by a known method.
The process procedures, the control procedures, specific names, information including registration data for each process and various parameters such as search conditions, example of display screen, and database construction, which are mentioned in the description and drawings, can be changed as required unless otherwise specified.
The constituent elements of the apparatus for in silico screening 100 are merely conceptual and may not necessarily physically resemble the structures shown in the drawings., that is, the apparatus need not necessarily have the structure that is illustrated.
For example, the process functions performed by each device of the apparatus for in silico screening 100, especially the each process function performed by the control unit 102, can be entirely or partially performed by CPU and a computer program executed by the CPU or by a hardware using wired logic. The external system 200 can be configured as a web server or ASP server or the like. Hardware configurations of the external system 200 can be configured by the information processing device such as generally commercially available personal computer, workstation, and attachment devices thereof. Each functions of the external system 200 can be operated by such as CPU, disc device, memory device, input device, output device, and communication control device in the hardware configurations of the external system 200, and programs or the like that controls these devices.
The computer program, which are recorded on a recording medium to be described later, can be mechanically read by the apparatus for in silico screening 100 on demand. In other words, the storage unit 106 such as read-only memory (ROM) or hard disk (HD) stores the computer program that can work in cooperation with OS to issue commands to the CPU and cause the CPU to perform various processes. The computer program is first loaded to the random access memory (RAM), and the instruction is executed at a control unit in collaboration with the CPU. Alternatively, the computer program can be stored in any application program server such as the external system 200 connected to the apparatus for in silico screening 100 via the network 300, and can be fully or partially loaded on demand.
The computer-readable “recording medium” on which the computer program can be stored may be a “physical medium of portable type” such as flexible disk, magneto optical (MO) disk, ROM, erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), compact disk-read-only memory (CD-ROM), digital versatile disk (DVD), or a “communication medium” that stores the computer program for a short term such as communication channels or carrier waves that transmit the computer program over the network 300 such as local area network (LAN), wide area network (WAN), and the Internet.
“Computer program” refers to a data processing method written in any computer language and written method, and can have software codes and binary codes in any format. The “computer program” can be not only a single component but also a dispersed construction in the form of a plurality of modules or libraries, or can perform various functions in collaboration with a different program such as the OS. In the each device according to the embodiment, any known configurations for reading the recording medium, any known process procedure for reading or the following installing the computer program can use any known configuration and procedure.
The storage unit 106 is a memory device such as RAM, ROM, and a fixed disk device such as HD or flexible disk, optical disk, and stores therein various programs, tables, databases, and web page files required for various processes and provided by websites.
The apparatus for in silico screening 100 can also be connected to information processing device such as any existing personal computer, workstation, etc. and can be operated by executing software (that includes computer program, data, etc.) that implements the method according to the present invention in the personal computer or workstation.
The distribution and integration of the apparatus for in silico screening 100 are not limited to those illustrated in the figures. The device as a whole or in parts can be functionally or physically distributed or integrated in an arbitrary unit according to various loads.
The information of which compound would significantly interact with a target macromolecular protein and would be docking to the protein, is an important factor in the development of new pharmaceutical products, and tailor-made medicine means developing pharmaceutical products that conventionally do not work, in correspondence with substitution of at least one amino acid residue. Therefore, the information of compounds bound to target macromolecular proteins is richer in the number of compounds for which experimental determination has been completed, and development of new drugs is more highly accelerated. Thus, the apparatus for in silico screening and the method of in silico screening described in the present invention have high industrial applicability.
Number | Date | Country | Kind |
---|---|---|---|
2007-293751 | Nov 2007 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2008/070973 | 11/12/2008 | WO | 00 | 5/6/2010 |