The present invention relates to a method for generating a compound structure, a program for generating a compound structure, and a device for generating a compound structure, and particularly relates to a technique for generating a compound structure having synthetic aptitude.
In the related art, the search for a structure of a compound having a desired physical property value has been performed mainly by solving a “forward problem” (giving a molecular structure as a cause of the problem and obtaining a physical property value as a result). With the development of informatics in recent years, studies on a solution method of an “inverse problem” (giving a physical property value and obtaining a molecular structure having the physical property value) are rapidly progressing. For example, “Bayesian molecular design with a chemical language model”, Hisaki Ikebata et al., “searched on Jul. 23, 2018”, internet (https://www.ncbi.nlm.nih.gov/pubmed/28281211) is known for searching for a structure by solving the inverse problem. The “Bayesian molecular design with a chemical language model”, Hisaki Ikebata et al., “searched on Jul. 23, 2018”, internet (https://www.ncbi.nlm.nih.gov/pubmed/28281211) discloses that a structure having a physical property value close to the target value is obtained by, giving a target value of physical property value, (1) generating a plurality of initial structures (chemical structures), (2) randomly changing each structure, (3) estimating the physical property value of each structure, and (4) adopting or rejecting the change in structure based on the distance between the physical property value and the target value (in this process, the processes (2) to (4) are repeated). As described above, in order to solve the inverse problem, a technique for performing (1) to (4) is required.
In a case of performing the above-described (1) to (4), a technique capable of evaluating synthetic aptitude of the compound is required. That is, it is meaningless in a case where the chemical structures generated and/or modified on a computer are difficult to synthesize. Therefore, a technique capable of generating a compound structure having synthetic aptitude is required, and as such a technique, a technique for generating a structure by learning a partial structure or a fragment (refer to “Bayesian molecular design with a chemical language model”, Hisaki Ikebata et al., “searched on Jul. 23, 2018”, internet (https://www.ncbi.nlm.nih.gov/pubmed/28281211) and “RecGen (Refined Compound Generator)”, Kyoto Constella Technologies Co., Ltd., “searched on Jul. 23, 2018”, internet (http://recgen.czeek.jp/recgen/)) has been known. Furthermore, a technique for updating the structure based on the evaluation results of physical property values is required (refer to “Bayesian molecular design with a chemical language model”, Hisaki Ikebata et al., “searched on Jul. 23, 2018”, internet (https://www.ncbi.nlm.nih.gov/pubmed/28281211)).
In the “RecGen (Refined Compound Generator)”, Kyoto Constella Technologies Co., Ltd., “searched on Jul. 23, 2018”, internet (http://recgen.czeek.jp/recgen/), in a case of connecting fragments, generation of a structure which cannot be synthesized is suppressed by preparing an overlap width portion and bonding the overlap width portion. However, in the “RecGen (Refined Compound Generator)”, Kyoto Constella Technologies Co., Ltd., “searched on Jul. 23, 2018”, internet (http://recgen.czeek.jp/recgen/), the synthetic aptitude is not evaluated. In addition, the method in the “RecGen (Refined Compound Generator)”, Kyoto Constella Technologies Co., Ltd., “searched on Jul. 23, 2018”, internet (http://recgen.czeek.jp/recgen/) is how to add a new structure to an existing structure, and it is difficult to delete an atom or an atomic group from the existing structure.
In order to solve the above-described inverse problem, it is required to generate a huge number of compound structures on the computer. On the other hand, in a case where a compound structure generated on the computer is difficult to synthesize, there is a problem that the structure obtained by solving the inverse problem cannot actually be synthesized.
The present invention has been studied in view of such circumstances, and an object of the present invention is to provide a method for generating a compound structure, a program for generating a compound structure, and a device for generating a compound structure, which are capable of generating a compound structure by adding or deleting an atom or an atomic group while determining synthetic aptitude.
A method for generating a compound structure according to a first aspect includes:
According to the first aspect, a modified compound structure can be generated by adding or deleting an atom or an atomic group while determining the synthetic aptitude.
A method for generating a compound structure according to a second aspect includes that,
A method for generating a compound structure according to a third aspect includes that,
A method for generating a compound structure according to a fourth aspect includes that,
A method for generating a compound structure according to a fifth aspect includes that,
A method for generating a compound structure according to a sixth aspect includes that,
A method for generating a compound structure according to a seventh aspect,
A method for generating a compound structure according to an eighth aspect includes that,
The method for generating a compound structure according to a ninth aspect includes that,
The method for generating a compound structure according to a tenth aspect includes that,
The method for generating a compound structure according to an eleventh aspect includes that,
A program for generating a compound structure according to a twelfth aspect causes a computer to execute the above-described method for generating a compound structure.
A device for generating a compound structure according to a thirteenth aspect includes:
According to the present invention, it is possible to generate a modified compound structure having synthetic aptitude.
Hereinafter, a method for generating a compound structure, program for generating a compound structure, and device for generating a compound structure according to embodiments of the present invention will be described with reference to the accompanying drawings. In the present specification, in a case where a numerical range is expressed by using “to”, the numerical range also includes numerical values of the upper limit and lower limit indicated by “to”.
<Device for Generating Compound Structure>
<Configuration of Processing Part>
The function of each part of the processing part 100 described above can be realized by using various processors. Examples of the various processors include a CPU that is a general-purpose processor which executes software (program) to realize various functions. In addition, examples of the various processors also include a graphics processing unit (GPU) which is a processor specializing in image process and a programmable logic device (PLD) which is a processor in which circuit configuration can be changed after manufacturing, such as a field programmable gate array (FPGA). Furthermore, examples of the various processors also include a dedicated electric circuit which is a processor having a circuit configuration specifically designed to execute a specific process, such as an application specific integrated circuit (ASIC).
The functions of each part may be realized by one processor, or may be realized by a plurality of processors of the same type or different types (for example, a plurality of FPGAs, a combination of CPU and FPGA, or a combination of CPU and GPU). In addition, a plurality of functions may be realized by one processor. As an example of configuring a plurality of functions with one processor, firstly, an aspect that, as typified by a computer such as a client and a server, one processor is configured by a combination of one or more CPUs and software, and this processor realizes the plurality of functions is exemplified. Secondly, an aspect that, as typified by a system on chip (SoC), uses a processor which realizes the functions of the entire system with a single integrated circuit (IC) chip is exemplified. As described above, various functions are composed by using one or more of the above-described various processors as a hardware structure. Furthermore, the hardware structure of these various processors is more specifically an electric circuit (circuitry) in which circuit elements such as semiconductor elements are combined. This electric circuit may be an electric circuit which realizes the above-described functions by using logical sum, logical product, logical negation, exclusive logical sum, and logical operation of a combination thereof.
In a case where the above-described processor or electric circuit executes a software (program), a processor-readable code (computer-readable code) of the software to be executed is stored in a non-temporary recording medium such as ROM 122 (refer to
<Configuration of Storage Part>
The storage part 200 is configured of a non-temporary recording medium such as a digital versatile disk (DVD), a hard disk, and various semiconductor memories, and a control part thereof, and can store compound structures (initial compound structure and modified compound structure); a compound database; atomic species obtained based on the compound database, an atomic arrangement, and an appearance frequency of each result; a synthetic aptitude score; and the like.
<Configuration of Display Part and Operation Part>
The display part 300 includes a monitor 310 (display device), and can display the input image, the information stored in the storage part 200, the result of process by the processing part 100, and the like. The operation part 400 includes a keyboard 410 and a mouse 420 as input devices and/or pointing devices, and the user can perform operations necessary for executing the method for generating a compound structure through these devices and a screen of the monitor 310. For example, the user can perform designation of process start instruction, input of an initial compound structure, hyperparameter for controlling the difficulty of synthetic aptitude, and the like.
<Procedure of Method for Generating Compound Structure>
<Preparation of Compound Database and Compound Structure>
A compound database for evaluating synthetic aptitude and a compound structure (initial structure) are prepared (Step S10). Data stored in the storage part 200 may be used as these data, or these data may be acquired from the external server 500 and the compound database 510 through the network NW. A compound database 510 including compounds suitable for the purpose is selected. What kind of data may be prepared may be decided according to the user's instruction input through the operation part 400.
The compound structure (initial structure) can be selected from the compound database 510, or may be input by the user through the operation part 400. In a case of selecting the compound structure from the compound database 510, the compound structure can be selected randomly from the compound database 510, or can be selected probabilistically based on the appearance frequency in the compound database 510. Random selection means that the selection is performed randomly, and the probabilistic selection means that the selection is performed based on some weighting.
An example of a case of selecting in one atom unit based on the appearance frequency in the compound database 510 will be described. Table 1 is a table in which atomic species of the compound database 510 are arranged in descending order of the appearance frequency. The atomic species include an atom, which is included in each compound included in the compound database 510, and an electronic state thereof (type of bonding). As shown in Table 1, the appearance frequency of “C.ar” is the highest, the appearance frequency of “C.3” is the second highest, and the appearance frequency of “Lr” is the lowest.
In Table 1, ar means aromatic, and “C.ar” means an aromatic carbon. The minimum value of the number of bonded atoms of “C.ar” is 2, and the maximum value thereof is 3. “C.3” is a carbon of spa hybrid orbital, and the minimum value of the number of bonded atoms is 1 and the maximum value of the number of bonded atoms is 4. For example, as an atom name including the electronic state of the atom, the mol2 format of Tripos can be applied. “C.1” is a carbon of sp1 hybrid orbital, and “C.2” is a carbon of sp2 hybrid orbital.
In a case of selecting the initial structure probabilistically, the atomic species are weighted according to the appearance frequency. The initial structure is selected according to the weighting. For example, atomic species with a high appearance frequency are selected. On the other hand, in a case of random, the initial structure is selected randomly from all atomic species. For example, atomic species with a low appearance frequency may be selected.
The data of the compound database and the compound structure (initial structure) are input to the processing part 100 through the acquisition part 102. The compound structure (initial structure) can be accepted to be either one atom or a compound.
As the compound database, PubChem (http://pubchem.ncbi.nlm.nih.gov/search/), DrugBank (http://www.drugbank.ca/), and the like can be used.
<Addition or Deletion of Atom or Atomic Group to Compound Structure>
The add/delete selection part 104 determines and selects to add an atom or an atomic group to the compound structure, or to delete an atom or an atomic group from the compound structure (Step S12). Here, immediately after the start of the method for generating a compound structure, the compound structure in the step S12 means a compound structure as the initial structure. On the other hand, after passing through the step S26 described later, the compound structure in the step S12 is a modified compound structure. In the step S12, a case of addition or deletion in one atom unit with respect to the compound structure and a case of addition or deletion in an atomic group (group of two or more atoms) unit with respect to the compound structure are accepted.
In the step S12, in a case where the compound structure prepared in the step S10 is one atom, the addition of an atom or an atomic group to the compound structure is selected.
In the step S12, by setting a threshold value for a molecular weight of the compound structure and increasing a probability of selecting the deletion of an atom or an atomic group in a case where the molecular weight of the compound structure exceeds the threshold value, the molecular weight of the generated compound structure can be limited.
In the step S12, the addition of an atom or an atomic group or the deletion of an atom or an atomic group can be randomly selected, or probabilistically selected based on the appearance frequency of the atomic species included in the compound database.
<Acquisition of Modified Compound Structure>
In a case of selecting an addition of an atom or an atomic group to the compound structure in the step S12, the compound structure modification part 106 selects an atom having the number of bonded atoms less than the maximum value from atoms included in the compound structure (Step S14), and then bonds a new atom or a new atomic group to the atom selected from atoms included in the compound structure (Step S16). In addition, in a case of selecting a deletion of an atom or an atomic group from the compound structure in the step S12, the compound structure modification part 106 deletes the atom or atomic group selected from the atoms or atomic groups included in the compound structure (Step S18).
In the step S14, the compound structure modification part 106 examines the number of bonded atoms of each atom of the compound structure. The number of bonded atoms of each atom can be obtained from Table 1 filled in based on the compound database 510. For example, by selecting one atom from the compound structure and searching for the selected one atom from Table 1, the number of bonded atoms of the selected one atom can be obtained. The number of bonded atoms can be obtained in the same way for all the atoms included in the compound structure. All the atoms for which the number of bonded atoms has been obtained are listed, and from the list, one atom is probabilistically selected as an atom or an atom to which an atomic group is added.
A hydrogen atom included in the compound structure can be omitted unless it is necessary to consider the hydrogen atom. This is because the compound structure is complicated in a case where the hydrogen atom is extracted. In the compound structure, in a case of selecting one atom having the number of bonded atoms less than the maximum value as the atom to which an atom or an atomic group is added, it is preferable that an atom in which the number of bonded atoms does not reach the minimum value is preferentially selected. In a case where all the atoms in the compound structure reach the minimum value, it is preferable that a probability in which an atom having a large difference between the number of bonded atoms and the maximum value is selected increases.
In the step S16, based on the compound database 510, the compound structure modification part 106 probabilistically selects one new atom or new atomic group, which can be bonded to the atom selected in the step S14, from an atomic arrangement (atomic species and type of bonding (single bond, double bond, and the like)), and forms a bonding.
Table 2 is a table of atomic arrangements, which is filled in based on the compound database 510. Table 2 shows atomic arrangements (atomic species, type of bonding, and appearance frequency) which can be bonded to “C.3”, in a case where the atom selected in the step S14 is “C.3”. Hereinafter, “—” represents a single bond, “═” represents a double bond, “#” represents a triple bond, and “:” represents an aromatic bond.
For example, in a case of probabilistically selecting a new atom bonded to the atom selected in the step S14, it is weighted according to the appearance frequency of the atomic arrangement. The atomic arrangement is selected according to the weighting, the atom included in the atomic arrangement is bonded to the atom selected in the step S14 as the new atom. On the other hand, in a case of random, it is selected randomly from all atomic arrangements.
In a case where an atomic arrangement capable of forming a cyclic structure appears as a result of bonding the new atom, the cyclic structure can be probabilistically formed. As for the probability of forming a cyclic structure, it is preferable to directly estimate a ratio in a case where the atomic arrangement is a cyclic structure in the compound database 510. However, a cyclic structure can be randomly formed.
In the step S18, the compound structure modification part 106 determines whether or not the compound structure is split into two or more molecules in a case where an atom in the compound structure is deleted. For example, in a compound structure shown in
As for the atom to be deleted from the compound structure, for example, candidate atoms are listed. The atom to be deleted can be randomly selected from the list. In addition, as the atom to be deleted, the same atom as an atom with a low appearance frequency in the compound database 510 can also be preferentially selected from the list.
The compound structure modification part 106 acquires a modified compound structure by passing through the step S16 or the step S18.
<Determination of Synthetic Aptitude>
The synthetic aptitude determination part 108 determines a synthetic aptitude of the modified compound structure, which is acquired by the compound structure modification part 106, based on information of the compound database 510 (Step S20).
The determination of the synthetic aptitude is performed, for example, by the following procedure. The procedure includes (1) extracting an atomic arrangement from a compound stored in a compound database and obtaining an appearance frequency of the atomic arrangement, (2) extracting an atomic arrangement from a modified compound structure and obtaining an appearance frequency of the atomic arrangement, (3) calculating, for each atomic arrangement in the compound structure, a frequency with which the atomic arrangement in the modified compound structure appears in the compound obtained from the compound database, as a partial score, based on the number of bonds included in the atomic arrangement in the modified compound structure and an appearance frequency of an atomic arrangement which corresponds to the atomic arrangement and is obtained from the compound database, using a function in which a numerical value decreases as the number of bonds and appearance frequency in the atomic arrangement in the modified compound structure increase, and (4) evaluating a synthetic aptitude by summing the calculated partial scores and obtaining a total score which is a synthetic aptitude score of the compound structure.
Tables 3 and 4 are tables of atomic arrangements, which are filled in based on the compound database 510, in which the number of bonds is used as the standard. Tables 3 and 4 include the number of bonds and the atomic arrangement (atomic species, type of bonding, and appearance frequency).
In Table 3, “S.3” is a sulfur of sp3 hybrid orbital.
In Table 4, “N.pl3” is a nitrogen trigonal planer, and “O.co2” is an oxygen in carboxylate and phosphate groups.
Atomic arrangements are extracted from the modified compound structure for each of the number of bonds. The appearance frequency of the extracted atomic arrangement in the modified compound structure is obtained. Table 5 is a table of atomic arrangements, which is obtained from a modified compound structure.
In Table 5, “O.3” is an oxygen of sp3 hybrid orbital, and edge means that the terminal of the molecule.
In a case where n(substr) represents the number of bonds of the atomic arrangement, f(substr) represents the appearance frequency of the atomic arrangement in the compound database, and f1(substr) represents the appearance frequency of the atomic arrangement in the modified compound structure, the partial score s(substr) can be obtained by Expression (1).
For example, a partial score of “C.ar:C.ar” included in the modified compound structure can be calculated as follows. From Table 3, the appearance frequency of “C.ar: C.ar” in the compound database 510 is 799082034. From Table 5, the appearance frequency of “C.ar:C.ar” in the modified compound structure is 6.
s(C.ar:C.ar)=f1(C.ar:C.ar)/(n(C.ar:C.ar)×(f(C.ar:C.ar)+1))=6/(1×(799082034+1))=7.5×10−9
A total score S can be obtained by obtaining partial scores for all the atomic arrangements included in the modified compound structure and summing the partial scores s.
The determination of the synthetic aptitude can be performed by setting a threshold value for the total score S. In a case where the total score S is equal to or less than the set threshold value, the modified compound structure is determined to have the synthetic aptitude.
In a case where a threshold value is set for the total score S, a compound structure having a total score S more than the threshold value is not generated at all. In fact, in a case of performing, with respect to the compound structure (including the initial structure and the modified compound structure), the process of the addition of a new atom or a new atomic group and the deletion of an atom or an atomic group, a compound structure having a total score S less than the threshold value may be acquired after passing through the compound structure having a total score S more than the threshold value. Therefore, it is necessary to determine a synthetic aptitude which can accept the compound structure having the total score S more than the threshold value.
In a case where an adoption probability is represented by p, the total score is represented by S, and the hyperparameter is represented by 6, the determination of the synthetic aptitude which can accept the compound structure having the total score S more than the threshold value can be probabilistically performed by Expression (2). Adjustment of the synthesis difficulty of the modified compound structure is performed by changing the value of the hyperparameter σ.
p=exp[−S/σ] (2)
Next, the adjustment of the synthesis difficulty will be described.
As shown in the graph of
As shown in the graph of
In a case where the hyperparameter σ is co, the adoption probability p is 100% regardless of the total score S. Expression (2) includes a case where the synthetic aptitude is determined without setting a threshold value for the total score S.
In addition, in a case where the adoption probability is represented by p, the total score is represented by S, the hyperparameter is represented by σ, and a parameter is represented by d, the determination of the synthetic aptitude can be probabilistically performed by Expression (3) which is an extended exponential function.
In
In a case where the parameter d is 1, Expression (3) is the same as the function of the adoption probability p represented by Expression (2). In a case where the parameter d is increased, Expression (3) changes as follows.
In a case where the parameter d is co, the adoption probability p is 1/e in a case where total score S=hyperparameter σ. In addition, the adoption probability p is 1 in a case where total score S<hyperparameter σ, and the adoption probability p is 0 in a case where total score S>hyperparameter σ. The plotted graphs asymptotically approach the so-called Heaviside step function. In
As long as the synthetic aptitude can be determined, the determination of the synthetic aptitude is not limited to the above-described detail.
<Acceptance and Rejection of Compound Structure>
The structure adoption decision part 110 probabilistically accepts the modification in a case where the modified compound structure has the synthetic aptitude (Step S22), or probabilistically rejects the modification in a case where the modified compound structure does not have the synthetic aptitude (Step S24).
Here, the “probabilistically” can be realized by applying the adoption probability p in the step S20.
In the method for generating a compound structure according to the embodiment, the synthetic aptitude is determined every time the compound structure is modified. By the determination of the synthetic aptitude, it is possible to suppress the generation of a compound structure which is difficult to synthesize. On the other hand, by adjusting the adoption probability p of the synthetic aptitude, a compound structure having low synthetic aptitude can be accepted, which increases the degree of freedom in modifying the compound structure and promotes the generation of new compound structure.
<Repetition of Process>
In a case where, as described above, the structure adoption decision part 110 probabilistically accepts the modification of the compound structure in a case where the modified compound structure has the synthetic aptitude (Step S22), or probabilistically rejects the modification of the compound structure in a case where the modified compound structure does not have the synthetic aptitude (Step S24), the control part 112 determines whether or not a termination condition is satisfied (Step S26). For example, the control part 112 can determine that “the termination condition is satisfied” in a case where all the atoms included in the compound structure are equal to or more than the minimum value with respect to the number of bonded atoms. In a case where the termination condition is not satisfied, the control part 112 repeatedly performs the steps S12 to S26. On the other hand, in a case where the control part 112 determines that “the termination condition is satisfied”, the generation of the compound structure is terminated.
The present invention will be specifically described with reference to the example. Even in this example, the process can be performed by the generation device 10 shown in
As shown in
In the step of adding or deleting an atom or an atomic group (Step S12), the addition or deletion of an atom or an atomic group is selected probabilistically. However, since “C.3” is one atom, the deletion of an atom or an atomic group is not selected, but the addition of an atom or an atomic group is selected.
In the step of selecting an atom having the number of bonded atoms less than the maximum value (Step S14), “C.3” having the number of bonded atoms less than the maximum value is selected as an atom to which a new atom or a new atomic group is bonded. Here, the minimum value of the number of bonded atoms of “C.3” is 1 and the maximum value thereof is 4. Since “C.3” is one atom, “C.3” is in a state in which the number of bonded atoms does not reach the minimum value.
In the step of bonding a new atom or atomic group (Step S16), the atomic arrangement capable of bonding to “C.3” and the type of bonding are selected probabilistically from the list (refer to Table 2) filled in based on the compound database 510. In the example, “C.3-C.ar” which has the second highest appearance frequency is selected. In
Next, in the step of determining the synthetic aptitude (Step S20), the synthetic aptitude of “C.3-C.ar” is determined. In the example, Expression (2) in which a total score S and a hyperparameter σ=0.1 are included is applied to the calculation of an adoption probability p of the synthetic aptitude. The adoption probability p of this modification in structure is calculated by Expression (4).
Since the adoption probability p is almost 1, “C.3-C.ar” is determined to have the synthetic aptitude. The change in structure is accepted almost 100% (Step S22). As shown in accepting the modification in structure of
As a result of adding (bonding) new atom, “C.3” reaches 1 which is the minimum value of the number of bonded atoms. On the other hand, “C.ar” does not reach 2 which is the minimum value of the number of bonded atoms. It is determined that the termination condition is not satisfied (Step S26). The process returns to the step S12, and is repeated. After that, the same process is repeated 5 times.
At the sixth time, as shown in
The adoption probability p of this modification in structure is calculated by Expression (5).
Since the adoption probability p is almost 1, “C.ar-C.ar” is determined to have the synthetic aptitude (Step S20). The change in structure is accepted almost 100% (Step S22). As shown in accepting the modification in structure of
Regarding a possible range of the number of bonded atoms, “C.3” is 1 to 4 and “C.ar” is 2 and 3. The atoms included in the modified compound structure satisfy the possible ranges of the number of bonded atoms, which is the termination condition (Step S26). The modification of the compound structure is completed, and the method for generating a compound structure is terminated.
In
By the method for generating a compound structure according to the embodiment, as shown in
The embodiments and examples of the present invention have been described above, but the present invention is not limited to the above-described aspects, and various modifications are possible without departing from the gist of the present invention.
10: device for generating compound structure
100: processing part
102: acquisition part
104: add/delete selection part
106: compound structure modification part
108: synthetic aptitude determination part
110: structure adoption decision part
112: control part
114: display control part
120: CPU
200: storage part
300: display part
310: monitor
400: operation part
410: keyboard
420: mouse
500: external server
510: compound database
NW: network
Number | Date | Country | Kind |
---|---|---|---|
2018-172577 | Sep 2018 | JP | national |
The present application is a Continuation of PCT International Application No. PCT/JP2019/036073 filed on Sep. 13, 2019 claiming priority under 35 U.S.C § 119(a) to Japanese Patent Application No. 2018-172577 filed on September 14, 2018. Each of the above applications is hereby expressly incorporated by reference, in its entirety, into the present application.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2019/036073 | Sep 2019 | US |
Child | 17192530 | US |