Generating valid three-dimensional (3D) molecule structures is valuable for downstream tasks in in-silico drug discovery, such as molecular property prediction and estimating binding affinity to a molecular target. A stable 3D structure of a molecule is also known as its conformation and is specified by the 3D Cartesian coordinates of its atoms. Traditional molecular dynamics or statistical mechanic-driven Monte Carlo methods are computationally expensive, making them not viable for screening modern artificial drug libraries which can contain billions of drug candidates. Deep learning-based generative methods have become an attractive alternative, but face many challenges such as lack of accuracy and requiring too many sampling steps to obtain satisfactory performance.
Methods disclosed herein involve generating 3D molecular conformations directly from two-dimensional (2D) molecular graphs in a single sampling step. Deep learning (DL)-based generative methods for generating 3D conformations of molecules from 2D molecular graphs have been an attractive tool in recent years. DL-based generative methods can be broadly classified into three categories: distance-based, reconstruction-based, and direct methods. The main goal of distance-based methods is to learn a probability distribution over inter-atomic distances. During inference, distance matrices are sampled from the learned distribution and converted to valid 3D conformations through post-processing algorithms A major advantage of modeling distance geometry is its roto-translation invariance. As a result, the likelihood-based objective function is also roto-translation invariant, leading to stable training and good generalization capability. A common practice adopted in the above distance-based methods is introducing virtual bonds between 2nd and 3rd neighboring atoms. Distance estimation is also required for these virtual bonds. These additional distances fix angles and dihedral angles between atoms which are crucial for generating valid conformations. However, they are not adequate to capture the interaction between atoms that are much farther apart in a long-sequence molecule. Another weakness of the distance-based method is the error accumulation problem. Random noise in the predicted distance can be exaggerated by a Euclidean Distance Geometry algorithm, leading to the generation of inaccurate conformations.
Reconstruction-based methods attempt to address this weakness by directly modeling the distribution of 3D coordinates. The main idea is to reconstruct valid conformations from either distorted or randomly sampled coordinates. Among the existing reconstruction-based methods, the process of transforming corrupted coordinates to stable conformations varies. Some adopt a reverse diffusion process for reconstruction, while others treat conformation generation as an optimization problem where random conformations generated by a model are tuned to match their ground truth under supervision. Despite the promising performance, these reconstruction-based methods require the design of task-specific and complicated coordinate transformation methods. This is to ensure the transformation is roto-transition or SE(3)-equivariant (i.e., a relevant symmetry between input and output for a Euclidean group of translations and rotations in 3D). To achieve this, one method employs a specialized SE(3)-equivariant Markov transition kernel. Another different method accomplishes the same by combining a task-specific adaption of the transformer with another specialized equivariant graph neural network. Furthermore, the former method requires numerous diffusing steps for each conformation generation to attain satisfactory performance. This requirement can become a significant performance bottleneck if this model is applied to high throughput virtual screening of a large number of molecules.
Accordingly, it would be desirable to develop a generative model that can produce a valid conformation directly from a 2D molecular graph in a single sampling step, without involving distance geometry or coordinate transformation. Some previous efforts have attempted to achieve this goal. Regrettably, the performance of these efforts is significantly worse than their distance-based counterparts. The main reason can be attributed to the use of an inferior graph neural network for information aggregation and direct coordinate generation without iterative refinement. Following the same framework, additional efforts aim to improve generative performance by adopting a more sophisticated graph neural network, featuring iterative conformation refinement and a loss function taking into account the symmetry permutation of atoms. While these efforts achieve competitive results as compared to other distance-based and reconstruction-based methods, acquiring permutation invariant loss function requires iterating through all permutations of a molecular graph, which can become computationally expensive for long-sequence molecules.
From the above, it can be seen that, regardless of their category, the aforementioned methods can be distilled to developing model architecture with ever-increasing sophistication and complexity, which is not ideal for the conformation generation of a large number of molecules.
Disclosed herein is a generative framework that generates 3D conformations directly from a 2D molecular graph in a single sampling step. The disclosed generative framework forgo building specialized generative model architectures. Instead, the generative framework is focused on intuitive and sensible input feature engineering. Briefly, the generative framework disclosed herein encodes a molecular graph using a symmetric tensor. The disclosed generative framework straightforwardly extends virtual bonds to high-order neighbors, leading to a fully connected symmetric tensor. The strength of these bonds is quantified by an additional feature channel which is inversely proportional to edge length (also referred to as “hops”) between atoms. For preliminary information aggregation, a kernel filter (e.g., a rectangular or square kernel filter) is employed to run through the tensor in a one-dimensional (1D) convolution manner. For example, with a kernel width of 3, the information from two immediate neighbors as well as all of their connected neighbors can be aggregated onto the center atom in a single operation. Methods may further involve generating a token-like feature vector per atom which can be directly fed to a standard transformer encoder for further information aggregation.
The generative framework disclosed herein is considerably simpler than other existing 3D conformation generative methods. It starts with building two tensors with one having only the molecular graph information and the other also including coordinate and distance information. Both tensors go through the same feature engineering step and the generated feature vectors are fed through two separate transformer encoders. The outputs of these two encoders are then combined in an intuitive way to form the input for another transformer encoder for generating conformation directly. The generative framework disclosed herein has been demonstrated through extensive experiments on two benchmark datasets. With sensible feature engineering, the relatively simple model disclosed herein can perform competitively against other models with extensive sophistication and complexity, thereby improving the efficiency of 3D conformation generation in in-silico drug discovery.
Additionally disclosed herein is a neural network included in the disclosed generative model. Compared to other developed generative models that involve multiple sampling steps, the neural network included in the disclosed generative model can achieve the expected performance without necessarily going through multiple sampling steps. The neural network can be convenient to train, easy to deploy, and cheap to host. Accordingly, the generative framework disclosed herein can achieve competitive performance with much-simplified sampling and with improved ease in model training and deployment, all of which enable the rapid and efficient implementation of the generative framework disclosed herein (e.g., implementation as a service in providing 3D conformations in in-silico drug discovery).
Additionally disclosed herein includes a method of generating a 3D conformation of a molecule. The method includes obtaining a two-dimensional (2D) graph of the molecule. A first tensor including molecular graph information and a second tensor including graph, coordinate and distance information of atoms of the molecule are generated. A first set of feature vectors corresponding to the first tensor and a second set of feature vectors corresponding to the second tensor are further generated. The first set of feature vectors are fed to a first encoder of a generative model to generate a first output. The second set of feature vectors are fed to a second encoder of the generative model to generate a second output. The first output and the second output are combined to form an input in a decoder of the generative model to generate the 3D conformation of the molecule.
In various embodiments, generating the first tensor includes adding an additional dimension to an adjacency matrix including one or more atom features and/or one or more bond features corresponding to the molecule; a diagonal section of the first tensor holds the one or more atom features for the atoms of molecule; the atom features include one or more of an atom type, atom charge, or atom chirality; each of the atom type, atom charge, and atom chirality is one-hot encoded; the atom type, atom charge, and atom chirality for an atom included in the molecule are stacked to form an atom feature vector corresponding to the atom; atom feature vectors for the atoms of the molecule are placed along a diagonal of the first tensor; an off-diagonal section of the first tensor holds bond features for one or more bonds corresponding to atom pairs included in the molecule; the bond features include bond types of the one or more bonds; the bond types include virtual bonds; the first tensor represents a fully connected symmetric tensor in which virtual bonds are extended across all atoms of the molecule; high-order virtual bonds share a same virtual bond type; the bond features include bond stereochemistry types of the one or more bonds; the bond features include associated ring sizes of the one or more bonds; the bond features include normalized bond lengths of the one or more bonds; a normalized bond length is calculated as an actual edge length divided by a longest chain length; a bond feature vector for an atom pair is further constructed; constructing the bond feature vector for an atom pair includes summing the atom feature vectors of the atoms of the molecule to form a new vector; constructing the bond feature vector for an atom pair further includes stacking the new vector with one-hot encoded bond type vector, normalized bond length, and one-hot encoded ring size vector for a bond associated with the atom pair, the method further includes constructing bond feature vectors for all atom pairs included in the molecule, to form the first tensor; generating the second tensor includes adding a coordinate channel to one or more atom feature vectors and adding a Euclidean distance channel to one or more bond feature vectors; generating the first set of feature vectors includes encoding the first tensor using a one-dimensional (1D) convolution operation; generating the second set of feature vectors includes encoding the second tensor using a 1D convolution operation; using the 1D convolution operation includes adjusting a kernel height of a kernel to equal a height of the first tensor or the second tensor during the 1D convolution operation; a kernel height remains unchanged and equals a kernel size of a kernel used during the 1D convolution operation; the kernel size of a kernel before adjusting the kernel length is 3×3 kernel or 4×4 kernel; the kernel size of a kernel after adjusting the kernel length is N×3 kernel or N×4 kernel, wherein N is the length of the first tensor or the second tensor; the first encoder and the second encoder are transformer encoders; feeding the input to the decoder of the generative model to generate the 3D conformation of the molecule includes: computing, in a first multi-head attention layer of the decoder, query and key matrices based on the first output; feeding the input to the decoder of the generative model to generate the 3D conformation of the molecule includes: reparametrizing the second output to generate value matrices for a first multi-head attention layer of the decoder; the first encoder, the second encoder, and the decoder included in the generative model are arranged in a conditional variational AutoEncoder framework; the generative model includes a convolutional neural network; the convolutional neural network in the generative model is trained before being applied to generate the 3D conformation of the molecule; the convolutional neural network in the generative model is trained by using a first data set containing molecules averaging a first number of atoms; the first number is a number between 30-40, 40-50, 50-60, or a number larger than 60; the convolutional neural network in the generative model is trained by using a second data set containing molecules averaging a second number of atoms; the second number is a number between 20-30, 10-20, or a number smaller than 10; each of the first data set and the second data set includes energy and statistical weight annotated molecular conformations corresponding to a set of molecules; the set of molecules include at least 1,000 molecules or at least 10,000 molecules; the method achieves a coverage score (COV) of at least 92; and the method achieves a matching score (MAT) below 0.85.
These and other features, aspects, and advantages of the present invention will become better understood with regard to the following description and accompanying drawings. It is noted that wherever practicable similar or like reference numbers may be used in the figures and may indicate similar or like functionality. For example, a letter after a reference numeral, such as “third party entity 110A,” indicates that the text refers specifically to the element having that particular reference numeral. A reference numeral in the text without a following letter, such as “third party entity 110,” refers to any or all of the elements in the figures bearing that reference numeral (e.g., “third party entity 110” in the text refers to reference numerals “third party entity 110A” and/or “third party entity 110B” in the figures).
Generally, the in-silico drug discovery system 130 performs methods disclosed herein, such as methods for generating 3D conformations for molecules. The 3D conformations may be generated based on the 2D molecular graphs of molecules from certain sources, such as certain programs or applications for generating 2D graphs for molecules. Example 2D graphs of molecules can include molecules expressed in terms of its atoms and bonds (e.g., as nodes and/or edges). 2D graphs of molecules can be expressed as, or derived from, any of a simplified molecular-input line-entry system (SMILES) string, molecular design limited (MDL), molfile (MOL), structure data file (SDF), protein data bank (PDB), XYZ, IUPAC international chemical identifier (InChi), and MOLII format. The third party entities 110 communicate with the in-silico drug discovery system 130 for purposes associated with discovering valid small molecule compounds.
In various embodiments, the methods described herein as being performed by the in-silico drug discovery system 130 can be dispersed between the in-silico drug discovery system 130 and third party entities 110. For example, a third party entity 110A or 110B can generate training data and/or train one or more machine learning-based models included in the system architecture 100. The in-silico drug discovery system 130 can then deploy the trained machine learning models to generate 3D conformations for molecules.
Referring to the third party entities 110, in various embodiments, a third party entity 110 represents a partner entity of the in-silico drug discovery system 130. For example, the third party entity 110 can operate either upstream or downstream of the in-silico drug discovery system 130. In various embodiments, a first third party entity 110A can operate upstream of the in-silico drug discovery system 130 and a second third party entity 110B can operate downstream of the in-silico drug discovery system 130.
As one example, the third party entity 110 operates upstream of the in-silico drug discovery system 130. In such a scenario, the third party entity 110 may perform the methods of generating training data and training of machine learning-based models, as is described in further detail herein. Thus, the third party entity 110 can provide trained machine learning models to the in-silico drug discovery system 130 such that the in-silico drug discovery system 130 can deploy the trained machine learning models.
As another example, the third party entity 110 operates downstream of the in-silico drug discovery system 130. In this scenario, the in-silico drug discovery system 130 deploys trained machine learning models to generate valid 3D molecule structures for molecules. The in-silico drug discovery system 130 can provide the determined 3D molecule structures to the third party entity 110 for downstream tasks. The third party entity 110 can use the obtained 3D structural information of molecules for predicting molecular properties and estimating binding affinity to molecule targets. In various embodiments, the third party entity 110 may be an entity that identifies compounds as promising candidates for drug development.
Referring to the network 120 shown in
In some embodiments, the generative models can be trained by the in-silico drug discovery system 130. The training example generator module 160 generates training data for use in training these machine learning models. In particular embodiments, the training example generator module 160 generates training data for different machine learning-based generative models if more than one such model is included in the in-silico drug discovery system 130. The model training module 165 trains one or more machine learning models using the training data.
In various embodiments, the in-silico drug discovery system 130 may be differently configured than as shown in
Molecule Conformation Generation Module
The tensor generation module 220 is configured to generate one or more tensors for a molecule based on the received 2D molecular graph. For example, the tensor generation module 220 receives a 2D molecular graph 210 of a molecular from a certain source as illustrated in
According to an embodiment, the tensor generation module 220 generates one or more tensors based on the adjacency matrix of the molecule. As used herein, an adjacency matrix refers to a binary matrix whose entry indicates whether there is a connection between two atoms. To generate a tensor, the tensor generation module 220 may add an additional dimension to the adjacency matrix, e.g., add features about connected atoms and the corresponding bonds to the off-diagonal section and add atom features to the diagonal section of the matrix. The tensor generation module 220 can extract features from the tensor (e.g., using a graph neural network (GNN)). The input of GNN is generally composed of three components, including atom feature vectors, edge feature vectors, and an adjacency matrix. Atom and edge features normally pass through separate embedding steps before being fed to GNN. The adjacency matrix is then used to determine neighboring atoms for layer-wise information aggregation. Although bond features are aggregated onto atom features and vice versa, these two features are maintained separately throughout the message-passing layers. To simplify the 3D conformation generation, instead of having separate input components, the disclosed method combines bond features and atom features into a single input for a conformation generation neural network. Specifically, the tensor generation module 220 achieves this by adding an additional dimension to an adjacency matrix, making it a tensor graph (or tensor), similar to a tensor used to encode images in computer vision tasks. The as-generated tensor contains both atom features and bond features for 3D conformation generation.
Similarly, the box 308 in
According to an embodiment, the tensor generation module 220 encodes the atom features and bond features according to certain predefined rules and logic. Table 1 lists some exemplary rules or logic for encoding the atoms and features. It is to be noted that the encoding methods included in Table 1 are provided for demonstration purposes and are not inclusive. In some embodiments, there are additional feature encoding rules and logics that are also possible for feature encoding, depending on the specific composition of a molecule.
According to an embodiment, the tensor generation module 220 can purposely generate different tensor graphs for the same molecule. For example, for a Benzene ring, two different tensor graphs can be generated, where each tensor graph may include different information (e.g., different types of atom features and/or bond features). As described elsewhere herein, the two encoders included in the conditional AutoEncoder framework 250 are conditioned on different inputs (e.g., different tensors). Accordingly, two different tensor graphs can be purposefully generated for a same molecule for feeding the two encoders. According to an embodiment, the atom features for a first tensor graph include atom type, charge and chirality, but do not include the coordinate information. The bond features for the first tensor graph include bond type, bond stereochemistry type, associated ring size, normalized bond length, and certain virtual bonds, but do not include distance information. On the other hand, the second tensor graph includes both coordinate information and distance channels in addition to the bond features and atom features included in the first tensor. For example, an atom feature vector in the second tensor includes three additional channels incorporating the coordinates of the respective atom.
Referring back to
Having obtained the tensor graph(s), a naïve way of building a generative model includes applying a convolutional neural network directly on top of the tensor(s), and training it to predict a distribution over the distance matrix, as many other existing methods have done. Among these different methods, some further simplify the process by assuming that each Euclidean distance follows an independent Gaussian distribution, which then translates the distribution prediction problem to predicting as many means or standard deviations as there are pair-wise distances. A network architecture used by such methods includes a standard UNet, which utilizes a small size kernel (e.g., 3×3). This then causes certain drawbacks for these naïve UNet-based methods. For example, with such a small kernel size, it takes many convolution layers to achieve information aggregation between atoms that are far apart. It does not take full advantage of high-order bonds already made available in the input tensor graph. Secondly, the output size grows quadratically with the number of atoms, as compared to only linear growth in reconstruction-based or direct generation methods. The kernel expansion module 230 disclosed herein addresses the above first problem by increasing the kernel size to expand its “field of view” in preliminary information aggregation.
In various embodiments, the kernel expansion module 230 generates a kernel for use in extracting features. In various embodiments, the kernel expansion module 230 generates a square kernel. For example, a square kernel may be a 2×2 kernel, a 3×3 kernel, a 4×4 kernel, a 5×5 kernel, or a 6×6 kernel. In various embodiments, the kernel expansion module 230 generates a rectangular kernel, such as a N×2 kernel, a N×3 kernel, a N×4 kernel, a N×5 kernel, or a N×6 kernel. In particular embodiments, N represents the length of the tensor graph, such that the rectangular kernel fully extends through the length of the tensor graph.
Referring back to
According to an embodiment, the feature vector generation module 240 generates a feature vector 318 by running a rectangle kernel filter through a tensor in a 1D convolution manner. For example, the feature vector generation module 240 disclosed herein may include a 1D convolution block composed of a configurable filter. The configurable filter is configured to be N×3 in the illustrated embodiment in
According to an embodiment, to generate a feature vector for an atom, the feature vector generation module 240 performs a convolution operation between the tensor and the configured filter, producing as output a new feature vector for the atom. According to an embodiment, the feature vector generation module 240 generates a feature vector for each atom by moving the filter in width (which is also referred to as “horizontal stride”) by 1 atom at a time. As described earlier, since the tensor generated by the tensor generation module 220 is symmetric, the filter can be also set to 3×N. At this point, a vertical stride is then performed during the 1D convolution. In some embodiments, by striding all the way along a direction, the feature vector generation module 240 can then generate a set of feature vectors corresponding to all atoms included in a molecule. In some embodiments, if multiple tensors (e.g., two sensors for two encoders) are generated by the tensor generation module 220, the feature vector generation module 240 generates a set of feature vectors for each generated tensor.
As also illustrated in
To generate 3D conformation with high confidence, the objective of a transformer architecture (e.g., transformer included in the conditional AutoEncoder framework 250) is to obtain a generative pθ (R|G) that approximates the Boltzmann distribution through maximum likelihood estimation. Particularly, given a set of molecular graphs G and their respective ground-truth conformations R, the objective of a transformer architecture may maximize the following objective.
logpθ (R|G)=log∫p(z)pθ (R|z,G)dz (1)
A molecular graph can have multiple random and valid conformations. Assume this randomness is driven by a latent process governed by a random variable z˜p(z), where p(z) is a known distribution (e.g., a standard normal distribution). As pθ (R|z, G) is often modeled by a complex function (e.g., a deep neural network), evaluation of the integral in Equation (1) is then intractable. This is undesirable in certain 3D conformation generation. Accordingly, the transformer architecture disclosed herein also resorts to establishing a tractable lower bound for Equation (1) according to the following Equation (2):
logpθ (R|G)≥Eq(z|R,G)[logpθ (R|z,G)]−DKL[qw(z|R,G)∥p(z)] (2)
where DKL is the Kullback-Leibler divergence and qw (z|R, G) is a variational approximation of the true posterior p(z|R, G). In some embodiments, assume p(z)=N (0, 1) and qw (z|R, G) is a diagonal Gaussian distribution whose means and standard deviations are modeled by a transformer encoder. The input of this transformer encoder is the aforementioned tensor containing both the coordinate and distance information, which can be also referred to GDR tensor. On the other hand, pθ (R|z, G) is further decomposed into two parts: a decoder pθ2 (R|z, σθ1 (G)) for predicting conformation directly and another encoder σθ1 (G). The input tensor for σθ1 (G) is absent of coordinate and distance information, and is therefore denoted the G tensor. In the disclosed conditional AutoEncoder framework 250, both encoder 260 and encoder 270 share the same standard transformer encoder structure. The Query (Q), Key (K) matrices for the first multi-head attention layer are computed based on the output vectors of σθ1 (G), and the Value (V) matrices come directly from the reparameterization of the output of qw(z|R, G), as z=μ+Σwϵ, where μw and Σw ϵ are the predicted mean and standard deviation respectively. ϵ is sampled from N (0, 1). A complete picture of how the two encoders and the decoder are arranged in a conditional variational AutoEncoder framework in further illustrated below.
In some embodiments, there are multiple ways to join together the outputs of the two encoders to form the input to the final decoder. Popular methods include stacking or addition. But neither method can achieve desirable performance. Direct stacking or addition of the sampled output of qw onto the output of σθ
As can be seen from
According to an embodiment, an outcome of using the sliding extended kernel is that the ordering of the generated feature tokens follows the ordering of the atoms. The same permutation order is also preserved through all the transformer components. In addition, following, the single-sample reconstruction loss is formulated as:
where A (·) is an alignment function aligning the predicted conformation {circumflex over (R)} onto the reference conformation R. The sum aggregation of per atom errors and the permutation preserving property of the feature extraction process make Equation (3) invariant to the permutation of atom order. Furthermore, the Kabsch algorithm is chosen as the alignment method which translates and rotates the predicted conformation onto its corresponding ground-truth before loss computation. This makes the reconstruction loss roto-translation invariant. Finally, the KL-loss component DKL [qw(z|R, G)∥p(z)] does not involve any coordinate. Therefore, the objective function defined in Equation (2) is both permutation and roto-translation invariant, which makes it an ideal model for 3D conformation generation.
In view of the above descriptions, to generate a single conformation during inference, the disclosed molecule conformation generation module 150 first constructs the G tensor of a molecular graph and obtain a single latent sample {z1, . . . zN} from a standard diagonal Gaussian distribution. The G tensor is passed through the first encoder 410 to produce {hIL, . . . , hNL} which is then combined with the latent sample via the modified multi-head attention mechanism. The output of this modified attention layer further goes through L-1 standard attention layers to be transformed into the final conformation. The entire generation process depends only on a 2D molecular graph, and requires a single sampling step and a single pass of the disclosed transformer architecture, thereby greatly saving the time and computation resources in generating the molecular 3D conformations.
Additionally disclosed herein is an example method 500 for generating a molecular 3D conformation for a molecule, as illustrated in
Non-Transitory Computer Readable Medium
Also provided herein is a computer-readable medium comprising computer executable instructions configured to implement any of the methods described herein. In various embodiments, the computer readable medium is a non-transitory computer readable medium. In some embodiments, the computer readable medium is a part of a computer system (e.g., a memory of a computer system). The computer readable medium can comprise computer executable instructions for implementing a machine learning model for the purposes of predicting a 3D conformation for a molecule.
Computing Device
The methods described above, including the methods of training and deploying machine learning models (e.g., convolutional neural network based encoders), are, in some embodiments, performed on a computing device. Examples of a computing device can include a personal computer, desktop computer laptop, server computer, a computing node within a cluster, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like.
The storage device 608 is a non-transitory computer-readable storage medium such as a hard drive, compact disk read-only memory (CD-ROM), DVD, or a solid-state memory device. The memory 606 holds instructions and data used by the processor 602. The input interface 614 is a touch-screen interface, a mouse, track ball, or other type of input interface, a keyboard, or some combinations thereof, and is used to input data into the computing device 600. In some embodiments, the computing device 600 may be configured to receive input (e.g., commands) from the input interface 614 via gestures from the user. The graphics adapter 612 displays images and other information on the display 618. The network adapter 616 couples the computing device 600 to one or more computer networks.
The computing device 600 is adapted to execute computer program modules for providing functionality described herein. As used herein, the term “module” refers to computer program logic used to provide the specified functionality. Thus, a module can be implemented in hardware, firmware, and/or software. In one embodiment, program modules are stored on the storage device 608, loaded into the memory 606, and executed by the processor 602.
The types of computing devices 600 can vary from the embodiments described herein. For example, the computing device 600 can lack some of the components described above, such as graphics adapters 612, input interface 614, and displays 618. In some embodiments, a computing device 600 can include a processor 602 for executing instructions stored on a memory 606.
The methods for generating 3D conformations for molecules can, in various embodiments, be implemented in hardware or software, or a combination of both. In one embodiment, a non-transitory machine-readable storage medium, such as one described above, is provided, the medium comprising a data storage material encoded with machine readable data which, when using a machine programmed with instructions for using said data, is capable of displaying any of the datasets and execution and results of a machine learning model of this invention. Such data can be used for a variety of purposes, such as tensor generation, kernel expansion, feature vector generation, molecular 3D conformation generation, and the like. Embodiments of the methods described above can be implemented in computer programs executing on programmable computers, comprising a processor, a data storage system (including volatile and non-volatile memory and/or storage elements), a graphics adapter, an input interface, a network adapter, at least one input device, and at least one output device. A display is coupled to the graphics adapter. Program code is applied to input data to perform the functions described above and generate output information. The output information is applied to one or more output devices, in known fashion. The computer can be, for example, a personal computer, microcomputer, or workstation of conventional design.
Each program can be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the programs can be implemented in assembly or machine language, if desired. In any case, the language can be a compiled or interpreted language. Each such computer program is preferably stored on a storage media or device (e.g., ROM or magnetic diskette) readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein. The system can also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
The 2D molecular graphs and molecular 3D conformations generated thereupon can be provided in a variety of media to facilitate their use. “Media” refers to a manufacture that is capable of recording and reproducing the 2D molecular graphs and molecular 3D conformations of the present invention. The 2D molecular graphs and molecular 3D conformations of the present invention can be recorded on computer readable media (e.g., any medium that can be read and accessed directly by a computer). Such media include, but are not limited to: magnetic storage media, such as floppy discs, hard disc storage medium, and magnetic tape; optical storage media such as CD-ROM; electrical storage media such as RAM and ROM; and hybrids of these categories such as magnetic/optical storage media. One of skills in the art can readily appreciate how any of the presently known computer readable mediums can be used to create a manufacture comprising a recording of the present graph and conformation information. “Recorded” refers to a process for storing information on computer readable medium, using any such methods as known in the art. Any convenient data storage structure can be chosen, based on the means used to access the stored information. A variety of data processor programs and formats can be used for storage (e.g., word processing text file, database format, etc.).
Dataset and Preprocessing
The geometric ensemble of molecules (GEOM) dataset was used for evaluating the performance of the disclosed methods. GEOM contains about 37 million energy and statistical weight annotated molecular conformations corresponding to about 450,000 molecules. This dataset was further divided into two constituent datasets, Drugs and QM9. The Drugs dataset covers about 317,000 median-sized molecules averaging 44.4 number of atoms. The QM9 dataset contains about 133,000 smaller molecules averaging only 18 atoms. 40,000 molecules were randomly selected from each dataset to form a training set. For each molecule, the top 5 most likely conformations were selected. This resulted in about 200,000 training conformations for each train set. For the validation set, 2,500 conformations were randomly selected for both Drugs and QM9 experiments. For testing, 200 molecules were randomly selected, which include more than 50 and less than 500 conformations from the QM9 dataset, and more than 50 and less than 100 conformations from the Drugs dataset.
Determining Input Tensor Graph Size
Basic data analysis was conducted on the entire Drugs dataset to determine the 98.5th percentile of the number of atoms (including hydrogen) to be 69, and the percentage of molecules having more than 69 atoms and with more than 50 and less than 100 conformations is only 0.19%. Accordingly, the size of the input tensor was set to 69×69 for the Drugs dataset. On the other hand, the maximum number of atoms 30 was used for the QM9 dataset. The channel features for the input tensor include atom types, atom charge, atom chirality, bond type, bond stereochemistry and bond in-ring size. For the GDR tensor, 3D coordinate channel and the computed distance channel were further included. The resulting channel depth was 50 for the GDR tensor and 46 for the G tensor.
Implementation Details
The proposed generative model was implemented using Tensorflow 2.3.1. Specifically, tensor graphs of molecules were generated in accordance with
Evaluation Metrics.
Widely accepted coverage score (COV) and matching score (MAT) were used to evaluate the performance of the proposed generative model. These two scores were computed as:
where Cg is the set of generated conformations and Cr is the corresponding reference set. The size of Cg is twice that of Cr, and for every molecule, twice the number of conformations were generated at that of reference conformations. δ is a predefined threshold and is set to 0.5 Å for QM9 and 1.25 Å for Drugs respectively. RMSD stands for the root-mean-square deviation between R and R. While the COV score measures the ability of a model in generating diverse conformations to cover all reference conformations, the MAT score measures how well the generated conformations match the ground-truth. A good generative model should have a high COV score and a low MAT score.
Results and Discussion
The performance of the disclosed generative model (or disclosed method) was evaluated by comparing it to 13 other existing methods. Specifically, the COV and MAT scores were determined by using the same test data generation configuration. Among the total 14 methods, six are distance-based methods, three are reconstruction-based methods, four are direct methods (including the disclosed method), and one is a classical method.
Distance-based methods, except for Method 7 and Method 9, had relatively poor performance when compared to that of the classic Method 1. Superior performance of Method 1 may be due to the application of an additional empirical force field (FF) to optimize the generated structure. To compare, the disclosed method was performed with FF optimization, which was shown to outperform the other methods by a significant margin, as shown in Table 2.
The application of FF further introduced complexity to the already complex two-stage generative model. Method 7 attempted to rectify this weakness by simulating a pseudo gradient-based force field. It was observed that inter atomic distances are continuously differential with respect to atomic coordinates, and the gradient fields of these two can be connected by the chain rule. Such a force field can be utilized in the annealed Langevin dynamics sampling to sequentially guide atom positions to a valid conformation. Method 9 further improves the performance of Method 7 by using a dynamically constructed graph structure that is able to model long range atom interaction. Despite being posed as a direct method, during inference, both Method 7 and Method 9 still need to compute atom distance as an intermediate step. Noticeably, Method 9 also requires dynamically changing the graph structure for every sampling step. As shown in Table 3, Method 7 and Method 9 outperform classical Method 1 by a significant margin in both QM9 and Drugs experiments. Nevertheless, their main weakness lies in the fact they require numerous Langevin dynamics sampling steps to attain desirable performance. For example, it takes Method 7 approximately 8500 seconds to fully decode 200 QM9 molecules and a staggering 11500 seconds for decoding 200 Drugs molecules.
One of the main goals of Method 10 is to be completely free of the dependence on atom distance. To achieve this, transformations are directly applied to 3D conformation, through which, a random or distorted conformation can be sequentially denoised to a valid conformation. Such a transformation preserves rotation and translation. This necessitates the design of a sophisticated SE(3)-equivariant transition kernel. In spite of its claim of not involving computing distance as an intermediate step, the equivariant graph field network adopted in Method 10 still relies on a distance-like quantity ∥xiL−xjL∥2 for every step of transformation.
Method 10 produces promising performance on both Drugs and QM9 datasets, significantly widening the gap between distance-based methods and reconstruction-based methods. Unfortunately, like its predecessor Method 9, during inference time, numerous diffusion steps are implemented, requiring approximately the same amount of time for decoding QM9 and Drugs testing datasets. Method 11 circumvents the sampling step by reformulating the molecular generation problem as an optimization problem. A properly optimized generative model can push randomly sampled conformations to valid ones in a single pass of the model. The structure of Method 11 requires maintaining a pair interaction matrix (N×N) through every attention layer. In addition, Method 11 has a significantly larger transformer structure as compared to the disclosed model, featuring 15 attention layers, 64 attention heads and a latent dimension size of 2048. The Method 11 model is also first pre-trained on a much larger dataset, and then fine-tuned on the GEOM dataset for conformation generation. The trained model generates a good performance on both Drugs and QM9 datasets. Despite using a much smaller transformer model, the disclosed model offers competitive performance as compared to Method 11, yielding a slightly better COV score for the Drugs dataset.
Method 2 is one of the first methods to generate molecular conformation directly from a 2D molecular graph. However, it yielded a COV score of 0, indicating its incapability of generating any meaningful 3D conformation for the Drugs dataset. The reason could be attributed to the use of a vanilla graph neural network that is inferior to other more advanced adaptions, and a simple loss function that computes root-mean-square deviation directly from predicted coordinates. Method 12 attempted to revitalize the same framework by adapting a more advanced graph neural network structure. Another major modification adopted in Method 12 is including symmetric permutation invariance to the loss function. This is fundamentally different from permutation order invariance, where changing atom ordering does not affect the value of the loss function as atom coordinates are permuted together. The symmetric invariance goes a step further. That is, swapping the coordinates of the symmetric substructure with atom ordering fixed should yield the same root-mean-square deviation value. However, this involves enumerating all possible permutations which can become very expensive for large molecules. Method 12 outperformed Method 11 on COV scores; nevertheless, its variant using the same loss function as the disclosed model performed significantly worse on both Drugs and QM9 datasets. This indicates that the disclosed model has better feature extraction capability, eliminating the need for a more complex loss function for achieving comparable performance.
In summary, extensive experiments on the GEOM dataset have demonstrated the following three main advantages of the disclosed model. (1) The disclosed model is a direct generative model. The disclosed model does not involve any distance computation or numerous sampling steps at inference time. It generates conformation directly from a 2D molecular graph in a single sampling step. As compared to other distance- or reconstruction-based methods, it takes only 62 seconds using a single Xeon 8163 CPU core4 to decode 200 QM9 molecules and 128 seconds for 200 Drugs molecules; which is a 100× speed up in terms of efficiency. (2) The disclosed model does not require any sophisticated adaption of graph neural network for feature extraction nor necessitates the design of any complex SE(3)-equivariant coordinate transformation. (3) Its simple yet intuitive feature engineering enables complete information aggregation at the input feature level, making it possible to use an off-the-shelf transformer encoder and simple loss function to achieve competitive conformation generation capability against other existing well-performed methods. These three advantages translate directly to the excellent practicality of the disclosed method.
Altogether, disclosed herein is a neural network-based molecular 3D conformation generation algorithm. The presented algorithm is simple, easy to implement, and demonstrates excellent conformation generation capacity. Its competitive performance has been demonstrated by comparison to a number of recently published state-of-the-art methods on 3D conformation generation.
The present application claims the benefit of and priority to U.S. Provisional Application Ser. No. 63/422,325, entitled “METHODS AND MODELS FOR DIRECT MOLECULAR CONFORMATION GENERATION,” filed on Nov. 3, 2022, which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63422325 | Nov 2022 | US |