The present invention generally relates to systems and methods to design and synthesize molecules based on molecular system properties; and more particularly to systems and methods that utilize molecular-orbital-based features with machine learning quantum chemistry computing to determine the properties of synthesized chemicals.
Molecular simulations can be helpful to the discovery effort of scientific industry, including solid-state materials, polymers, fine chemicals, and pharmaceuticals. Current approaches employ physics-based methods which solve quantum mechanical equations to describe the behavior of atoms and molecules. While powerful, current methods come at extraordinary computational costs (consuming a sizable fraction of the world's supercomputing resources) and human-time costs (with necessary calculations taking months or longer of wall-clock time). Advances in molecular simulation would broaden its applications in the industrial innovation and development process.
Systems and methods in accordance with various embodiments of the invention enable the design and/or synthesis of molecules based on molecular system properties. In many embodiments, molecules with specific molecular system properties can be synthesized for a wide range of product development processes such as drug discovery and material design. Examples of materials synthesized in accordance with various embodiments of the invention include (but are not limited to): catalysts, enzymes, pharmaceuticals, proteins and antibodies, organic electronics, surface coatings, nanomaterials, solvents and electrolyte materials that can be used in the construction of batteries.
Many embodiments predict molecular system properties based on molecular orbital based features using molecular-orbital-based machine learning (MOB-ML) processes. Examples of molecular system properties in accordance with various embodiments of the invention include (but are not limited to): solubility, binding affinity for molecules, binding affinity for protein, redox potential, pKa, electrical conductivity, ionic conductivity, thermal conductivity, and light emission efficiency.
In many embodiments, MOB-ML processes can allow for at least 1000-fold speed-ups in computational and wall-clock times over existing physics-based quantum mechanical methods. In several embodiments, the processes allow for at least 100-fold increases in human efficiency. By deploying MOB-ML at scale with cloud resources, the timescale for turnaround can be reduced from days to seconds. MOB-ML in accordance with several embodiments of the invention can enable at least 10-fold prediction accuracy improvements. Some other embodiments implement the software packages, de-risk computational predictions, reduce down-stream experimental and production costs, and accelerate time-to-market.
One embodiment of the invention includes: obtaining a set of molecular orbitals for a molecular system using a computer system; generating a set of molecular-orbital-based features based upon the set of molecular orbitals of the molecular system using the computer system; determining at least one molecular system property based on the set of features using a molecular-orbital-based machine learning (MOB-ML) model implemented on the computer system; and when the determined at least one molecular system property satisfies at least one criterion by the computer system, synthesizing the molecular system.
In a further embodiment, the set of molecular-orbital-based features comprises an attributed graph representation of molecular-orbital-based features.
In another embodiment, the molecular system is one of a plurality of candidate molecular systems. In addition, determining when the determined at least one molecular system property satisfies at least one criterion further includes: generating a set of molecular-orbital-based features based upon sets of molecular orbitals for each of the candidate molecular systems; determining at least one molecular system property for each of the candidate molecular systems based on the set of molecular-orbital-based features of each of the candidate molecular systems using the MOB-ML model; screening the candidate molecular systems based upon the at least one molecular system property determined for each of the candidate molecular systems; and identifying the molecular system based upon the screening.
A still further embodiment also includes training the MOB-ML model to learn relationships between sets of molecular-orbital-based features and molecular system properties using a training dataset describing a plurality of molecular systems and their molecular system properties.
In still another embodiment, training the MOB-ML model to learn relationships between sets of molecular-orbital-based features and molecular system properties further includes: obtaining a set of molecular orbitals for each molecular system in the training dataset of molecular systems by determining occupied molecular orbitals; and obtaining a set of molecular-orbital-based features based upon at least the occupied molecular orbitals.
In a yet further embodiment, a localization process is used to determine occupied molecular orbitals.
In yet another embodiment, obtaining the set of molecular-orbital-based features further comprises performing a dimensionality reduction process on an initial set of features.
In a further embodiment again, the dimensionality reduction process is selected from the group consisting of selecting the molecular-orbital-based features from the initial set of features, and applying a transformation process to the initial set of features to obtain the molecular-orbital-based features.
In another embodiment again, the transformation process is selected from the group consisting of subspace embedding and autoencoding.
In a further additional embodiment, training the MOB-ML model comprises at least one process selected from the group consisting of regression clustering, regression, and classification.
In another additional embodiment, training the MOB-ML model comprises at least regression process selected from the group consisting of Gaussian Process Regression, Neural Network Regression, Linear Regression, and Kernel Ridge Regression with feature selection based on Random Forest Regression, Kernel Ridge Regression without feature selection based on Random Forest Regression, and Kernel Ridge Regression with feature transformation based on Principle Component Analysis.
In a still yet further embodiment, the molecular system comprises at least one of atoms, molecular bonds, and molecules formed by atoms and molecular bonds.
In still yet another embodiment, the set of features includes molecular-orbital-based (MOB) features comprising an energy operator.
In a still further embodiment again, the molecular-orbital-based features further comprise at least one feature selected from the group consisting of: elements from a Fock matrix, elements from a Coulomb matrix, and elements from an exchange matrix.
In still another embodiment again, the at least one molecular system property comprises at least one property selected from the group consisting of quantum correlation energy, force, vibrational frequency, dipole moment, response property, excited state energy and force, and spectrum.
In a still further additional embodiment, the synthesized molecular system comprises at least one molecule selected from the group consisting of a catalyst, an enzyme, a pharmaceutical, a protein, an antibody, a surface coating, a nanomaterial, a semiconductor, a solvent for a battery, and an electrolyte for a battery.
Still another additional embodiment includes: obtaining set of molecular orbitals fora plurality of candidate molecular systems using a computer system; generating a set of molecular-orbital-based features for each candidate molecular system based upon sets of molecular orbitals for each of the candidate molecular systems using the computer system; determining at least one molecular system property for each of the candidate molecular systems based on the set of molecular-orbital-based features of each of the candidate molecular systems using a molecular-orbital-based machine learning (MOB-ML) model implemented on the computer system; screening the candidate molecular systems to identify at least one molecular system possessing at least one molecular system property that satisfies at least one criterion based upon the at least one molecular system property determined for each of the candidate molecular systems using the computer system; and generating a report describing the at least one molecular system identified during the screening of the candidate molecular systems using the computer system.
A yet further embodiment again includes: searching for a set of molecular-orbital-based features having at least one molecular system property predicted by a molecular-orbital-based machine learning (MOB-ML) model that satisfies at least one criterion using a computer system, where the MOB-ML model is trained to receive a set of molecular-orbital-based features of a molecular system and output an estimate of at least one molecular system property; mapping a located set of molecular-orbital-based features to an identified molecular system based upon a feature-to-structure map using the computer system, where the feature-to-structure map is trained to map a set of molecular-orbital-based features to a corresponding molecule structure; and generating a report describing the identified molecular system using the computer system.
Yet another embodiment again also includes screening the identified molecular system based upon at least one molecular system criterion.
Another further embodiment includes: searching for a set of molecular-orbital-based features having at least one molecular system property predicted by a molecular-orbital-based machine learning (MOB-ML) model that satisfies at least one criterion using a computer system, where the MOB-ML model is trained to receive a set of features of a molecular system and output an estimate of at least one molecular system property; mapping a located set of molecular-orbital-based features to an identified molecular system using a feature-to-structure map using the computer system, where the feature-to-structure map is trained to map a set of molecular-orbital-based features to a corresponding molecule structure; screening the identified molecular system based upon at least one screening criterion using the computer system; and when the identified molecular system satisfies the at least one screening criterion, synthesizing the identified molecular system.
In yet another further embodiment, searching for a set of molecular-orbital-based features having at least one molecular system property predicted by the MOB-ML model that satisfies at least one criterion further comprises using at least one generative model to generate candidate sets of features.
In still another further embodiment, the generative model is selected from the group consisting of a variational autoencoder (VAE) and a Generative Adversarial Network (GAN).
Another further embodiment again includes: obtaining a training dataset of molecular systems and their molecular system properties using a computer system; generating a set of molecular-orbital-based features for each molecular system in the training dataset based upon a set of molecular orbitals for each of the candidate molecular systems using the computer system; training a ML model to learn relationships between the set of molecular-orbital-based features of each molecular system in the training dataset and the molecular system properties of each of the molecular systems in the training dataset using the computer system; and utilizing the MOB-ML model to predict at least one molecular system property for a specific molecular system based upon a set of molecular-orbital-based features generated for the specific molecular system based upon a set of molecular orbitals for the specific molecular system.
In another further additional embodiment, obtaining a training dataset of molecular systems and their molecular system properties further includes: generating a set of molecular-orbital-based features for the specific molecular system based upon a set of molecular orbitals for the specific molecular system using the computer system; retrieving molecular-orbital-based features from a database based upon proximity between a retrieved molecular-orbital-based feature and a molecular-orbital-based feature from the set of molecular-orbital-based features for the specific molecular system; and forming the training dataset using the retrieved molecular systems.
In still yet another further embodiment, training the MOB-ML model to learn relationships between the sets of molecular-orbital-based features of each molecular system in the training dataset and the molecular system properties of each of the molecular systems in the training dataset further comprises utilizing a transfer learning process to train an MOB-ML model previously trained to determine the relationship between a molecular-orbital-based features of a molecular system and a different set of molecular system properties.
In still another further embodiment again, training the MOB-ML model to learn relationships between the sets of molecular-orbital-based features of each molecular system in the training dataset and the molecular system properties of each of the molecular systems in the training dataset further comprises utilizing an online learning process to update a previously trained MOB-ML model.
Additional embodiments and features are set forth in part in the description that follows, and in part will become apparent to those skilled in the art upon examination of the specification or may be learned by the practice of the disclosure. A further understanding of the nature and advantages of the present disclosure may be realized by reference to the remaining portions of the specification and the drawings, which forms a part of this disclosure
The description will be more fully understood with reference to the following figures, which are presented as exemplary embodiments of the invention and should not be construed as a complete recitation of the scope of the invention. It should be noted that the patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Turning now to the drawings, systems and methods for synthesizing molecules with specific molecular system properties are described. A molecular system can be atoms, molecular bonds, and/or the resulting molecules formed by the atoms and molecular bonds. Many embodiments implement a molecular-orbital-based machine learning (MOB-ML) process to determine properties of a molecular system. In a number of embodiments, an MOB-ML generative model is utilized to perform generative design of molecular systems having particular desirable properties that can then be synthesized.
In several embodiments, specific molecular system properties are utilized as inputs of an MOB-ML process. In many embodiments, the input properties of the molecular system are a set of features based on molecular orbitals. In some embodiments, the MOB features can be energy operators of the quantum system of the molecular systems. In a number of embodiments, the input MOB features include (but are not limited to): elements of a Fock matrix, elements of a Coulomb matrix, and/or elements of an exchange matrix. As can readily be appreciated, the specific MOB features used to describe a molecular system in accordance with various embodiments of the invention are largely only limited by the requirements of specific applications.
In many embodiments, the MOB-ML processes utilize models that are trained using input datasets. Many embodiments predict certain properties of a molecular system as outputs based on relationships between the input MOB features and the properties that are learned during the training of the MOB-ML model. In some embodiments, the output properties can include (but are not limited to): (1) computable properties of molecules such as electronic energies, correlation energies, forces, vibrational frequencies, dipole moments, response properties, excited state energies and forces, and/or spectra; and (2) experimentally measurable properties of molecules such as activity coefficients, pKa, pH, partition coefficients, vapor pressures, melting, boiling, and flash points, solvation free energies, electrical conductivity, viscosity, toxicity, ADME properties, and protein binding affinities. In several embodiments, a molecular system is selected based upon the predicted property for the molecular system output by the MOB-ML model based upon the input MOB features of the molecular system. In a number of embodiments, the MOB-ML model can be used to perform generative design in which a search is performed within feature space to identify at least one set of MOB features that provide a desired molecular system property. In several embodiments, MOB features can be mapped to molecular structures using a feature-to-structure map that can be derived from a training data set using a machine learning process. The molecular system(s) corresponding to the identified set(s) of MOB features can then be further analyzed to determine the molecular system(s) most suited to a particular application. As can readily be appreciated, systems and methods in accordance with various embodiments of the invention can utilize any of a variety of input MOB features of a molecular system to predict any of a variety of different properties of a corresponding molecular system as appropriate to the requirements of specific applications.
In several embodiments, the molecular systems predicted by the output properties can be in the same molecular family as the input molecular systems. In many embodiments, the molecular systems predicted by the output properties can be in a different molecular family as the input molecular systems. Examples of different molecular families can include (but are not limited to): molecular compositions, molecular geometries, and/or bonding environments. Sets of input MOB features in many embodiments have no explicit dependence on atom types, thus MOB-ML processes can enhance chemical transferability of the training results. In a number of embodiments, the MOB-ML processes are implemented as software applications.
In many embodiments, more complex models of molecular systems can be utilized including (but not limited to) graph organized MOB representations of molecular systems, as an alternative to the current matrix organized MOB representations. In a number of embodiments, quantum chemical information can be represented as an attributed graph G(V,E, X, Xe). In certain embodiments, the node features of the attributed graph correspond to diagonal MOB features (Xu=[FuuJuu,Kuu]) and the edge features correspond to off-diagonal MOB features (Xeu=[FuvJuv,Kuv]). Graph based representations of molecular systems can enable multi-task learning. As can readily be appreciated, appropriately constructed graph representations can provide the benefit of permutation invariance and size extensivity. In many embodiments, a generalized message passing neural network (MPNN) can be utilized to perform the machine learning task from the graph-based representations to a diverse set of chemical properties. In a number of embodiments, MOB-ML processes can utilize graph representations of molecular systems to form general chemical property classification.
Previous work in quantum chemistry has focused on predicting electronic energies or densities based on atom- or geometry-specific features, such as atom-types and bonding connectivities. (See, e.g., Smith, J., et al., Chem. Sci., 2017, 8, 3192-3203; McGibbon, R. T., et al., J. Chem. Phys., 2017, 147, 161725; the disclosures of which are incorporated herein by reference). Such approaches can yield good accuracy with computational cost that is comparable to classical force fields. However, a disadvantage of the approach is that building a machine learning (ML) model to describe a diverse set of elements and chemistries can require training with respect to a number of features that grows quickly with the number of atom- and/or bond-types, and can also require vast amounts of reference data for the selection and training of those features. These issues have hindered the degree of chemical transferability of existing ML models for electronic structure. In addition, across chemical sciences and industries, computation can be hindered by the interplay between prediction accuracy and computational efficiency.
MOB-ML processes in accordance with several embodiments of the invention can improve efficiency and accuracy in quantum simulation. In a number of embodiments, the output properties generated from MOB-ML processes are transferable and thus can be used to determine molecules of different molecular systems. In some embodiments, MOB-ML processes possess transferability across molecular geometries. Several embodiments implement MOB-ML processes with transferability within a molecular family. Some embodiments implement MOB-ML processes providing transferability across bonding environments. Certain embodiments implement MOB-ML processes providing transferability across chemical elements.
Many embodiments implement chemical transferability of MOB-ML processes across molecular systems and so are capable of identifying molecules with a broad range of properties. Molecules with specific molecular system properties can be synthesized using processes in accordance with various embodiments of the invention for a wide range of product development processes such as drug discovery and material design. Examples of such embodiments include (but are not limited to): catalyst design, enzyme reactions and drug design, protein and antibody design, surface coatings, nanomaterials, solvent and electrolyte materials for batteries.
In several embodiments, the transferability of MOB-ML models is leveraged in transfer learning processes that utilize pre-trained energy based models that are transferred to general molecular properties. In a number of embodiments, the transfer learning process can include (but is not limited to) a Gaussian Process kernel transfer and/or a Neural Network based transfer learning process. Furthermore, as increasing amounts of quantum simulation data are generated, MOB-ML processes in accordance with many embodiments of the invention can actively update underlying MOB-ML models based upon new data without requiring retraining using the original training data corpus.
Systems and methods for synthesizing molecules with specific molecular system properties and molecular-orbital-based machine learning (MOB-ML) processes that can be utilized in the design and/synthesis of molecules in accordance with various embodiments of the invention are discussed further below.
Many embodiments utilize accurate and transferable MOB-ML processes to predict properties including (but not limited to) correlated wavefunction energies based on input features using computations including (but not limited to) a self-consistent field calculation. A method for synthesizing molecules using a MOB-ML process in accordance with an embodiment of the invention is illustrated in
Sets of MOB features for the input datasets can be obtained based on molecular orbitals (102). In some embodiments, the MOB features can include (but are not limited to) energy operators of the molecular systems. In several embodiments, input MOB features can include (but are not limited to): elements of a Fock matrix, elements of a Coulomb matrix, and/or elements of an exchange matrix. As can readily be appreciated, any of a variety of input MOB features can be utilized as appropriate to the requirements of specific applications.
In certain embodiments, quantum chemistry calculations are performed using MOB-ML processes (103). In a number of embodiments, the computations can be performed on a local computing device. In several embodiments, the calculations are performed on a remote server system. MOB-ML processes can be trained with MOB features of the input datasets.
During a training process (not shown) MOB-ML processes can learn relationships between MOB features and properties of molecular systems using a training dataset. In some embodiments, the training datasets can be subsets randomly selected from input datasets. Examples of molecular datasets in such embodiments can include (but are not limited to): QM7b, QM7b-T, GDB-13, and GDB-13-T. In several embodiments, the training datasets can be sets of molecules from the same or different molecular systems. As can readily be appreciated, any of a variety of training datasets can be utilized as appropriate to the requirements of specific applications in accordance with various embodiments of the invention.
The MOB-ML processes can utilize a trained model that describes relationships between MOB features and properties of molecular systems to perform a ranking and/or categorization (104) of at least the molecules in the input dataset. In many embodiments, the MOB-ML processes can also identify novel molecules and/or molecules that are not in the input dataset based upon regions of the feature space that contain molecules that the model predicts will have desirable properties. The various ways in which MOB-ML processes can be utilized to identify molecular systems having desirable properties in accordance with various embodiments of the invention including specific examples are discussed further below.
In many embodiments, the trained MOB-ML processes generate output datasets of molecular system properties (105). The molecular system properties can include (but are not limited to): (1) computable properties of molecules such as electronic energies, correlation energies, forces, vibrational frequencies, dipole moments, response properties, excited state energies and forces, and/or spectra; and (2) experimentally measurable properties of molecules such as activity coefficients, pKa, pH, partition coefficients, vapor pressures, melting, boiling, and flash points, solvation free energies, electrical conductivity, viscosity, toxicity, ADME properties, and protein binding affinities. As can readily be appreciated, the specific features used as molecular system properties are largely only limited by the requirements of specific applications. Based on the output datasets, molecules with sets of desired molecular system properties can be identified and synthesized (106).
While various processes for synthesizing chemicals using MOB-ML processes are described above with reference to
In many embodiments, MOB-ML processes enable real-time chemical modeling, design, and collaboration. In several embodiments, the MOB-ML processes are implemented in software packages that can execute on a local computer or on a remote server. Additionally, the software packages according to some embodiments, can perform calculations on many possible chemical modifications and return rank-ordered recommendations for the most promising chemical modifications. With parallel computation all of the results can be returned in seconds. In this way, processes similar to the various processes for designing molecular systems described above can be performed and the results used to generate intuitive and interactive graphical user interfaces that enable any of a variety of experimental chemists to utilize MOB-ML in the design and/or synthesis of chemicals.
A user interface that can be generated by software using a ML process implemented in accordance with an embodiment of the invention is conceptually illustrated in
While various processes for designing molecules using MOB-ML processes are described above with reference to
Dimensionality reduction of the features of molecular systems can be an important part of an MOB-ML process implemented in accordance with an embodiment of the invention. The high dimensionality of the full set of features that can be generated by a molecular system can lead to over-fitting of dimensions that serve little informative value. Many embodiments include a variety of processes that can be utilized to generate features. Many embodiments include a variety of processes that perform a dimensionality reduction of features including (but not limited to) through feature selection and/or feature transformation. Some embodiments select features based on Hartree-Fock molecular orbitals to predict post-Hartree-Fock correlated wavefunction energies. Some embodiments are based on features of orbitals defined in (tight-binding) density functional theory calculations. Several embodiments include elements of a Fock matrix, elements of a Coulomb matrix, and/or elements of an exchange matrix as features. As can readily be appreciated, any of a variety of operations can be evaluated for the molecular orbitas which can be used as input MOB features and any of a variety of input MOB features can be selected as appropriate to the requirements of a specific application. In several embodiments, dimensionality reduction can also be achieved through feature transformation techniques, such as (but not limited to) Principal Component Analysis (PCA), truncated Singular Value Decomposition (SVD), and Neural Networks. Furthermore, in a number of embodiments, MOB features can be utilized to directly train a MOB-ML model without additional dimensionality reduction. As can readily be appreciated, the specific processes for evaluating molecular orbitals, performing dimensionality reduction and/or training MOB-ML models using MOB features are largely dependent upon the requirements of specific applications.
In many embodiments, feature generation includes a canonical ordering of the occupied and virtual molecular orbitals. Several embodiments apply localized molecular orbital (LMOs). In a number of embodiments, MOB features can be obtained from other types of MOs including (but not limited to) canonical and natural orbitals. Some embodiments utilize Boys localization for localization in occupied space and Intrinsic Bonding Orbital (IBO) localization for localization in virtual space. As can readily be appreciated, any of a variety of unitary orbital transformations can be utilized to obtain MOs as appropriate to the requirements of specific applications. In several embodiments, MOB features can be sorted by increasing distance from occupied MOs. As can readily be appreciated, any of a variety of sorting criteria can be utilized as appropriate to the requirements of specific applications. In some embodiments, automatic feature selection can be performed using any of a variety of processes including (but not limited to) random forest regression utilizing a mean decrease of accuracy criterion. As can readily be appreciated, any of a variety of processes can be utilized in the selection of features as appropriate to the requirements of specific applications. Selection and/or sorting is not required, however. A number of embodiments of the invention utilize machine learning models including (but not limited to) Neural Network models that receive MOB features as a direct input and output estimates of molecular properties for the received MOB features as an output. Various ways in which MOB-ML processes can estimate molecular properties from sets of features describing molecular systems in accordance with different embodiments of the invention are discussed further below.
Sets of MOB features in many embodiments have no explicit dependence on atom types, thus MOB-ML processes can enhance chemical transferability of the training results. In several embodiments, the smooth variation and local linearity of pair correlation energies as a function of MOB features of different molecular geometries and different molecules can be beneficial to the transferability of MOB-ML processes.
Many embodiments can predict properties of molecular systems including (but not limited to) post-Hartree-Fock correlated wavefunction energies using MOB features including (but not limited to) the Hartree-Fock (HF) molecular orbitals (MOs). In some embodiments, the starting point for a MOB-ML process involves decomposing the correlation energy into pairwise occupied MO contributions
where the pair correlation energy εij can be written as a functional of the full set of MOs, {ϕp}, appropriately indexed by i and j
εij=ε[{ϕp}ij]. (2)
The functional ε can be considered universal across all chemical systems; for a given level of correlated wavefunction theory, there is a corresponding E that maps the HF MOs to the pair correlation energy, regardless of the molecular composition or geometry. Furthermore, E simultaneously describes the pair correlation energy for all pairs of occupied MOs (i.e., the functional form of E does not depend on i and j). For example, in second-order Møller-Plessett perturbation theory (MP2), the pair correlation energies can be expressed as
where a and b index virtual MOs, ep is the Hartree-Fock orbital energy corresponding to MO ϕ9, and <ij∥ab> are anti-symmetrized electron repulsion integrals. A corresponding expression for the pair correlation energy can exist for any post-Hartree-Fock method, but it is typically costly to evaluate in closed form.
In MOB-ML, a machine learning model can be constructed for the pair energy functional
εij≈εML[fij] (4)
where fij denotes a vector of features associated with MOs i and j. Eq. 4 thus presents the opportunity for the machine learning of a universal density matrix functional for correlated wavefunction energies, which can be evaluated at the cost of the MO calculation.
The features fij can correspond to unique elements of the Fock (F), Coulomb (J), and exchange (K) matrices between ϕi, ϕj and the set of virtual orbitals. Some embodiments include features associated with matrix elements between pairs of occupied orbitals for which one member of the pair differs from ϕi or ϕj (i.e., non-i, j occupied MO pairs). In several embodiments, the feature vector can take the form
f
ij=(Fii,Fij,Fjj,FiO,FjO,FijVV,
J
ii
,J
ij
,J
jj
,J
i
O
,J
i
V
,J
i
V
,J
ij
VV,
K
ij
,K
i
O
,K
j
O
,K
i
V
,K
j
V
,K
ij
VV). (6)
where for a given matrix (F, J, or K) the superscript o denotes a row of its occupied-occupied block, the superscript v denotes a row of its occupied-virtual block, and the superscript vv denotes its virtual-virtual block. Redundant elements can be removed, such that the virtual-virtual block is represented by its upper triangle and the diagonal elements of K (which are identical to those of J) are omitted. To increase transferability and accuracy, ϕi and ϕj can be localized molecular orbitals (LMOs) rather than canonical MOs and employ valence virtual LMOs in place of the set of all virtual MOs. In this way, Eq. 4 can be separated to independently machine learn the cases of i=j and i≠j,
where fi denotes fii (Eq. 5) with redundant elements removed; by separating the pair energies in this way, the situation where a single ML model is required to distinguish between the cases of i=j and ϕi being nearly degenerate to ϕj is avoided, a distinction which can represent a sharp variation in the function to be learned.
Many embodiments introduce technical refinements to improve training efficiency, for example the accuracy and transferability of the model as a function of the number of training examples.
Some embodiments implement occupied LMO symmetrization. In this way, the feature vector can be pre-processed to specify a canonical ordering of the occupied and virtual LMO pairs. This can reduce permutation of elements in the feature vector, resulting in greater ML training efficiency. Matrix elements Mij (M=F, J, K) associated with ϕi and ϕj can be rotated into gerade and ungerade combinations
with the sign convention that Fij is negative. Here, p indexes any LMO other than i or j, for example an occupied LMO k, such that i≠k≠j, or a valence virtual LMO. As can readily be appreciated, any rotation of pairs of orbitals can be applied as appropriate to the requirements of specific applications.
Several embodiments implement LMO sorting. The LMO pairs can be sorted by increasing distance from occupied orbitals ϕi and ϕj. Sorting in this way can result in features corresponding to LMOs being listed in decreasing order of heuristic importance in such a way that the mapping between LMOs and their associated features is roughly preserved. In some embodiments, the LMO pairs can be sorted by decreasing approximate energy contribution to the correlation energy of the occupied orbitals ϕi and ϕj. As can readily be appreciated, any of a variety of sorting criteria can be utilized as appropriate to the requirements of specific applications.
For purposes of sorting, distance can be defined as
R
a
ij=∥ϕi|{circumflex over (R)}|ϕi−ϕa|{circumflex over (R)}|ϕa∥+∥ϕj|{circumflex over (R)}|ϕj−ϕa|{circumflex over (R)}|ϕa∥ (8)
where ϕa is a virtual LMO, {circumflex over (R)} is the Cartesian position operator, and 11.11 denotes the L2-norm.∥ϕi|{circumflex over (R)} |ϕi−ϕa|{circumflex over (R)} |ϕa∥ represents the Euclidean distance between the centroids of orbital i and orbital a. Distances can be defined based on Coulomb repulsion, which sometimes leads to inconsistent sorting in systems with strongly polarized bonds. The non-i, j occupied LMO pairs can be sorted in the same manner as the virtual LMO pairs. As can readily be appreciated, any of a variety of distance measurements can be utilized as appropriate to the requirements of specific applications in accordance with various embodiments of the invention.
Several embodiments implement orbital localization. In some embodiments, Intrinsic Bonding Orbital (IBO) localization can be used to obtain the occupied LMOs. In a number of embodiments, Boys localization can be used to obtain the occupied LMOs. Particularly for molecules that include triple bonds or multiple lone pairs, Boys localization can provide more consistent localization as a function of small geometry changes than IBO localization; and the chemically unintuitive mixing of σ and π bonds in Boys localization (“banana bonds”) does not present a problem for the MOB-ML process. As can readily be appreciated, any of a variety of unitary orbital transformations can be utilized to obtain MOs as appropriate to the requirements of specific applications in accordance with various embodiments of the invention.
Many embodiments implement dimensionality reduction of MOB features. Prior to training, automatic feature selection and/or transformation can be performed using processes including (but not limited to) random forest regression with the mean decrease of accuracy criterion or permutation importance. Such embodiments implement Gaussian Process Regression (GPR), which has performance that is known to degrade for high-dimensional datasets (in practice 50-100 features). The use of the full feature set with small molecules can lead to overfitting as features become correlated. As can readily be appreciated, any of a variety of sets of MOB features can be utilized to express a feature space of molecular system as appropriate to the requirements of specific MOB-ML and/or molecular synthesis processes in accordance with various embodiments of the invention.
While various processes for MOB feature selection are described above, any variety of processes that utilize quantum theory to select MOB features can be utilized in MOB-ML processes as appropriate to the requirements of specific applications in accordance with various embodiments of the invention. Processes for identifying MOB feature distance metrics in accordance with various embodiments of the invention are discussed further below.
Processes in accordance with various embodiments of the invention can rely upon the use of distance metrics that measure the distance between the MOB features of different molecular systems in feature space. In many embodiments, chemical space structure discovery is further enhanced by utilizing subspace embedding techniques and/or autoencoder techniques to discover the local and global structures of MOB feature space. As is discussed further below, any of a variety of distance measures and/or structure discovery techniques can be utilized as appropriate to the requirements of specific applications in accordance with various embodiments of the invention.
Many embodiments implement MOB features including (but not limited to) a set of distance measures between a pair of molecular orbitals in the space. In this space, a distance can be defined which distinguishes pairs based on their MOB features. Specific implementations can include (but are not limited to): Euclidean distance in the space of MOB features or in a subspace thereof; kernel distance measures such as those employed in Gaussian Process Regression in the space of MOB features or in a subspace thereof, including but not limited to exponential, squared exponential, and Matérn kernels; and measures based on manifold learning in the space of MOB features or in a subspace thereof, including but not limited to diffusion maps, t-stochastic neighbor embedding, and isomap. In embodiments that utilize Gaussian Process Regression and in which kernel distance measures are utilized, the Nystrom method can be utilized to perform sampling of the kernel matrix to enable the Gaussian Process Regression to be performed in a more computationally efficient manner with little or no accuracy loss. Furthermore, the kernels used in Gaussian Process Regression can be extended to functions constructed from MOB feature space using Neural Networks. In certain embodiments, physical intuition can also be incorporated into the construction of the kernel. MOB features can be ordered according to various distance measures in accordance with many embodiments of the invention. As can readily be appreciated, any of a variety of distance metric implementations can be utilized as appropriate to the requirements of specific applications in accordance with various embodiments of the invention.
Appropriately obtained sets of MOB features can provide a faithful and structured representation of chemical space. Exploration and discovery of the local and global structures of an MOB feature space can be facilitated using discovery techniques including (but not limited to) subspace embedding techniques and/or autoencoder techniques. The use of such discovery techniques can enhance MOB-ML process accuracy and/or provide physical insights for chemists to understand trends and similarities across chemical systems. The term subspace embedding is generally used to describe a set of techniques that can simplify the analysis of high dimensional data, which can be especially useful for sparse data. In a number of embodiments, subspace embedding techniques including (but not limited to) Uniform Manifold Approximation and Projection (UMAP), t-Stochastic Neighbor Embedding (t-SNE), and/or Oblivious Subspace Embedding (OSE) are utilized to reduce a high dimensional MOB feature space to a relatively low-dimensional subspace and facilitate chemical space structure discovery in accordance with various embodiments of the invention. Similarly, an autoencoder such as (but not limited to) an autoencoder neural network can be utilized to perform dimensionality reduction by learning a vector subspace embedding for a higher dimensionality MOB feature space. In a number of embodiments, a subspace embedding can be performed that preserves relative distance measurements between sets of MOB features in the higher dimensional MOB feature space to enable exploration of the properties of different sets of MOB features in the lower dimensionality subspace. As can readily be appreciated, the specific subspace embedding process utilized is largely dependent upon the requirements of a given application.
Several embodiments include pair correlation energies as a function of MOB features such that smooth variation and local linearity can be obtained for different molecules with different molecular geometries and hence enhance the transferability of MOB-ML processes.
While systems and methods that include various MOB feature distance metrics are described above, any of variety of processes for measuring distance between the MOB features of different molecular systems can be utilized in MOB-ML processes as appropriate to the requirements of specific applications in accordance with various embodiments of the invention. Processes for generating orbital pairs databases in accordance with various embodiments of the invention are discussed further below.
Processes in accordance with various embodiments of the invention are capable of generating databases of molecular orbital pairs. As is discussed further below, any of a variety of orbital pair databases can be utilized as appropriate to the requirements of specific applications in accordance with various embodiments of the invention.
Many embodiments implement MOB-ML processes that store, organize, and classify databases that include (but are not limited to) molecular orbitals which form the basis for the associated MOB feature values. In some embodiments, the MOB feature values can be output from MOB-ML processes, using processes similar to those described above with respect to
The databases 410 can be queried to generate datasets corresponding to particular sets of molecules, molecular geometries, level of theory, or any combination thereof. Various embodiments employ SQL databases such as MySQL or no-SQL databases such as MongoDB distributed across one or more computers. The databases, according to various embodiments, can be queried to find MOB features nearby to a given set of MOB features on the basis of a distance metric measured between a pair of molecular orbitals in the space. Several embodiments enable the databases to be queried to find molecular systems on the basis of the MOB feature values associated with the molecular orbitals associated with those molecular systems. Examples of such embodiments can include (but are not limited to): employing k-d trees in the space of MOB features. As can readily be appreciated, any of a variety of implementations of database indexes and/or to facilitate searching can be utilized as appropriate to the requirements of specific applications in accordance with various embodiments of the invention.
While various processes for generating orbital pairs databases are described above, any variety of orbital pairs databases of different molecular systems can be utilized in MOB-ML processes as appropriate to the requirements of specific applications in accordance with various embodiments of the invention. Processes for harvesting MOB features in accordance with various embodiments of the invention are discussed further below.
Processes in accordance with various embodiments of the invention rely upon harvesting MOB features from quantum chemistry calculations. As is discussed further below, any of a variety of MOB feature harvesters can be utilized as appropriate to the requirements of specific applications in accordance with various embodiments of the invention.
Many embodiments implement MOB-ML processes to collect and harvest MOB feature values from the output of quantum chemistry calculations. Some embodiments of the MOB feature values collected from the MOB-ML processes can include the MOB feature values based on the distance between a pair of molecular orbitals to the MOB feature values that are stored within a database of molecular orbitals. Some other embodiments of the MOB feature values collected from the MOB-ML processes eliminate the MOB feature values based on the distance between a pair of molecular orbitals to the MOB feature values that are stored within the databases of molecular orbitals.
A method for collecting and harvesting MOB features using a MOB-ML process in accordance with an embodiment is illustrated in
While various processes for harvesting MOB features are described above, any variety of processes that can collect and harvest MOB features of different molecular systems can be utilized in MOB-ML processes as appropriate to the requirements of specific applications in accordance with various embodiments of the invention. Processes for machine learning regression method in accordance with various embodiments of the invention are discussed further below.
Processes in accordance with various embodiments of the invention rely upon machine learning techniques including (but not limited to) machine learning regression. As is discussed further below, any of a variety of machine learning regression methods can be utilized as appropriate to the requirements of specific applications in accordance with various embodiments of the invention.
Many embodiments include MOB-ML processes that incorporate molecular orbital databases to determine accurate molecular system properties. Examples of such embodiments are illustrated in
In many embodiments, the molecular system properties that are determined using the MOB-ML process include but are not limited to quantum mechanical energies, forces, vibrational frequencies (hessian), dipole moments, response properties, excited state energies and forces, and spectra. As can readily be appreciated, any of a variety of molecular system properties can be utilized as appropriate to the requirements of specific applications in accordance with various embodiments of the invention. Some embodiments implement the prediction of forces and hessians that can be used to optimize the geometry of the molecular system to a local minimum or saddle point. Several embodiments include that the prediction of forces can be used to run molecular dynamics. Yet some embodiments include the prediction of energies and forces that can be used to perform configurational sampling. The predictions, according to several embodiments, can be made for high-level theories on the basis of MOB feature values that are obtained using a smaller atom-centered basis set. Examples of high-level theories can include (but not limited to) coupled cluster theory using a large atom-centered basis set. As can readily be appreciated, the specific features used as high-level theories are largely only limited to the requirements of specific applications. In some embodiments, the prediction can be made for high-level theories on the basis of MOB feature values that may include data from intermediate-level theories. Examples of high-level theories can include (but are not limited to) coupled cluster theory. As can readily be appreciated, the specific features used as high-level theories are largely only limited to the requirements of specific applications. Examples of intermediate-level theories can include (but are not limited to) MP2 theory. As can readily be appreciated, the specific features used as intermediate-level theories are largely only limited to the requirements of specific applications.
As the amount of quantum simulation data increases, MOB-ML processes in accordance with many embodiments of the invention can utilize online learning techniques to continuously update MOB-ML models without retraining the models using the entirety of the original training data set. In a number of embodiments, variational Gaussian Process formalism can be generalized for minibatched training for efficient online learning within an MOB-ML process. As can readily be appreciated, any of a variety of online ML techniques can be utilized to update previously trained MOB-ML models using additional quantum simulation data as appropriate to the requirements of specific applications in accordance with various embodiments of the invention. In several embodiments, software implementations of MOB-ML models can provide user interfaces that enable a user to efficiently update an existing MOB-ML model using additional sources of quantum simulation data selected by the user including (but not limited) streams of quantum simulation data.
In many instances, limited numbers of quantum simulations and/or experimental data may be available with respect to a particular molecular property. In a number of embodiments, the transferability of MOB-ML models is utilized to perform a transfer learning process that utilizes a MOB-ML trained with respect to a first set of molecular properties as an input to a training process that learns relationships between a set of quantum simulations and/or experimental data and a second set of molecular properties. In several embodiments, pre-trained energy based models can be utilized as inputs to a transfer learning process. In a number of embodiments, a transfer learning process such as (but not limited to) Gaussian Process kernel transfer and/or Neural Network transfer learning processes can be utilized as appropriate to the requirements of specific applications. The well-structured chemical space obtained from MOB features can also provide a latent space for regularizing an easily accessible atomic or sequence level representation to enhance transferability and enable an end-to-end machine learning model. Such a model can be particularly useful when limited experimental and/or quantum simulation data is available for a new molecular property.
While various processes for machine learning regression are described above, any variety of machine learning regression methods can be utilized in ML processes as appropriate to the requirements of specific applications in accordance with various embodiments of the invention including (but not limited to) ML processes that are trained using graph representations of quantum chemical information (see discussion above). MOB-ML processes that utilize clustering, regression and/or classification during training and/or evaluation in accordance with various embodiments of the invention are discussed further below.
Processes in accordance with various embodiments of the invention rely upon regression clustering, regression, and classification workflows for training and evaluating MOB-ML processes. As is discussed further below, any of a variety of workflows can be utilized as appropriate to the requirements of specific applications in accordance with various embodiments of the invention.
As the cost of GPR training scales cubically with the amount of data and becomes a computational bottleneck for large training sets, many embodiments implement clustering, regression, and/or classification steps into MOB-ML processes. In some embodiments, regression clustering (RC) can be used to partition the training data to best fit an ensemble of linear regression (LR) models. In several embodiments, each cluster can be regressed independently, using either LR or GPR. In yet some embodiments, a random forest classifier (RFC) can be trained for the determination of cluster assignments based on MOB feature values. RC recapitulates chemically intuitive groupings of the frontier molecular orbitals. Embodiments of MOB-ML processes including RC, LR, and RFC steps and RC, GPR, RFC steps can provide good prediction accuracy with greatly reduced wall-clock training times. In many embodiments, any of a variety of unsupervised and/or supervised clustering strategies can be utilized including (but not limited to) clustering on an embedded subspace and/or latent space. Furthermore, classification accuracy can be improved by applying different classifiers and soft clustering with different voting schemes. As can readily be appreciated, the specific clustering, regression and/or classification techniques that are utilized are largely only limited by the requirements of specific applications.
Many embodiments utilize RC to identify linear clusters and take advantage of the local linearity of pair correlation energies as a function of MOB features. Consider the set of M datapoints {fv, εt}⊂d×, where d can be the length of the MOB feature vector and where each datapoint can be indexed by t and corresponds to a MOB feature vector and the associated reference value (i.e., label) for the pair correlation energy. To separate these datapoints into locally linear clusters, S1, . . . SN, a solution can be used to the optimization problem in accordance with an embodiment:
where A(Sk) ∈d and b(Sk) ∈ can be obtained via ordinary least squares (OLS) solution,
Each resulting Sk is the set of indices t assigned to cluster k comprised of |Sk| datapoints. A modified version of the greedy algorithm (
Algorithm as shown in
|Dn,t|2=|A(Sn)·ft+b(Sn)−εEt|2 (11)
where Dn,t is the distance of this point to cluster n. A datapoint can be equidistant to two or more different clusters by this metric; in such cases, the datapoint is randomly assigned to only one of those equidistant clusters to enforce the pairwise-disjointness of the resulting clusters. Convergence of the greedy algorithm can be measured by the decrease in the objective function of Eq. 9.
Processes in accordance with many embodiments rely upon regression clustering. RC can be performed using the ordinary least square linear regression implementation in the SCIKIT-LEARN package. The greedy algorithm can be initiated from the results of K-means clustering, also implemented in SCIKIT-LEARN in some embodiments. In several embodiments, K-means initialization can improve the subsequent training of the random forest classifier (RFC) in comparison to random initialization. In some embodiments, a convergence threshold of 1×10−8 kcal2/mol2 for the loss function of the greedy algorithm (Eq. 9) can lead to no degradation in the final MOB-ML regression accuracy.
Processes in accordance with many embodiments rely upon regression. Some embodiments include ordinary least-squares linear regression (LR) as regression models. Several embodiments include Gaussian Process Regression (GPR) as regression models. In many embodiments, regression can be independently performed for the training data associated with each cluster, yielding a local regression model for each cluster. In several embodiments, regression can be independently performed for the diagonal and off-diagonal pair correlation energies (εdML and εoML) yielding independent regression models for each (Eq. 6). GPR can be performed using a negative log marginal likelihood objective.
Processes in accordance with many embodiments rely upon classification. RFC can be trained on MOB-ML features and cluster labels for a training set and then used to predict the cluster assignment of test datapoints in MOB-ML feature space in many embodiments. Some embodiments include the RFC implementation in SCIKIT-LEARN, using with 200 trees, the entropy split criteria, and balanced class weights. Several embodiments include alternative classifiers including (but not limited to) K-means, Linear SVM, and AdaBoost. As can readily be appreciated, the specific features used as classifiers are largely only limited to the requirements of specific applications.
Processes in accordance with many embodiments of the invention rely upon a clustering/regression/classification workflow.
In many embodiments, the resulting MOB-ML process can be specified in terms of the method of clustering (RC), the method of regression (either LR or GPR), and the method of classification (either RFC or the perfect classifier). A notation that specifies these options (e.g., RC/LR/RFC or RC/GPR/perfect) can be used to refer to a given MOB-ML process.
To improve the accuracy and reduce the uncertainty in MOB-ML processes, many embodiments include training of 10 independent ensembles of models using the clustering/regression/classification workflow. Several embodiments include computation by averaging over the 10 models and include the predictive mean and the corresponding standard error of the mean (SEM).
While various processes for regression clustering are described above, any variety of clustering methods can be utilized in MOB-ML processes as appropriate to the requirements of specific applications in accordance with various embodiments of the invention. Processes for molecular synthesis in accordance with various embodiments of the invention are discussed further below.
Processes in accordance with various embodiments of the invention can be utilized to synthesize molecules. In several embodiments, MOB-ML processes are utilized to conduct a virtual screen of a set of candidate molecular systems based upon a set of one or more criteria related to chemical properties predicted by the MOB-ML model. In a number of embodiments, a molecular system is identified using an inverse design or generative process in which a search of a MOB feature space is performed based upon a set of one or more criteria related to a chemical properties predicted by the MOB-ML. Sets of MOB features that are predicted to possess desirable chemical properties by the MOB-ML model can then be utilized to identify molecular structures corresponding to the MOB features that are likely to possess the desired chemical properties. As is discussed further below, any of a variety of chemical property criteria can be utilized to perform virtual screening and/or inverse molecular design as appropriate to the requirements of specific applications in accordance with various embodiments of the invention.
Many embodiments implement MOB-ML processes that screen a set of candidate molecular systems based upon a set of criteria related to one or more desirable chemical properties to identify a molecular structure to synthesize. A method for screening candidate molecular systems molecules using a MOB-ML process as part of a process for synthesizing a molecular system having a set of desirable characteristics in accordance with an embodiment of the invention is illustrated in
In several embodiments, an ML model that estimates one or more chemical properties based upon a quantum chemistry representation of a molecular system can be utilized in the virtual screening of the set of candidate molecular systems. In the illustrated embodiment, molecular system properties for the candidate molecular systems are predicted (903) using an MOB-ML model trained using a process similar to any of the various processes described above. As can readily be appreciated, the specific ML model depends largely upon the quantum chemistry representation utilized to represent the candidate molecular systems, any processes utilized to reduce the dimensionality of the feature space of the quantum chemistry representation, the specific chemical properties predicted by the ML model, and/or the requirements of specific applications.
Predicted chemical properties of candidate molecular systems can be utilized to screen the candidate molecular systems in accordance with one or more criteria related to a desirable set of molecular system chemical properties. In many embodiments, additional criteria can also be utilized as part of the screen including known chemical properties of particular molecular systems such as (but not limited to) water solubility and/or toxicity. In several embodiments, the synthesis process can also further optimize the chemical structure of an identified molecular system to further enhance one or more desirable chemical properties. As can readily be appreciated, decreasing an undesirable chemical property can be treated in an equivalent manner to increasing a desirable chemical property. The candidate molecular system(s) determined to satisfy the set of criteria of the screening process can be output as report information, and/or synthesized (905).
While many quantum chemistry ML processes utilize candidate molecular systems as a starting point, the process of training a ML model based upon feature vectors derived from quantum chemistry information can inherently define a feature space that can be used for inverse molecule design. Accordingly, systems and methods in accordance with many embodiments of the invention utilize a quantum chemistry feature space to identify sets of quantum chemistry features that are likely to result in a molecular system with desirable chemical properties, and then identify molecular systems corresponding to the identified set of quantum chemistry features.
A process for synthesizing a molecular system having a desired set of chemical properties identified using an inverse molecule design process in accordance with an embodiment of the invention is illustrated in
A search (922) can then be performed within the feature space of the ML model to identify sets of features that the ML model predicts will have a set of chemical properties that satisfy a set of search criteria. In a number of embodiments, the search can be conducted using a non-linear optimization process. In a number of embodiments, the search can be performed using a generative model such as (but not limited to) a variational autoencoder (VAE), a Generative Adversarial Network (GAN) and graph kernels. The generative models can be utilized to learn how to generate sets of features that successively improve upon the extent to which the ML model predicts that the generated sets of features satisfy the set of criterion of the search. As can readily be appreciated, any of a variety of techniques can be utilized to identify one or more sets of features within the feature space that a ML model predicts will have chemical properties satisfying a set of one or more chemical property criteria.
As can readily be appreciated, the feature space corresponds to quantum chemical representations of molecular systems. Therefore, the inverse molecular design process involves identification (923) of a molecular system possessing a quantum chemical representation corresponding to the identified set of features. In a number of embodiments, the mapping of a set of features in the feature space of the ML model to a molecular system can be achieved using a feature-structure map. In several embodiments, the feature-structure map can be learned from a set of training data in which molecular structures with bonding information and/or any other atomic representations are annotated with sets of features in the feature space. In a number of embodiments, the molecular structures can be represented as SMILES strings. As can readily be appreciated, any of a variety of training data sets and/or machine learning processes can be utilized to learn a process for mapping from a feature space to specific molecular structures.
In a number of embodiments, the inverse molecule design process yields a set of candidate molecular systems with predicted chemical properties. An addition screen can be performed (924) to filter the list of candidate molecular systems based upon a variety of criteria including (but not limited to): complexity of chemical synthesis, known toxicity, water solubility, and/or any of a variety of alternative chemical properties. When an appropriate candidate molecular system is identified, a report can be generated and/or the selected molecular system synthesized (925).
While various processes for identifying molecular structures for synthesis are described above, any of a variety of processes that identify molecular structures using ML models can be utilized to perform chemical synthesis as appropriate to the requirements of specific applications in accordance with various embodiments of the invention. ML processes can also be utilized in the context of quantum chemistry calculations for a variety of additional purposes. Processes for using ML in quantum chemistry calculations in accordance with various embodiments of the invention are discussed further below.
In a number of embodiments, a particular molecular system of interest can be utilized to identify a set of relevant molecular orbital training data from a database of molecular systems for which chemical properties are known. The database of molecular systems can be queried to identify molecular orbitals based upon distance in feature space between molecular orbitals represented within the database and molecular orbitals of the molecular system of interest. A distance metric can be utilized to measure the distance between MOB features of the molecular orbitals in the database and the MOB features of the molecular orbitals of the molecular system of interest. In this way, a molecular system specific training data set can be generated for the purposes of training an MOB-ML model to predict the chemical properties (e.g. quantum mechanical energy) of the molecular system of interest.
A specific process for generating a MOB-ML for estimating the chemical properties of a specific candidate molecular system in accordance with an embodiment of the invention is illustrated in
While the discussion of the processes described above with reference to
Processes in accordance with various embodiments of the invention rely upon quantum chemistry properties. As is discussed further below, any of a variety of quantum chemistry predictions of MOB features of different molecular systems can be utilized as appropriate to the requirements of specific applications in accordance with various embodiments of the invention.
Many embodiments implement physics-based quantum chemistry predictions as input MOB features of molecular systems during MOB-ML processes. Several embodiments implement predictions of physics-based quantum chemistry for the molecular system on the basis of MOB features. Some embodiments include that the output results can include molecular system properties. Various embodiments of physics-based quantum chemistry programs include (but are not limited to) coupled-cluster theory and MP2 theory. As can readily be appreciated, the specific features used as quantum chemistry programs are largely only limited to the requirements of specific applications. Many embodiments are incorporated in software packages.
A system for incorporating an MOB-ML process into a software package in accordance with an embodiment of the invention is illustrated in
In some embodiments, software packages incorporating MOB-ML processes can be operated on a user-friendly platform, examples of such embodiments include (but are not limited to): smart phones, tablets, and computers. As can readily be appreciated, the specific features used as user platforms are largely only limited to the requirements of specific applications. According to some embodiments, the software package performs quantum simulations in seconds via a backend cloud-based deployment of MOB-ML processes.
While various processes for generating quantum chemistry predictions from MOB features are described above, any variety of processes that predict molecular system properties based on MOB features can be utilized in MOB-ML processes as appropriate to the requirements of specific applications in accordance with various embodiments of the invention. Various examples implementing MOB-ML processes in accordance with various embodiments of the invention are discussed further below.
The following section provides specific examples of the use of different MOB-ML processes to determine molecular compositions and structures for synthesis. The features and training pair energies associated with the various geometries discussed below can be computed using the MOLPRO 2018.0 software package in a cc-pVTZ basis set for small molecule systems (see Examples 1-6 below) and for big molecules (e.g. QM7b-T and GDB-13-T) an MP2/cc-pVTZ and/or CCSD(T)/cc-pVDZ can be utilized. Localized molecular orbitals used in feature construction can be determined using an Intrinsic Bond Orbital method for both occupied and virtual space (see Examples 2, 5 and 6 below), or using the Boys method for occupied space and the Intrinsic Bond Orbital method for virtual space (see remaining examples below). Reference pair correlation energies can be computed with second-order MP2 theory and CCSD theory as well as with perturbative triples CCSD(T). Density fitting for both Coulomb and exchange integrals are employed for some of the results presented below (see Example 7 below which uses density fitting for QM7b-T and GDB13-T). The frozen core approximation can also be utilized.
Gaussian process regression (GPR) can be employed to machine learn εdML and εoML (Eq. 6) using the GPY 1.9.6 software package. The GPR kernel is Matérn 5/2 with white noise regularization. Kernel hyperparameters can be optimized with respect to the log marginal likelihood objective for the alkane series results, as well as for εdML of the QM7b results. The Matérn 3/2 kernel instead of the Matérn 5/2 kernel can be used for the case of εoML for QM7b-T results.
Feature selection can be performed using the random forest regression implementation in the SCIKIT-LEARN v0.20.0 package with a mean decrease of accuracy importance criteria.
Training and test geometries can be sampled at 50 fs intervals from ab initio molecular dynamics trajectories performed with the Q-CHEM 5.0 software package, using the B3LYP/6-31g* level of theory and a Langevin thermostat at 350 K.
As can readily be appreciated, MOB-ML processes can be implemented in any of a variety of different ways and/or using any of a variety of different software packages. It will be understood that the specific embodiments are provided for exemplary purposes and are not limiting to the overall scope of the disclosure, which must be considered in light of the entire specification, figures and claims.
Many embodiments implement transferability of MOB-ML process among molecular geometries. Several embodiments include the determination of correlation energies of water molecule geometries based on MOB-ML processes trained on pair energies from randomly sampled water molecule geometries.
In some embodiments, MOB-ML processes include a single water molecule on a subset of geometries to predict the correlation energy at other geometries. For both the Møller-Plessett perturbation theory (MP2) and coupled-cluster with singles and doubles (CCSD) levels of theory, the diagonal (εd and εdML are used interchangeably) and off-diagonal (εo and εoML are used interchangeably) contributions to the correlation energy can be separately trained using feature set A, as listed in
The MOB-ML prediction results for a single water molecule training on 200 geometries and predictions for 1000 geometries are shown in
In some embodiments, a separate MOB-ML process can be trained to predict the correlation energy at the MP2, CCSD, and CCSD(T) levels of theory, using reference calculations on a subset of 1000 randomized water geometries to predict the correlation energy for the remainder. Feature selection with an importance threshold of 1×10−3 results in 12, 11 and 10 features for εoML for MP2, CCSD and CCSD(T), respectively; ten features are selected for εdML for all three post-Hartree-Fock methods.
For all three methods shown in
Many embodiments implement MOB-ML process transferability within a molecular family. For example, several embodiments include determination of CCSD and MP2 correlation energies of water clusters based on MOB-ML training on water monomers and dimers.
In one embodiment,
In another embodiment,
Many embodiments implement MOB-ML process transferability within a molecular family of covalently bonded molecules. Several embodiments include determination of CCSD and MP2 correlation energies of butane and isobutane based on MOB-ML training of shorter alkane datasets.
MOB-ML processes in accordance with many embodiments of the invention can be trained on 100 methane and 300 ethane geometries using feature set B as shown in
The mean errors of CCSD correlation energies prediction are not large (1.2 and 1.4 mH) as shown in
The effect of including additional alkane training data in
Many embodiments include different carbon atom-types included in the training data reflect the differences of the MOB-ML prediction errors in
Many embodiments implement MOB-ML process transferability within a molecular family of covalently bonded molecules. Several embodiments include determination of CCSD(T) correlation energies of larger and more branched n-butane and isobutane based on MOB-ML model trained on thermalized geometries of shorter alkane datasets.
In the predictions of MOB-ML shown in
Many embodiments implement MOB-ML process transferability across molecules and elements. Several embodiments include determination of CCSD and MP2 correlation energies of water, methane, formic acid, and methanol based on MOB-ML training of water, methane, and formic acid.
In
Many embodiments implement MOB-ML process transferability across molecules and elements. Several embodiments include determination of CCSD and MP2 correlation energies of ammonia, methane, and hydrogen fluoride based on MOB-ML training of water.
In an embodiment,
In another embodiment,
Processes in accordance with various embodiments of the invention rely upon the transferability of MOB-ML processes. Many embodiments implement MOB-ML processes across a set of organic molecules. Several embodiments include the determination of CCSD and MP2 correlation energies of sets of organic molecules from the QM7b 36 and GDB-13 37 datasets.
QM7b dataset is comprised of 7,211 plausible organic molecules with at most 7 heavy atoms. Chemical elements in QM7b can be C, H, O, N, S, and Cl. These elements are commonly used in drugs. Dataset QM7b-T is composed of molecular geometries sampled at a temperature of about 350 K. MOB-ML processes can be trained on a randomly chosen subset of QM7b-T molecules and used to predict the correlation energy of the remainder. A Δ-ML process is trained on the same molecules using kernel-ridge regression using the FCHL representation with a Gaussian kernel function (FCHL/Δ-ML), as implemented in the QML package. (See, e.g., Ramakrishnan R., J. Chem. Theory Comput., 2015, 11, 2087, Faber F. A., J. Chem. Phys., 2018, 148, 241717, the disclosure of which are herein incorporated by reference).
The learning curves for MOB-ML processes trained at the MP2/cc-pVTZ and CCSD(T)/cc-pVDZ levels of theory are shown in
MOB-ML processes trained on 110 seven heavy-atom molecules can yield a prediction MAE of 1.89 mH for QM7b-T. Results show that a prediction MAE of 3.88 mH for GDB-13-T. Expressed in terms of size-intensive quantities, the prediction MAE per heavy atom is 0.277 mH and 0.298 mH for QM7b-T and GDB-13-T, respectively. The accuracy of the MOB-ML results are only slightly lower when the model is transferred to the dataset of larger molecules. On a per-heavy-atom basis, MOB-ML can reach chemical accuracy with the same number of QM7b-T training calculations (approximately 100), for tests on QM7b-T or GDB-13-T.
In comparison, the FCHL/Δ-ML method is significantly less transferable from QM7b-T to GDB-13-T. For models trained using 100 seven-heavy-atom molecules, the MAE per heavy atom of FCHL/Δ-ML is over twice that of MOB-ML in
Processes in accordance with various embodiments of the invention can utilize the workflow of clustering, regression, and classification for training and evaluating MOB-ML processes. Many embodiments include clustering and classification in MOB feature space. Several embodiments include locally linear clusters that overlap in sets of molecules from QM7b-T datasets using MOB-ML processes.
Many embodiments include the QM7b-T set of drug-like molecules with thermalized geometries, using the diagonal pair correlation energies εdML computed at the MP2/cc-pVTZ level. Some embodiments include randomly selection of 1000 molecules for training and perform RC on the dataset comprised of the energy labels and feature vectors, using N=20 optimized clusters. The sensitivity of RC to the choice of N can be examined.
In many embodiments, the resulting clusters can be well separated, such that the datapoints for one cluster can have small distances to the cluster which it belongs to and large distances to all other clusters. In some embodiments the clusters can overlap.
Each datapoint assigned to cluster 1 in blue color can be plotted according to its distance to both cluster 1 and cluster 2; likewise for the datapoints in cluster 2 in red color. The datapoints for which the distances to both clusters approach zero can correspond to regions of overlap between the clusters in the high dimensional space of MOB-ML features, exhibiting features similar to those described above with respect to
Processes in accordance with various embodiments of the invention can utilize chemically intuitive clusters during regression clustering of MOB-ML processes. Many embodiments include evaluating consistency of the clustering and classification processes with chemical intuition.
Many embodiments include a training set of 500 randomly selected molecules from QM7b-T and regression clustering for the diagonal pair correlation energies Er′ with a range of total cluster numbers up to N=20. For each clustering, an RFC can be trained. Each trained RFC can be independently applied to a set of test molecules with easily characterized valence molecular orbitals to see how the feature vectors associated with valence occupied LMOs can be classified among the optimized clusters.
Processes in accordance with various embodiments of the invention can utilize the sensitivity of clustering, regression, and classification workflow MOB-ML processes. Many embodiments include sensitivity of clustering, regression, and classification processes for the diagonal and off-diagonal contributions to the correlation energy for the QM7b-T set of molecules.
Many embodiments include the mean absolute error (MAE) of the MOB-ML predictions for the diagonal (Σiεii) and off-diagonal (Σi≠εii) contributions to the total correlation energy, as a function of the number of clusters, N, used in the regression clustering. Several embodiments include the MOB-ML processes employ linear regression and RFC classification (i.e., the RC/LR/RFC protocol). The training set can be comprised of 1000 randomly chosen molecules from QM7b-T, and the test set can contain the remaining molecules in QM7b-T.
Processes in accordance with various embodiments of the invention rely upon the sensitivity of number of employed clusters of clustering, regression, and classification processes. Many embodiments include learning curves of MOB-ML processes applied to MP2/cc-pVTZ and CCSD(T)/cc-pVDZ correlation energies for the QM7b-T set of molecules.
Many embodiments include the effect of clustering on the accuracy and training costs of MOB-ML for applications to sets of drug-like molecules with up to seven heavy atoms.
Several embodiments include the training costs and transferability of MOB-ML models that employ RC. In
For the predictions for seven-heavy-atom molecules (circles),
The improved efficiency of MOB-ML training with the use of clustering can arise from the cubic scaling of standard GPR in terms of training time (O(M3), where M is number of training pairs). Trivial parallelization over the independent regression of the clusters can reduce training time cost to the cube of the size of the largest cluster. Other kernel-based ML methods with high complexity in training time, like Kernel Ridge Regression, can similarly benefit from clustering. GPR regression can dominate the total training (and prediction) costs for the RC/GPR/RFC implementation, whereas training the RFC can dominate the training costs for RC/LR/RFC. In addition to improved efficiency in terms of training time, clustering can also bring benefits in terms of the memory costs for MOB-ML training, due to the quadratic scaling of GPR memory costs in terms of the size of the dataset.
For the learning curves, some embodiments compare the results for MOB-ML both with and without clustering to Faber-Christensen-Huang-Lilienfeld (FCHL) features.
Processes in accordance with various embodiments of the invention can utilize the sizes of clusters of clustering, regression, and classification processes. Many embodiments include effect of cluster-size capping on the prediction accuracy and training costs for MOB-ML with RC.
Many embodiments include that capping the number of datapoints in the largest cluster can achieve additional computational savings and adequate prediction accuracy. Some embodiments include SmaxN
As can be inferred from the above discussion, the above-mentioned concepts can be implemented in a variety of arrangements in accordance with embodiments of the invention. Accordingly, although the present invention has been described in certain specific aspects, many additional modifications and variations would be apparent to those skilled in the art. It is therefore to be understood that the present invention may be practiced otherwise than specifically described. Thus, embodiments of the present invention should be considered in all respects as illustrative and not restrictive.
The current application claims the benefit of priority under 35 U.S.C. § 119(e) to U.S. Provisional Patent Application No. 62/817,344 entitled “Harvesting, Databasing, And Regressing Molecular-Orbital Based Features for Accelerating Quantum Chemistry” filed Mar. 12, 2019, U.S. Provisional Patent Application No. 62/821,230 entitled “Molecular-Orbital-Based Features for Machine Learning Quantum Chemistry” filed Mar. 20, 2019, U.S. Provisional Patent Application No. 62/962,097 entitled “Molecular and Materials Discovery and Optimization by Machine Learning with the Use of Molecular-Orbital-Based Features” filed Jan. 16, 2020. The disclosures of U.S. Provisional Patent Application Nos. 62/817,344, 62/821,230, and 62/962,097 are hereby incorporated by reference in its entirety for all purposes.
This invention was made with government support under Grant No. FA9550-17-1-0102 awarded by US Air Force Office of Scientific Research. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
62817344 | Mar 2019 | US | |
62821230 | Mar 2019 | US | |
62962097 | Jan 2020 | US |