The present invention is a method for determining molecules of interest with respect to a molecular property. In particular, the present invention correlates experimental H2S vs. CO2 selectivity values with projected absorbents using molecular descriptions developed by quantitative structure-property relationships (QSPR).
Theoretically, all of the information required to determine chemical and physical properties of a chemical compound is coded within its structural formula. Quantitative Structure-Property Relationships (QSPR) is the process by which chemical structure is quantitatively correlated with a well defined process such as chemical reactivity. The goal of QSPR is to find a mathematical relationship between an activity or property under investigation and one or more descriptive parameters (descriptors) related to the structure of the molecule for a chemical compound.
A fundamental goal of QSPR studies is to predict physical, chemical, biological and technological properties of chemicals from simpler “descriptors”, calculated solely from molecular structure. To accomplish this, numerous experimental and computed descriptors have been developed for QSPR studies. The descriptor associates a real number with a chemical, and then sorts the set of chemicals according to the numerical value of the specific property. Each descriptor or property provides a scale for a particular set of chemicals.
QSPR or quantitative structure related analysis of physicochemical properties prior to 1970 had major applications only in analytical chemistry. The last three decades, however, have seen the development of a theoretical basis of QSPR with many contributions. Review papers on QSPR are given below. The development of this methodology was also supported by the simultaneous development of molecular structure-based descriptors that made it possible to describe molecules more precisely.
QSPR is now well-established and correlates varied complex physicochemical properties of a compound with its molecular structure through a set of descriptors. The basic strategy of QSPR is to find the optimum quantitative relationship between descriptors and structures, enabling the prediction of properties. QSPR became more attractive for chemists when new software tools allowed them to discover and to understand how molecular structure influences properties and to predict and prepare optimum structures. The software is now amenable to chemical and physical interpretation. There are still significant opportunities for the application of purely structure-based molecular descriptors in QSAR models through the use of physicochemical properties predicted with QSPR.
The QSPR approach has been applied in many different areas, including (i) properties of single molecules (e.g., boiling point, critical temperature, vapor pressure, flash point and autoignition temperature, density, refractive index, melting point; (ii) interactions between different molecular species (e.g., octanol/water partition coefficient, aqueous solubility of liquids and solids, aqueous solubility of gases and vapors, solvent polarity scales, GC retention time and response factor); (iii) surfactant properties (e.g., critical micelle concentration, cloud point) and (iv) complex properties of polymers (e.g., polymer glass transition temperature, polymer refractive index, rubber vulcanization acceleration).
The present invention includes a method for generating and/or identifying molecules of interest with respect to some molecular property. The molecular property is selectivity or a property which combines selectivity, aqueous solubility and vapor pressure for finding H2S absorbents.
Three characteristics, which are of ultimate importance in determining the effectiveness of the absorbent compounds to be identified for H2S removal, are “selectivity”, “loading” and “capacity”. The term “selectivity” as used throughout this document is defined as the following mole ratio fraction:
The higher this fraction, the greater the selectivity of the absorbent solution for the H2S gas. The term “loading” is defined as the concentration of the [H2S+CO2] gases [including H2S and CO2 both physically dissolved and chemically combined] in the absorbent solution as expressed in total moles of the two gases per mole of the amine. “Capacity” is defined as the moles of H2S loaded in the absorbent solution after the absorption step minus the moles of H2S loaded in the absorbent solution after the desorption step.
Let P represent either selectivity alone or an alternate relationship of selectivity, aqueous solubility and vapor pressure. The alternate relationship for the property P of a molecule that is to be predicted is defined as follows:
where S is selectivity, LW is aqueous solubility of the compound, VP is vapor pressure of the compound, and X and Y are exponent values which may take values from the set {0.5, 1, 2}. The choice of such a combined property was directed by the requirement that the prospective absorbents should have, apart to from a good selectivity, also high water solubility and low volatility.
The invention includes the following steps:
The invention includes a method for generating and/or identifying molecules with respect to some molecular property via predictive correlations. In the present invention the molecular property is selectivity or a newly defined property which combines selectivity, aqueous solubility and vapor pressure for finding H2S absorbents. The predictive correlations are found via Quantitative Structure-Property Relationships (OSPR), which is the process by which chemical structure is quantitatively correlated with a well defined process with measurable and reproducible parameters. The main goals of the invention are (i) to correlate experimental H2S vs CO2 selectivity values for series of postulated absorbents with theoretical molecular descriptors, by developing QSPR models, and (ii) to predict new active compounds with better selectivity than known so far and (iii) to identify structural characteristics with significant influence on the selectivity.
This is achieved by either the whole molecule approach or molecular fragment approach.
Descriptive parameters (descriptors) must be chosen to use in QSPR. Descriptors may be chosen using commercial software packages. Alternately, descriptions may be chosen based on the numerous published papers on QSPR. A list of descriptors is given in Appendix 8.
There are a huge variety of programs for QSPR/QSAR analysis. However, most of those are not interchangeable/equivalent: the programs developed especially for performing QSAR analysis are focused mainly on the description of the ligand-receptor interactions, while those devoted to QSPR rely on a huge descriptor space and advanced variable selection techniques. All programs for optimization of the chemical structure (and even those used only for structure drawing) provide some rudimentary tools for descriptor calculations.
HyperChem and ChemDraw are good examples of programs to optimize chemical structures. Programs able to perform QSPR analysis on technological properties, together with links to them are listed below with a short description of their advantages and disadvantages:
Some general reviews of CODESSA applications include:
Given the set of known molecules and the complete set of descriptors under consideration, a smaller subset of the descriptors is chosen for inclusion in correlations that will be developed to assess unknown molecules in the prediction of selectivity (P). The selection of descriptor values for inclusion in a particular correlation equation can be done in a number of ways based on statistical criteria. The selectivity (P data) for the known molecules is fit to a posed equation for relating the chosen subset of descriptor values to selectivity to (P). This fitting can be done via linear regression or other computational methods.
Once one or more correlation equations have been generated that relate selectivity P to descriptor values, the procedure is as follows:
Given the set of known molecules, create two or more sets of molecular fragments which may be combined to form potential absorbent molecules. Molecular fragments should be based on molecular fragments that are present in the known molecules such that the known molecules can be reconstructed using these molecular fragments and any rules developed for how to combine fragments into molecules.
Draw the protonated versions of each of the molecular fragments and either manually or computationally calculate the values for their molecular descriptors for all descriptors in the given complete set of descriptors.
Screen the set of all molecular descriptors for those that are common among all known molecules with known data for selectivity, vapor pressure and solubility. Then classify each descriptor in some scheme in order to designate how it will be treated in the predictive correlations when molecular fragments are combined to form molecules. Some methodology should then be used to decide on a subset of descriptors for inclusion in the predictive correlation.
The selectivity or P data for the known molecules formed by their substituent molecular fragments is fit to a posed equation for relating the chosen subset of descriptor values to selectivity or P for molecules composed of molecular fragments. This fitting can be done via linear regression or other computational methods.
Finally, promising molecules are found by searching for the molecules composed of molecular fragments with the highest value of P (or selectivity) predicted from the correlation equation(s). This search can be conducted with some form of enumeration of combinations of molecular fragments or a search algorithm.
The algorithm necessary to carry out the Whole Molecule and Molecular Fragment approaches is given in Appendix 7.
Examples presented are meant to be non-limiting.
To carry out Quantitative Structure Property Relationships (QSPR) analysis for H2S selectivity of potential absorbent molecules, experimental selectivity data for 33 absorbents (Appendix A1) at CO2/H2S loadings of 0.1, 0.2, 0.3 and 0.4 were used and four model-sets (Table 1-4) with common descriptors were developed (Table 5 for all loadings). Statistical parameters are acceptable for all models. The H2S selectivity values for a total of 67 (including isomers) new possible absorbents (Appendix 2) chosen using the physicochemical meaning of the theoretical molecular descriptors from model-sets #1-4 (Table 1-4) were also predicted.
Model-sets #1 and #2 (Table 1-2) were derived by a similar method: only one descriptor differs in the model-sets. Also, the statistical parameters are quite similar. Experimental selectivity values decrease as the loading increases. However, using the model-set #1 for prediction, in 21 cases the selectivity values are higher in loading 0.3 than in loading 0.2, which is not realistic. Comparison of the models in set # 1 (Table 1) reveals that in models for loadings 0.3 and 0.4, the positive descriptor's coefficient for the descriptor D37 (min. exchange energy for bond H—C) is considerably higher than in respective models for loadings 0.1 and 0.2.
The most realistic results were obtained with the model-set #2 (Table 2) where there are only 9 cases when the selectivity values are higher in loading 0.3 than in loading 0.2 (Table 6).
Table 3) for the prediction of selectivities, 6 structures were found for which the selectivity is higher in loading 0.3 than in loading 0.2 and 11 structures for which the selectivity is higher in loading 0.4 than in loading 0.3.
Using the model-set #4 (Table 4) for the prediction, in 5 cases the selectivity is higher in loading 0.3 than in loading 0.2 and in 9 cases the selectivity is higher in loading 0.4 than in loading 0.3.
Those numbers were derived by taking into account all the structures, including the large number of possible geometric isomeric forms (from 50000034 to S0000100).
Because of its low statistical reliability, model-set #4 was omitted from further consideration. Looking at the structures, which are giving higher selectivity for higher loadings in model-sets #1 and 2, it becomes evident that none of the “problematic” structures contain an 0-H group, with the sole exception of S0000078, which gives a small selectivity increase in loading 0.4 with model-set #2.
Ten of the most promising sets containing 4 descriptors each were selected with which to develop performance models, and these were built and added to the four previously built (Example 1).
Briefly, according to the Karelson approach, the molecules in a model set can be divided into distinct fragments as follows:
with a generic structure component G1 and the two substituent group components R1 and R2. One or two components may be missing.
The strategy for the development of new molecular structures with the best-pre-determined (maximum) logS, instead of selectivity values, involved the following steps:
logS=F(Di) (a)
logS=f(di) (b)
It needs to be noted that the experimental data set is small (only 33 absorbents), therefore, only general information about the influence of various fragments were obtained. However, the preparation and testing of new molecule entities (predicted in step 6 above) provided feedback for refinement of the models.
A fragment database of possible substituents Ri (125) and generic bridge structures Gk (94) were created and are given in Appendix 3 (list of substituents) and Appendix 4 (list of generic structures). Calculation of the fragment descriptors using CODESSA PRO (as the molecular descriptors for RiH, and HGkH) was carried out for these 125 possible substituents and generic structures. The corresponding Codessa Pro storage was then prepared for further calculations.
Later, a reoptimization of the molecular geometries, and elimination of those fragments that contain the following sequence refined the library of substituents and generic bridges:
To this point, the database consisted of 116 substituent group components and 73 generic bridge components (Appendix 3 and Appendix 4). The theoretical molecular descriptors were recalculated for all the fragments (RiH, HGH) and for the original 33 absorbents.
New Property with Solubility and Vapor Pressure
To be effective, absorbents should have a high solubility and low volatility. Therefore, a new property for the absorbents in which the solubilities (aqueous) and volatilities of the absorbents have been taken into account was defined. The properties were calculated as shown in Eq. 1 and the respective values are listed in Table 7.
P
n=log (selectivity*solubility/vapor pressure), n=0.1-0.4 (1)
A preliminary collection of the vapor pressure values were assembled for 29 out of 33 initial absorbents calculated using Advanced Chemistry Development (ACD) Software Solaris V4.67 (Ó 1994-2004 ACD, http://www.acdlabs.com/) available under the SciFinder Scholar 2002 Software, http://www.cas.org/SCIFINDER. (see Table 8).
Since the experimental vapor pressure values were missing for the 4 compounds (8, 11, 20 and 26) a QSPR model was built for their vapor pressures by using the 29 experimental values as a property and then to predict the missing values.
Multi-parameter correlations for the vapor pressure containing up to 7 descriptors were analyzed.
The logarithmic values of the vapor pressure were considered for developing a 4-parameter QSPR model that is given in Table 9; the respective plot of observed vs. predicted log VP values is presented in
In the case of logarithmic VP values, all data points showed a good fit on the scale (
No available experimental solubility values for these 33 absorbents were found searching both SciFinder Scholar 2002 and the Sigma-Aldrich database. As an alternative, we studied the the Ostwald solubility coefficient.
The property (Pn) to be investigated by fragment descriptor based QSPR approach, is defined as follows (Equation 2):
where S denotes the selectivity of the compound to separate CO2 and H2S in the gas mixture, LW is the aqueous solubility of the compound, VP is the vapor pressure of the compound, and X, Y are the exponents of solubility and vapor pressure, respectively.
Note: The solubility in water and vapor pressure are both “saturation” properties, i.e., they are measurements of the maximum capacity which a phase has for the dissolved compound in solution. Although water/air partition coefficients (Lw) are not constant over the whole concentration range in aqueous solution, here Lw means the water/air partition coefficient for a saturated solution. Parameter Lw, also named the Ostwald solubility coefficient, is defined as the ratio of the solubility of a compound in the aqueous solution to its equilibrium concentration in the gas phase (Eq. 2)
L
w=solubility of solute in aqueous solution/equilibrium conc. of solute in gas phase).
Experimental water solubility values were not found for the original absorbents. Thus, a 5-parameter QSPR model for the Ostwald solubility coefficients (Lw,) that we developed was used (Table 10) by using 179 experimental values for log Lw values for absorbents considered are presented in
Those three properties (selectivity, vapor pressure and solubility coefficients) were then combined into one function (property) and then the respective QSPR models were calculated.
The squared correlation coefficient is better than 0.95 for all the 3-parameter models at all loadings. Next, the models with common descriptors for all loadings were built. Such a restriction is expected to decrease R2, especially for the 3-parameter models. Therefore, 4-parameter models are also presented. The corresponding models (1-8) and plots (
Models 1-8 all contain the HDCA-2 (Area-weighted surface charge of hydrogen bonding donor atoms) related descriptor. In all models, this descriptor has a relatively high t-test value, which demonstrates its significance. The HDCA-2 descriptor is defined by Eq 3.
SD-solvent-accessible surface area of H-bonding donor H atoms, selected by threshold charge qD-partial charge on H-bonding donor H atoms, selected by threshold charge
Table 11 lists the preliminary property P values predicted for the 25 molecule entities (Appendix 5) using models 1-8. All the predicted results are in reasonable range. There are no predicted values that are unrealistically high.
As shown, the reported models for the “new property, P” where solubility and vapor pressure are included, have very good statistical characteristics.
We decided that it would be worthwhile to study the predictive power of other different exponential combinations of vapor pressure and solubility. Consequently, the general equation 4, based on equation 2, was defined as follows:
where S—the selectivity, LW—the solubility, VP—the vapor pressure of the compounds, and X, Y—the exponents of solubility and vapor pressure, respectively.
All 8 QSPR models were used to predict the Pn values for the original 33 absorbents and for 15 secondary amine structures (Table 12).
The results show that the new defined property, that combines selectivity, solubility and vapor pressure, is provides an in-depth analysis of the absorbents behavior.
A “new dataset” consisting of 22 compounds from different chemical classes: electroneutral molecules, salts and zwitterions were all used to build the 2D-QSPR models (Appendix 6). The models included 2, 3 and 4 descriptors as independent variables and are shown in Table 13. The descriptors are shown in Table 14. The experimental values for S (selectivity) at different loadings and the predicted LogS values based on Table 13 are in Table 15.
NEW DATASET: COMPOUNDS AND (I) EXPERIMENTAL VALUES FOR S (SELECTIVITY) AT LOADINGS INDICATED;
The experimental data for the original 33 structures were collected from the plots of—“Selectivity of amine solutions for H2S vs. loading of the solution with H2S and CO2 (moles per mole of amine)” available from the following ExxonMobil U.S. Pat. Nos. 4,405,580; 4,405,585; 4,405,581; 4,762,934; 4,417,075; 4,405,583; 4,405,582; 4,405,811; 4,483833; 4,892,674; 4,895,670; 4,618,481; 4,471,138.
The particular general form of the correlation of descriptors to P (or selectivity) can be described as follows. Let set M represent the set of known molecules and let set J represent the complete set of descriptors. A smaller subset of descriptors for inclusion in the QSPR whole molecule correlation equation is designated as J′ and is a subset of J. A linear regression technique is used to best fit the P data for molecules in set M using the descriptors of set J′ in the whole molecule QSPR equation expressed below. Pm represents the value of P for each of the known molecules indexed by m in set M. Djm represents the known value of descriptor j in set J for each of the known molecules indexed by m in set M.
A linear regression method is used to calculate the best fit values for the unknowns log P0 and coefficient αj for each of the descriptors considered. Using these coefficients, and the descriptor values for the set of defined unknown molecules, a correlated value for P can then be calculated. Molecules with attractive correlated values for P can then be tested experimentally to validate the prediction.
The search for the multiparameter regression with the maximum predicting power among a huge space of independent variables is not a trivial task. The calculation of all possible combinations of descriptors and the comparison of their statistical characteristics quickly becomes impractical with an increasing number of descriptors under consideration. The following strategy is used to choose the descriptors for consideration in set J′.
Let set M represent the set of known molecules and let set J represent the complete set of descriptors. Pm represents the value of P for each of the known molecules indexed by m in set M.
The Molecular Fragment Approach procedure for QSPR is as follows:
D
jm
ADD
=d
jrm
R1
+d
jgm
G
+d
jr′m
R2
∀j ∈ J′ ∩ J
ADD, (r, g,r′)=tm,m ∈ M
D
jm
CP
=d
jrm
R1
d
jgm
G
+d
jgm
G
d
jr′m
R2
∀j ∈ J′ ∩ J
CP,(r,g,r′)=tm,m ∈ M
D
jm
MIN=min{djrmR1, djgmG, djr′mR2} ∀j ∈ J′ ∩ JMIN,(r,g,r′)=tm,m ∈ M
D
jm
MAX=max{djrmR1, djrmG, djr′mR2} ∀j ∈ J′ ∩ JMAX,(r,g,r′)=tm,m ∈ M
Since a complete exhaustive enumeration of all possible descriptor combinations is computationally infeasible, the BESTREG and other heuristics were developed in the literature to provide methods for choosing the descriptor combinations to use in the QSPR. However, with the use of advanced mathematical programming techniques, the combination of descriptors that provides the absolute best correlation should be computationally tractable. Steps (6) and (7) of the detailed procedure outlined in the previous section would be replaced with the following process.
Find the best descriptor set J′ of size N for minimizing the least squares error for the hypothesized QSPR function.
As before, the derived descriptor values for the original molecules of set M are determined by the following expressions:
D
jm
ADD
=d
jrm
R1
+d
jgm
G
+d
jr′m
R2
∀j ∈ J′ ∩ J
ADD,(r,g, r′)=tm,m ∈ M
D
jm
CP
=d
jrm
R1
d
jgm
G
+d
jgm
G
d
jr′m
R2
∀j ∈ J′ ∩ J
CP,(r,g,r′)=tm,m ∈ M
D
jm
MIN=min{djrmR1, djgmG, djr′mR2} ∀j ∈ J′ ∩ JMIN,(r,g,r′)=tm,m ∈ M
D
jm
MAX=max{djrmR1, djgmG, djr′mR2} ∀j ∈ J′ ∩ JMAX,(r,g,r′)=tm,m ∈ M
In the search for the highest impact combination of descriptors, the development of a least-squares error combinatorial optimization approach is proposed. The model for determining the correlation parameters of the QSPR with the N best descriptors is the following:
This model is a convex mixed-integer quadratic programming (MIQP) problem. Commercial optimization algorithms such as CPLEX or XpressMP can be used to solve such MIQP problems, usually within a reasonable run-time since the number of binary variables is limited to the number of descriptors utilized. This approach would not only determine the optimum values for the correlation parameters for the QSPR model, but would also determine the N best descriptors that most impact the reduction of error in fitting the model to the actual data. Any descriptor j in which zj=1 would be a member of the QSPR descriptor set J′.
Then a sensitivity analysis is possible with a plot of globally minimum error versus N, providing not only a “best” set of descriptors, but also a basis for evaluating whether a model is being overfit. If as N is changed the descriptors within set J′ change radically from one globally minimized solution to another, this may indicate that the proposed QSPR equation form is not a good measure for predicting selectivity and should be re-evaluated.
If the set of descriptors chosen for use by the model corresponds to the descriptor set(s) chosen using the heuristic methods such as BESTREG, these calculations would serve to provide strong mathematical evidence of the validity of those methods.
With the optimal descriptor set J′ and the values for the unknowns log P0 and either αj, βj, γj, or λj for each descriptor j∈J, the equation for prediction of P for any given triplet t∈T is the same as in the previous section.
This application claims the benefit of U.S. Provisional Application No. 61/278,230 filed Oct. 2, 2009.
Number | Date | Country | |
---|---|---|---|
61278230 | Oct 2009 | US |