The invention is generally related to modeling of chemical compounds for the purpose of classifying and/or predicting properties thereof.
The advent of structure-activity relationship (SAR) and quantitative SAR (QSAR) models has allowed for the prediction of toxicants and the rational design of therapeutic agents based on their similarity in chemical structure to previously tested compounds. Moreover, QSAR approaches have investigated sets of similarly shaped chemicals with discrete mechanisms of action, including binding to a specific binding site of a specific protein. However, chemical compounds associated with adverse human health effects are generally not amicable to traditional QSAR modeling due to the structural diversity of chemicals being modeled for these endpoints and also because no generalized mechanism of action is applicable to an entire set of compounds (e.g. a specific receptor site, a specific chemical fragment, indicative of an adverse human health effect).
Conventionally, classifying a chemical compound may require significant resources including time to conduct the assessment and the costs associated therewith. For example, a complete cancer bioassay conducted by the National Toxicology Program (NTP) for classifying a chemical compound may require approximately two years to perform and cost in the millions of dollars. To date, approximately 538 technical reports are available from the NTP for rodent carcinogenicity. In addition, analysis and data from 6540 experiments on 1547 chemicals are available from the Carcinogenic Potency Database (CPDB). However, there are approximately 75,000 industrial chemicals on the Toxic Substance Control Act's Chemical Substance Inventory, which indicates a need for accurate and cost and time efficient SAR models for use in classifying chemical compounds.
SAR models have been developed to efficiently and rapidly analyze large numbers of structurally diverse chemical compounds without the need for any generalized mechanism of action. For example, SAR models have been used for carcinogenesis, such as predicting mammary carcinogens, using data from the Carcinogenic Potency Database (CPDB). These models generally use chemical descriptors that describe fragments of chemical structures of model chemical compounds known to be carcinogenic or known to be non-carcinogenic. For example, some models compared rat mammary carcinogens and rat non-carcinogens to determine whether a test chemical compound is likely to be a mammary carcinogen or non-carcinogen based on the fragment descriptors present in the model. These conventional models have provided some predictive capability for classifying chemical compounds; however, the predictive results have been moderately accurate when compared to experimental results.
As discussed above, data corresponding to chemical compounds and classifications of the chemical compounds are available from some sources. For example, data from the CPDB indicates whether a known chemical compound is carcinogenic or not, where the classification typically was determined after time consuming and costly assessment of the chemical compound. While some SAR models have been generated which compare chemical composition fragments (known as “fragment descriptors”) of the previously classified chemical compounds to classify unknown chemical compounds, these SAR models have had limited success accurately classifying the wide variety of chemical compounds used in industrial, medical, domestic, and other such settings.
Therefore, a significant need continues to exist in the art for improved modeling systems and methods for classifying a chemical compound and/or predicting properties of a chemical compound.
The invention addresses these and other problems associated with the prior art by using a hybrid modeling method and system that models not only the chemical structures of chemical compounds, e.g., using fragment descriptors, but also models biologically-relevant properties, and in particular chemical-protein interactions using “ligand descriptors” developed by virtual screening of compounds in a model's learning set, where the chemical compounds in the model's learning set have been previously classified, against a large and diverse set of proteins. Using data, including for example the carcinogenic classification of known chemical compounds, where the known chemical compounds comprise the model's learning set, a SAR model may be generated to determine classifications of unknown chemical compounds based on the known classifications from previous classification assessments and the resulting data.
In some embodiments of the invention, previously classified (i.e., “model”) chemical compounds are analyzed to determine ligand descriptors associated with each model chemical compound. The ligand descriptors associated with each model chemical compound indicate whether the model chemical compound may bind with a specific ligand binding cavity (a “binding site”) of a plurality of ligand binding sites. In some embodiments, each model chemical compound may be virtually screened against each ligand binding site, where the affinity of the model chemical compound to bind to the ligand binding site may be estimated based at least in part on hydrophobic, polar complementary, entropic, and/or solvation attributes. As such, each model chemical compound may include a plurality of ligand descriptors associated therewith, where each ligand descriptor indicates that the model chemical compound may interact with a specific ligand binding site.
In some embodiments of the invention, a computer based structure activity relationship model is generated. In these embodiments, a computer generating the computer based structure activity relationship model receives data corresponding to a plurality of model chemical compounds, where the data also indicates a plurality of ligand descriptors associated with each of the model chemical compounds. The computer generates the computer based structure activity relationship model based on the plurality of model chemical compounds and the plurality of ligand descriptors associated with each model chemical compound. In these embodiments, the computer based structure activity relationship model is configured to receive data corresponding to a test chemical compound and classify the test chemical compound based on the model chemical compounds and associated ligand descriptors.
In some embodiments, a computer executing a computer based SAR model determines whether a test chemical compound is of a desired classification, where the computer based SAR includes data corresponding to a plurality of model chemical compounds and the data may further indicate a plurality of ligand descriptors associated with each model chemical compound. In these embodiments, data corresponding to the test chemical compound may be input into the computer based SAR model, and the computer based SAR model determines whether the test chemical compound is of the desired classification based at least in part on the model chemical compounds and ligand descriptors associated with each model chemical compound.
For example, in some embodiments, the computer based SAR may be configured to determine whether a test chemical compound is carcinogenic. In this example, the computer based SAR model may include a plurality of carcinogenic model chemical compounds and a plurality of ligand descriptors associated with each carcinogenic model chemical compound, and the computer based SAR model may also include a plurality of non-carcinogenic model chemical compounds and a plurality of ligand descriptors associated with each non-carcinogenic model chemical compound. Data corresponding to the test chemical compound may be input into the computer based SAR, and the computer based SAR may determine if the test chemical compound is carcinogenic.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and, together with a general description of the invention given above and the detailed description given below, serve to explain the principles of the invention.
It should be understood that the appended drawings are not necessarily to scale, presenting a somewhat simplified representation of various preferred features illustrative of the basic principles of embodiments of the invention. The specific features consistent with embodiments of the invention disclosed herein, including, for example, specific dimensions, orientations, locations, sequences of operations and shapes of various illustrated components, will be determined in part by the particular intended application, use and/or environment. Certain features of the illustrated embodiments may have been enlarged or distorted relative to others to facilitate visualization and clear understanding.
Embodiments of the invention provide for methods and apparatus generally directed to generating a computer based structure activity relationship (SAR) model and/or classifying chemical compounds utilizing a computer based structure activity relationship (SAR) model. Particularly, the SAR model utilized for classification includes a plurality of descriptors associated with a plurality of model chemical compounds, and one or more test chemical compounds may be input into the SAR model to determine whether the one or more test chemical compounds are of a desired classification based at least in part on whether descriptors associated with each of the one or more test chemical compounds correspond to the descriptors associated with the model chemical compounds included in the SAR model.
While embodiments of the invention have been and may hereinafter be described as receiving a chemical compound and/or descriptors associated therewith, as is known in the relevant field, a computer may receive data representative of a chemical compound and/or descriptors associated therewith. For example, a test chemical compound and associated properties may be input into a computer based SAR model consistent with embodiments of the invention, and those skilled in the art will recognize such input may be in the form of data in a format recognized by the computer executing the computer based SAR model, such that the data indicates the chemical compound, ligand and/or fragment descriptors associated therewith, whether the chemical compound is of a desired classification and/or other such similar information. As such, in embodiments consistent with the invention, such data associated with a chemical compound may be input into and/or received by a computer based SAR model, such that the data associated with the chemical compound may be further utilized by the computer based SAR consistent with embodiments of the invention.
Moreover, in embodiments consistent with the invention, data associated with chemical compounds may be input and/or received from data storage sources connected locally and/or over a communication network, input/output (I/O) interfaces connected locally and/or over a communication network, and/or applications executing on processors of one or more computers connected locally and/or over a communication network. For example, as discussed above, the Carcinogenic Potency Database (CPDB), accessible at URL: http://potency.berkeley.edu includes such data associated with chemical compounds that may be input to and/or received by embodiments consistent with the invention. Other such sources include, for example, technical reports by the National Toxicology Program (NTP) (accessible at the NTP's website, URL: http://http://ntp.niehs.nih.gov), the Distributed Structure-Searchable Toxicity (DSSTox) Database Network (accessible at the U.S. Environmental Protection Agency's website, URL: http://http://www.epa.gov/ncct/dsstox/index.html), and/or similar data sources known in the relevant field.
Turning to the drawings, wherein like numbers may denote like parts throughout the several views,
Consistent with some embodiments of the invention, computer 10 may further include a computer based SAR model 18 stored in memory 14 and executable by processor 12, where SAR model includes data associated with one or more model chemical compounds 20, a plurality of ligand descriptors 22 associated with the model chemical compounds 20, and/or fragment descriptors 24 associated with the model chemical compounds 20. Moreover, computer based SAR model 18 may be configured to be executed by processor 12 to cause processor 12 to perform steps necessary to perform the steps necessary to execute steps, elements, and/or blocks embodying the various aspects of embodiments of the invention. Furthermore, computer 10 may include transceiver 26, where transceiver 26 may be configured to transmit and receive data to and from communication network 28 consistent with embodiments of the invention. In addition, computer 10 may include input/output interface (I/O interface) 30, where I/O interface 30 may be configured to transmit and receive data to and from attached devices, including for example, a computer keyboard, a computer mouse, a computer monitor, a printer, computer speakers, and other such human interface devices known in the art.
As shown in
As shown in
Those skilled in the art will recognize that SAR model 60 of
In some embodiments, the computer may analyze each model chemical compound of a plurality of model chemical compounds to determine a plurality of ligand descriptors and/or a plurality of fragment descriptors associated with each model chemical compound of the plurality (block 102). In these embodiments, the data received in block 102 may not indicate the plurality of ligand descriptors and/or the plurality of fragment descriptors associated with each model chemical compound. As such, in some embodiments, the computer based SAR model may advantageously analyze the model chemical compounds to determine the ligand descriptors and/or fragment descriptors associated with the model chemical compounds.
As discussed previously, a respective ligand descriptor associated with a respective chemical compound may indicate the propensity of the respective chemical compound to act as a ligand to a specific protein of a plurality of proteins; i.e., such respective ligand descriptor indicates that the respective chemical compound may bind with the specific protein at a binding site of the specific protein. As such, in some embodiments, each respective model chemical compound of the plurality of model chemical compounds may be virtually screened by a computer consistent with embodiments of the invention to determine whether the respective model chemical compound may bind with each binding site of each protein of the plurality of proteins. Virtual screening methods consistent with embodiments of the invention virtually dock a chemical compound a ligand binding site and determine whether the chemical compound may bind by estimating the affinity of the chemical compound to the binding site, where such estimation may be based at least in part on hydrophobic, polar complementarity, entropic, enthalpic, electrostatic, shape, fragment, trained scoring algorithms, alternate scoring algorithms, calculated properties and solvation attributes. Therefore, based on the virtual screening, a plurality of ligand binding sites may be determined for each model chemical compound of the plurality of model chemical compounds. Virtual screening consistent with some embodiments of the invention may be performed by one or more applications accessing databases storing information related to protein binding sites, including for example, the Protein Data-Bank (“PDB”) and the screening-PDB database (sc-PDB) (accessible at url: http://bioinfo-pharma.u-strasbg.fr/scPDB). Based at least in part on the ligand binding sites determined for each model chemical compound, a plurality of ligand descriptors may be associated with each model chemical compound. Furthermore, those skilled in the art will recognize that various virtual screening software applications may be used to analyze compounds to determine a ligand binding site, including, for example, AutoDock, EADock, Surflex-Dock, and/or other such software applications.
In some embodiments, a computer may analyze the model chemical compounds to determine fragment descriptors associated with the model chemical compound. In these embodiments, each model chemical compound is fragmented into all possible fragments based at least in part on atom type, bond type and atomic connections. In these embodiments, a computer may fragment a respective model chemical compound by analyzing the two-dimensional chemical structure of the compound and identifying fragments based on the properties of the two-dimensional chemical structure, such as atom type, bond type and atomic connections. Based at least in part on the identified chemical fragments determined for each model chemical compound, a plurality of fragment descriptors may be associated with each model chemical compound.
The computer processes the data (block 106), where processing may include for example, analyzing the data to determine which model chemical compounds of the plurality are of the desired classification and which model chemical compounds of the plurality are not of the desired classification.
The computer generates a computer based SAR model based at least in part on the model chemical compounds, the desired classification, the associated ligand descriptors, and/or the associated fragment descriptors (block 108). The computer based SAR model may be stored in a memory of the computer or in a memory remotely connected to the computer including, for example, a memory of another computer, server, or other such device (block 110). The computer based SAR model may be configured to receive data associated with one or more test chemical compounds, where the data may indicate the test chemical compound, associated ligand descriptors, and/or associated fragment descriptors. Furthermore, the computer based SAR model may be configured to classify the input test chemical compound based at least in part on the model chemical compounds, the classification of each model chemical compound of the plurality, associated ligand descriptors, and/or associated fragment descriptors. Additionally, in some embodiments, the computer based SAR model may be configured to analyze the input test chemical compound to determine ligand descriptors and/or fragment descriptors associated with the input test chemical compound, similar to the methods described above with respect to analyzing the model chemical compounds to determine ligand descriptors and fragment descriptors. As those skilled in the art will recognize, the computer based SAR model may be generated using specially configured software environments, or alternatively, the computer based SAR model may be generated utilizing for example, cat-SAR (as described in: Development of an information-intensive structure-activity relationship model and its application to human respiratory chemical sensitizers, Cunningham, A. R. et al (2005)). It will be appreciated, however, that other software environments and/or utilities may be utilized to implement embodiments consistent with the invention.
The computer based SAR model determines whether descriptors associated with the test chemical compound correspond to any descriptors associated with model chemical compounds of the desired classification (i.e., “active” model chemical compounds) (block 126). As such, in some embodiments, the computer based SAR model may determine whether the ligand descriptors associated with the test chemical compound matches any ligand descriptors associated with the active model chemical compounds. Additionally, the SAR model may determine whether the fragment descriptors associated with the test chemical compound matches any fragment descriptors associated with the active model chemical compounds. As such, the SAR model may determine one or more ligand and/or fragment descriptor matches between the test chemical compound and the active model chemical compounds, where each such “active” match increases the likelihood that the test chemical compound is also of the desired classification.
The computer based SAR model determines whether descriptors associated with the test chemical compound correspond to any descriptors associated with model chemical compounds not of the desired classification (i.e., “inactive” model chemical compounds) (block 128). As such, in some embodiments, the SAR model may determine whether the ligand descriptors associated with the test chemical compound matches any ligand descriptors associated with the inactive model chemical compounds. Additionally, the SAR model may determine whether the fragment descriptors associated with the test chemical compound matches any fragment descriptors associated with inactive model chemical compounds. As such, the SAR model may determine one or more ligand and/or fragment descriptor matches between the test chemical compound and the inactive model chemical compounds, where each such “inactive” match decreases the likelihood that the test chemical compound is also of the desired classification.
Based at least in part on the determined active matches and inactive matches, the SAR model determines whether the test chemical compound is of the desired classification (block 130). Therefore, in these embodiments, the computer generated SAR model may be utilized to determine whether the test chemical compound is of a desired classification, where the computer generated SAR model includes active model chemical compounds, inactive model chemical compounds, ligand descriptors associated with the model chemical compounds, and/or fragment descriptors associated with the model chemical compounds.
For example, a computer based SAR model consistent with embodiments of the invention may be configured to determine whether a test chemical compound is carcinogenic. In this exemplary embodiment, the computer based SAR model may include a plurality of model chemical compounds classified as carcinogenic (i.e., active model chemical compounds) and a plurality of model chemical compounds classified as non-carcinogenic (i.e., inactive model chemical compounds). The computer based SAR model may further include a plurality of ligand descriptors and/or fragment descriptors associated with the plurality of model chemical compounds. The test chemical compound may be input into the SAR model to determine whether the test chemical compound is carcinogenic. In this example, the ligand and/or fragment descriptors associated with the test chemical compound may be determined by analyzing the test chemical compound, as discussed above, or alternatively, the ligand and/or fragment descriptors associated with the test chemical compound may be indicated by the input data. The SAR model analyzes the test chemical compound to determine active matches and inactive matches, as described above, and based at least in part on the determined active matches and the inactive matches, the SAR model determines whether the test chemical compound is carcinogenic.
As discussed above, in some embodiments, the SAR model may analyze the model chemical compounds to determine ligand descriptors associated with each model chemical compound. As such, in some embodiments, the computer executing the SAR model may generate a chemical compound-ligand matrix, where each row of the matrix may represent a model chemical compound of the plurality, and each column may represent a protein of the plurality of proteins (block 146).
The computer may analyze the affinity scores for each ligand binding site to determine a plurality of ligand descriptors associated with each model chemical compound (block 148). For a respective model chemical compound, the computer may determine a subset of the plurality of proteins with which the respective model chemical compound is most likely to interact based at least in part on the affinity score determined for the respective model chemical compound for the ligand binding site associated with each protein of the plurality, and the computer may associate ligand descriptors to each model chemical compound based at least in part on the determined subset of proteins for each model chemical compound.
In some embodiments, a plurality of model chemical compounds may be analyzed to determine a plurality of fragment descriptors associated with each model chemical compound. In these embodiments, a computer may generate a chemical compound-fragment matrix where each row of the matrix may represent a model chemical compound of the plurality, and the columns may comprise the fragments of the chemical compound (block 166). The computer may analyze the fragments of each model chemical compound to determine the plurality of fragment descriptors to associate with each model chemical compound (block 168).
Referring now to
The SAR model may determine whether the input test chemical is DNA reactive (block 184). In these embodiments, the SAR model may include a plurality of model chemical compounds and ligand and/or fragment descriptors which may be utilized to determine whether the test chemical compound is DNA reactive (e.g., the desired classification is DNA reactive), as discussed previously. As such, in these embodiments, the computer based SAR model may determine a first classification of the test chemical compound and dynamically determine an appropriate SAR model to execute to determine a second classification of the test chemical compound based at least in part on the first classification. Furthermore, the SAR model may make a plurality of classifications based at least in part on previous classifications. As such, the SAR model may include a plurality of model chemical compounds, a plurality of ligand descriptors, and/or a plurality of fragment descriptors which may be utilized for the first classification, and the SAR model may include a plurality of model chemical compounds, a plurality of ligand descriptors and/or a plurality of fragment descriptors which may be utilized for each successive classification. As such, referring to flowchart 180, the SAR model may include a first plurality of model chemical compounds, a first plurality of ligand descriptors, and/or a first plurality of ligand descriptors for determining whether the test chemical compound is DNA reactive, where the SAR model may analyze the test chemical compound using a ligand model and/or a fragment model of the SAR model based on the DNA reactivity classification.
In response to determining that the test chemical compound is DNA reactive (block 184, “Y” branch), the computer based SAR model may cause a fragment model included in the SAR model to be executed by inputting fragment descriptors associated with the test chemical compound into the fragment model of the SAR model (block 186). The SAR model determines whether the test chemical compound is of the desired classification based at least in part on the fragment descriptors associated with the test compound (block 188).
In response to determining that the test chemical compound is not DNA reactive (block 184, “N” branch), the computer based SAR model may cause a ligand model included in the SAR model to be executed by inputting ligand descriptors associated with the test chemical compound into the ligand model of the SAR model (block 190). The SAR model determines whether the test chemical compound is of the desired classification based at least in part on the ligand descriptors associated with the test chemical compound (block 192).
In these embodiments, a SAR model consistent with embodiments of the invention determines a first classification of the input test chemical compound, in response to the first classification, the SAR model may choose a particular model included in the SAR model to execute to make a second classification of the test chemical compound. While flowchart 180 illustrates a SAR model determining whether the test chemical compound is DNA reactive as the first classification, the invention is not so limited. For example, a SAR model consistent with embodiments of the invention may determine whether an input test chemical compound is carcinogenic, in response to determining whether the test chemical compound is carcinogenic, the SAR model may determine the target site/organ that the carcinogenic test chemical compound may cause cancer. Alternatively, in an exemplary embodiments, a SAR model consistent with the invention may determine whether a test chemical compound is DNA reactive; based at least in part on determining that the test chemical compound is or is not DNA reactive, the SAR model may execute a model included in the SAR model to determine whether the test chemical compound is carcinogenic; and based at least in part on determining whether the test chemical compound is carcinogenic, the SAR model may execute a model included in the SAR model to determine a target site/organ which the carcinogenic test compound interacts to cause cancer.
Embodiments consistent with the invention may determine whether unknown/unclassified test chemical compounds are of a desired classification and/or include a desired property, where such classifications include, for example, DNA reactivity, carcinogenicity, target organ/site where cancer may be caused, genotoxicity, mutagenicity, activity in target types of cells (e.g., a chemical compound may be active only in cancer cells of a specific type, and thus may be utilized to develop cancer treatment), and other such like classifications/properties.
Moreover, in embodiments similar to the exemplary embodiment provided in flowchart 180, by dynamically selecting a model included in the SAR model for execution based at least in part on a first classification, the SAR model may advantageously execute a particular model that is more effective at determining a second classification of the test chemical compound if the test chemical compound is of a first desired classification. For example, a fragment model included in the SAR model may be more effective at determining whether a test chemical compound is carcinogenic if the test chemical compound is DNA reactive. Likewise, a ligand model included in the SAR model may be more effective at determining whether a test chemical compound is carcinogenic if the test chemical compound is not DNA reactive. As such, embodiments of the invention may dynamically select different models included in the SAR model for execution to increase accuracy of classifications (as compared to classifications based on testing), effectiveness of the classifications, speed of the classification, and/or other like metrics.
The SAR model determines whether the test chemical compound is carcinogenic (block 204). As discussed, in some embodiments the SAR model may execute an included model to determine whether the test chemical compound is carcinogenic. Alternatively, in other embodiments, the data input into the SAR model may indicate that the test chemical compound is carcinogenic. In response to determining that the test chemical compound is carcinogenic, the test chemical compound is input into a model included in the SAR model (block 206). The SAR model determines whether the test chemical compound targets a specific site/organ to cause cancer (block 208). For example, the SAR model may determine whether the carcinogenic test chemical compound interacts to cause mammary cancer (i.e., the test chemical compound is a mammary carcinogen). Moreover, the SAR model may input the carcinogenic test chemical compound into a plurality of models to determine whether the carcinogenic test chemical compound interacts with a respective specific site/organ of a plurality of specific sites/organs, where a model for each respective site/organ may be included in the SAR model, consistent with some embodiments of the invention.
Furthermore, while in some embodiments a SAR model consistent with embodiments of the invention may determine a first classification using a model included in the SAR model, those skilled in the art will recognize that other classification methods and systems may be utilized to make a first classification, the results of which may be input into the SAR model for further classification. Moreover, while the invention has and hereinafter will be described as inputting a test chemical compound, those skilled in the art will recognize that a computer based SAR model consistent with embodiments of the invention may input a plurality of test chemical compounds, such that the SAR model may determine whether each test chemical compound of the plurality of input test chemical compounds are of the desired classification substantially in parallel.
Moreover, while the invention has and hereinafter will be described in the context of fully functioning computers and computer systems, those skilled in the art will appreciate that the various embodiments of the invention are capable of being distributed as a program product in a variety of forms, and that the invention applies equally regardless of the particular type of computer readable media used to carry out the distribution. Examples of computer readable media include but are not limited to tangible, recordable type media such as volatile and non-volatile memory devices, floppy and other removable disks, hard disk drives, magnetic tape, optical disks (e.g., CD-ROMs, DVDs, BLURAY, etc.), among others. Moreover, those skilled in the art will recognize that such computer readable media may include remotely connected memory locations.
As described above, SAR models consistent with embodiments of the invention execute to determine whether a test chemical compound is of a desired classification. Ligand and/or fragment descriptors are utilized to determine an association between the activity/inactivity of a test chemical compound, where “activity” may be defined as the test chemical compound being of the desired classification, and “inactivity” may be defined as the test chemical compound not being of the desired classification. The activity or inactivity of a descriptor may be determined based on the model chemical compounds with which the descriptor is associated. For example, a respective ligand descriptor may be associated with one or more model chemical compounds of the plurality, where some of the model chemical compounds may be active and some of the model chemical compounds may be inactive. However, not all ligand binding sites and chemical fragments determined from analysis of the model chemical compounds may be indicative of the activity or inactivity of the model chemical compound. Thus, in some embodiments of the invention, determining ligand descriptors and fragment descriptors by analyzing the model chemical compounds may include determining which ligand binding sites and which chemical fragments are important in the classification performed by the SAR model, and identifying those determined ligand binding sites and chemical fragments as descriptors for the model.
For example, in some embodiments a computer generating a SAR model consistent with embodiments of the invention may determine important ligand binding sites by requiring a threshold number of model chemical compounds to be a ligand for the protein associated with the ligand binding site. Likewise, in some embodiments, a computer generating a SAR model may require a threshold proportion of active model compounds and/or inactive model compounds to be a ligand for the protein associated with the ligand binding site. Similarly, in some embodiments a computer generating a SAR model consistent with embodiments of the invention may require a threshold number of model chemical compounds to include a particular chemical fragment, and/or the computer may require a threshold proportion of active model chemical compounds and/or inactive model chemical compounds to include the particular chemical fragment for the chemical fragment to be considered a fragment descriptor.
Furthermore, as discussed above, a respective descriptor may be associated with more than one model chemical drug, where a descriptor may be associated with one or more active model chemical compounds and one or more inactive model chemical compounds. As such, presence of a particular descriptor in the plurality of descriptors associated with a test chemical compound indicates a probability of inactivity and/or inactivity. As such, in some embodiments, after determining all ligand descriptors and/or fragment descriptors associated with the test chemical compound, the probability of activity (i.e., the probability that the test chemical compound is of the desired classification) must be determined, where a threshold probability of activity may be required by a SAR model consistent with embodiments of the invention to determine that the test chemical compound is of the desired classification.
SAR models consistent with embodiments of the invention may determine the probability of activity based at least in part on the number of active descriptor matches (i.e., a descriptor associated with the test chemical compound matches a descriptor associated with the active model chemical compounds) and/or the number of inactive descriptor matches (i.e., a descriptor associated with the test chemical compound matches a descriptor associated with the inactive model chemical compounds). For example in some embodiments, all active and inactive model chemical compounds associated with each descriptor may be added, and the total active model chemical compounds are divided by the total model chemical compounds to determine the probability of activity. For example, if two descriptors are associated with a test chemical compound, one descriptor being associated with 9/10 active model chemical compounds and the other descriptor being found in 3/3 inactive model chemical compounds, the probability of activity of the test chemical compound may be determined as 9/10 actives+0/3 actives=9/13 actives or a 69% chance of activity. In some embodiments, the probability of activity may be determined by calculating the probability of activity associated with each descriptor. Using the above example, the two probabilities of activity would be 90% (9/10 actives) and 0% (0/3 active), which may be averaged to determine a probability of activity of 45%.
Referring to
In some embodiments consistent with the invention, a SAR model including a hybrid model, which in turn includes a ligand model and a fragment model, may execute both models to determine whether a test chemical compound is of the desired classification. As such, in some hybrid models consistent with SAR models of the invention, a determination of whether a test chemical compound is of the desired classification may require both the ligand model and the fragment model to determine that the test chemical compound is of the desired classification. In other embodiments consistent with the invention, a Bayesian hybrid model may combine determinations from the fragment model and the ligand model with a final determination as to classification based on Bayes' theorem.
In some embodiments, a self-fit analysis, cross-validation analysis, and/or external validation may be performed by a computer generating a SAR model consistent with embodiments of the invention to determine whether generated SAR model accurately determines whether a chemical compound is of a desired classification. For a self-fit analysis, after a SAR model is developed, the SAR model may be used to predict the activity (and classification) of the model chemical compounds in order to ascertain whether or not the SAR model may be capable of at fitting its own data. In some embodiments, a leave-one-out (LOO) validation may be conducted where each model chemical compound, one at a time, may be removed from the plurality of model chemical compounds of the SAR model (i.e., the learning set of the SAR model) and an n-1 SAR model may be derived. Referring to
Moreover, in some embodiments, a leave-many-out (LMO) validation may be conducted where, for example 10,000 randomly selected sets of, for example, 2.5% of the model chemical compounds may be removed from the plurality, and a n-2.5% SAR model may be derived. Referring to
In some embodiments, an external validation may be performed on a generated SAR model. In these embodiments, random sets of a desired percentage of the model chemical compounds may be removed, and a SAR model may be generated using the remaining model chemical compounds of the learning set, while predictions close to the activity threshold for the model may be excluded from the final assessment of the SAR model. For example, 10 random sets of 10% of model chemical compounds may be removed with the remaining 90% of the model chemical compounds used to generate a SAR model and determine the classification of those model chemical compounds removed and the average sensitivity, specificity, and concordance values may be calculated, while predictions close to the activity threshold for the model may be excluded from the final assessment of the SAR model.
The computer assembles protein ligand binding sites (block 284). In some embodiments, the computer may access one or more databases to determine proteins to be included in the protein ligand binding site structures used to generate the SAR model. The computer virtually screens the model chemical compounds of the learning set to the protein binding site structures to estimate affinity values for each model chemical compound to each protein binding site structure (block 286). The computer generates a model chemical compound-ligand matrix including the estimated affinity values for each model chemical compound to each protein binding site structure, and the computer analyzes the matrix to determine ligand descriptors to associate with each model chemical compound (block 288). Based on the determined ligand descriptors and the model chemical compounds of the learning set, the computer generates the computer based SAR model (block 290).
The computer may validate the generated SAR model by performing a LOO validation, LMO validation, and/or external validation (block 292). If the SAR model meets specificity, sensitivity, and or concordance requirements, the computer may execute the SAR model to predict the classification of an unknown chemical compound (i.e., a test chemical compound). The computer executing the SAR model virtually screens the test chemical compound to the protein ligand binding site structures to estimate affinity values for the test chemical compound with each protein binding site structure, and the computer associates ligand descriptors to the test chemical compound based on the estimated affinity values (block 294). The computer determines whether the test chemical compound is of the desired classification based on the ligand descriptors and the biological relevance of the ligand descriptors to the ligand descriptors associated with the model chemical compounds (block 296).
A computer generating a SAR model assembles a learning set of chemical compounds (i.e., a plurality of chemical compounds) (block 302). The computer fragments each model chemical compound into a plurality of chemical fragments (block 304). The computer sequentially numbers all the chemical fragments of the model chemical compounds and organizes the chemical fragments (block 306). The computer generates a model chemical compound-chemical fragment matrix (block 308), where the matrix may be analyzed to determine fragment descriptors associated with each model chemical compound. The computer generates a SAR model based at least in part on the model chemical compounds and the fragment descriptors associated with each model chemical compound (block 310).
The computer may validate the generated SAR model by performing a LOO validation, a LMO validation, and/or an external test validation (block 312). A computer executing the SAR model receives data indicating an unknown chemical compound (i.e., a test chemical compound), and the SAR model fragments the test chemical compound into a plurality of chemical fragments. The SAR model associates a plurality of fragment descriptors with the test chemical compound based at least in part on the chemical fragments (block 314). The SAR model analyzes the chemical fragments of the test chemical compound using the chemical fragments associated with the model chemical compounds to determine whether the test chemical compound is of the desired classification (block 316).
One area of particular difficulty in the classification of unknown/unclassified chemical compounds is determining whether or not a non-genotoxic chemical will be carcinogenic by means other than cancer bioassays, in large part because the cancer bioassays require significant resources and time to complete. The Ames Salmonella mutagenicity assay and other short-term tests for genotoxicity may be used to detect some carcinogens. These short-term genotoxicity tests only identify carcinogens that are genotoxic. However, a significant number of cancer causing (carcinogenic) chemical compounds are non-genotoxic, and do not directly interact with DNA but rather may induce cancer by alternative mechanisms. Hence, a classification on the Ames assay as non-genotoxic does not rule out the possibility that the chemical compound is a carcinogen, for which conventional methods and systems fail to classify.
As such, some embodiments of the invention may work in conjunction with a short-term assay, including, for example the Ames assay, to identify non-genotoxic carcinogens from among test chemical compounds that are indicated as non-genotoxic by the short term assay. Moreover, in some embodiments, the computer based SAR may dynamically select a model from a plurality of models included in the SAR model to determine whether a test chemical compound is of a desired classification based at least in part on the results of one of the short-term assays. Furthermore, while short-term assays such as the Ames assay may be useful for determining that a test chemical compound is genotoxic, the rapid throughput of a computer based SAR model of the present invention provides a distinct advantage for the classifying a large amount of test chemical compounds. Moreover, in some embodiments a SAR model consistent with the invention may be utilized to model the Ames assay, where the SAR model may include a model configured to determine whether a test chemical compound is genotoxic (e.g., the model may be configured to model the Ames assay), and the SAR model may selectively execute an included hybrid model, ligand model, and/or fragment model to determine whether the test chemical compound is of another desired classification (e.g., carcinogenic, targeting to a specific site/organ, and/or other such classifications).
While a computer based SAR model consistent with embodiments of the invention may be used to determine whether unknown chemical compounds are of a desired classification, in some embodiments, a computer based SAR model consistent with embodiments of the invention may also be utilized to determine one or more characteristics of the desired classification which the SAR model is configured to model. For example, in some embodiments, a SAR model including a learning set of model chemical compounds and a plurality of ligand descriptors associated with each model chemical compound may be analyzed to generate characteristic data based at least in part on the ligand descriptors and the model chemical compounds. Referring to
For example, if a SAR model were configured to classify compounds as carcinogenic, the SAR model may include a plurality of model chemical compounds classified a carcinogenic and a plurality of model chemical compounds classified as non-carcinogenic, and the SAR model may further include a plurality of ligand descriptors associated with each model chemical compound. As such, the computer may analyze the carcinogenic model chemical compounds to identify one or more ligand descriptors associated with multiple carcinogenic compounds as characteristic ligand descriptors. Moreover, in some embodiments, the computer may identify a ligand descriptor as not a characteristic ligand descriptor if the ligand descriptor is also associated with one or more model chemical compounds not of the classification. The computer may identify a protein associated with each characteristic ligand descriptor, where the associated protein may relate to carcinogenicity. As such, the computer may generate characteristic data which indicates biological activity characteristics of carcinogenicity, where the data may indicate the characteristic ligand descriptors, the associated proteins, or other such similar information. The characteristic data may be output in a format executable by the computer, in a format readable by an operator of the computer, etc. As those skilled in the art will recognize, the characteristic data generated from analyzing a SAR model consistent with embodiments of the invention may be invaluable in determining factors involved in causing disease, causing cancer, treating disease, treating cancer, and other such purposes, where the characteristic data may identify common properties among the model chemical compounds of a desired classification that may be used as discussed.
Exemplary Structure Based Activity Relationship Models And Results.
To compare performance of SAR models consistent with some embodiments of the invention, an exemplary model was generated. A SAR model was generated to determine whether a test chemical compound is a mammary carcinogen. The first SAR model included a plurality of model chemical compounds classified as mammary carcinogens and a plurality of model chemical compounds classified as non-carcinogens, which may be referred to as the hybrid MC-NC model. The hybrid MC-NC model included a plurality of ligand descriptors and a plurality of fragment descriptors associated with the model chemical compounds included in the hybrid MC-NC model, where the hybrid MC-NC model includes a ligand model and a fragment model.
Leave-one-out (LOO) validation of the fragment model returned a concordance of 75%, a sensitivity of 69%, and specificity of 81% and the ligand model returned a concordance of 67% with a sensitivity of 69% and a specificity of 64% (Table 1). The fragment model made predictions on 182 out of the 208 chemical compounds (88%) and was based on 1583 significant fragments (724 active and 859 inactive). The ligand model made predictions on all 208 chemicals (100%) and was based on 835 proteins (216 active and 619 inactive). Through adjustment of various thresholds requirements in the hybrid MC-NC model, the hybrid MC-NC model returned a concordance of 79%, a sensitivity of 72%, and a specificity of 86%.
Thus differences exist between the classes of chemical compounds, where such classification may affect the predictive value of the two dimensional chemical structure and/or ligand binding site affinity. Since a fragment model and ligand model are both predictive and derive from different perspectives, the models may reflect different attributes of the model chemical compounds as well as different facets of the toxicological phenomena under study. Therefore, a computer based SAR model including a hybrid model, which in turn includes a ligand model and a fragment model may improve classification accuracy.
Provided below are some experimental results classifying a test chemical compound using a computer executing a SAR model consistent with embodiments of the invention.
PhIP-PhIP (2-amino-1-methyl-6-phenylimidazo[4,5-b]pyridine) has been demonstrated to be a genotoxic carcinogen and an estrogen receptor ligand and is reported in the CPDB as a Salmonella mutagen and mammary carcinogen. The International Agency for Research on Cancer (IARC) indicates that there is inadequate evidence to determine its carcinogenicity in humans and antiquated evidence for carcinogenicity in experimental animals. A fragment model analysis of rat mammary carcinogens observed that structural fragments were able to accurately classify PhIP as a mammary carcinogen, and some of the fragments that were used for this classification were related to genotoxicity and other fragments, while being related to carcinogenicity, were not apparently related to genotoxicity. In other words, this latter set of fragments suggested a non-genotoxic mechanism to PhIP's carcinogenic potential. With reference to table 200 provided below, analysis of PhIP by executing the ligand model determined that PhIP was accurately predicted during the LOO validation to be a mammary carcinogen rather than a non-carcinogen due to its potential interaction with 60 proteins, as indicated in table 1 (e.g., the activity value=0.64, cutoff value=0.61). Interestingly, of the 60 proteins identified several were related to “estrogenicity” including estrogen sulfotransferase PDB (Protein Data Bank) (PDB 1HY3), estrogen receptor alpha (PDB 1X7E), and estrogen receptor beta (PDB 1X78).
Atrazine-Atrazine, a triazine herbicide, is reported in the CPDB as a Salmonella non-mutagen, and rat mammary carcinogen. IARC indicates that while there is adequate evidence of carcinogenicity in experimental animals there is inadequate evidence to determine its carcinogenicity in humans. Referring to table 2, provided below, and considering the LOO validation, atrazine was correctly predicted to be a rat mammary carcinogen by the ligand model (activity value=0.66, cutoff value=0.61). Of the 79 PDB structures used for the MC-NC prediction for mammary carcinogenicity, an automated Medline search identified six proteins that had references to both breast cancer and atrazine. These included aspartate aminotransferase (PDB 1AKA, 1ARG, 1CQ8), L-lactate dehydrogenase (PDB 1LLD), glycogen phosphorylase (PDB 1P4G), chitinase (PDB 1W1T), chloramphenicol aminotransferase 3 (PDB 1CLA), and glutathione S-transferase (PDB 4GST).
Given these brief examples of rat mammary carcinogens and the observation that some of the PDB structures used for their accurate assessment as a rat mammary carcinogen have already been shown to be associated with the agent in question and breast cancer, it is evident that a SAR model including a ligand model can be used to provide a degree of insight into biologically relevant descriptors of activity. In other words, if no mechanism-based explanation for the mammary carcinogenic activity of these agents had yet been discovered, the modeling process described herein would have pointed to some likely targets for the agent and its carcinogenic activity.
While various examples herein have described determining whether a test chemical compound is carcinogenic, DNA reactive, and/or targets specific organs/sites, those skilled in the art will recognize that the invention is not so limited. For example, SAR models consistent with embodiments of the invention may be configured to determine whether a test chemical compound is toxic, an endocrine destructor, allergen, developmentally toxic, and other such classifications. Moreover, in some embodiments, a test chemical may be input into a SAR model to determine whether the chemical is of a classification, including, for example cancer fighting, disease fighting, and other such beneficial classifications. As such, embodiments of the invention may be used in a wide variety of applications where it is desirable to classify chemical compounds. For example, a property of an unknown chemical compound may be predicted using a SAR model consistent with embodiments of the invention. As such, some embodiments of the invention may be utilized to select test chemical compounds from a plurality of test chemical compounds that are predicted to possess the desired property.
While the invention has been illustrated by a description of the various embodiments and the examples, and while these embodiments have been described in considerable detail, it is not the intention of the applicants to restrict or in any other way limit the scope of the appended claims to such detail. Additional advantages and modifications will readily appear to those skilled in the art. Thus, the invention in its broader aspects is therefore not limited to the specific details, representative apparatus and method, and illustrative example shown and described. In particular, any of the blocks of the above flowcharts may be deleted, augmented, made to be simultaneous with another, combined, or be otherwise altered in accordance with the principles of the invention. Accordingly, departures may be made from such details without departing from the spirit or scope of applicants' general inventive concept.
This application claims priority to U.S. Provisional Application Ser. No. 61/380,048 filed by Albert Cunningham and John Trent on Sep. 3, 2010, and entitled “HYBRID FRAGMENT-LIGAND MODELING FOR CLASSIFYING CHEMICAL COMPOUNDS,” which application is incorporated by reference in its entirety.
The invention was made with Government support under National Institutes of Health contract No. P20 RR018733. The Government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
61380048 | Sep 2010 | US |