The present invention relates to similarity search generally and to molecular similarity search in particular.
One of the mainstays of the drug industry is small-molecule-drugs. Pharmaceutical researchers search for the molecule that will, for example, inhibit an enzyme or activate a receptor in the way they desire. Using artificial intelligence (AI) for molecular property prediction is known.
Drug makers use molecular similarity search to try to predict properties such as solubility—how well a molecule can dissolve into the blood or enter the membrane of a cell; toxicity—the degree to which a molecule can damage an organism; and, blood brain barrier (BBB)—does the molecule enter the brain or not. After first screening of a molecule for structure, researchers employ deep learning techniques to find molecules with similar desired properties as known molecules.
Researchers utilize Neural Networks which are mathematical models, in this case convolutional neural networks (CNN) or graphical convolution networks (GCN) to recognize the properties of molecules. These may be implemented on software platforms such as Rdkit, Deepchem and others.
Reference is now made to
An input vector Vi, representing the structure and atomic features of a molecule, as described in detail hereinbelow, enters GCN 1 at input layer 2, and traverse hidden layers 3 and an output vector Vo exits GCN 1 at output layer 4.
There are two main modes of operating an GCN: training mode and operational mode (which includes testing, verification and regular use of GCN 1). During training, input vectors Vi, with an output value of Vo which is known, are put through GCN 1. The nodes 6, weights W, connections 7 and other features of GCN 1 explained further hereinbelow, are adjusted, for example by cross entropy loss, so when V1 traverses GCN 1, GCN 1 transforms Vi to equal the known value of Vo at output layer 4. Training a GCN to perform accurate transformations is a complex task, as is known in the art.
Once a GCN is trained, another set of input vectors is used to test and verify if the GCN transformation is reliable and accurate. Another set of test input vectors, again with known output values is passed through GCN 1 and actual Vo results are compared against known Vo values. If the results are acceptable, the GCN is considered trained. Once trained, the GCN may be used to predict the output of unknown query vectors.
Researchers strive to create the perfect transformation model, within a GCN, that will generate a desired output for a given input. For example, structural and atomic properties, called features, of a molecule, may be input to a GCN, and the toxicological properties of such a molecule may be predicted at the output. As known by those in the art, during the training phase of a GCN, various deep learning techniques are used to refine the GCN. These techniques include, but are not limited to neighbor feature aggregation layers, normalization layers, pooling layers, non-linear transformation layers, readout layers, and others. Current GCN techniques are described in the website publication, Deep Learning, at http://www.deeplearningbook.org; in the article “SimGNN: A Neural Network Approach to Fast Graph Similarity Computation” published by ACM 2019; and, Semi-Supervised Classification with Graph Convolutional Networks published by ICLR 2017.
Using the toxicology example mentioned hereinabove, the U.S. Environmental Protection Agency, the U.S. National Toxicology Program, the U.S. National Center for Advancing Translational Sciences, and the U.S. Food and Drug Administration formed the Tox21 Consortium that created the Tox21 molecular property dataset. The Tox21 dataset comprises: a database of over 12,000 molecules used to train, validate and test GCNs. Training molecules have a known set of 12 toxicological properties that are used by GCN 1 during training, to self-adjust nodes 6, connections 7, weights W and other GCN features mentioned hereinabove, to train the GCN to output the correct Tox21 12-bit property set for a given input molecule.
The Tox21 dataset has sets of input vectors, with known output vectors that can be used to train GCN 1. Other sets of vectors are included in the dataset for testing and verification. In total there are about 12,000 vectors available. The training molecule set is chosen to reflect the range of input types used with GCN 1. Likewise, validation vectors are a set of molecules that will test the full breadth of the performance of the GCN, but are not used during training. Finally, when GCN 1 has been tested and validated, unknown molecular vectors are input to GCN 1 and their Tox21 properties predicted at output 4.
Reference is now made to
The output vector Vo is a 12-bit binary vector representing the Tox21 molecular properties 13 of the molecule. These 12 properties are divided into a 7-bit ‘nuclear receptor panel’ of seven toxicological properties: (1) estrogen receptor alpha, LBD (ER, LBD); (2) estrogen receptor alpha, full (ER, full); (3) aromatase; (4) aryl hydrocarbon receptor (AhR); (5) androgen receptor, full (AR, full); (6) androgen receptor, LBD (AR, LBD); (7) peroxisome proliferator-activated receptor gamma (PPAR-gamma), and a 5-bit ‘stress response panel’ of 5 toxicological properties: (8) nuclear factor (erythroid-derived 2)-like 2/antioxidant responsive element (Nrf2/ARE); (9) heat shock factor response element (HSE); (10) ATADS; (11) mitochondrial membrane potential (MMP); (12) p53.
There is provided in accordance, with a preferred embodiment of the present invention, a method for finding similar molecules to a query molecule. The method includes transforming query atomic feature set (AFS) vectors and candidate AFS vectors into query property feature set (PFS) embedding vectors and candidate PFS embedding vectors, utilizing a GCN that has been trained to output a molecular property vector from an input query or input candidate molecular vectors, respectively. The method also includes extracting query and candidate PFS embedding vectors from hidden layers of the trained GCN, calculating a compensated similarity metric (CSM) for at least one pair of the query PFS embedding vector and one candidate PFS embedding vector, and selecting only such candidate molecular vectors which have a value of the CSM above a pre-defined threshold value.
Moreover, in accordance with a preferred embodiment of the present invention, compensating attempts to compensate for inaccuracies caused by a varying position of atomic feature sets at the input layer of the trained GCN.
Further, in accordance with a preferred embodiment of the present invention, calculating includes, for each candidate PFS embedding vector, summing all possible combinations of dot products between property feature sets in the query PFS embedding vector and property feature sets in the candidate PFS embedding vector, and normalizing the dot product sum, by dividing the dot product sum by the number of the property feature sets in the candidate PFS embedding vector.
Still further, in accordance with a preferred embodiment of the present invention, the trained GCN includes an input layer, four hidden layers and an output layer.
Additionally, in accordance with a preferred embodiment of the present invention, each PFS embedding vector includes a plurality of property feature sets.
Moreover, in accordance with a preferred embodiment of the present invention, the trained GCN is trained to one of the following properties: solubility, blood brain barrier or toxicity.
Further, in accordance with a preferred embodiment of the present invention, extracting query and candidate PFS embedding vectors is performed at the output of the fourth hidden layer.
Still further, in accordance with a preferred embodiment of the present invention, the candidate AFS vectors are vectors used to train the GCN.
Additionally, in accordance with a preferred embodiment of the present invention, adjusting the predefined threshold value changes the number of candidate molecular vectors deemed similar to the query molecular vector.
There is also provided, in accordance with a preferred embodiment of the present invention, a system for finding similar molecules to a query molecule. The system includes a GCN, a PFS vector extractor, a compensated vector comparator (CVC), and a candidate vector selector. The GCN has been trained to output a molecular property vector from an input query or input candidate molecular vectors, respectively, The GCN transforms query atomic feature set (AFS) vectors and candidate AFS vectors into query property feature set (PFS) embedding vectors and candidate PFS embedding vectors. The PFS vector extractor extracts query PFS embedding vectors and candidate PFS embedding vectors from hidden layers of the trained GCN. The compensated vector comparator (CVC) calculates a compensated similarity metric (CSM) for a pair of one query PFS embedding vector and one candidate PFS embedding vector. The candidate vector selector selects only such candidate molecular vectors which have a value of the CSM above a pre-defined threshold value.
Additionally, in accordance with a preferred embodiment of the present invention, the compensated vector comparator (CVC) attempts to compensate for inaccuracies caused by a varying position of atomic feature sets at the input layer of the trained GCN.
Further, in accordance with a preferred embodiment of the present invention, the CVC includes a dot product summer and a DPS normalizer. The dot product summer sums all possible combinations of dot products between property feature sets in the query PFS embedding vector and property feature sets in the candidate PFS embedding vector, for each candidate PFS embedding vector. The DPS normalizer normalizes the DPS, by dividing the DPS by the number of property feature sets in the candidate PFS embedding vector, for each candidate PFS embedding vector.
Still further, in accordance with a preferred embodiment of the present invention, the candidate vector selector changes the value of the predefined threshold value in order to change the number of candidate molecular vectors deemed similar to the query molecular vector.
The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:
It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present invention.
Applicant has realized that in a toxicologically trained Graphical Convolution Networks (GCNs), as the input vector, comprising a plurality of atomic feature sets (AFS), traverses from the input layer and through a plurality of hidden layers, its AFS data are transformed to toxicological feature set (TFS) data, before being further transformed into the toxicology property vector at the output layer.
Applicant has realized that this is not only true for toxicology, but also for other molecular properties such as blood brain barrier (BBB), solubility and other properties. In such GCNs that are trained according to a particular molecular property, as input vectors traverse the GCN, AFS data is transformed to property feature sets (PFS) before being further transformed into the appropriate property vector at the output layer. The present application uses toxicology as an example.
Applicant has also realized that, rather than use the toxicology output vector from such toxicological GCNs, TFS embedding vectors may be extracted from within the hidden layers of the GCN and used outside of the GCN to mathematically compare their toxicological properties with other extracted TFS embedding vectors.
Applicant has realized that the order that atoms are presented to the input layer of a GCN may affect the output accuracy. For example, a water molecule AFS vector having two hydrogen atoms and one oxygen atom may be presented to the GCN input layer as H—H—O, H—O—H or O—H—H.
Reference is now made to
Any GCN may be utilized, for example. Reference is now made to
At first hidden layer 32, the effects of feature sets of first-degree neighboring atoms are also calculated. At the first node, H—O is included, at the second node H—O—H, and at the third node O—H. At the third hidden layer 32, the secondary neighbors are included, which are H—O—H on the first node and H—O—H on the third node and at the fourth hidden layer 32, the tertiary neighbor are included. There are no tertiary neighbors in the H2O example, but in the Tox21 dataset, each molecule has about 20 atoms, and the neighboring atoms may have a greater effect on the calculation.
As mentioned hereinabove, there are many deep learning techniques that are applied within a GCN to improve the performance and accuracy of the GCN. In the preferred embodiment of the present invention, on the output of the first hidden layer 32 there is: a non-linear translation (NLT) layer 36 containing 128 Relus; a dropout layer 38 set to 0.1; a batch normalization layer 40; and, a graph pooling layer 42 set to max pool over the feature vectors for an atom and its neighbors in bond-graph. On the output of the second hidden layer 32 there is: a non-linear translation (NLT) layer 36 containing 128 Relus; a dropout layer 38 set to 0.1; a batch normalization layer 40. On the output of the third hidden layer 32 there is: a non-linear translation (NLT) layer 36 containing 128 Relus; and, a batch normalization layer 40; and on the output of the fourth hidden layer 32 there is: a non-linear translation (NLT) layer 36 containing 128 Relus, batch normalization 40; a graph pooling layer 42; a dense layer 44; another batch normalization layer 40; a graph gather layer 46; and a Softmax layer 48.
It will be appreciated that the specific techniques employed, the number of layers and the number of nodes in GCN 16 may vary and are presented here as examples of configuring a neural network.
Applicant has realized that vectors in the Tox21 dataset may be used not only for training GCNs, but to produce candidate TFS embedding vectors cTFS,i with which to compare with query TFS embedding vector qTFS.
Returning to
Reference is briefly made to
Applicant has realized that the arrangement of atomic feature sets in input vectors VAFS may also affect the arrangement of toxicity feature sets in TFS embedding vectors VTFS. Applicant has also realized that calculations performed on TFS embedding vectors VTFS need to compensate for the effects of such TFS arrangements in TFS embedding vectors VTFS. Applicant has realized that in the toxicology example, by using the normalized sum of TFS dot products between embedding vector pairs as a metric, such positioning effects are minimized and more accurate similarity metric for vector pairs can be calculated.
Reference is now made to
Reference is now made to
DPS(qTFS, cTFS,i)=[TFSq1·TFSc1]+[TFSq1·TFSc2]+[TFSq1·TFSc3]+[TFSq2·TFSc1]+[TFSq2·TFSc2]+[TFSq2·TFSc3] equation (1)
dot product sum normalizer 52 then completes the CSM calculation by normalizing DPS(qTFS, cTFS,i), by dividing it by the number of atoms tin the candidate vector cTFS,i (which in the example is 3), as shown in equation (2):
M
CVC,i=Normalized DPS(qTFS, cTFS,i)=[DPS(qTFS, cTFS,i)]/t equation (2)
CVC 24 then stores each MCVC,i for each TFS query-candidate pair qTFS−cTFS,i in CSM database 26. MCVC,i is then used by candidate vector selector 28 as a score against which it then selects only those candidate vectors CAFS,i with a score over a candidate score threshold. Those candidates with a score over such a threshold are deemed similar to query vector qAFS.
It should be noted that the embodiments described hereinabove may be implemented on any suitable computing device. All databases may be implemented as individual databases or sections of a single database. Extracted TFS embedding vectors may be used for any calculation, not only similarity metrics as shown hereinabove. TFS embedding vectors may be extracted from GCNs trained with any training vector set, not only toxicity vectors as shown hereinabove.
Applicant has also realized that by enabling candidate vector selector 28 to adjust the threshold score by which candidates are deemed similar, users have the flexibility to adjust the size of the candidate pool, without having to retrain the neural network.
Applicant has also realized that calculations can be implemented as simple Boolean functions, and performed in parallel on all candidate vectors simultaneously on associative memory arrays such as Gemini Associative Processing Unit, commercially available from GSI Technologies Inc. of the USA.
As mentioned hereinabove, such a GCN could be trained using any molecular property, such as solubility, BBB or other property. Reference is now made to
Unless specifically stated otherwise, as apparent from the preceding discussions, it is appreciated that, throughout the specification, discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining,” or the like, refer to the action and/or processes of a general purpose computer of any type, such as a client/server system, mobile computing devices, smart appliances, cloud computing units or similar electronic computing devices that manipulate and/or transform data within the computing system's registers and/or memories into other data within the computing system's memories, registers or other such information storage, transmission or display devices.
Embodiments of the present invention may include apparatus for performing the operations herein. This apparatus may be specially constructed for the desired purposes, or it may comprise a computing device or system typically having at least one processor and at least one memory, selectively activated or reconfigured by a computer program stored in the computer. The resultant apparatus when instructed by software may turn the general-purpose computer into inventive elements as discussed herein. The instructions may define the inventive device in operation with the computer platform for which it is desired. Such a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk, including optical disks, magnetic-optical disks, read-only memories (ROMs), volatile and non-volatile memories, random access memories (RAMs), electrically programmable read-only memories (EPROMs), electrically erasable and programmable read only memories (EEPROMs), magnetic or optical cards, Flash memory, disk-on-key or any other type of media suitable for storing electronic instructions and capable of being coupled to a computer system bus. The computer readable storage medium may also be implemented in cloud storage.
Some general-purpose computers may comprise at least one communication element to enable communication with a data network and/or a mobile communications network.
The processes and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct a more specialized apparatus to perform the desired method. The desired structure for a variety of these systems will appear from the description below. In addition, embodiments of the present invention are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein.
While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes, and equivalents will now occur to those of ordinary skill in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.
This application claims priority from U.S. provisional patent applications 62/989,937 filed Mar. 16, 2020 and 63/150,597 filed on Feb. 18, 2021, which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62989937 | Mar 2020 | US | |
63150597 | Feb 2021 | US |