The outbreak of the novel coronavirus disease, COVID-19, caused by the new coronavirus 2019-nCoV that is now officially designated as severe acute respiratory syndrome-related coronavirus SARS-CoV-2, represents a pandemic threat to global public health. Since the outbreak of COVID-19, this new disease and its causative virus have drawn major global attention. Scientists and physicians have been trying to understand this new emergent disease and its epidemiology in an effort to uncover possible treatment regimens, discover effective therapeutic agents, and develop vaccines.
There is an immediate need for effective treatment to contain the spread of this pandemic. Based on the time and resources required to develop new compounds to treat COVID-19 and emerging viral diseases, it is not feasible to rely completely on the traditional process of compound discovery, which takes an average 15 years and costs $2-3 billion to bring a new compound to market. A more pragmatic approach would be to perform drug repurposing, more specifically, accurately identify a set of candidate compounds which can exhibit high activity against viral proteins and potentially inhibit them using novel in-silico techniques.
Identification of targets is important for identifying drugs with high target specificity and/or uncovering existing drugs that could be repurposed to treat SARS-CoV-2 infection. Since SARS-CoV-2 is a newly discovered pathogen, no specific drugs have been identified or are currently available. A genomic sequence information coupled with protein structure modeling could accelerate the identification of existing drugs with therapeutic potential for COVID-19.
Accordingly, there is a need for a research tool that can accurately identify a set of candidate compounds which can exhibit high activity against viral proteins and potentially inhibit them.
The present disclosure provides a new and innovative method for predicting activity value for compound-viral protein interactions. The method uses data-drive machine learning models based on a simplistic representation of compounds (simplified molecular-input line-entry systems (SMILES) strings or Morgan Fingerprints) and viral protein sequence (amino acid (AA) sequence) to accurately predict activity value for compound-viral protein interactions. The method may further use two-dimensional images of compounds and physio-chemical and structural properties of proteins to strengthen the model. An aim of the provided method is to accurately identify a set of candidate compounds which can exhibit high activity against viral proteins and potentially inhibit them using novel in-silico techniques.
The present disclosure provides a method for predicting an activity value for compound-viral protein interaction, employing a consensus framework of in-silico embedding-based modeling techniques, which use different combinations of representations for compounds and viral proteins including: Morgan Fingerprints (MFP) as chemoinformatic descriptors of compounds+a convolutional neural network (CNN) autoencoder based vector representation for viral protein sequence; a teacher forcing−long short term memory neural network (TFLSTM) autoencoder based vector representation for compounds+CNN autoencoder based vector representation for viral proteins; canonical SMILES based sequential representation of compounds+Primary structure (linear chain of amino acid) based sequential representation of viral proteins.
The present disclosure encompasses several advantages over existing models predicting the compounds-protein binding affinity, such as using already collected information from other viruses to infer virus-specific compound activity when new viruses emerge to reduce costs and save time. Additionally, unlike the most commonly used AI prediction methods, the present disclosure avoids training deep learning models on human protein sequences (e.g., kinases, nuclear receptors, G-protein-coupled receptors) that are significantly different from viral protein sequences. Finally, the present disclosure provides the ability to collect information about the primary structure (e.g., linear chain of amino acids) for proteins associated with viruses, instead of using molecular docketing that requires high-quality three-dimensional crystal structures of the protein of interest as well as annotation information about the presence of active sites. Therefore, the present disclosure provides the prediction model for compound-viral protein activity that is cost-effective and time-efficient.
The present disclosure provides a method for predicting an activity value for compound-viral protein interaction by employing a consensus framework of in-silico embedding-based modeling techniques. The proposed method uses a machine learning representation framework that uses deep learning-induced vector embeddings of compounds and viral proteins as features to predict compound-viral protein activity. The prediction model, in turn, uses a consensus framework to rank approved compounds against viral proteins of interest. This prediction model allows the cost and time-efficient identification of compounds for the treatment of emerging viral infection, such as COVID-19.
In
In an embodiment, for data collection of compounds 105, the dataset representing ≈2.5 million simplified molecular-input line-entry systems (SMILES) for compounds is collected. This dataset is then filtered to remove salts and stereochemical information. The final compound set S consists of approximately 2.5 million canonical SMILES sequences 145 for small molecules. To train the traditional supervised machine learning (ML) algorithms, the set S is used to train a TF-LSTM based autoencoder, which generates a low dimensional vector representation (LSc) for each compound (e.g., compound vector representations 115). In addition, traditional cheminformatic descriptors such as Morgan Fingerprints (MFP) 155 derived from compound structures as an alternative vector representation for each compound are used.
In an embodiment, for data collection of viral proteins, the viral protein sequences 125 available in UniProt 170, comprising a total of approximately 2.7 million protein sequences, are downloaded. Among these, approximately ten thousand are deposited in SwissProt 150 (which are manually crated and functionally annotated), whereas the remaining protein sequences are obtained from TrEBML 160 and are not well-curated. The viral protein sequences 125 are filtered to keep sequences with L≤2000, resulting in a set V of approximately ≈99% of all viral proteins available in UniProt 170. The set V is used to train a CNN based autoencoder which then generates the required low dimensional representation (LSv for each viral protein sequence.
In the data gathering shown in
The compound-viral protein activities compares viral proteins and small molecules to target these viral proteins, to identify the bioactivities between the compounds and viral activities, The bioactivities can include measurements such as IC50, EC50, AC50, Ki, Kd, Potency, and other standard potency measures derived from dose-response assays at different concentration designed to measure activation, inhibition of targets, and pathways pharmacological significance.
These models may use different inputs, and reach conclusions in different ways, and the consensus between the models ensures accuracy in the prediction generated under the consensus framework 200. The compound-viral protein activity prediction for each model can be posed as a regression task to learn a mapping function g that receives joint compound and viral protein representations (xc, xv) and outputs the activity value ycv for that pair. With l defined as the model-specific loss function, the regression tasks reduce to estimate the parameters w, which minimizes Formula 1.
minwΣc,vl(ycv,g(xcxv,w) Formula 1
The mapping function g is a ML method including a Generalize Linear Model (GLM) 260, XGBoost 250 model, Support Vector Machine (SVM) 270 model, and l is the squared loss function. In these models, xc may be passed to a TG-LSTM or Morgan fingerprint generator, and xv is passed to a CNN to generate a numeric vector representation LSc (for compounds) and LSv (for viral proteins), which are used in the ML models to estimate activity values according to Formula 2.
ŷ
cv
=g(LSc,LSv,w) Formula 2
In additional examples, the LSTM 210, GAT-CNN 230, and CNN-LSTM 240 models may receive inputs of the SMILES sequences 145 and the protein AA sequences 165, while the CNN 220 receives inputs from the protein AA sequences 165 and Morgan Fingerprints 155, while the XGBoost 250, GLM 260, SVM 270, and RF 280 models receive inputs from the Morgan fingerprints 155, SMILES embedding 135, and the protein embeddings 175 gathered, filtered, and processed as explained in relation to
In an embodiment, four end-to-end deep learning models are built for the regression problem where the mapping function g were CNN, LSTM, CNN-LSTM, and GAT-CNN. These models directly work on the compound (xc) and viral protein (xv) representations, unlike traditional ML techniques.
The LSTM model 210 includes two LSTM encoders. It includes an LSTM encoder based on the compound representation (xc) and another one based on the viral protein representation (xv). The compound LSTM encoder generates the hidden state vector (hc) while the viral protein encoder generates the hidden state vector (hv). The two hidden vectors are then concatenated together (h). Multiple feed-forward layers are then layered on top of h which is connected to the output unit representing the activity value. The LSTM encoders not only capture short but long term dependencies as well, due to the availability of memory units, based on SMILES strings and viral protein sequences and the feed-forward layers encapsulate the co-occurrence of such patterns driving the activity value to be high or low for a given compound-viral protein combination.
The CNN 220 model comprises two CNN encoders. For the compound and protein CNN encoders, each of the compound (xc) and viral protein (xv) representation is passed through an embedding layer (e(·) to generate compound embedding matrix and viral protein embedding matrix respectively. A single convolutional layer with multiple filter sizes, k∈K={3,6,9,12}, is applied on top of the embedding matrix followed by a max-pooling operation to generate hidden state vector for small molecules as well as viral protein sequences. The hidden state vector hc for compounds and by for viral protein sequences are then concatenated together (h) and are considered as the output of the CNN encoders. Multiple feed-forward layers are then layered on top of h which are ultimately connected to the output unit corresponding to the activity value. The CNN encoders can capture contiguous sequences in the SMILES representations and k-mers in viral protein sequence, whereas the feed-forward layers capture the co-occurrence of such patterns that drive the activity value to be either high or low based on our training set Dtrain. Non-linear activations are used at every layer, and the model architecture w.r.t. hyper-parameters, such as filter sizes, learning rate, etc., are optimized.
The CNN-LSTM 240 model is a combination of the CNN 220 and the LSTM 210 models. By combining the CNN 220 and LSTM 210 models, the CNN-LSTM 240 model can capture spatially contiguous and well as long-term dependencies in the SMILES strings and viral protein sequences. The output of each encoder is concatenated together to generate hidden representation h, which is passed to multiple feed-forward layers and is ultimately connected to the output layer consisting of one unit for the activity value.
Graph Attention Networks-Convolutional Neural Networks (GAT-CNN) 230 model is composed of two parts, graph attention networks and convolutional neural networks. For a given compound, the compound structure can be presented as a graph consisting of the atoms (nodes) in the compound and connected by edges if a bond exists between a pair of atoms. In various embodiments, to convert a compound structure to the form of graph representations, the RDKit package taking SMILES strings may be used. Furthermore, RDKit can be used to extract different atom features such as atom's degree, the total number of hydrogen, the number of hydrogen with the number of bonded neighbors, atom status as aromatic or not, the implicit value of atoms, and atom symbol. These features can be used as node properties for atoms. In various embodiments, the atoms can include 78 such features from the SMILES strings. Given the graph-based representation of a compound molecule (xc) along with the extracted node features, the GAT portion of the GAT-CNN 230 model learns an embedding representation for a compound encapsulating the topological information available in the graph of each compound. The second component of the GAT-CNN 230 architecture is a CNN which take protein AA sequences as an input. This component is composed of the embedding layer and multiple convolutional layers. At each convolutional layer, a non-linear activation function is applied and is followed by a max-pooling operator. The CNN portion learns protein embedding (hv) and concatenates it with the SMILES embedding (10 generated by GAT portion to produce h, which is then passed to feed-forward layers. The output layer provides the value corresponding to the compound activity.
The consensus framework 200 averages the output activity estimates from the N top performing models to arrive at a consensus value for the predicted compound-viral protein activity 290 to learn from different combinations of non-linear patterns from diverse representations of the input data.
At block 320, a consensus framework collects chemical data for a plurality of compounds to examine for administration to a human to treat the virus and the identified proteins/AA sequences of the virus to treat in the human. In some examples, the consensus model collects the chemical data from a variety of different sources, and standardizes the chemical representations to provide a set of two-dimensional vector embeddings for various candidate compounds at various dosages and the virus AA sequences. In some examples, the chemical data include simplified molecular-input line-entry system (SMILES) representations of a plurality of compounds and viral protein representations for the viral protein amino acid sequences.
At block 330, the consensus framework estimates the compound-viral protein activities between each compound of the plurality of compounds and the viral protein amino acid sequences according to a plurality of machine learning models. In various examples, the consensus framework includes a plurality of different ML models constructed according to a corresponding plurality of different ML architectures, including: a Generalized Linear Model (GLM), Random Forests (RF), XGBoost, Support Vector Machines (SVM), Convolutional neural network (CNN), Long Short Term Memory (LSTM), CNN-LSTM, and Graph Attention Network (GAT)-CNN.
The consensus framework evaluates the outputs of each of the ML models to select the N top performing models to arrive at a consensus value for the predicted compound-viral protein activity by averaging the outputs of those N models. The consensus framework identifies the N top performing models from among the plurality of models based on their performance with respect to a predefined number P (e.g., P=4) of evaluation metrics on the test set, which can include the mean absolute error, mean squared error, Pearson correlation R, and the coefficient of determination for each model.
At block 340, the consensus model identifies the M best compounds according to the estimated compound-viral protein activities. When evaluating a plurality of proteins, the M best compounds may be evaluated based on a threshold value for any one of the proteins or a composite value of two or more proteins relative to each compound. In some examples, the candidate compounds may be evaluated at different doses, where the models evaluate the compounds at different concentrations or dosages, which may be used to identify a therapeutically effective dose when the compound is used in vivo after being evaluated in silico.
For example, when evaluating the PL-Pro, 3CL-Pro, and Spike Protein of SARS-COV-2, the consensus framework may output the M=47 best compounds according to Table 1, wherein the Predicted pChEMBL is the output of the consensus model and the binding energy (in Kcal/mol) is obtained via molecular docking experiment. Table 1 includes examples produced with high activities (e.g., predicted pChEMBL values), and may be further refined to select certain compounds having low binding energies, that are not already being examined or used to treat the virus in question, or that are available for use in trials or in human treatment (e.g., for other conditions).
At block 350, a user (or the consensus framework) selects a certain compound for administration or further clinical trials to treat the virus in a human. For example, Rifabutin may be selected for further trials or to treat SARS-COV-2 based on having a consistently low binding energy across the three proteins analyzed according to the examples shown in Table 1. Additionally or alternatively, Rifabutin may be selected for further trials or to treat SARS-COV-2 based on having a lowest binding energy for any of the individual proteins under analysis. In another example, LM 565 may be selected for further trials or for treating SARS-COV-2 based on having very high values for PL-Pro and 3CL-Pro, potentially indicating that LM 565 is a false positive for inclusion in the list displayed in Table 1.
At block 360, a user administers a therapeutically effective dose of the certain compound selected per block to a human as part of a clinical trial of the compound in treating the virus, or to directly treat the virus in question. Data from the in vivo use of the compound to treat the virus may be collected and fed back into the data sets used by the consensus framework, and may be used a training data for the various ML models used by the consensus model to improve the efficacy and accuracy of the ML models in identifying different compounds for treating the virus in question in the future, or for identifying a compound to use in treating a different virus (including mutations or variants of the virus in question) in the future.
Without further elaboration, it is believed that one skilled in the art can use the preceding description to utilize the claimed inventions to their fullest extent. The examples and aspects disclosed herein are to be construed as merely illustrative and not a limitation of the scope of the present disclosure in any way. It will be apparent to those having skill in the art that changes may be made to the details of the above-described examples without departing from the underlying principles discussed. In other words, various modifications and improvements of the examples specifically disclosed in the description above are within the scope of the appended claims. For instance, any suitable combination of features of the various examples described is contemplated.
The present disclosure claims the benefit of U.S. Provisional Patent Application 63/193,845 titled “A MODELLING FRAMEWORK FOR EMBEDDING-BASED PREDICTIONS FOR COMPOUND-VIRAL PROTEIN ACTIVITY”, filed on May 27, 2021, and which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63193845 | May 2021 | US |