This disclosure relates to the field of computer technology, and particularly relates to methods and apparatuses for training a prediction model and predicting data, a computer device and a storage medium.
With the development of artificial intelligence technology, the affinity between a compound and a targeted protein can be predicted by a machine learning algorithm. At present, a model established by the machine learning algorithm is used for predicting the change of affinity between the targeted protein and the compound after the mutation, and then determining whether the targeted protein has drug resistance to the compound, so as to provide a reference for doctors to use drugs. However, the prediction model established by the machine learning algorithm has the problems of low accuracy and poor generalization ability.
In view of the above technical problems, this disclosure provides methods and apparatuses for training a prediction model and predicting data, a computer device and a storage medium, aiming at improving the prediction model training accuracy so as to improve the prediction accuracy.
A method for training a prediction model includes:
obtaining a training sample set, the training sample set comprising training samples, training sample weights corresponding to the training samples and target energy characteristics corresponding to the training samples, a training sample comprising wild type protein information, mutant type protein information and compound information, the target energy characteristics being obtained based on wild type energy characteristics and mutant type energy characteristics, the wild type energy characteristics being obtained by performing binding energy characteristic extraction based on the wild type protein information and the compound information, and the mutant type energy characteristics being obtained by performing binding energy characteristic extraction based on the mutant type protein information and the compound information;
determining a current training sample from the training sample set based on the training sample weights;
inputting current target energy characteristics corresponding to the current training sample into a pre-trained prediction model for basic training to obtain a basic prediction model after completing the basic training;
updating the training sample weights corresponding to the training samples based on the basic prediction model; and
returning to perform the operation of determining the current training sample from the training sample set based on the updated training sample weights until completing model training to obtain a target prediction model.
An apparatus for training a prediction model includes a memory operable to store computer-readable instructions and a processor circuitry operable to read the computer-readable instructions. When executing the computer-readable instructions, the processor circuitry is configured to:
obtain a training sample set, the training sample set comprising training samples, training sample weights corresponding to the training samples and target energy characteristics corresponding to the training samples, a training sample comprising wild type protein information, mutant type protein information and compound information, the target energy characteristics being obtained based on wild type energy characteristics and mutant type energy characteristics, the wild type energy characteristics being obtained by performing binding energy characteristic extraction based on the wild type protein information and the compound information, and the mutant type energy characteristics being obtained by performing binding energy characteristic extraction based on the mutant type protein information and the compound information;
determine a current training sample from the training sample set based on the training sample weights;
input current target energy characteristics corresponding to the current training sample into a pre-trained prediction model for basic training to obtain a basic prediction model after completing the basic training;
update the training sample weights corresponding to the training samples based on the basic prediction model; and
return to perform the operation of determining the current training sample from the training sample set based on the updated training sample weights until completing model training to obtain a target prediction model.
According to the method and apparatus for training a prediction module, the training sample set is obtained, the training sample set includes training samples, the training sample weights corresponding to the training samples and the target energy characteristics corresponding to the training samples, the training sample includes the wild type protein information, the mutant type protein information and the compound information, and the current training sample is determined from the training sample set based on the training sample weight; the current target energy characteristics corresponding to the current training sample are inputted into the pre-trained prediction model for basic training, and the basic prediction model is obtained after completing the basic training; and the training sample weights corresponding to the raining samples are updated based on the basic prediction model, and the operation of determining the current training sample from the training sample set is performed based on the training sample weight until completing model training to obtain the target prediction model, the target prediction model is configured to predict the interaction state information corresponding to the inputted protein information and the inputted compound information. That is, the training sample weight is continuously updated in the iteration process, the current training sample is determined from the training sample set by using the training sample weight, thus the quality of the training sample can be ensured; and then the prediction model is trained by using the current training sample, so that the prediction accuracy and generalization of the target prediction model obtained by training can be improved.
A method for predicting data includes:
obtaining original data, the original data comprising original wild type protein information, original mutant type protein information and original compound information;
performing binding energy characteristic extraction based on the original wild type protein information and the original compound information to obtain original wild type energy characteristics;
performing binding energy characteristic extraction based on the original mutant type protein information and the original compound information to obtain original mutant type energy characteristics;
determining original target energy characteristics based on the original wild type energy characteristics and the original mutant type energy characteristics;
inputting the original target energy characteristics into a target prediction model for prediction to obtain interaction state information, the target prediction model being obtained by or being capable of (being configured to):
obtaining a training sample set;
determining a current training sample from the training sample set based on a training sample weight;
inputting current target energy characteristics corresponding to the current training sample into a pre-trained prediction model for basic training to obtain a basic prediction model after completing the basic training;
updating the training sample weight based on the basic prediction model; and
returning to perform the operation of determining the current training sample from the training sample set based on the training sample weight until completing the model training.
According to the method and apparatus for predicting data, the original data is obtained, the original target energy characteristics are determined, and the original target energy characteristics are inputted into the target prediction model for prediction to obtain the interaction state information; the target prediction model determines the current training sample from the training sample set based on the training sample weight by obtaining the training sample set; the current target energy characteristics corresponding to the current training sample are inputted into the pre-trained prediction model for basic training, and the basic prediction model is obtained after completing the basic training; and the training sample weights corresponding to the training samples is updated based on the basic prediction model, and the operation of determining the current training sample from the training sample set is performed based on the training sample weight until completing the model training; that is, the interaction state information is obtained by predicting through the target prediction model; and the target prediction model obtained by training can improve the prediction accuracy so that the accuracy of the obtained interaction state information can be improved.
To describe the technical solutions in embodiments of this disclosure more clearly, the following briefly describes the accompanying drawings required for describing the embodiments. Apparently, the accompanying drawings in the following description show merely some embodiments of this disclosure, and a person of ordinary skill in the art may derive other drawings from these accompanying drawings without creative efforts.
To make the objectives, technical solutions, and advantages of this disclosure clearer and more understandable, this disclosure is further described in detail below with reference to accompanying drawings and embodiments. It is to be understood that the specific embodiments described herein are only used for explaining this disclosure, and are not used for limiting this disclosure.
A method for training a prediction model provided in this disclosure may be applied to an application environment shown in
In one embodiment, as shown in
Step 202: Obtain a training sample set, the training sample set including training samples, training sample weights corresponding to the training samples and the target energy characteristics corresponding to the training samples, the training sample including wild type protein information, mutant type protein information and compound information, the target energy characteristics being obtained based on wild type energy characteristics and mutant type energy characteristics, the wild type energy characteristics being obtained by performing binding energy characteristic extraction based on the wild type protein information and the compound information, and the mutant type energy characteristics being obtained by performing binding energy characteristic extraction based on the mutant type protein information and the compound information.
The protein refers to targeted protein, such as protein kinase. The compound refers to a drug capable of interacting with the targeted protein, such as a tyrosine kinase inhibitor. The protein information is used for characterizing specific information of the targeted protein, and may include a protein structure, protein physicochemical properties, etc. The wild type protein information refers to information of an individual obtained from the nature, namely non-artificially mutated protein. The mutant type protein information refers to mutated protein information, such as mutation of a drug structure. The compound information refers to information of a compound capable of interacting with the protein, and may include a structure of the compound, physicochemical properties of the compound, etc. The training sample weight refers to a weight corresponding to the training sample and is used for characterizing the quality of the corresponding training sample. The high-quality training sample can improve the training quality when training a machine learning model. The binding energy characteristics refer to characteristics obtained when the protein and the compound interact with each other and are used for characterizing interaction energy information between targeted protein and compound molecules, and may include structural characteristics, physicochemical property characteristics, energy characteristics, etc. The binding energy characteristics are characteristics obtained after characteristic selection. The wild type energy characteristics refer to binding energy characteristics obtained when the wild type protein and the compound interact with each other. The mutant type energy characteristics refer to binding energy characteristics obtained when the mutant type protein and the compound interact with each other. The target energy characteristics are used for characterizing a difference between the mutant type energy characteristics and the wild type energy characteristics.
Specifically, the server may directly obtain the training sample set from the database. The training sample set includes training samples, training sample weights corresponding to the training samples and the target energy characteristics corresponding to training samples. The training sample includes the wild type protein information, the mutant type protein information and the compound information. The target energy characteristics are obtained based on the wild type energy characteristics and the mutant type energy characteristics. The wild type energy characteristics are obtained by performing binding energy characteristic extraction based on the wild type protein information and the compound information. The mutant type energy characteristics are obtained by performing binding energy characteristic extraction based on the mutant type protein information and the compound information. The server may further collect training samples from the Internet, extract the target energy characteristics corresponding to the training samples, and initialize the training sample weights corresponding to the training samples. The server may further obtain the training sample set from a third-party server which provides data service, for example, may obtain the training sample set from a third-party cloud server.
In one embodiment, the server may obtain protein information, mutant type protein information and compound information, perform binding energy characteristic extraction based on the wild type protein information and the compound information to obtain the wild type energy characteristics, perform binding energy characteristic extraction based on the mutant type protein information and the compound information to obtain the mutant type energy characteristics, and compute the difference between the wild type energy characteristics and the mutant type energy characteristics to obtain the target energy characteristics. Meanwhile, initializing corresponding training sample weights may be, for example, random initialization, zero initialization, Gaussian distribution initialization, etc.
Step 204: Determine a current training sample from the training sample set based on the training sample weight.
The current training sample refers to a training sample used in current training.
Specifically, the server selects the training sample from the training sample set according to the training sample weights corresponding to the training samples to obtain the current training sample. For example, the training sample with the training sample weight larger than a preset weight threshold may be treated as the current training sample. The preset weight threshold is a weight threshold set in advance. In a specific embodiment, the training sample weight may be set to be 0 and 1, namely, the training sample weight corresponding to the training samples is initialized to be 0 or 1. When the training sample weight is 1, the corresponding training sample is treated as the current training sample. In one embodiment, the server may select multiple training samples from the training sample set according to the training sample weight to obtain a current training sample set. The current training sample set includes multiple training samples. The current training sample set is used for training the basic prediction model.
Step 206: Input current target energy characteristics corresponding to the current training sample into a pre-trained prediction model for basic training, and obtain a basic prediction model after completing the basic training.
The current target energy characteristics refer to target energy characteristics corresponding to the current training sample. The pre-trained prediction model is a prediction model which is preliminarily trained in advance. The prediction model is established by using a random forest algorithm. The prediction model may be configured to predict the change of affinity between compounds and proteins before and after mutation. The basic prediction model is obtained by training through the corresponding current training samples under the condition that the training sample weight is kept unchanged.
Specifically, the server may input the current target energy characteristics corresponding to the previous training sample into the pre-trained prediction model for prediction to obtain a prediction result; compute loss according to the prediction result; reversely update the pre-trained prediction model according to the loss, and return to perform the operation of inputting the current target energy characteristics corresponding to the previous training sample into the pre-trained prediction model for prediction in an iteration way until reaching basic training completion conditions; and treating the prediction model reaching the basic training completion condition as the basic prediction model. The basic training completion condition refers to a condition for obtaining the basic prediction model, including that the training reaches a preset iteration frequency upper limit or the loss reaches a preset threshold, or parameters of the model are not changed any more, etc.
Step 208: Judge whether model training is completed or not, perform step 208a when model training is completed, otherwise, perform step 208b, and return to perform step 204.
Step 208a: Obtain a target prediction model, the target prediction model being configured to predict interaction state information corresponding to the inputted protein information and the inputted compound information.
Step 208b: Update the training sample weight corresponding to each training sample based on the basic prediction model, and return to perform the operation of determining the current training sample from the training sample set based on the training sample weight.
The completion of the model training refers to reaching of the condition for obtaining the target prediction model. The target prediction model refers to a model which is obtained by final training and configured to predict the interaction state information corresponding to the inputted protein information and the inputted compound information. The interaction state information is used for characterizing the change of binding free energy between the compounds and the proteins before and after mutation. The binding free energy refers to the interaction between a ligand and a receptor.
Specifically, in a case of obtaining the basic prediction model, the server further judges whether the model training is completed or not. The model training completion condition may include that the iteration frequency reaches a preset model training iteration frequency upper limit. In a case of not reaching the model training completion condition, the parameters of the basic prediction model are kept unchanged at the moment. Then the training sample weight corresponding to each training sample is updated by using the basic prediction model. The target energy characteristics corresponding to each training sample may be inputted into the basic prediction model to obtain the loss corresponding to each training sample. The training sample weight corresponding to each training sample is updated according to the loss corresponding to each training sample. In a case of updating the training sample weight, the operation of determining the current training sample from the training sample set based on the training sample weight is performed again in a continuous iteration way until reaching the model training completion condition. The basic prediction model reaching the model training completion condition is treated as the target prediction model. The target prediction model configured to predict the interaction state information corresponding to the inputted protein information and the inputted compound information.
The prediction module training method includes: obtaining the training sample set, the training sample set including training samples, the training sample weights corresponding to the training samples and target energy characteristics corresponding to the training samples, the training sample including the wild type protein information, the mutant type protein information and the compound information; determining the current training sample from the training sample set based on the training sample weight; inputting the current target energy characteristics corresponding to the current training sample into the pre-trained prediction model for basic training, and obtaining the basic prediction model after completing the basic training; and updating the training sample weights corresponding to the training samples based on the basic prediction model, and returning to perform the operation of determining the current training sample from the training sample set based on the training sample weight until completing model training to obtain the target prediction model, the target prediction model being configured to predict the interaction state information corresponding to the inputted protein information and the inputted compound information. That is, the training sample weight is continuously updated in the iteration process, the current training sample is determined from the training sample set by using the training sample weight, thus the quality of the training sample can be ensured; and then the prediction model is trained by using the current training sample, so that the prediction accuracy and generalization of the target prediction model obtained by training can be improved.
In one embodiment, as shown in
Step 302: Obtain training samples, the training sample including the wild type protein information, the mutant type protein information and the compound information.
Step 304: Perform binding initial energy characteristic extraction based on the wild type protein information and the compound information to obtain wild type initial energy characteristics.
The binding initial energy characteristics refer to extracted unscreened characteristics, and may include non-physical model characteristics, physical and experience potential energy-based characteristics, etc. The non-physical model characteristics include crystal protein-compound structure characteristics, physicochemical property characteristics of ligands and residues, energy characteristics obtained by computing by an experience or descriptor-based scoring function, etc. The physical and experience potential energy-based characteristics refer to energy characteristics obtained by computing through a modeling program based on mixed physical and experience potential energy. The wild type initial energy characteristics refer to binding initial energy characteristics obtained when the wild type protein information and the compound information interact with each other.
Specifically, the server may obtain training samples from the database. The training samples may be the sample used in pre-training. The training samples may be the same as or different from the training sample in the training sample set. The server may further collect training samples from the Internet. The server may further obtain training samples from a server which provides data service. The training samples includes the wild type protein information, the mutant type protein information and the compound information. Meanwhile, the server performs characteristic extraction on the training samples, namely, performs binding initial energy characteristic extraction based on the wild type protein information and the compound information to obtain the wild type initial energy characteristics corresponding to the training samples.
Step 306: Perform binding initial energy characteristic extraction based on the mutant type protein information and the compound information to obtain mutant type initial energy characteristics, and determine target initial energy characteristics corresponding to the training samples based on the wild type initial energy characteristics and the mutant type initial energy characteristics.
The mutant type initial energy characteristics refer to the binding initial energy characteristics obtained when the mutant type protein information and the compound information interact with each other. The target initial energy characteristics are used for characterizing the difference between the wild type initial energy characteristics and the mutant type initial energy characteristics.
Specifically, the server performs binding initial energy characteristic extraction on the mutant type protein information and the compound information to obtain mutant type initial energy characteristics, and computes the difference between the wild type initial energy characteristics and the mutant type initial energy characteristics. The difference is treated as target initial energy characteristics. For example, the difference between structural characteristics may be computed and treated as the target structural characteristics. The difference between physicochemical properties may also be computed and treated as the target structural characteristics.
Step 308: Input the target initial energy characteristics corresponding to the training samples into an initial prediction model for prediction to obtain initial interaction state information corresponding to the training samples, the initial prediction model being established by using a random forest algorithm.
The initial prediction model is a prediction model with initialized model parameters. The model parameters may be initialized at any time, may also be subjected to zero initialization, etc. The initial prediction model is established by using the random forest algorithm. The random forests refer to a classifier for training and predicting samples by using multiple trees. For example, an ExtraTree (extreme random tree) algorithm may be used for establishing the initial prediction model. The initial interaction state information refers to interaction state information obtained by predicting through the initial prediction model.
Specifically, the server establishes the initial prediction model with initialized model parameters through the random forest algorithm in advance, and then inputs the target initial energy characteristics corresponding to the training samples into the initial prediction model for prediction to obtain the outputted initial interaction state information corresponding to the training samples.
Step 310: Perform loss computation based on the initial interaction state information corresponding to the training samples and the interaction state tag corresponding to the training samples to obtain initial loss information corresponding to the training samples.
The interaction state tag refers to real interaction state information. The training samples has a corresponding interaction state tag. The initial loss information is used for characterizing the error between the initial interaction state information and the interaction state tag.
Specifically, the server computes the loss between the initial interaction state information corresponding to the training samples and the interaction state tag by using a preset loss function to obtain the initial loss information corresponding to the training samples. The loss function may be a mean square error loss function, an average absolute value error loss function, etc.
Step 312: Update the initial prediction model based on the initial loss information, and return to perform the operation of inputting the target energy characteristics corresponding to the training samples into the initial prediction model for prediction until completing the pre-training, thus obtaining characteristic importance corresponding to the pre-trained prediction model and the target initial energy characteristics.
The completion of pre-training refers to reaching of the condition for obtaining the pre-trained prediction model, namely, the pre-training reaches the preset iteration frequency, or the loss of pre-training reaches the preset threshold, or the prediction model parameters for pre-training are not changed any more. The characteristic importance is used for characterizing the importance degree of the target initial energy characteristics. The higher the characteristic importance is, the more important the corresponding characteristics are, and the more the contribution to model training is.
Specifically, the server computes a gradient by using the initial loss information, and then reversely updates the initial prediction model with the gradient to obtain the updated prediction model; judges whether the pre-training is completed; treats the updated prediction model as the initial prediction model in a case of not completing the pre-training, and returns to perform the operation of inputting the target energy characteristics corresponding to the training samples into the initial prediction model for prediction in an iteration way until completing the pre-training; and treats the updated prediction model obtained by the last iteration as the pre-trained prediction model, and directly obtains the characteristic importance corresponding to the target initial energy characteristics after training the pre-trained prediction model which is established by using the random forest algorithm. Each characteristic in the target initial energy characteristics has the corresponding characteristic importance.
Step 316: Determine the training sample weights corresponding to the training samples based on the loss information corresponding to the training samples in a case of completing pre-training, and select the target energy characteristics from the target initial energy characteristics based on the characteristic importance.
Specifically, the server may determine the training sample weights corresponding to the training samples through the loss information corresponding to the training samples after completing the pre-training, for example, may compare the loss information corresponding to the training samples with the weight loss threshold, and treats the corresponding training sample as a sample with good quality when the loss information is greater than the weight loss threshold. The corresponding training sample weight may be set to be 1. When the loss information is not greater than the weight loss threshold, the corresponding training sample is treated as a sample with poor quality. The corresponding training sample weight may be set to be 0. Characteristic selection is carried out on the target initial energy characteristics through the characteristic importance so as to obtain the target energy characteristics. The target energy characteristics are characteristics to be extracted during further training of the pre-trained prediction model.
In the above embodiment, pre-training is performed through each training sample to obtain the pre-trained model, then the training sample weight corresponding to each training sample is determined based on the loss information corresponding to each training sample after completing pre-training, and characteristic selection is performed on the target initial energy characteristics based on the characteristic importance to obtain the target energy characteristics. Therefore, the training efficiency can be improved in further training, and the accuracy of the training can be ensured.
In one embodiment, as shown in
Step 402: Input the target initial energy characteristics corresponding to the training samples into the initial prediction model.
Step 404: Treat, by the initial prediction model, the target initial energy characteristics corresponding to the training samples as a current division set and compute initial characteristic importance corresponding to the target initial energy characteristics; determine initial division characteristics from the target initial energy characteristics based on the initial characteristic importance; divide the target initial energy characteristics corresponding to each training sample based on the initial division characteristics so as to obtain each division result, the division result including target initial energy characteristics corresponding to each division sample; and treat each division result as the current division set, and return to perform the operation of computing the initial characteristic importance corresponding to the target initial energy characteristics for iteration until completing division, thus obtaining initial interaction state information corresponding to each training sample.
The initial characteristic importance refers to characteristic importance corresponding to the target initial energy characteristics. The initial division characteristics refer to characteristics for decision tree division. The division result is obtained by dividing the target initial energy characteristics. The division sample refers to a training sample corresponding to the target initial energy characteristics in the division result.
Specifically, the server inputs the target initial energy characteristics corresponding to each training sample into the initial prediction model. The initial prediction model scores the input characteristics to obtain the initial characteristic importance corresponding to the target initial energy characteristics. The initial characteristic importance may be computed by using information gain, an information gain rate, a Gini coefficient, a mean square error, etc. The initial division characteristics are determined from the target initial energy characteristics based on the initial characteristic importance. The target initial energy characteristics corresponding to each training sample are divided based on the initial division characteristics. That is, the target initial energy characteristics exceeding the initial division characteristics are treated as one part, and the target initial energy characteristics not exceeding the initial division characteristics are treated as the other part, thus obtaining the division result. The division result includes the target initial energy characteristics corresponding to each division sample. Each division result is treated as the current division set. The operation of computing the initial characteristic importance corresponding to the target initial energy characteristics is performed again for iteration until completing division, thus obtaining the initial interaction state information corresponding to each training sample. Division completion refers to that each tree node cannot be divided, that is, leaf nodes have the only corresponding target initial energy characteristic. The initial interaction state information refers to interaction state information predicted by the initial prediction model.
In the above embodiment, the target initial energy characteristics corresponding to each training sample are inputted into the initial prediction model. The initial prediction model computes the initial characteristic importance corresponding to the target initial energy characteristics; determines the initial division characteristics from the target initial energy characteristics based on the initial characteristic importance; divides the target initial energy characteristics corresponding to each training sample based on the initial division characteristics to obtain each division result, the division result including the target initial energy characteristics corresponding to each division sample; and treats each division result as the current division set, and returns to perform the operation of computing the initial characteristic importance corresponding to the target initial energy characteristics for iteration until completing the division, thus obtaining the initial interaction state information corresponding to each training sample, and improving the accuracy of the obtained initial interaction state information.
In one embodiment, step 202 of obtaining a training sample set, the training sample set including the training sample weights corresponding to the training samples includes:
obtaining confidence corresponding to each training sample, and determining the training sample weight corresponding to each training sample based on the confidence.
The confidence is used for characterizing the quality of the corresponding training sample. The higher the confidence is, the higher the quality corresponding to the training sample is, and the better the performance of a model obtained by training with the training sample having high confidence is.
Specifically, the server may also obtain the confidence corresponding to each training sample while acquiring each training sample. Then, the confidence may be directly treated as the training sample weight corresponding to each training sample. The confidence may be manually set or obtained by carrying out confidence evaluation on each training sample in advance. In one embodiment, the confidence corresponding to each training sample may also be compared with the preset confidence threshold. In a case of exceeding the confidence threshold, the corresponding training sample weight is set to be 1, and this training sample is treated as the current training sample. In a case of not exceeding the confidence threshold, the corresponding training sample weight is set to be 0.
In the above embodiment, the confidence is obtained, and the training sample weight corresponding to each training sample is determined according to the confidence, thus improving the efficiency of obtaining the training sample weight.
In one embodiment, as shown in
Step 502: Perform binding energy characteristic extraction based on the wild type protein information and the compound information to obtain wild type energy characteristics.
Step 504: Perform binding energy characteristic extraction based on the mutant type protein information and the compound information to obtain mutant type energy characteristics.
The wild type energy characteristics include, but are not limited to, the wild type protein characteristics, the compound characteristics, and the energy characteristics obtained when the wild type protein information and the compound information interact with each other. The wild type protein characteristics are used for characterizing characteristics corresponding to the wild type protein information, including, but not limited to, the wild type protein structural characteristics and the wild type protein physicochemical property characteristics. The compound characteristics include, but are not limited to, the compound structural characteristics and the compound physicochemical property characteristics. The mutant type energy characteristics include, but are not limited to, the mutant type protein characteristics, the compound characteristics, and the energy characteristics obtained when the mutant type protein information and the compound information interact with each other. The mutant type protein characteristics are used for characterizing characteristics corresponding to the mutant type protein information, including, but not limited to, the mutant type protein structural characteristics and the mutant type protein physicochemical property characteristics.
Specifically, the server performs characteristic extraction through the wild type protein information and the compound information to extract the wild type protein characteristics and compound characteristics, extracts the energy characteristics obtained when the wild type protein and compounds interact with each other, and treats the wild type protein characteristics, the compound characteristics and the energy characteristics as the wild type energy characteristics. The server performs characteristic extraction through the mutant type protein information to obtain the mutant type protein characteristics, then extracts the energy characteristics obtained when the mutant type protein and the compounds interact with each other, and treats the extracted mutant type protein characteristics, the compound characteristics and the energy characteristics as the mutant type energy characteristics.
Step 506: Compute a difference between the wild type energy characteristics and the mutant type energy characteristics to obtain the target energy characteristics.
Specifically, the server computes the difference between the wild type energy characteristics and the mutant type energy characteristics, for example, the difference between the wild type protein characteristics and the mutant type protein characteristics, and computes the difference between the energy characteristics obtained when the wild type protein and the compounds interact with each other and the energy characteristics obtained when the mutant type protein and the compounds interact with each other so as to obtain the target energy characteristics. In a specific embodiment, a characteristic difference value between the wild type energy characteristics and the mutant type energy characteristics may be computed to obtain the target energy characteristics.
In the above embodiment, the wild type energy characteristics and the mutant type energy characteristics are extracted, and then the difference between the wild type energy characteristics and the mutant type energy characteristics is computed to obtain the target energy characteristics, thus improving the accuracy of the obtained target energy characteristics.
In one embodiment, the wild type energy characteristics include first wild type energy characteristics and second wild type energy characteristics;
As shown in
Step 602: Perform binding energy characteristic extraction by a non-physical scoring function based on the wild type protein information and the compound information to obtain the first wild type energy characteristics.
The non-physical scoring function refers to an experience or descriptor-based scoring function. The scoring function is on the basis of some prior assumptions or used for fitting experimental data to obtain the energy characteristics. The obtained energy characteristics do not have obvious interpretable physical significance. The first wild type energy characteristics refer to first part of the extracted energy characteristics.
Specifically, the server may perform binding energy characteristic extraction by using a preset non-physical scoring function, compute the wild type protein information and the compound information by using the non-physical scoring function to obtain a computing result, and treat the computing result as the first wild type energy characteristics. The scoring function (a function for evaluating the reasonability of a theoretically obtained receptor-ligand combination mode) may be used for extracting the energy characteristics.
Step 602: Perform binding energy characteristic extraction by a physical function based on the wild type protein information and the compound information to obtain the second wild type energy characteristics.
The physical function refers to an energy function based on mixed physical and experience potential energy, and the physical function has obvious physical significance. An energy function family is composed of a force field function based on experimental data fitting, a quantitative computing function based on a first principle, a solvent model based on a continuous medium, etc.
Specifically, the server performs binding energy characteristic extraction on the wild type protein information and the compound information through the preset physical function to obtain the second wild type energy characteristics. For example, the energy function in a modeling program Rosetta (a polymer modeling software library taking Monte Carlo simulated annealing as an algorithm core) based on the mixed physical and experience potential energy may be used for computing the energy characteristics.
Step 602: Perform fusing based on the first wild type energy characteristics and the second wild type energy characteristics to obtain the wild type energy characteristics.
Specifically, the server computes a characteristic difference value between the first wild type energy characteristics and the second wild type energy characteristics to obtain the wild type energy characteristics.
In the above embodiment, the first wild type energy characteristics and the second wild type energy characteristics are extracted, and fusing is performed based on the first wild type energy characteristics and the second wild type energy characteristics to obtain the wild type energy characteristics. The first wild type energy characteristics and the second wild type energy characteristics can better characterize the interaction energy information between the wild type targeted protein and the compound molecules so that the accuracy of the obtained wild type energy characteristics is improved.
In one embodiment, the mutant type energy characteristics include first mutant type energy characteristics and second mutant type energy characteristics.
As shown in
Step 702: Perform binding energy characteristic extraction by a non-physical function based on the mutant type protein information and the compound information to obtain the first mutant type energy characteristics.
Step 704: Perform binding energy characteristic extraction by a physical function based on mutant type protein information and the compound information to obtain the second mutant type energy characteristics.
Specifically, the server performs binding energy characteristic extraction on the mutant type protein information and the compound information through the preset non-physical function to obtain the first mutant type energy characteristics, and then performs binding energy characteristic extraction on the mutant type protein information and the compound information through the preset physical function to obtain the second mutant type energy characteristics.
Step 706: Perform fusing based on the first mutant type energy characteristics and the second mutant type energy characteristics to obtain the mutant type energy characteristics.
Specifically, the server computes a characteristic difference value between the first mutant type energy characteristics and the second mutant type energy characteristics to obtain the mutant type energy characteristics.
In the above embodiment, the first mutant type energy characteristics and the second mutant type energy characteristics are extracted, and fusing is performed based on the first mutant type energy characteristics and the second mutant type energy characteristics to obtain the mutant type energy characteristics. The first mutant type energy characteristics and the second mutant type energy characteristics can better characterize the interaction energy information between the mutant targeted protein and the compound molecules so that the accuracy of the obtained mutant type energy characteristics is improved.
In one embodiment, as shown in
Step 802: Obtain protein family information, and divide the training sample set based on the protein family information to obtain training sample groups.
Proteins with similar in-vivo amino acid sequences and very similar structures and functions form a “protein family”, and members in the same protein family are called “homologous protein”. The protein family information refers to the information of the protein family. The training sample group is obtained by gathering the training samples corresponding to the same protein family.
Specifically, the server directly obtains the protein family information from the database. The protein family information may also be obtained from the Internet or from a third-party server which provides the data service. In one embodiment, the server may further divide the protein family with similar structures or sequences of the protein information in the training sample into the same training sample group to obtain the training sample groups.
Step 804: Select the current training sample from the training sample groups based on the training sample weight to obtain a current training sample set.
Specifically, the server selects the current training sample from the training sample groups by using the training sample weight, namely, sequentially selects the current training sample according to the training sample weight in the training sample group, and selects from the training sample groups to obtain the current training sample set.
Step 206 of inputting current target energy characteristics corresponding to the current training sample into a pre-trained prediction model for basic training, and obtaining a basic prediction model after completing the basic training includes the following step:
Step 806: Input the current target energy characteristics corresponding to each current training sample in the current training sample set into the pre-trained prediction model for basic training, and obtain a target basic prediction model after completing the basic training.
Specifically, the server inputs the current target energy characteristics corresponding to each current training sample in the current training sample set into the pre-trained prediction model for basic training, and obtains the target basic prediction model after completing the basic training.
In the above embodiment, the training sample set is divided according to the protein family information to obtain the training sample groups. Then, the current training sample is selected from the training sample groups based on the training sample weight to obtain the current training sample set, and thus the basic training is performed on the pre-trained prediction model through the current training sample set to obtain the basic prediction model. That is, the current training sample is selected from the training sample groups, and the selected training samples are distributed everywhere in the space rather than concentrated in a local area. Thus the global information contained in the training samples can be learned during model training, the comprehensiveness of the model in learning knowledge in the training process can be ensured, furthermore, the convergence speed in the model training process is improved, and the generalization ability of the model obtained by training is improved.
In a specific embodiment, the basic form of the pre-trained prediction model is shown in the following formula (1).
n represents a total number of training samples; X represents a training sample set, X=(x1, . . . , xn)∈Rn*m, R represents a real number set; and m represents a number of energy characteristics. xi represents an ith training sample; and yi represents an interaction state tag corresponding to the ith training sample. g represents a pre-trained prediction model; w represents model parameters; L represents a loss function; and v represents a training sample weight. v=(v(1), . . . , v(b)) where b represents a number of the training sample groups, namely the training sample set is divided into b groups: x(1), . . . , x(b), where x(j)∈Rn
represents a number of training samples in the jth training sample group, and in Σj=1bnj=n, v1j, v1j represents a training sample weight corresponding to a 1st training sample in the jth training sample group. vi represents an ith training sample weight. represents parameters of training sample difficulty, namely, the training samples are sequentially selected from those which are easy to select (high confidence) to those which are difficult to select (low confidence). γ represents parameters of sample diversity. That is, the samples are selected from multiple training sample groups. ∥ ∥1 represents L1 norm, and ∥ ∥2.1 represents L2.1 norm. In −∥v∥2.1=−Σj=1b∥v(j)∥2, represents a number of training sample groups, and j represents a training sample weight of the jth training sample group. That is, the negative L1 norm tends to select the samples with high confidence, namely the samples with small result errors in training. The negative a L2.1 norm is beneficial to selecting the training samples from the multiple training sample groups and embedding diversity information into the prediction model.
In one embodiment, the selecting the current training sample from the training sample groups based on the training sample weight to obtain a current training sample set includes:
obtaining current learning parameters, and determining a number of selected samples and sample distribution based on the current learning parameters; and selecting the current training sample from the training sample groups according to the training sample weight based on the number of selected samples and the sample distribution to obtain a target current training sample set.
The current learning parameters refer to learning parameters used in the current training. The current learning parameters are used for controlling the selection of the current training sample. The number of selected samples refers to a quantity of training samples to be selected currently. The sample distribution refers to distribution of selected current training samples in the training sample groups. The target current training sample set refers to a set of current training samples selected using the current learning parameters.
Specifically, the server obtains the current training sample parameters. The initial values of the current training sample parameters may be set in advance. The server uses the current learning parameters to compute the number of samples to be selected and the sample distribution in the current training. Then, the current training sample is selected from the training sample groups based on the number of selected samples and the sample distribution and according to the training sample weight so as to obtain the target current training sample set.
In the above embodiment, the current learning parameters are used for further controlling the selection of the training sample so as to obtain the target current training sample set, thus the accuracy of the selected training sample is improved, furthermore, the accuracy of the prediction model obtained by training is improved, and the generalization ability of the prediction model is improved.
In one embodiment, as shown in
Step 902: Input the current target energy characteristics corresponding to the current training sample into the pre-trained prediction model for prediction to obtain current interaction state information.
The current interaction state information is used for characterizing the change of interaction between the compounds and the proteins before and after mutation in the current training sample obtained by predicting.
Specifically, the server directly treats the current target energy characteristics corresponding to the current training sample as an input of the pre-trained prediction model. The pre-trained prediction model performs prediction according to the inputted current target energy characteristics and outputs a prediction result, namely the current interaction state information.
Step 904: Compute an error between the current interaction state information and the interaction state tag corresponding to the current training sample to obtain current loss information.
The current loss information refers to an error between a prediction result corresponding to the current training sample and a real result.
Specifically, the server obtains the interaction state tag corresponding to the current training sample. The interaction state tag may be set in advance. The interaction state tag may be the change of the interaction between the compounds and the proteins before and after the mutation measured by the experiment. Then the server computes the error between the current interaction state information and the interaction state tag corresponding to the current training sample by using the preset loss function so as to obtain the current loss information.
Step 906: Update the pre-trained prediction model based on the current loss information, and return to perform the operation of inputting the current target energy characteristics corresponding to the current training sample into the pre-trained prediction model for prediction to obtain the current interaction state information so as to obtain the basic prediction model after reaching basic training completion conditions.
Specifically, the server utilizes the current loss information to reversely update the parameters in the pre-trained prediction model through a gradient descent algorithm, and returns to perform the operation of inputting the current target energy characteristics corresponding to the current training sample into the pre-trained prediction model for prediction so as to obtain the current interaction state information in an iteration way until the preset basic training iteration frequency is reached or the model parameters are no longer changed. The pre-trained prediction model obtained at the last iteration is treated as the basic prediction model.
In a specific embodiment, the optimization function corresponding to the pre-trained prediction model is shown as the following formula (2). The optimization function is a regression optimization function.
vi* represents that the training sample with the training sample weight exceeding the weight threshold is selected for training. For example, when the training sample weight only includes 0 and 1, the training sample with the training sample weight being 1 may be selected for training.
In the above embodiment, the training sample weight is stored to be unchanged, then the current training sample is selected to train the pre-trained prediction model to obtain the basic prediction model, and thus the accuracy of the trained basic prediction model is improved.
In one embodiment, as shown in
Step 1002: Input the target energy characteristics corresponding to the training samples into the basic prediction model to obtain the basic interaction state information corresponding to the training samples.
the training samples refers to every training sample in the training sample set. The basic interaction state information refers to interaction state information corresponding to every training sample obtained by predicting through the basic prediction model. The interaction state information may be a relative difference value between the binding free energy of the wild type protein and the compounds and the binding free energy of the mutant type protein and the compounds.
Specifically, when the server obtains the basic prediction model by training, the parameters in the basic prediction model are kept unchanged, and the training sample weight corresponding to every training sample in the training sample set is updated. That is, the server inputs the target energy characteristic corresponding to the training samples into the basic prediction model to obtain the outputted basic interaction state information corresponding to the training samples.
Step 1004: Compute an error between the basic interaction state information corresponding to the training samples and the interaction state tag corresponding to the training samples to obtain the basic loss information.
The basic loss information refers to an error between a prediction result of the basic prediction model and a real result.
Specifically, the server computes the error of every training sample by using the preset loss function, namely computes the error between the basic interaction state information and the interaction state tag, thus obtaining the basic loss information corresponding to every training sample.
Step 1006: Update the training sample weight based on the basic loss information to obtain an updated sample weight corresponding to the training samples.
Specifically, the server updates every training sample weight by using the basic loss information corresponding to every training sample. The server may directly take the basic loss information corresponding to every training sample as the updated sample weight corresponding to every training sample.
In one embodiment, step 1006 of updating the training sample weight based on the basic loss information to obtain an updated sample weights corresponding to the training samples includes:
obtaining current learning parameters, and computing an update threshold based on the current learning parameters; comparing the update threshold with the basic loss information corresponding to each training sample to obtain a comparison result corresponding to each training sample; and determining the updated sample weight corresponding to each training sample according to the comparison result corresponding to each training sample.
The update threshold refers to a threshold for updating the training sample weight.
Specifically, the server obtains the current learning parameters and determines the update threshold by using the current learning parameters. The update threshold is compared with the basic loss information corresponding to each training sample. When the basic loss information exceeds the update threshold, it is indicated that the prediction error corresponding to the training sample is large, and the corresponding training sample weight is updated to be the first training sample weight. When the basic loss information does not exceed the update threshold, it is indicated that the error is small, and the corresponding training sample weight is updated to be the second training sample weight. Then, when the current training sample is selected, the training sample corresponding to the second training sample weight is selected as the current training sample.
In one embodiment, the current learning parameters include the diversity learning parameters and the difficulty learning parameters. The computing an update threshold based on the current learning parameters includes:
obtaining training sample groups, determining a current training sample group from the training sample groups, and computing a sample rank corresponding to the current training sample group; computing a weighted value based on the sample rank, and weighting the diversity learning parameters by using the weighted value to obtain a target weighted value; and computing a sum of the target weighted value and the difficulty learning parameters to obtain the update threshold.
The difficulty learning parameters refer to learning parameters for measuring the difficulty. The difficulty learning parameters are used for determining the confidence of the training sample selected in training. The diversity learning parameters are learning parameters for measuring the diversity. The diversity learning parameters are used for determining the distribution of the training samples selected during training in the training sample group. The sample rank refers to a rank of the training samples in the previous training sample group. The rank of one vector group refers to a number of vectors contained in the maximum irrelevant group of the vector group. The current training sample group refers to a training sample group needing to update the training sample weight currently.
Specifically, the server obtains training sample groups, determines the current training sample group from the training sample groups, and computes the sample rank corresponding to the current training sample group; computing a weighted value based on the sample rank, and weighting the diversity learning parameters by using the weighted value to obtain a target weighted value; and computing a sum of the target weighted value and the difficulty learning parameters to obtain the update threshold corresponding to the current training sample group. In one specific embodiment, the training samples in the training sample groups may be sorted in ascending order according to the basic loss information. Each sorted training sample group is obtained. The current training sample group is determined from the sorted training sample group. The update threshold corresponding to the current training sample group is computed.
In a specific embodiment, the following formula (3) may be used for updating the training sample weight corresponding to the training sample.
α represents a rank in a jth training sample group. gw*(xi(j)) represents predicted interaction state information corresponding to an ith training sample in the jth training sample group, and yij represents a real interaction state tag corresponding to the ith training sample in the jth training sample group.
represents a computed update threshold. When the error corresponding to the ith training sample in the jth training sample group is smaller than the update threshold, the corresponding training sample weight is updated to be 1. When the error corresponding to the ith training sample in the jth training sample group is greater than or equal to the update threshold, the corresponding training sample weight is updated to be 0.
In the above embodiment, the sample weight is continuously updated, and the current training sample is reselected for training. Thus the training sample with relatively large error can be used for training in the training process, the negative influence of the training sample with the relatively large error on the training process is avoided and the accuracy of the target prediction model obtained by training is improved.
In one embodiment, after updating the training sample weight corresponding to the training samples based on the basic prediction model, the method further includes:
obtaining current learning parameters, updating the current learning parameters according to a preset increment to obtain updated learning parameters, and treating the updated learning parameters as the current learning parameters.
Specifically, the server may preset an updating condition of the current learning parameters, for example, an increment of the current learning parameters after each weight updating is preset. Then the current learning parameters are updated according to the preset increment so as to obtain the updated learning parameters. The updated learning parameters are treated as the current learning parameters. In one embodiment, the server may further obtain the preset number of samples to be increased, update the current learning parameters through the preset number of samples to be increased so as to obtain the updated learning parameters, and treat the updated learning parameters as the current learning parameters. After the number of samples is increased, and when the loss information obtained by training becomes larger, the training is completed, and the prediction model obtained by training without increasing the number of samples is treated as a finally obtained target prediction model.
In one embodiment, as shown in
Step 1102: Obtain original data, the original data including original wild type protein information, original mutant type protein information and original compound information.
The original wild type protein information refers to wild type protein information requiring prediction of the interaction state information. The original mutant type protein information refers to mutant type protein information requiring prediction of the interaction state information. The original compound information refers to compound information requiring prediction of the interaction state information.
Specifically, the server may collect the original data from the Internet, or may also obtain the original data from the terminal. The server may further directly obtain the original data from a database. In one embodiment, the server may also obtain the original data transmitted by a third-party server. The third-party server may be a server providing a business service. The original data includes the original wild type protein information, the original mutant type protein information and the original compound information. In one embodiment, the server may obtain the original mutant type protein information and the original compound information from the terminal, and then obtain the original wild type protein information corresponding to the original mutant type protein information from the database so as to obtain the original data.
Step 1104: Perform binding energy characteristic extraction based on the original wild type protein information and the original compound information to obtain the original wild type energy characteristics, and perform binding energy characteristic extraction based on the original mutant type protein information and the original compound information to obtain the original mutant type energy characteristics.
The original wild type energy characteristics refer to energy characteristics obtained when the extracted original wild type protein information and the extracted original compound information interact with each other. The original mutant type energy characteristics refer to energy characteristics obtained when the extracted original mutant type protein information and the extracted original compound information interact with each other.
Specifically, the server performs binding energy characteristic extraction based on the original wild type protein information and the original compound information to obtain the original wild type energy characteristics. For example, the server may extract the structural characteristics according to a protein structure in the original wild type protein information and a compound structure in the original compound information, and then extract physicochemical property characteristics according to the physicochemical properties in the original wild type protein information and the physicochemical properties in the original compound information. The physicochemical properties are indexes for measuring chemical substance characteristics and include physical properties and chemical properties. The physical properties include melting and boiling points, states and colors at normal temperature. The chemical properties include pH values, etc. Meanwhile, the scoring function is used for computing the energy characteristics obtained when the original wild type protein information and the original compound information interact with each other, and the energy function based on mixed physical and experience potential energy is used for computing the energy characteristics, thus obtaining the original wild type energy characteristics. Then binding energy characteristic extraction is performed based on the original mutant type protein information and the original compound information to obtain the original mutant type energy characteristics. For example, the structural characteristics may be extracted according to the protein structure in the original mutant type protein information and the compound structure in the original compound information. Then physicochemical property characteristics are extracted according to physicochemical properties in the original mutant type protein information and physicochemical properties in the original compound information. Meanwhile, the scoring function is used for extracting the energy characteristics, and the energy function based on physical and experience potential energy is used for extracting the energy characteristics, thus obtaining the original mutant type energy characteristics.
Step 1106: Determine the original target energy characteristics based on the original wild type energy characteristics and the original mutant type energy characteristics.
Specifically, the server computes the difference between each characteristic value in the original wild type energy characteristics and the characteristic value corresponding to the original mutant type energy characteristics to obtain the original target energy characteristics.
Step 1108: Input the original target energy characteristics into a target prediction model for prediction to obtain interaction state information; determine, by the target prediction model, a current training sample from a training sample set based on a training sample weight by obtaining the training sample set; input current target energy characteristics corresponding to the current training sample into a pre-trained prediction model for basic training, and obtain a basic prediction model after completing the basic training; and update the training sample weights corresponding to the training samples based on the basic prediction model, and return to perform the operation of determining the current training sample from the training sample set based on the training sample weight until completing the model training.
The target prediction model may be a model obtained by training in any embodiment of the method for training a prediction model. That is, the target prediction model may obtain the training sample set, and determine the current training sample from the training sample set based on the training sample weight; input the current target energy characteristics corresponding to the current training sample into the pre-trained prediction model for basic training, and obtain the basic prediction model after completing the basic training; and update the training sample weights corresponding to the training samples based on the basic prediction model, and return to perform the operation of determining the current training sample from the training sample set based on the training sample weight until completing the model training.
Specifically, the server inputs the original target energy characteristics into the target prediction model for prediction so as to obtain outputted interaction state information. In a specific embodiment, the interaction state information refers to a relative difference value between the binding free energy of the original mutant type protein and the original compound and the binding free energy of the original wild type protein and the original compound. Then the relative difference value of the binding free energy is compared with a drug resistance threshold. When the relative difference value of the binding free energy exceeds the drug resistance threshold, it is indicated that the original mutant type protein has drug resistance and cannot be used continuously. When the relative difference value of the binding free energy does not exceed the drug resistance threshold, it is indicated that the original mutant type protein does not have drug resistance and can still be used normally.
According to the method and apparatus for predicting data, the computer device and the storage medium, the original data is obtained, the original target energy characteristics are determined, and the original target energy characteristics are inputted into the target prediction model for prediction to obtain the interaction state information; the target prediction model determines the current training sample from the training sample set based on the training sample weight by obtaining the training sample set; the current target energy characteristics corresponding to the current training sample are inputted into the pre-trained prediction model for basic training, and the basic prediction model is obtained after completing the basic training; and the training sample weight corresponding to the training samples is updated based on the basic prediction model, and the operation of determining the current training sample from the training sample set is performed based on the training sample weight until completing the model training; that is, the interaction state information is obtained by predicting through the target prediction model; and the target prediction model obtained by training can improve the prediction accuracy so that the accuracy of the obtained interaction state information can be improved.
This disclosure further provides an application scenario, and the foregoing method for predicting data is applied to the application scenario.
In a specific embodiment, as shown in
Step 1302: Obtain a training sample set, the training sample set including each training sample, a training sample weight corresponding to each training sample and the target energy characteristics corresponding to each training sample, the training sample including wild type protein information, mutant type protein information and compound information, the target energy characteristics being obtained based on wild type energy characteristics and mutant type energy characteristics, the wild type energy characteristics being obtained by performing binding energy characteristic extraction based on the wild type protein information and the compound information, and the mutant type energy characteristics being obtained by performing binding energy characteristic extraction based on the mutant type protein information and the compound information.
Step 1304: Obtain protein family information, divide the training sample set based on the protein family information to obtain each training sample group, acquire current learning parameters, and determine a number of selected samples and sample distribution based on the current learning parameters. The current training sample is selected from each training sample group according to the training sample weight based on the number of selected samples and the sample distribution to obtain a target current training sample set.
Step 1306: Input the target energy characteristics corresponding to each training sample in the target current training sample set into the basic prediction model to obtain the basic interaction state information corresponding to each training sample, and compute an error between the basic interaction state information corresponding to each training sample and the interaction state tag corresponding to each training sample to obtain the basic loss information.
Step 1308: Compute a sample rank corresponding to each training sample group. A weighted value is computed based on the sample rank. The diversity learning parameters are weighted by using the weighted value to obtain a target weighted value. A sum of the target weighted value and the difficulty learning parameters is computed to obtain the update threshold of each training sample group.
Step 1310: Compare the update threshold with the basic loss information corresponding to the training sample in each training sample group to obtain a comparison result corresponding to each training sample, and determine the updated sample weight corresponding to the training sample in each training sample group according to the comparison result corresponding to the training sample.
Step 1312: Update the current learning parameters according to the preset increment to obtain the updated learning parameters, treat the updated learning parameters as the current learning parameters, and return to perform the operation of determining the number of selected samples and the sample distribution based on the current learning parameters so as to obtain the target prediction module after completing the model training.
This disclosure further provides an application scenario, and the foregoing method for predicting a prediction model is applied to the application scenario. Specifically:
The input data and the training sample group information are obtained (1401). The input data includes each training sample, the corresponding training sample weight is 0 or 1, and the training sample group information indicates a training sample group to which the training sample in the input data belongs. At this time, the model parameters and learning parameters of the prediction model are initialized (1402).
Then the training sample weight corresponding to the training sample is fixed, the parameters of the model are trained, that is, the training sample with the training sample weight of 1 is selected according to the initialized learning parameters, and thus the current training sample is obtained (1403). The current target energy characteristics corresponding to the current training sample are extracted, and the current target energy characteristics are inputted into the initialized prediction model for basic training. The basic prediction model is obtained after completing the basic training.
Then the parameters of the basic prediction model are fixed, and the sample weight is updated (1404). That is, the training sample weight corresponding to training samples is updated by using the formula (3), and thus the updated sample weight is obtained.
At this time, the initialized learning parameters are further updated (1405). Then the operation of fixing the training sample weight corresponding to the training sample and training the parameters of the model is performed again in a continuous iteration way, and the model parameters of the prediction model and the training sample weight are outputted after completing the model training, thus obtaining the target prediction model (1406).
In this embodiment, the target prediction model obtained by training is subjected to a comparison test. Specifically, a drug resistance standard data set Platinum (Platinum is a database widely collecting drug resistance information and is developed for studying and understanding the influence of missense mutation on the interaction of the ligands and a proteome) and TKI are used for training and testing. The data set Platinum is used for training to obtain the target prediction model, and then the data set TKI is used for testing. By adopting RDKit (RDKit is an open source kit for chemical informatics, and is used for performing compound descriptor generation, fingerprint generation, compound structural similarity computing, 2D and 3D molecules display, etc. through a machine learning method based on the compound 2D and 3D molecule operation), Biopython (Biopython provides an online resource library for developers using and studying bioinformatics), FoldX (computing protein binding free energy), PLIP (being an analysis tool for protein-ligand non-covalent interaction), AutoDock (being open source molecular simulation software, and most mainly applied to performing ligand-protein molecule docking) and other non-physical model tools, the characteristics corresponding to the affinity change after protein mutation are predicted are generated. The modeling program Rosetta based on mixed physical and experience potential energy is used for computing the energy characteristics. Then characteristic selection is carried out to obtain finally selected characteristics. The details are shown in table 1 below, which is a finally selected characteristic number table.
At this time, the target prediction model obtained by training is subjected to comparison test, and the test result is shown in
In all characteristics, the average value of the RMSE (the smaller the better) index in this disclosure is 0.73, the minimum value is 0.72, the maximum value is 0.74, and the average error is obviously smaller than other related art. The Pearson (the larger the better) index in this disclosure is also obviously superior to other related technologies. The AUPRC index in this disclosure is also superior to other related technologies. Therefore, compared with the related technologies, the prediction accuracy in this disclosure is obviously improved. Further,
It is to be understood that, steps in flowcharts of
In one embodiment, as shown in
Herein, the term module (and other similar terms such as unit, submodule, etc.) may refer to a software module, a hardware module, or a combination thereof. A software module (e.g., computer program) may be developed using a computer programming language. A hardware module may be implemented using processing circuitry and/or memory. Each module can be implemented using one or more processors (or processors and memory). Likewise, a processor (or processors and memory) can be used to implement one or more modules. Moreover, each module can be part of an overall module that includes the functionalities of the module. A module is configured to perform functions and achieve goals such as those described in this disclosure, and may work together with other related modules, programs, and components to achieve those functions and goals.
The sample acquisition module 1702 is configured to obtain a training sample set. The training sample set includes each training sample, a training sample weight corresponding to each training sample and target energy characteristics corresponding to each training sample. The training sample includes wild type protein information, mutant type protein information and compound information. The target energy characteristics are obtained based on wild type energy characteristics and mutant type energy characteristics. The wild type energy characteristics are obtained by performing binding energy characteristic extraction based on the wild type protein information and the compound information. The mutant type energy characteristics are obtained by performing binding energy characteristic extraction based on the mutant type protein information and the compound information.
The sample determination module 1704 is configured to determine a current training sample from the training sample set based on the training sample weight.
The training module 1706 is configured to input current target energy characteristics corresponding to the current training sample into a pre-trained prediction model for basic training, and obtain a basic prediction model after completing the basic training.
The iteration module 1708 is configured to update the training sample weight corresponding to each training sample based on the basic prediction model, and return to perform the operation of determining the current training sample from the training sample set based on the training sample weight until completing model training to obtain a target prediction model. The target prediction model is configured to predict interaction state information corresponding to the inputted protein information and the inputted compound information.
In one embodiment, the apparatus 1700 for training a prediction model further includes:
a pre-training module, configured to obtain each training sample, the training sample including the wild type protein information, the mutant type protein information and the compound information; perform binding initial energy characteristic extraction based on the wild type protein information and the compound information to obtain wild type initial energy characteristics; perform binding initial energy characteristic extraction based on the mutant type protein information and the compound information to obtain mutant type initial energy characteristics, and determine target initial energy characteristics corresponding to each training sample based on the wild type initial energy characteristics and the mutant type initial energy characteristics; input the target initial energy characteristics corresponding to each training sample into an initial prediction model for prediction to obtain initial interaction state information corresponding to each training sample, the initial prediction model being established by using a random forest algorithm; perform loss computation based on the initial interaction state information corresponding to each training sample and the interaction state tag corresponding to each training sample to obtain initial loss information corresponding to each training sample; update the initial prediction model based on the initial loss information, and return to perform the operation of inputting the target energy characteristics corresponding to each training sample into the initial prediction model for prediction until completing the pre-training, thus obtaining the pre-trained prediction model and characteristic importance corresponding to the target initial energy characteristics; and determine the training sample weight corresponding to each training sample based on the loss information corresponding to each training sample in a case of completing pre-training, and select the target energy characteristics from the target initial energy characteristics based on the characteristic importance.
In one embodiment, the pre-training module is further configured to input the target initial energy characteristics corresponding to each training sample into the initial prediction model; and the initial prediction model is configured to treat the target initial energy characteristics corresponding to each training sample as a current division set, and compute initial characteristic importance corresponding to the target initial energy characteristics; determine initial division characteristics from the target initial energy characteristics based on the initial characteristic importance; divide the target initial energy characteristics corresponding to each training sample based on the initial division characteristics so as to obtain each division result, the division result including target initial energy characteristics corresponding to each division sample; and treat each division result as the current division set, and return to perform the operation of computing the initial characteristic importance corresponding to the target initial energy characteristics for iteration until completing division, thus obtaining initial interaction state information corresponding to each training sample.
In one embodiment, the sample acquisition module 1702 is further configured to obtain confidence corresponding to each training sample, and determine the training sample weight corresponding to each training sample based on the confidence.
In one embodiment, the sample acquisition module 1702 is further configured to perform binding energy characteristic extraction based on the wild type protein information and the compound information to obtain the wild type energy characteristics; perform binding energy characteristic extraction based on the mutant type protein information and the compound information to obtain the mutant type energy characteristics; and compute a difference between the wild type energy characteristics and the mutant type energy characteristics to obtain the target energy characteristics.
In one embodiment, the wild type energy characteristics include first wild type energy characteristics and second wild type energy characteristics. The sample acquisition module 1702 is further configured to perform binding energy characteristic extraction by a non-physical scoring function based on wild type protein information and compound information to obtain the first wild type energy characteristics; perform binding energy characteristic extraction by a physical function based on the wild type protein information and the compound information to obtain the second wild type energy characteristics; and perform fusing based on the first wild type energy characteristics and the second wild type energy characteristics to obtain the wild type energy characteristics.
In one embodiment, the mutant type energy characteristics include first mutant type energy characteristics and second mutant type energy characteristics. The sample acquisition module 1702 is further configured to perform binding energy characteristic extraction by a non-physical function based on the mutant type protein information and the compound information to obtain the first mutant type energy characteristics; perform binding energy characteristic extraction by a physical function based on the mutant type protein information and the compound information to obtain the second mutant type energy characteristics; and perform fusing based on the first mutant type energy characteristics and the second mutant type energy characteristics to obtain the mutant type energy characteristics.
In one embodiment, the sample determination module 1704 is further configured to obtain protein family information, and divide the training sample set based on the protein family information to obtain each training sample group; and select the current training sample from each training sample group based on the training sample weight to obtain a current training sample set.
The training module 1706 is further configured to input the current target energy characteristics corresponding to each current training sample in the current training sample set into the pre-trained prediction model for basic training, and obtain a target basic prediction model after completing the basic training.
In one embodiment, the sample determination module 1704 is further configured to obtain current learning parameters, and determine a number of selected samples and sample distribution based on the current learning parameters; and select the current training sample from each training sample group according to the training sample weight based on the number of selected samples and the sample distribution to obtain a target current training sample set.
In one embodiment, the training module 1706 is further configured to input the current target energy characteristics corresponding to the current training sample into the pre-trained prediction model for prediction to obtain current interaction state information; compute an error between the current interaction state information and the interaction state tag corresponding to the current training sample to obtain the current loss information; and update the pre-trained prediction model based on the current loss information, and return to perform the operation of inputting the current target energy characteristics corresponding to the current training sample into the pre-trained prediction model for prediction to obtain the current interaction state information so as to obtain the basic prediction model after reaching the basic training completion conditions.
In one embodiment, the iteration module 1708 is further configured to input the target energy characteristics corresponding to each training sample into the basic prediction model to obtain the basic interaction state information corresponding to each training sample; compute an error between the basic interaction state information corresponding to each training sample and the interaction state tag corresponding to each training sample to obtain basic loss information; and update the training sample weight based on the basic loss information to obtain an updated sample weight corresponding to each training sample.
In one embodiment, the iteration module 1708 is further configured to obtain current learning parameters, and compute an update threshold based on the current learning parameters; compare the update threshold with the basic loss information corresponding to each training sample to obtain a comparison result corresponding to each training sample; and determine the updated sample weight corresponding to each training sample according to the comparison result corresponding to each training sample.
In one embodiment, the current learning parameters include diversity learning parameters and difficulty learning parameters. The iteration module 1708 is further configured to obtain training sample groups, determine a current training sample group from each training sample group, and compute a sample rank corresponding to the current training sample group; compute a weighted value based on the sample rank, and weight the diversity learning parameters by using the weighted value to obtain a target weighted value; and compute a sum of the target weighted value and the difficulty learning parameters to obtain the update threshold.
In one embodiment, the iteration module 1708 obtains current learning parameters, updates the current learning parameters according to a preset increment to obtain updated learning parameters, and treats the updated learning parameters as the current learning parameters.
In one embodiment, as shown in
The data acquisition module 1802 is configured to obtain original data, the original data including original wild type protein information, original mutant type protein information and original compound information.
The characteristic extraction module 1804 is configured to perform binding energy characteristic extraction based on the original wild type protein information and the original compound information to obtain original wild type energy characteristics, and perform binding energy characteristic extraction based on the original mutant type protein information and the original compound information to obtain original mutant type energy characteristics.
The target characteristic determination module 1806 is configured to determine original target energy characteristics based on the original wild type energy characteristics and the original mutant type energy characteristics.
The prediction module 1808 is configured to input the original target energy characteristics into a target prediction model for prediction to obtain interaction state information; determine, by the target prediction model, a current training sample from a training sample set based on a training sample weight by obtaining the training sample set; input current target energy characteristics corresponding to the current training sample into a pre-trained prediction model for basic training, and obtain a basic prediction model after completing the basic training; and update the training sample weight corresponding to each training sample based on the basic prediction model, and return to perform the operation of determining the current training sample from the training sample set based on the training sample weight until completing the model training.
For specific definitions on the apparatus for training a prediction model and the apparatus for predicting data, reference may be made to the above definitions on the method for training a prediction model and the method for predicting data, no more description is made herein. The modules in the foregoing apparatus for predicting data may be implemented entirely or partially by software, hardware, or a combination thereof. The foregoing modules may be built in or independent of a processor of a computer device in a hardware form, or may be stored in a memory of the computer device in a software form, so that the processor invokes and performs an operation corresponding to each of the foregoing modules.
In an embodiment, a computer device is provided. The computer device may be a server, and an internal structure diagram thereof may be shown in
In an embodiment, a computer device is provided. The computer device may be a terminal, and an internal structure diagram thereof may be shown in
A person skilled in the art may understand that, the structure shown in
In an embodiment, a computer device is further provided, including a memory and a processor, the memory storing computer-readable instructions, the processor, when executing the computer-readable instructions, implementing the steps in the foregoing method embodiments.
In an embodiment, a computer-readable storage medium is provided, storing computer-readable instructions, the computer-readable instructions, when executed by a processor, implementing the steps in the foregoing method embodiments.
In an embodiment, a computer program product or a computer program is provided. The computer program product or the computer program includes computer instructions, the computer instructions being stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, to cause the computer device to perform the steps in the above method embodiments.
A person of ordinary skill in the art may understand that all or some of procedures of the method in the foregoing embodiments may be implemented by a computer program instructing relevant hardware. The program may be stored in a non-volatile computer-readable storage medium. When the program is executed, the procedures of the foregoing method embodiments may be implemented. Any reference to a memory, a storage, a database, or another medium used in the embodiments provided in this disclosure may include at least one of a non-volatile memory and a volatile memory. The non-volatile memory may include a read-only memory (ROM), a magnetic tape, a floppy disk, a flash memory, an optical memory, and the like. The volatile memory may include a random access memory (RAM) or an external cache. For the purpose of description instead of limitation, the RAM is available in a plurality of forms, such as a static RAM (SRAM) or a dynamic RAM (DRAM).
The technical features in the foregoing embodiments may be randomly combined. For concise description, not all possible combinations of the technical features in the embodiment are described. However, provided that combinations of the technical features do not conflict with each other, the combinations of the technical features are considered as falling within the scope recorded in this specification.
The foregoing embodiments only describe several implementations of this disclosure specifically and in detail, but cannot be construed as a limitation to the patent scope of this disclosure. A person of ordinary skill in the art may make various changes and improvements without departing from the ideas of this disclosure, which shall all fall within the protection scope of this disclosure. Therefore, the protection scope of this patent disclosure is subject to the protection scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2021103559296 | Apr 2021 | CN | national |
This application is a continuation application of PCT Patent Application No. PCT/CN2022/079885, filed on Mar. 9, 2022, which claims priority to Chinese Patent Application No. 2021103559296, entitled “METHODS AND APPARATUSES FOR TRAINING PREDICTION MODEL AND PREDICTING DATA AND STORAGE MEDIUM” filed with National Intellectual Property Administration on Apr. 1, 2021, wherein the content of the of the above-referenced applications is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2022/079885 | Mar 2022 | US |
Child | 18075643 | US |