The present application belongs to the technical field of medical and health information and, in particular, to a general multi-disease prediction system based on causal check data generation.
With the development of information technology, machine learning has become an important force to promote medical development. General medicine, as the most widely accepted medical discipline in the medical field, is one of the key areas in which a machine learning model is applied in medical scenes. However, due to the complexity of general diseases and the high cost of sample acquisition, it is difficult for some diseases to obtain a large number of training data, which leads to the poor prediction effect of the existing general multi-disease prediction system for few-shot diseases. At present, a set of general multi-disease prediction systems for few-shot diseases is urgently desired.
Generating simulation data by a data generation method is a common method to solve the problem of insufficient training samples of a machine learning model. The existing data generation methods are mainly based on generative adversarial networks. The generative adversarial network performs well in generating image data. However, in the general practice scene, there are many kinds of data and complex structures, especially the structured medical data, which contains many kinds of patient-centered characteristic data, with heterogeneity in time and space, and the data distribution is more complicated. It is difficult for the traditional generative adversarial network to handle the structured data with complex distribution. On the one hand, training with few sample data is prone to the problems such as unstable training, gradient disappearance or pattern collapse. On the other hand, only considering the correlation between variables, without considering the causal relationship between variables, leads to the result that the generated data is difficult to understand and inconsistent with common sense. Using these data for model training may not improve or even weaken the training effect of the model. For example, colds include viral colds and bacterial colds, and thus two kinds of drugs should be used, respectively. If the data of fever patients are generated based on a correlation model, it may result in the simultaneous use of viral cold medicines and bacterial cold medicines, which will interfere with the subsequent model construction.
The method for calculating causal effect values based on propensity scores is the most common method to measure the causal relationship between variables. Most of the existing methods for calculating propensity scores are based on logistic regression. However, due to the variety, complex structure and linear inseparability of data in the general practice scenes, the method for calculating propensity scores based on logistic regression does not perform well in general practice scenes.
In view of the shortcomings of the prior art, the present application provides a propensity score calculation method based on a general propensity score network from the perspective of causality, and on this basis, provides a medical data generation method based on causal check, which solves the problem that the data generated by a general propensity score network based on correlation analysis is difficult to understand, constructs a general multi-disease prediction system, and solves the problems of poor model performance and low robustness caused by few training samples in the general practice scenes.
The object of the present application is achieved through the following technical solution: a general multi-disease prediction system based on causal check data generation, including:
Further, in the causal check module, the general propensity score network is trained by using binary variable data of the general patient; characteristic variable data and label variable data of the general patient are converted into binary variables, categorical variables are converted into the binary variables by One-Hot Encoding, and continuous variables are converted into the categorical variables by binning in advance, and the categorical variables are further converted into the binary variables by One-Hot Encoding.
Further, the general propensity score network includes an input layer, a locally-connected layer, a sigmoid activation layer and an output layer;
Further, the training process of the general propensity score network is as follows:
Further, in the causal check module, the trained general propensity score network is used to calculate a general propensity score pia of a general patient i for a first event variable a, and a causal effect value ATEa,b of the first event variable a and a second event variable b are calculated by using the general propensity score according to the following calculation formula:
where n represents a total number of patients to be studied, Tj represents a true value of the first event variable of an ith patient, and Yi represents a true value of the second event variable of the ith patient.
Further, in the data generation module, the generator is composed of multiple layers of generator modules; the generator module includes a normalization layer, a fully-connected layer and an activation layer; the activation layer of the generator module in a last layer is a sigmoid activation layer; in a training process, the random noises and the corresponding disease labels are input into the normalization layer of a first generator module, and normalized data are input into the fully connected layer of the first generator module to obtain a first characteristic representation of input data; the first characteristic representation is input into the activation layer of the first generator module to obtain a second characteristic representation of the input data, and the second characteristic representation is used as input data of the generator module in a next layer; and finally the generated sample is obtained through the sigmoid activation layer of the generator module in the last layer.
Further, in the data generation module, a formula for calculating a causal loss Lcausal is as follows:
where, ATEa,ro represents a causal effect value of a first event variable a and a second event variable r of the original data, and ATEa,rg represents a causal effect value of the first event variable a and the second event variable r of the generated sample; Ar represents the first event variable set paired with the second event variable r; the second event variable set is a general disease set, and the second event variable r corresponds to a few-shot general disease r in the few-shot general disease set R.
Further, in the data generation module, a formula for calculating a discriminator adversarial loss Lζ is as follows:
where N is a data size of the random noises, and yi* is a probability that an ith generated sample is determined as real data of a corresponding disease by the discriminator;
a calculation formula of a regularization term loss Lregular is as follows:
where, ∥·∥ represents a L1 norm and w represents a model parameter of the generator.
Further, in the data generation module, a formula for calculating a total loss Ld of the discriminator is as follows:
where is md is a number of the positive samples, yk is a disease label corresponding to the positive samples, and xk, xk{circumflex over ( )}, dk are kth extracted positive sample, kth extracted negative sample and kth generated sample obtained by using the generator, respectively, and D(xk, yk), D(xk{circumflex over ( )}, yk), D(dk,yk) are probabilities that the positive sample xk, the negative sample xk{circumflex over ( )} and the generated sample dk are determined as the real data of a disease yk by the discriminator.
Further, the model prediction module is configured to:
construct an event relation graph: each first event variable constitutes a first event node in the event relation graph, each second event variable constitutes a second event node in the event relation graph, and an edge is constructed for each event pair;
generate a node embedding representations of the first event node and a node embedding representation of the second event node, construct a degree matrix Φ and an adjacency matrix A based on the event relation graph, and construct a causal effect matrix Ψ using the causal effect values of the original data;
construct the general multi-disease prediction model based on the general causal graph convolutional neural network, where the general causal graph convolutional neural network includes a plurality of causal graph convolutional modules, and each of the causal graph convolutional module includes a causal graph convolutional layer and an activation layer;
input the node embedding representations into the causal graph convolutional layer of a first causal graph convolutional module to obtain a first graph characteristic representation h(0):
where, H(0) represents a node embedding representation, W(0) represents a weight of the causal graph convolutional layer, I represents an identity matrix, and * represents a multiplication of elements of the matrix;
input h(0) into the activation layer of the first causal graph convolutional module to obtain an output H(1) of the first causal graph convolutional module; and
input the output of the previous causal graph convolutional module into a next causal graph convolutional module until a final disease prediction result is obtained.
The present application has the following beneficial effects:
In order to better understand the technical solution of the application, the embodiments of the application will be described in detail with the attached drawings.
It should be clear that the described embodiments are only part of the embodiments of this application, not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative work belong to the protection scope of this application.
The terms used in the examples of this application are intended to describe specific embodiments only and are not intended to limit this application. The singular forms “a”, “said” and “the” used in the embodiments of this application and the appended claims are also intended to include the plural forms, unless the context clearly indicates other meaning.
The present application provides a method for generating medical data based on a generative adversarial network of causal check, and based on this method, a set of general multi-disease prediction systems is constructed to solve the problem that the model has poor prediction for few-shot diseases due to less training samples in the general multi-disease prediction model. As shown in
The following description further gives some examples of the implementation of each module of the general multi-disease prediction system based on causal check data generation that meets the requirements of this application.
For all kinds of general diseases, the sample numbers of various diseases are counted and the sample ratios of various diseases are calculated. The sample ratio is the ratio of the number of samples of the diseases with the largest number of samples to the number of samples of various diseases. For example, for the four general diseases of common cold, gastritis, diarrhea and fever, the sample numbers are 10, 20, 30 and 40 respectively, and the sample ratios are 4, 2, 4/3 and 1 respectively.
For diseases whose disease sample ratios are greater than a set threshold (which is an adjustable parameter set according to the actual situation), they are added into the few-shot general disease set R, and the frequency
of the rth few-shot general disease si calculated, where countr is the number of samples of the rth f disease.
II. Causal Check Module, the Implementation Process of which is Shown in
The patient's characteristic variable data and label variable data are obtained. The characteristic variable data and label variable data are converted into binary variables as follows. For categorical variables, they are converted into binary variables by One-Hot Encoding. For continuous variables, they are converted into categorical variables through binning, and then into binary variables through One-Hot Encoding.
The characteristic variable set constitutes a first event variable set, and the label variable set constitutes a second event variable set. The first event variable set is a set of clinical manifestations, such as {hypertension, fever, chest tightness}, and the second event variable set is a set of general diseases, such as {cold, gastritis, cardiovascular diseases}.
Any first event variable in the first event variable set and any second event variable in the second event variable set form an event pair, and the causal effect values of all event pairs are calculated. The calculation method of causal effect values is as follows.
It is recorded that a first event variable a and a second event variable b form an event pairing δ; a covariant corresponding to the event pair δ is defined as the variable except the first event variable a in the first event variable set. Taking the event pair hypertension-cold as an example, the covariant is a variable except the hypertension variable in the first event variable set {hypertension, fever, chest tightness}, that is {fever, chest tightness}. Because of the variety and complexity of the data in general scene, the traditional method of calculating the propensity score based on logistic regression has limited ability in dealing with nonlinear separable data. Therefore, the present application constructs a general propensity score network aiming at the general scene, trains the general propensity score network by using the binary variable data of general patients, and calculates the general propensity score by using the trained general propensity score network.
The general propensity score indicates the probability that the first event occurs to the patient under the covariant condition. Taking {hypertension, fever, chest tightness} as an example, it is the probability of hypertension in patients with fever and chest tightness.
The general propensity score network includes an input layer, a locally-connected layer, a sigmoid activation layer and an output layer.
Specifically, the number of codes in the input layer and a number of codes in the output layer are both a number M of first event variables in the first event variable set. Both the locally-connected layer and the sigmoid activation layer contain τM nodes, where τ is an adjustable parameter, τ≥. The uth node of the input layer is connected with all nodes except those from a τ(u−1)+1th to a τuth node in the locally-connected layer. The nodes from the τ(u−1)+1th to the τuth node in the locally-connected layer are connected with nodes from a τ(u−1)+1th to a τuth node in the sigmoid activation layer in one-to-one correspondence. The nods from the τ(u−1)+1th to the τuth node in the sigmoid activation layer are only connected with a uth node in the output layer. The locally-connected layer has the advantages that the locally-connected layer ensures the local connection between the input layer and the output layer; for each first event variable to be predicted, the covariant characteristic node of the input layer forms a local network with the first event variable nodes of the locally-connected layer, the sigmoid activation layer and the output layer; and the locally-connected layer ensures the mutual independence among the local networks, so that the predicted first event variable will not be used for prediction.
The training process of the general propensity score network is as follows:
for each first event variable a, covariant data corresponding to the training samples is input into the locally-connected layer to obtain a first characteristic representation of propensity; the first characteristic representation of propensity is input into the sigmoid activation layer to obtain a second characteristic representation of propensity; the second characteristic representation of propensity is input into the output layer to obtain a predicted value of the first event variable a; a propensity loss is calculated by using the predicted values of all the first event variables and true values of all the first event variables. The propensity loss Lp is calculated as follows
where mp represents the total number of training samples, γf,a represents the true value of the first event variable a of a training sample f, and γf,a# represents the predicted value of the first event variable a of the training sample f.
The trained general propensity score network is used to calculate a general propensity score pia of a general patient i for the first event variable a, and a causal effect value ATE of the first event variable and a second event variable are calculated by using the general propensity score. The formula of the causal effect value ATEa,b of the first event variable a and the second event variable b is as follows:
where n represents a total number of patient to be studied, and Ti represents the true value of the first event variable of an ith patient; Yi represents the true value of the second event variable of the ith patient; Yi=1 represents that a second event occurs to the ith patient, and Yi=0 represents that a second event does not occur to the ith patient.
For the few-shot general disease set R, a data generation model is constructed based on the generative adversarial network of causal check, and the simulated data is generated by using the trained data generation model.
Specifically, the data generation model includes a generator and a discriminator. The generator G(z, c) is composed of multiple layers of generator modules, where z represents a random noise and c represents a disease label of a sample to be generated. The generator module includes a normalization layer, a fully-connected layer and an activation layer. The activation layer of the generator module of the last layer is a sigmoid activation layer, and the activation layers of other generator modules can be relu activation layer, sigmoid activation layer and tan h activation layer. The discriminator D is composed of multiple layers of discriminator modules, and the discriminator module includes a fully-connected layer, a Dropout layer and an activation layer.
S1, for each disease r in the few-shot general disease set R, mg noise points zr={z1,r, z2,r, . . . , zm
S2, the random noise z and the corresponding disease label c are input into the normalization layer of the first generator module, where the normalization layer is used for normalizing the input data, including batch standardization, sample standardization and the like; the normalized data are input into the fully connected layer of the first generator module to obtain a first characteristic representation of the input data; the first characteristic representation is input into the activation layer of the first generator module to obtain a second characteristic representation of the input data, and the second characteristic representation is input and output as the input data of the next generator module layer by layer; finally, the generated samples are obtained through the sigmoid activation layer of the generator module of the last layer.
S3, a causal check module is used to calculate the causal effect values of all event pairs of the generated samples.
S4, the generated samples and the disease labels are input into the discriminator, and the probability y* that the discriminator discriminates the generated samples as real data corresponding to the disease.
S5, the total loss L of the generator is calculated, including the adversarial loss Lζ of the discriminator, a causal loss Lcausal and a regularization term loss Lregular.
The adversarial loss of the discriminator measures the degree to which the generated sample of the generator is judged to be true by the discriminator. The smaller the adversarial loss of the discriminator, the easier it is for the generated sample to be judged to be true. The formula for calculating the adversarial loss Lζ of the discriminator is as follows:
where yi* is the probability that the ith generated sample is judged as the real data of the corresponding disease by the discriminator.
The causal loss measures the degree of causality between the generated sample of the generator and the original data. The smaller the causal loss, the more consistent the internal causality of the generated samples is with the original data. Specifically, the causal loss is a KL divergence loss between the causal effect values of all event pairs of the generated sample that are corrected by the frequency of the few-shot general disease and the causal effect values of all event pairs of original data. For a few-shot disease, the variance of the causal effect value corresponding to the calculated original data is large, and the stability of training is improved by giving a smaller weight. The calculation formula of the causal loss is as follows:
where, ATEa,ro represents a causal effect value of the first event variable a and a second event variable r of the original data, and ATEa,rg represents a causal effect value of the first event variable a and the second event variable r of the generated sample; Ar represents the first event variable set paired with the second event variable r; qr represents the frequency of a few-shot general disease r.
The calculation formula of the regularization term loss Lregular is as follows:
where, ∥·∥ represents a norm L1 and w represents a parameter of the generator model.
The total loss of the generator is as follows:
S1, md patient sample {(x1, y1), (x2,y2), . . . , (xk, yk), . . . , (xm
S2, md patient samples {(x1{circumflex over ( )}, y1{circumflex over ( )}), (x2{circumflex over ( )}, y2{circumflex over ( )}), . . . , (xk{circumflex over ( )},yk{circumflex over ( )}), . . . , (xm
S3, md noise points {z1{circumflex over ( )}, z2{circumflex over ( )}, . . . , zk{circumflex over ( )}, . . . , zm
S4, the extracted positive and negative samples and the generated samples are respectively input into the discriminator D to obtain the predicted disease label.
S5, the total loss Ld of the discriminator is calculated with the following formula:
where D(xk, yk), D(xk{circumflex over ( )},yk), D(dk, yk) are respectively the probabilities that the positive sample, the negative sample and the generated sample are judged as the real data of the disease yk by the discriminator D.
IV. Model Prediction Module, the Implementation Process of which is Shown in
The characteristic data and disease label data of the general patient to be trained are obtained. For diseases with insufficient training samples, the data generation model trained in the data generation module is used to generate general disease data. The training samples together with the generated general disease data are used to train the general multi-disease prediction model. The specific process is as follows:
Firstly, an event diagram is constructed, which is specifically as follows:
each first event variable in the first event variable set constitutes a first event node in the event relation graph, and each second event variable in the second event variable set constitutes a second event node in the event relation graph; an edge is constructed for each pair of the first event variable and the second event variable for each patient, thus completing the construction of the event relation graph.
Taking the first event variable set {fever, chest tightness} and the second event variable set {acute respiratory infection} of a patent as an example, an edge is constructed between fever and acute respiratory infection, and an edge is constructed between chest tightness and acute respiratory infection.
A graph representation learning algorithm is used to generate the embedding representations of the first event node and the second event node. Based on the event relation graph, the corresponding degree matrix Φ and adjacency matrix A are constructed. A causal effect matrix Ψ is constructed by using the causal effect values of the original data, and the numbers of rows and columns of the causal effect matrix Ψ are the same, which is the number of the first event nodes plus the number of the second event nodes. The element in row α and column β of the causal effect matrix Ψ is recorded as ψα,β, if row α is the first event node and column β is the second event node, ψα,β=ATEα,βo otherwise ψα,β=0.
A general multi-disease prediction model based on a general causal graph convolutional neural network is constructed. The general causal graph convolutional neural network includes several causal graph convolutional modules, and the causal graph convolutional module include a causal graph convolutional layer and an activation layer. The causal graph convolutional layer is a convolutional layer corrected by the causal effect matrix, and the robustness of the model is improved by adding causal effect correction. The embedding representation nodes are input into the causal graph convolutional layer of a first causal graph convolutional module to obtain a first graph characteristic representation h(0);
where, H(0) represents the node embedding representation, W(0) represents the weight of the causal graph convolutional layer of the first causal graph convolutional module, which can be obtained by training; I represents an identity matrix, and * represents the multiplication of the elements of the matrix.
The first graph characteristic representation h(0) is input into the activation layer of the first causal graph convolutional module to obtain an output H(1) of the first causal graph convolutional module;
Where σ(·) represents an activation function.
The output of the previous causal graph convolutional module is input into the next causal graph convolutional module until the final disease prediction result is obtained. The loss of the general causal graph convolutional neural network is calculated, and the loss function is a cross entropy loss function.
The general causal graph convolutional neural network is iteratively trained to obtain the trained general multi-disease prediction model, and the trained general multi-disease prediction model is used to predict general diseases.
Aiming at the general scene, the present application provides a general propensity score network suitable for calculating the general propensity scores; a causal effect calculation method is used to perform causal check for the general data generated by the generative adversarial network, so that the generated data is more in line with the real causal logic; in the training process of the generator, the same number of noise points are generated from binomial distribution for each few-shot disease to serve as the input of the generator together; in the training process of the discriminator, positive samples are extracted from the original data, and the same number of samples with different labels are extracted as negative samples, which are used to train the discriminator together with the negative samples generated by the generator; aiming at the few-shot general diseases, the generative adversarial network based on causal check is used to amplify the general data, so as to improve the prediction performance of the general multi-disease prediction system for the few-shot diseases; a general multi-disease prediction model based on a general causal graph convolutional neural network is proposed, and the causal effect value is integrated to improve the prediction performance of the general multi-disease prediction system.
It should also be noted that the terms “including”, “comprising” or any other variation thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or equipment including a series of elements includes not only those elements, but also other elements not explicitly listed, or elements inherent to such process, method, commodity or equipment. Without more restrictions, an element defined by the phrase “including a” does not exclude the existence of other identical elements in the process, method, commodity or equipment including the element.
Specific embodiments of this specification have been described above. Other embodiments shall also be within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve the desired results. In addition, the processes depicted in the drawings do not necessarily require the specific order shown or the sequential order to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The above is only the preferred embodiments of one or more embodiments of this specification, and it is not intended to limit one or more embodiments of this specification. Any modification, equivalent substitution, improvement and the like made within the spirit and principle of one or more embodiments of this specification shall be included in the scope of protection of one or more embodiments of this specification.
Number | Date | Country | Kind |
---|---|---|---|
202210547826.4 | May 2022 | CN | national |
The present application is a continuation of International Application No. PCT/CN2023/089993, filed on Apr. 23, 2023, which claims priority to Chinese Application No. 202210347826.4, filed on May 20, 2022, the contents of both of which are incorporated herein by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2023/089993 | Apr 2023 | WO |
Child | 18595379 | US |