The present application claims priority to Chinese Patent Application No. 202210838416.5, filed on Jul. 18, 2022, the content of which is incorporated herein by reference in its entirety.
The present application relates to the technical field of medical health information, in particular to a system for predicting end-stage renal disease complication risk based on contrastive learning.
End-stage renal disease has a long course, and many complications may occur during the long-term treatment, including vascular infection, hypertension, coronary heart disease, insomnia, depression, etc., which seriously affects the quality of life of patients. Therefore, it is necessary to make risk prediction and early intervention for complications of end-stage renal disease. In the long-term treatment process, the hospital electronic information system has accumulated a large number of structured medical data over time, including multi-dimensional and multi-scale clinical features and various kinds of outcome event labels. Clinical data in real scenes are faced with the problems of complex structure, unbalanced positive and negative samples, and few samples in some categories, therefore, it is difficult to directly apply the existing machine learning methods to obtain effective prediction results. Nowadays, contrastive learning has been widely used in various fields, and the performance of the whole model can be improved by learning representation through contrastive learning framework, but it still faces some problems when applied to the risk prediction of complications of end-stage renal disease. On the one hand, traditional contrastive learning is prone to feature collapse. One disadvantage of self-supervised contrastive learning is that it is very easy to map all inputs to the same vector without the correction of positive and negative examples, thus causing feature collapse. Even if label data is introduced for supervised learning, although the embedding vectors will not completely collapse, they may still collapse along a specific dimension, which leads to the embedding vectors can only be effective in the subspace of the lower dimension. On the other hand, traditional contrastive learning is oriented to image data and text data, and the data augmentation methods thereof (such as image flipping, color changing, scaling and other operations) are not suitable for structured medical data.
Aiming to overcome the shortcomings of the prior art, and to solve the problems that the complicated data in the end-stage renal disease scene is difficult to be fused and the labels are not balanced and the like, this patent proposes a system for predicting end-stage renal disease complication risk based on contrastive learning, and constructs a system for predicting the end-stage renal disease complication risk, so as to provide accurate and effective decision support for clinical decision-making.
The present application aims to provide a system for predicting an end-stage renal disease complication risk based on contrastive learning, which solve the problems that complex data in the end-stage renal disease scene are difficult to be fused and the labels are not balanced in the prior art.
The technical solution adopted by the present application is as follows:
Further, the end-stage renal disease data preparation module specifically includes:
Further, the structured data comprise demographic data, surgical data, medication data, chemical test data, diagnostic data and daily monitoring data.
Further, the data augmentation unit specifically includes:
Further, the complication risk prediction module specifically includes:
Further, the complication representation learning model constructing unit specifically includes:
Further, the complication representation learning model defining component specifically includes:
Further, the complication risk prediction model constructing unit specifically includes:
The present application has the beneficial effects that:
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the present application, its application or uses. Based on the embodiments in the present application, all other embodiments obtained by those skilled in the art without creative work belong to the scope of protection of the present application.
See
An system for predicting end-stage renal disease complication risk based on contrastive learning includes:
The end-stage renal disease data preparation module specifically includes:
The structured data comprise demographic data, surgical data, medication data, chemical test data, diagnostic data and daily monitoring data.
The data augmentation unit specifically includes:
The complication risk prediction module specifically includes:
Further, the complication representation learning model constructing unit specifically includes:
The complication representation learning model defining component specifically includes:
The complication risk prediction model constructing unit specifically includes:
See
An end-stage renal disease complication risk prediction method based on contrastive learning includes the following steps:
See
A data acquisition unit extracts structured data by using a hospital electronic information system and daily monitoring equipment; the structured data includes demographic data, surgical data, medication data, laboratory data, diagnostic data and daily monitoring data; demographic data: gender, age, nationality and region; Surgical data: mainly vascular access surgical information; drug use data: dialysis plan, drug use for complications, etc.; test data: creatinine, urea nitrogen, etc.; diagnostic data: complications; daily monitoring data: blood pressure, weight, etc.
A data cleaning unit performs missing value processing, error value detection, duplicate data elimination and/or inconsistency elimination operations on the structured data to obtain static data, one-dimensional time series data and two-dimensional time series data; the data cleaning unit mainly screens out the dirty data that is not reasonable; taking blood pressure data as an example, firstly, blood pressure data containing special characters are filtered out; secondly, the data of systolic blood pressure exceeding 250 mmHg or less than are screened out.
A data fusion unit splices one-dimensional packed data and the static data obtained by performing one-dimensional convolution and two-dimensional convolution operations respectively on the one-dimensional time series data and the two-dimensional time series data to obtain an original fusion feature;
A data augmentation unit is used to augment the original fusion features by a data augmentation method combining propensity score matching with SMOTE to obtain augmented structured data; the data augmentation unit is mainly used to increase the diversity of samples and solve the problem of unbalanced positive and negative samples. The present application adopts a data augmentation method combining propensity score matching with SMOTE to augment structured end-stage renal disease data and solve the problem of unbalanced positive and negative samples.
A fusion feature component is used to take a patient with an end-stage renal disease complication as a positive sample, take a patient with no the end-stage renal disease complication as a negative sample, represent the positive sample and the negative sample with the original fusion features, and perform normalization operation on the original fusion features of the positive sample and the negative sample, to obtain a fusion feature. In this embodiment, patients with cardiovascular complications are used as positive samples and patients without cardiovascular complications are used as negative samples.
0-1 normalization operation is carried out on the positive samples and the negative samples, and the fusion features of the normalized sample x is
where, xorim represents the original fusion feature of the m dimension, min(xorim) represents the minimum value of the original fusion feature of the m dimension, and max(xorim) represents the maximum value of the original fusion feature of the m dimension.
A propensity score component is used to select one dimension of the fusion feature arbitrarily to serve as an intervening variable, with other dimensions of the fusion feature serving as a concomitant variable set, to obtain a propensity score through loss function optimization.
Any one dimension xv(v=1, 2, . . . , m) of the fusion feature x is selected as the intervention variable, and the other dimensions =(x1, . . . , xv−1, xv+1, . . . , xm) as the covariate set to fit x with that is,
is taken as the propensity score of the intervention variable xv.
The parameters β0v, βv are optimized by a loss function L(av, xv)=Σi=1n log(cosh(aiv−xiv))+∥βv∥1, and the optimization method can be a gradient descent adam method. where ∥·∥1 represents a L1 norm, n is the total sample size, xiv is the vth variable of the ith sample, and aiv is the propensity score of xiv, that is aiv=av(xi).
A matching component is used to make all the positive samples constitute a positive sample universal set, make all the negative samples constitute a negative sample universal set, and make the positive sample universal set match negative sample subsets in the negative sample universal set based on the propensity score.
All the positive samples constitute a universal set of positive samples, which is record as {xt}; all the negative samples constitute a universal set of negative samples, which is recorded as {xf}. Any positive sample xp∈{xt} is selected, and the fusion feature of the positive sample xp is expressed as (xp1, xp2, . . . , xpm). If any feature b is selected as the intervention variable xpb of the positive sample xp, the propensity score of the positive sample xp is apb=ab (xp), and a suitable negative sample xq is matched based on the propensity score, and the fusion feature of the negative sample xq is expressed as (xq1, xq2, . . . , xqm), so that argmin xq|aqb−apb|, where aqb=ab (xq) xq∈{xf}. Based on the above matching method, the negative sample subset {xe}∈{xf} matched with the universal set {xt} of positive samples is selected for matching.
A positive sample augmentation component is used to obtain an augmented positive sample by performing a SMOTE algorithm on the positive sample universal set, the positive sample universal set and the augmented positive sample constituting a positive sample augmented set.
u similar samples xp1m xp2, . . . which has the smallest Mahalanobis distance d from the positive sample xp are selected from the universal set {xt} of positive samples. The Mahalanobis distance between samples xp and xpu d(xp, xpu)=√{square root over ((xp−xpu)TCp−1(xp−xpu))}, where Cp is a covariance matrix, Cp=cov(xp, xpu). u augmented positive samples {circumflex over (x)}p1, {circumflex over (x)}p2, . . . , {circumflex over (x)}pu are obtained based on SMOTE algorithm. The fusion feature of the augmented positive sample {circumflex over (x)}iu is expressed as ({circumflex over (x)}pu1, {circumflex over (x)}pu2, . . . , {circumflex over (x)}pum), where
Tne universal set {xt} of positive samples and its augmented positive samples constitute the positive sample augmentation set.
A negative sample augmentation component is used to obtain an augmented negative sample by performing a SMOTE algorithm on the negative sample subsets, the negative sample subsets and the augmented negative sample constituting a negative sample augmented set.
A negative sample xq ∈ a negative sample subset {xe}, and u similar negative samples xq1, xq2, . . . , xqu which has the smallest Mahalanobis distance d from the negative sample xq are selected from the universal set of negative samples {xf}. The Mahalanobis distance between negative samples xq and xqu d(xq, xqu)=√{square root over ((xq−xqu)T Cq−(xq−xqu))}, where Cq is a covariance matrix, Cq=cov(xq, xqu). u augmented negative samples {circumflex over (x)}q1, {circumflex over (x)}q2, . . . xqu are obtained based on SMOTE algorithm. The fusion feature of the augmented negative sample {circumflex over (x)}au is expressed as ({circumflex over (x)}qu1, {circumflex over (x)}qu2, . . . , {circumflex over (x)}qum), where
A negative sample subset {xe} and its augmented negative samples constitute a negative sample augmentation set.
An augmentation component is used to make the positive sample augmented set and the negative sample augmented set constitute jointly the augmented structured data.
See
A complication representation learning model constructing unit is used to construct a complication representation learning model;
The encoder is a five-layer fully connected network with 1024, 512, 256, 128 and 64 nodes, and the activation function is relu. The projector hθ is a three-layer attention network with 64,128,256 nodes and the activation function is relu;
A feature normalization block is used to input the augmented structured data in pairs into the encoder fθ, to obtain the initial complication representation, obtain the contrastive representation from the initial complication representation through the projector hθ, and obtaining the normalization representation from the contrastive representation through feature normalization operation.
The augmented structured data (X, X′) is input into the encoder fθ in pairs, and the initial complication representation (R, R′) is obtained. A contrastive representation (Z, Z′) is obtained from the initial complication representation through the projector hθ, and a normalization representation
is obtained from the contrastive representation through the feature normalization operation F-norm, where μz is an average value of the z feature dimensions of the contrastive representation and σz is a standard deviation of the z feature dimensions of the contrastive representation.
A total loss definition block is used to construct the total loss function by using the normalization representation, a covariance item, a variance item, a category similarity measure item and an augmented similarity measure item.
In order to prevent feature collapse, a total loss function is constructed by using covariance terms c(Znorm) and c(Z′norm), variance v(Znorm) and v(Z′norm), category similarity measure term sC (Znorm, Znorm) and augmented similarity measure term sA(Znorm, Znorm):
L=Σ
i
2(u+1)N
L
i
L
i
=λs
i
+μv
i
+vc
i
S
i
=s
C(Zinorm,Z′inorm)+sA(Zinorm,Z′inorm)
v
i
=v(Zinorm)+v(Z′inorm)
c
i
=c(Zinorm)+c(Z′inorm)
Among them, the category similarity measure item measures the category similarity of the whole batch of samples input in pairs. The specific formula is
A computational formula of the augmented similarity measure item sA (Znorm, Znorm) is
The specific formula of the variance item v(Znorm) is
The detailed formula of the covariance term c(Znorm) is:
The above formula represents the sum of covariances of Znorm between different dimensions. As a loss term, the above formula makes the redundant information of Znorm between different dimensions as little as possible. In other words, the above formula makes different dimensions of Znorm as different as possible, thus reducing the occurrence of feature collapse.
A complication representation learning model optimizing component is used to optimize parameters in the network structure through a gradient descent method, so that the total loss function reaches convergence, and constructing of the complication representation learning model is completed.
The encoder fθ and projector hθ are trained by comparing the total loss function L, the goal (taking predicting cardiovascular complications as an example) is to obtain the contrastive representation of cardiovascular complications in patients with end-stage renal disease, so that the representations of the same class are close, the representations of different classes are far away, the representations of augmented samples are close, and the representations of non-augmented samples are far away. The optimization method can be gradient descent adam method and the like.
A complication risk prediction model constructing unit is used to construct a complication risk prediction model;
a complication risk prediction model defining component is used to define a network structure of an end-stage renal disease complication risk prediction network, and select an activation function and a loss function of the end-stage renal disease complication risk prediction network and an optimization method;
a complication risk prediction model optimizing component is used to train the complication risk prediction network by using the optimization method, to complete constructing of the complication risk prediction model.
Firstly, a three-layer fully connected network is defined as the network structure of the end-stage renal disease complication risk prediction network gθ, and the number of nodes in the network structure of the end-stage renal disease complication risk prediction network gθ is 16, 4 and 1 in turn.
Relu is selected as the activation function of the full connection layer of the end-stage renal disease complication risk prediction network gθ, sigmoid as the activation function of the output layer, the cross entropy loss function as the loss function, and adam as the optimization method; the optimization method adam method is used to train the weight parameters of the complication risk prediction network to complete the construction of the complication risk prediction model.
When the total loss L converges, the weight parameters of the encoder fθ are frozen to train the weight parameters of the end-stage renal disease complication risk prediction network gθ.
A complication representation learning unit is used to perform training and learning on the augmented structured data through the complication representation learning model to obtain a complication representation; and
a risk prediction unit is used to perform end-stage renal disease complication risk prediction on the complication representation through the complication risk prediction model.
Taking the prediction of cardiovascular complications of end-stage renal disease as an example, the samples are input into the model in batches. A batch of samples contains N positive samples (with cardiovascular complications) and uN augmented positive samples, as well as matched N negative samples and uN augmented negative samples, totaling 2N(u+1) samples. The label y=1 indicates that cardiovascular complications occur, and y=0 indicates that cardiovascular complications do not occur. The output is the probability of cardiovascular complications in patients with end-stage renal disease.
In this application, the term “controller” and/or “module” may refer to, be part of, or include: an Application Specific Integrated Circuit (ASIC); a digital, analog, or mixed analog/digital discrete circuit; a digital, analog, or mixed analog/digital integrated circuit; a combinational logic circuit; a Field Programmable Gate Array (FPGA); a processor circuit (shared, dedicated, or group) that executes code; a memory circuit (shared, dedicated, or group) that stores code executed by the processor circuit; other suitable hardware components (e.g., op amp circuit integrator as part of the heat flux data module) that provide the described functionality; or a combination of some or all of the above, such as in a system-on-chip.
The term memory is a subset of the term computer-readable medium. The term computer-readable medium, as used herein, does not encompass transitory electrical or electromagnetic signals propagating through a medium (such as on a carrier wave); the term computer-readable medium may therefore be considered tangible and non-transitory. Non-limiting examples of a non-transitory, tangible computer-readable medium are nonvolatile memory circuits (such as a flash memory circuit, an erasable programmable read-only memory circuit, or a mask read-only circuit), volatile memory circuits (such as a static random access memory circuit or a dynamic random access memory circuit), magnetic storage media (such as an analog or digital magnetic tape or a hard disk drive), and optical storage media (such as a CD, a DVD, or a Blu-ray Disc).
The apparatuses and methods described in this application may be partially or fully implemented by a special purpose computer created by configuring a general-purpose computer to execute one or more particular functions embodied in computer programs. The functional blocks, flowchart components, and other elements described above serve as software specifications, which can be translated into the computer programs by the routine work of a skilled technician or programmer.
What has been described above is only the preferred embodiment of the present application, and it is not used to limit the present application. For those skilled in the art, the present application may have various modifications and changes. Any modification, equivalent substitution, improvement, etc. made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Number | Date | Country | Kind |
---|---|---|---|
202210838416.5 | Jul 2022 | CN | national |