The invention relates to generating real-time alerts about a patient using an Early Warning Score (EWS) generated using vital sign information.
Increased access to Electronic Health Records (EHR) has motivated the development of data-driven systems that detect physiological derangement and secure timely response. Commonly predicted adverse events such as mortality, unplanned ICU admission and cardiac arrest, have been extensively investigated by EWS systems, such as the National Early Warning Score (NEWS) that is currently recommended by the Royal College of Physicians in the UK. Typically, EWS systems assign a real-time alerting score to a set of vital sign measurements based on predetermined normality thresholds to indicate the patient's degree of illness.
However, physiological data recorded in EHRs are often sparse, noisy and incomplete, especially when collected in non-critical care wards. Missingness is often dealt with through complete-case analysis, population mean imputation, or carrying the most recent value forward. Such practices may impose bias and error and do not account for the uncertainty of the imputed data.
It is an object of the invention to at least partly address one or more of the issues described above.
According to an aspect, there is provided a computer-implemented method of generating real-time alerts about a patient, comprising: receiving vital sign data representing vital sign information obtained from the patient at one or more input times within an assessment time window; using a Gaussian process model of at least a portion of the vital sign information to generate a time series of synthetic vital sign data based on the received vital sign data, the synthetic vital sign data comprising at least a posterior mean for each of one or more components of the vital sign information at each of a plurality of regularly spaced time points in the assessment time window; using the generated synthetic vital sign data as input to a trained recurrent neural network to generate an early warning score, the early warning score representing a probability of an adverse event occurring during a prediction time window of predetermined length after the assessment time window; and generating an alert about the patient dependent on the generated early warning score.
Thus, a method is provided in which Gaussian process regression is used to generate synthetic vital sign data at regularly spaced intervals, which is provided as input to a recurrent neural network (RNN). This combination of processing architectures can be implemented efficiently using relatively modest computational resource and is demonstrated to achieve a high level of performance in generating EWSs. The architecture allows long term dependencies to be summarized efficiently. The Gaussian process regression allows computationally efficient modelling, where population based priors can be used to set up the Gaussian process model and the architecture as a whole achieves personalized modelling efficiently.
In an embodiment, the recurrent neural network comprises an attention mechanism.
The inventors have demonstrated that the introduction of an attention mechanism to the recurrent neural network provides a significant increase in performance. Furthermore, the attention mechanism provides the basis for improved interpretability by identifying which time points and/or which components of vital sign information are most relevant to the generated EWS.
In an embodiment, the recurrent neural network comprises a bidirectional Long Short Term Memory network.
The inventors have demonstrated that particularly high performance is achieved where the recurrent neural network is implemented as a bidirectional Long Short Term Memory (LSTM) network.
In an embodiment, the synthetic vital sign data comprises a posterior variance corresponding to each posterior mean; each posterior mean corresponding to each time point is used as input to a first recurrent neural network; each posterior variance corresponding to each time point is used as input to a second recurrent neural network; and the early warning score is generated via processing of outputs from both the first recurrent neural network and the second recurrent neural network. Furthermore, the first recurrent neural network interacts with an attention mechanism; the attention mechanism computes a respective attention weight to apply to a hidden state of the first recurrent neural network corresponding to each time point in the assessment time window; and the early warning score is generated via processing of a combination of a weighted sum of the hidden states of the first recurrent neural network weighted by the computed attention weights and an output from the second recurrent neural network.
The inventors have demonstrated that incorporating posterior variances further improves performance.
In an embodiment, the first recurrent neural network interacts with a first attention mechanism; the second recurrent neural network interacts with a second attention mechanism; the first attention mechanism computes a respective attention weight to apply to a hidden state of the first recurrent neural network corresponding to each time point in the assessment time window; the second attention mechanism computes a respective attention weight to apply to a hidden state of the second recurrent neural network corresponding to each time point in the assessment time window; and the early warning score is generated via processing of a combination of a weighted sum of the hidden states of the first recurrent neural network weighted by the computed attention weights of the first attention mechanism and a weighted sum of the hidden states of the second recurrent neural network weighted by the computed attention weights of the second attention mechanism.
The inventors have demonstrated that incorporating posterior means and variances via separate attention mechanisms further improves performance.
In an embodiment, the method further comprises receiving laboratory test data representing information obtained from one or more laboratory tests performed on the patient; receiving a diagnosis code representing a diagnosis of the patient made at a time of admission of the patient to a medical facility; using a trained model of a relationship between laboratory test data and probabilities of an adverse event occurring during the prediction time window to generate an early warning score based on the laboratory test data; using a trained model of a relationship between diagnosis codes and probabilities of an adverse event occurring during the prediction time window to generate an early warning score based on the diagnosis code; and obtaining a composite early warning score using a combination of at least the early warning score generated using the trained recurrent neural network, the early warning score based on the laboratory test data, and the early warning score based on the diagnosis code, wherein the alert is generated using the composite early warning score.
The inventors have demonstrated that the generation of alerts can be improved by such fusing of early warning scores obtained based on vital sign data, laboratory test data, and diagnosis codes.
In an embodiment, the model of the relationship between laboratory test data and probabilities of an adverse event includes a decay term to model an effect of delay between obtaining of the laboratory test data and a time at which the composite early warning score is to be obtained. The inventors have found that modelling the effect of delay in this way further improves the generation of alerts.
Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings in which corresponding reference symbols indicate corresponding parts, and in which:
Methods of the present disclosure are computer-implemented. Each step of the disclosed methods may therefore be performed by a computer. The computer may comprise various combinations of computer hardware, including for example CPUs, RAM, SSDs, motherboards, network connections, firmware, software, and/or other elements known in the art that allow the computer hardware to perform the required computing operations. The required computing operations may be defined by one or more computer programs. The one or more computer programs may be provided in the form of media or data carriers, optionally non-transitory media, storing computer readable instructions. When the computer readable instructions are read by the computer, the computer performs the required method steps. The computer may consist of a self-contained unit, such as a general-purpose desktop computer, laptop, tablet, mobile telephone, smart device (e.g. smart TV), etc. Alternatively, the computer may consist of a distributed computing system having plural different computers connected to each other via a network such as the internet or an intranet.
In an embodiment, the method comprises a step S1 of providing vital sign information. This step may be performed on an ongoing basis during a patient's stay in a medical facility, such as an intensive care unit (ICU). The vital sign information may be input manually by a medical worker via a data entry system (e.g. a computer keyboard or touch screen) or the vital sign information may be provided on an automatic basis by a sensor system 12, as depicted schematically in
In step S2, vital sign data is received at a data processing apparatus 5. The vital sign data represents vital sign information obtained in an assessment time window. The assessment time window is typically a period of time ending immediately prior to when the EWS is to be generated. In some embodiments, the assessment time window is a 24 hour period. The vital sign data represents vital sign information obtained at one or more input times within the assessment time window. The vital sign information obtained at each input time may consist of a single component (e.g. a single one of the example components of vital sign information mentioned above, such as a single value representing a measured FIR) or multiple different components (e.g. two or more of the example components of vital sign information mentioned above). In the schematic configuration of
In step S3, the vital sign data received in step S2 is pre-processed prior to being used as input to a trained recurrent neural network (RNN) in step S4.
An example architecture for the pre-processing is depicted in
In some embodiments, Gaussian process regression 303 is applied to continuous variables of the vital sign information (which will typically make up at least a portion of the vital sign information, such as the subset 301 of components in the example of
In some embodiments, step function modelling 304 is applied to discrete variables of the vital sign information (e.g. the subset 302 of components in the example of
The output from the Gaussian process regression 303 and the step function modelling 304 is a posterior mean and a posterior variance for each of the components of the vital sign information processed. As described in further detail below, the posterior mean may be scaled, for example so as to be in the range [−1,1], and the posterior variances may be scaled, for example so as to be in the range [0,1]. Synthetic vital sign data may then be generated at a plurality t of regularly spaced time points (e.g. t=12) to define a feature space 305 to be used as input to step S4 of
GPR generalizes multivariate Gaussian distributions to infinite dimensionality and offers a probabilistic and nonparametric approach to model a sparse vital sign time series y as a function of time from admission of a patient to a medical facility (e.g. ICU). In embodiments of the present disclosure, GPR is used to estimate missing observations y*={yi=1, . . . , yi=t} at regularly sampled time steps x*={xi=1, . . . , xi=t}, where t is the number of sampled observations (e.g. the number of time points for the synthetic vital sign data in the assessment window) and the final step xi=t is the time of observation measured in hours from admission time. In the examples discussed below, t=12 since bi-hourly sampling was performed in a 24 hour assessment window prior to xi=t.
The smoothness of the model depends on the choice of the covariance function denoted as K. The expected value of the model is determined by the mean function m(x), which in an example implementation is defined as a constant value equal to the vital sign component's mean of the patient population of the same age and sex. Thus,
ƒ(x)·GP(m(x),K(x,x*))
The key assumption of GPR is that y and y* are sampled from the same joint Gaussian distribution, such that
The covariance matrix in the above equation includes the covariance functions by applying the kernel to our observed and test data,
K representing the similarity measure between all observed values,
K* representing the similarity measure between all observed and test values, and
K** representing the similarity measure between all test values.
Finally, the best estimates for y* and its variance are the mean and variance of the conditional probability p(y*|y), where
p(y*|y)˜N(K*K−1y,K**−K*K−1K*T)
In an embodiment, a radial basis function (RBF) with added white noise is adopted as covariance function, such that
where δ(x, x′) is the Kronecker delta function and Θ={l, σf, σn} is the set of hyperparameters. Since it is desired to model vital sign data of the entire patient population, log-normal distributions are applied as priors for the three hyperparameters based on clinical judgment. The model is optimized by minimizing the negative log likelihood with respect to the hyperparameters. The GPR models may be built for example using GPy, which is a GP framework written in python.
In some embodiments, components of vital sign information that are discrete variables, such as AVPU and provision of supplemental oxygen, are modelled using a piecewise step function ƒ(x)=x where x is the most recent recorded value carried forward. In the detailed examples herein, if the most recent value was unavailable, then a score of 1 (Alert) was assumed for the AVPU score and that supplemental oxygen was not provided so as not to affect the final score.
In some embodiments, step S4 of
Due to the assumption of independence and requirement of fixed length inputs in standard feed forward neural networks (FFN), recurrent neural networks (RNNs) have been used for various temporal-based prediction tasks in different levels of health care settings. Given a sequential input, an RNN produces a sequential output at each time step using the current input and the network's previous state.
In some embodiments, the trained RNN particularly comprises a Long Short Term Memory (LSTM) network. LSTM networks develop the concept of the RNN by introducing the concept of the memory cell as the hidden state, as described in general terms in, for example, Hochreiter, S., and Urgen Schmidhuber, J. 1997. Long Short-Term Memory. Neural Computation 9(8):1735-1780.
The inventors have found that a Bidirectional Recurrent Neural Network provides particular improvements. These are described in general terms in, for example, Schuster, M., and Paliwal, K. K. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45(11):2673-2681.
As depicted schematically in
ƒt=σ(Wƒht-1+Wƒyt+bƒ)
i
t=σ(Wiht-1+Wiyt+bi)
C
t=ƒt* Ct-1+it*tan h(Wcht-1+Wcyt+bc)
h
t=σ(Whht-1+Whyt+bh)* tan h(Ct)
where σ is the sigmoid function, W indicates the weights of the respective feed forward neural network, and b is the bias.
As depicted schematically in
In some embodiments, the RNN comprises an attention mechanism. An example configuration of an attention mechanism is depicted in
Due to benefits of greater interpretability and extended long-term-dependencies, attention mechanisms (which may also be referred to as attention based models) have been used in various computer vision and natural language processing applications. See, for example, Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention Is All You Need. (Nips) and (Vaswani et al. 2017; Xu et al. 2015) or Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R.; and Bengio, Y. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.
Attention based models have not previously been used to operate on vital sign information or to provide EWSs.
As shown schematically in
where αi are the weights assigned to the hidden states, such that:
and et is the similarity function
e
t
=a(si-1,ht)
where a is considered a feed forward network. The context vector ct, output from summing node 312 is provided as input to a dense layer 314 (e.g. a fully connected neural network) which provides a mapping between the context vector ct and the output ot (e.g. an EWS at a particular time point t). Thus, in embodiments of this type an attention mechanism computes a respective attention weight to apply to a hidden state corresponding to each time point in the assessment time window, and the early warning score is generated via processing of (e.g. via a dense layer 314) a weighted sum of the hidden states weighted by the calculated attention weights (e.g. a context vector).
The generation of the attention weights provides an indication of how the relevance of the input data varies as a function of time. For example, time points in the assessment window having relatively high attention weights indicate a relatively high relevance of those time points to the EWS generated by the RNN. This is demonstrated in the discussion below referring to
In some embodiments, the attention weights are learned, for each component of the vital sign information, based on the posterior mean of the component, at each of the time points in the assessment time window. This is the case, for example, in the configuration of
Configurations of the type depicted in
In some embodiments, the generation of the EWS in step S4 uses the posterior variances generated by the pre-processing of step S3 in addition to the posterior means generated by the pre-processing of step S3. Thus, the mean and variance of each component of the vital sign information generated by the Gaussian process model at each time point tin the assessment window may be used as input to step S4.
Example architectures are depicted in
In the example of
Configurations of the type depicted in
In the example of
Configurations of the type depicted in
Experiments to valid embodiments were conducted on an anonymized dataset of vital sign observations recorded from adult patients. We included in our model continuous vital signs, such as heart rate (HR), respiratory rate (RR), systolic blood pressure (SBP), diastolic blood pressure (DBP), temperature (TEMP), and peripheral capillary oxygen saturation (SPO2), consciousness level (Alert, Voice, Pain & Unresponsive—AVPU score), and a variable indicating whether supplemental oxygen was provided to the patient at the time of observation. The age and sex of the patient and the timings of unplanned ICU admission, mortality, and cardiac arrest occurrences were also available.
Considering problem as a binary classification task, an event was defined as the composite outcome of the first occurrence of unplanned ICU admission, cardiac arrest or mortality. In the case of multiple occurrences of adverse events, account was taken only of the timing of the first event and observations recorded after an event were removed. Patient episodes were split into a labeled set of event and non-event windows. An event window was defined as an observation measurement of the deterioration and its preceding 24 hours of observations that is within N hours of a composite outcome. A non-event window was defined as an observation measurement and its preceding 24 hours that is not within N hours of a composite outcome. N was set to 24 hours in our study, which is a common evaluation window in the development of EWS systems. We split our dataset to 70% for a training set, 15% validation set and 15% test set. We tested our method on approximately 4,000 observation windows.
The following different classification approaches were compared, where Simple LSTM, LSTM-ATT, UA-LSTM-ATT-1, and UA-LSTM-ATT-2 correspond to the configurations introduced above.
Each patient admission has a set of vital sign time series data of 5 continuous variables: HR, SBP, RR, TEMP, and SPO2, and 2 discrete variables: AVPU and the provision of supplemental oxygen, recorded manually at observation times x.
GPR Modelling Lognormal priors over the hyperparameters for the vital signs were selected using a combination of a grid-based search and clinical expertise. The lognormal distributions chosen as priors for the radial basis function length scales were (μ=1.0, σ=0.1) for HR, RR, TEMP, and SPO2 and (μ=1.5, σ=0.1) for SBP and DBP. The lognormal distributions chosen as priors for the radial basis function variance were (μ=0.0, σ=0.1) for HR, SBP, DBP, and SPO2, (μ=1.5, σ=0.1) for RR, and (μ=3.5, σ=0.1) for TEMP. The lognormal distributions chosen as priors for the Gaussian noise were (μ=0.0, σ=4.0) for HR, SBP, DBP, and SPO2, (μ=0.0, σ=0.1) for RR, and (μ=1.5, σ=0.1) for TEMP. All GPR models were re-optimized for each of the first five observations, and then once every six new observations, if applicable. Applying lognormal distributions to the three hyperparameters of the GPR enabled us to efficiently model the vital signs of a heterogeneous population.
RNNs All of the RNNs used in step S4 of
Performance Evaluation We evaluated the performance using the area under receiver operating characteristics (AUROC) curve, area under precision-recall curve (AU-PR), F1 score, and sensitivity at a generic threshold of 50%, to predict the binary output of a composite outcome. All metrics were evaluated using a bootstrapping technique (number of bootstraps=100). All methods were implemented in Python and Keras.
Table 1 shows the performance results of all models on the testing set. The simple LSTM achieves a lower AUROC of 0.883 [95% CI 0.881-0.885] than the clinical benchmark NEWS, AUROC 0.888 [95% CI 0.886-0.890]. Incorporating the attention mechanism on top of a bidirectional LSTM network improves the mean AUROC from 0.883 to 0.895, and the AU-PR from 0.895 to 0.907. With regards to incorporating uncertainty, the first version of our proposed model UA-LSTM-ATT-1 achieves a comparable performance to LSTM-ATT (AUROC 0.896 [95% CI 0.894-0.898]. However, applying an attention mechanism to the variance input separately achieves the highest mean AUROC of 0.902 [95% CI 0.900-0.903] and the highest mean sensitivity of 0.795 [95% CI 0.792-0.799]. Our model also outperforms NEWS in terms of AU-PR (0.905 vs 0.890) and F1-score (0.814 vs 0.510).
To further investigate the effect of incorporating the uncertainty of the data, we visualize the attention weights learned from and applied to the mean function in the UA-LSTM-ATT-2, which achieved the highest AUROC, and the LSTM-ATT model in
We also compare the performance of LSTM (dot chain line), LSTM-ATT (broken line), and UA-LSTM-ATT-2 (solid line) for sequences of different lengths in
Based on an alerting threshold of 0.5, we applied a multinomial logistic regression to classify four classes where windows were (1) True Positive (TP) in UA-LSTM-ATT-2 and False Negative (FN) in NEWS (22.6%), (2) TP in NEWS and FN in UA-LSTM-ATT-2 (0.048%), (3) True Negative (TN) in UA-LSTM-ATT-2 and False Positive (FP) in NEWS (0.048%), and (4) TN in NEWS and FP in UA-LSTM-ATT-2 (7.5%). Diagnosis codes, grouped by official ICD-10 guidelines (ICD), was considered a significant predictor variable (p<0.05) in distinguishing Class 1 and 4 only. With the primary objective of alerting for deteriorating patients, UALSTM-ATT-2 improved the alerting performance, defined as the ratio of class 1 windows to FN in NEWS, for several diagnosis groups as shown in Table 2, reaching up to 84.3% improvement for patients with diseases of the respiratory system.
Methodology of the type described above can be adapted to take account of supplementary information in addition to the vital sign information. The supplementary information may comprise a diagnosis code (e.g. an ICD-10 diagnosis code—the 10th revision of the International Statistical Classification of Diseases and Related Health Problems, ICD, a medical classification list by the World Health Organisation see below) representing a diagnosis of the patient at a time of admission of the patient to a medical facility. Alternatively or additionally, the supplementary information may comprise laboratory test data. Embodiments described below explain how such information can be fused with information obtained from vital sign data in order to provide an improved alert. Embodiments described below also include a variation on how the recurrent neural network can be configured to provide an early warning score. The overall model described below is referred to as iFEWS in the present disclosure.
The problem of detecting clinical deterioration may be considered as a binary classification task. For each component of vital sign information recorded for a patient, a model (e.g. iFEWS) may be provided that predicts the probability of a composite outcome (e.g. represented as an early warning score) within the next N hours. Each component of vital sign information may be considered as an event or non-event window DW=[xi,yi]i=1n, with N=24 hours for example. As will be described in further detail below, laboratory test data may also be taken into account. Laboratory test data may be represented as a vector of the most recently-measured laboratory tests DL=[xl,z] in the last k days for example. As will be described in further detail below, diagnosis codes may also be taken into account. The diagnosis codes may include a first ICD-10 diagnosis code d assigned to the patient at admission for example. In this case, d is a categorical variable. The model may then estimate the posterior probability l of being within N hours of an adverse outcome, such that l∈[0,1].
The performance of deep learning models depends on the representation of the input data. It is therefore desirable to learn an efficient representation of the explanatory features of the data, which can then be used for subsequent predictive tasks. The data available for calculating early warning scores considered in the present disclosure can be heterogeneous in nature, ranging from both dense and sparse time-series variables, such as vital signs and laboratory tests, respectively, to discrete categorical variables such as diagnosis codes. The different variables may be treated based on how and when they were collected relative to the point of prediction as will be described below. A model may then be trained by learning an efficient representation of each variable type (e.g. using an autoencoder for the vital sign information) before combining those representations for our classification task. We now describe example data pre-processing and learning techniques for each variable type (i.e. vital sign data, laboratory test data and diagnosis codes).
As described earlier, since the vital signs are irregularly sampled, a Gaussian process model may be used to generate a time series of synthetic vital sign data at each of a plurality of regularly spaced time points in an assessment time window. This may be done by first applying a patient-specific feature transformation for each window using Gaussian process regression (GPR) with a squared-exponential kernel to obtain equally sampled posterior mean and variance estimates. The squared-exponential kernel has been shown to be suitable for modelling physiological data. These posterior mean and variance estimates are concatenated for all the vital signs to obtain: Yμ=[yμ,j]j=1m and Yσ=[yσ,j]j=0m, where Yμ, Yσ∈m×T and yμ,j and yσ,j are the GPR mean and variance for the jth vital sign, such that j=1, . . . , m.
As described earlier, a recurrent neural network may be used to generate an early warning score using the generated synthetic vital sign data. In the present embodiment, the recurrent neural network forms part of an autoencoder 400. An example of such a configuration is depicted schematically in
An autoencoder learns an efficient lower-dimensional representation of the (higher-dimensional) data through unsupervised learning. The basic architecture consists of an encoder 406 that learns a compact latent representation Lv from the input data 404, and a decoder 410 that reconstructs the input data 404 using the latent representation Lv (to provide reconstructed input 412). In embodiments of this type, the early warning score is generated using the latent representation from the autoencoder 400.
In an embodiment, as exemplified by
In an embodiment, each encoder channel 406 comprises an attention mechanism 408. Each attention mechanism is configured to compute a context vector. The latent representation Lv is obtained by combining the context vectors from the multiple encoder channels 406 and associated attention mechanisms 408.
As a specific example, a joint latent representation Lv of m components of vital sign information may be jointly reconstructed using a multi-channel attention-based autoencoder 400 that consists of m attention-based encoders 406 and a single decoder 410, in accordance with the architecture shown in
The context vectors of the m vital signs are concatenated to obtain the latent representation Lv:
L
v=[c1T, . . . ,cmT]
In an embodiment, the autoencoder 400 comprises a single decoder channel 410. The single decoder channel 410 may comprise plural layers. In the example shown the decoder channel 410 comprises three dense layers. The decoder channel 410 outputs a reconstructed input 412 corresponding to each of the encoder channels 406.
In an embodiment, the latent representation Lv is mapped by applying a sigmoid function to obtain the reconstructed input 412 of all vital signs ŷ:
ŷ=σ(W4g3(W3g2(W2g1(W1Lv+b1)+b2)+b3)+b4)
where W1, W2, and W3 are the weight matrices and b1, b2, and b3 are the bias vectors of the dense layers of the decoder channel 410. W4 is the weight matrix and b4 is the bias vector of the final sigmoid layer. The activation functions of the dense layers are g1, g2, and g3.
In an embodiment, the parameters of the autoencoder 400 are optimised by minimising a binary cross-entropy loss for all of the encoder channels 406 (i.e. for each of the components of vital sign information):
where m×T is the total number of input features from all of the vital-sign components.
In an embodiment, the latent representation Lv is further processed (in the block labelled σv in
l
v=σ(WvLv+bv)
where Wv is the weights matrix and bv is the bias vector. This component of the iFEWS model may be denoted as MC-AE-ATT-CLv, corresponding to the multichannel autoencoder with attention (MC-AE-ATT) with subsequent (-CLv) classification of the latent representation.
Learning from Laboratory Test Data
As mentioned above, laboratory test data may be used to improve a generated early warning score. Thus, the methods described above may be adapted to additionally provide the step of receiving laboratory test data. The laboratory test data represents information obtained from one or more laboratory tests performed on the patient. In an embodiment, the laboratory test data comprises measurement results relating to one or more of the following components: Haemoglobin (HGB), which is the number of red blood cells that transport oxygen to the body organs and carry back carbon dioxide to the lungs, measured by a blood test; White Blood Cells (WBC), or leukocytes, which are counted in blood tests to help detect infection that the immune system is trying to fight; Sodium (Na) test, which is a blood test that measures the amount of sodium in the blood, an electrolyte that regulates the amount of water surrounding the cells and maintains blood pressure; Potassium (K), which is also an electrolyte that is vital for regulating fluid volumes in cells and blood pH, Albumin (ALB), which is a protein made by the liver that prevents fluid in the bloodstream from leaking; Urea (UR), measured by urine or blood tests, is the metabolic waste product of protein breakdown; Creatinine (CR), which is a waste product generated by the breakdown of muscle tissue that specifically indicates kidney function; Hematocrit (HCT), which measures the proportion of red blood cells in the total blood count; Bilirubin (BIL), which is a yellow pigment in the blood that is produced by the breakdown of red blood cells—it is used as an indicator of anaemia, jaundice or liver disease; Troponin (TROP), which are proteins in the blood that measure contractions in the heart muscle; C-Reactive Protein (CRP), which is an acute-phase protein released by the liver after tissue injury, such as sepsis or strokes, that indicates degree of infection or inflammation.
In comparison to vital signs, laboratory tests are normally less frequently measured. In embodiments of the present disclosure, the laboratory test data may be pre-processed to yield a real-time alerting score as provided using the vital sign data (as described above). In an exemplary approach, each of one or more of the components of vital sign information is associated with a most recently-collected set of laboratory test data DL=[xl,z] during the previous N×k hours, where k is the number of days, xl is the time the laboratory tests were measured, and z∈q is a vector of q (scalar-valued) laboratory-test measurements. The time between a vital-sign measurement and the laboratory test measurements is denoted as tv-l=xn−xl, where xn is the time of prediction based on the vital-sign measurements. Physiologically implausible and missing values were replaced by the mean of the respective variable in the training set and the features were then scaled to obtain the final feature set {circumflex over (z)}.
A trained model of a relationship between laboratory test data and probabilities of an adverse event occurring during the prediction time window is used to generate an early warning score based on the laboratory test data. In an embodiment, the trained model comprises a logistic regression model. The use of a logistic regression model makes it possible to assess the learned coefficients assigned to each component (variable) of the laboratory test data. In the block labelled σl in
l
l=σ(Wl{circumflex over (z)}+bl)
where Wl is the weights matrix, {circumflex over (z)} is the vector of processed laboratory tests, and bl is the vector of biases. This module may be denoted with the suffice -CLl.
A composite early warning score may be obtained using a combination of at least the early warning score lv generated using the trained recurrent neural network (based on the vital sign data) and the early warning score ll based on the laboratory test data. An example implementation is described in further detail below with reference to
In an embodiment, the model of the relationship between laboratory test data and probabilities of an adverse event includes a decay term to model an effect of delay between obtaining of the laboratory test data and a time at which the composite early warning score is to be obtained. This may be implemented for example by accounting for a time difference between the vital-sign measurements and the laboratory-test measurements tv-l by further processing ll using an exponential decay model (depicted as block 420 in
where λ is learned during training of the model. This equation adjusts the posterior probability of an outcome computed using the laboratory tests using the exponential decay model.
As validation of this approach the inventors considered two sets of laboratory tests as input variables: (1) set S consisting of 8 laboratory tests; and (2) set U consisting of 4 additional laboratory-test variables. (Set S∪U therefore contains 11 variables in total). The results are discussed below.
Learning from Diagnosis Code Data
As mentioned above, diagnosis codes may be used to improve the generated early warning score. Thus, the methods described above may be adapted to additionally provide the step of receiving a diagnosis code (alternatively or additionally to receiving laboratory test data). In an embodiment, the diagnosis code represents information representing a diagnosis of the patient made at a time of admission of the patient to a medical facility.
In some embodiments, the diagnosis code is provided in a standard format, such as the ICD-10 format. Each diagnosis code may consist of several characters that represent a particular disease or illness. In an embodiment, diagnosis codes were grouped into 21 groups based on the high-level grouping of the ICD-10 codes. An additional group was created to represent missing or incorrect diagnosis codes that do not map to the ICD-10 dictionary. Thus, in total there were 22 possible diagnosis categories. To learn a representation of the discrete diagnosis codes, we incorporated an embedding module 422 (depicted in
l
d=σ(Wdld+bd)
where Wd is the weights matrix and bd is the bias vector. Thus, a trained model of a relationship between diagnosis codes and probabilities of an adverse event occurring during the prediction time window is used to generate an early warning score based on the diagnosis code.
A composite early warning score may be obtained using a combination of at least the early warning score lv generated using the trained recurrent neural network (based on the vital sign data) and the early warning score ld based on the diagnosis code. In some embodiments, a composite early warning score is obtained using a combination of the early warning score lv generated using the trained recurrent neural network (based on the vital sign data), the early warning score ld based on the diagnosis code, and the early warning score ll based on the laboratory test data (optionally updated as described above to give {circumflex over (l)}l). An example implementation is described in further detail below with reference to
l=σ(Wo[lv;{circumflex over (l)}l;ld]+bo)
As described above, the three different types of input are first processed with different feature learning techniques to compute the three separate early warning scores (ld, {circumflex over (l)}l, and lv). The final output l is then computed to indicate the probability of an occurrence of a composite outcome within the next N hours of a vital-sign measurement.
In comparison with data encountered in computer vision and natural language processing, clinical datasets tend to be smaller in magnitude. To address this, in some embodiments a performance of the iFEWS model is improved by first pre-training its components independently and then fine-tuning their parameters as part of the larger model. In an embodiment, the model may be trained in a two-fold process. First, the MC-AE-ATT component is pre-trained independently by minimizing the binary cross-entropy loss described above. Secondly, the CLl component is pre-trained independently by minimizing the binary cross-entropy loss but with a newly defined output {circumflex over (l)}∈(0,1), which indicates the probability of an adverse event at any time in the future during the current admission.
The pre-trained weights of MC-AE-ATT and CLl components may then be used to initialise their corresponding weights in the iFEWS model. The classification objective of iFEWS is the binary cross-entropy loss of the true labels (early warning scores) {tilde over (l)} and the predicted labels (early warning scores):
where N is the number of training samples.
The final objective function of iFEWS consisted of the joint loss function:
JT=RL(y,ŷ)+CL({tilde over (l)},l)
We included the reconstruction loss function of the MC-AE-ATT component, since it contains the majority of parameters that compute the latent representation of the vital-sign measurements. (We note that losses RL and CL could be combined in the affine βa+(1−β)b, and β=0.5 performed best empirically for our task.)
To evaluate the effect of the design choices on the overall performance of the model, and to justify model complexity, we assess several simpler variants of iFEWS. For learning the representation of the vital signs, we first developed and evaluated a single-channel autoencoder (SC-AE) that simply concatenated all the vital-sign sequences as one input. The inputs were then processed by three dense layers. In order to encode temporal information, we then designed the multichannel autoencoder (MC-AE) that processed each vital-sign sequence independently using an BiLS™ network. Since the BiLS™ network lacks interpretability, we finally incorporated the attention mechanism in each channel (MC-AE-ATT).
We also compared the iFEWS model to LDTEWS and LDTEWS:NEWS as standard clinical benchmarks. Both LDTEWS and LDTEWS:NEWS only included 8 routinely collected laboratory tests (i.e. Hb, WCC, U, ALB, CR, NA, and K) as included in set S. We further included TROP, HCT, TBIL, and CRP in set U and evaluated our deep learning models using both sets.
We evaluated the performance of our models using several metrics based on the respective task. For the autoencoders, we measured the mean squared error (MSE) to assess the reconstruction quality.
During model development and validation, we assessed the model variants and components using AUROC and AUPRC. For our proposed iFEWS model and other classifiers, we used the AUROC, sensitivity, specificity, and PPV evaluated on the testing sets. All metrics were performed using a bootstrapping technique with replacement with a fixed number of bootstraps (nb). We compared the performance of the models across patients aged 16-45 years and >45 years, and across three outcomes (unplanned ICU admission, cardiac arrest, and mortality) independently.
All hyperparameters of the model were optimised empirically using a balanced training and validation set, referred to as DO,1B. The regularly-spaced mean vital-sign measurements (yμ) were transformed with min-max scaling of [0,1]. All of the vital-sign autoencoder models were trained with 20 epochs, with early stopping by monitoring the loss on the validation set. The encoder module of the SC-AE consisted of four dense layers with 64 nodes followed by a latent-space dense layer consisting of 12 nodes. The decoder module of the SC-AE consisted of four dense layers with 64 nodes and a final sigmoid layer with 84 output nodes (corresponding to the 12 equidistant timesteps of the 7 vital signs). The encoder of the MC-AE model consisted of a BiLS™ with 5 output nodes at each timestep and the decoder consisted of four dense layers with 64 nodes each. The classifier consisted of five dense layers and a final sigmoid layer.
To assess the predictive power of vital signs and the continued learning scheme, we trained MC-AE-ATT-CLv independently using three different training schemes. The first training scheme involved pre-training MC-AE-ATT independently, and then fixing its weights during the training of the latent space classifier -CLv. The second scheme involved joint training of MC-AE-ATT and the latent space classifier -CLv with random initialisation of weights. The third scheme, continued learning, involved pre-training the MC-AE-ATT independently followed by joint learning with the latent space classifier -CLv.
The laboratory-test measurements were transformed using standardisation with a zero mean and unit variance. For the models using laboratory tests, we trained and evaluated our models for the original label {tilde over (l)} (i.e. the vital-sign measurements are within N hours of an outcome). The models were trained with 100 epochs with early stopping by monitoring the classification loss on the validation set in order to avoid overfitting.
The diagnosis codes embedding module performed best when it computed 3-dimensional vector representations. We also compared embeddings to one-hot encoding, and we found that (in experiments not shown here for brevity) that the model using embeddings performed better. We also did not pre-train the embedding in the continued learning training scheme because it did not show any predictive power when learning in isolation of components of the larger models Weights that were not pre-initialised with the continued learning scheme were randomly initialised. All the models were optimised using the Adam optimiser and implemented using Keras (v 2.2.2) (a high-level neural networks API—www.keras.io) with a TensorFlow backend (v 1.5.0—www.tensorflow.org).
The reconstruction errors in terms of the MSE of the vital-sign sequences in the training set DO,1B and testing sets DO,2 and Dp are shown in Table A.
O,1B
O,2
P
The MSE increases as the model complexity increases across all datasets. While MC-AE-ATT is the most interpretable since it incorporates an attention mechanism, it yields the highest reconstruction error in all datasets. Additionally, DP has the highest standard deviation of errors across the three datasets. This may be because the vital-sign sequences in DP were scaled using transformations learned from an independent and foreign dataset DO,1B On the other hand, DO,1B and DO,2 belong to the same distributions as they were both obtained from the same hospital source.
Table B presents the performance of the different training schemes on a validation set DO,V.
Pre-initialisation has the lowest number of trainable parameters, since it only involves training of the latent space classifier. It also achieves the lowest AUROC [95% CI 85.7-85.8] and AUPRC [95% 86.3-86.4] values across all schemes. Continued learning achieves the highest AUROC [95% 89.3-89.4] across all schemes; we choose to adopt it for training our overall model. We note that the AUPRC values are considerably high since the validation set DO,V is balanced as is the training set from which it was derived.
Table C summarises the performance of LDTEWS and the LR models on the validation set DO,1V using the two sets of laboratory-test variables, S and U.
LDTEWS achieves the lowest performance for both labels in terms of AUROC [95% CI 67.1-67.2] and AUPRC [95% CI 67.3-67.4]. We also observe that LR achieves the highest AUROC [95% 72.6-72.8] and AUPRC [95% CI 73.5-73.7] when using the laboratory-tests dataset U. This suggests that incorporating the additional variables in set U over set S improves the predictive performance of a laboratory-tests based classifier.
Performance Evaluation of iFEWS
Table D summarises the performance results of the final models on DO,2.
iFEWS and a variant of iFEWS without attention (iFEWSMC-AE) achieved the highest AUROC values, [95% CI 90.0-90.0] and [95% 90.2-90.2] respectively. iFEWS also had the highest sensitivity [95% CI 77.0-77.1]. With respect to the clinical baseline that is adopted in practice, NEWS, our model is approximately 4% higher. iFEWSSC-AE achieved the lowest AUROC [95% CI 89.6-89.7] across the three autoencoder models. Despite MC-AE-ATT having the highest reconstruction error (as shown in Table A), the performance of iFEWS is comparable with that of iFEWSMC-AE. This suggests that incorporating an attention mechanism improves interpretability while maintaining model performance. All models achieved a comparable PPV.
Table E shows the performance of iFEWS on sub-populations in DO,2.
Across the younger patients, iFEWS achieved a higher AUROC than LDTEWS:NEWS, [95% CI 87.1-87.4] and [95% CI 81.5-81.9] respectively. The performance of iFEWS for 16-45 years old patients is also superior to that of a supervised learning model DEWS (AUROC [95% CI 81.8-82.2]) and NEWS (AUROC [95% CI 75.7-76.2]). This represents more than 10% increase relative to the performance of the current state-of-the-art (i.e. NEWS) for the young patient group. For the group of elder patients, for unplanned ICU admission, and for mortality, iFEWS consistently performed better than LDTEWS:NEWS in terms of the AUROC. For mortality, iFEWS achieved a similar AUROC to LDTEWS:NEWS, [95% CI 93.6-93.7] and [95% CI 93.6-93.7] respectively. However, iFEWS had a higher sensitivity, [95% CI 85.7-85.9] compared to [95% CI 84.0-84.2].
Table F presents the performance of iFEWS across the different patient sub-populations in D.
For the overall dataset, iFEWS achieved a higher AUROC than LDTEWS:NEWS, [95% CI 89.5-89.5] and [95% CI 88.5-88.6] respectively. As for the 16-45 years old, iFEWS achieved an a higher AUROC [95% CI 94.2-94.3] than LDTEWS:NEWS [95% 89.1-89.2]. For the older patient group and across all outcomes, iFEWS had the highest AUROC. Thus, even on a completely independent testing set, we conclude that iFEWS had superior discriminatory performance than the multi-modal state-of-the-art EWS.
To get a better understanding of the decision-making process of iFEWS, we examined feature saliency of the LR components. This involved investigating the weights assigned to the features after model training in the sigmoid-based layers. For example,
We also examined the weights assigned to the auxiliary outputs ({circumflex over (l)}l, ld, and lv) using the different variable types.
The performance of iFEWS in comparison to LDTEWS:NEWS in terms of the trigger rate and the AUROC presented earlier highlights the ability of iFEWS to ease staff burden by reducing false positive alerts and providing superior discrimination ability.
Number | Date | Country | Kind |
---|---|---|---|
1820004.8 | Dec 2018 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB2019/053437 | 12/5/2019 | WO | 00 |