Clinical trials (sometimes called “clinical studies”) are often used to assess the safety and efficacy of a drug or a medical device. In some trials, hundreds or thousands of test sites enroll thousands or tens of thousands of subjects or patients.
One metric that may be monitored during a clinical trial is the occurrence of adverse events, sometimes abbreviated “AE.” An AE typically includes any event that is experienced by a clinical trial subject during his/her participation in the trial that may have a negative impact on health or well-being, such as headache, stomachache, dry mouth, high blood pressure, fast heart rate, migraines, seizures, stroke, heart attack, etc.
A specific type of AE is a serious adverse event or “SAE.” An adverse event is considered serious if, according to the clinical trial investigator, the outcome of the event is any of the following: death, a life-threatening event, inpatient hospitalization (initial or prolonged), disability, significant incapacity to conduct normal life functions, congenital anomaly or birth defect, or other important medical events that may lead to one of the aforementioned outcomes. It is critical to be able to quickly detect and/or determine SAEs that occur during a clinical trial to prevent other clinical trial subjects from suffering from the same SAE or at least to understand when such an SAE may occur.
Where considered appropriate, reference numerals may be repeated among the drawings to indicate corresponding or analogous elements. Moreover, some of the blocks depicted in the drawings may be combined into a single function.
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of embodiments of the invention. However, it will be understood by those of ordinary skill in the art that the embodiments of the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to obscure the present invention.
A sponsor (i.e., the drug company that makes the drug being tested) is responsible for ongoing safety evaluation of investigational products. Sponsors are thus required by the FDA and other regulatory bodies to submit reports of Serious and Unexpected Suspected Adverse Reactions (SUSARs) to all investigators, ethics committees, and competent authorities. The information included in such safety reports has to be consistent, accurate and complete. It has been reported that the accuracy and completeness of SAE case reports are poor, which can delay identifying safety signals. See, e.g., S. Crépin et al., Pharmacoepidemiol. Drug Safety 2016; 25: 719-24.
Reference is now made to
In operation 125, a medical expert conducts a review of the reported seriousness of adverse events based at least in part on a data management system operation 120 that provides subject profiles. An example of such a system is JReview®, provided by Integrated Clinical Systems, Inc. This medical review of trial data, both cumulative data and individual subject data, tries to identify potential issues that could affect either the safety of trial subjects or the progress of the trial. The medical review includes an ongoing, real-time review per subject, as well as a periodic, comprehensive review across subjects at specific time points (e.g., at the Data Monitoring Committee (DMC) meeting, which occurs prior to a final, blind review meeting) for plausibility and consistency from a medical perspective as planned in the Medical Review Plan (MRP). The medical review supports the medical quality of the clinical data (e.g., efficacy and safety data). The medical review is based on data from the clinical database (e.g., data sets/tables/listings) and on pharmacovigilance data (e.g., CIOMS (Council for International Organizations of Medical Sciences) and blinded SUSAR reports) in format and content as specified in the MRP. Operation 130 asks whether the medical expert agrees with the site investigator's determination of whether an adverse event is serious or not. If so, then in operation 150 the adverse event is reported with the site investigator's determination of seriousness. If the medical expert does not agree with the determination, the expert in operation 135 raises a query to the site investigator including evidence to support the expert's determination. In operation 140, the site investigator then may or may not update the determination of seriousness based on the expert's query. (The decision is ultimately up to the site investigator. The sponsor's medical expert can query and re-query to urge the investigator to reclassify, but the medical expert cannot override the site investigator.) The resulting event is then reported in operation 150.
Another method used to determine the seriousness of adverse events is the EMA's IME list, as described above. The list identifies preferred terms (PTs) from the Medical Dictionary for Regulatory Activities (MedDRA®), developed by the International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH), that are medically important regardless of the presence of other regulatory seriousness criteria. The list's purpose is to facilitate the classification of suspected serious adverse reactions, to aggregate data analysis, and to perform case assessment for pharmacovigilance activities. Some pharmaceutical companies use this list to identify potential serious adverse events in clinical studies. It is updated with each MedDRA version (twice a year) and is based on a Medicines and Healthcare products Regulatory Agency (MHRA) list. It is intended for guidance purposes only: there is no mandatory requirement for regulatory reporting and there is an option to use the list for other purposes.
There are challenges in detecting and/or determining SAEs using these current methods. First, the current methods are highly manual and subjective processes. Second, there are different standards within and among organizations in determining what is an SAE. Third, SAE detection and/or determination suffers from systemic and human error, time-consuming and costly review, and frequent lack of quantitative evidence.
The inventors have developed a system and method to address these challenges by automatically and reliably determining SAEs. The system and method use a data-driven approach that provides quantitative evidence to investigators and sponsors during the medical review process. The system and method improve both the accuracy and efficiency of SAE reporting. Using the data-gathering platform developed by Medidata Solutions, Inc., the assignee of the present invention, the inventors identified and used patterns from more than one million adverse event records, including nearly 70,000 serious adverse events, from over 1,800 clinical trials completed since 2007. The benefits of the present invention include improving the probability of detecting and/or determining SAEs without reducing the probability of correctly classifying non-serious AEs, prioritizing (S)AEs for medical review with greater precision, and providing quantitative and objective evidence of the seriousness of adverse events.
In addition to distinguishing among types of adverse events to determine whether one is serious, the invention also addresses determining whether an adverse event, which is sometimes serious and sometimes not, qualifies as an SAE. Patterns begin to emerge once all the data are pooled together and the terms used to describe the adverse events are standardized. From data reviewed by the inventors, over 3,850 total terms have at least one serious observation, but the same event may or may not always be categorized as serious or non-serious. The differences in categorization may be due to other contextual features, e.g., the severity of the event, the age of the subject, the indication of the drug under investigation, etc. For example, a “headache” may be considered serious in a trial for a brain cancer treatment but perhaps not considered serious in a trial for an allergy treatment.
Reference is now made to
Part of the utility of this invention comes from standardized clinical trial database 410, which is very comprehensive and includes data from thousands of clinical trials, hundreds of sites, multiple therapeutic areas, and multiple sponsors. As the size of the database increases, the benefits of the invention are more easily realized.
The development workflow first includes clinical data standardizer 405, to which are input data from multiple clinical trials are input and which standardizes the data and form and field names across the clinical trials. These data come from EDC 401a, b, c, which indicates electronic data capture from multiple clinical trials (of which three are indicated in
AE data processor 420 processes the standardized data from clinical data standardizer 405 and/or standardized clinical trial database 410, including the standardized form and field names, to generate standardized adverse event terms, and stores/pools them together across trials in standardized adverse events repository 430, along with other adverse event-level data described below. Storing adverse events using standard terms improves the input to model developer 440. One type of such processor is disclosed in U.S. patent application Ser. No. 15/443,828, filed Feb. 27, 2017, assigned to Medidata Solutions, Inc., the assignee of the present invention, and incorporated herein by reference in its entirety. That application discloses an apparatus and method for automatically mapping verbatim narratives to terms in a terminology dictionary. In contrast to some types of clinical trial data, such as blood pressure and heart rate, which may be recorded as numbers, an adverse event that occurs during a clinical trial is generally recorded as a text or verbatim narrative. The format of such narratives may differ from one recorder (e.g., subject, doctor, nurse, technician, etc.) to another and may even differ by the same recorder at a different time. Such differences may be as simple as spelling differences, which may be caused by typographical errors or that some words are spelled differently in different geographic areas. One example is “diarrhea,” which may be misspelled (e.g., diarrea) or may be spelled differently in different countries (e.g., in the United Kingdom it is spelled “diarrhoea”). Other times, the same condition is described by its symptoms rather than a specific name. So “diarrhea” may be described as “loose stools,” “Soft bowels,” “soft stools,” “Loose bowel movements,” etc., and each of these words may be capitalized, may appear in singular or plural, or may be misspelled. The processing maps each narrative to a term or terms in a terminology dictionary, such as MedDRA.
Standardized adverse event data are input to model developer 440, along with other attributes such as demographic information or data 406 and trial features 409. Adverse event-level data or information (also called “features”) include the adverse event MedDRA preferred term (PT), the severity grade (on the NCI's five-level scale), the duration of the event (e.g., if the subject recovered in one day or less, which may indicate lesser seriousness), whether the adverse event is on the IME list, and the adverse event's relationship to the trial treatment or AEREL. (This is a standard field in the SDTM (Study Data Tabulation Model), which is an industry standard data model for clinical trial data. AEREL is the relationship of the adverse event to the treatment that was under investigation in that particular trial, i.e., an indication of whether the trial treatment had a causal effect on the adverse event, as reported by the clinician/investigator.)
Subject-level demographic data or information (also called “features”) 406 include age, gender, race, concurrent events (including severity and seriousness) experienced by the subject on the same day, and previous events (including severity and seriousness) experienced by the subject. Trial-level features 409, which come from a table containing curated trial-level data 407, include indication, phase, sponsor, therapeutic area, and the primary purpose of the trial reported to the NIH on clinicaltrials.gov. Other features may include hospitalization, lab values, medical history, and concomitant medications. These features are not exhaustive; others may be used in addition or instead.
Model developer 440 evaluates several models to develop a multivariate, probabilistic SAE model 450 for use in determining whether an adverse event is serious. The models considered herein include a benchmark model (i.e., rules-based algorithm using IME list+AE Severity), logistic regression, extreme gradient boosting (“XGBoost”), and neural network models such as feed-forward and bi-directional, long(-term) short-term memory (Bi-LSTM), both of which will be discussed in detail below.
Reference is now made to
The details of the models considered will now be described. The first is a benchmark or baseline model. This model follows the rule that an adverse event is classified as an SAE (score=1) if the AE is on the IME list or the severity is Grade 3 (severe) or higher on the CTCAE scale, otherwise it is classified as a non-SAE (score=0). This baseline reflects industry practice to some extent, as some sponsors use the severity grade and/or the IME list to provide directional guidance in reviewing events. Although this model is called a “benchmark,” this approach does not generate a probability score as other approaches. It treats events on the IME list or with different severity scores (3 and above) as having the same seriousness likelihood. This is often not a realistic assumption. In addition, this model does not consider any other factors such as the target disease or indication of the trial. These drawbacks provide motivation to test a range of probabilistic modeling approaches, as described herein.
The next model is a machine learning model that uses logistic regression with manually engineered features. That is, the interactions between features are manually specified (i.e., input to the formula of the model manually) and tested. So, several of these models were tested, each with a different combination of features. The features used in all of the models included the five-level severity score and the MedDRA preferred term. Later models added one or more of the indication of the trial drug, outcome of the adverse event, the subject's age, the average seriousness of previous occurrences, and the sponsor name. Another feature that may be used (in this model and in the XGBoost model described below) is the percentage of time that an adverse event of the same type and severity grade is labeled as serious.
The next model, also a machine learning model, extreme gradient boosting (“XGBoost,” see Tianqi Chen and Carlos Guestrin, “XGboost: A scalable tree boosting system,” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, A C M, 2016), offers benefits over logistic regression, because it can automatically model the interactions between features. This model uses an ensemble classification and regression tree (CART) method, using boosting/additive training (with a gradient boosting machine, GBM) as well as random subsampling of samples and features (random forest algorithm). The XGBoost model includes regularization on the complexity of trees, but requires feature engineering, e.g., converting variables of high-cardinality to numerical values (or one-hot encoding), to model the relationships between the features. Like logistic regression, several of these models were tested, using the same combinations of features as logistic regression—five-level severity score, MedDRA preferred term, indication of the trial drug, outcome of the adverse event, the subject's age, the average seriousness of previous occurrences, and the sponsor name.
The next models are neural networks (deep learning models), such as feed-forward, Bi-LSTM, and convolutional (whose performance was similar to feed-forward). The deep learning models learn the complex interactions between variables. They have a flexible architecture to incorporate different sources of information, whereas logistic regression models typically manually define and select the interaction terms. The deep learning models learn vectorized representations of high-dimensional variables, whereas XGBoost models typically manually transform high-dimensional variables to numerical values. The input structure for the deep learning models represents each event or high-dimensional variable (or group of same-day events) as embedded numerical vectors, as seen in
One deep learning model is a feed-forward model, illustrated in
Block 620 indicates the number of hidden layers and that each hidden layer has a ReLU (rectified linear unit) activation function. The network makes a decision Y, in block 630, of either a 0 or a 1. One example of the architecture of a feed-forward model is to use 50 embedding dimensions, together with other numerical features that adds up to an input vector dimension of 183. The input vector is then connected to a hidden layer with 512 nodes, the output is then connected to a hidden layer with 256 nodes, which is then connected to an output layer with 1 node.
Another deep learning model is a Bi-LSTM model. An LSTM (long(-term) short-term memory) model is a type of recurrent neural network (RN N) used in sequence modeling problems such as language processing. The model takes inputs sequentially and updates the model's internal representation at each step, based on both the current input and the previous representation. See Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural Computation 9(8): 1735-1780 (1997). A Bi-LSTM model is an extension of the standard LSTM model because it is bi-directional. See Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” 2013 IEEE International Conference on Acoustics, speech and signal processing (ICASSP) (IEEE 2013). Such a model combines two LSTMs where one is running forward in time and the other is running backward. The context window around each input thus consists of both information prior to and after the current input. A bi-directional LSTM is used to model complicated relationships between concurrent events. In the present invention, the adverse events of each subject are sorted by starting date and separated into events occurring on the current (i.e. same) day (to be evaluated for seriousness likelihood) and events that occurred previously (to be used as additional model inputs).
One embodiment of a Bi-LSTM model is illustrated in
More specifically, each event input consists of dense vectors 641, 642, 643 (embeddings) (where a dense vector has continuous values rather than just 0 or 1) representing the event preferred term (PT), trial indication and sponsor, and one-hot encoded vectors representing the five-level severity grade, the event's relationship to treatment, subject race, gender, trial phase, and primary purpose, and values of the remaining numerical features. The embeddings of the event, indication, and sponsors can either be learned together with other parameters in the model, or pre-trained. The input also has a value based on a subject's previous events (Hist1, Hist2, etc.) (i.e., attention on subject history). The value is computed in block 660 by a weighted sum of seriousness labels (SAE=1, non-SAE=0) of all previous events (hAE1, hAE2, etc.) that happened to the same subject. The weight is computed by the dot-product of the current event (concatenated event embedding and severity grade) and each previous event, and normalized across previous events so the weights sum to one. This block thus learns how to use subject history without manual feature engineering.
One example of the architecture of a Bi-LSTM model is to use 50 embedding dimensions (as in the feed-forward model) and 181 input vector dimensions. There can be multiple LSTM gated recurrent unit (GRU) hidden layers per direction, while the embodiment of this application uses 1 hidden layer per direction with 256 nodes each. The concatenated forward and backward hidden layers, with a dimension of 512, are then connected to the output layer of dimension 1. Compared to the feed-forward model, the Bi-LSTM model considers all same-day adverse events together (where adverse events occurring within three days apart are treated as same-day events) and learns how to use AE history, instead of feature engineering.
The blocks shown in
Operation 442 assesses the SAE model. The model architecture that is ultimately selected, as well as hyper-parameters for the model, are assessed based on the validation set performance. Model hyper-parameters may include the number of hidden layers or the number of nodes within each hidden layer. For each hyper-parameter set, a model is trained on the training set data (in operation 436), and the resulting model is used to generate an SAE probability score for each AE in the validation set. These predicted probability scores are compared to actual “serious” labels in the data for each parameter set described in operation 436. (The “serious” labels are reported during the trial.) This operation in effect selects the optimal hyper-parameter set. The performance of the model is summarized for each set using two area under curve (AUC) metrics—the area under the receiver operating characteristic (ROC) curve (or AUROC) and the area under the precision-recall (PR) curve. These areas are used to compare performance and select the best model with the highest AUCs.
Both of these AUC metrics take into account a combination of true positives, false positives, false negatives, and true negatives. With respect to precision and recall, precision is the number of true positives divided by the sum of the true positives and false positives:
Precision measures the percentage of predicted SAEs that are actually SAEs (i.e., that are actually reported serious in the trial). Precision describes how good a model is at predicting the positive class, i.e., precision describes the percentage of the results that are relevant. (“Positive” class=class 1=SAE; “negative” class=class 0=nSAE or non-serious AE.) Recall is the number of true positives divided by the sum of the true positives and false negatives:
Recall measures the percentage of actual SAEs that were predicted to be SAEs. Recall describes the percentage of total relevant results correctly classified by the algorithm. Reviewing both precision and recall is useful in cases in which there is an imbalance in the observations between two classes, in this case a non-serious AE (nSAE) class (class 0) and an SAE class (class 1). Specifically, here there are many examples of nSAE (class 0) and only a few examples of an SAE (class 1). The large number of class 0 examples typically means that the accuracy of the model in predicting class 0 correctly, e.g., high specificity, is less important.
Key to the calculation of precision and recall is that the calculations do not make use of the true negatives. The calculations are concerned only with the correct prediction of the minority class, class 1.
The other curves represent three tested models: a logistic regression model is indicated by 720, an XG Boost model is indicated by 730, and a Bi-LSTM model is indicated by 740. The AUC for the logistic regression model is 0.71, for the XG Boost model is 0.75, and for the Bi-LSTM model is 0.77. Thus, all three probabilistic models perform much better than the IME+Severity baseline (AUC=0.55), and the Bi-LSTM model has the best performance. Arrow 712 indicates that for the same precision (0.29) in the benchmark model, recall increases from 0.79 to 0.95 using the Bi-LSTM model. Similarly, arrow 713 indicates that for the same recall (0.79) in the benchmark model, precision increases from 0.29 to 0.60 using the Bi-LSTM model.
Performance of each model may also be made by examining the area under the ROC curve. The ROC curve plots the true positive rate vs. the false positive rate. Below is a table comparing area under the ROC curve for various numbers of variables (features) used with the logistic regression (“log”) and XGBoost (“xgb”) models. These results came from testing the validation set:
With six variables, the XGBoost model performed better than the logistic regression model by 0.957/0.958 to 0.950. Note that using the sponsor name rather than age as a variable made the logistic regression model perform worse, but the same variable change made the XGBoost model perform better. While adding variables generally made the models perform the same or better, sometimes they made it worse (e.g., adding age as a third variable to the XGBoost model increased AUC by 0.001, whereas adding age as a sixth variable to the same model decreased AUC by 0.001).
The actual ROC curves for a number of models are shown in
This table shows the substantial performance increase of the models over the benchmark model.
Another measure of performance is SAE coverage by review amount. As discussed earlier, medical reviewers can use the model estimated SAE likelihood to prioritize their review. The following simulation analysis was performed to show the advantage of doing so. The events in the test set are ordered either by the IME+Severity grade benchmark or by the SAE likelihood estimated by the Bi-LSTM model. Then the percentage of SAEs are computed among all SAEs in the dataset that are contained in the riskiest 10%, 20%, . . . adverse events ranked by the different models. In other words, this metric shows that to identify x % of SAEs in the dataset (x % is the same as recall), y % of all AEs need to be reviewed.
This efficiency translates into real cost savings. For example, in a typical, multi-year, phase 3 clinical trial containing 5000 subjects, a senior medical reviewer may review 36,000 adverse events each year. The time to review each AE is one to two minutes. If the estimated percent reduction in reviews is 30-50%, the reviewer may save between 150 and 500 hours per year per trial. Based on a reviewer annual salary of $200,000, the annual reviewer time savings per trial can be from $15,000-$48,000.
Referring again to
Referring back to
Several results of using the invention are shown in
More results of the invention are shown in
The invention also allows the user to review the factors contributing to the SAE probability score.
CRF data may include the adverse event, its severity, whether there is a concurrent adverse event (and the severity of that AE), how long the AE lasted, and whether there is a relationship between the adverse event and the treatment that was under investigation in that particular trial. Demographic data for the subject may include age, subject adverse event history, gender, race, etc. Profile data for the clinical trial may include phase (I, II, or III), indication (e.g., diabetes, breast cancer, pneumonia, etc.), and purpose (e.g., treatment, prevention, diagnostic, supportive care, screening, health services research, and basic science). Sponsor data may include the name of the sponsor. For each of these factors, there may be associated an increase or a decrease in the SAE probability score (using change of log-odds units). In this embodiment, a subject that experienced moderately severe diarrhea, diarrhea provides an increase of 0.25 units, the severity is considered “moderate,” which provides a decrease of 0.25 units, concurrent vomiting provides an increase of 0.10 units, the “moderate” severity of the vomiting provides a decrease of 0.08 units, that the diarrhea cleared up in a day provides a decrease of 0.38 units, and the relationship between diarrhea and the treatment under test provides a decrease of 0.18 units.
Reference is now made to
Besides the operations shown in
The results illustrated above address some of the shortcomings stated above of the prior art method of determining seriousness of adverse events. For example, as discussed above, review of adverse events using this invention takes less time because, in the example above, only 41% of the adverse events would need to be reviewed to capture 99% of all of the SAEs, whereas 96% of the adverse events in the prior scheme would need to be reviewed to capture the same percentage. Second, this scheme reduces systemic and human error because it is more objective. Third, this scheme considers many more factors than the prior scheme could consider, e.g., whether there is a concurrent AE, the severity of the concurrent AE, and whether the subject recovered from the AE and/or the concurrent AE in one day, as well as demographic and trial information, such as gender, race, age, phase, indication, phase, and sponsor. Fourth, this scheme uses quantitative evidence, whereas in the prior scheme, such evidence was often lacking.
The invention also contributes to safety. In the case of the diabetes drug Avandia, in the prior scheme, three stroke events, which the FDA considers to be SAEs, were not reported as SAEs because the subjects were not hospitalized. See https://www.medpagetoday.com/upload/2010/7/9/20100713-14-EMDAC-DSaRM-B1-01-FDA.pdf. In contrast, this inventive scheme does not rely on hospitalization to determine whether an adverse event is considered serious.
In sum, the invention includes machine learning models trained on a large amount of historical data, combining adverse event-, subject-, and trial-level information. It is able to distinguish SAEs from non-SAEs with high accuracy, evaluated by area under the ROC curve metric. The machine-learning models include logistic regression, extreme gradient boosting, and deep learning models. The advantages of the present invention over the prior methods include: (1) using a large amount of standardized AE data across sponsors reduces individual investigator and/or sponsor's inconsistencies in identifying SAEs; (2) using a machine-learning approach to model complicated relationships across factors associated with the adverse event, the subject, and the trial improves the accuracy of SAE identification; and (3) using the model-estimated SAE likelihood to prioritize events more likely to be serious for review reduces the review workload.
Aspects of the present invention may be embodied in the form of a system, a computer program product, or a method. Similarly, aspects of the present invention may be embodied as hardware, software or a combination of both. Aspects of the present invention may be embodied as a computer program product saved on one or more computer-readable media in the form of computer-readable program code embodied thereon.
The computer-readable medium may be a computer-readable storage medium. A computer-readable storage medium may be, for example, an electronic, optical, magnetic, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof.
Computer program code in embodiments of the present invention may be written in any suitable programming language. The program code may execute on a single computer, or on a plurality of computers. The computer may include a processing unit in communication with a computer-usable medium, where the computer-usable medium contains a set of instructions, and where the processing unit is designed to carry out the set of instructions.
The above discussion is meant to be illustrative of the principles and various embodiments of the present invention. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.