Selecting Clinical Trial Sites Based on Multiple Target Variables Using Machine Learning

BACKGROUND

Performing site selection for clinical trials is a valuable step for ensuring on-time and on-target enrollment completion. Sluggish patient recruitment may disrupt clinical trial timelines and affect a clinical trial site's performance. Relying solely on historical performance has shown to be a weak predictor of a site's future performance and of a trial's overall timeline. To deliver robust predictions of a site's enrollment, an advanced analytics platform to assist site selection and planning is needed.

SUMMARY

As described herein, systems, non-transitory computer readable media, and methods are used to predict target variables informative of site enrollment (e.g., number of patients a site will enroll) and site default likelihood (e.g., how likely a site is to enroll zero patients or fewer patients than a predetermined threshold) of one or more clinical trial sites. The prediction(s) can be used to assist selection of trial sites (e.g., healthcare facility and principal investigator pairs) for planning and supporting one or more clinical trials. The systems and methods described herein involve the engineering of features, predicting the two target variables that include any of enrolled patients, enrollment rate, default, and/or site agility using machine learning models, and ranking of sites that improve the selection of clinical trial sites that are likely to be successful.

Various embodiments disclosed herein involve building machine learning models including features that are selected through a specific feature selection process. Namely, the feature selection process involves generating features of historical clinical trial data over time periods in relation to reference time windows and reference entities, and selecting top features for inclusion in machine learning models. In various embodiments, the systems and methods select top-performing models from among a large selection of machine learning models. Predictions from the top-performing models are visualized, e.g., in quadrant graphs to elucidate site rankings. Simulations of patient enrollment capture the stochastic fluctuations in multi-site enrollment timelines with limited assumptions, producing statistically-robust enrollment curves. The final output may be a ranked list of sites with corresponding contact information to deliver to feasibility stakeholders. This final output assists in identifying and prioritizing the best performing sites for enrollment of patients for a specific clinical trial.

Disclosed herein is an automated method for determining or selecting one or more clinical trial sites for inclusion in a clinical trial, comprising: obtaining input data comprising data of an upcoming trial protocol, for each of the one or more clinical trial sites: generating a predicted site enrollment and a predicted site default likelihood for the clinical trial site by applying one or more machine learning models to selected features of the input data: ranking the one or more clinical trial sites according to the predicted site enrollment and the predicted site default likelihood for the one or more clinical trial sites; and selecting top-ranked clinical trial sites, wherein each of the selected clinical trial sites has a predicted site enrollment above a first threshold value and a predicted site default likelihood below a second threshold value, wherein the selected features are previously determined by performing feature engineering on historical clinical trial data.

In various embodiments, the first threshold value is a median predicted site enrollment across the one or more clinical trial sites or a first specified value, and wherein the second threshold value is a median predicted site default likelihood across the one or more clinical trial sites or a second specified value.

In various embodiments, the method further comprising visualizing the predicted site enrollment and the predicted site default likelihood for the clinical trial sites in a quadrant graph.

In various embodiments, the method further comprising generating a plurality of quantitative values informative of predicted enrollment timelines by applying a stochastic model to the predicted site enrollment and the predicted site default likelihood for the one or more clinical trial sites.

In various embodiments, the simulation method comprises Monte Carlo simulation.

In various embodiments, the plurality of quantitative values informative of predicted enrollment timeline comprises one or more of time to enroll between 50-1000 patients, number of patients enrolled in 4 months, number of patients enrolled in 12 months, number of patients enrolled in 18 months, number of patients enrolled in 24 months, and number of patients enrolled between 3-48 months.

In various embodiments, the predicted site enrollment and the predicted site default likelihood are validated by using one or more of the historical clinical trial data and prospective clinical trial data.

In various embodiments, the method further comprises an improvement of at least 11% in identifying the top-ranked clinical trial sites.

In various embodiments, the method further comprises generating a site list of the selected top-ranked clinical trial sites.

In various embodiments, the site list comprises corresponding contact information useful for feasibility stakeholders.

In various embodiments, the one or more machine learning models are determined by training a plurality of machine learning models, and by selecting the top performing model of the trained machine learning models.

In various embodiments, the plurality of machine learning models are automatically trained.

In various embodiments, the one or more machine learning models are independently any one of a random forest model, an extremely randomized tree (XRT) model, a generalized linear model (GLM), a gradient boosting machine (GBM), XGBoost, a stacked ensemble, and a deep learning algorithm (e.g., fully connected multi-layer artificial neural network).

In various embodiments, the one or more machine learning models are trained to predict site enrollment and site default likelihood for a specific disease indication.

In various embodiments, the one or more machine learning models achieve an AUC performance metric of at least 0.68 for predicting likelihood of a site default.

In various embodiments, the one or more machine learning models achieve a root mean squared error performance metric between 3.1-6.7 for predicting estimated number of enrolled patients at a site.

In various embodiments, the selected features comprise features associated with geographic locations, protocol complexity, study design, competitive landscape, and historical site enrollment metrics.

In various embodiments, the selected features comprise at least 3 of state of clinical trial site, study title, conditions, country, heading, sponsor, outcome measures, features associated with historical site enrollment metrics, investigator, and facility address.

In various embodiments, the selected features comprise at least 5 of state of clinical trial site, study title, conditions, country, heading, sponsor, outcome measures, features associated with historical site enrollment metrics, investigator, and facility address.

In various embodiments, the selected features comprise at least 10 of state of clinical trial site, study title, conditions, country, heading, sponsor, outcome measures, features associated with historical site enrollment metrics, investigator, and facility address.

In various embodiments, the features associated with historical site enrollment metrics comprise at least 3 of minimum, maximum, exponentially weighted moving average (EWMA), weighted average, and median of number of Enrolled patients for a trial, number of patients consented for a trial, number of patients completed a trial, number of patients that failed screening for a trial, and agility (e.g., time it took to start recruiting in a trial) over a reference time window at a reference entity.

In various embodiments, the features associated with historical site enrollment metrics comprise at least 5 of minimum, maximum, exponentially weighted moving average (EWMA), average, and median of number of Enrolled patients for a trial, number of patients consented for a trial, number of patients completed a trial, number of patients that failed screening for a trial, and agility.

In various embodiments, the features associated with historical site enrollment metrics comprise at least 10 of minimum, maximum, exponentially weighted moving average (EWMA), weighted average, and median of number of Enrolled patients for a trial, number of patients consented for a trial, number of patients completed a trial, number of patients that failed screening for a trial, and agility over a reference time window at a reference entity.

In various embodiments, the reference time is at least 1 year, at least 3 years, at least 5 years, or at least 10 years.

In various embodiments, the reference entity is one of a site, an investigator, a country, a state, a city, or a selected location.

In various embodiments, performing feature engineering on historical clinical trial data comprises converting trial metadata from the historical clinical trial data into a numerical representation of a single value or vector of values using n-grams, TFIDF, Word2vec, Glove, Fast Text, BERT, ELMo, InferSent.

In various embodiments, performing feature engineering on historical clinical trial data comprises applying a random forest feature selection algorithm to identify high importance features that have feature importance values above a threshold value.

In various embodiments, the historical clinical trial data is sourced from clinical trial databases selected from one or more of Data Query System (DQS), a contract research organization (CRO), Clinical Trial Management System (CTMS), or clinicaltrials.gov.

In various embodiments, the upcoming trial protocol comprises information for a clinical trial associated with any one of immunology, cardiovascular and metabolic diseases, infectious diseases, oncology, and neuroscience.

In various embodiments, the upcoming trial protocol comprises information for a clinical trial for any one of Crohn's Disease, Lupus, diabetic kidney disease, Lung Cancer, or respiratory syncytial virus (RSV).

In various embodiments, the predicted site enrollment comprises number of patients a site will enroll.

In various embodiments, the predicted site enrollment further comprises enrollment rate and/or agility, wherein the enrollment rate comprises number of patients per sit per month or year, and wherein the agility comprises time it took to start recruiting in a trial.

In various embodiments, the predicted site default likelihood comprises how likely a site is to enroll zero patients or fewer patients than a predetermined threshold.

Additionally disclosed herein is a non-transitory computer readable medium for determining or selecting one or more clinical trial sites for inclusion in a clinical trial, comprising instructions that, when executed by a processor, cause the processor to: obtain input data comprising data of an upcoming trial protocol; for each of the one or more clinical trial sites: generate a predicted site enrollment and a predicted site default likelihood for the clinical trial site by applying one or more machine learning models to selected features of the input data; rank the one or more clinical trial sites according to the predicted site enrollment and the predicted site default likelihood for the one or more clinical trial sites; and select top-ranked clinical trial sites, wherein each of the selected clinical trial sites has a predicted site enrollment above a first threshold value and a predicted site default likelihood below a second threshold value, wherein the selected features are previously determined by performing feature engineering on historical clinical trial data.

In various embodiments, the non-transitory computer readable medium further comprises instructions that, when executed by the processor, cause the processor to visualize the predicted site enrollment and the predicted site default likelihood for the clinical trial sites in a quadrant graph.

In various embodiments, the non-transitory computer readable medium further comprises instructions that, when executed by the processor, cause the processor to generate a plurality of quantitative values informative of predicted enrollment timelines by applying a stochastic model to the predicted site enrollment and the predicted site default likelihood for the one or more clinical trial sites.

In various embodiments, the simulation method comprises Monte Carlo simulation.

In various embodiments, the non-transitory computer readable medium further comprises an improvement of at least 11% in identifying the top-ranked clinical trial sites.

In various embodiments, the non-transitory computer readable medium further comprises instructions that, when executed by the processor, cause the processor to generate a site list of the selected top-ranked clinical trial sites.