Performing site selection for clinical trials is a valuable step for ensuring on-time and on-target enrollment completion. Sluggish patient recruitment may disrupt clinical trial timelines and affect a clinical trial site's performance. Relying solely on historical performance has shown to be a weak predictor of a site's future performance and of a trial's overall timeline. To deliver robust predictions of a site's enrollment, an advanced analytics platform to assist site selection and planning is needed.
As described herein, systems, non-transitory computer readable media, and methods are used to predict target variables informative of site enrollment (e.g., number of patients a site will enroll) and site default likelihood (e.g., how likely a site is to enroll zero patients or fewer patients than a predetermined threshold) of one or more clinical trial sites. The prediction(s) can be used to assist selection of trial sites (e.g., healthcare facility and principal investigator pairs) for planning and supporting one or more clinical trials. The systems and methods described herein involve the engineering of features, predicting the two target variables that include any of enrolled patients, enrollment rate, default, and/or site agility using machine learning models, and ranking of sites that improve the selection of clinical trial sites that are likely to be successful.
Various embodiments disclosed herein involve building machine learning models including features that are selected through a specific feature selection process. Namely, the feature selection process involves generating features of historical clinical trial data over time periods in relation to reference time windows and reference entities, and selecting top features for inclusion in machine learning models. In various embodiments, the systems and methods select top-performing models from among a large selection of machine learning models. Predictions from the top-performing models are visualized, e.g., in quadrant graphs to elucidate site rankings. Simulations of patient enrollment capture the stochastic fluctuations in multi-site enrollment timelines with limited assumptions, producing statistically-robust enrollment curves. The final output may be a ranked list of sites with corresponding contact information to deliver to feasibility stakeholders. This final output assists in identifying and prioritizing the best performing sites for enrollment of patients for a specific clinical trial.
Disclosed herein is an automated method for determining or selecting one or more clinical trial sites for inclusion in a clinical trial, comprising: obtaining input data comprising data of an upcoming trial protocol, for each of the one or more clinical trial sites: generating a predicted site enrollment and a predicted site default likelihood for the clinical trial site by applying one or more machine learning models to selected features of the input data: ranking the one or more clinical trial sites according to the predicted site enrollment and the predicted site default likelihood for the one or more clinical trial sites; and selecting top-ranked clinical trial sites, wherein each of the selected clinical trial sites has a predicted site enrollment above a first threshold value and a predicted site default likelihood below a second threshold value, wherein the selected features are previously determined by performing feature engineering on historical clinical trial data.
In various embodiments, the first threshold value is a median predicted site enrollment across the one or more clinical trial sites or a first specified value, and wherein the second threshold value is a median predicted site default likelihood across the one or more clinical trial sites or a second specified value.
In various embodiments, the method further comprising visualizing the predicted site enrollment and the predicted site default likelihood for the clinical trial sites in a quadrant graph.
In various embodiments, the method further comprising generating a plurality of quantitative values informative of predicted enrollment timelines by applying a stochastic model to the predicted site enrollment and the predicted site default likelihood for the one or more clinical trial sites.
In various embodiments, the simulation method comprises Monte Carlo simulation.
In various embodiments, the plurality of quantitative values informative of predicted enrollment timeline comprises one or more of time to enroll between 50-1000 patients, number of patients enrolled in 4 months, number of patients enrolled in 12 months, number of patients enrolled in 18 months, number of patients enrolled in 24 months, and number of patients enrolled between 3-48 months.
In various embodiments, the predicted site enrollment and the predicted site default likelihood are validated by using one or more of the historical clinical trial data and prospective clinical trial data.
In various embodiments, the method further comprises an improvement of at least 11% in identifying the top-ranked clinical trial sites.
In various embodiments, the method further comprises generating a site list of the selected top-ranked clinical trial sites.
In various embodiments, the site list comprises corresponding contact information useful for feasibility stakeholders.
In various embodiments, the one or more machine learning models are determined by training a plurality of machine learning models, and by selecting the top performing model of the trained machine learning models.
In various embodiments, the plurality of machine learning models are automatically trained.
In various embodiments, the one or more machine learning models are independently any one of a random forest model, an extremely randomized tree (XRT) model, a generalized linear model (GLM), a gradient boosting machine (GBM), XGBoost, a stacked ensemble, and a deep learning algorithm (e.g., fully connected multi-layer artificial neural network).
In various embodiments, the one or more machine learning models are trained to predict site enrollment and site default likelihood for a specific disease indication.
In various embodiments, the one or more machine learning models achieve an AUC performance metric of at least 0.68 for predicting likelihood of a site default.
In various embodiments, the one or more machine learning models achieve a root mean squared error performance metric between 3.1-6.7 for predicting estimated number of enrolled patients at a site.
In various embodiments, the selected features comprise features associated with geographic locations, protocol complexity, study design, competitive landscape, and historical site enrollment metrics.
In various embodiments, the selected features comprise at least 3 of state of clinical trial site, study title, conditions, country, heading, sponsor, outcome measures, features associated with historical site enrollment metrics, investigator, and facility address.
In various embodiments, the selected features comprise at least 5 of state of clinical trial site, study title, conditions, country, heading, sponsor, outcome measures, features associated with historical site enrollment metrics, investigator, and facility address.
In various embodiments, the selected features comprise at least 10 of state of clinical trial site, study title, conditions, country, heading, sponsor, outcome measures, features associated with historical site enrollment metrics, investigator, and facility address.
In various embodiments, the features associated with historical site enrollment metrics comprise at least 3 of minimum, maximum, exponentially weighted moving average (EWMA), weighted average, and median of number of Enrolled patients for a trial, number of patients consented for a trial, number of patients completed a trial, number of patients that failed screening for a trial, and agility (e.g., time it took to start recruiting in a trial) over a reference time window at a reference entity.
In various embodiments, the features associated with historical site enrollment metrics comprise at least 5 of minimum, maximum, exponentially weighted moving average (EWMA), average, and median of number of Enrolled patients for a trial, number of patients consented for a trial, number of patients completed a trial, number of patients that failed screening for a trial, and agility.
In various embodiments, the features associated with historical site enrollment metrics comprise at least 10 of minimum, maximum, exponentially weighted moving average (EWMA), weighted average, and median of number of Enrolled patients for a trial, number of patients consented for a trial, number of patients completed a trial, number of patients that failed screening for a trial, and agility over a reference time window at a reference entity.
In various embodiments, the reference time is at least 1 year, at least 3 years, at least 5 years, or at least 10 years.
In various embodiments, the reference entity is one of a site, an investigator, a country, a state, a city, or a selected location.
In various embodiments, performing feature engineering on historical clinical trial data comprises converting trial metadata from the historical clinical trial data into a numerical representation of a single value or vector of values using n-grams, TFIDF, Word2vec, Glove, Fast Text, BERT, ELMo, InferSent.
In various embodiments, performing feature engineering on historical clinical trial data comprises applying a random forest feature selection algorithm to identify high importance features that have feature importance values above a threshold value.
In various embodiments, the historical clinical trial data is sourced from clinical trial databases selected from one or more of Data Query System (DQS), a contract research organization (CRO), Clinical Trial Management System (CTMS), or clinicaltrials.gov.
In various embodiments, the upcoming trial protocol comprises information for a clinical trial associated with any one of immunology, cardiovascular and metabolic diseases, infectious diseases, oncology, and neuroscience.
In various embodiments, the upcoming trial protocol comprises information for a clinical trial for any one of Crohn's Disease, Lupus, diabetic kidney disease, Lung Cancer, or respiratory syncytial virus (RSV).
In various embodiments, the predicted site enrollment comprises number of patients a site will enroll.
In various embodiments, the predicted site enrollment further comprises enrollment rate and/or agility, wherein the enrollment rate comprises number of patients per sit per month or year, and wherein the agility comprises time it took to start recruiting in a trial.
In various embodiments, the predicted site default likelihood comprises how likely a site is to enroll zero patients or fewer patients than a predetermined threshold.
Additionally disclosed herein is a non-transitory computer readable medium for determining or selecting one or more clinical trial sites for inclusion in a clinical trial, comprising instructions that, when executed by a processor, cause the processor to: obtain input data comprising data of an upcoming trial protocol; for each of the one or more clinical trial sites: generate a predicted site enrollment and a predicted site default likelihood for the clinical trial site by applying one or more machine learning models to selected features of the input data; rank the one or more clinical trial sites according to the predicted site enrollment and the predicted site default likelihood for the one or more clinical trial sites; and select top-ranked clinical trial sites, wherein each of the selected clinical trial sites has a predicted site enrollment above a first threshold value and a predicted site default likelihood below a second threshold value, wherein the selected features are previously determined by performing feature engineering on historical clinical trial data.
In various embodiments, the first threshold value is a median predicted site enrollment across the one or more clinical trial sites or a first specified value, and wherein the second threshold value is a median predicted site default likelihood across the one or more clinical trial sites or a second specified value.
In various embodiments, the non-transitory computer readable medium further comprises instructions that, when executed by the processor, cause the processor to visualize the predicted site enrollment and the predicted site default likelihood for the clinical trial sites in a quadrant graph.
In various embodiments, the non-transitory computer readable medium further comprises instructions that, when executed by the processor, cause the processor to generate a plurality of quantitative values informative of predicted enrollment timelines by applying a stochastic model to the predicted site enrollment and the predicted site default likelihood for the one or more clinical trial sites.
In various embodiments, the simulation method comprises Monte Carlo simulation.
In various embodiments, the plurality of quantitative values informative of predicted enrollment timeline comprises one or more of time to enroll between 50-1000 patients, number of patients enrolled in 4 months, number of patients enrolled in 12 months, number of patients enrolled in 18 months, number of patients enrolled in 24 months, and number of patients enrolled between 3-48 months.
In various embodiments, the predicted site enrollment and the predicted site default likelihood are validated by using one or more of the historical clinical trial data and prospective clinical trial data.
In various embodiments, the non-transitory computer readable medium further comprises an improvement of at least 11% in identifying the top-ranked clinical trial sites.
In various embodiments, the non-transitory computer readable medium further comprises instructions that, when executed by the processor, cause the processor to generate a site list of the selected top-ranked clinical trial sites.
In various embodiments, the site list comprises corresponding contact information useful for feasibility stakeholders.
In various embodiments, the one or more machine learning models are determined by training a plurality of machine learning models, and by selecting the top performing model of the trained machine learning models.
In various embodiments, the plurality of machine learning models are automatically trained.
In various embodiments, the one or more machine learning models are independently any one of a random forest model, an extremely randomized tree (XRT) model, a generalized linear model (GLM), a gradient boosting machine (GBM), XGBoost, a stacked ensemble, and a deep learning algorithm (e.g., fully connected multi-layer artificial neural network).
In various embodiments, the one or more machine learning models are trained to predict site enrollment and site default likelihood for a specific disease indication.
In various embodiments, the one or more machine learning models achieve an AUC performance metric of at least 0.68 for predicting likelihood of a site default.
In various embodiments, the one or more machine learning models achieve a root mean squared error performance metric between 3.1-6.7 for predicting estimated number of enrolled patients at a site.
In various embodiments, the selected features comprise features associated with geographic locations, protocol complexity, study design, competitive landscape, and historical site enrollment metrics.
In various embodiments, the selected features comprise at least 3 of state of clinical trial site, study title, conditions, country, heading, sponsor, outcome measures, features associated with historical site enrollment metrics, investigator, and facility address.
In various embodiments, the selected features comprise at least 5 of state of clinical trial site, study title, conditions, country, heading, sponsor, outcome measures, features associated with historical site enrollment metrics, investigator, and facility address.
In various embodiments, the selected features comprise at least 10 of state of clinical trial site, study title, conditions, country, heading, sponsor, outcome measures, features associated with historical site enrollment metrics, investigator, and facility address.
In various embodiments, the features associated with historical site enrollment metrics comprise at least 3 of minimum, maximum, exponentially weighted moving average (EWMA), weighted average, and median of number of Enrolled patients for a trial, number of patients consented for a trial, number of patients completed a trial, number of patients that failed screening for a trial, and agility (e.g., time it took to start recruiting in a trial) over a reference time window at a reference entity.
In various embodiments, the features associated with historical site enrollment metrics comprise at least 5 of minimum, maximum, exponentially weighted moving average (EWMA), average, and median of number of Enrolled patients for a trial, number of patients consented for a trial, number of patients completed a trial, number of patients that failed screening for a trial, and agility.
In various embodiments, the features associated with historical site enrollment metrics comprise at least 10 of minimum, maximum, exponentially weighted moving average (EWMA), weighted average, and median of number of Enrolled patients for a trial, number of patients consented for a trial, number of patients completed a trial, number of patients that failed screening for a trial, and agility over a reference time window at a reference entity.
In various embodiments, the reference time is at least 1 year, at least 3 years, at least 5 years, or at least 10 years.
In various embodiments, the reference entity is one of a site, an investigator, a country, a state, a city, or a selected location.
In various embodiments, the instructions that cause the processor to perform feature engineering on historical clinical trial data comprises converting trial metadata from the historical clinical trial data into a numerical representation of a single value or vector of values using n-grams, TFIDF, Word2vec, Glove, Fast Text, BERT, ELMo, InferSent.
In various embodiments, the instructions that cause the processor to perform feature engineering on historical clinical trial data comprises applying a random forest feature selection algorithm to identify high importance features that have feature importance values above a threshold value.
In various embodiments, the historical clinical trial data is sourced from clinical trial databases selected from one or more of Data Query System (DQS), a contract research organization (CRO), Clinical Trial Management System (CTMS), or clinicaltrials.gov.
In various embodiments, the upcoming trial protocol comprises information for a clinical trial associated with any one of immunology, cardiovascular and metabolic diseases, infectious diseases, oncology, and neuroscience.
In various embodiments, the upcoming trial protocol comprises information for a clinical trial for any one of Crohn's Disease, Lupus, diabetic kidney disease, Lung Cancer, or respiratory syncytial virus (RSV).
In various embodiments, the predicted site enrollment comprises number of patients a site will enroll.
In various embodiments, the predicted site enrollment further comprises enrollment rate and/or agility, wherein the enrollment rate comprises number of patients per sit per month or year, and wherein the agility comprises time it took to start recruiting in a trial.
In various embodiments, the predicted site default likelihood comprises how likely a site is to enroll zero patients or fewer patients than a predetermined threshold.
Additionally disclosed herein is a system for determining or selecting one or more clinical trial sites for inclusion in a clinical trial, comprising: a computer system configured to obtain input data comprising data of an upcoming trial protocol, wherein for each of the one or more clinical trial sites: the computer system generates a predicted site enrollment and a predicted site default likelihood for the clinical trial site by applying one or more machine learning models to selected features of the input data, wherein the computer system ranks the one or more clinical trial sites according to the predicted site enrollment and the predicted site default likelihood for the one or more clinical trial sites, wherein the computer system selects top-ranked clinical trial sites, wherein each of the selected clinical trial sites has a predicted site enrollment above a first threshold value and a predicted site default likelihood below a second threshold value, and wherein the selected features are previously determined by performing feature engineering on historical clinical trial data.
In various embodiments, the first threshold value is a median predicted site enrollment across the one or more clinical trial sites or a first specified value, and wherein the second threshold value is a median predicted site default likelihood across the one or more clinical trial sites or a second specified value.
In various embodiments, the system further comprises: an apparatus configured to visualize the predicted site enrollment and the predicted site default likelihood for the clinical trial sites in a quadrant graph.
In various embodiments, the computer system generates a plurality of quantitative values informative of predicted enrollment timelines by applying a stochastic model to the predicted site enrollment and the predicted site default likelihood for the one or more clinical trial sites.
In various embodiments, the simulation method comprises Monte Carlo simulation.
In various embodiments, the plurality of quantitative values informative of predicted enrollment timeline comprises one or more of time to enroll between 50-1000 patients, number of patients enrolled in 4 months, number of patients enrolled in 12 months, number of patients enrolled in 18 months, number of patients enrolled in 24 months, and number of patients enrolled between 3-48 months.
In various embodiments, the predicted site enrollment and the predicted site default likelihood are validated by using one or more of the historical clinical trial data and prospective clinical trial data.
In various embodiments, the system further comprises an improvement of at least 11% in identifying the top-ranked clinical trial sites.
In various embodiments, the computer system further generates a site list of the selected top-ranked clinical trial sites.
In various embodiments, the site list comprises corresponding contact information useful for feasibility stakeholders.
In various embodiments, the one or more machine learning models are determined by training a plurality of machine learning models, and by selecting the top performing model of the trained machine learning models.
In various embodiments, the plurality of machine learning models are automatically trained.
In various embodiments, the one or more machine learning models are independently any one of a random forest model, an extremely randomized tree (XRT) model, a generalized linear model (GLM), a gradient boosting machine (GBM), XGBoost, a stacked ensemble, and a deep learning algorithm (e.g., fully connected multi-layer artificial neural network).
In various embodiments, the one or more machine learning models are trained to predict site enrollment and site default likelihood for a specific disease indication.
In various embodiments, the one or more machine learning models achieve an AUC performance metric of at least 0.68 for predicting likelihood of a site default.
In various embodiments, the one or more machine learning models achieve a root mean squared error performance metric between 3.1-6.7 for predicting estimated number of enrolled patients at a site.
In various embodiments, the selected features comprise features associated with geographic locations, protocol complexity, study design, competitive landscape, and historical site enrollment metrics.
In various embodiments, the selected features comprise at least 3 of state of clinical trial site, study title, conditions, country, heading, sponsor, outcome measures, features associated with historical site enrollment metrics, investigator, and facility address.
In various embodiments, the selected features comprise at least 5 of state of clinical trial site, study title, conditions, country, heading, sponsor, outcome measures, features associated with historical site enrollment metrics, investigator, and facility address.
In various embodiments, the selected features comprise at least 10 of state of clinical trial site, study title, conditions, country, heading, sponsor, outcome measures, features associated with historical site enrollment metrics, investigator, and facility address.
In various embodiments, the features associated with historical site enrollment metrics comprise at least 3 of minimum, maximum, exponentially weighted moving average (EWMA), weighted average, and median of number of Enrolled patients for a trial, number of patients consented for a trial, number of patients completed a trial, number of patients that failed screening for a trial, and agility (e.g., time it took to start recruiting in a trial) over a reference time window at a reference entity.
In various embodiments, the features associated with historical site enrollment metrics comprise at least 5 of minimum, maximum, exponentially weighted moving average (EWMA), average, and median of number of Enrolled patients for a trial, number of patients consented for a trial, number of patients completed a trial, number of patients that failed screening for a trial, and agility.
In various embodiments, the features associated with historical site enrollment metrics comprise at least 10 of minimum, maximum, exponentially weighted moving average (EWMA), weighted average, and median of number of Enrolled patients for a trial, number of patients consented for a trial, number of patients completed a trial, number of patients that failed screening for a trial, and agility over a reference time window at a reference entity.
In various embodiments, the reference time is at least 1 year, at least 3 years, at least 5 years, or at least 10 years.
In various embodiments, the reference entity is one of a site, an investigator, a country, a state, a city, or a selected location.
In various embodiments, perform feature engineering on historical clinical trial data comprises converting trial metadata from the historical clinical trial data into a numerical representation of a single value or vector of values using n-grams, TFIDF, Word2vec, Glove, Fast Text, BERT, ELMo, InferSent.
In various embodiments, perform feature engineering on historical clinical trial data comprises applying a random forest feature selection algorithm to identify high importance features that have feature importance values above a threshold value.
In various embodiments, the historical clinical trial data is sourced from clinical trial databases selected from one or more of Data Query System (DQS), a contract research organization (CRO), Clinical Trial Management System (CTMS), or clinicaltrials.gov.
In various embodiments, the upcoming trial protocol comprises information for a clinical trial associated with any one of immunology, cardiovascular and metabolic diseases, infectious diseases, oncology, and neuroscience.
In various embodiments, the upcoming trial protocol comprises information for a clinical trial for any one of Crohn's Disease, Lupus, diabetic kidney disease, Lung Cancer, or respiratory syncytial virus (RSV).
In various embodiments, the predicted site enrollment comprises number of patients a site will enroll.
In various embodiments, the predicted site enrollment further comprises enrollment rate and/or agility, wherein the enrollment rate comprises number of patients per sit per month or year, and wherein the agility comprises time it took to start recruiting in a trial.
In various embodiments, the predicted site default likelihood comprises how likely a site is to enroll zero patients or fewer patients than a predetermined threshold.
These and other features, aspects, and advantages of the present invention will become better understood with regard to the following description and accompanying drawings. It is noted that wherever practicable similar or like reference numbers may be used in the figures and may indicate similar or like functionality. For example, a letter after a reference numeral, such as “top-performing MLM 220A,” indicates that the text refers specifically to the element having that particular reference numeral. A reference numeral in the text without a following letter, such as “top-performing MLM 220,” refers to any or all of the elements in the figures bearing that reference numeral (e.g., “top-performing MLM 220” in the text refers to reference numerals “top-performing MLM 220A” and/or “top-performing MLM 220B” in the figures).
Figure (
Figure (
The system environment 100 may include one or more subjects 110 who were enrolled in clinical trials that provide the clinical trial data 120. In various embodiments, a subject (or patient) may comprise a human or non-human, human or non-human, whether in vivo, ex vivo, or in vitro, male or female, a cell, tissue, or organism. In various embodiments, the subject 110 may have met eligibility criteria for enrollment in the clinical trials. For example, the subject 110 may have been previously diagnosed with a disease indication. Thus, the subject 110 may have been enrolled in a clinical trial that provides the clinical trial data 120 that tested a therapeutic intervention for treating the disease indication. Although
The clinical trial data 120 refers to clinical trial data related to one or more clinical trial sites and/or data of an upcoming trial protocol. In various embodiments, the clinical trial data 120 are related to one or more clinical trial sites that may have previously conducted a clinical trial (e.g., such that there are clinical operations data related to the previously conducted clinical trial). For example, the clinical trial sites where the clinical trial site data 120 is related to may have previously conducted one or more clinical trials that enrolled subjects 110. In various embodiments, the clinical trial site data 120 is related to one or more clinical trial sites that include at least one clinical facility and/or investigator that were previously used to conduct a clinical trial (e.g., in which the subjects 110 were enrolled) or can be used for one or more prospective clinical trials. In various embodiments, the clinical trial site data 120 is related to one or more clinical trial sites that are located in different geographical locations. In various embodiments, the clinical trial site data 120 is related to one or more clinical trial sites that generate or store clinical trial site date 120 describing the prior clinical trials (e.g., in which the subjects 110 were enrolled) that were conducted at the sites. In various embodiments, the clinical trial data 120 includes clinical operations data (e.g., clinical operation data that is not related to a subject 110) from one or more clinical trial sites. In various embodiments, the clinical trial data 120 includes site level enrollment data. In various embodiments, the clinical trial data 120 includes trial level enrollment data. In various embodiments, the clinical trial site data 120 is related to one or more clinical trial sites that were conducted for one or more different disease indications. Example disease indications are associated with any one of immunology, cardiovascular and metabolic diseases, infectious diseases, oncology, and neuroscience. In particular embodiments, the disease indication is any one of multiple myeloma, prostate cancer, non-small cell lung cancer, treatment resistant depression, Crohn's disease, systemic lupus erythematosus, hidradenitis suppurative/atopic dermatitis, diabetic kidney disease, or respiratory syncytial virus (RSV). In various embodiments, the clinical trial data 120 are data from one or more datasets related to an upcoming clinical trial dataset. For example, the clinical trial data 120 includes data of one or more protocols for an upcoming clinical trial related to a disease indication. Thus, the clinical trial data 120 related to one or more protocols for the upcoming clinical trial can be analyzed to predict likely top-performing sites that can be enrolled in the upcoming clinical trial.
In various embodiments, the clinical trial data 120 are obtained from internal clinical trial data, such as clinical trial data stored by a party operating the site prediction system 130. In various embodiments, the clinical trial data 120 are obtained from external clinical trial data, such as clinical trial data stored by a party different from the party operating the site prediction system 130. In various embodiments, the clinical trial data 120 are obtained from a combination of internal clinical trial data and external clinical trial data. In various embodiments, the clinical trial data 120 are obtained from one or more clinical trial sites In various embodiments, the clinical trial data 120 are obtained from a real-world database (e.g., a hospital). In various embodiments, the clinical trial data 120 are obtained from a public data set (e.g., a library).
The site prediction system 130 analyzes clinical trial data 120 and generates a site prediction 140. In particular embodiments, the site prediction system 130 generates a site prediction 140 for a specific disease indication that is to be treated in a future clinical trial, the site prediction 140 identifying the likely best performing clinical trial sites for the specific disease indication. In various embodiments, the site prediction system 130 applies one or more machine learning models and/or a stochastic model to analyze or evaluate clinical trial data 120 to generate the site prediction 140. In various embodiments, the site prediction system 130 includes or deploys one or more machine learning models that are trained using historical dataset from internal and/or external resources (e.g., industry sponsors and/or contract research organizations (CROs), etc.).
In various embodiments, the site prediction system 130 can include one or more computers, embodied as a computer system 400 as discussed below with respect to
The site prediction 140 is generated by the site prediction system 130 and includes predictions of one or more clinical trial sites based on the clinical trial data 120 for selecting sites for a prospective clinical trial. In various embodiments, the site prediction system 130 may generate a site prediction 140 for each clinical trial site. For example, if there are X possible clinical trial sites that are undergoing site selection, the site prediction system 130 may generate a site prediction 140 for each of the X clinical trial sites. In various embodiments, X is at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least 25, at least 50, at least 75, at least 100, at least 150, at least 200, at least 250, at least 300, at least 350, at least 400, at least 450, at least 500, at least 750, at least 1000, at least 1500, at least 2000, at least 2500, at least 3000, at least 3500, at least 4000, at least 4500, at least 5000, at least 5500, at least 6000, at least 6500, or at least 7000 clinical trial sites. In particular embodiments, X is at least 5000 clinical trial sites. In particular embodiments, X is at least 6000 clinical trial sites.
In various embodiments, the site prediction 140 includes a predicted site enrollment (e.g., number of patients a site will enroll) for one or more clinical trial sites involved in the clinical trial data 120. In various embodiments, the site prediction 140 includes a predicted site default likelihood (e.g., how likely a site is to enroll zero patients or fewer patients than a predetermined threshold) for one or more clinical trial sites involved in the clinical trial data 120. In various embodiments, the site prediction 140 includes a predicted site enrollment (e.g., number of patients a site will enroll) and a predicted site default likelihood (e.g., how likely a site is to enroll zero patients or fewer patients than a predetermined threshold) for one or more clinical trial sites involved in the clinical trial data 120. In various embodiments, in regards to the predicted site default likelihood, the predetermined threshold may be less than 100 patients, less than 75 patients, less than 50 patients, less than 40 patients, less than 30 patients, less than 20 patients, less than 15 patients, less than 10 patients, less than 9 patients, less than 8 patients, less than 7 patients, less than 6 patients, less than 5 patients, less than 4 patients, less than 3 patients, or less than 2 patients.
In various embodiments, the site prediction 140 includes predicted enrollment performance related to an enrollment timeline. For example, predicted enrollment performance related to an enrollment time may include a time to enroll a specific number of patients. As another example, predicted enrollment performance related to an enrollment time may include a predicted number of patients enrolled by a certain timepoint after enrollment begins.
In various embodiments, the site prediction 140 is or includes a list of ranked sites (e.g., sites that will enroll the highest number of patients) for a prospective clinical trial. In various embodiments, the site prediction 140 is or includes at least 5 of the top-ranked sites. In various embodiments, the site prediction 140 is or includes at least 10 of the top-ranked sites. In various embodiments, the site prediction 140 is or includes at least 20 of the top-ranked sites. In various embodiments, the site prediction 140 is or includes at least 50 of the top-ranked sites. In various embodiments, the site prediction 140 is or includes a list of the least ranked sites (e.g., lowest likelihood to enroll zero patients or fewer patients than a predetermined threshold) for a prospective clinical trial, such that the site prediction 140 enables a recipient of the list to avoid enrolling the lowest ranked sites for the prospective clinical trial. In various embodiments, the site prediction 140 is or includes at least 5 of the lowest-ranked sites. In various embodiments, the site prediction 140 is or includes at least 10 of the lowest-ranked sites. In various embodiments, the site prediction 140 is or includes at least 20 of the lowest-ranked sites. In various embodiments, the site prediction 140 is or includes at least 50 of the lowest-ranked sites. In various embodiments, the site prediction 140 can be transmitted to stakeholders so they can select sites for inclusion. In various embodiments, the site prediction 140 can be transmitted to principal investigators at the clinical trial site and/or stakeholders so they can determine whether to run the clinical trial at their site.
In various embodiments, the one or more clinical trial sites are categorized into tiers. For example, the one or more clinical trial sites can be categorized into a first tier representing the best performing clinical trial sites, a second tier representing the next best performing clinical trial sites, and so on. In various embodiments, the one or more clinical trial sites are categorized into four tiers. In various embodiments, the top tier of clinical trial sites are selected and included in a prediction e.g., a site prediction 140 shown in
Reference is now made to
As shown in
Generally, the data processing module 145 processes (e.g., ingests, cleans, integrates, enriches) the input data (e.g., clinical trial data 120 in
The feature engineering module 150 extracts and selects features from the data processed by the data processing module 145. In various embodiments, the feature engineering module 150 provides extracted values of selected features to the model training module 155 for developing (e.g., training, validating, etc.) machine learning models. In various embodiments, the feature engineering module 150 provides extracted values of selected features to the model deployment module 160 for selecting top-performing machine learning models and for deploying the selected top-performing machine learned models to generate a site prediction (e.g., number of patients a site will enroll) and a predicted site default likelihood (e.g., how likely a site is to enroll zero patients or fewer patients than a predetermined threshold) for one or more clinical trial sites.
The model training module 155 develops (e.g., trains, validates, etc.) a plurality of machine learning models using selected features of the input data, and provides the trained machine learning models to the model deployment module 160. In various embodiments, a platform utilizing a proprietary framework or an open-source framework (e.g., H2O's AutoML framework) to automatically train and perform hyperparameter tuning on the plurality of machine learning models (e.g., generalized linear model (GLM), gradient boosting machine (GBM), XGBoost, stacked ensembles, deep learning, etc.). In various embodiments, the open-source framework may be scalable. The trained machine learning models may be locked and stored in the trained models store 180 to provide to the model deployment module 160 after the training is completed (e.g., until an quantitative improvement in the output of each model between each epoch or between each iteration of training is less than a pre-defined threshold, or until a maximum number of iterations is reached).
In various embodiments, the model deployment module 160 selects top-performing machine learning models and deploys the top-performing machine learning models. The model deployment module 160 may select top-performing machine learning models by evaluating or assessing the generated site predictions (e.g., a predicted site enrollment, a predicted site default likelihood, etc.). In various embodiments, the model deployment module 160 selects a best-performing machine learning model for each type of site prediction, based on the best training score as well as model interpretability. For example, the model deployment module 160 selects a best-performing machine learning model for predicting site enrollment, and a best-performing machine learning model for predicting site default likelihood. In various embodiments, the selected models for the site prediction variables are the same model. In various embodiments, the selected models for the site prediction variables are different models.
The model deployment module 160 implements the trained machine learned models stored in the trained models store 180 to analyze the values of selected features of the input data to generate site predictions such as a predicted site enrollment and a predicted site default likelihood. The model deployment module 160 provides the sites predictions generated from selected machine learning models to the simulation module 165.
In various embodiments, the machine learning models deployed by the model deployment module 160 can predict number of patients a clinical trial site will enroll in the next year. In various embodiments, the machine learning models deployed by the model deployment module 160 can predict number of patients a clinical trial site will enroll in the next 3 years. In various embodiments, the machine learning models deployed by the model deployment module 160 can predict number of patients a clinical trial site will enroll in the next 5 years. In various embodiments, the machine learning models deployed by the model deployment module 160 can predict number of patients a clinical trial site will enroll within a M time period. In various embodiments, M is any of 6 months, 1 year, 1.5 years, 2 years, 2.5 years, 3 years, 3.5 years, 4 years, 4.5 years, 5 years, 5.5 years, 6 years, 6.5 years, 7 years, 7.5 years, 8 years, 8.5 years, 9 years, 9.5 years, 10 years, 10.5 years, 11 years, 11.5 years, 12 years, 12.5 years, 13 years, 13.5 years, 14 years, 14.5 years, 15 years, 15.5 years, 16 years, 16.5 years, 17 years, 17.5 years, 18 years, 18.5 years, 19 years, 19.5 years, or 20 years. In various embodiments, M can be any number.
In various embodiments, the machine learning models deployed by the model deployment module 160 can predict how likely a site is to enroll zero patients or fewer patients than a predetermined threshold in the next year. In various embodiments, the machine learning models deployed by the model deployment module 160 can predict how likely a site is to enroll zero patients or fewer patients than a predetermined threshold in the next 3 years. In various embodiments, the machine learning models deployed by the model deployment module 160 can predict how likely a site is to enroll zero patients or fewer patients than a predetermined threshold in the next 5 years. In various embodiments, the machine learning models deployed by the model deployment module 160 can predict how likely a site is to enroll zero patients or fewer patients than a predetermined threshold within a M time period. In various embodiments, M is any of 6 months, 1 year, 1.5 years, 2 years, 2.5 years, 3 years, 3.5 years, 4 years, 4.5 years, 5 years, 5.5 years, 6 years, 6.5 years, 7 years, 7.5 years, 8 years, 8.5 years, 9 years, 9.5 years, 10 years, 10.5 years, 11 years, 11.5 years, 12 years, 12.5 years, 13 years, 13.5 years, 14 years, 14.5 years, 15 years, 15.5 years, 16 years, 16.5 years, 17 years, 17.5 years, 18 years, 18.5 years, 19 years, 19.5 years, or 20 years. In various embodiments, M can be any number. In various embodiments, the predetermined threshold may be less than 100 patients, less than 75 patients, less than 50 patients, less than 40 patients, less than 30 patients, less than 20 patients, less than 15 patients, less than 10 patients, less than 9 patients, less than 8 patients, less than 7 patients, less than 6 patients, less than 5 patients, less than 4 patients, less than 3 patients, or less than 2 patients.
The simulation module 165 applies a stochastic model (e.g., Monte Carlo simulation) using the site predictions generated from selected machine learning models, as input, to generate enrollment timeline prediction 245 (e.g., multi-site enrollment timelines). Example descriptions of a Monte Carlo simulation are found in Abbas I. et al.: Clinical trial optimization: Monte Carlo simulation Markov model for planning clinical trials recruitment, Contemporary Clinical Trials 28:220-231, 2007, which is hereby incorporated by reference in its entirety.
The visualization module 170 generates a visualization of the predictions generated by deploying top-performing machine learning models using the model deployment module 160 and/or by the stochastic model using the simulation module 165. In various embodiments, the visualization module 170 generates a visualization of the predicted site enrollment 225 and of the predicted site default likelihood 230 for the clinical trials sites generated by the top-performing models by the model deployment module 160. For example, the visualization module 170 may present the predicted site enrollment 225 and the predicted site default likelihood 230 in a quadrant graph. In various embodiments, the visualization module 170 generates a visualization of the enrollment timeline prediction 245 generated by the stochastic model 240 in a graph that includes statistically-robust enrollment curves. Examples of visualizations are shown in 8-19 described below in the context of specific examples. Similar visualizations may be generated in relation to other executions of the site prediction system 130.
The input data store 175 stores clinical trial data (e.g., clinical trial data 120 in
The trained models store 180 stores trained machine learning models (e.g., GLM, GBM, XGBoost, stacked ensembles, deep learning, etc.) for selection and implementation in the deployment phase.
The output data store 185 stores the site predictions (e.g., site predictions 140 in
In various embodiments, the components of the site prediction system 130 are applied during one of the training phase and the deployment phase. For example, the model training module 155 are applied during the training phase to train a model. Additionally, the model deployment module 160 is applied during the deployment phase. In various embodiments, the components of the site prediction system 130 can be performed by different parties depending on whether the components are applied during the training phase or the deployment phase. In such scenarios, the training and deployment of the prediction model are performed by different parties. For example, the model training module 155 and training data applied during the training phase can be employed by a first party (e.g., to train a model) and the model deployment module 160 applied during the deployment phase can be performed by a second party (e.g., to deploy the model). Training models and deploying models are described in further detail below.
Embodiments described herein include methods for generating a site prediction for one or more clinical trial sites by applying one or more trained models to analyze selected features of the input data related to the one or more clinical trial sites. Such methods can be performed by the site prediction system 130 described in
As shown in
As shown in
In various embodiments, the selected features may include features associated with geographic locations, protocol complexity, study design, competitive landscape, and historical site enrollment metrics. In particular embodiments, the selected features include state of clinical trial site, study title, conditions, country, heading, sponsor, outcome measures, features associated with historical site enrollment metrics, investigator, and/or facility address. In particular embodiments, the selected features associated with historical site enrollment metrics include statistical measures, such as any of a minimum, maximum, exponentially weighted moving average (EWMA), weighted average, and median values. In particular embodiments, the selected features associated with historical site enrollment metrics includes at least 1, 3, 5, or 10 of minimum, maximum, exponentially weighted moving average (EWMA), weighted average, and median of number of enrolled patients for a trial, number of patients consented for a trial, number of patients completed a trial, number of patients that failed screening for a trial, and agility (e.g., time it took to start recruiting in a trial) over a reference time window at a reference entity. In particular embodiments, the reference time is at least 1 year, at least 3 years, at least 5 years, or at least 10 years. In various embodiments, the reference time can be any time period. In particular embodiments, the reference entity is one of a site, an investigator, a country, a state, a city, or a selected location.
As shown in
In various embodiments, the top performing MLMs 220A and 220B were previously trained using training data, as is described in further detail herein. In various embodiments, the training data can be historical trial data. In various embodiments, the top-performing MLMs 220A and 220B were previously determined by training a plurality of machine learning models, and by selecting the top performing model of the trained machine learning models to predict site enrollment (e.g., number of patients a site will enroll) and/or site default likelihood (e.g., how likely a site is to enroll zero patients or fewer patients than a predetermined threshold) for a specific disease indication. For example, the top-performing MLM 220A may be the best-performing MLM for generating predicted site enrollment 225, and the top-performing MLM 220B may be the best-performing MLM for generating site default likelihood 230. In various embodiments, the top performing MLMs 220A and 220B are constructed as a single model. For example, MLMs 220A and 220B are constructed as single model, which outputs predicted site enrollment 225 and predicted site default likelihood 230. In various embodiments, the top performing MLMs 220A and 220B are separate models. In various embodiments, the top performing MLMs 220A or 220B are independently any one of a random forest model, an extremely randomized trees (XRT) model, a generalized linear model (GLM), a gradient boosting machine (GBM), XGBoost, a stacked ensemble, and a deep learning algorithm (e.g., fully connected multi-layer artificial neural network). In various embodiments, MLM 220A is a regression model that predicts a continuous value representing the predicted site enrollment 225. In various embodiments, MLM 220B is a classifier that predicts a classification representing the predicted site default likelihood 230 (e.g., default or no default).
In various embodiments, the predicted site enrollment 225 represents a “number enrolled” variable, and the predicted site default likelihood 230 represents a “site default” variable. In various embodiments, a “site default” variable that is equal to zero refers to a site that enrolled more than one patient, and thus the site has not defaulted. In various embodiments, a “site default” variable that is equal to 1 refers to a site that enrolled zero or one patient, and thus the site has defaulted. In various embodiments, the “number enrolled” variable refers to number of patients enrolled at a site.
In various embodiments, the predicted site enrollment 225 includes enrollment rate (e.g., number of patients per sit per month/year) and/or agility (time required for a site to start up and begin recruitment).
In various embodiments, the predicted site enrollment 225 and predicted site default likelihood 230 are validated by using one or more of the historical clinical trial data and/or prospective clinical trial data.
The predicted site enrollment 225 and predicted site default likelihood 230 can be used to generate predicted site rankings 235. In various embodiments, the predicted site enrollment 225 and predicted site default likelihood 230 are compared to one or more threshold values to generate predicted site rankings 235. For example, the predicted site enrollment 225 for a site can be compared to a first threshold value and the predicted site default likelihood 230 for a site can be compared to a second threshold value. Generally, a site that has a predicted site enrollment that is above the first threshold value and a predicted site default likelihood that is below the second threshold value will be ranked more highly than another site in which either the predicted site enrollment is below the first threshold or the predicted site default likelihood is above the second threshold.
In various embodiments, the first threshold value and the second threshold values are statistical measures. A statistical measure can be a mean value, a median value, or a mode value. For example, the first threshold value can be the median site enrollment across historical data of all clinical trial sites or a specified value (e.g., a value in the top-performing quadrant or quartile). The second threshold value can be the median predicted site default likelihood across historical data of all clinical trial sites or a specified value (e.g., a value in the low-performing quadrant or quartile). In various embodiments, the first threshold value and the second threshold value are fixed values. For example, the first threshold value may be a fixed value of at least 1 enrolled patient, at least 2 enrolled patients, at least 3 enrolled patient, at least 4 enrolled patient, at least 5 enrolled patients, at least 6 enrolled patients, at least 7 enrolled patients, at least 8 enrolled patients, at least 9 enrolled patients, at least 10 enrolled patients, at least 15 enrolled patients, at least 20 enrolled patients, at least 25 enrolled patients, at least 30 enrolled patients, at least 35 enrolled patients, at least 40 enrolled patients, at least 50 enrolled patients, at least 75 enrolled patients, at least 100 enrolled patients, at least 200 enrolled patients, at least 300 enrolled patients, at least 400 enrolled patients, at least 500 enrolled patients, or at least 1000 enrolled patients. As another example, the second threshold value may be a fixed value of less than 30% likelihood of default, less than 25% likelihood of default, less than 20% likelihood of default, less than 15% likelihood of default, less than 14% likelihood of default, less than 13% likelihood of default, less than 12% likelihood of default, less than 11% likelihood of default, less than 10% likelihood of default, less than 9% likelihood of default, less than 8% likelihood of default, less than 7% likelihood of default, less than 6% likelihood of default, less than 5% likelihood of default, less than 4% likelihood of default, less than 3% likelihood of default, less than 2% likelihood of default, or less than 1% likelihood of default.
In various embodiments, the predicted site rankings 235 is a list of all ranked sites. In various embodiments, the predicted site rankings 235 is a list of selected top-ranked clinical trial sites. In various embodiments, each of the top-ranked clinical trial sites included in the predicted site rankings 235 has a predicted site enrollment above a first threshold value and a predicted site default likelihood below a second threshold value.
In various embodiments, the predicted site rankings 235 is a list of at least 3 top-ranked clinical trial sites. In various embodiments, the predicted site rankings 235 is a list of at least 5 top-ranked clinical trial sites. In various embodiments, the predicted site rankings 235 is a list of at least 10 top-ranked clinical trial sites. In various embodiments, the predicted site rankings 235 is a list of at least 20 top-ranked clinical trial sites. In various embodiments, the predicted site rankings 235 is a list of at least 50 top-ranked clinical trial sites. In various embodiments, the predicted site rankings 235 includes corresponding contact information useful for feasibility stakeholders, such as address, country, investigator, contact information, and other suitable information of each site listed in the predicted site rankings 235.
The predicted site enrollment 225 and predicted site default likelihood 230 can be used as an input to a stochastic model 240 (e.g., Monte Carlo simulation) to generate a plurality of quantitative values informative of enrollment timeline predictions 245.
In various embodiments, the plurality of quantitative values informative of enrollment timeline prediction 245 includes time to enroll a number of patients. In various embodiments, the plurality of quantitative values informative of enrollment timeline prediction 245 includes time to enroll at least 5 patients. In various embodiments, the plurality of quantitative values informative of enrollment timeline prediction 245 includes time to enroll at least 10 patients. In various embodiments, the plurality of quantitative values informative of enrollment timeline prediction 245 includes time to enroll at least 50 patients. In various embodiments, the plurality of quantitative values informative of enrollment timeline prediction 245 includes time to enroll at least 100 patients. In various embodiments, the plurality of quantitative values informative of enrollment timeline prediction 245 includes time to enroll at least 500 patients. In various embodiments, the plurality of quantitative values informative of enrollment timeline prediction 245 includes time to enroll at least 1000 patients. In various embodiments, the plurality of quantitative values informative of enrollment timeline prediction 245 includes time to enroll at least 2000 patients. In various embodiments, the plurality of quantitative values informative of enrollment timeline prediction 245 includes time to enroll a range of 50-1000 patients.
In various embodiments, the plurality of quantitative values informative of enrollment timeline prediction 245 includes number of patients enrolled in a time period. In various embodiments, the plurality of quantitative values informative of enrollment timeline prediction 245 includes number of patients enrolled in 1 month. In various embodiments, the plurality of quantitative values informative of enrollment timeline prediction 245 includes number of patients enrolled in 4 month. In various embodiments, the plurality of quantitative values informative of enrollment timeline prediction 245 includes number of patients enrolled in 6 month. In various embodiments, the plurality of quantitative values informative of enrollment timeline prediction 245 includes number of patients enrolled in 12 month In various embodiments, the plurality of quantitative values informative of enrollment timeline prediction 245 includes number of patients enrolled in 18 month. In various embodiments, the plurality of quantitative values informative of enrollment timeline prediction 245 includes number of patients enrolled in 24 month. In various embodiments, the plurality of quantitative values informative of enrollment timeline prediction 245 includes number of patients enrolled in a range of 18-24 months. In various embodiments, the plurality of quantitative values informative of enrollment timeline prediction 245 includes number of patients enrolled in a range of 3-48 months.
In particular embodiments, the plurality of quantitative values informative of predicted enrollment performance comprises one or more of time to enroll 500 patients, number of patients enrolled in 4 months, number of patients enrolled in 12 months, number of patients enrolled in 18 months, or number of patients enrolled in 24 months.
Reference is now made to
At step 260, input data comprising data of an upcoming trial protocol is obtained.
At step 265, for each of one or more clinical trial sites, one or more machine learning models (e.g., top-performing MLM 220A and 220B) are applied to selected features of the input data to generate a predicted site enrollment and a predicted site default likelihood for the clinical trial site. In various embodiments, the selected features are previously determined by performing feature engineering on historical clinical trial data (e.g., historical clinical trial data 310 in
At step 270, the one or more clinical trial sites are ranked according to the predicted site enrollment and the predicted site default likelihood for the one or more clinical trial sites.
At step 275, top-ranked clinical trial sites are selected from the ranked clinical trial sites. In various embodiments, each of the selected clinical trial sites has a predicted site enrollment above a first threshold value and a predicted site default likelihood below a second threshold value. In various embodiments, the first threshold value is a median predicted site enrollment across the one or more clinical trial sites. In various embodiments, the second threshold value is a median predicted site default likelihood across the one or more clinical trial sites.
At step 280, the predicted site enrollment and the predicted site default likelihood for the ranked clinical trial sites are visualized in a quadrant graph, and a site list of the selected top-ranked clinical trial sites is generated. The quadrant graph and/or the site list can be evaluated or provided to appropriate stakeholders for determining or selecting in an upcoming clinical trial.
At step 285, a plurality of quantitative values informative of enrollment timeline prediction is generated by applying a stochastic model (e.g., stochastic model 240 in
III. Training a Machine Learning Model for deployment in a Site Prediction System
As shown in
In various embodiments, the historical clinical trial data 310 may be a subset of the input data (e.g., clinical trial data 120 in
In various embodiments, the historical clinical trial data 310 includes site level enrollment data and/or trial level data of a historical clinical trial. For example, the historical clinical trial data 310 include enrollment number per site, default status (e.g., 0 or 1 patients were enrolled), enrollment rate (e.g., number of patients per sit per month/year), enrollment dates such as agility (e.g., time required for a site to start up and begin recruitment) or enrollment period, etc., investigator names, site locations, trial sponsor, list of trial identifiers for disease indication, eligibility criteria, protocol information, trial dates (e.g., start date, end date, etc.), and/or site ready time of a historical clinical trial.
The historical clinical trial data 310 is processed (e.g., cleaned, integrated, and enriched) using the data processing module 145. In various embodiments, the data processing module 145 clean the historical clinical trial data 310 by assessing each column of the historical clinical trial data 310, followed by cleaning methods such as standardizing date formats, removing null values, removing new line characters, cleaning column names, parsing or cleaning age criteria, and other appropriate cleaning steps.
In various embodiments, the data processing module 145 integrates the cleaned historical clinical trial data 310 by merging datasets of the historical clinical trial data 310 based on the National Clinical Trial (NCT) number. The data processing module 145 may perform the integration and merging of datasets if the historical clinical trial data 310 includes multiple datasets that are obtained from multiple databases. In various embodiments, the cleaned historical clinical trial data 310 is integrated so that each row includes trial performance for each site-investigator-pair. In various embodiments, the cleaned historical clinical trial data 310 is integrated so that there are multiple rows for each trial. In various embodiments, the cleaned historical clinical trial data 310 is integrated so that there are multiple rows for each site-investigator pair. In various embodiments, the cleaned historical clinical trial data 310 is integrated so that there is a unique row for each site-investigator performance for a given trial.
In various embodiments, the data processing module 145 enriches the cleaned and integrated historical clinical trial data 310 by splitting inclusion and exclusion criteria, and/or standardizing names.
Generally, the feature engineering module 150 extracts features that are related to facilities or investigators of the processed historical clinical trial data 310, and selects top features by applying a random forest feature selection algorithm to identify high importance features that have feature importance values above a threshold value. In various embodiments, the feature engineering module 150 extracts features by converting or transforming tagged trial metadata (e.g., text, words) from the historical clinical trial data 310 into a numerical representation of a single value or vector of values using n-grams, TFIDF, Word2vec, Glove, Fast Text, BERT, ELMo, InferSent. In various embodiments, the feature engineering module 150 extracts time series features that capture historical performance of a site in the past M time period.
In various embodiments, the selected features may include features associated with geographic locations, protocol complexity, study design, competitive landscape, and historical site enrollment metrics. In particular embodiments, the selected features include state of clinical trial site, study title, conditions, country, heading, sponsor, outcome measures, features associated with historical site enrollment metrics, investigator, and/or facility address. In particular embodiments, the selected features associated with historical site enrollment metrics includes at least 1, 3, 5, or 10 of minimum, maximum, exponentially weighted moving average (EWMA), weighted average, and median number of enrolled patients for a trial, number of patients consented for a trial, number of patients completed a trial, number of patients that failed screening for a trial, and agility (e.g., time it took to start recruiting in a trial) over a reference time window at a reference entity. In particular embodiments, the reference time is at least 1 year, at least 3 years, at least 5 years, or at least 10 years. In particular embodiments, the reference entity is one of a site, an investigator, a country, a state, a city, or a selected location.
As a specific example, given a reference time window of 2 years and a reference entity of a site, example resulting features include:
As another specific example, given a reference time window of 1 year and a reference entity of an investigator, example resulting features include:
As another specific example, given a reference time window of 5 years and a reference entity of a country, example resulting features include:
In various embodiments, the feature engineering module 150 extracts or selects at least 3 features. In various embodiments, the feature engineering module 150 extracts or selects at least 5 features. In various embodiments, the feature engineering module 150 extracts or selects at least 10 features. In various embodiments, the feature engineering module 150 extracts or selects at least 50 features. In various embodiments, the feature engineering module 150 extracts or selects at least 100 features. In various embodiments, the feature engineering module 150 extracts or selects at least 500 features. In various embodiments, the feature engineering module 150 extracts or selects at least 1000 features. In various embodiments, the feature engineering module 150 extracts or selects at least 2000 features. In particular embodiments, the feature engineering module 150 extracts or selects at least 1700 features.
The model training module 155 trains the plurality of machine learning models (MLMs) 320A, 320B, 320C, etc. by providing the extracted features of the historical clinical trial data 310 as input. As shown in
In various embodiments, one or more of MLMs 320A, 320B, 320C, etc. are individually trained to minimize a loss function such that the output of each model is improved over successive training epochs. In various embodiments, the loss function is constructed for any of a least absolute shrinkage and selection operator (LASSO) regression, Ridge regression, or ElasticNet regression. In such embodiments, the dotted lines for the models shown in
Generally, a machine learning model (MLM) is structured such that it analyzes input data or extracted features of input data associated with a clinical trial site and/or an upcoming trial protocol, and predicts site enrollment, site default likelihood, and/or other related output for clinical trial sites based on the input data. In various embodiments, the MLM is any one of a regression model (e.g., linear regression, logistic regression, or polynomial regression), decision tree, random forest, gradient boosted machine learning model, support vector machine, Naïve Bayes model, k-means cluster, or neural network (e.g., feed-forward networks, convolutional neural networks (CNN), deep neural networks (DNN), autoencoder neural networks, generative adversarial networks, or recurrent networks (e.g., long short-term memory networks (LSTM), bi-directional recurrent networks, deep bi-directional recurrent networks), or any combination thereof. In particular embodiments, the MLM is one of any one of a random forest model, an extremely randomized trees (XRT) model, a generalized linear model (GLM), a gradient boosting machine (GBM), XGBoost, a stacked ensemble, and a deep learning algorithm (e.g., fully connected multi-layer artificial neural network).
The MLM can be trained using a machine learning implemented method, such as any one of a linear regression algorithm, logistic regression algorithm, decision tree algorithm, support vector machine classification, Naïve Bayes classification, K-Nearest Neighbor classification, random forest algorithm, deep learning algorithm, gradient boosting algorithm, and dimensionality reduction techniques such as manifold learning, principal component analysis, factor analysis, autoencoder regularization, and independent component analysis, or combinations thereof. In particular embodiments, the machine learning implemented method is a logistic regression algorithm. In particular embodiments, the machine learning implemented method is a random forest algorithm. In particular embodiments, the machine learning implemented method is a gradient boosting algorithm, such as XGboost. In various embodiments, the model is trained using supervised learning algorithms, unsupervised learning algorithms, semi-supervised learning algorithms (e.g., partial supervision), weak supervision, transfer, multi-task learning, or any combination thereof.
In various embodiments, the MLM for analyzing selected features of the input data may include parameters, such as hyperparameters or model parameters. Hyperparameters are generally established prior to training. Examples of hyperparameters include the learning rate, depth or leaves of a decision tree, number of hidden layers in a deep neural network, number of clusters in a k-means cluster, penalty in a regression model, and a regularization parameter associated with a cost function. Model parameters are generally adjusted during training. Examples of model parameters include weights associated with nodes in layers of neural network, support vectors in a support vector machine, node values in a decision tree, and coefficients in a regression model. The model parameters of the machine learning models and the convolutional neural networks are trained (e.g., adjusted) using the training data to improve the predictive capacity of the machine learning model.
Embodiments disclosed herein are useful for identifying clinical trial sites that are likely to be high performing clinical trial sites. Thus, these high performing clinical trial sites can be enrolled in a clinical trial for investigating therapeutics for a variety of disease indications. In various embodiments, a disease indication for a clinical trial can include any one of immunology, cardiovascular and metabolic diseases, infectious diseases, oncology, and neuroscience. In particular embodiments, the disease indication is any one of Crohn's disease, lupus, diabetic kidney disease, lung cancer, or respiratory syncytial virus (RSV). Example clinical trials supported among the different therapeutic areas are: Tremfya for Crohn's Disease and Stelara for Lupus (Immunology), Invokana for DKD (CVM), JNJ 61186372/Lazertinib for Lung Cancer (Oncology) and VAC18193 for RSV (IDV).
The methods of the disclosed embodiments, including the methods of implementing MLM and a stochastic simulation for predicting clinical trial sites, are, in some embodiments, performed on one or more computers.
For example, the building and deployment of a MLM or a stochastic simulation can be implemented in hardware or software, or a combination of both. In one embodiment of the invention, a machine-readable storage medium is provided, the medium comprising a data storage material encoded with machine readable data which, when using a machine programmed with instructions for using said data, is capable of executing the training or deployment of machine learning models and/or displaying any of the datasets or results described herein. The embodiments can be implemented in computer programs executing on programmable computers, comprising a processor, and a data storage system (including volatile and non-volatile memory and/or storage elements). Some computing components (e.g., those used to display the user interfaces described herein may include additional components such as a graphics adapter, a pointing device, a network adapter, at least one input device, and at least one output device. A display is coupled to the graphics adapter. Program code is applied to input data to perform the functions described above and generate output information. The output information is applied to one or more output devices, in known fashion. The computer can be, for example, a personal computer, microcomputer, or workstation of conventional design.
Each program can be implemented in a high-level procedural or object-oriented programming language to communicate with a computer system. However, the programs can be implemented in assembly or machine language, if desired. In any case, the language can be a compiled or interpreted language. Each such computer program is preferably stored on a storage media or device readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein. The system can also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
The signature patterns and databases thereof can be provided in a variety of media to facilitate their use. “Media” refers to a manufacture that contains the signature pattern information. The databases of the present invention can be recorded on computer readable media, e.g., any medium that can be read and accessed directly by a computer. Such media include, but are not limited to magnetic storage media, hard disc storage medium, and magnetic tape; optical storage media; electrical storage media such as RAM and ROM; and hybrids of these categories such as magnetic/optical storage media. Any convenient data storage structure can be chosen, based on the means used to access the stored information. A variety of data processor programs and formats can be used for storage, e.g., word processing text file, database format, etc.
In some embodiments, the methods of the invention, including the methods for predicting enrollment of clinical trial sites, are performed on one or more computers in a distributed computing system environment (e.g., in a cloud computing environment). In this description, “cloud computing” is defined as a model for enabling on-demand network access to a shared set of configurable computing resources. Cloud computing can be employed to offer on-demand access to the shared set of configurable computing resources. The shared set of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly. A cloud-computing model can be composed of various characteristics such as, for example, on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth. A cloud-computing model can also expose various service models, such as, for example, Software as a Service (“SaaS”), Platform as a Service (“PaaS”), and Infrastructure as a Service (“IaaS”). A cloud-computing model can also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth. In this description and in the claims, a “cloud-computing environment” is an environment in which cloud computing is employed.
The storage device 408 is a non-transitory computer-readable storage medium such as a hard drive, compact disk read-only memory (CD-ROM), DVD, or a solid-state memory device. The memory 406 holds instructions and data used by the processor 402. The input interface 414 is a touch-screen interface, a mouse, track ball, or other type of pointing device, a keyboard, or some combination thereof, and is used to input data into the computer 400. In some embodiments, the computer 400 may be configured to receive input (e.g., commands) from the input interface 414 via gestures from the user. The network adapter 416 couples the computer 400 to one or more computer networks.
The graphics adapter 412 displays representation, graphs, tables, and other information on the display 418. In various embodiments, the display 418 is configured such that the user (e.g., data scientists, data owners, data partners) may input user selections on the display 418 to, for example, predict enrollment for a clinical trial site for a particular disease indication or order any additional exams or procedures. In one embodiment, the display 418 may include a touch interface. In various embodiments, the display 418 can show one or more predicted enrollments of a clinical trial site. Thus, a user who accesses the display 418 can inform the subject of the predicted enrollment of a clinical trial site.
The computer 400 is adapted to execute computer program modules for providing functionality described herein. As used herein, the term “module” refers to computer program logic used to provide the specified functionality. Thus, a module can be implemented in hardware, firmware, and/or software. In one embodiment, program modules are stored on the storage device 408, loaded into the memory 406, and executed by the processor 402.
The types of computers 400 used by the entities of
Further disclosed herein are systems for implementing MLMs for generating site predictions for clinical trial sites. In various embodiments, such a system can include at least the site prediction system 130 described above in
As shown in
As shown in
As shown in
IX. Example 2: Example Site Prediction Systems and Methods for a particular disease Lupus
As shown in
As shown in
As shown in
As shown in
The many different combinations of features and time windows that have been generated have resulted in improved performance of the models.
While various specific embodiments have been illustrated and described, the above specification is not restrictive. It will be appreciated that various changes can be made without departing from the spirit and scope of the present disclosure(s). Many variations will become apparent to those skilled in the art upon review of this specification.
All references, issued patents and patent applications cited within the body of the instant specification are hereby incorporated by reference in their entirety, for all purposes.
This application claims the benefit of U.S. Provisional Application No. 63/242,753 filed on Sep. 10, 2021, which is incorporated by reference herein.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/IB2022/058525 | 9/9/2022 | WO |
| Number | Date | Country | |
|---|---|---|---|
| 63242753 | Sep 2021 | US |