Research and Development (R&D) are investigative activities that a business or other organizations conduct with the intention of making discoveries that can either lead to the development of new products or procedures, or to improvement of existing products or procedures. R&D may proceed in linear or non-linear manner and typically involve several steps over long periods of time.
Every field of industry engages in extensive efforts of Research and Development for New Product Development. In many industries, such R&D may last for years or decades and costs may reach or exceed the multi-billion dollar range (as for example in Pharmaceutical development, Defense and other fields of application). A major problem in managing such R&D is that of optimally allocating resources to competing R&D activities since it is not generally known which research activities are most likely to “convert” to scientific-technological results that facilitate new products. Another problem is to accelerate the successful R&D efforts and eliminate the unsuccessful ones as early as possible.
For example in the Life Sciences, the process of “Translational Research” describes the research activities that eventually lead to practical applied innovations such as new diagnostic technologies/products, new drugs, improvements in the guidelines that determine the standard of care etc. Both private industry (e.g., Pharmaceutical companies) and the public sector (e.g., Federal Funding agencies such as the NIH) are faced with the pressing problem of allocating limited resources to a small number of efforts out of many candidate R&D initiatives. In many cases, one has to decide which R&D programs that have yielded partial results should be prioritized over other incomplete or yet-to-begin ones. In addition, since the time-to-market directly affects profitability (e.g., at the tune of >1 billion USD/year for “blockbuster” drugs), it is highly desirable to accelerate the R&D that is likely to be successful and eliminate the R&D that is likely to be unsuccessful as early as possible .
The same considerations are true for all industries where R&D plays a significant role in New Product Development (NPD). Examples include: electronics, telecommunications, computer and information technology, defense, aeronautics, aviation and aerospace, Internet commerce, financing and investing, energy, automotive and transportation, marketing and advertising to name a few.
The present invention provides a method, process and apparatus for:
Users of the invention may use it for:
Table 1 lists example input features for Model Training in the Translational Research Field of Application. These features can either be content-based or meta-data (e.g., bibliometric) features. Content features are based on document content such as the title or abstract. Bibliometric features are information based on the authors, publication, or other metadata.
Table 2 lists the top 10 important features for two use cases with different training corpora in Translational Research Field of Application.
The invention method comprises 3 stages, which are implemented in the system described and claimed.
I. Knowledge Base Creation & Configuration to the Specific Field of Application
Creating this Knowledge Base involves the following elements:
II. Ex Post Facto R&D Success Model and Corresponding Decision Support System
Creating this model and decision support system involves the following elements:
The model can now be used to assess retroactively (i.e., “historically”) the impact of a stage of R&D to successful endpoints by using standard graph algorithms for determining all paths from a stage or stages of interest to one or more success exemplars of interest. Existence of one or more paths is direct evidence for the impact of a stage of R&D to the success of the overall effort, lack thereof is evidence for lack of impact. Other ways to describe and infer macro properties of the R&D process modelled by the graph model and identify critical components include a variety of standard Network Science analytics tools (e.g., clustering coefficient, hubs, percent shortest path, characteristic path length, Betweeness Centrality, clusters etc.)
III. Prospective Predictive R&D Success Model and Corresponding Decision Support System
Creating this model and decision support system involves the following elements:
The steps followed include:
The provided technical report (attached hereto as Appendix 1, and incorporated herein by reference) provides details of the method as applied to the specific field of application of R&D for the Life Sciences (also commonly labeled as “Translational Research”). It demonstrates empirically that the invention leads to accurate predictions and in depth understanding of R&D process in a real-life complex domain (that of translational biomedical research leading to new drug development).
Differences from General-Purpose Text Categorization and Classification Methods
These endpoint exemplars are NOT training exemplars for predictive modeling but need to be coupled with the dependence network that tracks paths from any stage of interest to the endpoint exemplars.
While the invention has been described in its preferred embodiments, it is to be understood that the words which have been used are words of description rather than of limitation and that changes may be made within the purview of the appended claims without departing from the true scope and spirit of the invention in its broader aspects. Rather, various modifications may he made in the details within the scope and range of equivalents of the claims and without departing from the spirit of the invention. The inventors further require that the scope accorded their claims be in accordance with the broadest possible construction available under the law as it exists on the date of filing hereof (and of the application from which this application obtains priority, if any) and that no narrowing of the scope of the appended claims be allowed due to subsequent changes in the law, as such a narrowing would constitute an ex post facto adjudication, and a taking without due process or just compensation.
Predicting and Understanding Success of Translational Research and Development in The Life Sciences
Translational research is a notoriously hard endeavor that requires significant amounts of time and effort, and it is currently poorly understood from a process perspective. The goal of this work is to improve the understanding of the process eventually leading to improved efficiency of translation. Our overarching program seeks to: (a) develop a quantitative predictive framework for large-scale modeling translational research, (b) to use the framework to identify examples of translational success, and (c) to analyze these cases to determine factors that lead to translational success. Our approach utilizes a Markov process methodology combined with custom citation analysis, and special-purpose predictive modeling (comprising task-customized machine-learning based text categorization techniques). We demonstrate the feasibility of the approach by constructing accurate models to predict translational success based on analysis of the biomedical literature. Our experimental results show that this methodology can predict translational success with high accuracy. These initial results provide a foundation for future work that will quantitatively and accurately model the entire translational research process. Because the approach is not domain specific, it can be used for R&D processes across domains.
Translating basic science discoveries into clinical care is a lengthy, expensive, and currently poorly understood process. For example, it requires 13 years to produce a new drug after target discovery, the failure rate exceeds 95%, and the cost exceeds $1 billion[1]. Incorporating new knowledge into clinical care requires additional time after developing a treatment, and the entire translational process requires about 17 years [2], [3]. From the public funding point of view significant research and resources have been dedicated to improving the efficiency of translational research. Examples are the Clinical and Translational Science Awards (CTSA) and the National Center for Advancing Translational Science (NCATS). From the private industry R&D investment viewpoint, the ability to prioritize correctly among competing investment R&D targets is essential. In medical research the translational process is conceptualized as spanning 4 stages: T1, T2, T3, T4 spanning from discoveries in the lab all the way to delivering at the bedside and the patient community. Translational success requires long times since translational science is a complicated research enterprise spanning many research domains. It is currently very difficult to anticipate which basic science discoveries will impact clinical care. One reason why such a prediction is very hard is that there are relatively few examples of translational success compared to the total volume of the commercial R&D activity and the scientific literature.
The unpredictability of translational research makes it difficult to evaluate the effectiveness of efforts to accelerate translation. If public funding or industry allocate resources to one area, there is no guarantee that the translational results will materialize or even that the process will speed up. Current methods for resource allocation are based on fundamentally heuristic or otherwise unproven assumptions. A classic example is the debate of what is the best relative allocation of funds between basic and clinical science. If we allocate more funding and resources in one area (e.g., basic science), proponents of that area advocate that translation should be accelerated. Unobserved bottlenecks or unanticipated consequences may hinder the entire process and nullify the above intuitive rationale however. For example, the benefits of an increased rate of basic science discoveries could be offset by the decreased rate of clinical research due to relative smaller allocation of funds for clinical research, or due to shifts in the talent pool distribution between the two areas (among many other hard-to-predict factors).
In short one cannot currently answer major policy, investment and planning questions without a much more detailed, quantitative, and reproducibly predictive understanding the entire translational process.
An accurate model for future translational success would determine the factors that led to translational success so that translational research could become a repeatable, predictable process. The model would also identify high-impact research based on their likelihood of impacting clinical care. Such models could be used to allocate resources in a targeted, principled manner.
A number of conceptual frameworks exist for modeling translational research [4]-[7]. The specifics of each framework vary, but they share some common characteristics that Trochim combines into a “process marker model” [7]. This conceptual framework designates progress milestones throughout translational research. For example, one marker indicates when individual clinical studies are synthesized into general knowledge through meta-analyses, systematic reviews, and guidelines. The elapsed time is measured by comparing publication dates of the initial article and guideline. Existing frameworks such as this one are useful for a general understanding of translational research, but they are limited in their usability since they are not designed to be operational or to be used quantitatively for large-scale analysis.
It is worth mentioning two additional frameworks for studying the efficiency of translational research. The “Payback Model” literature quantifies research outputs to measure efficient use of funding [8]-[10]. Comroe and Dripps [11] evaluated the contribution of basic science research and clinical research to translational research, and this motivated a number of similar studies [12]-[16]. These two approaches are limited since they rely on manual literature reviews and case studies. The findings are not scalable since analyzing many topics and time periods is not feasible with manual review.
So to summarize prior efforts in this area we note that the existing work on modeling translational research has identified several factors but has several practical weaknesses (a) relies on manual review of the literature, which requires significant time and effort. (b) The findings are not necessarily generalizable since it is not feasible in practice to study more than a few topics. (c) The prior work has not produced reproducible predictive models that provide concrete estimates of future success that can be used in formal ROI and risk modeling analyses. A comprehensive, quantitative and scalable framework for modeling translational research is needed to thoroughly understand and rationally plan translational research. The ideal model should be an automated, computational method with no or minimal manual literature review so that it can span many topics and time periods. The model should enable large-scale analysis and provide accurate predictive information.
The current work proposes a methodology for modeling translational research that fulfills these requirements. The model is based on an automatically generated citation network. Translational success is indicated by a citation path between basic science research and the clinical literature. Citation information is automatically extracted from publications. Multiple states of translational progress are defined using a Markov process formulation where the probability of transitioning to a given state only depends on its previous state. Using the citation network and Markov process framework, we train machine learning models to predict which articles are likely to lead to translational success. Although we define multiple states of translational progress, this work focused on direct transition of the initial basic science discovery to translational success. This transition is equivalent to predicting which research results represented by scientific papers will lead to translational success. The long-term plan is to model all transitions as well as the entire translational research process. The preliminary results demonstrate that it is possible to train machine learning models capable of predicting translational success and confirm the feasibility of modeling translational research using this framework.
We first describe an ex post facto framework for capturing translational success using strictly citation information. We then describe a more nuanced and semantically more informative Markov Process model that can model explicitly various intermediate steps of the translational process. We finally operationalize modeling by constructing and evaluating a truly prospective predictive model for long term translational success.
In the medical scientific domain successful translational paths from basic discoveries to clinical deployment can be traced over time using citation paths between the basic science and clinical literatures. Most papers are not cited, and even fewer papers eventually impact clinical care. Yet the existing citations are numerous and form a graph reflecting vastly complicated relationships over time. Some of the relations are explicit (e.g., what paper cites which papers) and some are implicit (i.e., sets of papers describe loosely coordinated and interacting programs of research conducted by groups of researchers in several sites over time).
Because citations occur for a large number of reasons, most of the citation paths and relationships are not relevant to translational success, however.
This is more of an attribution framework that seeks to explain which discoveries had impact toward a clinical modality of interest.
Limitations of this framework include:
The next two modeling refinements (sections B and C) remove the limitations of the ex post facto citation model.
So far we utilized the fact that translational research is the transmission of knowledge from basic science research to clinical care, and this process is observable through citations. We operationally defined translational success as evidence that basic science research impacted clinical care, and this evidence is a citation path from the clinical literature (e.g., clinical guideline or clinical trial) to a basic science article. The citation path may be indirect with multiple articles connected by citations. This is the main value of the ex post facto citation model.
Other document characteristics and metadata may also provide useful information in addition to citations. Other intermediate states of translational progress also exist. We use these observations to improve upon the basic citation model using a Markov process, which consists of states and transition probabilities. The probability of transitioning to a given state only depends on the previous state. In other words, we assume that the likelihood of research leading to translational success depends on its current state of maturity and not prior steps leading to the current state of progress. The Markov process framework allows us to make useful inferences using a variety of mathematical tools such as calculating the probability of transitioning to a given state (e.g., probability that an article will lead to translational success). Later in this report we use machine learning models to provide the necessary transition probabilities.
In using the literature for analysis of translational success, articles are mapped to Markov process states based on publication metadata or citation information (e.g., types of papers citing it or cited by it, content, etc.). In many cases it is reasonable to operationally model translational success as occurring if there is a citation path (i.e., papers connected by citations) between an article and a document demonstrating clinical impact such as a clinical guideline or clinical trial. On the other hand, if a mathematical discovery is never cited by the clinical literature, then this is an example of failure of the model to capture such success due to disconnect of the two literatures. More expansive operational criteria can be used to address such limitations.
By definition, a Markov process satisfies the Markov property, which is defined as follows:
PrX=xX=x, X=x, . . . , X=x)=Pr(X X=x)
The probability of a state transition, Pr(Xn+1=x), only depends on its previous state (i.e., X=x). We use the following Markov Process states as a useful starting point for modeling translational research. This list is not intended as an exhaustive or definitive list in this application domain and we anticipate that it can be improved over time.
The state transitions are shown in
This transition represents the case where research stalls but eventually is successful. Ideally we want to avoid prematurely abandoning research.
In the next and final methods section we introduce the process for building a predictive model for future translational success that focuses on the most important transition of the initial discovery to translational success (i.e., transition 1 above). The purpose is to demonstrate the feasibility of modeling translational research as a Markov process and to use this framework to predict “macro-level” translational success. This transition has its own intrinsic value independent of the other transitions. Modeling other transitions follows the same methodology and will not be repeated.
We use a machine learning approach similar to prior work of ours and others in text categorization and article and citation classification. The usual steps involve operational definition of positives/negatives, data design and capture, model selection and error estimation. Because of the nature of the modeling several task-specific modifications to standard protocols had to be introduced as we explain below.
A number of data design issues were considered to decide which articles to include in the training corpus. For example, modeling short-term impact would include recently published articles (e.g., up to 5 years old) while modeling long-term impact would require older articles. Modeling direct impact would require articles cited directly by the clinical literature, but modeling indirect impact would involve multiple citation levels. The operational definition for success that we chose for the present modeling (i.e., cited by the clinical literature) determined that only articles representing direct impact would be included regardless of the age. The topic of the article was also considered.
Using a citation network restricted to a specific topic is a different modeling task than using a network containing multiple topics. We focused on the literature involving the cancer testis antigen NY-ESO-1, which has led to targeted molecular treatments for cancer. NY-ESO-1 is a recent advancement that is clinically relevant and an ideal example of translational success. Also, the NY-ESO-1 literature is relatively small with 551 MEDLINE articles so it can be examined manually in order to manually debug the modeling process if necessary.
We defined a number of use cases to guide training corpus construction.
Corpus construction started with a seed set of examples of translational success.
We first identified 31 MEDLINE-indexed clinical trials about NY-ESO-1. Corpora were constructed for Use Case 1 (i.e., predicting which NY-ESO-1 articles will be cited by NY-ESO-1 clinical trials) and Use Case 2 (i.e., predicting which articles about any topic will be cited by NY-ESO-1 clinical trials). We consider uses cases 1 and 2 for the present experiments. Article bibliographies were parsed to identify articles at citation level 1 (i.e., articles cited directly by the seed set or were connected by 1 citation).
The articles that were cited by the clinical trials were labeled as positive cases since there was a citation link connecting them to translational success. Negative examples were collected by randomly selecting articles from the same journal and volume as the positive cases. This procedure ensured that negative cases were from similar domains and same time frame as the positive cases. Negatives were also restricted to the NY-ESO-1 topic for use case 1. It was verified that the negatives were not previously included as positive cases.
Input features were extracted from each selected article, and the documents were pre-processed and formatted for learning. Input features were a combination of content and metadata (i.e., bibliometric features). Content features included the article title, abstract, and Medical Subject Heading (MeSH) terms. MEDLINE was the data source for this information. Bibliometric features included the publication history of the authors. These features were the publication and citation counts for the first and last authors in the 10 years prior to the publication of a given article. Only information available at the time of an article's publication was used. The ISI Web of Science was the data source for these features. These features were chosen since they have been useful in predicting long-term citation count and automatically classifying instrumental citations [17], [18]. Table 1 contains the full list of features where the first 3 rows are the content features.
The corpus was used to train models for predicting which articles were likely to lead to translational success. Articles were pre-processed and formatted for learning prior to training. For content features, a bag-of-words approach was used that considered each word separately. Stopwords were removed (e.g., “a”, “the”, and other common words), and multiple forms of the same concept were eliminated with Porter stemming [19]. Then, the terms were weighted based on their frequency using log frequency with redundancy [20]. Each weight was a value between 0 and 1. The bibliometric features were normalized into values between 0 and 1 based on the maximum and minimum values for a given feature. In the end, all documents were represented as a matrix of weights (i.e., values between 0 and 1) where rows corresponded to documents and columns represented input features. Articles were labeled positive if a citation path to translational success existed. The learning task was to predict this label.
Support vector machines (SVMs) with heterogeneous polynomial kernel were chosen as the learning method. They resist overfitting and are able to handle the high-dimensional data that is typical of text data. This statistical machine learning method has been successful in many text categorization studies with biomedical articles and web sites [17], [18], [20]-[22].
Model selection was performed using a nested stratified 5-fold cross validation design [23]. SVM cost and polynomial kernel degree parameters were optimized in the inner loop (e.g., between training and validation parts of the data). Error estimation was performed in the outer loop with the remaining independent test data. The outer loop produced an unbiased estimate of model predictivity within each fold. The final estimate was averaged over all folds to reduce error estimate variance from the randomized data splits during training, validation, and testing. Performance was measured using the area under the receiver operating characteristic curve (AUC).
After model training, feature selection was performed to identify the most important features which were most associated with translational success. We selected the Markov Boundary of the response variable (i.e., translational success or cited by clinical trial) in order to reduce the total number of features to only the essential (i.e., “strongly relevant”) ones for classification. The Markov Boundary is the minimal Markov Blanket, that is the smallest set of features conditioned on which all remaining features are independent of the response variable. It excludes irrelevant and redundant variables, and it provably results in maximum variable compression and maximal predictivity under broad distributional assumptions [24]. Then, logistic regression estimated the magnitude of each feature's effect and its statistical significance.
The learning task was to predict whether an article would be cited by a clinical trial as evidence of translational success. The first model predicts which NY-ESO-1 articles will be cited by NY-ESO-1 clinical trials. Model performance was very good with an AUC of .87. For reference, an AUC of .75 indicates a mediocre classifier, an AUC of 0.85 is a very good classifier, and an AUC greater than 0.9 is an excellent classifier. This performance means that the models were able to predict which NY-ESO-1 articles would be cited by NY-ESO-1 clinical trials and lead to translational success. The second model predicts which articles, regardless of topic, would be cited by NY-ESO-1 clinical trials. Model performance was excellent with an AUC of .92. Since model performance was very good for both use cases, the results demonstrate that modeling translational research with this framework yields useful models for predicting which articles would lead to translational success.
Feature selection was performed to find the most predictive features. The total number of features was reduced to the Markov Blanket, and logistic regression was performed on the selected features. For use case 1, the original set of 15128 features was reduced to 110 features. For use case 2, the original set of 23575 features was reduced to 175 features. Table 4 lists the top 10 features ranked by absolute value of the regression coefficient. The bibliometric feature “Last Author Citation Count” was highly ranked for use case 1 where the topic was restricted to NY-ESO-1. The model for use case 2 relied on only content features.
This report presented a method for automated, large-scale analysis of translational research. A framework was presented that modeled translational research using citation network information and defined states of translational progress using Markov processes. Corpora were constructed to train machine learning models that predicted which articles would be cited by clinical guidelines for a given topic. The experimental analysis demonstrated the feasibility of the machine learning text-categorization framework for modeling translational research and predicting success.
This work focused on the direct transition between an initial discovery and translational success. Modeling additional transitions using the approach described here is straightforward.
The present work developed and conducted preliminary validation of a novel approach for modeling translational research. Previous methods relied on manual literature reviews that do not provide generalizable information. The automated, machine learning based approach has the potential to model the entire translational research process. Being able to predict which papers will impact clinical care and lead to translational success greatly improve our understanding of the translational research process. This knowledge can guide research efforts and resource allocation. The method described here is by design domain independent and thus can be used in any R&D field.
This application is a continuation of and claims the benefit of U.S. application Ser. No. 14/623,428 filed Feb. 16, 2015 which claims priority from provisional application 61940727, filed Feb. 17, 2014.
Number | Date | Country | |
---|---|---|---|
61940727 | Feb 2014 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14623428 | Feb 2015 | US |
Child | 16423890 | US |