This invention is directed to identifying patients for clinical trials.
The large, heterogeneous, and ever-increasing volume of patient databases, the difficulties of manually indexing these collections, and the inadequacy of human language alone to describe their rich contents, such as image information that is visually recognizable and medically significant, all provide impetus for research and development toward practical content-based image and information retrieval (CBIR) systems that could become a standard offering of the medical library of the future. Although CBIR has been used for diagnosis support during or after clinical trials, there is no prior work focusing on the application of content-based retrieval and learning for the purpose of patient identification for recruitment prior to clinical trials.
Exemplary embodiments of the invention as described herein generally include methods and systems for the use of CBIR techniques for patient identification for clinical trials. According to an embodiment of the invention, a patient identification process for clinical trials can be modeled as a cross-modality content-based retrieval process, with integration of multiple modalities, including image, genomic, clinical, and financial information, in an automatic and semi-automatic content-based retrieval system with experts in the loop. According to an embodiment of the invention, textual information can be combined with categorical, numerical, and visual data representing clinical, genomic, financial, and imaging information. Computer vision and machine learning tools can extract descriptors or features to represent the visual and genomic data. A system according to an embodiment of the invention can retrieve qualified patients from a large, heterogeneous database based on learning from examples selected by and on-line feedbacks from the experts. On-line learning from user feedback can provide flexibility for the user to easily select patients based on different criteria, without tedious and difficult parameter tuning for the distance measures by the user. The patient identification process is supported by query by example, query by profile/template/sketch, and learning from user feedback. According to an embodiment of the invention, long-term feedback and learning from multiple experts is supported, which can be performed in the background throughout the usage of the retrieval system. Long-term learning can provide automatic and semiautomatic knowledge representation and discovery. With sufficient statistics, hidden correlations or dependencies across modalities can be discovered and represented in quantifiable forms. With an expert user in the process, a CBIR system according to an embodiment of the invention can support not only basic similarity searching, but also on-line, adaptive distance metric tuning of the search and retrieval algorithms according to the specific need of the current user and the current task.
According to an aspect of the invention, there is provided a method for identifying a patient for a clinical study including the steps of creating a database of patients and patient information, providing a criteria for selecting one or more patients from the database, performing a content based similarity search of the database to retrieve the one or more patients who meet the selection criteria, and presenting said selected one or more patients to a user.
According to a further aspect of the invention, the criteria for selecting one or more patients comprises providing example patient suitable for said study to a search engine, and wherein said criteria is determined from characteristic feature values of said example patient.
According to a further aspect of the invention, the criteria for selecting one or more patients comprises providing a plurality of example patients suitable for said study to a search engine, and wherein said criteria is determined from characteristic feature values of said plurality of example patients.
According to a further aspect of the invention, the database is created by extracting features that support distance based comparisons from at least one of financial, demographic, image, clinical, and genomic data.
According to a further aspect of the invention, these features include numerical data and discrete information represented by words.
According to a further aspect of the invention, the similarity search comprises a distance measure performed on said selection criteria.
According to a further aspect of the invention, the method includes receiving user feedback regarding the one or more selected patients, wherein the feedback concerns whether each of the one or more selected patients presented to the user is suitable for the clinical study, improving said content based similarity search based on said user feedback, performing the improved content based similarity search of the database to retrieve one or more additional patients who meet the selection criteria, and presenting said selected additional patients to the user.
According to a further aspect of the invention, improving said content based similarity search comprises selecting and re-weighting distance measures of said features stored in said database.
According to a further aspect of the invention, improving said content based similarity search comprises utilizing discriminative density estimators and kernel machine techniques.
According to a further aspect of the invention, improving said content based similarity search comprises a biased discriminant analysis.
According to a further aspect of the invention, the method includes selecting one or more additional patients wherein said content based similarity search is uncertain whether said additional patients meet the selection criteria.
According to a further aspect of the invention, the method includes using statistical analysis to determine consistent hidden information and dependencies among keywords and key-features within said database.
According to a further aspect of the invention, the steps of receiving user feedback, learning from said feedback, performing an improved content based similarity search, and presenting said selected additional subjects are repeated until a sufficient sample of subjects for said clinical study has been selected.
According to another aspect of the invention, there is provided a program storage device readable by a computer, tangibly embodying a program of instructions executable by the computer to perform the method steps for identifying a patient for a clinical study.
Exemplary embodiments of the invention as described herein generally include systems and methods for patient identification for clinical trials using content-based retrieval and learning. In the interest of clarity, not all features of an actual implementation which are well known to those of skill in the art are described in detail herein.
A content-based retrieval and learning system according to an embodiment of the invention can provide an automatic patient identification that incorporates knowledge and intelligence. By intelligence is meant the use of machine learning, image processing, and computer vision algorithms for feature extraction from genomic data, images, or image sequences, so that evaluations of non-numerical and non-categorical information sources can be analyzed by machines. By knowledge is meant the use of AI and machine learning tools for extracting quantitative dependencies among different data modalities and disease categories, either from the data or from relevance feedback learning processes. These dependencies can represent new knowledge, or known knowledge but in a more quantitative form.
A retrieval system for patient identification according to an embodiment of the invention can include modules for performing the following functions: (1) content extraction and representation; (2) patient selection through content-based similarity search; (3) user feedback and on-line learning; and (4) long-term learning from user inputs and feedbacks.
Once a suitable database is in place, a physician planning a clinical trial would determine a target patient profile 101 suitable for the planned trial, along with one or more examples of patients fitting this profile. The search and content-based image and information retrieval algorithms according to an embodiment of the invention can include a query-by-example based search and retrieval, and a query-by-profile/template/sketch based search and retrieval. In a query-by-example scenario a user submits an example patient who fits the desired criteria to the search engine, while in a query-by-profile/template/sketch scenario, a user can submit a plurality of suitable patients to the search engine. A CBIR system according to an embodiment of the invention can infer appropriate selection criteria from the characteristic feature values of the example (or examples) provided. Alternatively, a user can provide a value or a range of values for one or more characteristics of one or more suitable patients, such as an average value and a standard deviation for a characteristic of a distribution of patients. An initial retrieval result for the patient selection is based on a direct similarity matching between the input, i.e. characteristics of the patients submitted as examples, and those patients in the database. The initial distance measure can be any suitable distance measure, such as a Euclidean distance, weighted Euclidean distance, Mahalanobis distance, or in the case of query-by-profile/template/sketch, where the descriptor can be a distribution, the initial distance measure can be a K-L divergence, a histogram intersection, or an Earth Movers Distance, etc. These distance measures are exemplary, and other distance measures as are known in the art are within the scope of the embodiment of the invention. The subjects returned to the user will be, in the case of query-by-example, those subjects who either exactly match the example or closely match the example by some closeness criteria provided by the user. In the case of query-by-profile/template/sketch, subjects within the ranges provided will be retuned to the user.
In
According to an embodiment of the invention, user interaction can improve the patient selection process to better match the intentions and needs of the doctors conducting the trial. This can be achieved by techniques referred to herein as relevance feedback. Relevance feedback can treat each task as being different, as even for the same trial a researcher may want to select patients using different criteria. Although current CBIR systems provide interfaces for a user to hand-tune weights on different features to support such requests, the similarity measure in the researcher's mind is often not easily expressed in terms of exact weights of system parameters. In addition, the researcher's perceived similarity may not be expressible by a linear weighting scheme, which assumes feature independence that may not be true in reality.
A flowchart of a relevance feedback method according to an embodiment of the invention is presented in
At step 403, the system uses the improved search and content-based image and information retrieval to select a new sample of potential trial subjects. The system then returns to step 401 to present the new selection to the user. These new samples are representative of a system that can learn from user feedback and return more cases that are a good match according to the feedback. This feedback process can be repeated as many times as necessary until a sufficient patient sample has been selected for the trials.
The relevance feedback techniques just presented involve the use of on-line user interactions. Such user interactions typically provide a relatively small number of training samples, usually in the dozens as compared to hundreds or thousands for off-line training. This small training sample can cause two difficulties in a statistical learning framework: the bias in the density estimates, and the asymmetry in representative power for different classes. Asymmetry in representative power means that a small number of examples cannot represent the positive and the negative classes well enough, and in most cases, one is much worse than the other. For example, five horses represents the “horse” class much better than five examples of non-horse animals represents the “non-horse” class. One technique for handling small samples is biased discriminant analysis (BDA), a kernel machine based discriminative density estimator.
Another aspect of relevance feedback, according to an embodiment of the invention, are active learning techniques. Active learning refers to a strategy for the learner (i.e., the machine) to actively select samples to query a teacher (i.e., the user) for feedback to maximize information gain or minimize entropy/uncertainty in decision-making. Active learning can provide more efficient and more intelligent user interactions. Referring back to
During long-term usage of a retrieval system of an embodiment of the invention, each user input and feedback comprises valuable information. In accordance with an embodiment of the invention, long-term learning from multiple experts over time can be incorporated by using statistical analysis to identify consistent hidden information and dependencies among the keywords and the key-features within databases. Such long-term learning can, as a by-product, signal unusual or changing behavior/action on the part of a user. With expert guidance, long-term relevance feedback tools can facilitate advanced research activities toward the discovery of new disease patterns/trends and drug interactions or effects. In accordance with an embodiment of the invention, an implementation for long term learning includes one or more processes that can be invoked by the improvement and updating of the search and content-based image and information retrieval techniques of step 403 of
Simulations have shown the feasibility of such long-term learning. The results of a simulated experiment on long-term learning from multiple sessions of user feedbacks are displayed in
It is to be understood that the present invention can be implemented in various forms of hardware, software, firmware, special purpose processes, or a combination thereof. In one embodiment, the present invention can be implemented in software as an application program tangible embodied on a computer readable program storage device. The application program can be uploaded to, and executed by, a machine comprising any suitable architecture.
Referring now to
The computer system 501 also includes an operating system and micro instruction code. The various processes and functions described herein can either be part of the micro instruction code or part of the application program (or combination thereof) which is executed via the operating system. In addition, various other peripheral devices can be connected to the computer platform such as an additional data storage device and a printing device.
It is to be further understood that, because some of the constituent system components and method steps depicted in the accompanying figures can be implemented in software, the actual connections between the systems components (or the process steps) may differ depending upon the manner in which the present invention is programmed. Given the teachings of the present invention provided herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations or configurations of the present invention.
The particular embodiments disclosed above are illustrative only, as the invention may be modified and practiced in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. Furthermore, no limitations are intended to the details of construction or design herein shown, other than as described in the claims below. It is therefore evident that the particular embodiments disclosed above may be altered or modified and all such variations are considered within the scope and spirit of the invention. Accordingly, the protection sought herein is as set forth in the claims below.
This application claims priority from “Patient Identification for Clinical Trials using Content-Based Retrieval and Learning”, U.S. Provisional Application No. 60/554,462 of Zhou, et al., filed Mar. 19, 2004, the contents of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
60554462 | Mar 2004 | US |