Patient data mining

Description

FIELD OF THE INVENTION

The present invention relates to data mining, and more particularly, to systems and methods for mining high-quality structured clinical information from patient medical records.

BACKGROUND OF THE INVENTION

Health care providers accumulate vast stores of clinical information. However, efforts to mine clinical information have not proven to be successful. In general, data mining is a process to determine useful patterns or relationships in data stored in a data repository. Typically, data mining involves analyzing very large quantities of information to discover trends hidden in the data.

Clinical information maintained by health care organizations is usually unstructured. Therefore, it is difficult to mine using conventional methods. Moreover, since clinical information is collected to treat patients, as opposed, for example, for use in clinical trials, it may contain missing, incorrect, and inconsistent data. Often key outcomes and variables are simply not recorded.

While many health care providers maintain billing information in a relatively structured format, this type of information is limited by insurance company requirements. That is, billing information generally only captures information needed to process medical claims, and more importantly reflects the “billing view” of the patient, i.e., coding the bill for maximum reimbursement. As a result, billing information often contains inaccurate and missing data, from a clinical point of view. Furthermore, studies show that billing codes are incorrect in a surprisingly high fraction of patients (often 10% to 20%).

Given that mining clinical information could lead to insights that otherwise would be difficult or impossible to obtain, it would be desirable and highly advantageous to provide techniques for mining structured high-quality clinical information.

SUMMARY OF THE INVENTION

The present invention provides a data mining framework for mining high-quality structured clinical information.

In various embodiments of the present invention, systems and methods are provided for mining information from patient records. A plurality of data sources are accessed. At least some of the data sources can be unstructured. The system includes a domain knowledge base including domain-specific criteria for mining the data sources. A data miner is configured to mine the data sources using the domain-specific criteria, to create structured clinical information.

Preferably, the data miner includes an extraction component for extracting information from the data sources to create a set of probabilistic assertions, a combination component for combining the set of probabilistic assertions to create one or more unified probabilistic assertion, and an inference component for inferring patient states from the one or more unified probabilistic assertion.

The extraction component may employ domain-specific criteria to extract information from the data sources. Likewise, the combination component may use domain-specific criteria to combine the probabilistic assertions, and the inference component may use domain-specific criteria to infer patient states. The patient state is simply a collection of variables that one may care about relating to the patient, for example, conditions and diagnoses.

The extraction component may be configured to extract key phrases from free text treatment notes. Other natural language processing/natural language understanding methods may also be used instead of, or in conjunction with, phrase extraction to extract information from free text.

Data sources may include one or more of medical information, financial information, and demographic information. The medical information may include one or more of free text information, medical image information, laboratory information, prescription information, and waveform information.

Probability values may be assigned to the probabilistic assertions. The structured clinical information may include probability information relating to the stored information. The structured clinical information may be stored in a data warehouse. The structured clinical information may include corrected information, including corrected ICD-9 diagnosis codes. (The International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) is based on the World Health Organization's Ninth Revision, International Classification of Diseases (ICD-9). ICD-9-CM is the official system of assigning codes to diagnosis and procedures associated with hospital utilization in the United States. The Tenth Revision (ICD-10) has recently been released and differs from the Ninth Revision (ICD-9); it is expected to be implemented soon).

The system may be run at arbitrary intervals, periodic intervals, or in online mode. When run at intervals, the data sources are mined when the system is run. In online mode, the data sources may be continuously mined.

The domain-specific criteria for mining the data sources may include institution-specific domain knowledge. For example, this may include information about the data available at a particular hospital, document structures at a hospital, policies of a hospital, guidelines of a hospital, and any variations of a hospital.

The domain-specific criteria may also include disease-specific domain knowledge. For example, the disease-specific domain knowledge may include various factors that influence risk of a disease, disease progression information, complications information, outcomes and variables related to a disease, measurements related to a disease, and policies and guidelines established by medical bodies.

Furthermore, a repository interface may be used to access at least some of the information contained in the data source used by the data miner. This repository interface may be a configurable data interface. The configurable data interface may vary depending on which hospital is under consideration.

The data source may include structured and unstructured information. Structured information may be converted into standardized units, where appropriate. Unstructured information may include ASCII text strings, image information in DICOM (Digital Imaging and Communication in Medicine) format, and text documents partitioned based on domain knowledge.

In various embodiments of the present invention, the data miner may be run using the Internet. The created structured clinical information may also be accessed using the Internet.

In various embodiments of the present invention, the data miner may be run as a service. For example, several hospitals may participate in the service to have their patient information mined, and this information may be stored in a data warehouse maintained by the service provider. The service may be performed by a third party service provider (i.e., an entity not associated with the hospitals).

These and other aspects, features and advantages of the present invention will become apparent from the following detailed description of preferred embodiments, which is to be read in connection with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a computer processing system to which the present invention may be applied according to an embodiment of the present invention;

FIG. 2 shows an exemplary computerized patient record (CPR); and

FIG. 3 shows an exemplary data mining framework for mining high-quality structured clinical information.

DESCRIPTION OF PREFERRED EMBODIMENTS

To facilitate a clear understanding of the present invention, illustrative examples are provided herein which describe certain aspects of the invention. However, it is to be appreciated that these illustrations are not meant to limit the scope of the invention, and are provided herein to illustrate certain concepts associated with the invention.

It is also to be understood that the present invention may be implemented in various forms of hardware, software, firmware, special purpose processors, or a combination thereof. Preferably, the present invention is implemented in software as a program tangibly embodied on a program storage device. The program may be uploaded to, and executed by, a machine comprising any suitable architecture.

Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units (CPU), a random access memory (RAM), and input/output (I/O) interface(s). The computer platform also includes an operating system and microinstruction code. The various processes and functions described herein may either be part of the microinstruction code or part of the program (or combination thereof) which is executed via the operating system. In addition, various other peripheral devices may be connected to the computer platform such as an additional data storage device and a printing device.

It is to be understood that, because some of the constituent system components and method steps depicted in the accompanying figures are preferably implemented in software, the actual connections between the system components (or the process steps) may differ depending upon the manner in which the present invention is programmed.

FIG. 1 is a block diagram of a computer processing system 100 to which the present invention may be applied according to an embodiment of the present invention. The system 100 includes at least one processor (hereinafter processor) 102 operatively coupled to other components via a system bus 104. A read-only memory (ROM) 106, a random access memory (RAM) 108, an I/O interface 110, a network interface 112, and external storage 114 are operatively coupled to the system bus 104. Various peripheral devices such as, for example, a display device, a disk storage device (e.g., a magnetic or optical disk storage device), a keyboard, and a mouse, may be operatively coupled to the system bus 104 by the I/O interface 110 or the network interface 112.

The computer system 100 may be a standalone system or be linked to a network via the network interface 112. The network interface 112 may be a hard-wired interface. However, in various exemplary embodiments, the network interface 112 can include any device suitable to transmit information to and from another device, such as a universal asynchronous receiver/transmitter (UART), a parallel digital interface, a software interface or any combination of known or later developed software and hardware. The network interface may be linked to various types of networks, including a local area network (LAN), a wide area network (WAN), an intranet, a virtual private network (VPN), and the Internet.

The external storage 114 may be implemented using a database management system (DBMS) managed by the processor 102 and residing on a memory such as a hard disk. However, it should be appreciated that the external storage 114 may be implemented on one or more additional computer systems. For example, the external storage 114 may include a data warehouse system residing on a separate computer system.

Those skilled in the art will appreciate that other alternative computing environments may be used without departing from the spirit and scope of the present invention.

Increasingly, health care providers are employing automated techniques for information storage and retrieval. The use of a computerized patient record (CPR) to maintain patient information is one such example. As shown in FIG. 2, an exemplary CPR (200) includes information that is collected over the course of a patient's treatment. This information may include, for example, computed tomography (CT) images, X-ray images, laboratory test results, doctor progress notes, details about medical procedures, prescription drug information, radiological reports, other specialist reports, demographic information, and billing (financial) information.

A CPR typically includes a plurality of data sources, each of which typically reflect a different aspect of a patient's care. Structured data sources, such as financial, laboratory, and pharmacy databases, generally maintain patient information in database tables. Information may also be stored in unstructured data sources, such as, for example, free text, images, and waveforms. Often, key clinical findings are only stored within physician reports.

FIG. 3 illustrates an exemplary data mining system for mining high-quality structured clinical information. The data mining system includes a data miner (350) that mines information from a CPR (310) using domain-specific knowledge contained in a knowledge base (330). The data miner (350) includes components for extracting information from the CPR (352), combining all available evidence in a principled fashion over time (354), and drawing inferences from this combination process (356). The mined information may be stored in a structured CPR (380).

The extraction component (352) deals with gleaning small pieces of information from each data source regarding a patient, which are represented as probabilistic assertions about the patient at a particular time. These probabilistic assertions are called elements. The combination component (354) combines all the elements that refer to the same variable at the same time period to form one unified probabilistic assertion regarding that variable. These unified probabilistic assertions are called factoids. The inference component (356) deals with the combination of these factoids, at the same point in time and/or at different points in time, to produce a coherent and concise picture of the progression of the patient's state over time. This progression of the patient's state is called a state sequence.

The present invention can build an individual model of the state of a patient. The patient state is simply a collection of variables that one may care about relating to the patient. The information of interest may include a state sequence, i.e., the value of the patient state at different points in time during the patient's treatment.

Advantageously, the architecture depicted in FIG. 3 supports plug-in modules wherein the system can be easily expanded for new data sources, diseases, and hospitals. New element extraction algorithms, element combining algorithms, and inference algorithms can be used to augment or replace existing algorithms.

Each of the above components uses detailed knowledge regarding the domain of interest, such as, for example, a disease of interest. This domain knowledge base (330) can come in two forms. It can be encoded as an input to the system, or as programs that produce information that can be understood by the system. The part of the domain knowledge base (330) that is input to the present form of the system may also be learned from data.

Domain-specific knowledge for mining the data sources may include institution-specific domain knowledge. For example, this may include information about the data available at a particular hospital, document structures at a hospital, policies of a hospital, guidelines of a hospital, and any variations of a hospital.

The domain-specific knowledge may also include disease-specific domain knowledge. For example, the disease-specific domain knowledge may include various factors that influence risk of a disease, disease progression information, complications information, outcomes and variables related to a disease, measurements related to a disease, and policies and guidelines established by medical bodies.

As mentioned, the extraction component (352) takes information from the CPR (310) to produce probabilistic assertions (elements) about the patient that are relevant to an instant in time or time period. This process is carried out with the guidance of the domain knowledge that is contained in the domain knowledge base (330). The domain knowledge required for extraction is generally specific to each source.

Extraction from a text source may be carried out by phrase spotting, which requires a list of rules that specify the phrases of interest and the inferences that can be drawn therefrom. For example, if there is a statement in a doctor's note with the words “There is evidence of metastatic cancer in the liver,” then, in order to infer from this sentence that the patient has cancer, a rule is needed that directs the system to look for the phrase “metastatic cancer,” and, if it is found, to assert that the patient has cancer with a high degree of confidence (which, in the present embodiment, translates to generate an element with name “Cancer”, value “True” and confidence 0.9).

The data sources include structured and unstructured information. Structured information may be converted into standardized units, where appropriate. Unstructured information may include ASCII text strings, image information in DICOM (Digital Imaging and Communication in Medicine) format, and text documents partitioned based on domain knowledge. Information that is likely to be incorrect or missing may be noted, so that action may be taken. For example, the mined information may include corrected information, including corrected ICD-9 diagnosis codes.

Extraction from a database source may be carried out by querying a table in the source, in which case, the domain knowledge needs to encode what information is present in which fields in the database. On the other hand, the extraction process may involve computing a complicated function of the information contained in the database, in which case, the domain knowledge may be provided in the form of a program that performs this computation whose output may be fed to the rest of the system.

Extraction from images, waveforms, etc., may be carried out by image processing or feature extraction programs that are provided to the system.

Combination includes the process of producing a unified view of each variable at a given point in time from potentially conflicting assertions from the same/different sources. In various embodiments of the present invention, this is performed using domain knowledge regarding the statistics of the variables represented by the elements (“prior probabilities”).

Inference is the process of taking all the factoids that are available about a patient and producing a composite view of the patient's progress through disease states, treatment protocols, laboratory tests, etc. Essentially, a patient's current state can be influenced by a previous state and any new composite observations.

The domain knowledge required for this process may be a statistical model that describes the general pattern of the evolution of the disease of interest across the entire patient population and the relationships between the patient's disease and the variables that may be observed (lab test results, doctor's notes, etc.). A summary of the patient may be produced that is believed to be the most consistent with the information contained in the factoids, and the domain knowledge.

For instance, if observations seem to state that a cancer patient is receiving chemotherapy while he or she does not have cancerous growth, whereas the domain knowledge states that chemotherapy is given only when the patient has cancer, then the system may decide either: (1) the patient does not have cancer and is not receiving chemotherapy (that is, the observation is probably incorrect), or (2) the patient has cancer and is receiving chemotherapy (the initial inference—that the patient does not have cancer—is incorrect); depending on which of these propositions is more likely given all the other information. Actually, both (1) and (2) may be concluded, but with different probabilities.

As another example, consider the situation where a statement such as “The patient has metastatic cancer” is found in a doctor's note, and it is concluded from that statement that <cancer=True (probability=0.9)>. (Note that this is equivalent to asserting that <cancer=True (probability=0.9), cancer=unknown (probability=0.1)>).

Now, further assume that there is a base probability of cancer<cancer True (probability=0.35), cancer=False (probability=0.65)> (e.g., 35% of patients have cancer). Then, we could combine this assertion with the base probability of cancer to obtain, for example, the assertion<cancer=True (probability=0.93), cancer=False (probability=0.07)>.

Similarly, assume conflicting evidence indicated the following:

1. <cancer=True (probability=0.9), cancer=unknown probability=0.1)>

2. <cancer=False (probability=0.7), cancer=unknown (probability=0.3)>

3. <cancer=True (probability=0.1), cancer-unknown (probability=0.9)> and

4. <cancer=False (probability=0.4), cancer=unknown (probability=0.6)>.

In this case, we might combine these elements with the base probability of cancer<cancer=True (probability=0.35), cancer=False (probability=0.65)> to conclude, for example, that <cancer=True (prob=0.67), cancer=False (prob=0.33)>.

It should be appreciated the present invention typically must access numerous data sources, and deal with missing, incorrect, and/or inconsistent information. As an example, consider that, in determining whether a patient has diabetes, the following information might have to be extracted:

(a) ICD-9 billing codes for secondary diagnoses associated with diabetes;

(b) drugs administered to the patient that are associated with the treatment of diabetes (e.g., insulin);

(d) doctor mentions that the patient is a diabetic in the H&P (history & physical) or discharge note (free text); and

(e) patient procedures (e.g., foot exam) associated with being a diabetic.

As can be seen, there are multiple independent sources of information, observations from which can support (with varying degrees of certainty) that the patient is diabetic (or more generally has some disease/condition). Not all of them may be present, and in fact, in some cases, they may contradict each other. Probabilistic observations can be derived, with varying degrees of confidence. Then these observations (e.g., about the billing codes, the drugs, the lab tests, etc.) may be probabilistically combined to come up with a final probability of diabetes. Note that there may be information in the patient record that contradicts diabetes. For instance, the patient is has some stressful episode (e.g., an operation) and his blood sugar does not go up.

It should be appreciated that the above examples are presented for illustrative purposes only and are not meant to be limiting. The actual manner in which elements are combined depends on the particular domain under consideration as well as the needs of the users of the system. Further, it should be appreciated that while the above discussion refers to a patient-centered approach, actual implementations may be extended to handle multiple patients simultaneously. Additionally, it should be appreciated that a learning process may be incorporated into the domain knowledge base (330) for any or all of the stages (i.e., extraction, combination, inference) without departing from the spirit and scope of the present invention.

The data miner may be run using the Internet. The created structured clinical information may also be accessed using the Internet.

Additionally, the data miner may be run as a service. For example, several hospitals may participate in the service to have their patient information mined, and this information may be stored in a data warehouse owned by the service provider. The service may be performed by a third party service provider (i.e., an entity not associated with the hospitals).

Once the structured CPR (380) is populated with patient information, it will be in a form where it is conducive for answering several questions regarding individual patients, and about different cross-sections of patients.

The following describes REMIND (Reliable Extraction and Meaningful Inference from Non-structured Data), an innovative data mining system developed by Siemens Corporate Research (SCR), a subsidiary of Siemens Corporation. REMIND is based upon an embodiment of the present invention.

Initially, an analogy is provided that describes the spirit in which REMIND performs inferences.

A French medical student who has some knowledge about cancer is provided with cancer patient CPR's. The CPR's contain transcribed English dictations and pharmacy data. The student's task is to classify which patients have had a recurrence, and if they have, determine when it occurred. Unfortunately his English is poor, though he does know some key medical words and a few of the drug names. However, he cannot rely purely on the presence of some key words, such as metastases, in the dictation, because he knows that physicians often make negative statements (“Patient is free of evidence of metastases.”). How might the student best carry out his task?

The student can collect all relevant evidence from the CPR—without trusting any single piece of evidence—and combine it to reconcile any disparities. He can use his knowledge about the treatment of cancer—for instance, on noting that a patient had a liver resection, the student can conclude that the patient (probably) previously had a recurrence.

Problem Definition

Let S be a continuous time random process taking values in Σ that represents the state of the system. Let T={t₁, t₂, . . . , t_n}, where t_i<t_i+1, be the n “times of interest” when S has to be inferred. Let S_irefer to the sample of S at time t_iεT. Let V be the set of variables that depend upon S. Let O be set of all (probabilistic) observations for all variables, vεV. Let O_ibe the set of all observations “assigned” to t_iεT; i.e., all observations about variables, vεV, that are relevant for this time-step t_i. Similarly, let

seq_MAP=arg max_seqP[seq|O]

O^j_i(v) be the j-th observation for variable v assigned to t_i. Let seq=<S₁, S₂, . . . S_n> be a random variable in Σ_n; i.e., each realization of seq is a state sequence across T. GOAL: Estimate the most likely state sequence, seq_MAP, (the maximum a posteriori estimate of seq) given O.

REMIND extracts information, O_i, from every data source in a uniform format called probabilistic observations. Each O_iis drawn entirely from a single piece of information in a data source (e.g., from a phrase in a sentence, or a row in a database table), and hence is assumed to be inherently undependable. The observation {“Recurrent”, “12/17/01”, <T=0.1, F=0.0>}, states that the Boolean variable “Recurrent” has an associated distribution over all possible values that can be taken by “Recurrent”. The probabilities do not have to add up to 1.0; any remainder (here 0.9) is assigned to unknown, and is smoothed over T/F, based upon the (time-dependent) a priori distribution.

Extraction from Structured Data:

REMIND communicates with all databases via JDBC, Java's built-in interface to relational databases. Executing a query (e.g., retrieve drug administered) is expressed as a probabilistic observation.

Extraction from Free Text:

REMIND strips document headers/footers, and tokenizes free text. Information from the token stream is extracted via phrase spotting, an easy-to-implement method from computational linguistics. Phrase spotting is about as simple as it sounds. A phrase-spotting rule is applied within a single sentence. The rule:

[metastasis & malignant] custom character {“Recurrent”, <T=0.5>} states that if the 2 words (actually aliases) in the rule are found in a sentence, a probabilistic observation about recurrence should be generated. REMIND also has compound rules to detect “negation” and “imprecision”, which modify the probabilities in existing observations.

The primary focus of our interest is estimating what happened to the patient across T, the duration of interest. The estimation of the MAP state sequence can be done in two steps, the first of which is combination of observations at a fixed point in time and the second is the propagation of these inferences across time.

Each (smoothed) O_iis in the form of an a posteriori probability of a variable given the small context that it is extracted from. All observations, O^j_i(v), about a variable for a single time t_iare combined into one assertion in a straightforward manner by using Bayes' theorem:

$P [v_{i} \langle O_{i}^{1} (v_{i}), \dots, O_{i}^{k} (v_{i})] \propto P [v_{i}] \cdot \prod_{j = 1}^{k} P [O_{i}^{j} (v_{i}) \rangle v_{i}] \propto \frac{\prod_{j = 1}^{k} P [v_{i} | O_{i}^{j} (v_{i})]}{{P [v_{i}]}^{k - 1}}$

At every t_iεT, the relationships among S_iand V are modeled using a Bayesian Network. Because the state process is modeled as being Markov and the state as being causative (directly or indirectly) of all the variables that we observe, we have the following equation:

$P [seq | O] \propto P [S_{0}] \cdot \prod_{i = 2}^{n} P [S_{i} | S_{i - 1}] \cdot \prod_{i = 1}^{n} P [O_{i} | S_{i}] \propto \prod_{i = 2}^{n} \frac{P [S_{i} | S_{i - 1}]}{P [S_{i}]} \cdot \prod_{i = 1}^{n} P [S_{i} | O_{i}]$

This equation connects the a posteriori probability of seq (any sequence of samples of the state process across time) given all observations, to P(S_i|O_i), the temporally local a posteriori probability of the state given the observations for each time instant. Essentially, we string together the temporally local Bayesian Networks by modeling each state sample, S_i, as the cause of the next sample, S_i+1.

Although illustrative embodiments of the present invention have been described herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be affected therein by one skilled in the art without departing from the scope or spirit of the invention.

Claims

1. A system for producing structured clinical information from patient records, the system comprising: a patient record comprising at least two data sources having patient information, at least one of the data sources being an unstructured data source and at least one of the data sources being a structured data source;a probabilistic data miner of a computer platform configured to (a) extract multiple pieces of information related to a variable for a patient from mining structured data of the at least one structured data source and mining unstructured data of the at least one unstructured data source of the patient record, the mining of the at least one unstructured data source comprising mining free text information, and (b) combine the extracted multiple pieces of information related to the variable into a value of the variable for the patient, the value being a function of the multiple pieces related to the variable, and the data miner configured to repeat (a) and (b) for a plurality of different variables of the same patient for a same time, each repetition of extracting and combining multiple pieces of information related to the variable of the different variables being handled in the repetition such that the multiple pieces of information for one variable are different than the multiple pieces of information for other ones of the different variables, the variable and the different variables comprising characteristics of the patient at the time;wherein one or both of (a) and (b) are performed as a function of domain-specific criteria.
2. The system of claim 1 wherein the data miner is operable to extract the multiple pieces of information as a function of the domain-specific criteria.
3. The system of claim 1 wherein the data miner is operable to combine the extracted multiple pieces of information as a function of the domain-specific criteria, the domain-specific criteria comprising knowledge about a disease, the multiple pieces being combined to determine the value for the variable.
4. The system of claim 1 wherein the data miner comprises an extraction component for extracting the multiple pieces of information and outputting a probabilistic assertion.
5. The system of claim 1 wherein the data miner is operable to infer a patient state as a function of the combined pieces of information.
6. The system of claim 5 wherein the data miner is operable to infer the patient state as a function of probabilities, one of the probabilities assigned to each of the combined pieces of information.
7. The system of claim 5 wherein the inference is a function of a statistical model of a pattern of evolution of a disease across a patient population and the relationship between a patient's disease and observed variables.
8. The system of claim 1 wherein the patient information includes one or more of: medical information, financial information, demographic information or combinations thereof, the at least one unstructured data source including two or more of: the free text information, medical image information, laboratory information, prescription drug information, waveform information or combinations thereof.
9. The system of claim 1 wherein the data miner is operable to extract key phrases from the at least one unstructured data source, the at least one unstructured data source comprising free text treatment notes, the key phrases comprising at least part of the domain-specific criteria.
10. The system of claim 1 wherein the data miner is operable to output structured clinical information with probability information.
11. The system of claim 1 wherein the domain-specific criteria includes institution-specific domain knowledge, disease-specific domain knowledge, or combinations thereof.
12. The system of claim 1 wherein the at least one unstructured data source includes two or more of: ASCII text strings, image information in DICOM format, text documents, or combinations thereof, partitioned based on domain knowledge.
13. The system of claim 1 wherein the data miner is run at arbitrary intervals, periodic intervals, in an online mode, or combinations thereof.
14. The system of claim 1 wherein a repository interface is used to access at least some of the information contained in the at least two data sources used by the data miner, wherein the repository interface is a configurable data interface, which varies depending on hospital.
15. A method for producing structured clinical information from patient records, comprising: (a) extracting, by a machine, multiple pieces of information from at least one structured data source and at least one unstructured data source of a patient record of a patient, the extracting from the at least one unstructured data source comprising mining unstructured free text information, the multiple pieces indicating different first values for a same variable recorded for the patient;(b) combining, probabilistically by the machine, the extracted multiple pieces of information into a second value of the variable representing the patient, the second value being a function of the first values recorded for the patient;wherein one or both of (a) and (b) are performed as a function of domain-specific criteria; andrepeating (a) and (b) for a different variable, the repetition searching for different pieces of information relevant to the different variable and combining the different pieces of information into a value for the different variable, the variable and the different variable representing the patient at a given time.
16. The method of claim 15 wherein (a) is performed as a function of the domain-specific criteria.
17. The method of claim 15 wherein the combining of (b) is performed as a function of the domain-specific criteria, the domain-specific criteria comprising knowledge of a disease.
18. The method of claim 15 wherein (a) comprises outputting a probabilistic assertion representative of each of the different first values for the same variable, and wherein (b) comprises combining the probabilistic assertions into a unified probabilistic assertion for the variable.
19. The method of claim 15 further comprising: (c) inferring a patient state as a function of the combined pieces of information.
20. The method of claim 19 wherein (c) comprises inferring the patient state as a function of probabilities associated with the combined pieces of information.
21. The method of claim 19 wherein (c) comprises inferring as a function of a statistical model of a pattern of evolution of a disease across a patient population and the relationship between a patient's disease and observed variables.
22. The method of claim 15 wherein (a) comprises extracting from one or more of: medical information, financial information, demographic information or combinations thereof, the at least one unstructured data source including two or more of: the free text information, medical image information, laboratory information, prescription drug information, waveform information or combinations thereof.
23. The method of claim 15 wherein (a) comprises extracting key phrases from the at least one unstructured data source, the at least one unstructured data source comprising free text treatment notes, the key phrases comprising at least part of the domain-specific criteria.
24. The method of claim 15 further comprising: (c) outputting structured clinical information with probability information.
25. The method of claim 15 wherein the domain-specific criteria includes institution-specific domain knowledge, disease-specific domain knowledge, or combinations thereof.
26. The method of claim 15 wherein the at least one unstructured data source includes one or more of: ASCII text strings, image information in DICOM format, text documents or combinations thereof partitioned based on domain knowledge.
27. A method for providing structured clinical information from patient records, the method comprising: (a) mining, by a processor, a patient record having at least one unstructured data source comprising unstructured patient information, the mining comprising mining unstructured free text information, the patient record being from a healthcare provider, the mining including extracting at least one of multiple pieces of information related to each of multiple variables;(b) creating, probabilistically by the processor, structured clinical data for each of the variables from the extracted multiple pieces of information, including the at least one piece of the unstructured patient information mined from the unstructured data source, the structured clinical data being stored for answering a question regarding patients;(c) providing (a) as a service to the healthcare provider; and(d) mining from the structured clinical data as a function of the question.
28. The method of claim 27 wherein (a) comprises mining the patient records of the healthcare provider and another healthcare provider and wherein (c) comprises providing the mining of (a) as a service to the healthcare provider and the other healthcare provider.
29. The method of claim 27 wherein (c) comprises providing the service by a third party service provider.
30. The method of claim 27 wherein (b) comprises: (b1) combining a set of probabilistic assertions into one or more unified probabilistic assertions; and(b2) inferring a patient state from the one or more unified probabilistic assertions; andwherein (c) comprises communicating the patient state.
31. The method of claim 27 wherein one or both of (a) or (b) is performed as a function of domain-specific criteria, the domain-specific criteria comprising knowledge of a disease.
32. The method of claim 31 wherein the domain-specific criteria for mining comprises institution-specific domain knowledge of the healthcare provider.
33. The method of claim 32 wherein the institution-specific domain knowledge relates to one or more of: data at a hospital, document structures at a hospital, policies of a hospital, guidelines of hospital, variations at a hospital or combinations thereof.
34. The method of claim 27 wherein (a) comprises mining the at least one unstructured data source comprising one or more of medical information, financial information, demographic information or combinations thereof, wherein the medical information includes two or more of: the free text information, medical image information, laboratory information, prescription drug information, waveform information or combinations thereof.
35. The method of claim 27 further comprising: (e) assigning probability values to the structured clinical data;wherein (c) comprises communicating a probability as a function of the probability values.
36. The method of claim 27 further comprising: (e) storing the created structured clinical information in a database maintained by a service provider, the service provider being different than the healthcare provider, wherein the service provider performs (c).
37. The method of claim 27 wherein (a) comprises mining the at least one unstructured data source comprising the unstructured patient information, the unstructured patient information comprising one or more of: ASCII text strings, image information in DICOM format, text documents or combinations thereof.
38. The method of claim 27 wherein (c) comprises accessing the created structured clinical information using the Internet.
39. The method of claim 27 wherein (a) comprises running the data mining using the Internet.
40. The method of claim 27 wherein (c) comprises communicating a diagnosis.
41. The method of claim 27 wherein (c) comprises communicating corrected information related to the patient record.
42. The system of claim 1 wherein the multiple pieces of information related to the variable for the patient are represented as probabilistic assertions related to the variable about the patient at a particular time, and wherein the value of the variable for the patient is a unified probabilistic assertion related to the variable about the patient at the particular time, the unified probabilistic assertion formed from the probabilistic assertions.
43. A system for producing structured clinical information from patient records, comprising: a patient record comprising at least unstructured data and structured data; anda probabilistic data miner machine configured to (a) extract multiple pieces of information from the structured data and the unstructured data of the patient record of a patient, the extracting from the at least unstructured data comprising mining unstructured free text information, the multiple pieces indicating different first values for a same variable recorded for the patient, and configured to (b) combine the extracted multiple pieces of information into a second value of the variable representing the patient, the second value being a function of the first values recorded for the patient;wherein one or both of (a) and (b) are performed as a function of domain-specific criteria; andwherein the probabilistic data miner is configured to repeat (a) and (b) for a different variable, the repetition searching for different pieces of information relevant to the different variable and combining the different pieces of information into a value for the different variable, the variable and the different variable representing the patient at a given time.
44. The system of claim 43 wherein the probabilistic data miner is configured to perform (a) as a function of the domain-specific criteria.
45. The system of claim 43 wherein the probabilistic data miner is configured to combine in (b) as a function of the domain-specific criteria, the domain-specific criteria comprising knowledge of a disease.
46. The system of claim 43 wherein the probabilistic data miner is configured to output a probabilistic assertion representative of each of the different first values for the same variable for (a), and to combine in (b) the probabilistic assertions into a unified probabilistic assertion for the variable.
47. The system of claim 43 wherein the probabilistic data miner is configured to (c) infer a patient state as a function of the combined pieces of information.
48. The system of claim 47 wherein the probabilistic data miner is configured to infer in (c) the patient state as a function of probabilities associated with the combined pieces of information.
49. The system of claim 47 wherein the probabilistic data miner is configured to infer in (c) as a function of a statistical model of a pattern of evolution of a disease across a patient population and the relationship between a patient's disease and observed variables.
50. The system of claim 43 wherein the probabilistic data miner is configured to extract in (a) from one or more of: medical information, financial information, demographic information or combinations thereof, the at least one unstructured data source including two or more of: the free text information, medical image information, laboratory information, prescription drug information, waveform information or combinations thereof.
51. The system of claim 43 wherein the probabilistic data miner is configured to extract in (a) key phrases from the at least one unstructured data source, the at least one unstructured data source comprising free text treatment notes, the key phrases comprising at least part of the domain-specific criteria.
52. The system of claim 43 wherein the probabilistic data miner is configured to (c) output structured clinical information with probability information.
53. The system of claim 43 wherein the probabilistic data miner is configured to perform (a) and/or (b) with the domain-specific criteria including institution-specific domain knowledge, disease-specific domain knowledge, or combinations thereof.
54. The system of claim 43 wherein the free text comprises ASCII text strings, text documents or combinations thereof partitioned based on domain knowledge.

CROSS REFERENCE TO RELATED APPLICATIONS

The present patent document is a divisional of application Ser. No. 10/287,055 filed Nov. 4, 2002, which claims the benefit of U.S. Provisional Application Ser. No. 60/335,542, filed on Nov. 2, 2001, which are both incorporated by reference herein in their entirety.

US Referenced Citations (149)

Number	Name	Date	Kind
4946679	Thys-Jacobs	Aug 1990	A
5172418	Ito et al.	Dec 1992	A
5307262	Ertel	Apr 1994	A
5359509	Little et al.	Oct 1994	A
5365425	Torma et al.	Nov 1994	A
5486999	Mebane	Jan 1996	A
5508912	Schneiderman	Apr 1996	A
5544044	Leatherman	Aug 1996	A
5557514	Seare et al.	Sep 1996	A
5619991	Sloane	Apr 1997	A
5652842	Siegrist, Jr. et al.	Jul 1997	A
5657255	Fink et al.	Aug 1997	A
5664109	Johnson et al.	Sep 1997	A
5669877	Blomquist	Sep 1997	A
5706441	Lockwood	Jan 1998	A
5724379	Perkins et al.	Mar 1998	A
5724573	Agrawal et al.	Mar 1998	A
5737539	Edelson et al.	Apr 1998	A
5738102	Lemelson	Apr 1998	A
5811437	Singh et al.	Sep 1998	A
5832447	Ricker et al.	Nov 1998	A
5832450	Myers et al.	Nov 1998	A
5835897	Dang	Nov 1998	A
5845253	Rensimer et al.	Dec 1998	A
5899998	McGauley et al.	May 1999	A
5903889	de la Huerga et al.	May 1999	A
5908383	Brynjestad	Jun 1999	A
5924073	Tyuluman et al.	Jul 1999	A
5924074	Evans	Jul 1999	A
5935060	Iliff	Aug 1999	A
5939528	Clardy et al.	Aug 1999	A
5991731	Colon et al.	Nov 1999	A
6039688	Douglas et al.	Mar 2000	A
6067466	Selker et al.	May 2000	A
6076088	Paik et al.	Jun 2000	A
6078894	Clawson et al.	Jun 2000	A
6081786	Barry et al.	Jun 2000	A
6083693	Nandabalan et al.	Jul 2000	A
6125194	Yeh et al.	Sep 2000	A
6128620	Pissanos et al.	Oct 2000	A
6139494	Cairnes	Oct 2000	A
6151581	Kraftson et al.	Nov 2000	A
6173280	Ramkumar et al.	Jan 2001	B1
6196970	Brown	Mar 2001	B1
6212519	Segal	Apr 2001	B1
6212526	Chaudhuri et al.	Apr 2001	B1
6253186	Pendleton, Jr.	Jun 2001	B1
6259890	Driscoll et al.	Jul 2001	B1
6266645	Simpson	Jul 2001	B1
6272472	Danneels et al.	Aug 2001	B1
6302844	Walker et al.	Oct 2001	B1
6322502	Schoenberg et al.	Nov 2001	B1
6322504	Kirshner	Nov 2001	B1
6338042	Paizis	Jan 2002	B1
6341265	Provost et al.	Jan 2002	B1
6347329	Evans	Feb 2002	B1
6381576	Gilbert	Apr 2002	B1
6468210	Iliff	Oct 2002	B1
6478737	Bardy	Nov 2002	B2
6484144	Martin et al.	Nov 2002	B2
6523019	Borthwick	Feb 2003	B1
6529876	Dart	Mar 2003	B1
6551243	Bocionek et al.	Apr 2003	B2
6551266	Davis, Jr.	Apr 2003	B1
6587829	Camarda et al.	Jul 2003	B1
6611825	Billheimer et al.	Aug 2003	B1
6611846	Stoodley	Aug 2003	B1
6641532	Iliff	Nov 2003	B2
6645959	Bakker-Arkema et al.	Nov 2003	B1
6678669	Lapointe et al.	Jan 2004	B2
6754655	Segal	Jun 2004	B1
6802810	Ciamiello et al.	Oct 2004	B2
6804656	Rosenfeld et al.	Oct 2004	B1
6826536	Forman	Nov 2004	B1
6839678	Schmidt et al.	Jan 2005	B1
6903194	Sato et al.	Jun 2005	B1
6915254	Heinze et al.	Jul 2005	B1
6915266	Saeed et al.	Jul 2005	B1
6941271	Soong	Sep 2005	B1
6961687	Myers et al.	Nov 2005	B1
6988075	Hacker	Jan 2006	B1
7058658	Mentzer	Jun 2006	B2
7130457	Kaufman et al.	Oct 2006	B2
7249006	Lombardo et al.	Jul 2007	B2
7307543	Rosenfeld et al.	Dec 2007	B2
7353238	Gliklich	Apr 2008	B1
7617078	Rao et al.	Nov 2009	B2
7630908	Amrien et al.	Dec 2009	B1
20010011243	Dembo et al.	Aug 2001	A1
20010023419	LaPointe et al.	Sep 2001	A1
20010032195	Graichen et al.	Oct 2001	A1
20010051882	Murphy et al.	Dec 2001	A1
20020002474	Michelson et al.	Jan 2002	A1
20020010597	Mayer et al.	Jan 2002	A1
20020019746	Rienhoff et al.	Feb 2002	A1
20020026332	Snowden et al.	Feb 2002	A1
20020029155	Hetzel et al.	Mar 2002	A1
20020029157	Marchosky	Mar 2002	A1
20020032581	Reitberg	Mar 2002	A1
20020035316	Drazen	Mar 2002	A1
20020038227	Fey et al.	Mar 2002	A1
20020077756	Arouh et al.	Jun 2002	A1
20020077853	Boru et al.	Jun 2002	A1
20020082480	Riff et al.	Jun 2002	A1
20020087361	Benigino et al.	Jul 2002	A1
20020099570	Knight	Jul 2002	A1
20020107641	Schaeffer et al.	Aug 2002	A1
20020123905	Goodroe et al.	Sep 2002	A1
20020138492	Kil	Sep 2002	A1
20020143577	Shiffman et al.	Oct 2002	A1
20020165736	Tolle et al.	Nov 2002	A1
20020173990	Marasco	Nov 2002	A1
20020177759	Schoenberg et al.	Nov 2002	A1
20020178031	Sorensen et al.	Nov 2002	A1
20030028401	Kaufman et al.	Feb 2003	A1
20030036683	Kehr et al.	Feb 2003	A1
20030046114	Davies et al.	Mar 2003	A1
20030050794	Keck	Mar 2003	A1
20030065535	Karlov et al.	Apr 2003	A1
20030108938	Pickar et al.	Jun 2003	A1
20030120133	Rao et al.	Jun 2003	A1
20030120134	Rao et al.	Jun 2003	A1
20030120458	Rao et al.	Jun 2003	A1
20030120514	Rao et al.	Jun 2003	A1
20030125984	Rao et al.	Jul 2003	A1
20030125985	Rao et al.	Jul 2003	A1
20030125988	Rao et al.	Jul 2003	A1
20030126101	Rao et al.	Jul 2003	A1
20030130871	Rao et al.	Jul 2003	A1
20030135391	Edmundson et al.	Jul 2003	A1
20030208382	Westfall	Nov 2003	A1
20040049506	Ghouri	Mar 2004	A1
20040067547	Harbron et al.	Apr 2004	A1
20040078216	Togo	Apr 2004	A1
20040143462	Hunt et al.	Jul 2004	A1
20040158193	Bui et al.	Aug 2004	A1
20040167804	Simpson et al.	Aug 2004	A1
20040172297	Rao et al.	Sep 2004	A1
20040184644	Leichter et al.	Sep 2004	A1
20040243586	Byers	Dec 2004	A1
20050187794	Kimak	Aug 2005	A1
20050191716	Surwit et al.	Sep 2005	A1
20050256738	Petrimoulx	Nov 2005	A1
20060020465	Cousineau et al.	Jan 2006	A1
20060064415	Guyon et al.	Mar 2006	A1
20060122864	Gottesman et al.	Jun 2006	A1
20060136259	Weiner et al.	Jun 2006	A1
20080275731	Rao et al.	Nov 2008	A1
20090171163	Mates et al.	Jul 2009	A1

Foreign Referenced Citations (14)

Number	Date	Country
19820276	Nov 1999	DE
0596247	May 1994	EP
0641863	Mar 1995	EP
0917078	Oct 1997	EP
2332544	Jun 1999	GB
11328073	Nov 1999	JP
2001297157	Oct 2001	JP
9829790	Jul 1998	WO
9839720	Sep 1998	WO
0051054	Aug 2000	WO
0069331	Nov 2000	WO
0166007	Sep 2001	WO
0178005	Oct 2001	WO
0182173	Nov 2001	WO

Non-Patent Literature Citations (41)

Entry
Morik et al. “Combining statistical learning with a knowledge based approach A case study in intensive care monitoring”, Apr. 27, 1999.
Kamp, et al. “Database system support for multidimensional data analysis in environmental epidemiology”, The 1997 International Database Engineering & Applications Symposium; Montreal; Can; Aug. 25-27, 1997. pp. 180-188. 1997.
King, et al., “MEDUS/A: Distributing Database Management for Medical Research”, Proceedinngs of Computer Networks Compcon 82, 20.09.1982-23.09.1982 pp. 635-642.
Boxwala et al, “Architecture for a Multipurpose Guideline Execution Engine”, Proc. AMIA Symp 1999, pp. 701-705.
“Guidance for Institutional Review Boards and Clinical Investigators 1998 Update”, Sep. 1998, U.S. Food and Drug Administration, http://www.fda.gov/ScienceResearch/SpecialTopics/RunningClinicalTrials/GuidancesInformationSheetsandNotices/ucm113793.htm#IRBMember.
Kassirer, “The Use and Abuse of Practice Profiles”, The New England Journal of Medicine, vol. 330:634-636, Mar. 3, 1994.
Chen, et al., Do “America's Best Hospitals” Perform Better for Acute Myocardial Infarction?, The New England Journal of Medicine, vol. 340, No. 4:286-292, Jan. 28, 1999.
Hofer, et al., “The Unreliability of Individual Physician ‘Report Cards’ for Assessing the Costs and Quality of Care of a Chronic Disease”, JAMA, Jun. 9, 1999, vol. 281, No. 22, pp. 2098-2105.
Ong et al, “The Colorectal Cancer Recurrence Support (CARES) System; Artificial Intelligence in Medicine”, Nov. 1997, Elsevier, Netherlands, vol. 11, pp. 175-188.
Nahm, et al., “A Mutually Beneficial Integration of Data Mining and Information Extraction”, In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), Jul. 30, 2000, pp. 627-632, Austin, TX, 20001.
Rao, et al., “Data mining for disease management: adding value to patient records”, Electromedica, vol. 68, 2000, pp. 63-67.
Mills, “Computer Technology of the Not-Too-Distant Future” Sep. 1993, Medical Laboratory Observer, vol. 25, No. 9, p. 78.
Duda, et al., “Pattern Classification—Chapter 1” 2001, John Wiley & Sons, New York, US, XP002536377, pp. 14-17.
Hudson, et al., “The Feasibility of Using Automated Data to Assess Guideline-Concordant Care for Schizophrenia”, Journal of Medical Systems, vol. 23 No. 4 1999, pp. 299-307.
PR Newswire, “Diabetes Health Management Award Honors Mayo Clinic's Zimmerman”, Sep. 25, 2000, http://www.thefreelibrary.com/Diabetes Health Management Award Honors Mayo Clinic's Zimmerman.-a065465402.
Hudson, Mary E., “CAATS and compliance—computer-assisted audit techniques in health care”, Internal Auditor, Apr. 1998, vol. 55, No. 2, p. 25, http://findarticles.com/p/articles/mi—m4153/is—n2—v55/ai—20860208/.
Grimes, Seth, “Structure, Models and Meaning, Is ‘Unstructured’ data merely unmodeled?”, Intelligent Enterprise, Mar. 2005, http://intelligent-enterprise.informationweek.com/showArticle.jhtml?articleID=59301538.
Berkus, “Unstructured Data as an Oxymoron”, ITtoolbox Blogs, Sep. 1, 2005, http://it.toolbox.com/blogs/database-soup/unstructured-data-as-an-oxymoron-5588.
Larsen, “Fast and Effective Test Mining Using Linear-time Document Clustering”, In Knowledge Discovery and Data Mining (1999), pp. 16-22.
Rao, “From Unstructured Data to Actionable Intelligence”, IT Pro, Nov./Dec. 2003, pp. 29-35.
Mitchell, “Machine Learning and Data Mining”, Communications of the ACM, Nov. 1999, vol. 42, No. 11, pp. 31-36.
Kleissner, “Data Mining for the Enterprise”, System Sciences, 1998, Proceedings of the Thirty-First Hawaii International Conference on Kohala Coast, HI, Jan. 6-9, 1998, IEEE Comput. Soc. US, pp. 295-304.
Evans, et al., “Using Data Mining to Characterize DNA Mutations by Patient Clinical Features”, Proc AMIA Annu Fall Symp. 1997: 253-257.
Waltz, “Information Understanding Integrating Data Fusion and Data Mining Processes”, Circuits and Systems, 1998, Proceedings of the 1998 IEEE International Symposium in Monterey, CA, USA, May 31-Jun. 3, 1998, NY, NY, pp. 553-556.
Roemer, et al., “Improved diagnostic and prognostic assessments using health management information fusion,” AUTOTESTCON Proceedings, 2001. IEEE Systems Readiness Technology Conference , vol., No., pp. 365-377, 2001.
Chemical and Biological Arms Control Institute: “Bioterrorism in the United States:Threat, Preparedness and Response”, Contract No. 200199900132, Nov. 2000.
Hanson, et al.. “Probabilistic Heuristic Estimates”, Annals of Mathematics and Artificial Intelligence, vol. 2, Nos. 1-4, pp. 209-220, 1990.
Stephen F. Jencks, MD., M.P.H., et al., “Rehospitalizations among Patents in the Medicare Fee-for-Service Program,” The New England Journal of Medicine, pp. 1418-1428, Apr. 2009.
Report to Congress, “Promoting Greater Efficiency in Medicare,” MedPac, pp. 288, Jun. 2007.
Eleanor Herriman, et al., “Patient Non-adherence—Pervasiveness, Drivers, and Interventions,” Sciences Spotlight Series, vol. 2, No. 4, Aug. 2007.
Cook, et al., “The Importance of medication adherence: from the doctor patient interaction to impact on health outcomes,” The Forum 10, Oct. 13-15, 2010.
Gordon K. Norman, “It Takes More than Wireless to Unbind Healthcare,” Presentation at Healthcare Unbound 2007 Conference (Abstract).
Lee Jacobs, “Are Your Patients Taking What You Prescribe? A Major Determinant: Clinician-Patient Communication,” Physician Work Environment, vol. 6, No. 3, Summer 2002.
Michael P. Ho et al., “Impact of Medication Therapy Discontinuation on Mortality After Myocardial Infarction,” Archives of Internal Medicine, 166: 1842-1847, 2007 (Abstract only).
Sergio B. Wey, et al., “Risk Factors for Hospital-Acquired Candidemia,” A Matched Case Control Study, Archives of Internal Medicine, 1989;149(10), (abstract only).
Andrew Pollack, “Rising Threat of Infections Unfazed by Antibiotics,” New York Times, Feb. 27, 2010.
http://en.wikipedia.org/wiki/Nosocomial infection, downloaded Jun. 2, 2011.
Janice Morse, “Preventing Patient Falls,” Sage Publications, 1997.
Ann Hendrich, “How to Try This: Predicting Patient Falls, American Journal of Nursing,” Nov. 2007, vol. 107, No. 11, pp. 50-58.
Un Yong Nahm, et al., “A Mutually Beneficial Integration of Data Mining and Information Extraction”, Proceedings AAAI, National Conference on Artificial Intelligence, Jul. 30, 2000, pp. 627-632.
N. Sager, M. Lyman, C. Buchnall, N. Nhan and L. Tick, “Natural Language Processing and the Representation of Clinical Data,” JAMIA, vol. 1 (2), pp. 142-160 (1994).

Related Publications (1)

	Number	Date	Country
	20090259487 A1	Oct 2009	US

Provisional Applications (1)

	Number	Date	Country
	60335542	Nov 2001	US

Divisions (1)

	Number	Date	Country
Parent	10287055	Nov 2002	US
Child	12488083		US

Patient data mining

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

US Classifications

Field of Search

US

International Classifications

Term Extension

Abstract