The present application claims priority from Japanese patent application JP 2013-104664 filed on May 17, 2013, the content of which is hereby incorporated by reference into this application.
This invention relates to a data analysis technology, and more particularly, to a system for analyzing medical data, to thereby support a healthcare business.
A health insurance society operates an insurance business, a healthcare guidance system, for providing healthcare guidances for preventing lifestyle diseases, and preventing severity thereof from increasing in order to reduce medical cost. However, resources such as public health nurses employed for the healthcare guidance and a cost for the healthcare guidances are limited. Therefore, a system for supporting an operation of effective and efficient insurance business is desired.
As a method for supporting the operation of the insurance business, in Japanese Patent Application Laid-open No. 2012-128670 A, there is described a healthcare business support system for selecting people subject to a healthcare guidance based on healthcare cost information, health checkup information, and healthcare guidance information. The healthcare business support system includes a medical cost model generation unit for generating a medical cost model representing an estimated medical cost for each of severities and test values of an insured person to a health insurance, a test value improvement model generation unit for generating a test value improvement model representing an improvement amount for each of the severities and the test values, an estimated medical cost reduction effect calculation unit for calculating an estimated medical cost reduction amount by a healthcare guidance for each of the severities and the test values, and a subject person selection unit for selecting an insured person to the health insurance belonging to a severity and a test value high in estimated medical cost reduction amount as a healthcare guidance subject person.
It is necessary to select people subject to the healthcare guidance by priority in order to effectively and efficiently operate the insurance business within resources of a health insurance society. Moreover, a content of the healthcare guidance appropriate for each of the subject people needs to be selected.
When the medical cost is estimated according to Japanese Patent Application Laid-open No. 2012-128670 A, the future medical cost is estimated based on the current severity and test value. For example, a future severity of diabetes is estimated based on current severity and blood sugar level of diabetes, and an average medical cost corresponding to the severity is considered as an estimated medical cost.
However, a factor (blood sugar level for diabetes) effective for estimating the future medical cost and severity needs to be manually set as prior knowledge. Moreover, definition of the severity also needs to be manually set.
Various factors such as age, sex, other test values, a prescription state of medicines, and life style are considered in addition to blood sugar level as the factors effective for estimating the future medical cost, and more precise estimation can thus be carried out by considering the factors. However, it is difficult to manually list up the factors. Moreover, the factors need to be set based on the prior knowledge for each disease. Therefore, it is difficult to analyze all diseases.
The representative one of inventions disclosed in this application is outlined as follows. There is provided an analysis system comprising a processor configured to execute a program and a memory configured to store the program. The analysis system executes the program to analyze medical data. The analysis system is capable of making access to a database storing medical information including an injury and illness name of an insured person and a medical action performed on the insured person and health checkup information including a test value obtained by a health checkup on the insured person. The analysis system further comprises; a causation/transition structure calculating unit configured to control the processor to generate a graph structure including a node corresponding to an item defined by a plural of or the medical information or the health checkup information and a probability variable relating to the item, and a probabilistic dependency defined by one of a directed link or an undirected link between the nodes, and to store the generated graph structure in the database; a node generating unit configured to control the processor to generate an event space of the nodes based on the medical information and the health checkup information, and to store the generated event space in the database; a probability calculating unit configured to control the processor to calculate a conditional probability of the graph structure based on the medical information, the health checkup information and the event space, and to store the calculated conditional probability in the database; a state transition model reconstructing unit configured to control the processor to reconstruct a state transition model with a graph structure, an event space and a conditional probability including specified probability variables based on a state transition model constructed from the graph structure, the event space and the conditional probability, and to store the reconstructed state transition model in the database; a disease state transition estimating unit configured to control the processor to estimate a disease state transition probability based on the reconstructed state transition model; and a health guidance supporting unit configured to control the processor to select a subject for health guidance and a content of health guidance based on the estimated disease state transition probability.
According to one embodiment of this invention, a related future event can be accurately estimated based on various kinds of data. Objects, configurations, and effects other than those described above are readily apparent from the following description of embodiments.
The present invention can be appreciated by the description which follows in conjunction with the following figures, wherein:
A first embodiment of this invention describes an example of a medical data analysis system that selects a person to be subjected to healthcare guidance, proposes a healthcare guidance approach, and estimates a healthcare guidance effect for prevention of disease onset and transition to a critical state thereof based on medical data (e.g., healthcare cost information, health checkup information, and medical inquiry information).
The healthcare cost information is information recording names of injury and illness when a health insurance insured person visits a medical institute for consultation, a prescribed medicine, a clinical action performed, and a medical costs (points). An example of the healthcare cost information is described later referring to
Health checkup information stores a test value when an insured person has had a health checkup. An example of the health checkup information is described later referring to
According to the first embodiment, a causal relation of a disease and disease state transition structure are modelized based on medical data. The first embodiment provides various functions such as selection of a person to be subjected to healthcare guidance, proposal of a healthcare guidance approach, and estimation of a healthcare guidance effect based on this model.
The medical data analysis apparatus 101 according to this embodiment includes an input unit 102, an output unit 103, a processing device 104, a memory 105, and a storage medium 106.
The input unit 102 is a human interface such as a mouse and a keyboard, and receives an input to the medical data analysis apparatus 101. The output unit 103 includes a display and a printer for outputting an arithmetic operation result by the medical data analysis system. The storage medium 106 is a storage apparatus for storing various programs for realizing medical data analysis processing by the medical data analysis system, an execution result of the medical data analysis processing, and the like, and is, for example, a non-volatile storage medium (such as a magnetic disk drive and non-volatile memory). The programs stored in the storage medium 106 are extended on the memory 105. The processing device 104 is an arithmetic operation apparatus for executing a program loaded on the memory 105, and is, for example, a CPU, a GPU, or the like. Processing and arithmetic operation described later are carried out by the processing device 104.
The medical data analysis system according to the first embodiment may be a computer system including a single computer, or a computer system including a server and client terminals. A data formatting unit 107, a disease causation/transition model generating unit 108, and a disease onset probability/medical cost estimating unit 112 of the medical data analysis apparatus 101 may be configured as separate apparatus as illustrated in FIGS. 3, 4, and 5. In this case, the apparatus illustrated in
The medical data analysis system is a computer system configured on a single computer, or on a plurality of computers logically or physically constructed, and may operate on separate threads on the same computer or operate on virtual computers configured on a plurality of physical computer resources.
First, a description is given of medical data used in the first embodiment.
A medical information storage unit 117 stores the medical data input to the input unit 102. The medical data includes the healthcare cost information, the health checkup information, and the medical inquiry information. The healthcare cost information includes healthcare cost basic information, injury and illness name information, clinical action information, medicine information, injury and illness name classification information, clinical action classification information, and medicine classification information.
A description is now given of the healthcare cost information.
The healthcare cost basic information 601 is information holding relationships between a healthcare cost and an insured person. The healthcare cost basic information 601 includes search numbers 602, insured person IDs 603, sex 604, ages 605, months and years of clinical action 606, and total points 607.
The search number 602 is an identifier for uniquely identifying a healthcare cost record. The insured person ID 603 is an identifier for uniquely identifying an insured person to the health insurance. The sex 604 is information representing a sex of the insured person. The age 605 is information representing an age of the insured person.
The month and year of clinical action 606 is a month and a year when the insured person visits a medical institute. The total point 607 is information representing a total point of one healthcare cost record. It should be noted that a medical cost (in yen) is calculated by multiplying the total point by “10”. If a plurality of injury and illness names are registered to one search number in injury and illness name information 901 illustrated in
The injury and illness name information 901 includes search numbers 602, injury and illness codes 902, and injury and illness names 903.
The search number 602 is an identifier for uniquely identifying a healthcare cost record, and uses the same number as the search number (
It should be noted that a plurality of injury and illness names can be described in one healthcare cost record. For example, the injury and illness names 903 of entries having “11” in the search number 602 are “diabetes” and “hypertension” in the injury and illness name information 901 illustrated in
Injury and illness name classification information 1001 is information for associating injury and illness classification with injury and illness names belonging to the injury and illness classification with each other, and includes an injury and illness classification 1002, an injury and illness code 902, an injury and illness name 903, and a complication 1003.
The injury and illness classification 1002 is the classification to which an injury and illness in question belongs. The injury and illness code 902 is an injury and illness code described in the healthcare cost record, and uses the same numbers as used in the injury and illness code 902 of the injury and illness name information 901 illustrated in
Clinical action information 1101 includes search numbers 602, clinical action codes 1102, clinical action names 1103, and clinical action points 1104.
The search number 602 is an identifier for uniquely identifying a healthcare cost record, and uses the same number as the search number (
In
Clinical action classification information 1201 includes injury and illness classifications 1002, clinical action codes 1102, and clinical action names 1103.
The injury and illness classification 1002 uses the same classification as the injury and illness classification 1002 (
Medicine information 1301 includes search numbers 602, medicine codes 1302, medicine names 1303, and medicine points 1304.
The search number 602 is an identifier for uniquely identifying a healthcare cost record, and uses the same number as the search number 602 (
In
Medicine classification information 1401 includes injury and illness classifications 1002, medicine codes 1302, and medicine names 1303.
The injury and illness classification 1002 uses the same classification as the injury and illness classification 1002 (
It should be noted that the clinical action information 1101 illustrated in
A description is now given of the health checkup information.
Health checkup information 701 is information for managing health checkup information on a plurality of insured persons for a plurality of years, and includes insured person IDs 603, health checkup dates 702, and various test values (such as BMIs 703, abdominal circumferences 704, fasting blood sugars 705, systolic blood pressures 706, and neutral fats 707) in the health checkup.
The insured person ID 603 is an identifier of an insured person to the health insurance who has had the health checkup, and uses the same identifier as the insured person ID 603 (
For example, if an insured person has not had a specific test, data in the health checkup information may be absent. For example, in
A description is now given of the medical inquiry information.
Medical inquiry information 801 is information for managing medical inquiry information on a plurality of insured persons for a plurality of years, and includes insured person IDs 603, medical inquiry dates 802, and answers of the medical inquiry (such as smoking 803, drinking 804, and walking 805. The medical inquiry may include a life style, health history, a constitution such as allergy, and subjective symptoms.
The insured person ID 603 is an identifier of an insured person to the health insurance who has had the medical inquiry, and uses the same identifier as the insured person ID 603 (
Detailed information such as the number of steps in the walking, the amount of drinking, and the number of smoked cigarettes may not be acquired from the medical inquiry. Not a specific amount of drinking, but a corresponding frequency out of frequencies classified in advance in a questionnaire may be responded. For example, if information only on absence/presence of habits of smoking and drinking are acquired, the frequency of drinking may be divided into a certain number of degrees (such as (1) none, (2) once to twice per week, (3) three times or more per week), and the frequency may be responded. In this case, the value in the medical inquiry information is the number without a quantitative meaning.
If an answer to a specific item is not received, data on the medical inquiry may be absent. For example, in
A description is now given of processing for a data formatting unit 107. The data formatting unit 107 sums/unifies and formats in a tabular form the healthcare cost information, the health checkup information, and the medical inquiry information for each of the insured persons and each period from the medical data stored in the medical information storage unit 117. In the following, a description is given while assuming one period is one year, but the one period may be another period such as half a year, two years, and three years.
The formatted information 1501 includes healthcare cost formatted information acquired by formatting the healthcare cost information in the year of 2004. Each row of the formatted information 1501 represents data summed for one insured person ID for one year.
An insured person ID 603, a sex 604, an age 605, and a total point 607 are the same as the insured person ID 603, the sex 604, the age 605, and the total point 607 (
The injury and illness code 10 (1503) is the number of healthcare cost records having “10” in the injury and illness code out of the healthcare cost records for the insured person ID. The injury and illness code 20 (1504) is similarly the number of healthcare cost records having “20” in the injury and illness code out of the healthcare cost records for the insured person ID. The clinical action code 1000 (1505) is the number of healthcare cost records for which a clinical action having 1000 in the clinical action code has been performed out of the healthcare cost records for the insured person ID. The medicine code 110 (1506) is the number of healthcare cost records for which a medicine having the medicine code of “110” has been prescribed out of the healthcare cost records for the insured person ID.
A specific description is now given of the processing by the data formatting unit 107 for a case where the data in the year of 2004 are formatted.
First, one insured person ID is selected. The data formatting unit 107 acquires search numbers of healthcare cost records for the insured person ID having 2004 in the month and year of clinical action from the healthcare cost basic information 601. Then, the data formatting unit 107 refers to the injury and illness name information 901, and counts, for each injury and illness code, the number of healthcare cost records having the injury and illness code described thereon. As a result, the number of healthcare cost records for each of the injury and illness codes is acquired. Similarly, the data formatting unit 107 refers to the clinical action information 1101, counts the number of healthcare cost records for each of the clinical action codes, refers to the medicine information 1301, and counts the number of healthcare cost records for each of the medicine codes. As a result, a data row for the year of 2004 is generated for the selected insured person ID. This processing is carried out for all combinations of each of insured person IDs and each of the years subject to the analysis.
For example, search numbers “11”, “12”, and “13” can be acquired from the healthcare cost basic information 601 for the data of the insured person ID “K0001” in the first row in 2004 in the formatted information 1501 illustrated in
The formatted information 1501 illustrated in
A value of each of the items is a value of the health checkup data of an insured person and a year respectively corresponding to the insured person ID 603 and the data year 1502. The health checkup data can be acquired from the health checkup information 701. If the health checkup information 701 includes a plurality of pieces of health checkup data corresponding to the same insured person ID and the same year, data on any one of dates of health checkup or an average of health checkup results of the plurality of times for the year may be used. If the data on one health checkup date is used, it is preferred to use data on a simultaneous health checkup day practiced at approximately the same time of every year. Moreover, data small in amount of deficiency may be selected. A numerical value defined in advance for representing the deficiency is used for the deficient data. In the example illustrated in
The formatted information 1501 illustrated in
A value of each of the items is a value of the medical inquiry data of an insured person and a year respectively corresponding to the insured person ID 603 and the data year 1502. The medical inquiry data can be acquired from the medical inquiry information 801. If the medical inquiry information 801 includes a plurality of pieces of medical inquiry data corresponding to the same insured person ID and the same year, data on any one of the dates of medical inquiry or an average of medical inquiry results of the plurality of times for the year may be used. If the data on one health checkup date is used, it is preferred to use data on a simultaneous health checkup day practiced every year at approximately the same time of every year. Alternatively, data small in amount of deficiency may be selected. A numerical value defined in advance for representing the deficiency is used for the deficient data. In the example illustrated in
As a result of the processing, healthcare cost formatted information, health checkup formatted information, and medical inquiry formatted information can be generated. The data only for the year of 2004 is illustrated in
On this occasion, when the healthcare cost formatted information is generated, similar items may be summarized, thereby unifying the plurality of items. For example, if functions of the oral antidiabetic A and functions of the oral antidiabetic B out of the items of the medicines are similar, the oral medicines A and B may be summarized, and may be treated as one item. On this occasion, a sum of the number of prescriptions of the oral antidiabetic A and the number of prescriptions of the oral antidiabetic B in the same year is used as a value of the item newly summarized. Criteria for determining whether items are similar or not are described below. Clinical action names belonging to the same injury and illness classification in the clinical action classification information 1201 are considered as similar items. Moreover, medicine names belonging to the same injury and illness classification in the medicine classification information 1401 are considered as similar items. Moreover, similar item information is manually generated in advance.
The generated healthcare cost formatted information, health checkup formatted information, and medical inquiry formatted information shown in
The value in the healthcare cost formatted information is acquired by summing the number of healthcare cost records, namely the number of prescriptions, but the value may be information representing whether the prescriptions exist or not. In other words, a case where the number of prescriptions is equal to or more than 1 (prescription exists) is summarized as 1, and a case where the number of prescriptions is 0 (no prescription) is set to 0, resulting in a binary representation. Moreover, the number of prescriptions may be considered to represent the severity, and the value of the healthcare cost formatted information may be a value representing a level as a result of classification of the number of prescription. For example, a case where the number of prescriptions is 0 is set to 0, a case where the number of prescriptions is 1 to 4 is set to 1, and a case where the number of prescriptions is equal to or more than 5 is set to 2, resulting in a representation as the three stages.
In the above-mentioned example, the healthcare cost formatted information, health checkup formatted information, and medical inquiry formatted information are summarized in a period of one year. However, the period may be set to a different period of two years, three years, or the like. In the following, a description is given of a case where the period for the summarizing is one year.
Next, the disease causation/transition model generating unit 108 is described.
The disease causation/transition model generating unit 108 includes a causation/transition structure calculating unit 109, a node generating unit 110, and a probability table calculating unit 111. The disease causation/transition model generating unit 108 generates a model representing a cause of a disease and disease state transition based on a graphical model by using formatted information stored in the formatted information storage unit 118.
Using the disease causation/transition model, an estimated value for a medical cost of a year (X+n) after a certain year (year X) can be calculated from personal health checkup data, medical inquiry data, and healthcare cost data of the certain year, to thereby estimate a disease onset probability. Further, an estimated value for a medical cost of the next year for a group in a specific state in year X (e.g., group whose blood sugar levels lie within a certain range) can be calculated, to thereby estimate the disease onset probability of a disease. Although the following describes estimation of a medical cost and the state of a disease of the next year (when n=1), the estimation may be made for a different period such as two years later or three years later.
At this time, generation of a model needs medical data acquired in years separated at least by n years. For n=3, for example, pieces of medical data acquired in years separated by three years, such as medical data of year 2004 and medical data of year 2007, are needed. The following description is given on the assumption that pieces of medical data acquired in years separated by n years are stored in the medical information storage unit 117, and formatted information generated from the medical data by the data formatting unit 107 is stored in the formatted information storage unit 118. The disease causation/transition model generating unit 108 generates a model representing the cause of a disease and disease state transition by using the formatted information stored in the formatted information storage unit 118.
First, a graphical model is briefly described.
The graphical model is formed by nodes and edges. The node represents a probability variable, and the edge represents the dependency relation between nodes (between probability variables). The edges have two types: a directed link and an undirected link.
Now, two probability variables X1 and X2 are considered.
A structure 1701 illustrated in
Because the probability variable X1 does not have a parent node, the probability distribution of X1 is given by a prior probability P(X1). Therefore, the joint probability distribution of X1 and X2 is given by P(X1, X2)=P(X1)P(X2|X1). X1 and X2 both take three values (states) of 1, 2, and 3. Expressing this probability distribution simply needs the probability distribution P(X1) and the probability distribution P(X2|X1). The probability distribution P(X1) and the probability distribution P(X2|X1) are expressed by a probability table 1702 and a probability table 1703, respectively, shown in
A structure 1704 illustrated in
Accordingly, the dependency relation between probability variables can be expressed.
According to the first embodiment, a node (probability variable) is selected from items in formatted information of year X and items in formatted information of year X+n. For example, an injury and illness code 10 in year X, BMI in year X, smoking in year X, an injury and illness code 10 in year X+n, BMI in year X+n, smoking in year X+n, and so forth in
The number of those items is, for example, several hundred to several thousand in consideration of the healthcare cost, health checkup, and medical inquiry when items in healthcare cost information are limited to those related to diabetes, and is several hundred thousand in consideration of all the items in healthcare cost information, all the items of health checkup, and all the items of medical inquiry. In other words, the number of nodes ranges from several hundred to as large as several hundred thousand.
The disease causation/transition model generating unit 108 uses formatted information generated from past healthcare cost information, health checkup information, and medical inquiry information to generate a model for estimating the disease onset probability and medical cost of a disease of an insured person n years after a certain year based on the healthcare cost information, health checkup information, and medical inquiry information for the insured person in the certain year. At this time, pieces of past formatted information of at least n years are needed. For n=3, for example, two years of past formatted information of year 2004 and year 2007 are used to generate a model for estimating the disease onset probability and medical cost of three years later. Assuming that the present year is 2008, and all or parts of the healthcare cost information, health checkup information, and medical inquiry information of an insured person are given, it is possible to estimate the disease onset probability and medical cost of year 2011 for the insured person.
The model illustrated in
The following description is given on the assumption that n=1.
The causation/transition structure calculating unit 109 forms an edge based on the dependency between those nodes. The node generating unit 110 generates space (event space) taken by the value of each node. The probability table calculating unit 111 calculates a conditional probability.
The causation/transition structure calculating unit 109 forms an edge (dependency relation) between those nodes (probability variables) from the data. The formation is described referring to a simple example.
Now, a model of determining whether a system including two sensors is normal or abnormal based on the statuses of the sensors is considered. Each of the probability variables, X1 and X2, which respectively indicate the statuses of the two sensors, takes two states. In addition, the system has a probability variable X3 which takes two statuses of normal and abnormal, respectively expressed by “0” and “1”.
It is defined that when the sensors are in the status of “1”, it is likely that the system is abnormal. For example, X1 is for a temperature sensor and X1=1 is established when the temperature sensor indicates a temperature higher than a certain value, and X2 is for a sound sensor and X2=1 is established when the sound sensor detects a sound different from normal sounds. This implies that the case where the two sensors are effective in determining whether the system is normal or abnormal becomes a structure expressed by a structure 1801 in
This example is described in comparison with the example illustrated in
When there are N probability variables, there are M models described above which are equal in number to the number of combinations (M) of two probability variables selected from among N probability variables in consideration of the presence/absence of an edge between probability variables. Accordingly, there are 2M combinations of the presence/absence of an edge between nodes. In consideration of the direction of an edge, the types of models become greater. Therefore, it is not possible to check all the possibilities. To cope with this difficulty, there is a method of limiting expression of the model to a structure called Bayesian network to search for a structure suitable for expressing data.
The Bayesian network is a structure where every edge is a directed link, and there are no plurality of routes directed from a variable X1 to a variable X2 passing along directed links. For example, a structure 1901 illustrated in
Various methods of automatically learning the structures of a Bayesian network based on data are proposed. However, the use of any one of the methods has a difficulty in checking all the possibilities when the number of nodes becomes larger. When the scale of the system is large and different types and qualities of data are mixedly present as in the first embodiment, it is difficult to automatically learn an accurate network.
In this respect, the graphical model structure calculating unit 208 according to the first embodiment first defines the causation/transition relation between nodes as edges based on the features of the individual items of the healthcare cost, health checkup, and medical inquiry. Next, the graphical model structure calculating unit 208 calculates the dependency between nodes based on two dependencies, namely the quantitative dependency and co-occurrence dependency. Then, the graphical model structure calculating unit 208 deletes the edge between low-dependency nodes. In the graphical model according to the first embodiment, two types of edges, namely an edge expressing the pathologic cause and effect and an edge expressing a disease state transition, are taken into account.
The following describes the process executed by the causation/transition structure calculating unit 109 referring to
In causation/transition structure defining step 2001 in
The injury and illness names are the items in the injury and illness codes 1503 and 1504 of healthcare cost formatted information, and the medical actions are the items in the clinical action code 1505 and the medicine code 1506 of the healthcare cost formatted information. The test value represents the items of a test value obtained by health checkup formatted information. The lifestyle habits are the items on the lifestyle habit and subjective symptom found by medical inquiry and obtained from the medical inquiry formatted information. The basic information represents the age and sex.
Nodes are classified based on the above-mentioned classifications of items. In other words, when nodes correspond to items in healthcare cost information, health checkup information, and medical inquiry information, the nodes are classified to the classifications to which the items belong, and when nodes correspond to a unified item obtained by unifying a plurality of items, the nodes are classified to the classification to which the unified item belongs. Through the above-mentioned processing, the nodes are classified into an injury and illness name, a medical action, a test value, a lifestyle habit, and basic information.
Referring to
It is an object of the model of the first embodiment to estimate the probability of a disease state transition (disease onset) and the medical cost in future and/or specify the cause of the disease state transition based on personal data of this year. To achieve the object, it is desired to estimate medical actions of the next year. The situations of this year's medical actions seem to be helpful information in the estimation of the next year's medical actions. Therefore, edges from the items of this year to items of the next year are generated between the nodes of this year's medical actions and the nodes of the next year's medical actions as seen from a structure 2101 illustrated in
The conditional probability for this model can be calculated by using data of two years of healthcare cost formatted information as shown in a table 2102 of
It appears that the probability of transition varies depending on the test value and lifestyle habits of individual persons. For example, the probability that a person who is given a prescription for a oral antidiabetic this year will be given a prescription for insulin next year is expected to be higher for a person having a higher blood sugar level. Apparently, acquisition of more detailed personal information can provide a more rigorous probability of transition.
Because the probabilities of being given prescriptions for individual medical actions next year seem to depend on this year's test value, a directed link from this year's test value to the next year's medical action is defined. Likewise, a lifestyle habit seems to affect a next year's medical action, and hence a directed link from this year's lifestyle habit to the next year's medical action is defined. Those definitions are illustrated in a structure 2201 in
Further, a medical cost is calculated based on a medical action, and hence, when the medical cost is to be estimated, a directed link from the medical action of this year to the total points (medical cost) of the next year is defined. Further, in order to improve the accuracy of the medical cost, a directed link from this year's total points to the next year's total points is defined. Those definitions are illustrated in a structure 2202 in
The above-mentioned edges of the causation/transition relations are organized in a table 2301 shown in
Other definitions of the causation/transition relations are illustrated in
A model expressed by a table 2302 illustrated in
A model expressed by a table 2303 illustrated in
The direction of an edge is described now. As illustrated in
Because the age and sex which are basic information are items that widely affect all the items in the causation/transition structure defining step 2001 in
Through the above-mentioned process, the direction and the presence or absence of an edge between nodes belonging to different classifications are defined. When the model illustrated in
The description of the process of the causation/transition structure defining step 2001 is completed. In the following, nodes with different periods are treated as nodes belonging to different classifications. In other words, the classification of test values of year X is treated as different from the classification of test values of year X+n.
Next, among the transition and causal edges between nodes (probability variables) belonging to different classifications defined in the causation/transition structure defining step 2001, the dependency between the probability variables is calculated, and an edge between low-dependency probability variables is eliminated.
In Inter-node dependency calculating step 2002, the dependency between nodes (probability variables) is calculated. At this time, individual nodes have values of different properties. For example, a test value, such as BMI or fasting blood sugar, is a sequential value whose scale varies from one value to another. The item “medical action” in healthcare cost formatted information is an integer value representing the number of prescriptions. Further, the answer number for a subjective symptom, for example, as a medical inquiry item is a value which does not have a quantitative meaning. Further, there is a missing value. Under such a circumstance, a method of comparing dependencies of variables of different properties with one another is needed.
The first embodiment shows an example of calculating the dependency between nodes using two references, namely, the quantitative dependency reference and the co-occurrence dependency reference. The quantitative dependency reference is a reference for calculating the similarity between values that have quantitative meanings, whereas the co-occurrence dependency reference is a reference for calculating the similarity between values that do not have quantitative meanings or between a value that has a quantitative meaning and a value that does not have a quantitative meaning.
First, a method of calculating the quantitative dependency is described. The dependency between two probability variables X1 and X2 is calculated. As observation data of X1 and X2, x1=(x11, x12, . . . , x1n) and x2=(x21, x22, . . . , x2n) are given, respectively. The quantitative dependency to be described below is an example based on the coefficient of correlation when x1 and x2 are considered as vectors.
The correlation coefficient between the vectors x1 and x2 are defined as r(x1, x2). Because there is a missing value in x1 and x2, an element having a missing value in one of x1 and x2 is eliminated. When x1i is missing, for example, x2i is eliminated. Vectors with the missing dimension eliminated from x1 and x2 are given as v1=(v11, v12, . . . , v1m) and v2=(v21, v22, . . . , v2m).
Even if the value of the correlation coefficient r(v1, v2) may have substantially the same dependency, the value varies due to the difference in property between the values of v1 and v2. Therefore, the elements in v1 and v2 are independently rearranged at random first. The resultant vectors, w1 and w2, are expected not to have a dependency therebetween. Using the vectors, |r(v1, v2)−r(w1, w2)| is calculated. When |r(v1, v2)|<|r(w1, w2)|, it can be determined that there is not a quantitative dependency. Accordingly, the quantitative dependency in this case is set to 0, and quantitative dependencies in the other cases are set as |r(v1, v2)−|r(w1, w2)|. This makes it possible to calculate the quantitative dependency compared with that in a random case (when there is not a dependency).
Here, the quantitative dependency is effective in comparing pieces of data having quantitative values. In an example 2005 illustrated in
A method of calculating the co-occurrence dependency is described by way of an example where dependency between two probability variables X1 and X2 is calculated.
As observation data of X1 and X2, x1=(x11, x12, . . . , x1n) and x2=(x21, x22, . . . , x2n) are given, respectively. The co-occurrence dependency to be described below is an example based on the entropies of x1 and x2.
First, as in the case of the quantitative dependency, vectors with missing values eliminated are set as v1 and v2. Next, the set of element pairs of the vectors v1 and v2 is set to S={(v1i, v2i)} (where i is an integer value of 1 to m). The number of the elements of S is m. For the elements of S, p=(p1, p2), the number of the elements of S equal to p is given by np. In addition, the number of different elements of S is given by L. Then, the entropy of a pair of v1 and v2 normalized with L is given by the following equation.
e(v1,v2)=Σ[(−np/m)log(−np/m)]/L
where Σ is the sum of all the elements p of S. As in the case of the quantitative dependency, e(w1, w2) is calculated for randomized w1 and w2. e(v1, v2) is a positive value which becomes smaller as the degree of co-occurrence of v1 and v2 get larger. Accordingly, when the random and normalized e(v1, v2)/e(w1, w2) is larger than 1, it can be determined that v1 and v2 do not have a dependency relation. Further, e(v1, v2)/e(w1, w2) is a value equal to or larger than 0. Accordingly, the co-occurrence dependency when e(v1, v2)/e(w1, w2) is larger than 1 is set to 0, and the co-occurrence dependency in the other cases is set to 1−e(v1, v2)/e(w1, w2).
The quantitative dependency and the co-occurrence dependency which are defined in the above-mentioned manner are values equal to or larger than 0 and equal to or less than 1, and each dependency becomes greater as the value becomes larger. The dependencies are calculated for every pair of probability variables having edges defined in the causation/transition structure defining step 2001. In the following, the quantitative dependency is expressed as Q, and the co-occurrence dependency is expressed as C.
In Dependency calibration step 2003 in
f(C)=αC*C+βC+γ
A method of determining parameters for f (α, β, and γ in the above-mentioned case) is described below. For the examples 2005 and 2006 respectively illustrated in
Accordingly, the dependency is defined by D=max{Q, f(C)}.
In Low-dependency inter-node edge deleting step 2004, of the edges determined in the causation/transition structure defining step 2001, an edge located between nodes having D smaller than a predetermined threshold is deleted.
Through the above-mentioned processing, the direction and the presence/absence of an edge between nodes belonging to different classifications are defined. In other words, for a node N1 and a node N2 belonging to different classifications, when an edge between the node N1 and the node N2 is defined in the causation/transition structure defining step 2001, and when the dependency between the node N1 and the node N2 is equal to or larger than a predetermined threshold, the edge defined in the causation/transition structure defining step 2001 is defined between the node N1 and the node N2. Otherwise, an edge is not defined between the node N1 and the node N2.
In Constraint structure learning step 2007, a final inter-node edge structure is determined. Three examples are described hereinbelow.
First, a first example is described. In the first example, only an edge between nodes that belong to different classifications defined by the processes up to Low-dependency inter-node edge deleting step 2004 and whose dependency is equal to or larger than the threshold is treated as an edge of a final disease causation/transition model. At this time, an edge between nodes that have the same period and belong to the same classification is not defined.
A second example is described. Between nodes that belong to different classifications, the method of the first example is used for definition, and the structure of an edge between nodes that belong to the same classification is learned using an existing structure learning method. The edge structure can be efficiently learned by limiting the edge structure that is constructed as a result of the learning to, for example, the structure of a Bayesian network. Accordingly, the presence or absence of an edge is defined between nodes belonging to the same classification, and the direction of the edge is further defined when the edge is directed. The structure of an edge between nodes belonging to different classifications has already been defined. The edge structure formed through the above-mentioned processing may not be a Bayesian network structure as a whole even when the structure between nodes belonging to the same classification is a Bayesian network structure.
A third example is described. The presence/absence and the direction of an edge between nodes that belong to different classifications are restricted based on edges that are defined by the processes up to Low-dependency inter-node edge deleting step 2004. There are four ways of making the definition: a definition when an edge is not present between nodes, a definition when there is an undirected edge, and two definitions when there is a directed edge (there are two definitions according to the direction of the edge). By way of contrast, when an edge is defined between nodes belonging to different classifications is defined through the processes up to Low-dependency inter-node edge deleting step 2004, restrictions are made to the two cases of the case where an edge is not present between nodes, or the case where an edge in the same direction is present. When an edge is not present through the processes up to Low-dependency inter-node edge deleting step 2004, a restriction that an edge is not present between nodes is made. Under this restriction, the edge structure of the entire nodes is learned using the existing structure learning method.
When the values of nodes need to be discretized using the existing structure learning method in the above-mentioned second and third examples, a method of performing discretization based on the ratio of the number of persons, which is described later in the description of a node generating unit 209 may be used.
The above completes the process of the causation/transition structure calculating unit 109. Through this process, a structure (edge) between nodes is determined.
To generate a probability table in the probability table calculating unit 111, the node generating unit 110 defines event spaces for nodes and consolidates the nodes. A node has a sequential value like a test value. When the value of the medical action in healthcare cost information is the number of the prescriptions, the estimation accuracy becomes low if the granularity of the number of prescriptions is small. Accordingly, discretization is preferred to be performed with an adequate granularity. When the individual numbers of prescriptions are treated separately, the number of events for each number of prescriptions becomes small, which may undesirably reduce the accuracy of the probability table or may make generation of the probability table difficult.
The event space may be manually defined beforehand. For example, the weight may be expressed in divided ranks of 5 kilograms, and the event spaces for the corresponding nodes may be set to { . . . , 50 to 54, 55 to 59, . . . }. In this case, a collection of weight values from 50 kilograms to 54 kilograms is treated as a single event.
Further, another example of defining event space is described. The above-mentioned method requires defining of the event space for each node. For example, the height and the weight differ from each other in meaning and scale, and hence different divided levels are required to be defined. In the example to be described, the value is divided by the ratio of the number of persons. Therefore, event space can be defined by a uniform method that does not depend on nodes. Specifically, the scale is divided by every k %, and sub scales from the lower p % to (p+k) % are grouped into one. For example, provided that lower 5% of the weights of all the insured persons are put into a group of w1 kilograms or less, and lower 5% to 10% of the weights are put into a group of w1 kilograms to w2 kilograms, event spaces become {w1 or less, w1 to w2, . . . }. For division by 5%, the number of states becomes 20.
Nodes may not be consolidated. When event spaces are given by the above-mentioned method and nodes are not consolidated, the process proceeds to the process of the probability table calculating unit 111. When the nodes are not consolidated, the number of cases for calculating a conditional probability become 0 in some circumstances. In this case, a process of estimating the conditional probability is needed, and this process is described later.
Next, an example in which the event space of a node is defined, and nodes are consolidated is described.
First, definition of the event space of a node is described. The event space of a node defines the state (value) a probability variable takes, and is generated by discretizing value space of a corresponding item.
Next, a discretization method is described. According to the first embodiment, discretization of a node is performed using two references. The first reference serves to achieve discretization in such a way that a sufficient number of cases for each state of the node after discretization are acquired. A sufficient number of cases are acquired when the discretization is rough, and hence a statistically reliable probability table can be generated. When the discretization is too rough, however, the dependency of the probability distribution of a child node with respect to the state of this node cannot be expressed adequately. In this respect, the second reference serves to achieve discretization in such a way that the expression of the dependency of the conditional probability distribution of a child node with respect to the state of this node after discretization is not lost.
First, the necessity for discretization is described referring to an example 1701 illustrated in
Generation of a model requires generation of a probability table 1702 for X1 and a probability table 1703 for X2. For example, a22 in the probability table 1703 is the probability of X2=2 when X1=2, which needs a sufficient number of cases where X1=2 and X2=2. When the granularity of X1 is fine, the number of cases becomes few, and 0 depending on a situation. An insufficient number of cases raise a problem in that the probability value cannot be estimated or the reliability of the probability value is reduced. It is therefore necessary to perform discretization to provide an adequate granularity. For X1=1 and X1=2, when the probability distribution of X2, P(X2|X1=1), is substantially equal to the probability distribution of P(X2|X1=2), it is favorable to put the states X1=1 and X2=2 into a single state from the viewpoint of the number of cases and the amount of calculation.
First, a discretization method to obtain a sufficient number of cases for each state of the node after discretization is described.
Assume that X1 is a node of interest, and X2 is a child node thereof, which has already been adequately discretized. A case number 2401 illustrated in
First, the leftmost state of X1 is selected in Minimum value state selecting step 2501. Here, a case number 2402 represents the number of cases for each state of X2 when X1 is in a state expressed by a minimum value. The values in the case number 2402 become larger from left to right similarly to the values in the case number 2401. Likewise, a case number 2403 represents the number of cases for each state of X2 when X1 takes a state larger than the minimum value by “1”.
In the following description, the currently selected state is S. The initial state of S is a state expressed by the minimum value of X1.
In Step 2502, the number of cases for each state of X2 conditioned to make X1=S is compared with a predetermined threshold. When the number of cases is smaller than the predetermined threshold, it is determined that the number of cases is insufficient, and the state is combined with a next state on the right (Step 2503). When there is no next state on the right, the state may be combined with a next state on the left. As two left states of X1 are put together, the number of cases becomes case numbers 2404 and 2405 illustrated in
When the number of cases is sufficient, on the other hand, S is regarded as a completed state, and it is checked in Step 2504 if there is an uncompleted state (next state on the right) (Step 2504). When there is an uncompleted state, this state is set as S, and the process then returns to Step 2502. When there is not any uncompleted state, the process proceeds to Step 2505.
This process can carry out discretization in such a way that each state has a stable number of cases, as shown in a case number 2407 of
Further, discretization is carried out so that the probability dependency of a child node with respect to a parent node is not lost. Specifically, a state “0” at the left end of the case number 2407 and an adjoining state “1” are selected (Step 2505), and when two probability distributions, P(X2|X1=0) and P(X2|X1=1), do not have a significant difference (NO in Step 2506), the state “0” and the state “1” are combined (Step 2507). This process is repeated until the probability distributions have a difference. Then, a next state in X1 on the right is targeted (YES in Step 2508), and states are combined by a similar method. Regarding a difference between P(X2|X1=0) and P(X2|X1=1), for example, when there are states a and b of X2 that make the difference between P(X2=a|X1=0) and P(X2=b|X1=0) equal to or larger than the predetermined threshold, it is determined that the probability distributions have a significant difference therebetween.
Specifically, two minimum states after combination, namely the left end state and an adjoining state in the example of the case number 2407, are selected in Step 2501. The selected states are set as S1 and S2, respectively. Next, the difference between P(X2|X1=S1) and P(X2|X1=S2) is determined in the above-mentioned manner in Step 2506. When the two probability distributions do not have a difference, the process proceeds to Step 2507. In Step 2507, the state S1 and the state S2 are combined, and the combined state is newly set as S1, after which the process proceeds to Step 2508. When the two probability distributions do not have a significant difference in Step 2506, the state S2 is newly set as S1, after which the process proceeds to Step 2508. In Step 2508, when there is a state next to S1 on the right, the state is set as S2, and the process proceeds to Step 2506. When there is not a state next to S1 on the right, the process is terminated.
The above-mentioned process can perform discretization of X1 with the child node X2 discretized.
Accordingly, discretization is performed on the nodes in order starting with a leaf node which does not have a child node. When there is a node of total points indicating a medical cost, the total-point node becomes a leaf node. The total-point node is discretized beforehand so that the granularity needed for estimation is provided. When there is not a total-point node, a node relating to a medical action becomes a leaf node. This discretization method is predetermined. When distinction is made based on the presence or absence of a prescription, for example, discretization is performed based on two states of 0 and at least 1. When a finer granularity is needed, discretization is performed based on three states of, for example, 0, 1 to 5, and at least 6.
The above process achieves recursive discretization from a leaf node to a root node (node which does not have a parent) in order.
Next, the node generating unit 209 consolidates nodes.
As mentioned above, discretization is carried out paying attention to only a relation with child nodes. However, when a certain node has two or more parents as shown in a structure 2601 illustrated in
Consolidation of some nodes and combination of the states thereof are considered. First, in Step 2701, it is determined whether there are cases equal to or larger in number than a predetermined number for the combination of all the states of a parent node. When the number of cases is sufficient, this process is terminated.
When the number of cases is insufficient, in Maximum dependency pair consolidating step 2702, the dependency of parent nodes is calculated in the same way as used in Inter-node dependency calculating step 2002 to select a pair of nodes having a maximum dependency. It is considered that similar nodes having high dependencies give similar influences on child nodes. In this respect, two nodes having high dependencies are consolidated into a new node. When the numbers of states of the original two nodes are n1 and n2, the number of states of the new node is n1×n2 which is the combination of the states of the two nodes. A structure 2602 shows a state where the node X2 and the node X3 are combined with the node X5 (
Next, the states of the consolidated nodes are combined in State combining step 2703. A process of combining the states of the consolidated nodes is described referring to
Similarly to
First, a state at the upper left end is selected in Upper-left-end state selecting step 2901. In other words, the state is a combination that permits both X2 and X3 to take minimum values. In the following, a selected state is expressed by S. Initially, S is the state at the upper left end. Next, in Step 2902, the number of cases of each state of X4 with a condition of X5=S is checked to determine whether there are a sufficient number of cases. When there are a sufficient number of cases for each state of X4, the process for this state is terminated, and proceeds to Step 2905. In Step 2905, the state 2801 is searched downward from the top row and from the left end to the right end in each row for any uncompleted state, and the first uncompleted state found is set as S, after which the process returns to Step 2902.
When the number of cases for each state of X4 is insufficient, on the other hand, an optimal adjoining state to be combined is selected in Optimal-adjoining-state selecting step 2903. The adjoining states are uncompleted states adjoining a currently selected state above, below, to the left, and to the right. Among those four adjoining states, a state that less influences the conditional probability distribution of the child node X4 when combined is an optimal state to be combined. When there is not any uncompleted state, an optimal state is selected from adjoining completed states, and is combined with the currently selected state. The combined state is newly set as S, after which the process returns to Step 2902.
The following describes an example of a method of calculating an influence on the conditional probability distribution of the child node X4 in Step 2903. Assume that the currently selected state is a, adjoining states are b, and I(b)=max|P(X4=s|X5=a)−P(X4=s|X5=a)| where max is a function for selecting a state among the adjoining states b to all the states s of X4, which has the minimum value of I(b), as an optimal adjoining state.
Through the above-mentioned process, the states are combined two-dimensionally. A state 2804 schematically shows how the two-dimensional combination is achieved. The state 2804 is illustrated with lines between combined states deleted.
This process is recursively repeated from a leaf node toward a root node to complete consolidation of the nodes. This overcomes the problem such that combining the number of states of a parent node reduces the number of cases, which otherwise makes estimation difficult or reduces the estimation accuracy.
Next, the node generating unit 110 generates a causation/transition structure after consolidation of the nodes. The node generating unit 110 deletes nodes to be consolidated, and inserts the consolidated node which is newly generated by the consolidation. At this time, all the parent nodes of the nodes to be consolidated are set as parent nodes of the consolidated node. For example, as illustrated in
Finally, the node generating unit 110 stores information on this structure in a causation/transition model storage unit 119, and information on the consolidation of the nodes and the combination of the states in a node information storage unit 120.
The probability table calculating unit 111 generates a conditional probability table with the structure that is generated by the node generating unit 110 and is stored in the causation/transition model storage unit 119. This is the calculation of P(X|X1, X2, . . . , Xn) for each state of X, X1, . . . , Xn where X, X1, . . . , Xn are parent nodes of each node X.
The process is described with the structure 3103 illustrated in
For example, X6 represents the blood sugar level in year X, X7 represents the presence or absence of a prescription for oral antidiabetic in year X, and X5 represents the presence or absence of a prescription for insulin preparation in year X+n; “1” represents the presence of a prescription. Assume that there is a prescription for oral antidiabetic in year X, there are p insured persons whose values of diabetes are expressed by S, and there are q insured persons, among the p insured persons, who have been given prescriptions for insulin preparation n years later. At this time, P(X5=1|X6=S, X7=1)=q/p is satisfied.
When there are no cases and thus the conditional probability cannot be calculated, a uniform distribution, for example, may substitute for the conditional probability. When p=0 is satisfied in the above-mentioned example, P(X5|X6=S, X7=1) cannot be calculated. Accordingly, the distribution of X5 is assumed to be uniform, and when X5 takes two values as in the above-mentioned example, P(X5=1|X6=S, X7=1)=½ and P(X5=0|X6=S, X7=1)=½ are set.
The probability table calculating unit 111 calculates this probability for all the nodes, and stores the generated probability table in the causation/transition model storage unit 119.
The above is the description of the process of the disease causation/transition model generating unit 108.
Next, the disease onset probability/medical cost estimating unit 112 is described. The disease onset probability/medical cost estimating unit 112 includes a model reconstructing unit 113, a disease state transition probability/medical cost estimating unit 114, and a healthcare guidance supporting unit 115.
The model reconstructing unit 113 reconstructs a model matching the user's intention from a causation/transition model stored in the causation/transition model storage unit 119 in response to a request from the healthcare guidance supporting unit 115. The reconstructed model is stored in a reconstructed model storage unit 121. The disease state transition probability/medical cost estimating unit 114 estimates disease onset probability and a medical cost by using the reconstructed model generated by the model reconstructing unit 113. The result of the estimation is stored in an estimation result storage unit 122.
First, the process of the model reconstructing unit 113 is described.
The model generated by the disease causation/transition model generating unit 108 is a large-scale model where a large number of nodes are related to one another. However, users are often interested in a part of this model. Therefore, the model reconstructing unit 113 provides a function of reconstructing only a model related to nodes necessary for a user. This not only can reduce the amount of calculation but also provides a model which is easy for a user to handle.
When a model is constructed from the beginning in response to a request from the user, a large amount of calculation is needed. However, the calculation cost for the reconstructing process is small. In this respect, information obtained from a huge amount of data can be used efficiently and effectively by a two-level configuration including the disease causation/transition model generating unit that generates an exquisite model, and the model reconstructing unit that reconstructs a compact model matching the purpose as employed in the first embodiment. When the system is configured differently by different apparatus as illustrated in
The model reconstructing unit 113 reconstructs a model matching the user's intention in response to a request from the healthcare guidance supporting unit 115. In other words, when provided with a list of nodes to be included in a reconstructed model, the model reconstructing unit 113 constructs a model relating to the nodes. The node list includes nodes before consolidation. In other words, the nodes correspond to items in formatted information. For example, when a pathologic cause and effect and disease state transition which are related to diabetes are of interest, items, test values, and medical inquiry results that relate to associated medical actions are set up into a node list.
First, a description is given of the process of the model reconstructing unit 113 when the node generating unit 110 does not consolidate nodes, and a graphical model generated by the disease causation/transition model generating unit 108 is a directed graph.
As a list of nodes, N1, N2, . . . , Nk are selected. First, an edge structure is set in such a way that in the model generated by the disease causation/transition model generating unit 108, when there is a route moving through an edge directed from Ni to Nj, an edge directed from Ni to Nj is set, and when there is a route moving through an edge directed from Nj to Ni, an edge directed from Nj to Ni is set. Otherwise, an edge is not set. Next, a conditional probability that is defined by the edge structure is acquired by marginalization of nodes that are not included in the list.
For example, a model illustrated in
The process executed by the model reconstructing unit 113 when the node generating unit 110 consolidates nodes is described referring to an example illustrated in
Next, the model reconstructing unit 113 reconstructs a model including only X5, X4, and X8. At this time, when there is a route connecting the nodes selected as the nodes of the reconstructed model, a directed link is formed between the nodes even in the reconstructed model. A structure 3302 becomes a structure 3303.
Next, the conditional probability is calculated to complete the reconstructed model. An example of calculating P(X4|X5) is described instead of the description of the process. P(X4|X5) can be calculated by ΣP(X4|X1=s, X5), where Σ is the sum of all the states of X1. In other cases, the conditional probability can also be acquired from the model stored in the causation/transition model storage unit 119.
The above-mentioned process reconstructs a model which is constructed from nodes selected as a list of nodes when there is consolidation of nodes, and from a node after consolidation when nodes are consolidated. The definition of an edge and the calculation of the conditional probability that follow the reconstruction of the model are the same as those when there is no node consolidation.
When all the nodes are specified as a list of nodes, the model reconstructing unit 113 need not perform model reconstruction, and hence the model generated by the disease causation/transition model generating unit 108 is used. Further, the model generated by the disease causation/transition model generating unit 108 may be used for the model that is used by the disease state transition probability/medical cost estimating unit 114 in the estimation, and the model reconstructing unit 113 may form only a network chart to be displayed on a display apparatus in the healthcare guidance supporting unit 115 into a reconstructed model. The network chart and the probability table in this case are based on the above-mentioned reconstructed model.
The disease state transition probability/medical cost estimating unit 114 estimates the disease onset probability of a disease and the medical cost thereof by using the model reconstructed by the model reconstructing unit 113, or the model generated by the disease causation/transition model generating unit 108 and stored in the causation/transition model storage unit 119.
This process is described using the structure 3302. In case of acquiring the probability of X5=s (e.g., X5 is an item relating to the number of prescriptions for insulin next year), the probability indicates a probability that the number of prescriptions for insulin becomes a number specified by s. The joint distribution of X1, X4, X5, X6, X7, and X8 is given by the following expression.
P(X1,X4,X5,X6,X7,X8)=P(X1)P(X6)P(X8)P(X7|X8)P(X5|X6,X7)P(X4|X1,X5)
P(X5=s) is given by the following expression, where Σ is the sum of the states of all the probability variables except X5.
P(X5=s)=ΣP(X1,X4,X5,X6,X7,X8)
This calculation can be executed by using the probability table that is generated by the probability table calculating unit 111 and is stored in the causation/transition model storage unit 119. When there is a calculated probability variable other than X5 (e.g., when X1=t), the probability variable P(X5=s) is given by the following expression, where Σ is the sum of the states of all the probability variables except the observation node X1 and node X5 to be estimated.
P(X5=s)=ΣP(X1=t,X4,X5,X6,X7,X8)
This is equivalent to, for example, a case of estimating the next year's medical actions and medical cost while the state of an obtained test value of this year's health checkup as a node is fixed.
The above-mentioned processing can ensure estimation of the states of nodes equivalent to the next year's medical action and medical cost with this year's information obtained. When P(X) is acquired with the medical cost node being denoted by X, estimated probability values are acquired for the individual points of the medical cost. The next year's medical cost can be estimated as the expected value.
The above-mentioned expression calculates the sum of all the states, and hence it takes a considerable time for the calculation. Algorithms that efficiently acquire the sum have been proposed. Examples of the algorithms include the message passing algorithm and the junction tree algorithm. The disease state transition probability/medical cost estimating unit 114 may use those algorithms.
The healthcare guidance supporting unit 115 provides a function of supporting healthcare guidance for preventing future disease onset. The following describes two functions including a support function of supporting a health insurance business operator in preparing a healthcare guidance plan, and a function of supporting a person responsible for healthcare guidance or a subject person.
First, the support function of supporting a health insurance business operator in preparing a healthcare guidance plan is described. A health insurance business operator wants to select subject persons who get a high prevention effect brought by healthcare guidance by priority within the budget of healthcare guidance, and to perform guidance suitable for each subject person. There are a plurality of healthcare guidance services (healthcare guidance service 1, healthcare guidance service 2, etc.) that can be provided by a health insurance business operator. For example, the healthcare guidance service 1 is a guidance to mainly reduce the BMI value, and the healthcare guidance service 2 is a guidance to drop the cholesterol value.
The process for the support function for a health insurance business operator is described.
First, in Subject disease setting step 3401, a target disease for the process is set. When the three major lifestyle-related diseases, namely, diabetes, lipid abnormality, and hypertension, are targets, for example, the model reconstructing unit 113 reconstructs a model by using items in medical actions, items in health checkup, and medical inquiry items corresponding to diabetes, lipid abnormality, and hypertension among items in healthcare cost formatted information. When all diseases are targets, the model generated by the disease causation/transition model generating unit 108 and stored in the causation/transition model storage unit 119 is used.
In next Healthcare guidance service setting step 3402, the types of healthcare guidance services and assumed effects of the individual healthcare guidance services are set. For example, the assumed effects of the healthcare guidance service 1 include a weight reduction of 5 kilograms.
In next Healthcare-guidance effect estimating step 3403, the effect of reducing the medical cost is estimated for all combinations of the healthcare guidance services and the subject candidates for the healthcare guidance. First, a description is given of how to calculate the effect of reducing the medical cost for the combination of the healthcare guidance service 1 and a healthcare-guidance subject candidate 1.
First, the next year's medical cost for the healthcare-guidance subject candidate 1 when the healthcare guidance service is not provided is estimated. Based on the healthcare cost, and the values of the health checkup and medical inquiry for the healthcare-guidance subject candidate 1 this year, the states of nodes corresponding to the items of this year are set, and the disease state transition probability/medical cost estimating unit 114 estimates the medical cost (C1). Next, values of test values that are improved by the healthcare guidance service are set in this year's values of the healthcare-guidance subject candidate 1, and the disease state transition probability/medical cost estimating unit 114 estimates the next year's medical cost (C2). C1 is the estimated medical cost when healthcare guidance is not provided and C2 is the estimated medical cost when healthcare guidance is provided, and hence, given that C3 is a cost needed for the healthcare guidance, the cost effectiveness for reducing the medical cost can be calculated from E=C1−C2−C3. This process is performed for all the combinations of the healthcare guidance services and the healthcare-guidance subject candidates to calculate the cost effectiveness E for reducing the medical cost.
Next, in Healthcare guidance contents designing step 3404, one of all the combinations of the healthcare guidance services and the healthcare-guidance subject candidates which provides a greatest cost effectiveness for reducing the medical cost is selected. Then, the selected healthcare-guidance subject candidate is treated as selected. Then, one of the combinations of the healthcare guidance services for unselected healthcare-guidance subject candidates and the unselected healthcare-guidance subject candidates which provides a greatest cost effectiveness for reducing the medical cost is selected. Then, the selected healthcare-guidance subject candidate is treated as selected. In this manner, the combinations of the healthcare guidance services for healthcare-guidance subject candidates and the healthcare-guidance subject candidates can be selected in descending order of the effect. Finally, the combinations that provide a large effect are selected within the range of the budget of healthcare guidance to set a healthcare-guidance subject candidate and the contents of healthcare guidance therefor.
In Effect estimating step 3405, the values of the cost effectiveness for reducing the medical cost of the combinations selected in Healthcare guidance contents designing step 3404 are summed up, and a value obtained by subtracting the healthcare guidance cost from the effect of reducing the medical cost is output as an effect.
Next, the process for the support function for a person responsible for healthcare guidance and a subject person is described.
First, in Subject disease setting step 3401, a target disease for the process is set. When the three major lifestyle-related diseases, namely, diabetes, lipid abnormality, and hypertension, are targets, for example, the model reconstructing unit 113 reconstructs a model by using items in medical actions, items in health checkup, and medical inquiry items corresponding to diabetes, lipid abnormality, and hypertension among items in healthcare cost information. When all diseases are targets, the model generated by the disease causation/transition model generating unit 108 and stored in the causation/transition model storage unit 119 is used.
Another example of the process of Subject disease setting step 3401 is described. A subject person or a person responsible for healthcare guidance selects a disease to be treated. In other words, an item corresponding to a certain medical action is selected. Then, the dependencies of this item with respect to all of the other items are calculated by methods similar to those of Steps 2002 and 2003. Then, those items which each have at least a certain level of dependency to the selected item are extracted, and the model reconstructed by the model reconstructing unit 113 based on a list of the selected item and the extracted items is used.
In disease onset probability calculating step 3406, with the states of all the nodes being not set, the disease state transition probability/medical cost estimating unit 114 estimates the next year's disease state transition probability and medical cost of each of the diseases. The disease state transition probability of each disease can be acquired as a probability that the next year's number of prescriptions for nodes relating the medical action corresponding to this disease is at least one. This can be considered as the average disease probability of the disease. Then, based on the healthcare cost, and the values of the health checkup and medical inquiry for the subject person of this year, the states of nodes corresponding to the items of this year are set, and the disease state transition probability/medical cost estimating unit 114 estimates the next year's disease state transition probability and medical cost for each disease. The disease probability of each disease at this point of time is the disease probability of the disease of the subject person. Accordingly, for each disease, the disease probability of the disease of the subject person is divided by the average disease probability of the disease to calculate how many times the average disease probability higher the risk of developing disease of the subject person is.
In High-risk disease presenting step 3407, a disease whose risk of developing disease is higher than the average disease probability by at least a predetermined threshold, and the risk of developing disease are presented. Accordingly, the subject person or the person responsible for healthcare guidance can know the risk of developing disease of the subject person.
In Item-to-be-improved presenting step 3408, a test value that has at least a certain level of dependency to the medical action node corresponding to the high-risk disease calculated in High-risk disease presenting step 3407 is presented. The dependency is calculated by methods similar to those of Steps 2002 and 2003 of
Next, in Step 3409 of allowing the user to input a target value (Target value user inputting step 3409), the user is prompted to input an improving target value (e.g., the target value of weight) for a test item presented in Item-to-be-improved presenting step 3408.
Finally, in Effect estimating step 3410, the test item input in Target value user inputting step 3409 is updated with the target value, the disease probability of the disease after the target is achieved is estimated by a method similar to the one used for Step 3406, and a change in the risk of developing disease is presented. Viewing a change in the risk of developing disease, the user can set the improving target, and make good use of the change in self-management.
The healthcare guidance supporting unit 115 may display a model to be used in analysis as a network chart. The healthcare guidance supporting unit 115 may also display the risk of developing disease in the vicinity of an edge. Accordingly, the user can easily grasp how the state of the disease changes and the factor that affects the change. This feature is effective at the time of preparing the contents of healthcare guidance and setting the target to be improved by the healthcare guidance.
According to the configuration of the first embodiment, the disease causation/transition model generating unit 108 constructs a graphical model constructed by nodes based on items in healthcare cost information, health checkup information, and medical inquiry information. Then, the model reconstructing unit 113 reconstructs a graphical model of an adequate scale matching the purpose. This configuration can ensure estimation using a compact model, and fast estimation. It is not necessary to handle a large-scale model including nodes which do not match the purpose, helping the user understand the structure of a model. This improves the readability and facilitates an analysis.
There is another approach to prepare a model from medical data purpose by purpose. However, this approach cannot cope with various purposes unless medical data is always held. This approach thus still has a problem from the viewpoint of concealment of personal information. When medical data is not held, a model is generated for each purpose based on an application previously assumed, and hence this approach can cope with only a specific purpose such as a specific disease. Further, the generation of a model from medical data involves a huge amount of calculation compared with reconstructing of a model, and is disadvantageous from the viewpoint of the amount of calculation. The apparatus of the configuration of the first embodiment can be separated as illustrated in
The causation/transition structure calculating unit 109 restricts the direction of an edge between nodes representing a medical cost, a medical action, a test value, and a lifestyle habit. This signifies that a lifestyle habit influences a test value, a test value influences a medical action, a medical action influences a medical cost, and those past states influence the states in future. Such restriction to the direction of an edge between nodes can reduce the amount of calculation for learning a structure, and can provide a model which is intuitively easy to understand.
The node generating unit 110 consolidate nodes to define event space from two viewpoints: to secure the number of cases at the time of generating a conditional probability table and to maintain the dependency of the probability distribution of a child node to a parent node. Accordingly, a statistically reliable conditional probability table can be generated, and the estimation accuracy can be increased. Further, the event space of a node (probability variable) can be made small, which is advantageous from the viewpoint of the amount of calculation.
The healthcare guidance supporting unit 115 estimates a future state of a disease and a future medical cost by using a reconstructed model. The model of the first embodiment considers various factors, and hence it is possible to achieve accurate estimation. In addition, with the presence of healthcare cost information, it is possible to cope with any target disease. Further, an intervention effect originating from healthcare guidance can be estimated by performing estimation with the current test value of an insured person replaced with a value representing an expected improvement originating from healthcare guidance.
Moreover, displaying a model which is used in those analyses in the form of a network chart enables a user to grasp the influence originating from a change in state of a disease, and is effective in preparing the contents of healthcare guidance and setting a target to be improved by the healthcare guidance. This model is a reconstructed model, and is a chart formed by nodes to which a user pays attention, and hence the user is likely to show an interest in the model.
According to the first embodiment, as described above, a future disease probability and a future medical cost can be estimated accurately based on medical data such as healthcare cost information, health checkup information, and medical inquiry information. Factors effective for estimation can be automatically selected based on data, thus ensuring estimation in view of multiple factors. Further, diseases included in healthcare cost information can be analyzed, and hence a healthcare-guidance subject and the contents of healthcare guidance which show high cost effectiveness can be selected for various diseases.
In addition, the analysis system configured to include the model generating function (causation/transition structure calculating unit 109) and the model reconstructing function (model reconstructing unit 113) can achieve fast estimation for various diseases with high concealment of personal information.
Specifically, with the analysis system configured to include the model generating function and the model reconstructing function, the model generating function generates an exquisite and large-scale model designed for all diseases (all items in the healthcare cost record, and health checkup items), and the model reconstructing function reconstructs a compact model matching the purpose. For example, the model generating function alone increases the scale of a model and increases the amount of calculation for estimation, and hence the model is difficult to use. Moreover, when only a specific disease is to be analyzed, a model including an irrelevant disease as well is difficult to use. A model designed for each purpose (e.g., diabetes, lipid abnormality, or hypertension) may be generated as another approach, but this approach requires a significant amount of calculation in order to construct a model, undesirably making it necessary to hold original data (healthcare cost information, and health checkup information).
In the first embodiment, the model generating function generates an exquisite and large-scale model designed for all diseases, and a model matching the purpose is reconstructed from the generated model. The amount of calculation for reconstructing a model is not huge, and hence a model can be reconstructed easily. Further, the reconstructed model is compact, and hence the calculation cost for estimation is small. As long as the model generated by the model generating function is held, original data is unnecessary, and hence confidential information (personal information) need not be held at the time of performing estimation. This ensures effective and efficient use of a large amount of data.
With items in the healthcare cost record and health checkup items serving as nodes, the nodes are generated from a graphical model including the states of the nodes as the values of the items and the probability dependencies between the nodes serving as edges. Accordingly, the state of a child node depends on the state of a parent node, and can be given by the conditional probability of the parent node.
The edges of the graphical model are characterized by the transition and a cause and effect. For example, the current lifestyle habit and the current test value have a causal relation therebetween, the current test value and the current clinical action have a causal relation therebetween, the current clinical action and a future clinical action have a transitional relation therebetween, and a future clinical action and a future medical cost have a transitional relation therebetween. Further, the current lifestyle habit and the current test value have a causal relation therebetween, the current test value and a future test value have a transitional relation therebetween, a future test value and a future clinical action have a causal relation therebetween, and a future clinical action and a future medical cost have a transitional relation therebetween. Further, the current medical cost and a future medical cost have a transitional relation therebetween.
For the above-mentioned model generating function to generate a large-scale model, the increase in the scale of a model may suffer an insufficient number of cases for defining the conditional probabilities of the parent nodes of the individual nodes. When a parent node is large, the probability distribution of the states of a child node is given by the combinations of the states of the parent node, requiring a sufficient number of cases for all the combinations of the states of the parent node. Accordingly, it is preferred that the resolution of the states of a parent node and the number of parent nodes be small. When the resolution of the states of a parent node and the number of parent nodes are small, however, the accuracy of a generated model drops. Therefore, the node generating unit 110 performs consolidation and discretization of parent nodes in such a way as to reduce the influence on the probability distribution of child nodes and provide a sufficient number of cases. This process is carried out in order from a leaf node toward a root node.
In addition, the model generating function generates a model for each of items that are always distinguished, i.e., for each age range and each sex of insured persons, making it possible to construct a highly usable model.
The healthcare guidance supporting unit selects a list of all or some of the probability variables of diabetes, hypertension, and lipid abnormality, and hence lifestyle-related diseases which are major causes to increase medical costs can be analyzed.
In a second embodiment of this invention, a graphical model is constructed based on tabular information formed from items and data entries. The following describes an example of an analysis system that estimates an unknown value of newly acquired data based on the constructed model.
The analysis system according to the second embodiment includes a data analysis apparatus 201 and a database 214.
The data analysis apparatus 201 includes an input unit 202, an output unit 203, an processing device 204, a memory 205, and a storage medium 206. The configurations and functions of those components are identical to those of the input unit 102, the output unit 103, the processing device 104, the memory 105, and the storage medium 106 of the first embodiment, respectively.
First, data that is handled in the second embodiment is described. The data that is handled in the second embodiment is tabular data 3701 shown in
In the second embodiment, a graphical model having items X1, X2, so forth as nodes (probability variables) is constructed. Nodes are hereinafter expressed by Xi representing item names. Each row corresponds to an insured person of the first embodiment, and respective items correspond to items in healthcare cost information, health checkup information, and medical inquiry information of the first embodiment.
A graphical model generating unit 207 constructs a graphical model having the items X1, X2, so forth as nodes.
A graphical model structure calculating unit 208 defines an edge between items. With prior knowledge given, the presence/absence of a node, and the type of a node may be restricted. Assuming that a structure is a Bayesian network, then there is an efficient algorithm that learns edge structures. At this time, the dependency between items may be calculated by a method similar to the one used by the causation/transition structure calculating unit 109, and when the dependency is equal to or less than a threshold, the structure may be learned with a restriction made to indicate absence of an edge. The generated edge structure is stored in a graphical model storage unit 216.
The node generating unit 209 performs a process similar to that of the node generating unit 110 according to the first embodiment. The generated node information is stored in a node information storage unit 217.
A probability table calculating unit 210 performs a process similar to that of the probability table calculating unit 111 according to the first embodiment. The generated probability table is stored in a graphical model storage unit 216.
An estimating unit 211 estimates an unknown value included in a new data entry made. When data 3702 shown in
A simple graphical model reconstructing unit 212 reconstructs a model constructed from a list of specified nodes. The simple graphical model reconstructing unit 212 performs a process similar to that of the model reconstructing unit 113 according to the first embodiment. The reconstructed model is stored in a reconstructed model storage unit 218.
A probability inferring unit 213 specifies a list of nodes needed depending on the purpose for the simple graphical model reconstructing unit 212 to reconstruct a model. Further, the probability inferring unit 213 estimates unknown values in data input from the input unit 202 by using the model reconstructed by the simple graphical model reconstructing unit 212. The results of the estimation are stored in an estimation result storage unit 219.
The analysis system according to the second embodiment may be a computer system including a single computer, or a computer system including a server and client terminals. The graphical model generating unit 207 and the estimating unit 211 of the analysis apparatus 201 may be configured as separate apparatus.
The analysis system is a computer system configured on a single computer, or a plurality of computers logically or physically constructed, and may operate on separate threads on the same computer or may operate on virtual computers configured on a plurality of physical computer resources.
Each server is provided with a program that is executed by the processing device 204 through a removable medium (CD-ROM, flash memory, or the like) or over a network, and is stored in a non-volatile storage apparatus which is a non-transitory storage medium. Therefore, it is preferred that the computer system include an interface that ensures reading from a removable medium.
As described above, according to the second embodiment, it is possible to accurately estimate an event which occurs in future based on various kinds of data other than medical data.
This invention is not limited to the above-described embodiments but includes various modifications. The above-described embodiments are explained in details for better understanding of this invention and are not limited to those including all the configurations described above. A part of the configuration of one embodiment may be replaced with that of another embodiment; the configuration of one embodiment may be incorporated to the configuration of another embodiment. A part of the configuration of each embodiment may be added, deleted, or replaced by that of a different configuration.
The above-described configurations, functions, processing modules, and processing means, for all or a part of them, may be implemented by hardware: for example, by designing an integrated circuit. The above-described configurations and functions may be implemented by software, which means that a processor interprets and executes programs providing the functions. The information of programs, tables, and files to implement the functions may be stored in a storage device such as a memory, a hard disk drive, or an SSD (a Solid State Drive), or a storage medium such as an IC card, or an SD card. The drawings shows control lines and information lines as considered necessary for explanation but do not show all control lines or information lines in the products. It can be considered that almost of all components are actually interconnected.
Number | Date | Country | Kind |
---|---|---|---|
2013-104664 | May 2013 | JP | national |