This application claims priority to and the benefit of Taiwan Application No. 111105739, filed on Feb. 17, 2022, the entirety of which is incorporated by reference herein.
The disclosure is related to a method of assisting in disease prediction, and in particular, it is related to a method, an electronic system, and a computer program product for establishing a decision tree for disease prediction.
At present, doctors must rely on their experience to determine whether a patient suffers from a particular disease, and then they conduct further examinations such as blood work or computed tomography. However, most doctors will let patients take painkillers or anti-inflammatory drugs to address their symptoms, which may increase the chance of delaying medical treatment. Therefore, it is necessary to import Fast Healthcare Interoperability Resources (FHIR), a common international format, so that patients can provide complete medical records to hospitals in various places, reducing the chances of misdiagnosis and gaining valuable treatment time.
Now, more and more medical institutions are introducing artificial intelligence to help judge images, reducing the burden on pathologists and increasing the possibility of early detection of diseases. In the case of young doctors with less experience, artificial intelligence can also be used to assist in disease recognition, reducing the possibility of misdiagnosis during experiential learning. Therefore, how to establish a mechanism to assist in disease prediction has become an important issue.
In order to resolve the issue described above, the present disclosure provides a method for establishing a decision tree for disease prediction. The method includes the following steps. A plurality of physiological measurement data corresponding to different diseases is received. The physiological measurement data classified by the purpose. At least one cutting point of the physiological measurement data is calculated. The decision tree is branched at the cutting point. The decision tree is pruned, completing the establishment of the decision tree.
According to the method disclosed above, the step of calculating the cutting point of the physiological measurement data includes calculating the value of the cutting point of the physiological measurement data by using the specific function associated with the physiological measurement data and the absolute value of the correlation coefficient that is associated with the physiological measurement data.
According to the method disclosed above, the step of branching the decision tree corresponding to the cutting point includes setting the cutting point with the smallest value as a branch node of the decision tree; and determining whether the step of branching can be continued or not.
According to the method disclosed above, the step of pruning the decision tree to complete the establishment of the decision tree includes pruning the decision tree by using the Akaike Information Criterion (AIC).
According to the method disclosed above, the step of classifying the physiological measurement data corresponding to the purpose includes classifying the physiological measurement data as classification data when the physiological measurement data are used for the estimation of the probability of occurrence of different diseases.
According to the method disclosed above, when the physiological measurement data are classified as the classification data, the specific function is a Gini coefficient formula; the Gini coefficient formula is as follows: Gini(D)=Σi=1np(xi)×(1−p(xi))=1−Σi=1np(xi)2; wherein xi is the data corresponding to a disease among the physiological measurement data; p(xi) is the probability of occurrence of the data corresponding to the disease among the physiological measurement data; and n is the number of disease types corresponding to the physiological measurement data.
According to the method disclosed above, the correlation coefficient is as follows:
wherein i is one of the physiological measurement data; n is the number of physiological measurement data; xj is an independent variable and represents the physiological measurement data;
According to the method disclosed above, the physiological measurement data comprises gender, Body Mass Index (BMI), uric acid, total cholesterol, white blood cells, and blood sugar.
According to the method disclosed above, the value of the cutting point of the physiological measurement data is equal to Gini(D)×|r(i)|.
According to the method disclosed above, the AIC is as follows: AIC=−2×1+2×(k+1); wherein 1 is a likelihood function, and k is the number of parameters.
According to the method disclosed above, the method further includes calculating the correct rate of each terminal branch of the decision tree corresponding to the different diseases.
According to the method disclosed above, in response to determine whether the step of branching can be continued or not, the method includes repeating the step of calculating the value of the cutting point of the physiological measurement data and the step of setting the cutting point with the smallest value as the branch node of the decision tree, until the step of branching cannot be continued; or repeating the step of calculating the value of the cutting point of the physiological measurement data and the step of setting the cutting point with the smallest value as the branch node of the decision tree, until the number of physiological measurement data included in the branch node is less than or equal to a preset number of physiological measurement data corresponding to each disease.
According to the method disclosed above, in response to determine whether the step of branching can be continued or not, the method includes sorting the physiological measurement data according to gender from female to male; sorting the physiological measurement data according to BMI from low to high; sorting the physiological measurement data according to uric acid from low to high; sorting the physiological measurement data according to total cholesterol from least to most; sorting the physiological measurement data according to the number of white blood cells from least to most; and sorting the physiological measurement data according to blood sugar from low to high.
According to the method disclosed above, in response to determine whether the step of branching can be continued or not, the method includes calculating the product between the specific function and the absolute value of the correlation coefficient according to the results of sorting by gender, BMI, uric acid, total cholesterol, white blood cells, and blood sugar of the physiological measurement data.
The present disclosure also provides an electronic system to establish a decision tree for disease prediction. The electronic system includes a first processor, a data base and a second processor. The first processor is configured to receive a plurality of physiological measurement data corresponding to different diseases from a hospital. The data base is configured to store the physiological measurement data. The second processor is configured to obtain the physiological measurement data from the data base to execute the following steps. The steps include classifying the physiological measurement data corresponding to the purpose; calculating at least one cutting point of the physiological measurement data; branching the decision tree corresponding to the cutting point; and pruning the decision tree to complete the establishment of the decision tree.
According to the electronic system disclosed above, the second processor's calculation of the cutting point of the physiological measurement data includes calculating the value of the cutting point of the physiological measurement data by using the specific function that is associated with the physiological measurement data and the absolute value of the correlation coefficient that is associated with the physiological measurement data. This is performed by the second processor.
According to the electronic system disclosed above, when the physiological measurement data are used for the estimation of the probability of occurrence of different diseases, the second processor classifies the physiological measurement data as classification data.
According to the electronic system disclosed above, when the second processor classifies the physiological measurement data as the classification data, the specific function is a Gini coefficient formula; the Gini coefficient formula is as follows: Gini(D)=Σi=1np(xi)×(1−p(xi))=1−Σi=1np(xi)2; wherein xi is the data corresponding to a disease among the physiological measurement data; p(xi) is the probability of occurrence of the data corresponding to the disease among the physiological measurement data; and n is the number of disease types corresponding to the physiological measurement data.
According to the electronic system disclosed above, the correlation coefficient is as follows:
wherein i is one of the physiological measurement data; n is the number of physiological measurement data; xj is an independent variable and represents the physiological measurement data;
According to the electronic system disclosed above, the value of the cutting point of the physiological measurement data is equal to Gini(D)×|r(i)|.
The present disclosure also provides a computer program product to establish a decision tree for disease prediction. The computer program product is applied to an electronic system having a first processor, a second processor, and a data base. The computer program product includes a receiving instruction, a storing instruction, a reading instruction, a classifying instruction, a calculating instruction, a branching instruction, and a pruning instruction. The receiving instruction enables the first processor to receive a plurality of physiological measurement data corresponding to different diseases from a hospital. The storing instruction enables the data base to store the physiological measurement data. The reading instruction enables the second processor to obtain the physiological measurement data from the data base. The classifying instruction enables the second processor to classify the physiological measurement data corresponding to the purpose. The calculating instruction enables the second processor to calculate at least one cutting point of the physiological measurement data. The branching instruction enables the second processor to branch the decision tree corresponding to the cutting point. The pruning instruction enables the second processor to prune the decision tree. After the first processor finishes the receiving instruction, the data base finishes the storing instruction, and the second processor finishes the reading instruction, the classifying instruction, the calculating instruction, the branching instruction, and the pruning instruction, the establishment of the decision tree is completed.
The disclosure can be more fully understood by reading the subsequent detailed description with references made to the accompanying figures. It should be understood that the figures are not drawn to scale in accordance with standard practice in the industry. In fact, it is allowed to arbitrarily enlarge or reduce the size of components for clear illustration. This means that many specific details, relationships and methods are disclosed to provide a complete understanding of the disclosure.
In order to make the above purposes, features, and advantages of some embodiments of the present disclosure more comprehensible, the following is a detailed description in conjunction with the accompanying drawings.
It should be understood that the words “comprise” and “include” used in the present disclosure are used to indicate the existence of specific technical features, values, method steps, operations, units and/or components. However, it does not exclude that more technical features, values, method steps, work processes, units, components, or any combination of the above can be added.
The words “first”, “second”, “third”, and “fourth” are used to describe components, they are not used to indicate the priority order of or advance relationship, but only to distinguish components with the same name.
In detail, in step S104, the method for establishing the decision tree for disease prediction of the present disclosure further includes the following step: The value of the cutting point of the physiological measurement data is calculated by using a specific function associated with the physiological measurement data and the absolute value of the correlation coefficient associated with the physiological measurement data. In step S106, the method for establishing the decision tree for disease prediction of the present disclosure further includes the following steps. The cutting point with the smallest value is set as a branch node in the decision tree. A determination is made as to whether the step of branching can be continued or not. In step S108, the method for establishing the decision tree for disease prediction of the present disclosure further includes pruning the decision tree using the Akaike Information Criterion (AIC).
In some embodiments, the decision tree established in the present disclosure for disease prediction is a Classification and Correlation Coefficient Regression Trees (CCRT) decision tree. The CCRT decision tree is an improved version of the traditional and well-known Classification and Regression Trees (CART) decision tree. The correlation coefficient is added into the CCRT decision tree to adjust the parameters to improve the disease prediction ability of the CCRT decision tree. In step S100, the physiological measurement data are the medical record data of each patient from the hospital. For example, a patient's medical record data may include gender, Body Mass Index (BMI), uric acid, total cholesterol, white blood cells, and blood sugar, but the present disclosure is not limited thereto.
Table 1 is physiological measurement data corresponding to different diseases of five patients from the hospital. The physiological measurement data in Table 1 are provided as examples.
As shown in Table 1, patient No. 1 is a female, her BMI is 18, uric acid is 7.3, total cholesterol is 150, white blood cells is 15.3, and blood sugar is 201, and the doctor judges that the disease that patient No. 1 suffers from is diabetes. Patient No. 2 is a female, her BMI is 36, uric acid is 9.8, total cholesterol is 285, white blood cells is 20.8, and blood sugar is 125, and the doctor judges that the disease that patient No. 2 suffers from is atherosclerosis. Patient No. 3 is a male, his BMI is 32, uric acid is 6.5, total cholesterol is 201, white blood cells is 8.51, and blood sugar is 100, and the doctor judges that the disease that patient No. 3 suffers from is hypertension. Patient No. 4 is a male, his BMI is 24, uric acid is 5.7, total cholesterol is 187, white blood cells is 4.38, and blood sugar is 131, and the doctor judges that the disease that patient No. 4 suffers from is fatty liver. Patient No. 5 is a male, his BMI is 28, uric acid is 7.4, total cholesterol is 235, white blood cells is 18.1, and blood sugar is 185, and the doctor judges that the disease that patient No. 5 suffers from is diabetes.
In step S102, the physiological measurement data are classified as classification data when the physiological measurement data are used for the estimation of the probability of occurrence of different diseases. In some embodiments, the physiological measurement data are classified as numerical data when the physiological measurement data from the hospital are used for classification of different diseases. The CCRT decision tree of the present disclosure can process both classification data and numerical data. In some embodiments, when the physiological measurement data is classified as classification data in step S102, the specific function associated with the physiological measurement data in step S104 is a Gini coefficient formula. In detail, the Gini coefficient formula is shown as Equation 1 below.
Gini(D)=Σi=1np(xi)×(1−p(xi))=1−Σi=1np(xi)2 Equation 1
xi is the data corresponding to a disease among the physiological measurement data; p(xi) is the probability of occurrence of the data corresponding to the disease among the physiological measurement data; and n is the number of disease types corresponding to the physiological measurement data.
In step S104, the correlation coefficient is shown as Equation 2.
i is one of the physiological measurement data; n is the number of physiological measurement data; xj is an independent variable and represents the physiological measurement data;
In detail, in step S104, the value of the cutting point of the physiological measurement data is equal to Gini(D)×|r(i)|, which is Equation 3.
In some embodiments, before the method of the present disclosure calculates the product of the Gini coefficient formula, Gini(D), and the absolute value of the correlation coefficient, |r(i)|, the physiological measurement data are sorted according to gender from female to male, BMI from low to high, uric acid from low to high, total cholesterol from least to most, the number of white blood cells from least to most, and blood sugar from low to high. In some embodiments, the method of the present disclosure calculates the product between Gini coefficient formula, Gini(D), and the absolute value of the correlation coefficient, |r(i)|, to obtain the value of the cutting point of the physiological measurement data according the sorting of gender, BMI, uric acid, total cholesterol, white blood cells, and blood sugar in the physiological measurement data.
The physiological measurement data in Table 1 are exemplified. The method of the present disclosure sorts the data of patients No. 1 to 5 as (1, 2, 3, 4, 5) according to gender, that is, the sorting of gender of patients No. 1 to 5 is (F, F, M, M, M). After that, the method of present disclosure calculate the value of cutting point between male and female in the data of patients No. 1 to 5, as shown in Equation 4 below.
The method of the present disclosure converts gender into values and substitutes the values into Equations 3, 2 and 1 according to the sorting of the physiological measurement data (F, F, M, M, M) by gender to obtain Equation 4. For example, in the method of present disclosure, after sorting the physiological measurement data according to gender, the cutting point is between the first two F and the last three M. The data of the first two F correspond to different diseases (e.g., diabetes and atherosclerosis, respectively), thus the probability is ½ each. The left branch is
multiplying by ⅖ (2 of 5 data). Similarly, the data of the last three M correspond to different diseases (e.g., hypertension, fatty liver, and diabetes), thus the probability is ⅓ each. The right branch is
multiplying by ⅗ (3 of 5 data). According to the result of Equation 4, it can be obtained that the value of the cutting point sorted by gender is 0.6.
Then, the method of the present disclosure sorts the data of patients No. 1 to 5 as (1, 4, 5, 3, 2) according to BMI, that is, the sorting of BMI of patients No. 1 to 5 is (18, 24, 28, 32, 36). The method of the present disclosure first calculates the first cutting point according to BMI, that is, the first cutting point for BMI<((18+24)/2), as shown in Equation 5 below.
For example, after sorting the physiological measurement data according to BMI, the first cutting point is between 18 and 24. In the physiological measurement data whose BMI is 18 (e.g., patient No. 1), the disease it corresponds to is diabetes, thus the probability is 1/1. Therefore, the left branch of the first cutting point is
multiplying by ⅕ (1 of 5 data). Similarly, in the physiological measurement data whose BMI are 24, 28, 32, and 36 (e.g., patients No. 2˜5), the diseases they correspond to are all different (such as atherosclerosis, hypertension, fatty liver, and diabetes), thus the probability is ¼ each. Therefore, the right branch is
multiplying by ⅘ (4 of 5 data). According to the result of Equation 5, it can be obtained that the value of the first cutting point sorted by BMI is 0.6.
Then, the method of the present disclosure calculates the next cutting point according to BMI, that is, the second cutting point for BMI<((24+28)/2), as shown in Equation 6 below.
For example, after sorting the physiological measurement data according to BMI, the second cutting point is between 24 and 28. In the physiological measurement data whose BMI are 18 and 24 (e.g., patients No. 1 and No. 4), the diseases they correspond to are diabetes and fatty liver, thus the probability is ½ each. Therefore, the left branch of the second cutting point is
multiplying by ⅖ (2 of 5 data). Similarly, in the physiological measurement data whose BMI are 28, 32 and 36 (e.g., patients No. 2, No. 3 and No. 5), the diseases they correspond to are all different (such as atherosclerosis, hypertension, and diabetes), thus the probability is ⅓ each. Therefore, the right branch is
multiplying by ⅗ (3 of 5 data). According to the result of Equation 6, it can be obtained that the value of the second cutting point sorted by BMI is 0.6.
Then, the method of the present disclosure calculates the next cutting point according to BMI, that is, the third cutting point for BMI<((28+32)/2), as shown in Equation 7 below.
For example, after sorting the physiological measurement data according to BMI, the third cutting point is between 28 and 32. In the physiological measurement data whose BMI are 18, 24 and 28 (e.g., patients No. 1, No. 4 and No. 5), the diseases they correspond to are diabetes and fatty liver, thus the probability of occurrence of diabetes is ⅔, and the probability of occurrence of fatty liver is ⅓. Therefore, the left branch of the third cutting point is
multiplying by ⅗ (3 of 5 data). Similarly, in the physiological measurement data whose BMI are 32 and 36 (e.g., patients No. 2 and No. 3), the diseases they correspond to are all different (such as atherosclerosis and hypertension), thus the probability is ½ each. Therefore, the right branch is
multiplying by ⅖ (2 of 5 data). According to the result of Equation 7, it can be obtained that the value of the third cutting point sorted by BMI is 0.054.
Then, the method of the present disclosure calculates the next cutting point according to BMI, that is, the fourth cutting point for BMI<((32+36)/2), as shown in Equation 8 below.
For example, after sorting the physiological measurement data according to BMI, the fourth cutting point is between 32 and 36. In the physiological measurement data whose BMI are 18, 24, 28 and 32 (e.g., patients No. 1, No. 3, No. 4 and No. 5), the diseases they correspond to are diabetes, hypertension, and fatty liver, thus the probability of occurrence of diabetes is 2/4, the probability of occurrence of hypertension is ¼, and the probability of occurrence of fatty liver is ¼. Therefore, the left branch of the fourth cutting point is
multiplying by ⅘ (4 of 5 data). Similarly, in the physiological measurement data whose BMI is 36 (e.g., patient No. 2), the disease it correspond to is atherosclerosis, thus the probability is 1/1. Therefore, the right branch is
multiplying by ⅕ (1 of 5 data). According to the result of Equation 8, it can be obtained that the value of the fourth cutting point sorted by BMI is 0.158.
Moreover, the method of the present disclosure sorts the data of patients No. 1 to 5 as (4, 3, 1, 5, 2) according to uric acid, that is, the sorting of uric acid of patients No. 1 to 5 is (5.7, 6.5, 7.3, 7.4, 9.8). The method of the present disclosure first calculates the first cutting point according to uric acid, that is, the first cutting point for uric acid<((5.7+6.5)/2), as shown in Equation 9 below.
For example, after sorting the physiological measurement data according to uric acid, the first cutting point is between 5.7 and 6.5. In the physiological measurement data whose uric acid is 5.7 (e.g., patient No. 4), the disease it corresponds to is fatty liver, thus the probability is 1/1. Therefore, the left branch of the first cutting point is
multiplying by ⅕(1 of 5 data). Similarly, in the physiological measurement data whose uric acid are 6.5, 7.3, 7.4 and 9.8 (e.g., patients No. 1˜3 and No. 5), the diseases they correspond to are diabetes, atherosclerosis, and hypertension, thus the probability of occurrence of diabetes is 2/4, the probability of occurrence of atherosclerosis is ¼, and the probability of occurrence of hypertension is ¼. Therefore, the right branch is 1−
multiplying by ⅘ (4 of 5 data). According to the result of Equation 9, it can be obtained that the value of the first cutting point sorted by uric acid is 0.5.
Then, the method of the present disclosure calculates the next cutting point according to uric acid, that is, the second cutting point for uric acid<((6.5+7.3)/2), as shown in Equation 10 below.
For example, after sorting the physiological measurement data according to uric acid, the second cutting point is between 6.5 and 7.3. In the physiological measurement data whose uric acid are 5.7 and 6.5 (e.g., patients No. 3 and No. 4), the diseases they correspond to are fatty liver and hypertension, thus the probability is ½ each. Therefore, the left branch of the second cutting point is
multiplying by ⅖ (2 of 5 data). Similarly, in the physiological measurement data whose uric acid are 7.3, 7.4 and 9.8 (e.g., patients No. 1, No. 5, and No. 2), the diseases they correspond to are diabetes and atherosclerosis, thus the probability of occurrence of diabetes is ⅔, and the probability of occurrence of atherosclerosis is ⅓. Therefore, the right branch is
multiplying by ⅗ (3 of 5 data). According to the result of Equation 10, it can be obtained that the value of the second cutting point sorted by uric acid is 0.4667.
Then, the method of the present disclosure calculates the next cutting point according to uric acid, that is, the third cutting point for uric acid<((7.3+7.4)/2), as shown in Equation 11 below.
For example, after sorting the physiological measurement data according to uric acid, the third cutting point is between 7.3 and 7.4. In the physiological measurement data whose uric acid are 5.7, 6.5 and 7.3 (e.g., patients No. 4, No. 3 and No. 1), the diseases they correspond to are all different (such as fatty liver, hypertension, and diabetes), thus the probability is ⅓ each. Therefore, the left branch of the third cutting point is
multiplying by ⅗ (3 of 5 data). Similarly, in the physiological measurement data whose uric acid are 7.4 and 9.8 (e.g., patients No. 5 and No. 2), the diseases they correspond to are different (such as diabetes and atherosclerosis), thus the probability is ½ each. Therefore, the right branch is
multiplying by ⅖ (2 of 5 data). According to the result of Equation 11, it can be obtained that the value of the third cutting point sorted by uric acid is 0.589.
Then, the method of the present disclosure calculates the next cutting point according to uric acid, that is, the fourth cutting point for uric acid<((7.4+9.8)/2), as shown in Equation 12 below.
For example, after sorting the physiological measurement data according to uric acid, the fourth cutting point is between 7.4 and 9.8. In the physiological measurement data whose uric acid are 5.7, 6.5, 7.3 and 7.4 (e.g., patients No. 4, No. 3, No. 1, and No. 5), the diseases they correspond to are fatty liver, hypertension, and diabetes, thus the probability of occurrence of fatty liver is ¼, the probability of occurrence of hypertension is ¼, and the probability of occurrence of diabetes is 2/4. Therefore, the left branch of the fourth cutting point is
multiplying by ⅘ (4 of 5 data). Similarly, in the physiological measurement data whose uric acid is 9.8 (e.g. patient No. 2), the disease it corresponds to is atherosclerosis, thus the probability is 1/1. Therefore, the right branch is
multiplying by ⅕ (1 of 5 data). According to the result of Equation 12, it can be obtained that the value of the fourth cutting point sorted by uric acid is 0.4938.
After that, the method of the present disclosure sorts the data of patients No. 1 to 5 as (1, 4, 3, 5, 2) according to total cholesterol, that is, the sorting of total cholesterol of patients No. 1 to 5 is (150, 187, 201, 235, 285). The method of the present disclosure first calculates the first cutting point according to total cholesterol, that is, the first cutting point for total cholesterol<((150+187)/2), as shown in Equation 13 below.
For example, after sorting the physiological measurement data according to total cholesterol, the first cutting point is between 150 and 187. In the physiological measurement data whose total cholesterol is 150 (e.g., patient No. 1), the disease it corresponds to is diabetes, thus the probability is 1/1. Therefore, the left branch of the first cutting point is
multiplying by ⅕ (1 of 5 data). Similarly, in the physiological measurement data whose total cholesterol are 187, 201, 235 and 285 (e.g. patients No. 2˜5), the diseases they correspond to are all different (such as atherosclerosis, hypertension, fatty liver, and diabetes, thus the probability is ¼ each. Therefore, the right branch is
multiplying by ⅘ (4 of 5 data). According to the result of Equation 13, it can be obtained that the value of the first cutting point sorted by total cholesterol is 0.6.
Then, the method of the present disclosure calculates the next cutting point according to total cholesterol, that is, the second cutting point for total cholesterol<((187+201)/2), as shown in Equation 14 below.
For example, after sorting the physiological measurement data according to total cholesterol, the second cutting point is between 287 and 201. In the physiological measurement data whose total cholesterol are 150 and 187 (e.g., patients No. 1 and No. 4), the diseases they correspond to are diabetes and fatty liver, thus the probability is ½ each. Therefore, the left branch of the second cutting point is
multiplying by ⅖ (2 of 5 data). Similarly, in the physiological measurement data whose total cholesterol are 201, 235 and 285 (e.g., patients No. 2, No. 3 and No. 5), the diseases they correspond to are all different (such as atherosclerosis, hypertension and diabetes), thus the probability is ⅓ each. Therefore, the right branch is
multiplying by ⅗ (3 of 5 data). According to the result of Equation 14, it can be obtained that the value of the second cutting point sorted by total cholesterol is 0.6.
Then, the method of the present disclosure calculates the next cutting point according to total cholesterol, that is, the third cutting point for total cholesterol <((201+235)/2), as shown in Equation 15 below.
For example, after sorting the physiological measurement data according to total cholesterol, the third cutting point is between 201 and 235. In the physiological measurement data whose total cholesterol are 150, 187 and 201 (e.g., patients No. 1, No. 4 and No. 3), the diseases they correspond to are all different (such as fatty liver, hypertension, and diabetes), thus the probability is ⅓ each. Therefore, the left branch of the third cutting point is
multiplying by ⅗ (3 of 5 data). Similarly, in the physiological measurement data whose total cholesterol are 235 and 285 (e.g., patients No. 5 and No. 2), the diseases they correspond to are different (such as diabetes and atherosclerosis), thus the probability is ½ each. Therefore, the right branch is
multiplying by ⅖ (2 of 5 data). According to the result of Equation 15, it can be obtained that the value of the third cutting point sorted by total cholesterol is 0.4944.
Then, the method of the present disclosure calculates the next cutting point according to total cholesterol, that is, the fourth cutting point for total cholesterol<((235+285)/2), as shown in Equation 16 below.
For example, after sorting the physiological measurement data according to total cholesterol, the fourth cutting point is between 235 and 285. In the physiological measurement data whose total cholesterol are 150, 187, 201 and 235 (e.g., patients No. 1, No. 4, No. 3, and No. 5), the diseases they correspond to are fatty liver, hypertension, and diabetes, thus the probability of occurrence of fatty liver is ¼, the probability of occurrence of hypertension is ¼, and the probability of occurrence of diabetes is 2/4. Therefore, the left branch of the fourth cutting point is
multiplying by ⅘ (4 of 5 data). Similarly, in the physiological measurement data whose total cholesterol is 285 (e.g. patient No. 2), the disease it corresponds to is atherosclerosis, thus the probability is 1/1. Therefore, the right branch is
multiplying by ⅕ (1 of 5 data). According to the result of Equation 16, it can be obtained that the value of the fourth cutting point sorted by total cholesterol is 0.01.
After that, the method of the present disclosure sorts the data of patients No. 1 to 5 as (4, 3, 1, 5, 2) according to white blood cells, that is, the sorting of white blood cells of patients No. 1 to 5 is (4.38, 8.51, 15.3, 18.1, 20.8). The method of the present disclosure first calculates the first cutting point according to white blood cells, that is, the first cutting point for white blood cells<((4.38+8.51)/2), as shown in Equation 17 below.
For example, after sorting the physiological measurement data according to white blood cells, the first cutting point is between 4.38 and 8.51. In the physiological measurement data whose white blood cells is 4.38 (e.g., patient No. 4), the disease it corresponds to is fatty liver, thus the probability is 1/1. Therefore, the left branch of the first cutting point is
multiplying by ⅕ (1 of 5 data). Similarly, in the physiological measurement data whose white blood cells are 8.51, 15.3, 18.1 and 20.8 (e.g. patients No. 1˜3 and No. 5), the diseases they correspond to are diabetes, atherosclerosis and hypertension, thus the probability of occurrence of diabetes is 2/4, the probability of occurrence of atherosclerosis is ¼, and the probability of occurrence of hypertension is ¼. Therefore, the right branch is
multiplying by ⅘ (4 of 5 data). According to the result of Equation 17, it can be obtained that the value of the first cutting point sorted by white blood cells is 0.5.
Then, the method of the present disclosure calculates the next cutting point according to white blood cells, that is, the second cutting point for white blood cells <((8.51+15.3)/2), as shown in Equation 18 below.
For example, after sorting the physiological measurement data according to white blood cells, the second cutting point is between 8.51 and 15.3. In the physiological measurement data whose white blood cells are 4.38 and 8.51 (e.g., patients No. 4 and No. 3), the diseases they correspond to are fatty liver and hypertension, thus the probability is ½ each. Therefore, the left branch of the second cutting point is
multiplying by ⅖ (2 of 5 data). Similarly, in the physiological measurement data whose white blood cells are 15.3, 18.1 and 20.8 (e.g., patients No. 1, No. 5 and No. 2), the diseases they correspond to are diabetes and atherosclerosis, thus the probability of occurrence of diabetes is ⅔, and the probability of occurrence of atherosclerosis is ⅓. Therefore, the right branch is
multiplying by ⅗ (3 of 5 data). According to the result of Equation 18, it can be obtained that the value of the second cutting point sorted by white blood cells is 0.4667.
Then, the method of the present disclosure calculates the next cutting point according to white blood cells, that is, the third cutting point for white blood cells<((15.3+18.1)/2), as shown in Equation 19 below.
For example, after sorting the physiological measurement data according to white blood cells, the third cutting point is between 15.3 and 18.1. In the physiological measurement data whose white blood cells are 4.38, 8.51 and 15.3 (e.g., patients No. 4, No. 3 and No. 1), the diseases they correspond to are all different (such as fatty liver, hypertension, and diabetes), thus the probability is ⅓ each. Therefore, the left branch of the third cutting point is
multiplying by 3/5 (3 of 5 data). Similarly, in the physiological measurement data whose white blood cells are 18.1 and 20.8 (e.g., patients No. 5 and No. 2), the diseases they correspond to are different (such as diabetes and atherosclerosis), thus the probability is 1/2 each. Therefore, the right branch is
multiplying by ⅖ (2 of 5 data). According to the result of Equation 19, it can be obtained that the value of the third cutting point sorted by white blood cells is 0.599.
Then, the method of the present disclosure calculates the next cutting point according to white blood cells, that is, the fourth cutting point for white blood cells<((18.1+20.8)/2), as shown in Equation 20 below.
For example, after sorting the physiological measurement data according to white blood cells, the fourth cutting point is between 18.1 and 20.8. In the physiological measurement data whose white blood cells are 4.38, 8.51, 15.3 and 18.1 (e.g., patients No. 4, No. 3, No. 1, and No. 5), the diseases they correspond to are fatty liver, hypertension, and diabetes, thus the probability of occurrence of fatty liver is ¼, the probability of occurrence of hypertension is ¼, and the probability of occurrence of diabetes is 2/4. Therefore, the left branch of the fourth cutting point is
multiplying by 4/5 (4 of 5 data). Similarly, in the physiological measurement data whose white blood cells is 20.8 (e.g. patient No. 2), the disease it corresponds to is atherosclerosis, thus the probability is 1/1. Therefore, the right branch is
multiplying by ⅕ (1 of 5 data). According to the result of Equation 20, it can be obtained that the value of the fourth cutting point sorted by white blood cells is 0.4916.
The method of the present disclosure sorts the data of patients No. 1 to 5 as (4, 3, 1, 5, 2) according to blood sugar, that is, the sorting of blood sugar of patients No. 1 to 5 is (100, 125, 131, 185, 201). The method of the present disclosure first calculates the first cutting point according to blood sugar, that is, the first cutting point for blood sugar<((100+125)/2), as shown in Equation 21 below.
For example, after sorting the physiological measurement data according to blood sugar, the first cutting point is between 100 and 125. In the physiological measurement data whose blood sugar is 100 (e.g., patient No. 3), the disease it corresponds to is hypertension, thus the probability is 1/1. Therefore, the left branch of the first cutting point is
multiplying by ⅕ (1 of 5 data). Similarly, in the physiological measurement data whose blood sugar are 125, 131, 185 and 201 (e.g. patients No. 2, No. 4, No. 5 and No. 1), the diseases they correspond to are diabetes, atherosclerosis and hypertension, thus the probability of occurrence of diabetes is 2/4, the probability of occurrence of atherosclerosis is ¼, and the probability of occurrence of hypertension is ¼. Therefore, the right branch is
multiplying by ⅘ (4 of 5 data). According to the result of Equation 21, it can be obtained that the value of the first cutting point sorted by blood sugar is 0.5.
Then, the method of the present disclosure calculates the next cutting point according to blood sugar, that is, the second cutting point for blood sugar<(125+131)/2), as shown in Equation 22 below.
For example, after sorting the physiological measurement data according to blood sugar, the second cutting point is between 125 and 131. In the physiological measurement data whose blood sugar are 100 and 125 (e.g., patients No. 3 and No. 2), the diseases they correspond to are fatty liver and atherosclerosis, thus the probability is ½ each. Therefore, the left branch of the second cutting point is
multiplying by ⅖ (2 of 5 data). Similarly, in the physiological measurement data whose blood sugar are 131, 185 and 201 (e.g., patients No. 4, No. 5 and No. 1), the diseases they correspond to are diabetes and fatty liver, thus the probability of occurrence of diabetes is ⅔, and the probability of occurrence of fatty liver is ⅓. Therefore, the right branch is
multiplying by ⅗ (3 of 5 data). According to the result of Equation 22, it can be obtained that the value of the second cutting point sorted by blood sugar is 0.4667.
Then, the method of the present disclosure calculates the next cutting point according to blood sugar, that is, the third cutting point for blood sugar<((131+185)/2), as shown in Equation 23 below.
For example, after sorting the physiological measurement data according to blood sugar, the third cutting point is between 131 and 185. In the physiological measurement data whose blood sugar are 100, 125 and 131 (e.g., patients No. 3, No. 2 and No. 4), the diseases they correspond to are all different (such as fatty liver, hypertension, and atherosclerosis), thus the probability is ⅓ each. Therefore, the left branch of the third cutting point is
multiplying by ⅗ (3 of 5 data). Similarly, in the physiological measurement data whose blood sugar are 185 and 201 (e.g., patients No. 5 and No. 1), the disease they correspond to is the same (such as diabetes), thus the probability is 2/2. Therefore, the right branch is
multiplying by ⅖ (2 of 5 data). According to the result of Equation 23, it can be obtained that the value of the third cutting point sorted by blood sugar is 0.073.
Then, the method of the present disclosure calculates the next cutting point according to blood sugar, that is, the fourth cutting point for blood sugar<((185+201)/2), as shown in Equation 24 below.
For example, after sorting the physiological measurement data according to blood sugar, the fourth cutting point is between 185 and 201. In the physiological measurement data whose blood sugar are 100, 125, 131 and 185 (e.g., patients No. 3, No. 2, No. 4, and No. 5), the diseases they correspond to are all different (such as fatty liver, hypertension, atherosclerosis and diabetes), thus the probability of occurrence of fatty liver is ¼, the probability of occurrence of hypertension is ¼, he probability of occurrence of atherosclerosis is ¼, and the probability of occurrence of diabetes is ¼. Therefore, the left branch of the fourth cutting point is
multiplying by ⅘ (4 of 5 data). Similarly, in the physiological measurement data whose blood sugar is 201 (e.g. patient No. 1), the disease it corresponds to is diabetes, thus the probability is 1/1. Therefore, the right branch is
multiplying by ⅕ (1 of 5 data). According to the result of Equation 24, it can be obtained that the value of the fourth cutting point sorted by blood sugar is 0.4048. So far, the method of present disclosure has complete step S104 in
In step S104, the method of present disclosure obtains the value of the cutting point sorted by gender as 0.6. The method of present disclosure obtains the value of the first, second, third and fourth cutting points sorted by BMI as 0.6, 0.6, 0.054, and 0.158. The method of present disclosure obtains the value of the first, second, third and fourth cutting points sorted by uric acid as 0.5, 0.4667, 0.589, and 0.4938. The method of present disclosure obtains the value of the first, second, third and fourth cutting points sorted by total cholesterol as 0.6, 0.6, 0.4944, and 0.01. The method of present disclosure obtains the value of the first, second, third and fourth cutting points sorted by white blood cells as 0.5, 0.4667, 0.599, and 0.4916. The method of present disclosure obtains the value of the first, second, third and fourth cutting points sorted by blood sugar as 0.5, 0.4667, 0.073, and 0.4048.
After that, in step S106 in
Since the left branch of the branch node 200 (total cholesterol<260) leaves 4 of the physiological measurement data (data of patients No. 1, No. 2, No. 3 and No. 5), the method of present disclosure also executes steps S104 and S106, and obtains that the third cutting point sorted by BMI (BMI<((28+32)/2)=30) has the smallest value, so a branch node 202 is set as BMI. The left branch of the branch node 202 is the physiological measurement data with BMI<30 (for example, the data of patients No. 1, No. 4 and No. 5), and the right branch of the branch node 202 is the physiological measurement data with BMI>=30 (for example, the data of patient No. 3). In step S106, since the right branch of the branch node 202 leaves one of the physiological measurement data corresponding to hypertension (for example, the data of patient No. 3), the number of physiological measurement data included in a branch node 208 (for example, one data) is less than or equal to a preset number of physiological measurement data (for example, one data, that is, the data of patient No. 3) corresponding to a disease (for example, hypertension). Therefore, the method of present disclosure sets the branch node 208 as a terminal branch node (e.g., the branching of the branch node 208 cannot be continued), and sets hypertension in the branch node 208.
Since the left branch of the branch node 202 (BMI<30) leaves 3 of the physiological measurement data (data of patients No. 1, No. 4 and No. 5), the method of present disclosure also executes steps S104 and S106, and obtains that the first cutting point sorted by blood sugar (blood sugar<((131+185)/2)=158) has the smallest value, so a branch node 204 is set as blood sugar. The left branch of the branch node 204 is the physiological measurement data with blood sugar<158 (for example, the data of patient No. 4), and the right branch of the branch node 204 is the physiological measurement data with blood sugar>=158 (for example, the data of patients No. 1 and No. 5). In step S106, since the right branch of the branch node 204 leaves two of the physiological measurement data corresponding to hypertension (for example, the data of patients No. 1 and No. 5), the number of physiological measurement data included in the branch node 208 (for example, two data) is less than or equal to a preset number of physiological measurement data (for example, two data, that is, the data of patients No. 1 and No. 5) corresponding to a disease (for example, diabetes). Therefore, the method of present disclosure sets a branch node 212 as a terminal branch node (e.g., the branching of the branch node 212 cannot be continued), and sets diabetes in the branch node 212.
Moreover, since the left branch of the branch node 204 leaves one of the physiological measurement data corresponding to fatty liver (for example, the data of patient No. 4), the number of physiological measurement data included in the branch node 210 (for example, one data) is less than or equal to a preset number of physiological measurement data (for example, one data, that is, the data of patient No. 4) corresponding to a disease (for example, fatty liver). Therefore, the method of present disclosure sets a branch node 210 as a terminal branch node (e.g., the branching of the branch node 210 cannot be continued), and sets fatty liver in the branch node 210. To put it simply, the branch nodes 200, 202 and 204 are obtained by being determined as “no” in step S106 in
In step S108, Akaike Information Criterion (AIC) is a criterion used to check whether the decision tree in
AIC=−2×1+2×(k+1) Equation 25
In Equation 25, 1 is a likelihood function, and k is the number of parameters. In some embodiments, the method of present disclosure further calculates the correct rate of each terminal branch node (e.g., the branch nodes 206, 208, 210, and 212 in
The method of the present disclosure inputs the following three prediction data in Table 2 into the decision tree of
Table 3 is the judgment of disease characteristics by the decision tree in
According to Table 3, the method of present disclosure can obtain that total cholesterol for patient A is lower than 260, and the BMI for patient A is higher than or equal to 30, thus patient A may suffer from hypertension. And so on, patient B may suffer from diabetes, and patient C may suffer from fatty liver. The above result can be used as auxiliary conditions for doctors' diagnosis.
The present disclosure also provides a computer program product to establish a decision tree (for example, the decision tree in
The calculating instruction enables the processor 314 to execute step S104 in
The more the physiological measurement data from the hospital, the more accurate the prediction results obtained by the method, electronic system, and computer program product of the present disclosure for establishing a decision tree for disease prediction. The method, electronic system, and computer program product of the present disclosure can assist doctors in medical diagnosis, and give preventive medication in advance according to prediction results. The method, electronic system, and computer program product of the present disclosure can calculate the data of each terminal branch node of the decision tree to obtain the probability of a single disease, which can improve the accuracy of more disease predictions.
The embodiments of the present disclosure are disclosed above, but they are not used to limit the scope of the present disclosure. A person skilled in the art can make some changes and retouches without departing from the spirit and scope of the embodiments of the present disclosure. Therefore, the scope of protection in the present disclosure shall be deemed as defined by the scope of the attached claims.
Number | Date | Country | Kind |
---|---|---|---|
111105739 | Feb 2022 | TW | national |