This application claims the benefit of Taiwan application Serial No. 113102037, filed Jan. 18, 2024, the subject matter of which is incorporated herein by reference.
The invention relates in general to a processing method and an electronic device using the same, and more particularly to a data processing method for machine learning and an electronic device using the same.
According to machine learning technology, when more detection items are used in model building, normally prediction accuracy will be increased. However, if unsuitable detection items are used, prediction accuracy will decrease instead. Therefore, the detection items need to be selected and decided before they are used in the model building and prediction inference of the machine learning model.
Besides, in real application, data come from numeral sources. For instance, in medical field, commonly seen source differences may include the brands and makes of inspection apparatuses or hospitals. Such source differences may lead to a reduction in the performance of model building and prediction inference using machine learning technology. Take blood test for instance. Different hospital may use different inspection apparatuses or different numeric units. blood values have different numeric intervals in different inspection apparatuses. For instances, the blood values can be 20 to 2000 in AA machine and 100 to 10000 in BB machine. Since the same value can mean differently in different machines, the predictive effect will be poor if the model is trained using data of different numeric intervals.
The invention is directed to a data processing method for the machine learning and an electronic device using the same. During the process of selecting the detection items of original measuring data, the original measuring data of different sources are properly treated, so that prediction accuracy of the machine learning model can be increased. The quantity balanced and value scaled original measuring data possess excellent extendibility and are beneficial to the training and amendment of the machine learning model.
According to one embodiment of the present invention, a data processing method for the machine learning is provided. The data processing method for the machine learning includes the following steps. For a plurality of sources, a source balancing procedure is performed on an original measuring data to obtain a balanced distribution map. In the balanced distribution map, the quantity of data items obtained from each of the sources is identical, and the original measuring data comprises a plurality of subjects corresponding to a plurality of detection values of a plurality of detection items. For each of the subjects, a personalization scaling procedure is performed on the detection values to obtain a personalized scaled measuring data. In the personalized scaled measuring data, the detection values of each of the subjects are scaled to the same numeric interval. For each of the sources, a source scaling procedure is performed on the detection values to obtain a by-source scaled measuring data. In the by-source scaled measuring data, the detection values of each of the sources are scaled to the same numeric interval. The balanced distribution map, the personalized scaled measuring data and the by-source scaled measuring data are combined to obtain a balanced personalized scaled data and a balanced by-source scaled data. The balanced personalized scaled data and the balanced by-source scaled data are split into a plurality of splits, each corresponding to all of the sources. Each of the splits is sampled to obtain a predictive ability table through analysis. The predictive ability table comprises a predictive ability of each of the detection items. Based on the predictive ability table, some of the detection items are outputted. The outputted detection items are used for a machine learning model to perform model building, training or prediction inference.
According to another embodiment of the present invention, an electronic device is provided. The electronic device includes a source quantity balancing unit, a personalization scaling unit, a source scaling unit, a combination unit and an extraction unit. The source quantity balancing unit is used to, for a plurality of sources, perform a source balancing procedure on an original measuring data to obtain a balanced distribution map. In the balanced distribution map, the quantity of data items obtained from each of the sources is identical. The original measuring data includes a plurality of subjects corresponding to a plurality of detection values of a plurality of detection items. The personalization scaling unit used to, for each of the subjects, perform a personalization scaling procedure on the detection values to obtain a personalized scaled measuring data. In the personalized scaled measuring data, the detection values of each of the subjects are scaled to the same numeric interval. The source scaling unit used to, for each of the sources, perform a source scaling procedure on the detection values to obtain a by-source scaled measuring data. In the by-source scaled measuring data, the detection values of each of the sources are scaled to the same numeric interval. The combination unit used to combine the balanced distribution map, the personalized scaled measuring data and the by-source scaled measuring data to obtain a balanced personalized scaled data and a balanced by-source scaled data. The extraction unit includes a splitter, a calculator and a selector. The splitter used to split the balanced personalized scaled data and the balanced by-source scaled data into a plurality of splits, each corresponding to all of the sources. The calculator used to sample each of the splits perform and obtain a predictive ability table through analysis. The predictive ability table comprises a predictive ability of each of the detection items. The selector used to, based on the predictive ability table, output some of the detection items. The outputted detection items are used for a machine learning model to perform model building, training or prediction inference.
The above and other aspects of the invention will become better understood with regard to the following detailed description of the preferred but non-limiting embodiment(s). The following description is made with reference to the accompanying drawings.
Technical terms are used in the specification with reference to the prior art used in the technology field. For any terms described or defined in the specification, the descriptions and definitions in the specification shall prevail. Each embodiment of the present disclosure has one or more technical features. Given that each embodiment is implementable, a person ordinarily skilled in the art can selectively implement or combine some or all of the technical features of any embodiment of the present disclosure.
Referring to
The model building of the machine learning model MD requires a large volume of data for training purpose. To increase the data volume of the original measuring data DT0, the original measuring data DT0 are normally obtained from different sources SR. The sources SR may include different institutes DM (such as different hospitals or different research institutes) or different equipment EP.
However, the original measuring data DT0 obtained from different sources SR may contain different numeric intervals. If the original measuring data DT0 are directly used to train the machine learning model MD, the predictive effect of the machine learning model MD will be poor.
Referring to
In the present embodiment, the process of selecting the detection items IT of the original measuring data DT0 includes performing suitable quantity balancing and value scaling treatments on the original measuring data DT0 of different sources SR, so that prediction accuracy of the machine learning model MD can be increased. The quantity balanced and value scaled original measuring data DT0 possess excellent extendibility and are beneficial to the training and amendment of the machine learning model MD.
Refer to
The original measuring data DT0 includes the subject US and the source SR. As indicated in Table 1, of the original measuring data DT0, 3 items are obtained from the “1” source SR, 3 items are obtained from the “2” source SR, and 4 items are obtained from the “3” source SR. For the quantity of data items obtained from each of the sources SR to be identical, the data volume of the original measuring data DT0 can be increased by using an up-sampling process. For instance, as indicated in Table 2, the “CC” subject US (corresponding to the “1” source SR) is sampled twice and so is the “EE” subject US (corresponding to the “2” source SR) sampled twice, so that the quantity of data items obtained from each of the “1” source SR, the “2” source SR, and the “3” source SR is identical, that is, 4, and the balanced distribution map MP can be obtained. In the balanced distribution map MP, the quantity of data items corresponding to each of the sources SR is identical.
Besides, the problem of the same subject US being sampled for too many times can be resolved through a weighting arrangement. For instance, each time when a subject US is sampled, a weight of 1/a can be assigned to this subject US.
Referring to Table 3, a particular set of original measuring data DT0 is shown. The original measuring data DT0 include a plurality of detection values VL (such as “325”, “270”, . . . ) of a plurality of subjects US (such as “AA”, “BB”, . . . ) corresponding to a plurality of detection items IT (such as “Protein 1”, “Protein 2”, . . . ).
Next, the method proceeds to step S120, for each of the subjects US, a personalization scaling procedure P2 is performed on the detection values VL by the personalization scaling unit 120 to obtain a personalized scaled measuring data DT1. Referring to Table 4, the personalized scaled measuring data DT1 are shown.
As indicated in Table 4, in the personalized scaled measuring data DT1, the detection values VL of each of the subjects US are scaled to the same numeric interval. For instance, of the detection values VL (such as “325”, “270”, and “200”) of the “AA” subject US “325” is scaled as “1”, “200” is scaled as “0”, and “270” is scaled by proportion as “0.56”. Of the detection values VL (such as “155”, “810”, and “310”) of the “BB” subject US, “810” is scaled as “1”, “155” is scaled as “0”, and “310” is scaled by proportion as “0.24”. By the same analogy, the maximum and minimum values of the detection values VL of the same subject US are respectively scaled as 1 and 0, and the remaining detection values VL are scaled by proportion. Thus, the difference of each of the subjects US between different detection items IT can be maintained, and the quantity of detection values VL of each of the subjects US can remain identical.
Then, the method proceeds to step S130, for each of the sources SR, a source scaling procedure P3 is performed on the detection values VL by the source scaling unit 130 to obtain a by-source scaled measuring data DT2. Referring to Table 5, the by-source scaled measuring data DT2 are shown.
Of the by-source scaled measuring data DT2, the detection values VL of each of the sources SR are scaled to the same numeric interval. The source scaling unit 130 performs a z-score transform on the detection values VL of the same source SR. For instance, the mean value of the detection values VL (such as “325”, “155”, and “160”) of the “1” source SR is “213.3”, and the standard error is “78.99”. “325” is scaled as “1.41”; firstly, “325” is deducted by the mean value, then the difference is divided by the standard error. “155” is scaled as “−0.74”; firstly, “155” is deducted by the mean value, then the difference is divided by the standard error. “160” is scaled as “−0.67”; firstly, “160” is deducted by the mean value, then the difference is divided by the standard error. By the same analogy, the z-score transform can also be performed on the detection values VL (such as “270”, “30”, and “265”) of the “2” source SR. Thus, the differences of each of the sources SR between different subjects US can be maintained and the quantity of detection values VL of each of the sources SR can remain identical.
Through the above steps, the balanced distribution map MP as indicated in Table 2, the personalized scaled measuring data DT1 as indicated in Table 4, and the by-source scaled measuring data DT2 as indicated in Table 5 are obtained.
Then, the method proceeds to step S140, the balanced distribution map MP, the personalized scaled measuring data DT1 and the by-source scaled measuring data DT2 are combined by the combination unit 140 to obtain a balanced personalized scaled data DT1′ and a balanced by-source scaled data DT2′. Referring to Table 6, the balanced personalized scaled data DT1′ and the balanced by-source scaled data DT2′ are shown.
Then, the method proceeds to steps S151 to S153. Referring to
In step S151, the balanced personalized scaled data DT1′ and the balanced by-source scaled data DT2′ are split into a plurality of split SP by the splitter 151 of the extraction unit 150, wherein each of the splits SP corresponds to all of the sources SR.
Referring to
Then, the method proceeds to step S152, each of the splits SP is sampled by the calculator 152 of the extraction unit 150 to obtain a predictive ability table TB through analysis. Referring to Table 7, the predictive ability table TB is shown.
The predictive ability table TB includes the predictive ability of each of the detection items IT The calculator 152 can perform random sampling on each of the splits SP. For instance, to calculate the predictive ability, each of the splits SP is sampled for 10 times. Then, the mean value can be calculated to obtain the values of Table 7.
Next, the method proceeds to step S153, some of the detection items IT* are outputted by the selector 153 based on the predictive ability table TB. Referring to Table 8, total scores of predictive ability of respective detection item IT are shown.
The selector 153 adds up the predictive abilities of each of the splits SP to a obtain a total score of predictive ability for each of the splits, then outputs the two detection items IT* whose scores are the highest. Take Table 8 for instance, the “Protein 1” of the balanced by-source scaled data DT2′ and the “Protein 3” of the balanced personalized scaled data DT1′ are outputted.
Referring to Table 9, the “Protein 1” of the balanced by-source scaled data DT2′ and the “Protein 3” of the balanced personalized scaled data DT1′, which are lastly outputted, are shown.
The outputted detection items IT* are used for the machine learning model MD to perform model building, training or prediction inference. As indicated in
Refer to Table 10. The technology of the present disclosure technology possesses excellent extendibility. Each time when new source data are added, existing model can directly be used for prediction, or the new source data can directly be added to the training data without having to amend the data format or add a column of new source). In the predictive experiment of physical cognitive decline syndrome (PODS), the training model of the technology of the present disclosure using combined data of various sources produces higher accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and area under the ROC curve than the training model using mono source data.
According to the above embodiments, the process of selecting the detection items IT of the original measuring data DT0 includes performing a source balancing procedure P1, a personalization scaling procedure P2, and a source scaling procedure P3 on the original measuring data DT0 of different sources SR, so that prediction accuracy of the machine learning model MD can be increased. Moreover, the quantity balanced and value scaled original measuring data DT0 possess excellent extendibility and are beneficial to the training and amendment of the machine learning model MD.
The characteristics of some implementations or examples for implementing the present disclosure are disclosed above. Some specific examples describing the elements and disposition of the present disclosure (such as the values and names) are provided to simply or exemplify some implementations of the present disclosure. The elements and configuration are for exemplary purpose only, not for limiting the present disclosure. Moreover, the designations and/or alphabets can be repeated in some implementations of the present disclosure for the purpose of clarity and simplicity without specifying the relationships between various implementations and/or configurations of the present disclosure.
While the invention has been described by way of example and in terms of the preferred embodiment(s), it is to be understood that the invention is not limited thereto. Based on the technical features embodiments of the present invention, a person ordinarily skilled in the art will be able to make various modifications and similar arrangements and procedures without breaching the spirit and scope of protection of the invention. Therefore, the scope of protection of the present invention should be accorded with what is defined in the appended claims.
| Number | Date | Country | Kind |
|---|---|---|---|
| 113102037 | Jan 2024 | TW | national |