This application claims the benefit of Taiwan Patent Application No. 099141008, filed on Nov. 26, 2010, which is hereby incorporated by reference for all purposes as if fully set forth herein.
1. Field of Invention
The present invention relates to a data imputation system and method, and more particularly to a system and a method for imputing missing values and a computer program product thereof.
2. Related Art
Nowadays, for collection and processing of data for biological and medical use, a large volume of data is usually collected at remote ends or from different places, followed by summarization or data processing and analysis. For example, a technology for collecting gene data is to use a chip or an inspection apparatus to inspect tissues of a living body or collect physiological signals of a living body, for example, cells, body liquid, or physiological signals of biological motion of an animal or a plant, and various other different gene expression data, and the gene expression data will be recorded in a data matrix in a storage unit of the chip or inspection apparatus.
However, as for gene data collection described above, when the gene expression data is collected for medical analysis, missing of gene expression values may occur. Currently, if missing of a value occurs to the gene expression data in medical analysis, many analyses cannot be carried out, so that the gene expression data is considered invalid, and incomplete data transactions are deleted. However, if too many data transactions are deleted, the analysis is inaccurate or cannot be carried out, and in this case, the most commonly used method is to use the same or a different chip or inspection device to collect gene expression data again. It is obvious that, both the operation of collecting data again and the use of other chips or inspection apparatuses result in wasting of precious medical data. On the other hand, current data imputation technologies mostly propose linear regression, neural network and K-nearest neighborhood (KNN). However, it is difficult to apply the linear regression and neural network to categorical data, and if different value imputation technologies are used for correlated data matrixes, the analytical result will be doubtable. On the other hand, the KNN is not applicable to data matrixes with a large data volume, and requires a long time for searching data, and thus has a rather small range of applications.
Therefore, how to provide a value imputation method that is applicable to various data matrixes, does not require a long time for data processing, and has a low error rate is a problem to be considered by manufacturers.
Accordingly, the present invention is directed to a system and a method for imputing missing values of unknown data attributes by pairing highly similar data transactions to obtain correlated initial estimated data and a computer program product thereof.
To solve the above system problems, the present invention provides a system for imputing missing values, comprising a storage unit and a computing device. The storage unit stores a data matrix, the data matrix comprises a plurality of data transactions and a plurality of data attributes, the data transactions comprise a plurality of complete data transactions and a plurality of incomplete data transactions, and each incomplete data transaction comprises at least one unknown data. The computing device comprises an analysis program and a processor, and the processor is for reading and using the analysis program to analyze the data matrix.
The processor finds at least one target data transaction approximate to each incomplete data transaction from the complete data transactions, obtains at least one known data from the at least one target data transaction to compute an initial estimated data, uses the initial estimated data to replace the corresponding unknown data and serve as a plurality of data to be corrected, finds a specific data to be corrected from the data to be corrected, selects a first designated data attribute and a second designated data attribute respectively having an approximate variation with the specific data to be corrected from the data attributes, finds a data transaction group according to data in the transaction to which the specific data to be corrected belongs in a manner of grouping same data into one group, divides the data transactions into a plurality of subgroups according to a attribute combination of the data transaction group and the second designated data attribute in a manner of grouping same data into one group, finds at least one target group having data matching the data transaction group from the subgroups, uses data of the specific data attribute to be corrected corresponding to the at least one target group to compute an imputed data for imputing the attribute of the specific data to be corrected, and judges whether the transaction to which the specific data to be corrected belongs has other data to be corrected, so as to determine whether to designate another specific data to be corrected.
To solve the above method problems, the present invention provides a method for imputing missing values, applicable to a data matrix, wherein the data matrix comprises a plurality of data transactions and a plurality of data attributes. The method comprises: finding a plurality of complete data transactions and a plurality of incomplete data transactions from the data matrix, each incomplete data transaction comprising at least one unknown data; respectively obtaining at least one target data transaction approximate to each incomplete data transaction from the complete data transactions; obtaining at least one known data from the at least one target data transaction corresponding to the incomplete data transaction according to a attribute position of each unknown data in the incomplete data transaction, and using the at least one known data to compute an initial estimated data; using the initial estimated data to replace the corresponding unknown data and serve as a plurality of data to be corrected; designating a specific data to be corrected from the data to be corrected, the transaction to which the specific data to be corrected belongs being a correction data transaction; selecting a first designated data attribute having the most approximate variation with the specific data to be corrected from the data attributes, and finding a data transaction group comprising the correction data transaction according to data in the transaction to which the specific data to be corrected belongs in a manner of grouping same data into one group; selecting a second designated data attribute having a secondary approximate variation with the specific data to be corrected from the data attributes, and dividing the data transactions into a plurality of subgroups according to an attribute combination of the attribute to which the specific data to be corrected belongs and the second designated data attribute in a manner of grouping same data into one group; finding at least one target group having data matching the data transaction group from the subgroups, and using data of the specific data attribute to be corrected corresponding to the at least one target group to compute an imputed data for imputing the attribute of the specific data to be corrected; and judging whether the transaction to which the specific data to be corrected belongs has other data to be corrected, so as to determine whether to designate another specific data to be corrected.
The present invention further provides a computer program product, read by a computing device to execute the above method for imputing missing values, and the process is as described above, so that the details will not be described herein again.
The present invention is characterized in that, by combining a Pearson Correlation Coefficient (PCC) with a rough set, a two-stage data imputation technology is used to impute in high-precision estimated data and then correct the imputed data, which helps to improve the accuracy and validity of analysis. Furthermore, such a technology can impute missing values into data, and a lot of data can be maintained, so that the data after imputing can be applied to more data analyses rather than being discarded, so as to avoid repeated collection of gene expression data, thereby saving the medical resources, the labor force and the technical cost.
The present invention will become more fully understood from the detailed description given herein below for illustration only, and thus are not limitative of the present invention, and wherein:
Preferred embodiments of the present invention are described in detail below with reference to the accompanying drawings.
The computing device 20 may be an ordinary electronic device with data processing capability, such as various types of computers, personal computers, notebook computers, servers, workstations or personal digital assistants (PDAs). The storage unit 10 may be an element or apparatus with storage capability, such as a chip, memory, hard disk or flash drive, and may also be disposed in or integrated with other apparatuses, such as various inspection apparatuses (for inspecting a biopsy to generate various inspection data), health care boxes (for collecting various physiological signals of human body) or signal collection apparatuses (for collecting various signals).
As shown in
A plurality of complete data transactions and a plurality of incomplete data transactions are found from the data matrix, each incomplete data transaction including at least one unknown data (Step S110). As shown in
It is assumed that the data matrix 11a includes 10 data transactions, in which the 4th, 5th, and 9th data transactions are complete data transactions, the 1st, 2nd, 3rd, 6th, 7th, 8th, and 10th data transactions are incomplete data transactions, and each incomplete data transaction includes at least one unknown data 71 (represented as 0 in the figure), for example, the unknown data of the 1st data transaction is the 3rd attribute, the unknown data of the 2nd data transaction is the 1st attribute, the unknown data of the 3rd data transaction is the 4th attribute, the unknown data of the 6th data transaction is the 2nd and 3rd attributes, and so on.
At least one target data transaction approximate to each incomplete data transaction is respectively obtained from the complete data transactions (Step S120). This step is described with reference to
A complete data curve of each complete data transaction is established (Step S121), and an incomplete data curve of each incomplete data transaction is established (Step S122).
Here, each complete data transaction is analyzed first, and data of the complete data transaction is projected to a two-dimensional coordinate system, so as to obtain a complete data curve corresponding to each complete data transaction. Likewise, each incomplete data transaction is analyzed, the existence of unknown data is ignored, and data of the incomplete data transaction is projected to a two-dimensional coordinate system, so as to obtain an incomplete data curve corresponding to each incomplete data transaction.
Similarities between each incomplete data curve and the complete data curves are compared, so as to find at least one approximate target data curve corresponding to each incomplete data curve from the complete data curves (Step S123). Here, each incomplete data curve is compared with all the complete data curves, and after the incomplete data curves are compared with the complete data curves one by one, approximation ratios of the most-relevant complete data curves corresponding to the incomplete data curves are generated. Afterwards, according to the approximation ratios, at least one approximate target data curve can be obtained by pairing with each incomplete data curve.
Afterwards, at least one target data transaction most approximate to each incomplete data transaction is found by pairing the incomplete data curves with the target data curves (Step S124). The target data curves are those generated by mapping the target data transactions to the two-dimensional coordinate system as described herein, so pairing of the incomplete data transactions and the target data transactions can be derived from pairing of the incomplete data curves with the target data curves.
However, Step S120 may also adopt a method of comparing numerical values of attributes of the same order to determine differences, so as to determine data differences between the incomplete data transactions and the complete data transactions, and thus determine data similarities between the incomplete data transactions and the complete data transactions, thereby achieving pairing of the incomplete data transactions and the complete data transactions having high similarities, and since this method is well known to those of ordinary skill in the art of data comparison, the details will not be described herein.
At least one known data is obtained from the target data transaction corresponding to the incomplete data transaction according to a attribute position of each unknown data in the incomplete data transaction, and is used to compute an initial estimated data (Step S130), and the initial estimated data is used to replace the corresponding unknown data and serve as a plurality of data to be corrected (Step S140).
In this step, the initial estimated data is a mean of the known data of the target data transaction corresponding to the incomplete data transaction to which the unknown data attribute to be imputed by the initial estimated data in advance belongs. For example, data of the data transactions shown in
Then, an operation of correcting the initial estimated data is performed, as shown in
Then, a first designated data attribute having the most approximate variation with the specific data to be corrected is selected from the data attributes, and a data transaction group including the correction data transaction is found according to data in the transaction to which the specific data to be corrected belongs in a manner of grouping same data into one group (Step S160). The data variation of the attribute to which the specific data to be corrected is based on data benefit values of each attribute, and as for computation of the data benefit values, reference is made to
Hence, {cor(1, the number of unknown data attributes of the correction data transaction), cor(2, the number of unknown data attributes of the correction data transaction), cor(4, the number of unknown data attributes of the correction data transaction), cor(5, the number of unknown data attributes of the correction data transaction)}={0.867, −0.419, −0.062, 0.600}, in which the number of unknown data attributes of the correction data transaction 81 is 3. It can be seen from this embodiment that the 1st data attribute has the highest data benefit value, and thus the 1st data attribute is considered the first designated data attribute 83. Therefore, for data of the 1st data attribute, all the data transactions are divided into groups in a manner of grouping same data into one group, that is, as shown in
A second designated data attribute having a secondary approximate variation with the specific data to be corrected is selected from the data attributes, and the data transactions are divided into a plurality of subgroups according to an attribute combination of the attribute to which the specific data to be corrected belongs and the second designated data attribute in a manner of grouping same data into one group (Step S170).
In this step, in order to reduce the complexity of data comparison, for the data attribute formed by the attribute to which the specific data to be corrected 82 of the correction data transaction belongs, all the data transactions may be divided into groups in a manner of grouping same data into one group. As shown in
As for
At least one target group having data matching the data transaction group is found from the subgroups, and data of the attribute of the specific data to be corrected corresponding to the at least one target group is used to compute an imputed data for imputing the attribute of the specific data to be corrected (Step S180). This step is performed in the following manner: when a data transaction of a specific group in the subgroups is consistent with any data transaction in the data transaction group, judging that the specific group is the target group; and at this time, designating data attributes to be corrected as designated data attributes.
As shown in
Afterwards, it is judged whether the transaction to which the specific data to be corrected belongs has other data to be corrected (Step S190). When the transactions to which the specific data to be corrected are all corrected, the operation is ended; otherwise, another specific data to be corrected is designated, that is, the process returns to (Step S150), so as to continue the process from Step S150 to Step S190, until all the specific data to be corrected is corrected.
Reference is made to
Likewise, through Step S110 to Step S140, all the unknown data of the data matrix shown in
The PCC equation is as follows:
where I=Iu∩Iv.
where, u and v respectively represent two data transactions, ru,i and rv,i are respectively values of the ith attribute of the uth and vth transactions, and
r
2
Next, a result is estimated according to a target attribute value of the most similar transaction, and a commonly used equation is defined as follows:
where U=all similar users with u.
where, Pu,i is a target attribute value of the ith attribute of the uth transaction, and is a mean attribute value of the uth transaction, Su,v represents the similarity between the uth transaction and the vth transaction, and taking
However, different from the foregoing embodiment, data of the data transactions in the foregoing embodiment are numerical data, and the initial estimated data 72′ is a mean of the correlated known data of the target data transaction corresponding to the incomplete data transaction to which the unknown data 71′ to be imputed by the initial estimated data 72′ in advance belongs. However, data of the data transactions in this embodiment are categorical data, and the initial estimated data 72′ is data having the highest frequency of occurrence in the correlated known data of the target data transaction corresponding to the incomplete data transaction to which the unknown data 71′ to be replaced by the initial estimated data 72′ belongs. For example, assuming that the target data transactions corresponding to the 5th data transaction are the 1st data transaction to the 4th data transaction, and L has the highest frequency of occurrence in the 1st attributes of the data transactions, it is estimated that the numerical value of the 1st attribute of the 5th data transaction is L.
Similarly, after the initial estimated data 72′ is preliminarily imputed to the second data matrix shown in
In this embodiment, Step S150 to Step S190 may be performed with reference to the prior art, for example, [T. P. Hong, L. H. Tseng, and S. L. Wang, “Learning rules from incomplete training examples by rough sets”, Expert Systems with Applications, Vol. 22, pp. 285, 2002].
The invention being thus described, it will be obvious that the same may be varied in many ways. Such variations are not to be regarded as a departure from the spirit and scope of the invention, and all such modifications as would be obvious to one skilled in the art are intended to be included within the scope of the following claims.
Number | Date | Country | Kind |
---|---|---|---|
099141008 | Nov 2010 | TW | national |