The present invention relates to a multi-class classification method, a multi-class classification program, and a multi-class classification device which select a feature amount and classify a sample into any of a plurality of classes based on a value of the selected feature amount, and a feature amount selection method, a feature amount selection device, and a feature amount set which are used for such multi-class classification.
In recent years, although the application or expansion of machine learning in the industrial field progresses, the feature selection and the multi-class classification are still major issues. There are various feature selection methods, but an example focusing on a pairwise coupling of the class is proposed (see “Feature selection for multi-class classification using pairwise class discriminatory measure and covering concept”, Hyeon Ji et al., ELECTRONICS LETTERS, 16 Mar. 2000, vol. 36, No. 6, p. 524-525 below). The technology disclosed in “Feature selection for multi-class classification using pairwise class discriminatory measure and covering concept”, Hyeon Ji et al., ELECTRONICS LETTERS, 16 Mar. 2000, vol. 36, No. 6, p. 524-525 is a method focusing on that the basic class classification is “binary-class classification” with two classes, using the pairwise coupling of the class to focus on the discrimination ability of the feature amount and perform the selection.
In addition, as a method of multi-class classification, for example, a one-versus-one (OVO) method in which two-class discrimination is repeated is known.
In addition, in the field of biotechnology, for example, methods of the feature selection and the multi-class classification are actively studied for cancer and the like. Generally, it is an application of a general machine learning method, and for example, a method of the feature selection by t-test, information gain, or the like, a classification method by support vector machine (SVM), random forest, naive bayes, or the like is applied. Such a technology is disclosed in JP2012-505453A, for example.
The study disclosed in “Feature selection for multi-class classification using pairwise class discriminatory measure and covering concept”, Hyeon Ji et al., ELECTRONICS LETTERS, 16 Mar. 2000, vol. 36, No. 6, p. 524-525 is limited to only the feature selection, and uses the existing method as it is in the subsequent multi-class classification. In addition, the extension to a set cover problem, which will be described below for the present invention, is not specified. Also, independence between feature amounts for selecting robust feature amounts is verified, only basic multi-class classification is assumed, and discrimination unneeded classes are not introduced. Therefore, it is difficult to apply the study as it is to the extended multi-class classification. Similarly, in the technology disclosed in JP2012-505453A, it is not considered to examine a gene cluster needed for discrimination as the set cover problem.
In addition, in the methods of repeating the two-class discrimination and performing the multi-class classification, the problem that “higher ranking cannot be trusted” is pointed out in a voting method. In addition, in the tournament hierarchy method, the problem that it is difficult to decide a comparison order is pointed out.
In cases of the feature amount selection and the multi-class classification in the field of biotechnology, there is a problem that “the accuracy drops in a case in which the number of classes handled reaches about 10” in a case of mRNA expression level base, which is often reported. For example, in one of the reports that develops a multi-class cancer classifier based on mutation information, the result is that 5 types of cancer can be discriminated with an F-Number exceeding 0.70. The feature selection and the multi-class classification based on DNA methylation are also studied. However, the applied class remains in a small number of small sample-sized trials.
In recent years, although there is also the study that applies deep learning, in the first place, learning does not proceed well due to the undetermined problem of omics data (a sample size is small relative to the number of parameters; there are hundreds of thousands of methylated sites, whereas less than 10,000 open data tumor records are available). Even in a case in which study succeeds, there is a problem that, in diagnostic applications, it is difficult to accept the study because the reason for discrimination cannot be clarified.
As described above, in the related art, a sample having a plurality of feature amounts cannot be robustly and highly accurately classified into any of a plurality of classes based on a value of a part of the selected feature amount.
The present invention has been made in view of such circumstances, and is to provide a multi-class classification method, a multi-class classification program, and a multi-class classification device which can robustly and highly accurately classify a sample having a plurality of feature amounts into any of a plurality of classes based on a value of a part of the selected feature amount. In addition, the present invention is to provide a feature amount selection method, a feature amount selection device, and a feature amount set used for such multi-class classification.
A first aspect of the present invention relates to a feature amount selection method of selecting a feature amount group to be used for determining which of N (two or more) classes a sample belongs to, the method comprising an input step of inputting a learning data set including a known sample group belonging to a given class, which is a target, and a feature amount group of the known sample group, and a selection step of selecting a feature amount group needed for class determination for an unknown sample of which a belonging class is unknown, from the feature amount group based on the learning data set, in which the selection step includes a quantification step of quantifying, by a pairwise coupling that combines two classes among the N classes, a discrimination possibility between the two classes in accordance with each feature amount of the selected feature amount group by using the learning data set, and an optimization step of totalizing the quantified discrimination possibilities for all the pairwise couplings and selecting a combination of the feature amount groups for which a result of the totalization is optimized.
A second aspect relates to the feature amount selection method according to the first aspect, in which the selection step further includes a first marking step of marking a part of the given classes as first discrimination unneeded class groups that do not need to be discriminated from each other, and a first exclusion step of excluding the pairwise coupling in the marked first discrimination unneeded class groups from pairwise couplings to be expanded.
A third aspect relates to the feature amount selection method according to the first or second aspect, in which the selection step further includes a similarity evaluation step of evaluating similarity between the feature amounts based on the discrimination possibility for each pairwise coupling of each feature amount, and a priority setting step of setting a priority of the feature amount to be selected, based on an evaluation result of the similarity.
A fourth aspect relates to the feature amount selection method according to the third aspect, in which the similarity is an overlap relationship and/or an inclusion relationship of the discrimination possibility for each pairwise coupling.
A fifth aspect relates to the feature amount selection method according to the third or fourth aspect, in which the similarity is a distance between discrimination possibility vectors for each pairwise coupling or a metric value in accordance with the distance.
A sixth aspect relates to the feature amount selection method according to any one of the first to fifth aspects, further comprising a selected number input step of inputting a selected number M of the feature amounts in the selection step, in which the optimization is maximizing a minimum value of a totalized value in all the pairwise couplings in accordance with M selected feature amounts.
A seventh aspect relates to the feature amount selection method according to any one of the first to sixth aspects, in which the optimization step includes an importance input step of inputting importance of the class or pairwise discrimination, and a weighting step of performing weighting based on the importance in a case of the totalization.
An eighth aspect relates to the feature amount selection method according to any one of the first to seventh aspects, in which the number of the feature amounts selected in the selection step is 25 or more.
A ninth aspect relates to the feature amount selection method according to the eighth aspect, in which the number of the feature amounts selected in the selection step is 50 or more.
A tenth aspect relates to the feature amount selection method according to the ninth aspect, in which the number of the feature amounts selected in the selection step is 100 or more.
An eleventh aspect of the present invention relates to a feature amount selection program causing a computer to execute the feature amount selection method according to any one of the first to tenth aspects.
A twelfth aspect of the present invention relates to a multi-class classification method of determining, in a case in which N is an integer of 2 or more, which of N classes a sample belongs to, from a feature amount of the sample, the method comprising the input step and the selection step executed by using the feature amount selection method according to any one of the first to tenth aspects, and a determination step of performing the class determination for the unknown sample based on the selected feature amount group, which includes an acquisition step of acquiring a feature amount value of the selected feature amount group and a class determination step of performing the class determination based on the acquired feature amount value, in which, in the determination step, the class determination for the unknown sample is performed by configuring a multi-class discriminator that uses the selected feature amount group in association with the pairwise coupling.
The feature selection is particularly useful in a case in which it takes cost (including time, cost, and the like) to refer to (including acquisition, storage, and the like) the feature amount of the sample. Therefore, for example, a unit that refers to the feature amount of the learning data and a unit that refers to the feature amount of the unknown sample may be different, and after selecting a small number of feature amounts, a suitable feature amount acquisition unit may be developed and prepared.
On the other hand, the multi-class classification (STEP 2) is a discrimination problem of deciding which of a plurality of classes the given unknown sample belongs to, and is a general problem in machine learning. It should be noted that many of the actual multi-class classifications are not always the problem of simply selecting one of the N classes. For example, even in a case in which the plurality of classes actually are present, the discrimination itself may not be needed. Conversely, for example, in a sample set labeled as one class, a plurality of sample groups having different appearances may be mixed. The method withstands such a complicated extended multi-class classification is desirable.
As the simplest feature selection method, it is conceivable to evaluate all selection methods of a small number of the feature amounts from a large number of the feature amounts, which are candidates, by using the learning data set. However, since there is a risk of over-learning for the learning data set, and the number of candidates is huge and cannot be evaluated, some kind of framework is essential.
An example of applying the first aspect of the present invention (multi-class classification involving the feature selection) to the field of biotechnology is shown. Cancer or a body tissue has a unique DNA methylated pattern. In addition, DNA liberated from the body tissue (cell free DNA: cfDNA) is mixed in human blood, and in particular, cfDNA derived from cancer is detected. Therefore, by analyzing the methylated pattern of cfDNA, it is possible to determine the presence or absence of the cancer and to specify the primary lesion in a case in which the cancer is present. That is, early cancer screening test by blood sampling and guidance to appropriate detailed test are realized.
Therefore, the problem of discriminating “whether it is cancer or non-cancer” and the origin tissue from the DNA methylated pattern is extremely important. This problem can be defined as the multi-class classification problem that discriminates the cancer from blood or a normal tissue. However, since there are many types of human organs (for example, 8 types of major cancers and 20 types or more of normal tissues), and there are subtypes of cancer, so that some cancers of the same organ have different aspects from each other, it can be said that the classification problem is difficult.
In addition, suppressing the measurement cost is desired from the assumption that it is used for the screening test, so that an expensive array that comprehensively measures methylated sites cannot be used as it is. Therefore, it is necessary to narrow down in advance a small number of sites needed for the discrimination from hundreds of thousands of DNA methylated sites, that is, the feature selection is needed in the previous stage.
Therefore, the technology (the method proposed in the present invention) of narrowing down a small number of DNA methylated sites and configuring the feature selection and multi-class classification methods that can discriminate the cancer from the normal tissue and specify the origin tissue based on the small number of sites is useful. It should be noted that, since the number of DNA methylated sites selected from, for example, 300,000 sites exceeds 10 to the 1,000th power, it can be seen that a comprehensive search method cannot be used.
Therefore, the inventors of the present application propose the feature selection method of listing the DNA methylated sites that act like switches that contribute to robust discrimination and being based on the combination search that sufficiently covers the pairwise discrimination of the needed classes. Further, the inventors of the present application propose the method of configuring a multi-class classifier from a simple binary-class classifier in combination with a tournament hierarchy method by only using a robust discrimination portion among the selected sites.
As a result, it is possible to support the multi-class classification involving the feature selection that incorporates various characteristics of actual problems. Actually, it can be applied to the multi-class classification that greatly exceeds 10 classes of cancer and normal, as seen in the example of cancer diagnosis described above. The feature amount selection and the multi-class classification methods proposed by the inventors of the present application are extremely useful in industry.
It should be noted that the present description is one of specific cases, and the twelfth aspect of the present invention is not applicable only to the field of biotechnology. Actually, as many of the common machine learning technologies are applicable to the field of biotechnology, there is no problem even in a case in which the technology developed in the field of biotechnology is applied to general machine learning problems.
A thirteenth aspect relates to the multi-class classification method according to the twelfth aspect, in which, in the quantification step, a statistically significant difference in the feature amounts in the learning data set between pairwise-coupled classes is used.
A fourteenth aspect relates to the multi-class classification method according to claim the twelfth or thirteenth aspect, in which, in the quantification step, in a case in which a feature amount of the unknown sample belonging to any of pairwise-coupled classes is given under a threshold value set with reference to the learning data set, a probability of correctly discriminating a class to which the unknown sample belongs by the given feature amount is used.
A fifteenth aspect relates to the multi-class classification method according to any one of the twelfth to fourteenth aspects, in which, in the quantification step, a quantification value of the discrimination possibility is a value obtained by performing multiple test correction on a statistical probability value by the number of feature amounts.
A sixteenth aspect relates to the multi-class classification method according to any one of the twelfth to fifteenth aspects, further comprising a subclass setting step of clustering one or more samples belonging to the classes based on a given feature amount from the learning data set to form a cluster and setting the formed cluster to a subclass in each class, a second marking step of marking each subclass in each class as second discrimination unneeded class groups that do not need to be discriminated from each other in each class, and a second exclusion step of excluding the pairwise coupling in the marked second discrimination unneeded class groups from pairwise couplings to be expanded.
A seventeenth aspect relates to the multi-class classification method according to any one of the twelfth to sixteenth aspects, in which the totalization is calculating a total value or an average value of quantitative values of the discrimination possibility.
An eighteenth aspect relates to the multi-class classification method according to any one of the twelfth to seventeenth aspects, further comprising a target threshold value input step of inputting a target threshold value T of a totalized value indicating a result of the totalization, in which the optimization is setting a minimum value of the totalized value in all the pairwise couplings in accordance with a selected feature amount to be equal to or more than the target threshold value T.
A nineteenth aspect relates to the multi-class classification method according to any one of the twelfth to eighteenth aspects, in which, in the determination step, binary-class discriminators that each use the selected feature amount group in association with each pairwise coupling are configured, and the binary-class discriminators are combined to configure the multi-class discriminator.
A twentieth aspect relates to the multi-class classification method according to any one of the twelfth to nineteenth aspects, further comprising a step of evaluating a degree of similarity between the sample and each class by a binary-class discriminator, and a step of configuring the multi-class discriminator based on the degree of similarity.
A twenty-first aspect relates to the multi-class classification method according to any one of the twelfth to twentieth aspects, further comprising a step of evaluating a degree of similarity between the sample and each class by a binary-class discriminator, and a step of configuring the multi-class discriminator by applying the binary-class discriminator used for evaluation of the degree of similarity between classes having a higher degree of similarity again to the classes.
A twenty-second aspect relates to the multi-class classification method according to any one of the twelfth to twenty-first aspects, in which, in the determination step, a decision tree that uses the selected feature amount group in association with each pairwise coupling is configured, and one or more decision trees are combined to configure the multi-class discriminator.
A twenty-third aspect relates to the multi-class classification method according to the twenty-second aspect, in which, in the determination step, the multi-class discriminator is configured by a combination of the decision tree and the decision tree, as a random forest.
A twenty-fourth aspect relates to the multi-class classification method according to any one of the twelfth to twenty-third aspects, in which omics information of a living body tissue piece is measured to determine a class to which the living body tissue piece belongs from the N classes.
A twenty-fifth aspect relates to the multi-class classification method according to any one of the twelfth to twenty-fourth aspects, in which omics switch-like information of a living body tissue piece is measured to determine a class to which the living body tissue piece belongs from the N classes.
A twenty-sixth aspect relates to the multi-class classification method according to any one of the twelfth to twenty-fifth aspects, in which the number of classes to be discriminated is 10 or more.
A twenty-seventh aspect relates to the multi-class classification method according to the twenty-sixth aspect, in which the number of the classes to be discriminated is 25 or more.
A twenty-eighth aspect of the present invention relates to a multi-class classification program causing a computer to execute the multi-class classification method according to any one of the twelfth to twenty-seventh aspects. It should be noted that a non-transitory recording medium on which a computer-readable code of the program according to the twenty-eighth aspect is recorded can also be used as an aspect of the present invention.
A twenty-ninth aspect of the present invention relates to a feature amount selection device that selects a feature amount group to be used for determining which of N (two or more) classes a sample belongs to, the device comprising a first processor, in which the first processor performs input processing of inputting a learning data set including a known sample group belonging to a given class, which is a target, and a feature amount group of the known sample group, and selection processing of selecting a feature amount group needed for class determination for an unknown sample of which a belonging class is unknown, from the feature amount group based on the learning data set, and the selection processing includes quantification processing of quantifying, by a pairwise coupling that combines two classes among the N classes, a discrimination possibility between the two classes in accordance with each feature amount of the selected feature amount group by using the learning data set, and optimization processing of totalizing the quantified discrimination possibilities for all the pairwise couplings and selecting a combination of the feature amount groups for which a result of the totalization is optimized.
A thirtieth aspect of the present invention relates to a multi-class classification device that determines, in a case in which N is an integer of 2 or more, which of N classes a sample belongs to, from a feature amount of the sample, the device comprising the feature amount selection device according to the twenty-ninth aspect, and a second processor, in which the second processor performs the input processing and the selection processing using the feature amount selection device, and determination processing of performing the class determination for the unknown sample based on the selected feature amount group, which includes acquisition processing of acquiring a feature amount value of the selected feature amount group and class determination processing of performing the class determination based on the acquired feature amount value, and in the determination processing, the class determination for the unknown sample is performed by configuring a multi-class discriminator that uses the selected feature amount group in association with the pairwise coupling.
A thirty-first aspect of the present invention relates to a feature amount set that is used by a multi-class classification device to determine which of N (two or more) classes a sample belongs to, the feature amount set comprising a feature amount data set of the sample belonging to each class, which is a target, in which, in a case in which, by a pairwise coupling that combines two classes among the N classes, a discrimination possibility between the two classes in accordance with each feature amount of the selected feature amount group is quantified with reference to the feature amount data set, the feature amount set is marked to be discriminable by at least one feature amount in all the pairwise couplings.
A thirty-second aspect relates to the feature amount set according to the thirty-first aspect, in which, in a case in which, by the pairwise coupling that combines the two classes among the N classes, the discrimination possibility between the two classes in accordance with each feature amount of the selected feature amount group is quantified with reference to the feature amount data set, the feature amount set is marked to be discriminable by at least 5 or more feature amounts in all the pairwise couplings.
A thirty-third aspect relates to the feature amount set according to the thirty-first aspect, in which, in a case in which, by the pairwise coupling that combines the two classes among the N classes, the discrimination possibility between the two classes in accordance with each feature amount of the selected feature amount group is quantified with reference to the feature amount data set, the feature amount set is marked to be discriminable by at least 10 or more feature amounts in all the pairwise couplings.
A thirty-fourth aspect relates to the feature amount set according to the thirty-first aspect, in which, in a case in which, by the pairwise coupling that combines the two classes among the N classes, the discrimination possibility between the two classes in accordance with each feature amount of the selected feature amount group is quantified with reference to the feature amount data set, the feature amount set is marked to be discriminable by at least 60 or more feature amounts in all the pairwise couplings.
A thirty-fifth aspect relates to the feature amount set according to any one of the thirty-first to thirty-fourth aspects, in which the number of selected feature amounts is equal to or less than 5 times a presented minimum cover number.
A thirty-sixth aspect relates to the feature amount set according to any one of the thirty-first to thirty-fifth aspects, in which the number of classes to be discriminated is 10 or more.
A thirty-seventh aspect relates to the feature amount set according to the thirty-sixth aspect, in which the number of classes to be discriminated is 25 or more.
A thirty-eighth aspect relates to the feature amount set according to any one of the thirty-first to thirty-seventh aspects, in which the number of selected feature amounts is 25 or more.
A thirty-ninth aspect relates to the feature amount set according to the thirty-eighth aspect, in which the number of the selected feature amounts is 50 or more.
A fortieth aspect relates to the feature amount set according to the thirty-ninth aspect, in which the number of the selected feature amounts is 100 or more.
In the following, embodiments of a feature amount selection method, a feature amount selection program, a multi-class classification method, a multi-class classification program, a feature amount selection device, a multi-class classification device, and a feature amount set according to the present invention will be described with reference to the attached drawings.
<Schematic Configuration of Multi-Class Classification Device>
<Configuration of Processing Unit>
As shown in
The functions of the units of the processing unit 100 can be realized using various processors and a recording medium. The various processors include, for example, a central processing unit (CPU) which is a general-purpose processor which executes software (program) to realize various functions. In addition, the various processors also include a graphics processing unit (GPU) which is a processor specialized for image processing, and a programmable logic device (PLD) which is a processor of which a circuit configuration can be changed after manufacturing, such as a field programmable gate array (FPGA). In a case in which learning or recognition of the image is performed, the configuration using the GPU is effective. Further, the various processors also include a dedicated electric circuit which is a processor having a circuit configuration designed exclusively for executing specific processing, such as an application specific integrated circuit (ASIC).
The functions of the units may be realized by one processor, or may be realized by a plurality of processors of the same type or different types (for example, a plurality of FPGAs, or a combination of the CPU and the FPGA, or a combination of the CPU and the GPU). In addition, a plurality of the functions may be realized by one processor. As an example of configuring the plurality of functions with one processor, first, as represented by a computer, there is a form in which one processor is configured by a combination of one or more CPUs and software, and the processor realizes the plurality of functions. Second, as represented by a system-on-chip (SoC) or the like, there is a form in which a processor that realizes the functions of the entire system with one integrated circuit (IC) chip is used. As described above, various functions are configured by one or more of the various processors as the hardware structure. Further, the hardware structure of these various processors is more specifically an electric circuit (circuitry) in which circuit elements, such as semiconductor elements, are combined. The electric circuit may be an electric circuit that realizes the functions using a logical sum, a logical product, a logical negation, an exclusive logical sum, and a logical operation of a combination thereof.
In a case in which the processor or the electric circuit executes software (program), a code readable by a computer (for example, various processors or electric circuits constituting the processing unit 100 and/or a combination thereof) of the executed software is stored in a non-transitory recording medium, such as the ROM 118, and the computer refers to the software. The software stored in the non-transitory recording medium includes a program (feature amount selection program and multi-class classification program) for executing the feature amount selection method and/or the multi-class classification method according to the embodiment of the present invention, and data (data related to acquisition of the learning data, and data used for the feature amount selection and the class determination) used in a case of execution. The code may be recorded in the non-transitory recording medium, such as various magneto-optical recording device and a semiconductor memory, instead of the ROM 118. In a case of the processing using the software, for example, the RAM 120 is used as a transitory storage region, and the data stored in, for example, an electronically erasable and programmable read only memory (EEPROM) (not shown) can also be referred to. The storage unit 200 may be used as the “non-transitory recording medium”.
Details of the processing by the processing unit 100 having the configuration described above will be described below.
<Configuration of Storage Unit>
The storage unit 200 is configured by various storage devices, such as a hard disk and a semiconductor memory, and a control unit thereof, and can store the learning data set described above, an execution condition of the selection processing and the class determination processing and results thereof, the feature amount set and the like. The feature amount set is the feature amount set that is used by the multi-class classification device 10 to determine which of N (two or more) (N is integer of 2 or more) classes the sample belongs to and comprises the feature amount data set of the sample belonging to each class, which is a target, in which, in a case in which, by a pairwise coupling that combines two classes among the N classes, a discrimination possibility between the two classes in accordance with each feature amount of the selected feature amount group is quantified with reference to the feature amount data set, the feature amount set is marked to be discriminable by at least one feature amount in all the pairwise couplings. The feature amount set can be generated by an input step (input processing) and a selection step (selection processing) in the feature amount selection method (feature amount selection device) according to the embodiment of the present invention. In addition, the feature amount set is preferably marked as discriminable with at least 5 or more feature amounts, more preferably marked as discriminable with at least 10 or more feature amounts, and still more preferably marked as discriminable with at least 60 or more feature amounts. In addition, the feature amount set is effective in a case in which the number of classes to be discriminated is 10 or more, and further effective in a case in which the number of classes is 25 or more. In addition, it is effective in a case in which the number of the selected feature amounts is 50 or more, and further effective in a case in which the number of the selected feature amounts is 100 or more.
<Configuration of Display Unit>
The display unit 300 comprises a monitor 310 (display device) configured by a display, such as a liquid crystal display, and can display the acquired learning data and the result of the selection processing and/or the class determination processing. The monitor 310 may be configured by a touch panel type display, and may receive command input by a user.
<Configuration of Operation Unit>
The operation unit 400 comprises a keyboard 410 and a mouse 420, and the user can perform operations related to execution of the multi-class classification method according to the embodiment of the present invention, result display, and the like via the operation unit 400.
<1. Processing of Feature Amount Selection Method and Multi-Class Classification Method>
The selection step includes a quantification step (step S112) of quantifying, by a pairwise coupling that combines two classes among the N classes, the discrimination possibility between the two classes in accordance with each feature amount of the selected feature amount group by using the learning data set, and an optimization step (step S114) of totalizing the quantified discrimination possibilities for all the pairwise couplings and selecting a combination of the feature amount groups for which a result of the totalization is optimized. In addition, in the determination step, the class determination for the unknown sample is performed by configuring the multi-class discriminator that uses the selected feature amount group in association with the pairwise coupling.
<2. Basic Policy of Present Invention>
The present invention is particularly suitable in a case of selecting the feature amount having a characteristic close to a binary value, and in a case of deciding the class by combining such feature amounts like a “switch”. That is, it is not a case in which it is quantitatively bonded to the feature amount linearly or non-linearly, but this is not always simple, and it is a sufficiently complicated problem in a case in which there are many switches. Therefore, the present invention is based on the policy of “searching for and selecting a combination of a large number of the feature amounts having a switch-like function, and configuring the multi-class classifier with a simple classifier”.
It should be noted that the learning data set is given, for every sample, values of a plurality of common feature amounts (for example, methylated sites) (note that some “missing values” may be included as values: hereinafter, referred to NA) and one correct answer class label (for example, cancer or non-cancer, and tissue classification) are given (input (input step and input processing: step S100) of the learning data set by the input processing unit 102 is performed).
In addition, although the above assumption is made here for the sake of simplicity, in a case in which a part of the sample is not given a correct answer class label, so-called semi-supervised learning may be incorporated. Since it is a combination with a known method, two typical processing examples are simply shown. A method (1) of, as a preprocessing, giving some class labels to the sample to which the correct answer class label is not given based on data comparison with the sample to which the correct answer class label is given, and a method (2) of performing cycle, such as estimating the belonging class of another unknown sample after learning with the data to which the class label is given once, and regarding the class label having high accuracy as the “correct label”, increasing the learning data again, and performing learning can be used in combination.
<2.1 Feature Amount Selection Method>
In the present chapter, the selection of the feature amount (step S110: selection step) by the selection processing unit 104 (quantification processing unit 106 and optimization processing unit 108) will be described. First, the principle of feature amount selection (selection step, selection processing) according to the embodiment of the present invention will be described in a simplified case. In the following, a method of sequential extension will be described. Finally, the procedure of the feature amount selection that incorporates all the extensions is summarized. It should be noted that, understandably, all the feature amounts described in the present chapter refer to the learning data.
<2.2 Principle of Feature Amount Selection: Return to Set Cover Problem>
First, the principle of the feature amount selection (selection step) for the multi-class classification will be described. For the sake of simplicity in the present section, it is assumed that the values of all the feature amounts of the samples belonging to the same class exactly match, and the feature amounts take a fixed value of binary (0 or 1).
In a case in which a value of a feature amount i of a class s is denoted by Xi(s), “the classes s and t can be discriminated by the selected feature set f” means that any feature amount is different, that is, Expression (1) is satisfied.
∃i∈f, Xi(s)≠Xi(t) (1)
Therefore, a necessary and sufficient condition that all the given classes C={1, 2, . . . , N} can be discriminated from each other is that Expression (2) is satisfied.
∀{s,t|s≠t∈C}, ∃i∈f, Xi(s)≠Xi(t) (2)
Here, a class binary relationship is pairwise expanded, the exclusive logical sum Yi(k) (see Expression (3)) of the binary feature amounts i of the classes s and t is introduced for the pair k={s, t} ∈ P2(C) in the binary combination, and this is called a “discrimination switch” (
Y
i
(k={s,t})
=xor(Xi(s), Xi(t)) (3)
From the above, the necessary and sufficient condition that all the given classes C can be discriminated from each other can be rewritten as Expression (4).
∀k∈P2(C), ∃i∈f, Yi(k)=1 (4)
That is, in a case in which a whole feature set is denoted by F, the feature amount selection for the multi-class classification can be returned to the set cover problem of selecting a subset fF satisfying Expression (4).
It should be noted that the “set cover problem” can be defined as, for example, “a problem of selecting a subset of S such that it includes (covers) all the elements of U at least once in a case in which the set U and the subset S of a power set of U are given” (other definitions are possible).
Here, the switch set Ii={k|Yi(k)=1} for the feature amount i is the subset of the binary combination P2(C) of the class. Therefore, I={Ii|i ∈ F} corresponding to the whole feature set F is the subset of the power set of its family of sets, P2(C). That is, the problem is “a problem of selecting the subset (corresponding to f) of I such that it includes all elements of P2(C) at least once in a case in which the subset I (corresponding to F) of the power set of P2(C) is given”, that is, it can be regarded as the set cover problem. Specifically, it is necessary to select the feature amount (and/or a combination thereof) such that at least one discrimination switch value is “1” for all pairs that are pairwise expanded. In the cases of
<2.3 Substitute Exclusive Logical Sum with Quantitative Value of Discrimination Possibility>
Here, in a case in which the feature amount is originally the binary value, the feature amount and its representative value (median value or the like) may be regarded as the discrimination possibility as it is. It should be noted that, in general, the feature amount is not limited to the binary value, and even samples belonging to the same class can fluctuate to various values. Therefore, the quantification processing unit 106 (selection processing unit 104) desirably substitutes the discrimination switch value (exclusive logical sum) with the quantitative value (quantification value) of discrimination possibility based on the feature amount of the learning data set.
First, the quantification processing unit 106 estimates the distribution parameter θi(s) and the distribution D(θi(s)) of the class s and the feature amount i from the measurement value group of the feature amount i of the sample belonging to the class s (step S112: quantification step). It is particularly desirable to quantify the discrimination possibility from the distribution or the distribution parameter. It should be noted that, the sample of which the value of the feature amount is NA need only be excluded from the quantitative processing. Of course, in a case in which all the samples are NA, the feature amounts cannot be used.
For example, the quantification processing unit 106 can obtain a p-value by performing a statistical test on the presence or absence of the significant difference between the pairwise parameter θi(s) and θi(t), specifically, can use Welch's t-test. The Welch's t-test is a method that assumes a normal distribution and is a general-purpose applicable method (as an image, the significant difference is determined depending on whether the feature amount distribution of s and t is close to any of
It should be noted that, in a case in which there are a particularly large number of feature amount candidates, in a case in which the determination is repeated for whole feature set F, a multiple comparison test will occur. Therefore, the quantification processing unit 106 desirably corrects a p-value group obtained for the same pairwise k={s, t} to a so-called q-value group (step S112: quantification step). Examples of the method of multiple test correction include the Bonferroni method and the BH method [Benjamini, Y., and Y. Hochberg, 1995], and more desirable method is the latter method of performing correction to a so-called false discovery rate (FDR), but it is not limited to this.
As shown in Expression (5), the quantification processing unit 106 compares the obtained q-value with a predetermined reference value α and assigns 0 or 1 to the discrimination switch (particularly, a case in which the discrimination switch is 1 is called “marked”).
It should be noted that, the discrimination switch is discretized and binarized from the standpoint of extending the set cover problem, but continuous variables may be handled, for example, by setting the discrimination switch to 1-q.
Further, since the p-value or the q-value is the statistical difference and not the probability that the sample can be discriminated, the quantification processing unit 106 may further perform, under the appropriate threshold value which is set with reference to the learning data set, the quantification by the probability that the belonging class can be correctly discriminated by the feature amount in a case in which the feature amount is given to the unknown sample belonging to any of the pairwise-coupled classes. In addition, the quantification processing unit 106 may correct such a statistical probability value by the multiple test correction in accordance with the number of the feature amounts.
In addition, in addition to the reference related to the statistical test, the reference value, such as having a certain difference in the average value, may be added or used as a substitute. Of course, various statistics other than the average value and the standard deviation may be used as the reference.
<2.4 Extension of Set Cover Problem to Optimization Problem, Such as Maximizing Minimum Pairwise Cover Number>
In a case in which the feature amount is a random variable, even in a case in which the discrimination switch is marked, it is not always possible to reliably discriminate the corresponding pairwise. Therefore, it is desirable to extend the set cover problem.
Therefore, as shown in Expression (6), the quantification processing unit 106 (selection processing unit 104) totals the quantitative value of the discrimination possibilities by using discrimination redundancy as the pairwise cover number Zf(k) (calculation of the totalized value as the total value; step S112, quantification step).
Z
f
(k∈P
(C))=Σi∈fYi(k) (6)
The definition of Zf(k) is not limited to the definition shown in Expression (6). For example, for the continuous variable version of −Yi(k), it may be defined as the product of (1−Yi(k)) as the probability that all discriminations will fail, or the success probability of at least U discriminations may be calculated from Yi(k) by using a certain appropriate threshold value U. In addition, an average value of individual discrimination possibilities may be calculated. In this way, various totalizing methods can be considered.
Next, from the standpoint that “it is desirable to reduce the bottleneck of the discrimination as much as possible”, the optimization processing unit 108 (selection processing unit 104) can return to the feature amount selection problem to the problem of maximizing the minimum pairwise cover number again with the number of the feature amounts to be selected denoted by m, by Expression (7) (step S114: optimization step, optimization processing).
arg maxf⊆F,|f|=m min{Zf(k)|k∈P2(C)} (7)
The above is an example of return in a case in which the selected number of the feature amounts is determined (in a case in which the selected number M of the feature amounts is input, that is, in a case in which the selected number input step/processing is performed). On the contrary, the optimization processing unit 108 (selection processing unit 104) may set the threshold value (target threshold value T) for the minimum pairwise cover number (minimum value of the totalized value of the discrimination possibilities) (target threshold value input step/processing), and may select the feature amount so satisfy the threshold value (step S114: optimization step/processing, selection step/processing). In this case, of course, it is desirable that the number of the feature amounts to be selected be smaller, and it is particularly preferable that the number of the feature amounts be the minimum.
Alternatively, various optimization methods, such as combining these two, can be considered.
Since the set cover problem is a field that is actively studied, there are various solutions. The problem of maximizing the minimum cover number, which is an extension of this, can be handled with in almost the same procedure. It should be noted that, since it is generally an NP-complete problem, it is not easy to obtain an exact solution.
Therefore, of course, it is desirable to obtain the exact solution and literally solve the problem of maximizing the minimum pairwise cover number and the problem of achieving the set cover number with the minimum feature amount, but the optimization processing unit 108 (selection processing unit 104) may use a method of increasing the cover number as much as possible, reducing the number of the selected feature amounts as much as possible, or obtaining a local minimum, by a heuristic method.
Specifically, for example, the optimization processing unit 108 (selection processing unit 104) may adopt a simple greedy search procedure. In addition to the minimum pairwise cover number of the currently selected feature set, “a method of sequentially defining i-th smallest i-th rank pairwise cover number, and sequentially selecting the feature amount that maximizes the i-th rank pairwise cover number of smaller i” can be considered.
Further, the importance of the class or the pairwise discrimination may be input (step S112: quantification step, importance input step/processing), and weighting based on the importance may be performed in a case of the optimization (weighting step/processing). For example, Expression (7) can be modified to Expression (8).
argmax min{Zk/wk} (8)
Here, wk indicates the importance of the pairwise discrimination. Alternatively, the importance of the class may be designated to obtain wk=wswt and the like, and the importance of the pairwise may be decided based on the importance of the class. It should be noted that, the calculation expression that reflects the importance of the class of the pairwise based on the product is merely an example, and the specific calculation expression for weighting may be another method to the same effect.
Specifically, for example, in a case in which the discrimination between a disease A and a disease B is particularly important in the discrimination of pathological tissues, while the discrimination between the disease B and a disease C is not important, it is desirable to set a large value to wk={A, B} and set a small value to wk={B, C}. As a result, for example, the method of appropriate feature amount selection or class classification (diagnosis) can be provided in a case in which early detection of the disease A is particularly important but a symptom thereof is similar to a symptom of the disease B, and in a case in which early detection of the disease B and the disease C are not important and there is a large difference in the symptoms from each other.
<2.5 Exclusion of Similar Feature Amount>
In general, since the feature amount with high similarity (degree of similarity) that take close values in the entire discrimination target class have high correlation, it is desirable to avoid overlapping selection in consideration of the robustness of the discrimination. In addition, since the optimization search described in the previous term can be made more efficient in a case in which |F| can be reduced, the optimization processing unit 108 (selection processing unit 104) desirably narrows down the feature amount to be considered in advance based on the evaluation result of the similarity (step S110: selection step/processing, similarity evaluation step/processing, priority setting step/processing). Actually, for example, there are hundreds of thousands of methylated sites.
Here, the set of k Ii={k|Yi(k)=1} in which Yi(k)=1 for the feature amount i is called the “switch set”. From this switch set, it is possible to consider the similarity (or degree of similarity) of the feature amounts, that is, the equivalence relationship (overlap relationship) and the inclusion relationship of the feature amounts.
For the feature amount i, all 1 in which Ii=Ii is collected, and the equivalent feature set UI is created as shown in Expression (9). In addition, all 1 as Ii⊃Ii is collected to create an inclusion feature set Hi as shown in Expression (10).
U
i
={i, l
1
(i)
, l
2
(i), . . . |Ii=Il*
H
i
={l
1
(i)
, l
2
(i), . . . |Ii⊃Il} (10)
Since the equivalent feature set is obtained by grouping the overlapping feature amounts and the inclusion feature set is obtained by grouping the dependent feature amounts, the feature amount having high similarity can be excluded by narrowing down the feature amounts to one representative feature amount. Therefore, for example, the whole feature set F may be replaced with the similar exclusion feature set as in Expression (11).
F′=F\{{l
(i)
|∃U
i
, l
(i)
≠i, l
(i)
∈U
i
}∪{l
(i)
|∃H
i
, l
(i)
∈H
i}} (11)
Of course, the selection processing unit 104 may consider only one of the equivalent feature set or the inclusion feature set as the similarity, or may create another index. For example, a method of obtaining a vector distance between the feature amounts (distance between discrimination possibility vectors) and regarding the vector distance equal to or less than a certain threshold value as the similar feature amount can be considered. In addition to the simple distance, any distance or a metric value equivalent thereto may be introduced, such as normalizing the discrimination possibilities of a plurality of feature amounts and then calculating the distance.
Further, although the narrowing down is performed in the above, the selection processing unit 104 may use a method of lowering the selection priority (priority) of the feature amount for which the similar feature amount is already selected in a case of the optimization search (priority setting step) to decide the ease of selection. Of course, a method of raising the selection priority (priority) of the feature amount having a low degree of similarity to the already selected feature amount (priority setting step) may be used.
<2.6 Introduction of Discrimination Unneeded Pairwise (Class Set)>
The class binary relationship extends to |P2(C)|=NC2 for the given class number N. This is simply the binary relationship of the class, but there may be pairwise that does not need to be discriminated in practice.
For example, in a case of assuming a cancer diagnosis problem (see examples described below), the discrimination between the cancer tissues and the discrimination between the cancer tissue and the normal tissue is essential, but the discrimination between the normal tissues is not needed.
Therefore, the selection processing unit 104 may partially suppress the pairwise expansion of the class binary relationship. That is, the given class C={c|c ∈ CT, CN} is divided by the class set CT that needs to be discriminated and the class set CN that does not need to be discriminated (first discrimination unneeded class group) to make consideration between CT and CT, and between CT and CN (pairwise expansion), while the pair of CN is excluded from the class binary relationship (step S110: selection step, first marking step/processing, first exclusion step/processing). That is, the selection processing unit 104 calculates P2(C)′ by Expression (12), and replaces the existing P2(C) with P2(C)′.
P
2(C)′=P2(C)\{{s, t}|s≠t∈CN} (12)
It should be noted that, two or more such divisions or marks may be present.
In this case, the selection processing unit 104 performs the pairwise expansion between the classes T (for example, classes T1 and T2 and classes T1 and T3) and between the class T and the class N (for example, classes T1 and N1 and classes T1 and N2), but does not perform the pairwise expansion between the classes N.
<2.7 Introduction of Subclass from Sample Clustering>
Even in a case in which the correct answer class label is given to the sample, there is a case in which a plurality of groups with different appearances are actually mixed in the sample of the same class nominally. Even in a case in which it is sufficient to discriminate the nominal class, the feature amounts do not always follow the same distribution parameter, so that the discrimination switch cannot be correctly given.
For example, there are subtypes of the cancer, and some cancers of the same organ have different appearances [Holm, Karolina, et al., 2010]. It should be noted that, in a case in which application to the screening test (combined with detailed tests) is assumed, it is not needed to discriminate the subtype.
Therefore, in order to correspond to the subtype, a special class unit called the subclasses, which do not need to be discriminated from each other, may be introduced (step S110: selection step, subclass setting step/processing, second marking step/processing).
The subclass can be automatically configured from samples. It should be noted that, since it is difficult to perform identification from a single feature amount, a method can be considered in which the selection processing unit 104 clusters (forms a cluster) the sample by the total feature amount (given feature amount) for each class and divides an appropriate cluster number L (or minimum cluster size nC) to make the subclass correspond to the cluster. For example, as shown in
It should be noted that there are various clustering methods, clustering may be performed by another method, and clustering criteria may be set in various ways.
For example, in a case in which the class J is divided into {J1, J2, . . . , JL} (second discrimination unneeded class group), the given class C={1, 2, . . . , J, . . . , N} can be extended by Expression (13).
C
+J={1, 2, . . . , J1, J2, . . . , JL, . . . , N} (13)
As in the previous term, the class binary relationship is replaced by Expression (14) by excluding the pair of subclasses that do not need to be discriminated (second exclusion step).
P
2(C+J)′−J=P2(C+J)′\{{s, t}|s≠t∈J*} (14)
It should be noted that the final class binary relationship applied sequentially including the previous term CN is referred to as P2(C+C)′−C.
<2.8 Summary of Procedure of Feature Selection Method>
The procedure of the feature selection method (selection step by the selection processing unit 104, selection processing) proposed by the inventors of the present application is summarized.
(i) In the given class set C, the class set CN that does not need to be discriminated is set.
(ii) The samples are clustered for each class with all the feature amounts to make each obtained cluster correspond to the subclass (subclass is special classes that do not need to be discriminated from each other).
(iii) The pairwise expansion P2(C+C)′−C of all the class binary relationships, which are discrimination targets, excluding those that do not need to be discriminated is determined.
(iv) The distribution parameter from the sample belonging to each class is estimated and the significant difference in the feature amount in the class pair k={s, t} is determined by the statistical test to assign 0/1 to the discrimination switch Yi(k={s, t}).
(v) From the discrimination switch, the equivalent feature amount set and the inclusion feature amount set are configured to create the similar exclusion feature set F′.
(vi) The feature set f (feature amount set) that maximizes the minimum value of the pairwise cover number Zf(k) obtained from the sum of the discrimination switches is selected from the F′ for the entire pairwise expansion P2(C+C)′−C of the discrimination target class.
It should be noted that, the above i to vi are comprehensive examples, and it is not always needed to implement all of the above i to vi, and there may be a procedure of partially rejecting the above i to vi. In addition, of course, the configuration may be used in which the alternative method specified or suggested in each section is used. It should be noted that the multi-class classification device 10 may execute only the steps of the feature amount selection method (feature amount selection method, feature amount selection processing) to obtain the feature amount set used for the multi-class classification.
<3. Multi-Class Classification Method>
In the present chapter, the processing (step S120: determination step, determination processing) performed by the class determination processing unit 114 (determination processing unit 110) will be described. First, a configuration example (class determination step, determination step) of the binary-class classifier (binary-class discriminator) based on the selected feature amount (selected feature amount group, feature amount set) will be described. Next, an example (class determination step, determination step) of a method of configuring (configuring the multi-class discriminator that uses the selected feature amount group in association with the pairwise coupling) the multi-class classifier (multi-class discriminator) from the binary-class classifier, by two-stage procedure of (1) brute force match ranking and (2) final tournament match will be described.
<3.1 Configuration of Binary-Class Classifier>
The fact that the feature amount that contributes to the pairwise discrimination is selected will be utilized. Therefore, the binary-class classifier can be configured only from the combination of the pairwise and the feature amount marked with the discrimination switch (each of the binary-class discriminator that use the selected feature amount group in association with each pairwise coupling is configured). It should be noted that in a case of the class classification, the acquisition processing unit 112 acquires the value of the feature amount of the selected feature amount group (step S122: acquisition step, acquisition processing).
For example, the class determination processing unit 114 (determination processing unit 110) can decide the discrimination switch state yi(k=(s, t), j) for the class pairwise {s, t} of the given sample j (belonging class is unknown) and the selected feature amount i by comparing with the learning distribution (step S124: class determination step, see
It should be noted that the “?” in Expression (15) indicates that the class to which the sample x belongs is unknown. In addition, in a case in which the value of the feature amount of the sample is NA, y is set to 0.
The class determination processing unit 114 (determination processing unit 110) totals the above values to calculate a discrimination score rj(s, t), and configures the binary-class classifier Bj(s, t) as shown in Expressions (16) and (17) (step S124: class determination step)
r
j(s, t)=Σi∈f yi(k={s, t}, j) (16)
<3.2 Procedure (1) of Multi-Class Classification: Brute Force Match Ranking>
The class determination processing unit 114 (determination processing unit 110) can further total the above-described discrimination scores (note that, in order to normalize the number of discrimination switches, it is desirable to take a code value) to calculate a class score (pair score) as shown in Expression (18) (step S124: class determination step).
This class score indicates “how similar the unknown sample j is to the class s”. Further, the class determination processing unit 114 (determination processing unit 110) lists the discrimination candidate classes in descending order of the class score and creates a brute force match ranking G (step S124: class determination step). In a case of the creation, replacement processing (replace with +1 in a case in which the class score is positive, leave the value at ±0 in a case in which the class score is zero, and replace the value with −1 in a case in which the class score is negative) may be performed.
In a case in which the subtotal is calculated for all the class pairs in this way, the result shown in
<3.3 Procedure (2) of Multi-Class Classification: Final Tournament Match>
In multi-class classification including the present problem, the discrimination between the similar classes often becomes a performance bottleneck. Therefore, in the present invention, the feature amount group (feature amount set) capable of discriminating all the pairwise including the similar classes is selected.
On the other hand, in the brute force match ranking G, although it is expected that highly similar classes will gather near the top rank, most of the class scores are determined by comparison with the lower rank classes. That is, the ranking near the top rank (ranking between the classes D, N, and A in the examples of
Therefore, the class determination processing unit 114 (determination processing unit 110) can decide the final discrimination class based on an irregular tournament match Tj of g higher rank classes in the brute force match ranking, as shown in Expression (19) (step S124: class determination step).
That is, the class determination processing unit 114 applies the binary-class classifier again to the pairwise of the lower rank two classes from the g classes at the higher rank of the list, determines survival to reduce the number of lists by one, and sequentially performs the same procedures (finally, the G top rank class is compared with the surviving class).
For example, as shown in
<3.4 Configuration of Other Multi-Class Classifiers>
It should be noted that the above description is an example of the classifier configuration, and various machine learning methods may be used in addition to the above example. For example, the configuration may be basically a random forest configuration, in which only those for which the discrimination switch of the selected feature amount is effective are used (determination step) in the decision tree in the middle. Specifically, the class determination processing unit 114 (determination processing unit 110) may configure the decision tree that uses the selected feature amount group in association with each pairwise coupling, and may combine one or more decision trees to configure the multi-class discriminator (step S124: class determination step). In this case, the class determination processing unit 114 may configure the multi-class discriminator as the random forest by combining the decision tree and the decision tree (step S124: class determination step).
<4. Output>
The output processing unit 115 can output the input data, the processing condition described above, the result, and the like in accordance with the operation of the user via the operation unit 400 or without the operation of the user. For example, the output processing unit 115 can output the input learning data set, the selected feature amount set, the result of the brute force match ranking or the final tournament match, and the like by the display on the display device, such as the monitor 310, the storage in the storage device, such as the storage unit 200, the print by a printer (not shown), and the like (output step, output processing;
<5. Test Data and Examples>
The inventors of the present application select 8 types (large intestine cancer, stomach cancer, lung cancer, breast cancer, prostate cancer, pancreatic cancer, liver cancer, and cervical cancer) of the cancer, which are diagnosis targets. Since these cancers account for about 70% of Japanese cancers [Hori M, Matsuda T, et al., 2015], these cancers are considered to be an appropriate target for the early screening test.
In addition, since normal tissue needs to cover everything that can flow out into the blood, a total of 24 possible types, such as blood, kidney, and thyroid gland, are listed in addition to the organs corresponding to the 8 types of the cancer.
A total of 5,110 open data samples including the measurement values of the methylated site are collected assuming the discrimination of the extracted cell aggregates (living body tissue piece) by positioning as a feasibility study (
For the cancer tumor and the normal organ (excluding blood), 4,378 samples are collected from the registered data of “The Cancer Genome Atlas” (TCGA) [Tomczak, Katarzyna, et al., 2015]. In addition, 732 samples of the blood are also collected [Johansson, Asa, Stefan Enroth, and Ulf Gyllensten, 2013].
All sample belonging classes (origin tissues including cancer and non-cancer distinction) are assigned in accordance with the registered annotation information.
In addition, the total number of methylated measurement values is 485,512 sites, but is 291,847 sites, excluding those for which all sample values cannot be measured (NA). It should be noted that, in the registered data described above, the data subjected to the post-processing, such as normalization, is used as it is.
Further, the entire data set is mechanically divided into equal parts, one data set is used as the learning data set and the other data set is used as a test data set.
The trial tasks set in the present example are as follows.
i. About 5,000 samples of the data sets are prepared
Allocation class (32 in total): Cancer (8 types) or normal tissue (24 types)
Feature amount (methylated site): about 300,000 items
ii. From the above half of the learning data set, at maximum 10 to 300 items of methylated site (omics information, omics switch-like information) that can be used for the discrimination are selected in advance (and learning with parameter, such as subclass division or distribution parameter)
iii. Answer is made (one sample at a time) to the discrimination problem of the given sample (particularly from the other half of the test data set)
Input: selected methylated site measurement value of the sample (at maximum 300 items corresponding to the selection in ii)
Output: 9 types of estimated class=“cancer +origin tissue (select from 8 types)” or “non-cancer (only 1 type)” are selected
It should be noted that, in the example, the following method is adopted as a related-art method to be compared with a proposal method (method according to the embodiment of the present invention).
Feature selection method: shannon entropy criteria with methylated site research cases [Kadota, Koji, et al., 2006; Zhang, Yan, et al., 2011]
Multi-class classification: naive bayes classifier (simple but known for its high performance [Zhang, Harry, 2004])
<5.1 Comparison Result between proposal method and Related-Art Method>
<5.1.1 Discrimination Accuracy of Test Data>
Learning with the learning data is performed, 277 sites (omics information, omics switch-like information) are selected, the discrimination accuracy of the test data is confirmed, and the proposal method (multi-class classification method according to the embodiment of the present invention) is compared with the related-art method (
The average F-number of the related-art method is 0.809, while the average F-number of the proposal method reaches 0.953. In addition, in the related-art method, some lung cancer, pancreatic cancer, stomach cancer, and the like have an F-number/sensitivity/goodness of fit of less than 0.8, but the proposal method achieves 0.8 or more in all items.
<5.1.2 Robustness of Discrimination>
The robustness of the discrimination is confirmed by the average F-number difference between the learning and the test in the previous term, and the proposal method is compared with the related-art method (
It can be seen that the related-art method shows an almost perfect average F-number of 0.993 for the learning data, and the accuracy of the test data is greatly reduced (difference 0.185), resulting in over-learning.
On the other hand, in the proposal method, the reduction in the average F-number is only 0.008. In addition, the discrimination ability of the pancreatic cancer has a relatively low value (F-number 0.883) within the proposal method, but also has a relatively low value (0.901) during learning. This proposal method suggests that the discrimination accuracy and tendency in the test data can be predicted to some extent at a stage of learning completion.
<5.1.3 Relationship between Number of Selected Features and Discrimination Accuracy>
The relationship between the number of the selected feature amounts and the discrimination accuracy (F-number) is confirmed (
Therefore, it is shown that, in the cancer diagnosis problem to discriminate “whether it is cancer or non-cancer” and the origin tissue particularly from the methylated pattern of cfDNA, the discrimination ability is not sufficient with 10 feature amount selections, the multi-measurements of at least 25 to 100 items are required (therefore, in the multi-class classification problem with a large number of classes, the number of the feature amounts (selected feature amount groups) selected in the selection step (selection processing) is preferably 25 or more, more preferably 50 or more, and still more preferably 100 or more).
<5.1.4 Exclusion of Similar Feature Amount and Introduction of Discrimination unneeded Pairwise>
In the proposal method, the similar feature amounts are not selected (similarity evaluation step, similarity evaluation processing). In addition, discrimination unneeded pairwise is introduced.
There are a total of 291,847 effective methylated sites (feature amounts in the present problem), but among 291,847 effective methylated sites, 59,052 similar feature amounts (equivalence relationship, inclusion relationship) can be specified and reduced by excluding from the target (reduction by 20.2%). In addition, since the original 32 classes are divided into 89 classes by sample clustering, the total number of simple pairwise is 4,005. Of these, the non-target pairwise between 551 normal tissues and cancer subclasses can be reduced (reduction by 13.8%).
At the same time, the search space can be reduced by 31.2%. It can be confirmed that the efficiency of the discrimination switch combination search is improved by excluding the similar feature amounts and introducing the discrimination unneeded pairwise.
<5.1.5 Subclass Division>
In the proposal method, sample clustering is introduced to internally divide the given class into the subclasses. Since the combination with the discrimination unneeded pairwise is also important, the effect of both is confirmed.
For comparison, a trial is performed in which subclass division is not performed, discrimination unneeded pairwise of the feature selection is not introduced, and other procedures are the same. As a result, even in a case of being limited to the cancer tissue, the correct answer rate of the discrimination is reduced from the original 95.9% to 85.6% (since there are 24 types of the normal tissues without division, particularly to confirm the effect of the subclass division, the comparison is limited to the cancer tissue).
It can be confirmed that highly accurate discrimination is realized by introducing subclass division and discrimination unneeded pairwise.
<5.1.6 Combined Use of Final Tournament Match>
In the proposal method, in the multi-class classification, the brute force match ranking (in the present section, the first rank class is called the “qualifying top class”) and the final tournament match are used in combination.
In the 2,555 test data, there are 278 cases in which the qualifying top class does not match the correct answer class. Among these, there are 162 cases that can be corrected to the correct discrimination by the final tournament match. On the other hand, there are 19 opposite cases (the qualifying top class matches the correct answer class, but it is changed to the wrong discrimination by the final tournament match).
That is, by using the final tournament match in combination, it is possible to correct 51.4% by subtracting the discrimination error of the qualifying top class, and improve the overall correct answer rate by 5.6%. The configuration can be confirmed in which the performance of the binary-class classifier is skillfully brought out by pairwise discrimination.
In the proposal method, the discrimination procedure, the comparative study class, and the dependent feature amount are clear. Therefore, it is possible to trace back the discrimination result and easily confirm and describe the difference with the feature amount or the threshold value which is the basis. It can be said that it is an “AI that can be described” that is particularly advantageous for application to medical diagnosis that requires a basis for the discrimination.
In the same manner, each class score Ri (s) can be confirmed in the 7 rows from the row of “cancer tissue 1” to the row of “normal tissue 1”. In addition, each class pairwise discrimination score rj(s, t) can be confirmed in the three rows from the row “<cancer tissue 1|cancer tissue 3>” to the row “<cancer tissue 1|cancer tissue 5>”.
In addition, in the table shown a portion (b) of
As described above, with the proposal method (the present invention), after the classification (selection), the processing steps are traced in the reverse order, and each score or the like is shown, so that the basis for the discrimination can be confirmed and visualized. As a result, the reliability degree of the final discrimination result can be estimated from the similar class score, the discrimination score, or the like of other candidates. In addition, specifying the feature amount that is the basis can be connected to the consideration after classification by its interpretation.
<Relationship between Number of Selected Feature Amounts and Minimum Cover Number>
The relationship between the number of the selected feature amounts and the minimum cover number in the example described above is shown in a graph of
(min{Zf(k)}) (20)
Here, a linear relationship with a slope of generally ⅕ is obtained, and it means that, for advanced multi-class classification problems such as cancer 8 classes/normal 24 classes, and with internal subclass division, the feature amount set that covers all the class discriminations can be selected for generally every five selections. That is, the feature selection of the method according to the embodiment of the present invention is returned to the set cover problem, and it is shown that the effect of extension is great and the minimum cover number can be efficiently improved in the multi-class classification problem. In addition, it can be seen that, from
<Relationship between Minimum Cover Number and Minimum F-Number>
The relationship between the minimum cover number in the selected feature amount set and the minimum F-number (the minimum value of the discrimination ability F-number in the test data in the discrimination target class) is shown in a graph of
(min{Zf(k)}) (21)
From this, it can be read that, in a case in which the minimum cover number is 0, almost no performance can be obtained, the minimum F-number becomes 0.8 at around 5, the minimum F-number becomes 0.85 at around 10, and 0.9 at around 60. That is, first, it can be seen that almost no performance can be obtained unless the feature amount set having the minimum cover number of at least 1 or more is selected. In addition, the detailed criteria for the F-number actually needed vary depending on the problem, but 0.80, 0.85, and 0.90 are easy-to-understand criteria, so it can be seen that the feature amount set having the minimum cover number of 5, 10, or 60 or more is valuable. In addition to the previous term (relationship between the number of the selected feature amounts and the minimum cover number), “achieving the cover number by a relatively small number of the selected feature amounts (5 times or less of the presented minimum cover number)”, which can be realized by the present invention, is particularly valuable.
It should be noted that, the example about “methylated site and living body tissue classification” described above is merely one of specific examples. The method according to the embodiment of the present invention is sufficiently generalized and can be applied to any feature amount selection and multi-class classification outside the field of biotechnology. For example, in a case of performing the class classification of people in the image (for example, Asia, Oceania, North America, South America, Eastern Europe, Western Europe, Middle East, and Africa), the feature amount can be selected by the method according to the embodiment of the present invention from a large number of the feature amounts, such as the size and shape of the face, skin color, hair color, and/or the position, the size, and the shape of the eyes, nose, and mouth, and the multi-class classification can be performed by using the selected feature amount. In addition, the method according to the embodiment of the present invention may be applied to the feature amount selection and the class classification of agricultural, forestry and fishery products, industrial products, or various statistical data.
The embodiments and other examples of the present invention have been described above, but the present invention is not limited to the aspects described above and can have various modifications without departing from the gist of the present invention.
10: multi-class classification device
100: processing unit
102: input processing unit
104: selection processing unit
106: quantification processing unit
108: optimization processing unit
110: determination processing unit
112: acquisition processing unit
114: class determination processing unit
115: output processing unit
116: CPU
118: ROM
120: RAM
200: storage unit
300: display unit
310: monitor
400: operation unit
410: keyboard
420: mouse
NW: network
S100 to S124: each processing of multi-class classification method
Number | Date | Country | Kind |
---|---|---|---|
2020-022822 | Feb 2020 | JP | national |
The present application is a Continuation of PCT International Application No. PCT/JP2021/004193 filed on Feb. 5, 2021 claiming priority under 35 U.S.C. § 119(a) to Japanese Patent Application No. 2020-022822 filed on Feb. 13, 2020. Each of the above applications is hereby expressly incorporated by reference, in its entirety, into the present application.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2021/004193 | Feb 2021 | US |
Child | 17876324 | US |