This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2022-087670, filed on May 30, 2022, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to a non-transitory recording medium storing a training data generation program, a training data generation device, and a training data generation method.
Recently there are often demands in machine learning methods for large scale data as training data to train machine learning models. However, there are often cases in which it is difficult to collect together a sufficient number of training data. To address this issue, some prepared training data is converted to generate new training data, thereby augmenting the number of training data.
For example, there is a proposal for a data generation device that generates supervised data capable of building an analysis model with a high degree of generalizability. In this device, when classifying a first supervised data into specific categories using a trained analysis model, a characteristic site contributing to classification into a specific category is detected in the first supervised data, and second supervised data is generated by manipulating the first supervised data according to the characteristic site.
Moreover, for example, there is a proposal for a neural network learning device that extracts a feature from training data using a neural network undergoing training, and uses the neural network undergoing training to generate an adversarial feature from the extracted feature. This device uses the training data and the adversarial feature to compute a recognition result of the neural network, and trains the neural network such that the recognition result is close to a desired output.
Moreover, for example, there is a proposal for a system that augments a training sample for a minority class in a machine learning model using unbalanced training samples. In this system a training sample value is selected from a training sample set, a combination ratio value is selected from a continuous probability distribution, and selected training sample values are modified using the combination ratio value. This system generates a synthesized training sample by combining the modified training sample values.
Moreover, for example, there is a proposal for a system that generates a set of data samples of a minority data class to balance up an unbalanced training data set including both a majority data class and a minority data class. For example, related arts are disclosed in International Publication (WO) No. 2021/130995, International Publication (WO) No. 2018-167900, United States Patent Application Laid-Open No. 2021/0073671 and United States Patent Application Laid-Open No. 2015/0088791
According to an aspect of the embodiments, a non-transitory recording medium storing a program that causes a computer to execute a training data generation process comprising: classifying, based on a feature value, each of a first plurality of training data having a first attribute and each of a second plurality of training data having a second attribute that are contained in a plurality of training data; based on a comparison of a number of training data classified in a first group from among the first plurality of training data against a number of training data classified in a second group from among the first plurality of training data, selecting a third plurality of training data from training data classified in a third group from among the second plurality of training data and training data classified in a fourth group from among the second plurality of training data, the third group corresponding to the first group, the fourth group corresponding to the second group; and converting each of the third plurality of training data into a fourth plurality of training data having the first attribute.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
Explanation follows regarding an example of an exemplary embodiment according to technology disclosed herein, with reference to the drawings.
As illustrated in
When respective training data have been divided into groups based on one or other attribute of each training data in the training data set, sometimes this results in an imbalance in data size between groups, namely in an imbalance in the number of training data included in each group. In cases in which a machine learning model has been trained using a training data set having an imbalance in the number training data due to an attribute to be considered for fairness (hereafter referred to as a “sensitive attribute”), there is a high probability that prediction results from such a machine learning model will be discriminatory. There is accordingly a desire to rectify such an imbalance in data size in the training data set. In the following, in group division based on a sensitive attribute, a group having a large data size is referred to as a “majority group”, and data for the majority group is referred to as “majority data”. Moreover, a group having a small data size is referred to as a “minority group”, and data for the minority group is referred to as “minority data”.
Moreover, in cases in which there is an imbalance in data size for respective groups resulting from classifying training data by a sensitive attribute, a drop in prediction accuracy by the machine learning model more readily occurs for minority data compared to majority data due to there being less training data for the minority group. There is accordingly a demand to raise the prediction accuracy of the minority group by augmenting the data.
The following first reference example might be considered as a method to augment minority group data. Consider for example, as illustrated in
Moreover, the following second reference example might be considered as a method for augmenting minority group data. In the second reference example, for example as illustrated in
In order to address this issue, the present exemplary embodiment proposes a method of data augmentation so as to have a diversity of expression while maintaining characteristics of the minority data. The present exemplary embodiment focuses on the fact that training data having characteristics similar to those of the minority data are also sometimes contained in the majority group. For example, as illustrated in
Detailed explanation follows regarding functional sections of the training data generation device 10 according to the present exemplary embodiment. Note that a specific example of the present exemplary embodiment will be described for a case that envisages a task of predicting whether or not a facial image of a person is “attractive”. Moreover, gender is used as the sensitive attribute, and a small difference in prediction accuracy between male and female is considered to be fair. Furthermore, the present exemplary embodiment envisages a case in which there is an insufficient number of training data for the male group, and the number of training data for the male group is also less than the number of training data for the female group. Namely, the case presumes that the male data is minority data and the female data is majority data. Note that in the present task, whether or not a facial image is “attractive” is presumed to be strongly influenced by characteristics of hair style.
As illustrated in
The classification section 14 performs classification based on a feature value respectively into a first plural number of training data having a first attribute and into a second plural number of training data having a second attribute that are contained in the training data set. The first attribute is male and the second attribute is female. Namely, the first plural number of training data are training data classified in a male group that is the minority group, and the second plural number of training data is training data classified in a female group that is the majority group.
More specifically, the classification section 14 extracts a feature value from each instance of training data. For example, in cases in which each training data has been input to a deep neural network, which is an example of a machine learning model, the classification section 14 extracts as a feature value of the training data, a value output from at least one among a middle layer or an output layer of the deep neural network. The classification section 14 subjects the training data to clustering based on similarity of a feature value, as illustrated at an upper part of
The selection section 16 compares a number of training data classified in a first similarity group from among the minority data against a number of training data classified in a second similarity group therefrom. Based on this comparison, the selection section 16 selects, from among the majority data, training data to be used for augmentation from training data classified in a third similarity group corresponding to the first similarity group. Moreover, from among the majority group training data, the selection section 16 selects training data to be used for augmentation from training data classified in a fourth similarity group corresponding to the second similarity group. Note that the training data to be used for augmentation is an example of “a third plurality of training data” of technology disclosed herein. Moreover, the present exemplary embodiment will be described for a case in which the first similarity group and the third similarity group are the same as each other, and the second similarity group and the fourth similarity group are the same as each other.
More specifically, as illustrated in
As illustrated in
More specifically as illustrated on the left of
Moreover, as illustrated at the right of
The conversion section 18 respectively converts each the majority data selected by the selection section 16 as the training data to be used in augmentation by conversion into data having the first attribute, namely, having characteristics of an attribute of the minority group. More specifically as illustrated in
The conversion section 18 is able to perform augmentation that considers a feature of minority data due to employing majority data of the same similarity group in augmentation, as illustrated at the middle of
The determination section 20 determines whether or not to employ the data converted by the conversion section 18 as the augmentation data. More specifically, as illustrated in
The determination section 20 removes post conversion data determined not for employing as augmentation data from a group of post conversion data, and takes the remaining data as augmentation data. Note that, as illustrated in
The training data generation device 10 may, for example, be implemented by a computer 40 as illustrated in
The storage device 43 is, for example, a hard disk drive (HDD), solid state drive (SSD), or flash memory. A training data generation program 50 that causes the computer 40 to function as the training data generation device 10 is stored on the storage device 43 serving as a storage medium. The training data generation program 50 includes a classification process control command 54, a selection process control command 56, a conversion process control command 58, and a determination process control command 60.
The CPU 41 reads the training data generation program 50 from the storage device 43, expands the training data generation program 50 into the memory 42, and sequentially executes the control commands of the training data generation program 50. The CPU 41 operates as the classification section 14 illustrated in
Note that the functionality implemented by the training data generation program may be implemented by, for example, a semiconductor integrated circuit, and more particularly by an application specific integrated circuit (ASIC).
Next, description follows regarding operation of the training data generation device 10 according to the present exemplary embodiment. Training data generation processing illustrated in
At step S10, the classification section 14 acquires the training data set that has been input to the training data generation device 10. The classification section 14 then extracts a feature value from each training data, and classifies each training data into one or other similarity group by clustering the training data based on similarity of feature value.
Next at step S12, the selection section 16 tallies the number of minority data classified in each similarity group, and computes a proportion of the number of items in each similarity group. The selection section 16 also tallies the total number of majority data. Next at step S14, the selection section 16 computes an augmentation item number of each similarity group so as to make the data size of the minority group equivalent to the data size of the majority group while maintaining the computed proportions in the minority group.
Next at step S16, the selection section 16 determines for each of the similarity groups whether or not the number of majority data classified in the same similarity group are greater than the computed augmentation item number of this similarity group. Processing transitions to step S18 in cases in which there is a greater number of items in the majority data, and processing transitions to step S20 in cases in which the augmentation item number is the number of the minority data or greater.
At step S18, the selection section 16 selects, from among the majority data classified in a given same similarity group, an amount of the augmentation item number of the majority data in sequence from the highest similarity to the minority data classified in this same similarity group. At step S20, the selection section 16 selects all the majority data in the same similarity group. Note that the processing of step S16 to step S20 is executed for each of the similarity groups.
Next, at step S22, the conversion section 18 respectively converts each the majority data selected at step S18 of step S20 into data having characteristics of an attribute of the minority group. Next at step S24, in cases in which the post conversion data is not classified in the same similarity group as the similarity group of the majority data prior to conversion, the determination section 20 removes this post conversion data from the post conversion data group, and outputs the remaining data as an augmented data set. The training data generation processing is then ended.
As described above, in the training data generation device according to the present exemplary embodiment, a training data set is classified into similarity groups based on a feature value of both minority data and majority data contained in the training data set and related to a sensitive attribute. The training data generation device then computes an augmentation item number of each of the similarity groups for augmenting the minority group to the total number of the majority group while maintaining the proportions of number of items for each similarity group in the minority group. The training data generation device also selects, as data to be used for augmenting each similarity group, data of the amount of the computed augmentation item number from the majority data of the same similarity group. The training data generation device then converts the selected majority data into data having characteristics of an attribute of the minority group so as to generate augmentation data. This thereby enables generation of a training data set after data augmentation to rectify fairness by generating a post augmentation training data so as to maintain an imbalance of characteristics of the training data set prior to augmentation. The prediction accuracy of the minority group is also raised due to generating the augmentation data by converting the majority data with similarity to the feature of the minority data.
Note that there are various fairness indices around due to there being various ways of thinking about and criterion of fairness. The exemplary embodiment described above presumes matching prediction accuracy to be an impartial accuracy parity index. Thus in the exemplary embodiment described above, data augmentation is performed such that the data size is equivalent across groups classified by a sensitive attribute. Another representative fairness index is, for example, a demographic parity index. Such an index is an impartial index in which the rate of positive predictions matched across the sensitive attribute groups. In such cases, as illustrated in
More specifically, similarly to in the exemplary embodiment described above, the selection section computes the augmentation item number of each of the similarity groups so as to make the data sizes equivalent for the minority group and the majority group while maintaining the proportions of the number of each similarity group in the minority group. The selection section then, when selecting majority data for each similarity group from the same similarity group, selects majority data such that the rate of positive predictions for this similarity group is equivalent. However, the rate of positive predictions being equivalent does not only mean cases in which the rate of positive predictions is the same for both the minority group and the majority group, but also includes cases in which a difference between the rate of positive predictions for the minority group and the rate of positive predictions for the majority group is a difference lying within a second threshold or within a third threshold. The second threshold and the third threshold are thresholds for each similarity group, and are values predetermined such that the rate of positive predictions of the minority group and the rate of positive predictions of the majority group can be taken equivalent. More specifically, the selection section preferentially selects for positive predictions in the majority data in cases in which the rate of positive predictions of the minority group is lower than the rate of positive predictions of the majority group. However, the selection section preferentially selects for negative predictions of in the majority data in cases in which the rate of positive predictions of the minority group is higher than the rate of positive predictions of the majority group.
Moreover although the training data generation program is pre-stored (installed) on the storage device in the exemplary embodiment described above, there is no limitation thereto. The program according to the technology disclosed herein may be provided in a format stored on a storage medium such as CD-ROM, DVD-ROM, USB memory, or the like.
Related technology considers augmenting training data for a minority group based on training data of a majority group to rectify fairness in training data. However, in the related technology, in cases in which there is an imbalance in a feature between training data of the minority group and training data of the majority group, the originally existing imbalance in the feature of the minority group is lost. Then in cases in which the originally existing imbalance in the feature of the minority group is no longer maintained after augmentation, there is a possibility that the prediction accuracy of the machine learning models will fall for the minority group.
The technology disclosed herein enables a training data set after data augmentation to rectify fairness to be generated as post augmentation training data based on an imbalance of features in the training data set prior to augmentation.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2022-087670 | May 2022 | JP | national |