This application is based on and claims the benefit of priority of the prior Japanese Patent Application No. 2022-038624 filed on Mar. 11, 2022, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to a data converting program, a data converting device, and a data converting method.
There are cases in which values of specific attributes included in training data used in training a machine-learned model are biased, and the results of judgement by that machine-learned model are discriminatory. For example, a case can be envisaged of training a machine-learned model that estimates results of success or failure from attributes of a person by using training data whose explanatory variables are sex, age, birthplace or the like of the person, and whose objective variables are the results of success or failure of that person with respect to employment or a test or the like. In such a case, if using, as the training data, a past history in which the sex being female is treated unfavorably with respect to the results of success or failure, a machine-learned model that is trained by using that training data will carry out discriminatory estimation such as handing down judgements that are disadvantageous to women.
Techniques of eliminating bias such as described above by converting data have been proposed. For example, there has been proposed a technique of converting data such that the data distributions become the same in cases in which there are attributes that have the possibility of bringing about discriminatory behavior and in cases in which there are no such attributes. Further, a technique has been proposed of converting data, which correspond to conversion rules that are set in advance, in accordance with those conversion rules. Moreover, there has been proposed a technique of providing constraints that suppress the degree of change in the distribution, and then converting from arbitrary data X1 to arbitrary data X2 at probability P(X1,X2). For example, related arts are disclosed in Feldman, M., Friedler, S.A., Moeller, J., Scheidegger, C. and Venkatasubramanian S., “Certifying and removing disparate impact”, In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, August, pp. 259-268., Hajian, S. and Domingo-Ferrer, J., “A methodology for direct and indirect discrimination prevention in data mining”, IEEE transactions on knowledge and data engineering, 25(7), 2012, pp.1445-1459., and Calmon, F.P., Wei, D., Vinzamuri, B., Ramamurthy, K.N. and Varshney, K.R., “Optimized pre-processing for discrimination prevention”, In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, December, pp. 3995-4004.
According to an aspect of the embodiments, there is provided a data converting program causing a computer to execute a process of: for each of plural conversion rules, specifying a difference between pre-conversion data and post-conversion data generated by applying the plural conversion rules respectively to the pre-conversion data; determining application probabilities of the plural conversion rules respectively, in accordance with deviations in first plural data based on a first attribute of the first plural data and the differences for the plural conversion rules; and generating second plural data by applying the plural conversion rules to the first plural data in accordance with the application probabilities.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
An example of an embodiment relating to the technique of the disclosure is described hereinafter with reference to the drawings.
Before details of the embodiment are described, the elimination of bias by data conversion is described first.
Pre-conversion data 100 illustrated in
Here, for the above-described data conversion, it is desirable that the distributions of data before and after conversion do not change greatly. This is because, if the distribution changes greatly, there are cases in which the estimation accuracy of a machine-learned model, which is trained by using the post-conversion data as the training data, will deteriorate. Further, it is preferable that there be data conversion that can be interpreted by humans, i.e., that the data conversion be interpretive. This is because, if the data conversion is not interpretive, it is difficult to manually check the appropriateness of the conversion with respect to the post-conversion data. As interpretive data conversion, a technique of converting data based on predetermined conversion rules can be considered. Thus, in the present embodiment, the data conversion is data conversion that is based on conversion rules, and bias is eliminated from the data by data conversion that suppresses a change in the distribution of the post-conversion data. The data converting device relating to the present embodiment is described in detail hereinafter.
As illustrated in
As illustrated in
For each of the plural conversion rules, the specifying section 12 specifies a distance (difference) between pre-conversion data and post-conversion data, which is generated by applying the respective plural rules to the pre-conversion data. Here, the value of the general attribute of data Xk is xk, the value of the target attribute is yk, and the value of the sensitive attribute is sk, and the data Xk is expressed by the vector (xk,yk,sk). For arbitrary data Xk = (xk,yk,sk) and data Xm = (xm,ym,sm), the specifying section 12 acquires the definition of distance c(Xk,Xm) between Xk and Xm. For example, the distance c(Xk,Xm) may be the Euclidean distance of Xk and Xm.
X1 = (20,1,1), X2 = (50,1,1), c(X1,X2) = 30 X1 = (20,1,1), X3 = (25,1,1), c(X1,X3) = 5
In this case, a greater distance means that the data differs more. For example, the above-described example illustrates that the difference with data X1 is greater for data X2 than for data X3. Namely, this distance c(Xk,Xm) is an index expressing the degree of change in the distribution of data in a case in which data Xk is converted into data Xm. The specifying section 12 specifies the distances c(Xk,Xm) for all combinations of data that can be supposed as combinations of values of the respective attributes.
The determining section 14 determines the application probability of each of the plural conversion rules based on the deviation of the data in a case in which the sensitive attribute is used as the reference, and the difference in the data before and after conversion. Specifically, the determining section 14 determines a probability of application of each of the plural conversion rules such that the deviation of the data before and after conversion in a case in which the sensitive attribute is used as the reference, and the difference in the data before and after conversion, become minima.
The conversion rule is a rule for converting data that matches a condition into new data, and is expressed as follows for example.
The determining section 14 acquires set R of conversion rules r that match the data X = (x,y,s), and determines application probability p(r) that expresses the proportion of data to which conversion rule r∈R is to be applied, among the total number of the data X. Here, in order to eliminate bias from the pre-conversion data, data conversion must be carried out such that, in the post-conversion data, the number of data whose target attribute is a predetermined value is fair regardless of the value of the sensitive attribute. For example, the numbers of data corresponding to the sensitive attribute and the target attribute are written as follows. data set D =
Here, the respective (x,y,s) are discrete values. Further, 1(yn=j) is a function that repeats 1 in a case in which yn = j, and repeats 0 in other cases. Namely, formula (1) expresses, among the data within the data set, the number of data whose target attribute is a predetermined value. Formula (2) expresses, among the data within the data set, the number of data whose sensitive attribute is a predetermined value. Formula (3) expresses, among the data within the data set, the number of data whose target attribute is a predetermined value and whose sensitive attribute is a predetermined value.
Further, in order to carry out fair data conversion, it is made such that the probability that the value of the target attribute becomes a predetermined value does not change due to the sensitive attribute. Accordingly, it suffices to carry out data conversion such that, in the post-conversion data, following formula (4) and following formula (5) become equal, i.e., such that following formula (6) is satisfied.
The determining section 14 determines the application probability p(r) for each conversion rule so as to suppress a change in the distributions of the data before and after conversion, while carrying out fair data conversion such as described above. In the present embodiment, the problem that determines the application probability p(r) per conversion rule is formulated into a minimum cost flow problem. Specifically, as illustrated in
The source node corresponds to the supply point of the flow in the minimum cost flow problem, and the sink node corresponds to the demand point. The determining section 14 causes the number of data that are included in data set D (the pre-conversion data) to flow from the source node toward the sink node. The first nodes are nodes respectively corresponding to the combinations (x′,y′,s′) of values of the respective attributes of the pre-conversion data. The determining section 14 connects the source node and the respective first nodes by edges, and sets (0,Nx′y′s′) at each edge. Nx′y′s′ is the number of data at which x = x′, y = y′ and s = s′, among the data X = (x,y,s) that are included in the data set D.
The second nodes are nodes respectively corresponding to the conversion rules r. The determining section 14 connects the first nodes by edges to the second nodes that correspond to the conversion rules that the data, which corresponds to that first node, matches, and sets (c((x′,y′,s′),(x″,y″,s″)),∞) for each edge. (c((x′,y′,s′),(x″,y″,s″)) is the distance of the data before and after conversion due to the conversion rule r corresponding to the second node that is connected by the edge.
The third nodes are nodes corresponding to groups expressing pairs of value y of the target attribute and value s of the sensitive attribute. The determining section 14 connects the second nodes by edges with the third node, which corresponds to the group to which the post-conversion data in accordance with the conversion rules r corresponding to those second nodes belong, and sets (0,∞) for those edges. Further, the determining section 14 connects the respective third nodes and the sink node by edges, and sets (0, Ns’”Ny”/N) at the edges. The determining section 14 sets the value of Ns”Ny”/N such that the post-conversion data becomes fair, and specifically, satisfies above formula (6).
As described above, by setting the nodes, the edges and the cost and capacity per edge, the solution to the minimum cost flow problem of this network expresses a converting process in which the data set D becomes fair by using the conversion rules, and expresses conversion in which the change in the distributions before and after conversion is the minimum. Due to the determining section 14 solving the minimum cost flow problem of a network such as illustrated in
The generating section 16 generates post-conversion data by applying plural conversion rules to the pre-conversion data, based on the application probabilities determined by the determining section 14. In the case of the example of
The outputting section 18 outputs the plural post-conversion data generated by the generating section 16. Further, the outputting section 18 may also output, together therewith, the application probability for each conversion rule that was applied by the generating section 16. Due thereto, the interpretability of the data conversion is improved more.
The data converting device 10 may be realized, for example, by a computer 40 illustrated in
The storage 43 may be realized by an HDD (Hard Disk Drive), an SSD (Solid State Drive), a flash memory or the like. A data converting program 50 for causing the computer 40 to function as the data converting device 10 is stored in the storage 43 that serves as a storage medium. The data converting program 50 has a specifying process 52, a determining process 54, a generating process 56 and an outputting process 58.
The CPU 41 reads-out the data converting program 50 from the storage 43, expands the data converting program 50 in the memory 42, and successively executes the processes of the data converting program 50. By executing the specifying process 52, the CPU 41 operates as the specifying section 12 illustrated in
Note that the functions realized by the data converting program 50 can also be realized by, for example, a semiconductor integrated circuit, and, more specifically, an ASIC (Application Specific Integrated Circuit) or the like.
Operation of the data converting device 10 relating to the present embodiment is described next. When plural pre-conversion data and plural conversion rules are inputted to the data converting device 10, the data converting processing illustrated in
In step S10, the specifying section 12 acquires the plural pre-conversion data and the plural conversion rules that were inputted to the data converting device 10. Next, in step S12, for each of the plural conversion rules, the specifying section 12 specifies the distance between the pre-conversion data, and the post-conversion data that was generated by applying the plural conversion rules respectively to the pre-conversion data.
Next, in step S14, the determining section 14 determines the respective application probabilities of the plural conversion rules, such that the deviation of the data before and after conversion in a case in which the sensitive attribute is used as the reference, and the distance of the data before and after conversion, become minima. Next, in step S16, the generating section 16 applies the plural conversion rules to the pre-conversion data based on the application probabilities determined in above step S14, and generates post-conversion data. Next, in step S18, the outputting section 18 outputs the plural post-conversion data generated in above step S16, and the data converting processing ends.
As described above, for each of plural conversion rules, the data converting device relating to the present embodiment specifies a distance between pre-conversion data, and post-conversion data generated by applying the plural conversion rules respectively to the pre-conversion data. Further, the data converting device determines application probabilities of the plural conversion rules respectively, based on the deviations in data in cases in which the sensitive attribute is used as the reference, and the distances of the data before and after the conversion. Then, the data converting device applies the plural conversion rules to the pre-conversion data based on the determined application probabilities, and generates post-conversion data. Due thereto, the data converting device can suppress a change in the distributions of the data due to data conversion that is for eliminating bias.
Note that the above embodiment describes a case in which a minimum cost flow problem is applied to the determining of the application probabilities, but the present disclosure is not limited to this. For example, in patterns that allocate numbers of data such that there is fair data conversion, i.e., such that above formula (6) is satisfied, the data converting device may specify the distances of the data before and after conversion by round robin, and may determine the application probability per conversion rule based on the pattern in which the distance is the minimum. However, the application probabilities can be determined efficiently by applying a minimum cost flow problem as in the above-described embodiment.
Further, although the above embodiment describes a form in which the data converting program is stored in advance (is installed) in a storage, the present disclosure is not limited to this. The program relating to the technique of the disclosure can also be provided in a form of being stored on a storage medium such as a CD-ROM, a DVD-ROM, a USB memory or the like.
If the distributions of the data change greatly before and after conversion by data conversion for eliminating bias as in the related art, there is the problem that the estimation accuracy of a machine-learned model, which is trained by using the post-conversion data as training data, deteriorates.
In accordance with the technique of the disclosure, change in the distribution of data due to data conversion for eliminating bias can be suppressed.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2022-038624 | Mar 2022 | JP | national |