This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2022-037534, filed on Mar. 10, 2022, the entire contents of which are incorporated herein by reference.
The disclosed technology discussed herein is related to a storage medium, an information processing device, and an information processing method.
In the field of machine learning, it may be desirable to analyze a difference between two datasets, such as a difference between a dataset used for training of a machine learning model and a dataset used at a time of applying the machine learning model. For example, it may be desirable to check the behavior of the machine learning model at the application destination by detecting the difference between the two datasets described above, or the like. For example, it may be desirable to analyze a data group that causes the difference.
As a technique related to the analysis of the machine learning model, there has been proposed a factor analysis device that quantitatively searches for factors having an impact on results based on a training result of a neural network, for example. In this device, an input node of an etiology model is set with a plurality of input values included in a dataset, and an output node is set with an output value included in the dataset. Furthermore, this device adjusts weight coefficients of a plurality of nodes included in the etiology model based on the output value and the plurality of input values, and calculates an influence value of each of a plurality of input items on the output value based on the weight coefficient adjustment result. Then, this device calculates a contribution of each of the plurality of input items to the output based on the influence value calculated based on the plurality of datasets.
Japanese Laid-open Patent Publication No. 2018-198027 is disclosed as related art.
According to an aspect of the embodiments, a non-transitory computer-readable storage medium storing an information processing program that causes at least one computer to execute a process, the process includes acquiring an update amount of a classification criterion of a classification model in retraining, the classification model being trained by using a first dataset, the classification model classifying input data into one of a plurality of classes, the retraining being performed by using a second dataset; and detecting data with a largest change amount among the second dataset when changing each piece of data included in the second dataset so as to decrease the update amount.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
At a time of applying a machine-learned model, training data used for the machine learning of the model may not remain. For example, in business using customer data, it may not be allowed to retain certain customer data for a long period of time or to reuse a machine-learned model using that customer data for a task of another customer contractually or from a perspective of a risk of information leakage. In such a case, it is not possible to detect data that causes the difference between the dataset at the time of training and the dataset at the time of application.
Furthermore, the existing technique described above is a technique related to factor analysis in the relationship between input and output, and the calculated contribution represents a feature influence level of each input on the output. For example, according to the existing technique, it is not possible to detect the data that causes the difference between the two datasets.
In one aspect, the disclosed technology aims to detect data that causes a difference between two datasets.
Hereinafter, an exemplary embodiment according to the disclosed technology will be described with reference to the drawings.
As illustrated in
Here, as a method of analyzing the difference between the two datasets, a method of comparing statistics or the like of the two datasets and identifying presence or absence and cause of the difference is conceivable. For example, a case of comparing a difference between a dataset A and a dataset B will be described as illustrated in
However, it is not possible to compare both datasets according to the method described above in a case where one of the datasets does not exist. For example, in a case where only a machine learning model trained with the dataset A remains and the dataset A itself does not remain as illustrated in
For example, a case of comparing a classification model for classifying input data into one of a plurality of classes, which is a classification model trained using a training dataset, with a target dataset different from the training dataset will be described. In this case, for a feature of the input data, a method of calculating, for each piece of target data, a confidence level based on a distance from a decision plane of a feature space indicating a boundary of individual classes in the classification model is conceivable. In this case, the confidence level that decreases as the distance from the decision plane decreases is calculated. Then, as illustrated in the upper part of
In the case of the method described above, as illustrated in the lower part of
Furthermore, in a case where the classification model is retrained with the target dataset, the difference between the training dataset and the target dataset is likely to be large when a change of the decision plane of the classification model before and after the retraining is large as illustrated in
In view of the above, in the present embodiment, how the classification model is to change when retraining is carried out is estimated as indicated by A of
As illustrated in
The calculation unit 12 calculates an update amount of the weight that identifies the decision plane of the classification model 20 in the case of retraining the classification model 20, which is the classification model 20 trained using the training dataset, for classifying the input data into one of the plurality of classes based on the target dataset. Note that the weight is an exemplary classification criterion according to the disclosed technology.
For example, as illustrated in
As illustrated in
Furthermore, the calculation unit 12 calculates a weight update amount |Δw| for the total loss ΣL as an index representing the difference between the training dataset and the target dataset. Furthermore, in a case where the classification model 20 is a differentiable model such as a neural network, the calculation unit 12 may calculate, as an update amount, magnitude of a gradient indicating the impact of the loss of the classification model 20 for the target dataset on the weight. For example, as illustrated in
In a case where the update amount calculated by the calculation unit 12 is equal to or larger than a predetermined threshold value, the determination unit 14 determines that there is a difference between the training dataset and the target dataset.
When the determination unit 14 determines that there is a difference between the datasets, the detection unit 16 detects target data that causes the difference from the target dataset. For example, the detection unit 16 calculates a movement amount in a case of moving, in the feature space, each data point corresponding to the target data included in the target dataset to decrease the update amount calculated by the calculation unit 12. Note that the movement amount in the case of moving each data point corresponding to the target data in the feature space is one type of a target data change amount. Then, the detection unit 16 detects the target data whose calculated movement amount is relatively large within the target dataset as target data that causes the difference between the datasets, and outputs it as a detection result. Furthermore, the detection unit 16 may output a detection result indicating that the target data that causes the difference between the datasets is unknown data. The unknown data is data not classified into any of the plurality of classes when the target data is input to the classification model 20.
For example, the detection unit 16 considers a movement amount |Δx| of the target data in the case of moving each piece of the target data (input data x) to decrease the weight update amount |Δw| as a contribution to the difference between the datasets. Then, the detection unit 16 outputs the target data with the movement amount equal to or larger than a predetermined threshold value as a cause of the change of the classification model 20. Furthermore, in the case where the classification model 20 is a differentiable model, the detection unit 16 may calculate, as a movement amount, the magnitude of the gradient of each piece of the target data with respect to the magnitude of the gradient of the weight for the loss. For example, as illustrated in
The information processing device 10 may be implemented by, for example, a computer 40 illustrated in
The storage unit 43 may be implemented by a hard disk drive (HDD), a solid state drive (SSD), a flash memory, or the like. The storage unit 43 as a storage medium stores an information processing program 50 for causing the computer 40 to function as the information processing device 10. The information processing program 50 includes a calculation process 52, a determination process 54, and a detection process 56. Furthermore, the storage unit 43 includes an information storage area 60 for storing information included in the classification model 20.
The CPU 41 reads out the information processing program 50 from the storage unit 43, loads it to the memory 42, and sequentially executes the processes included in the information processing program 50. The CPU 41 executes the calculation process 52, thereby operating as the calculation unit 12 illustrated in
Note that, functions implemented by the information processing program 50 may also be implemented by, for example, a semiconductor integrated circuit, which is, in more detail, an application specific integrated circuit (ASIC) or the like.
Next, operation of the information processing device 10 according to the present embodiment will be described. When the classification model 20 machine-learned with the training dataset is stored in the information processing device 10 and the target dataset is input to the information processing device 10, information processing illustrated in
In step S10, the calculation unit 12 obtains the target dataset input to the information processing device 10. Next, in step S12, the calculation unit 12 labels the target data based on the output obtained by inputting each piece of the target data included in the target dataset to the classification model 20. This step may be omitted if the target data is labeled in advance.
Next, in step S14, the calculation unit 12 calculates the total loss, which is a classification error between the ground truth label and the output when each piece of the target data included in the target dataset is input to the classification model 20. Then, the calculation unit 12 calculates a weight update amount of the classification model 20 with respect to the total loss.
Next, in step S16, the determination unit 14 determines whether or not the weight update amount calculated in step S14 described above is equal to or larger than a predetermined threshold value TH1. The process proceeds to step S18 if the update amount is equal to or larger than the threshold value TH1, and proceeds to step S22 if the update amount is smaller than the threshold value TH1.
In step S18, the detection unit 16 calculates a movement amount in the case of moving each piece of the target data included in the target dataset to decrease the weight update amount calculated in step S14 described above. Next, in step S20, the detection unit 16 detects target data whose calculated movement amount is equal to or larger than a predetermined threshold value TH2 as target data that causes a difference between the datasets. The threshold value TH2 may be a predetermined value, or may be a value dynamically determined to detect a predetermined number of pieces of the target data in descending order of movement amount.
Meanwhile, in step S22, the detection unit 16 determines that there is no difference between the datasets. Next, in step S24, the detection unit 16 outputs a detection result detected in step S20 described above or a detection result indicating no difference in step S22 described above, and the information processing is terminated.
Next, the information processing described above will be described more specifically using a simple example.
As illustrated in
ΣL==Σi exp((∥p−ai∥−1)ci)/N
Here, ai represents a two-dimensional coordinate of the i-th training data, ci represents a label of the i-th training data (positive example: 1, negative example: −1), and N represents the number of pieces of the training data included in the training dataset. Furthermore,
The weight in the classification model 20 is p. It is assumed that p optimized by machine learning using the training dataset is (−0.5, 0.0) as illustrated in
a1=(0.0,0.0)
a2=(1.0,0.0)
a3=(0.0,1.0)
In this case, as illustrated in
Then, the detection unit 16 calculates the gradient of each piece of the target data with respect to the magnitude of the gradient of the weight p and its magnitude as set out below and as illustrated in
a1:∥(−0.09,−0.36)∥=0.37
a2:∥(−0.09,0.12)∥=0.15
a3:∥(−0.13,−0.26)∥=0.30
Then, for example, when the threshold value TH2=0.2, the detection unit 16 detects the target data a1 and a3 as data that causes the difference between the training dataset and the target dataset, which is, as unknown data for the training dataset.
As described above, the information processing device according to the present embodiment calculates the update amount of the classification criterion of the classification model when the classification model trained using the training dataset, which classifies the input data into one of the plurality of classes, is retrained based on the target dataset. The target dataset is a dataset different from the training dataset. Then, the information processing device detects target data whose movement amount in the case of moving each piece of the target data included in the target dataset to decrease the calculated update amount is relatively large within the target dataset. As a result, it becomes possible to detect the data that causes the difference between the two datasets even when the training dataset does not exist.
Note that, while a mode in which the information processing program is stored (installed) in the storage unit in advance has been described in the embodiment above, it is not limited to this. The program according to the disclosed technology may also be provided in a form stored in a storage medium such as a compact disc read only memory (CD-ROM), a digital versatile disc read only memory (DVD-ROM), or a universal serial bus (USB) memory.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2022-037534 | Mar 2022 | JP | national |