This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2022-37362, filed on Mar. 10, 2022, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to a determination program, a determination method, and an information processing device.
In analysis using machine learning, when a dataset is given, the given dataset is usually divided into three datasets, namely, training data, validation data, and test data. Usually, it is frequent to split the test data from the dataset and then divide the remaining data into the training data and the validation data. In addition, there are cases where division into the training data and the validation data is performed a plurality of times, as in cross-validation.
The training data is the data used when a machine learning pipeline is created. The validation data is the data used for primary evaluation and is used mainly to compare diverse machine learning models. The test data is data for performing final evaluation of the selected machine learning model, using the validation data.
The purpose of machine learning is to train the relationship between features and objective variable and to predict the objective variable from features of new data. In addition, after selecting a machine learning pipeline using the training data and the validation data, “the training data and the validation data” are also trained collectively again when final evaluation is performed using the test data. Since precise evaluation of the division between “(training data+validation data) and (test data)” and the division between “training data and validation data” will be disabled unless the divisions are performed in the same method, the method for division is one, and the one method will be applied twice.
For example, there are cases where the training data and the validation data have already been divided using a division method with a certain criterion, but the division method is unknown due to diverse factors such as the change of analysts. In this case, when new data is added or when the training data is further divided into the training data and the validation data, it is desired to use the same division method as the certain division method. As an approach for identifying the certain division method, there is known an approach of generating a combined dataset by combining the training data and the validation data that have already been divided and dividing the combined dataset by diverse division methods to look for a division method that matches the original divided data.
Examples of the related art include Japanese Laid-open Patent Publication No. 2019-152964.
According to an aspect of the embodiments, there is provided a non-transitory computer-readable recording medium storing a determination program for causing a computer to execute processing. In an example, the processing includes: generating a plurality of division candidate datasets divided in accordance with different criteria from each other, from a combined dataset obtained by combining training data and validation data in divided dataset that has been divided into the training data and the validation data used for machine learning; generating respective machine learning pipelines that execute machine learning, separately for each of the divided dataset and the plurality of division candidate datasets; using each of the divided dataset and the plurality of division candidate datasets to calculate respective prediction performances, each of the respective prediction performances indicating a prediction performance when a corresponding machine learning pipeline of the respective machine learning pipelines is executed; identifying the division candidate datasets that have the prediction performances closest to the respective prediction performances calculated by using the divided dataset, from among the plurality of division candidate datasets; and determining division criteria used for the identified division candidate dataset to be the division criteria used for the divided dataset.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
However, with the above-described technique, it is not practicable to identify the division method used for the divided data from among a plurality of division method candidates. For example, when the number of division method candidates is finite, the original division dataset sometimes does not match the data used in any division method, and additionally, the division often includes elements of random numbers, which will not allow the above-described technique to identify the division method accurately.
Note that automated machine learning (AutoML) or the like that automates data analysis has been used, but AutoML uses random division to divide data, and it is thus not feasible to identify the division method used for the divided data.
In addition, it is also conceivable to manually search for the division method, but manual data division is difficult, especially for beginners. For example, using information that is not supposed to be used when making predictions on the test data is called a “leak”, and when a leak is caused, the prediction performance will not be evaluated precisely, and the machine learning model may not be properly selected, sometimes resulting in degradation in prediction performance during services.
An object of one aspect is to provide a determination program, a determination method, and an information processing device capable of identifying a division method used to generate divided data used for machine learning.
Hereinafter, embodiments of a determination program, a determination method, and an information processing device disclosed in the present application will be described in detail with reference to the drawings. Note that the present embodiments will not be limited by the following embodiments. In addition, the embodiments may be appropriately combined with each other unless otherwise contradicted.
<Description of Information Processing Device>
First, data used for machine learning will be described.
The purpose of machine learning is to train the relationship between features and objective variable and to predict the objective variable from features of new data. In addition, after selecting a machine learning pipeline using the training data and the validation data, “the training data and the validation data” are also trained collectively again when final evaluation is performed using the test data. Since precise evaluation of the division between “(training data+validation data) and (test data)” and the division between “training data and validation data” will be disabled unless the divisions are performed in the same method, the method for division is one, and the one method will be applied twice.
Here, in data division, division is performed in consideration of (1) date and time, (2) groups, and (3) distribution of the objective variable. For example, (1) date and time have a requirement that, when making predictions for a certain point in time, future information after the certain point in time is not permitted to be used. (2) Groups have a requirement that the training data and the validation data are not permitted to contain the same group. For example, when a machine learning model is used to predict sales for a new store, it is desired that store data in the test data is not included in the training data. Conversely, when sales of miscellaneous merchandise are predicted, it is not desired to consider stores. In (3) distribution of the objective variable, it is desirable that, in the case of a classification model, each label is included in the training data and the validation data in the same proportion, and for example, when there is a label having a small number of samples, division in consideration of labels is desired. In the case of a regression model, it is desirable that the training data and the validation data are close in distribution.
In this manner, since the method for data division has requirements to consider, it is important to identify a division method that satisfies (1), (2), and (3) above, and if the division method is imprecise, the accuracy of the machine learning model thereafter will also deteriorate.
Next, a usually used reference technique for identifying a data division approach will be described.
For example, the reference technique generates a division dataset 1 obtained by dividing a dataset into training data 1 and validation data 1 using a division method 1, a division dataset 2 obtained by dividing the dataset into training data 2 and validation data 2 using a division method 2, and so forth. Thereafter, the reference technique compares the original division dataset with each of the division datasets 1 to N generated by each of the division methods 1 to N and, by searching for a matching division dataset, identifies the division method used for the original division dataset.
However, in the approach of the reference technique, when the candidates for division are assumed to be finite, there are cases where the original division dataset does not match any of the division datasets 1 to N in the first place, and the division often includes elements of random numbers. In addition, in the approach of the reference technique, when there is no perfect match, an index for evaluating which division is most plausible is not clear. Furthermore, even if the existing AutoML technique is used, whether random division is adopted is given as an input, and it is not feasible to select a plausible division method.
Thus, by focusing on that “the prediction performance can be evaluated similarly to given division” and “comparison results between a plurality of machine learning pipelines are similar to those in the given division”, the information processing device 10 according to the first embodiment identifies an unknown division method by trying a plurality of machine learning pipelines and regarding the division with which the machine learning pipelines has the closest prediction performances is correct.
For example, the information processing device 10 generates a combined dataset obtained by combining the training data and the validation data in a divided dataset that has been divided into the training data and the validation data used for machine learning but for which the division method is unknown. Then, the information processing device 10 uses the combined dataset to generate a plurality of division candidate datasets divided by each of the division methods 1 to N in accordance with different criteria from each other.
The information processing device 10 generates respective machine learning pipelines that execute machine learning, separately for each of the divided dataset and the plurality of division candidate datasets. The information processing device 10 uses each of the divided dataset and the plurality of division candidate datasets to calculate respective prediction performances when the respective machine learning pipelines are executed.
The information processing device 10 identifies a division candidate dataset that has the prediction performances closest to the respective prediction performances calculated using the divided dataset, from among the plurality of division candidate datasets, and determines the division criterion used for the identified division candidate dataset to be the division criterion used for the divided dataset for which the division method is unknown.
As a result, the information processing device 10 may identify the division method used to generate the divided data used for machine learning.
<Functional Configuration>
The communication unit 11 is a processing unit that controls communication with another device and is implemented by, for example, a communication interface. For example, the communication unit 11 receives various instructions from a terminal of an administrator and transmits the processing result of the control unit 20 to the terminal of the administrator.
The storage unit 12 is a processing unit that stores various types of data, programs executed by the control unit 20, and the like and is implemented by a memory or a hard disk, for example. This storage unit 12 stores a divided dataset d0.
The divided dataset d0 is a dataset used to train a machine learning model using a neural network or the like and is a dataset that has been divided into training data and validation data by a certain division method.
The control unit 20 is a processing unit that exercises overall control of the information processing device 10 and, for example, is implemented by a processor or the like. This control unit 20 includes a generation unit 21, a prediction processing unit 22, and a determination unit 23. Note that the generation unit 21, the prediction processing unit 22, and the determination unit 23 are implemented by an electronic circuit included in a processor, a process executed by the processor, or the like.
The generation unit 21 is a processing unit that generates a plurality of division candidate datasets divided in accordance with different criteria from each other, from a combined dataset obtained by combining the training data and the validation data in the divided dataset d0 that has been divided into the training data and the validation data used for machine learning. Then, the generation unit 21 outputs generated division candidate datasets d1 to dN to the prediction processing unit 22.
For example, the division methods are the known division methods S1 to SN given in advance, such as random division and division based on numerical values set in columns. To give an example, the division method S1 is an approach for generating the division candidate dataset d1 by randomly dividing into the training data and the validation data. In addition, the division method S2 is an approach for generating the division candidate dataset d2 by dividing into the training data and the validation data such that the proportion of data of which the value of a column (for example, the height) is equal to or greater than a threshold value becomes similar to the proportion before the division. The division method S3 is an approach for generating the division candidate dataset d3 by dividing into the training data and the validation data such that the proportion of data of which the value of the first column (for example, the height) is less than a threshold value and the value of the second column (for example, the weight) is equal to or greater than a threshold value becomes similar to the proportion before the division. Note that the ratio of the training data and the validation data is assumed to be similar to the ratio in the divided dataset d0 for which the division method is unknown, but that ratio is known.
The prediction processing unit 22 is a processing unit that generates respective machine learning pipelines that execute machine learning, separately for each of the divided dataset and the plurality of division candidate datasets, and calculates the respective prediction performances when the respective machine learning pipelines are executed, using each of the divided dataset and the plurality of division candidate datasets.
For example, the prediction processing unit 22 uses a technique such as AutoML to generate a machine learning pipeline P0 corresponding to the divided dataset d0. Similarly, the prediction processing unit 22 uses a technique such as AutoML to generate machine learning pipelines P1 to PN corresponding to the division candidate datasets d1 to dN, respectively. Note that the machine learning pipeline represents a series of processes including preprocessing including missing value interpolation, scaling, and the like and machine learning model generation.
The determination unit 23 is a processing unit that identifies a division candidate dataset that has the prediction performances closest to the respective prediction performances calculated using the divided dataset, from among the plurality of division candidate datasets, and determines the division criterion used for the identified division candidate dataset to be the division criterion used for the divided dataset.
For example, the determination unit 23 generates a first vector whose components are the respective prediction performances when each of the machine learning pipelines P0 and P1 to PN is executed using the divided dataset d0. Similarly, the determination unit 23 generates second vectors whose components are the respective prediction performances when each of the machine learning pipelines P0 and P1 to PN is executed, for each of the plurality of division candidate datasets d1 to dN. Then, the determination unit 23 calculates the similarity between the second vectors separately corresponding to each of the plurality of division candidate datasets d1 to dN and the first vector and identifies the division candidate dataset corresponding to the second vector with the highest similarity, from among the plurality of division candidate datasets d1 to dN.
In addition, the determination unit 23 identifies the tendency of the respective prediction performances when each of the machine learning pipelines P0 and P1 to PN is executed using the divided dataset d0. Similarly, the determination unit 23 identifies the tendency of the respective prediction performances when each of the machine learning pipelines P0 and P1 to PN is executed, for each of the plurality of division candidate datasets d1 to dN. Then, the determination unit 23 identifies a division candidate dataset having prediction performances of which the tendency is similar to the tendency of the respective prediction performances corresponding to the divided dataset d0, from among the plurality of division candidate datasets d1 to dN.
Here, a specific example of the above process of the determination unit 23 will be described with reference to
As illustrated in
Similarly, using the division candidate dataset d1, the determination unit 23 identifies a prediction performance e1,0 when the machine learning pipeline P0 is executed, a prediction performance e1,1 when the machine learning pipeline P1 is executed, a prediction performance e1,j when the machine learning pipeline Pj, is executed, and a prediction performance e1,N when the machine learning pipeline PN is executed.
Similarly, using the division candidate dataset di, the determination unit 23 identifies a prediction performance ei,0 when the machine learning pipeline P0 is executed, a prediction performance ei,1 when the machine learning pipeline P1 is executed, a prediction performance ei,j when the machine learning pipeline Pj, is executed, and a prediction performance ei,N when the machine learning pipeline PN is executed.
Similarly, using the division candidate dataset dN, the determination unit 23 identifies a prediction performance eN,0 when the machine learning pipeline P0 is executed, a prediction performance eN,1 when the machine learning pipeline P1 is executed, a prediction performance eN,j when the machine learning pipeline Pj, is executed, and a prediction performance eN,N when the machine learning pipeline PN is executed.
Then, the determination unit 23 generates a vector V0 whose components are the prediction performance e0,0, the prediction performance e0,1, the prediction performance e0,j, and the prediction performance e0,N for the divided dataset d0. Similarly, the determination unit 23 generates a vector V1 whose components are the prediction performance e1,0, the prediction performance e1,1, the prediction performance e1,j, and the prediction performance e1,N for the division candidate dataset d1. The determination unit 23 generates a vector V1 whose components are the prediction performance ei,0, the prediction performance ei,1, the prediction performance ei,j, and the prediction performance ei,N for the division candidate dataset di. Similarly, the determination unit 23 generates a vector VN whose components are the prediction performance eN,0, the prediction performance eN,1, the prediction performance eN,j, and the prediction performance eN,N for the division candidate dataset dN.
Thereafter, as illustrated in (1) of
As another approach, as illustrated in (2) of
In addition, the determination unit 23 identifies the order of respective prediction performances starting from the highest prediction performance for each division candidate dataset and determines the division method Sn for the division candidate dataset dn having the same order as the order of the divided dataset d0 to be the division method S0 used for the divided dataset d0.
Furthermore, the determination unit 23 calculates the similarity between the prediction performance of the divided dataset d0 and the prediction performance of the plurality of division candidate datasets d1 to dN for each of the machine learning pipelines P0 and P1 to PN. For example, the determination unit 23 separately calculates the differences between the prediction performance e0,0 of the divided dataset d0 and the respective prediction performances e0,0, e1,0, ei,0, and eN,0 of the division candidate datasets d1 to dN. Then, the determination unit 23 identifies, for example, the division candidate dataset dn having an average value or variance of the differences less than a threshold value, or the division candidate dataset dn having the smallest difference between the maximum value and the minimum value of the differences. Then, the division method Sn for the identified division candidate dataset dn is determined to be the division method S0 used for the divided dataset d0.
Note that the determination unit 23, for example, outputs and displays the determined division method S0 used for the divided dataset d0 on a display or the like, or transmits the determined division method S0 to the terminal of the administrator. In addition, the determination approaches (1) and (2) in
<Processing Flow>
Subsequently, the control unit 20 acquires the divided dataset d0 from the storage unit 12 (S103) and generates a combined dataset obtained by combining the training data and the validation data in the divided dataset d0 (S104).
Then, the control unit 20 uses each of the division methods S1 to SN to generate the division candidate datasets d1 to dN from the combined dataset (S105). Subsequently, the control unit 20 generates each of the machine learning pipelines P0 and P1 to PN for each of the divided dataset d0 and the division candidate datasets d1 to dN (S106).
Thereafter, the control unit 20 executes each of the machine learning pipelines P0 and P1 to PN for each of the divided dataset d0 and the division candidate datasets d1 to dN to acquire the respective prediction performances (S107).
Then, the control unit 20 identifies the division candidate dataset dn having the prediction performances closest to the prediction performances of the divided dataset d0 (S108) and determines the division method Sn for the identified division candidate dataset dn to be the division method used for the divided dataset d0 (S109).
<Effects>
As described above, when given a divided dataset including the objective variable, candidates for division, and an index for the prediction performance, the information processing device 10 according to the first embodiment may identify the division method used to generate the divided data from among the division candidates. In addition, since the information processing device 10 according to the first embodiment can identify the division method using the similarity of the prediction performances, the tendency of the prediction performances, and the like, the occurrence of a leak also may be suppressed, and a human error caused by manual operation also may be suppressed.
Furthermore, the information processing device 10 according to the first embodiment may achieve reduction of the time involved in identifying the division method, compared with comparison with all candidates as in the reference technique or manual identification. Additionally, since the information processing device 10 according to the first embodiment can identify the division method using objective information such as the prediction performance, the validity of the identified division method is high, and degradation in prediction performance during services after the machine learning using the identified division method also may be suppressed.
Next, an example of a usage scene of the division method identified using the approach according to the above first embodiment will be described.
As a result, the information processing device 10 can further divide the already existing training data into the training data and the test data and can treat the already existing validation data as test data. Therefore, the information processing device 10 may generate the training data and the validation data while ensuring the validity of the division method, for example, when the number of training targets increases.
As a result, when supervised data is added to the divided dataset, the information processing device 10 may generate the training data and the validation data while ensuring the validity of the division method. Therefore, the information processing device 10 may also improve the accuracy of the machine learning model to be trained while increasing the amount of data for training by adding supervised data.
Incidentally, while the embodiments have been described above, the embodiments may be carried out in a variety of different modes in addition to the embodiments described above.
<Numerical Values, Etc.>
The exemplary numerical values, exemplary data, column names, number of columns, number of pieces of data, and the like used in the embodiments described above are merely examples and may be optionally modified. In addition, the processing flow described in each flowchart may be appropriately modified unless otherwise contradicted. Each division method is an example of different criteria.
<System>
Pieces of information including the processing procedure, control procedure, specific name, and various types of data and parameters described above or illustrated in the drawings may be optionally modified unless otherwise noted.
Furthermore, each constituent element of each device illustrated in the drawings is functionally conceptual and does not necessarily have to be physically configured as illustrated in the drawings. For example, specific forms of distribution and integration of the respective devices are not limited to those illustrated in the drawings. For example, all or a part of the devices may be configured by being functionally or physically distributed or integrated in optional units according to various loads, use situations, or the like.
Moreover, all or any part of individual processing functions performed in each device may be implemented by a central processing unit (CPU) and a program analyzed and executed by the CPU or may be implemented as hardware by wired logic.
<Hardware>
The communication device 10a is a network interface card or the like and communicates with another device. The HDD 10b stores a program that activates the functions illustrated in
The processor 10d reads a program that executes processing similar to the processing of each processing unit illustrated in
In this manner, the information processing device 10 works as an information processing device that executes an information processing method by reading and executing a program. In addition, the information processing device 10 may implement functions similar to the functions in the embodiments described above by reading the program described above from a recording medium with a medium reading device and executing the read program described above. Note that the program referred to in other embodiments is not limited to being executed by the information processing device 10. For example, the embodiments described above may be similarly applied also to a case where another computer or server executes the program or a case where these computer and server cooperatively execute the program.
This program may be distributed via a network such as the Internet. In addition, this program may be recorded in a computer-readable recording medium such as a hard disk, a flexible disk (FD), a compact disc read only memory (CD-ROM), a magneto-optical disk (MO), or a digital versatile disc (DVD) and may be executed by being read from the recording medium by a computer.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2022-037362 | Mar 2022 | JP | national |