This application is based upon and claims the benefit of priority from the prior Japanese Patent Application No. 2020-150056, filed on Sep. 7, 2020, the entire contents of which are incorporated herein by reference.
One embodiment of the present disclosure relates to an information processing device and an information processing method.
A regression model with penalty terms has been proposed as a method for extracting a feature amount from a large amount of data (big data). This regression model has a problem that a feature amount similar to one selected as an explanatory variable cannot be extracted. Therefore, there is a problem that important factors included in big data can be easily overlooked.
Further, the work of extracting a feature amount or a similar feature amount from big data depends on a data size of the big data, and the larger the data size, the longer the extraction work takes.
According to one embodiment, an information processing device has an inputter configured to input analysis target data including a plurality of explanatory variables, a screening processor configured to generate intermediate data with the number of the explanatory variables included in the analysis target data reduced by using a part of the plurality of explanatory variables as objective variables, a first feature amount extractor configured to extract a first feature amount from the intermediate data based on the objective variables, and a similar feature amount extractor configured to extract a similar feature amount from the intermediate data based on a degree of similarity between the explanatory variables included in the intermediate data and the first feature amount.
Hereinafter, embodiments of an information processing device will be described with reference to the drawings. In the following, main components of the information processing device will be mainly described, but the information processing device may have components and functions not illustrated in the drawings or described. The following descriptions do not exclude components or functions not illustrated in the drawings or described.
The input unit 2 inputs analysis target data including a plurality of explanatory variables. Specific contents of the analysis target data are not considered, but they are, for example, a large amount of data (big data) exceeding tens of thousands of dimensions. Individual data in the analysis target data are also called explanatory variables. In addition, some of the explanatory variables are called objective variables. In the present embodiment, it is intended to perform processing for selecting an explanatory variable that affects an objective variable from the explanatory variables. As a specific example, the analysis target data may be data generated in a manufacturing process of a semiconductor factory or may be other data.
The screening processing unit 3 uses a part of the explanatory variables as the objective variable and generates intermediate data generated by reducing the number of explanatory variables included in the analysis target data. More specifically, the screening processing unit 3 generates the intermediate data in which some explanatory variables are deleted from the analysis target data so as not to lose a feature amount. Therefore, although the number of data is less than that of the analysis target data, the intermediate data contain a feature amount comparable to the analysis target data. For example, the screening processing unit 3 generates the intermediate data narrowed down to several thousand dimensions when the analysis target data have more than tens of thousands of dimensions. It is arbitrary how much the screening processing unit 3 reduces the analysis target data to generate the intermediate data.
The feature amount extraction unit 4 extracts the feature amount from the intermediate data based on the objective variable. A feature amount is an explanatory variable that affects the objective variable included in the analysis target data. That is, the feature amount is an explanatory variable having a high degree of correlation with the objective variable. As will be described later, in the present specification, the feature amount extracted by the feature amount extraction unit 4 may be referred to as a first feature amount, and the feature amount extraction unit 4 may be referred to as a first feature amount extraction unit. The degree of correlation is represented by a correlation value as described later, and the larger the correlation value, the higher the degree of correlation.
The similar feature amount extraction unit 5 extracts the similar feature amount from the intermediate data based on a degree of similarity between the explanatory variables included in the intermediate data and the feature amount.
The information processing device 1 of
The information processing device 1 of
The information processing device 1 of
The characteristic analysis unit 8 described above may have a distribution detection unit 9, a distribution evaluation unit 10, and a correlation calculating unit 11.
The distribution detection unit 9 detects distribution of the explanatory variables included in the analysis target data. The distribution evaluation unit 10 evaluates the distribution of the explanatory variables detected by the distribution detection unit 9. The correlation calculating unit 11 extracts the characteristic data based on the evaluation result of the distribution evaluation unit 10.
The information processing device 1 of
The regression model construction unit 6 extracts the feature amounts contained in the intermediate data by using a sparse modeling technique. Further, the similar feature amount extraction unit 5 extracts the similar feature amounts from the intermediate data based on the degree of similarity between the explanatory variables and the feature amounts included in the intermediate data. Calculation methods for extracting the similar feature amounts from the intermediate data are not particularly considered.
A mathematical formula of the regression model constructed by the regression model construction unit 6 is represented by, for example, formula (1).
y=Xβ(=β0+β1X1+ . . . +βpXp) (1)
The feature amounts extracted by the feature amount extraction unit 4 is obtained, for example, by using Lasso's mathematical formula illustrated in formula (2) below. That is, among the explanatory variables X, the explanatory variable X that minimizes an objective function by adding an L1 penalty term (right-hand side second term) to a mean square error (right-hand side first term) illustrated in the formula (2) is the feature amount.
{circumflex over (β)}LASSO=argminβ∥y−Xβ|22+λ∥β∥1(∥β∥1=∥β∥+ . . . +|βp| (2)
The formula (1) is an example of a regression model, and the formula (2) is an example of a mathematical formula for obtaining the feature amounts. The feature amounts may be extracted using mathematical formulae other than the formulae (1) and (2).
As described above, in the first embodiment, the feature amounts are extracted based on the intermediate data generated by screening the analysis target data and significantly reducing the data size, and the similar feature amounts are extracted based on the degree of similarity between the explanatory variables included in the intermediate data and the feature amounts. Since the intermediate data are data whose data size is significantly smaller than that of the analysis target data while maintaining the feature amounts of the analysis target data, the similar feature amounts can be quickly extracted. In particular, since the intermediate data maintains the feature amounts of the analysis target data, the similar feature amounts can be extracted accurately without omission. By extracting the similar feature amounts, it is possible to extract important factors included in the analysis target data without overlooking them.
In an information processing device 1a according to a second embodiment, the processing operation of the screening processing unit 3 is different from that of the first embodiment.
After the screening processing unit 3 finishes generating multiple intermediate data, the first feature amount extraction unit 4a extracts a plurality of feature amounts in association with the multiple intermediate data. The similar feature amount extraction unit 5 extracts similar feature amounts from the intermediate data corresponding to each of a plurality of first feature amounts. Each time the screening processing unit 3 generates new intermediate data, the second feature amount extraction unit 4b extracts a second feature amount based on the new intermediate data. The first feature amount is a feature amount that is finally extracted from the analysis target data, while the second feature amount is an intermediate feature amount that is extracted in a process of screening processing.
The second feature amount extraction unit 4b extracts the second feature amount each time the screening processing unit 3 generates the intermediate data. More specifically, the second feature amount extraction unit 4b extracts the second feature amount included in the intermediate data based on the regression model constructed by the regression model construction unit 6 using the sparse modeling technique.
The information processing device 1a of
The objective variable update unit 13 generates a new objective variable each time the second feature amount extraction unit 4b extracts the second feature amount. The explanatory variable update unit 14 generates a new explanatory variable each time the second feature amount extraction unit 4b extracts the second feature amount. The analysis target update unit 15 updates the analysis target data so as to include a new objective variable and a new explanatory variable. The screening processing unit 3 generates new intermediate data from the updated analysis target data.
The information processing device 1a of
The information processing device 1a of
The number-of-times determination unit 17 determines whether the number-of-times the second feature amount has been extracted by the second feature amount extraction unit 4b has reached a predetermined number of times. The correlation calculation unit 18 calculates a correlation value between the new objective variable and the new analysis target data when it is determined that the predetermined number of times has not been reached. The correlation degree determination unit 19 determines whether the correlation value is equal to or greater than a predetermined threshold value. When the correlation value is equal to or higher than the predetermined threshold value, the screening processing unit 3 ends generation of the intermediate data, and when the correlation value is less than the threshold value, stops the generation of the intermediate data.
The information processing device 1a of
The information processing device 1a of
The information processing device 1a of
In
The screening processing unit 3 generates the intermediate data X′j having the data size corresponding to the characteristic data. The second feature amount extraction unit 4b extracts the second feature amount X″j from the intermediate data X′j.
The processings of the broken line portions in
After the screening processing by the screening processing unit 3 is completed, the first feature amount extraction unit 4a extracts the first feature amount using all the intermediate data generated by the screening processing unit 3. At that time, the first feature amount extraction unit 4a examines how many times the screening processing unit 3 has extracted the extracted first feature amount from the intermediate data generated. The similar feature amount extraction unit 5 does not use all the intermediate data but extracts a similar feature amount from the intermediate data from which the individual first feature amount is extracted.
As a specific example, it is assumed that the screening processing unit 3 repeats the processing of generating the intermediate data three times. Assuming that the intermediate data generated by the screening processing unit 3 each time are “data 1”, “data 2”, and “data 3”, intermediate data “data” finally output by the screening processing unit 3 are data=“data 1”+“data 2”+“data 3”.
The first feature amount extraction unit 4a extracts the first feature amount from the intermediate data “data”. At this time, for example, it is assumed that four first feature amounts F1, F2, F3, and F4 are extracted. The first feature amount extraction unit 4a examines, for example, that the first feature amount F1 is extracted from the intermediate data “data 1”, the first feature amounts F2 and F3 are extracted from the intermediate data “data 2”, and the first feature amount F4 is extracted from the intermediate data “data 3”.
In this case, the similar feature amount extraction unit 5 extracts the similar feature amount of the first feature amount F1 from the intermediate data “data 1”, extracts the similar feature amounts of the first feature amounts F2 and F3 from the intermediate data “data 2”, and extracts the similar feature amount of the first feature amount F4 from intermediate data “data 3”.
In this way, by limiting a range in which the similar feature amount extraction unit 5 extracts the similar feature amount, a processing speed for extracting the similar feature amount can be improved.
Next, the characteristic analysis unit 8 extracts the characteristic data from the analysis target data (step S2). A detailed processing procedure of the characteristic analysis unit 8 will be described later.
Next, the screening processing unit 3 performs the screening processing based on the analysis target data and the characteristic data and generates intermediate data X′0 having the data size corresponding to the characteristic data (step S3). The analysis target data in step S3 are the analysis target data input in step S1, and X0=X and d0=Y.
Next, the second feature amount extraction unit 4b extracts a second feature amount X″0 from the intermediate data X′0 (step S4). The second feature amount extraction unit 4b extracts the second feature amount by, for example, the Lasso's mathematical formula of the above-mentioned formula (2).
Next, a linear prediction value Y0{circumflex over ( )} of the extracted second feature amount X″0 is calculated (step S5). The linear prediction value Y0″ is a value obtained by multiplying the second feature amount X″0 by a coefficient β0.
Next, an objective variable d1=d0−Y0{circumflex over ( )} is calculated (step S6). Next, an explanatory variable X1=X−X′0 is set (step S7). The analysis target data are updated by the objective variable d1 and the explanatory variable X1.
Next, a variable j=1 for counting the number of screenings is set (step S8).
It is determined whether the variable j is within a predetermined number of times value D_Iteration (step S9). When the variable j exceeds the predetermined number of times value D_Iteration, the processing ends. The processing of step S9 is performed by the number-of-times determination unit 17 of
When the variable j is within the predetermined number of times value D_Iteration, the characteristic analysis unit 8 extracts characteristic data Xj and dj from the updated analysis target data (step S10).
Next, the screening processing unit 3 performs the screening processing based on the analysis target data and the characteristic data and generates the intermediate data X′j having the data size corresponding to the characteristic data (step S11).
Next, the second feature amount extraction unit 4b extracts the second feature amount X″j from the intermediate data X′j (step S12). Next, a linear prediction value Yj{circumflex over ( )} of the extracted second feature amount X″j is calculated (step S13). The linear prediction value Yj{circumflex over ( )} is a value obtained by multiplying the second feature amount X″j by a coefficient 131.
Next, the objective variable dj+1=dj−Yj{circumflex over ( )} is calculated (step S14). Next, the explanatory variable Xj+1=X−X′j is set (step S15).
Next, processing of the determination processing unit is performed (step S16). The determination processing unit determines whether to repeat the processings of steps S9 to S15, as will be described later.
First, the analysis target data including the explanatory variable X and the objective variable Y are input (step S21). Next, for example, a third feature amount is extracted using the Lasso's mathematical formula illustrated in the above formula (2) (step S22). The extraction of the third feature amount in this processing means to detect distribution characteristic of the analysis target data. The processing of step S22 is performed by the distribution detection unit 9 in
Next, distribution of the third feature amount is evaluated (step S23). Here, for example, in order to calculate a ratio of the third feature amount to the explanatory variable X and a value of a regression coefficient for each third feature amount, and to extract the final third feature amount from the explanatory variable X, characteristic values such as how much screening is possible are calculated. The processing of step S23 is performed by the distribution evaluation unit 10 in
Next, a correlation between the explanatory variable and the objective variable, for example, is calculated, and the characteristic data are extracted (step S24). From the distribution evaluation result of the third feature amount, for example, when there is a strong bias in distribution of the regression coefficient, it can be judged that the data after screening may be small. The processing of step S24 is performed by the correlation calculating unit 11 of
Next, it is determined whether the correlation value is equal to or less than a predetermined threshold value (step S33). When the correlation value is equal to or less than the threshold value, it is determined that the processings of steps S9 to S17 in
As described above, in the second embodiment, the screening processing is repeated a plurality of times, the intermediate data are generated for each screening processing, and the second feature amount is generated for each intermediate data. Based on the generated second feature amount, the analysis target data are updated to generate the next intermediate data. As a result, the analysis target data can be divided into small pieces, and the intermediate data can be generated in small pieces, and the individual intermediate data can be generated quickly. In addition, the first feature amount extraction unit 4a extracts the first feature amount based on all the intermediate data generated by the screening processing unit 3 in the multiple screening processings and examines which intermediate data of the screening processing unit 3 each of the extracted first feature amounts was extracted from. Then, the similar feature amount extraction unit 5 extracts the similar feature amount from the intermediate data from which each first feature amount is extracted. As a result, the range for extracting the similar feature amount can be narrowed, and the similar feature amount can be extracted at high speed.
At least a part of the information processing devices 1 and 1a described in the above-described embodiments may be configured by hardware or software. When configured by software, a program that realizes at least a part of the functions of the information processing device 1 may be stored in a recording medium such as a flexible disk or a CD-ROM, read by a computer, and executed. The recording medium is not limited to a removable medium such as a magnetic disk or an optical disk and may be a fixed recording medium such as a hard disk device or a memory.
In addition, a program that realizes at least a part of the functions of the information processing devices 1 and 1a may be distributed via a communication line (including wireless communication) such as the Internet. Further, the program may be distributed in a state of being encrypted, modulated, or compressed via a wired line or wireless line such as the Internet or after being stored in a recording medium.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the disclosures. Indeed, the novel methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the disclosures. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the disclosures.
Number | Date | Country | Kind |
---|---|---|---|
2020-150056 | Sep 2020 | JP | national |