The present invention relates to a data prediction device, a data prediction method, and a program.
Various technologies have been proposed as prediction schemes to predict data to be observed next, from data observed in the past. For example, multiple linear regression analysis, which is a multivariate analysis, is a model for describing a variable using the other variables when multiple parameter variables are given, and is used as a data prediction scheme.
Here, consider as an example that a certain user predicts a throughput (the amount of data transferred per unit time) when using a certain Internet service. For the sake of simplicity, assume that the throughput can be estimated (described) from the following three parameters. Also, each of the parameters has five elements as follows:
Assume here that these information items are stored as illustrated in
However, all combinations of information items of the time range, location and service may become extremely large, and it may be difficult to cover all the cases. In other words, a problem exists in that the number of samples of past observation data may be insufficient for data prediction. For example, in the data observed as illustrated in
In order to solve this problem, one may intuitively consider using multiple prediction schemes in which the granularity of use parameters is changed. For example, in the case of the above three parameters, one may consider a prediction scheme using all of {T, L, S}, prediction schemes using pairs of two different parameters {T, L}, {T, S}, and {L, S}, and prediction schemes using just one parameter {T} and {L}. Changing the granularity of the parameters to be used in this way increases the number of sample data usable for prediction, and enables prediction. However, there is a problem that if the number of use parameters is reduced, prediction accuracy may become lower. For example, in
Note that Non-patent document 1, which relates to an IP reputation technology for spam determination of e-mail, describes a technology to increase SPAM determination accuracy by taking classification precision of multiple IP reputation databases into consideration at the same time. However, the present invention is a technology aiming at determining application priorities of multiple schemes, and is a technology different from the one in Non-patent document 1 that takes precision of the respective schemes into consideration at the same time.
Also, Non-patent document 2 describes a technology that, for given multiple input parameters, obtains a combination of parameters input into a regression equation such that the prediction accuracy becomes the highest. However, the present invention is a technology that obtains application order of regression equations to be applied to data, based on past prediction accuracy, which is different from the technology described in Non-patent document 2.
The present invention has been made in view of the above, by which it is possible to select a scheme that makes prediction accuracy higher among multiple data prediction schemes.
Therefore, in order to solve the above problem, a data prediction device includes a predictor configured, among a plurality of data prediction schemes each of which has an application condition and a priority, among the data prediction schemes each of which has the application condition satisfied by a set of past observation data, to use a first data prediction scheme having a highest priority, so as to calculate a predicted value of next observation data, based on the set of the observation data; an accuracy calculator configured, in response to receiving the next observation data as input, to compare the predicted value with the observation data, so as to calculate precision of the first data prediction scheme; and a changer configured to change the priority of the first data prediction scheme, based on the precision calculated by the accuracy calculator.
It is possible to select a scheme that makes prediction accuracy higher among multiple data prediction schemes.
In the following, embodiments will be described with reference to the drawings.
A program that implements processing on the data prediction device 10 is provided with a recording medium 101 such as a CD-ROM. When the recording medium 101 storing the program is set in the drive unit 100, the program is installed into the auxiliary storage unit 102 from the recording medium 101 via the drive unit 100. However, installation of the program is not necessarily executed from the recording medium 101, and may also be downloaded from another computer via the network. The auxiliary storage unit 102 stores the installed program, and stores required files, data, and the like as well.
Upon receiving a command to activate the program, the memory unit 103 reads the program from the auxiliary storage unit 102, to load the program. The CPU 104 executes a function which relates to the data prediction device 10 according to the program stored in the memory unit 103. The interface unit 105 is used as an interface for connecting with the network.
The data collector 11 collects observation data required for data prediction, and stores the collected observation data in the observation data storage unit 121. The observation data storage unit 121 stores, for example, a set of observation data as illustrated in
The prediction function information series generator 12 takes as input a set of prediction functions (referred to as a “prediction function set”), application conditions of the respective prediction functions, application priorities of the respective prediction functions, and initial prediction accuracy of the respective prediction functions, to generate prediction function information series, which is an array of prediction function information. The prediction function information series generator 12 outputs the generated prediction function information series to the data predictor 13. Note that one item of the prediction function information is information including the application condition, the application priority, and the prediction accuracy of one prediction function.
The data predictor 13 takes as input the prediction function information series output from the prediction function information series generator 12, and the set of the observation data stored in the observation data storage unit 121, to execute data prediction using one of the prediction functions in the prediction function set, and to store a prediction result in the prediction result storage unit 122.
The prediction accuracy updater 14 compares a prediction result stored in the prediction result storage unit 122 with observation data that has been actually observed and corresponds to the prediction result, to calculate the prediction accuracy of the prediction function used for the data prediction. The prediction accuracy updater 14 associates the calculated prediction accuracy with the prediction function used for the data prediction, and stores the associated accuracy in the prediction accuracy information storage unit 124. Also, the prediction accuracy updater 14 outputs the prediction function information series to the prediction function information series rebuilder 15.
The prediction function information series rebuilder 15 takes as input the prediction function information series output from the prediction accuracy updater 14, and the prediction accuracy stored in the prediction accuracy information storage unit 124, to sort and reconstruct the prediction function information series.
Note that the units illustrated in
In the following, processing steps executed by the data prediction device 10 will be separately described for Step 1 to Step 4.
First, Step 1 will be described in detail. Step 1 is executed by the prediction function information series generator 12.
At Step S101, the prediction function information series generator 12 receives as input a prediction function set F, an application condition set C being a set of the application conditions of the respective prediction functions, an application priority set P being a set of the application priorities of the respective prediction functions, and the initial prediction accuracy set A being a set of the initial prediction accuracy of the respective prediction functions. The prediction function set F includes multiple prediction functions that are different from each other in terms of the granularity or the range (unifiedly referred to as the “granularity”, below) of observation data to be applied. Here, the “granularity of observation data” means the granularity of commonality of observation conditions required by a prediction function for observation data to be applied. The “commonality of observation conditions required by a prediction function for observation conditions” includes, in the example in
Also, the application condition is information representing, for example, a lower limit of the number of observation data items that satisfy a commonality of the observation conditions required by a prediction function, for the items to which the prediction function is applied, in a set of observation data items. The application condition of a prediction function having a commonality of the time being common is, for example, a lower limit of the number of observation data items having a common time.
Also, the application priority is a numerical value that represents a priority (priority order) used for prediction, which is defined in the present embodiment that the smaller the value is, the higher the priority is. Alternatively, it may be defined that the greater the value is, the higher the priority is. The initial prediction accuracy of each prediction function may be set to any appropriate numerical value.
Next, the prediction function information series generator 12 generates the following prediction function information x for each prediction function from the prediction function set F, the application condition set C, the application priority set P, and the initial prediction accuracy set A (Step S102). Here, the size of each set is defined to be N.
x
i=(fi, ci, pi, ai)
where fi∈F, ci∈C, pi∈P, ai∈A, and 1≤i≤N.
Next, the prediction function information series generator 12 sorts the prediction function information x in descending order of the application priorities c, to generate prediction function information series L (Step S103). Defining that a smaller subscript has a higher priority, the prediction function information series L is obtained as follows:
L=[x1, x2, . . . , xN]
Next, Step 2 will be described in detail. Step 2 is a process that is executed by the data predictor 13 in response to input of a data prediction request, to calculate a predicted value relating to an observation value of next observation data, from data D being a set of past observation data stored in the observation data storage unit 121. The “next observation data” means observation data that is predicted to be observed next.
At Step S201, the data predictor 13 receives a data prediction request as input. Upon the prediction request, prediction conditions are input. The “prediction conditions” are a set of values corresponding to items that constitute observation conditions. For example, in the example in
Next, the data predictor 13 substitutes 1 for a variable i (Step S202). The variable i is a variable for identifying an item of prediction function information x to be processed among the prediction function information x included in the prediction function information series L.
Next, the data predictor 13 extracts an item of prediction function information xi=(fi, ci, pi, ai), which is the i-th item of the prediction function information x from the prediction function information series L (Step S203).
Next, the data predictor 13 applies the application condition ci to the data D, to determine whether the data D satisfies the application condition ci (Step S204). If the application condition ci is “the number of observation data items having common time range and location being 100 or more”, and reservation conditions are “time range=t1, location=l1, and service=s1”, then, for example, observation data whose time range is t1 and whose location is l1 is extracted from the data D, to be determined whether the extracted number of observation data items is 100 or more.
If the data D satisfies the application condition ci (YES at Step S204), the data predictor 13 extracts data X as a set of observation data to which the prediction function fi is to be applied, from the data D (Step S205). For example, if the application condition ci is “the number of observation data items having common time range and location being 100 or more”, and reservation conditions are “time range=t1, location=l1, and service=s1”, then, observation data whose time range is t1 and whose location is l1 is extracted as the data X.
Next, the data predictor 13 inputs the data X into the prediction function fi to calculate a predicted value fi(X) (Step S206). For example, the prediction function fi may be a function that calculates an average of the throughput of the data X. The data predictor 13 associates the predicted value fi(X) with the prediction function fi and the prediction conditions, and stores the associated value in the prediction result storage unit 122.
On the other hand, if the data D does not satisfy the application condition ci (NO at Step S204), the data predictor 13 determines whether the value of the variable i is greater than equal to N (Step S207). If the value of the variable i is less than N (NO at Step S207), the data predictor 13 adds 1 to the variable i (Step S208), and repeats Steps S203 and after. If the value of the variable i is greater than equal to N (YES at Step S207), the data predictor 13 ends the process.
According to the process in
Next, Step 3 will be described in detail. Step 3 is a process that is executed by the prediction accuracy updater 14 to calculate and update prediction accuracy of a prediction function used for data prediction.
At Step S301, the prediction accuracy updater 14 receives actually observed data d as input. Inputting the data d may be executed by the user, or may be observed automatically. According to the example in
Next, the prediction accuracy updater 14 obtains a prediction function f and a predicted value f(X) associated with the prediction conditions matching the data d, from the prediction result storage unit 122 (Step S302).
If a corresponding prediction function f and a predicted value f cannot be obtained (NO at Step S303), the process in
errorf=|f(X)−d|
In other words, the prediction error errorf is the difference between the predicted value f(X) and the observation data d. According to the example in
Next, the prediction accuracy updater 14 associates the calculated prediction error errorf with the prediction function f, and stores it in the prediction error series storage unit 123 (Step S305). Therefore, the prediction error series storage unit 123 stores prediction error series LD being an array of prediction errors (history of the prediction error) for each prediction function used for data prediction. Note that the prediction error series LDf of the prediction function f is as follows:
LD
f=|error1f, error2f, . . . , errorMf|
Next, the prediction accuracy updater 14 calculates the prediction accuracy of the prediction function f, based on the prediction error series LDf stored in the prediction error series storage unit 123 for the prediction function f (Step S306). Denoting the prediction accuracy of the prediction function f by a, a is calculated, for example, as follows:
a=(ΣLDf)/|LDf|
where |LDf| represents the number of prediction errors included in the prediction error series LDf.
Next, the prediction accuracy updater 14 updates the prediction accuracy that has been associated and stored with the prediction function f in the prediction accuracy information storage unit 124, by the prediction accuracy a calculated at Step S306.
In this way, the prediction accuracy is updated every time data prediction is executed. Note that another indicator (geometric mean or the like) other than the above may be used as the prediction accuracy.
Next, Step 4 will be described in detail. Step 4 is a process that is executed by the prediction function information series rebuilder 15, to sort the prediction function information series based on the prediction accuracy and default application priorities of the respective prediction functions, and to reconstruct the application priorities of the prediction functions. Note that Step 4 may be executed periodically, or may be executed every time the prediction accuracy of one of the prediction functions is updated at Step 3.
At Step S401, the prediction function information series rebuilder 15 extracts two items of prediction function information xi=(fi, ci, pi, ai) and prediction function information xj=(fj, cj, pj, aj) that are different from each other, from the prediction function information series L. Note that as the prediction accuracy ai and the prediction accuracy aj, values that have been stored in the prediction accuracy information storage unit 124 with respect to the prediction function fi and the prediction function fi, respectively, are used.
Next, the prediction function information series rebuilder 15 obtains the prediction error series LDi corresponding to the prediction function fi, and the prediction error series LDj corresponding to prediction function fj, from the prediction error series storage unit 123 (Step S402).
Next, in order to determine whether the difference between the prediction error series LDi and the prediction error series LDj is statistically significant, the prediction function information series rebuilder 15 executes a t-test (Step S403). In other words, if the prediction application count is small, there is a possibility that the difference ai−aj of the prediction accuracy calculated at Step S405, which will be described later, is not statistically significant. Thereupon, the t-test is executed at Step S403. However, the test method is not limited to the t-test, and another method, such as a nonparametric test, may be used.
If a significant difference has been recognized as a result of the t-test (YES at Step S404), the prediction function information series rebuilder 15 compares the prediction accuracy ai and aj, to determine the priorities of the prediction function information xi and xj (Step S405). Specifically, if ai−aj has been calculated and the calculation result is negative, the prediction function information xi is determined to have a priority higher than that of the prediction function information xj. On the other hand, if the calculation result is positive, xj is determined to have a priority higher than xi. Also, if the calculation result is 0 (if ai=aj), the priority is determined based on default application priorities pi and pj.
On the other hand, if a significant difference has not been recognized as a result of the t-test (NO at Step S404), the prediction function information series rebuilder 15 compares the default application priorities pi and pj, to determine the priorities of the prediction function information xi and xj (Step S406).
By using the comparison step of different elements as described above, the prediction function information series rebuilder 15 sorts and updates the prediction function information series L. Any method of sorting may be adopted, including quick sort and merge sort.
As described above, according to the present embodiment, in a scheme to predict next observation data from past observation data, among multiple prediction functions, based on the priorities and prediction accuracy of the respective prediction functions, an appropriate data prediction function is selected to execute data prediction based on the selected prediction function. Accordingly, even if sample data is insufficient for a certain prediction function when making a prediction, it is possible to execute data prediction by using another prediction function as long as sample data for the other prediction function is sufficient enough. Also, since a prediction function having a higher precision is prioritized to be used, it is possible to raise the precision of data prediction. Also, since the priority is dynamically determined based on a result of data prediction, it is possible to reduce analysis operations of a person who performs data analysis (an operator or the like).
Note that in the present embodiment, the data predictor 13 is an example of a predictor. The prediction accuracy updater 14 is an example of an accuracy calculator. The prediction accuracy updater 14 is an example of a changer. The prediction function is an example of a data prediction scheme.
As above, the embodiments of the present invention have been described in detail. Note that the present invention is not limited to such specific embodiments, but various variations and modifications may be made within the scope of the subject matters of the present invention described in the claims.
The present patent application claims priority based on Japanese Patent Application No. 2015-123273 filed on Jun. 18, 2015, and entire contents of the Japanese Patent Application are incorporated herein by reference.
Number | Date | Country | Kind |
---|---|---|---|
2015-123273 | Jun 2015 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2016/065689 | 5/27/2016 | WO | 00 |