The present disclosure relates to a data extension device, a data extension method, and a data extension program.
Conventionally, a machine learning technology has been known which automatically learns a feature amount from input data and generates a learner for executing tasks such as motion image and voice recognition and time-series data prediction. The machine learning technology improves task performance by learning a learner using learning data. If the number of learning data is too small, overtraining occurs.
A technique is known which performs data extension for increasing the number of learning data in order to suppress overtraining and improve the generalization performance of the learner (e.g., Non-Patent Literature 1). In Non-Patent Literature 1, for the purpose of predicting the stock price after 30 seconds, data extension for the time-series data of stock buying and selling orders is performed by shifting the starting date/time of sampling for creation of the learning execution data set later for each completion of creation of the learning execution data set. The learning execution data set is a set of learning execution data which is a collection of learning data used by a learner to execute a single task. In an example of Non-Patent Literature 1, for creating a learning execution data set, learning data, which is acquired time-series data of stock buying and selling orders for a period of about several hundred days, is repeatedly sampled at an interval of 30 seconds from a starting date/time to obtain 90-minute data as one learning execution data. Then, the starting date/time of sampling is shifted 10 seconds later, and similar sampling is performed again. Thus, data extension is achieved corresponding to the number of times of shifting although some overlapping is observed between the learning data.
In a case where an interval between the starting date/time and the ending date/time of the learning execution data is uniquely defined, sufficient data extension can be achieved by shifting the starting date/time of sampling. However, in a case where the starting date/time and ending date/time of the learning execution data are uniquely defined rather than the interval, there is a problem that sufficient data extension cannot be achieved. For example, in a case where it is uniquely defined that the starting date/time is at 6:00 on a certain day and the ending date/time is at 23:00 on a day three days after the starting date/time, the time interval at which the starting date/time can be shifted is at least 24 hours. Therefore, the number of times of shifting the starting date/time is reduced, and sufficient data extension cannot be achieved.
The disclosed technology is made in view of the above described problem, and an object thereof is to provide a data extension device, a data extension method, and a data extension program which can achieve sufficient data extension even if the starting date/time and ending date/time of the learning execution data have been uniquely defined.
A first aspect of the present disclosure is a data extension method, wherein based on entire learning data that is a set of time-series data, wherein the entire learning data is a set of minimum constitution unit data, each of which is time-series data having a first time interval that is a time interval required for learning, wherein the time-series data having the first time interval is each assigned a first label that indicates a feature in a time series of the first interval; a generation unit generates learning execution data that is a set of time-series data to be used in learning, by combining the minimum constitution unit data included in the entire learning data such that regularity of the first label in a time series of the entire learning data is maintained.
A second aspect of the present disclosure is a data extension device including a generation unit, wherein based on entire learning data that is a set of time-series data, wherein the entire learning data is a set of minimum constitution unit data, each of which is time-series data having a first time interval that is a time interval required for learning, wherein the time-series data having the first time interval is each assigned a first label that indicates a feature in a time series of the first interval; the generation unit generates learning execution data that is a set of time-series data to be used in learning, by combining the minimum constitution unit data included in the entire learning data such that regularity of the first label in a time series of the entire learning data is maintained.
A third aspect of the present disclosure is a data extension program for causing a computer to execute: based on entire learning data that is a set of time-series data, wherein the entire learning data is a set of minimum constitution unit data, each of which is time-series data having a first time interval that is a time interval required for learning, wherein the time-series data having the first time interval is each assigned a first label that indicates a feature in a time series of the first interval; a generation unit generating learning execution data that is a set of time-series data to be used in learning, by combining the minimum constitution unit data included in the entire learning data such that regularity of the first label in a time series of the entire learning data is maintained.
According to the disclosed technology, sufficient data extension can be achieved even if the starting date/time and ending date/time of the learning execution data have been uniquely defined.
<Configuration of a Data Extension Device According to an Embodiment of the Technology of the Present Disclosure>
Hereinafter, an example embodiment of the disclosed technology will be described with reference to the drawings. In each drawing, the same or equivalent components and parts are given the same reference numerals. Additionally, the dimensional ratios in the drawings are exaggerated for the sake of explanation, and may differ from the actual ratios.
The CPU 11 is a central processing unit which executes various programs and controls various parts. That is, the CPU 11 reads a program from the ROM 12 or the storage 14, and executes the program using the RAM 13 as a work area. The CPU 11 controls the above-described components and performs various arithmetic processing according to a program stored in the ROM 12 or the storage 14. In the present embodiment, the ROM 12 or the storage 14 stores a data extension program for executing data extension processing.
The ROM 12 stores various programs and various data. The RAM 13 temporarily stores programs or data as a work area. The storage 14 includes a storage device such as an HDD (Hard Disk Drive) or SSD (Solid State Drive) which stores various programs including an operating system and various data.
The input unit 15 includes a pointing device such as a mouse, and a keyboard, and is used for various inputs.
The display unit 16 is, for example, a liquid crystal display for displaying various types of information. The display unit 16 may function as the input unit 15 by employing a touch panel system.
The communication interface 17 is an interface for communicating with other equipment, and employs a standard such as Ethernet (R), FDDI, Wi-Fi (R).
Next, a functional configuration of the data extension device 10 will be described.
In the present disclosure, a case where a task is to predict the number of people observed at a facility will be described as an example. Using learning execution data generated by the data extension device 10 according to the present disclosure, a learner executing the task is made to learn. Learning execution data, which is time-series data used in learning, has a starting date/time and an ending date/time that are uniquely defined. Specifically, it is assumed that the starting time Ts is 0:00 on A-th day and the ending time Te is 24:00 on A+7-th day. That is, the learning execution data is defined as time-series data of the number of people for consecutive 7 days. Further, the sampling time t is set to 1 hour. Among the learning execution data, the previous 6 days are used as input data to the learner, and the last one day is used as teacher data corresponding to output data of the learner.
The input unit 101 receives input of overall learning data which is a set of time-series data for each sampling time t and which is a set of minimum constitution unit data. Here, the minimum constitution unit data is a time-series data having a first time interval that is a time interval required for learning, wherein the time-series data having the first time interval is assigned a first label that indicates a feature in a time series of the first interval.
In the present disclosure, a case where the overall learning data is a time-series data obtained with sampling time t for 10 days from 0:00 on Friday, Mar. 1, 2019, to 24:00 on Sunday, Mar. 10, 2019, will be described as an example. Then, if the minimum constitution unit is defined as a 1-day or 24-sampling time series in overall learning data, “A set of 10 minimum constitution unit data=Overall learning data” (
Seven kinds of “day of week labels”, Monday to Sunday, are defined as first labels indicating features of the minimum constitution units that are time-series data of time intervals. Overall learning data is assumed to be assigned a first label for each minimum constitution unit. A first label may be assigned to each minimum constitution unit for the input overall learning data. Then, the input unit 101 passes the accepted overall learning data to the process unit 102.
The process unit 102 extracts the regularity of the first label based on correlation between a difference series of the minimum constitution unit data and a difference series of the minimum constitution unit data assigned a different kind of first label from that of the respective minimum constitution unit data, with respect to each of the minimum constitution unit data. Then, the process unit 102 assigns a second label to each minimum constitution unit data included in the entire learning data, where the number of kinds of the second labels is smaller than the number of kinds of the first labels.
Specifically, the process unit 102 first computes a correlation coefficient between the difference series of the minimum constitution unit data having different first labels. The difference series is a series of data in which the difference from data separated by one time point is taken, in time-series data. For example, in the case of the overall learning data in this disclosure, when the first labels are combination of any of “Monday” to “Friday”, the combination between difference series of minimum constitution unit data is highly correlated (e.g., correlation coefficient 0.7). However, when the correlation coefficient is calculated using a combination of any of difference series of minimum constitution unit data with the first label “Saturday” or “Sunday” and any of difference series of minimum constitution unit data with the first label “Monday” to “Friday”, the correlation is low (e.g., correlation coefficient 0.3). In addition, in the case where the first labels are a combination of “Saturday” and “Sunday”, when the correlation coefficient is calculated between difference series of the minimum constitution unit data, the correlation is high (e.g., correlation coefficient 0.8). From this, for the first label, regularity can be found that there are two groups within which the correlation coefficient is high: a group of “Monday” to “Friday” and a group of “Saturday” and “Sunday”. Therefore, based on the regularity of the first label, “Monday” to “Friday” of the first label can be clustered into the “Group 1”, and “Saturday” and “Sunday” can be clustered into “Group 2”.
As described above, the process unit 102 assigns the second label “group 1” or the second label “group 2” to each minimum constitution unit data as a meta label based on the correlation between the difference series of the minimum constitution unit data having the first labels. As described above, when a second label is assigned to each minimum constitution unit data of the overall learning data of the present disclosure as described above, 6 minimum constitution unit data are assigned the second label “group 1”, and 4 minimum constitution unit data are assigned the second label “group 2”, By assigning the second label, non-obvious features of the minimum constitution unit data can be visualized.
Then, process unit 102 passes the entire learning data assigned the second label to the generation unit 103.
The generation unit 103 selects the minimum constitution unit data assigned the second label and included in the entire learning data based on regularity of the second label by which the first label has been replaced with the second label, and generates the learning execution data by combining the selected minimum constitution unit data such that the regularity of the second label is maintained.
Here, when focusing on the first label of minimum constitution unit data used for teacher data, the arrangement of the minimum constitution unit data constituting the learning execution data has several patterns depending on the target data. For example, in the learning execution data of the present disclosure, since the sequence of days of week is unique and the number of kinds of the first label is 7, the arrangement of the minimum constitution unit data constituting the learning execution data has 7 patterns. Further, when the arrangement of the minimum constitution unit data is represented using the assigned second label, the arrangement of the minimum constitution unit data constituting the learning execution data also has 7 patterns. For example, if the first label of the minimum constitution unit data which is teacher data is “Monday”, the first label of the six minimum constitution unit data which is input data is “Tuesday” to “Sunday” in order from the previous one. When first labels are replaced with second labels, the arrangement is as shown in pattern 1 of FIG. By repeating the same process, a total of 7 patterns of the arrangement of the second labels for 7 consecutive days are obtained. The generation unit 103 uses the regularity of the second label by which the first label is replaced with the second label, for the obtained 7-pattern arrangement.
For each pattern “a” (a=1−7), the number of combinations for selecting 10 minimum constitution unit data having second labels from the overall learning data is calculated as shown in
Next, the generation unit 103 selects the minimum constitution unit data that can be combined for each of all the patterns such that the regularity of the second label is maintained. More specifically, the generation unit 103 selects minimum constitution unit data from 504 combinations, the number of which is obtained by summing the numbers of combinations K1 to K7 for respective patterns (the following equation (1)).
[Math. 1]
D=Σ
a=1
7
K
a=72×7=504 (1)
The generation unit 103 generates learning execution data by combining the selected minimum constitution unit data with respect to each of the patterns. At this time, the generation unit 103 randomly sets a sequence in which the minimum constitution unit data constituting one generated learning execution data is arranged within the second label. An example of the case of the pattern 1 will be described with reference to
The output unit 104 outputs the learning execution data set generated by the generation unit 103. The learner is made to learn based on the learning execution data set output by the output unit 104.
<Operation of the Data Extension Device According to the Embodiment of the Technology of the Present Disclosure>
Next, the operation of the data extension device 10 will be described.
In step S101, as the input unit 101, the CPU 11 receives input of overall learning data which is a set of time-series data for each sampling time t and which is a set of minimum constitution unit data. Here, the minimum constitution unit data is a time-series data having a first time interval that is a time interval required for learning, wherein the time-series data having the first time interval is assigned a first label that indicates a feature in a time series of the first interval.
In step S102, as the process unit 102, the CPU 11 extracts the regularity of the first label based on correlation between a difference series of the minimum constitution unit data and a difference series of the minimum constitution unit data assigned a different kind of first label from that of the respective minimum constitution unit data, with respect to each of the minimum constitution unit data.
In step S103, as the process unit 102, the CPU 11 assigns a second label to each minimum constitution unit data included in the entire learning data, where the number of kinds of the second labels is smaller than the number of kinds of the first labels.
In step S104, as the generation unit 103, the CPU 11 selects the minimum constitution unit data assigned the second label and included in the entire learning data based on regularity of the second label by which the first label has been replaced with the second label. Then, as the generation unit 103, CPU 11 generates the learning execution data by combining the selected minimum constitution unit data such that the regularity of the second label is maintained.
In step S105, as the output unit 104, CPU 11 outputs the learning execution data and terminates the process.
As described above, the data extension device according to the embodiment of the present disclosure generates, based on entire learning data, learning execution data that is a set of time-series data to be used in learning, by combining minimum constitution unit data included in the entire learning data such that regularity of the first label in a time series of the entire learning data is maintained. Accordingly, sufficient data extension can be achieved even if the starting date/time and ending date/time of the learning execution data have been uniquely defined. Such entire learning data is a set of time-series data, and is a set of minimum constitution unit data. Such minimum constitution unit data is a time-series data having a first time interval that is a time interval required for learning, wherein the time-series data having the first time interval is assigned a first label that indicates a feature in a time series of the first interval.
Further, the regularity of the first label is extracted based on correlation between one minimum constitution unit data and others of the minimum constitution unit data, with respect to each of the minimum constitution unit data, so that non-obvious features of time-series data of the minimum constitution unit data can be used for data extension.
Further, the second label, the number of kinds of which is smaller than that of the first label, is assigned to minimum constitution unit data included in the entire learning data, so that the number of minimum constitution unit data per one kind of the second label can be increased. Thereby, the number of combinations for rearranging minimum constitution unit data can be increased.
The present disclosure is not limited to the above-described embodiments, and various modifications and applications can be made without departing from the gist of the present invention.
The data extension program executed by the CPU reading and executing software (program) in the above-described embodiments may be executed by various processors other than the CPU. Exemplary processors in this case include a PLD (Programmable Logic Device) which can change its circuit configuration after manufacturing, such as FPGA (Field-Programmable Gate Array), and a dedicated electrical circuit which is a processor having a circuit configuration designed exclusively for performing specific processing, such as ASIC (Application Specific Integrated Circuit). The data extension program may be executed by one of these various processors or by a combination of two or more processors of the same or different types (e.g., a plurality of FPGAs, and a combinations of CPU and FPGA, etc.). The hardware structure of these various processors is more specifically an electrical circuit that combines circuit elements such as semiconductor devices.
Further, in the above-described embodiments, a mode in which the data extension program is prestored (installed) in the ROM 12 or the storage 14 has been described, but the program is not limited thereto. The program may be provided in a CD-ROM (Compact Disk Read Only Memory), a DVD-ROM (Digital Versatile Disk Read Only Memory), and a non-transitory storage medium such as a USB (Universal Serial Bus) memory. The program may also be downloaded from an external device via a network.
Regarding the foregoing embodiments, the following supplementary items will be further disclosed.
(Supplementary Item 1)
A data extension device comprising:
a memory; and
at least one processor connected to the memory,
wherein the processor is configured to:
based on entire learning data that is a set of time-series data, wherein the entire learning data is a set of minimum constitution unit data, each of which is time-series data having a first time interval that is a time interval required for learning, wherein the time-series data having the first time interval is each assigned a first label that indicates a feature in a time series of the first interval,
generate learning execution data that is a set of time-series data to be used in learning, by combining the minimum constitution unit data included in the entire learning data such that regularity of the first label in a time series of the entire learning data is maintained.
(Supplementary Item 2)
A non-transitory storage medium having a data extension program stored therein, the data extension program causing a computer to:
based on entire learning data that is a set of time-series data, wherein the entire learning data is a set of minimum constitution unit data, each of which is time-series data having a first time interval that is a time interval required for learning, wherein the time-series data having the first time interval is each assigned a first label that indicates a feature in a time series of the first interval,
generate learning execution data that is a set of time-series data to be used in learning, by combining the minimum constitution unit data included in the entire learning data such that regularity of the first label in a time series of the entire learning data is maintained.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2019/041995 | 10/25/2019 | WO |