This application is based upon and claims the benefit of the prior Japanese Patent Application No. 2020-100693, filed on Jun. 10, 2020, the entire contents of which are incorporated herein by reference.
The embodiment discussed herein is related to a data analysis method and a data analysis device.
In the related art, a data analysis by topological data analysis (TDA) is performed on time-series data that change with the passage of time, such as stock prices, to perform a feature extraction of the time-series data.
For this data analysis by TDA, a technique of the related art is known in which the persistent homology is applied to an attractor obtained by using the time-series data to perform the feature extraction of the attractor shape.
Related techniques are disclosed in, for example, Japanese Laid-Open Patent Publication No. 2017-097643.
According to an aspect of the embodiment, a non-transitory computer-readable recording medium has stored therein a program that causes a computer to execute a process, the process including determining numerical values indicating features at respective timings having a predetermined time interval with respect to time-series data to be analyzed, numbers of the numerical values at the respective timings being made same, and generating an attractor related to the time-series data based on the determined numerical values.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
In the above-mentioned technique of the related art, since the number of pieces of data included in the time-series data is limited, it may be difficult to clearly extract the features of the attractor shape, which causes a problem that the feature extraction performance deteriorates.
Hereinafter, an embodiment will be described with reference to the accompanying drawings. The data analysis program, data analysis method, and data analysis device described in the following embodiment are merely examples, and the embodiments are not limited thereto.
As illustrated in
The time-series data are multi-dimensional data. For example, time-series data of a stock price include four prices (four values): opening price, high price, low price, and closing price. Here, the opening price is the price of a stock traded (contracted) first in a predetermined period (e.g., half-day or daily unit). The high price is the highest price of the stock traded in the predetermined period. The low price is the lowest price of the stock traded in the predetermined period. The closing price is the last price of the stock traded in the predetermined period.
For example, the features of time-series data of stock prices often appear in half-day or daily units. Therefore, the closing price data among the four prices of opening price, high price, low price, and closing price is often used for analyzing the time-series data of stock prices.
In Case C1 of the related art, the attractor is reconstructed based only on the closing price in the time-series data of a stock price (x) to obtain the Betti series by TDA for the generated attractor. Therefore, since the number of pieces of data is limited to the closing price, it is difficult to clearly extract the features of the attractor shape. For example, in the Betti series of Case C1, a scale (r) becomes smaller (i.e., sudden descent), and then the change is smooth as a whole. Therefore, it is difficult to clearly extract the features because the features lack the smoothness of change as a whole.
In Case C2 of the present embodiment, for the time-series data, a plurality of numerical values indicating the features at respective timings (time i) having a predetermined time interval (e.g., one-minute interval within 90 minutes) is determined so that the number of numerical values is the same, and the attractor is reconstructed based on the determined numerical values. Specifically, the high price and the low price in the time-series data of the stock price and the interpolation points between the prices at each timing are determined by equally dividing, for example, between the high price and the low price.
In this way, the numerical values indicating a plurality of features determined so that the number of numerical values per timing is the same for each timing may be state points on the attractor in a phase space. Therefore, by reconstructing the attractor using these numerical values, the density of the attractor in the phase space increases so that the shape of the attractor is clarified and the Betti series obtained by TDA is stabilized. Specifically, in the Betti series of Case C2, the change is smooth as a whole. Therefore, in Case C2, the features of the time-series data may be accurately extracted based on the Betti series.
In addition, since the opening price and the closing price are included between the high price and the low price, which are examples of the highest point and the lowest point in the time interval corresponding to each timing, it is possible to express the existence range of the attractor on the phase space more widely in the high price and the low price than in the opening price and the closing price. In addition, since the existence range of the attractor on the phase space may be expressed more widely, it is highly possible that a difference in the attractor shape and a difference in the Betti series based on the difference may be clearly distinguished. In that respect, it is considered that the high price and the low price are better than the opening price and the closing price.
Regarding the time-series data to be analyzed, the time-series data indicating the transition of the stock price are illustrated in this embodiment, but the present disclosure is not limited to the time-series data of the stock price. For example, the time-series data may include biological data (time-series data such as brain wave, pulse, or body temperature) other than heart rate, wearable sensor data (time-series data of a gyro sensor, an acceleration sensor, a geomagnetic sensor, or the like), financial data (time-series data of interest rate, commodity price, international balance, stock price, or the like), natural environment data (time-series data of temperature, humidity, carbon dioxide concentration, or the like), social data (data of labor statistics, population statistics, or the like), etc.
For example, in the case of time-series data of an acceleration sensor installed on a bridge, the highest point and the lowest point of acceleration at each timing and the interpolation points between the points are determined to reconstruct an attractor. Next, a Betti series is obtained by TDA for the generated attractor, and a difference in time-series data is detected. As a result, the characteristic state that occurs in response to the deterioration of the strength of the bridge may be detected and the deterioration of the bridge may be detected accordingly.
Under the control of the control unit 30, the communication unit 10 communicates with other devices (e.g., a display device, a server device, etc.) via a communication cable or the like. The communication unit 10 is implemented by, for example, a communication interface connected to a display device, a NIC (network interface card) connected to a communication network such as a LAN (local area network) or the like.
The storage unit 20 corresponds to, for example, a semiconductor memory device such as a RAM (random access memory) or a flash memory, or a storage device such as an HDD (hard disk drive). The storage unit 20 stores time-series data 21 and the like to be analyzed, which are received by an input reception unit 31. In the case of stock prices, the time-series data 21 are, for example, Tick data indicating individual transactions (contract time, stock price, and number of stocks).
The control unit 30 includes the input reception unit 31, a determination unit 32, an attractor generation unit 33, an analysis processing unit 34, and an output unit 35. The control unit 30 may be implemented by a CPU (central processing unit), an MPU (micro processing unit), or the like. The control unit 30 may also be implemented by hard-wired logic such as an ASIC (application specific integrated circuit) or an FPGA (field programmable gate array).
The input reception unit 31 is a processing unit that receives data input. Specifically, the input reception unit 31 receives input of the time-series data 21 to be analyzed by the operation input using a keyboard, a touch panel, or the like, or the file input by communication via the communication unit 10. Next, the input reception unit 31 stores the input time-series data 21 in the storage unit 20.
The determination unit 32 is a processing unit that determines a plurality of numerical values indicating the features at respective timings having a predetermined time interval for the time-series data 21 to be analyzed so that the number of numerical values per timing is the same.
Specifically, the determination unit 32 reads out data having a predetermined time width from the storage unit 20 based on the respective timings having the predetermined time interval for the time-series data 21 to be analyzed, and determines the same number of numerical values indicating the features at each timing. In addition, the time interval for taking timing and the time width for reading data from the time-series data 21 after each timing are set in advance by, for example, a user. As an example, the time interval for taking timing may be one-minute interval. Further, the time width for reading the data may be between the reference timing and the next timing.
Further, the numerical value indicating the feature determined by the determination unit 32 at each timing may be determined by extracting from the data having a predetermined time width after each timing. For example, the determination unit 32 obtains the values of the highest point and the lowest point at each timing. Then, the determination unit 32 obtains the interpolation points between the obtained highest point and lowest point by equally dividing them into the same number, for example, at each timing. The determination unit 32 determines the obtained values of the highest point and the lowest point and the values of the obtained interpolation points between the highest point and the lowest point as the numerical values indicating the features.
The attractor generation unit 33 is a processing unit that generates an attractor from the time-series data 21. Specifically, the attractor generation unit 33 generates virtual time-series data by introducing a characteristic time shift term (T) every dimension for the plurality of numerical values determined by the determination unit 32 at each timing of the time-series data 21, that is, multi-dimensional time-series data. Then, the attractor generation unit 33 generates an attractor from the generated virtual time-series data. As a method of introducing the characteristic time shift term (T) from the time-series data, a well-known statistical method used in informatics, such as multi-dimensional autocorrelation coefficient and mutual information amount, may be used.
The analysis processing unit 34 is a processing unit that generates a Betti series by executing a persistent homology conversion on the attractor generated by the attractor generation unit 33. Here, the term “homology” refers to a method of expressing the feature of an object by the number of m (m≥0)-dimensional holes. The term “hole” mentioned herein refers to the origin of a homology group. The 0-dimensional hole is a connecting component, the 1-dimensional hole is a hole (tunnel), and the 2-dimensional hole is a cavity. The number of holes in each dimension is called a Betti number. The phrase “persistent homology” refers to a method of characterizing the transition of m-dimensional holes in an object (here, a set of points (Point Cloud)). The persistent homology may examine the features related to the arrangement of points. In this method, each point in the object gradually inflates into a sphere, in which process the time when each hole appears (represented by the radius of the sphere at the time of appearance) and the time when it disappears (represented by the radius of the sphere at the time of disappearance) are specified (corresponding to the scale (r) described above).
Although the case of generating the 0-dimensional Betti series is illustrated in this embodiment, the analysis processing unit 34 may generate a one-dimensional or two-dimensional Betti series.
The output unit 35 is a processing unit that performs an output process such as a display output to a display device and a file output. Specifically, the output unit 35 outputs, to a user, the analysis results of the Betti series or the like analyzed by the analysis processing unit 34 as the display output to the display device or the file output. In addition, the output unit 35 may output a result obtained by inputting the Betti series analyzed by the analysis processing unit 34, as the feature amount, into a known machine learning model, that is, a classification result by the machine learning model.
Next, the attractor generation unit 33 generates an attractor regarding the time-series data 21 based on a plurality of numerical values (the highest point (xh), the lowest point (xl), and the interpolation points (xin1, xin2, . . . )) determined by the determination unit 32 at each timing (S3).
Referring back to
Here, the conditions for increasing the number of pieces of data for attractor generation related to the time-series data 21 will be described. First, in order to reconstruct the attractor in a phase space, the number of data points at each timing is made same. Further, among the feature points included in the time-series data 21 (e.g., the opening price, the high price, the low price, and the closing price in a stock price), a feature point in which a point sequence fluctuates drastically and an attractor may not be stable may not be preferable as an object for increasing the number of data for attractor generation.
For example, the four values (i.e., the opening price, the high price, the low price, and the closing price) in the stock price are only representative points (feature points) at each timing. Therefore, a point sequence connecting the values for each of the four values is not originally data that are connected in time, and therefore has little physical meaning. However, when the attractor is reconstructed, the arrangement of each point on the phase space becomes meaningful, so that a meaningful point sequence data may be selected to use the data effectively.
For example, as is clear from the comparison between the graph G12 and the graph G22, the high and low prices and the interpolation points thereof are wider in the track of the attractor. In addition, since the attractors of the high and low prices determine the upper and lower limits, respectively, when the number of data points is increased by the interpolation points, the attractors may be expected to be clarified. In contrast, the shapes of the attractors of the opening and closing prices and the interpolation points thereof are clear at first glance, but the attractors illustrate a distorted shape due to the influence of noise caused by severe fluctuations, and the density of points is sparse as a whole.
Therefore, the distance between the phase points forming the attractor increases at the high and low prices and the interpolation points thereof. Further, in comparison between the graph G13 of the Betti series by the attractors of the high and low prices and the interpolation points thereof and the graph G23 of the Betti series by the attractors of the opening and closing prices and the interpolation points thereof, in the graph G13, the Betti number holds a large value for a particularly small r (scale), expressing the feature more clearly.
Specifically, the determination unit 32 determines projection points obtained by projecting the contract prices (black circle) indicated by the tick data D at each timing (break time) at one-minute interval, as the interpolation points. Further, in order to make the number of interpolation points the same at each timing, the determination unit 32 may randomly select when the number of projection points is larger than the number (designated number) determined as the interpolation points. On the contrary, when the number of projection points is less than the designated number, the determination unit 32 may match the designated number to the minimum number of projection points for each timing, or may interpolate to match to the designated number.
As described above, the data analysis device 1 includes the determination unit 32 and the attractor generation unit 33. The determination unit 32 determines the plurality of numerical values indicating the features of the time-series data 21 to be analyzed at respective timings having a predetermined time interval so that the number of numerical values at each timing is made same. The attractor generation unit 33 generates the attractors ATh, ATin1, ATin2, . . . , ATl related to the time-series data 21 based on the numerical values determined by the determination unit 32.
The numerical values indicating a plurality of features determined for the time-series data 21 by aligning the conditions at each timing so that the number of numerical values is the same may be the state points on the attractor in the phase space. Therefore, by generating the attractors ATh, ATin1, ATin2, . . . , ATl based on the numerical values indicating the plurality of determined features, the density of attractors in the phase space may increase, so that the existence range of the attractors on the phase space may be expressed more widely. As a result, the attractor shapes are clarified to distinguish the changes of the attractors clearly, so that the attractors and the Betti series by TDA are stabilized. Further, the Betti series becomes smooth. For this reason, the data analysis device 1 improves the performance of feature extraction in data analysis by TDA, thereby facilitating extraction of the features of the time-series data with high accuracy.
Further, the determination unit 32 determines the numerical values of the highest point (e.g., the high price xh) and the lowest point (e.g., the low price xl) included in the time-series data 21 within the time interval corresponding to a timing, and the numerical values of the interpolation points (xin1, xin2, . . . ) with the same number of interpolation points per timing between the highest point and the lowest point.
The interpolation points between the highest point and the lowest point are considered to be points near the phase points that originally exist on the attractor. By determining the interpolation points as numerical values indicating the features, the density of the phase points on the attractor in the phase space increases, thereby expressing the existence range of the attractor in more detail. As a result, the attractor shapes are clarified to easily distinguish a difference between the attractors at the time of data analysis by TDA.
Further, the determination unit 32 determines the numerical values of the interpolation points by equally dividing between the highest point and the lowest point (e.g., between the high price and the low price). In this way, the data analysis device 1 may determine the interpolation points by equally dividing between the highest point and the lowest point.
Further, the determination unit 32 determines the measured values (e.g., the contract prices in the stock price) included in the time-series data 21 within the time interval corresponding to a timing, as the numerical values of the interpolation points. The measured values included in the time-series data 21 within the time interval corresponding to the timing may be considered closer to the phase points originally existing on the attractor. Therefore, by determining the measured values as the numerical values of the interpolation points, the attractor shapes are clarified to easily distinguish the difference between the attractors at the time of data analysis by TDA.
Further, the time-series data 21 are data indicating the temporal transition of the stock price. The determination unit 32 determines the high and low prices of the stock price and the numerical values of the interpolation points with the same number of interpolation points per timing between the high and low prices at each timing. The interpolation points between the high and low prices in the stock price correspond to the degree of fluctuation in the stock price. Therefore, by generating attractors using the high and low prices of the stock price and the interpolation points between the prices, since the attractors are considered more accurately reflect the dynamic characteristics of the actual phenomenon in the stock price (transition of the stock price over time), the accuracy of stock price feature extraction may be expected to be increased.
Each constituent element of each of the illustrated devices does not necessarily have to be physically configured as illustrated in the drawings. That is, the specific form of distribution/integration of the devices is not limited to those illustrated in the drawings, and all or a part of the devices may be configured to be functionally or physically distributed/integrated in arbitrary units according to various loads and usage conditions.
Further, all or a part of various types of processing functions performed by the data analysis device 1 may be executed on a CPU (or a microcomputer such as an MPU or an MCU (Micro Controller Unit)). Further, it is needless to say that all or a part of the various types of processing functions may be executed on a program analyzed and executed by a CPU (or a microcomputer such as an MPU or an MCU) or on hardware by wired logic. Further, the various types of processing functions performed by the data analysis device 1 may be executed by a plurality of computers in cooperation by cloud computing.
The various types of processes described in the above embodiment may be implemented by executing a program prepared in advance on a computer. Therefore, an example of a computer (hardware) that executes a program having the same function as that of the above embodiment will be described below.
As illustrated in
The hard disk device 209 stores a program 211 for executing various types of processes in the input reception unit 31, the determination unit 32, the attractor generation unit 33, the analysis processing unit 34, the output unit 35, and the like described in the above embodiment. Further, the hard disk device 209 stores various types of data 212 referred to by the program 211. The input device 202 receives, for example, input of operation information from an operator. The monitor 203 displays, for example, various types of screens operated by the operator. The interface device 206 is connected to, for example, a printing device or the like. The communication device 207 is connected to a communication network such as a LAN (Local Area Network) to exchange a variety of information with external devices via the communication network.
The CPU 201 reads the program 211 stored in the hard disk device 209 and deploys the read program 211 onto the RAM 208 to perform various types of processes related to the input reception unit 31, the determination unit 32, the attractor generation unit 33, the analysis processing unit 34, the output unit 35, and the like. The program 211 may not be stored in the hard disk device 209. For example, the computer 200 may read and execute the program 211 stored in a readable storage medium. The storage medium that may be read by the computer 200 includes a portable recording medium such as a CD-ROM, a DVD disk, or a USB (universal serial bus) memory, a semiconductor memory such as a flash memory, a hard disk drive, or the like. Further, the program 211 may be stored in a device connected to a public line, the Internet, a LAN, or the like, and the computer 200 may read and execute the program 211 from this device.
According to an aspect of the embodiment, it is possible to extract the features of time-series data with high accuracy.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to an illustrating of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2020-100693 | Jun 2020 | JP | national |