This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2024-006595, filed on Jan. 19, 2024; the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to an information processing device, a computer program product, and an information processing method.
A technique has been proposed that inputs time-series data (sequence data) to a machine learning model, and performs a variety of inference. The machine learning model is, for example, a deep learning model. For example, a model has been proposed that receives input of video data capturing a mobile object such as a person or receives input of skeleton sequence data obtained by analyzing video data, and performs inference about the actions of the mobile object (the behavior of the person).
If the inference is performed using a plurality of types of time-series data, it is expected to have an enhancement in the accuracy of the inference. On the other hand, since the time-series data contains data (frames) having a plurality of times, in a configuration in which a plurality of pieces of time-series data is used, there is a risk of an increase in the amount of calculation of the operations such as inference and learning performed by a machine learning model.
According to an embodiment, an information processing device includes one or more hardware processors. the one or more hardware processors are configured to perform: obtaining one or more pieces of first time-series data and one or more pieces of second time-series data, types of the one or more pieces of first time-series data being different from each other, types of the one or more pieces of second time-series data being different from the types of the one or more pieces of first time-series data and being different from each other; calculating, regarding each of a plurality of first frames included in the first time-series data, a degree of attention indicating a degree at which attention is given in inference performed by a first model configured to receive input of the first time-series data and perform the inference; selecting, from the second time-series data, an N number of second frames using the degree of attention, N being an integer equal to or greater than 1; and performing training-inference including performing training of a second model configured to perform the inference or performing the inference by the second model using information based on the first time-series data and information based on the selected N number of second frames.
Preferred embodiments of an information processing device according to the present invention are described below in detail with reference to the accompanying drawings.
The following explanation is mainly given about an exemplary configuration in which a model is used that receives input of a plurality of types of time-series data (sequence data) and that estimates (infers) the behavior of a person. Herein, the inference applicable in the embodiments described below is not limited to the estimation of the behavior of a person, and it is possible to apply any type of inference performed using time-series data. For example, it is possible to have a configuration in which the actions of a non-human mobile object such as an animal or a robot are estimated.
The time-series data contains a plurality of pieces of data (frames) continuous in chronological order. Each of the frames included in the time-series data represents, for example, one of the following types of data. However, those are not the only possible examples.
Meanwhile, the types of data (time-series data) are sometimes called the modals of data or the formats of data. The embodiments described below may also be interpreted to be an example of performing inference using multimodal time-series data.
As explained above, as a result of performing inference using a plurality of types of time-series data, it becomes possible to enhance the accuracy of the inference. Depending on the type of the time-series data, sometimes it is not necessary to use all of the frames included in the time-series data for the intended inference.
For example, consider a case in which the action of eating a particular food item as performed by a particular person is estimated using the time-series data of the skeleton data (hereinafter, called skeleton sequence data) and using the time-series data of the color image data (hereinafter, called video data). In such a case, the action of eating a food item may be estimated using, for example, the skeleton sequence data containing a plurality of frames. On the other hand, the food item being eaten may be possibly estimated using, for example, only those frames in which the food item is captured from among the frames included in the video data.
Thus, in the case where inference is performed using a plurality of types of time-series data, regarding some of the types of time-series data, even when the inference is performed using only some of the frames instead of using all of the frames included in the time-series data, the accuracy of inference may still be enhanced.
An information processing device according to a first embodiment performs predetermined inference using a plurality of types of time-series data. At that time, in the first embodiment, using the degree of attention of each frame calculated from a particular type of time-series data (hereinafter, called time-series data SDA), N number of frames (where N is an integer equal to or greater than 1) are selected (hereinafter, called frame data FDB) from other types of time-series data (hereinafter, time-series data SDB), and the inference is performed using the selected frame data FDB. As a result, as compared to a configuration in which the inference is performed using the time-series data SDB that contains all frames, it becomes possible to reduce the amount of calculation.
The degree of attention represents the information indicating the degree at which attention is given in the inference performed using the time-series data SDA. Then, for example, frame data FDB at the time corresponding to the frame having the highest degree of attention is selected. Hence, even without using the time-series data SDB that contains all frames, it becomes possible to enhance the accuracy of the processing performed using a machine learning model.
The obtaining unit 101 obtains a variety of information used in the information processing device 100. For example, the obtaining unit 101 obtains one or more pieces of time-series data SDA (pieces of first time-series data) that are of mutually different types, and obtains one or more pieces of time-series data SDB (pieces of second series data) that are of different types than the time-series data SDA and that are of mutually different types.
In the first embodiment, the following explanation is given about an example in which a single set of time-series data SDA is obtained and a single set of time-series data SDB is obtained. An example in which two pieces of time-series data SDB are used is explained in a second embodiment. An example in which two pieces of time-series data SDA are used is explained in a third embodiment. If the second and third embodiments are combined, it becomes possible to have a configuration in which two or more pieces of time-series data SDA are used and two or more pieces of time-series data SDB are used.
As explained later, the time-series data SDA is equivalent to the time-series data from which all frames are used in behavior estimation. On the other hand, the time-series data SDB is equivalent to the time-series data from which N number of frames are selected and the selected frames are used in behavior estimation.
The time-series data SDA and the time-series data SDB may have the same start time and the same end time. That eliminates the need to perform the operation of associating the times among a plurality of pieces of time-series data.
If a plurality of pieces of time-series data has different frame rates, for example, the obtaining unit 101 performs an operation for interpolating the frames regarding the time-series data having low frame rates, and ensures that the frame rate is same among a plurality of pieces of time-series data.
The time-series data having a large data volume may be obtained as the time-series data from which the frames are selected. Thus, the data volume of the time-series data SDB may be greater than the data volume of the time-series data SDA. As a result, it becomes possible to reduce more data volume.
For example, the obtaining unit 101 may obtain skeleton sequence data as the time-series data SDA, and may obtain video data as the time-series data SDB. For example, the skeleton sequence data represents the coordinate data of the time series expressing the joint points of a person. The skeleton sequence data may be obtained according to an arbitrary method. For example, the skeleton sequence data may be obtained according to a skeleton detection method such as Openpose, or according to a method in which a sensor is used.
In the following explanation, it is assumed that a single person represents the estimation target. When the data of a plurality of persons is included in the time-series data, the obtaining unit 101 may use an object detector to generate time-series data that contains data clipped on a person-by-person basis. In the case where inference other than the estimation of the actions of a person is performed, it is not necessary to perform the operation clipping the data on a person-by-person basis using an object detector.
The calculating unit 102 calculates the degree of attention regarding each of a plurality of frames (first frames) included in the time-series data SDA. The degree of attention indicates the degree at which attention is given in the inference performed using a model MA (a first model) that is configured to receive input of the time-series data SDA and perform inference.
A frame having a high degree of attention may be interpreted as a frame that is important in the inference performed using the model MA, or as a frame serving as the basis of the inference performed using the model MA, or as a frame having a significant contribution in the inference performed using the model MA. Thus, the degree of attention may also be referred to as the degree of importance or the degree of contribution.
When the skeleton sequence data is used as the time-series data SDA, for example, the model MA may be a deep learning model such as ST-GCN (Spatial Temporal Graph Convolutional Network) that, regarding each of a plurality of frames included in the skeleton sequence data, outputs the output data, which contains the degree of attention in the temporal axis, as the result of inference (inference result).
For example, ST-GCN receives input of the skeleton sequence data and outputs a feature (an example of the output data) including three-dimensional elements of (the skeleton points, the number of persons, and the degree of attention in the temporal axis). The calculating unit 102 may use the model MA and calculate the degree of attention in the temporal axis as the degree of attention of each frame.
For example, the calculating unit 102 implements a dimensional compression method such as Global Average Pooling and compresses, from among the elements included in the feature, the elements in the other dimensionalities other than the degree of attention in the temporal axis; and calculates the feature that includes only the degree of attention in the temporal axis. That feature is equivalent to the degree of attention of each frame (the degree of attention in the temporal axis). It is assumed that, greater the degree at which attention is given in the inference, the greater is the value of the degree of attention. Meanwhile, it is also possible to have a configuration in which dimensional compression is not performed and, for example, the degree of attention is calculated for each skeleton point, and the degree of attention having the highest value is used from among the degrees of attention of the skeleton points.
However, the method for calculating the degree of attention is not limited to the method explained above. Thus, as long as the degree at which attention is given in the inference based on the model MA may be calculated for each of a plurality of frames included in the time-series data SDA, it is possible to implement any other method.
The selecting unit 103 uses the degrees of attention calculated by the calculating unit 102 and selects, from the time-series data SDB, the frame data FDB representing the N number of frames (second frames).
For example, as the frame data FDB, the selecting unit 103 selects frames in the time-series data SDB at the same time as the time of the frames having the highest degree of attention. When a single frame has the highest degree of attention, the selecting unit 103 selects a single (N=1) set of frame data FDB.
When there are two or more frames having the highest degree of attention, the selecting unit 103 may select, as the frame data FDB from the time-series data SDB, two or more frames at the same times as the two or more frames having the highest degree of attention.
The method by which the selecting unit 103 selects a frame is not limited to the method for selecting the frame corresponding to the frame having the highest degree of attention. Alternatively, for example, the selecting unit 103 may apply the following selection methods.
Returning to the explanation with reference to
The model 601 receives input of the time-series data SDA, performs inference, and outputs the inference result. In the example illustrated in
The model 603 receives input of the frame data FDB, performs inference, and outputs the inference result. In the example illustrated in
The model 603 may be configured to perform batch processing of N number of frames, that is, to receive collective input of N number of frames and output N number of inference results. Alternatively, the model 603 may be configured to receive input of N number of frames one by one, and output a single inference result at a time. In the latter case, the model 603 repeatedly performs inference for N number of times and obtains N number of inference results. When a plurality of inference results is obtained (i.e., when N≥2 holds true), a statistical value (the average value or the median value) of the inference results may be used as the final inference result obtained by the model 603. Still alternatively, the model 603 may be configured to receive collective input of N number of frames and output only a single inference result.
The model 602 uses the inference results obtained by the models 601 and 603, and outputs the final inference result. In the example illustrated in
Meanwhile, in the case where inference is performed, the training-inference unit 104 outputs, as the processing result, the inference result obtained by the model 602.
When performing training, the training-inference unit 104 trains each model using the data for learning. The training method implemented by the training-inference unit 104 may be any method that is implementable with respect to each model. For example, the training-inference unit 104 calculates the value of a loss function using the output (the inference result) of the model 602, and trains the model 602 to optimize the value of the loss function.
The training-inference unit 104 may train at least either the model 601 (the model MA) or the model 603 along with the model 602 (the model MB). For example, the training-inference unit 104 may train the parameters of the model 601, the parameters of the model 602, and the parameters of the model 603 to optimize the value of the loss function. The processing result obtained when performing training indicates, for example, the parameters representing the trained models.
Meanwhile, when the training-inference unit 104 trains the model 601 along with the model 602, the calculating unit 102 calculates the degrees of attention using the model 601 that is being trained. On the other hand, when the training-inference unit 104 does not train the model 601 along with the model 602, the calculating unit 102 calculates the degrees of attention using the already-trained model 601.
The configuration of the training-inference unit 104 is not limited to the configuration illustrated in
In that case, the model 602 receives input of the probability information PA (another example of the information based on the time-series data SDA) and the probability information PB (another example of the information based on the selected N number of pieces of frame data FDB), and outputs the inference result. For example, the model 602 outputs, as the inference result, the probability information obtained by adding the probability information PA and the probability information PB.
Returning to the explanation with reference to
The selected pieces of frame data FDB are equivalent to the frames having high degrees of attention, that is, the frames to which attention is given in the inference, or the frames serving as the basis for the inference, or the frames important in the inference. Thus, as a result of outputting the selected pieces of frame data FDB, it becomes possible to figure out the basis for the inference.
The output control unit 105 may output the output information according to any method. For example, it is possible to apply the method for displaying the information in a display device such as a display, or the method for transmitting the information to external devices connected via a network.
At least some of the constituent elements explained above (i.e., the obtaining unit 101, the calculating unit 102, the selecting unit 103, the training-inference unit 104, and the output control unit 105) may be implemented using one or more processing units. For example, the constituent elements are implemented using one or more processors. For example, the constituent elements may be implemented by causing a processor such as a central processing unit (CPU) or a graphics processing unit (GPU) to execute computer programs, that is, may be implemented using software. Alternatively, the constituent elements may be implemented using a processor such as a dedicated integrated circuit (IC), that is, may be implemented using hardware. Still alternatively, the constituent elements may be implemented using a combination of software and hardware. In the case where a plurality of processors is used, each processor either may implement one of the constituent elements or may implement two or more constituent elements.
The memory unit 121 is used to store a variety of information that is used in the information processing device. For example, the memory unit 121 is used to store: the information obtained by the obtaining unit 101 (the time-series data SDA and the time-series data SDB); the frame data FDB selected by the selecting unit 103; and the inference result obtained by the training-inference unit 104.
The memory unit 121 may be configured using any commonly-used memory medium such as a flash memory, a memory card, a random access memory (RAM), a hard disk drive (HDD), or an optical disc.
The information processing device 100 may be physically configured either using a single device or using a plurality of devices. For example, the information processing device 100 may be built in a cloud environment. Meanwhile, the constituent elements of the information processing device 100 may be dispersedly included across a plurality of devices.
Given below is the explanation of the information processing performed by the information processing device 100 according to the first embodiment.
The obtaining unit 101 obtains the time-series data SDA and the time-series data SDB (Step S101). The calculating unit 102 uses the time-series data SDA and calculates the degree of attention of each frame included in the time-series data SDA (Step S102). The selecting unit 103 uses the degrees of attention and selects N numbers of frames (pieces of frame data FDB) from the time-series data SDB (Step S103). For example, the selecting unit 103 selects, from the time-series data SDB, the pieces of frame data FDB corresponding to the frames having the highest degree of attention. The training-inference unit 104 performs processing (inference or training) using the time-series data SDA and the frame data FDB (Step S104). The output control unit 105 outputs the processing result (the inference result or the training result) (Step S105). That marks the end of the information processing.
In this way, in the first embodiment, inference is performed using the time-series data of a particular type and using the frames of the time-series data of another type that are selected according to the degrees of attention calculated from the particular type of time-series data. As a result, as compared to the inference performed using a single set of time-series data, it becomes possible to enhance the accuracy of the processing. Meanwhile, regarding some pieces of time-series data, selected frames are used for inference instead of using the time-series data in entirety. As a result, as compared to the configuration in which the inference is performed using the time-series data SDB that contains all frames, it becomes possible to reduce the amount of calculation.
Moreover, as a result of performing inference using the selected frames, it also becomes possible to reduce the load attributed to building a model for the purpose of inference. For example, regarding a model that performs inference using only some of the frames (for example, a single frame) instead of using the time-series data in entirety, it is possible to have a configuration in which a model is used that is trained in advance using large learning data made open to the public.
In the first embodiment, frames are selected from only a single type of time-series data. However, there may be two or more types of time-series data from which the frames are selected. In a second embodiment, the explanation is given about an example in which the frames are selected from each of two types of time-series data.
In the second embodiment, the obtaining unit 101-2, the selecting unit 103-2, and the training-inference unit 104-2 have different functions than the functions according to the first embodiment. The other configurations and the other functions are identical to
The obtaining unit 101-2 obtains a single set of time-series data SDA and obtains two pieces of time-series data SDB-1 and SDB-2 of mutually different types. For example, the time-series data SDA represents skeleton sequence data; the time-series data SDB-1 represents video data; and the time-series data SDB-2 represents sequence data of the optical flow calculated from the video data (the time-series data SDB-1).
The selecting unit 103-2 selects N number of frames from each of the two pieces of time-series data SDB-1 and SDB-2. In the following explanation, the frames selected from the time-series data SDB-1 are referred to as frame data FDB-1, and the frames selected from the time-series data SDB-2 are referred to frame data FDB-2.
Regarding each set of time-series data (the time-series data SDB-1 and the time-series data SDB-2), the method for selecting the frames is identical to the method implemented by the selecting unit 103 according to the first embodiment. For example, the selecting unit 103-2 selects, as the frame data FDB-1 (FDB-2), frames in the time-series data SDB-1 (SDB-2) at the same time as the time of the frame having the highest degree of attention.
The training-inference unit 104-2 uses the information that is based on the time-series data SDA and uses the information that is based on the N number of pieces of frame data FDB-1 and FDB-2, and performs training of the model MB or inference.
The model 603 is identical to
The model 604-2 receives input of the frame data FDB-2 and performs inference, and outputs the inference result. In the example illustrated in
The model 602-2 uses the inferencing result of the models 601, 603, and 604-2; and outputs the final inference result. In the example illustrated in
In an identical manner to the first embodiment, the models 601, 603, and 604-2 may output probability information as the inference result instead of outputting the features. In that case, the model 602-2 outputs the inference result using a variety of probability information. In the case where the time-series data SDB of three or more types are used, the configuration may be such that a model corresponding to each type is used in an identical manner to the explanation given earlier.
In the information processing device 100-2 according to the second embodiment, the flow of information processing is identical to
In this way, in the second embodiment, it becomes possible to have a configuration in which the frames are selected from each of two types of time-series data.
In the first embodiment, the degrees of attention are calculated in only a single type of time-series data. However, there may be two or more types of time-series data in which the degrees of attention are calculated. In a third embodiment, the explanation is given about an example in which the degrees of attention are calculated from each of two-types of time-series data.
In the third embodiment, the obtaining unit 101-3, the calculating unit 102-3, the selecting unit 103-3, and the training-inference unit 104-3 have different functions than the functions according to the first embodiment. The other configurations and the other functions are identical to
The obtaining unit 101-3 obtains two pieces of time-series data SDA-1 and SDA-2 of mutually different types, and obtains a single set of time-series data SDB. For example, the time-series data SDA-1 represents skeleton sequence data; the time-series data SDB represents video data; and the time-series data SDA-2 represents sequence data of the optical flow calculated from the video data (the time-series data SDB).
The calculating unit 102-3 calculates the degree of attention of each frame in each of the pieces of time-series data SDA-1 and SDA-2. The method for calculating the degrees of attention in each set of time-series data (each of the pieces of time-series data SDA-1 and SDA-2) is identical to the method implemented by the calculating unit 102 according to the first embodiment. For example, using a model MA-1 (an example of the first model) that receives input of the time-series data SDA-1 and performs inference, a calculating unit 102-2 calculates the degree of attention of each of a plurality of frames included in the time-series data SDA-1 (hereinafter, referred to as an attention degree AD-1). Moreover, using a model MA-2 (an example of the first model) that receives input of the time-series data SDA-2 and performs inference, the calculating unit 102-2 calculates the degree of attention of each of a plurality of frames included in the time-series data SDA-2 (hereinafter, referred to as an attention degree AD-2).
Then, using the attention degrees AD-1 and the attention degrees AD-2, the selecting unit 103-3 selects N number of frames from a single set of time-series data SDB. The selecting unit 103-3 may apply the following selection methods.
The training-inference unit 104-3 uses the information based on the two pieces of time-series data SDA-1 and SDA-2 and uses the information based on the N number of frames of frame data FDB, and performs training of the model MB or inference.
The model 601 is identical to
A model 605-2 receives input of the time-series data SDA-2 and performs inference, and outputs the inference result. In the example illustrated in
The model 602-3 uses the inference result of the models 601, 603, and 605-3; and outputs the final inference result. In the example illustrated in
In an identical manner to the first embodiment, the models 601, 603, and 605-3 may output probability information as the inference result instead of outputting the features. In that case, the model 602-3 outputs the inference result using a variety of probability information. In the case where the time-series data SDA of three or more types are used, the configuration may be such that a model corresponding to each type is used in an identical manner to the explanation given earlier.
In the example illustrated in
In the information processing device 100-2 according to the third embodiment, the flow of information processing is identical to
In this way, in the third embodiment, it becomes possible to have a configuration in which the degrees of attention are calculated from two types of time-series data.
As explained above, according to the first to third embodiments, it becomes possible to hold down an increase in the amount of calculation, while enhancing the accuracy of the processing performed using machine learning models.
Explained below with reference to
The information processing device according to the first to third embodiments includes a control device such as a central processing unit (CPU) 51; memory devices such as a read only memory (ROM) 52 and a random access memory (RAM) 53; a communication interface (I/F) 54 that performs communication by establishing connection with a network; and a bus 61 that connects the constituent elements to each other.
A computer program executed in the information processing device according to the first to third embodiments is stored in advance in the ROM 52.
Alternatively, the computer program executed in the information processing device according to the first to third embodiments may be recorded as an installable file or an executable file in a computer-readable recording medium such as a CD-ROM (which stands for Compact Disk Read Only Memory), a flexible disk (FD), a CD-R (which stands for Compact Disk Recordable), or a DVD (which stands for Digital Versatile Disk).
Still alternatively, the computer program executed in the information processing device according to the first to third embodiments may be stored in a downloadable manner in a computer that is connected to a network such as the Internet. Still alternatively, the computer program executed in the information processing device according to the first to third embodiments may be distributed via a network such as the Internet.
The computer program executed in the information processing device according to the first to third embodiments may cause a computer to function as the constituent elements of the information processing device. In the computer, the CPU 51 reads the computer program from a computer-readable memory medium into the main memory device, and then may execute the computer program.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2024-006595 | Jan 2024 | JP | national |