The present invention relates to a learning apparatus, a learning method, and a learning program.
Generally, when classifying videos, it is important to grasp the temporal context of each frame image, and various proposals have been made in the past. For example, the non-patent literature cited below disclose technologies for estimating temporal sequence relationships among frame images in a video.
Non-Patent Literature 1: Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, Ming-Hsuan Yang, “Unsupervised Representation Learning by Sorting Sequences”, The IEEE International Conference on Computer Vision (ICCV) 2017, pp. 667-676, 2017.
Non-Patent Literature 2: Dahun Kim, Donghyeon Cho, In So Kweon, “Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles”, Vol. 33 No. 01:AAAI-19, IAAI-19, EAAI-20, pp. 8545-8552, 2019.
Meanwhile, to grasp the temporal context of each frame image in a video, it is desirable to be capable of estimating not only the temporal sequence relationships, but also the temporal interval. This is because if the temporal interval between the frame images in a video can be estimated, it is possible to compute not only the movement direction but also properties such as the movement speed of an object included in each frame image.
In one aspect, an objective is to generate a model that estimates the temporal interval between frame images in a video.
According to an aspect of the present disclosure, a learning apparatus includes:
a first model configured to accept a plurality of frame images included in a video as input, and output a feature vector for each frame image;
a second model configured to accept the feature vector for each frame image as input, and output a temporal interval between a frame image treated as a reference and each of the frame images other than the frame image treated as the reference; and
a learning unit configured to update parameters of the first and second models such that each of the temporal intervals output from the second model approaches each temporal interval computed from time-related information pre-associated with each frame image.
According to the present disclosure, a model that estimates the temporal interval between frame images in a video can be generated.
Hereinafter, embodiments will be described with reference to the attached drawings. Note that, in the present specification and drawings, structural elements that have substantially the same functions and structures are denoted with the same reference signs, and duplicate description of these structural elements is omitted.
<Application Example of Model Generated by Learning Apparatus>
First, an application example of a model generated by a learning apparatus according to the first embodiment will be described.
As illustrated in the upper part of
The video 101 contains frame images captured in a temporal sequence proceeding from left to right in the upper part of
Note that the two types of models (model I, model II) subjected to the learning process by the learning apparatus 100 and the model (model III) subjected to a fine-tuning process by a task implementation apparatus (fine-tuning) 110 described later are assumed to be a combination selected from among the base models
As illustrated in the upper part of
Note that although not illustrated in the upper part of
Also, if the feature vector a, the feature vector b, and the feature vector c output from the model I are input into the model II, temporal intervals between
Specifically, in the model II, the differences in the time information (respective time differences) or the differences in the frame IDs (respective frame differences) between the first frame image in the temporal sequence and each of the second and third frame images in the temporal sequence are output.
In the learning apparatus 100, the parameters of the model I and the model II are updated such that the time differences or the frame differences output from the model II approach,
In the case illustrated in the upper part of
Also, in the case illustrated in the upper part of
Note that the phase in which the parameters of the model I and the model II are updated through the learning process performed by the learning apparatus 100 is hereinafter referred to as the “pre-learning phase”.
When the pre-learning phase ends, the process proceeds to a “fine-tuning phase”. As illustrated in the lower part of
The video 102 contains frame images captured in a temporal sequence proceeding from left to right in the lower part of
Note that a correct answer label of the objective task may be associated with each of the plurality of frame images including the three frame images xb, xa, xc, for example. Specifically,
As illustrated in the lower part of
On the other hand, the model III used for the objective task is a model on which a fine-tuning process is executed to implement an objective task (for example, a task of computing the movement speed of an object included in the input frame images).
As illustrated in the lower part of
In the task implementation apparatus 110, the parameters of the model III used for the objective task are updated (for the already-trained model I (trained), the parameters are fixed in the fine-tuning phase) such that the output result L (or the information such as output results Lb, La, Lc) output by the model III used for the objective task approach,
When the fine-tuning phase ends, the process proceeds to an “estimation phase”. As illustrated in
The video 103 contains frame images captured in a temporal sequence proceeding from left to right in
The task implementation apparatus 120 includes two types of models, of which the model I (trained) is a trained model I generated by having the learning apparatus 100 perform the learning process on the model I in the pre-learning phase.
Also, the model III (trained) used for the objective task is a trained model III generated by having the task implementation apparatus 110 perform the fine-tuning process on the model III used for the objective task.
As illustrated in
<Hardware Configuration of Learning Apparatus>
Next, a hardware configuration of the learning apparatus 100 will be described.
The processor 201 includes various computational apparatuses such as a central processing unit (CPU) and a graphics processing unit (GPU). The processor 201 reads and executes various programs (such as a learning program described later, for example) in the memory 202.
The memory 202 includes main memory apparatuses such as read-only memory (ROM) and random access memory (RAM). The processor 201 and the memory 202 form what is called a computer, and the computer implements various functions by causing the processor 201 to execute various read programs in the memory 202.
The auxiliary storage apparatus 203 stores various programs and various data used when the various programs are executed by the processor 201.
The I/F apparatus 204 is a connecting apparatus that connects an operating apparatus 210 and a display apparatus 211, which are examples of external apparatuses, to the learning apparatus 100. The I/F apparatus 204 receives operations with respect to the learning apparatus 100 through the operating apparatus 210. The I/F apparatus 204 also outputs results of processes performed by the learning apparatus 100 to the display apparatus 211.
The communication apparatus 205 is a communication apparatus for communicating with other apparatuses over a network.
The drive apparatus 206 is an apparatus for mounting a recording medium 212. The recording medium 212 referred to herein includes media on which information is recorded optically, electrically, or magnetically, such as a CD-ROM, a flexible disk, or a magneto-optical disc. Additionally, the recording medium 212 may also include media such as a semiconductor memory on which information is recorded electrically, such as ROM or flash memory.
Note that various programs installed in the auxiliary storage apparatus 203 may be installed by mounting a distributed recording medium 212 on the drive apparatus 206 and causing the drive apparatus 206 to read the various programs recorded on the recording medium 212, for example. Alternatively, the various programs installed in the auxiliary storage apparatus 203 may be installed by being downloaded from a network through the communication apparatus 205.
<Functional Configuration and Specific Example of Process by Learning Apparatus>
Next, a functional configuration of the learning apparatus 100 will be described.
The self-supervised data generation unit 330 samples and reads a plurality of frame images from a video stored in an image data storage unit 310, generates and associates pseudo-labels (frame differences or time differences) with the frame images, and then randomly rearranges the frame images.
Also, the self-supervised data generation unit 330 notifies the preprocessing unit 340 of the rearranged plurality of frame images together with the associated pseudo-labels.
The preprocessing unit 340 executes various preprocesses (such as a normalization process, a cutting process, and a channel separation process, for example) on the plurality of frame images included in the notification from the self-supervised data generation unit 330. In addition, the preprocessing unit 340 stores the plurality of preprocessed frame images together with the associated pseudo-labels in a training data set storage unit 320 as a training data set.
The learning unit 350 includes a feature extraction unit 351, a self-supervised estimation unit 352, and a model update unit 353.
The feature extraction unit 351 corresponds to the model I described in
The self-supervised estimation unit 352 corresponds to the model II described in
The model update unit 353 compares the pseudo-labels (frame differences or time differences) included in the training data set read by the learning unit 350, and the training data set storage unit 320 to the frame differences or time differences output by the self-supervised estimation unit 352. Additionally, the model update unit 353 updates the parameters of the feature extraction unit 351 and the self-supervised estimation unit 352 so as to minimize the error (for example, the squared loss) between
<Details About Respective Units of Learning Apparatus>
Next, details about the respective units (the self-supervised data generation unit 330, the preprocessing unit 340, and the learning unit 350) of the learning apparatus 100 will be described.
(1) Self-Supervised Data Generation Unit
First, details about the self-supervised data generation unit 330 will be described.
Note that the following description assumes that frame IDs (for example, v1_f1, v2_f2, . . . ) including
Also, the following description assumes that
As illustrated in
The image data acquisition unit 401 samples a plurality of frame images (here, the frame images xv1_f1, xv1_f1020, xv1_f1980) from, for example, the video v1 from among the videos (v1, v2, . . . , vn) stored in the image data storage unit 310.
As described above, t, t+17, and t+33 are associated with the respective sampled frame images xv1_f1, xv1_f1020, and xv1_f1980 as time information. Also, v1_f1, v1_f1020, and v1_f1980 are associated with the respective sampled frame images xv1_f1, xv1_f1020, and xv1_f1980 as frame IDs.
Note that the inclusion of the first frame image of the video v1 (the frame image with the frame ID=v1_f1) among the plurality of frame images sampled by the image data acquisition unit 401 is merely for the sake of convenience and is not a requirement. For example, the present embodiment assumes that a method of sampling on the basis of random numbers in a uniform distribution is adopted as the method of sampling the plurality of frame images read by the image data acquisition unit 401.
Also, the present embodiment assumes that a number of samples determined on the basis of a hyper parameter for example is adopted as the number of samples of the plurality of frame images read by the image data acquisition unit 401. Alternatively, it is assumed that a number of samples determined by calculation from properties such as the epoch (the number of times that all videos usable in the learning process have been used in the learning process) and the lengths of the videos is adopted.
The sequence changing unit 402 rearranges the sequence of the plurality of frame images (frame images xv1_f1, xv1_f1020, and xv1_f1980) read by the image data acquisition unit 401. The example in
The pseudo-label generation unit 403 generates pseudo-labels (pv1_f1020, pv1_f1, and pv1_f1980) for the rearranged plurality of frame images (frame images xv1_f1020, xv1_f1, and xv1_f1980). As described above, frame differences or time differences are included in the pseudo-labels, and the frame differences in the read plurality of frame images (frame images xv1_f1, xv1_f1020, and xv1_f1980) are calculated according to the differences in the frame IDs between
Also, the time differences in the read plurality of frame images (frame images xv1_f1, xv1_f1020, and xv1_f1980) are calculated according to the differences in the time information between
Consequently, as illustrated in
(2) Preprocessing
Next, details about the preprocessing unit 340 will be described.
Specifically, in the case where sensor data is associated with each of the plurality of frame images, the preprocessing unit 340 performs a normalization process on each of the plurality of frame images on the basis of the sensor data. Note that sensor data refers to data indicating an image capture status when the plurality of frame images were captured (for example, in the case where the image capture apparatus is mounted on a moving object, data such as movement speed data and position data of the moving object).
Additionally, the preprocessing unit 340 performs a cutting process of cutting out an image of a predetermined size from each of the plurality of frame images to. For example, the preprocessing unit 340 may be configured to cut out a plurality of images at different cutting positions from a single frame image.
In addition, the preprocessing unit 340 performs a channel separation process of selecting an image of a specific color component from among the images of each color component (R image, G image, B image) included in each of the plurality frame images, and replacing the value of each pixel with the selected color component. For example, the preprocessing unit 340 may be configured to perform the channel separation process such that an (R, G, B) frame image is converted to (R, R, R), (G, G, G), or (B, B, B).
Note that the above preprocesses are examples, and the preprocessing unit 340 may also executed a preprocess other than the above on each of the plurality of frame images. Moreover, the preprocessing unit 340 may execute all of the above preprocesses or only a portion of the above preprocesses.
The example in
(3) Learning Unit
Next, details about the learning unit 350 will be described.
Also, as illustrated in
Also, in the final layer, the self-supervised estimation unit 352 converts each feature vector (hv1_f1020, hv1_f1, and hv1_f1980) to a one-dimensional scalar value. Accordingly, the self-supervised estimation unit 352 outputs
{circumflex over (p)}v1_f1020,
{circumflex over (p)}v1_f1,
{circumflex over (p)}v1_f1980, [Math. 1]
as the frame differences or the time differences.
The model update unit 353 acquires
pv1_f1020,
pv1_f1,
pv1_f1980, [Math. 2]
as the pseudo-labels (frame differences or time differences) included in the training data set read by the learning unit 350 from the training data set storage unit 320.
The model update unit 353 also compares the frame differences or time differences output by the self-supervised estimation unit 352, and the pseudo-labels (frame differences or time differences) included in the training data set. Furthermore, the model update unit 353 updates the parameters of the feature extraction unit 351 and the parameters of the self-supervised estimation unit 352 so as to minimize the error in the comparison result.
In the case of the example in
Note that the model update unit 353 stores the updated parameters of the feature extraction unit 351 in a model I parameter storage unit 610 (although the feature extraction unit 351 includes a plurality of CNN units, the parameters are assumed to be shared). The model update unit 353 also stores the updated parameters of the self-supervised estimation unit 352 in a model II parameter storage unit 620.
<Flow of Task Implementation Process>
Next, the overall flow of the task implementation process will be described.
In step S701 of the pre-learning phase, the self-supervised data generation unit 330 of the learning apparatus 100 acquires a plurality of frame images.
In step S702 of the pre-learning phase, the self-supervised data generation unit 330 of the learning apparatus 100 generates pseudo-labels and then randomly rearranges the plurality of frame images.
In step S703 of the pre-learning phase, the preprocessing unit 340 of the learning apparatus 100 executes preprocessing on the randomly rearranged plurality of frame images.
In step S704 of the pre-learning phase, the learning unit 350 of the learning apparatus 100 executes learning using the preprocessed plurality of frame images and the corresponding pseudo-labels, and updates the parameters of the feature extraction unit 351 and the parameters of the self-supervised estimation unit 352.
Next, the flow proceeds to the fine-tuning phase. In step S705 of the fine-tuning phase, the task implementation apparatus 110 applies the parameters of the feature extraction unit 351 and generates the model I (trained).
In step S706 of the fine-tuning phase, the task implementation apparatus 110 acquires a plurality of frame images with associated correct answer labels for the objective task.
In step S707 of the fine-tuning phase, the task implementation apparatus 110 executes preprocessing as in step S703.
In step S708 of the fine-tuning phase, the task implementation apparatus 110 executes the fine-tuning process using the preprocessed plurality of frame images and the correct answer labels, and updates the parameters of the model III used for the objective task.
Next, the flow proceeds to the estimation phase. In step S709 of the estimation phase, the task implementation apparatus 110 applies the parameters of the model III used for the objective task to generate the model III (trained) used for the objective task.
In step S710 of the estimation phase, the task implementation apparatus 110 acquires a plurality of frame images.
In step S711 of the estimation phase, the task implementation apparatus 110 executes preprocessing similarly to step S703.
In step S712 of the estimation phase, the task implementation apparatus 110 executes the estimation process for the objective task by treating the preprocessed plurality of frame images as input.
Next, specific Examples (Example 1 and Example 2) of the task implementation process will be described using
In the first Example, under the above preconditions, feature vectors hv1_f1, . . . , hv1_fn are output by the feature extraction unit 351, and the pseudo-labels
{circumflex over (p)}v1_f1, . . . {circumflex over (p)}v1_fn1 [Math. 3]
are output by the self-supervised estimation unit 352. Furthermore, the parameters of the feature extraction unit 351 and the parameters of the self-supervised estimation unit 352 are updated by the model update unit 353.
On the other hand, as illustrated in
As in the first Example, the video v1 is a video recorded by a dashboard camera, for example. Also, the n frame images xv1_f1 to xv1_fn are frame images obtained after performing a normalization process using sensor data and furthermore performing a cutting process and a channel separation process, for example.
However, in the second Example, the feature extraction unit 351 includes CNN units and FC units that process the frame images, FC units that process the sensor data, and FC units that process the object data. Additionally, in the second Example, the feature extraction unit 351 includes a fusion unit and an FC unit that process the frame images, sensor data, and object data processed by the above units.
In the second Example, under the above preconditions, feature vectors hv1_f1, . . . , hv1_fm are output by the feature extraction unit 351, and the pseudo-labels
{circumflex over (p)}v1_f1, . . . {circumflex over (p)}v1_fn1 [Math. 4]
are output by the self-supervised estimation unit 352. Furthermore, the parameters of the feature extraction unit 351 and the parameters of the self-supervised estimation unit 352 are updated by the model update unit 353.
Additionally, the parameters of the feature extraction unit 351 updated by executing the learning process in the pre-learning phase described using
Furthermore, in the task implementation apparatus 110 illustrated in
By having the task implementation apparatus 110 illustrated in
This is because the feature vectors output from the feature extraction unit 1010 include information indicating the temporal interval between the frame images within the same video, thereby making it easy to grasp
Additionally, the parameters of the feature extraction unit 351 updated by executing the learning process in the pre-learning phase described using
Furthermore, in the task implementation apparatus 110 illustrated in
By having the task implementation apparatus 110 illustrated in
Although Examples 1 and 2 above describe a case of detecting or classifying near-miss incidents by using a video recorded by a dashboard camera, a specific examples of the task implementation apparatus are not limited thereto. For example, a task implementation apparatus that recognizes human behavior may also be constructed by using frame images in a video in which people are moving.
In such a case, a configuration similar to Examples 1 and 2 above may also be used to execute a learning process and a fine-tuning process in the pre-learning phase and the fine-tuning phase, and thereby construct a task implementation apparatus that recognizes human behavior from frame images.
This is because it is easy to grasp
<Conclusion>
As is clear from the above description, the learning apparatus 100 according to the first embodiment
With this configuration, according to the learning apparatus 100 according to the first embodiment, a model that estimates the temporal interval between frame images in a video can be generated.
The first embodiment above describes a case of computing pseudo-labels (frame differences or time differences) as the temporal interval on the basis of time-related information pre-associated with each frame image. However, the temporal interval computed on the basis of the time-related information is not limited to frame differences or time differences, and temporal intervals corresponding to the objective task may also be computed as the pseudo-labels.
pAv1_f1020,
pAv1_f1,
pAv1_f1980, [Math. 5]
are input into the model update unit 353 as the pseudo-labels.
Also, in the case of
{circumflex over (p)}Av1_f1020,
{circumflex over (p)}Av1_f1,
{circumflex over (p)}Av1_f1980, [Math. 6]
as the temporal intervals corresponding to the task with the objective A.
In this way, in the pre-learning phase, the learning unit 350 may perform the learning process using temporal intervals corresponding to the objective task.
In the first embodiment above, the image data acquisition unit 401 is described as sampling a plurality of frame images on the basis of random numbers in a uniform distribution. However, the sampling method used by the image data acquisition unit 401 when sampling a plurality of frame images is not limited to the above.
For example, the image data acquisition unit 401 may also prioritize reading out frame images with a large amount of movement according to optical flow, or reference sensor data associated with the frame images (details to be described later) and prioritize reading out frame images that satisfy a predetermined condition.
Also, in the second Example of the first embodiment above, the task implementation apparatus 110 is described as inputting sets of a frame image, sensor data, and object data included in the video image v2 into the feature extraction unit. However, it is not necessary to input both the sensor data and the object data, and it is also possible to input only one of the sensor data or the object data.
Note that the present invention is not limited to the configurations indicated here, such as combinations with other elements in the configurations and the like cited in the above embodiments. These points can be changed without departing from the gist of the present invention, and can be defined appropriately according to the form of application.
100 learning apparatus
110 task implementation apparatus (fine-tuning)
120 task implementation apparatus (estimation)
330 self-supervised data generation unit
340 preprocessing unit
350 learning unit
351 feature extraction unit
352 self-supervised estimation unit
353 model update unit
303 frequency analysis unit
304 data generation unit
320 training data set storage unit
1010 feature extraction unit
1020 model for near-miss incident detection
1110 feature extraction unit
1120 model for near-miss incident
1210 self-supervised estimation unit for task with objective A
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/019009 | 5/12/2020 | WO |