The present invention relates to, for example, a learning device that learns know-how regarding a method of scoring a competition of a competitor, a learning model data generating method, a program corresponding to the learning device, and an estimation device, an estimating method, and a program corresponding to the estimation device that estimate a score of a competition based a learning result.
In sports, there is a competition in which an official referee scores a score for a competition performed by a player such as high jump, figure skating, or gymnastics, and ranks individual competitions based on the scored score. There are quantitative scoring criteria in scoring such competitions.
In recent years, a technique used in activity quality evaluation in the field of computer vision, such as automatic estimation of a score in such a competition, has been studied, and a technique called Action Quality Assessment (AQA) is known as such a technique.
For example, in the technology described in Non Patent Literature 1, a method has been proposed in which video data in which a series of motions performed by a competitor is recorded is used as input data, and features are extracted from the video data by deep learning to estimate a score.
The learning unit 101 calculates the loss LSR using the estimated score yscore obtained as the output value by providing the video data to the DNN and the true value score tscore corresponding to the video data. The learning unit 101 calculates a new coefficient to be applied to the DNN by the error back propagation method to reduce the calculated loss LSR. The learning unit 101 updates the coefficient by writing the calculated new coefficient in the learning model data storage unit 102.
By repeating processing of updating these coefficients, the coefficients gradually converge, and the finally converged coefficients are stored in the learning model data storage unit 102 as learned learning model data. In Non Patent Literature 1, a loss function of LSR=L1 distance (yscore, tscore)+L2 distance (yscore, tscore) is used to calculate the loss LSR.
An estimation device 200 includes an estimation unit 201 including a DNN having the same configuration as the learning unit 101, and a learning model data storage unit 202 that stores learned learning model data stored in the learning model data storage unit 102 of the learning device 100 in advance. The learned learning model data stored in the learning model data storage unit 202 is applied to the DNN of an estimation unit 201. The estimation unit 201 provides the DNN, as input data, video data recording a series of motions performed by an arbitrary competitor, thereby obtaining an estimated score yscore for the competition as an output value of the DNN.
The following experiment was attempted on the technique described in Non Patent Literature 1. The video data (hereinafter, referred to as “original video data.”) in which a series of motions performed by a competitor illustrated in
As illustrated in
In the technique described in Non Patent Literature 1, only video data is provided as learning data without explicitly providing characteristics regarding the motion of the competitor, for example, joint coordinates. Therefore, from the above experimental results, the technology described in Non Patent Literature 1 extracts features in the video that are not related to the motion of the competitor, for example, features of the background of a venue or the like, and it is presumed that the learning model is not generalized to the motion of the competitor. Since features of the background of a venue or the like are extracted, it is also presumed that the technique described in Non Patent Literature 1 deteriorates accuracy with respect to video data including an unknown background.
Although there is a method of explicitly giving joint information such as human joint coordinates, estimation thereof is difficult because a joint operates in a complicated manner, and incorrect joint information adversely affects accuracy. Therefore, there is a circumstance that it is desired to avoid the technique of explicitly giving the joint information.
In view of the above circumstances, an object of the present invention is to provide a technique capable of generating learning model data generalized to the motion of a competitor from video data recording the motion of the competitor without explicitly providing joint information, and improving scoring accuracy in a competition.
An aspect of the present invention is a learning device including a learning unit that generates learning model data in a learning model that inputs original video data in which a background and a motion of a competitor are recorded, competitor mask video data in which an area surrounding the competitor is masked in each of a plurality of image frames included in the original video data, and background mask video data in which an area other than the area surrounding the competitor is masked in each of a plurality of image frames included in the original video data, in a case where the original video data is input, outputs a true value competition score that is an evaluation value for a competition of the competitor, in a case where the competitor mask video data is input, outputs an arbitrarily determined true value background score, and in a case where the background mask video data is input, outputs an arbitrarily determined true value competitor score.
An aspect of the present invention is an estimation device including an input unit that captures video data to be evaluated in which a motion of a competitor is recorded, and an estimation unit that estimates an estimated competition score for the video data to be evaluated based on a learned learning model that inputs original video data in which a background and a motion of a competitor are recorded, competitor mask video data in which an area surrounding the competitor is masked in each of a plurality of image frames included in the original video data, and background mask video data in which an area other than the area surrounding the competitor is masked in each of a plurality of image frames included in the original video data, in a case where the original video data is input, outputs a true value competition score that is an evaluation value for a competition of the competitor, in a case where the competitor mask video data is input, outputs an arbitrarily determined true value background score, and in a case where the background mask video data is input, outputs an arbitrarily determined true value competitor score, and the video data to be evaluated captured by the input unit.
An aspect of the present invention is a learning model data generating method including generating learning model data in a learning model that inputs original video data in which a background and a motion of a competitor are recorded, competitor mask video data in which an area surrounding the competitor is masked in each of a plurality of image frames included in the original video data, and background mask video data in which an area other than the area surrounding the competitor is masked in each of a plurality of image frames included in the original video data, in a case where the original video data is input, outputs a true value competition score that is an evaluation value for a competition of the competitor, in a case where the competitor mask video data is input, outputs an arbitrarily determined true value background score, and in a case where the background mask video data is input, outputs an arbitrarily determined true value competitor score.
An aspect of the present invention is an estimating method including capturing video data to be evaluated in which a motion of a competitor is recorded, and estimating an estimated competition score for the video data to be evaluated based on a learned learning model that inputs original video data in which a background and a motion of a competitor are recorded, competitor mask video data in which an area surrounding the competitor is masked in each of a plurality of image frames included in the original video data, and background mask video data in which an area other than the area surrounding the competitor is masked in each of a plurality of image frames included in the original video data, in a case where the original video data is input, outputs a true value competition score that is an evaluation value for a competition of the competitor, in a case where the competitor mask video data is input, outputs an arbitrarily determined true value background score, and in a case where the background mask video data is input, outputs an arbitrarily determined true value competitor score, and the captured video data to be evaluated.
An aspect of the present invention is a program for causing a computer to operate as the learning device or the estimation device.
According to the present invention, it is possible to generate learning model data generalized to the motion of a competitor from video data recording the motion of the competitor without explicitly providing joint information, and improve scoring accuracy in a competition.
Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
The input unit 11 captures original video data in which a series of motions to be scored as evaluation targets among the motions performed by the competitor are recorded together with the background. For example, in a case where the competitor is a high-diving swimmer, the original video data is recorded together with the background, and includes a motion until the player completes water entry into the pool after the player stands on the jumping platform, jumps, and performs a motion such as twisting or rotating. The image frames illustrated in
The input unit 11 captures a true value competition score which is an evaluation value for the motion of the competitor recorded in the original video data. The true value competition score is, for example, a score of a scoring result obtained by scoring the motion of the competitor recorded in the original video data based on a quantitative scoring criterion actually employed in the competition by the referee when the original video data is recorded. The input unit 11 sets the captured original video data and the true value competition score corresponding to the original video data in association with each other as a training data set of the original video data.
The input unit 11 captures the competitor mask video data corresponding to the original video data. Here, the competitor mask video data is video data in which a rectangular area surrounding an area of the competitor is masked in each of a plurality of image frames included in the original video data. The image frames illustrated in
The input unit 11 captures a true value background score corresponding to the competitor mask video data. The true value background score is an evaluation value for the competitor mask video data. The competitor mask video data is video data that the competitor cannot be seen completely. Therefore, in consideration of the fact that the referee cannot score, a score that is not evaluated in the competition, for example, the lowest score in the competition is determined as the true value background score. For example, in a case where the score in a case where evaluation is not performed in the competition is “0”, a value of “0” is determined in advance as the true value background score. The input unit 11 sets a training data set of the competitor mask video data by associating the captured competitor mask video data with the true value background score corresponding to the competitor mask video data.
The input unit 11 captures the background mask video data corresponding to the original video data. Here, the background mask video data is video data in which an area other than a rectangular area surrounding the area of the competitor is masked in each of the plurality of image frames included in the original video data. The image frames illustrated in
The input unit 11 captures a true value competitor score corresponding to the background mask video data. The true value competitor score is an evaluation value for the background mask video data. The background mask video data is video data in which the competitor is visible. Therefore, for example, the true value competition score of the original video data corresponding to the background mask video data is determined in advance as the true value competitor score corresponding to the background mask video data. The input unit 11 sets a training data set of the background mask video data by associating the captured background mask video data with the true value competitor score captured in correspondence with the background mask video data.
In a case where the training data sets of the plurality of pieces of original video data are captured, the input unit 11 captures the training data set of the competitor mask video data and the training data set of the background mask video data corresponding to each of the training data sets of the plurality of pieces of original video data.
The ranges of the rectangular areas 41, 42, and 43 illustrated in
In a case where the technology described in the above reference literature is adopted, for example, the input unit 11 may capture the original video data, detect a range of a rectangular area from the captured original video data, and generate the competitor mask video data and the background mask video data from the original video data based on the detected range of the rectangular area. In this case, it is assumed that, for example, it is defined that “0” described above is applied as the true value background score, and it is defined that the true value competition score is applied as the true value competitor score. In this case, the input unit 11 can generate the training data set of the original video data, the training data set of the competitor mask video data, and the training data set of the background mask video data by capturing only the original video data and the true value competition score.
Note that each of the true value competition score, the true value background score, and the true value competitor score is not limited to the evaluation value as described above, and may be arbitrarily determined. For example, the score of the scoring result obtained by scoring the competition of the competitor recorded in the original video data by a criterion other than the quantitative scoring criterion adopted in the competition may be set as the true value competition score. As the true value competitor score, a value other than the true value competition score may be adopted. The true value background score and the true value competitor score may be changed in the middle of processing.
The learning unit 12 includes a learning processing unit 13 and a function approximator 14. For example, DNN is applied as the function approximator 14. Note that the DNN may have any network structure. The function approximator 14 is provided a coefficient stored in the learning model data storage unit 15 by the learning processing unit 13. Here, in a case where the function approximator 14 is the DNN, the coefficient is a weight or a bias applied to each of a plurality of neurons included in the DNN.
By providing the original video data included in the training data set of the original video data to the function approximator 14, the learning processing unit 13 performs learning processing of updating the coefficient so that the estimated competition score obtained as the output value of the function approximator 14 approaches the true value competition score corresponding to the original video data provided to the function approximator 14. By providing the competitor mask video data included in the training data set of the competitor mask video data to the function approximator 14, the learning processing unit 13 performs learning processing of updating the coefficient so that the estimated background score obtained as the output value of the function approximator 14 approaches the true value background score corresponding to the competitor mask video data provided to the function approximator 14. By providing the background mask video data included in the training data set of the background mask video data to the function approximator 14, the learning processing unit 13 performs learning processing of updating the coefficient so that the estimated competitor score obtained as the output value of the function approximator 14 approaches the true value competitor score corresponding to the background mask video data provided to the function approximator 14.
The learning model data storage unit 15 stores coefficients to be applied to the function approximator 14, that is, learning model data. The learning model data storage unit 15 stores the initial value of the coefficient in advance in the initial state. The coefficient stored in the learning model data storage unit 15 is rewritten to a new coefficient by the learning processing unit 13 every time the learning processing unit 13 calculates a new coefficient by learning processing.
That is, by the learning processing performed by the learning processing unit 13, the learning unit 12 generates the learning model data in the learning model in which the original video data, the competitor mask video data, and the background mask video data are input, the true value competition score is output in a case where the original video data is input, the true value background score is output in a case where the competitor mask video data is input, and the true value competitor score is output in a case where the background mask video data is input. Here, the learning model is a coefficient stored in the learning model data storage unit 15, that is, the function approximator 14 to which the learning model data is applied.
Next, processing by the learning device 1 will be described with reference to
For example, it is assumed that the following learning rule is determined in advance in the learning processing unit 13. That is, it is assumed that a learning rule in which the number of each of the training data set of the original video data, the training data set of the competitor mask video data, and the training data set of the background mask video data is, for example, N, the mini-batch size is M, and all of the training data set of the original video data, the training data set of the competitor mask video data, and the training data set of the background mask video data are used as processing for one epoch is determined in advance. In the learning rule, it is assumed that it is predetermined that the processing is performed in the order of the training data set of the original video data, the training data set of the competitor mask video data, and the training data set of the background mask video data. Here, N and M are integers of 1 or more, and may be any value as long as M<N. Hereinafter, as an example, a case where N is “300” and M is “10” will be described.
The input unit 11 of the learning device 1 captures the 300 pieces of original video data and the true value competition scores respectively corresponding to the 300 pieces of original video data, associates the 300 pieces of captured original video data with the true value competition scores respectively corresponding to the captured original video data pieces, and generates training data sets of the 300 pieces of original video data.
The input unit 11 captures the 300 pieces of competitor mask video data corresponding to each of the 300 pieces of original video data and the true value background score corresponding to each of the competitor mask video data pieces, and generates a training data set of the 300 pieces of competitor mask video data by associating the 300 pieces of captured competitor mask video data with the true value background score corresponding to each of the captured competitor mask video data pieces.
The input unit 11 captures the 300 pieces of background mask video data corresponding to each of the 300 pieces of original video data and the true value competitor score corresponding to each of the background mask video data pieces, and generates a training data set of the 300 pieces of background mask video data by associating the 300 pieces of the captured background mask video data with the true value competitor score corresponding to each of the captured background mask video data pieces.
The input unit 11 outputs the training data sets of the 300 pieces of original video data, the training data sets of the 300 pieces of competitor mask video data, and the training data sets of the 300 pieces of background mask video data to the learning processing unit 13. The learning processing unit 13 captures the training data sets of the 300 pieces of original video data, the training data sets of the 300 pieces of competitor mask video data, and the training data sets of the 300 pieces of background mask video data output from the input unit 11. The learning processing unit 13 writes and stores the captured training data sets of the 300 pieces of original video data, the captured training data set of the 300 pieces of competitor mask video data, and the captured training data set of the 300 pieces of background mask video data in the internal storage area.
The learning processing unit 13 provides an area for storing the number of epochs, that is, the value of the number of times of epochs in an internal storage area, and initializes the number of epochs to “0”. The learning processing unit 13 provides an area for storing the parameters of the mini-batch learning, that is, the number of times of processing indicating the number of times of providing each of the original video data, the competitor mask video data, and the background mask video data to the function approximator 14, in the internal storage area, and initializes the number of times of processing of each of the original video data, the competitor mask video data, and the background mask video data to “0” (step Sa1).
The learning processing unit 13 selects a training data set to be selected according to the number of times of processing of each of the original video data, the competitor mask video data, and the background mask video data stored in the internal storage area and a predetermined learning rule (step Sa2). Here, the number of times of processing of each of the original video data, the competitor mask video data, and the background mask video data is “0”, and all of the 300 pieces of original video data, the 300 pieces of competitor mask video data, and the 300 pieces of background mask video data are not used for processing. As described above, in the learning rule, it is predetermined that the processing is performed in the order of the training data set of the original video data, the training data set of the competitor mask video data, and the training data set of the background mask video data. Therefore, the learning processing unit 13 first selects a training data set of the original video data (Step Sa2, Original video data).
The learning processing unit 13 reads out the coefficient stored in the learning model data storage unit 15 and applies the read coefficient to the function approximator 14 (step Sa3-1). The learning processing unit 13 reads out the training data sets of the original video data of the number of learning mini-batch sizes M defined in the learning rule from the internal storage area in order from the head for the training data set of the original video data selected in the processing of step Sa2.
Here, since the mini-batch size M is “10”, the learning processing unit 13 reads out the training data set of the 10 pieces of original video data from the internal storage area. The learning processing unit 13 selects one piece of original video data from the read training data set of the 10 pieces of original video data and provides the selected original video data to the function approximator 14. The learning processing unit 13 captures the estimated competition score output by the function approximator 14 by providing the original video data. The learning processing unit 13 writes and stores the captured estimated competition score and the true value competition score corresponding to the original video data provided to the function approximator 14 in an internal storage area in association with each other. Every time the original video data is provided to the function approximator 14, the learning processing unit 13 adds 1 to the number of times of processing of the original video data stored in the internal storage area (Step Sa4-1).
The learning processing unit 13 repeatedly performs the processing of step Sa4-1 on each of the 10 pieces of original video data included in the training data set of the 10 pieces of original video data (loops L1s to L1e), and generates 10 combinations of the estimated competition score and the true value competition score in an internal storage area.
The learning processing unit 13 calculates a loss based on a predetermined loss function based on the 10 combinations of the estimated competition score and the true value competition score stored in an internal storage area. Based on the calculated loss, the learning processing unit 13 calculates a new coefficient to be applied to the function approximator 14 by, for example, the error back propagation method. The learning processing unit 13 rewrites and updates the coefficient stored in the learning model data storage unit 15 with the calculated new coefficient (step Sa5-1).
The learning processing unit 13 refers to the number of times of processing of each of the original video data, the competitor mask video data, and the background mask video data stored in the internal storage area, and determines whether the processing for one epoch has ended (step Sa6). As described above, in the learning rule, it is defined that all the training data set of the original video data pieces, the training data set of the competitor mask video data pieces, and the training data set of the background mask video data pieces are used as processing of one epoch. Therefore, the state in which the processing for one epoch is completed is a state in which the number of times of processing of each of the original video data, the competitor mask video data, and the background mask video data is “300” or more. Here, the number of times of processing of the original video data is “10”, and the number of times of processing of each of the competitor mask video data and the background mask video data is “0”. Therefore, the learning processing unit 13 determines that the processing for one epoch has not ended (Step Sa6, No), and advances the processing to step Sa2.
In a case where the number of times of processing of the original video data is not “300” or more in the processing of step Sa2 to be performed again, the learning processing unit 13 selects the training data set of the original video data again in the processing of step Sa2 (Step Sa2, Original video data), and performs the processing of step Sa3-1 and subsequent steps.
On the other hand, in a case where the number of times of processing of the original video data is “300” or more in the processing of step Sa2 performed again, the learning processing unit 13 then selects the training data set of the competitor mask video data according to the learning rule (Step Sa2, Competitor mask video data).
The learning processing unit 13 reads out the coefficient stored in the learning model data storage unit 15 and applies the read coefficient to the function approximator 14 (step Sa3-2).
The learning processing unit 13 reads out the training data sets of the 10 pieces of competitor mask video data sequentially from the top from the internal storage area for the training data set of the competitor mask video data selected in the processing of step Sa2. The learning processing unit 13 selects one piece of competitor mask video data from the read training data sets of the 10 pieces of original video data and provides the selected original video data to the function approximator 14. The learning processing unit 13 captures the estimated background score output by the function approximator 14 by providing the competitor mask video data. The learning processing unit 13 writes and stores the captured estimated background score and the true value background score corresponding to the competitor mask video data provided to the function approximator 14 in an internal storage area in association with each other. Every time the competitor mask video data is provided to the function approximator 14, the learning processing unit 13 adds 1 to the number of times of processing of the competitor mask video data stored in the internal storage area (Step Sa4-2).
The learning processing unit 13 repeatedly performs the processing of step Sa4-2 on each of the 10 pieces of competitor mask video data included in the training data set of the 10 pieces of competitor mask video data (loops L2s to L2e), and generates 10 combinations of the estimated background score and the true value background score in an internal storage area.
The learning processing unit 13 calculates a loss based on a predetermined loss function using the 10 combinations of the estimated background score and the true value background score stored in the internal storage area. Based on the calculated loss, the learning processing unit 13 calculates a new coefficient to be applied to the function approximator 14 by, for example, the error back propagation method. The learning processing unit 13 rewrites and updates the coefficient stored in the learning model data storage unit 15 with the calculated new coefficient (step Sa5-2).
The learning processing unit 13 determines whether the processing for one epoch has been completed (step Sa6). In a case where the number of times of processing of the competitor mask video data is not “300” or more, the learning processing unit 13 determines that the processing for one epoch has not been ended (Step Sa6, No), and advances the processing to step Sa2.
In a case where the number of times of processing of the competitor mask video data is not “300” or more in the processing of step Sa2 to be performed again, the learning processing unit 13 again selects the training data set of the competitor mask video data (Step Sa2, Competitor mask video data). Thereafter, the learning processing unit 13 performs the processing of step Sa3-2 and subsequent steps.
On the other hand, in a case where the number of times of processing of the competitor mask video data is “300” or more in the processing of step Sa2 performed again, the learning processing unit 13 then selects the training data set of the background mask video data according to the learning rule (Step Sa2, Background mask video data).
The learning processing unit 13 reads out the coefficient stored in the learning model data storage unit 15. The learning processing unit 13 applies the read coefficient to the function approximator 14 (step Sa3-3).
The learning processing unit 13 reads out the training data sets of the 10 pieces of background mask video data sequentially from the top from the internal storage area for the training data set of the background mask video data selected in the processing of step Sa2. The learning processing unit 13 selects one piece of background mask video data from the read training data set of the 10 pieces of background mask video data and provides the selected background mask video data to the function approximator 14. The learning processing unit 13 captures the estimated competitor score output by the function approximator 14 by providing the background mask video data. The learning processing unit 13 writes and stores the captured estimated competitor score and the true value competitor score corresponding to the background mask video data provided to the function approximator 14 in the internal storage area in association with each other. Every time the background mask video data is provided to the function approximator 14, the learning processing unit 13 adds 1 to the number of times of processing of the background mask video data stored in the internal storage area (Step Sa4-3).
The learning processing unit 13 repeatedly performs the processing of step Sa4-3 on each of the 10 pieces of background mask video data included in the training data set of the 10 pieces of background mask video data (loops L3s to L3e), and generates 10 combinations of the estimated competitor score and the true value competitor score in an internal storage area.
The learning processing unit 13 calculates a loss based on a predetermined loss function based on the 10 combinations of the estimated competitor score and the true value competitor score stored in an internal storage area. Based on the calculated loss, the learning processing unit 13 calculates a new coefficient to be applied to the function approximator 14 by, for example, the error back propagation method. The learning processing unit 13 rewrites and updates the coefficient stored in the learning model data storage unit 15 with the calculated new coefficient (step Sa5-3).
The learning processing unit 13 determines whether the processing for one epoch has been completed (step Sa6). In a case where the number of times of processing of the background mask video data is not “300” or more, the learning processing unit 13 determines that the processing for one epoch has not been ended (Step Sa6, No). In this case, the learning processing unit 13 advances the processing to step Sa2.
In a case where the number of times of processing of the background mask video data is not “300” or more in the processing of step Sa2 to be performed again, the learning processing unit 13 selects the training data set of the background mask video data again in the processing of step Sa2 (Step Sa2, Background mask video data). Thereafter, the learning processing unit 13 performs the processing of step Sa3-3 and subsequent steps.
On the other hand, in the processing of step Sa6, in a case where the processing for one epoch has been completed, that is, the number of times of processing of each of the original video data, the competitor mask video data, and the background mask video data is “300” or more, the learning processing unit 13 determines that the processing for one epoch has been completed (Step Sa6, Yes). The learning processing unit 13 adds 1 to the number of epochs stored in the internal storage area. The learning processing unit 13 initializes the parameter of the mini-batch learning stored in the internal storage area to “0” (step Sa7). That is, the learning processing unit 13 initializes the number of times of processing of each of the original video data, the competitor mask video data, and the background mask video data to “0”.
The learning processing unit 13 determines whether or not the number of epochs stored in the internal storage area satisfies the end condition (step Sa8). For example, in a case where the number of epochs reaches a predetermined upper limit value, the learning processing unit 13 determines that the end condition is satisfied. On the other hand, for example, in a case where the number of epochs has not reached a predetermined upper limit value, the learning processing unit 13 determines that the end condition is not satisfied.
In a case where it is determined that the number of epochs satisfies the end condition (Step Sa8, Yes), the learning processing unit 13 ends the processing. On the other hand, in a case where the learning processing unit 13 determines that the number of epochs does not satisfy the end condition (Step Sa8, No), the processing proceeds to the processing of step Sa2. In the processing of step Sa2 performed again after the processing of step Sa8, the learning processing unit 13 selects the training data set of the original video data pieces, the training data set of the competitor mask video data pieces, and the background mask video data set in this order according to the learning rule again. Thereafter, the learning processing unit 13 performs the processing of step Sa3-1 and subsequent steps, the processing of step Sa3-2 and subsequent steps, and the processing of step Sa3-3 and subsequent steps on each selected component.
As a result, when the learning processing unit 13 ends the processing, the learned coefficient, that is, the learned learning model data, is generated in the learning model data storage unit 15. Note that the learning processing performed by the learning processing unit 13 is processing of updating the coefficient to be applied to the function approximator 14 by the repetitive processing illustrated in steps Sa2 to Sa8 in
Note that, in the processing of
In the processing of
As a learning rule, for example, the upper limit value of the number of epochs is predetermined to “100”, and in order to stabilize the learning processing, that is, to moderate the convergence of the coefficients, until the number of epochs reaches “50”, the learning processing unit 13 selects the training data set of the original video data pieces and the training data set of the competitor mask video data pieces in this order and does not select the background mask video data in the processing of step Sa2. After the number of epochs reaches “50”, for the next 50 epochs, the learning processing unit 13 may set a learning rule to select the training data set of the original video data pieces, the training data set of the competitor mask video data pieces, and the training data set of the background mask video data pieces in this order in the processing of step Sa2. As a result, in the processing of
Note that the number of epochs being “50” is an example, and another value may be determined. Instead of limiting the number of epochs for changing the combination of the training data sets to be selected to one, a plurality of numbers of epochs for changing the combination of the training data sets to be selected may be determined, and the learning processing unit 13 may determine a learning rule for changing the training data set to be selected every time the number of epochs reaches the plurality of determined numbers of epochs. In this case, the combination of the training data selected by the learning processing unit 13 in the processing of step Sa2 is not limited to the example of the combination of the training data pieces described above, and may be any combination. The learning rule may be such that the training data set selected by the learning processing unit 13 in the processing of step Sa2 is randomly changed every time the number of epochs increases.
For example, in a case where the true value background score is set to “0”, even after the learning processing is performed to a certain extent, it is known as a result of simulation that the estimated background score output by the function approximator 14 does not completely become “0” but outputs “1” or “2” when the competitor mask video data is provided to the function approximator 14. It can be considered that this may be due to the fact that the referee may be in a state of slightly scoring against the background. In a case where the true value competitor score is set as the true value competition score, it is known that the function approximator 14 does not output a value that completely matches the true value competition score when the background mask video data is provided to the function approximator 14 even after the learning processing is performed to a certain extent.
In this way, it is assumed that the background affects the scoring of the referee, and as a learning rule, when the number of epochs becomes a predetermined number less than a predetermined upper limit value, the learning processing unit 13 may set a learning rule in which all the true value background scores included in the training data set of the competitor mask video data are replaced with the estimated background scores output by the function approximator 14 when the competitor mask video data is given at that time, and all the true value competitor scores included in the training data set of the background mask video data are replaced with the estimated competitor scores output by the function approximator 14 when the background mask video data is given at that time.
In a case where this learning rule is applied, the learning processing unit 13 performs the processing of
In the above description, when the number of epochs reaches a predetermined number, the true value background score and the true value competitor score are replaced. However, the true value background score and the true value competitor score may be replaced at any timing during predetermined learning processing other than the timing when the number of epochs reaches the predetermined number. For example, it may be a timing at which the learning processing unit 13 detects that the difference between the estimated background score output by the function approximator 14 and the immediately preceding estimated background score has continuously become equal to or less than a certain value at a predetermined number of times, and the difference between the estimated competitor score output by the function approximator 14 and the immediately preceding estimated competitor score has continuously become equal to or less than a certain value at a predetermined number of times.
In the processing of
In the processing of
In the processing of
On the other hand, for example, the learning processing unit 13 advances the processing to step Sa6 without performing step Sa5-1 after the processing of the loops L1s to L1e ends in the processing of
For example, the learning processing unit 13 advances the processing to step Sa6 without performing step Sa5-1 after the processing of the loops L1s to L1e ends in the processing of
For example, the learning processing unit 13 advances the processing to step Sa6 without performing step Sa5-2 after the processing of the loops L2s to L2e ends in the processing of
In the processing of
In this manner, the selection order of the training data set of the original video data pieces, the training data set of the competitor mask video data pieces, and the training data set of the background mask video data pieces in the processing of step Sa2 may be arbitrarily determined. The learning processing unit 13 may calculate a loss by arbitrarily selecting the combination of the estimated competition score and the true value competition score, the combination of the estimated background score and the true value background score, and the combination of the estimated competitor score and the true value competitor score, and calculate a new coefficient based on the calculated loss.
In the processing of
A learning rule in which each of the other learning rules described above, the learning rule (part 1), the learning rule (part 2), and the learning rule (part 3) are arbitrarily combined may be determined in advance.
The estimation unit 22 internally includes a function approximator having the same configuration as the function approximator 14 included in the learning processing unit 13. The estimation unit 22 calculates an estimated score corresponding to the video data based on the evaluation target video data fetched by the input unit 21 and the function approximator to which the learned coefficient stored in the learning model data storage unit 23 is applied, that is, the learned learning model.
The estimation unit 22 gives the captured evaluation target video data to the function approximator (step Sb3). The estimation unit 22 outputs the output value of the function approximator as the estimated score for the evaluation target video data (step Sb4).
The learning device 1 of the present embodiment described above generates the learning model data in the learning model in which the original video data, the competitor mask video data, and the background mask video data are input, the true value competition score is output in a case where the original video data is input, the true value background score is output in a case where the competitor mask video data is input, and the true value competitor score is output in a case where the background mask video data is input. The learning device 1 performs learning processing using the original video data, the competitor mask video data, and the background mask video data, thereby being promoted to extract features related to the motion of the competitor from the video data. As a result, the learning device 1 can generate learning model data generalized to the motion of the competitor from the video data recording the motion of the competitor without explicitly giving the joint information. In the estimation processing performed by the estimation device 2 using the learned learning model generated by applying the learned learning model data generated by the learning device 1 in this manner to the function approximator, it is possible to improve the scoring accuracy in the competition.
Note that the above embodiment illustrates an example in which one competitor is included in the original video data, but the competition recorded in the original video data may be a competition performed by a plurality of competitors, and the rectangular area in this case is an area surrounding the plurality of competitors.
In the above embodiment, the shape surrounding the area of the competitor is rectangular, but the shape is not limited to the rectangular shape, and may be a shape other than the rectangular shape.
In the above embodiment, in the video data of the competitor mask video data and the background mask video data, the color at the time of masking is set as the average color in the image frame in which masking is performed. On the other hand, the average color of all the image frames included in the original video data corresponding to each of the video data of the competitor mask video data and the background mask video data may be selected as the color for masking. For each piece of video data, a color for masking an arbitrarily determined color may be used. Note that, since it is better to make the color at the time of masking inconspicuous, it is necessary to select an inconspicuous color according to the overall hue for each image frame, and in that respect, it is considered that it is most effective to select an average color for each image frame that harmonized with the background and becomes an inconspicuous hue as the color at the time of masking.
The function approximator 14 included in the learning unit 12 of the learning device 1 and the function approximator included in the estimation unit 22 of the estimation device 2 according to the above-described embodiment are DNN, for example. However, a neural network other than DNN, a means based on machine learning, or an arbitrary means for calculating a coefficient of a function approximated in the function approximator may be applied thereto.
The learning device 1 and the estimation device 2 may be integrated. In such a configuration, a device in which the learning device 1 and the estimation device 2 are integrated has a learning mode and an estimation mode. The learning mode is a mode in which learning processing is performed by the learning device 1 to generate learning model data. That is, in the learning mode, the device in which the learning device 1 and the estimation device 2 are integrated executes the processing illustrated in
The learning device 1 and the estimation device 2 according to the above-described embodiment may be implemented by a computer. In that case, a program for implementing these functions may be recorded in a computer-readable recording medium, and the program recorded in the recording medium may be read and executed by a computer system to implement the functions. The “computer system” mentioned herein includes an OS and hardware such as a peripheral device. The “computer-readable recording medium” refers to a portable medium such as a flexible disk, a magneto-optical disc, a ROM, or a CD-ROM, or a storage device such as a hard disk included in the computer system. The “computer-readable recording medium” may include a medium that dynamically stores the program for a short time, such as a communication line in a case where the program is transmitted via a network such as the Internet or a communication line such as a telephone line, and a medium that stores the program for a certain period of time, such as a volatile memory inside the computer system serving as a server or a client in that case. Also, the foregoing program may be for implementing some of the functions described above, may be implemented in a combination of the functions described above and a program already recorded in a computer system, or may be implemented with a programmable logic device such as a field programmable gate array (FPGA).
Although the embodiment of the present invention has been described in detail with reference to the drawings, specific configurations are not limited to the embodiment and include design and the like without departing from the gist of the present invention.
It can be used for scoring a competition in sports.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/018964 | 5/19/2021 | WO |