The present disclosure relates to an action segment estimation model building device, an action segment estimation model building method, and a non-transitory recording medium storing an action segment estimation model building program.
Recognition of postures from a video of a person imaged with a normal RGB camera has become possible due to progresses in deep learning technology, and various research and development is being performed into estimating actions of a person utilizing such recognition information. Under such circumstances, effort is being put into estimating time segments where a specified action occurred from time series data of postures detected from people videos.
In one exemplary embodiment, in a hidden semi-Markov model, observation probabilities for each type of movement of plural first hidden Markov models are learned using unsupervised learning. The hidden semi-Markov model includes plural second hidden Markov models each containing plural of the first hidden Markov models using types of movement of a person as states and with the plural second hidden Markov models each using actions determined by combining plural of the movements as states. The learnt observation probabilities are fixed, input first supervised data is augmented so as to give second supervised data, and transition probabilities of the movements of the first hidden Markov models are learned by supervised learning in which the second supervised data is employed. The learnt observation probabilities and the learnt transition probabilities are used to build the hidden semi-Markov model that is a model for estimating segments of the actions. Augmentation is performed to the first supervised data by adding teacher information of the first supervised data to each item of data generated by performing at least one out of oversampling in the time direction or oversampling in feature space.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
In the present exemplary embodiment, a hidden semi-Markov model (hereafter referred to as HSMM) such as that illustrated in
The HSMM of the present exemplary embodiment includes plural first HMMs employing each movement of a person as states, and a second HMM employing actions each with a determined combination of plural movements as states. m1, m2, m3 are examples of movements, and a1, a2, a3 are examples of actions. An action is a combination of plural movements, and a movement is a combination of plural postures.
When time series sensor data generated by detecting postures of a person is given to an HSMM built by setting parameters, the HSMM estimates optimal action time segments (hereafter referred to as action segments). d1, d2, d3 are examples of action segments.
Observation probabilities and transition probabilities are present in the parameters of an HMM. O1, . . . , O8 are examples of observation probabilities, and transition probabilities are the probabilities corresponding to arrows linking states. The observation probabilities are probabilities that a given feature is observed in each state, and the transition probabilities are the probabilities of transitioning from a given state to another state. Transition probabilities are not needed for cases in which an order of transition is determined. Note that the number of movements and the number of action, namely the number of the first HMMs and the number of second HMIs, are merely examples thereof, and are not limited to the numbers of the example illustrated in
A target of the present exemplary embodiment is an action limited to achieving a given task goal. Such an action is, for example, an action in a standard task performed on a production line of a factory, and has the following properties.
Property 1: a difference between each action configuring a task is a difference in a combination of limited plural movements.
Property 2: plural postures observed when the same task is performed are similar to each other.
In the present exemplary embodiment, based on property 1, all actions are configured by movements contained in a single movement set. As illustrated in the example in
For example, the movement m1 may be “raise arm”, the movement m12 may be “lower arm”, and the movement m13 may be “extend arm forward”. The number of movements contained in the movement set is not limited to the example illustrated in
In the HMM of
More specifically, a model employed for unsupervised learning of observation probabilities may be a Gaussian mixture model (GMM). For each observation, a single movement is selected probabilistically from out of the movements, and a Gaussian distribution is generated for this movement. This is a different assumption to supervised learning not using a time series dependency relationship of observation. The parameters of each Gaussian distribution of the trained MINI are assigned to Gaussian distributions that are probability distributions of the observation probabilities for each movement.
As described below, the transition probability learning section 12 learns the transition probabilities of the movements of the first HMMs using learning data appended with teacher information (hereafter referred to as supervised data). The teacher information is information giving a correct answer of a time segment in which each action occurs for posture time series data. The training is, for example, performed using maximum likelihood estimation and an expectation maximization algorithm (EM algorithm) or the like (another approach may also be employed therefor, such as machine learning, a neural network, deep learning, or the like).
Generating supervised data takes both time and effort. Thus in the present exemplary embodiment the learnt observation probabilities are fixed in the observation probability learning section 11, and transition probabilities are learned from the existing supervised data.
More specifically, as illustrated in
Explanation follows regarding oversampling in the time direction. The oversampling in the time direction considers, for example, temporal extension and contraction related to a length of time taken for different movements depending on the person. And more specifically is as follows.
(1) As illustrated in
(2) Each clock-time is propagated to before and after clock-times while attenuating the stretch strength of the clock-time. The stretch strength is attenuated so as become zero at a prescribed number of clock-times distant. In the example of
(3) A feature value at a clock-time corresponding to maximum strength from out of the original stretch strength of each clock-time and the propagated stretch strengths corresponding to the parameters propagated from before and after clock-times is selected as the feature value of this clock-time. In the example of
Explanation follows regarding oversampling in feature space. According to the above property 2, postures of the same task are similar to each other, and so by adding noise, data can be generated that has a variation similar to the variation of each actual observation, as illustrated in the example of
The supervised data is augmented by applying teacher information TI of the seed data SD commonly across respective items of the augmented data. The augmented supervised data, which is an example of second supervised data, is employed to learn the transition probabilities of plural movements of the first HMIs using supervised learning.
In the oversampling, noise is generated and added to the feature value at each clock-time. For example, noise added may be generated from a multivariate Gaussian distribution having a covariance that is a fixed multiple of the covariance of the sample set of the identified movement. Moreover, a center distance d may be computed from the sample set of the identified movement to the sample set of the movement having a nearest center distance thereto, and the noise added may be generated from an isotropic Gaussian distribution (i.e. with a covariance matrix that is a diagonal matrix) such that a standard deviation in each axis direction of feature space is a fixed multiple of d.
In the present exemplary embodiment, noise related to the speed of each body location of a person performing a movement is added to the feature value of the movement by body location. For example, diagonal components that are variance components in a covariance matrix of Gaussian distribution change by body location of a person performing a movement. More specifically, a standard deviation σi′ (variance σi′2) of a feature value that is a posture component of a feature vector at body location i (wherein i is a natural number) is computed according to Equation (1) using an angular speed ωi of body location i, a value σi (variance σi2) of a standard deviation serving as a base, and a constant coefficient k.
σi′=σi+kωi Equation (1)
σi and k are constants determined in advance experimentally, and do not vary with body location. As illustrated by the second term of Equation (1), noise, namely variation in posture, is increased in proportion to a magnitude of angular speed. For example, a horizontal axis of
Although feature space is expressed in two dimensions in
In cases in which an angular speed component of a motion of a body location 1 and an angular speed component of a motion of a body location 2 are substantially the same as each other, as illustrated on the left of
Oversampling in the time direction enables changes in the time direction to be accommodated. Namely, even in case in which the same task is performed, a given movement (motion feature) will be observed for a shorter time, or observed for a longer time, due to a fast motion or a slow motion. For a fast motion sometimes a given movement is not observed.
For example, a worker A takes about three clock-times for the action 2 as in the example illustrated on the left of
Oversampling in feature space enables variation in feature values expressing posture to be accommodated. For example as illustrated in the example on the left of
However, a change in posture of the second arm is proportional to speed and small, and accordingly variance in feature values is also small. Performing oversampling in feature space enables, in this manner, samples having different variances in feature value due to body location to be augmented.
Both oversampling in the time direction and oversampling in the feature direction may be performed, or one thereof may be performed alone. In cases in which oversampling is only performed in the feature direction, the noise added to the feature values at each clock-time by body location of each clock-time is noise related to the speed of each body location of the person performing movement.
The building section 13 uses the observation probabilities learnt in the observation probability learning section 11 and the state transition probabilities learnt in the transition probability learning section 12 to build an HSMM such as in the example illustrated in
The action segment estimation model building device 10 of the present exemplary embodiment includes the following characteristics.
1. Observation probabilities of common movements for all actions of the first HMMs are learned by unsupervised learning.
2. Transition probabilities between movements of the first HMMs are learned by supervised learning using the supervised data resulting from augmenting the supervised seed data.
The action segment estimation model building device 10 includes, for example, a central processing unit (CPU) 51, a primary storage device 52, a secondary storage device 53, and an external interface 54, as illustrated in
The primary storage device 52 is, for example, volatile memory such as random access memory (RAM) or the like. The secondary storage device 53 is, for example, non-volatile memory such as a hard disk drive (HDD) or a solid state drive (SSD).
The secondary storage device 53 includes a program storage area 53A and a data storage area 53B. The program storage area 53A is, for example, stored with a program such an action segment estimation model building program. The data storage area 53B is, for example, stored with supervised data, unsupervised data, learnt observation probabilities, transition probabilities, and the like.
The CPU 51 reads the action segment estimation model building program from the program storage area 53A and expands the action segment estimation model building program in the primary storage device 52. The CPU 51 acts as the observation probability learning section 11, the transition probability learning section 12, and the building section 13 illustrated in
Note that the program such as the action segment estimation model building program may be stored on an external server, and expanded in the primary storage device 52 over a network. Moreover, the program such as the action segment estimation model building program may be stored on a non-transitory recording medium such as a digital versatile disc (DVD), and expanded in the primary storage device 52 through a recording medium reading device.
An external device is connected to the external interface 54, and the external interface 54 performs a role in exchanging various information between the external device and the CPU 51.
The action segment estimation model building device 10 may, for example, be a personal computer, a server, a computer in the cloud, or the like.
At step 103, the CPU 51 augments supervised data by appending teacher information of supervised seed data to data generated by oversampling the supervised seed data, as described later. At step 104, the CPU 51 allocates the feature vectors for the supervised data to respective time segments of the actions appended with the teacher information.
At step 105, the CPU 51 takes a time series of the feature vectors in the time segments allocated at step 104 as observation data, and uses the supervised data augmented at step 103 to learn the transition probabilities of the movements of the first HMMs using supervised learning.
At step 106, the CPU 51 sets, as a probability distribution of successive durations of respective actions, a uniform distribution having a prescribed range for the successive durations of the respective actions appended with the teacher information. The CPU 51 uses the observation probabilities learnt at step 102 and the transition probabilities learnt at step 105 to build an HSMM. The HSMM is built such that actions of the second HMIs transition in the order of the respective actions appended with the teacher information after a fixed period of time set at step 106 has elapsed. The built HSMM may, for example, be stored in the data storage area 53B.
At step 153, the CPU 51 acquires time series data of motion information for each location on a body from the time series data of the posture information acquired at step 152. The time series data of the motion information may, for example, be curvature, curvature speed, and the like for each location. The locations may, for example, be an elbow, a knee, or the like.
At step 154, the CPU 51 uses a sliding time window to compute feature vectors by averaging the motion information of step 153 in the time direction within a window for each fixed time interval.
At step 253, the CPU 51 takes a feature value of observation data at a clock-time corresponding to the maximum stretch strength from out of the values of stretch strength of this clock-time and the stretch strengths propagated from other clock-times, and selects this as the feature value for this clock-time. At step 254, the CPU 51 computes a Gaussian distribution covariance matrix based on the values of angular speed at each of the body locations.
At step 255, the CPU 51 adds noise generated with the Gaussian distribution of the covariance matrix computed at step 254 to each of the feature values selected at step 253. The supervised data is augmented by repeatedly augmenting the supervised data.
The processing of step 254 and step 255 may be repeated alone. In such cases, the noise is added to the original feature values at each of the clock-times. Alternatively, the processing of steps 251 to step 253 may be repeated alone.
At step 201, the CPU 51 extracts feature vectors from sensor data generated by detecting postures of a person using sensors. The sensors are devices to detect person posture and may, for example, be a camera, infrared sensor, motion capture device, or the like. Step 201 of
At step 202, the CPU 51 takes a series of the feature vectors extracted at step 201 as observation data, and estimates successive durations of each action state by comparing to the HSMM built with the action segment estimation model building processing. At step 203, the CPU 51 estimates time segments of each action from the successive durations of each action state estimated at step 202.
For example, in technology employing a video as input so as to recognize a particular action in the video, basic movement recognition, element action recognition, and higher level action recognition are performed. A particular action in a video is a more complicated higher level action from combining element actions, basic movement recognition is posture recognition for each frame, and element action recognition is performed by temporal spatial recognition, and recognizes a simple action over a given length of time. Higher level action recognition is recognition of a complex action over a given length of time. Such technology utilizes action segment estimation model building processing and a built action segment estimation model to enable estimation of action segments.
An HSMM in which movements included in actions are not particularly limited may be employed in related technology. In such related technology, for example as illustrated in the example in
(1) raise arm, (2) lower arm, (3) extend arm forward, (4) bring both hands close together in front of body, (5) move forward, (6) move sideways, (7) squat, (8) stand.
Examples of actions are, for example, as set out below:
Action a31: (1) raise arm→(3) extend arm forward→(1) raise arm→(4) bring both hands close together in front of body→(7) squat;
Action a32: (7) squat→(4) bring both hands close together in front of body→(8) stand→(5) move forward→(3) extend arm forward; and the like.
As described above, in cases in which an HMM includes movements of general actions, namely plural movements not limited for the action to be estimated, the observation probabilities of the movements are difficult to express as a single simple probability distribution. In order to address this issue there is technology that employs a hierarchical hidden Markov model. As illustrated in the example in
As illustrated in the example in
However as illustrated in the example of
As illustrated in the example on the left of
As illustrated in the example at the bottom right of
However, in cases in which there is only a small volume of supervised data, many fluctuations are unable to be learnt directly, and accommodation of fluctuations in the observation data is weak. However, in the present exemplary embodiment, performing oversampling in the time direction and oversampling in feature space enables appropriate supervised data to be augmented so as to enable accommodation of fluctuations in the observation data.
The present exemplary embodiment thereby enables modeling of the way movements are arrayed under presumed fluctuations in the observation data even in cases in which there is a small volume of existing supervised data. This thereby enables time segments to be estimated at high precision even in cases in which there is function in the observation data.
In the present exemplary embodiment, in a hidden semi-Markov model, observation probabilities for each type of movement of plural first hidden Markov models are learned using unsupervised learning. The hidden semi-Markov model includes plural second hidden Markov models each containing plural of the first hidden Markov models using types of movement of a person as states and with the plural second hidden Markov models each using actions determined by combining plural of the movements as states. The learnt observation probabilities are fixed, input first supervised data is augmented so as to give second supervised data, and transition probabilities of the movements of the first hidden Markov models are learned by supervised learning in which the second supervised data is employed. The learnt observation probabilities and the learnt transition probabilities are used to build the hidden semi-Markov model that is a model for estimating segments of the actions. Augmentation is performed on the first supervised data by adding teacher information of the first supervised data to each item of data generated by at least one out of oversampling in the time direction or oversampling in feature space.
The present disclosure enables an action segment estimation model to be built efficiently. Namely for example enables, for plural actions of movements performed in a decided order, such as in standard tasks in a factory, in dance choreography, and in martial art forms, the time segments of each action to be estimated accurately under the condition that there is a restriction on the order of occurrence.
There is a high cost to generating teacher information of supervised data when training a model to estimate time segments of actions according to the related arts.
One of objects of the present disclosure is to efficiently build an action segment estimation model.
One of the aspects of the present disclosure enables an action segment estimation model to be built efficiently.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
This application is a continuation application of International Application No. PCT/JP2021/002817, filed on Jan. 27, 2021, the disclosure of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2021/002817 | Jan 2021 | US |
Child | 18341583 | US |