Techniques relevant to the present invention is disclosed in Patent Documents 1 to 3 and in Non-Patent Document 1.
In Patent Document 1, a technique for computing a feature value of each of a plurality of key points of a human body included in an image, and classifying a plurality of poses and a plurality of movements of the human body extracted from the image by collecting similar poses and movements, based on the computed feature value, is disclosed.
In Patent Document 2, a technique for classifying, based on a feature value of time-series position data of a user for each day, movement pattern of the user for each day into a plurality of clusters is disclosed.
In Patent Document 3, a technique for classifying time-series position data of a human body part into a plurality of position data groups, and analyzing a motion for each of the plurality of position data groups is disclosed.
A technique related to skeleton estimation on a person is disclosed in Non-Patent Document 1.
Patent Document 1: International Patent Publication No. WO2021/084677
Patent Document 2: International Patent Publication No. WO2017/187584
Patent Document 3: Japanese Patent Application Publication No. 2021-022323
Non-Patent Document 1: Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh, “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields”, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 7291-7299
When classifying human movements presented in a plurality of frames by collecting similar movements, it is necessary to compute a similarity between two movements. The technique for computing a similarity between two movements disclosure in Patent Document 1 presumes that the two movements are presented by a same number of frames. A limitation that all movements to be classified are presented in a same number of frames is not convenient. None of Patent Documents or Non-Patent Document discloses this problem and a solution thereof.
The present invention is to improve convenience of a technique for classifying human movements presented in a plurality of frames by collecting similar movements.
According to the present invention, an action classification apparatus is provided, including:
Further, according to the present invention, an action classification method is provided, including,
Further, according to the present invention, a program is provided, causing a computer to function as:
According to the present invention, convenience of a technique for classifying human movements presented in a plurality of frames by collecting similar movements is improved.
The above-described object, another object, a characteristic, and an advantage will be further clarified by public example embodiments described in the following and the following diagrams accompany therewith.
In the following, example embodiments of the present invention will be described with reference to the drawings. Note that, in all the drawings, a similar component is denoted with a similar reference sign, and description thereof is omitted as appropriate.
An action classification apparatus according to the present example embodiment computes a similarity between human movements presented in any number of frames, and classifies, based on a computation result, a plurality of movements by collecting similar movements. In a case of the present example embodiment, a movement to be classified may be presented in any number of frames. In comparison to a case in which the number of frames in which a movement to be classified is presented is limited to one specific value, convenience is improved.
Next, one example of a hardware configuration of the action classification apparatus will be described. Each function unit of the action classification apparatus is achieved by any combination of software and hardware, mainly including a central processing unit (CPU) of any computer, a memory, a program loaded onto the memory, a storage unit, such as a hard disk, storing the program (in addition to a program stored in advance from a stage of shipping an apparatus, a program downloaded from a storage medium such as a compact disk (CD) or from a server on the Internet can also be stored), and an interface for network connection. Further, it is understood by a person skilled in the art that there are various modification example of a method and an apparatus for achieving the action classification apparatus.
The bus 5A is a data transmission path for the processor 1A, the memory 2A, the peripheral circuit 4A, and the input/output interface 3A to transmit and receive data to and from one another. The processor 1A is, for example, an arithmetic operation processing apparatus such as a CPU and a graphics processing unit (GPU). The memory 2A is, for example, a memory such as a random access memory (RAM) and a read only memory (ROM). The input/output interface 3A includes an interface for acquiring information from an input apparatus, an external apparatus, an external server, an external sensor, a camera, and the like, an interface for outputting information to an output apparatus, an external apparatus, an external server, and the like, and the like. The input apparatus is, for example, a keyboard, a mouse, a microphone, a physical button, a touch panel, and the like. The output apparatus is, for example, a display, a speaker, a printer, a mailer, and the like. The processor 1A can issue instruction to each module and execute arithmetic operation, based on a result of arithmetic operation by each module.
One example of a function block diagram of an action classification apparatus 10 according to the present example embodiment is illustrated in
The extraction unit 11 extracts, from a moving image, a plurality of human movements presented in any number of frames, and stores an extraction result in a storage unit. The storage unit may be provided in the action classification apparatus 10, or may be provided in an external apparatus configured in such a way as to be accessible from the action classification apparatus 10.
“Any number of frames” refers to that the number of frames is not limited to one predetermined number, and may be any number among a plurality of options. Specifically, the number of frames in which a human movement extracted in the present example embodiment is presented is not limited to one fixed value, for example, such as “five frames”, and may be any number within a numerical range set with a specific width, for example, such as “any of 5 to 20 frames”.
The above-described numerical range may be defined optionally according to a required performance. The larger the numerical range is, the smaller a limit on the number of frames can be. By defining this numerical range sufficiently wide, limitation on the number of frames can be virtually eliminated. Meanwhile, when the numerical range is too wide, there will be human movements of which difference in the number of frames from each other is very large, and computation of a similarity of movements and the like become troublesome. When the numerical range is narrowed down to a certain extent, there will not be human movements of which difference in the number of frames from each other is very large, and computation of a similarity of movements and the like become easier.
One example of an extraction result stored in the storage unit is schematically illustrated in
The movement identification information is information for identifying the plurality of human movements detected by the extraction unit 11 from each other. A new piece of movement identification information is issued each time a new human movement is extracted.
The frame number is a number of frame in which each of the extracted human movements is presented. In the example illustrated in
The in-image position information is information indicating where a person making each of the movements is located in each frame. In the illustrated example, a position of a person making each of the movements is indicated by coordinates of four vertices of a rectangle enclosing the person making the movement, but this method is one example and a position of a person in a frame may be indicated in another method.
Note that, although the extraction result illustrated in
There are various means by which the extraction unit 11 extracts, from a moving image, a human movement presented in any number of frames, and any technique can be employed. For example, for each of a plurality of human movements, a user may give an input, to the action classification apparatus, specifying a start frame and an end frame of any number of frames in which the human movement is presented, and a position of a person making the movement in each of the frames. Further, the extraction unit 11 may extract the plurality of human movements from a moving image, based on a user input, and store an extraction result in the storage unit.
Alternatively, a human movement presented in any number of frames may be extracted from a moving image by arithmetic operation processing by a computer without a user input specifying a start frame, an end frame, and a position in a frame as described above. One example of a means for achieving by arithmetic operation processing by a computer will be described in the following example embodiment.
Returning to
Herein, processing by the time-series feature value computation unit 12 is described in more detail by taking a movement determined by the movement identification information “000001” illustrated in
In the present example embodiment, any technique can be employed as a means for computing a feature value of a human pose. One example thereof will be described in the following example embodiment.
Returning to
A means for computing a similarity between two time-series feature values for a same number of frames is not particularly limited, and any technique can be employed. For example, the similarity computation unit 13 may compute a similarity between two time-series feature values by using the technique disclosed in Patent Document 1.
Alternatively, the similarity computation unit 13 may determine a frame of one time-series feature value being relevant to each frame of the other time-series feature value, for example, based on an order of appearance of the frames. The similarity computation unit 13 associates frames of a same order of appearance with each other. Further, the similarity computation unit 13 may compute, for each pair of frames relevant to each other, a similarity of feature values of human poses, and compute a statistical value (an average value, a median value, a mode, a maximum value, a minimum value, or the like) of the similarity computed for each of a plurality of the pairs, as a similarity between the two time-series feature values.
Meanwhile, when two time-series feature values for which a similarity is computed are time-series feature values for different numbers of frames, the similarity computation unit 13 may compute a similarity between the two time-series feature values by using a “technique for computing a similarity between sets that differ from each other in number of elements”. Note that, in the following example embodiment, another example of a means for computing a similarity between two time-series feature values for different numbers of frames will be described.
The classification unit 14 classifies the plurality of human movements extracted by the extraction unit 11, by grouping similar movements together, based on the similarity between the plurality of time-series feature values computed by the similarity computation unit 13. There are various classification methods and, for example, a plurality of human movements of which similarity between time-series feature values of each other is equal to or more than a reference value may be classified into a same cluster (a group of similar movements).
Next, one example of a flow of processing by the action classification apparatus 10 will be described with reference to a flowchart in
First, the action classification apparatus 10 extracts, from a moving image, a plurality of human movements presented in any number of frames (S10). Next, the action classification apparatus 10 computes, for each of the human movements extracted in S10, a feature value of a human pose in each of the any number of frames, and thereby computes a time-series feature value for the any of frames (S11). Next, the action classification apparatus 10 computes a similarity between a plurality of the time-series feature values (S12). Then, the action classification apparatus 10 classifies, based on the similarity computed in S12, the plurality of extracted human movements (S13).
The action classification apparatus 10 according to the present example embodiment computes a similarity between human movements presented in any number of frames, and classifies the plurality of human movements by collecting similar movements, based on a computation result. In a case of the present example embodiment, a movement to be classified may be presented in any number of frames. In comparison to a case in which the number of frames in which a movement to be classified is presented is limited to one specific value, convenience is improved.
According to an action classification apparatus 10 of the present example embodiment, processing of extracting, from a moving image, a plurality of human movements presented in any number of frames is automated. Detailed description will be given in the following.
An extraction unit 11 detects, from a moving image, a plurality of persons appearing consecutively in any number of frames, by using a tracking engine that tracks a same person. Further, the extraction unit 11 extracts, as a human movement presented in the any number of frames, a movement of each of the plurality of persons that is presented in the any number of frames and is detected by the tracking engine.
The tracking engine tracks a same person, based on at least one of a feature value of a face, a feature value of an outfit, a feature value of a possessed item, a feature value of a pose of the person, and a position in a frame.
The tracking engine may determine that a person is the same person, for example, when feature values of faces are similar to each other at or above a reference level. Further, the tracking engine may determine that a person is the same person when feature values of outfits are similar to each other at or above a reference level. Further, the tracking engine may determine that a person is the same person when feature values of possessed items are similar to each other at or above a reference level.
Further, the tracking engine may determine that a person is the same person when poses in two frames that are consecutive in a time-series order are similar to each other at or above a reference level. Further, the tracking engine may determine that a person is the same person when in-frame positions in two frames that are consecutive in the time-series order are similar to each other at or above a reference level.
Further, the tracking engine may determine a person is the same person when an integrated similarity computed based on similarities of feature values of any two or more of the above-described plurality of types is equal to or more than a reference value. Examples of the integrated similarity include, but are not limited to, an average value, a maximum value, a minimum value, a mode, a median value, a weighted average value, a weighted sum, and the like of similarities of feature values of two or more types. When an integrated similarity is computed, it is desirable that similarities of feature values of the plurality of types are normalized and are made to be comparable.
A specific example of processing by the extraction unit 11 is described with reference to
The person A appears in the moving image from a time t11 to a time t15.
Further, the person A walks from the time t1 to a time t12, stops from the time t12 to a time t13, and falls down from the time t13 to the time t15.
The person B appears in the moving image from the time t1 to the time t12. Further, the person B walks from the time t11 to the t12.
When such a moving image is processed by the face tracking engine, for example, from the time t11 to a time t14, the person A is tracked as a same person, but at the time t14, for some reason (for example, a feature value of a face can no longer acquired sufficiently because the person A has fallen), tracking on person A is once stopped. Further, from the time t14 to t15, the person A is tracked while being recognized as a different person from the person who had been tracked from the time t11 to t14. As a result, the person A from the time t11 to t14 is assigned with one piece of person identification information (“ID: 1” in the diagram), and the person A from the time t14 to t15 is assigned with another piece of person identification information (“ID: 2” in the diagram).
Further, from the time t11 to t12, the person B is tracked as a same person. As a result, the person B from the time t1 to t12 is assigned with one piece of person identification information (“ID: 3” in the diagram).
On a basis of a result of such tracking by the face tracking engine, the extraction unit 11 extracts a movement made by the person A (“ID: 1” in the diagram) from the time t11 to t14 as one human movement, extracts a movement made by the person A (“ID: 2” in the diagram) from the time t14 to t15 as another human movement, and extracts a movement made by the person B (“ID: 3” in the diagram) from the time t11 to t12 as yet another human movement.
Another specific example of the processing by the extraction unit 11 is described in
In the example in
Note that, when a person detected by the tracking engine consecutively appears in equal to or more than a predetermined upper limit number (matter of design) of frames, the extraction unit 11 may divide a plurality of frames in which the person consecutively appears into a plurality of groups by using any method, and extract each human movement presented in the plurality of frames belonging to each of the plurality of groups, as one human movement. In this case, one piece of movement identification information (see
In the example in
A method for dividing a plurality of frames into a plurality of groups is not particularly limited, and the number of frames belonging to each group may be less than the predetermined upper limit number. For example, a predetermined number (less than the predetermined upper limit number) of frames may be grouped together in a time-series order of the plurality of frames into one group. Note that, one frame may belong in duplicate to a plurality of groups, or such duplication may not be allowed.
Further, when the number of frames in which the detected person appears consecutively is equal to or less than a predetermined lower limit number (matter of design), the extraction unit 11 may not extract a human movement presented in equal to or less than the lower limit number of frames as one human movement.
Other configurations of the action classification apparatus 10 according to the present example embodiment are similar to those in the first example embodiment.
According to the action classification apparatus 10 of the present example embodiment, an advantageous effect similar to that in the first example embodiment is achieved. Further, according to the action classification apparatus 10 of the present example embodiment, the processing of extracting, from a moving image, a plurality of human movements presented in any number of frames is automated. As a result, convenience is improved.
In the present example embodiment, a means for computing a feature value of a human pose is embodied. Detailed description is as follows. A time-series feature value computation unit 12 includes a skeleton structure detection unit and a feature value computation unit.
The skeleton structure detection unit executes processing of detecting N (N is an integer equal to or greater than two) key points of a human body included in a frame. The processing by the skeleton structure detection unit is achieved by using the technique disclosed in Patent Document 1. Although details are omitted, in the technique disclosed in Patent Document 1, detection of a skeleton structure is performed by using a skeleton estimation technique such as OpenPose disclosed in Non-Patent Document 1. The skeleton structure detected by using the technique consists of a “key point” being a characteristic point such as a joint and a “bone (bone link)” indicating a link between the key points.
The skeleton structure detection unit extracts a keypoint that may be a key point, for example, from an image, and detects N key points of a human body by referring to information acquired by performing machine learning on an image of the key point. The N key points to be detected is defined in advance. The number of key points to be detected (specifically, the number of N) and which part of the human body is a key point to be detected are vary, and any variation can be employed.
In the example in
The feature value computation unit computes a feature value of the detected two-dimensional skeleton structure. For example, the feature value computation unit computes a feature value of each of the detected key points.
A feature value of a skeleton structure indicates a feature of a skeleton of a person, and is an element for classifying a state (a pose or a movement) of the person, based on the skeleton of the person. Usually, the feature value includes a plurality of parameters. Further, the feature value may be a feature value of an entire skeleton structure, may be a feature value of a part of the skeleton structure, or may include a plurality of feature values such as a feature value of each part of the skeleton structure. A computation method of the feature value may be any method such as machine learning or normalization, and a minimum value and a maximum value may be determined as normalization. As one example, the feature value is a feature value acquired by machine learning on the skeleton structure, a size of the skeleton structure from a head portion to a foot portion on an image, a relative positional relationship between a plurality of key points in a vertical direction in a skeleton area including the skeleton structure on the image, a relative positional relationship between a plurality of key points in a horizontal direction in the skeleton area, and the like. The size of the skeleton structure is a height of the vertical direction, an area, or the like of the skeleton area including the skeleton structure on the image. The vertical direction (a height direction or a longitudinal direction) is an up/down direction (Y-axis direction) in the image, and, for example, is a direction perpendicular to a ground (reference plane). Further, the horizontal direction (a lateral direction) is a left/right direction (X-axis direction) in the image, and, for example, is a direction parallel to the ground.
Note that, in order to perform classification desired by a user, it is preferable to use a feature value having robustness against classification processing. For example, when a user desires classification independent of an orientation or a body shape of a person, a feature value being robust against an orientation and a body shape of a person may be used. The feature value independent of an orientation or a body shape of a person can be acquired by learning skeletons of persons in a same pose facing various orientations and skeletons of persons of various body shapes in a same pose, or by extracting only a feature of the vertical direction of a skeleton.
The above-described processing by the feature value computation unit is achieved by using the technique disclosed in Patent Document 1.
In this example, a feature value of a key point indicates a relative positional relationship between a plurality of key points in a vertical direction in a skeleton area including a skeleton structure on an image. Since the key point A2 of the neck is used as a reference point, a feature value of the key point A2 is 0.0, and feature values of the key point A31 of the right shoulder and the key point A32 of the left shoulder at a same height as the neck are also 0.0. A feature value of the key point A1 of the head at a higher position than the neck is −0.2. Feature values for the key point A51 of the right hand and the key point A52 of the left hand at positions lower than the neck are 0.4, and feature values of the key point A81 of the right foot and the key point A82 of the left foot are 0.9. When the person raises the left hand from this state, as illustrated in Fig, 12, the left hand becomes higher than the reference point, and therefore the feature value of the key point A52 of the left hand becomes −0.4. Meanwhile, since normalization is performed by using only a Y-axis coordinate, the feature value does not change even when a width of the skeleton structure changes as in
There are various methods for computing a similarity of poses indicated by such a feature value. For example, after a similarity of feature values is computed for each of the key points, a similarity of poses may be computed based on a plurality of feature values of the key points. For example, an average value, a maximum value, a minimum value, a mode, a median value, a weighted average value, a weighted sum, or the like of the plurality of feature values of the key points may be computed as a similarity of poses. When a weighted average value or a weighted sum is computed, a weight of each key point may be set by a user, or may be defined in advance.
Other configurations of an action classification apparatus 10 according to the present example embodiment is similar to those in the first and second example embodiments.
According to the action classification apparatus 10 of the present example embodiment, an advantageous effect similar to that in the first and second example embodiments is achieved. Further, according to the action classification apparatus 10 of the present example embodiment, a similarity of pose can be computed with high accuracy. As a result, accuracy of action classification is improved.
In the present example embodiment, a means for computing a similarity between two time-series feature values for different numbers of frames is embodied. Detailed description is as follows.
When computing a similarity between two time-series feature values for different numbers of frames, a similarity computation unit 13 computes the similarity between the two time-series feature values by executing processing illustrated in a flowchart in
In S20, the similarity computation unit 13 determines, based on a similarity of feature value of a human pose in each frame, a frame of the other time-series feature value that associates to each frame of one time-series feature value. Detailed description is as follows.
The similarity computation unit 13 searches, from the frames of the other time-series feature value, one or a plurality of frames in which a pose similar (a similarity is equal to or more than a threshold value) to a human pose in one first frame of the one time-series feature value is presented and associates the searched one or a plurality of frames with the first frame. One example of a result of determining a correlation is illustrated in
The above-described determination of a correlation can be achieved, for example, by using a technique such as dinamic time warping (DTW). In such a case, as a distance score required in the determination of the correlation, a distance (a Manhattan distance, a Euclidean distance, or the like) between feature values, and the like can be used.
Returning to
In S22, the similarity computation unit 13 computes a similarity between the two time-series feature values, based on the similarity computed in S21. The similarity computation unit 13 computes, for example, a statistic value (an average value, a median value, a mode, a maximum value, a minimum value, or the like) of the similarity computed for each of the plurality of pairs, as the similarity between the two time-series feature values.
Other configurations of an action classification apparatus 10 according to the present example embodiment is similar to those in the first to third example embodiments.
According to the action classification apparatus 10 of the present example embodiment, an advantageous effect similar to that in the first to third example embodiments is achieved. Further, according to the action classification apparatus 10 of the present example embodiment, a similarity between two time-series feature values for different numbers of frames can be computed with high accuracy. As a result, convenience is improved.
In the present example embodiment, a means for computing a similarity between two time-series feature values for different numbers of frames is embodied in a method different from that in the fourth example embodiment. Detailed description is as follows.
When computing a similarity between two time-series feature values for different numbers of frames, a similarity computation unit 13 computes the similarity between the two time-series feature values by executing processing illustrated in a flowchart in
In S30, the similarity computation unit 13 extracts a plurality of key frames from any number of frames of one time-series feature value.
The “key frames” are some frames among the any number of frames of the one time-series feature value. As illustrated in
In the extraction processing 1, the similarity computation unit 13 extracts a key frame, based on a user input. Specifically, a user gives an input specifying some of the plurality of frames as the key frames. Then, the similarity computation unit 13 extracts the frame specified by the user as the key frame.
In the extraction processing 2, the similarity computation unit 13 extracts the key frame in accordance with a predetermined rule.
Specifically, as illustrated in
In extraction processing 3, the similarity computation unit 13 extracts the key frame in accordance with a predetermined rule.
Specifically, as illustrated in
Next, the similarity computation unit 13 computes a similarity between the newly extracted key frame and each frame subsequent to the newly extracted key frame in the time-series order. Then, the similarity computation unit 13 extracts a frame of which a similarity is equal to or less than the reference value (matter of design) and is most early in the time-series order, as a new key frame. The similarity computation unit 13 repeats the processing, and thereby extracts the plurality of key frames. According to the processing, poses of the human body in adjacent key frames are different from each other to some extent. Thus, the plurality of key frames in which characteristic poses of the human body are presented can be extracted while suppressing an increase in the number of the key frames. The above-described reference value may be defined in advance, may be selected by a user, or may be set by another means.
Returning to
The “key-relevance frame” is a frame that includes a human body in a pose being equally or more similar to a pose of a human body included in a key frame than a predetermined level. A means for computing a similarity of a pose is not particularly limited, and, for example, the means described in the third example embodiment can be employed. When Q (Q is an integer of 2 or more) key frames are extracted, Q key-relevance frames associated to each of the Q key frames are extracted.
In
Further, in the example in
Returning to
In a first computation method, the similarity computation unit 13 computes a similarity between the two time-series feature values, based on a pose similarity.
The “pose similarity” is a similarity between a feature value of a human pose in each of a plurality of key frames and a feature value of a human pose in each of a plurality of key-relevance frames.
First, the similarity computation unit 13 computes, for each pair of a key frame and a key-relevance frame associated with each other, a similarity of feature values of human poses (pose similarity). A computation method of the pose similarity is not particularly limited, and, for example, the method described in the third example embodiment can be employed. Then, the similarity computation unit 13 computes a statistic value (an average value, a median value, a mode, a maximum value, a minimum value, and the like) of the pose similarity computed for each of a plurality of pairs, as the similarity between the two time-series feature values. Note that, the similarity computation unit 13 may compute, as the similarity between the two time-series feature value, a value acquired by standardizing the computed statistic value in accordance with a predetermined rule.
In a second computation method, the similarity computation unit 13 computes a similarity between the two time-series feature values, based on a time interval similarity.
The “time interval similarity” is a similarity between a time interval between a plurality of key frames and a time interval between a plurality of key-relevance frames.
First, concepts of the “time interval between a plurality of key-relevance frames” and the “time interval between a plurality of key frames” are described with reference to
In the illustrated example, the time interval between a plurality of key-relevance frames is a time interval between each of the first to fifth key-relevance frames.
For example, the time interval between the plurality of key-relevance frames may be a concept that includes a time interval between temporally adjacent key-relevance frames. In the example in
Alternatively, the time interval between the plurality of key-relevance frames may be a concept that includes a time interval between a temporally first key-relevance frame and a temporally last key-relevance frame. In the example in
Alternatively, the time interval between the plurality of key-relevance frames may be a concept that includes a time interval between a reference key-relevance frame determined in any method and each of the other key-relevance frames. In the example in
The “time interval between a plurality of key-relevance frames” may be any one of the above-described plurality of types of time intervals, or may include a plurality of types of time intervals. It is preliminarily defined which of the above-described plurality of types of time intervals is to be the time interval between the plurality of key-relevance frames. In the example in
A concept of the time interval between the plurality of key frames is similar to the above-described concept of the time interval between the plurality of key-relevance frames.
Note that, a time interval between two frames may be indicated by the number of frames in between the two frames, or may be indicated by an elapsed time between the two frames computed based on the number of frames between the two frames and a frame rate.
Next, computation method of the time interval similarity will be described. When the time interval between the plurality of key-relevance frames and the time interval between the plurality of key frames are one type of time intervals, the similarity computation unit 13 computes a difference in the time intervals as the time interval similarity. The difference in the time intervals is a gap or a variation rate. Note that, the similarity computation unit 13 may compute a value acquired by standardizing the computed difference in the time intervals in accordance with a predetermined rule, as the time interval similarity. In this example, the computed time interval similarity is the similarity between the two time-series feature values.
Meanwhile, when the time interval between a plurality of relevance frames and the time interval between the plurality of key frames include a plurality of types of time intervals, first, the similarity computation unit 13 computes, for each type of time interval, a difference in the time intervals, as the time interval similarity. The difference in the time intervals is a gap or a variation rate. After that, the similarity computation unit 13 computes a statistic value of the time interval similarity computed for each type of time interval, as the similarity between the two time-series feature values. Examples of the statistic value include an average value, a maximum value, a minimum value, a mode, a median value, and the like, but are not limited thereto. Note that, the similarity computation unit 13 may compute a value acquired by standardizing the computed statistic value in accordance with a predetermined rule, as the similarity between the two time-series feature values.
In a third computation method, the similarity computation unit 13 computes a similarity between the two time-series feature values, based on a change direction similarity.
The “change direction similarity” is a similarity between a direction of change in a feature value of a human pose in a plurality of key frames and a direction of change in a feature value of a human pose in a plurality of key-relevance frames.
First, the similarity computation unit 13 computes a direction of temporal change in a feature value of a human pose in a plurality of key frames in time-series. The similarity computation unit 13 computes a direction of change in a feature value of a human pose, for example, between key frames that are adjacent in time-series order.
For example, the feature value may be a feature value of a key point that has been described with reference to
By computing the above-described direction of change in a numerical value between adjacent key frames, the similarity computation unit 13 can compute, for each key point, time-series data indicating a time-series change in a direction of change in a feature value. The time-series data are, for example, the “direction in which the numerical value increases”→the “direction in which the numerical value increases”→the “direction in which the numerical value increases”→“no change in the numerical value”→“no change in the numerical value”→the “direction in which the numerical value increases”, and the like. When the “direction in which the numerical value increases” is, for example, represented as “1”, “no change in the numerical value” is, for example, represented as “0”, and the “direction in which the numerical value decreases” is, for example, represented as “−1”, the time-series data can be represented as a numerical string, for example, “111001”.
Alternatively, a feature value of a pose may be indicated by a height and an area of a skeleton area, an angle of a predetermined joint (an angle formed by three key points), or the like. Also in this case, a direction of change in a numerical value is divided into three categories that are a “direction in which the numerical value increases”, “no change in the numerical value”, and a “direction in which the numerical value decreases”. Further, when three or more key frames are to be processed, as descried above, the similarity computation unit 13 can compute time-series data indicating a time-series change in the direction of change in a feature value.
The similarity computation unit 13 computes a similarity (change direction similarity) between numerical strings computed as described above, as the similarity between the two time-series feature values. Note that, the similarity computation unit 13 may compute a value acquired by standardizing, in accordance with a predetermined rule, the similarity (change direction similarity) between the numerical strings computed as described above, as the similarity between the two time-series feature values. A computation method of a similarity of two numerical string rolls is not particularly limited, and, for example, a method in which a numerical string is regarded as a character string and a similarity between the two character strings is computed may be employed.
Further, when a plurality of types of the above described numerical strings are computed, (for example, a numerical string for each key point, a numerical string of angles of a plurality of joints, and the like), after computing a similarity (change direction similarity) between each type of numerical string, the similarity computation unit 13 computes a statistic value of the similarity between each type of numerical string, as the similarity between the two time-series feature values. The statistic value is an average value, a maximum value, a minimum value, a mode, a median value, a weighted average value, a weighted sum, or the like, but is not limited thereto. When a weighted average value and a weighted sum is computed, a weight of each type of numerical string may be set by a user, or may be defined in advance.
In a fourth computation method, the similarity computation unit 13 computes a similarity between the two time-series feature values, based on a determination result of a key-relevance frame.
As described above, a key-relevance frame is a frame that includes a human body in a pose being equally or more similar to a pose of a human body included in a key frame than a predetermined level. When there are Q key frames, Q key-relevance frames may be determined, or a smaller number of key-relevance frames may be determined. Further, a time-series order of the Q key frames and a time-series order of the determined plurality of key-relevance frames may be identical, or may be different. The similarity computation unit 13 computes the similarity between the two time-series feature values, based on this viewpoint.
For example, the similarity computation unit 13 determines whether a same number of key-relevance frames as the key frames are determined. Then, the similarity computation unit 13 computes the similarity between the two time-series feature values, based on a result of the determination. When the same number of key-relevance frames as the key frames are determined, the similarity computation unit 13 computes a similarity higher than that in a case in which fewer key-relevance frames than key frames are determined. Further, when fewer key-relevance frames than key frames are determined, the similarity computation unit 13 computes a higher similarity as the number of determined key-relevance frames is greater. An algorithm for computing a similarity based on this criterion is not particularly limited, and any method can be employed.
Alternatively, the similarity computation unit 13 computes a similarity between a time-series order of the plurality of key frames and a time-series order of the plurality of key-relevance frames, as the similarity between the two time-series feature values. A computation method of the similarity between the time-series orders is not particularly limited, and, for example, a method described in the following may be employed.
The time-series order of the plurality of key frames can be indicated by using the above-described value of N, for example, as a numerical string such as “12345”. This numerical string indicates that the time-series order of the first to fifth key frames is “the first key frame→the second key frame→the third key frame→the fourth key frame→the fifth key frame”. Likewise, a time-series order of the plurality of key-relevance frames can also be indicated by using the above-described value of N, for example, as a numerical string such as “12435”. This numerical string indicates that the time-series order of the first to fifth key-relevance frames is “the first key-relevance frame→the second key-relevance frame→the fourth key-relevance frame→the third key-relevance frame→the fifth key frame”. Further, the similarity computation unit 13 may compute a similarity between the time-series order of the plurality of key frames and the time-series order of the plurality of key-relevance frames by using a method in which the numerical string is regarded as a character string and a similarity between the two character strings is computed.
In a fifth computation method, the similarity computation unit 13 computes a similarity between the two time-series feature values by using a plurality of the first to fourth computation method.
The similarity computation unit 13 standardizes similarities computed in any plurality of the first to fourth computation method, in such a way as to be comparable to each other. Then, the similarity computation unit 13 computes a statistic value of the similarities computed in each of the computation methods, as the similarity between the two time-series feature values. The statistic value is an average value, a maximum value, a minimum value, a mode, a median value, a weighted average value, a weighted sum, or the like, but is not limited thereto. When a weighted average value and a weighted sum is computed, a weight of the similarity computed in each type of computation method may be set by a user, or may be defined in advance.
Other configurations of an action classification apparatus 10 according to the present example embodiment is similar to those in the first to third example embodiments.
According to the action classification apparatus 10 of the present example embodiment, an advantageous effect similar to that in the first to third example embodiments is achieved. Further, according to the action classification apparatus 10 of the present example embodiment, a similarity between two time-series feature values for different numbers of frames can be computed with high accuracy. As a result, accuracy of action classification is improved.
An action classification apparatus 10 according to the present example embodiment outputs a characteristic user interface (UI) screen. Detailed description is as follows.
A classification unit 14 displays a UI screen as illustrated in
In the area in which a classification result is displayed, a result of classifying a plurality of human movements extracted by an extraction unit 11 is displayed. As described above, the classification unit 14 generates a plurality of clusters by grouping together similar movements of the plurality of human movements extracted by the extraction unit 11. In the example in
As a selection method of a representative, (1) a method of selecting a predetermined number of representatives in order from closest to a center of a cluster, (2) a method of selecting a predetermined number of representatives at random, and the like are conceivable. Further, a predetermined condition such as excluding overlapping movements of the same person from being representative may be set. A computation method of a center of a cluster is not particularly limited, and any technique can be employed.
An analyzed moving image is reproduced on the moving image confirmation screen. A reproduction position can be specified by a user. For example, a user may give an input of selecting one thumbnail from the illustrated classification result. Further, the classification unit 14 may reproduce the moving image from a beginning of a scene including a selected human movement (or from a predetermined time before that point). Note that, in the illustrated example, a key point and a bone detected from each person are superimposed on each person, but the key point and the bone may or may not be displayed.
In the area in which a UI component for receiving a user input that specifies various weights, sliders each corresponding to “shape” “change”, and “length” are displayed. Further, a weight can be specified within a range of zero to one for each of “shape”, “change”, and “length”. The “shape” corresponds to a pose similarity described in the fifth example embodiment. The “change” corresponds to a change direction similarity described in the fifth example embodiment. The “length” corresponds to a time interval similarity described in the fifth example embodiment.
Note that, in this example, the three weights that are the pose similarity, the change direction similarity, and the time interval similarity can be specified, but this is one example, and a weight that can be specified is not limited thereto. Further, a weight of a determination result of a key-relevance frame described in the fifth example embodiment may be specifiable, and any two types of weights may be specifiable.
Further, in the illustrated example, a weight for each of a plurality of key points can be specified. In the diagram, 1 and 2 displayed in association with each key point are the weight of each key point. Further, a key point that is not filled in black indicates that a weight is zero (not considered in similarity computation). For example, a user gives a predetermined input for each key point, and can thereby set a weight for each key point, as illustrated. Further, the user can recognize various weights that have been set so far, from the illustrated screen.
Note that, when a user gives an input that changes various weights to the illustrated UI component, in response to the input, a similarity computation unit 13 may re-compute a similarity, based on a newly set weight. Then, the classification unit 14 may re-classify the plurality of human movements extracted from the moving image, based on the newly computed similarity, and update the illustrated classification result to a new classification result.
Other configurations of the action classification apparatus 10 according to the present example embodiment is similar to those in the first to fifth example embodiments.
According to the action classification apparatus 10 of the present example embodiment, an advantageous effect similar to that in the first to fifth example embodiments is achieved. Further, according to the action classification apparatus 10 of the present example embodiment, a user can easily set various weights, and easily recognize current settings. Further, a user can easily recognize a classification result.
Although the example embodiments of the present invention has been described with reference to the drawings, these are examples of the present invention and various configurations other than the above-described configurations can also be employed. The configurations of the above-described example embodiments may be combined with each other, and may be partially replaced with another configuration. Further, various modifications may be added to the configurations of the above-descried example embodiments without departing from the scope. Further, the configuration and the processing disclosed in each of the above-described example embodiments and modification examples may be combined with each other.
Further, a plurality of steps (pieces of processing) are described in order in a plurality of flowcharts in the above description, but an execution order of steps executed in each example embodiment is not limited to the described order. In each example embodiment, the illustrated order of steps can be changed to an extent that contents of the steps are not interfered with. Further, each of the above-described example embodiments can be combined to an extent that contents of each example embodiments do not conflict with each other.
A part or the entirety of the above-described example embodiments may be described as the following supplementary notes, but is not limited thereto.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/042229 | 11/17/2021 | WO |