The present invention relates to the technical field of video image processing, in particular to a key joint motion estimation based method for estimating continuous human postures.
In recent years, the estimation of human postures in an image/a video is generally implemented by using a deep neural network. However, the deep neural network has higher requirements for computer hardware and larger consumption of computing resources. Although calculated results have higher accuracy, the calculation is slower and poorer in real-timeliness. These defects limit the application of human posture estimation and make it more difficult to popularize. In addition, the estimation of the human postures in the video by using the deep neural network is to, in essence, divide the video into frames of images for calculation, without use of prior knowledge that human posture information between the frames of images in the video is continuous, so that the information of each frame is independent and separate, and the jump easily occurs.
A solution for extracting human postures based on the above deep neural network has the following references:
In the broad fields of image processing and video encoding, there are H.264, H.265 and other industry-recognized encoding standards and protocols. In these standards, the ultimate goal of motion estimation is to implement the compression of continuous image frames in a video by dividing a whole image into many small regions (macroblocks) and then searching for and estimating the most similar regions of these macroblocks. The above motion estimation algorithm based on block matching is called a block matching algorithm.
In view of the above deficiencies in the prior art that human postures are estimated with a deep neural network, the present invention provides a continuous human posture estimation algorithm integrating a deep neural network human posture estimation algorithm and a block matching motion estimation algorithm. The provided algorithm may give full play to the advantages of the two technical routes, avoid and supplement their disadvantages, and implement fast and accurate continuous human posture estimation.
There is provided a key joint motion estimation based method for estimating continuous human postures. A system for estimation includes two estimators:
The key joint motion estimation based method for estimating continuous human postures includes three stages:
The set threshold ε in the third stage may be set as required, and there is no unified standard.
Preferably, the estimator 1 uses a VNect, DeepPose, Stacked Hourglass or RMPE neural network model trained by an MPI-INF-3DHP data set.
Preferably, an algorithm in the estimator 2 is an algorithm for detecting coordinate changes of the key joints by using a block matching algorithm; in the block matching algorithm, it is required to search for a block most similar to a macroblock to be matched in a previous frame based on a given matching criterion in a current frame; the macroblock is a small rectangular region centered on a selected key joint; a range of block matching is called a search window, that is a larger rectangular region centered on the selected key joint; and a macroblock having a minimum error with the macroblock to be matched in the search window serves as a matching result.
More preferably, the matching criterion in the block matching algorithm uses a minimum mean square error (MSE) function, a minimum mean absolute deviation (MAD), or a minimum sum of absolute difference (SAD) criterion, defined as follows:
wherein p represents a pixel in the macroblock to be matched B, ν represents a motion vector corresponding to two macroblocks that are being matched, f (a) represents a pixel value at a position a in the current video frame, and flast (a) represents a pixel value at a position a in the previous video frame, that is, if f (p + ν) represents a pixel value at a position p + ν in the current video frame, flast (p) represents a pixel value at a position p in the previous video frame.
More preferably, after the matching criterion is determined, it is also required to match an actual macroblock; and when the block matching algorithm selects the macroblock to be matched in the current frame, a macroblock to be selected is selectively determined by using a search template. Further preferably, when the macroblock to be selected is selectively determined by using the search template, a motion estimation search algorithm used is a three-step search method, a diamond search method, or a four-step search method.
Further preferably, the three-step search method includes the following steps:
Further preferably, the diamond search method has two different matching templates of a big diamond and a small diamond; the big diamond has nine search points, and the small diamond has only five search points; firstly, a coarse search is performed by using the big diamond search template with a larger step length, and then a fine search is performed by using the small diamond template; and the diamond search method includes the following steps:
Further preferably, the four-step search method includes the following steps:
Preferably, during definition of human key joints, a total of 21 key joints are defined as follows:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
According to a basic idea of the present invention, a motion estimation block matching algorithm is applied to human joint tracking, so as to obtain continuous human posture results. Meanwhile, the results are continuously corrected by using a deep neural network based human posture estimator. For the human key joint tracking, it is only required to divide a corresponding number of macroblocks with a selected key joint to be tracked as a center, without dividing a whole image into a plurality of macroblocks as in image compression, and then to search for, by searching for changes of macroblocks in two adjacent image frames, an optimal motion estimation result of the macroblocks according to a specific strategy.
The present invention may estimate the continuous human postures in a video stream, where the human postures are specifically embodied as coordinate positions of human key joints in a video frame. Compared with a posture estimation method completely relying on a deep neural network, the posture estimation method provided by the present invention has the advantages of high frame rate, low hardware requirements, and sequential continuity of recognition results; and compared with a posture estimation method completely relying on a motion estimation algorithm, the present invention may correct a cumulative error, to improve the estimation accuracy.
The video stream processed in the technical solution of the present application may be a read video stored in a hard disk or a real-time video acquired by a camera, and when the real-time video acquired by the camera is processed, the advantages of the present invention can be better highlighted due to higher requirements for real-timeliness.
Human joints tracked in the embodiment are defined below. A total of 21 key joints are defined. Names and numbers of all the key joints are as shown in
0
11
1
12
2
13
3
14
4
15
5
16
6
17
7
18
8
19
9
20
10
A flowchart of a key joint motion estimation based method for estimating continuous human postures in the present invention is as shown in
A core part of an algorithm includes two modules, which are a pretrained deep neural network posture estimator, namely an estimator (1), and a video encoding standard H.264 based motion estimator, namely an estimator (2), respectively.
For the estimator (1), a VNect neural network model trained by an MPI-INF-3DHP data set is used in the embodiment (other feasible network models include DeepPose, Stacked Hourglass, RMPE, etc.). The estimator (1) has a frame rate of about 30 Hz and an average coordinate error of 82.5 mm in an environment of Intel Core i5-8400 CPU and NVIDIA GeForce GTX 1060 6 GB GPU.
For the estimator (2), the estimator (2) is an algorithm for detecting coordinate changes of key joints by using a block matching algorithm.
For the block matching algorithm, as shown in
The matching criterion frequently used in the block matching algorithm includes a minimum mean square error (MSE) function, a minimum mean absolute deviation (MAD), and a minimum sum of absolute difference (SAD) criterion, defined as follows:
where p represents a pixel in the macroblock to be matched B, ν represents a motion vector (a relative position) corresponding to two macroblocks that are being matched, f(a) represents a pixel value at a position a in the current video frame, and flast (a) represents a pixel value at a position a in the previous video frame. In the above matching criteria, the SAD criterion is most widely used.
After the matching criterion is determined, it is also required to match an actual macroblock. When the block matching algorithm selects the macroblock to be matched in the current frame, if all macroblocks in the region are matched in sequence, a global optimal matching macroblock in the region may be found eventually. However, such method is too large in calculation amount to meet lightweight requirements, and is rarely used in the field of video encoding. In contrast, a plurality of search templates may be used to selectively determine a macroblock to be selected. More classical motion estimation search algorithms include a three-step search method, a diamond search method, and a four-step search method.
The three-step search method includes the following steps:
The four-step search method includes the following steps:
The diamond search has two different matching templates of a big diamond and a small diamond, where the big diamond has nine search points, and the small diamond has only five search points. Firstly, a coarse search is performed by using the big diamond search template with a larger step length, and then a fine search is performed by using the small diamond template. The diamond search method includes the following steps:
The three-step search method based on the minimum sum of absolute difference criterion is preferred in the embodiment of the present application.
The estimator (2) is simple and fast in calculation, and has a frame rate of about 50,000 Hz in the environment of Intel Core i5-8400 CPU and NVIDIA GeForce GTX 1060 6 GB GPU, which is far beyond the frame rate requirements of real-time processing of video streams. However, the estimator may drift seriously over time to gradually deviate from and lose a tracking target.
The video stream processed may be a video stored in a storage device that is read by frame, or a real-time video acquired by a camera (in this case, the requirements for real-timeliness is higher, so that the advantages of the present invention can be better highlighted).
An algorithm flow includes the following three stages:
Number | Date | Country | Kind |
---|---|---|---|
202210418358.0 | Apr 2022 | CN | national |