The present application is based on, and claims priority from, Chinese application number 2023102166797, filed on Mar. 7, 2023, the disclosure of which is hereby incorporated by reference herein in its entirety.
The present disclosure relates to the field of computer vision and image processing, and in particular, to a method for estimating the gaze directions of multiple persons in images.
The human gaze direction is an important channel for expressing intentions, and has noteworthy applications in such fields as human-computer interaction, virtual and augmented reality, and so on. Using facial images to calculate the gaze direction is currently a hot research topic, but existing methods usually assume that the input image only contains one cropped and calibrated face. If there are multiple target characters in the application scenario, there will be no way to calculate the gaze directions of multiple persons at real-time speed. Regarding this issue, the goal of the present technique is to simultaneously estimate the gaze directions of multiple target characters in images through one single calculation, and ultimately achieve real-time estimation of the gaze directions of multiple persons in the video.
The content of the present disclosure is to briefly introduce the concepts, which will be described in detail in the section of detailed description of the invention later. The content of this disclosure is not intended to identify key or necessary features of the claimed technical solution, nor is it intended to limit the scope of the claimed technical solution.
At present, gaze direction estimation based on appearance is a hot research topic, and people have applied various strategies to improve estimation accuracy, such as coarse-to-fine estimation strategies, adversarial learning methods, and self-attention mechanisms, and so on. At the same time, multiple large-scale gaze estimation image datasets have been published. Most of the images were collected in a laboratory environment with strict multi-view camera settings, participants in fixed positions, and targets to be gazed at. This type of collecting process usually results in these datasets containing only single-face images in limited scenarios. Correspondingly, the gaze estimation methods already proposed assume that there is only one calibrated face in the input image. However, this will lead to a drawback, that is, the speed of the current gaze direction estimation method is proportional to the number of faces in the input image. When there are multiple persons in the image, most methods cannot process the video data at real-time speed.
The task of understanding facial images has received great attention due to its widespread applications. Many practical methods such as human face location, facial expression recognition, and head posture estimation etc. have been proposed. With the research development of object detection methods, one-time calculating methods for multi-face parallel calculations are favored in real-time applications due to their lightweight design and high accuracy. For example, the new face detection method applies a single-stage structure and designs more effective modules for facial features. At the same time, a plurality of corresponding large-scale facial datasets have also been published, many of which are constructed through extensive manual annotations. In addition, it has been found that multi-task learning (estimation of facial key points, head posture, gender, etc.) while conducting face detection is an efficient method, because the calculations of these tasks all rely on common facial features. The technique in this article is inspired by these works and aims to develop a face-based one-time calculating method for gaze direction estimation.
Based on the above practical demands and technical difficulties, the purpose of the present disclosure consists in: a novel method for one-time calculating the gaze directions of multiple face regions, being able to predict the gaze direction of a single or multiple face regions in an image accurately and in real-time; a novel method for generating multi-person gaze direction replacement data, being able to quickly generate a large amount of labeled realistic data, for training and testing of a deep learning model; based on the generated dataset, a novel multi-task learning network structure is proposed, which can simultaneously predict the gaze directions of multiple face regions in the image through one-time calculation and support end-to-end training; a self-supervised loss function based on two-dimensional projection may be used to supervise three-dimensional gaze direction estimation and improve calculation accuracy.
Some embodiments of the present disclosure provide a method for determining gaze directions of multiple persons in images, which comprises: obtaining a facial image, wherein the facial image includes at least one face region; constructing a deep network model for multi-task learning, wherein the deep network model includes a plurality of multi-task processing structures capable of parallel calculations, the plurality of multi-task processing structures capable of parallel calculations output the gaze direction of each face region and the face position information of each face region in the facial image through one-time calculation; training the deep network model end-to-end on the dataset to obtain a trained deep network model, wherein the trained deep network model is used to determine the gaze directions of multiple persons; inputting the facial image to the trained deep network model to obtain the gaze direction of at least one face region included in the facial image.
The method for determining the gaze directions of multiple face regions in images disclosed by the present disclosure has the following beneficial features compared to other gaze direction estimation methods: (1) a novel method for one-time calculating the gaze directions of multiple face regions is invented, its operating speed is not affected by the number of faces in the image and can predict the gaze direction of a single or multiple face regions in the image accurately and in real-time; (2) a novel multi-task learning network structure is designed, which can simultaneously predict the gaze directions of multiple face regions in the image by processing the image once, as well as at the same predict information such as face position and head posture etc.; (3) a novel self-supervised loss function based on two-dimensional projection is proposed, which can pertinently supervise the estimation of three-dimensional gaze direction and improve calculation accuracy; (4) a novel method for generating gaze direction replacement data for multiple face regions is designed, which can quickly generate a large amount of labeled realistic data for training and testing of a deep learning model; (5) the proposed network can be trained end-to-end on the constructed data, then the model accurately predicts the gaze direction in real-time during deployment testing.
The above and other features, advantages, and aspects of the embodiments of the present disclosure will become more apparent in conjunction with the accompanying drawings and with reference to the following specific implementations. Throughout the drawings, the same or similar reference signs indicate the same or similar elements. It should be understood that the drawings are schematic, and the components and elements are not necessarily drawn to scale.
Hereinafter, the embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be implemented in various forms, and shall not be construed as being limited to the embodiments set forth herein. On the contrary, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are used only for illustrative purposes, not to limit the protection scope of the present disclosure.
Besides, it should be noted that, for ease of description, only the portions related to the relevant invention are shown in the drawings. In the case of no conflict, the embodiments in the present disclosure and the features in the embodiments may be combined with each other.
It should be noted that such adjuncts as “one” and “more” mentioned in the present disclosure are illustrative, not restrictive, and those skilled in the art should understand that, unless the context clearly indicates otherwise, they should be understood as “one or more”.
The present disclosure will be described in detail below with reference to the accompanying drawings and in conjunction with embodiments.
Step 101, obtaining a facial image.
In some embodiments, the executing body of the multi-person gaze direction estimation method in mages may obtain a facial image through wired or wireless connections. Wherein, the facial image includes at least one face region.
Optionally, the face region in the at least one face region may include but not be limited to at least one of the following: facial size information, posture information, facial expression information, and gender information.
Step 102, constructing a deep network model for multi-task learning.
In some embodiments, the executing body mentioned above may construct a deep network model for multi-task learning. Wherein, the deep network model includes a plurality of multi-task processing structures capable of parallel calculations. The plurality of multi-task processing structures capable of parallel calculations output the gaze direction of each face region and the face position information of each face region in the facial image, through one-time calculation.
Optionally, the deep network model is a multi-task single-stage deep network model, which includes one encoder and multiple decoders. Wherein, the decoders in the multiple decoders are decoders that simultaneously complete different types of tasks. The deep network model is used to determine at the same time the gaze directions of multiple face regions, and output face position information and key point information of multiple face regions.
Step 103, training the deep network model end-to-end on the dataset to obtain a trained deep network model.
In some embodiments, the executing body may train the deep network model end-to-end on the dataset to obtain a trained deep network model. Wherein, the trained deep network model is used to determine the gaze directions of multiple persons.
Optionally, the above dataset is generated by a multi-person gaze direction image and generation framework that has replaced the eye region, by inputting two types of data. Wherein, one type of data is single-person image data with a gaze direction label, and the other type of data is multi-person image data with multiple face regions. The generation framework is used to automatically cluster single-person image data based on at least one of the gender information, race information, age information and head posture information, for easy retrieval. The generation framework is also used to retrieve in the single-person image data, single-person image data that is closest to the aforementioned face region, for each face region in the multi-person image data, and replace the eye region, to generate a corresponding gaze direction.
Optionally, the overall loss function during end-to-end training process may be obtained by the following formula:
Wherein, L represents the overall loss function. α represents the first adjustable hyperparameter. Lface represents the loss function related to face position information and key point information. β represents the second adjustable hyperparameter. Lgaze represents the loss function related to the gaze direction. Wherein, the first adjustable hyperparameter and the second adjustable hyperparameter may be set according to actual needs.
Wherein, Lgaze may be obtained by the following formula:
Wherein, Lgaze represents the loss function related to the gaze direction. λ1 represents the first hyperparameter used to balance different loss terms. Lself represents the self-supervised loss function. λ2 represents the second hyperparameter used to balance different loss terms. yg represents the gaze direction of the face region. yg* represents the truth label. ∥1 represents the L1 norm. Wherein, the first hyperparameter used to balance different loss terms and the second hyperparameter used to balance different loss terms may be set according to actual needs.
Step 104, inputting the facial image to the trained deep network model to obtain the gaze direction of at least one face region included in the facial image.
In some embodiments, the executing body may input the facial image in real-time to the trained deep network model, in a deployment environment, to accurately obtain the gaze direction of at least one face region included in the facial image once.
Optionally, the above method may also comprise: for the gaze direction of each face region in the gaze direction of the above at least one face region, the following determination steps may be performed:
The first step is to determine the positions yF,yT,yS of the gaze projection points of the gaze direction yg=(θ,ψ) of the above face region on the front, top and side projection planes. Wherein, yF represents the position of the gaze projection point in the front direction. yT represents the position of the gaze projection point in the top direction. yS represents the position of the gaze projection point in the side direction. F represents the front direction. T represents the top direction. S represents the side direction. yg represents the gaze direction of the face region mentioned above. θ represents the angle of nutation in the gaze direction. ψ represents the angle of rotation in the gaze direction.
The second step is to determine whether the positions yF,yT,yS of the gaze projection points on the front, top and side projection planes are equal to the three projections of the three-dimensional gaze prediction values. Wherein, the three projections of the three-dimensional gaze prediction values may be obtained by the following formula:
Wherein, Π represents the projection function. F represents the front direction. θ represents the angle of nutation in the gaze direction. ϕ represents the angle of rotation in the gaze direction. ΠF(θ,ϕ) represents projecting the gaze direction onto the front plane. sin ϕ represents the sine value of ϕ. cos θ represents the cosine value of θ. T represents the top direction. ΠT(θ,ϕ) represents projecting the gaze direction onto the top plane. cos ϕ represents the cosine value of ϕ. S represents the side direction. ΠS(θ,ϕ) represents projecting the gaze direction onto the side plane. sin θ represents the sine value of 0.
Optionally, the above deep network model includes a self-supervised loss function, which is obtained by the following formula:
Wherein, Lself represents the self-supervised loss function. τ represents the front or top or side direction. τ takes the value of {F,T,S}. F represents the front direction. T represents the top direction. S represents the side direction. yτ represents the gaze direction in the front or top or side direction. Π represents the projection function. yg represents the gaze direction of the face region. Πτ(yg) represents the projection function from three-dimension to two-dimension. ∥1 represents the L1 norm. e represents a natural constant. p represents a trainable parameter. eτ−p represents the −p power of e. pτ represents the correction coefficient for τ projection.
See
The system operation flowchart in the present disclosure is shown in
The specific contents of the innovations made in the various steps are explained in combination with relevant attached drawings as follows:
The present disclosure adopts a method for estimating the gazes of multiple persons by one-time calculation. Unlike other gaze estimation models that process face regions one by one, this model is the first model that estimates the gazes of multiple persons through one-time calculation. This model may accurately estimate the gaze directions of multiple target objects in the image in real-time during the testing process.
Firstly, for the input image, as shown in
To be specific, the multi-scale feature extraction module may be realized through a feature pyramid structure during the implementation process. The extracted features are calculated using top-down and lateral connections for the outputs of the deeply separable convolutional MobileNet model or residual network ResNet at different stages. The feature pyramid structure includes two modules, namely a bottom-up feature extraction module and a top-down feature aggregation module. In the bottom-up process, features are extracted from the image by a multi-stage convolutional module. Then, feature aggregation is carried out through the feature aggregation module. For the output of each feature layer, this technique adds a context module to increase receptive fields. By doing so, rich semantic information features are extracted from the image, and faces of different dimensions may be processed equally.
After feature extraction, use 1×1 convolution in the multi-task parallel processing module as calculation for different downstream tasks. For face detection tasks, use three calculation modules during the implementation process: classification module, positioning module, and key point module, which are respectively used to estimate the possibility yp of face existence, the position yb of the face region, and the position yl of facial key points. This technique also includes a three-dimensional gaze estimation module for calculating yg, and three additional two-dimensional gaze projection point estimation modules for calculating yF,yT, yS. For each training anchor box i, the goal of this technique is to minimize the loss to L=αLface+βLgaze. Wherein, L represents the overall loss function. α represents the first adjustable hyperparameter. Lface represents the loss function related to face position information and key point information. β represents the second adjustable hyperparameter. Lgaze represents the loss function related to the gaze direction. Wherein, Lface=Lclass(ypi,ypi*)+λ1ypi*Lbox(ybi,ybi*)+λ2ypi*Llandmark(yli,yli*). Wherein, Lface represents the loss function related to face position information and key point information. Lclass represents the loss function used for classification. i represents the training anchor box. ypi represents the possibility of presence of the ith face. ypi* represents the possibility of presence of the ith face labeled in the dataset. λ1 represents the first hyperparameter used to balance different loss terms. Lbox represents. ybi represents the position of the ith face region. ybi* represents the position of the ith face region labeled in the dataset. λ2 represents the second hyperparameter used to balance different loss terms. Llandmark represents the loss function used to calculate key points. yli represents the position of the ith facial key point predicted by the network. yli* represents the position of the ith facial key point labeled in the dataset.
The present disclosure designs a novel projection-based self-supervised loss function that may improve the prediction accuracy of the gaze direction. The loss function is Lgaze, the Lgaze definition may be:
Wherein, Lgaze represents the loss function related to the gaze direction. λ1 represents the first hyperparameter used to balance different loss terms. Lself represents the self-supervised loss function, λ2 represents the second hyperparameter used to balance different loss terms. yg represents the gaze direction of the face region. yg* represents the truth label. ∥1 represents the L1 norm.
Here, Lself is a projection-based self-supervised loss designed for three-dimensional gaze direction. To be specific, as shown in
Wherein, GS represents the two-dimensional gaze sensitivity. g represents the three-dimensional gaze direction. x represents the projection of the gaze in the image coordinate system. dg/dx represents the projection ratio of the two gazes, g and x. φ represents the angle of rotation. dφ/dx represents the ratio of φ to x. r represents the radius of the unit circle.
The two-dimensional gaze sensitivity defines the rate of change of x caused by a change in g, and this formula indicates that the farther the position of x from the origin, the greater the sensitivity GS will be. At the implementation level, the objective function of gaze direction estimation not only predicts the three-dimensional gaze direction, but also calculates the positions of gaze projection points on the three projection planes, i.e. yF,yT,yS. The calculation method is as the following formula:
Wherein, Π represents the projection function. F represents the front direction. θ represents the angle of nutation in the gaze direction. ϕ represents the angle of rotation in the gaze direction. ΠF(θ,ϕ) represents projecting the gaze direction onto the front plane. sin ϕ represents the sine value of ϕ. cos θ represents the cosine value of θ. T represents the top direction. ΠT(θ,ϕ) represents projecting the gaze direction onto the top plane. cos ϕ represents the cosine value of ϕ. S represents the side direction. ΠS(δ,ϕ) represents projecting the gaze direction onto the side plane. sin θ represents the sine value of θ.
By limiting the positions yF,yT,yS of three two-dimensional gaze projection points to be the same as the projection position of the three-dimensional gaze prediction value yg, the implementation method may be achieved by the following loss function:
Wherein, Lself represents the self-supervised loss function. τ represents the front or top or side direction. τ takes the value of {F,T,S}. F represents the front direction. T represents the top direction. S represents the side direction. yτ represents the gaze direction in the front or top or side direction. Π represents the projection function. yg represents the gaze direction of the face region. Πτ(yg) represents the projection function from three-dimension to two-dimension. ∥1 represents the L1 norm. e represents a natural constant. p represents a trainable parameter. eτ−p represents the −p power of e. pτ represents the correction coefficient for r projection.
In innovation point (2), a new multi-person gaze replacement process framework is proposed. The existing dataset either only contains facial information (such as bounding boxes, key points, etc.), or only contains a calibrated single face with gaze direction labels. Therefore, the first challenge faced by this technique is to construct a dataset that includes both multiple target characters and corresponding gaze direction truth labels, thereby merging the advantages of both the face dataset and the gaze dataset. The framework of this process is shown in
Firstly, perform facial attribute calculations for each suitable face in Widerface and ETH-XGaze:
Wherein, A represents the set of facial attributes calculated from the input image. F represents a feature extractor. I represents the input image.
Wherein, A is a set of facial attributes calculated from a single-face image, including facial key point information, head posture information, age information, gender information, and race information.
Next, for the appropriate face fw, in each Widerface dataset, retrieve the closest face in ETH-XGaze. During the retrieval process, guide the retrieval results by using a scoring formula. The scoring formula is as follows:
Wherein, e represents the image that needs to be queried. w represents the image in the dataset. S(fe,fw) represents the scoring function. fe represents the features of image e. fw represents the feature of image w. τ is the key point or head posture. ατ is determined by comparing experimental results. a represents the nature of the face. aτ,w represents the nature of face corresponding to the image retrieved from the dataset. ατ,e represents the nature of face extracted from the retrieved image.
Then, in the process of replacing the glasses region having gaze direction truth labels, a gaze replacement method is designed to generate a synthetic facial image fw′ with a true value gaze direction label g′. By affine transformation, the gaze direction truth attribute of fes may be retained.
In the process of constructing data, the present disclosure adopts 24,282 faces in the Widerface training set for gaze replacement, while other faces that are too small, too blurry, or have too many facial occlusions do not undergo gaze replacement. Compared with other known gaze datasets, the advantage of the newly proposed composite dataset lies in that each image contains one or more character objects, and the entire training set contains about 20,000 character objects. Besides, it has the advantage of a face detection dataset, which means that its data collection environment is very diverse.
Although the specific implementations of the present disclosure have been described above to facilitate the understanding of the present disclosure by those ordinarily skilled in the art, it should be clear that the present disclosure is not limited to the scope of specific implementations. For those ordinarily skilled in the art, as long as various changes are within the spirit and scope of the present disclosure as defined and determined by the attached claims, these changes are obvious, and all inventions and creations that utilize the concept of the present disclosure are under protection.
Number | Date | Country | Kind |
---|---|---|---|
2023102166797 | Mar 2023 | CN | national |