This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2021-099500, filed on Jun. 15, 2021, the entire contents of which are incorporated herein by reference.
The embodiment discussed herein is related to a facial expression determination technology.
Facial expressions play an important role in nonverbal communication. Facial expression estimation is an indispensable technology for developing a computer that understands and supports people. To estimate a facial expression, a method of describing the facial expression has to be specified first. Action units (AUs) are known as a method of describing facial expressions. The AU represents a facial movement involved in facial expression, as defined based on the anatomical findings of the facial muscles. There have also been proposed technologies for estimating AUs.
A representative form of an AU estimation engine that estimates AUs employs machine learning based on a large amount of training data, and image data of facial expressions as well as occurrence (occurrence or non-occurrence) and intensity (occurrence intensity) of each AU, which are obtained as a determination result of facial expression, are used as the training data. The occurrence and intensity of the training data are annotated by a specialist called a coder.
Japanese Laid-open Patent Publication No. 2018-036734, Japanese Laid-open Patent Publication No. 2020-057111, U.S. patent Ser. No. 10/339,369, and U.S. Patent Publication No. 2019/213403 are disclosed as related art.
According to an aspect of the embodiment, a non-transitory computer-readable recording medium stores a program causing a computer to execute a process, the process includes acquiring a taken image including a face with a plurality of markers attached, determining whether a movement amount of a first marker, among the plurality of markers, in a first direction is equal to or greater than a first threshold, based on a first position of the first marker in the taken image and criteria for an occurrence state of a first facial muscle movement, determining whether a movement amount of a second marker, among the plurality of markers, in a second direction is less than a second threshold, based on a second position of the second marker in the taken image and the criteria for the occurrence state of the first facial muscle movement, determining that there is occurrence as to an occurrence state of the first facial muscle movement upon determining that the movement amount of the first marker in the first direction is equal to or greater than the first threshold and the movement amount of the second marker in the second direction is less than the second threshold, and determining that there is no occurrence as to the occurrence state of the first facial muscle movement upon determining that the movement amount of the first marker in the first direction is equal to or greater than the first threshold and the movement amount of the second marker in the second direction is equal to or greater than the second threshold.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
The existing method has a problem that it may be difficult to generate training data for facial expression estimation. For example, annotation by a coder is costly and time-consuming, thus making it difficult to create large amounts of data. It is also difficult to accurately capture small changes in movement measurement of each part of a face by image processing of a face image, and it is difficult for a computer to determine a facial expression from the face image without human judgment.
Hereinafter, examples according to the embodiment will be described in detail based on the drawings. The examples do not limit the present embodiment. The embodiment will be described by way of example, but is not limited to AU.
A configuration of a determination system according to the present embodiment will be described with reference to
As illustrated in
The determination device 10 acquires the image taken by the RGB camera 31 and the result of motion capture by the IR camera 32. The determination device 10 determines an occurrence intensity 121 of an AU, and outputs the occurrence intensity 121 and an image 122 obtained by removing the markers through image processing from the taken image to the machine learning device 20. For example, the occurrence intensity 121 may be data that is expressed using a 6-level rating system with a scale from 0 to 5 for each AU, and is annotated such as “AU1: 2, AU2: 5, AU4: 0, . . . ”. The occurrence intensity 121 may be data that is expressed using 0 for no occurrence and a 5-level rating system with a scale from A to E for each AU, and is annotated such as “AU1: B, AU2: E, AU4: 0, . . . ”. The occurrence intensity is not limited to that expressed using such a 6-level rating system, and may be expressed based on, for example, 2-level evaluation (occurrence or non-occurrence).
The machine learning device 20 performs machine learning using the image 122 and the AU occurrence intensity 121 outputted from the determination device 10 as training data, and generates a machine learning model for calculating an estimated value of the AU occurrence intensity from the image. The machine learning device 20 may use the AU occurrence intensity as a label. The processing performed by the machine learning device 20 may be performed by the determination device 10. In this case, the machine learning device 20 does not have to be included in the determination system 1.
Arrangement of cameras will be described with reference to
A plurality of markers are attached to the face of the subject to be imaged so as to cover target AUs (for example, AU1 to AU28). The positions of the markers change with changes in the facial expression of the subject. For example, a marker 401 is arranged near the base of the eyebrow. A marker 402 and a marker 403 are arranged near the marionette lines. The markers may be placed on the skin corresponding to the movement of one or more AUs and facial muscles. The markers may be arranged so as to avoid the surface of the skin where the texture changes significantly due to wrinkles or the like.
The subject wears an instrument 40 with a reference marker. It is assumed that the position of the reference marker attached to the instrument 40 does not change even if the facial expression of the subject changes, Therefore, the determination device 10 may detect changes in the positions of the markers attached to the face based on a change in position relative to the reference marker. The determination device 10 may also identify the coordinates on a plane or space of each marker based on the positional relationship with the reference marker. The determination device 10 may determine the marker positions from a reference coordinate system or a projection position of a reference plane. By setting the number of the reference markers to three or more, the determination device 10 may identify the marker positions in a three-dimensional space.
The instrument 40 is, for example, a headband, and the reference marker is placed outside the contour of the face. The instrument 40 may be a VR headset, a mask made of a hard material, or the like. In that case, the determination device 10 may use a rigid surface of the instrument 40 as the reference marker.
The determination device 10 determines whether or not each of a plurality of AUs has occurred, based on AU criteria and the positions of the plurality of markers. The determination device 10 also determines the occurrence intensity for one or more AUs determined to have occurred among the plurality of AUs.
For example, the determination device 10 determines an occurrence intensity of a first AU based on a movement amount of a first marker that is included in the criteria and substantially moves when the first AU has occurred. The movement amount of the first marker is calculated based on the distance between a reference position of the first marker and the position of the first marker. The first marker may be said to be one or a plurality of markers corresponding to a specific AU. The first marker may be a marker used to determine the occurrence and occurrence intensity of AU, and a marker used only to determine the occurrence of AU without affecting the occurrence intensity of AU. In this case, when both of the marker used to determine the occurrence and occurrence intensity of AU and the marker used only to determine the occurrence of AU without affecting the occurrence intensity of AU have movements of a threshold or more, it may be determined that the AU has occurred. The AU occurrence intensity may be determined based only on the amount of movement of the marker used to determine the AU occurrence and occurrence intensity.
The AU criteria indicate, for example, one or more markers used to determine the AU occurrence intensity for each AU, among the plurality of markers. The AU criteria may include reference positions of the plurality of markers. The AU criteria may include a relationship (conversion rule) between the occurrence intensity and the movement amount of the marker used to determine the occurrence intensity for each of the plurality of AUs. The reference position of the marker may be determined according to each position of the plurality of markers in the taken image of the subject in an expressionless state (no AUs have occurred).
The movement of the marker will be described with reference to
As illustrated in
Various rules are conceivable for the determination device 10 to convert the variation into the occurrence intensity. The determination device 10 may perform conversion according to one predetermined rule, or may perform conversion according to a plurality of rules and adopt the one with the highest occurrence intensity.
For example, the determination, device 10 may acquire the maximum variation, which is the variation when the subject changes the facial expression to the maximum, and convert the variation into the occurrence intensity based on a ratio of the variation to the maximum variation. The determination device 10 may determine the maximum variation using data tagged by the coder with an existing method. The determination device 10 may linearly convert the variation into the occurrence intensity. The determination device 10 may perform conversion using an approximate expression created from the preliminary measurement of a plurality of subjects.
For example, the determination device 10 may determine the occurrence intensity based on a movement vector of the first marker calculated based on the position set as the criteria and the position of the first marker. In this case, the determination device 10 determines the occurrence intensity of the first AU based on the degree of matching between the movement vector of the first marker and a specified vector associated with the first AU. The determination device 10 may use an existing AU estimation engine to correct the correspondence between the magnitude of the vector and the occurrence intensity.
The method of determining the occurrence intensity of AU will be described more specifically.
In
There is the case that the movement vector of each marker may be dispersed, and may not completely match the determination direction of the specified vector.
As illustrated in
However, even if the movement vectors are dispersed, the occurrence intensity of the AU corresponding to the specified vector may be determined by calculating the inner product of the movement vector and the specified vector. In
As described above, the AU occurrence intensity may be determined based on the inner product of the specified vector and the movement vector of AU and the variation in the distance between markers. However, a marker that substantially moves when the target AU occurs may move in response to another facial expression, for example, upon occurrence of another AU even when the target. AU has not occurred. Therefore, the occurrence of AU that has not actually occurred may be falsely detected.
In order to avoid such false detection of AU occurrence, it is attempted to determine the AU based on a region including the movement vector by providing a separation boundary in the vector space, for example.
When referring to the determination result of
Therefore, in the present embodiment, in order to determine the target AU, the AU determination is made based on not only the movement of the marker that substantially moves when the target AU occurs, but also the movement of other markers.
A functional configuration of the determination device 10 according to the present embodiment will be described with reference to
The input unit 11 receives data input via an input device such as the RGB camera 31, the IR camera 32, a mouse, and a keyboard, for example. For example, the input unit 11 receives an image taken by the RGB camera 31 and the result of motion capture by the IR camera 32. The output unit 12 outputs data to an output device such as a display, for example. For example, the output unit 12 outputs the occurrence intensity 121 of AU and the image 122 obtained by removing markers through image processing from the taken image.
The storage unit 13 has a function to store data, programs to be executed by the controller 14, and the like, and is implemented by a storage device such as a hard disk or a memory, for example. The storage unit 13 stores AU information 131 and an AU occurrence intensity estimation model 132.
The AU information 131 represents the correspondence between markers and AUs. For example, a reference position of each marker, one or more AUs corresponding to each marker, and the direction and magnitude of a specified vector of each AU are stored in association with each other.
The AU occurrence intensity estimation model 132 is a machine learning model generated by machine learning with the taken image, from which the markers are removed, as a feature amount and the AU occurrence intensity as a correct label.
The controller 14 is a processing unit that controls the entire determination device 10, and includes an acquisition unit 141, a calculation unit 142, a determination unit 143, and a generation unit 144.
The acquisition unit 141 acquires the taken image including a face. For example, the acquisition unit 141 acquires a group of continuously taken images including the face of a subject with markers attached at a plurality of reference positions corresponding to the plurality of AUs. The taken images acquired by the acquisition unit 141 are taken by the RGB camera 31 and the IR camera 32 as described above.
The subject changes his/her facial expressions as the images are taken by the RGB camera 31 and the IR camera 32. In this event, the subject may freely change the facial expressions, or may change the facial expressions according to a predetermined scenario. As a result, the RGB camera 31 and the IR camera 32 may take images of how the facial expressions change in chronological order. The RGB camera 31 may also shoot a video. For example, the video may be regarded as a plurality of still images arranged in chronological order.
The calculation unit 142 calculates a movement vector based on the positions of the markers included in the taken image. For example, the calculation unit 142 derives a movement amount and a movement direction of the markers moved by the change in the facial expression of the subject from the reference positions of the markers in the taken image. The calculation unit 142 calculates an inner product of the movement vector and the specified vector indicating the AU determination direction.
As described above, the determination unit 143 determines the occurrence intensity of the AU corresponding to each of the specified vectors based on each specified vector. The determination unit 143 may also determine whether or not AU has occurred, based on not only the occurrence intensity but also whether the movement amount of the marker indicated by the movement vector or the inner product of the movement vector and the specified vector exceeds a predetermined threshold. Regarding this point, description will be given by giving a specific example with reference to
As described above, there may be a case where the determination of the target AU is not performed properly only with the movement of the marker that substantially moves when the target AU occurs. Therefore, the determination unit 143 performs determination of the target AU based on the movement of not only a marker that substantially moves when the target AU has occurred, but also a marker other than that marker (which may be hereinafter referred to as a “cancel marker”). For example, as will be described specifically later, the determination unit 143 determines whether or not the target AU has occurred, based on the movement of the cancel marker.
The AU determination using the cancel marker will be described more specifically with reference to
In the example of
Although
Next, an example in a case of canceling the occurrence of the AU12 will be described.
First, in the example of
Although the AU determination using the cancel marker has been described with reference to
In the example of
When a predetermined threshold is set for the determination of the distance between the markers 406 and 407 and the distance between the markers becomes equal to or lower than the threshold, the determination unit 143 may determine that the distance between the markers is reduced. The threshold may be a value different from the threshold of the inner product for the specified vector of AU, and may be determined for each subject based on the position of each marker during the expressionless state.
Next, an example in a case of canceling the occurrence of the AU09 will be described,
First, in the example of
Referring back to
The generation unit 144 may remove the markers by using a mask image.
The method of removing the markers by the generation unit 144 is not limited to that described above. For example, the generation unit 144 may generate a mask image by detecting the position of a marker based on a specified shape of the marker. The relative positions of the IR camera 32 and the RGB camera 31 may be calibrated in advance. In this case, the generation unit 144 may detect the position of the marker based on information of marker tracking by the IR camera 32.
The generation unit 144 may adopt different detection methods depending on the marker. For example, since a marker on the nose has little movement and a shape thereof is easy to be recognized, the generation unit 144 may detect the position by shape recognition. Since the marker on the side of the mouth has a large movement and a shape thereof is hard to be recognized, the generation unit 144 may detect the position using a method of extracting the representative color.
Next, with reference to
As illustrated in
Next, the determination device 10 calculates a movement vector based on the position of the marker included in the taken image acquired in step S101 (step S102).
Then, the determination device 10 calculates an inner product of the movement vector acquired in step S102 and a specified vector corresponding thereto (step S103). The calculation of the inner product is executed for each AU.
Thereafter, the determination device 10 determines cancellation of the AU occurrence (step S104). As for the determination of the cancellation of AU occurrence, as described with reference to
When it is determined that the AU occurrence is cancelled (step S104: Yes), the determination device 10 determines that no target AU has occurred (step S105). After the execution of step S105, the determination processing illustrated in
On the other hand, when it is determined that the AU occurrence is not cancelled (step S104: No), it is determined whether or not the target AU has occurred based on the inner product calculated in step S103 (step S106). When it is determined that no target AU has occurred since the inner product calculated in step S103 is less than the predetermined threshold (step S106: No), the determination device 10 determines that no target AU has occurred (step S105), and the determination processing illustrated in
On the other hand, when it is determined that the target AU has occurred (step S106: Yes), the determination device 10 calculates the occurrence intensity of the target AU based on the inner product calculated in step S103 (step S107), The calculation of the AU occurrence intensity in step S107 may be executed based on the distance between markers instead of the inner product. After the execution of step S107, the determination processing illustrated in
As described above, the determination device 10 acquires a taken image including a face with a plurality of markers attached. The determination device 10 determines whether a movement amount of a first marker, among the plurality of markers, in a first direction is equal to or greater than a first threshold, based on a first position of the first marker in the taken image and criteria for an occurrence state of a first facial muscle movement. The determination device 10 determines whether a movement amount of a second marker, among the plurality of markers, in a second direction is less than a second threshold, based on a second position of the second marker in the taken image and the criteria for the occurrence state of the first facial muscle movement. The determination device 10 determines that there is occurrence as to an occurrence state of the first facial muscle movement upon determining that the movement amount of the first marker in the first direction is equal to or greater than the first threshold and the movement amount of the second marker in the second direction is less than the second threshold. The determination device 10 determines that there is no occurrence as to the occurrence state of the first facial muscle movement upon determining that the movement amount of the first marker in the first direction is equal to or greater than the first threshold and the movement amount of the second marker in the second direction is equal to or greater than the second threshold.
Thus, even when it is difficult to perform adequate detection of a target AU only by the determination based on the movement of the marker that substantially moves when the target AU has occurred, the determination device 10 may perform more accurate facial expression determination based on the face image.
In the processing of determining that there is occurrence as to the occurrence state of the first facial muscle movement, the determination device 10 determines an occurrence intensity for the occurrence state of the first facial muscle movement, based on the movement amount of the first marker in the first direction.
Thus, the determination device 10 may perform more accurate facial expression determination based on the face image.
The determination device 10 generates training data for machine learning based on an image obtained by removing the first marker and the second marker from the taken image and on the occurrence intensity.
Thus, the determination device 10 may generate the training data for a machine learning model that enables more accurate facial expression determination based on the face image.
In the processing of determining that there is no occurrence as to the occurrence state of the first facial muscle movement, the determination device 10 determines the occurrence intensity for the occurrence state of the first facial muscle movement to be a value corresponding to no occurrence upon determining that there is no occurrence as to the occurrence state of the first facial muscle movement.
Thus, the determination device 10 may perform more accurate facial expression determination based on the face image.
The determination device 10 further determines whether a movement amount of a third marker, among the plurality of markers, in a third direction is equal to or greater than a third threshold, based on a third position of the third marker in the taken image and criteria for an occurrence state of a second facial muscle movement. The determination device 10 determines whether a movement amount of a fourth marker, among the plurality of markers, in a fourth direction is equal to or greater than a fourth threshold, based on a fourth position of the fourth marker in the taken image and the criteria for the occurrence state of the second facial muscle movement. The determination device 10 determines that there is occurrence as to an occurrence state of the second facial muscle movement upon determining that the movement amount of the third marker in the third direction is equal to or greater than the third threshold and the movement amount of the fourth marker in the fourth direction is equal to or greater than the fourth threshold. The determination device 10 determines an occurrence intensity for the occurrence state of the second facial muscle movement based on only the movement amount of the third marker in the third direction among the movement amount of the third marker in the third direction and the movement amount of the fourth marker in the fourth direction.
Thus, the determination device 10 may perform more accurate facial expression determination based on the face image.
The determination device 10 determines whether or not there is the facial muscle movement, based on a distance between the third marker and the fourth marker among the plurality of markers.
Thus, even when it is difficult to perform adequate detection of a target AU only by the determination based on the movement of the marker that substantially moves when the target AU has occurred, the determination device 10 may perform more accurate facial expression determination based on the face image.
The determination device 10 determines the first threshold and the second threshold based on the positions of the plurality of markers when the face is expressionless.
Thus, even for subjects having different face sizes, the determination device 10 may perform more accurate facial expression determination based on the face image.
Unless otherwise specified, processing procedures, control procedures, specific names, and information including various types of data and parameters described above in this document or drawings may be arbitrarily changed. The specific examples, distributions, numerical values, and so forth described in the embodiment are merely exemplary and may be arbitrarily changed.
The specific form of distribution or integration of units included in each apparatus is not limited to that illustrated in the drawings. For example, the calculation unit 142 of the determination device 10 may be distributed to a plurality of processing units, or the calculation unit 142 and the determination unit 143 of the determination device 10 may be integrated into one processing unit. For example, all or part of the units may be configured so as to be functionally or physically distributed or integrated in arbitrary units in accordance with various types of loads, usage states, or the like. All or an arbitrary part of the processing functions performed by each apparatus may be implemented by a central processing unit (CPU) and a program analyzed and executed by the CPU or may be implemented by hardware using wired logic.
The communication interface 10a is a network interface card or the like and performs communication with other servers. The HDD 10b stores a program or DB that operates the functions illustrated in
The processor 10d is the CPU, a microprocessor unit (MPU), a graphics processing unit (GPU), or the like. Alternatively, the processor 10d may be implemented by an integrated circuit such as an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA). The processor 10d is a hardware circuit that reads, from the HdD 10b or the like, the program for executing processes similar to those of the processing units illustrated in
The determination device 10 may also implement the functions similar to the functions of the above-described embodiment by reading out the above-described programs from a recording medium with a medium reading device and executing the above-described read programs. The programs described in other embodiments are not limited to the programs to be executed by the determination device 10, For example, the above-described embodiment may be similarly applied when another computer or a server executes the program or the other computer and the server cooperate with each other to execute the program.
The programs may be distributed over a network such as the Internet. The program may be recorded on a computer-readable recording medium such as a hard disk, a flexible disk (FD), a compact disc read-only memory (CD-ROM), a magneto-optical (MO) disk, or a Digital Versatile Disc (DVD) and may be executed by being read from the recording medium by the computer.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2021-099500 | Jun 2021 | JP | national |