This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2020-002467, filed on Jan. 9, 2020, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to a training data generating technique.
In nonverbal communication, expressions play an important role. An expression estimating technique is indispensable for understanding and sensing persons. A technique called an action unit (AU) is known as a tool for estimating expressions. AU is a technique for quantifying and disassembling expressions based on facial regions and muscles of facial expressions.
An AU estimating engine is constructed based on machine learning formed based on a large amount of teacher data. Image data of facial expressions, and Occurrence (presence or absence of occurrence) and Intensity (occurrence intensity) of each of AUs are used as teacher data. Furthermore, Occurrence and Intensity in teacher data are subjected to annotation by specialists, called coders.
For example, a related technique is disclosed in Patent Document 1: Japanese Laid-open Patent Publication No. 2011-237970
Another related technique is disclosed in X. Zhang, L. Yin, J. Cohn, S. Canavan, M. Reale, A. Horowitz, P. Liu, and J. M. Girard, “BP4D-Spontaneous: a high-resolution spontaneous 3D dynamic facial expression database” Image and Vision Computing, 32, 2014, 1, 692-705.
According to an aspect of the embodiments, a non-transitory computer-readable recording medium stores therein a training data generating program that causes a computer to execute a process including: acquiring a captured image including a face; specifying a position of a marker included in the captured image; selecting a first action unit from among a plurality of action units based, on a judgment criterion of an action unit and the position of the specified marker; generating an image by performing image processing of deleting the marker from the captured image; and generating training data for machine learning by adding information on the first action unit to the generated image.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
G. 12 is a flowchart illustrating the flow of a generating process performed on training data; and
With the related technique, there is a problem in that there may sometimes be a case in which it is difficult to generate teacher data that is used to estimate AUs. For example, cost and time are needed for annotation performed by coders; therefore, it is difficult to generate, a large amount of data. Furthermore, it is difficult to accurately find a small change by measuring a movement of each of the facial regions by performing image processing on the facial image, and it is thus difficult for a computer to judge AUs from the facial image without judgement by persons. Therefore, it is difficult for the computer to generate teacher data in which labels of AUs are added to a face image.
Preferred embodiments will be explained with reference to accompanying drawings. Furthermore, the present invention is not limited to the embodiments. Furthermore, each of the embodiments can be used in any appropriate combination as long as processes do not conflict with each other.
A configuration of a machine learning system according to an embodiment will be described with reference to
As illustrated in
The generating device 10 acquires the results of images captured by the ROB camera 31 and motion captured by the IF camera 32, Then, the generating device 18 outputs, to the machine learning device 20, an occurrence intensity 121 of each AU and an image 122 in which markers are deleted from the captured image by performing image processing. For example, the occurrence intensity 121 may also be data in which an occurrence intensity of each AU is indicated by five-level evaluation using A to E and annotation, such as “AU 1:2, AU 2:5, and AU 4:1, . . . ”, has been performed. The occurrence intensity is not limited to be indicated by five-level evaluation and may also he indicated by, for example, two-level evaluation (presence or absence of occurrence).
The machine learning device 20 performs machine learning by using the image 122 and the occurrence intensity 121 of each AU output from the generating device 10 and generates, from the image, a model that is used to estimate occurrence intensity of each AU. The machine learning device 20 can use the occurrence intensity of each AU as a label.
Here, arrangement of cameras will be described. with reference to
Furthermore, a plurality of markers are attached to the face of the subject whose image is captured so as to cover target AUs (for example, an AU 1 to an AU 28). The positions of the markers are changed in accordance with a change in the expression of the subject. For example, a marker 401 is arranged in the vicinity of the root of an eyebrow (glabella). Furthermore, a marker 402 and a marker 403 are arranged in the vicinity of the smile line (nasolabial fold). The markers may also be arranged on the skin associated with one or more AUs and motions of muscles of facial expressions. Furthermore, the markers may also be arranged by avoiding the skin where a change in the texture is large due to, for example, wrinkling.
Furthermore, the subject wears an instrument 40 to which a reference point markers are attached. It is assumed that the positions of the reference point markers attached to the instrument 40 are not changed even if an expression of the subject is changed. Consequently, the generating device 10 can detect a change in the positions of the markers attached to the face based on a change in the relative position from each of the reference point markers. Furthermore, by setting the number of reference point markers to three or more, the generating device 10 can specify the positions of the markers in three-dimensional space.
The instrument 40 is, for example, a headband, and in which the reference point markers are arranged outside the facial contour. Furthermore, the instrument 40 may also be a VR headset, a mask formed of a rigid material, or the like. In this case, the generating device 10 can use the rigid surface of the instrument 40 as the reference point markers.
A functional configuration of the generating device 10 will be described with reference to
The input unit 11 is an interface that is used to input data. For example, the input unit 11 receives an input of data via an input device, such as a mouse, and a keyboard. Furthermore, the output unit 12 is an interface that is used to output data. For example, the output unit 12 outputs data to an output device, such as a display.
The storage unit 13 is an example of a storage device that stores therein data or programs executed by the control unit 14 and is, for example, a hard disk or a memory. The storage unit 13 stores therein AU information 131. The AU information 131 is information indicating an association relationship between markers and AUs.
The control unit 14 is implemented by, for example, a central processing unit (CPU), a micro processing unit (MPU), a graphics processing unit (CPU), or the like executing, in a RAM as a work area, the program that is stored in an inner storage device. Furthermore, the control unit 14 may also be implemented by for example, an integrated circuit, such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), or the like. The control unit 14 includes an acquiring unit 141, a specifying unit 142, a judgement unit 143, an image generating unit 144, and a training data generating unit 145.
The acquiring unit 141 acquires a captured image including a face. For example, the acquiring unit 141 acquires a captured image including a face to which a plurality of markers are attached at a plurality of positions associated with a plurality of AUs. The acquiring unit 141 acquires images captured by the RGB camera 31.
Here, when an image is captured by the IR cameras 32 and the RCP camera 31, the subject changes expressions. Consequently, the generating device 10 can acquire the state in which the expressions are changed in time series as an image. Furthermore, the RGB camera 31 may also capture a moving image. The moving image is assumed to be a plurality of still images arranged in time series. Furthermore, regarding the subject, expressions may also freely be changed or expressions may also be changed in accordance with a predetermined scenario.
The specifying unit 142 specifies the positions of the markers included in the captured image. The specifying unit 142 specifies each of the positions of the plurality of markers included in the captured image. Furthermore, when a plurality of images are acquired in time series, the specifying unit 142 specifies the positions of the markers in each of the images. Furthermore, the specifying unit 142 can specify the coordinates of each of the markers on a plane or space based on the positional relationship with the reference point markers attached to the instrument 40. Furthermore, the specifying unit 142 may also determine the positions of the markers based on a reference coordinate system or based on a projection position of the reference plane.
The judgement unit 143 judges presence or absence of occurrence of each of the plurality of AUs based on a judgment criterion of the AUs and the positions of the plurality of markers. The judgement unit 143 judges an occurrence intensity related to one or more AUs that occurs from among the plurality of AUs. At this time, if the judgement unit 143 judges that occurrence is present in an AU associated with a marker from among the plurality of AUs based on the judgment criterion and the position of the marker, the judgement unit 143 can select the AU associated with the subject marker.
For example, the judgement unit 143 judges an occurrence intensity of a first AU based on an amount of movement of a first marker calculated based on a distance between the reference position of the first marker that is associated with the first AU included in the judgment criterion and the position of the first marker specified by the specifying unit 142. Furthermore, it can be said that the first marker is one or a plurality of markers associated with specific AUs.
The judgment criterion of AUs indicates, for example, from among the plurality of markers, one or the plurality of markers used to judge an occurrence intensity of each AU. The judgment criterion of AUs may also include the reference positions of the plurality of markers.
Regarding each of the plurality of AUs, the judgment criterion of AUs may also include a relationship (conversion rule) between an occurrence intensity and an amount of movement of a marker that is used to judge the occurrence intensity. Furthermore, the reference positions of the marker may also be determined in accordance with each of the positions of the plurality of markers in a captured image in which the subject is in a lack-of-expression state (none of the AUs occur).
Here, movements of markers will be described with reference to
As illustrated in
Furthermore, variations in the distance from the reference point marker of the marker 401 in the X direction and the Y direction are represented by the tables illustrated in
Various items can be considered as the rule for the judgement unit 143 converting an amount of variation to each of the occurrence intensities. The judgement unit 143 may also perform conversion in accordance with a predetermined single rule, or may also perform conversion based on a plurality of rules and use the largest occurrence intensity.
For example, the judgement unit 143 may also previously acquire the maximum amount of variation that is an amount of variation obtained when the subject changes its expression to the maximum and convert the occurrence intensities based on the ratio of the amount of variation to the maximum amount of variation. Furthermore, the judgement unit 143 may also determine the maximum amount of variation by using data in which a coder attaches tags by using a related technique. Furthermore, the judgement unit 143 may also linearly convert an amount of variation to each of the occurrence intensities, Furthermore, the judgement unit 143 may also perform conversion by using an approximate expression generated from measurements of a plurality of subjects obtained in advance.
Furthermore, for example, the judgement unit 143 can judge an occurrence intensity based on a movement vector of the first marker calculated based on the position that is previously set as the judgment criterion and the position of the first marker specified by the specifying unit 142. In this case, the judgement unit 143 judges an occurrence intensity of the first AU based on the degree of match between the movement vector of the first marker and the vector that is previously associated with the first AU. Furthermore, the judgement unit 143 may also correct association between the magnitude of the vector and the occurrence intensity by using existing AU estimating engine.
Furthermore, for example, as illustrated in
Furthermore, the generating device 10 may also output an occurrence intensity by associating the occurrence intensity with the image that has been subjected to image processing. In this case, the image generating unit 144 generates an image by performing image processing in which a marker is deleted from a captured image.
The image generating unit 144 can delete a marker by using a mask image.
Furthermore, the method of deleting markers performed by the image generating unit 144 is not limited to the method described above. For example, the image generating unit 144 may also detect a position of a marker based on the shape of a predetermined marker and generate a mask image. Furthermore, it may also be possible to previously perform calibration on the relative positions of the IR cameras 32 and the RGB camera 31. In this case, the image generating unit 144 can detect the position of the marker from the information on the marker tracking received from the IR cameras 32.
Furthermore, the image generating unit 144 may also use different detecting methods depending on markers. For example, for a marker above a nose, a movement is small and it is thus possible to easily recognize the shape; therefore, the image generating unit 144 may also detect the position by recognizing the shape. Furthermore, for a marker besides a mouth, it is difficult to recognize the shape; therefore, the image generating unit 144 may also detect the position by using a method of extracting the representative color.
The training data generating unit 145 generates training data for machine learning by attaching information related to the first AU to the generated image. For example, the training data generating unit 145 generates training data for machine learning by attaching the occurrence, intensity of the first AU judged by the judgement unit 143 to the generated image. Furthermore, the machine learning device 20 may also perform training by adding the training data generated by the training data generating unit 145 to existing training data.
For example the training data containing an image as an input can be used for training of an estimation model for estimating AUs that occur. Furthermore, the estimation model may also be a model dedicated to each of the AUs. When the estimation models are dedicated to specific AUs, the generating device 10 may also change the generated training data to the training data in which only the information related to the specific AUs is used for teacher labels. Namely, regarding an image in which another AU that is different from the specific AUs occurs, the generating device 10 can delete information on the other AU and add information, as a teacher label, indicating that the specific AU does not occur.
According to the embodiment, it is possible to estimate training data needed. In general, an enormous amount of calculation cost is needed to perform machine learning. For the calculation cost, the usage of time, GPU, and the like is included. p If the quality and the amount of data sets are improved the accuracy of a model obtained by training is improved. Consequently, if it is possible to in advance, roughly estimate the quality and the amount of data sets with respect to the target accuracy, the calculation cost is reduced. Here, for example, the quality of data sets is a deletion rate and deletion accuracy of markers. Furthermore, for example, the amount of data sets is the number of data sets and subjects.
Among the combination of AUs, there may be a combination having a high correlation. Consequently, it is assumed that an estimation performed with respect to a certain AU can be applied to another AU having a high correlation with that AU. For example, it is known that an AU 18 have a high correlation with an AU 22 and an associated marker may possibly be common. Consequently, if it is possible to estimate data sets having the quality and the amount enough to reach the target of the estimation accuracy of the AU 18, it is possible to roughly estimate the quality and the amount of data sets enough to reach the target of the estimation accuracy of the AU 22.
The machine learning device 20 performs machine learning by using training data generated by the generating device 10, and then generates a model for estimating an occurrence intensity of each AU from an image. Furthermore, an estimating device 60 actually performs estimation by using the model generated by the machine learning device 20.
A functional configuration of the estimating device 60 will be described with reference to
The input unit 61 is a device or an interface for inputting data. For example, the input unit 61 is a mouse and a keyboard. Furthermore, the output unit 62 is a device or an interface for outputting data. For example, the output unit 62 is a display or the like that is used to display a screen.
The storage unit 63 is an example of a storage device that stores therein data or programs or the like executed by the control unit 64 and is, for example, a hard disk or a memory. The storage unit 63 stores therein model information 631. The model information 621 parameters or the like that constructs a model generated y the machine learning device 20.
The control unit 64 is implemented by, for example, a CPU, an MPU, a CPU, or the like executing, in a RAM as a work area, the program that is stored in an inner storage device. Furthermore, the control unit 64 may also be implemented by, for example, an integrated circuit, such as an ASIC, an FPGA, or the like. The control unit 64 includes an acquiring unit 641 and an estimating unit 642.
The acquiring unit 641 acquires a first captured image that includes a face. For example, the first Image is an image in which a face of a person is captured and an occurrence intensity of each AU is unknown.
The estimating unit 642 inputs the first captured image to a machine learning model that is generated from machine learning performed based on training data in which information on the first AU selected based on the judgment criterion of each AU and a position of each marker included in the captured image. Then, the estimating unit 642 acquires output of the machine learning model as the estimation result of the expression of the face.
For example, the estimating unit 642 acquires data, such as “AU 1:2, AU 2:5, AU 4:1, . . . ”, expressed by a five-level evaluation in which an occurrence intensity of each AU are indicated by A to E. Furthermore, the output unit 12 outputs the estimation result acquired by the estimating unit 642.
The flow of a process performed by the generating device 10 will be described with reference to
The flow of the occurrence intensity judgement process (Step S20 in
Then, the generating device 10 calculates a movement vector of the marker based on the position of the specified marker and the reference position (Step S202). Then, the generating device 10 judges the occurrence intensity of the AU based on the movement vector (Step S203).
The flow of a training data generating process will be described with reference to
As described above, the acquiring unit 141 in the generating device 10 acquires a captured image including a face. The specifying unit 142 specifies the positions of the markers included in the captured image. The judgement unit 143 selects the first AU from among the plurality o AUs based on the judgment criterion of the AUs and the position of the specified marker. The image generating unit 144 generates an image by performing the image processing for deleting a marker from the captured image. The training data generating unit 145 generates training data for machine learning by attaching information on the first AU to the generated image. In this way, the generating device 10 can automatically obtain training data with high quality in which the marker is deleted. Consequently, according to the embodiment, it is possible to generate teacher data for estimating AUs.
When the judgement unit 143 that the AU associated with the marker from among the plurality of AUs occurs based on the judgment criterion and the position of the marker, the judgement unit 143 selects the subject AU. In this way, the judgement unit 143 can judge the AU associated with the marker.
The judgement unit 143 judges the occurrence intensity of the AU based on an amount of movement of the marker calculated based on the distance between the reference position of a marker included in the judgment criterion and the position of the specified marker. In this way, the judgement unit 143 can judge the AU based on the distance.
The acquiring unit 641 in the estimating device 60 acquires the first captured image including a face. The estimating unit 642 inputs the first captured image to a machine learning model that is generated from machine learning based on training data in which information on the first AU selected based on the judgment criterion of the AUs and the positions of the markers included in the captured image is used as a teacher label. The estimating unit 642 acquires an output of the machine learning model as the estimation result of the expression of the face. In this way, the estimating device 60 can perform estimation with high accuracy by using the model generated at low cost.
As described above, the acquiring unit 141 in the generating device 10 acquires a captured image including a face to which a plurality of markers are attached at a plurality of positions that are associated with a plurality of AUs. The specifying unit 142 specifies each of the positions of the plurality of markers included in the captured image. The judgement unit 143 judges an occurrence intensity of a specific AU based on a judgment criterion of the specific AU selected from the plurality of AU and positions of one or the plurality of markers, from among the plurality of markers, associated with the specific AU. The output unit 12 outputs the occurrence intensity of the specific AU by associating the occurrence intensity with the captured image. In this way, the generating device 10 can judge the occurrence intensity of the specific AU from the captured image without annotating performed by a coder. Consequently, it is possible to generate teacher data for estimating AUs.
The judgement unit 143 judges the occurrence intensity based on an amount of movement of the marker calculated based on the distance between the position that is previously set as the judgment criterion and the position of one or a plurality of markers specified by the specifying unit 142. In this way, the generating device 10 can calculate, the occurrence intensity of each AU with high accuracy by using the judgment criterion.
The judgement unit 143 judges the occurrence intensity of the specific AU based on the degree of match between a vector that is previously associated with the specific AU and a movement vector of one or the plurality of markers calculated based on the position that is previously set as the judgment criterion and the position of the first marker specified by the specifying unit 142. In this way, by calculating the movement vector, the generating device 10 can evaluate the movement of the marker including directions and improve the judgement accuracy of the occurrence intensity.
The judgement unit 143 judges the occurrence intensity based on a change in the distance between the position of the first marker specified by the specifying unit 142 and the position of the second marker. In this way, by using the positions of the plurality of markers, the generating device 10 can cope with a complicated movement of a marker caused by a change in the surface texture of the face.
In the embodiment described above, a description has been given with the assumption that the judgement unit 143 judges an occurrence intensity of each AU based on an amount of movement of each marker. In addition, a state in which a marker that does not move can be used as the judgment criterion of the occurrence intensity judged by the judgement unit 143.
Furthermore, a color that is easily detected may also be arranged around the marker. For example, a green round adhesive seal in which an IR marker is placed at the center may also be attached to a subject. In this case, the image generating unit 144 can detect a green round area from the captured image and delete the area together with the IR marker.
The flow of the processes, the control procedures, the specific names, and the information containing various kinds of data or parameters indicated in the above specification and drawings can be arbitrarily changed unless otherwise stated. Specific examples, distributions, values, and the like described in the embodiment are only examples and can be arbitrarily changed.
Furthermore, the components of each unit illustrated in the drawings are only or conceptually illustrating the functions thereof and are not always physically configured as illustrated in the drawings. In other words, the specific shape of a separate or integrated device is not limited to the drawings. Specifically, all or part of the device can be configured by functionally or physically separating or integrating any of the units depending various loads or use conditions. Furthermore, all or any part of each of the processing functions performed by the each of the devices can be implemented by a CPU and by programs analyzed and executed by the CPU or implemented as hardware by wired logic.
The communication interface 10a is a network interface card and the like and communicates with other server. The HDD 10b stores therein the programs and DBs that operate the functions illustrated in
by reading the program that executes the same process as that performed by each of the processing units illustrated in
In this way, reading and executing the programs, the generating device 10 is operated as an information process apparatus that executes a machine learning method. Furthermore, the generating device 10 can also implement the same function as that described above in the embodiments by reading the programs described above from a recording medium by a medium reading device and executing the read programs described above. Furthermore, the programs described in the other embodiment are not limited to be executed by the generating device 10. For example, the present invention may also be similarly used in a case in which another computer or a server executes a program or in a case in which another computer and a server cooperatively execute the program with each other.
These programs can be distributed via a network, such as the Internet. Furthermore, these programs can be executed by storing the programs in a recording medium that can be read by a computer readable medium, such as a hard disk, a flexible disk (FD), a CD-ROM, a magneto-optical disk (MO), a digital versatile disk (DVD), or the like, and read the programs from the recording medium by the computer.
According to an aspect of the present invention, it is possible to generate teacher data for estimating AUs.
All examples and conditional language provided herein are intended for pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventors to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be, made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2020-002467 | Jan 2020 | JP | national |