This patent application is based on and claims priority pursuant to 35 U.S.C. § 119(a) to Japanese Patent Application No. 2021-045560, filed on Mar. 19, 2021, in the Japan Patent Office, the entire disclosure of which is hereby incorporated by reference herein.
Embodiments of this disclosure relate to a learning apparatus, a learning system, and a nonverbal information learning method.
In recent years, the development of deep learning enables accurate recognition of nonverbal information such as a person's line of sight and facial expression from a video image in real time. This technology is applied to various applications such as automatic analysis of surveillance camera images and health condition monitoring. Further, in recent years, a nonverbal information conversion technology developed in conjunction with the nonverbal information recognition technology is attracting attention. These techniques enable to give a desired impression to a partner in a conversation using a video call, for example.
Further, in such deep learning technologies, the importance of improving the efficiency of annotation is increasing so that label information is efficiently added to a large-scale data set. For example, a method is known that extracts a region to which a user pays attention in a video by using line-of-sight data representing a line of sight of the user when annotating label information to be paired with the video.
An embodiment of the present disclosure includes a learning apparatus. The learning apparatus includes circuitry. The circuitry receives an input of first label information to be given to a facial expression image indicating a face of a person. The circuitry estimates second label information to be given to the facial expression image based on an interpolated image generated using the facial expression image and line-of-sight information indicating a direction of a line of sight of an annotator, the direction being detected at a time when the input is received. The circuitry calculates a difference between the first label information of which the input is received and the estimated second label information. The circuitry updates a parameter used for processing of estimating the second label information based on the calculated difference.
Another embodiment of the present disclosure includes a learning system. The learning system includes circuitry. The circuitry receives an input of first label information to be given to a facial expression image indicating a face of a person. The circuitry estimates second label information to be given to the facial expression image based on an interpolated image generated using the facial expression image and line-of-sight information indicating a direction of a line of sight of an annotator, the direction being detected at a time when the input is received. The circuitry calculates a difference between the first label information of which the input is received and the estimated second label information. The circuitry updates a parameter used for processing of estimating the second label information based on the calculated difference.
A more complete appreciation of the disclosure and many of the attendant advantages and features thereof can be readily obtained and understood from the following detailed description with reference to the accompanying drawings, wherein:
The accompanying drawings are intended to depict embodiments of the present disclosure and should not be interpreted to limit the scope thereof. The accompanying drawings are not to be considered as drawn to scale unless explicitly noted. Also, identical or similar reference numerals designate identical or similar components throughout the several views.
In describing embodiments illustrated in the drawings, specific terminology is employed for the sake of clarity. However, the disclosure of this specification is not intended to be limited to the specific terminology so selected and it is to be understood that each specific element includes all technical equivalents that have a similar function, operate in a similar manner, and achieve a similar result.
Referring now to the drawings, embodiments of the present disclosure are described below. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise.
Embodiments of the present disclosure are described with reference to the drawings. In the description of the drawings, the same elements are denoted by the same reference numerals, and redundant descriptions thereof are omitted.
Overview of Nonverbal Information Processing System:
Referring to
As illustrated in
The nonverbal information conversion apparatus 50 is a computer that converts nonverbal information so that an intention of the sender is intelligibly communicated to the recipient. The nonverbal information conversion apparatus 50 acquires data including nonverbal information of the sender, converts the nonverbal information so that an intention of the sender is intelligibly communicated to the recipient, and outputs processed data obtained by performing conversion of the nonverbal information to the acquired data.
In the disclosure, the nonverbal information includes a feature amount such as a user's line of sight, a user's facial expression, a posture of a user's upper limb, a shape of a user's hand, a shape or a posture of a user's arm or foot, or a tone or intonation of user's voice. The intention of the sender includes one or more of a condition of the sender (e.g., pleasant, concentrated, or active), an emotion of the sender (e.g., happy, angry, sad, pleasure, composed, or disgusted), and will of the sender (e.g., instruct, deny, or request) that the sender wants to communicate to the recipient.
In one example, the nonverbal information conversion apparatus 50 is implemented by a single computer. In another example, the nonverbal information conversion apparatus 50 is implemented by a plurality of computers. In one example, the nonverbal information conversion apparatus 50 is implemented by a computer residing in a cloud environment. In another example, the nonverbal information conversion apparatus 50 is implemented by a computer existing residing in an on-premises environment.
The transmission apparatus 70 is a computer such as a laptop computer used by the sender in dialogue communication. The reception apparatus 90 is a computer such as a laptop computer used by the recipient in dialog communication. The transmission apparatus 70 transmits, to the nonverbal information conversion apparatus 50, video data obtained by capturing the sender from the front, for example. The reception apparatus 90 controls a display to display video in which the sender appears, based on video data (conversion data) converted by the nonverbal information conversion apparatus 50. The laptop computer is merely one example of each of the transmission apparatus 70 and the reception apparatus 90. In another example, each of the transmission apparatus 70 and the reception apparatus 90 is implemented by a smartphone, a tablet terminal, a wearable terminal, or desktop personal computer (PC). Although
The nonverbal information processing system 1 further includes a learning system 5 including a learning apparatus 10 used by an annotator. The learning apparatus 10 is a computer for performing machine learning of training data used for conversion of the nonverbal information.
The annotator looks at a facial expression image displayed on the learning apparatus 10 and inputs a corresponding facial expression label (label information). Further, the learning apparatus 10 detects line-of-sight information of the annotator at the time of the input of the facial expression label, and stores the detected line-of-sight information in addition to the facial expression image and the facial expression label. The learning apparatus 10 deals with data of the facial expression image, the facial expression label, and the line-of-sight information as one set. The learning apparatus 10 repeats the above processing for the number of facial expression images (frames), to generate a data set.
In recent years, with the development of deep learning, the importance of improving the efficiency of annotation that efficiently provides label information to a large-scale data set has been increasing. The purpose of improving the efficiency of annotation is to reduce the burden on an annotator and to maintain the quality of obtained label information. If the burden on the annotator increases, the reliability the label information obtained by annotation may degrades due to decrease of concentration at the time of annotating.
Further, a method is known that improves the efficiency of annotation by aggregating data to be learned. Furthermore, a method is known that uses a reaction of an annotator as it is for learning. As a method of adding a reaction of an annotator to label information when annotating the label information to be paired with video, a region to which the annotator responds to is identified by using line-of-sight information, for example. With this configuration, efficient learning is performed even with a smaller amount of data. In particular, in the case of a task such as adding a label to a target video, a region to which the annotator pays attention in selecting the label is clearly given. Such a configuration saves the annotator from performing a complicated application. Thus, a burden on the annotator is reduced. With such a method, a region of high importance included in a video is extracted, and an annotator directly specifies importance without using an algorithm for calculating importance.
However, in the above-described method, only a portion of a line-of-sight region is used as an input, and a peripheral region around the line-of-sight region is not used. For example, David Whitney, Dennis M. Levi, “Visual crowding: a fundamental limit on conscious perception and object recognition”, Trends in cognitive sciences, 2011, 15.4, 160-168 discloses that information of a peripheral region is important in image recognition, as well as information of a central region. Accordingly, it is not appropriate to apply the method using only the line-of-sight region of the annotator to tasks such as object recognition and the facial expression recognition, in which recognition of a peripheral region has significances in addition to recognition of a central region. Accordingly, there is a room for improving efficiency of annotation.
To address such an issue, the learning system 5 uses an interpolation-type learning algorithm using line-of-sight information of an annotator, to implement learning of video based on a central region, which is a line-of-sight region in an input image, and a peripheral region which is a region around the central region.
Hardware Configuration:
Referring to
The CPU 101 controls overall operation of the computer. The ROM 102 stores a program such as an initial program loader (IPL) to boot the CPU 101. The RAM 103 is used as a work area for the CPU 101. The HD 104 stores various data such as a program. The HDD controller 105 controls reading or writing of various data from or to the HD 104 under control of the CPU 101. The display 106 is an example of a display device (display means) that displays various types of information such as a cursor, a menu, a window, characters, or an image. In one example, the display 106 is a touch panel display provided with an input device (input means). The external device connection I/F 107 is an interface that connects the computer to various extraneous sources. The communication I/F 108 is an interface for data transmission and reception with other computers or electronic devices. The communication I/F 108 is, for example, a communication interface such as a wired or wireless LAN. In another example, the communication I/F 108 includes a communication interface for mobile communication such as 3G, 4G, 5G, or LTE, Wi-Fi®, or WiMAX. The bus line 110 is, for example, an address bus or a data bus, which electrically connects the elements such as the CPU 101 illustrated in
The keyboard 111 is an example of an input device (input means) including a plurality of keys for inputting characters, numerical values, various instructions, and the like. The pointing device 112 is an example of an input device (input means) that allows a user to select or execute a specific instruction, select an object for processing, or move a cursor being displayed. The keyboard 111 and the pointing device 112 are merely examples of the input device (input means). In another example, a touch panel, a voice input device, or the like is used as the input device (input means). In still another example, instead of or in alternative to the display device (display means) such as the display 106 and the input device (input means) such as the keyboard 111 and the pointing device 112, a user interface (UI) external to the computer is used. The audio input/output I/F 113 is a circuit for inputting or outputting an audio signal between the microphone 114 and the speaker 115 under control of the CPU 101. The microphone 114 is an example of an audio collecting device (audio collecting means), which is a built-in type, that receives an input of audio. The speaker 115 is an example of an output device (output means), which is a built-in type, that outputs an audio signal. The camera 116 is an example of an image capturing device (image capturing means), which is a built-in type, that captures an image of an object to obtain image data. In another example, each of the microphone 114, the speaker 115, and the camera 116 is an external device in alternative to the built-in device of the computer. The DVD-RW drive 117 controls reading or writing of various data to or from a DVD-RW 118, which is an example of a removable storage medium. In another example, the removable storage medium includes at least one of digital versatile disk-recordable (DVD-R) or a Blu-ray® disc, in addition to or in alternative to the DVD-RW. The medium I/F 119 controls reading or writing (storing) of data from or to a storage medium 121 such as a flash memory. The line-of-sight detection device 123 is a sensor device that detects movement of a line of sight of a user who uses the learning apparatus 10. As the line-of-sight detection device 123, an infrared light emitting diode (LED) lighting device and an infrared camera are used, for example. In this case, the infrared LED lighting device of the line-of-sight detection device 123 irradiates the face of the user, to set a position on the cornea of reflected light (corneal reflex) formed by being irradiated by the infrared LED illumination device as a reference point. Further, the line-of-sight detection device 123 detects the line of sight of the user with the infrared camera based on a position of a pupil with respect to the position of the corneal reflex. The line-of-sight detection device 123 as described is merely one example. In another example, any known apparatus capable of performing a general line-of-sight detection method is used.
For example, any one of the above-described control programs is recorded in a file in a format installable or executable on a computer-readable storage medium for distribution. Examples of the storage medium include, but are not limited to, a compact disc-recordable (CD-R), a digital versatile disk (DVD), a Blu-ray® disc, a secure digital (SD) card, and a universal serial bus (USB) memory. In addition, such storage medium may be provided in the form of a program product to users within a certain country or outside that country. For example, the learning apparatus 10 executes a program according to the present disclosure to implement a nonverbal information learning method according to the present disclosure.
Functional Configuration:
Referring to
The data acquisition unit 11 is implemented mainly by the communication I/F 108 or the external device connection I/F 107 operating under control of the CPU 101. The data acquisition unit 11 acquires various data input from an external apparatus. The data output unit 12 is implemented mainly by the communication I/F 108 or the external device connection I/F 107 operating under control of the CPU 101. The data output unit 12 outputs various data obtained by processing by the learning apparatus 10 to an external apparatus.
The input receiving unit 13 is implemented mainly by the keyboard 111 or the pointing device 112 operating under control of the CPU 101. The input receiving unit 13 receives various selections or inputs from the user. The image generation unit 14, which is implemented mainly by instructions of the CPU 101, generates a facial expression image to be machine-learned, based on video information in which a person appears, the video data being input from an external apparatus. The display control unit 15, which is implemented mainly by instructions of the CPU 101, displays various screens on a display device (display means) such as the display 106.
The line-of-sight detection unit 16 is implemented mainly by the line-of-sight detection device 123 operating under control of the CPU 101. The line-of-sight detection unit 16 detects line-of-sight information indicating a direction of a line of sight of the annotator.
The interpolation unit 17, which is implemented mainly by instructions of the CPU 101, generates an interpolated image based on the facial expression image generated by the image generation unit 14 and the line-of-sight information detected by the line-of-sight detection unit 16.
The inference unit 18, which is implemented mainly by instructions of the CPU 101, estimates label information to be added to the facial expression image based on the interpolated image generated using the facial expression image generated by the image generation unit 14 and the line-of-sight information detected by the line-of-sight detection unit 16.
The loss calculation unit 19, which is implemented mainly by instructions of the CPU 101, calculates a difference between label information whose input is received by the input receiving unit 13 and the label information estimated by the inference unit 18.
The optimization unit 20, which is implemented mainly by instructions of the CPU 101, updates a parameter used for processing by the inference unit 18 based on the difference calculated by the loss calculation unit 19.
The storing/reading unit 29 stores various data (or information) in the storage unit 1000 and/or reads various data (or information) from the storage unit 1000. In the storage unit 1000, a data set used for learning by the learning apparatus 10 and learned learning data are stored. In another example, the storage unit 1000 is configured as one or more storage devices that are external to the learning apparatus 10.
Overview:
Referring to
First, the learning system 5A prepares a data set including a facial expression image generated from certain video information, line-of-sight information indicating a direction of a line of sight of an annotator, and label information, which is a facial expression label added to the facial expression image by the annotator. The interpolation unit 17 generates an interpolated image including a central region and a peripheral region by pattern interpolation using the facial expression image and the line-of-sight information as inputs. The central region is the direction of the line of sight (line-of-sight region) in the input facial expression image. The peripheral region is an area around the central region. Then, the inference unit 18 estimates label information, which is a facial expression label to be added to the facial expression image, using the interpolated image generated by the interpolation unit 17 as an input.
Further, the learning system 5A uses a loss calculated by the loss calculation unit 19 based on the label information estimated by the inference unit 18 and the label information added by the annotator, for the parameter update of the inference unit 18 by the optimization unit 20. The loss calculation by the loss calculation unit 19 and the parameter update of the inference unit 18 by the optimization unit 20 are performed in the same or substantially the manner as a general-purpose learning system.
Referring to
First, the image generation unit 14 of the learning apparatus 10 generates a facial expression image by using video information in which a person appears, the video information being input from an external apparatus (step S11). Specifically, the image generation unit 14 detects a face of a person from the video information input from the external apparatus and detects landmarks of the face using a method described E. Goeleven, R. De Raedt, L. Leyman, and B. Verschuere, “The Karolinska directed emotional faces: A validation study”, Cogn. Emot., vol. 22, no. 6, pp. 1094-1118, 2008, for example. The image generation unit 14 performs left/right tilt correction and size correction using the detected face landmarks. In the left/right tilt correction, for example, the input video information and the detected facial landmarks are rotated so that the heights (y-values) or the left and right eyes are the same. In the size correction, for example, the input video information and the detected face landmarks are enlarged or reduced so that the upper, lower, left, and right poles of the detected face landmarks are within a designated image size.
Next, the display control unit 15 controls a display unit such as the display 106 to display the facial expression image generated in step S11 (step S12). The facial expression image is, for example, a still image expressing the basic six emotions with a face as described in T. Baltrusaitis, P. Robinson, and L. P. Morency, “OpenFace: an open source facial behavior analysis toolkit”, IEEE Winter Conf. Appl. Comput. Vision, WACV, 2016.
Next, the input receiving unit 13 receives an input of a facial expression label according to a predetermined input operation performed by the annotator to an input device (input means) such as the keyboard 111 (step S13). For example, the annotator observes the facial expression image displayed in step S12, to input a corresponding facial expression label. The learning apparatus 10 stores the answer input by the input receiving unit 13 as label information indicated by a one-hot vector.
Further, the line-of-sight detection unit 16 detects a direction of a line of sight of the annotator at the time when the input of the facial expression label is received in step S13 (step S14). Specifically, the line-of-sight detection unit 16 detects the direction of the line of sight of the annotator on the display 106 in real time using the line-of-sight detection device 123, for example. In order to improve the estimation accuracy, the line-of-sight detection unit 16 performs calibration at the first detection, to correct the influence of eyeball characteristics and display characteristics, for example. The direction of the line of sight is represented by a pixel position (x, y) of the display 106, and this coordinate information is acquired as line-of-sight information.
Next, the storing/reading unit 29 stores the facial expression image generated in step S12, the label information input in step S13, and the line-of-sight information indicating the direction of the line of sight detected in step S14 in the storage unit 1000 as one data set (step S15). Then, in a case that the learning apparatus 10 performs the above processes on all of image frames for the input video information (YES in step S16), the operation ends. By contrast, there is a frame on which the above processes are not yet performed (NO in step S16), the learning apparatus 10 repeats the processes from step S11 until the processes are performed on all of image frames for the input video information.
As described, the learning system 5A generates, as preprocessing of learning, the data set, which is a set of the facial expression image representing the face of the certain person, the line-of-sight information indicating the direction of the line of sight of the annotator, and the label information input by the annotator.
Next, referring to
First, the storing/reading unit 29 of the learning apparatus 10 reads a data set generated by the processes described above with reference to
Next, the interpolation unit 17 performs generalization processing on the facial expression image included in the data set read in step S31 (step S32). For example, the interpolation unit 17 generalizes the facial expression image using a pre-learned variational auto-encoder (VAE), to reproduce pattern interpolation in which a peripheral region is generated. Then, the interpolation unit 17 generates an interpolated image using the generalized image obtained by the generalization processing in step S32 and the facial expression image read in step S31 (step S33). For example, the interpolation unit 17 generates the interpolated image by combining the generalized image and the facial expression image by weighted addition using the line-of-sight information read in step S31.
Referring to
Referring again to
Next, the loss calculation unit 19 calculates a difference between the label information (an example of first label information) read in step S31 and the label information (an example of second label information) estimated in step S34 (step S35). Specifically, the loss calculation unit 19 calculates a difference between the label information added by the annotator and the label information, which is the facial expression label estimated by the inference unit 18, as a loss by the cross-entropy loss.
Next, the optimization unit 20 updates a parameter used for the processing by the inference unit 18 based on the difference calculated in step S35 (step S36). Specifically, the optimization unit 20 updates the parameter used for the processing by the inference unit 18 based on a predetermined optimization method using the loss obtained in the processing by the interpolation unit 17 and the inference unit 18. As the optimization method, a method generally used in machine learning is used.
Then, in a case that the learning apparatus 10 performs the above processes on all of the facial expression images read in step S31 (YES in step S37), the operation ends. By contrast, there is a remaining facial expression image on which the above processes are not yet performed (NO in step S37), the learning apparatus 10 repeats the processes from step S32 until the processes are performed on all of the read facial expression images. In one example, the learning apparatus 10 performs the operation illustrated in
As described, the learning system 5A according to the first embodiment performs learning using the interpolated image generated by pattern interpolation based on the input facial expression image and line-of-sight information in an interpolation-type learning algorithm using line-of-sight information of an annotator. With this configuration, learning based on the central region and the peripheral region in the input facial expression image is efficiently implemented.
Referring to
As described, the learning system 5B according to the second embodiment performs learning using the interpolated image generated by down-sampling based on the input facial expression image and line-of-sight information. With this configuration, learning based on the central region and the peripheral region in the input facial expression image is efficiently implemented.
Referring to
First, the storing/reading unit 29 of the learning apparatus 10 reads a data set generated by the processes described above with reference to
The inference unit 18 performing such layer filtering processing to estimate a facial expression label (label information) to be added to the read facial expression image. Processes from step S53 to step S55 are performed in the same or substantially the same manner as described above referring to step S35 to step S37 of
As described, the learning system 5C according to the third embodiment performs estimation of label information based on a result of each layer filtering processing based on the input facial expression image and line-of-sight information. With this configuration, learning based on the central region and the peripheral region in the input facial expression image is efficiently implemented.
As described, the learning system 5 uses an interpolation-type learning algorithm using line-of-sight information of an annotator, to implement learning of video based on a central region, which is a line-of-sight region in an input image, and a peripheral region which is a periphery of the central region. Further, the learning system 5 implements efficient annotation based on the line-of-sight region and the peripheral region in an input image by using an interpolation type learning algorithm using line-of-sight information of an annotator.
According to one or more embodiments, a non-transitory computer-executable medium storing a program storing instructions is provided, which, when executed by a processor of a computer, causes the computer to perform a nonverbal information learning method. The nonverbal information learning method includes receiving an input of first label information to be given to a facial expression image indicating a face of a person. The nonverbal information learning method includes estimating second label information to be given to the facial expression image based on an interpolated image generated using the facial expression image and line-of-sight information indicating a direction of a line of sight of the annotator, the direction being detected at a time when the input is received. The nonverbal information learning method includes calculating a difference between the first label information of which the input is received and the estimated second label information. The nonverbal information learning method includes updating a parameter used for processing by the estimating based on the calculated difference.
Applying the method using only the line-of-sight region of the annotator to tasks such as object recognition and facial expression recognition is not appropriate, because recognition of a peripheral region has significances in addition to recognition of a central region. Accordingly, in the related art, there is a room for improving efficiency of annotation.
According to one or more embodiments of the present disclosure, efficient annotation is implemented based on a line-of-sight region and a peripheral region in an input image by using line-of-sight information of an annotator.
The functionality of the elements disclosed herein may be implemented using circuitry or processing circuitry which includes general purpose processors, special purpose processors, integrated circuits, application specific integrated circuits (ASICs), digital signal processors (DSPs), field programmable gate arrays (FPGAs), system on a chips (SOCs), graphics processing units (GPUs), conventional circuitry and/or combinations thereof which are configured or programmed to perform the disclosed functionality. Processors are considered processing circuitry or circuitry as they include transistors and other circuitry therein. In the disclosure, the circuitry, units, or means are hardware that carry out or are programmed to perform the recited functionality. The hardware may be any hardware disclosed herein or otherwise known which is programmed or configured to carry out the recited functionality. When the hardware is a processor which may be considered a type of circuitry, the circuitry, means, or units are a combination of hardware and software, the software being used to configure the hardware and/or processor.
Although the learning apparatus, the learning system, the nonverbal information learning method, and the program according to embodiments of the present disclosure have been described above, the above-described embodiments are illustrative and do not limit the present invention. Thus, numerous additional modifications and variations are possible in light of the above teachings. For example, elements and/or features of different illustrative embodiments may be combined with each other and/or substituted for each other within the scope of the present invention.
Any one of the above-described operations may be performed in various other ways, for example, in an order different from the one described above.
Number | Date | Country | Kind |
---|---|---|---|
2021-045560 | Mar 2021 | JP | national |