The present invention relates to a learning device, an estimation device, a learning method, an estimation method, a learning program, and an estimation program.
There is a conventional technology of a conversation system that generates a speech responding to a speech of a user and achieves smooth interaction between the user and the system. In such a conversation system, quick response is an important element, and for example, there is a technology of randomly generating a quick response (for example, see Patent Literature 1).
However, the conventional technology has a problem that a more natural quick response cannot be generated as a listener. For example, in the conventional technology, there is a limit to performing a speech at an appropriate timing, and there is a problem that the content of the speech is far from a natural quick response.
The present invention has been made in view of the above, and an object thereof is to provide a learning device, an estimation device, a learning method, an estimation method, a learning program, and an estimation program capable of generating a more natural quick response as a listener.
In order to solve the above-described problems and achieve the object, a learning device of the present invention includes: an acquisition unit that acquires speech data of a speaker, information on the speaker, conversation data of a listener, and information on the listener; and a creation unit that creates a learned model of estimating a quick response of the listener to a conversation of the speaker using the information acquired by the acquisition unit with a quick response included in the conversation data of the listener as correct answer data.
In addition, an estimation method is an estimation method performed by an estimation device, the estimation method including: an acquisition step of acquiring speech data of a speaker, and information on the speaker; and an estimation step of inputting the information acquired by the acquisition step as input data to a learned model of predicting a quick response of a listener from a conversation content of the speaker and estimating the quick response of the listener to the conversation of the speaker.
According to the present invention, it is possible to generate a more natural quick response as a listener.
Hereinafter, embodiments of a learning device, an estimation device, a learning method, an estimation method, a learning program, and an estimation program according to the present application will be described in detail with reference to the drawings. Moreover, the present invention is not limited to the embodiment described below.
For example, the learning device 10 acquires, as learning data, speech data (speech sentences) of a speaker and a listener, and multi-modal such as facial expressions, motions, voices, and the like. Then, the learning device 10 performs machine learning using the acquired information and creates a learned model (multimodal generator).
The estimation device 20 predicts a quick response of the listener from the conversation content of the speaker using the learned model created by the learning device 10. For example, the estimation device 20 acquires speech data of a speaker, and information on the speaker, and inputs the acquired information as input data to a learned model of predicting a quick response of the listener from a conversation content of the speaker and estimates the quick response of the listener to the conversation of the speaker. That is, the estimation device 20 inputs the multimodal of the speaker to the learned model (multimodal generator), and generates the multimodal of the listener.
That is, the estimation device 20 can estimate a natural and appropriate back channel by using the multimodal of the speaker as an input. As described above, the learning device 10 can learn a quick response of the listener with respect to the speech of the speaker. Then, the estimation device 20 can generate a back channel (quick response) having more appropriate contents, and can generate a more natural reaction as a listener.
[Configuration of Learning Device]
The communication processing unit 11 is implemented by a network interface card (NIC) or the like, and controls communication via an electric communication line such as a local area network (LAN) or the Internet.
The input unit 12 is implemented by using an input device such as a keyboard or a mouse and inputs various types of instruction information such as processing start to the control unit 14 in response to an input operation by an operator. The output unit 13 is implemented by a display device such as a liquid crystal display.
The storage unit 15 stores data and programs necessary for various types of processing by the control unit 14, and includes a learned model storage unit 15a. For example, the storage unit 15 is a semiconductor memory element such as a random access memory (RAM) or a flash memory, or a storage device such as a hard disk or an optical disk.
The learned model storage unit 15a stores a learned model learned by the creation unit 14b to be described later. For example, the learned model storage unit 15a stores, as a learned model, a back channel generator for estimating a quick response of a listener to a conversation of a speaker.
The control unit 14 includes an internal memory for storing a program defining various processing procedures and the like and required data, and executes various types of processing using the program and the data. For example, the control unit 14 includes a learning data acquisition unit 14a and a creation unit 14b. Here, the control unit 14 is an electronic circuit such as a central processing unit (CPU) or a micro processing unit (MPU), or an integrated circuit such as an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA).
The learning data acquisition unit 14a acquires the speech data of the speaker and the information on the speaker, and the conversation data of the listener and the information on the listener. For example, the learning data acquisition unit 14a acquires any one or more of the expression, the motion, and the voice of the speaker as the information on the speaker, and acquires any one or more of the expression, the motion, and the voice of the speaker as the information on the listener. The learning data acquisition unit 14a may acquire, for example, image data of the face of the speaker or the entire speaker, or may acquire information such as the expression “smile” or the motion “absence” as the information of the expressions and the motions of the speaker and the listener.
The creation unit 14b uses the information acquired by the learning data acquisition unit 14a to create a learned model of estimating a quick response of the listener to the conversation of the speaker using the quick response included in the conversation data of the listener as correct answer data. The creation unit 14b may use any method using the model as a learning method. Here, the quick response included in the conversation data of the listener is, for example, a speech such as “yes, yes, yes”, “yeah, yeah”, “good”, “I see”, “true”, “yeah”, “yes”, “oh”, “ooh”, “hmm”, “great”, or “what” included in the conversation data. Thereafter, the creation unit 14b stores the created learned model in the learned model storage unit 15a.
[Configuration of Estimation Device]
The communication processing unit 21 is implemented by an NIC or the like, and controls communication via a telecommunication line such as a LAN or the Internet. The input unit 22 is implemented by using an input device such as a keyboard or a mouse and inputs various types of instruction information such as processing start to the control unit 24 in response to an input operation by an operator. The output unit 23 is implemented by a display device such as a liquid crystal display.
The storage unit 25 stores data and programs necessary for various types of processing by the control unit 24, and includes a learned model storage unit 25a. For example, the storage unit 25 is a semiconductor memory element such as a random access memory (RAM) or a flash memory, or a storage device such as a hard disk or an optical disk.
The learned model storage unit 25a stores a learned model learned by the creation unit 14b. For example, the learned model storage unit 25a stores, as a learned model, a back channel generator for estimating a quick response of a listener to a conversation of a speaker.
The control unit 24 includes an internal memory for storing a program defining various processing procedures and the like and required data, and executes various types of processing using the program and the data. For example, the control unit 24 includes an input data acquisition unit 24a and an estimation unit 24b. Here, the control unit 24 is an electronic circuit such as a central processing unit (CPU) or a micro processing unit (MPU), or an integrated circuit such as an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA).
The input data acquisition unit 24a acquires speech data of a speaker and information on the speaker. For example, the input data acquisition unit 24a acquires any one or more of the expression, the motion, and the voice of the speaker as the information on the speaker.
The estimation unit 24b inputs the information acquired by the input data acquisition unit 24a as input data to a learned model of predicting a quick response of the listener from the conversation content of the speaker, and estimates the quick response of the listener with respect to the conversation of the speaker. Then, the estimation unit 24b outputs the estimated quick response information.
Here, processing of estimating a quick response of a listener to a conversation of a speaker will be described with reference to
As a result, the estimation device 20 can implement natural conversation continuation of the speaker by naturally and appropriately generating the back channel important in communication according to the multimodal of the speaker.
[Processing Procedure by Learning Device] Next, an example of a processing procedure of processing executed by the learning device 10 will be described with reference to
As illustrated in
Then, the creation unit 14b uses the information acquired by the learning data acquisition unit 14a to create a learned model of predicting a quick response of the listener to the conversation of the speaker using the quick response included in the conversation data of the listener as correct answer data (step S103). Thereafter, the creation unit 14b stores the created learned model in the learned model storage unit 15a (step S104).
[Processing Procedure by Estimation Device] Next, an example of a processing procedure of processing executed by the estimation device 20 will be described with reference to
As illustrated in
[Effects of Embodiment] As described above, the learning device 10 according to the embodiment acquires speech data of a speaker, information on the speaker, conversation data of a listener, and information on the listener, and creates a learned model of estimating a quick response of the listener to a conversation of the speaker using the acquired information with a quick response included in the conversation data of the listener as correct answer data. Therefore, the learning device 10 can generate a learned model capable of estimating a more natural quick response as a listener. Furthermore, the estimation device 20 acquires speech data of a speaker, and information on the speaker, and inputs the acquired information as input data to a learned model of predicting a quick response of the listener from a conversation content of the speaker and estimates the quick response of the listener to the conversation of the speaker. Therefore, the estimation device 20 can generate a more natural quick response as a listener.
That is, the estimation device 20 can estimate a natural and appropriate back channel by using the multimodal of the speaker as an input. As described above, the learning device 10 can learn a quick response of the listener with respect to the speech of the speaker. Then, the estimation device 20 can generate a back channel (quick response) having more appropriate contents, and can generate a more natural reaction as a listener.
Each component of each device illustrated according to the above embodiments is functionally conceptual and does not necessarily have to be physically configured as illustrated. That is, a specific form of distribution and integration of each device is not limited to the illustrated form, and all or a part thereof can be functionally or physically distributed and integrated in any unit according to various loads, usage conditions, and the like. Furthermore, all or any part of the processing functions performed in each device can be implemented by a CPU and a program analyzed and executed by the CPU, or can be implemented as hardware by wired logic.
Furthermore, among the processing described in the above embodiments, all or a part of the processing described as being automatically performed can be manually performed, or all or a part of the processing described as being manually performed can be automatically performed by a known method. In addition, the processing procedures, the control procedures, the specific names, and the information including various data and parameters illustrated in the above document and drawings can be arbitrarily changed unless otherwise specified.
In addition, it is also possible to create a program in which the processing to be executed by the learning device 10 or the estimation device 20 described in the embodiment described above is described in a language that can be executed by a computer. In this case, the computer executes the program, and thus the effects similar to those of the above embodiments can be obtained. Furthermore, the program may be recorded in a computer-readable recording medium, and the program recorded in the recording medium may be read and executed by the computer to implement processing similar to that of the above embodiments.
As exemplified in
Here, as illustrated in
In addition, various data described in the above embodiments is stored as program data in, for example, the memory 1010 and the hard disk drive 1031. The CPU 1020 then reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1031 to the RAM 1012 as necessary, and executes various processing procedures.
Note that the program module 1093 and the program data 1094 related to the program are not limited to being stored in the hard disk drive 1031, and may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive or the like. Alternatively, the program module 1093 and the program data 1094 related to the program may be stored in another computer connected via a network (such as a local area network (LAN) or a wide area network (WAN)) and read by the CPU 1020 via the network interface 1070.
Although the embodiments to which the invention made by the present inventor is applied have been described above, the present invention is not limited by the description and drawings constituting a part of the disclosure of the present invention according to the present embodiments. In other words, other embodiments, examples, operational techniques, and the like made by those skilled in the art or the like on the basis of the present embodiment are all included in the scope of the present invention.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/JP2022/007727 | 2/24/2022 | WO |