The present disclosure relates to a dialog assistance apparatus, a dialog assistance method, and a program.
When two or more speakers are engaged in a dialog, it is difficult for them to speak according to the knowledge level of the opponent.
For example, when a dialog about Information and Communication Technology (ICT) is made between a speaker A who is highly literate in ICT (that is, who has a high level of understanding of ICT terms) and a speaker B who is less literate in ICT (that is, who has a low level of understanding of ICT terms), the speaker B may not understand what the speaker A says, and a breakdown of the dialog may occur.
Techniques have been devised heretofore to prevent the breakdown of the dialog between a user and a robot.
The techniques in the prior art do not take into account the knowledge level of the speakers engaged in the dialog, and thus cannot help one speaker understand the content of the dialog of other speakers. As a result, it has been difficult to assist in facilitating the dialog.
The present disclosure has been made in view of the above, and it is an object of the present invention to assist in facilitating a dialog.
To achieve the object, a dialog assistance apparatus includes a first estimation unit that estimates, with respect to a field related to speech content of a first speaker, a knowledge level of a second speaker having a dialog with the first speaker, an acquisition unit that acquires, from a storage unit that stores a question in association with a keyword and a knowledge level, a question that corresponds to a keyword included in the speech content and that corresponds to the knowledge level of the second speaker, and an output unit that outputs the acquired question to the first speaker.
It is possible to assist in facilitating the dialog.
Hereinafter, embodiments of the present disclosure will be described with reference to the drawings. The present embodiment assumes a situation in which a speaker A with high literacy (high knowledge level) and a speaker B with relatively low literacy (low knowledge level) in a certain field (for example, Information and Communication Technology (ICT)) have a dialog. For example, the speaker A may be a person who is in charge at the counter of a certain store, and the speaker B may be a person who consults the speaker A over the counter. This situation setting intends to facilitate understanding of the present embodiment and does not intend that the present embodiment is effective only in the above situation.
A dialog assistance apparatus 10 is placed where the speaker A and the speaker B have a dialog, to assist the dialog. The dialog assistance apparatus 10 may be shaped like a robot. Alternatively, a device such as a personal computer (PC), a smart phone, or the like may be utilized as the dialog assistance apparatus 10.
A program for implementing processing performed by the dialog assistance apparatus 10 is provided as a recording medium 101 such as a compact disc read-only memory (CD-ROM). When the recording medium 101 storing the program is set in the drive device 100, the program is installed on the auxiliary storage device 102 from the recording medium 101 via the drive device 100. However, the program does not necessarily have to be installed from the recording medium 101 and may be downloaded from another computer via a network. The auxiliary storage device 102 stores the installed program and also stores necessary files, data, and the like.
The memory device 103 reads and stores the program from the auxiliary storage device 102 when the program is instructed to start. The CPU 104 implements functions relevant to the dialog assistance apparatus 10 in accordance with the program stored in the memory device 103. The microphone 105 is used to input voice of a dialog (in particular, speech content of the speaker A). The display device 106 is, for example, a liquid crystal display and is used to output (display) a question by voice to the speaker A when the speaker B is unable to understand the speech content of the speaker A, as will be described later. The display device 106 may be shaped like a window which is disposed, for example, between the speaker A and the speaker B. The camera 107 is, for example, a digital camera and used to input an image of the face (hereinafter referred to as a “face image”) of the speaker B. The microphone 105, the display device 106, the camera 107 may not be built in the dialog assistance apparatus 10, and may be connected to the dialog assistance apparatus 10, for example, wirelessly or by wire.
Hereinafter, processing executed by the dialog assistance apparatus 10 will be described.
When the speaker A starts speaking, the keyword extraction unit 11 inputs the spoken voice of the speaker A via the microphone 105 (S101). For example, at the timing of the end of the speech, the keyword extraction unit 11 applies speech recognition to the spoken voice that has been input with respect to the speech, and extracts at least one keyword from text data acquired as a result of the speech recognition (S102). For example, “tethering” may be extracted as a keyword when the spoken voice is “do you use tethering?”.
Such keyword extraction can be performed using known techniques. For example, the keyword extraction may be performed using the method cited in “Keyword Recognition and Extraction for Speech-Driven Web Retrieval Task”, Masahiko Matsushita, Hiromitsu Nishizaki, Takehito Utsuro, and Seiichi Nakagawa, Information Processing Society of Japan, Research Report, Speech language information processing (SLP), 2003 (104 (2003-SLP-048)), 21-28. Alternatively, the keywords registered in the knowledge level DB 122, which will be described later, may be extracted.
Subsequently, the keyword extraction unit 11 records the extracted keyword in the keyword storage unit 121 (S103) and waits for the next speech of the speaker A (S101). In the keyword storage unit 121, the keywords are recorded in a manner that the order of extraction of the keywords (order of speeches) can be identified.
The understanding level estimation unit 12 inputs the face image of the speaker B who is continuously captured by the camera 107 (S201), and estimates (calculates), based on the face image, the understanding level of the speaker B with respect to the speech content of the speaker A (S202). Specifically, the expression of the speaker B is likely to change when the speech content of the speaker A is difficult to understand. The understanding level estimation unit 12 thereby estimates the understanding level based on the expression of the speaker B. Such estimation of the understanding level may be performed using the technique described in, for example, “Understanding Presumption System from Facial Images”, Jun Mimura and Masafumi Hagiwara, IEEJ Journal of Industry Applications, C, 120 (2), 2000, 273-278. In that case, the understanding level is estimated in five levels (0 to 4) ranging from no understanding at all to a complete understanding. Although the understanding level is estimated using the input of the face image in the present embodiment, other understanding level estimation methods may be used. For example, the speech content of the speaker A or the speaker B may be input to estimate the understanding level using an existing speech recognition technique or text analysis technique.
Subsequently, the understanding level estimation unit 12 estimates whether the understanding level of the speaker B is smaller than a threshold (S203). Assume that, in the present embodiment, the lower the understanding value, the lower the level of understanding. In step S203, it is determined whether the speaker B has a low understanding level.
If the understanding level of the speaker B is equal to or greater than the threshold (No in S203), it is estimated that the speaker B is able to understand the speech content of the speaker A, and there is no need to assist the speaker B, so that the process returns to step S201. If the understanding level of the speaker B is smaller than the threshold (Yes in S203), the knowledge level estimation unit 13 estimates the knowledge level of the speaker B for the field (for example, ICT) related to the speech content of the speaker A in accordance with at least one keyword stored in the keyword storage unit 121 and the knowledge level DB 122 (S204). That is, how much knowledge the speaker B has for the field is estimated.
When a plurality of keywords are included in the target keyword group, the understanding level estimation unit 12 may acquire, for example, the knowledge level from the knowledge level DB 122 for each target keyword, and estimate the lowest value of the acquired knowledge levels to be the knowledge level of the speaker B. Alternatively, the understanding level estimation unit 12 may estimate the highest value of the knowledge levels corresponding to any target keyword, which has been recorded in the keyword storage unit 121 before the understanding level is estimated to be smaller than the threshold, to be the knowledge level of the speaker B. This is because the speaker B is more likely to have understood the keywords that have been recorded before the understanding level is estimated to be smaller than the threshold.
In addition to the above, the technique disclosed in JP 2013-167765 A may also be used. In that case, the history of dialogs between the speaker A and the speaker B is recorded, and the knowledge level estimation unit 13 may estimate the knowledge level (knowledge amount) of the speaker B with reference to the history. Alternatively, the technique disclosed in JP 2019-28604 A may be used to estimate the knowledge level of the speaker B.
Subsequently, the question acquisition unit 14 acquires, from the question DB 123, the question to be output to the speaker A in accordance with the target keyword group and the knowledge level group estimated for the speaker B (S205).
Accordingly, in step S205, the question acquisition unit 14 acquires the “question” from the record that includes any keyword included in the target keyword group in the “keyword” and that indicates the “required knowledge level” to be equal to or smaller than the knowledge level of the speaker B. When there are a plurality of “questions”, the questions may be sorted, for example, in descending order of the “number of outputs”.
Subsequently, the question output unit 15 outputs (displays) the question acquired by the question acquisition unit 14 to the display device 106 (S206). The display device 106 is disposed to be visibly recognizable by the speaker A and the speaker B.
Then, the speaker A speaks the answer to the question. In accordance with the question and the answer, it can be expected that the speaker B is able to understand the speech content of the speaker A, which the speaker B could not understand before.
The following is a specific example of the dialog between the speaker A and the speaker B and the questions output by the dialog assistance apparatus 10.
A(1): “Do you use wireless LAN at home?”
A(2): “Do you use tethering when you are out?”
Dialog assistance apparatus 10: “By tethering, can I use the Internet on my laptop computer?”
B(3): “I do not use my laptop computer outside, so I do not think I use tethering.”
In the above, A(m) (m=1 to 3) represents the speech uttered by the speaker A. B(m) (m=1 to 3) represents the speech uttered by the speaker B. In this dialog, step S202 and subsequent steps are performed according to the facial expression of the speaker B when he/she has spoken “Well . . . ”. In step S206 performed as a result, the dialog assistance apparatus 10 outputs the question “By tethering, can I use the Internet on my laptop computer?” to the speaker A on behalf of the speaker B. In response, the speaker A answers (“Yes.”). This answer allows the speaker B to respond to the speech A(2) (speech B(3)) even if the speaker B does not fully understand the meaning of “tethering”, thus facilitating the dialog between the two. In other words, the dialog between the two has become engaged and the breakdown of the dialog is avoided.
In this case, like the specific example described above, the question output unit 15 outputs the question “By tethering, can I use the Internet on my laptop computer?” on behalf of the speaker B. Although the present embodiment has described the example in which the output form of the question is display, the question output unit 15 may also output the question by voice. In that case, the dialog assistance apparatus 10 needs to include a speaker.
Another case is assumable, as illustrated in
Even when a plurality of speakers B are present, there may be no need to limit the output of the question based on the threshold. In this case, the understanding level estimation unit 12 may estimate the understanding level of each speaker B (in parallel), and the knowledge level estimation unit 13 may estimate the knowledge level of each speaker B (in parallel). The question acquisition unit 14 may acquire, from the question DB 123, the question to be output to the speaker A based on the lowest knowledge level among a plurality of the estimated knowledge levels (in parallel). In this manner, the question may be output according to the speaker B having the lowest knowledge level.
In accordance with the present embodiment, as described above, when the speaker B cannot understand the speech content of the speaker A (the content of the dialog with the speaker A), the dialog assistance apparatus 10 outputs (gives notice of) the question to the speaker A according to the knowledge level of the speaker B on behalf of the speaker B. As the speaker A answers the question, the speaker B can respond to the speech content based on the answer without fully understanding the speech content. This makes it possible to assist in facilitating the dialog.
In the present embodiment, the knowledge level estimation unit 13 is an example of a first estimation unit. The question acquisition unit 14 is an example of an acquisition unit. The question output unit 15 is an example of an output unit. The understanding level estimation unit 12 is an example of a second estimation unit. The speaker A is an example of a first speaker. The speaker B is an example of a second speaker.
Although the embodiments of the present disclosure have been described in detail above, the present disclosure is not limited to such specific embodiments, and various modifications and change can be made within the scope of the gist of the present disclosure described in the aspects.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/JP2020/011193 | 3/13/2020 | WO |