This application is based on and claims priority under 35 U.S.C. § 119 to Japanese Patent Application 2022-024907, filed on Feb. 21, 2022, the entire content of which is incorporated herein by reference.
This disclosure relates to a dialogue system and a dialogue unit.
JP 2011-215900A (Reference 1) discloses a multimodal dialogue device. The device identifies whether a user is having a dialogue with the dialogue device by inputting a voice feature and a feature related to a face direction of the user to an identification model, for example.
However, when determining whether the user is having a dialogue with the device based on the feature related to the face direction and the voice feature, the accuracy of the determination is not necessarily high.
According to an aspect of this disclosure, there is provided a dialogue system that dialogues with a user, the dialogue system including: an execution device configured to execute a detection process and a recognition process, in which the recognition process is a process of recognizing that the user talks to the dialogue system when a predetermined state is detected by the detection process, the detection process is a process of detecting the predetermined state using, as an input, an output signal of a camera that images the user, and the predetermined state includes a state in which a mouth of the user moves.
According to another aspect of this disclosure, there is provided a dialogue unit included in the dialogue system described above.
The foregoing and additional features and characteristics of this disclosure will become more apparent from the following detailed description considered with the reference to the accompanying drawings, wherein:
(a) and (b) of
Hereinafter, a first embodiment will be described with reference to the drawings.
A control device 20 controls an image displayed on the display unit 12 by operating the display unit 12. At this time, the control device 20 refers to RGB image data Drgb output by an RGB camera 30 in order to control the image. The RGB camera 30 is disposed toward a direction in which the user is assumed to be located. The RGB image data Drgb includes luminance data of each of three primary colors including red, green, and blue. In addition, the control device 20 refers to infrared image data Dir output by an infrared camera 32 in order to control the image. The infrared camera 32 is also disposed toward the direction in which the user is assumed to be located. In addition, the control device 20 refers to an output signal Ss of a microphone 34 in order to control the image. The microphone 34 is provided to sense a sound signal generated by the user.
The control device 20 outputs a voice signal by operating a speaker 36 according to an action of the agent image 14. The voice signal is a signal indicating a content uttered by the agent indicated by the agent image 14.
The control device 20 includes a PU 22 and a storage device 24. The PU 22 is a software processing device including at least one of a CPU, a GPU, a TPU, and the like. The storage device 24 stores scenario data 24b. The scenario data 24b includes a finite automaton. The PU 22 controls a dialogue between the user and the agent according to the scenario data 24b. In the following description, among processes executed by the control device 20, an “authentication process” and a “process related to user utterance detection” will be particularly described in detail.
In a series of processes shown in
Next, the PU 22 determines whether a face of the user is located in a predetermined region in a region indicated by the RGB image data Drgb (S14). The predetermined region may be a center region of the region indicated by the RGB image data Drgb. This process can be executed by calculating coordinates of a predetermined position in, for example, a contour of the face. This process can be implemented by including, for example, in the mapping data 24c, data defining a contour output mapping for inputting the RGB image data Drgb and outputting the coordinates of the predetermined position. The contour output mapping is a trained model that is trained using the RGB image data Drgb and the corresponding coordinates of the predetermined position as the training data. The contour output mapping may be implemented by, for example, the CNN. Alternatively, for example, a Transfomer or the like may be used.
When it is determined that the face of the user is within the predetermined region (S14: YES), the PU 22 determines whether a face direction of the user is the front of the RGB camera 30 by using the infrared image data Dir as an input (S16). The face direction detection process here is the same as a process in S46 of
Meanwhile, when it is determined that the face direction is the front (S16: YES), the PU 22 determines whether it is determined that there is any wearing object in the process of S12 (S20). When it is determined that one of the mask and the sunglasses is worn (S20: YES), the PU 22 turns on a wearing flag (S22). As a result, for example, it is possible to generate and deal with a state in which a transition is made to a state defined by the scenario data 24b when the wearing flag is turned on. Specifically, for example, the transition destination state may be a state that defines an utterance process in which the agent asks the user to remove the wearing object.
Meanwhile, when it is determined that there is not any wearing object (S20: NO), the PU 22 executes personal authentication process of the user (S24). This process is a process in which the PU 22 determines whether a face image of the user matches any of face images included in registration data 24d stored in the storage device 24 shown in
Meanwhile, when it is determined that there is the registration information (S26: YES), the PU 22 extracts the registration information from the registration data 24d (S34). When a response content according to the scenario data 24b is selected, the response content is stored in the registration data 24d (S36). In this manner, it is possible to avoid repeatedly uttering a content once talked to the user.
When the PU 22 makes a negative determination in the processes of S14 and S30 and when the processes of S18, S22, S32, and S36 are completed, the PU 22 temporarily ends the series of processes shown in
When having a dialogue with the user, it is desirable to accurately and quickly detect user utterance. Hereinafter, the process related to the user utterance detection will be described in detail.
In a series of processes shown in
Next, the PU 22 determines whether there is a mouth movement according to a determination as to whether the amount indicating the distance between the upper lip and the lower lip is equal to or larger than a predetermined value (S44). This process is a process of determining whether the user moves the mouth for utterance. In the process of S44, it is determined that the user moves the mouth for utterance when the amount indicating the distance between the upper lip and the lower lip is equal to or larger than the predetermined value. When it is determined that there is a mouth movement (S44: YES), the PU 22 executes a process of detecting the face direction of the user (S46). Here, the PU 22 detects the face direction by shape model fitting. Specifically, the PU 22 calculates coordinates of a predetermined feature point of the face using a regression model. Then, the PU 22 calculates the face direction by fitting the calculated coordinates of the feature point to a predetermined shape model. This process can be implemented by including the regression model and the shape model in the mapping data 24c.
Then, the PU 22 determines whether the face direction of the user is an agent direction (S48). When it is determined that the face direction is the agent direction (S48: YES), the PU 22 acquires sound data Ds by converting the analog output signal Ss into the digital sound data Ds (S50). Then, the PU 22 inputs the sound data Ds to a VAD mapping, thereby calculating a value of a variable indicating whether the sound data Ds is in a voice detection section (S52). The VAD mapping is a trained model that outputs the value of the variable indicating whether the sound data Ds is in the voice detection section based on time-series data of the sound data Ds. The VAD mapping is one of mappings defined by the mapping data 24c. The VAD mapping is implemented using, for example, a hidden Markov model (hereinafter referred to as HMM).
When it is determined that the sound data Ds is in the voice detection section (S54: YES), the PU 22 determines that the user is uttering toward the dialogue system 10 (S56). Then, the PU 22 determines whether the agent is uttering (S58). “The agent is uttering” means a state in which a voice signal is output from the speaker 36 according to the action of the agent image 14. When it is determined that the agent is uttering (S58: YES), the PU 22 stops the utterance (S60).
When the process of S60 is completed and when a negative determination is made in the process of S58, the PU 22 operates the display unit 12 to display an image of a listening posture as the agent image 14 (S62). The listening posture includes a nodding action. The listening posture may also include a posture in which the face of the user is gazed at.
When the process of S62 is completed and when the PU 22 makes a negative determination in the processes of S44, S48 and S54, the PU 22 temporarily ends the series of processes shown in
Here, for example, the PU 22 causes the agent to explain a shift to a left panel by looking at the left panel 16, uttering “left panel”, or pointing at the left panel 16. In addition, for example, the PU 22 causes the agent to explain a shift to a right panel by looking at the right panel 16, uttering “right panel”, or pointing at the right panel 16. Accordingly, when the user desires to shift to the left panel, the user looks at the left panel 16, utters “left panel”, or points at the left panel 16.
Here, for example, when the user utters “left panel” after the agent explains a method of selecting the left panel and before the agent explains a method of selecting the right panel 16, the agent immediately takes the listening posture. That is, when the user utters “left panel”, the mouth of the user moves, and thus the PU 22 makes a positive determination in the process of S44. When the user hears the explanation of the agent and utters “left panel”, the face direction of the user is the agent direction. Therefore, the PU 22 makes a positive determination in the process of S48. In addition, when the user utters “left panel”, the PU 22 determines that the utterance is in a voice section. Therefore, the PU 22 determines that the user utters, and sets the agent to the listening posture. Therefore, the user can recognize that his/her request is accepted.
Further, for example, when the user utters “left panel” while the agent is explaining the method of selecting the right panel 16 after explaining the method of selecting the left panel, the agent stops explaining and takes the listening posture. Therefore, the user can recognize that his/her request is accepted.
Further, for example, when the user points at the left panel 16 while the agent is explaining the method of selecting the right panel 16 after explaining the method of selecting the left panel, the agent stops explaining.
According to the present embodiment described above, the following effects are further obtained.
(1) The PU 22 determines that the user utters under a condition that the mouth moves, in addition to a condition for determining that the utterance is in the voice section. As a result, even when it is erroneously determined that the utterance is in the voice section due to noise, it is possible to prevent an erroneous determination that the user utters. Therefore, it is possible to prevent an occurrence of barge-in in which the user interrupts the utterance of the agent in response to the agent stopping the utterance.
(2) A detection result of the voice section detection is included in an input corresponding to the output signal Ss among inputs of the determination processes as to whether the user utters. As a result, it is possible to reduce a calculation load of the determination process, for example, as compared with a case of taking into consideration of a content obtained by converting the output signal into text data by voice recognition. Therefore, the determination process can be quickly completed.
Hereinafter, a second embodiment will be described with reference to the drawings, focusing on differences from the first embodiment.
In a series of processes shown in
Next, the PU 22 determines whether a logical product of the following condition (A) and condition (B) is true (S48a).
Condition (A): a condition that the face direction of the user is the agent direction. The determination process as to whether this condition is satisfied is the same as the processes of S46 and S48 of
Condition (B): a condition that the line of sight of the user is the agent direction. Here, the PU 22 estimates a direction of the line of sight by the shape model fitting using a 3D model of eyes. This process can be implemented by including, in the mapping data 24c, the 3D model of the eyes and a mapping for outputting a shape model fitting result using the feature point as an input.
When it is determined that the logical product is true (S48a: YES), the PU 22 proceeds to the process of S50, and when it is determined that the logical product is false (S48a: NO), the PU 22 temporarily ends the series of processes shown in
As described above, in the present embodiment, the condition that the line of sight of the user is the agent direction is added as the condition for determining that the user utters. Accordingly, it is possible to determine whether the user utters with higher accuracy.
Hereinafter, a third embodiment will be described with reference to the drawings, focusing on differences from the first embodiment.
In the first embodiment, when the user performs a pointing action such as pointing at the left panel 16, the agent does not take the listening posture. However, it is also assumed that the user utters “left panel” or the like at the same time as performing the pointing action. In the present embodiment, a process in such a case is separately provided.
In a series of processes shown in
Next, the PU 22 determines whether the user is pointing at a predetermined region of the display unit 12 (S72). Here, in the example shown in
Hereinafter, a fourth embodiment will be described with reference to the drawings, focusing on differences from the first embodiment.
The backend unit 50 executes an arithmetic process using data transmitted from the control device 20 as an input. The backend unit 50 includes a PU 52, a storage device 54, and a communication device 56. The PU 52 is a software processing device including at least one of a CPU, a GPU, a TPU, and the like. The storage device 54 stores the mapping data 24c.
In a series of processes shown in (a) of
Meanwhile, as shown in (b) of
Meanwhile, as shown in (a) of
Condition (C): a condition that the mouth the user moves.
Condition (D): a condition that the VAD is performed.
When it is determined that the logical product is true (S86: YES), the PU 22 executes the processes of S56 to S62. Meanwhile, when it is determined that the logical product is false (S86: NO) and when the process of S62 is completed, the PU 22 temporarily ends the series of processes shown in (a) of
As described above, in the present embodiment, the processes corresponding to the processes of S42, S44, S46, S48, S52, and S54 in
The processes executed by the backend unit 50 are not limited to the processes corresponding to the processes of S42, S44, S46, S48, S52, and S54 in
A correspondence relationship between matters in the above embodiments and matters described below in the section “Solution to Problem” is as follows. In the following, the correspondence relationship is shown for each of the numbers assigned to the solutions described below in the section “Solution to Problem”. [1-3] The execution device corresponds to the PU 22 in
The present embodiment can be modified and implemented as follows. The present embodiment and the following modifications can be implemented in combination with each other within a technically consistent range.
State in which Line of Sight is Directed to Dialogue System
A condition for determining that the line of sight is directed to the dialogue system is not limited to the condition exemplified in the above embodiments. For example, the condition (B) may be replaced with a condition that the line of sight of the user is directed to the display unit 12. However, the condition for determining that the line of sight is directed to the dialogue system is not limited to the condition that the line of sight of the user is directed to a display region of the display device. For example, when the dialogue unit does not include the display device as will be described in the section “Dialogue Unit” below, a condition that the line of sight of the user is directed to the microphone 34 may be set as the condition for the above determination.
The predetermined state serving as a condition for recognizing that the user talks to the agent is not limited to the state exemplified in the above embodiments. For example, the predetermined state may be a state in which a logical product of the state in which the mouth moves and the state in which the VAD is performed is true. In addition, for example, the predetermined state may be a state in which a logical product of the state in which the mouth moves, the state in which the line of sight is directed to the dialogue system, and the state in which the VAD is performed is true. In addition, for example, the predetermined state may be a state in which a logical product of the state in which the line of sight is directed to the dialogue system and the state in which the VAD is performed is true. In addition, for example, the predetermined state may be a state in which a logical product of the state in which the face direction is the agent direction, the state in which the line of sight is directed to the dialogue system, and the state in which the VAD is performed is true. In addition, for example, in the process of
The predetermined state does not have to include the state in which the VAD is performed. For example, when the state in which the mouth moves is detected, it may be determined that the state is the predetermined state, and the PU 22 may set the agent to the listening posture. However, in this case, when the VAD is not performed for a predetermined period thereafter, the PU 22 may cancel the listening posture of the agent. In this case, the PU 22 may operate the speaker 36 to output a voice signal having a content for confirming intention of the utterance, such as “what did you say” to the user after the cancellation.
(a) Detection of State in which Mouth Moves
In the above embodiments, the infrared image data Dir is used as the input of the regression model that outputs coordinate values of the predetermined position of the contour of the mouth. However, the image data is not limited thereto. For example, the RGB image data Drgb may be used.
For example, the state in which the mouth moves may be detected using an identification model that determines whether the mouth moves according to time-series data of image data including the mouth. As the identification model, a regression coupled neural network (hereinafter referred to as RNN) or the like may be used. In this case, the image data input to the identification model at one time is data imaged at one timing, but one output value is calculated with reference to not only the latest image data but also the past image data. However, the identification model is not limited to the RNN. For example, a neural network using a caution mechanism, such as a Transfomer, may be used. In this case, it is possible to obtain a model in which a predetermined number of pieces of image data imaged at timings adjacent to each other may be input at one time to calculate an output value. The identification model in the case of using the time series data does not have to be the neural network. For example, after features of the respective pieces of image data constituting the time series data are extracted, it may be determined whether the mouth moves by inputting the extracted features to a support vector machine.
(b) Detection of State in which Line of Sight is Directed to Dialogue System
In the above embodiments, the infrared image data Dir is used as the input of the regression model that outputs coordinate values of predetermined feature points of contours of the eyes, the mouth, the nose, and the like. However, the image data is not limited thereto. For example, the RGB image data Drgb may be used.
A method of estimating the line of sight is not limited to a model-based method such as the shape model fitting. For example, an appearance-based method using a trained model that outputs a gaze point using image data as an input may be used. Here, as the trained model, for example, a linear regression model, a Gaussian process regression model, the CNN, or the like may be used.
(c) Method of Detecting Section in which Voice is Output
The trained model for executing the VAD is not limited to the model using the HMM. For example, the trained model may be a support vector machine that identifies a section in which a voice is output by inputting time series data of a feature at one time. In addition, for example, an identification model using the RNN or the like may be used.
The method of detecting the section in which the voice is output is not limited to the method using the trained model. For example, the method may be a process of determining whether a sound pressure level having a frequency component characteristic of the voice is equal to or higher than a threshold value. At this time, an input of the process of detecting the section in which the voice is output is not limited to the digital sound data Ds, and may be the analog output signal Ss itself. In other words, the process of detecting the section in which the voice is output may be executed by an analog process. This process can be implemented by using a dedicated analog circuit or the like.
(d) Detection of State in which Voice is Output
The detection of the state in which the voice is output is not limited to the VAD. For example, the process may be a process of determining whether a word is actually spoken by a voice recognition process.
(e) Detection of State in which Face Direction is Directed to Dialogue System
In the above embodiments, the infrared image data Dir is used as the input of the regression model that outputs coordinate values of predetermined feature points of contours of the eyes, the mouth, the nose, and the like. However, the image data is not limited thereto. For example, the RGB image data Drgb may be used.
For example, the state in which the face direction is the agent direction may be detected using an identification model that determines whether the face direction is the agent direction using image data as an input. Here, as the identification model, a neural network or the like can be used.
The posture of listening to the user does not have to include the nodding action. For example, the posture may be a posture of looking at the user without moving.
The utterance process is not limited to a process executed according to the scenario data 24b. For example, the process may be a process of uttering a response sentence obtained by inputting text data indicating an utterance content of the user to the trained model. In this case, when personal authentication is performed, the content once talked may also not be set as an utterance target again.
In
The process executed by the backend unit 50 is not limited to the process shown in the system shown in
The number of backend units does not have to be one. For example, a backend unit that executes the process of converting the sound data Ds into the text data and a backend unit that executes the determination process as to whether the mouth moves may be disposed at different places.
The display device is not limited to the device including the display unit 12. For example, the display device may be a display device using holography. In addition, for example, a head-up display or the like may be used.
The dialogue unit does not have to include the display device.
The dialogue system does not have to include the backend unit 50.
The execution device is not limited to a device that executes software processing, such as a CPU, a GPU, and a TPU. For example, a dedicated hardware circuit that performs hardware processing on at least a part of what is software-processed in the above embodiment may be provided. Here, the dedicated hardware circuit may be, for example, an ASIC or the like. That is, the execution device may have any one of the following configurations (a) to (c). (a) The execution device includes a processing device that executes all of the above processes according to a program, and a program storage device that stores the program. (b) The execution device includes a processing device that execute a part of the above processes according to a program, a program storage device, and a dedicated hardware circuit that executes the remaining processes. (c) The execution device includes a dedicated hardware circuit that executes all the processes described above. Here, there may be multiple software execution devices each including a processing device; and a program storage device, and there may be multiple dedicated hardware circuits.
Hereinafter, solutions to the related art problems and effects thereof will be described.
1. A dialogue system that dialogues with a user, the dialogue system including: an execution device configured to execute a detection process and a recognition process, in which the recognition process is a process of recognizing that the user talks to the dialogue system when a predetermined state is detected by the detection process, the detection process is a process of detecting the predetermined state using, as an input, an output signal of a camera that images the user, and the predetermined state includes a state in which a mouth of the user moves.
In the above configuration, it is recognized that the user talks to the dialogue system under the condition that the predetermined state is detected. Here, since the predetermined state includes the state in which the mouth of the user moves, utterance of the user can be determined with a high probability. Therefore, by determining that the user talks to the dialogue system based on this state, it is possible to improve determination accuracy.
2. The dialogue system according to the above 1, in which the predetermined state further includes at least one of two states including a state in which a line of sight of the user is directed toward the dialogue system and a state in which the user points at the dialogue system.
In the above configuration, when the predetermined state includes the state in which a line of sight of the user is directed toward the dialogue system, it is possible to infer the user's intention to have a dialogue with the dialogue system. Therefore, by determining that the user talks to the dialogue system based on this state, it is possible to improve determination accuracy. When the predetermined state includes the state in which the user points at the dialogue system, it is possible to infer the user's intention to try to communicate with the dialogue system. Therefore, by determining that the user talks to the dialogue system based on this state, it is possible to improve determination accuracy.
3. The dialogue system according to the above 1 or 2, in which the detection process is a process of detecting the predetermined state using, as an input, an output signal of a microphone in addition to the output signal of the camera, and the predetermined state includes a state in which both the state in which the mouth of the user moves and a state in which a voice is output are established.
In the above configuration, the predetermined state includes the state in which both the state in which the mouth of the user moves and the state in which a voice is output are established. Therefore, it is possible to identify a situation in which a probability that the output of the voice is from the user is high or a situation in which a probability that the output of the voice is from the user toward the dialogue system is high.
4. The dialogue system according to the above 3, in which the detection process includes a voice section detection process of detecting a section in which the voice is output, and the state in which the voice is output in the predetermined state is a state in which the voice is determined to be in a voice section by the voice section detection process.
In the above configuration, the state in which the voice is determined to be in the voice section is detected as the state in which the voice is output. Therefore, it is possible to reduce a calculation load of the determination process as to whether the state is the predetermined state, for example, as compared with a case of taking into consideration a content obtained by converting the output signal into text data by voice recognition. Therefore, the determination process as to whether the state is the predetermined state can be quickly completed.
5. The dialogue system according to any one of the above 1 to 4, in which the execution device further executes a display process of displaying an agent image by operating a display device, the agent image is an image of an agent, the agent is a person who dialogues with the user, and the display process includes a process of displaying an image indicating the agent taking a posture listening to the user when it is recognized by the recognition process that the user talks to the dialogue system.
In the above configuration, when it is recognized that the user talks to the dialogue system, the agent takes a listening posture, and thus the user can recognize that the agent is listening to him/her.
6. The dialogue system according to any one of the above 3 to 5, in which the execution device further executes an utterance process of talking to the user by operating the speaker, and a stop process of stopping the utterance process when it is recognized by the recognition process that the user talks to the dialogue system during execution of the utterance process.
In the above configuration, when it is recognized that the user talks to the dialogue system, the utterance process is stopped. Here, the recognition process in which the user talks to the dialogue system is performed with high accuracy since the recognition process is based on the predetermined state. Therefore, it is possible to prevent the execution of the stop process when noise is erroneously detected as being in the state in which the voice is output. When the stop process is executed in a case where noise is erroneously detected as being in the state in which the voice is output, there is a concern that a so-called barge-in, in which the user talks, may occur. Therefore, in the above configuration, it is possible to prevent the occurrence of the barge-in.
7. The dialogue system according to the above 6, further including: a storage device, in which the execution device further executes a registration process, an authentication process, and a history information association process, the registration process is a process of storing information on a face image of the user in the storage device by using the output signal of the camera as an input, the authentication process includes a process of determining whether the user who is imaged by the camera is the user stored in the storage device, using the output signal of the camera as an input, the history information association process is a process of storing, in the storage device, history information on communication between the user stored in the storage device and the dialogue system, and the utterance process includes a process of determining a content of utterance based on the history information associated with the user authenticated by the authentication process.
In the above configuration, since the content of the utterance is determined according to the history information, it is possible to prevent the content once talk to the user from being repeated.
8. The dialogue system according to any one of the above 1 to 7, in which the execution device includes a first execution device and a second execution device, the dialogue system further includes a dialogue unit and a backend unit, the dialogue unit includes the first execution device and a first communication device, the backend unit includes the second execution device, a second storage device, and a second communication device, the second storage device stores mapping data, the detection process includes an output signal acquisition process, an image data transmission process, an image data receiving process, a state determination process, a determination result transmission process, and a determination result receiving process, the output signal acquisition process is a process in which the first execution device acquires the output signal of the camera, the image data transmission process is a process in which the first execution device operates the first communication device to transmit image data corresponding to the output signal of the camera to the backend unit, the image data receiving process is a process in which the second execution device receives the image data, the mapping data is data that defines a determination mapping, the determination mapping is a process of outputting a variable used for determining whether the image data is in the predetermined state, using the image data as an input, the state determination process is a process in which the second execution device inputs the image data to the determination mapping to calculate a variable related to a determination result of whether the image data is in the predetermined state, the determination result transmission process is a process in which the second execution device operates the second communication device to transmit the variable related to the determination result to the dialogue unit, and the determination result receiving process is a process in which the first execution device receives the determination result.
The above load can be reduced.
9. The dialogue unit included in the dialogue system according to 8 described above.
The principles, preferred embodiment and mode of operation of the present invention have been described in the foregoing specification. However, the invention which is intended to be protected is not to be construed as limited to the particular embodiments disclosed. Further, the embodiments described herein are to be regarded as illustrative rather than restrictive. Variations and changes may be made by others, and equivalents employed, without departing from the spirit of the present invention. Accordingly, it is expressly intended that all such variations, changes and equivalents which fall within the spirit and scope of the present invention as defined in the claims, be embraced thereby.
Number | Date | Country | Kind |
---|---|---|---|
2022-024907 | Feb 2022 | JP | national |