The present disclosure relates to information processing apparatuses, control methods, and programs.
Techniques of performing speech recognition/semantic analysis on a user's speech, and responding by outputting a voice have conventionally been developed. In particular, recent progress in speech recognition algorithms and recent advances in computer technologies have allowed speech recognition to be processed in a practical time, and therefore, a user interface (UI) using a voice has been widely used in a smartphone, tablet terminal, and the like.
For example, a voice UI application which is installed in a smartphone, tablet terminal, or the like can respond to inquiries which are made through a user's voice, in a voice, or perform processes corresponding to instructions which are made through a user's voice.
Patent Literature 1: JP 2012-181358A
However, in a typical voice UI using speech recognition, only one responding method finally determined is returned with respect to a user's voice input. Therefore, it is necessary for a user to wait until the system has completed the process. During the waiting time, no feedback is given from the system to a user, so that the user may be worried that their voice input is not being properly processed.
Also, Patent Literature 1 described above proposes a technique of automatically converting an input voice into text, and specifically, a system for converting an input voice into text and displaying the text in real time. In the system, the above voice UI is not assumed. Specifically, only text obtained by converting an input voice is displayed, and no semantic analysis or no response (also referred to as a responding action) based on semantic analysis is fed back, unlike voice interaction. Therefore, the user cannot observe a specific action which is caused by their speech until the system has started the action.
With the above in mind, the present disclosure proposes an information processing apparatus, control method, and program capable of notifying a user of a candidate for a response, from the middle of a speech, through a voice UI.
According to the present disclosure, there is provided an information processing apparatus including: a semantic analysis unit configured to perform semantic analysis on speech text recognized by a speech recognition unit in the middle of a speech; a score calculation unit configured to calculate a score for a response candidate on the basis of a result of the analysis performed by the semantic analysis unit; and a notification control unit configured to perform control to notify of the response candidate, in the middle of the speech, according to the score calculated by the score calculation unit.
According to the present disclosure, there is provided a control method including: performing semantic analysis on speech text recognized by a speech recognition unit in the middle of a speech; calculating, by a score calculation unit, a score for a response candidate on the basis of a result of the semantic analysis; and performing control to notify of the response candidate, in the middle of the speech, according to the calculated score.
According to the present disclosure, there is provided a program for causing a computer to function as: a semantic analysis unit configured to perform semantic analysis on speech text recognized by a speech recognition unit in the middle of a speech; a score calculation unit configured to calculate a score for a response candidate on the basis of a result of the analysis performed by the semantic analysis unit; and a notification control unit configured to perform control to notify of the response candidate, in the middle of the speech, according to the score calculated by the score calculation unit.
As described above, according to the present disclosure, a user can be notified of a candidate for a response, from the middle of a speech, through a voice UI.
Note that the effects described above are not necessarily limitative. With or in the place of the above effects, there may be achieved any one of the effects described in this specification or other effects that may be grasped from this specification.
Hereinafter, (a) preferred embodiment(s) of the present disclosure will be described in detail with reference to the appended drawings. In this specification and the appended drawings, structural elements that have substantially the same function and structure are denoted with the same reference numerals, and repeated explanation of these structural elements is omitted.
Also, description will be provided in the following order.
1. Overview of speech recognition system according to one embodiment of the present disclosure
2. Configuration
3. Operation process
4. Display examples of candidates for responding action
4-1. Display of speech text
4-2. Display method according to score
4-3. Display method where there are plurality of speakers
4-4. Display method in regions other than main display region
4-5. Different display methods for different screen states
4-6. Other icon display examples
5. Conclusion
A speech recognition system according to one embodiment of the present disclosure has a basic function of performing speech recognition/semantic analysis on a user's speech, and responding by outputting a voice. An overview of the speech recognition system according to one embodiment of the present disclosure will now be described with reference to
Here, in a typical voice UI using speech recognition, only one responding method finally determined is returned with respect to a user's voice input. Therefore, it is necessary for a user to wait until the system has completed the process. During the waiting time, no feedback is given from the system to the user, so that the user may be worried that their voice input is not being properly processed.
With this in mind, in the speech recognition system according to one embodiment of the present disclosure, the user can be notified of a candidate for a response, from the middle of a speech, through a voice UI.
Specifically, the information processing apparatus 1 sequentially performs speech recognition and semantic analysis in the middle of a speech, and on the basis of the result, acquires a candidate for a response, produces an icon (or text) representing the acquired response candidate, and notifies the user of the icon.
In the example shown in
As a result, the user can understand that their voice input is recognized in the middle of the speech, and can know a candidate for a response in real time.
In the foregoing, an overview of the speech recognition system according to the present disclosure has been described. Note that the shape of the information processing apparatus 1 is not limited to the cylindrical shape shown in
The control unit 10 controls each component of the information processing apparatus 1. The control unit 10 is implemented in a microcomputer including a central processing unit (CPU), a read only memory (ROM), a random access memory (RAM), and a non-volatile memory. Also, as shown in
The speech recognition unit 10a recognizes the user's voice collected by the microphone 12 of the information processing apparatus 1, and converts the voice into a string of characters to acquire speech text. Also, the speech recognition unit 10a can identify a person who is uttering a voice on the basis of a feature of the voice, or estimate the direction of the source of the voice, i.e., the speaker.
Also, the speech recognition unit 10a according to this embodiment sequentially performs speech recognition in real time from the start of the user's speech, and outputs the result of speech recognition in the middle of the speech to the semantic analysis unit 10b.
The semantic analysis unit 10b performs a natural language process or the like on speech text acquired by the speech recognition unit 10a for semantic analysis. The result of the semantic analysis is output to the responding action acquisition unit 10c.
Also, the semantic analysis unit 10b according to this embodiment can sequentially perform semantic analysis on the basis of the result of speech recognition in the middle of a speech which is output from the speech recognition unit 10a. The semantic analysis unit 10b outputs the result of the semantic analysis performed sequentially to the responding action acquisition unit 10c.
The responding action acquisition unit 10c acquires a responding action with respect to the user's speech on the basis of the result of semantic analysis. Here, the responding action acquisition unit 10c can acquire a candidate for a responding action at the current time on the basis of the result of semantic analysis in the middle of a speech. For example, the responding action acquisition unit 10c acquires an action corresponding to an example sentence having a high level of similarity, as a candidate, on the basis of comparison of speech text recognized by the speech recognition unit 10a with example sentences registered for learning of semantic analysis. In this case, because the speech text to be compared is not complete, the responding action acquisition unit 10c may compare the speech text with a first half of each example sentence, depending on the length of the speech. Also, the responding action acquisition unit 10c can acquire a candidate for a responding action by utilizing the occurrence probability of each word contained in speech text. Here, a semantic analysis engine which uses a natural language process may be produced in a learning-based manner. Specifically, a large number of speech examples assumed in the system are previously collected, and are each correctly associated (also referred to as “labeled”) with a responding action of the system, i.e., learnt as a data set. Thereafter, by comparing the data set with speech text obtained by speech recognition, a responding action of interest can be obtained. Note that this embodiment does not depend on the type of a semantic analysis engine., Also, the data set learnt by a semantic analysis engine may be personalized for each user.
The responding action acquisition unit 10c outputs the acquired candidate for a responding action to the score calculation unit 10d.
Also, when a responding action is based on the result of semantic analysis after the end of a speech, the responding action acquisition unit 10c determines that the responding action is a final one, and outputs the final responding action to the execution unit 10f.
The score calculation unit 10d calculates scores for candidates for a responding action acquired by the responding action acquisition unit 10c, and outputs the score calculated for each responding action candidate to the display control unit 10e. For example, the score calculation unit 10d calculates a score according to the level of similarity which is obtained by comparison with an example sentence registered for semantic analysis learning which is performed during acquisition of the responding action candidate.
Also, the score calculation unit 10d can calculate a score, taking into a user environment into account. For example, during an operation of the voice UI according to this embodiment, the user environment is continually acquired and stored as the user's history. When the user can be identified, a score can be calculated, taking into account the history of operations by the user and the current situation. As the user environment, for example, a time zone, a day of the week, a person who is present together with the user, a state of an external apparatus around the user (e.g., the on state of a TV, etc.), noise environment, the lightness of a room (i.e., an illuminance environment), or the like may be acquired. As a result, when the user can be identified, the score calculation unit 10d can calculate a score, taking into account the history of operations by the user and the current situation. Basically, weighting may be performed according to the user environment in combination with score calculation according to the level of similarity with an example sentence during the above acquisition of a responding action candidate.
There may be various examples of the operation history and the current situation, and a portion thereof will be described below. The information processing apparatus 1 may weight a score according to the current user environment after learning of a data set described below.
As a result, for example, if the user has a history of using a moving image application alone at weekend night, then when the user is in a user environment where the user is alone in a room at weekend night, the score calculation unit 10d calculates a score by weighting an action candidate which is activation of a moving image application. Note that, in this embodiment, a recommended responding action candidate can be presented to the user according to the operation history and the current user environment.
Also, as described above, the speech recognition unit 10a sequentially acquires speech text, and the semantic analysis unit 10b sequentially performs semantic analysis in combination, and therefore, the responding action acquisition unit 10c sequentially updates acquisition of a responding action candidate. The score calculation unit 10d sequentially updates a score for each responding action candidate according to acquisition and updating of a responding action candidate, and outputs the score to the display control unit 10e.
The display control unit 10e functions as a notification control unit which performs control to notify the user of each responding action candidate in the middle of a speech according to a score for each responding action candidate calculated by the score calculation unit 10d. For example, the display control unit 10e controls the projection unit 16 so that the projection unit 16 projects and displays an icon indicating each responding action candidate on the wall 20. Also, when the score calculation unit 10d updates a score, the display control unit 10e updates a display to notify the user of each responding action candidate according to the new score.
Here, a display of a responding action candidate corresponding to a score will be described with reference to
Next, as shown in a middle portion of
Thereafter, as shown in a right portion of
Thus, speech recognition is sequentially performed, from the middle of a speech, and a responding action candidate is fed back to the user. Also, as the speech proceeds, responding action candidates are updated, and after the end of the speech, a finally determined responding action is executed.
In the foregoing, display examples of candidates for a responding action by the display control unit 10e have been described.
When final speech text is determined (i.e., speech recognition is ended) after the end of a speech, the execution unit 10f executes a responding action finally determined by the responding action acquisition unit 10c. The responding action is herein assumed, for example, as follows.
The communication unit 11 transmits and receives data to and from an external apparatus. For example, the communication unit 11 connects to a predetermined server on a network, and receives various items of information required during execution of a responding action by the execution unit 10f,
The microphone 12 has the function of collecting a sound therearound, and outputting the sound as an audio signal to the control unit 10. Also, the microphone 12 may be implemented in an array microphone.
The loudspeaker 13 has the function of converting an audio signal into a sound and outputting the sound under the control of the control unit 10.
The camera 14 has the function of capturing an image of a surrounding area using an imaging lens provided in the information processing apparatus 1, and outputting the captured image to the control unit 10. Also, the camera 14 may be implemented in an omnidirectional camera or a wide-angle camera.
The distance measurement sensor 15 has the function of measuring a distance between the information processing apparatus 1 and the user or a person around the user. The distance measurement sensor 15 is implemented in, for example, a photosensor a sensor which measures a distance to an object of interest on the basis of information about a phase difference in light emission/light reception timing).
The projection unit 16, which is an example of a display apparatus, has the function of projecting (and magnifying) and displaying an image on a wall or a screen.
The storage unit 17 stores a program for causing each component of the information processing apparatus to function. Also, the storage unit 17 stores various parameters which are used by the score calculation unit 10d to calculate a score for a responding action candidate, and an application program executable by the execution unit 10f. Also, the storage unit 17 stores registered information of the user. The registered information of the user includes personally identifiable information (the feature amount of voice, the feature amount of a facial image or a human image (including a body image), a name, an identification number, etc.), age, sex, interests and preferences, an attribute (a housewife, employee, student, etc.), information about a communication terminal possessed by the user, and the like.
The light emission unit 18, which is implemented in a light emitting device, such as an LED or the like, can perform full emission, partial emission, flicker, emission position control, and the like. For example, the light emission unit 18 can emit light from a portion thereof in the direction of a speaker which is recognized by the speech recognition unit 10a under the control of the control unit 10, thereby appearing as if it gazed the speaker.
In the foregoing, a configuration of the information processing apparatus 1 according to this embodiment has been specifically described. Note that the configuration shown in
Next, an operation process of the speech recognition system according to this embodiment will be specifically described with reference to
Next, in step S106, the speech recognition unit 10a acquires speech text by a speech recognition process.
Next, in step S109, the control unit 10 determines whether or not speech recognition has been completed, i.e., whether or not speech text has been finally determined. A situation where a speech is continued (the middle of a speech) means that speech recognition has not been completed, i.e., speech text has not been finally determined.
Next, if speech recognition has not been completed (S109/No), the semantic analysis unit 10b acquires speech text which has been uttered until the current time, from the speech recognition unit 10a in step S112.
Next, in step S115, the semantic analysis unit 10b performs a semantic analysis process on the basis of speech text which has been uttered until a time point in the middle of the speech.
Next, in step S118, the responding action acquisition unit 10c acquires a candidate for a responding action to the user's speech on the basis of the result of the semantic analysis performed by the semantic analysis unit 10b, and the score calculation unit 10d calculates a score for the current responding action candidate.
Next, in step S121, the display control unit 10e determines a method for displaying the responding action candidate. Examples of the method for displaying a responding action candidate include displaying an icon representing the responding action candidate, displaying text representing the responding action candidate, displaying in a sub-display region, displaying in a special footer region provided below a main display region when the user is viewing a movie in the main display region, and the like. Specific methods for displaying a responding action candidate will be described below with reference to
Next, in step S124, the display control unit 10e performs control to display N responding action candidates ranked highest. For example, the display control unit 10e controls the projection unit 16 so that the projection unit 16 projects icons representing responding action candidates onto the wall 20.
The processes in S112-S124 described above are sequentially performed until a speech has been completed. When a responding action candidate or a score therefore is updated, the display control unit 10e changes the displayed information according to the updating.
Meanwhile, if a speech has been ended and speech recognition has been completed (final speech text has been determined) (S109/Yes), the semantic analysis unit 10b performs a semantic analysis process on the basis of the final speech text in step S127.
Next, in step S130, the responding action acquisition unit 10c finally determines a responding action with respect to the user's speech on the basis of the result of the semantic analysis performed by the semantic analysis unit 10b. Note that when the user explicitly selects a responding action, the responding action acquisition unit 10c can determine that a final responding action is one selected by the user.
Thereafter, in step S133, the execution unit 10f executes the final responding action determined by the responding action acquisition unit 10c.
In the foregoing, an operation process of the speech recognition system according to this embodiment has been specifically described. Note that when a history of operations performed by the user is accumulated, a process of storing a data set of the result of sensing a user environment during speech and a finally determined responding action may be performed, following step S133. Next, display examples of candidates for a responding action according to this embodiment will be described with reference to
In the above example shown in
Also, in this embodiment, a region where a responding action candidate is displayed and the amount of information may be dynamically changed according to the score. This will now be described with reference to
Also, in this embodiment, a responding action candidate having a low score may be displayed using other display methods, such as, for example, grayed out, instead of not being displayed, whereby it can be explicitly indicated that the score is lower than a predetermined value. This will now be described with reference to
Next, as shown in a right portion of
In the above display method, a list of responding action candidates is displayed, and therefore, the user can select a responding action which is desired, immediately even in the middle of a speech. Specifically, a displayed responding action candidate can be utilized as a short-cut to an action. In this case, the user can also select a responding action candidate which is displayed grayed out.
For example, when there is a desired action among responding action candidates displayed in the middle of a speech, the user can choose that action by saying “The left icon!,” “The third icon!,” or the like. Also, the choice can also be performed by using not only a voice but also a gesture, touch operation, remote controller, or the like. Also, such a choice performed by the user may be used for not only the function of determining what action is to be activated but also the function of cancelling. For example, when a speech “Konshu no tenki, . . . a sorejyanakute (the weather in this week . . . , oops, this is not it)” is uttered, a responding action candidate which has been displayed in a larger size (higher score) in association with “Konshu no tenki (the weather in this week) . . . ” can be cancelled (not displayed) and the score can be reduced.
<4-3. Display Method where there are Plurality of Speakers>
Also, the speech recognition system according to this embodiment can also be used by a plurality of users. For example, it is assumed that the locations of users (speakers) are recognized by using an array microphone or a camera, a display region is divided according to the users' locations, and an action candidate is displayed for each user. In this case, real-time speech recognition, semantic analysis, and responding action acquisition processes and the like shown in the flow of
Note that when a plurality of users are using the system, the information processing apparatus 1 according to this embodiment may perform real-time speech recognition, semantic analysis, responding action acquisition processes, and the like in an integrated manner without dividing the display region for the users, and feed a single result back.
<4-4. Display Method in Regions other than Main Display Region>
Also, the speech recognition system according to this embodiment can notify of a responding action candidate in the middle of a speech in a region other than a main display region. Here, the main display region refers to a region for projection and display performed by the projection unit 16. The information processing apparatus 1 may display a responding action candidate on, for example, a sub-display (not shown) formed by a liquid crystal display or the like provided on a side surface of the information processing apparatus 1, or an external display apparatus such as a TV, smartphone, or tablet terminal located around the user, a wearable terminal worn by the user, or the like, as a display region other than the main display region.
When display is performed in a region other than the main display region, only an icon or text for a responding action candidate having the highest score may be displayed instead of the display method shown in
Also, the speech recognition system according to this embodiment may change the method for displaying a responding action candidate, according to the current screen state of the display region. This will now be specifically described with reference to
With this in mind, for example, when a moving image 50 is being displayed as shown in a left portion of
Also, when displaying icons for responding action candidates in the footer region 45, the information processing apparatus 1 can adjust the number or sizes of the displayed icons not to obstruct the view of a moving image.
Thus, the display control unit 10e of the information processing apparatus 1 according to this embodiment can perform optimum display control by using a predetermined display layout pattern according to a screen state (e.g., the amount of displayed information, the size of a display region, etc.), or a display state (icons, text, display amounts, etc.) of displayed responding action candidates. Also, the information processing apparatus 1 may use a method for displaying in regions other than the main display region, such as those described above, during playback of a moving image. As a result, the user can be notified of responding action candidates while the responding action candidates do not overlap at all the moving image screen played back in the main display region.
In the above display screen examples, icons indicating the activation action of various applications are shown as icons for responding action candidates. This embodiment is not limited to this. Other display examples of candidates for a responding action will now be described with reference to
As described above, in a speech recognition system according to an embodiment of the present disclosure, the user can be notified of a response candidate (responding action candidate) through a voice UI from the middle of a speech, i.e., semantic analysis is sequentially performed in real time, and a response candidate can be fed back to the user.
The preferred embodiment(s) of the present disclosure has/have been described above with reference to the accompanying drawings, whilst the present disclosure is not limited to the above examples. A person skilled in the art may find various alterations and modifications within the scope of the appended claims, and it should be understood that they will naturally come under the technical scope of the present disclosure.
For example, a computer program can be provided which causes hardware including a CPU, ROM, RAM, and the like included in the information processing apparatus 1 to provide the functions of the information processing apparatus 1. Also, a computer readable storage medium storing the computer program is provided.
Also, the display control unit 10e may display at least a predetermined number of responding action candidates, all responding action candidates having a score exceeding a predetermined threshold, or at least a predetermined number of responding action candidates until a score exceeds a predetermined threshold.
Also, the display control unit 10e may display a responding action candidate together with its score.
Further, the effects described in this specification are merely illustrative or exemplified effects, and are not limitative. That is, with or in the place of the above effects, the technology according to the present disclosure may achieve other effects that are clear to those skilled in the art based on the description of this specification.
Additionally, the present technology may also be configured as below.
An information processing apparatus including:
a semantic analysis unit configured to perform semantic analysis on speech text recognized by a speech recognition unit in the middle of a speech;
a score calculation unit configured to calculate a score for a response candidate on the basis of a result of the analysis performed by the semantic analysis unit; and
a notification control unit configured to perform control to notify of the response candidate, in the middle of the speech, according to the score calculated by the score calculation unit.
The information processing apparatus according to (1),
wherein the score calculation unit updates the score according to the semantic analysis sequentially performed on the speech by the semantic analysis unit, and
the notification control unit performs control to update display of the response candidate in association with the updating of the score.
The information processing apparatus according to (1),
wherein the notification control unit performs control to notify of a plurality of the response candidates in display forms corresponding to the scores.
The information processing apparatus according to (3),
wherein the notification control unit performs control to display a predetermined number of the response candidates having highest scores on the basis of the scores.
The information processing apparatus according to (3) or (4),
wherein the notification control unit performs control to display the response candidate or candidates having a score exceeding a predetermined value.
The information processing apparatus according to any one of (3) to (4),
wherein the notification control unit performs control to display the response candidates using display areas corresponding to values of the scores.
The information processing apparatus according to any one of (3) to (5),
wherein the notification control unit performs control to display icons for the response candidates, each icon including information about a display dot size corresponding to the score.
The information processing apparatus according to any one of (3) to (6),
wherein the notification control unit performs control to display the response candidate or candidates having a score lower than a predetermined value, in a grayed-out fashion.
The information processing apparatus according to any one of (3) to (8),
wherein the notification control unit performs control to display the recognized speech text together with the response candidates.
The information processing apparatus according to any one of (1) to (8),
wherein the score calculation unit calculates the score, additionally taking a current user environment into account.
The information processing apparatus according to any one of (1) to (10), further including:
an execution control unit configured to perform control to execute a final response.
The information processing apparatus according to (11),
wherein control is performed so that a final response determined on the basis of a result of the semantic analysis on speech text finally determined after end of the speech is executed.
The information processing apparatus according to (11),
wherein control is performed so that a final response chosen by a user is executed.
A control method including:
performing semantic analysis on speech text recognized by a speech recognition unit in the middle of a speech;
calculating, by a score calculation unit, a score for a response candidate on the basis of a result of the semantic analysis; and
performing control to notify of the response candidate, in the middle of the speech, according to the calculated score.
A program for causing a computer to function as:
a semantic analysis unit configured to perform semantic analysis on speech text recognized by a speech recognition unit in the middle of a speech;
a score calculation unit configured to calculate a score for a response candidate on the basis of a result of the analysis performed by the semantic analysis unit; and
a notification control unit configured to perform control to notify of the response candidate, in the middle of the speech, according to the score calculated by the score calculation unit.
1 information processing apparatus
10 control unit
10a speech recognition unit
10b semantic analysis unit
10c responding action acquisition unit
10d score calculation unit
10e display control unit
10f execution unit
11 communication unit
12 microphone
13 loudspeaker
14 camera
15 distance measurement sensor
16 projection unit
17 storage unit
18 light emission unit
20 wall
Number | Date | Country | Kind |
---|---|---|---|
2015-073894 | Mar 2015 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2015/085845 | 12/22/2015 | WO | 00 |