One or more embodiments relate to a voice recognition system for monitoring a user and modifying speech translation based on the user's movement and appearance.
An example of a voice recognition system for controlling cellphone functionality is the “S Voice” system by Samsung. An example of a voice recognition system for controlling portable speaker functionality is the “JBL CONNECT” application by JBL®.
In one embodiment, a voice recognition system is provided with a user interface to display content, a camera to provide a signal indicative of an image of a user viewing the content and a microphone to provide a signal indicative of a voice command. The voice recognition system is further provided with a controller that communicates with the user interface, the camera and the microphone and is configured to filter the voice command based on the image.
In another embodiment, a voice recognition system is provided with a user interface to display content, a camera to provide a first signal indicative of an image of a user viewing the content, and a microphone to provide a second signal indicative of a voice command that corresponds to a requested action. The voice recognition system is further provided with a controller that is programmed to receive the first signal and the second signal, filter the voice command based on the image, and perform the requested action based on the filtered voice command.
In yet another embodiment, a computer-program product embodied in a non-transitory computer readable medium that is programed for controlling a voice recognition system is provided. The computer-program product includes instructions for: receiving a voice command that corresponds to a requested action; receiving a visual command indicative of the user viewing content on a user interface; filtering the voice command based on the visual command; and performing the requested action based on the filtered voice command.
In another embodiment, a method for controlling a voice recognition system is provided. A first signal is received that is indicative of a voice command that corresponds to a requested action. A second signal is received that is indicative of an image of a user viewing content on a user interface. The voice command is filtered based on the image, and the requested action is performed based on the filtered voice command.
As such the voice recognition system improves the accuracy of the translation of a voice command by combining the voice command with eye gaze tracking and/or facial recognition to narrow down the search field and limit the speech to text translation to the item that the user is interested in.
As required, detailed embodiments of the present invention are disclosed herein; however, it is to be understood that the disclosed embodiments are merely exemplary of the invention that may be embodied in various and alternative forms. The figures are not necessarily to scale; some features may be exaggerated or minimized to show details of particular components. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the present invention.
With reference to
The voice recognition system 10 monitors a user's features and compares the features to predetermined data to determine if the user is recognized and if an existing profile of the user's interests is available. If the user is recognized, and their profile is available, the system 10 translates the user's speech using filters based on their profile. The system 10 also monitor's the user's movement (e.g., eye gaze and/or lip movement) and filters the user's speech based on such movement. Such filters narrow the search field used to translate the user's speech to text and improve the accuracy of the translation, especially in environments with loud ambient noise, e.g., the passenger compartment of an automobile.
The controller 20 generally includes any number of microprocessors, ASICs, ICs, memory (e.g., FLASH, ROM, RAM, EPROM and/or EEPROM) and software code to co-act with one another to perform operations noted herein. The controller 20 also includes predetermined data, or “look up tables” that are based on calculations and test data and stored within the memory. The controller 20 communicates with other components of the media device 12 (e.g., the camera 14, the microphone 16 and the user interface 18, etc.) over one or more wired or wireless connections using common bus protocols (e.g., CAN and LIN).
Referring to
The user interface 18 displays content such as vehicle controls for various vehicle systems. For example, the user interface 18 displays a climate controls icon 22, a communication controls icon 24 and an audio system controls icon 26, according to the illustrated embodiment. The user interface 18 adjusts the content displayed to the user in response to a user tactile (touch) command, voice command or visual command. For example, the voice recognition system 10 controls the user interface 18 to display additional climate controls (shown in
With reference to
The voice recognition system 10 adjusts the content displayed to the user based on a voice command. For example, rather than pressing the available audio content icon 28 for Artist 2, the user could say “Play Artist 2, Album B, Song 1”, and voice recognition system 10 controls the audio system to stop playing the current audio content (i.e., Artist 1, Album A, Song 2) and start playing the new requested audio content. The voice recognition system 10 converts or translates the user's voice command to text, and compares it to predetermined data, e.g., a database of different commands, to interpret the command. However, in some conditions, it may be difficult for the voice recognition system 10 to interpret the command. For example, the user may be driving with the windows open, or there may be other passengers talking in the vehicle, which may create noise which complicates the translation.
The voice recognition system 10 improves the accuracy of the translation of the voice command by combining it with eye gaze tracking to narrow down the search field and limit the speech to text translation to the item on the menu that the user is focused on, according to an embodiment. In one example, the user provides the voice command: “Play Artist 2, Album B, Song 1”, while looking at the Artist 2, Album B icon 28. However, other passengers in the vehicle are talking during the command, so the voice recognition system 10 is only able to translate “Play . . . Song 1” from the voice command. The voice recognition system 10 determines that the user's eye gaze was focused on the Artist 2, Album B icon 28 and therefore narrows the search field to the correct available audio content.
The voice recognition system 10 improves the accuracy of the translation of the voice request by combining the voice command with facial recognition to narrow down the search field, according to an embodiment. In another example, the available audio content includes a song by the artist: The Beatles® and a song by the artist: Justin Bieber®. The user provides a voice command: “Play The Beatles®” while looking at the road and not at the user interface 18. However, the windows in the vehicle are open and there is external noise present during the command, so the voice recognition system 10 is only able to translate “Play Be . . . ” from the voice command. The voice recognition system 10 determines that driver A (Dad) was driving, not driver B (Child), using facial recognition software and is able to narrow the search field to the correct available audio content based on a profile indicative of driver A's audio preferences and/or history.
In another embodiment, the voice recognition system 10 further improves the accuracy of the translation of the voice request by combining the voice command with facial recognition and lip-reading to narrow down the search field. The voice recognition system 10 uses facial recognition to detect face and lip motions and correlates the motion to predetermined facial motion corresponding to the phonics of the speech.
The voice recognition system 10 responds to a user command using audio and/or visual communication, according to an embodiment. After receiving a command to play audio content, the system 10 may ask the user to confirm the command, e.g., “Please confirm, you would like to play Artist 2, Album B, Song 1.” Alternatively, or in addition to such audio communication, the voice recognition system 10 may provide visual feedback through dynamic and responsive user interface 18 changes. For example, the voice recognition system may control the available audio content icon 28 for Artist 2, Album B to blink, move, or change size (e.g., shrink or enlarge), as depicted by motion lines 30 in
With reference to
The voice recognition system 10 initiates audio communication, (wakes) using eye gaze tracking, according to an embodiment. For example, the system 10 initiates audio communication after determining that the user's eye gaze was focused on the user interface 18 for a predetermined period of time. The voice recognition system 10 may also notify the user once it wakes, using audio or visual communication. In the illustrated embodiment, the user interface 18 includes a wake icon 32 that depicts an open eyeball. After waking, the voice recognition system 10 notifies the user by controlling the wake icon to blink, as depicted by motion lines 34 (shown in
With reference to
Referring to
Similarly, the voice recognition system 10 implemented in the home entertainment system 42 may provide personalized sports scores and news, turn on the surround sound, and specific optical settings for the television, in response to a “Sports” voice command combined with an eye gaze focusing on a sports icon (not shown). Additionally, the voice recognition system 10 implemented in the cellphone 44 may set a home security system, check interior lights, thermostat settings and door locks in response to a “Sleep” voice command, combined with an eye gaze focusing on a sleep icon (not shown).
With reference to
At operation 110, the voice recognition system 10 (shown in
The voice recognition system initiates audio communication with the user (i.e., wakes) at operation 116. This initiation is in response to a voice command (e.g., “wake word”) or in response to a visual command, e.g., a determination that the user's eye gaze was focused on the user interface 18 for longer than a predetermined period of time, according to one or more embodiments. As discussed with reference to
At operation 118, the voice recognition system 10 continues to monitor a user's features and compares the features to predetermined data to determine if the user is recognized. If the user is recognized, the voice recognition system 10 acquires their profile at operation 120, e.g., through the cloud based network 38 (shown in
The voice recognition system 10 receives a voice command at operation 122. Then at operation 124, the voice recognition system 10 determines if the voice command, combined with a non-verbal command, e.g., eye-gaze, corresponds to a macro. If so, the system 10 proceeds to operation 130 and performs the action(s).
If the voice command does not correspond to a macro, then the voice recognition system 10 filters the user's speech. If a profile was acquired at operation 120, the system 10 filters the voice command at operation 126 based on the profile. The system 10 also monitor's the user's movement (e.g., eye gaze and/or lip movement) and filters the voice command based on such movement. Such filters narrow the search field used to translate the voice command to text and improve the accuracy of the translation. The voice recognition system 10 translates the voice command at operation 128 and then performs the action or actions (e.g., adjust content displayed on the user interface 18; control the climate system to increase the temperature within the vehicle; or control the audio system to play a different song) at operation 130.
While exemplary embodiments are described above, it is not intended that these embodiments describe all possible forms of the invention. Rather, the words used in the specification are words of description rather than limitation, and it is understood that various changes may be made without departing from the spirit and scope of the invention. Additionally, the features of various implementing embodiments may be combined to form further embodiments of the invention.
This application claims the benefit of U.S. provisional application Ser. No. 62/440,893 filed Dec. 30, 2016, the disclosure of which is hereby incorporated in its entirety by reference herein.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2017/068856 | 12/29/2017 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62440893 | Dec 2016 | US |