Currently, many devices do not permit voice command recognition during a telephone call because the call and voice command software would both need to occupy the same audio channel being used for the telephone call. Some devices have sought to overcome this by employing additional hardware. For instance, one microphone on the device might be used for conducting the telephone call and a separate microphone on the device might be used for receiving voice commands provided during the call. As another example, one chipset might be used to conduct the telephone call and a separate chipset might be used for audio processing to identify commands. However, as recognized herein, this adds to manufacturing costs owing to multiple pieces of the same types of hardware having to be included on a single device, which also unnecessarily taking up valuable physical space within the device. There are currently no adequate solutions to the foregoing computer-related, technological problem.
Accordingly, in one aspect a first device includes at least one processor and storage accessible to the at least one processor. The storage includes instructions executable by the at least one processor to facilitate audio communication between the first device and a second device and to select a threshold amount of the audio communication. The threshold amount does not include the entirety of the audio communication. The instructions are also executable by the at least one processor to transcribe to text words that are recognized from the threshold amount of the audio communication, determine whether the text comprises a command to the first device, and request confirmation that a command to the first device has been issued based on a determination that the text comprises a command to the first device.
In another aspect, a method includes facilitating audio communication between a first device and a second device and selecting a threshold amount of the audio communication. The threshold amount does not include the entirety of the audio communication. The method also includes converting to text words that are recognized from the threshold amount of the audio communication, determining whether the text comprises a command to a device, and presenting a request to confirm that a command to the device has been provided based on determining that the text comprises a command to the device.
In still another aspect, a computer readable storage medium includes instructions executable by at least one processor to facilitate audio communication between a first device and a second device, convert to text at least one word that is recognized from the audio communication, and determine whether the text comprises a command to a device. The instructions are also executable by the at least one processor to present a request to confirm that a command to the device has been provided based on a determination that the text comprises a command to the device.
The details of present principles, both as to their structure and operation, can best be understood in reference to the accompanying drawings, in which like reference numerals refer to like parts, and in which:
The present application deals with voice commands being recognized during a telephone or video conferencing call, for example, and providing a subsequent user interface for a user to acknowledge or disregard actions the device has identified to perform based on a potential voice command received during the call. This may be done using a single microphone feed rather than two feeds, one for the call and one for voice commands.
Accordingly, audio of the call may be transcribed by the device using software that, e.g., runs in the background. Audio of a defined window of time may be captured and transcribed, and then the transcription may be further analyzed by the device to determine if any word(s) from the transcription match commands in a predefined database of voice commands. When a voice command is identified within the transcription, the words of the transcription that come before and after the command itself may also be analyzed utilizing, e.g., natural language processing to determine whether there is intention to use command keywords or just regular speech for which a command should not be executed. Additionally, a “command” icon or symbol may appear on screen whenever a voice command is detected by the device as another way to confirm a user's intention to provide a voice command. Thus, the same audio channel from the same microphone as used to conduct the call itself may also be used to determine whether the user might have also provided a voice command to the device itself.
Furthermore, in some embodiments portions of the entire conversation may be recorded and transcribed separately and then discarded if those segments contain no voice command so that the device may consume relatively less memory for determining whether a voice command has been provided than had the entire call been transcribed, e.g., throughout or at the end of the call. Additionally, if a voice command was received toward the beginning or end of a recorded segment and additional context before or after the voice command would be helpful that is not actually included in that same segment (or if the command itself was cut off), the device may provide an audible prompt via speakers and/or a visual prompt via a GUI on a display for the user to repeat the command and context so that it may all be captured in a single audio segment and then that segment may be transcribed as described herein.
With respect to any computer systems discussed herein, a system may include server and client components, connected over a network such that data may be exchanged between the client and server components. The client components may include one or more computing devices including televisions (e.g., smart TVs, Internet-enabled TVs), computers such as desktops, laptops and tablet computers, so-called convertible devices (e.g., having a tablet configuration and laptop configuration), and other mobile devices including smart phones. These client devices may employ, as non-limiting examples, operating systems from Apple Inc. of Cupertino Calif., Google Inc. of Mountain View, Calif., or Microsoft Corp. of Redmond, Wash. A Unix® or similar such as Linux® operating system may be used. These operating systems can execute one or more browsers such as a browser made by Microsoft or Google or Mozilla or another browser program that can access web pages and applications hosted by Internet servers over a network such as the Internet, a local intranet, or a virtual private network.
As used herein, instructions refer to computer-implemented steps for processing information in the system. Instructions can be implemented in software, firmware or hardware, or combinations thereof and include any type of programmed step undertaken by components of the system; hence, illustrative components, blocks, modules, circuits, and steps are sometimes set forth in terms of their functionality.
A processor may be any conventional general purpose single- or multi-chip processor that can execute logic by means of various lines such as address lines, data lines, and control lines and registers and shift registers. Moreover, any logical blocks, modules, and circuits described herein can be implemented or performed with a general purpose processor, a digital signal processor (DSP), a field programmable gate array (FPGA) or other programmable logic device such as an application specific integrated circuit (ASIC), discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A processor can also be implemented by a controller or state machine or a combination of computing devices. Thus, the methods herein may be implemented as software instructions executed by a processor, suitably configured application specific integrated circuits (ASIC) or field programmable gate array (FPGA) modules, or any other convenient manner as would be appreciated by those skilled in those art. Where employed, the software instructions may also be embodied in a non-transitory device that is being vended and/or provided that is not a transitory, propagating signal and/or a signal per se (such as a hard disk drive, CD ROM or Flash drive). The software code instructions may also be downloaded over the Internet. Accordingly, it is to be understood that although a software application for undertaking present principles may be vended with a device such as the system 100 described below, such an application may also be downloaded from a server to a device over a network such as the Internet.
Software modules and/or applications described by way of flow charts and/or user interfaces herein can include various sub-routines, procedures, etc. Without limiting the disclosure, logic stated to be executed by a particular module can be redistributed to other software modules and/or combined together in a single module and/or made available in a shareable library.
Logic when implemented in software, can be written in an appropriate language such as but not limited to C# or C++, and can be stored on or transmitted through a computer-readable storage medium (that is not a transitory, propagating signal per se) such as a random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), compact disk read-only memory (CD-ROM) or other optical disk storage such as digital versatile disc (DVD), magnetic disk storage or other magnetic storage devices including removable thumb drives, etc.
In an example, a processor can access information over its input lines from data storage, such as the computer readable storage medium, and/or the processor can access information wirelessly from an Internet server by activating a wireless transceiver to send and receive data. Data typically is converted from analog signals to digital by circuitry between the antenna and the registers of the processor when being received and from digital to analog when being transmitted. The processor then processes the data through its shift registers to output calculated data on output lines, for presentation of the calculated data on the device.
Components included in one embodiment can be used in other embodiments in any appropriate combination. For example, any of the various components described herein and/or depicted in the Figures may be combined, interchanged or excluded from other embodiments.
“A system having at least one of A, B, and C” (likewise “a system having at least one of A, B, or C” and “a system having at least one of A, B, C”) includes systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.
The term “circuit” or “circuitry” may be used in the summary, description, and/or claims. As is well known in the art, the term “circuitry” includes all levels of available integration, e.g., from discrete logic circuits to the highest level of circuit integration such as VLSI, and includes programmable logic components programmed to perform the functions of an embodiment as well as general-purpose or special-purpose processors programmed with instructions to perform those functions.
Now specifically in reference to
As shown in
In the example of
The core and memory control group 120 include one or more processors 122 (e.g., single core or multi-core, etc.) and a memory controller hub 126 that exchange information via a front side bus (FSB) 124. As described herein, various components of the core and memory control group 120 may be integrated onto a single processor die, for example, to make a chip that supplants the conventional “northbridge” style architecture.
The memory controller hub 126 interfaces with memory 140. For example, the memory controller hub 126 may provide support for DDR SDRAM memory (e.g., DDR, DDR2, DDR3, etc.). In general, the memory 140 is a type of random-access memory (RAM). It is often referred to as “system memory.”
The memory controller hub 126 can further include a low-voltage differential signaling interface (LVDS) 132. The LVDS 132 may be a so-called LVDS Display Interface (LDI) for support of a display device 192 (e.g., a CRT, a flat panel, a projector, a touch-enabled display, etc.). A block 138 includes some examples of technologies that may be supported via the LVDS interface 132 (e.g., serial digital video, HDMI/DVI, display port). The memory controller hub 126 also includes one or more PCI-express interfaces (PCI-E) 134, for example, for support of discrete graphics 136. Discrete graphics using a PCI-E interface has become an alternative approach to an accelerated graphics port (AGP). For example, the memory controller hub 126 may include a 16-lane (×16) PCI-E port for an external PCI-E-based graphics card (including, e.g., one of more GPUs). An example system may include AGP or PCI-E for support of graphics.
In examples in which it is used, the I/O hub controller 150 can include a variety of interfaces. The example of
The interfaces of the I/O hub controller 150 may provide for communication with various devices, networks, etc. For example, where used, the SATA interface 151 provides for reading, writing or reading and writing information on one or more drives 180 such as HDDs, SDDs or a combination thereof, but in any case the drives 180 are understood to be, e.g., tangible computer readable storage mediums that are not transitory, propagating signals. The I/O hub controller 150 may also include an advanced host controller interface (AHCI) to support one or more drives 180. The PCI-E interface 152 allows for wireless connections 182 to devices, networks, etc. The USB interface 153 provides for input devices 184 such as keyboards (KB), mice and various other devices (e.g., cameras, phones, storage, media players, etc.).
In the example of
The system 100, upon power on, may be configured to execute boot code 190 for the BIOS 168, as stored within the SPI Flash 166, and thereafter processes data under the control of one or more operating systems and application software (e.g., stored in system memory 140). An operating system may be stored in any of a variety of locations and accessed, for example, according to instructions of the BIOS 168.
The system may also include an audio receiver/microphone 193 that provides input from the microphone 193 to the processor 122 based on audio that is detected, such as via a user providing audible input to the microphone during a telephone call or other audio communication while the speakers 194 output audio from the other end(s) of the call in accordance with present principles. The system may further include camera 195 that gathers one or more images and provides input related thereto to the processor 122. The camera 195 may be a thermal imaging camera, a digital camera such as a webcam, a three-dimensional (3D) camera, and/or a camera otherwise integrated into the system 100 and controllable by the processor 122 to gather pictures/images and/or video, such as images to be used for eye tracking and video conferencing in accordance with present principles.
Additionally, though not shown for clarity, in some embodiments the system 100 may include a gyroscope that senses and/or measures the orientation of the system 100 and provides input related thereto to the processor 122, as well as an accelerometer that senses acceleration and/or movement of the system 100 and provides input related thereto to the processor 122. Still further, the system 100 may include a GPS transceiver that is configured to communicate with at least one satellite to receive/identify geographic position information and provide the geographic position information to the processor 122. However, it is to be understood that another suitable position receiver other than a GPS receiver may be used in accordance with present principles to determine the location of the system 100.
It is to be understood that an example client device or other machine/computer may include fewer or more features than shown on the system 100 of
Turning now to
Describing the headset 216 in more detail, it may be a virtual reality (VR) headset, an augmented reality (AR) headset, a pair of smart glasses, or even an earpiece headset for making telephone calls. It may include a head-mounted display 218 on which VR and AR images are presentable as well as the graphical elements described herein. The headset 216 may also include speakers for outputting audio in accordance with present principles as well as one or more cameras 220 so that the headset or a connected device may track a user's eyes in accordance with present principles based on input from the camera(s) 220 using eye tracking software.
Referring to
From block 300 the logic of
From block 302 the logic may proceed to block 304. At block 304 the device may select the recorded segment of audio of the communication so that it may be transcribed. The logic may then move to block 306 where, using voice to text software, the device may transcribe the words spoken by the user as indicated in the recorded audio segment. After block 306 the logic may proceed to block 308 where the device may access a database of voice commands that may be stored locally on the device or remotely on, e.g., a cloud server to which the device has access. The database itself may be, for example, a relational database of various words and corresponding entries for whether those words constitute a voice command for which the device's personal assistant should take action. Additionally or alternatively, the database may simply be a listing of words that, when recognized by the device, are to constitute a voice command for which the device's personal assistant should take action. Regardless, the device at block 308 may access the database and parse it until a match to one or more of the words from the transcribed audio segment are located in the database.
The logic may then proceed to decision diamond 310 where the device may determine, based on parsing the database, whether one or more words from the text of the transcription are indicated in the database. A negative determination at diamond 310 may cause the device to proceed to block 312 where the device may discard the transcription and/or the recorded audio segment itself (e.g., delete it or remove it from memory), after which the device may proceed to block 314. Block 314 may be an instruction for the logic to proceed back to block 302 and to proceed therefrom to analyze another, subsequently recorded audio segment.
However, if an affirmative determination is made at diamond 310 instead of a negative one, the logic of
Thus, from block 316 the logic may proceed to decision diamond 318 where the device may in fact determine, based on execution of the natural language processing software/artificial intelligence at block 316, whether there was an intent by the user to provide a voice command to the device to execute a function. A negative determination at diamond 318 may cause the logic to revert back to block 312 and proceed therefrom as described above. However, an affirmative determination at diamond 318 will instead cause the logic to block 320.
At block 320 the device may, as another step to confirm that a voice command has in fact been issued, request confirmation from the user of a voice command has actually being issued to the device. The confirmation request may take one or more different forms. For instance, the request may include presentation of a predetermined audio tone or chime via the device's speaker(s) that the user would know as being a cue that the device has picked up on a voice command to the device. A graphical element such as an icon or symbol may also be presented on the device's touch-enabled display as part of the request so that when the predetermined audio tone/chime is played the user has a threshold non-zero period of time to provide touch input selecting the graphical element to provide input confirming that a voice command has in fact been provided to the device. However, in other embodiments the graphical element itself might be provided without also providing the predetermined audio tone/chime, as might be appropriate if the user were engaging in video conferencing and were already looking at the display anyway as part of the conferencing.
Accordingly, from block 320 the logic may proceed to decision diamond 322 where the device may determine whether a response to the request was received within a threshold non-zero time of one or both of the predetermined chime/tone being played and the graphical element being presented. For example, the threshold time may be thirty seconds. A negative determination will cause the logic to revert back to block 312 and proceed therefrom as described above. However, an affirmative determination at diamond 322 will instead cause the logic to block 324. At block 324 the device may perform a function or task indicated by the voice command and any surrounding portions that might provide context for the voice command.
As an example, the voice command may be “Okay assistant, what is the weather like over there?” Then, based on that command and the device also identifying that Morrisville, N.C. was being discussed in surrounding parts of the conversation, the device may access weather information over the Internet to determine the current weather in Morrisville, N.C. to report to the user. Other examples of voice commands may include commands to create electronic calendar entries, commands to find recipes for a particular type of dinner, commands to add tasks to a “to do” list, commands to turn on other devices such as TVs or smart home lights, or any other commands that might be provided to a personal assistant application.
Continuing the detailed description in reference to
Conversely, the user not selecting the icon 404 within a threshold non-zero amount of time of presentation of the icon 404, the user selecting the icon 404 but not for the threshold selection time referenced in the paragraph immediately above, and/or the user gesturing another predetermined gesture other than to select the icon 404 with his or her finger may be interpreted by the device as one or more of the following: input that a voice command was not provided, input that the device should not take action in conformance with the voice command, and/or input that the icon 404 should be deleted/removed from the GUI 400 without taking action in conformance with the voice command. The predetermined gesture referenced in the sentence immediately above may be, for example, a drag and drop gesture using the user's hand or the device's cursor to drag and drop the icon 404 in a graphical trash can 408 presented on the device's display. The predetermined gesture may also be a dragging or swiping of the icon 404 offscreen by the user taking his or her index finger and swiping against the device's touch-enabled display to swipe the icon 404 off the display.
Still in reference to
Additionally or alternatively, the icon 504 may be selected by the user to confirm the user's voice command by the user gazing at the icon 504 for a threshold non-zero amount of time and by the user also providing a gesture with his or her hand that the headset or a connected device would recognize as a predetermined gesture indicating user confirmation. For example, one or more cameras within the user's environment or on the headset itself may gather images of the user and provide them to the headset's processor (or a connected device's processor) for the processor to execute gesture recognition using the images to identify the gesture as a “thumbs up” gesture with the user's hand that indicates user confirmation of the voice command.
Additionally or alternatively, the predetermined gesture may be an “air tap” where a user uses his or her index finger to provide a tapping gesture in free space where the icon 504 appears to the user to exist in 3D space owing to the headset using AR or VR processing to present the icon 504 in such a manner. The “tapping” on the icon 504 as it appears to the user may thus be interpreted by the headset as selection of the icon 504 and hence user confirmation of the voice command the headset has identified.
Notwithstanding the foregoing, also note that in some embodiments identification of the predetermined gesture without also identifying the user gazing at the icon 504 past the threshold amount of time may still constitute confirmation from the user.
In any case, it is to be further understood that a user's gaze at the icon 504 for less than the threshold amount of time, the user not looking at the icon 504 at all, and/or the user gesturing another predetermined gesture may be interpreted by the headset as one or more of the following: input that a voice command was not provided, input that the headset should not take action in conformance with the voice command, and/or input that the icon 504 should be deleted/removed from the GUI 500 without taking action in conformance with the voice command. This predetermined “no” gesture may be, for example, a “thumbs down” gesture using the user's hand.
This predetermined “no” gesture may also include the user pressing and holding the icon 504 using his or her index finger where the icon 504 appears to the user to exist in 3D space owing to the headset using AR or VR processing to present the icon 504 in such a manner. Once the headset identifies the user as pressing and holding the icon 504 for a threshold non-zero amount of time, the headset may enable the user to drag the icon 504 offscreen by taking his or her index finger and swiping in free space from where the icon 504 appears to be presented to another location, relative to the user, that cannot be seen by the user while wearing the headset (such as down and to the right of the user's right leg).
Still in reference to
Before moving on to the description of
Now describing
The GUI 600 may also include an option 604 that is selectable by directing touch or cursor input to the check box adjacent to it to enable a setting for the device in which, prior to requesting confirmation of a voice command from a user, the device may use natural language understanding software as described herein. For example, a user may select option 604 to enable the device to execute step 316 of
Even further, the GUI 600 may include a setting 606 for a user to establish the length of the threshold amount of audio that is to be recorded as, e.g., referenced above when describing blocks 302 and 304. Thus, a user may direct input to input box 608 by selecting it with touch input or a cursor and then using a soft or hard keyboard to specify a particular length of time, such as fifteen seconds, to establish as the threshold amount of time.
The GUI 600 may also include a setting 610 for a user to establish the threshold amount of time for selection of a graphical element as disclosed herein. Accordingly, a user may direct input to input box 612 by selecting it with touch input or a cursor and then using a soft or hard keyboard to specify a particular length of time, such as five seconds.
It is to be understood that whilst present principals have been described with reference to some example embodiments, these are not intended to be limiting, and that various alternative arrangements may be used to implement the subject matter claimed herein. Components included in one embodiment can be used in other embodiments in any appropriate combination. For example, any of the various components described herein and/or depicted in the Figures may be combined, interchanged or excluded from other embodiments.