Speech recognition analysis via identification information

Description

BACKGROUND

Speech recognition technology allows a user of a computing device to make inputs via speech commands, rather than via a keyboard or other peripheral device input device. One difficulty shared by different speech recognition systems is discerning intended speech inputs from other received sounds, including but not limited to background noise, background speech, and speech from a current system user that is not intended to be an input.

Various methods have been proposed to discern intended speech inputs from other sounds. For example, some speech input systems require a user to say a specific command, such as “start listening,” before any speech will be accepted and analyzed as an input. However, such systems may still be susceptible to background noise that randomly matches recognized speech patterns and that therefore may be interpreted as input. Such “false positives” may result in a speech recognition system performing actions not intended by a user, or performing actions even when no users are present.

SUMMARY

Accordingly, various embodiments are disclosed herein that relate to the use of identity information to help avoid the occurrence of false positive speech recognition events in a speech recognition system. For example, one disclosed embodiment provides a method of operating a speech recognition input system. The method comprises receiving speech recognition data comprising a recognized speech segment, acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array, and confidence data comprising a recognition confidence value, and also receiving image data comprising visual locational information related to a location of each person located in a field of view of the image sensor. The acoustic locational data is compared to the visual locational data to determine whether the recognized speech segment originated from a person in the field of view of the image sensor. The method further comprises adjusting the confidence data based upon whether the recognized speech segment is determined to have originated from a person in the field of view of the image sensor.

This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an embodiment of an example speech input environment in the form of a video game environment.

FIG. 2 shows a block diagram of an embodiment of a computing system comprising a speech recognition input system.

FIG. 3 shows a flow diagram depicting an embodiment of a method of analyzing speech input using identity data.

FIG. 4 shows a flow diagram depicting another embodiment of a method of analyzing speech input using identity data.

FIG. 5 shows a block diagram of an embodiment of a system for analyzing a speech input using identity data.

FIG. 6 shows a schematic depiction of a portion of an embodiment of a frame of a depth image.

DETAILED DESCRIPTION

The present disclosure is directed to avoiding false positive speech recognitions in a speech recognition input system. Further, the disclosed embodiments also may help to ensure that a speech recognition event originated from a desired user in situations where there are multiple users in the speech recognition system environment. For example, where a plurality of users are playing a game show-themed video game and the game requests a specific person to answer a specific question, the disclosed embodiments may help to block answers called by other users. It will be understood that speech recognition input systems may be used to enable speech inputs for any suitable device. Examples include, but are not limited to, interactive entertainment systems such as video game consoles, digital video recorders, digital televisions and other media players, and devices that combine two or more of these functionalities.

FIG. 1 shows an example speech recognition use environment in the form of an interactive entertainment system 10 that may be used to play a variety of different games, play one or more different media types, and/or control or manipulate non-game applications. The interactive entertainment system 10 comprises a console 102 configured to display an image on a display 104, shown as a television which may be used to present game visuals to one or more game players. It will be understood that the example embodiment shown in FIG. 1 is presented for the purpose of illustration, and is not intended to be limiting in any manner.

Entertainment system 10 further comprises an input device 100 having a depth-sensing camera and a microphone array. The depth-sensing camera may be used to visually monitor one or more users of entertainment system 10, and the microphone array may be used to receive speech commands made by the players. The use of a microphone array, rather than a single microphone, allows information regarding the location of a source of a sound (e.g. a player speaking) to be determined from the audio data.

The data acquired by input device 100 allows a player to make inputs without the use of a hand-held controller or other remote device. Instead, speech inputs, movements, and/or combinations thereof may be interpreted by entertainment system 10 as controls that can be used to affect the game being executed by entertainment system 10.

The movements and speech inputs of game player 108 may be interpreted as virtually any type of game control. For example, the example scenario illustrated in FIG. 1 shows game player 108 playing a boxing game that is being executed by interactive entertainment system 10. The gaming system uses television 104 to visually present a boxing opponent 110 to game player 108. Furthermore, the entertainment system 10 also visually presents a player avatar 112 that game player 108 controls with movements. For example, game player 108 can throw a punch in physical space as an instruction for player avatar 112 to throw a punch in game space. Entertainment system 10 and input device 100 can be used to recognize and analyze the punch of game player 108 in physical space so that the punch can be interpreted as a game control that causes player avatar 112 to throw a punch in game space. Speech commands also may be used to control aspects of play.

Furthermore, some movements and speech inputs may be interpreted as controls that serve purposes other than controlling player avatar 112. For example, the player may use movements and/or speech commands to end, pause, or save a game, select a level, view high scores, communicate with a friend, etc. The illustrated boxing scenario is provided as an example, but is not meant to be limiting in any way. To the contrary, the illustrated scenario is intended to demonstrate a general concept, which may be applied to a variety of different applications without departing from the scope of this disclosure.

FIG. 2 shows a block diagram of the embodiment of FIG. 1. As mentioned above, input device 100 comprises an image sensor, such as a depth-sensing camera 202 to detect player motion, and also comprises a microphone array 204 to detect speech inputs from players. Depth-sensing camera 202 may utilize any suitable mechanisms for determining the depth of a target object (e.g. a player) in the field of view of the camera, including but not limited to structured light mechanisms. Likewise, microphone array 204 may have any suitable number and arrangement of microphones. For example, in one specific embodiment, microphone array 204 may have four microphones that are spatially arranged to avoid instances of a sound from a source destructively interfering at all four microphones. In other embodiments, the input device 100 may comprise an image sensor other than a depth-sensing camera.

Input device 100 also comprises memory 206 comprising instructions executable by a processor 208 to perform various functions related to receiving inputs from depth-sensing camera 202 and microphone array 204, processing such inputs, and/or communicating such inputs to console 102. Embodiments of such functions are described in more detail below. Console 102 likewise includes memory 210 having instructions stored thereon that are executable by a processor 212 to perform various functions related to the operation of entertainment system 10, embodiments of which are described in more detail below.

As described above, it may be difficult for a speech recognition system to discern intended speech inputs from other received sounds, such as background noise, background speech (i.e. speech not originating from a current user), etc. Further, it also may be difficult for a speech recognition system to differentiate speech from a current system user that is not intended to be an input. Current methods that involve a user issuing a specific speech command, such as “start listening,” to initiate a speech-recognition session may be subject to false positives in which background noise randomly matches such speech patterns. Another method involves the utilization of a camera to detect the gaze of a current user to determine if speech from the user is intended as a speech input. However, this method relies upon a user being positioned in an expected location during system use, and therefore may not be effective in a dynamic use environment in which users move about, in which users may be out of view of the camera, and/or in which non-users may be present.

Accordingly, FIG. 3 shows a flow diagram depicting an embodiment of a method 300 for operating a speech recognition input system. Method 300 comprises, at 302, receiving speech recognition data. The speech recognition data may include data such as a recognized speech segment 304, acoustic location information 306 that indicates a direction and/or location of an origin of the recognized speech segment, and/or confidence data 308 that represents a confidence value of the certainty of a match of the recognized speech segment to the speech pattern to which it was matched. The recognized speech segment 304 and confidence data 308 may each be determined from analysis of sounds received by the microphone array, for example, by combining the signals from the microphones into a single speech signal via digital audio processing and then performing speech recognition analysis. Likewise, the acoustic location information 306 may be determined from the output of the microphone array via analysis of the relative times at which the recognized speech segment was received. Various techniques are known for each of these processes.

Next, method 300 comprises, at 312, receiving image data. The image data may comprise, for example, processed image data that was originally received by the depth-sensing camera and then processed to identify persons or other objects in the image. In some embodiments, individual pixels or groups of pixels in the image may be labeled with metadata that represents a type of object imaged at that pixel (e.g. “player 1”), and that also represents a distance of the object from an input device. This data is shown as “visual location information” 314 in FIG. 3. An example embodiment of such image data is described in further detail below.

After receiving the speech recognition data and the image data, method 300 next comprises, at 316, comparing the acoustic location information to the visual location information, and at 318, adjusting the confidence data based upon whether the recognized speech segment is determined to have originated from a person in the image sensor field of view. For example, if it is determined that the recognized speech segment did not originate from a player in view, the confidence value may be lowered, or a second confidence value may be added to the confidence data, wherein the second confidence value is an intended input confidence value configured (in this case) to communicate a lower level of confidence that the recognized speech segment came from an active user. Likewise, where it is determined that the recognized speech segment did originate from a player in view, the confidence value may be raised or left unaltered, or an intended input confidence value may be added to the confidence data to communicate a higher level of confidence that the recognized speech segment came from an active user.

In either case, the recognized speech segment and modified confidence data may be provided to an application for use. Using this data, the application may decide whether to accept or reject the recognized speech segment based upon the modified confidence data. Further, in some cases where it is determined that it is highly likely that the recognized speech segment was not intended to be a speech input, method 300 may comprise rejecting the recognized speech segment, and thus not passing it to an application. In this case, such rejection of a recognized speech segment may be considered an adjustment of a confidence level to a level below a minimum confidence threshold. It will be understood that the particular examples given above for adjusting the confidence data are described for the purpose of illustration, and that any other suitable adjustments to the confidence values may be made.

In some cases, other information than acoustic location information and visual location information may be used to help determine a level of confidence that a recognized speech segment is intended to be an input. FIG. 4 shows a flow diagram depicting an embodiment of a method 400 that utilizes various examples of data that may be used to help determine whether a recognized speech segment is intended to be a speech input. Further, FIG. 5 shows an embodiment of a system 500 suitable for performing method 400.

Method 400 comprises, at 402, receiving a recognized speech segment and confidence data. As illustrated in FIG. 5, such signals may be received as output from an audio processing pipeline configured to receive a plurality of audio signals from a microphone array via an analog-to-digital converter (ADC), as indicated at 502. The illustrated audio processing pipeline embodiment comprises one or more digital audio processing stages, illustrated generically by box 504, and also a speech recognition stage 506.

The digital audio processing stage 504 may be configured to perform any suitable digital audio processing on the digitized microphone signals. For example, the digital audio processing stage 504 may be configured to remove noise, to combine the four microphone signals into a single audio signal, and to output acoustic location information 507 that comprises information on a direction and/or location from which a speech input is received. The speech recognition stage 506, as described above, may be configured to compare inputs received from the digital audio processing stage 504 to a plurality of recognized speech patterns to attempt to recognize speech inputs. The speech recognition stage 506 may then output recognized speech segments and also confidence data for each recognized speech segment to an intent determination stage 508. Further, the intent determination stage 508 may also receive the acoustic location information from the digital audio processing stage 504. It will be understood that, in some embodiments, the acoustic location information may be received via the speech recognition stage 506, or from any other suitable component.

Referring back to FIG. 4, method 400 next comprises determining whether the recognized speech segment originated from a player in view of the image sensor. This determination may be made in any suitable manner. For example, referring again to FIG. 5, image data from a depth-sensing camera may be received by a video processing stage 510 that performs such video processing as skeletal tracking.

The video processing stage 510 may output any suitable data, including but not limited to a synthesized depth image that includes information regarding the locations and depths of objects at each pixel as determined from skeletal tracking analysis. FIG. 6 shows a schematic view of data contained in a portion of an example embodiment of a synthesized depth image 600. Synthesized depth image 600 comprises a plurality of pixels each comprising image data and associated metadata that comprises information related to persons located in the image as determined via skeletal tracking. For example, a first pixel 602 comprises a first set of metadata 604. The first set of metadata 604 is shown as comprising, from top to bottom, a pixel index (shown as [x,y] coordinates), a depth value that indicates a depth of a part of a person's body in the image (e.g. a distance from the depth-sensing camera), a body part identification (here shown generically as “bp 4”, or body part 4), and a player number (“P1”, or player 1). Further, a second pixel 606 is seen to comprise a second set of metadata 608. Comparing the first set of metadata 604 and the second set of metadata 608, it can be seen that the first pixel 602 and second pixel 606 are identified as imaging different body parts at different distances from the depth-sensing camera. Thus, the processed image data comprises visual locational information related to a distance of each person in the field of view of the depth-sensing camera.

Referring again to FIG. 4, such visual location information may be compared, at 404, to the acoustic location information to help determine whether the recognized speech segment originated from a player in the field of view of the depth-sensing camera or other image sensor. If it is determined that the recognized speech segment did not originate from a player in the field of view of the depth-sensing camera, then method 400 comprises, at 406, determining whether the person from whom the recognized speech segment originated can be identified by voice. This may be performed in any suitable manner. For example, referring again to FIG. 5, a database of user voice patterns 514 may be maintained by an interactive entertainment system (e.g. each new user of the system may be asked to input a voice sample to allow the system to maintain a record of the user's voice pattern) to allow the subsequent identification of users by voice. Referring back to FIG. 4, if it is determined that the recognized speech segment did not originate from a player in view and the speaker cannot be identified by voice, then method 400 comprises rejecting the recognized sound segment, as shown at 408. In this instance, the recognized sound segment is not passed to an application for use. On the other hand, if the speaker can be identified by voice, then the confidence data is modified at 510 to reflect a reduction in confidence that the recognized speech input was intended to be an input. It will be understood that, in other embodiments, the recognized speech segment may not be rejected, but instead the confidence data may be modified, where the speaker is not in the field of view of the depth sensing camera and cannot be identified by voice.

Returning to process 404, if it is determined that the recognized speech segment originated from a person in the field of view of the depth-sensing camera, then method 400 comprises, at 412, determining if the person is facing the depth-sensing camera. This may comprise, for example, determining if the visual location data indicates that any facial features of the player are visible (e.g. eyes, nose, mouth, overall face, etc.). Such a determination may be useful, for example, to distinguish between a user sitting side-by-side with and talking to another user (i.e. speech made by a non-active user) from the user making a speech input (i.e. speech made by an active user). If it is determined at 412 that the user is not facing the camera, then method 400 comprises, at 414, adjusting the confidence data to reflect a reduction in confidence that the recognized speech input was intended to be an input. On the other hand, if it is determined that the user is facing the camera, then the confidence data is not adjusted. It will be understood that, in other embodiments, any other suitable adjustments may be made to the confidence data other than those described herein to reflect the difference confidence levels resulting from the determination at 412.

Next, at 416, it is determined whether the person from whom the recognized speech segment originated can be identified by voice. As described above for process 406, this may be performed in any suitable manner, such as by consulting a database of user voice patterns 514. If it is determined that the recognized speech segment did not originate from a player in view and the speaker cannot be identified by voice, then method 400 comprises at 418, adjusting the confidence data to reflect a reduction in confidence that the recognized speech input was intended to be an input. On the other hand, if it is determined that the user is facing the camera, then the confidence data is not adjusted. It will be understood that, in other embodiments, any other suitable adjustments may be made to the confidence data other than those described herein to reflect the difference confidence levels resulting from the determination at 416.

Method 400 next comprises, at 420, determining whether the user's speech input contains a recognized keyword. Such recognized keywords may be words or phrases considered to be indicative that subsequent speech is likely to be intended as a speech input, and may be stored in a database, as indicated at 516 in FIG. 5. If it is determined at 420 that the recognized speech segment was not preceded by a keyword received in a predetermined window of time, then method 400 comprises, at 422, adjusting the confidence data. On the other hand, if it is determined that the recognized speech segment was preceded by a keyword within the predetermined window of time, then method 400 comprises adjusting the confidence data based upon an amount of time that passed between receiving the keyword and the recognized speech segment. For example, in some embodiments, the magnitude of the adjustment applied may follow a decay-type curve as a function of time, such that the adjustment reflects a progressively lesser confidence as more time passes between receiving the keyword and receiving the recognized speech segment. In other embodiments, the adjustment may be binary or stepped in nature, such that no adjustment is made to the confidence data until a predetermined amount of time passes between receiving a keyword and receiving the recognized speech segment. It will be understood that these examples of time-dependent adjustments are described for the purpose of illustration, and are not intended to be limiting in any manner.

It further will be understood that the examples of, and order of, processes shown in FIG. 4 are presented for the purpose of example and are not intended be limiting. In other embodiments, a determination of an intent of a user to make a speech input may utilize only a subset of the illustrated processes and/or additional processes not shown. Furthermore, such processes may be applied in any suitable order.

It also will be appreciated that the computing devices described herein may be any suitable computing device configured to execute the programs described herein. For example, the computing devices may be a mainframe computer, personal computer, laptop computer, portable data assistant (PDA), set top box, game console, computer-enabled wireless telephone, networked computing device, or other suitable computing device, and may be connected to each other via computer networks, such as the Internet. These computing devices typically include a processor and associated volatile and non-volatile memory, and are configured to execute programs stored in non-volatile memory using portions of volatile memory and the processor. As used herein, the term “program” refers to software or firmware components that may be executed by, or utilized by, one or more computing devices described herein, and is meant to encompass individual or groups of executable files, data files, libraries, drivers, scripts, database records, etc. It will be appreciated that computer-readable storage media may be provided having program instructions stored thereon, which upon execution by a computing device, cause the computing device to execute the methods described above and cause operation of the systems described above.

It is to be understood that the configurations and/or approaches described herein are exemplary in nature, and that these specific embodiments or examples are not to be considered in a limiting sense, because numerous variations are possible. The specific routines or methods described herein may represent one or more of any number of processing strategies. As such, various acts illustrated may be performed in the sequence illustrated, in other sequences, in parallel, or in some cases omitted. Likewise, the order of the above-described processes may be changed.

The subject matter of the present disclosure includes all novel and nonobvious combinations and subcombinations of the various processes, systems and configurations, and other features, functions, acts, and/or properties disclosed herein, as well as any and all equivalents thereof.

Claims

1. In a computing system comprising a microphone array and an image sensor, a method of operating a speech recognition input system, the method comprising: receiving speech recognition data as an output from a speech recognition stage of an audio processing pipeline, the speech recognition data comprising a recognized speech segment and confidence data comprising a recognition confidence value that represents a confidence in a certainty of a match of the recognized speech segment to a speech pattern;receiving acoustic locational data as an output from a digital audio processing stage of the audio processing pipeline, the acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array;receiving image data comprising visual locational information related to a location of each person located in a field of view of the image sensor;comparing the acoustic locational data to the visual locational information to determine whether the recognized speech segment originated from a person in the field of view of the image sensor; andadjusting the confidence data based upon whether the recognized speech segment is determined to have originated from a person in the field of view of the image sensor.
2. The method of claim 1, wherein adjusting the confidence data comprises adjusting the recognition confidence value such that the recognition confidence value has a lower value after adjusting if the recognized speech segment is determined not to have originated from a person in the field of view of the image sensor than if the recognized speech segment is determined to have originated from a person in the field of view of the image sensor.
3. The method of claim 1, further comprising determining to reject the recognized speech segment as a speech input when the recognition confidence value is below a minimum confidence threshold.
4. The method of claim 1, further comprising adjusting the confidence data based upon whether the recognized speech segment is determined to have originated from a recognized speaker.
5. The method of claim 1, wherein, if the recognized speech segment is determined not to have originated from a recognized speaker and is determined not to have originated from a person in the field of view of the image sensor, then adjusting the confidence data comprises rejecting the recognized speech segment.
6. The method of claim 1, wherein, if it is determined that the recognized speech segment originated from a person in the field of view of the image sensor, then determining whether a face of the person is facing the image sensor, and adjusting the recognition confidence value such that the recognition confidence value has a lower value after adjusting if the face of the person is not facing the image sensor than if the face of the person is facing the image sensor.
7. The method of claim 1, further comprising receiving a speech input of a keyword before receiving the recognized speech segment, and wherein adjusting the confidence data comprises adjusting the recognition confidence value based upon an amount of time that passed between receiving the speech input of the keyword and receiving the recognized speech segment.
8. The method of claim 1, wherein the image sensor is a depth-sensing camera, and wherein receiving image data comprising visual locational information comprises receiving image data comprising information related to a distance of each person in the field of view of the depth-sensing camera.
9. An interactive entertainment system, comprising: a depth-sensing camera;a microphone array comprising a plurality of microphones; anda computing device comprising a processor and memory comprising instructions stored thereon that are executable by the processor to: receive speech recognition data as an output from a speech recognition stage of an audio processing pipeline, the speech recognition data comprising a recognized speech segment and confidence data comprising a recognition confidence value that represents a confidence in a certainty of a match of the recognized speech segment to a speech pattern;receive acoustic locational data as an output from a digital audio processing stage of the audio processing pipeline, the acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from the microphone array;receive image data comprising visual locational information related to a location of each person located in a field of view of the depth-sensing camera;compare the acoustic locational data to the visual locational data to determine whether the recognized speech segment originated from a person in the field of view of an image sensor; andadjust the confidence data based upon whether the recognized speech segment is determined to have originated from a person in the field of view of the depth-sensing camera.
10. The interactive entertainment system of claim 9, wherein the instructions are executable to adjust the confidence data by adjusting the recognition confidence value such that the recognition confidence value has a lower value after adjusting if the recognized speech segment is determined not to have originated from a person in the field of view of the depth-sensing camera than if the recognized speech segment is determined to have originated from a person in the field of view of the depth-sensing camera.
11. The interactive entertainment system of claim 9, wherein the instructions are further executable to determine to reject the recognized speech segment as a speech input when the confidence value is below a minimum confidence threshold.
12. The interactive entertainment system of claim 9, wherein the instructions are further executable to: determine if the recognized speech segment originated from a recognized speaker; andadjust the confidence data based upon whether the recognized speech segment is determined to have originated from a recognized speaker.
13. The interactive entertainment system of claim 12, wherein the instructions are further executable to reject the recognized speech segment if the recognized speech segment is determined not to have originated from a recognized speaker and the recognized speech segment is determined not to have originated from a person in the field of view of the depth-sensing camera.
14. The interactive entertainment system of claim 9, wherein the instructions are further executable to: determine that the recognized speech segment originated from a person in the field of view of the image sensor,determine whether a face of the person is facing the image sensor; andadjust the confidence data such that the recognition confidence value has a lower value after adjusting if the face of the person is not facing the image sensor than if the face of the person is facing the image sensor.
15. The interactive entertainment device of claim 9, further comprising receiving a speech input of a keyword before receiving the recognized speech segment, and wherein adjusting the confidence data comprises adjusting the recognized confidence value based upon an amount of time that passed between receiving the speech input of the keyword and receiving the recognized speech segment.
16. A hardware computer-readable storage device comprising instructions stored thereon that are executable by a computing device to: receive speech recognition data as an output from a speech recognition stage of an audio processing pipeline, the speech recognition stage being configured to compare inputs received from a digital audio processing stage of the audio processing pipeline to a plurality of recognized speech patterns to recognize speech inputs, and the speech recognition data comprising a recognized speech segment and confidence data comprising a recognition confidence value that represents a confidence in a certainty of a match of the recognized speech segment to a speech pattern;receive acoustic locational data as an output from a digital audio processing stage of the audio processing pipeline, the acoustic locational data related to a location of origin of the recognized speech segment as determined via signals from a microphone array;receive image data comprising visual locational information related to a location of each person located in a field of view of a depth-sensing camera;compare the acoustic locational data to the visual locational data to determine whether the recognized speech segment originated from a person in the field of view of an image sensor;adjust the confidence data based upon whether the recognized speech segment is determined to have originated from a person in the field of view of the depth-sensing camera;if it is determined that the recognized speech segment originated from a person in the field of view of the image sensor, then determine whether a face of the person is facing the image sensor; andadjusting the confidence data such that the recognition confidence value has a lower value after adjusting if the face of the person is not facing the image sensor than if the face of the person is facing the image sensor.
17. The hardware computer-readable storage device of claim 16, wherein the instructions are further executable to: determine if the recognized speech segment originated from a recognized speaker; andadjust the confidence data based upon whether the recognized speech segment is determined to have originated from a recognized speaker.
18. The hardware computer-readable storage device of claim 17 wherein the instructions are executable to reject the recognized speech segment if the recognized speech segment is determined not to have originated from a recognized speaker and the recognized speech segment is determined not to have originated from a person in the field of view of the depth-sensing camera.
19. The hardware computer-readable storage device of claim 16, wherein the instructions are further executable to receive a speech input of a keyword before receiving the recognized speech segment, and adjust the recognized confidence value based upon an amount of time that passed between receiving the speech input of the keyword and receiving the recognized speech segment.
20. The hardware computer-readable storage device of claim 16, wherein the instructions are further executable to adjust the confidence data by one or more of adjusting the recognition confidence value and including an intended input confidence value in the confidence data.

US Referenced Citations (246)

Number	Name	Date	Kind
4627620	Yang	Dec 1986	A
4630910	Ross et al.	Dec 1986	A
4645458	Williams	Feb 1987	A
4695953	Blair et al.	Sep 1987	A
4702475	Elstein et al.	Oct 1987	A
4711543	Blair et al.	Dec 1987	A
4751642	Silva et al.	Jun 1988	A
4796997	Svetkoff et al.	Jan 1989	A
4809065	Harris et al.	Feb 1989	A
4817950	Goo	Apr 1989	A
4843568	Krueger et al.	Jun 1989	A
4893183	Nayar	Jan 1990	A
4901362	Terzian	Feb 1990	A
4925189	Braeunig	May 1990	A
5101444	Wilson et al.	Mar 1992	A
5148154	MacKay et al.	Sep 1992	A
5184295	Mann	Feb 1993	A
5229754	Aoki et al.	Jul 1993	A
5229756	Kosugi et al.	Jul 1993	A
5239463	Blair et al.	Aug 1993	A
5239464	Blair et al.	Aug 1993	A
5288078	Capper et al.	Feb 1994	A
5295491	Gevins	Mar 1994	A
5320538	Baum	Jun 1994	A
5347306	Nitta	Sep 1994	A
5385519	Hsu et al.	Jan 1995	A
5405152	Katanics et al.	Apr 1995	A
5417210	Funda et al.	May 1995	A
5423554	Davis	Jun 1995	A
5454043	Freeman	Sep 1995	A
5465317	Epstein	Nov 1995	A
5469740	French et al.	Nov 1995	A
5495576	Ritchey	Feb 1996	A
5516105	Eisenbrey et al.	May 1996	A
5524637	Erickson	Jun 1996	A
5534917	MacDougall	Jul 1996	A
5563988	Maes et al.	Oct 1996	A
5566272	Brems et al.	Oct 1996	A
5577981	Jarvik	Nov 1996	A
5580249	Jacobsen et al.	Dec 1996	A
5594469	Freeman et al.	Jan 1997	A
5597309	Riess	Jan 1997	A
5616078	Oh	Apr 1997	A
5617312	Iura et al.	Apr 1997	A
5638300	Johnson	Jun 1997	A
5641288	Zaenglein	Jun 1997	A
5682196	Freeman	Oct 1997	A
5682229	Wangler	Oct 1997	A
5690582	Ulrich et al.	Nov 1997	A
5703367	Hashimoto et al.	Dec 1997	A
5704837	Iwasaki et al.	Jan 1998	A
5710866	Alleva et al.	Jan 1998	A
5715834	Bergamasco et al.	Feb 1998	A
5855000	Waibel et al.	Dec 1998	A
5875108	Hoffberg et al.	Feb 1999	A
5877803	Wee et al.	Mar 1999	A
5913727	Ahdoot	Jun 1999	A
5933125	Fernie et al.	Aug 1999	A
5980256	Carmein	Nov 1999	A
5989157	Walton	Nov 1999	A
5995649	Marugame	Nov 1999	A
6005548	Latypov et al.	Dec 1999	A
6009210	Kang	Dec 1999	A
6054991	Crane et al.	Apr 2000	A
6066075	Poulton	May 2000	A
6072494	Nguyen	Jun 2000	A
6073489	French et al.	Jun 2000	A
6077201	Cheng et al.	Jun 2000	A
6098458	French et al.	Aug 2000	A
6100896	Strohecker et al.	Aug 2000	A
6101289	Kellner	Aug 2000	A
6128003	Smith et al.	Oct 2000	A
6130677	Kunz	Oct 2000	A
6141463	Covell et al.	Oct 2000	A
6147678	Kumar et al.	Nov 2000	A
6152856	Studor et al.	Nov 2000	A
6159100	Smith	Dec 2000	A
6173066	Peurach et al.	Jan 2001	B1
6181343	Lyons	Jan 2001	B1
6188777	Darrell et al.	Feb 2001	B1
6215890	Matsuo et al.	Apr 2001	B1
6215898	Woodfill et al.	Apr 2001	B1
6226396	Marugame	May 2001	B1
6229913	Nayar et al.	May 2001	B1
6243683	Peters	Jun 2001	B1
6256033	Nguyen	Jul 2001	B1
6256400	Takata et al.	Jul 2001	B1
6283860	Lyons et al.	Sep 2001	B1
6289112	Jain et al.	Sep 2001	B1
6299308	Voronka et al.	Oct 2001	B1
6308565	French et al.	Oct 2001	B1
6316934	Amorai-Moriya et al.	Nov 2001	B1
6345111	Yamaguchi et al.	Feb 2002	B1
6363160	Bradski et al.	Mar 2002	B1
6384819	Hunter	May 2002	B1
6411744	Edwards	Jun 2002	B1
6430997	French et al.	Aug 2002	B1
6476834	Doval et al.	Nov 2002	B1
6496598	Harman	Dec 2002	B1
6503195	Keller et al.	Jan 2003	B1
6539931	Trajkovic et al.	Apr 2003	B2
6570555	Prevost et al.	May 2003	B1
6594629	Basu et al.	Jul 2003	B1
6633294	Rosenthal et al.	Oct 2003	B1
6640202	Dietz et al.	Oct 2003	B1
6661918	Gordon et al.	Dec 2003	B1
6681031	Cohen et al.	Jan 2004	B2
6714665	Hanna et al.	Mar 2004	B1
6731799	Sun et al.	May 2004	B1
6735562	Zhang et al.	May 2004	B1
6738066	Nguyen	May 2004	B1
6765726	French et al.	Jul 2004	B2
6788809	Grzeszczuk et al.	Sep 2004	B1
6801637	Voronka et al.	Oct 2004	B2
6807529	Johnson et al.	Oct 2004	B2
6853972	Friedrich et al.	Feb 2005	B2
6873723	Aucsmith et al.	Mar 2005	B1
6876496	French et al.	Apr 2005	B2
6882971	Craner	Apr 2005	B2
6937742	Roberts et al.	Aug 2005	B2
6950534	Cohen et al.	Sep 2005	B2
6964023	Maes et al.	Nov 2005	B2
6993482	Ahlenius	Jan 2006	B2
7003134	Covell et al.	Feb 2006	B1
7036094	Cohen et al.	Apr 2006	B1
7038855	French et al.	May 2006	B2
7039676	Day et al.	May 2006	B1
7042440	Pryor et al.	May 2006	B2
7046300	Iyengar et al.	May 2006	B2
7050606	Paul et al.	May 2006	B2
7058204	Hildreth et al.	Jun 2006	B2
7060957	Lange et al.	Jun 2006	B2
7113201	Taylor et al.	Sep 2006	B1
7113918	Ahmad et al.	Sep 2006	B1
7121946	Paul et al.	Oct 2006	B2
7170492	Bell	Jan 2007	B2
7184048	Hunter	Feb 2007	B2
7202898	Braun et al.	Apr 2007	B1
7222078	Abelow	May 2007	B2
7227526	Hildreth et al.	Jun 2007	B2
7227960	Kataoka	Jun 2007	B2
7228275	Endo et al.	Jun 2007	B1
7259747	Bell	Aug 2007	B2
7308112	Fujimura et al.	Dec 2007	B2
7317836	Fujimura et al.	Jan 2008	B2
7321853	Asano	Jan 2008	B2
7348963	Bell	Mar 2008	B2
7359121	French et al.	Apr 2008	B2
7367887	Watabe et al.	May 2008	B2
7379563	Shamaie	May 2008	B2
7379566	Hildreth	May 2008	B2
7389591	Jaiswal et al.	Jun 2008	B2
7412077	Li et al.	Aug 2008	B2
7421093	Hildreth et al.	Sep 2008	B2
7428000	Cutler et al.	Sep 2008	B2
7430312	Gu	Sep 2008	B2
7436496	Kawahito	Oct 2008	B2
7447635	Konopka et al.	Nov 2008	B1
7450736	Yang et al.	Nov 2008	B2
7452275	Kuraishi	Nov 2008	B2
7460690	Cohen et al.	Dec 2008	B2
7489812	Fox et al.	Feb 2009	B2
7536032	Bell	May 2009	B2
7555142	Hildreth et al.	Jun 2009	B2
7560701	Oggier et al.	Jul 2009	B2
7570805	Gu	Aug 2009	B2
7574020	Shamaie	Aug 2009	B2
7576727	Bell	Aug 2009	B2
7580570	Manu et al.	Aug 2009	B2
7590262	Fujimura et al.	Sep 2009	B2
7593552	Higaki et al.	Sep 2009	B2
7598942	Underkoffler et al.	Oct 2009	B2
7607509	Schmiz et al.	Oct 2009	B2
7620202	Fujimura et al.	Nov 2009	B2
7668340	Cohen et al.	Feb 2010	B2
7680287	Amada et al.	Mar 2010	B2
7680298	Roberts et al.	Mar 2010	B2
7683954	Ichikawa et al.	Mar 2010	B2
7684592	Paul et al.	Mar 2010	B2
7684982	Taneda	Mar 2010	B2
7697827	Konicek	Apr 2010	B2
7701439	Hillis et al.	Apr 2010	B2
7702130	Im et al.	Apr 2010	B2
7704135	Harrison, Jr.	Apr 2010	B2
7710391	Bell et al.	May 2010	B2
7729530	Antonov et al.	Jun 2010	B2
7746345	Hunter	Jun 2010	B2
7760182	Ahmad et al.	Jul 2010	B2
7801726	Ariu	Sep 2010	B2
7809167	Bell	Oct 2010	B2
7834846	Bell	Nov 2010	B1
7852262	Namineni et al.	Dec 2010	B2
RE42256	Edwards	Mar 2011	E
7898522	Hildreth et al.	Mar 2011	B2
8024185	Do et al.	Sep 2011	B2
8035612	Bell et al.	Oct 2011	B2
8035614	Bell et al.	Oct 2011	B2
8035624	Bell et al.	Oct 2011	B2
8072470	Marks	Dec 2011	B2
8073690	Nakadai et al.	Dec 2011	B2
8296151	Klein et al.	Oct 2012	B2
8315366	Basart et al.	Nov 2012	B2
8384668	Barney et al.	Feb 2013	B2
8442833	Chen	May 2013	B2
8543394	Shin	Sep 2013	B2
20020116197	Erten	Aug 2002	A1
20030009329	Stahl et al.	Jan 2003	A1
20030018475	Basu et al.	Jan 2003	A1
20040037450	Bradski	Feb 2004	A1
20040054531	Asano	Mar 2004	A1
20040119754	Bangalore et al.	Jun 2004	A1
20040193413	Wilson et al.	Sep 2004	A1
20040260554	Connell et al.	Dec 2004	A1
20040267521	Cutler et al.	Dec 2004	A1
20050060142	Visser et al.	Mar 2005	A1
20060085187	Barquilla	Apr 2006	A1
20060143017	Sonoura et al.	Jun 2006	A1
20080026838	Dunstan et al.	Jan 2008	A1
20080059175	Miyajima	Mar 2008	A1
20080165388	Serlet	Jul 2008	A1
20080309761	Kienzle et al.	Dec 2008	A1
20080312918	Kim	Dec 2008	A1
20090018828	Nakadai et al.	Jan 2009	A1
20090030552	Nakadai et al.	Jan 2009	A1
20090067590	Bushey et al.	Mar 2009	A1
20090119096	Gerl et al.	May 2009	A1
20090125311	Haulick et al.	May 2009	A1
20090150146	Cho et al.	Jun 2009	A1
20090150156	Kennewick et al.	Jun 2009	A1
20090171664	Kennewick et al.	Jul 2009	A1
20100134677	Yamamoto et al.	Jun 2010	A1
20100207875	Yeh	Aug 2010	A1
20100211387	Chen	Aug 2010	A1
20100217604	Baldwin et al.	Aug 2010	A1
20100299144	Barzelay et al.	Nov 2010	A1
20100312547	Van Os et al.	Dec 2010	A1
20100315905	Lee et al.	Dec 2010	A1
20110035224	Sipe	Feb 2011	A1
20110043617	Vertegaal et al.	Feb 2011	A1
20110054899	Phillips et al.	Mar 2011	A1
20110107216	Bi	May 2011	A1
20110112839	Funakoshi et al.	May 2011	A1
20110164769	Zhan et al.	Jul 2011	A1
20120327193	Dernis et al.	Dec 2012	A1
20130195285	De La Fuente et al.	Aug 2013	A1
20130253929	Weider et al.	Sep 2013	A1

Foreign Referenced Citations (6)

Number	Date	Country
201254344	Jun 2010	CN
0583061	Feb 1994	EP
08044490	Feb 1996	JP
9310708	Jun 1993	WO
9717598	May 1997	WO
9944698	Sep 1999	WO

Non-Patent Literature Citations (34)

Entry
Shiell, et al., “Chapter I Audio-Visual and Visual-Only Speech and Speaker Recognition: Issues about Theory, System Design, and Implementation”, Retrieved at <<http://www.igi-global.com/downloads/excerpts/9676.pdf>>, 2009, pp. 38.
Gurban, Mihai, “Multimodal Feature Extraction and Fusion for Audio-Visual Speech Recognition”, Retrieved at <<http://biblion.epfl.ch/EPFL/theses/2009/4292/EPFL—TH4292.pdf>>, Jan. 2009, pp. 140.
Kittler, et al., “Combining Evidence in Multimodal Personal Identity Recognition Systems”, Retrieved at <<www. springerlink.com/index/w17Ign58h8538k54.pdf>>, Apr. 2006, pp. 327-334.
Nakadai, et al.,“Improvement of Recognition of Simultaneous Speech Signals Using AV Integration and Scattering Theory for Humanoid Robots”, Retrieved at <<http://www.sciencedirect.com/science?—ob=MImg&imagekey=B6V1C-4DS9W7 H-3-1&—cdi=5671&—user=3765386&—orig=search&—coverDate=10%2 F01% 2F2004&—sk=999559998&view=c&wchp=dGLbVtz-zSkzk&md5=813c2f68cd13f188baf7b0d6f5457007&ie=isdarticle.pdf>>, Oct. 2004, pp. 16.
Kim, et al., “Hybrid Confidence Measure for Domain-Specific Keyword Spotting”, Retrieved at <<http://www. springerlink.com/content/gcgyggn8hxh5w36a/fulltext.pdf>>, Jan. 2002, pp. 10.
Cooke, et al., “Gaze-contingent automatic speech recognition”, retrieved at <<http://ieeexplore.ieee.org/stamp/stamp.jsp?isnumber=4693967&arnumber=4693973&punumber=4159607>>, Nov. 12, 2009, pp. 12.
Kanade et al., “A Stereo Machine for Video-rate Dense Depth Mapping and Its New Applications”, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1996, pp. 196-202,The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA.
Miyagawa et al., “CCD-Based Range Finding Sensor”, Oct. 1997, pp. 1648-1652, vol. 44 No. 10, IEEE Transactions on Electron Devices.
Rosenhahn et al., “Automatic Human Model Generation”, 2005, pp. 41-48, University of Auckland (CITR), New Zealand.
Aggarwal et al., “Human Motion Analysis: A Review”, IEEE Nonrigid and Articulated Motion Workshop, 1997, University of Texas at Austin, Austin, TX.
Shao et al., “An Open System Architecture for a Multimedia and Multimodal User Interface”, Aug. 24, 1998, Japanese Society for Rehabilitation of Persons with Disabilities (JSRPD), Japan.
Kohler, “Special Topics of Gesture Recognition Applied in Intelligent Home Environments”, In Proceedings of the Gesture Workshop, 1998, pp. 285-296, Germany.
Kohler, “Vision Based Remote Control in Intelligent Home Environments”, University of Erlangen-Nuremberg/ Germany, 1996, pp. 147-154, Germany.
Kohler, “Technical Details and Ergonomical Aspects of Gesture Recognition applied in Intelligent Home Environments”, 1997, Germany.
Hasegawa et al., “Human-Scale Haptic Interaction with a Reactive Virtual Human in a Real-Time Physics Simulator”, Jul. 2006, vol. 4, No. 3, Article 6C, ACM Computers in Entertainment, New York, NY.
Qian et al., “A Gesture-Driven Multimodal Interactive Dance System”, Jun. 2004, pp. 1579-1582, IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan.
Zhao, “Dressed Human Modeling, Detection, and Parts Localization”, 2001, The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA.
He, “Generation of Human Body Models”, Apr. 2005, University of Auckland, New Zealand.
Isard et al., “Condensation—Conditional Density Propagation for Visual Tracking”, 1998, pp. 5-28, International Journal of Computer Vision 29(1), Netherlands.
Livingston, “Vision-based Tracking with Dynamic Structured Light for Video See-through Augmented Reality”, 1998, University of North Carolina at Chapel Hill, North Carolina, USA.
Wren et al., “Pfinder: Real-Time Tracking of the Human Body”, MIT Media Laboratory Perceptual Computing Section Technical Report No. 353, Jul. 1997, vol. 19, No. 7, pp. 780-785, IEEE Transactions on Pattern Analysis and Machine Intelligence, Caimbridge, MA.
Breen et al., “Interactive Occlusion and Collusion of Real and Virtual Objects in Augmented Reality”, Technical Report ECRC-95-02, 1995, European Computer-Industry Research Center GmbH, Munich, Germany.
Freeman et al., “Television Control by Hand Gestures”, Dec. 1994, Mitsubishi Electric Research Laboratories, TR94-24, Caimbridge, MA.
Hongo et al., “Focus of Attention for Face and Hand Gesture Recognition Using Multiple Cameras”, Mar. 2000, pp. 156-161, 4th IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France.
Pavlovic et al., “Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review”, Jul. 1997, pp. 677-695, vol. 19, No. 7, IEEE Transactions on Pattern Analysis and Machine Intelligence.
Azarbayejani et al., “Visually Controlled Graphics”, Jun. 1993, vo1. 15, No. 6, IEEE Transactions on Pattern Analysis and Machine Intelligence.
Granieri et al., “Simulating Humans in VR”, The British Computer Society, Oct. 1994, Academic Press.
Brogan et al., “Dynamically Simulated Characters in Virtual Environments”, Sep./Oct. 1998, pp. 2-13, vol. 18, Issue 5, IEEE Computer Graphics and Applications.
Fisher et al., “Virtual Environment Display System”, ACM Workshop on Interactive 3D Graphics, Oct. 1986, Chapel Hill, NC.
“Virtual High Anxiety”, Tech Update, Aug. 1995, pp. 22.
Sheridan et al., “Virtual Reality Check”, Technology Review, Oct. 1993, pp. 22-28, vol. 96, No. 7.
Stevens, “Flights into Virtual Reality Treating Real World Disorders”, The Washington Post, Mar. 27, 1995, Science Psychology, 2 pages.
“Simulation and Training”, 1994, Division Incorporated.
“Notice on China Third Office Action”, Mailed Date: Sep. 10, 2012, Application No. 201110031166.6, Filed Date:Jan. 21, 2011, pp. 9.

Related Publications (1)

	Number	Date	Country
	20110184735 A1	Jul 2011	US

Speech recognition analysis via identification information

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

US Classifications

Field of Search

US

International Classifications