The present disclosure relates to a dialogue system suitable for use in vehicles. More particularly, the disclosure relates to a computer-implemented natural language query system that resolves ambiguity using driver data, such as head or face position, eye gaze, hand and arm posture and gestures and other orientation data sensed by internal sensors, and correlates that driver data with external data, such as image data, sound data, and 3D depth data obtained external to the vehicle, such as.
Speech-enabled dialogue systems for use within vehicles present unique problems. Road noise within the vehicle cabin can significantly degrade speech recognition capabilities. In the past, this difficulty has been addressed by restricting the recognition system vocabulary and by using other recognition model training techniques that attempt to deal with the undesirably high degree of variability in signal-to-noise ratio and the resultant variability in recognition likelihood scores. In the instances where the recognizer fails, the conventional approach has been to use a dialogue system that will prompt the user to repeat the utterance that was not understood.
The present solution to the aforementioned problem takes a much different approach. The system uses sensors disposed inside the vehicle that are positioned to monitor physical activity of the driver, such as by monitoring the position and orientation of the driver's head or face. From this information, the system derives driver activity data that are stored as a function of time. In addition, a camera system disposed on the vehicle is positioned to collect visual data regarding conditions external to the vehicle as a function of time. A correlation processor associates non-transitory computer-readable memory time-correlated driver activity data with the visual data, and also optionally with location data from a cellular system or GPS system.
A disambiguation processor, operating in conjunction with the speech recognition processor, uses the time-correlated driver activity data, visual data and optional location data to ascertain from the driver activity what external condition the driver is referring to during an utterance. The disambiguation processor formulates a computer system query to correspond to the ascertained external condition.
The natural query processing apparatus is thus able to infer what the driver was looking at when the utterance was made, or to infer from other gestures performed by the driver during the utterance, to “fill in the gaps” or disambiguate the driver's utterance so that it can be used to generate a computer system query. The apparatus will thus address disambiguation issues caused by poor recognition. However, the apparatus has uses beyond dealing with cabin noise-induced recognition problems. The apparatus can formulate queries that would respond to driver utterances such as, “What is the name of that restaurant?” or “What did that sign say?” In both of these examples, a conventional recognition system, even one that perfectly recognized the words of the utterance, would not be able to generate a computer system query because it would have no way to know what restaurant the driver was referring to or what sign the driver was able to read.
a and 6b are a detailed flow chart illustrating the disambiguation process;
Referring to
The query processing apparatus has as one of its components a speech processor to perform speech recognition upon utterances made within the vehicle. Thus, a suitable microphone 22 is positioned within the vehicle cabin so that it can pick up utterances made by the driver. The speech processing system may also include a speech synthesis or digitized speech playback system that issues audible information to the vehicle occupants through a suitable audio sound system, such as the sound system of the infotainment center.
In one embodiment, shown in
The apparatus further includes an input/output circuit 34 to which the internal and external sensors are attached. Thus, as illustrated, the input/output circuit may be supplied with visual data from a camera array 36 comprising one or more cameras pointing outwardly from the vehicle to capture information about conditions external to the vehicle. Gesture sensors 20 and the 3D camera sensor 14 are also connected to the input/output circuitry 34. The apparatus can connect to online databases 37 such as internet databases and services using a wireless connection, which may be connected to or incorporated into the input/output circuitry 34. One or more microphones 22, or a microphone array communicate also with the input/output circuit 34. Internal microphones capture human speech within the vehicle. By arranging the internal microphones in a spaced array, the system can localize the source of speech (who is talking) and filter out noise (beam steering). External microphones can also be included, to capture sounds from outside the vehicle.
The processor 24 is programmed to execute computer instructions stored within memory 28 to effect various processor functions that will now be described in connection with
In the illustrated embodiment, the computer memory data store 28 may include a buffer of suitable size to hold camera array motion picture data for a predetermined interval of time suitable to store events happening external to the vehicle for a brief interval (e.g., 10 seconds) prior to a speech utterance by the driver. In this way, phrases spoken by the driver can be disambiguated based on events the driver may have seen immediately prior to his or her utterance. Of course, a larger buffer can be provided, if desired, to store longer intervals of external event information.
The correlation processor 42, in effect, links or correlates the driver activity data with the visual data and with the vehicle location data. This allows the disambiguation processor 46 to correlate the direction the driver is looking, or gestures the driver has made, with visible objects external to the vehicle and with the vehicle's current location. In this way, when the driver issues an utterance such as, “What is the name of that building?”, the disambiguation processor can ascertain the meaning of “that building” by reference to the direction the driver was looking when making the utterance or from gestural cues received from the gestures center 20 (
Like the correlation processor 42, the disambiguation processor 46 may be implemented by programming processor 24 (
The disambiguation processor 46 works in conjunction with the speech recognition system to assist in extracting meaning from the driver's utterances. Spoken utterances are picked up by microphone 22 and processed by speech processor 50. The speech processor works in conjunction with a dialogue processor 52 to parse phrases uttered by the driver and ascertain their meaning. In a presently preferred embodiment, the speech processor 50 operates using a set of trained hidden Markoff models (HMM), such as a set of speaker independent models. The dialogue processor embeds stored knowledge of grammar rules and operates to map uttered phrases onto a set of predefined query templates. The templates are configured to match the sentence structure of the uttered phrase to one of a set of different query types where information implied by the driver's looking direction or gestural activity are represented as unknowns, to be filled in by the disambiguation processor. Speaker identification from the optical sensor can be used to increase accuracy of the speech recognition and load appropriate profiles in terns of preferences (search history). If a network connection is available, the system can distribute the speech recognition process to processors outside the vehicle via the network.
For example, one dialogue template seeks to ascertain the name of something:
The dialogue processor would parse this phrase and recognize that the placeholder “that ______” corresponds to information that may be potentially supplied by the disambiguation processor.
In a more sophisticated version of the preferred embodiment, the dialogue processor will first attempt to ascertain the meaning of the spoken phrase based on context information (location-based data, vehicle sensor data) and based on previous utterances where possible. As it attempts to extract meaning from the uttered phrase, the dialogue processor generates a likelihood score associated with each different possible meaning that it generates. In cases where the likelihood score is below a predetermined threshold, the dialogue processor can use information from the disambiguation processor to assist in selecting the most likely candidate. Context-based grammar may be used, based on the location and anticipating what the user might say, combined with query history. This will boost query accuracy.
Once the speech recognizer of the dialogue processor has ascertained the meaning of the uttered phrase (e.g., expressed in the SPARQL format), the dialogue processor uses semantic and ontology analysis to match it to a set of predefined available services that can service the query. The dialogue processor thus issues a command to the query generation processor 54 which in turns constructs a computer query suitable for accessing information stored in the local database 48 or obtained from online databases or online services 37. In addition, the query generation processor 54 can also be configured to issue control commands to various electronic components located within the vehicle.
The driver sensor system responsible for determining what direction the driver is looking can operate upon facial information, head location and orientation information, eye movement information and combinations thereof. In one embodiment illustrated in
The identified feature points are connected by line segments, as illustrated, and the relative positions of these feature points and the respective lengths of the interconnecting line segments provide information about the position and orientation of the driver's head and face. For example, if the driver turns his or her head to look in an upwardly left direction, the position and distances between the feature points will change as seen from the vantage point of the 3D camera 14. This change is used to calculate a face position and orientation as depicted at 80. The algorithm uses the individual feature points to define a surface having a centroid positioned at the Origin of a coordinate system (XYZ) where the surfaces positioned to lie within face coordinate frame (xyz). The angular difference between these respective reference frames has been calculated (αβγ) and the angular direction serves as the facial pointing direction. Preferably, the reference frame XYZ is referenced to the reference frame used by the GPS system. Thus, for illustration purposes, the north vector N has been illustrated to represent the GPS reference frame. In an alternate embodiment, Active Appearance Model (AAM) algorithms can also be used to track the head orientation.
As the driver moves his or her head to look in different directions, the plane defined by the identified feature points will change in orientation and the resulting change, when compared with the reference frame XYZ, serves as a metric indicating the driver's looking direction.
The operation of the natural query processing apparatus and its method of use will now be described with reference to
Meanwhile, environment analysis is performed at 104, using the camera array 36. As with the driver sensor system data, the camera array data are also time-stamped and stored under mediation by the correlation processor 42 (
Next at 106 the user's speech input is interpreted based on the recognized speech from step 100 and further based on the gesture and posture analysis and environment analysis from steps 102 and 104. Once the meaning of the user's utterance has been determined, the query processing apparatus connects to an online database or service or to a local database to find information based on the user's query or command. Then in step 110 the information is presented to the user or directly used to control electronic components within the vehicle. User information can be supplied visually on the screen of the navigation system 18 or audibly through the infotainment system 30.
a and 6b provide additional detail on how the dialogue processor performs its function, utilizing services of the disambiguation processor 46 (
By way of explanation, when referring to
With reference to
Step 222 may also be reached, as illustrated in
From step 224, if the user did not look at something outside the vehicle (other than the road), the dialogue processor 52 (
Returning focus now to step 220, if the user did not look at something outside the vehicle (except the road), the system at step 240 will look for points of interest near the locations pointed at or looked at by the user, in the past, using online database access if necessary. Then, at step 242 environmental features are extracted from the data captured by the external sensors (e.g., camera array 36). After this is performed at step 242, flow proceeds to step 244 where the dialogue manager tries to match the voice input, the gesture (if there was one), and one of the points of interest found with a location-related command. Once this is accomplished, flow branches to step 228 where further processing proceeds as previously described.
While a number of different possible embodiments are envisioned,
The external camera analyzes the surrounding environment and creates a depth map of the immediate vicinity of the car. It is a 360 3D camera that can be realized using, for instance, stereoscopic or time of flight technologies. Its output is used to segment and parse the environment into several categories of visual information (buildings, vehicles, people, road signs, other signage, etc). The output of the segmentation is stored in a short-term buffer that is mined once the user performs a query. A secondary narrow-angle high-resolution PTZ camera can be used to capture visual information with higher detail when needed and when the direction of interest has been identified.
The external microphone is composed of several sensing units to perform audio source localization and source separation / beam steering. The raw sound input of the different microphone units is recorded in a short-term buffer that is then processed depending on the query issued by the driver, allowing different processing effects to be achieved at query time. For instance, if the driver is asking “What was that sound?”, he is most likely interested in sounds coming out of the car and correlated to vibration patterns coming out of the car. If he is asking “What was that lady shouting?”, the system will first have to identify a high-pitch high-volume human voice in the recordings and then process the various inputs to separate the voice from other sources.
GPS: localization of the vehicle is done using a standard GPS unit in coordination with other sensors, such as compass, accelerometer, gyroscope, altimeter, or full-fledged IMUs in order to increase the localization accuracy. RF triangulation can also be used as an additional source of information.
User tracking sensors (shown collectively at 252) may comprise video cameras, depth cameras and microphones. Video cameras: depth cameras can be used to track the user's head position and record its 6 degrees of freedom, as well as to locate the arms of the driver when he is pointing at a particular direction outside of the cockpit interior. IR cameras can be used to track the user's gaze and identify where the user is looking at. Visible light cameras can be used to identify the user, perform lip/audio synchronization to identify who is speaking, and infer the mood of the user. Visible light cameras are also useful for head position computation and tracking, and for gesture analysis.
Microphone arrays are used to capture the user's voice and identify who is speaking by creating several beams steered at the location of each of the occupants (the position of the mouth can be calculated from the information provided by the depth camera and the visible light camera).
The system is able to access both local databases (shown collectively at 254) and online databases and services (shown collectively at 256). The local databases may be stored within the memory 28 (
The foregoing description of the embodiments has been provided for purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure. Individual elements or features of a particular embodiment are generally not limited to that particular embodiment, but, where applicable, are interchangeable and can be used in a selected embodiment, even if not specifically shown or described. The same may also be varied in many ways. Such variations are not to be regarded as a departure from the disclosure, and all such modifications are intended to be included within the scope of the disclosure.