The present invention is in the field of speech dialog systems, and more specifically in the field of synthesizing confirmation phrases in response to input phrases spoken by a user.
Current dialog systems often use speech as both input and output modalities. For example, a speech recognition function may be used to convert speech input to text and then a text to speech (TTS) function may use the text generated by the conversion as input to synthesize speech output. In many dialog systems, speech generated using TTS provides audio feedback to a user to solicit the user's confirmation to verify the result of the system's recognition analysis of the speech input. For example, in handheld communication devices, a user can use the speech input modality of a dialog system incorporated within the device for dialing a number based on a spoken name. The reliability of this application is improved when TTS is used to synthesize a response phrase giving the user the opportunity to confirm the system's correct analysis of the received speech input. Conventional response generation functions that employ TTS as described above, however, require the expenditure of a significant amount of time and resources to develop. This is especially true when multiple languages are involved. Moreover, TTS implemented dialog systems consume significant amounts of the limited available memory resources within the handheld communication device. The foregoing factors can create a major impediment to the world-wide deployment of multi-lingual devices using such dialog systems.
One alternative is to synthesize confirmation responses through the reconstruction of speech directly from features derived from the speech input or from a most likely set of acoustic states determined by the recognition process. The most likely set of acoustic states is determined during the speech recognition process through a comparison of the input speech with a set of trained speech models. This alternative can significantly traverse the cost issues noted above. Providing confirmation speech of acceptable quality in this manner presents significant challenges.
The present invention is illustrated by way of example and not limitation in the accompanying figures, in which like references indicate similar elements, and in which:
Before describing in detail the particular embodiments of speech dialog systems in accordance with the present invention, it should be observed that the embodiments of the present invention reside primarily in combinations of method steps and apparatus components related to speech dialog systems. Accordingly, the apparatus components and method steps have been represented where appropriate by conventional symbols in the drawings, showing only those specific details that are pertinent to understanding the present invention so as not to obscure the disclosure with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein.
It will also be understood that the terms and expressions used herein have the ordinary meaning as is accorded to such terms and expressions with respect to their corresponding respective areas of inquiry and study except where specific meanings have otherwise been set forth herein. In this document, relational terms such as first and second, top and bottom, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms “comprises,” “comprising,” or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element preceded by “comprises . . . a” does not, without more constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises the element.
A “set” as used in this document may mean an empty set. The term “another”, as used herein, is defined as at least a second or more. The terms “including” and/or “having”, as used herein, are defined as comprising. The term “coupled”, as used herein with reference to electro-optical technology, is defined as connected, although not necessarily directly, and not necessarily mechanically. The term “program”, as used herein, is defined as a sequence of instructions designed for execution on a computer system. A “program”, or “computer program”, may include a subroutine, a function, a procedure, an object method, an object implementation, an executable application, an applet, a servlet, source code, object code, a shared library/dynamic load library and/or other sequence of instructions designed for execution on a computer system.
Related U.S. patent application Ser. No. 11/118,670 entitled “Speech Dialog Method and System” discloses embodiments of a speech dialog method and device for performing speech recognition on received speech and for generating a confirmation phrase from the most likely acoustic states derived from the recognition process. This represents an improvement over past techniques that use TTS techniques for generating confirmation phrases in response to input speech.
The perceived quality of a synthesized confirmation phrase generated directly from features extracted from input speech or from the most likely set of acoustic states as determined from a recognition process can vary significantly depending upon the manner in which the speech is mathematically represented (e.g., which features are extracted) and the manner by which it is then synthesized. For example, some features that may be mathematically extracted from input speech are better suited to distinguishing elements of speech in a manner similar to the way the human ear perceives speech. Thus, they tend to be better suited to the speech recognition function of a dialog system than they are to the speech synthesis function. Moreover, these types of extracted features are typically used to train speech models over a large number of speakers so that a recognition function employing the trained models can recognize speech over a broad range of speakers and speaking environments. This renders speech reconstruction from a most likely set of acoustic states, derived from such broadly trained models during the recognition process, even less desirable.
Likewise, certain types of features can be extracted from a received speech signal that are better suited to modeling speech as it is generated by the human vocal tract rather than the manner in which the speech is discerned by the ear. Using vectors consisting of these synthesis type feature parameters to generate speech tends to produce more normal sounding speech than does the use of recognition type feature parameters. On the other hand, synthesis type feature parameters tend not to be very stable when averaged over a broad number of speakers and therefore are less advantageous for use in speech recognition. Thus, it would be desirable to implement a speech dialog method and device that employs vectors of recognition type feature parameters for performing the recognition function, and vectors of synthesis type feature parameters for generating the appropriate confirmation phrase back to the user, rather than using just one type of feature parameter and thus disadvantaging one process over the other.
A speech dialog device and method in accordance with embodiments of the invention can receive, for example, an input phrase that constitutes both a non-variable segment and an instantiated variable. For the instantiated variable, the recognition process can be used to determine a most likely set of acoustic states in the form of recognition feature vectors from a set of trained speech models. This most likely set of recognition feature vectors can then be applied to a map to determine a most likely set of synthesis feature vectors that can also represent the most likely set of acoustic states determined for the instantiated variable (assuming that the recognition process was accurate in its recognition of the input speech). The synthesis feature vectors can then be used to synthesize the variable as part of a generated confirmation phrase. For a non-variable segment that is associated with the instantiated variable of the input phrase, the recognition process can identify the non-variable segment and determine an appropriate response phrase to be generated as part of the confirmation phrase. Response phrases can be pre-stored acoustically in any form suitable to good quality speech synthesis, including using the same synthesis type feature parameters as those used to represent the instantiated variable for synthesis purposes. In this way, both the recognition and synthesis functions can be optimized for a dialog system rather than compromising one function in favor of the other.
As part of a dialog process, it also may be desirable for the dialog method or device to synthesize a response phrase to a received speech phrase that includes no instantiated variable, such as “Please repeat the name,” under circumstances such as when the recognition process was unable to determine a close enough match between the input speech and the set of trained speech models to meet a certain metric to ensure reasonable accuracy. A valid user input response to such a synthesized response may include only a name, and no non-variable segment such as a command. In an alternate example, the input speech phrase from a user could be “Email the picture to John Doe”. In this alternate example, “Email” would be a non-variable segment, “picture” is an instantiated variable of type <email object>, and “John Doe” is an instantiated variable of the type <dialed name>.
The following description of some embodiments of the present invention makes reference to FIGS. 1 and 2, where a flow chart for a ‘Train Map and Models’ process 100 (
At step 105 (
At step 110, synthesis feature vectors are also derived from the same training speech uttered by one or more of the training speakers. The synthesis feature vectors can be generated at the same frame rate as the recognition feature vectors such that there is a one-to-one correspondence between the two sets of feature vectors (i.e., recognition and synthesis) for a given training utterance of a given speaker. Thus, for at least one training speaker, his or her utterances have a set of both recognition feature vectors and synthesis feature vectors, with each feature vector having a one-to-one correspondence with a member of the other set as they both represent the same sample frame of the training speech utterance for that speaker. These synthesis feature vectors, along with their corresponding recognition feature vectors, can be used to train the map. Those of skill in the art will recognize that these synthesis feature vectors can be made up of coefficients that are more suited to speech synthesis than recognition. An example of such parameters includes line spectrum pairs (LSP) coefficients, which are compatible for use with a vocal tract model of speech synthesis such as linear prediction coding (LPC). It will be appreciated that deriving the recognition features (e.g., MFCCs) and the synthesis features (e.g., the LSPs) from the training utterances of just one training speaker may be preferable because the quality of speech synthesis is not necessarily improved by averaging the synthesis feature vectors over many speakers as is the case for recognition.
At step 115, a mapping between recognition and synthesis feature vectors is established and trained using the sets of corresponding recognition and synthesis feature vectors as derived in steps 105 and 110. It will be appreciated that there are a number of possible techniques by which this can be accomplished. For example, vector quantization (VQ) can be employed to compress the feature data and to first generate a codebook for the recognition feature vectors using conventional vector quantization techniques. One such technique clusters or partitions the recognition feature vectors into distinct subsets by iteratively determining their membership in one of the clusters or partitions based on minimizing their distance to the centroid of a cluster. Thus, each cluster or partition subset is identified by a mean value (i.e., the centroid) of the cluster. The mean value of each cluster is then associated with an index value in the VQ codebook and represents all of the feature vectors that are members of that cluster. One way to train the map is to search the training database (i.e., the two corresponding sets of feature vectors derived from the same training utterances) for the recognition feature vector that is the closest in distance to the centroid value for each entry in the codebook. The synthesis feature vector that corresponds to that closest recognition feature vector is then stored in the mapping table for that entry.
As will be seen later, the most likely set of recognition feature vectors determined for an instantiated variable of input speech during the recognition process can be converted to a most likely set of synthesis feature vectors based on this mapping. For each of the most likely set of recognition feature vectors, the map table is searched for the entry corresponding to the centroid value closest to each of the most likely set of recognition feature vectors. The synthesis feature vector from the training database that has been mapped to that entry then becomes the corresponding synthesis feature vector for the most likely set of synthesis feature vectors that can be used to generate the response phrase.
Another possible method for training the map involves a more statistical approach where a Gaussian mixture model (GMM) is employed to model the conversion between the most likely set of recognition feature vectors and the most likely set of synthesis feature vectors. In an embodiment, the training recognition feature vectors are not coded as a set of discrete partitions, but as an overlapping set of Gaussian distributions, the mean of each Gaussian distribution being analogous to the cluster mean or the centroid value in the VQ table described above. The probability density of a recognition vector x in a GMM is given by
where m is the number of Gaussians, αi≧0 is the weight corresponding to the ith Gaussian with
and N(•) is a p-variate Gaussian distribution defined as
with μ being the p×1 mean vector and Σ being the p×p covariance matrix.
Thus, when performing a conversion for each member x of the most likely set of recognition feature vectors, this technique does not simply look only for the mean to which x is closest, then finding the converted most likely synthesis vector to be the corresponding training synthesis feature vector associated with that mean. Rather, this statistical technique finds all of the joint probability densities p(x,i) of each of the most likely set of recognition feature vectors x being associated with each of the Gaussian distributions, forms conditional probability densities p(x,i)/p(x) and uses the conditional probability densities to weight the training synthesis feature vectors corresponding to the GMM means to establish the most likely synthesis feature vector.
Thus, in one embodiment the most likely synthesis feature vector y converted from the most likely recognition feature vector x is given by the weighted average
where p(x,i)=αiN(x,μi,Σi) and the yi's i=1 p(X) represent the training synthesis feature vectors corresponding to the mean vectors in the GMM. The training synthesis feature vector yi corresponding to the GMM mean μl can be found by identifying the training recognition feature vector closest to the mean or the training recognition feature vector with the highest joint probability density p(x,i) and selecting the corresponding training synthesis feature vector. The GMM model can be trained using the well-known expectation and maximization (EM) algorithm from the set of recognition feature vectors extracted from the training speech. While this embodiment is a bit more complex, it provides improved speech synthesis quality. This mapping technique accounts for the variances in the distributions as well as closeness to the means. It will be appreciated that the statistical model of the conversion may be applied in a number of different ways.
At step 120, speech models are established and then trained in accordance with the recognition feature data for the training utterances. As previously mentioned, these models can be HMM's, which work well with the features of speech represented by the recognition feature parameters in the form of MFCCs. Techniques for modeling speech using HMMs and recognition feature vectors such as MFCCs extracted from training utterances are known to those of skill in the art. It will be appreciated that these models can be trained using the speech utterances of many training speakers.
At step 205 of a ‘Speech Dialog Process’ 200 (
At step 210 (
At steps 215, 220 (
The most probable set of acoustic states selected by the recognition function for a non-variable segment determines a value 425 (
Thus, in the example shown in
The most likely set of acoustic states determined and output by the recognition function 410 (
The set of synthesis acoustic states for the response phrase “Do you want to call?” 340 in the example of
In the case of the instantiated variable, the most likely set of recognition acoustic states 335 (
In the example illustrated in
In some embodiments, an optional quality assessment function 445 (
In those embodiments in which the optional quality assessment function 445 determines a quality metric of the most likely set of acoustic states, when the quality metric does not meet the criterion, the quality assessment function 445 controls an optional selector 450 to couple a digitized audio signal from an out-of-vocabulary (OOV) response audio function 460 to the speaker function 455 that presents a phrase to a user at step 245 (
The metric that is used in those embodiments in which a determination is made as to whether to present an OOV phrase may be a metric that represents a confidence that a correct selection of the most likely set of acoustic states has been made. For example, the metric may be a metric of a distance between the set of acoustic vectors representing an instantiated variable and the selected most likely set of acoustic states.
The embodiments of the speech dialog methods 100, 200 and electronic device 400 described herein may be used in a wide variety of electronic apparatus such as, but not limited to, a cellular telephone, a personal entertainment device, a pager, a television cable set top box, an electronic equipment remote control unit, a portable or desktop or mainframe computer, or an electronic test equipment. The embodiments provide a benefit of less development time and require fewer processing resources than prior art techniques that involve speech recognition, a determination of a text version of the most likely instantiated variable and the synthesis from text to speech for the synthesized instantiated variable. These benefits are partly a result of avoiding the development of the text to speech software systems for synthesis of the synthesized variables for different spoken languages for the embodiments described herein.
It will be appreciated the speech dialog embodiments described herein may be comprised of one or more conventional processors and unique stored program instructions that control the one or more processors to implement, in conjunction with certain non-processor circuits, some, most, or all of the functions of the speech dialog embodiments described herein. The unique stored programs made be conveyed in a media such as a floppy disk or a data signal that downloads a file including the unique program instructions. The non-processor circuits may include, but are not limited to, a radio receiver, a radio transmitter, signal drivers, clock circuits, power source circuits, and user input devices. As such, these functions may be interpreted as steps of a method to perform accessing of a communication system. Alternatively, some or all functions could be implemented by a state machine that has no stored program instructions, in which each function or some combinations of certain of the functions are implemented as custom logic. Of course, a combination of the two approaches could be used. Thus, methods and means for these functions have been described herein.
In the foregoing specification, the invention and its benefits and advantages have been described with reference to specific embodiments. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present invention as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of present invention. Some aspects of the embodiments are described above as being conventional, but it will be appreciated that such aspects may also be provided using apparatus and/or techniques that are not presently known. The benefits, advantages, solutions to problems, and any element(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential features or elements of any or all the claims.
The present application is related to U.S. patent application Ser. No. 11/118,670 entitled “Speech Dialog Method and System,” which is incorporated herein in its entirety by this reference.