The automation of information based phone calls such as directory assistance calls may substantially reduce operator costs for the provider. However, users can become frustrated with automated phone calls reducing customer satisfaction and repeat business.
A method of storing a characteristic of an utterance to be received by a speech recognition engine, receiving the utterance from a user, the utterance being received in response to a prompt of an automated conversation, analyzing the received utterance to determine if the utterance conforms to the characteristic and indicating to the user that the utterance conformed to the characteristic.
A speech engine including a storage module to store a characteristic of an utterance to be received from a user, a receiving module to receive the utterance from the user, the utterance being received in response to a prompt, an analyzing module to analyze the received utterance to determine if the utterance conforms to the characteristic and an indication module to indicate to the user that the utterance conformed to the characteristic.
A system comprising a memory to store a set of instructions and a processor to execute the set of instructions, the set of instructions being operable to store a characteristic of an utterance to be received by a speech recognition engine, receive the utterance from a user, the utterance being received in response to a prompt of an automated conversation, analyze the received utterance to determine if the utterance conforms to the characteristic and indicate to the user that the utterance conformed to the characteristic.
The present invention may be further understood with reference to the following description and the appended drawings, wherein like elements are provided with the same reference numerals. The present invention is described with reference to an automated directory assistance phone call. However, those of skill in the art will understand that the present invention may be applied to any type of automated conversation. These automated conversations are not limited to phone calls, but may be carried out on any system which receives voice responses to prompts from the system. Furthermore, the terms “speech”, “utterance” and “words” are used with reference to a user's response in this description. Each of these terms is meant to show the sounds which are made by the user to respond to prompts in an automated conversation.
An automated conversation system usually includes a series of prompts (e.g., voice prompts) to which a user will respond by providing a speech input or utterances. The system will then analyze the utterances to determine what the user has said. If the automated conversation system is an automated phone call, the series of prompts may be referred to as the call flow, e.g., prompt 1—response 1—prompt 2—response 2, etc.
The pre-processed speech signal may then be routed to a recognition processing module 30 and an attribute processing module 40. The recognition processing module 30 will process the speech signal to determine the content of the speech signal, i.e., what the person said. For example, the recognition processing module 30 may determine that the person stated “yes” in response to a prompt. For those more interested in understanding how the recognition processing module 30 recognizes human speech, they are referred to “Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition” by Daniel Jurafsky and James H. Martin. The output of the recognition processing module 30 may then be sent to other system modules for further processing. This further processing will be described in greater detail below.
The pre-processed speech signal may also be routed to the attribute processing module 40 to determine attributes of the speech signal. The attributes of the speech signal may be considered to be parameters or characteristics which are unrelated to the grammatical meaning of the speech. As described above, the recognition processing module 30 will determine the grammatical meaning (i.e., what the person said). The attribute processing module 40 may determine attributes of the speech signal itself, for example, the length of the signal, the amount of noise in the signal, the amplitude of the signal, etc. The output of the attribute processing module 40 may be the attributes associated with the input speech signal which may be sent to other system modules for further processing. This further processing will be described in greater detail below.
The ASR engine 10, as part of an overall system for automated conversations, is used to recognize the speech input by a user of the system. If the speech is completely recognized, the system will generally progress the user through a series of prompts to arrive at the desired result for the user, e.g., the system may output a telephone listing desired by the user. However, there are many instances where a user inputs speech which is an inappropriate response to the prompt provided by the system or the input speech signal experiences some type of problem (e.g., noise). The exemplary embodiments of the present invention are directed at systems and methods for identifying problems with the user's response and allowing the system to aid the user in correcting the response.
The outputs of the recognition processing module 30 and the attribute processing module 40 may be used to identify the type of inappropriate response and allow the system to provide the user with a new or additional prompt that allows the user to correct the response. For example, the attribute processing module 40 may determine that the speech signal has a certain duration. The attribute processing module 40 (or a further processing module) may have information that the duration of the response is very long compared to the expected duration of the response (e.g., the prompt requested a yes/no answer and the duration of the response was significantly longer than an expected duration for either yes or no). In such a case, the system may provide the user with a new prompt to correct the problem, e.g., a prompt stating “please respond by stating yes or no only.” Other attributes of the speech signal generated by the attribute processing module 40 may be used in a similar manner. Each attribute may be associated with one or more categories of problems. The category of problem may then correspond to a particular corrective action that may be taken by the system.
The output of the recognition processing module 30 may be used in a similar manner to identify problems with a response. In such a circumstance, the recognition processing module 30 may determine the content of the speech, but the content does not match the expected response. As will be described in greater detail below, the recognition processing module 30 may include a grammar which includes words or other utterances which may be recognized by the recognition processing module 30, but the recognized response is inappropriate. In the same manner as described above for the speech attributes, the recognized response may be output by the recognition processing module 30 and categorized as a particular type of problem. The system may then take a corrective action corresponding to the categorized problem.
After the content and attribute information is extracted from the speech signal, this information is used in step 370 to determine whether there are any problems with the response. Some examples of problems with responses were described above and additional examples will be provided below. If there are no problems based on the content or attributes of the speech signal, the method will be complete because the system will have recognized the speech as an appropriate response to the prompt and take the appropriate action based on the recognized response. However, if there is a problem with the response based on the content or attribute of the speech signal, the method will continue to step 375 where the problem will be categorized. There are numerous categories of problems which may be identified based on the content (e.g., too many recognized, but low priority utterances, etc.) and attributes (e.g., response too long, response too short, too much noise in response, amplitude of signal too low, etc.).
After the problem has been categorized in step 375, the method continues to step 380 where the system will select the corrective action which corresponds to the category of problem identified in the speech signal. Again, there are numerous types of corrective actions that may correspond to the identified problem category, e.g., re-prompt with previous prompt, re-prompt with new prompt, change selected grammar for the recognition processing module 30, attempt different type of noise cancellation, raise volume of incoming signal, etc. The system may implement the selected corrective action and the method is then complete. Those of skill in the art will understand that the method 350 may be carried out for each response (or speech signal) received by the system.
In the method 350 of
In another example, the attribute processing module 40 may determine that the speech signal includes an excessive amount of noise. Once again, the ASR engine 10 may use this attribute information to determine that the speech signal should not be sent to the recognition processing module 30 because it would be unlikely that the content would be determined from a noisy signal. These examples also point out that the attribute processing module 40 may receive the speech signal without the benefit of the pre-processing of the sampling module 20 or may have a separate pre-processing module from the pre-processing module used for the recognition processing module 30.
In an alternative example, the attribute processing module 40 may be placed after the recognition processing module 30 so that attributes of the speech signal are determined after the content is determined. This arrangement may be advantageous because, if the recognition processing module 30 is able to determine the content of the speech signal, and that content is an appropriate response, the ASR engine 10 may determine that the processing of the speech signal by the attribute processing module 40 is not necessary. In addition, if the ASR engine 10 determines the problem with the response using only the content identified by the recognition processing module 30, the ASR engine 10 may also determine that the processing of the speech signal by the attribute processing module 40 is not necessary in this situation. Thus, the time and processing requirements for the attribute processing module 40 may be saved.
The ASR engine 10 is shown as including modules 20, 30 and 40. Those of skill in the art will understand that each of these modules may include additional functionality to that described herein and the functionality described herein may be included in more or less modules of an actual implementation of an ASR engine. Furthermore, the ASR engine 10 may include additional functionality, e.g., the entire system described above may be included in the ASR engine 10.
The following provides an exemplary implementation of the exemplary system and method for providing corrective actions in an automated conversation. The exemplary implementation is a directory assistance (“DA”) service providing phone listings to users. The DA service includes an ASR engine implementing the functionality of the exemplary recognition processing module 30 and attribute processing module 40. The DA service 30 also includes a database or a series of databases that include the listing information. These databases may be accessed based on information provided by the user in order to obtain the listing information requested by the user.
The general operation of the DA service will be described with reference to the exemplary conversation 50 illustrated by
On line 56 of the conversation 50, the user responds to the voice prompt of line 54. In this example, the user says “Brooklyn, New York” and this speech is presented to the ASR engine of the DA service (e.g., ASR engine 300). As described above, the ASR engine may determine the content of the speech, i.e., the speech signal corresponds to the content of Brooklyn, New York. The DA service then generates a further voice prompt in line 58 based on the information provided by the city/state response in line 56. The voice prompt in line 58 prompts “What listing?” On line 60 of the conversation 50, the user responds to the voice prompt of line 58. In this example, the user says “Joe's Pizza” and the ASR engine recognizes the speech as corresponding to a listing for Joe's Pizza. The ASR engine provides this content information to the DA service which searches for the desired listing. For example, the automated call service may access a database associated with Brooklyn, New York and search for the listing Joe's Pizza. The DA service then generates a listing such as that shown in line 62 of the conversation 50.
The conversation 50 of
In an automated conversation system, the expected responses to the prompts may form a grammar for the ASR engine which is defined as a formal definition of the syntactic structure of a response and the actual words and/or sounds which are expected to make up the response. The grammar may be used by the ASR engine to parse a response to determine the content of the user's response.
The grammar 70 shows a first exemplary entry 72 which has an expected grammar for a response of Hoboken, New Jersey to the city/state prompt. Thus, if a user were to respond to the city/state prompt with a response of Hoboken, New Jersey, the ASR engine would recognize this response and indicate to the DA service that the content of the user's response corresponded to a desired city/state of Hoboken, New Jersey. The DA service may then provide the next prompt of the automated conversation to the user, e.g., the listing prompt. The grammar 70 shows additional entries 74-80 having city/state responses corresponding to Philadelphia, Pennsylvania, Philly, Pennsylvania, Mountain View, California Brooklyn, New York, respectively. The entries 74 and 76 show that a single city, e.g., Philadelphia, may have multiple grammar entries because users may use slang or other alternate names for the same city.
However, users do not always provide responses in the expected manner or in the exact syntax which the ASR engine is expecting.
One exemplary manner of handling this non-standard syntax is by also providing an additional grammar which includes expected faux responses to the prompt.
There may be other responses which are provided by the user which do not include a valid city/state response. For example, the user may simply respond to the city/state prompt by asking “what?” The faux grammar 100 may also be used to handle this situation. The entry 107 of faux grammar 100 is “what.” The ASR engine may recognize this speech by the user and inform the DA service that the user content of the user response was the question “what?” in response to the city/state prompt. The DA service may be programmed to replay the city/state prompt when the user asks this question. Thus, the faux grammar 100 may be used in a variety of manners to help determine the action that is taken by the DA service based on the response of the user.
The faux grammar 100 may be an example of the ASR engine identifying the content of the speech signal, but not identifying the information that was meant to be conveyed by the speech. For example, the recognition processing module of the ASR engine may identify a plurality of faux responses in the speech signal based on the faux grammar 100. The recognition processing module may output the content or the number of instances of faux content in the response. As described above, the ASR engine may then identify a problem with the response based on the faux content or the number of instances of faux content in the response. The DA service may then take the appropriate corrective action based on the identified problem. An example will be provided below.
As described above, an attribute of the speech may also be used to determine the problem with the response. An example of an attribute may be the duration of the response. Thus, the attribute processing module of the ASR engine may determine the duration of each speech signal and categorize the speech signals based on these durations. For example, it may be determined that when a short response is provided, the user may have the correct syntax for the response, but did not provide enough information in the response. The re-prompt may be short and may simply be a repeat of the initial prompt.
Referring back to
In the above examples, it can be seen that the utterance attribute based re-prompting may contribute to the overall satisfaction that the customer feels when using the DA service. In the example of
Those of skill in the art will also understand that other attributes or a combination of attributes and content may be used to determine a proper re-prompt. For example, a response may include speech where the content is partially recognized, e.g., identifiable and unidentifiable utterances. The ASR engine may be configured to determine the number of unintelligible utterances compared to the identifiable utterances and provide re-prompts based on this comparison. Any attribute which can be determined from the utterance of the user may be used as a factor in determining the type of re-prompt that is presented to the user.
In addition, different attribute/content information may be combined in different manners to form the basis for various re-prompts. Thus, a short response that is completely unintelligible may be treated differently than a short response that has some discernable grammar. Furthermore, the number of different types of re-prompts is not limited. In the example provided above of the duration based re-prompt, the responses were characterized as being short or long duration responses. In another example, the responses may be characterized as short, medium or long duration responses. Each of these categories may have a different corresponding re-prompt.
In the examples provided, the smart re-prompt was described with reference to the listing state. However, the smart re-prompt may also be implemented in the locality state. Where the automated conversation is not a DA service, the exemplary embodiment of the smart re-prompt may be implemented in any of the states of the automated conversation. For example, if the automated conversation is a phone banking related application, there may be transaction type state, e.g., the automated system prompts the user as to the type of transaction that the user desires to perform such as balance requests, money transfers, etc. The smart re-prompt may be implemented in this state or any other state of the banking application. The automated conversation may not be a phone call related conversation. For example, a retail store may have a device that provides automated conversations for its customers related to, for example, product checks, store directories, returns, etc. The smart re-prompt may be implemented in any prompting state of this type of device.
The present invention has been described with the reference to the above exemplary embodiments. One skilled in the art would understand that the present invention may also be successfully implemented if modified. Accordingly, various modifications and changes may be made to the embodiments without departing from the broadest spirit and scope of the present invention as set forth in the claims that follow. The specification and drawings, accordingly, should be regarded in an illustrative rather than restrictive sense.
The present application claims priority to U.S. Provisional Patent Application No. 60/665,710 entitled “System and Method for Handling a Voice Prompted Conversation” filed on Mar. 28, 2005, the specification of which is expressly incorporated, in its entirety, herein.
Number | Date | Country | |
---|---|---|---|
60665710 | Mar 2005 | US |