The present disclosure relates generally to name grammars utilized in speech recognition systems to identify spoken names.
Speech recognition software can be utilized to analyze input in the form of spoken words and phrases and determine what has been said. Existing speech recognition software systems are not designed to recognize any possible utterance but are constrained by a grammar of recognizable word or phonetic patterns in order to provide reasonable response time and accuracy. These grammars are generally context sensitive. For example, an automobile control context might include a limited set of grammar definitions including entries for “start the engine” and “turn on the lights” where an airline application might include context-specific commands such as “what is the departure time of flight 788X?” or “i'd like to upgrade to first-class.”
Grammars are often created utilizing existing text definition descriptions such as Augmented Backus-Naur Form (ABNF), Grammar Syntax Language (GSL), and Speech Recognition Grammar Specification (SRGS). Each of these grammar formats specify how recognition grammars are defined. A common element between grammar definitions is that entries in the grammar may be assigned weights indicating the likelihood of the entry being spoken as an indicator to the speech recognition software to give more precedence or likelihood to certain words or phrases being returned as a result. Appropriate weights are difficult to determine and guessing weights does not always improve recognition performance because of gaps in expected behavior or usage between the designer of a system and the user of a system. Effective weights are usually obtained by study of real speech and result data collected from a system in use in its intended context.
Grammars involving names are a common special case in speech recognition in that the context is generally the same (identify a person or group of people by a name) but the content is almost certainly guaranteed to be unique for each implementation. For example, if two companies sell widgets through a speech recognition application, they might have commands in common like “buy a widget” or “i'd like help for my widget.” However, since each company may have different internal structures and employees, commands like “call Steve Jones” only make sense if the company has an employee named Steve Jones. Additionally, one company may refer to different processes or groups with different names, so one widget company might require “i'd like technical support” while the other requires “i'd like widget help.” These differences make it extremely difficult to identify weighting or probability structures for grammars that include things like personnel, department, or even location names.
An adaptive name grammar for a speech recognition system implements a user-specific personal name grammar definition having entries for a group of members with each entry including identification information that identifies an associated member of the group and with each entry including a weight or probability value indicating the likelihood of the name of the associated member being spoken. Environmental information is analyzed to determine group members likely to be contacted by the user and the weight value in an entry associated with a group member is altered to indicate the likelihood that the group member will be contacted by the user.
Reference will now be made in detail to various embodiments of the invention. Examples of these embodiments are illustrated in the accompanying drawings. While the invention will be described in conjunction with these embodiments, it will be understood that it is not intended to limit the invention to any embodiment. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the various embodiments. However, the present invention may be practiced without some or all of these specific details. In other instances, well known process operations have not been described in detail in order not to unnecessarily obscure the present invention. Further, each appearance of the phrase an “example embodiment” at various places in the specification does not necessarily refer to the same example embodiment.
Using speech recognition to call, address, or otherwise identify people or groups of people by their spoken names is one of the more difficult problems to overcome in a voice user interface. This is because the number of names, complexity of their composition, variations in pronunciation, and external interference all factor into the ability to correctly match a name against a given audio input or utterance. In a global communications environment, the sheer number and diversity of names coupled with the allowable variation in individual pronunciation presents an enormous technical challenge.
An example embodiment will now be described that uses concepts from social interaction to dynamically adjust the accuracy of voice addressing by spoken name. By collecting user-specific contact and addressing information, work group structure, inter-personal communications, egress monitoring, and call history information that may exist within a messaging or corporate information system, the utterance resolution via speech recognition can be improved when calling or addressing to another user by name.
The operation of this embodiment will now be described in an example context of a user (User A) working in a corporate or other professional or social environment where user-specific contact and addressing information, work group structure, inter-personal communications, egress monitoring, and call history information exists within a corporate messaging and information system. In this example, the name grammar would include every employee of the corporation or business.
The following is a pseudo-example of activity for User A:
Note that in this example there are three types of information utilized. The first is dynamic activity such as calling another user (item 1), leaving a message (item 2) or attending a meeting with another user (item 5). The second is social and environmental information such as the customer or personal distribution list (item 3), the buddy list (item 4), the physical location (item 6) and the reporting structure (item 7). The third is time. Weighting is degraded over time as communications and social interactions between Users vary in intensity over time. To keep weights current, communications activity between two entities must continue to be established. As shown in the list, different activities and different information may be assigned different weights based on their relative importance within the organization.
A subsequent constructed collection of addressable names or weighting improvements based on the operations described in the pseudo-example would result in the following for User A: +9 to User B; +5 to User C; +4 to User D; +1 to User E; +1 to User F; +1 to User G. Over time, if User A did not continue to contact other Users, their individual weighting improvements could slowly reset or normalize.
This small collection of commonly accessed or important members is an important consideration in resolving addressing relationships within large organizations with tens or even hundreds of thousands of individual contacts. It plays upon common ideas of repeated, regular and regulated social interactions between individuals to help shape voice recognition accuracy for names. Name recognition is improved by limiting or weighting the scope of potential addressable names based upon meta-information relative to the sociological hierarchy of a user, thereby increasing the likelihood of a positive match.
The operation of an example embodiment will now be described with reference to
In this example, each Personal Name Grammar Member entry in User A's personal name grammar includes different identifiers for the member, such as the member's identifier in the Personal Name Grammar system (ObjectId), the member's identifier in the phone call database (MemberUserObjectID), the member's identifier in the contact data base (MemberContactObjectID), the member's identifier in the personal contact database (MemberPersonalContactObjectID), and the member's identifier in the personal group database (MemberPersonalGroupObjectID). The Personal Name Grammar Member entry also includes information on the date the entry was entered (DateEntered), the current weight assigned to the member (CurrentWeight) and statistics (Inputs and Outputs).
There is a Personal Name Grammar entry for every member included in a user's personal name grammar that includes the member's identifier (ObjectId), the maximum age of any member entry in the user's personal grammar (MaxMemberAge) and the maximum number of entries in the user's personal grammar (MaxMemberCount).
The fields in the Personal Name Grammar entry are used for management purposes to control the age of entries in a user's personal name grammar so that stale entries can be removed (MaxMemberAge) and to limit the number of entries in the user's personal name grammar (MaxMemberCount).
The operation of the example system depicted in
Upon startup the system is initialized and the tables are set up. The corporate information and messaging system is searched and table entries are created for members in User A's custom and distribution list (item 3 in the pseudo-example), in User A's buddy list (item 4), with whom User A has regular ad-hoc meetings (item 5), with members in the same physical location as User A (item 6) and with members who report to User A (item 6).
During initialization weights can be assigned to each member as described above in the context of the pseudo-example.
Subsequent to initialization, in the first step the environmental and social context for User A is rechecked for changes and table entries are updated or new table entries are created.
The personal name grammar system then monitors whether a call has been received by User A. If so, then User A's Personal Name Grammar Member Entry for the caller member has its weight adjusted (item 1) if the personal name grammar of User A includes a table entry for the caller target member. If there is no existing table entry for the caller target member then a table entry is created with the appropriate weight assigned.
The personal name grammar system then monitors whether a call has been made. If so, then User A's Personal Name Grammar Member Entry for the called target member has its weight adjusted (item 2) if the personal name grammar of User A includes a table entry for the called target member. If there is no existing table entry for the called target member then a table entry is created with the appropriate weight assigned.
The flow chart of
Accordingly, the weights assigned to the different names in the name grammar of the speech recognition system have been adaptively adjusted to take into account the specific social interactions and environment of User A. Those members that are more likely to be contacted by User A have been assigned higher weight values so that when the speech recognition system attempts to recognize a name spoken by User A the search will be weighted towards members with whom User A has social or environmental contacts.
The invention has now been described with reference to the example embodiments. Alternatives and substitutions will now be apparent to persons of skill in the art. For example, the structure of the table entries, the values of the weights assigned, and the types of meta-information searched are described by way of example, not limitation. Accordingly, it is not intended to limit the invention except as provided by the appended claims.