Not applicable.
This invention relates in general to the use of local and network-based engines for voice recognition and text conversion, and more particularly to method and system of arbitrating the selection of such engines.
A mobile communication device architecture is typically optimized for minimizing power consumption and fails to match the performance capability of desktop computing stations in terms of MIPs or otherwise. High performance Voice Recognition (VR) engines are available on high performance server platforms, whereas speaker dependent recognition engines for name tag or digit dialing are available on mobile platforms. For applications that require very high quality voice recognition as well as high quality (natural) speech synthesis, a server solution is the most favored. For applications where the grammar is limited and time is of the essence, local recognition and local synthesis are desirable.
In a wireless mobile environment, existing systems have contemplated downloading larger grammar dictionaries from a remote server if a local dictionary was insufficient to process a particular input. There are also systems that use either a local engine or a remote engine, but not both. Nor is there any existing system that arbitrates between the use of a local engine or a remote engine based on factors specifically affecting a mobile environment.
Embodiments in accordance with the invention illustrate systems and methods that intelligently use both local and network-based voice recognition (VR) and text-to-speech synthesis engines. As a result, an enhancement in the system level performance for the recognition rate for VR and the quality of speech for TTS can be achieved.
In a first aspect of an embodiment in accordance with the present invention, a method of arbitrating the selection of engines between a local engine and a network-based engine in a mobile communication network for voice recognition or text conversion (e.g., text-to-speech synthesis (TTS) or text synthesis) can include the step of determining at least one factor among an available bandwidth on a given channel, a signal quality on the given channel, a latency indication, a desired application need, a cost factor, a background environment indication (e.g., a background noise level or a predetermined background noise characteristic), and a number of unsuccessful attempts on the given channel and the step of automatically selecting one among the local engine and the network-based engine based upon the at least one factor determined when performing at least one among voice recognition and text conversion. Each of the factors that can be determined can be achieved in numerous ways. Some examples are provided. Determining the available bandwidth can involve detecting the available bandwidth at a given time period and automatically selecting among the local engine and the network-based engine can involve providing a recommended suggestion to a dialog manager for selection by a user. The step of determining the signal quality can include the step of measuring signal strength between a portable communication unit and a base station. The step of determining the cost factor can include the step of determining a cost associated with communication in at least one among a predetermined number of networks. The step of determining the latency indication can include the step of measuring a delay that the mobile communication network experiences to process a request compared to a predetermined threshold. The step of determining the background environment indication can include the step of measuring a background noise level compared to a threshold level of noise. The step of determining the number of unsuccessful attempts can include the step of accounting for the number of unsuccessful attempts in voice recognition in comparison to a predetermined number. The step of determining the desired application need can involve determining at least one among a quality level of processing required, a speed requirement, and a grammar and language dictionary requirement. Note that the automatically selected engine(s) can be used either for performing voice recognition or text conversion.
In a second aspect, a mobile communication system having an arbitrated selection between a local engine and a network-based engine for voice recognition or text conversion can include at least one remote server having the network-based engine, and a portable communication unit having the local engine and a processor. The processor in the portable communication unit can be programmed to determine at least one factor among an available bandwidth on a given channel, a signal quality on the given channel, a latency indication, a desired application need, a cost factor, a background environment indication, and a number of unsuccessful attempts on the given channel and to automatically select one among the local engine and the network-based engine based upon the at least one factor determined when performing at least one among voice recognition and text conversion. As explained above, the processor can programmed to determine many factors. For example, the processor can be programmed to determine the signal quality by measuring signal strength between a portable communication unit and a base station. The processor can be further programmed to automatically select by weighting the selection for the local engine in a weak signal environment and weighting the selection for the network-based engine in a strong signal environment such that the weak signal environment or the strong signal environment is determined using at least one threshold value.
In a third aspect, a mobile communication system having an arbitrated selection between a local engine and a network-based engine for voice recognition or text conversion can include at least one remote server having the network-based engine and a remote processor, and a portable communication unit having the local engine and a local processor. In this embodiment, at least one among the remote processor and the local processor can be programmed to determine at least one factor among an available bandwidth on a given channel, a signal quality on the given channel, a latency indication, a desired application need, a cost factor, a background environment indication, a number of unsuccessful attempts on the given channel, and a server traffic condition and to automatically select one among the local engine and the network-based engine based upon the at least one factor determined when performing at least one among voice recognition and text conversion.
In a fourth aspect, a mobile communication unit that arbitrates a selection between a local engine and a network-based engine for voice recognition or text conversion can include a transceiver unit coupled to a processor and a local engine. The transceiver unit can communicate with a remote server having the network-based engine. The processor can be programmed to determine at least one factor among an available bandwidth on a given channel, a signal quality on the given channel, a latency indication, a desired application need, a cost factor, a background environment indication, and a number of unsuccessful attempts on the given channel and to automatically select one among the local engine and the network-based engine based upon the at least one factor determined when performing at least one among voice recognition and text conversion.
While the specification concludes with claims defining the features of the invention that are regarded as novel, it is believed that the invention will be better understood from a consideration of the following description in conjunction with the drawing figures, in which like reference numerals are carried forward.
Referring now to
The bandwidth available can be detected by the system 10 at the moment required to make a decision on which engine to use. In a high data rate network such as WLAN, the algorithm can select the network-based engines. Referring to the high level architecture 20 of
In determining signal quality and preventing high error rates, the system 10 can measure signal strength. In a low signal strength environment, the algorithm can select the local version of the engine. The dialog manager (22 or 36) (VR selection) can use the signal quality to make a decision between the local or the server version. If the signal quality (i.e. RSSI) goes down below a certain threshold (predetermined by the system) and the VR engine selected is the server version, then the dialog manager can choose to use the local version to prevent retransmission and poor quality conversions. Contrastly, if the signal quality or RSSI goes up, the system can automatically switch to the server engine. A hysteresis window can be provided to avoid false switches.
With respect to cost, a user can set up a cost preference (or preferred network). There are several wireless networks available (WLAN, iDEN, GSM, CDMA, 3G, etc) and each has a different cost structure for the user. The proposed methodology can factor the cost into the decision algorithm. A user can also enter their cost preferences (versus performance) enabling the system 10 to take cost into account. For example, if a user enters a zone (roaming to a new network), the system can detect the new network and based on cost, switch to the new network or not. The cost information can be prerecorded on the phone or directly downloaded from the network. More specifically, if WLAN service is available with a flat fee structure, the user may want to switch from a cellular service to the WLAN service for cost purposes, not to mention the likely advantage with respect to bandwidth.
The algorithm can also measure the delay that the system 10 takes to process a request and compare it to a predetermined threshold for example. Based on the measured delay, the algorithm can then make a decision on what system to choose. The algorithm can also account for the number of unsuccessful attempts in recognition when making a decision on which engine to use. A threshold can be set to the number of attempts enabling the algorithm can make a decision on which system to choose. When the background is noisy, the algorithm can choose to use the network-based or distributed version as the recognition becomes more difficult. Of course, each of these factors can be weighted or given priority as desired.
The application selected by the user can have the ability to feed the algorithm with different application's needs that will affect the choice of engine(s). For example, an application (or user preference for the application) requiring high quality processing in terms of recognition would favor a network-based engine, whereas an application requiring speed of conversion (where conversion time is critical for the application) would favor a local engine. In the case where a grammar or language dictionary is required for an application that is unavailable or insufficient in a local form, then the application can request a new grammar or language dictionary, so the algorithm can either download it to use the local engine or send the request to the network-based engine for remote processing. Of course, the algorithm can also monitor the user patterns to enhance the local recognition engine (such as in a speaker dependent engine) and download new grammars (dictionaries), acoustic models, languages, and common words as examples and as available and appropriate.
Between the local recognition engines, a speaker dependant recognition is more accurate than the speaker independent. This criterion can be added to the decision algorithm. If a particular speaker always uses the device, the local version of the speaker dependant engine can have priority in most instances. The user can be detected using speaker verification techniques known in the art.
Referring to
After the decision is made at block 74, a local Dialog Manager (DM) can verify if it is possible to grant the application request. The first verification of the DM is if the application pre-selected the engine to use both engines at decision block 76. If both engines are to be used, then both engines are launched at block 78. If a single engine was selected, then the DM verifies if the application has some other preferences at decision block 80. If a local server is preferred at decision block 82, then background noise can be assessed whether it exceeds a threshold at decision block 84. In other words, the VR engine measures the noise level and informs the DM with the decision. If no appreciable background noise (or an acceptable level) is measured, then the DM selects and the method 50 uses the local engine at block 86. If appreciable background noise (or an unacceptable level) is measured at decision block 84, the server engine at block 88 is used overriding the initial application preference. Since the application did not request this engine, the DM grants temporary access to the server engine (Force server). This means that the DM will keep verifying the background to select the local version. When the application has no preference or best engine, the DM can assume by default the server engine (although, the no “preference/best engine” can be configured by the user to be the local engine or both engines).
If the server version is requested at decision block 82 (or no preference as indicated at decision block 80), then the selection engine goes into a series of verifications. At decision block 90, the Modem Layer can inform the local DM with the decision based on network availability (WLAN, 2.5G, 3G, etc.). If a high bandwidth network is unavailable at decision block 90, then the systems forces the use of the local engine at block 98. Next, based on User Preferences and Server DM information (Cost information from the network), a decision whether to use a network-based engine can be determined at decision block 92. If the cost determined is unacceptable, then the local engine is used again at block 98. With respect to signal quality, the Modem Layer can inform the local DM with the level of the signal strength at decision block 94. If signal strength or quality is poor, then once again, the method forces the use of the local engine at block 98. If the bandwidth, cost and signal quality factors are favorable and all the respective verifications are successful, then the server engine is selected for the application at block 96 and the DM provides access to the server engine.
Again, if at least one of the verification fails, then the engine selected is the local engine. Since the application did not request this engine, the DM grants temporary access to the local engine (Force Local) at block 98. This means that the DM will keep verifying to find the acceptable parameters for the server version.
Referring to
Referring to
If the server engine 108 is selected by the application or a change of state, the DM 102 keeps monitoring the bandwidth, latency, signal strength and the number of attempts and verifies if those parameters are under normal conditions. If any of those parameters changes and reaches unacceptable limits, then the DM 102 changes the engine to the transitional mode (Forced Local 110). If all the parameters remain the same, there is no state change, the system keeps using the preferred mode. If the selected engine is in the local forced option 110 (in Transition mode) or in a change of state, the DM 102 keeps monitoring the bandwidth, number of attempts, signal strength, cost changes, etc. If all parameters reach acceptable limits, then the DM 102 can change the state back to the server engine 108 (original decision made by the application). If all the parameters remain the same, there is no state change, and the system keeps using the transitional mode. When both engines are used at 112, the DM 102 keeps monitoring all the parameters (for example, if no network is available), and in some extreme cases the engine might be forced to either the local or server engine.
Referring to
Note, the selection algorithm for TTS conversion in
In summary, the architecture as depicted in
As shown in
One practical example can be illustrated in the case of requesting and receiving driving directions. This application can provide a mechanism for a user to request and receive driving directions via speech. Due to the limitations on a mobile device, the voice recognition engine may not necessarily hold all the street names on its database for the local speech recognition. A distributed or network-based voice recognition engine becomes the ideal solution when using directions (street names, etc). When using the system (network-based), the user might enter an area with poor reception or no reception at all. The driving direction application (using the network-based engine) becomes obsolete, preventing user to ask for directions. The proposed solution allows the system to switch to the local engine and limited version of the voice recognition allowing the user to access the application for simple requests.
Likewise, with TTS, the server solution can be ideal, but if the system cannot deliver the best solution, the system can automatically select a limited local TTS version. In this manner, the user does not loose the ability to get driving directions via synthesized voice.
In another example for air travel reservations, a user may perform one or more among name or number dialing and command or control operations. The user can say a name or a number into a mobile device. An illustrative use case for this application is when a user is driving. While driving, the user can benefit from hands and eyes free operation. Noise may be present while the user tries to use the name/number dialing application. With the current technology, the user will not be able to use the speech recognition feature of the application when most needed. The proposed solution can automatically switch to the network-based (or distributed) recognition engine (a more noise robust solution). Also, the user can have their contact list stored locally as well as on the server. The system application can select the local engine, but if the number of attempts is too high, then the system can automatically switch to the server version. This way, the user has a more robust VR engine available for use by the mobile unit.
Much of the inventive functionality and many of the inventive principles are best implemented with or in software programs or instructions and integrated circuits (ICs) such as application specific ICs. It is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating such software instructions and programs and ICs with minimal experimentation. Therefore, in the interest of brevity and minimization of any risk of obscuring the principles and concepts in accordance to the present invention, any discussion of such software and ICs, if any, was limited to the essentials with respect to the principles and concepts of the preferred embodiments.
In light of the foregoing description, it should be recognized that embodiments in accordance with the present invention can be realized in hardware, software, or a combination of hardware and software. A communications system or device according to the present invention can be realized in a centralized fashion in one computer system or processor, or in a distributed fashion where different elements are spread across several interconnected computer systems or processors (such as a microprocessor and a DSP). Any kind of computer system, or other apparatus adapted for carrying out the functions described herein, is suited. A typical combination of hardware and software could be a general purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the functions described herein.
Additionally, the description above is intended by way of example only and is not intended to limit the present invention in any way, except as set forth in the following claims.