Aspects of the invention relate generally to speech recognition. More specifically, aspects of the invention relate to seamless transferring of automatic speech recognition sessions from one speech recognition engine to another.
A variety of mobile computing devices exist, such as personal digital assistants (PDAs), mobile phones, digital cameras, digital players, mobile terminals, etc. (hereinafter referred to as “mobile devices”). These mobile devices perform various functions specific to the device and are often able to communicate (via wired or wireless connection) with other devices. A mobile device may, for example, provide Internet access, maintain a personal calendar, provide mobile telephony, take digital photographs and provide speech recognition services. However, memory capacity is typically limited on mobile devices.
Automatic Speech Recognition (ASR) is a resource intensive service. Using an ASR system on resource constrained devices require employing light-weight algorithms and methodologies. An often suggested work around for resource constrain is using a client-server based architecture (also known as DSR, distributed speech recognition). In DSR, an ASR client resides on the computing device and the resource-intensive tasks are handled on a network based server. Thus, a client server based approach (DSR) maintains the convenience of a mobile ASR client, and enables the use of complex techniques at the server with very high resource availability.
However, a client-server network may not be available such as when a user is physically moving from a first location to a second location and dictating a memorandum or other document. For example, a user may begin a dictation at a first location such as in an automobile and continue/finish the dictation at a home or office located at a second location.
In addition, inefficiencies may arise as users may be forced to use only one ASR engine for all speech recognition services. Upon implementing a speech recognition service through a first ASR engine, it may be beneficial to switch to a different ASR engine that may be optimized for a particular speech recognition service. Moreover, because an ASR engine works in sequential manner switching ASR engines seamlessly must be accomplished in real-time which remains a problem in the art for which a solution has not been implemented.
For these and other reasons, there remains a need for an apparatus and method by which an ASR session may be seamlessly transferred from one ASR engine to another ASR engine in an efficient manner.
The following presents a simplified summary in order to provide a basic understanding of some aspects of the invention. The summary is not an extensive overview of the invention. It is neither intended to identify key or critical elements of the invention nor to delineate the scope of the invention. The following summary merely presents some concepts of the invention in a simplified form as a prelude to the more detailed description below.
In an aspect of the invention, a method and apparatus is provided for efficient and seamless switching between ASR engines. For example, a mobile terminal may switch between a first ASR engine located on the mobile terminal to a second ASR engine located on a personal computer.
In an aspect of the invention, ASR state information may be used to create a state matrix in the first ASR engine. The matrix may be transferred to a second ASR engine during an ASR transfer enabling the second ASR engine to begin from the ending point of the first ASR engine.
In another aspect of the invention, the state matrix information may include data such as timing and acoustic and language model scores. The second ASR engine may utilize its own set of acoustic and language models to rescore a word lattice diagram.
A more complete understanding of the present invention and the advantages thereof may be acquired by referring to the following description in consideration of the accompanying drawings, in which like reference numbers indicate like features, and wherein:
In the following description of the various embodiments, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration various embodiments in which the invention may be practiced. It is to be understood that other embodiments may be utilized and structural and functional modifications may be made without departing from the scope and spirit of the present invention. It is noted that various connections are set forth between elements in the following description. In addition, it is further noted that these connections in general and, unless specified otherwise, may be direct or indirect and that this specification is not intended to be limiting in this respect.
Enabling seamless ASR session transfers provides for a pleasing user experience. In an aspect of the invention, each node in a network may have an ASR engine. The ASR engine may benefit from context information which includes both the ambience of the user and context of the dictated utterance. For example, in an embodiment the user may be interacting with the ASR engine in a hallway that contains high ambient noise. Receipt of the information regarding the noisy hallway may be utilized in applying suitable noise robust ASR techniques. When the user moves into his/her office, which may be relatively quiet, the ASR engine that the user is interacting with may use an algorithm suitable for high Signal to Noise Ratio (SNR). Seamless transfer of an ASR session in progress such as from one ASR engine located in the hallway to another ASR engine located in an office may be necessary for its usability in environments that require multiple and frequent transfer of ASR sessions between ASR engines.
In one aspect of the invention, mobile device 112 may include a wireless interface configured to send and/or receive digital wireless communications within network 118. The information received by mobile device 112 through the network 118 includes user selection, applications, services, electronic images, audio clips, video clips, and/or WTAI (Wireless Telephony Application Interface) messages.
A server such as server 126 may act as a file server, such as a personal server for a network such as home network, some other Local Area Network (LAN), or a Wide Area Network (WAN). Server 126 may be a computer, laptop, set-top box, DVD, television, PVR, DVR, TiVo device, personal portable server, personal portable media player, network server or other device capable of storing, accessing and processing data. Mobile device 112 may communicate with server 126 in a variety of manners. For example, mobile device 112 may communicate with server 126 via wireless communication.
In another aspect of the invention, a server such as server 127 may alternatively (or also) have one or more other communication network connections. For example, server 127 may be linked (directly or via one or more intermediate networks) to the Internet 129, to a conventional wired telephone system, or to some other communication or broadcasting network, such as a TV, a radio or IP datacasting networks.
In an embodiment, mobile device 112 has a wireless interface configured to send and/or receive digital wireless communications within wireless network 118. As part of wireless network 118, one or more base stations (not shown) may support digital communications with mobile device 112 while the mobile device 112 is located within the administrative domain of wireless network 118. Mobile device 112 may also be configured to access data previously stored on server 126. In one embodiment, file transfers between remote control device 112 and server 126 may occur via Short Message Service (SMS) messages and/or Multimedia Messaging Service (MMS) messages transmitted via short message service center (SMSC) and/or a multimedia messaging service center (MMSC). The transfer may also occur via IMS or over standard Internet Protocol (IP) stack.
As shown in
As shown in
Server 126 may also include volatile memory 154 (e.g., RAM) and/or non-volatile memory 156 (such as a hard disk drive, tape system, or the like). Software and applications may be stored within memory 154 and/or memory 156 that provides instructions to processor 142 for enabling server 126 to perform various functions, such as processing file transfer requests (such as for image files), storing files in memory 154 or memory 156, displaying images and other data, and organizing images and other data. The other data may include but is not limited to video files, audio files, emails, SMS/MMS messages, other message files, text files, or presentations. In one aspect of the invention, memory 154 may include a DSR client 157. The DSR client 157 may covert an incoming stream from an ASR engine into recognized text.
Although shown as part of server 126, memory 156 could be remote storage coupled to server 126, such as an external drive or another storage device in communication with server 126. Preferably, server 126 also includes or is coupled to a display device 158 that may have a speaker 155, via a video interface (not shown). Display 158 may be a computer monitor, a television set, a LCD projector, or other type of display device. In at least some embodiments, server 126 also includes a speaker 155 over which audio clips (or audio portions of video clips) stored in memory 154 or 156 may be played.
In an aspect of the invention, a user may record some speech on his/her mobile device using a mobile device-based ASR application. When the user reaches his/her home/office they may begin to use an ASR application present on his/her PC/Laptop seamlessly. Thus, the user utilizes the mobility of his/her mobile device when he/she is on the move and avails the higher resources available to his/her PC-based ASR engine. In another aspect of the invention, a user may seamlessly move between different environments. For example, a first environment may include a noisy hallway, whereas, a second environment may comprise a quite office. In an embodiment, a first ASR engine used in the first environment (noisy hallway) may be tuned for a high ambient noise environment, whereas, a second ASR engine employed in the second environment (quite office) may be tuned for a lower ambient noise level. As the user moves from the first environment to the second environment, the ASR session may be transferred seamlessly without user knowledge that the ASR session has been transferred from the first ASR engine to the second ASR engine.
Speech recognition systems provide the most probable decoding of the acoustic signal as the recognition output, but keep multiple hypotheses that are considered during the recognition.
In
As illustrated in
Exemplary lattice and state information is illustrated in
In an aspect of the invention, timing information along with the acoustic and language model scores may be transferred to the ASR engine B “606” from ASR engine A “604.” In an embodiment, if the speech signal is saved in the memory of the ASR engine for every utterance, the speech signal may also be transferred depending on the bandwidth and quality of the connection between the ASR engines.
In an aspect of the invention, each of the ASR engines may use its own set of acoustic and language models to rescore the word lattice. As those skilled in the art will realize, the receiving or second ASR engine may use the acoustic models to rescore the lattice only if the speech data is available. If the recorded speech signal from the beginning of the sentence or phrase is not available, then it may be the case that only timing information is used and the new engine uses its own language model and the acoustic score from the lattice to find the spoken utterance. In addition, the lattice may not be encoded with words alone; it may also contain other acoustic information carried by the speech signal, such as prosody, speaker identity and language identity.
In an aspect of the invention, an ASR session transfer between the two ASR engines may include session establishment and context information transfer. In session establishment, standard signaling protocols like HTTP/SIP/etc may be used to provide the high-level framework to establish a session. This may provide for parameter negotiation before establishment and could be used to agree on the formats or syntax to be used to transport and interpret the context information from one ASR engine to another. In another aspect of the invention, session establishment may also include verifying the usefulness of first ASR engine's context information to the second ASR engine.
In another aspect of the invention, context information transfer may include formatting lattice information in a mutually agreed syntax and format. The lattice information may be transferred from one ASR engine to another using any commonly used representation techniques such as SDP, XML, ASCII-Text file, or any other format deemed suitable by the two engines involved in the session transfer.
The embodiments herein include any feature or combination of features disclosed herein either explicitly or any generalization thereof. While the invention has been described with respect to specific examples including presently preferred modes of carrying out the invention, those skilled in the art will appreciate that there are numerous variations and permutations of the above described systems and techniques.