Continuous speech transcription performance indication

Description

II. COPYRIGHT STATEMENT

All of the material in this patent document is subject to copyright protection under the copyright laws of the United States and of other countries. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the governmental files or records, but otherwise reserves all copyright rights whatsoever.

III. BACKGROUND OF THE PRESENT INVENTION

Automatic Speech Recognition (“ASR”) systems convert spoken audio into text. As used herein, the term “speech recognition” refers to the process of converting a speech (audio) signal to a sequence of words or a representation thereof (text messages), by means of an algorithm implemented as a computer program. Speech recognition applications that have emerged over the last few years include voice dialing (e.g., “Call home”), call routing (e.g., “I would like to make a collect call”), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g., a radiology report), and content-based spoken audio searching (e.g. finding a podcast where particular words were spoken).

As their accuracy has improved, ASR systems have become commonplace in recent years. For example, ASR systems have found wide application in customer service centers of companies. The customer service centers offer middleware and solutions for contact centers. For example, they answer and route calls to decrease costs for airlines, banks, etc. In order to accomplish this, companies such as IBM and Nuance create assets known as IVR (Interactive Voice Response) that answer the calls, then use ASR (Automatic Speech Recognition) paired with TTS (Text-To-Speech) software to decode what the caller is saying and communicate back to them.

More recently, ASR systems have found application with regard to text messaging. Text messaging usually involves the input of a text message by a sender who presses letters and/or numbers associated with the sender's mobile phone. As recognized for example in the aforementioned, commonly-assigned U.S. patent application Ser. No. 11/697,074, it can be advantageous to make text messaging far easier for an end user by allowing the user to dictate his or her message rather than requiring the user to type it into his or her phones. In certain circumstances, such as when a user is driving a vehicle, typing a text message may not be possible and/or convenient, and may even be unsafe. On the other hand, text messages can be advantageous to a message receiver as compared to voicemail, as the receiver actually sees the message content in a written format rather than having to rely on an auditory signal.

Many other applications for speech recognition and ASR systems will be recognized as well.

Of course, the usefulness of an ASR system is generally only as good as its speech recognition accuracy. Recognition accuracy for a particular utterance can vary based on many factors including the audio fidelity of the recorded speech, correctness of the speaker's pronunciation, and the like. The contribution of these factors to a recognition failure is complex and may not be obvious to an ASR system user when a transcription error occurs. The only indication that an error has occurred may be the resulting (incorrect) transcription text.

Some ASR systems are able to provide an indication of confidence the transcription performance. The confidence might be expressed as a number, such as a percentage on a scale of 0% to 100%. In addition, an indication of interference (background noise, etc) may be given. However, known systems do not provide an approach whereby transcription metrics, such as metrics relating to confidence or interference, can be communicated to the user of an ASR system by graphical or audio integration into the results of the transcription, while minimizing user interface clutter and distraction.

Additionally, when speech is transcribed to text, some natural speech elements can be lost during the transcription process. Specifically, verbal volume, or emphasis, as well as pauses between words and phrases, are difficult to render within a language model. Known systems do not provide an approach for at least partially compensating for these shortcomings by recreating these missing elements as visual cues.

IV. SUMMARY OF THE INVENTION

The present invention includes many aspects and features. Moreover, while many aspects and features relate to, and are described in, the context of <Again, Insert the General Field of the Invention>, the present invention is not limited to use only in <Again, Insert the General Field of the Invention>, as will become apparent from the following summaries and detailed descriptions of aspects, features, and one or more embodiments of the present invention.

Accordingly, one aspect of the present invention relates to a method of providing speech transcription performance indication. The method includes displaying, by a user device, text transcribed from an audio stream by an ASR system; and via the user device, providing, in user-perceptible form, an indicator of a level of background noise of the audio stream.

In a feature of this aspect of the invention, the method further includes, before the displaying step, receiving, by the user device, the transcribed text from the ASR system.

Another aspect of the present invention relates to a method of providing speech transcription performance indication. The method includes receiving, at a user device data representing text transcribed from an audio stream by an ASR system, and data representing a metric associated with the audio stream; displaying, via the user device, said text; and via the user device, providing, in user-perceptible form, an indicator of said metric.

In a feature of this aspect of the invention, said transcribed text was converted at a server from data representing said audio stream transmitted to the server from a transmitting device.

In a feature of this aspect of the invention, said transcribed text was confirmed by a user of said transmitting device prior to said receiving step.

In a feature of this aspect of the invention, said transcribed text was confirmed prior to said receiving step.

In a feature of this aspect of the invention, said text comprises a word and said metric is a metric of said word.

In a feature of this aspect of the invention, said text comprises a plurality of words and said metric is one of a plurality of metrics represented by data received at the user device and provided via the user device, each of said plurality of metrics being a metric of a distinct one of said plurality of words.

In a feature of this aspect of the invention, said text comprises a syllable and said metric is a metric of said syllable.

In a feature of this aspect of the invention, said text comprises a plurality of syllables and said metric is one of a plurality of metrics represented by data received at the user device and provided via the user device, each of said plurality of metrics being a metric of a distinct one of said plurality of syllables.

In a feature of this aspect of the invention, said text comprises a phrase and said metric is a metric of said phrase.

In a feature of this aspect of the invention, said text comprises a plurality of phrases and said metric is one of a plurality of metrics represented by data received at the user device and provided via the user device, each of said plurality of metrics being a metric of a distinct one of said plurality of phrases.

In a feature of this aspect of the invention, said text comprises a sentence and said metric is a metric of said sentence.

In a feature of this aspect of the invention, said text comprises a plurality of sentences and said metric is one of a plurality of metrics represented by data received at the user device and provided via the user device, each of said plurality of metrics being a metric of a distinct one of said plurality of sentences.

In a feature of this aspect of the invention, said text comprises a plurality of units and said metric is one of a plurality of metrics represented by data received at the user device and provided via the user device, each of said plurality of metrics being a metric of one of said plurality of units, and each of said units being either a word, a sentence, a phrase, or a syllable.

In a feature of this aspect of the invention, said user device is a mobile phone.

In a feature of this aspect of the invention, said data representing transcribed text and said data representing said metric are received in the same manner as data representing a text message.

In a feature of this aspect of the invention, said text is displayed as a text message.

In a feature of this aspect of the invention, said user device is a computer.

In a feature of this aspect of the invention, said data representing transcribed text and said data representing said metric are received in the same manner as data representing an instant message.

In a feature of this aspect of the invention, said text is displayed as an instant message.

In a feature of this aspect of the invention, said data representing transcribed text and said data representing said metric are received in the same manner as data representing an email.

In a feature of this aspect of the invention, said text is displayed as an email.

In a feature of this aspect of the invention, said metric associated with the audio stream comprises a volume of the audio stream.

In a feature of this aspect of the invention, said metric associated with the audio stream comprises background noise of the audio stream.

In a feature of this aspect of the invention, said metric associated with the audio stream comprises a confidence level of the audio stream.

In a feature of this aspect of the invention, said indicator comprises a font color.

In a feature of this aspect of the invention, said indicator comprises a font weight.

In a feature of this aspect of the invention, said indicator comprises a font size.

In a feature of this aspect of the invention, said indicator comprises underlining.

In a feature of this aspect of the invention, said indicator comprises an audible indicator.

Another aspect of the present invention relates to a method of providing speech transcription performance indication. The method includes receiving data representing an audio stream; converting said data representing an audio stream to text via an ASR system; determining a metric associated with the audio stream; transmitting data representing said text to a user device; and transmitting data representing said metric to the user device.

In addition to the aforementioned aspects and features of the present invention, it should be noted that the present invention further encompasses the various possible combinations and subcombinations of such aspects and features.

V. BRIEF DESCRIPTION OF THE DRAWINGS

Further aspects, features, embodiments, and advantages of the present invention will become apparent from the following detailed description with reference to the drawings, wherein:

FIG. 1 is a block diagram of a communication system in accordance with a preferred embodiment of the present invention;

FIG. 2 is a block diagram of a communication system in accordance with another preferred embodiment of the present invention;

FIG. 3 is a block diagram of an exemplary implementation of the system of FIG. 1;

FIG. 4 is a schematic diagram illustrating the operation of continuous speech transcription performance indication in conjunction with a portion of the communication system of FIGS. 1 and 3;

FIG. 5 is a bar graph illustrating the volume of each word in the utterance of FIG. 4;

FIG. 6 is a bar graph illustrating the level of background noise present during each word in the utterance of FIG. 4;

FIG. 7 is a bar graph illustrating the confidence level for each word in the utterance of FIG. 4;

FIG. 8 is an XML fragment describing the volume, background noise and confidence level for each word in the utterance of FIG. 4;

FIGS. 9A-9F are graphical depictions, on a receiving device, of the transcription of the utterance of FIG. 4 using performance indications for each word thereof;

FIG. 10 is a block diagram illustrating the operation of continuous speech transcription verbal loudness or emphasis and pause or silence indication in conjunction with a portion of the communication system of FIG. 2;

FIG. 12 is a block diagram of the system architecture of one commercial implementation;

FIG. 13 is a block diagram of a portion of FIG. 12;

FIG. 14 is a typical header section of an HTTP request from the client in the commercial implementation;

FIG. 15 illustrates exemplary protocol details for a request for a location of a login server and a subsequent response;

FIG. 16 illustrates exemplary protocol details for a login request and a subsequent response;

FIG. 17 illustrates exemplary protocol details for a submit request and a subsequent response;

FIG. 18 illustrates exemplary protocol details for a results request and a subsequent response;

FIG. 19 illustrates exemplary protocol details for an XML hierarchy returned in response to a results request;

FIG. 20 illustrates exemplary protocol details for a text to speech request and a subsequent response;

FIG. 21 illustrates exemplary protocol details for a correct request;

FIG. 22 illustrates exemplary protocol details for a ping request; and

FIG. 23 illustrates exemplary protocol details for a debug request.

VI. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

As a preliminary matter, it will readily be understood by one having ordinary skill in the relevant art (“Ordinary Artisan”) that the present invention has broad utility and application. Furthermore, any embodiment discussed and identified as being “preferred” is considered to be part of a best mode contemplated for carrying out the present invention. Other embodiments also may be discussed for additional illustrative purposes in providing a full and enabling disclosure of the present invention. Moreover, many embodiments, such as adaptations, variations, modifications, and equivalent arrangements, will be implicitly disclosed by the embodiments described herein and fall within the scope of the present invention.

Accordingly, while the present invention is described herein in detail in relation to one or more embodiments, it is to be understood that this disclosure is illustrative and exemplary of the present invention, and is made merely for the purposes of providing a full and enabling disclosure of the present invention. The detailed disclosure herein of one or more embodiments is not intended, nor is to be construed, to limit the scope of patent protection afforded the present invention, which scope is to be defined by the claims and the equivalents thereof. It is not intended that the scope of patent protection afforded the present invention be defined by reading into any claim a limitation found herein that does not explicitly appear in the claim itself.

Thus, for example, any sequence(s) and/or temporal order of steps of various processes or methods that are described herein are illustrative and not restrictive. Accordingly, it should be understood that, although steps of various processes or methods may be shown and described as being in a sequence or temporal order, the steps of any such processes or methods are not limited to being carried out in any particular sequence or order, absent an indication otherwise. Indeed, the steps in such processes or methods generally may be carried out in various different sequences and orders while still falling within the scope of the present invention. Accordingly, it is intended that the scope of patent protection afforded the present invention is to be defined by the appended claims rather than the description set forth herein.

Additionally, it is important to note that each term used herein refers to that which the Ordinary Artisan would understand such term to mean based on the contextual use of such term herein. To the extent that the meaning of a term used herein—as understood by the Ordinary Artisan based on the contextual use of such term—differs in any way from any particular dictionary definition of such term, it is intended that the meaning of the term as understood by the Ordinary Artisan should prevail.

Furthermore, it is important to note that, as used herein, “a” and “an” each generally denotes “at least one,” but does not exclude a plurality unless the contextual use dictates otherwise. Thus, reference to “a picnic basket having an apple” describes “a picnic basket having at least one apple” as well as “a picnic basket having apples.” In contrast, reference to “a picnic basket having a single apple” describes “a picnic basket having only one apple.”

When used herein to join a list of items, “or” denotes “at least one of the items,” but does not exclude a plurality of items of the list. Thus, reference to “a picnic basket having cheese or crackers” describes “a picnic basket having cheese without crackers”, “a picnic basket having crackers without cheese”, and “a picnic basket having both cheese and crackers.” Finally, when used herein to join a list of items, “and” denotes “all of the items of the list.” Thus, reference to “a picnic basket having cheese and crackers” describes “a picnic basket having cheese, wherein the picnic basket further has crackers,” as well as describes “a picnic basket having crackers, wherein the picnic basket further has cheese.”

Referring now to the drawings, in which like numerals represent like components throughout the several views, the preferred embodiments of the present invention are next described. The following description of the preferred embodiment(s) is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses.

FIG. 1 is a block diagram of a communication system 10 in accordance with a preferred embodiment of the present invention. As shown therein, the communication system 10 includes at least one transmitting device 12 and at least one receiving device 14, one or more network systems 16 for connecting the transmitting device 12 to the receiving device 14, and an ASR system 18, including an ASR engine. Transmitting and receiving devices 12,14 may include cell phones 21, smart phones 22, PDAs 23, tablet notebooks 24, various desktop and laptop computers 25,26,27, and the like. One or more of the devices 12,14, such as the illustrated iMac and laptop, computers 25,26, may connect to the network systems 16 via wireless access point 28. The various transmitting and receiving devices 12,14 (one or both types of which being sometimes referred to herein as “client devices”) may be of any conventional design and manufacture.

FIG. 2 is a block diagram of a communication system 60 in accordance with another preferred embodiment of the present invention. This system 60 is similar to the system 10 of FIG. 1, except that the ASR system 18 of FIG. 1 has been omitted and the ASR engine has instead been incorporated into the various transmitting devices 12, including cell phones 61, smart phones 62, PDAs 63, tablet notebooks 64, various desktop and laptop computers 65,66,67, and the like.

It will be appreciated that the illustrations of FIGS. 1 and 2 are intended primarily to provide context in which the inventive features of the present invention may be placed. A more complete explanation of one or more system architectures implementing such systems is provided elsewhere herein, in the incorporated applications and/or in the incorporated Appendices attached hereto. Furthermore, in the context of text messaging, the communication systems 10,60 each preferably includes, inter alia, a telecommunications network. In the context of instant messaging, the communications systems 10,60 each preferably includes, inter alia, the Internet.

More particularly, and as described, for example, in the aforementioned U.S. Patent Application Pub. No. US 2007/0239837, FIG. 3 is a block diagram of an exemplary implementation of the system 10 of FIG. 1. In this implementation, the transmitting device 12 is a mobile phone, the ASR system 18 is implemented in one or more backend servers 160, and the one or more network systems 16 include transceiver towers 130, one or more mobile communication service providers 140 (operating or joint or independent control) and the Internet 150. The backend server 160 is or may be placed in communication with the mobile phone 12 via the mobile communication service provider 140 and the Internet 150. The mobile phone has a microphone, a speaker and a display.

A first transceiver tower 130A is positioned between the mobile phone 12 (or the user 32 of the mobile phone 12) and the mobile communication service provider 140, for receiving an audio message (V1), a text message (T3) and/or a verified text message (V/T1) from one of the mobile phone 12 and the mobile communication service provider 140 and transmitting it (V2, T4, V/T2) to the other of the mobile phone 12 and the mobile communication service provider 140. A second transceiver tower 130B is positioned between the mobile communication service provider 140 and mobile devices 170, generally defined as receiving devices 14 equipped to communicate wirelessly via mobile communication service provider 140, for receiving a verified text message (V/T3) from the mobile communication service provider 140 and transmitting it (V5 and T5) to the mobile devices 170. In at least some embodiments, the mobile devices 170 are adapted for receiving a text message converted from an audio message created in the mobile phone 12. Additionally, in at least some embodiment, the mobile devices 170 are also capable of receiving an audio message from the mobile phone 12. The mobile devices 170 include, but are not limited to, a pager, a palm PC, a mobile phone, or the like.

The system 10 also includes software, as disclosed below in more detail, installed in the mobile phone 12 and the backend server 160 for causing the mobile phone 12 and/or the backend server 160 to perform the following functions. The first step is to initialize the mobile phone 12 to establish communication between the mobile phone 12 and the backend server 160, which includes initializing a desired application from the mobile phone 12 and logging into a user account in the backend server 160 from the mobile phone 12. Then, the user 32 presses and holds one of the buttons of the mobile phone 12 and speaks an utterance, thus generating an audio message, V1. At this stage, the audio message V1 is recorded in the mobile phone 12. By releasing the button, the recorded audio message V1 is sent to the backend server 160 through the mobile communication service provider 140.

In the exemplary embodiment of the present invention as shown in FIG. 3, the recorded audio message V1 is first transmitted to the first transceiver tower 130A from the mobile phone 12. The first transceiver tower 130A outputs the audio message V1 into an audio message V2 that is, in turn, transmitted to the mobile communication service provider 140. Then the mobile communication service provider 140 outputs the audio message V2 into an audio message V3 and transmits it (V3) to the Internet 150. The Internet 150 outputs the audio message V3 into an audio message V4 and transmits it (V4) to the backend server 160. The content of all the audio messages V1-V4 is identical.

The backend server 160 then converts the audio message V4 into a text message, T1, and/or a digital signal, D1, in the backend server 160 by means of a speech recognition algorithm including a grammar algorithm and/or a transcription algorithm. The text message T1 and the digital signal D1 correspond to two different formats of the audio message V4. The text message T1 and/or the digital signal D1 are sent back to the Internet 150 that outputs them into a text message T2 and a digital signal D2, respectively.

The digital signal D2 is transmitted to a digital receiver 180, generally defined as a receiving device 14 equipped to communicate with the Internet and capable of receiving the digital signal D2. In at least some embodiments, the digital receiver 180 is adapted for receiving a digital signal converted from an audio message created in the mobile phone 12. Additionally, in at least some embodiments, the digital receiver 180 is also capable of receiving an audio message from the mobile phone 12. A conventional computer is one example of a digital receiver 180. In this context, a digital signal D2 may represent, for example, an email or instant message.

It should be understood that, depending upon the configuration of the backend server 160 and software installed on the mobile phone 12, and potentially based upon the system set up or preferences of the user 32, the digital signal D2 can either be transmitted directly from the backend server 160 or it can be provided back to the mobile phone 12 for review and acceptance by the user 32 before it is sent on to the digital receiver 180.

The text message T2 is sent to the mobile communication service provider 140 that outputs it (T2) into a text message T3. The output text message T3 is then transmitted to the first transceiver tower 130A. The first transceiver tower 130A then transmits it (T3) to the mobile phone 12 in the form of a text message T4. It is noted that the substantive content of all the text messages T1-T4 may be identical, which are the corresponding text form of the audio messages V1-V4.

Upon receiving the text message T4, the user 32 verifies it and sends the verified text message V/T1 to the first transceiver tower 130A that in turn, transmits it to the mobile communication service provider 140 in the form of a verified text V/T2. The verified text V/T2 is transmitted to the second transceiver tower 130B in the form of a verified text V/T3 from the mobile communication service provider 140. Then, the transceiver tower 130B transmits the verified text V/T3 to the mobile devices 170.

In at least one implementation, the audio message is simultaneously transmitted to the backend server 160 from the mobile phone 12, when the user 32 speaks to the mobile phone 12. In this circumstance, it is preferred that no audio message is recorded in the mobile phone 12, although it is possible that an audio message could be both transmitted and recorded.

Such a system may be utilized to convert an audio message into a text message. In at least one implementation, this may be accomplished by first initializing a transmitting device so that the transmitting device is capable of communicating with a backend server 160. Second, a user 32 speaks to or into the client device so as to create a stream of an audio message. The audio message can be recorded and then transmitted to the backend server 160, or the audio message can be simultaneously transmitted to the backend server 160 through a client-server communication protocol. Streaming may be accomplished according to processes described elsewhere herein and, in particular, in FIG. 4, and accompanying text, of the aforementioned U.S. Patent Application Pub. No. US 2007/0239837. The transmitted audio message is converted into the text message in the backend server 160. The converted text message is then sent back to the client device 12. Upon the user's verification, the converted text message is forwarded to one or more recipients 34 and their respective receiving devices 14, where the converted text message may be displayed on the device 14. Incoming messages may be handled, for example, according to processes described elsewhere herein and, in particular, in FIG. 2, and accompanying text, of the aforementioned U.S. Patent Application Pub. No. US 2007/0239837.

Additionally, in at least one implementation, advertising messages and/or icons may be displayed on one or both types of client device 12,14 according to keywords contained in the converted text message, wherein the keywords are associated with the advertising messages and/or icons.

Still further, in at least one implementation, one or both types of client device 12,14 may be located through a global positioning system (GPS); and listing locations, proximate to the position of the client device 12,14, of a target of interest may be presented in the converted text message.

FIG. 4 is a block diagram illustrating communications between two users 32,34 via a portion of the communication system 10 of FIGS. 1 and 3. As shown therein, a first user 32, sometimes referred to herein as a transmitting user, is communicating with a second user 34, sometimes referred to herein as a receiving user, by way of respective transmitting and receiving devices 12,14. In the context of text messaging, the transmitting user 32 may send text messages using his transmitting device 12, for example via SMS, and the receiving user 34 receives text messages on his receiving device 14, in this case also via SMS. In the context of instant messaging, the transmitting user 32 may send instant messages via an IM client using his transmitting device 12, and the receiving user 34 receives instant messages on his receiving device 14 via an IM client. In either case, the transmitting user 32 preferably speaks into his transmitting device 12 with his utterances being converted to text for communicating to the receiving device 14, all as more fully described hereinbelow.

When the first user 32 speaks an utterance 36 into the transmitting device 12, and the recorded speech audio is sent to the ASR system 18, as described previously. In the example of FIG. 4, the utterance 36 is “Please meet me for coffee at one.” The ASR engine in the ASR system 18 attempts to recognize and transcribe the utterance 36 into text. Speech recognition requests received by the ASR engine may be handled, for example, according to processes described elsewhere herein and, in particular, in FIG. 3, and accompanying text, of the aforementioned U.S. Patent Application Pub. No. US 2007/0239837. Further, speech recognition may be carried out, for example, according to processes described elsewhere herein and, in particular, in FIGS. 6A-6H, and accompanying text, of the aforementioned U.S. Patent Application Pub. No. US 2007/0239837.

It will be appreciated that automated transcription of recorded utterances 36 is useful in other environments and applications as well. For example, in another system (not separately illustrated), a user speaks an utterance 36 into a device as a voicemail, and the recorded speech audio is sent to the ASR system 18. Other applications to which the teachings of the present invention are applicable will be apparent to the Ordinary Artisan.

During the recording, recognition and transcription process, various parameters may be measured or otherwise determined. For example, the volume of each word in the utterance 36 may be measured, the background noise present during each word in the utterance 36 may be measured, and a confidence level (referring to the relative level of confidence the ASR system 18 has that the particular word has been converted into text properly) may be determined. In this regard, FIG. 5 is a bar graph illustrating the volume of each word in the utterance 36 of FIG. 4; FIG. 6 is a bar graph illustrating the level of background noise present during each word in the utterance 36 of FIG. 4; and FIG. 7 is a bar graph illustrating the confidence level for each word in the utterance 36 of FIG. 4. It will be appreciated that a variety of additional parameters may likewise be measured or otherwise determined, and that different combinations of parameters may be chosen.

When the ASR 18 returns the transcription results text, it also returns a stream of metrics that are linked to the text elements. The resulting parameters may be coupled with the transcribed speech on a word-by-word basis, a syllable-by-syllable basis, a phrase-by-phrase basis, a sentence-by-sentence basis, or the like, and placed into any desired format. At least some embodiments may utilize XML fragments that may be passed around as necessary. FIG. 8 is an XML fragment describing the volume, background noise and confidence level for each word in the utterance 36 of FIG. 4.

As illustrated in FIGS. 6 and 8, it will be noted that a noise spike occurred while the speaker was saying “coffee,” causing the ASR 18 to mis-recognize the results as “toffee.” However, the ASR 18 also noted in the results that confidence in the word “coffee” was only 50%, which was generally lower than the rest of the utterance 36. Other words in the utterance 36 were recognized correctly, but also had varying metrics levels.

If the results shown in FIGS. 5-7 are returned to the transmitting device 12 for verification by the first user 32 prior to transmission to the receiving device 14 of the second user 34, as described, for example, in FIG. 3 and accompanying text, then the first user 32 may discover the error involving “coffee” and “toffee” and may further take steps to correct the problem, such as by executing the process again (perhaps using better diction) or by manually editing the results text presented to him. However, if the first user 32 fails to discover the error before causing the message to be transmitted to the receiving device 14, if the first user 32 ignores the error, or if the message is sent directly to the receiving device 14 without first being presented to the first user 32 for verification, then the original recognition results, including the error involving “coffee” and “toffee,” are provided to the receiving device 14 for presentation to the second user 34. According to the present invention, various approaches may be implemented in order to provide the second user 34 with information about the message that may aid the user 34 in assessing the likely accuracy of the message. Furthermore, other parameters pertaining to the message may likewise be provided to the receiving device 14 for presentation to the second user 34, whether or not the first user verifies, accurately or not, the results text. According to the present invention, various approaches may be implemented in order to provide the second user 34 with such additional information about the message that may aid the user 34 in better understanding or assessing the message.

When presenting the recognition results to the user 34 in a visual context, there are many options available to integrate ASR metrics with the results text. For example, graphical elements can be added to the textual representation of the results in several ways including, but not limited to, the use of font color, font grayscale, font weight (bold, etc.), font size, underlining, or any combination thereof. FIG. 9A is a graphical depiction, on a receiving device 14, of the transcription of the utterance 36 of FIG. 4 using font color to indicate confidence level, wherein the words “Please meet me” and “at one” appear in green to indicate a confidence level of 80-100%, the word “for” appears in orange to indicate a confidence level of 60-79%, and the word “toffee” appears in red to indicate a confidence level of 59% or below. It will be appreciated that the meaning of the various colors may be varied, greater or fewer numbers of colors may be used, different colors may be chosen, different thresholds may be chosen, or the like, all without departing from the scope of the invention.

FIG. 9B is a graphical depiction, on a receiving device 14, of the transcription of the utterance 36 of FIG. 4 using font grayscale to indicate confidence level, wherein the words “Please meet me” and “at one” appear in black to indicate a confidence level of 80-100%, the word “for” appears in medium gray to indicate a confidence level of 60-79%, and the word “toffee” appears in light gray to indicate a confidence level of 59% or below. It will be appreciated that the meaning of the various shades of gray may be varied, greater or fewer numbers of shades of gray may be used, different shades of gray may be chosen, different thresholds may be chosen, or the like, all without departing from the scope of the invention.

FIG. 9C is a graphical depiction, on a receiving device 14, of the transcription of the utterance 36 of FIG. 4 using font weight to indicate confidence level, wherein the words “Please meet me” and “at one” appear in double bold font to indicate a confidence level of 80-100%, the word “for” appears in bold font to indicate a confidence level of 60-79%, and the word “toffee” appears in normal font to indicate a confidence level of 59% or below. It will be appreciated that the meaning of the various font weights may be varied, greater or fewer numbers of font weights may be used, different font weights may be chosen, different thresholds may be chosen, or the like, all without departing from the scope of thnveion.

FIG. 9D is a graphical depiction, on a receiving device 14, of the transcription of the utterance 36 of FIG. 4 using font size to indicate confidence level, wherein the words “Please meet me” and “at one” appear in font size 18 to indicate a confidence level of 80-100%, the word “for” appears in font size 14 to indicate a confidence level of 60-79%, and the word “toffee” appears in font size 10 to indicate a confidence level of 59% or below. It will be appreciated that the meaning of the various font sizes may be varied, greater or fewer numbers of font sizes may be used, different font sizes may be chosen, different thresholds may be chosen, or the like, all without departing from the scope of the invention.

FIG. 9E is a graphical depiction, on a receiving device 14, of the transcription of the utterance 36 of FIG. 4 using underlining to indicate confidence level, wherein the words “Please meet me” and “at one” appear without underlining to indicate a confidence level of 80-100%, the word “for” is single underlined to indicate a confidence level of 60-79%, and the word “toffee” is double underlined to indicate a confidence level of 59% or below. It will be appreciated that the meaning of the various underlinings may be varied, greater or fewer numbers of underlinings may be used, different underlining styles may be chosen, different thresholds may be chosen, or the like, all without departing from the scope of the invention.

A combination of indications could be used to emphasize the various parameter levels. For example, the words “Please meet me” and “at one” could appear in black, double bold, size 18 font, without underlining, to indicate a confidence level of 80-100%, the word “for” could appear in medium gray, bold, single underlined size 14 font to indicate a confidence level of 60-79%, and the word “toffee” could appear in light gray, normal, double underlined size 10 font to indicate a confidence level of 59% or below.

This general technique could integrate any text formatting or continuous graphical element (background, shading, color wash, texture, font type, etc) to communicate one or more ASR metrics to the receiving user 34. In addition to the variations for each indication type described previously, it will further be appreciated that any combination of different indication types may be utilized, the meaning of the various indication types may be varied, greater or fewer numbers of indication types may be used, or the like, all without departing from the scope of the invention. It will still further be appreciated that one type of indication may be used for one parameter and another type of indication may be used simultaneously for a different parameter. For example, in FIG. 9F, font color has been used to indicate confidence level, underlining style has been used to indicate utterance volume, and font size has been used to indicate utterance background noise.

A similar technique may be utilized to indicate verbal volume or emphasis and silent pauses or spaces between portions of an utterance 36. FIG. 10 is a block diagram illustrating the operation of continuous speech transcription emphasis and silence indication in conjunction with a portion of the communication system of FIG. 2. As shown therein, a first user 32 is utilizing the system 10 to communicate with a second user 34. More particularly, the user 32 speaks an utterance 36 into the transmitting device 12, and the recorded speech audio is sent to the ASR system 18. In FIG. 4, the utterance 36 is “Hey! I'm talking to you, buddy!” The ASR 18 attempts to recognize and transcribe the utterance 36 into language text.

As described previously, during the recording, recognition and transcription process, various parameters may be measured or otherwise determined. For example, the volume of each word in the utterance 36 may be measured, and the length of silent spaces or pauses between the words may be measured. In this regard, Table 1 (hereinbelow) illustrates the volume of each word in the utterance and the duration of each silent period between words. It will be appreciated that a variety of additional parameters, some of which may be described herein, may likewise be measured or otherwise determined, and that different combinations of parameters may be chosen.

As noted previously, when the ASR 18 returns the transcription results text, it also returns a stream of metrics that are linked to the text elements. As described previously with regard to the confidence level parameter, parameters such as volume and background noise may be coupled with the transcribed speech on a word-by-word basis, a syllable-by-syllable basis, a phrase-by-phrase basis, a sentence-by-sentence basis, or the like, and placed into any desired format. As also described previously with regard to the confidence level parameter, at least some embodiments may utilize XML fragments that may be passed around as necessary.

As with the performance indication described above, it is also possible to use graphical display elements to visually express speech elements such as punctuation. For example, pauses in the English language can be displayed as a space of variable length between words, and verbal emphasis, which would conventionally be shown in other text-based communication contexts by an exclamation point or the use of bolded text, can be displayed graphically using font size, boldness, or any of the other elements described for transcription metrics.

FIG. 11 is a graphical depiction, on a receiving device, of the transcription of the utterance of FIG. 10 using font size and spacing to indicate emphasis and silent spaces between portions of the utterance 36, wherein the word “you” appears in font size 24 to indicate a volume of “very loud,” the words “Hey” and “buddy” appear in font size 18 to indicate a volume of “loud,” the words “I'm talking” appear in font size 14 to indicate a volume of “medium” and the word “to” appears in font size 12 to indicate a volume of “quiet.” Furthermore, the words “Hey” and “I'm” are separated by several spaces to indicate a long pause therebetween, and, the words “I'm” and “talking” are separated by a single space to indicate a short pause therebetween, the words “talking” and “to” are separated by a single space to indicate a short pause therebetween, the words “to” and “you” are separated by a single space to indicate a short pause therebetween, and the words “you” and “buddy” are separated by several spaces to indicate a long pause therebetween. It will be appreciated that the meaning of the various font sizes may be varied, greater or fewer numbers of font sizes may be used, different font sizes may be chosen, or the like, all without departing from the scope of the invention and that the meaning of the number of spaces may be varied, greater or fewer numbers of spaces may be used, or the like, all without departing from the scope of the invention.

TABLE 1

Element
Length
Volume

Hey!
200
mS
Loud

<silence>
300
mS
Silent

I'm
100
mS
Medium

<silence>
20
mS
Silent

talking
300
mS
Medium

<silence>
30
mS
Silent

to
125
mS
Quiet

<silence>
25
mS
Silent

you,
180
mS
Very Loud

<silence>
250
mS
Silent

buddy!
300
mS
Loud

Recognition results can also be presented to the user in an audio format, for example, by converting the recognition text results back into speech using a text-to-speech conversion and playing the speech back after recognition is complete. Such steps may be carried out, for example, according to processes described elsewhere herein and, in particular, in FIG. 5, and accompanying text, of the aforementioned U.S. Patent Application Pub. No. US 2007/0239837. During playback of the results audio, there are also many options available to integrate ASR metrics into the presented results. For example, artificial audio and speech artifacts can be injected into the speech playback to give the user cues as to what external factors might have impacted ASR performance. These cues could be done in several ways, including, but not limited to, those shown in Table 2.

TABLE 2

Cue
Description

Tone injection
A tone of varying frequency, pitch, volume,

phase, or other characteristic is added.

Artificial
Artificial noise of varying characteristics is

noise injection
added. Volume, white vs. pink noise, etc.

Noise playback
Noise derived from the original speech recording

is isolated and injected back into the results

playback to give some user indication as to what

audio event may have reduced ASR accuracy. For

example, truck horn, jackhammer, door slamming,

shouting, etc. The user may not have been aware

of the event when making the recording, but now

has more understanding as to why the recording

failed or was subpar.

TTS
Emphasis, pauses and questioning inflections

pronunciation
(among others) can be added to the TTS playback

in order to set off words that have low

confidence. For example:

“please meet me for (pause) coffee? (pause) at

one”

Commercial Implementation

One commercial implementation of the foregoing principles is the Yap® and Yap9™ service (collectively, “the Yap service”), available from Yap Inc. of Charlotte, N.C. The Yap service includes one or more web applications and a client device application. The Yap web application is a J2EE application built using Java 5. It is designed to be deployed on an application server like IBM WebSphere Application Server or an equivalent J2EE application server. It is designed to be platform neutral, meaning the server hardware and OS can be anything supported by the web application server (e.g. Windows, Linux, MacOS X).

FIG. 12 is a block diagram of the system architecture of the Yap commercial implementation. With reference to FIG. 12, the operating system may be implemented in Red Hat Enterprise Linux 5 (RHEL 5); the application servers may include the Websphere Application Server Community Edition (WAS-CE) servers, available from IBM; the web server may be an Apache server; the CTTS Servlets may include CTTS servlets from Loquendo, including US/UK/ES male and US/UK/ES female; the Grammar ASP may be the latest WebSphere Voice Server, available from IBM; suitable third party ads may be provided by Google; a suitable third party IM system is Google Talk, available from Google; and a suitable database system is the DB2 Express relational database system, available from IBM.

FIG. 13 is a block diagram of the Yap EAR of FIG. 12. The audio codec JARS may include the VoiceAge AMR JAR, available from VoiceAge of Montreal, Quebec and/or the QCELP JAR, available from Qualcomm of San Diego, Calif.

The Yap web application includes a plurality of servlets. As used herein, the term “servlet” refers to an object that receives a request and generates a response based on the request. Usually, a servlet is a small Java program that runs within a Web server. Servlets receive and respond to requests from Web clients, usually across HTTP and/or HTTPS, the HyperText Transfer Protocol. Currently, the Yap web application includes nine servlets: Correct, Debug, Install, Login, Notify, Ping, Results, Submit, and TTS. Each servlet is described below in the order typically encountered.

The communication protocol used for all messages between the Yap client and Yap server applications is HTTP and HTTPS. Using these standard web protocols allows the Yap web application to fit well in a web application container. From the application server's point of view, it cannot distinguish between the Yap client midlet and a typical web browser. This aspect of the design is intentional to convince the web application server that the Yap client midlet is actually a web browser. This allows a user to use features of the J2EE web programming model like session management and HTTPS security. It is also an important feature of the client as the MIDP specification requires that clients are allowed to communicate over HTTP.

More specifically, the Yap client uses the POST method and custom headers to pass values to the server. The body of the HTTP message in most cases is irrelevant with the exception of when the client submits audio data to the server in which case the body contains the binary audio data. The Server responds with an HTTP code indicating the success or failure of the request and data in the body which corresponds to the request being made. Preferably, the server does not depend on custom header messages being delivered to the client as the carriers can, and usually do, strip out unknown header values. FIG. 14 is a typical header section of an HTTP request from the Yap client.

The Yap client is operated via a user interface (UI), known as “Yap9,” which is well suited for implementing methods of converting an audio message into a text message and messaging in mobile environments. Yap9 is a combined UI for SMS and web services (WS) that makes use of the buttons or keys of the client device by assigning a function to each button (sometimes referred to as a “Yap9” button or key). Execution of such functions is carried out by “Yaplets.” This process, and the usage of such buttons, are described elsewhere herein and, in particular, in FIGS. 9A-9D, and accompanying text, of the aforementioned U.S. Patent Application Pub. No. US 2007/0239837.

Usage Process—Install: Installation of the Yap client device application is described in the aforementioned U.S. Patent Application Pub. No. US 2007/0239837 in a subsection titled “Install Process” of a section titled “System Architecture.”

Usage Process—Notify: When a Yap client is installed, the install fails, or the install is canceled by the user, the Notify servlet is sent a message by the phone with a short description. This can be used for tracking purposes and to help diagnose any install problems.

Usage Process—Login: When the Yap midlet is opened, the first step is to create a new session by logging into the Yap web application using the Login servlet. Preferably, however, multiple login servers exist, so as a preliminary step, a request is sent to find a server to log in to. Exemplary protocol details for such a request can be seen in FIG. 15. An HTTP string pointing to a selected login server will be returned in response to this request. It will be appreciated that this selection process functions as a poor man's load balancer.

After receiving this response, a login request is sent. Exemplary protocol details for such a request can be seen in FIG. 16. A cookie holding a session ID is returned in response to this request. The session ID is a pointer to a session object on the server which holds the state of the session. This session data will be discarded after a period determined by server policy.

Sessions are typically maintained using client-side cookies, however, a user cannot rely on the set-cookie header successfully returning to the Yap client because the carrier may remove that header from the HTTP response. The solution to this problem is to use the technique of URL rewriting. To do this, the session ID is extracted from the session API, which is returned to the client in the body of the response. This is called the “Yap Cookie” and is used in every subsequent request from the client. The Yap Cookie looks like this:

;jsessionid=C240B217F2351E3C420A599B0878371A

All requests from the client simply append this cookie to the end of each request and the session is maintained:

/Yap/Submit;jsessionid=C240B217F2351E3C420A599B0878371A

Usage Process—Submit: After receiving a session ID, audio data may be submitted. The user presses and holds one of the Yap-9 buttons, speaks aloud, and releases the pressed button. The speech is recorded, and the recorded speech is then sent in the body of a request to the Submit servlet, which returns a unique receipt that the client can use later to identify this utterance. Exemplary protocol details for such a request can be seen in FIG. 17.

One of the header values sent to the server during the login process is the format in which the device records. That value is stored in the session so the Submit servlet knows how to convert the audio into a format required by the ASR engine. This is done in a separate thread as the process can take some time to complete.

The Yap9 button and Yap9 screen numbers are passed to the Submit server in the HTTP request header. These values are used to lookup a user-defined preference of what each button is assigned to. For example, the 1 button may be used to transcribe audio for an SMS message, while the 2 button is designated for a grammar based recognition to be used in a web services location based search. The Submit servlet determines the appropriate “Yaplet” to use. When the engine has finished transcribing the audio or matching it against a grammar, the results are stored in a hash table in the session.

In the case of transcribed audio for an SMS text message, a number of filters can be applied to the text returned from the ASR engine. Such filters may include, but are not limited to, those shown Table 3.

TABLE 3

Filter Type
Function

Ad Filter
Used to scan the text and identify keywords that

can be used to insert targeted advertising

messages, and/or convert the keywords into

hyperlinks to ad sponsored web pages

Currency
Used to format currency returned from the speech

Filter
engine into the user's preferred format. (e.g.,

“one hundred twenty dollars” −> “$120.00”)

Date Filter
Used to format dates returned from the speech

engine into the user's preferred format. (e.g.,

“march fourth two thousand seven” −> “3/4/2007”)

Digit Filter
User to format spelled out single digits returned

from the speech engine into a multi-digit number

such as a zip code (e.g., “two eight two one

one” −> “28211”)

Engine
Used to remove speech engine words

Filter

Number
Used to convert the spelled out numbers returned

Filter
from the speech engine into a digit based number

(e.g., “one hundred forty seven” −> “147”)

Obscenity
Used to place asterisks in for the vowels in

Filter
street slang (e.g., “sh*t”, “f*ck”, etc.)

Punctuation
Used to format punctuation

Filter

SMS Filter
Used to convert regular words into a spelling

which more closely resembles an SMS message

(e.g., “don't forget to smile” −> “don't 4get

2 :)”, etc.)

Time Filter
Used to format time phrases

Notably, after all of the filters are applied, both the filtered text and original text are returned to the client so that if text to speech is enabled for the user, the original unfiltered text can be used to generate the TTS audio.

Usage Process—Results: The client retrieves the results of the audio by taking the receipt returned from the Submit servlet and submitting it as a request to the Results servlet. Exemplary protocol details for such a request can be seen in FIG. 18. This is done in a separate thread on the device and a timeout parameter may be specified which will cause the request to return after a certain amount of time if the results are not available. In response to the request, a block of XML is preferably returned. Exemplary protocol details for such a return response can be seen in FIG. 19. Alternatively, a serialized Java Results object may be returned. This object contains a number of getter functions for the client to extract the type of results screen to advance to (i.e., SMS or results list), the text to display, the text to be used for TTS, any advertising text to be displayed, an SMS trailer to append to the SMS message, etc.

Usage Process—TTS: The user may choose to have the results read back via Text to Speech. This can be an option the user could disable to save network bandwidth, but adds value when in a situation where looking at the screen is not desirable, like when driving. If TTS is used, the TTS string is extracted from the results and sent via an HTTP request to the TTS servlet. Exemplary protocol details for such a request can be seen in FIG. 20. The request blocks until the TTS is generated and returns audio in the format supported by the phone in the body of the result. This is performed in a separate thread on the device since the transaction may take some time to complete. The resulting audio is then played to the user through the AudioService object on the client. Preferably, TTS speech from the server is encrypted using Corrected Block Tiny Encryption Algorithm (XXTEA) encryption.

Usage Process—Correct: As a means of tracking accuracy and improving future SMS based language models, if the user makes a correction to transcribed text on the phone via the keypad before sending the message, the corrected text is submitted to the Correct servlet along with the receipt for the request. This information is stored on the server for later use in analyzing accuracy and compiling a database of typical SMS messages. Exemplary protocol details for such a submission can be seen in FIG. 21.

Usage Process—Ping: Typically, web sessions will timeout after a certain amount of inactivity. The Ping servlet can be used to send a quick message from the client to keep the session alive. Exemplary protocol details for such a message can be seen in FIG. 22.

Usage Process—Debug: Used mainly for development purposes, the Debug servlet sends logging messages from the client to a debug log on the server. Exemplary protocol details can be seen in FIG. 23.

Usage Process—Logout: To logout from the Yap server, an HTTP logout request needs to be issued to the server. An exemplary such request would take the form: “/Yap/Logoutjsessionid=1234”, where 1234 is the session ID.

User Preferences: In at least one embodiment, the Yap website has a section where the user can log in and customize their Yap client preferences. This allows them to choose from available Yaplets and assign them to Yap9 keys on their phone. The user preferences are stored and maintained on the server and accessible from the Yap web application. This frees the Yap client from having to know about all of the different back-end Yaplets. It just records the audio, submits it to the server along with the Yap9 key and Yap9 screen used for the recording and waits for the results. The server handles all of the details of what the user actually wants to have happen with the audio.

The client needs to know what type of format to utilize when presenting the results to the user. This is accomplished through a code in the Results object. The majority of requests fall into one of two categories: sending an SMS message, or displaying the results of a web services query in a list format. Notably, although these two are the most common, the Yap architecture supports the addition of new formats.

Based on the foregoing description, it will be readily understood by those persons skilled in the art that the present invention is susceptible of broad utility and application. Many embodiments and adaptations of the present invention other than those specifically described herein, as well as many variations, modifications, and equivalent arrangements, will be apparent from or reasonably suggested by the present invention and the foregoing descriptions thereof, without departing from the substance or scope of the present invention.

Accordingly, while the present invention has been described herein in detail in relation to one or more preferred embodiments, it is to be understood that this disclosure is only illustrative and exemplary of the present invention and is made merely for the purpose of providing a full and enabling disclosure of the invention. The foregoing disclosure is not intended to be construed to limit the present invention or otherwise exclude any such other embodiments, adaptations, variations, modifications or equivalent arrangements, the present invention being limited only by the claims appended hereto and the equivalents thereof.

Claims

1. A computer-implemented method comprising: receiving first text, wherein the first text was created by performing automatic speech recognition on a first portion of audio data, wherein the audio data includes a plurality of portions;receiving a first value associated with the first text;causing presentation of the first text with a first graphical element indicating the first value;receiving, during presentation of the first text, second text, wherein the second text was created by performing automatic speech recognition on a second portion of the plurality of portions of audio data and wherein the second portion of the audio data is subsequent to the first portion of the audio data; andcausing presentation of the second text.
2. The computer-implemented method of claim 1, wherein the first value indicates a volume level of a portion of the first portion of the audio data corresponding to the first text, a level of background noise in the portion of the first portion of the audio data or a confidence level associated with the first text as determined by an automatic speech recognition engine.
3. The computer-implemented method of claim 1, wherein the first graphical element comprises font color, font grayscale, font weight, font size or underlining.
4. The computer-implemented method of claim 1, wherein the first portion of the audio data comprises at least part of a voicemail message.
5. The computer-implemented method of claim 1, further comprising: receiving a second value associated with the second text; andcausing presentation of the second text with a second graphical element indicating the second value, wherein the second text comprises a modified version of the first text.
6. The computer-implemented method of claim 1, further comprising: receiving a third value associated with a third text, wherein the third text was created by performing automatic speech recognition on the first portion of the audio data; andcausing presentation of the third text with a third graphical element indicating the third value.
7. A system comprising: an electronic data store configured to store transcription information; andone or more computing devices in communication with the electronic data store, the one or more computing devices configured to at least: receive first text, wherein the first text was created by performing automatic speech recognition on a first portion of audio data, wherein the audio data includes a plurality of portions;receive a first value associated with the first text;cause presentation of the first text with a first graphical element indicating the first value;receive, during presentation of the first text, second text, wherein the second text was created by performing automatic speech recognition on a second portion of the plurality of portions of audio data and wherein the second portion of the audio data is subsequent to the first portion of the audio data; andcause presentation of the second text.
8. The system of claim 7, wherein the first graphical element comprises a lighter color if the first value is in a lower range, or a darker color if the first value is in a higher range.
9. The system of claim 8, wherein the lighter color is gray and the darker color is black.
10. The system of claim 7, wherein the first portion of the audio data comprises at least part of a voicemail message.
11. The system of claim 7, wherein the one or more computing devices are further configured to: receive a second value associated with the second text; andcause presentation of the second text with a second graphical element indicating the second value, wherein the second text comprises a modified version of the first text.
12. The system of claim 7, wherein the one or more computing devices are further configured to: receive a third value associated with third text, wherein the third text was created by performing automatic speech recognition on the first portion of the audio data; andcause presentation of the third text with a third graphical element indicating the third value.
13. The system of claim 12, wherein the third value indicates a volume level of a portion of the first portion of the audio data corresponding to the third text, a level of background noise in the portion of the first portion of the audio data or a confidence level associated with the third text as determined by an automatic speech recognition engine.
14. A non-transitory computer-readable medium storing instructions that, when executed by a processor on a computing device, cause the computing device to at least: receive first text, wherein the first text was created by performing automatic speech recognition on a first portion of audio data, wherein the audio data includes a plurality of portions;receive a first value associated with the first text;cause presentation of the first text with a first graphical element indicating the first value;receive, during presentation of the first text, second text, wherein the second text was created by performing automatic speech recognition on a second portion of the plurality of portions of audio data and wherein the second portion of the audio data is subsequent to the first portion of the audio data; andcause presentation of the second text.
15. The non-transitory computer-readable medium of claim 14, wherein the first graphical element indicates a volume level associated with a portion of the first portion of the audio data corresponding to the first text and is presented substantially simultaneously with the presentation of the first text.
16. The non-transitory computer-readable medium of claim 14, wherein the first graphical element comprises font color, font grayscale, font weight, font size or underlining.
17. The non-transitory computer-readable medium of claim 14, wherein the first portion of the audio data comprises at least part of a voicemail message.
18. The non-transitory computer-readable medium of claim 14, further comprising instructions to filter the first text by replacing one or more words in the first text with corresponding numbers or digits.
19. The non-transitory computer-readable medium of claim 14, wherein the first portion of the audio data is captured at a first device and the first text is presented at a second device.
20. The non-transitory computer-readable medium of claim 14, further comprising instructions to: receive a second value associated with the second text; andcause presentation of the second text with a second graphical element indicating the second value, wherein the second text comprises a modified version of the first text.

I. CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No. 14/010,433, filed Aug. 26, 2013, which is a continuation of U.S. application Ser. No. 13/621,179, filed Sep. 15, 2012, now U.S. Pat. No. 8,543,396, which is a continuation of U.S. application Ser. No. 12/197,213, filed on Aug. 22, 2008, now U.S. Pat. No. 8,510,109, which claims the benefit of priority from: (1) U.S. provisional patent application Ser. No. 60/957,386, filed Aug. 22, 2007 and entitled “CONTINUOUS SPEECH TRANSCRIPTION PERFORMANCE INDICATION;”(2) U.S. provisional patent application Ser. No. 60/957,393, filed Aug. 22, 2007 and entitled “VOICE CLICK FOR SPEECH-ENABLED APPLICATIONS;”(3) U.S. provisional patent application Ser. No. 60/957,701, filed Aug. 23, 2007 and entitled “CONTINUOUS SPEECH TRANSCRIPTION PERFORMANCE INDICATION;”(4) U.S. provisional patent application Ser. No. 60/957,702, filed Aug. 23, 2007 and entitled “VOICE CLICK FOR SPEECH-ENABLED APPLICATIONS;”(5) U.S. provisional patent application Ser. No. 60/957,706, filed Aug. 23, 2007 and entitled “POST-PROCESSING TRANSCRIPTION RESULTS WITH FILTERS AND FINITE GRAMMARS;”(6) U.S. provisional patent application Ser. No. 60/972,851, filed Sep. 17, 2007 and entitled “SYSTEM AND METHOD FOR DELIVERING MOBILE ADVERTISING WITHIN A THREADED SMS OR IM CHAT CONVERSATION ON A MOBILE DEVICE CLIENT;”(7) U.S. provisional patent application Ser. No. 60/972,853, filed Sep. 17, 2007 and entitled “METHOD AND SYSTEM FOR DYNAMIC PERSONALIZATION AND QUERYING OF USER PROFILES BASED ON SMS/IM CHAT MESSAGING ON A MOBILE DEVICE;”(8) U.S. provisional patent application Ser. No. 60/972,854, filed Sep. 17, 2007 and entitled “LOCATION, TIME & SEASON AWARE MOBILE ADVERTISING DELIVERY;”(9) U.S. provisional patent application Ser. No. 60/972,936, filed Sep. 17, 2007 and entitled “DELIVERING TARGETED ADVERTISING TO MOBILE DEVICE FOR PRESENTATION WITHIN SMSes OR IM CONVERSATIONS;”(10) U.S. provisional patent application Ser. No. 60/972,943, filed Sep. 17, 2007 and entitled “Dynamic Personalization and Querying of User Profiles Based on SMSes and IM Conversations;”(11) U.S. provisional patent application Ser. No. 60/972,944, filed Sep. 17, 2007 and entitled “Location, Time, and Season Aware Advertising Delivery to and Presentation on Mobile Device Within SMSes or IM Conversations or User Interface Thereof;”(12) U.S. provisional patent application Ser. No. 61/016,586, filed Dec. 25, 2007 and entitled “VALIDATION OF MOBILE ADVERTISING FROM DERIVED INFORMATION;”(13) U.S. provisional patent application Ser. No. 61/021,335, filed Jan. 16, 2008 and entitled “USING A PHYSICAL PHENOMENA DETECTOR TO START AND STOP RECORDING FOR A SPEECH RECOGNITION ENGINE;”(14) U.S. provisional patent application Ser. No. 61/021,341, filed Jan. 16, 2008 and entitled “CONTINUOUS SPEECH TRANSCRIPTION UTTERANCE EMPHASIS AND SILENCE INDICATION;”(15) U.S. provisional patent application Ser. No. 61/034,815, filed Mar. 7, 2008 and entitled “USE OF INTERMEDIATE SPEECH TRANSCRIPTION RESULTS IN EDITING FINAL SPEECH TRANSCRIPTION RESULTS;”(16) U.S. provisional patent application Ser. No. 61/038,046, filed Mar. 19, 2008 and entitled “CORRECTIVE FEEDBACK LOOP FOR AUTOMATED SPEECH RECOGNITION;”(17) U.S. provisional patent application Ser. No. 61/041,219, filed Mar. 31, 2008 and entitled “USE OF METADATA TO POST PROCESS SPEECH RECOGNITION OUTPUT.” Each of the foregoing are incorporated by reference in their entireties herein. Additionally, the disclosure of U.S. patent application Ser. No. 11/697,074, filed Apr. 5, 2007, entitled “HOSTED VOICE RECOGNITION SYSTEM FOR WIRELESS DEVICES” and published as U.S. Patent Application Pub. No. US 2007/0239837, is incorporated in its entirety herein by reference and is intended to provide background and technical information with regard to the systems and environments of the inventions of the current provisional patent application. Further, the foregoing nonprovisional patent application references and incorporates a previously filed and provisional patent application (U.S. Provisional Patent Application Ser. No. 60/789,837, filed Apr. 5, 2006, entitled “Apparatus And Method For Converting Human Speech Into A Text Or Email Message In A Mobile Environment Using Grammar Or Transcription Based Speech Recognition Software Which Optionally Resides On The Internet,” by Victor R. Jablokov, which is incorporated in its entirety herein by reference). The disclosure of this particular provisional patent application is contained in Appendix A attached hereto and, likewise, is incorporated herein in its entirety by reference and is intended to provide background and technical information with regard to the systems and environments of the inventions of the current provisional patent application. Still further, the disclosure of U.S. provisional patent application Ser. No. 61/091,330, filed Aug. 22, 2008 and entitled “METHODS, APPARATUSES, AND SYSTEMS FOR PROVIDING TIMELY USER CUES PERTAINING TO SPEECH RECOGNITION,” is incorporated herein in its entirety by reference. Finally, a brochure espousing benefits of one or more inventions of the present patent application as well as these prior filed applications is contained in Appendix B attached hereto. The disclosure of this brochure further is incorporated herein in its entirety by reference.

US Referenced Citations (376)

Number	Name	Date	Kind
5036538	Oken et al.	Jul 1991	A
5675507	Bobo, II	Oct 1997	A
5822730	Roth et al.	Oct 1998	A
5852801	Hon	Dec 1998	A
5864603	Haavisto et al.	Jan 1999	A
5948061	Merriman et al.	Sep 1999	A
5974413	Beauregard et al.	Oct 1999	A
5995928	Nguyen	Nov 1999	A
6026368	Brown et al.	Feb 2000	A
6100882	Sharman et al.	Aug 2000	A
6173259	Bijl et al.	Jan 2001	B1
6212498	Sherwood et al.	Apr 2001	B1
6219407	Kanevsky et al.	Apr 2001	B1
6219638	Padmanabhan et al.	Apr 2001	B1
6253177	Lewis et al.	Jun 2001	B1
6298326	Feller	Oct 2001	B1
6366886	Dragosh et al.	Apr 2002	B1
6401075	Mason et al.	Jun 2002	B1
6453290	Jochumson	Sep 2002	B1
6490561	Wilson et al.	Dec 2002	B1
6519562	Phillips et al.	Feb 2003	B1
6532446	King	Mar 2003	B1
6571210	Hon et al.	May 2003	B2
6604077	Dragosh et al.	Aug 2003	B2
6654448	Agraharam et al.	Nov 2003	B1
6687339	Martin	Feb 2004	B2
6687689	Fung et al.	Feb 2004	B1
6704034	Rodriguez et al.	Mar 2004	B1
6760700	Lewis et al.	Jul 2004	B2
6775360	Davidson et al.	Aug 2004	B2
6816578	Kredo et al.	Nov 2004	B1
6820055	Saindon et al.	Nov 2004	B2
6850609	Schrage	Feb 2005	B1
6856960	Dragosh et al.	Feb 2005	B1
6865258	Polcyn	Mar 2005	B1
6895084	Saylor et al.	May 2005	B1
6980954	Zhao et al.	Dec 2005	B1
7007074	Radwin	Feb 2006	B2
7013275	Arnold et al.	Mar 2006	B2
7035804	Saindon et al.	Apr 2006	B2
7035901	Kumagai et al.	Apr 2006	B1
7039599	Merriman et al.	May 2006	B2
7047200	Schmid et al.	May 2006	B2
7062435	Tzirkel-Hancock et al.	Jun 2006	B2
7089184	Rorex	Aug 2006	B2
7089194	Berstis et al.	Aug 2006	B1
7133513	Zhang	Nov 2006	B1
7136875	Anderson et al.	Nov 2006	B2
7146320	Ju et al.	Dec 2006	B2
7146615	Hervet et al.	Dec 2006	B1
7181387	Ju et al.	Feb 2007	B2
7181398	Thong et al.	Feb 2007	B2
7200555	Ballard et al.	Apr 2007	B1
7206932	Kirchhoff	Apr 2007	B1
7225125	Bennett et al.	May 2007	B2
7225224	Nakamura	May 2007	B2
7233655	Gailey et al.	Jun 2007	B2
7236580	Sarkar et al.	Jun 2007	B1
7254384	Gailey et al.	Aug 2007	B2
7260534	Gandhi et al.	Aug 2007	B2
7280966	Ju et al.	Oct 2007	B2
7302280	Hinckley et al.	Nov 2007	B2
7310601	Nishizaki et al.	Dec 2007	B2
7313526	Roth et al.	Dec 2007	B2
7319957	Robinson et al.	Jan 2008	B2
7324942	Mahowald et al.	Jan 2008	B1
7328155	Endo et al.	Feb 2008	B2
7330815	Jochumson	Feb 2008	B1
7363229	Falcon et al.	Apr 2008	B2
7376556	Bennett	May 2008	B2
7379870	Belvin et al.	May 2008	B1
7392185	Bennett	Jun 2008	B2
7401122	Chen	Jul 2008	B2
7418387	Mowatt et al.	Aug 2008	B2
7475404	Hamel	Jan 2009	B2
7539086	Jaroker	May 2009	B2
7555431	Bennett	Jun 2009	B2
7571100	Lenir et al.	Aug 2009	B2
7577569	Roth et al.	Aug 2009	B2
7590534	Vatland	Sep 2009	B2
7634403	Roth et al.	Dec 2009	B2
7640158	Detlef et al.	Dec 2009	B2
7640160	Di Cristo et al.	Dec 2009	B2
7650284	Cross et al.	Jan 2010	B2
7657424	Bennett	Feb 2010	B2
7668718	Kahn et al.	Feb 2010	B2
7672841	Bennett	Mar 2010	B2
7680661	Co et al.	Mar 2010	B2
7685509	Clark et al.	Mar 2010	B1
7689415	Jochumson	Mar 2010	B1
7702508	Bennett	Apr 2010	B2
7707163	Anzalone et al.	Apr 2010	B2
7716058	Roth et al.	May 2010	B2
7725307	Bennett	May 2010	B2
7725321	Bennett	May 2010	B2
7729904	Bennett	Jun 2010	B2
7729912	Bacchiani et al.	Jun 2010	B1
7747437	Verhasselt et al.	Jun 2010	B2
7757162	Barrus et al.	Jul 2010	B2
7769764	Ramer et al.	Aug 2010	B2
7796980	McKinney et al.	Sep 2010	B1
7822610	Burns et al.	Oct 2010	B2
7852993	Ju et al.	Dec 2010	B2
7890329	Wu et al.	Feb 2011	B2
7890586	McNamara et al.	Feb 2011	B1
7899670	Young et al.	Mar 2011	B1
7899671	Cooper et al.	Mar 2011	B2
7904301	Densham et al.	Mar 2011	B2
7907705	Huff et al.	Mar 2011	B1
7908141	Belknap	Mar 2011	B2
7908273	DiMaria et al.	Mar 2011	B2
7925716	Zhang et al.	Apr 2011	B2
7949529	Weider et al.	May 2011	B2
7957975	Burns et al.	Jun 2011	B2
7970610	Downey	Jun 2011	B2
8010358	Chen	Aug 2011	B2
8027836	Baker et al.	Sep 2011	B2
8032372	Zimmerman et al.	Oct 2011	B1
8050918	Ghasemi et al.	Nov 2011	B2
8069047	Cross et al.	Nov 2011	B2
8073700	Jaramillo et al.	Dec 2011	B2
8106285	Gerl et al.	Jan 2012	B2
8117268	Jablokov et al.	Feb 2012	B2
8121838	Kobal et al.	Feb 2012	B2
8126120	Stifelman et al.	Feb 2012	B2
8135578	Hébert	Mar 2012	B2
8140632	Jablokov et al.	Mar 2012	B1
8145485	Brown	Mar 2012	B2
8145493	Cross, Jr. et al.	Mar 2012	B2
8209184	Dragosh et al.	Jun 2012	B1
8229743	Carter	Jul 2012	B2
8296377	Jablokov et al.	Oct 2012	B1
8301454	Paden	Oct 2012	B2
8311825	Chen	Nov 2012	B2
8326636	White	Dec 2012	B2
8335829	Jablokov et al.	Dec 2012	B1
8335830	Jablokov et al.	Dec 2012	B2
8352261	Terrell, II et al.	Jan 2013	B2
8352264	White	Jan 2013	B2
8355920	Gopinath et al.	Jan 2013	B2
8380511	Cave et al.	Feb 2013	B2
8401850	Jochumson	Mar 2013	B1
8417530	Hayes	Apr 2013	B1
8433574	Jablokov et al.	Apr 2013	B2
8498872	White et al.	Jul 2013	B2
8510094	Chin et al.	Aug 2013	B2
8510109	Terrell, II et al.	Aug 2013	B2
8543396	Terrell, II et al.	Sep 2013	B2
8589164	Mengibar et al.	Nov 2013	B1
8611871	Terrell, II	Dec 2013	B2
8670977	Saraclar et al.	Mar 2014	B2
8793122	White et al.	Jul 2014	B2
8898065	Newman et al.	Nov 2014	B2
9009055	Jablokov et al.	Apr 2015	B1
9053489	Jablokov et al.	Jun 2015	B2
9093061	Secker-Walker et al.	Jul 2015	B1
9099087	Adams et al.	Aug 2015	B2
20010047294	Rothschild	Nov 2001	A1
20010056369	Takayama et al.	Dec 2001	A1
20020016712	Geurts et al.	Feb 2002	A1
20020029101	Larson et al.	Mar 2002	A1
20020035474	Alpdemir	Mar 2002	A1
20020052781	Aufricht et al.	May 2002	A1
20020087330	Lee et al.	Jul 2002	A1
20020091570	Sakagawa	Jul 2002	A1
20020165719	Wang et al.	Nov 2002	A1
20020165773	Natsuno et al.	Nov 2002	A1
20030008661	Joyce et al.	Jan 2003	A1
20030028601	Rowe	Feb 2003	A1
20030050778	Nguyen et al.	Mar 2003	A1
20030093315	Sato	May 2003	A1
20030101054	Davis et al.	May 2003	A1
20030105630	MacGinitie et al.	Jun 2003	A1
20030115060	Junqua et al.	Jun 2003	A1
20030125955	Arnold et al.	Jul 2003	A1
20030126216	Avila et al.	Jul 2003	A1
20030139922	Hoffmann et al.	Jul 2003	A1
20030144906	Fujimoto et al.	Jul 2003	A1
20030149566	Levin et al.	Aug 2003	A1
20030182113	Huang	Sep 2003	A1
20030187643	Van Thong et al.	Oct 2003	A1
20030191639	Mazza	Oct 2003	A1
20030200086	Kawazoe et al.	Oct 2003	A1
20030200093	Lewis et al.	Oct 2003	A1
20030212554	Vatland	Nov 2003	A1
20030220792	Kobayashi et al.	Nov 2003	A1
20030220798	Schmid et al.	Nov 2003	A1
20030223556	Ju et al.	Dec 2003	A1
20040005877	Vaananen	Jan 2004	A1
20040015547	Griffin et al.	Jan 2004	A1
20040019488	Portillo	Jan 2004	A1
20040059632	Kang et al.	Mar 2004	A1
20040059708	Dean et al.	Mar 2004	A1
20040059712	Dean et al.	Mar 2004	A1
20040107107	Lenir et al.	Jun 2004	A1
20040133655	Yen et al.	Jul 2004	A1
20040151358	Yanagita et al.	Aug 2004	A1
20040176906	Matsubara et al.	Sep 2004	A1
20040193420	Kennewick et al.	Sep 2004	A1
20050004799	Lyudovyk	Jan 2005	A1
20050010641	Staack	Jan 2005	A1
20050021344	Davis et al.	Jan 2005	A1
20050027538	Halonen et al.	Feb 2005	A1
20050080786	Fish et al.	Apr 2005	A1
20050101355	Hon et al.	May 2005	A1
20050102142	Soufflet et al.	May 2005	A1
20050149326	Hogengout et al.	Jul 2005	A1
20050154587	Funari et al.	Jul 2005	A1
20050182628	Choi	Aug 2005	A1
20050187768	Godden	Aug 2005	A1
20050188029	Asikainen et al.	Aug 2005	A1
20050197145	Chae et al.	Sep 2005	A1
20050197840	Wang et al.	Sep 2005	A1
20050209868	Wan et al.	Sep 2005	A1
20050239495	Bayne	Oct 2005	A1
20050240406	Carroll	Oct 2005	A1
20050261907	Smolenski et al.	Nov 2005	A1
20050266884	Marriott et al.	Dec 2005	A1
20050288926	Benco et al.	Dec 2005	A1
20060004570	Ju et al.	Jan 2006	A1
20060009974	Junqua et al.	Jan 2006	A1
20060052127	Wolter	Mar 2006	A1
20060053016	Falcon et al.	Mar 2006	A1
20060074895	Belknap	Apr 2006	A1
20060075055	Littlefield	Apr 2006	A1
20060111907	Mowatt et al.	May 2006	A1
20060122834	Bennett	Jun 2006	A1
20060129455	Shah	Jun 2006	A1
20060143007	Koh et al.	Jun 2006	A1
20060149558	Kahn et al.	Jul 2006	A1
20060149630	Elliott et al.	Jul 2006	A1
20060159507	Jawerth et al.	Jul 2006	A1
20060161429	Falcon et al.	Jul 2006	A1
20060195318	Stanglmayr	Aug 2006	A1
20060195541	Ju et al.	Aug 2006	A1
20060217159	Watson	Sep 2006	A1
20060235684	Chang	Oct 2006	A1
20060235695	Thrift et al.	Oct 2006	A1
20070005368	Chutorash et al.	Jan 2007	A1
20070005795	Gonzalez	Jan 2007	A1
20070033005	Cristo et al.	Feb 2007	A1
20070038451	Cogne et al.	Feb 2007	A1
20070038740	Steeves	Feb 2007	A1
20070038923	Patel	Feb 2007	A1
20070043569	Potter et al.	Feb 2007	A1
20070061146	Jaramillo et al.	Mar 2007	A1
20070061148	Cross et al.	Mar 2007	A1
20070061300	Ramer et al.	Mar 2007	A1
20070079383	Gopalakrishnan	Apr 2007	A1
20070086773	Ramsten et al.	Apr 2007	A1
20070106506	Ma et al.	May 2007	A1
20070106507	Charoenruengkit et al.	May 2007	A1
20070115845	Hochwarth et al.	May 2007	A1
20070118374	Wise et al.	May 2007	A1
20070118426	Barnes, Jr.	May 2007	A1
20070118592	Bachenberg	May 2007	A1
20070123222	Cox et al.	May 2007	A1
20070133769	Da Palma et al.	Jun 2007	A1
20070133771	Stifelman et al.	Jun 2007	A1
20070150275	Garner et al.	Jun 2007	A1
20070156400	Wheeler	Jul 2007	A1
20070180718	Fourquin et al.	Aug 2007	A1
20070233487	Cohen et al.	Oct 2007	A1
20070233488	Carus et al.	Oct 2007	A1
20070239837	Jablokov et al.	Oct 2007	A1
20070255794	Coutts	Nov 2007	A1
20080016142	Schneider	Jan 2008	A1
20080037720	Thomson et al.	Feb 2008	A1
20080040683	Walsh	Feb 2008	A1
20080052075	He et al.	Feb 2008	A1
20080063154	Tamari et al.	Mar 2008	A1
20080063155	Doulton	Mar 2008	A1
20080065481	Immorlica et al.	Mar 2008	A1
20080065737	Burke et al.	Mar 2008	A1
20080077406	Ganong, III	Mar 2008	A1
20080091426	Rempel et al.	Apr 2008	A1
20080133232	Doulton	Jun 2008	A1
20080147404	Liu et al.	Jun 2008	A1
20080154600	Tian et al.	Jun 2008	A1
20080154870	Evermann et al.	Jun 2008	A1
20080155060	Weber et al.	Jun 2008	A1
20080172781	Popowich et al.	Jul 2008	A1
20080177551	Schalk	Jul 2008	A1
20080195588	Kim et al.	Aug 2008	A1
20080198898	Taylor et al.	Aug 2008	A1
20080198980	Skakkebaek et al.	Aug 2008	A1
20080198981	Skakkebaek et al.	Aug 2008	A1
20080200153	Fitzpatrick et al.	Aug 2008	A1
20080201139	Yu et al.	Aug 2008	A1
20080208582	Gallino	Aug 2008	A1
20080208590	Cross, Jr. et al.	Aug 2008	A1
20080221897	Cerra et al.	Sep 2008	A1
20080243500	Bisani et al.	Oct 2008	A1
20080243504	Poi	Oct 2008	A1
20080261564	Logan	Oct 2008	A1
20080275864	Kim et al.	Nov 2008	A1
20080275873	Bosarge et al.	Nov 2008	A1
20080301250	Hardy	Dec 2008	A1
20080313039	Altberg et al.	Dec 2008	A1
20080317219	Manzardo	Dec 2008	A1
20090006194	Sridharan et al.	Jan 2009	A1
20090012793	Dao et al.	Jan 2009	A1
20090037255	Chiu et al.	Feb 2009	A1
20090043855	Bookstaff et al.	Feb 2009	A1
20090055175	Terrell, II et al.	Feb 2009	A1
20090055179	Cho et al.	Feb 2009	A1
20090063151	Arrowood et al.	Mar 2009	A1
20090063268	Burgess et al.	Mar 2009	A1
20090076821	Brenner et al.	Mar 2009	A1
20090076917	Jablokov et al.	Mar 2009	A1
20090077493	Hempel et al.	Mar 2009	A1
20090083032	Jablokov et al.	Mar 2009	A1
20090086958	Altberg et al.	Apr 2009	A1
20090100050	Erol et al.	Apr 2009	A1
20090117922	Bell	May 2009	A1
20090124272	White et al.	May 2009	A1
20090125299	Wang	May 2009	A1
20090141875	Demmitt et al.	Jun 2009	A1
20090150156	Kennewick et al.	Jun 2009	A1
20090150405	Grouf et al.	Jun 2009	A1
20090157401	Bennett	Jun 2009	A1
20090163187	Terrell, II	Jun 2009	A1
20090170478	Doulton	Jul 2009	A1
20090182559	Gerl et al.	Jul 2009	A1
20090182560	White	Jul 2009	A1
20090199101	Cross et al.	Aug 2009	A1
20090204410	Mozer et al.	Aug 2009	A1
20090210214	Qian et al.	Aug 2009	A1
20090228274	Terrell, II et al.	Sep 2009	A1
20090240488	White et al.	Sep 2009	A1
20090248415	Jablokov et al.	Oct 2009	A1
20090271194	Davis et al.	Oct 2009	A1
20090276215	Hager	Nov 2009	A1
20090282363	Jhaveri et al.	Nov 2009	A1
20090307090	Gupta et al.	Dec 2009	A1
20090312040	Gupta et al.	Dec 2009	A1
20090319187	Deeming et al.	Dec 2009	A1
20100017294	Mancarella et al.	Jan 2010	A1
20100049525	Paden et al.	Feb 2010	A1
20100058200	Jablokov et al.	Mar 2010	A1
20100121629	Cohen	May 2010	A1
20100145700	Kennewick et al.	Jun 2010	A1
20100146077	Davies et al.	Jun 2010	A1
20100180202	Del Valle Lopez	Jul 2010	A1
20100182325	Cederwall et al.	Jul 2010	A1
20100191619	Dicker et al.	Jul 2010	A1
20100223056	Kadirkamanathan	Sep 2010	A1
20100268726	Gorodyansky et al.	Oct 2010	A1
20100278453	King	Nov 2010	A1
20100279667	Wehrs et al.	Nov 2010	A1
20100286901	Geelen et al.	Nov 2010	A1
20100293242	Buchheit et al.	Nov 2010	A1
20100312619	Ala-Pietila et al.	Dec 2010	A1
20100312640	Haldeman et al.	Dec 2010	A1
20110029876	Slotznick et al.	Feb 2011	A1
20110040629	Chiu et al.	Feb 2011	A1
20110047452	Ativanichayaphong et al.	Feb 2011	A1
20110054900	Phillips et al.	Mar 2011	A1
20110064207	Chiu et al.	Mar 2011	A1
20110144973	Bocchieri et al.	Jun 2011	A1
20110161072	Terao et al.	Jun 2011	A1
20110161276	Krumm et al.	Jun 2011	A1
20110296374	Wu et al.	Dec 2011	A1
20110313764	Bacchiani et al.	Dec 2011	A1
20120022875	Cross et al.	Jan 2012	A1
20120046950	Jaramillo et al.	Feb 2012	A1
20120059653	Adams et al.	Mar 2012	A1
20120095831	Aaltonen et al.	Apr 2012	A1
20120166202	Carriere et al.	Jun 2012	A1
20120259729	Linden et al.	Oct 2012	A1
20120324391	Tocci	Dec 2012	A1
20130041667	Longe et al.	Feb 2013	A1
20130158994	Jaramillo et al.	Jun 2013	A1
20130226894	Venkataraman et al.	Aug 2013	A1
20130281007	Edge et al.	Oct 2013	A1
20150255067	White et al.	Sep 2015	A1

Foreign Referenced Citations (2)

Number	Date	Country
1 274 222	Jan 2003	EP
WO 2006101528	Sep 2006	WO

Non-Patent Literature Citations (41)

Entry
Desilets, A., Bruijn, B., Martin, J., 2002, Extracting keyphrases from spoken audio documents, Springer-Verlag Berlin Heidelberg, 15 pages.
Fielding, et al., Hypertext Transfer Protocol-HTTP/ 1.1, RFC 2616, Network Working Group (Jun. 1999), sections 7,9.5, 14.30, 12 pages total.
Glaser et al., Web-based Telephony Bridges for the Deaf, Proc. South African Telecommunications Networks & Applications Conference (2001), Wild Coast Sun, South Africa, 5 pages total.
J2EE Application Overview, publicly available on http://www/orionserver.com/docs/j2eeoverview.html since Mar. 1, 2001. Retrieved on Oct. 26, 2007, 3 pages total.
Kemsley, et al. A Survey of Neural Network Research and Fielded Applications, 1992, in International Journal of Neural Networks: Research and Applications, vol. 2, No. 2/3/4, pp. 123-133. Accessed on Oct. 25, 2007 at https://citeseer.ist. psu.edu/cache/ papers/cs/25638lftp:zSzzSzaxon.cs. byu.eduzSzpubzSzpaperszSzkemsley 92. pdf/kemsley92survey. pdf, 12 pages total.
Transl8it! Translation engine, publicly available on http://www.transl8it.com since May 30, 2002. Retrieved on Oct. 26, 2007, 6 pages total.
VBulietin Community Forum, thread posted on Mar. 5,2004. Page retrieved on Oct. 26, 2007 from http://www.vbulletin.com/forum/showthread.php?t=96976, 1 page total.
Lewis et al., SoftBridge: An Architecture for Building IP-based Bridges over the Digital Divide. Proc. South African Telecommunications Networks & Applications Conference (SATNAC 2002), Drakensberg, South Africa, 5 pages total.
“International Search Report” and “Written Opinion of the International Search Authority” (Korean Intellectual Property Office) in Yap, Inc. International Patent Application U.S. Appl. No. PCT/US2007/008621, dated Nov. 13, 2007, 13 pages total.
Marshall, James, HTTP Made Really Easy, Aug. 15, 1997, retrieved from http://www.jmarshall.com/easy/http/ on Jul. 25, 2008, 15 pages total.
Knudsen, Jonathan, Session Handling in MIDP, Jan. 2002, retrieved from http://developers.sun.com/mobility/midp/articles/sessions on Jul. 25, 2008, 7 pages total.
Allauzen, C., et al., A Generalized Composition Algorithm for Weighted Finite-State Transducers, Interspeech, Brighton, U.K., Sep. 2009, pp. 1203-1206.
Bisani, M., et al., Automatic Editing in a Back-End Speech-to-Text System, 2008, 7 pages.
Board of Patent Appeals and Interferences Answer in U.S. Appl. No. 12/352,442 dated May 15, 2012.
Brown, E., et al., Capitalization Recovery for Text, Springer-Verlag Berlin Heidelberg, 2002, 12 pages.
Glaser, M., et al., Web-Based Telephony Bridges for the Deaf, Proceedings of the South African Telecommunications Networks & Applications Conference, Wild Coast Sun, South Africa, 2001, 5 pages.
Gotoh, Y., et al., Sentence Boundary Detection in Broadcase Speech Transcripts, Proceedings of the ISCA Workshop, 2000, 8 pages.
Hori, T., et al., Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition, IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, No. 4, May 2007, pp. 1352-1365.
Huang, J., et al., Extracting Caller Information From Voicemail, Springer-Verlag Berlin Heidelberg, 2002, 11 pages.
Huang, J., et al., Maximum Entropy Model for Punctuation Annotation From Speech, in ICSLP 2002, pp. 917-920.
Huang, J., et al., Extracting Caller Information from Voicemail, IBM T.J. Watson Research Center, 2002, pp. 67-77.
Information Disclosure Statement (IDS) Letter Regarding Common Patent Application(s) dated Jun. 4, 2010.
Information Disclosure Statement (IDS) Letter Regarding Common Patent Application(s), dated Dec. 6, 2010.
Information Disclosure Statement (IDS) Letter Regarding Common Patent Application(s), dated Feb. 14, 2012.
Information Disclosure Statement (IDS) Letter Regarding Common Patent Application(s), dated Jul. 21, 2009.
Information Disclosure Statement (IDS) Letter Regarding Common Patent Application(s), dated Jul. 21, 2011.
Information Disclosure Statement (IDS) Letter Regarding Common Patent Application(s), dated Mar. 17, 2011.
Information Disclosure Statement (IDS) Letter Regarding Common Patent Application(s), dated Nov. 24, 2009.
International Search Report and Written Opinion International Patent Application No. PCT/US2007/008621, dated Nov. 13, 2007.
J2EE Application Overview, http://www.orionserever.com/docs/j2eeoverview.html, Mar. 1, 2001.
Justo, R., et al., Phrase Classes in Two-Level Language Models for ASR, Springer-Verlag London Limited, 2008, 11 pages.
Kimura, K., et al., 1992, Association-Based Natural Language Processing With Neural Networks, Proceedings of the 7th Annual Meeting of the Association of Computational Linguistics, pp. 223-231.
Lewis, J., et al., SoftBridge: An Architecture for Building IP-Based Bridges Over the Digital Divide, Proceedings of the South African Telecommunications Networks & Applications Conference (SATNAC 2002), Drakensberg, South Africa, 5 pages.
Li, X., et al., Time Based Language Models, CIKM '03 Proceedings of the 12th International Conference on Information and Knowledge Management, 2003, pp. 469-475.
Office Action in Canadian Application No. 2648617 dated Feb. 27, 2014.
Ries, K., Segmenting Conversations by Topic, Initiative, and Style, Springer-Verlag Berlin Heidelberg, 2002, 16 pages.
Schalkwyk, J., et al., Speech Recognition With Dynamic Grammars Using Finite-State Transducers, Eurospeech 2003-Geneva, pp. 1969-1972.
Shriberg, E., et al., Prosody-Based Automatic Segmentation of Speech Into Sentences and Topics, 2000, 31 pages.
Soltau, H., and G. Saon, Dynamic Network Decoding Revisited, Automatic Speech Recognition and Understanding, 2009, IEEE Workshop, pp. 276-281.
Stent, A., et al., Geo-Centric Language Models for Local Business Voice Search, Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pp. 386-396, 2009.
Thomae, M., et al., Hierarchical Language Models for One-Stage Speech Interpretation, in Interspeech, 2005, pp. 3425-3428.

Related Publications (1)

	Number	Date	Country
	20160027443 A1	Jan 2016	US

Provisional Applications (17)

Number	Date	Country
60957386	Aug 2007	US
60957393	Aug 2007	US
60957701	Aug 2007	US
60957702	Aug 2007	US
60957706	Aug 2007	US
60972851	Sep 2007	US
60972853	Sep 2007	US
60972854	Sep 2007	US
60972936	Sep 2007	US
60972943	Sep 2007	US
60972944	Sep 2007	US
61016586	Dec 2007	US
61021335	Jan 2008	US
61021341	Jan 2008	US
61034815	Mar 2008	US
61038046	Mar 2008	US
61042219	Mar 2008	US

Continuations (3)

	Number	Date	Country
Parent	14010433	Aug 2013	US
Child	14517720		US
Parent	13621179	Sep 2012	US
Child	14010433		US
Parent	12197213	Aug 2008	US
Child	13621179		US

Continuous speech transcription performance indication

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

Field of Search

US

CPC

International Classifications

Disclaimer

Abstract