The present invention relates to speech recognition lexicons, and more particularly to a tool for developing desired voice-tag “sounds like” pairs.
Developments in digital technologies in professional broadcasting, the movie industry, and home video have led to an increased production of multimedia data. Users of applications that involve large amounts of multimedia content must rely on metadata inserted in a multimedia data file to effectively manage and retrieve multimedia data. Metadata creation and management can be time-consuming and costly in multimedia applications. For example, to manage metadata for video multimedia data, an operator may be required to view the video in order to properly generate metadata by tagging specific content. The operator must repeatedly stop the video data to apply metadata tags. This process may take as much as four or five times longer than the real-time length of the video data. As a result, metadata tagging is one of the largest expenses associated with multimedia production.
Voice-tagging systems allow a user to speak a voice-tag into an automatic speech recognition system (ASR). The ASR converts the voice-tag into text to be inserted as meta-data in a multimedia data stream. Because the user does not need to stop or replay the data stream, voice-tagging can be done in real-time. In other embodiments, voice-tagging can be accomplished during live recording of multimedia data. An exemplary voice-tagging system 10 is shown in
A system for developing voice-tag “sounds like” pairs for a voice-tagging lexicon comprises a voice-tag editor receptive of alphanumeric characters indicative of a voice tag. The voice tag editor is configured to display and edit the alphanumeric characters. A text parser is connected to the editor and is operable to generate normalized text corresponding to the alphanumeric characters. The normalized text serves as recognition text for the voice tag and is displayed by the voice tag editor. A storage mechanism is connected to the editor and is operable to update a lexicon with the displayed alphanumeric characters and the corresponding normalized text, thereby developing a desired voice tag “sounds like” pair.
Further areas of applicability of the present invention will become apparent from the detailed description provided hereinafter. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
The present invention will become more fully understood from the detailed description and the accompanying drawings, wherein:
The following description of the preferred embodiment(s) is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses.
A voice-tag “sounds like” pair is a combination of two text strings, where the voice tag is the text that will be used to tag the multimedia data and the “sounds like” is the verbalization that the user is supposed to utter in order to insert the voice tag into the multimedia data. For example if the user wants to insert the voice tag “Address 1” when the phrase “101 Broadway St” is spoken, then the user creates a voice tag “sounds like” pair of “Address 1” and “101 Broadway St” in the voice-tagging lexicon.
A voice-tagging system 20 for generating and/or modifying a voice-tagging lexicon is shown in
The text parser 24 receives the alphanumeric “sounds like” input 31 from the voice-tag editor 22 and generates corresponding normalized text 34 according to a rule set 36. Normalization is the process of identifying alphanumeric input such as numbers, dates, acronyms, and abbreviations and transforming them into full text as is known in the art. The normalized text 34 serves as recognition text for the voice-tag and as user feedback for the voice tag editor 22. The voice-tag editor 22 is configured to display the voice-tag data 38 to the user. A storage mechanism 40 receives the voice-tag data 38 and updates the lexicon 26 with the voice-tag data 38. For example, a user may intend that “Address 1” is a voice-tag for “sounds-like” input of “101 Broadway St.” The parser generates transcriptions for the “sounds like” text. Subsequently, during the voice tagging process, if the user says “one oh one broadway street,” the voice-tag “Address 1” will be associated with the corresponding timestamp of the multimedia data.
The transcription generator 28 receives the voice-tag data 38. The transcription generator 28 may be configured in a variety of different ways. In one embodiment of the present invention, the transcription generator 28 accesses a baseline dictionary 42 or conventional letter-to-sound rules to produce a suggested phonetic transcription. An initial phonetic transcription of the alphanumeric input 34 may be derived through a lookup in the baseline dictionary 42. In the event that no pronunciation is found for the spelled word, conventional letter-to-sound rules may be used to generate an initial phonetic transcription.
An exemplary voice-tag pair system is shown in
An exemplary voice-tag editor allowing the user to input and/or modify voice-tags is shown in
As the user enters the alphanumeric input in the voice-tag field 52, the parser operates automatically on the alphanumeric input and returns normalized text in a parsed text field 62. A “sounds-like” field 64 is initially automatically filled in with text identical to the alphanumeric input entered in the voice-tag field 52. The user may view the normalized text to determine if the parser correctly parsed the alphanumeric input and select a desired entry from the parsed text field 62. In other words, the user may wish that the voice tag “50 m” be associated with the spoken input “fifty meters.” Therefore, the user selects “fifty meters” from the parsed text field 62. The “sounds-like” field 64 is subsequently filled in with the selected entry. If the normalized text in the parsed text field 62 is not correct, the user may modify the “sounds-like” field 64. The parser operates automatically on the modified “sounds-like” field 64 to generate revised normalized text in the parsed text field 62. Additionally, the voice-tag editor may notify the user that the alphanumeric input is not able to be parsed. For example, if the alphanumeric input includes a symbol that cannot be parsed, the voice-tag editor may prompt the user to replace the symbol or the entire alphanumeric input.
The user may add the voice-tag in the voice tag field 52 to the lexicon by selecting the add button 66. The voice-tag will be stored as a voice-tag recognition pair with the text in the “sounds-like” field 62. A transcription generator generates a phonetic transcription of the “sounds-like” field 62. Henceforth, the phonetic transcription will be paired with the corresponding voice-tag. Adding the voice-tag to the lexicon will cause the voice-tag to be displayed in the voice-tag lexicon window 54. The user can delete voice-tags from the lexicon by selecting a voice-tag from the voice-tag lexicon window 54 and selecting a delete button 68. The user can update a selected voice-tag by selecting the update button 70. The user can test the audio speech recognition associated with a voice-tag by selecting a test ASR button 72. The update and test ASR functions of the voice-tag editor are explained in more detail in
Referring now to
The user may select the correct transcription from the n-best list by selecting a drop-down arrow 86. The user may edit the existing transcription that appears in the transcription field 84 if none of the transcriptions in the n-best list are correct. The user may select an update button 88 to update a transcription list 90. The user can add a selected transcription to the transcription list 90 by selecting an add button 92. The user can delete a transcription from the transcription list 90 by selecting a delete button 94. The user may select a “new” button 96 to clear all fields and begin anew.
The transcriptions in the transcription list 90 represent possible pronunciations of the selected voice-tag. For example, as shown in
Referring back to
The user may reduce potential recognition confusion by selecting a disambiguate button 112. For example, selecting the disambiguate button 112 initiates a procedure to minimize recognition confusion by detecting if two or more words are confusingly similar. The user may then have the option of selecting a different phrase to use for a particular voice-tag to avoid confusion. Alternatively, the user interface may employ other methods to optimize speech ergonomics. “Speech ergonomics” refers to addressing potential problems in the voice-tag lexicon to avoid problems in the voice-tagging process. Such problems are further described below.
One known problem in speech recognition is confusable speech entries. In the context of voice-tagging, confusable speech entries are phrases in the lexicon that are very close in pronunciation. In one scenario, one or more isolated words such as “car” and “card” may have confusingly similar pronunciations. Similarly, certain combinations of words may have confusingly similar pronunciations. Another problem of speech recognition is unbalanced phrase lengths. Unbalanced phrase lengths can occur when there are some phrases in the lexicon that are very short and some phrases that are very long. The length of a particular phrase is not determined by the length of the alphanumeric input or “sounds like” field. Instead, the length is indicative of the phonetic transcription associated therewith. Still another problem of speech recognition is hard-to-pronounce phrases. Such phrases require increased attention and effort to verbalize.
In order to compensate for confusingly similar entries, the present invention may incorporate technology to measure the similarity of two or more transcriptions. For example, a measure distance may be generated that indicates the similarity of two or more transcriptions. A measure distance of zero indicates that two confusingly similar entries are identical. In other words, measure distance increases as similarity decreases. The measure distance may be calculated using a variety of suitable methods. Source code for an exemplary measure distance method is provided at Appendix A. One method measures the number of edits that would be necessary to make a first transcription identical to a second transcription. “Edits” refers to insert, delete, and replace operations. Each particular edit may have a corresponding penalty. Penalties for all edits may be stored in a penalty matrix. Another method to generate the measure distance is to build actual speech recognition grammar for each entry to determine a difference between Hidden Markov Models (HMM) that correspond to each entry. For example, the difference between the HMMs may be determined using an entropy measure.
With respect to unbalanced phrase lengths, the speech recognition technology of the present invention operates on the “sounds like” field. In other words, the lengths of the transcriptions associated with the “sounds like” field are compared. One method to address the problem of unbalanced phrase lengths is to build a length histogram that represents the distribution of phrases with a particular length. The present invention may incorporate statistical analysis methods to identify phrases that diverge too much from a center of the histogram and mark such phrases as too short or too long.
With respect to hard-to-pronounce phrases, such phrases can be identified by observing the syllabic structure of the phrases. Each phrase is syllabified so the individual syllables may be noted. The syllables may then be identified as unusual or atypical. The method for identifying the syllables can be a rule-or-knowledge based system, a statistical learning system, or a combination thereof. The unusual syllables may be caused by a word with an unusual pronunciation, a word having a problem with the letter-to-sound rules, or a combination thereof. Additionally, a transcription that is incorrectly entered by the user may be problematic. A problematic transcription may be marked for future resolution. Subsequently, inter-word and/or inter-phrase problems are analyzed.
Therefore, the above problems may be addressed by the voice-tag editor of the present invention. For instance, referring back to
An exemplary disambiguating process 120 for a voice-tag editor is shown in
Upon detection of a problem with the selected voice-tag, the disambiguate button 112 becomes available for selection. At step 130, the user may continue to add the selected voice-tag to the lexicon or disambiguate the selected voice-tag at step 132. For example, the user may select the disambiguate button 112. The voice-tag editor may provide various solutions for the problem. For example, the voice-tag editor may incorporate a thesaurus. If the desired voice-tag entered by the user is determined to have one or more of the above-mentioned problems, the voice-tag editor may provide synonyms to the spoken phrase for the voice-tag that would avoid the problem. In other words, if the spoken phrase “fifty meters” sounds confusingly similar to “fifteen meters,” the voice-tag editor may suggest that the spoken phrase “five zero meters” be used. Additionally, the voice-tag editor may give the user the option of editing one or more of the transcriptions associated with the selected voice-tag. The user may ignore the suggestions of the voice-tag editor and continue to add the selected voice-tag to the lexicon, or modify the voice-tag, at step 134.
The description of the invention is merely exemplary in nature and, thus, variations that do not depart from the gist of the invention are intended to be within the scope of the invention. Such variations are not to be regarded as a departure from the spirit and scope of the invention.