This invention relates to speech training, and in particular, to techniques for training non-native speakers of a language the meaning and/or pronunciation of phrases in a given language.
With the development of digital technologies, high tech methods of teaching the meaning and pronunciation of phrases in a given language have come into wide use. These technologies include methods that both do and do not require a relatively expensive apparatus such as a personal computer. Additionally, there are devices that either use or do not use speech recognition as part of the learning strategy.
Typical examples of devices that require a relatively expensive electronic apparatus such as a personal computer and that do not use speech recognition in the learning experience include the following:
Typical examples of devices that require a relatively expensive apparatus such as a personal computer and that do include speech recognition in the learning experience include:
Currently, there are no handheld, inexpensive devices on the market that employ speech recognition to offer feedback to the user on the quality of pronunciation. BBK, TCL, JF, and SOCO are Asian companies that offer language assistance products in the price range of $25.00 to $95.00. They are all record-and-playback devices that offer different levels of playback control, none of which provide information to the user other than his original recording. Some also contain electronic Chinese/English dictionaries.
In the price range to $400.00, Global View, Lexicomp, Golden, GSL, Minjin and BBK offer models that are also record and playback devices. They allow storage of larger recordings, and some contain speech recordings by professional voices that allow a user to make his own audio comparison of his recording with that of a professional voice. None of these devices provide evaluation and feedback on the quality of the user's recording.
Current art pronunciation trainers, such as those described above, suffer from two drawbacks. First, many of them require use of a complicated apparatus such as a personal computer. Many potential students either do not have access to personal computers or have access to them only in classrooms. Pronunciation training is better done in private as compared to in a classroom environment because the latter may be embarrassing to the individual and correcting individuals upsets the normal pace of classroom activity.
The second deficiency of current art pronunciation trainers is that feedback to the user is either non-existent or is offered in ways that many users have difficulty assimilating. These include graphs of formant frequencies, scores given as numbers, and pictures of the placement of the tongue and lips for correct pronunciation.
Embodiments of the present invention include a computer-implemented pronunciation training method comprising receiving a spoken utterance from a user, the spoken utterance including a plurality of sub-units of sound, analyzing the speech quality of the plurality of sub-units of sound of the spoken utterance and generating an audio signal of the spoken utterance from the user while simultaneously displaying the speech quality of each sub-unit of the spoken utterance as each sub-unit of the spoken utterance is generated.
In one embodiment, the present invention provides an electronic device that teaches the elements of correct pronunciation through use of a speech recognizer that evaluates the prosody, intonation, phonetic accuracy and lip and tongue placement of a spoken phrase.
In accordance with one embodiment of the invention, the user may practice pronunciation in private because the pronunciation trainer is an inexpensive, hand-held, battery operated device.
In accordance with another embodiment of the invention, feedback on prosody, intonation, and phonetic accuracy of a user's spoken phrases are provided through an intuitive visual means that is easy for a non-technical person to interpret.
In accordance with another embodiment of the invention, the user may listen to his recording while observing the visual feedback in order to learn where pronunciation errors were made in a phrase.
In accordance with another embodiment of the invention, the recording of the user can be played back at slow speed while the user observes the visual feedback in order for the user to better identify the location of pronunciation errors in a phrase.
In accordance with another embodiment of the invention, the user can compare his recording with that of a professional voice to learn the correct pronunciation of those parts of phrases that he learned from the visual feedback were not well-spoken.
In accordance with another embodiment of the invention, the user can set the level of the analysis of his speech in order to increase the subtlety of the analysis as his proficiency improves.
In accordance with another embodiment of the invention, the electronic device that teaches pronunciation may be used without modification by speakers having different native tongues because a small instruction manual in the language of the speaker provides all the information required for the speaker to operate the electronic device.
In accordance with another embodiment of the invention, the visual means used to provide pronunciation feedback is also used as a signal level indicator during recordings in order to guarantee an appropriate signal amplitude.
In accordance with another embodiment of the invention, the background noise level is monitored by the electronic device and the user is alerted whenever the signal-to-noise ratio is too low for a reliable analysis by the speech recognizer.
In accordance with another embodiment of the invention, the performance of the speech recognizer is improved by normalizing its output according to the mean and standard deviation of the outputs from a corpus of good speakers saying the phrases being studied.
The following detailed description and accompanying drawings provide a better understanding of the nature and advantages of the present invention.
Described herein are techniques for implementing pronunciation training. In the following description, for purposes of explanation, numerous examples and specific details are set forth in order to provide a thorough understanding of the present invention. It will be evident, however, to one skilled in the art that the present invention may be practiced without these examples and specific details. In other instances, certain methods and processes are shown in block diagram form in order to avoid obscuring the present invention. Furthermore, while the present invention may be used for pronunciation training in any language, the present description uses English. However, it is recognized that any language may be taught using the methods and devices described in this disclosure.
There are basically two properties to proper speech: the sounds and how those sounds are spoken. Words or phrases spoken by a user are referred to herein as “utterances.” Utterances may be broken down into sub-units for more detailed analysis. One common example of this is to break an utterance down into phonemes. Phonemes are the sounds in an utterance, and thus represent the first of the two properties mentioned above. The second property is how the sounds are spoken. The term used herein to describe “how” the sounds are spoken is “prosody.” Prosody may include pitch contour as a function of time and emphasis. Emphasis may include the volume (loudness) of the sounds as a function of time, the duration of the various sub-units of the utterance, the location or duration of pauses in the utterance or the duration of different parts of each utterance.
Proper pronunciation involves speaking the phonemes (sounds) of the language correctly and using the correct prosody (i.e., where by “correct” means according to local, regional or business customs, which may be programmable). As used herein, the term “speech quality” refers to the sounds and prosody of a user's utterance. For example, speech quality may be improved by minimizing the difference between the sound and prosody of a reference utterance and the sound and prosody of a user's utterance.
A user may press the ‘ON/OFF’ button 501 to turn the unit on. Then press the ‘REPEAT PHRASE’ 503 button. The user will hear, ‘One’ and ‘Please say that again’ in a first language (e.g., typically a language that the user wants to learn, such as English). ‘One’ means that this is phrase number one (1). ‘Please say that again’ is the phrase that the user will learn to say. A user may receive a reference to an instruction manual to find the translation of the phrase ‘Please say that again.’ A user may press the ‘REPEAT PHRASE’ 503 button if the user wants to hear the English pronunciation of this phrase again or press the ‘NEXT PHRASE’ 505 or ‘LAST PHRASE’ 507 button to hear the next or previous phrase along with its phrase number. A user may press the ‘MODE’ button 509 to toggle whether the analysis of the user's upcoming recording will be in the ‘BEGINNER’ or ‘ADVANCED’ mode. The lights 511 and 513 below the ‘MODE’ 509 button indicate the mode. In ‘BEGINNER’ mode, the quality of the spoken utterance is evaluated against a lower standard than in the ‘ADVANCED’ mode (i.e., the speech quality can be less in ‘BEGINNER’ mode for a given output). Additional modes, or standards, could also be used. Moreover, the standard used for evaluating the user's spoken utterance may be altered after the spoken utterance is first analyzed so a user can determine his level of sophistication.
After a user selects a phrase and the analysis mode, the user may press and hold down the ‘RECORD’ button 515. A fraction of a second after pressing the ‘RECORD’ button 515, the user may say the phrase of interest. The user may then release the ‘RECORD’ button 515 a moment after finished speaking the utterance. During recording, the row of lights 517 will monitor the loudness of the input speech. If the user speaks too softly or is too far from the unit, only the first one or two lights at the left of the group will come on. If the user speaks too loudly or is too close to the unit, the last red light at the right of the group will come on. In other words, the system may produce a visual output that is used to indicate the amplitude of the user's spoken utterance. Either of these situations will produce a low quality recording, so the user should practice until the speech volume is adjusted to turn on the middle, green lights while speaking.
After the user finishes speaking, the unit will analyze the user's input utterance and report the quality of the pronunciation in the twelve light emitting diodes 517 simultaneously as the spoken utterance is played back. Each of the twelve lights represents a segment of the recorded utterance with the left-to-right arrangement of lights corresponding to the beginning-to-end of the utterance. The light emitting diodes produce different color outputs depending on the accuracy of the user's utterance. If a light is green, that segment of the user's spoken utterance has a good speech quality. If it is yellow, that segment is questionable, while, if it is red, that segment of user's spoken utterance has a poor speech quality. In other words, the colors of the light emitting diodes correspond to the speech quality at successive portions of the user's spoken utterance.
A user can listen carefully for the parts of the recording where the lights are either yellow or red by pressing the ‘PLAY BACK’ button 519. Alternatively, a user can obtain a more precise location of any pronunciation problems in the phrase by pressing the ‘SLOW PLAY BACK’ button 521 and then watching the lights. A user can also hear the correct English pronunciation by pressing the ‘REPEAT PHRASE’ button 503 again. By comparison of the user's spoken utterance with the correct English reference utterance, a user can learn how the poorly spoken parts of his/her spoken utterance may be improved, and the user can improve them by making new recordings.
Embodiments of the present invention may include a variety of phrases. Example phrases that may be used in a system are shown below for illustrative purposes. The following may be included with a system according to the present invention so that a user will have a reference to translate utterances being produced by the system during a pronunciation and language training session. The first phrase is in a first language, which is typically a language the user is trying to learn. These phrases are illustrated by an underscore-one (e.g., “One—1,” in English). The second phrase is in a second language, which is typically a language the user understands (e.g., the user's native language). These phrases are indicated by an underscore-two (e.g., “One—2,” in Chinese)
The pronunciation trainer may also provide audio feedback that explains situations that arise during its use. For example, the analysis of a user's utterance may not be reliable if the user's speech volume was not large enough compared to the background noise. In this case, the pronunciation trainer may say, “Please talk louder or move to a quieter location.” Embodiments of the present invention may work better in a quiet location. If the noise level is too large for a reliable analysis of the user's recording, the pronunciation trainer may say, “This product will work better if you move to a quiet location.” If the user's utterance is distorted during playback, the user may have spoken too loudly, so the unit might say “Please talk louder while watching the lights.” If only part of the spoken utterance is played back, the a user may have paused in the middle of the recording for a long enough time that pronunciation trainer thought the user was done speaking. In this case, the user may be prompted with a phrase suggesting that the pauses in his recording should be shorter.
The following describes an example of one speech recognizer that may be used in embodiments of the present invention. The speech recognizer used in this specific embodiment, and features of the description that follows should not be imported into the claims or definitions of the claim elements unless specifically so stated by this disclosure. Additional support for some of the concepts describe below may be found in commonly-owned U.S. patent application Ser. No. 10/866,232, entitled Method and Apparatus for Specifying and Performing Speech Recognition Operations, filed Jun. 10, 2004 naming Pieter J. Vermeulen, Robert E. Savoie, Stephen Sutton and Forrest S. Mozer as inventors, the contents of which is hereby incorporated herein by reference in its entirety. Any definitions of any claim terms provided in the present disclosure take precedence over definitions in U.S. patent application Ser. No. 10/866,232 to the extent any such definitions are conflicting or related.
The operation of a pronunciation trainer according to one example implementation may be understood by reference to
The third column of data in
The fifth column of the figure gives the data used to normalize the raw scores. Each row of this column contains three numbers, the first of which is the energy associated with that block of data, where a bigger number corresponds to a louder volume of that segment of speech. The second and third numbers are the means and standard deviations of the raw scores of the corpus of good speakers who recorded the same phrase. These means and standard deviations are segregated by the triples of phones. That is, the raw scores for each good speaker whose preceding phone, current phone and following phone were the same were accumulated and the means and standard deviations of this accumulation were computed off-line. Thus, for example, the mean and standard deviation of all speakers who said /l-R/ followed by /i:-L/, followed by /i:-R/ was computed to be 10.79 and 6.68, as can be seen in the data associated with block 9.
The raw scores for each block were converted to a normalized score, which is the right-most number of column 4 using the equation,
normalized score=10+10*(raw score−mean)/(standard deviation)
Thus, the normalized score of block 9 is 6 because the raw score was less than the mean score (10.79) by a fraction of the standard deviation (6.68).
There are two corrections to the normalized scores that are produced as described above. The first occurs for cases where there were not sufficient examples of a triple in the corpus of the good speakers to produce a reliable mean and standard deviation. When this happens, as for blocks 22 and 37, the normalized score is recorded as 255. In a second normalization, this score is replaced by the average of the scores on either side of it. Thus, for example, the final normalized score for block 37, given as the first number in the normalized score column, is the average of 6 and 20, or 13.
Because the distribution of scores of phone triples is not a normal distribution, there are sometimes outliers that produce large normalized scores. This happens in the case of block 11 because the mean and standard deviation for this triple are small. Thus, even though the raw score for this block was small (3), it was a standard deviation above the mean of the corpus of good speakers, so the first normalized score was 20. To handle such cases that usually arise from small standard deviations, the normalized score of any block is replaced by the average of the normalized scores of its neighbors if it is two or more times larger than the average of its neighbors. Thus, the final normalized score of block 11 became 10.
The importance of normalizing the raw scores is evidenced by the data of blocks 29, 30, and 31, all of which produced raw scores that were large. However, the mean and standard deviation of block 29 for example, was 61.49 and 17.55, so the raw score of 54 was less than the mean, resulting in a normalized score of 6.
The final normalized scores are averaged to produce 12 values that control the 12 light emitting diodes 517 of
The above description of example embodiment concerns teaching a user to correct the phonetic pronunciation of his speech. However, good English also requires that the emphasis and duration of the sub-units of a phrase be correct. In another embodiment, the speech recognizer analyzes the relative duration of different parts of an utterance. For example, the durations of the phones in
In yet another description of a preferred embodiment, the prosody of the user's phrase can be compared to the average of that for the corpus of good speakers. Prosody may consists of emphasis, which is the amplitude of the speech at any point in the phrase, and the pitch frequency as a function of time. The amplitude of the speech is given in
Likewise, in another description of the preferred embodiment, the placement of the lips and tongue, and their variations during the playback of the phrase, can be displayed so the user can see how to form the vocal cavity for any portion of the phrase that is mispronounced. For example, the placement of the lips and tongue that form the vocal cavity may be displayed in synchronization with the playback of the user's spoken utterance or the synthesized reference utterance.
Because the cost of on-board memory in a small hand-held device limits the number of phrases that can be stored in the device at any one time, a hand-held unit may include a memory (e.g., a programmable memory) such that many different vocabularies containing different reference utterances for training can be downloaded from an external source sequentially to produce a large amount of training material. Each such vocabulary might contain a few dozen phrases covering special topics such as business, sports, games, slang, etc.
The above description illustrates various embodiments of the present invention along with examples of how aspects of the present invention may be implemented. The above examples and embodiments should not be deemed to be the only embodiments, and are presented to illustrate the flexibility and advantages of the present invention as defined by the following claims. Based on the above disclosure and the following claims, other arrangements, embodiments, implementations and equivalents will be evident to those skilled in the art and may be employed without departing from the spirit and scope of the invention as defined by the claims. The terms and expressions that have been employed here are used to describe the various embodiments and examples. These terms and expressions are not to be construed as excluding equivalents of the features shown and described, or portions thereof, it being recognized that various modifications are possible within the scope of the appended claims.