Reading software focuses on increasing reading skills of a user and uses voice recognition to determine if a user correctly reads a passage. Reading software that assesses audio or speech input from a user often relies on an underlying voice model used by a speech recognition program. The voice model provides a standard representation of sounds (e.g., phonemes) to which the speech recognition program compares user input to determine if the input is correct.
In one aspect, a method for generating a custom voice model includes receiving audio input from a user and comparing the received audio input to an expected input. The expected input is determined based on an initial or default voice model. The method also includes determining a number of words read incorrectly in a sentence or portion of the passage and adding the sentence audio data to a set of data for producing the custom voice model if the number of words read incorrectly is less than a threshold value.
Embodiments can include one or more of the following.
The method can include determining a number of words read incorrectly based on a subset of words from the passage. The method can include signaling the user to re-read a sentence if the number of words read incorrectly is greater than the threshold value.
The method can include playing a recorded reading of the sentence or of a portion of the sentence, and indicating that the user should repeat what they hear. The audio can be used to signal the user to re-read the sentence if the number of words read incorrectly is greater than the threshold value. The method can include receiving input from the user related to the re-read sentence and determining a number of words read incorrectly in the re-read sentence. The method can also include proceeding to the next sentence without adding the re-read sentence audio data to the set of data for producing the custom voice model, if the number of words read incorrectly is greater than the threshold value. The method can include determining the number of sentences that have been included in the set of data for producing the custom voice model, and aborting the generation of a custom voice model if the number of sentences is less than a threshold. The method can include playing a recorded reading of a sentence based upon user request, either before the user starts reading the passage or after the user has requested to pause the reading of the passage and associated audio collection.
In another aspect, a method for generating a custom voice model includes using an existing model to determine if a received audio input matches an expected input. A new model is estimated based on the received audio input. The expected input is represented by a sequence of phones, e.g. single speech sounds considered as a physical event without reference to the physical event's place in the structure of a language. Each phone is modeled by a sequence of Hidden Markov Model (HMM) states whose output distributions are represented by a weighted mixture of Gaussian or Normal distributions. Each of the Gaussian distributions is parameterized by a mean vector and covariance matrix. The method includes aligning the received audio against the expected sequence of HMM states and using this alignment to re-estimate the observed HMM output distribution parameters. For example, the Normal distribution arithmetic means and co-variances can be re-estimated to produce the custom voice model.
The method can include storing the custom Gaussian voice model. Receiving audio input can include receiving less than about 100 words of audio input or less than the amount of audio input associated with one page of text. Both the variance and the arithmetic mean can be adjusted. Analyzing phonemes to adjust the mean and/or variance can include calculating a new variance and arithmetic mean based on the received audio. In another aspect, analyzing can include calculating a new variance and arithmetic mean based on the received audio and merging or combining the calculated variance and arithmetic mean with the original variance and arithmetic mean for the Gaussian to determine a custom voice model.
In another aspect, a device is configured to receive audio input from a user and compare the received audio input to an expected input that is determined based on an initial or default voice model. The device is further configured to determine a number of words read incorrectly in a sentence and add the sentence audio data to the set of data for producing a custom voice model if the number of words read incorrectly is less than a threshold value.
Embodiments can include one or more of the following.
The device can be configured to determine a number of words read incorrectly based on a subset of less than all of the words from the passage. The device can be configured to signal the user to re-read a sentence if the number of words read incorrectly is greater than the threshold value. The device can be configured to receive input from the user related to the re-read sentence and determine a number of words read incorrectly in the re-read sentence. The device can be further configured to proceed to the next sentence without adding the re-read sentence audio data to the set of data for producing the custom voice model, if the number of words read incorrectly is greater than the threshold value.
In another aspect, a device is configured to determine if received audio input matches an expected input. The expected input is represented by a sequence of phones with each phone represented by a HMM whose output distributions consist of a weighted mixture of Gaussian or Normal distributions.
Each of the Gaussian functions has a weight factor, an arithmetic mean, and a variance. The device is configured to decompose the received audio input into phonemes and analyze the phonemes to adjust at least one of the variance and the arithmetic mean without adjusting the weight factor for at least one of the Gaussian distributions to produce the custom voice model.
Embodiments can include one or more of the following.
The device can be configured to adjust both the variance and the arithmetic mean. The device can be configured to calculate a new variance and arithmetic mean based on the received audio. The device can be configured to calculate a new variance and arithmetic mean based on the received audio and average the calculated variance and arithmetic mean with the original variance and arithmetic mean for the Gaussian to determine a custom voice model.
In another aspect, a computer program product is tangibly embodied in an information carrier, for executing instructions on a processor. The computer program product can be operable to cause a computer to receive audio input from a user and compare the received audio input to an expected input. The expected input can be determined based on an initial or default voice model. In addition, the computer program product can include instructions to determine a number of words read incorrectly in a sentence of the passage and add the sentence audio data to the set of data for producing a custom voice model if the number of words read incorrectly is less than a threshold value.
Embodiments can include one or more of the following.
The computer program product can include instructions to determine a number of words read incorrectly based on a subset of less than all of the words from the passage. The computer program product can include instructions to signal the user to re-read a sentence if the number of words read incorrectly is greater than the threshold value. The computer program product can include instructions to receive input from the user related to the re-read sentence, determine a number of words read incorrectly in the re-read sentence, and proceed to the next sentence without adding the re-read sentence audio data to the set of data for producing the custom voice model, if the number of words read incorrectly is greater than the threshold value.
In another aspect, a computer program product is tangibly embodied in an information carrier, for executing instructions on a processor. The computer program product can be operable to cause a machine to determine if a received audio input matches an expected input. The expected input can be represented by a set of phonemes with at least some of the phonemes represented by a plurality of Gaussian functions. Each of the Gaussian functions has a weight factor, an arithmetic mean, and a variance. The computer program product can include instructions to decompose the received audio input into phonemes and analyze the phonemes to adjust at least one of the variance and the arithmetic mean without adjusting the weight factor for at least one of the Gaussian functions for a particular phoneme to produce the custom Gaussian voice model. In addition, the computer program product can include instructions to store the custom Gaussian voice model.
Embodiments can include one or more of the following.
The computer program product can include instructions to adjust both the variance and the arithmetic mean. The computer program product can include instructions to calculate a new variance and arithmetic mean based on the received audio.
Referring to
The software includes an operating system 30 which can be any operating system, speech recognition software 32 which can be any system such as the Sphinx II open source recognition engine or any engine that provides sufficient access to recognizer functionality and uses a semi-continuous acoustic model and tutoring software 34 which will be discussed below. The reading tutor software 34 is useful in developing reading fluency. The software also includes a set of acoustic models 52 used by the speech recognition engine and the tutor software 34 in assessing fluency. The acoustic models 52 can include standard acoustic models and custom acoustic models or voice profiles. The custom acoustic models are acoustic models adapted to the speech of a particular user. A user would interact with the computer system principally though mouse 29a and microphone/headset 29b.
Referring now to
The server computer 44 would include amongst other things a file 46 stored, e.g., on storage device 47, which holds aggregated data generated by the computer systems 10 through use by students executing software 34. The files 46 can include text-based results from execution of the tutoring software 34 as will be described below. Also residing on the storage device 47 can be individual speech files resulting from execution of the tutor software 34 on the systems 10. In other embodiments, the speech files being rather large in size would reside on the individual systems 10. Thus, in a classroom setting an instructor can access the text based files over the server via system 45, and can individually visit a student system 10 to play back audio from the speech files if necessary. Alternatively, in some embodiments the speech files can be selectively downloaded to the server 44.
Like many advanced skills, reading depends on a collection of underlying skills and capabilities. The tutoring software 34 fits into development of reading skills based on existence of interdependent areas such as physical capabilities, sensory processing capabilities, and language and reading skills. In order for a person to learn to read written text, the eyes need to focus properly and the brain needs to properly process resulting visual information. The person develops an understanding of language, usually through hearing language, which requires that the ear mechanics work properly and the brain processes auditory information properly. Speaking also contributes strongly to development of language skills, but speech requires its own mechanical and mental processing capabilities. Before learning to read, a person should have the basic language skills typically acquired during normal development and should learn basic phoneme awareness, the alphabet, and basic phonics. In a typical classroom setting, a person should have the physical and emotional capability to sit still and “tune out” distractions and focus on a task at hand. With all of these skills and capabilities in place, a person can begin to learn to read fluently, with comprehension, and to develop a broad vocabulary.
The tutor software 34 described below is particularly useful once a user has developed proper body mechanics and the sensory processing, and the user has acquired basic language, alphabet, and phonics skills. The tutor software 34 can improve reading comprehension, which depends heavily on reading fluency. The tutor software 34 can develop fluency by supporting frequent and repeated oral reading. The reading tutor software 34 provides this frequent and repeated supported oral reading, using speech recognition technology to listen to the student read and provide help when the student struggles. In addition, reading tutor software 34 can assist in vocabulary development. The software 34 can be used with users of all ages and especially children in early though advanced stages of reading development.
Vocabulary, fluency, and comprehension all interact as a person learns. The more a person reads, the more fluent the person becomes, and the more vocabulary the person learns. As a person becomes more fluent and develops a broader vocabulary, the person reads more easily.
Referring to
The acoustic model 52 represents the sounds of speech (e.g., phonemes). Due to differences in speech for different groups of people or for individual users, the speech recognition engine 32 includes multiple acoustic models 52 such as an adult male acoustic model 54, an adult female acoustic model 56, a child acoustic model 58, and a custom acoustic model 60. In addition, although not shown in
The pronunciation dictionary 70 is based on words 68 and phonetic representations. The words 68 come from the story texts or passages, and the phonetic representations 72 are generated based on human speech input or knowledge of how the words are pronounced. In addition, the pronunciations or phonetic representations of words can be obtained from existing databases of words and their associated pronunciations. Both the pronunciation dictionary 70 and the language model 64 are associated with the story texts to be recognized. For the pronunciation dictionary 72, the words are taken independently from the story text. In contrast, the language model 64 is based on sequences of words from the story text or passage. The recognizer uses the language model 64 and the dictionary 70 to constrain the recognition search and determine what is considered from the acoustic model when processing the audio input from the user 50. In general, the speech recognition process 32 uses the acoustic model 52, the language model 64, and the pronunciation dictionary 70 to generate the speech recognition result 66.
Referring to
In addition to selecting a representative voice profile 84, 86, or 88, a user can generate a custom acoustic model 60 by selecting one of the options in the custom voice profile section 90 of setup screen 80. To begin set-up of a custom voice model, the user selects button 92 to train the custom model. If a custom model has previously been generated for the user, the user selects button 94 to add additional training to the previously generated custom model. In addition, the user can delete a previously generated custom model by selecting button 96.
Referring to
Referring to
In order to obtain accurate acoustic representations of the phonemes for a particular user's speech, accurate or semi-accurate speech input is needed from the user. When the speech acoustic model is adapted for use with reading tutor software, the speech input received to modify the voice model can be limited. For example, the passages (e.g., passages 112 and 114) presented to the user to read may be short (e.g., less than about 100-150 words). However, the passage could be greater than 150 words but still be relatively short in length (e.g., the passage could be about 1-2 pages in length). However, other lengths of passages are possible. In addition, the passages presented to the user may be at or below the user's current reading level. By selecting passages at or below a user's reading level, it is more probable that the received pronunciations for the words are accurate. In addition to selecting a passage based on a skill level of the user and a length of the passage, the text of the passage can be selected based on the phonemes included in the text. For example, a text with multiple occurrences of different phonemes allows the voice model for the phonemes to be adjusted with increased accuracy.
Due to the length of the passages presented to the user, a limited amount of data is available to the speech recognition engine for adjusting the voice model. Based on the limited data, the speech recognition engine uses a statistical method (e.g., a Bayesian method) to adjust the underlying arithmetic means and variances for Gaussians in the voice model (as described below). In addition, since the data is limited, in some embodiments the speech recognition engine merges or averages a calculated model based on the user input with the original voice model to produce the custom voice model. Thus, the custom model may be a variation of a previously stored model and not generally based solely on the received audio.
In order to generate an accurate acoustic model for a system used by children, for example, accurate or robust data collection should be maintained. However, children (or other users) struggling to read a passage are required by an adaptation algorithm to read the passage in order to allow the system to compute new models thus, increasing the difficulty of receiving accurate or robust data. The adaptation system in the tutor software and voice recognition system handles reading errors in a manner that allows the voice recognition system to collect a reasonable amount of audio data without frustrating the child (or other user).
The system allows up to, e.g., two reader errors in content words per 150 words. As described above, the user or child reads a short (e.g., less than about 150 words) passage with a reading level below the user's current reading level. In some embodiments, the system also allows common words in the passage to be misspoken or omitted. If errors are detected in the child's reading, the system allows the child to read the sentence again. If the user still appears to have difficulty, then the sentence is read to the user and the user is given a further chance to read the passage back to the system.
Referring to
If the number of errors is determined 132 to be greater than the threshold, the input for that reading of the sentence is not used to adapt the voice model and the user is prompted 134 to re-read the sentence. Process 120 determines 138 a number of incorrect pronunciations for the re-read sentence and determines 142 if the number of incorrect pronunciations in the re-read sentence is greater than the threshold. If the number of errors is determined 142 to be greater than the threshold, the sentence is skipped 143 and the process proceeds to the following sentence without adding the sentence data to the custom voice profile training data set. Process 120 subsequently determines 137 if the number of sentences that have been skipped is greater than a threshold. If the total number of sentences skipped is greater than the threshold, the speech model will not be adjusted and process 120 aborts 139 the custom voice profile training.
If the number of pronunciation errors for a sentence is determined 132 to be less than the threshold, the sentence data is added 133 to the custom voice profile training data set.
Subsequent to either sentence data being added 133 to the voice profile training data set or after a sentence has been skipped but the total number of sentences skipped had been determined 137 to be less than the threshold, the voice profile training process determines 135 if there is a next sentence in the passage. If there is a next sentence, process 120 returns to receiving audio input 126 for the subsequent sentence. If the user has reached the end of the passage (e.g., there is not a next sentence), process 120 determines 136, based on the voice profile training data set, a set of arithmetic means and standard deviations for Gaussian models used in a representation of the phonemes detected in the received utterances. Process 120 calculates 140 a set of arithmetic means and standard deviations for the custom model based on the original or previously stored arithmetic means and deviations and the arithmetic means and deviations determined from the user input. Process 120 adjusts 144 the original model based on the calculated set of arithmetic means and standard deviations for the custom model and stores the adjusted original model as the custom speech model for the user.
Referring to
Referring to
Referring to
Two examples of speech models that can be used by a speech recognition program include semi-continuous acoustic models and continuous acoustic models. Adaptation of a semi-continuous hidden Markov (HMM) acoustic model for a speech recognition program differs from the adaptation of a fully-continuous model. For example, adaptation algorithms derived for a fully continuous recognizer may use techniques such as maximum-likelihood linear regression (MLLR). Because of the small amounts of data used to adjust the voice model, the fully continuous model space can be partitioned into clusters of arithmetic mean vectors whose affine transforms are calculated individually. In a semi-continuous model, such partitions may not readily exist. In a continuous model, the output density comprises weighted sum of multidimensional Gaussian density functions as shown in equation (1):
For each state ‘s’ in the hidden Markov model (HMM) there exists an associated set of M weights and Gaussian density functions that define an output probability. For a semi-continuous model the Gaussian density functions are shared across multiple states, effectively generating a pool of shared Gaussian distributions. Thus, a state is represented by the weights given to distributions as shown in equation (2):
For example, in one embodiment, 256 mean vectors and diagonal covariance matrices are used in a process to partition the feature space into distinct regions. For example, the feature space is spanned by the feature vectors. The feature space can be a 55 dimensional space, including the real spectrum coefficients, and the delta and double deltas of these coefficients plus three power features. In the case of a fully-continuous HMM it is therefore possible to apply a number of differing affine transforms to subsets of the density mean/covariance vectors effectively altering the partitioning of the feature space. States are usually divided into subsets using phonologically based rules or by applying clustering techniques to the central moments for the respective models.
As described above, because the semi-continuous model shares a limited number of Gaussian densities between multiple states, clustering the underlying Gaussian distributions is not easily accomplished. In addition, a useful partitioning can be unlikely due to the small number of distributions (e.g., 256 distributions).
It can be advantageous to re-estimate the free parameters in the continuous HMM model from a small amount of data. The free parameters are the codebook, arithmetic means and variances estimated to arrive at the acoustic model. In some embodiments, the codebook is limited to 256 entries, thus, only 256×55 mean elements have to be estimated and a like number of variances. In a fully continuous model, the number of free parameters is much higher because there is no codebook. For example, in the fully continuous model each state has its own set of mean and variance vectors, so given there are 5000-6000 states each would have maybe 50 mean and covariance vectors the number of parameters to be estimated in this case is much higher. Due to the high number of parameters, it may not be desirable to use an algorithm such as MLLR. As described above, for the semi-continuous model the number of free parameters is much smaller than for the fully-continuous model. Thus, it is possible to apply a maximum a posteriori (MAP) model estimation criterion directly, even with relatively small amounts of adaptation data, as discussed below.
If large quantities of audio adaptation data are available (e.g. 10-30 minutes of audio input), the voice model can be adapted by modifying the mixture weights WSk. However, in a 100-300 word story, adjusting the weights generally does not provide adequate state coverage. For example, the model may be adapted based on one or two samples, reducing the reliability of the estimate.
The output of data collection for rapid adaptation is a collection of audio data and the recognizer's state word level transcription pertaining to each sentence recorded. The adaptation algorithm takes this data and uses a Forward-Backward algorithm (see
The adaptation process follows the schematic shown in
In order to reduce the computational intensity in generating the custom voice model, the speech recognition software may consider a set of the most probable Gaussians (e.g., the top four most probable Gaussians) when evaluating output probabilities for each state. If four Gaussians are used, the output probability is given by equation (3) as:
Equation 9 re-estimates the k'th component of the i'th state Gaussian mixture model mean vector and covariance matrix. For a semi-continuous HMM, these components are shared over multiple states. Integrating out the state dependency from ζt(i,k), gives the semi-continuous model estimates as shown in equations (10) to (12) below:
The above estimates for the general model parameters can be used to modify the feature vectors (e.g., four feature vectors). The set η(xt) can be different for each of the feature vectors, hence the output probability can be calculated according to equation (13) below:
To determine a new estimate for ûf
These probabilities can be substituted into the equations for ML mean vector and covariance matrix estimation given above.
The rapid adaptation algorithm also includes a MAP model estimation. During the model estimation the arithmetic mean and covariance vectors of the ML and speaker-independent (SI) models are combined according to the posterior probabilities ζt(fk) and a hyper-parameter λ shown in equations (16) and (17) below:
The hyper-parameter values for passages of around 200 words of training data can be approximately set such that λu is in the range of 1.0e-4 to 5.0e-2 (e.g., λμ=2.0e-4) and λΣ is in the range of 1.0e-4 to 5.0e-3, (e.g., λΣ=3.0e-4).
Thus, based on a limited amount of user voice data, the custom voice model is adapted for the user by adjusting the arithmetic means and variances of the underlying Gaussian functions in the voice model.
The use of voice models adapted to the user's speech can reduce false negative interventions and increase the number of errors caught by the application. For speakers who match the acoustic model well this can be observed in a reduction in the Gaussian variances across the model.
The process for collecting voice data used to generate a custom voice model is uniquely designed for children, hence the instructive user interface. Because the children may mis-speak during the collection phase, the output of the forward backward algorithm can be analyzed to ensure that the observed word sequence approximately or closely matches the recorded data. This is accomplished by checking the best terminating state matched against the audio data is within the last five states of the HMM of the last word of the recognized sequence. If the audio data is not within five states, the utterance is discarded.
A number of embodiments of the invention have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. Accordingly, other embodiments are within the scope of the following claims.