When only a small number of manual accent labels are available, how to take the best advantage of them can be very important in training high performance classifiers, Disclosed embodiments utilize unlabeled data (i.e., data without accent labels) which is more abundant than their labeled counterparts to improve labeling performance. Improving labeling performance without manually labeling a large corpus potentially saves time and cost, while still providing the training data required to train high performance classifiers.
Referring now to
First classifier 110 is configured to analyze words in the speech corpus 105 and to automatically label accent of the analyzed words based on first criteria. For example, when first classifier 110 is a linguistic classifier as shown in
Second classifier 120 is also configured to analyze words in the speech corpus database 105 in order to automatically label accent of the analyzed words based on second criteria. For example, when the second classifier 120 is a hidden Markov model (SEMM) based acoustic classifier as illustrated in
System 100 also includes a comparison engine or component which is configured to compare the first accent labels 112 provided by the first classifier and the second accent labels 122 provided by the second classifier to determine if there is agreement between the first classifier 10 and the second classifier 120 on accent labels for particular words. For any words having first and second accent labels 112, 122 which indicate agreement by the first and second classifiers, the comparison engine 130 provides the agreed upon accent labels 112, 122 as final accent labels 132 for those words. For any words that have first and second labels 112, 122 which are not in agreement, a third classifier 140 is included to analyze these words.
Third classifier 140 is, in some embodiments, a combined classifier which includes both linguistic and acoustic classifier aspects or functionality. For words in the speech corpus where the comparison engine 130 determines that there is not agreement between the first and second classifiers, third classifier 140 is configured to provide the final accent labels 142 for those words. Final accent labels 142 are provided, in some embodiments, as a function of the first accent labels 112 for those words provided by the first classifier and the second accent labels 122 for those words provided by the second classifier. Final accent labels 142 can also be provided based on other features 144 from speech corpus database 105. Additional features 144 include in some embodiments other acoustic features 146 and/or other linguistic features 148. In some embodiments, combined classifier 140 is trained using only the limited amount of manually labeled accent data, but this need not be the case in all embodiments. Further discussion of these aspects is provided below.
In some embodiments, system 100 includes an output component or module 150 which provides as an output the final accent labels 132 from comparison engine 130 for words in which there was accent label agreement, and final accent labels 142 from third classifier 140 for the remaining words. As illustrated in
Referring specifically to the embodiment illustrated in
Referring to linguistic classifier 110, usually content words which carry more semantic weight in a sentence are accented while function words are unaccented. Classifier 110 is configured, in exemplary embodiments, to follow this rule. According to their POS tags, content words are deemed as accented while non-content or function words as unaccented.
Referring next to HMM based acoustic classifier 120, in exemplary embodiments this classifier uses the segmental information that can distinguish accented vowels from unaccented ones. To this end, a set of segmental units which are to be modeled was chosen. A first set of segmental units includes accent and position dependent phone sets.
In a conventional speech recognizer, about 40 phones are used in English, and for each vowel a universal HMM is used to model both its accented and unaccented realizations. In disclosed embodiment models, the accented and unaccented are modeled separately as two different phones. Furthermore, to model the syllable structure which includes onset, vowel nucleus and coda, with a higher precision, consonants at the onset position are treated differently from the same phones at the coda position. This accent and position dependent (APD) phone set increases the number of labels from 40 to 78 while the corresponding HMMs can be trained similarly.
Before training the new HMMs, the pronunciation lexicon is adjusted in terms of the APD phone set. Each word pronunciation is encoded into either accented or unaccented versions. In the accented one, the vowel in the primary stressed syllable is accented and all the other vowels unaccented. In the unaccented word, all vowels are unaccented. All consonants at syllable-onset position are replaced with corresponding onset consonant models and similarly for consonants at coda position.
In order to train HMMs for the APD phones, accents in the training data have to be labeled, either manually or automatically. Then, in the training process, the phonetic transcription of the accented version of a word is used if it is accented. Otherwise, the unaccented version is used. Besides the above adjustment, the whole training process can be the same as conventional speech recognition training. API) HMMs can be trained with the standard Baum-Welch algorithm in the HTK software package. The trained acoustic model (classifier 120) is then used to label accents.
Using APD HMMs in acoustic classifier 120, the accent labeling is actually a decoding in a finite state network 300, an example of which is shown in
Referring now back to combined classifier 140 shown in
Three accent related feature types are used by combined classifier 140. The first type is the likelihood scores of accented and unaccented vowel models and their differences. The second type addresses the prosodic features that cannot be directly modeled by the HMMs, such as the normalized vowel duration and fundamental frequency differences between the current and the neighboring vowels. The third type is the linguistic features beyond POS, like uni-gram, bi-gram and tri-gram scores of a given word because frequently used words tend to be produced with reduced pronunciations. For each type of feature, an individual classifier is trained first. The somewhat weak results provided by these individual classifiers are then combined by classifier 140 into a stronger one. The combining scheme which classifier 140 implements is, in an exemplary embodiment, the well known AdaBoost algorithm.
As noted, the AdaBoost algorithm is often used to adjust the decision boundaries of weak classifiers to minimize classification errors and has resulted in better performance than each of multiple individual ones. The advantage of AdaBoost is that it can combine a sequence of weak classifiers by adjusting the weights of each classifier dynamically according to the errors in the previous learning step. In each boosting step, one additional classifier of a single feature is incorporated.
Referring now to
Referring now to
In further embodiments, represented as being optional by dashed connecting lines, the method includes the further step 520 of automatically accent relabeling the data in the database using a third classifier 140. Then, at step 525, the second classifier 120 is retrained, or further trained, using the automatically accent relabeled data in the database. Another step, occurring before step 520, can include step 530 of training the third classifier 140 using manually accent labeled data 415.
With reference to
Computer 610 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 610 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 600.
The system memory 630 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 631 and random access memory (RAM) 632. A basic input/output system 633 (BIOS), containing the basic routines that help to transfer information between elements within computer 610, such as during start-up, is typically stored in ROM 631. RAM 632 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 620. By way of example, and not limitation,
The computer 610 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 610 through input devices such as a keyboard 662, a microphone 663, and a pointing device 661, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a scanner or the like. These and other input devices are often connected to the processing unit 620 through a user input interface 660 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port or a universal serial bus (USB). A monitor 691 or other type of display device is also connected to the system bus 621 via an interface, such as a video interface 690.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.