The disclosure relates to language instruction. More particularly, the present disclosure relates to a system and method for modeling of phonological errors and related methods.
The use of technology in classrooms has been steadily increasing in the past decade and the comfort level of students in using technology has never been higher. Computer Assisted Pronunciation Training (CAPT) has been quietly inching its way into many language learning curriculum. The high demand and shortage of language tutors especially in Asia has lead to CAPT systems playing a prominent and increasing role in language learning.
CAPT systems can be very effective among language learners who prefer to go through the curriculum at their own pace. Also, CAPT systems exhibit infinite patience while administering repeated practice drills which is a necessary evil in order to achieve automaticity. Most CAPT systems are first language (L1) independent (i.e., the language learners first language) and cater to a wide audience of language learners from different language backgrounds. These systems take the learner through pre-designed prompts and provide limited feedback based on the closeness of the acoustics of the learners' pronunciation to that of native/canonical pronunciation. In most of these systems, the corrective feedback, if any, is implicit in the form of pronunciation scores. The learner is forced to self-correct based on his/her own intuition about what went wrong. This method can be very ineffective especially when the learner suffers from the inability to perceive certain native sounds.
A recent trend in CAPT systems is to capture language transfer effects between the learner's L1 and L2 (second language) languages. This makes the CAPT system better equipped to detect, identify and provide actionable feedback to the learner. These specialized systems have become more viable with enormous demand for English language learning products in Asian countries like China and India. If the system is able to successfully pinpoint errors, it can not only help the learner identify and self-correct a problem, but can also be used as input for a host of other applications including content recommendation systems and individualized curriculum-based systems. For example, if the learner consistently mispronounces a phoneme (the smallest sound unit in a language capable of conveying a distinct meaning), the learner can be recommended remedial perception exercises before continuing the speech production activities. Also, language tutors can receive regular error reports on learners, which might be very useful in periodic tuning of customizable curriculum.
Linguistic experience and literature can be used to get a collection of error rules that represent negative transfer effects for a given L1-L2 pair. But this is not a foolproof process as most linguists are biased to certain errors based on their personal experience. Also, there are always inconsistencies among literature sources that list error rules for a given L1-L2 pair. Most of the relevant studies have been conducted on limited speaker population and most of them lack sufficient coverage of all phonological error phenomena. It might be very convenient and cost effective to automatically derive error rules from L2 data.
The prior art has tried automatically deriving context sensitive phonological (i.e., speech sounds in a language) rules by aligning the canonical pronunciations with phonetic transcriptions (i.e., visual representation of speech sounds) obtained from an annotator. Most alignment techniques used in similar automated approaches are variants of a basic edit distance (ED) algorithm. The algorithm is constrained to one-to-one mapping which is ineffective in discovering phonological error phenomena that occur over phone chunks. As edit distance based techniques poorly model dependencies between error rules, it's not straightforward to generate all possible non-native pronunciations given a set of error rules. Extensive rule selection and application criteria need to be developed as such criteria is not modeled as part of the alignment process.
Accordingly, a system and method is needed for modeling phonological errors.
Disclosed herein is method for teaching a user a non-native language. The method comprises creating, in a computer process, models representing phonological errors in the non-native language; and generating with the models, in a computer process, non-native pronunciations for a native pronunciation.
Further disclosed herein is a system for teaching a user a non-native language. In some embodiments, the system comprises a word aligning module for aligning native pronunciations with corresponding non-native pronunciations, the aligned native and non-native pronunciations for use in creating a native to non-native phone translation model; a language modeling module for generating a non-native phone language model using annotated native and non-native phone sequences; and a non-native pronunciation generator for generating non-native pronunciations using the phone translation and phone language models.
In other embodiments, the system comprises a memory containing instructions and a processor executing the instructions contained in the memory. The instructions, in some embodiments, may include aligning native pronunciations with corresponding non-native pronunciations, the aligned native and non-native pronunciations for use in creating a native to non-native phone translation model; generating a non-native phone language model using annotated native and non-native phone sequences; and generating non-native pronunciations using the phone translation and phone language models.
The instructions in other embodiments may include creating models representing phonological errors in the non-native language; and generating with the models non-native pronunciations for a native pronunciation.
The present disclosure presents a system for modeling phonological errors in non-native language data using statistical machine translation techniques. In some embodiments, the phonological error modeling (PEM) system may be a separate and discrete system while in other embodiments, the PEM system may be a component of sub-system of a CAPT system. The output of the PEM system may be used by a speech recognition engine of the CAPT system to detect non-native phonological errors.
The PEM system of the present disclosure formulates the phonological error modeling problem as a machine translation (MT) problem. A MT system translates sentences in a source language to a sentence in a target language. The PEM system of the present disclosure may comprise a statistical MT sub-system that considers canonical pronunciation to be in the source language and then generates the best non-native pronunciation (target language to be learned) that is a good representative translation of the canonical pronunciation for a given L1 population (native language speakers). The MT sub-system allows the PEM system of the present disclosure to model phonological errors and modeling dependencies between error rules. The MT sub-system also provides a more principled search paradigm that is capable of generating N-best non-native pronunciations for a given canonical pronunciation.
MT relates to the problem of generating the best sequence of words in the target language (language to be learned) that is a good representation of a sequence of words in the source language. The Bayesian formulation of the MT problem is as follows:
where, T and S are word sequences in the target and source languages respectively. P(S|T) is a translation model that models word/phrase correspondences between the source (native) and target (non-native) languages. P(T) represents a language model of the target language. The MT sub-system of the PEM system of the present disclosure may comprise a Moses phrase-based machine translation system.
Language model 60 learns the most probable sequence of words that occur in the target language. It guides the search during a decoding phase by providing prior knowledge about the target language. The language model 60, in some embodiments, may comprise a trigram (3-gram) language model 60 with Witten-Bell smoothing applied to its probabilities. A decoder 70 can read language models 60 created from popular open source language modeling toolkits 50 including but not limited to SRI-LM, RandLM and IRST-LM.
The decoder 70 may comprise a Moses decoder. The Moses decoder 70 implements a beam search to generate the best sequence of words in the target language that represents the word sequence in the source language. At each state, the current cost of the hypothesis is computed by combining the cost of previous state with the cost of the translating the current phrase and the language model cost of the phrase. The cost also includes a distortion metric that takes into account the difference in phrasal positions between the source and the target language. Competing hypotheses can potentially be of different lengths and a word can compete with a phrase as a potential translation. In order to solve this problem, a future cost is estimated for each competing path. As the search space is very large for an exhaustive search, competing paths are pruned away using a beam which is usually based on a combination of a cost threshold and histogram pruning.
In accordance with the present disclosure, phonological errors in L2 (non-native target language) data are reformulated as a machine translation problem by considering a native/canonical phone sequence to be in the source language and attempting to generate the best non-native phone sequence (non-native target language) that represents a good translation of the native/canonical phone sequence. The corresponding Bayesian formulation may comprise:
where, N and NN are the corresponding native and non-native phone sequences. P(N|NN) is a translation model which models the phonological transformations between the native and non-native phone sequences. P(NN) is a language model for the non-native phone sequences, which models the likelihood of a certain non-native phone sequence occurring in L2 data.
The training of the phonological translation error and non-native phone language models 140 and 160, respectively, will now be described. A parallel phone (pronunciation) corpus of canonical (native pronunciations) and annotated phone sequences (non-native pronunciations) from L2 data 190, are applied to the word aligning and language modeling toolkits 20 and 50, respectively. The parallel phone corpus may include prompted speech data from an assortment of different types of content. The parallel phone corpus may include minimal pairs (e.g. right/light), stress minimal pairs (e.g. CONtent/conTENT), short paragraphs of text, sentence prompts, isolated loan words and words with particularly difficult consonant clusters (e.g. refrigerator). Phone level annotation may be conducted on each corpus by plural human annotators (e.g. 3 annotators). The word aligning toolkit 20 generates phone alignments in response to the applied phone corpus 190. The phone alignments at the output of the word aligning toolkit 20, are applied to the native to non-native phone translation trainer 30, which grows the one-to-one phone alignments into phone-chunk based alignments, thereby training the phonological translation model 140. This process is analogous to growing word alignments into phrasal alignments in traditional machine translation. For example, but not limitation, if p1, p2 and p3 are native phones and np1, np2, np3 are non-native phones (they occur one after the other in a sample phone sequence), the one-to-one phone alignments may comprise p1-to np1, p2-to-np2 and p3-to-np3 (three separate phone alignments). The trainer 30 may then grow these one-to-one phone alignments into phone-chunk p1p2p3-to-np1np2np3.
The resulting phonological translation error model 140 may have phone-chunk pairs with differing phone lengths and a translation probability associated with each one of them. The application of the annotated phone sequences from the L2 data of the parallel phone corpus 190 to the language modeling toolkit 50 trains the non-native phone language model 160.
Given the phonological (phone) translation error model 140 and the non-native phonological (phone) language model 160, the decoder (non-native pronunciation generator) 70 can generate N-best non-native phone sequences for a given canonical native phone sequence supplied by the native lexicon unit 180 (contains native pronunciations) which are stored in the non-native pronunciation lexicon unit 110.
The PEM system using MT was evaluated against a prior art edit distance (ED) based system. The PEM system was used to detect phonological errors in a test set. In order to build the edit distance based baseline system, phonological errors were initially extracted using ED from the training set. Phonological errors were ranked by occurrence probability. From empirical observations, the cutoff probability threshold was set at 0.001. This provided approximately 1500 frequent error patterns. The frequent error rules were loaded into the Lingua Phonology Perl module to generate non-native phone sequences. The tool was constrained to apply rules only once for a given triphone context as the edit distance approach does not model interdependencies between error rules. The N-best list obtained from the Lingua module was ranked by the occurrence probability of the rules that were applied to obtain that particular alternative. The non-native lexicon was created with an N-best cutoff of 4 so that it's comparable to the non-native lexicon produced by the PEM system. The PEM and ED systems were evaluated using the following metrics: (i) overall accuracy of the system; (ii) diagnostic performance as measured by precision and recall; and (iii) F-1 score, which is the harmonic mean of precision and recall. This provided one number to track changes in operating point of the systems. These metrics were calculated for the phone detection and phone identification tasks along with their corresponding human annotator upper bounds.
Phone error detection is defined as the task of flagging a phoneme as containing a mispronunciation. The accuracy metric measures overall classification accuracy of the system on the phone error detection task, while precision and recall measure the diagnostic performance of the system. Precision measures the number of correct mispronunciations over all the mispronunciations flagged by the system. Recall measures the number of correct mispronunciations over the total number of mispronunciations found in the test set (as flagged by the annotator).
Phone identification is defined as the task of identifying the phone label spoken by the learner. The identification accuracy metric measures the overall performance on the identification task. Precision measures the number of correctly identified error rules over the total number of error rules discovered by the system. Recall measures the number of correctly identified error rules over the number of error rules in the test set (as annotated by the human annotator).
The computer 750 and audio equipment shown in
In one embodiment, software for enabling computer system 750 to interact with user 702 may be stored on volatile or non-volatile memory within computer 750. However, in other embodiments, software and/or data for enabling computer 750 may be accessed over a local area network (LAN) and/or a wide area network (WAN), such as the Internet. In some embodiments, a combination of the foregoing approaches may be employed. Moreover, embodiments of the present disclosure may be implemented using equipment other than that shown in
In an embodiment, RAM 806 and/or ROM 808 may hold user data, system data, and/or programs. I/O adapter 810 may connect storage devices, such as hard drive 812, a CD-ROM (not shown), or other mass storage device to computing system 600. Communications adapter 822 may couple computer system 800 to a local, wide-area, or global network 824. User interface adapter 816 may couple user input devices, such as keyboard 826, scanner 828 and/or pointing device 814, to computer system 800. Moreover, display adapter 818 may be driven by CPU 802 to control the display on display device 820. CPU 802 may be any general purpose CPU.
While exemplary drawings and specific embodiments of the disclosure have been described and illustrated, it is to be understood that that the scope of the invention as set forth in the claims is not to be limited to the particular embodiments discussed. For example, but not limitation, one of ordinary skill in the speech recognition art will appreciate that the MT approach may also be used to construct a non-native speech recognition system. That is, a system to recognize words spoken by a non-native speaker with higher degree of accuracy by modeling the variations that they would produce while speaking. Thus, the embodiments shall be regarded as illustrative rather than restrictive, and it should be understood that variations may be made in those embodiments by persons skilled in the art without departing from the scope of the invention as set forth in the claims that follow and their structural and functional equivalents.
Number | Date | Country | |
---|---|---|---|
61503325 | Jun 2011 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US2012/044992 | Jun 2012 | US |
Child | 14141774 | US |