This application has prior art in United States patent applications, the entire contents of all of which are incorporated herein by reference: U.S. patent application Ser. No. 12/165,258, “INTERACTIVE LANGUAGE PRONUNCIATION TEACHING”, filed Jun. 30, 2008,
Also, the prior art U.S. patent application Ser. No. 14/705,634 “PRONUNCIATION LEARNING FROM USER CORRECTION” filed May 6, 2015
TajweedMate Mobile Application, “https://www.tajweedmate.com”, accessed Aug. 28, 2022
Tarteel Quran Application, “https://www.tarteel.ai/”, accessed Aug. 28, 2022
The Quran is a religious text that has been revealed in Arabic. People around the world in Islam learning Quran's Arabic pronunciation in a well-structured way for more than a thousand years.
Learning to read The Quran requires frequent practice to master the pronunciation. Automated feedback with speech recognition has been demonstrated to be an invaluable tool for reducing the need for human evaluation.
However, they require frequent user interaction beyond speech itself. Users need to interact with the touch interface, select the lessons, and listen to the expected pronunciation. Thus, they often slow down the learning progress.
It is desirable to have a computer-implemented method for speech practice in continuous conversation (dialogue) mode without any interruptions during learning or without the need for an electronic display. Especially in driving or walking conditions, this would benefit the learning process significantly.
This present invention provides a system to teach the correct pronunciation of the Quran from exemplary phrases that are defined in an expert-defined curriculum using automatic speech recognition without requiring any navigational user interaction except speech input.
An exemplary system comprised of playing the true pronunciation of the first selected phrase from a plurality of phrases and starting sound recording and applying an automated speech recognition algorithm to convert the first sound record into token probabilities. These token probabilities are then used to compare to the first selected phrase to check the correctness. This correctness could be used in the decision to advance to the next phrase in the learning curriculum.
An exemplary system may also include repetition of the true pronunciation of the first selected phrase until acceptance of each speech input. That feature would ensure learning of the phrase before advancing to the next one.
Optionally, a performance indicator for visual feedback or visual demonstrations of correct and wrong pronunciation parts in a text form could be shown automatically to demonstrate the quality of pronunciation such that progress could be monitored.
The feedback loop is essential to learning any new skill. Learning the pronunciation of the Quran in the Arabic language is no different from this perspective. People learn the language by listening, imitating, and getting feedback. The method described herein automates the feedback loop of learning the pronunciation of the Quran with the help of speech recognition.
Referring to
In step 102, the user tries to imitate the same sound from step 101. This step consisted of automatically activating the sound capture device after the playback of step 101, recording the utterance of the user, and deactivating it after the utterance.
The activation of the capture device could be further automated by predefined time intervals to automate to stop recording. For example, two times the length of the sound record for the original exemplary phrase.
In another alternative, silence detection could be used to stop recording automatically without the need to define record time duration.
In step 103, the recorded signal is analyzed using a speech recognition system.
Speech recognition systems consist of input signal processing, a machine learning method that transfers processed input signals to output probabilities in character/phoneme space.
Input signal processing could be but is not limited to taking Fast Fourier Transforms of the raw audio signal data, normalizing, or thresholding based on statistical values of raw signal data.
A machine learning method could be but is not limited to training a neural network with a prior automated speech recognition dataset consisting of pairs of sound records and corresponding written forms or using previously pre-trained automated speech recognition models for the selected language.
In step 104, the system quantifies the correctness of the output and gives output to decision control to either pass to the next exemplary phrase or repeat the current one.
Checking the correctness could be but is not limited to registering machine learning output to the expected written form of the first sound record or calculating the number of matches between the decoded result of the automated speech recognition system and the expected written form of the first sound record.
Referring to
In step 201, the system gets the preprocessed sound record input in the form of a matrix. In one dimension, it represents the time points. In the other dimension, it represents different frequencies that exist in the sound record.
In step 202, the system executes the pre-trained neural network and computes the output character/phoneme probabilities shown in step 203.
In step 204, the registration unit takes the expected output phrase and compares the neural network output to the expected phrase in the text form.
The simplest comparison method could use the output probabilities and get the highest probable character in each time point and compute the character list sequence. Correctness could be calculated by checking the exact match or ratio of the character matches of speech recognition output and expected output after removing the prespecified control character list.
In some examples, a more comprehensive registration-based method could be used to check correctness. In those examples, a dynamic programming-based module could be used to assign elements to corresponding probabilities in the output matrix. The dynamic programming algorithm maximizes the total global matching score constrained by the order given in the expected output sequence.
In some examples, a penalty score could be associated with missing terms in a display device that could show the missing terms.
In step 205, the registration results could be quantified into a numeric value between 0 and 1 or a percentage score. A decision threshold on calculated score value could determine the advancement into the next exemplary phrase or repeating the same phrase.