The invention is in the field of foreign language instruction and more particularly directed to creating custom instructional curricula from uploaded videos.
Video recordings (video files that include an audio channel or track) have become a significant part of modern life due to the advance of multimedia technologies. Millions of video recordings are produced every day for various purposes, both commercial and personal. Some of these videos are expressly made as instructional aids for foreign language learning, while others are later repurposed for this task. Commonly, it usually takes a month or two to develop a new curriculum for language learning.
Non-native Chinese speakers that learn to speak Chinese by watching videos of others speaking Chinese tend to make several types of pronunciation mistakes. One type are the pronunciation mistakes due to the pronunciation behaviors of the learner's native languages. People learning Chinese who are native English speakers will tend to make the same kinds of pronunciation mistakes, but a different set of mistakes than those who are native French speakers. Other mistakes are copied from the speakers in the videos themselves. For instance, those learning Chinese sometimes utilize videos of non-native Chinese speakers speaking Chinese. In this case, the learner may end up making the pronunciation mistakes of the non-native Chinese speaker in the video. The same can be true if the speaker in the video is a native speaker of Chinese, but speaks the language with poor enunciation skills.
The present invention provides systems and methods for creating educational curricula from video files such as user-provided videos. These systems and methods, among other things, identify common pronunciation errors, highlight mispronunciations made by the speaker in the video, and mark words in a transcription of the human speech component of the audio portion of the video where the user is likely to mispronounce those words. These systems and methods employ machine learning to improve at one or more of the functions described herein by learning from the users' inputs.
Sentences, by their nature, are the most common speaking or listening learning unit for spoken language, thus the present invention segments video files into smaller pieces automatically according to identified sentence boundaries. The present invention further includes automatically transcribing the speech identified in the audio tracks of videos to obtain corresponding transcriptions. The present invention can further comprise determining a learning value of a video file, and/or automatically identifying language-based pronunciation error patterns. Moreover, the present invention can comprise building a curriculum by automatically generating additional learning materials related to the video's transcriptions.
The present invention allows a variety of videos, such as those made for other purposes, to be automatically converted into instructional materials for language learning. Using these instructional materials, students can learn by listening to the speech in the audio track of the video as well as by practicing speaking, using the video as a reference. The automatic conversion from videos to good quality learnable lessons (i.e. a curriculum) improves and speeds the curriculum generation process tremendously. Common error patterns for those learners coming from another native language are, therefore, incorporated for more effective learning. Accordingly, the present invention allows curricula to be personalized according to the user's native language by identifying where the user is likely to make pronunciation mistakes, based on their native language, and to mark those places on the video transcription derived from the video.
Various embodiments of the video-to-curriculum systems and methods disclosed herein can also identify and mark pronunciation mistakes made by the speakers in the input video in order to prevent the user from being misled by these mispronunciations. The number and type of pronunciation mistakes by speakers in videos can also be used to determine the value of a video as part of a curriculum. It should be noted that while good pronunciation by a speaker in a video can be a positive learning example, poor pronunciation can also serve as a learning example of what mistakes to avoid. Additionally, where machine learning is employed in the present invention, models thereof can be continuously trained and refined through continued use.
In a typical application of the present invention, a teacher picks an arbitrary video, for instance, a video found online. The system automatically transcribes the input video and takes the transcriptions as corpus. The system generates a temporary curriculum based on the input video. The teacher can still arrange or rewrite the materials if needed. After the final curriculum has been generated, course-related materials and exercises based on the content of the curriculum are automatically attached to this curriculum. The system also automatically detects the grammar points of the sentences, labels pinyins and language proficiency levels for all words, annotates corresponding video/audio clips to the sentences, and integrate all essential elements to build a curriculum. Producing a curriculum takes the teacher only a few minutes and the content can be changed easily. Additionally, all standards of language proficiency are stored in a database and can be automatically matched to content.
The present invention is directed to systems and methods for automatically generating language-learning curricula from ordinary video files produced for other reasons, both personal and commercial. An exemplary curriculum comprises the original video with noise removed from the audio track, a transcript of the words spoken in the video, and information that can be accessed related to the speech in the video, such as pronunciation mistakes, helpful suggestions, and so forth.
Both the host and client device of the system 100 includes a processor and non-transitory memory storing computer-readable instructions that when executed by the processor cause host and the client device to perform the steps of the methods disclosed herein. An exemplary method that can be performed by the host, such as a server, comprises a step of receiving an input video file 110 having an audio track, a step of removing noise from the audio track using a denoising system 120, a step of segmenting the input video according to human speech sentence boundaries using a segmenting module 130, a step of transcribing the spoken sentences identified within the cleaned audio track using a transcription module 140 to create sentence transcripts 145, and a step of generating learning materials from the transcriptions 145 using a curricula module 150.
The step of receiving an input video file can include receiving, for instance, an MPEG file, a Windows Media Video file, or a WebM file. The input video preferably includes one or more people speaking a language that the user seeks to become more proficient in. For the purposes of illustration, the present disclosure uses Chinese as the example of the language to be learned, and assumes the user has a different native language. However, the systems and methods disclosed herein can be used to generate curricula to be used to gain proficiency in any language. Input videos can be supplied by a user or can be supplied by an organization seeking to produce suitable curricula for language education. In the case of user-supplied input videos, in this step the user can upload videos from a client device (e.g., a PC, tablet, or smartphone) across a network connection to the host server to generate learning materials therefrom.
In the noise removal step, noise is removed from the audio track of the input video to produce a cleaned audio track. Noise, in the present context, is any sound other than the human speech component of the audio. Removing noise can include, in some embodiments, analyzing the track to differentiate the speech component from background sounds such as music, traffic, animal sounds, and so forth. Exemplary denoising systems suitable for performing this step can employ machine learning, such as through the use of generative adversarial networks (GANs). GAN-based speech enhancement systems are well known in the art, see, for example, “Towards Generalized Speech Enhancement with Generative Adversarial Networks” by S. Pascual et al., Cornell University, April, 2019.
In a denoising system 120 that employs machine learning, before the denoising system 120 is used in the denoising step, a denoising model of the denoising system 120 is trained with various noise data (music and so forth) as non-speech signatures. In some training embodiments, an ordinary noise model can be used to generate noises that simulate noises found in various environments. The generated noises are then added to clean speech audio files to generate training data for training the denoising model. During the denoising step, in some embodiments, the denoising system 120 identifies the non-speech signals, removes them by filtering, and amplifies the remaining speech signal to produce the cleaned audio track.
The step of segmenting the video according to human speech sentence boundaries can include, in various embodiments, taking the cleaned audio track as the input, determining the sentence boundaries (i.e. the beginning and ending times of each sentence) and using the sentence boundaries to partition the video into smaller video segments. Sentence boundary detection can also be based on machine learning, for example, through the use of neural networks techniques. During training, a sentence boundary model is trained to automatically extract useful features, such as sound volumes, durations of silences, characteristics of human voices, etc., to identify the sentence boundaries. In runtime, the trained model is used to predict the sentence boundaries in the cleaned audio track. This time-boundary information is sometimes provided in Conversation Time Marked (CTM) formatted files.
In some embodiments, the cleaned audio track is sampled at a succession of sample points, and for each sample point a prediction is made whether or not the sample point represents human speech. In these embodiments, if a series of sample points are predicted as being human speech, for a duration equal to a threshold, such as 300 ms, then the series of sample points is considered to be human speech, otherwise the sample points are considered to be silence.
The cleaned audio track can sometimes still retain some residual noise. If the cleaned audio track lacks residual noise, a sample point can be determined to represent human speech if the volume of the sample point is over threshold, otherwise it is viewed as silence. On the other hand, if the cleaned audio track still contains some residual noises, the voice activity detection model is used to determine human speech from silence.
To segment the cleaned audio track into sentences, the portions that are identified as sentences are delineated by their start and end times, each such delineated portion being an audio segment. The same start and end times are then applied to the video to create video segments synchronized to the audio segments. In some embodiments, a user can be provided with options, through the application operating on the client device, to adjust the boundary prediction results on the display of the client device.
Returning to
After transcripts 145 are obtained, curricula module 150 receives the transcripts 145 to generate a language learning curriculum comprising language learning materials based on the input video. The curricula module 150 determines whether the input video is a positive or a negative example in terms of its value as a curriculum. The curricula module 150 also builds a statistical model to record frequent mispronounced words according for the spoken language in the video. The curricula module 150 further automatically labels frequently mispronounced words according to learner's native language to improve learning efficiency.
In
A model of frequently mispronounced words, in the language spoken in the video, for native speakers of the learner's language is also used to tag frequently mispronounced words when they are present in a transcript 145. In some embodiments, the corresponding statistical model is used to predict the 100 most frequent word mispronunciations and label them on the text.
To produce the initial model, in some embodiments, five Chinese sentences which include all phonemes in Chinese are given to native speakers of another language, such as Spanish, to read aloud while their voices are recorded. A speech scoring system is then used to score all the voice recordings to build an initial mispronunciation model for the combination of the target language and native tongue (e.g., Chinese as spoken by Spanish native speakers). As learning curricula are used, these models can be revised. Every time a person coming from a native language practices a target language spoken in a video, the words that are scored below a threshold are noted, and as words are found over time to be mispronounced more or less frequently than in the existing model the weightings assigned to the words are varied accordingly.
Another model of frequently mispronounced words, in the language spoken in the video, for native speakers of the same language as spoken in the video (e.g., mispronunciations of Chinese by native Chinese speakers, where the speaker in the video is a native Chinese speaker speaking in Chinese) is also employed to tag words in the transcript that were mispronounced by the speaker in the video.
Next, the scores of the words of the video are evaluated to determine whether the video constitutes a positive or a negative example. In an exemplary embodiment, both a threshold percentage of the number of words in the video and a threshold score are employed. In the example of
Curricula module 150 can also perform one or more of word segmentation, word-level labeling, pinyin labeling, vocabulary targeting, and grammar detection. Word segmentation tokenizes characters into word sequence that makes sense in the context. For instance, in Chinese, every character has its meaning, and two or three characters can be combined to form a word which sometimes has a different meaning. Unlike written western languages where spaces denote word boundaries, written Chinese relies on the reader to infer the word boundaries from the context. When Chinese text is processed, word segmentation is applied to determine whether the characters in a sentence should be separated individually or should be combined according to the context. The process of taking all contextual clues into consideration and inferring the correct word boundaries without creating a nonsense sentence is termed word segmentation or tokenization.
Word-level labeling labels each word to various national and international standard levels. For instance, the oral proficiency levels of the ACTFL Guidelines map to the continuum of language proficiency from highly articulate to a level of little or no functional ability. Pinyin labeling labels each word with its phonetic notation. Vocabulary targeting identifies vocabulary and links words to corresponding vocabulary profile pages.
Grammar detection, illustrated by
The tagged transcripts 145 are then provided to a learning module creator. The learning module creator produces learning modules for both positive and negative videos. A learning module is audiovisual content to be presented through a graphical user interface like a browser window or a smartphone display. Each learning module comprises one or more sentence transcripts 145 and corresponding audio segments from a video. In exemplary embodiments, the transcripts 145 are displayed on the graphical user interface with certain words visually differentiated as noted above. The graphical user interface provides the user the ability to play the audio segment corresponding to a transcript 145, for example, with a selectable audio icon.
Other tools can likewise be made available in various embodiments. Such tools can include the ability to play the video segment corresponding to the transcript 145, the ability to select a word or character to access more information thereon, or to play a recording from a library of the proper pronunciation. Another tool that can be provided by the module is an interface that allows the user to practice speaking the sentence, while the audio of the user's speech is recorded and the pronunciations of the words of the sentence are scored, and the resulting scores and detailed diagnostics on any errors are provided to the user.
Further, the module can visually differentiate words of the transcripts 145 that are commonly mispronounced by people learning the language who are native speakers of the same language of which the user is a native speaker. In
In
Relevance in the audio domain refers to words or phrases that sound similar to the target. As an example of audio domain relevance, is pronounced as “eye” so words sharing similar pronunciations like (b-eye), (wh-eye), (k-eye) have relevance and therefore learning modules that include these words could be good selections. Similarly, relevance in the text domain refers to the meaning and grammatical usage of the target word. Thus, for example, for the word “loves,” learning modules that present phrases or sentences about relationships or romance would be relevant, as would those that use the verb and noun uses of the word “loves” .
The curriculum creator then selects a number of learning modules from a library thereof. The learning modules with the highest relevance are chosen. In the illustration of
The descriptions herein are presented to enable persons skilled in the art to create and use the systems and methods described herein. Various modifications to the embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the inventive subject matter. Moreover, in the following description, numerous details are set forth for the purpose of explanation. However, one of ordinary skill in the art will realize that the inventive subject matter might be practiced without the use of these specific details. Flowcharts in drawings are used to represent processes. A hardware processor system may be configured to perform some of these processes. Modules within flow diagrams representing computer implemented processes represent the configuration of a processor system according to computer program code to perform the acts described with reference to these modules. Thus, the inventive subject matter is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
The use of the term “means” within a claim of this application is intended to invoke 112(f) only as to the limitation to which the term attaches and not to the whole claim, while the absence of the term “means” from any claim should be understood as excluding that claim from being interpreted under 112(f). As used in the claims of this application, “configured to” and “configured for” are not intended to invoke 112(f).