This application claims priority under 35 U.S.C. § 119 to Korean Patent Application No. 10-2017-0034849, filed on 20 Mar. 2017, the disclosure of which is incorporated herein by reference in its entirety.
The present invention relates to a technology of foreign language fluency or pronunciation evaluation which is applicable to a computer-based foreign language learning service. More particularly, this invention relates to a method and system for grading foreign language fluency on the basis of end-to-end technique which omits an intermediate process of grading fluency or pronunciation by using a convolution neural network.
A conventional foreign language fluency evaluation system is largely configured with a grading model training unit and an automatic grading unit. The grading model training unit trains a grading model so as to increase a correlation between a result obtained by the automatic grading unit evaluating speech pronounced by a non-native speaker and a result obtained through grading performed by a grading expert(s). Such a process will be described below with reference to
A raw non-native speech signal (hereinafter referred to as a ‘raw signal’) is collected in step 10. A feature vector suitable for speech recognition is extracted from the raw signal in step 11. Generally, mel-frequency cepstral coefficient (MFCC) and perceptual linear prediction (PLP) are used. Word and time sorting information about the extracted feature is obtained through automatic speech recognition, and a feature necessary for pronunciation evaluation is extracted based on the obtained word and time sorting information in step 13. In this case, the extracted feature varies according to the language characteristics. For example, features shown in
In a process of automatically grading speech pronounced by a non-native speaker by means of the trained grading model, steps 10 to 13 of the model training process described above with reference to
In the related art foreign language fluency evaluation system, 1) a feature vector for speech recognition must be extracted from a raw signal, and 2) operational performance of speech recognition is not accurate. For this reason, 3) sophistication of a system for grading fluency using the above-described information is inevitably reduced, and 4) features for grading foreign language fluency are extracted through an objective and intuitive method. Also, 5) modules (for example, a speech recognition module, a feature extraction module, a grading model, etc.) used for fluency grading operate separately, and so, the related art foreign language fluency evaluation system has suboptimal performance that does not reach overall optimal performances.
Accordingly, it is an object of the present invention to provide a method and system for grading foreign language fluency on the basis of end-to-end technique, in which a multi-step intermediate process of grading foreign language fluency in the related art is omitted.
To accomplish the above object, the method and system for grading foreign language fluency on the basis of end-to-end technique according to the present invention proposes an end-to-end automatic grading which trains a convolution neural network (CNN) receiving directly a raw signal corresponding to the speech pronounced by a non-native speaker, so that it makes an output having a grade level comparable to that by a skilled grader.
In one general aspect, an end-to-end foreign language fluency grading method of grading a foreign language fluency of a non-native speaker from a non-native raw speech signal includes: inputting the raw speech to a convolution neural network (CNN); training a filter coefficient of the CNN based on a fluency grading score calculated by a human rater for the raw speech signal so as to generate a foreign language fluency grading model; and grading foreign language fluency for a non-native speech signal newly input to the trained CNN by using the foreign language fluency grading model to output a grading result.
In another general aspect, an end-to-end foreign language fluency grading system for grading a foreign language fluency of a non-native speaker from a non-native raw speech signal includes: a convolution neural network (CNN) for receiving the raw speech, training a filter coefficient of the CNN based on a fluency grading score calculated by a human rater for the raw speech signal so as to generate a foreign language fluency grading model, and grading foreign language fluency for a non-native speech signal newly input to the foreign language fluency grading model generated through the training to output a grading result.
When the above end-to-end foreign language fluency grading method trains the filter coefficient, it may use a number of [(non-native speech signal), (fluency grading score by the human rater)] pairs data.
The CNN may include a convolution multilayer. The convolution multilayer may include a first convolution layer which may perform a convolution operation based on local filtering on a non-native raw speech signal input thereto to provide a result of the convolution operation to an n-th (where n is a natural number equal to or more than two) convolution layer subsequent thereto.
The CNN may further include a plurality of fully connected layers for additionally training a result obtained from the convolution multilayer.
The grading of the foreign language fluency may be based on a silence section and an envelope included in the non-native speech signal newly input.
The convolution multilayer may include first to n-th convolution layers, and as n increases, a filter size is reduced.
In another general aspect, a convolution neural network (CNN) for grading foreign language fluency based on end-to-end includes: a first unit receiving a non-native raw speech signal and training a filter coefficient of the CNN based on a fluency grading score calculated by a human rater for the raw speech signal so as to generate a foreign language fluency grading model; and a second unit grading foreign language fluency for a non-native speech signal newly input to the generated foreign language fluency grading model to output a grading result.
A number of [(non-native speech signal), (fluency grading score by the human rater)] pairs data may be used for training the foreign language fluency grading model.
The second unit may include a convolution multilayer. The convolution multilayer may include a first convolution layer, which may perform a convolution operation based on local filtering on a non-native raw speech signal input thereto to provide a result of the convolution operation to an n-th (where n is a natural number equal to or more than two) convolution layer subsequent thereto.
The convolution multilayer may include a first to n-th convolution layers, and as n increases, a filter size is reduced.
The second unit may further include a plurality of fully connected layers for additionally training a result obtained from the convolution multilayer.
The second unit may be based on a silence section and an envelope included in the non-native speech signal.
Other features and aspects will be apparent from the following detailed description, the drawings, and the claims.
In order to solve a problem of conventional foreign language fluency grading technology, the present invention proposes an end-to-end foreign language fluency grading system which inputs a raw signal, corresponding to speech pronounced by a non-native speaker, to a convolution neural network (CNN), trains a grading model at a level corresponding to a score (a grade) graded by a human rater or grader to build a foreign language fluency grading model, and grades foreign language fluency by using the built model, thereby directly and automatically outputting an optical grading score without performing a related art feature vector extraction process.
A concept of the present invention uses a CNN as in
The CNN predicts a fluency grading score of an input speech signal through a training process. In order to train the CNN, a number of [(speech signal), (fluency grading score by a human rater)] pairs are needed. Here, the fluency grading score made by the human rater is pronunciation score data which is provided as a result obtained by human rater's actually listening to and grading the speech. That is to say, the training of the CNN means a process of training a filter coefficient to obtain a conventional fluency grading score made by a human rater corresponding to an input signal.
First, an input “xi” 51 denotes a raw time-domain signal, the segment parameter of which is 32,000 samples corresponding to 2 seconds. “yi” 57 denotes a fluency grading score obtained to the input “xi” 51 by a grading expert. “Conv-1” 52 is a first convolution layer and is configured with 32 filters. Each of the filters outputs a convolution result for 320 input samples and slides in units of 160 samples. “Conv-2” 53 is a second convolution layer and performs a convolution operation on an output of the conv-1 52 by using the 32 filters to output a result of the convolution. In this case, a filter size, that is a convolution size, corresponds to 50 samples, and sliding is performed in units of 10 samples. “Conv-3” 54 is a third convolution layer and performs a convolution operation on a result obtained by the conv-2. In this case, a filter size corresponds to 20 samples, and sliding is performed in units of one sample.
“fc-1” 55 and “fc-2” 56 are each a fully connected layer. An activation function for the fully connected layer 55 is ‘softmax’ and an activation function for the fully connected layer 55 is ‘linear’. An output of a fully connected layer is configured with a grade performed by a human rater. When features obtained through a convolution layer are additionally trained through a fully connected layer, stronger signal characteristics can be realized and thus topology-change-robust recognition ability can be obtained.
Therefore, a CNN where a grade obtained by human rater's grading a raw signal generated by a non-native speaker is used as an output value is trained. As described above, coefficients constituting the CNN are trained through the forward-path process and the backward-path process. In the forward path, a fluency grading score predicted by the CNN for an input speech signal is output; in the backward path, a filter coefficient is trained by backward propagating a difference between the predicted fluency grading score and a grading score graded by the human rater.
To this end, in an embodiment of the present invention, a model having a CNN structure illustrated in
As described above, training in the end-to-end fluency grading according to an embodiment of the present invention is automatically finding which filter is proper for an accuracy of a fluency grading result. The followings explain that the filter of a CNN found through the end-to-end training is relevant for fluency grading.
As described above, according to the embodiments of the present invention, a related art step-based foreign language grading processes which are complicated and independent may be performed as one integration process by using the CNN, thereby solving the problems of the related art and considerably improving a grading performance.
A number of exemplary embodiments have been described above. Nevertheless, it will be understood that various modifications may be made. For example, suitable results may be achieved if the described techniques are performed in a different order and/or if components in a described system, architecture, device, or circuit are combined in a different manner and/or replaced or supplemented by other components or their equivalents. Accordingly, other implementations are within the scope of the following claims.
Number | Date | Country | Kind |
---|---|---|---|
10-2017-0034849 | Mar 2017 | KR | national |