It is often important to measure the quality of input data for a variety of reasons. For example, it can be beneficial to determine a quality of certain sound acquired in a system so that feedback can be provided to the source of that sound. That feedback can enable improvement of the sound at the source, enabling better communication of information in the future. Traditionally, such physical data extraction and analysis has utilized time-aggregated features (e.g., mean length of silence periods) to characterize the quality of the input data. Such, systems fail to take advantage of contextual information that can be acquired by looking at data, not only as a whole, but at individual segments within the data, in view of what has happened before and after those individual segments.
Systems and methods are provided for a processor-implemented method of analyzing quality of sound acquired via a microphone. An input metric is extracted from a sound recording at each of a plurality of time intervals. The input metric is provided at each of the time intervals to a neural network that includes a memory component, where the neural network provides an output metric at each of the time intervals, where the output metric at a particular time interval is based on the input metric at a plurality of time intervals other than the particular time interval using the memory component of the neural network. The output metric is aggregated from each of the time intervals to generate a score indicative of the quality of the sound acquired via the microphone.
As another example, a processor-implemented system for analyzing quality of sound acquired via a microphone includes a processing system comprising one or more data processors and a non-transitory computer-readable medium encoded with instructions for commanding the processing system to execute steps of a method. In the method, an input metric is extracted from a sound recording at each of a plurality of time intervals. The input metric is provided at each of the time intervals to a neural network that includes a memory component, where the neural network provides an output metric at each of the time intervals, where the output metric at a particular time interval is based on the input metric at a plurality of time intervals other than the particular time interval using the memory component of the neural network. The output metric is aggregated from each of the time intervals to generate a score indicative of the quality of the sound acquired via the microphone.
As a further example, a non-transitory computer-readable medium is encoded with instructions for commanding one or more data processors to execute steps of a method of analyzing quality of sound acquired via a microphone. In the steps, an input metric is extracted from a sound recording at each of a plurality of time intervals. The input metric is provided at each of the time intervals to a neural network that includes a memory component, where the neural network provides an output metric at each of the time intervals, where the output metric at a particular time interval is based on the input metric at a plurality of time intervals other than the particular time interval using the memory component of the neural network. The output metric is aggregated from each of the time intervals to generate a score indicative of the quality of the sound acquired via the microphone.
A sound quality determining neural network system can be implemented in a variety of contexts. For example, such a system can be utilized in a system configured to automatically (e.g., without any human input on speech quality) analyze the quality of spontaneous speech (e.g., non-native spontaneous speech spoken as part of a learning exercise or evaluation). Receptive language skills, i.e., reading and listening, are typically assessed using a multiple-choice paradigm, while productive skills, i.e., writing and speaking, usually are assessed by eliciting constructed responses from the test taker. Constructed responses are written or spoken samples such as essays or spoken utterances in response to certain prompt and stimulus materials in a language test. Due to the complexity of the constructed responses, scoring has been traditionally performed by trained human raters, who follow a rubric that describes the characteristics of responses for each score point. However, there are a number of disadvantages associated with human scoring, including factors of time and cost, scheduling issues for large-scale assessments, rater consistency, rater bias, central tendency, etc.
Automated scoring provides a computerized system that mimics human scoring, but in the context of a computer system that inherently operates much differently from a human brain, which makes such evaluations effortlessly. The processes described herein approach automated scoring problems in a significantly different manner than a human would evaluate the same problem, even though the starting and ending points are sometimes the same. The systems and methods described herein are directed to a problem that is uniquely in the computer realm, where a system is sought that can mimic the behavior of a human scoring, using a computer-processing system that functions much differently than a human brain.
Many state-of-the art automated speech scoring systems leverage an automatic speech recognition (ASR) front-end system that provides word hypotheses about what the test taker said in his response. Training such a system requires a large corpus of non-native speech as well as manual transcriptions thereof. The outputs of this ASR front-end are then used to design further features (lexical, prosodic, semantic, etc.) specifically for automatic speech assessment, which are then fed into a machine-learning-based scoring model. Certain embodiments herein reduce or eliminate the need for one or more of these actions.
In one embodiment, a Bidirectional Long Short Term Memory Recurrent Neural Networks (BLSTM) is used to combine different features for scoring spoken constructed responses. The use of BLSTMs enables capture of information regarding the spatiotemporal structure of the input spoken response time series. In addition, by using a bidirectional optimization process, both past and future context are integrated into the model. Further, by combining higher-level abstractions obtained from the BLSTM model with time aggregated response-level features, a system provides an automated scoring system that can utilize both time sequence and time aggregated information from speech.
For example, a system can combine fine-grained, time aggregated features at a level of the entire response that capture pronunciation, grammar, etc. (e.g., that a system like the SpeechRater system can produce) with time sequence features that capture frame-by-frame information regarding prosody, phoneme content, and speaker voice quality of the input speech. An example system uses a BLSTM with either a multilayer perceptron (MLP) or a linear regression (LR) based output layer to jointly optimize the automated scoring model.
As noted above, a system can provide a quality score based in part on time aggregated features. In one example, SpeechRater extracts a range of features related to several aspects of the speaking construct. These include pronunciation, fluency, intonation, rhythm, vocabulary use, and grammar. A selection of 91 of these features was used to score spontaneous speech.
In addition to the time aggregated features discussed above, one or more time sequence features are generated that utilize one or more neural networks having memory capabilities. The time-aggregated features computed from the input spoken response take into account delivery, prosody, lexical and grammatical information. Among these, features such as the number of silences capture aggregated information over time. However, some pauses might be more salient than others for purposes of scoring—for instance, silent pauses that occur at clause boundaries in particular are highly correlated with language proficiency grading. In addition, time aggregated features do not fully consider the evolution of the response over time. Thus systems and methods described herein utilize time-sequence features that capture the evolution of information over time and use machine learning methods to discover structure patterns in this information stream. In one example, a system extracts six prosodic features—“Loudness,” “F0,” “Voicing,” “Jitter Local,” “Jitter DDP,” and “Shimmer Lo-cal.” “Loudness” captures the loudness of speech, i.e., the normalized intensity. “F0” is the smoothed fundamental frequency contour. “Voicing” stands for the voicing probability of the final fundamental frequency candidate, which captures the breathy level of the speech. “Jitter Local” and “Jitter DDP” are measures of the frame-to-frame jitter, which is defined as the deviation in pitch period length, and the differential frame-to-frame jitter, respectively. “Shimmer Local” is the frame-to-frame shimmer, which is defined as the amplitude deviation between pitch periods.
Apart from prosodic features, in certain examples a group of “Mel-Frequency Cepstrum Coefficients” (MFCC's) are extracted from 26 filter-bank channels. MFCC's capture an overall timbre parameter which measures both what is said (phones) and the specifics of the speaker voice quality, which provides more speech information apart from the prosodic features described above. MFCCs are computed, in one example, using a frame size of 25 ms and a frame shift size of 10 ms, based on the configuration file parameters. MFCC features can be useful in phoneme classification, speech recognition, or higher level multimodal social signal processing tasks.
An LSTM architecture can include of a set of recurrently connected subnets, known as memory blocks. Each block contains one or more self-connected memory cells and three multiplicative units—the input, output and forget gates—that provide continuous analogues of write, read and reset operations for the cells. An LSTM network is formed, in one example like a simple RNN, except that the nonlinear units in the hidden layers are replaced by memory blocks.
The multiplicative gates allow LSTM memory cells to store and access information over long periods of time, thereby avoiding the vanishing gradient problem. For example, as long as the input gate remains closed (i.e. has an activation close to 0), the activation of the cell will not be overwritten by the new inputs arriving in the network, and can therefore be made available to the net much later in the sequence, by opening the output gate.
Given an input sequence x=(x1, . . . , xT), a standard recurrent neural network (RNN) computes the hidden vector sequence h=(h1, . . . , hT) and output vector sequence y=(y1, . . . , yT) by iterating the following equations from t=1 to T:
ht=H(Wxhxt+Whhht−1+bh)
yt=Whyht+b0
where the W terms denote weight matrices (e.g. Wxh is the input-hidden weight matrix), the b terms denote bias vectors (e.g. bh is the hidden bias vector) and His the hidden layer function. H is usually an element wise application of a sigmoid function. In some embodiments, the LSTM architecture, which uses custom-built memory cells to store information, is better at finding and exploiting long range context.
In one embodiment, His implemented as following.
it=σ(Wxixt+Whiht−1+Wcict−1+bi)
ft=σ(Wxfxt+Whfht−1+Wcfct−1+bf)
ct=ftct−1+it tan h(Wxcxt+Whcht−1+bc)
ot=σ(Wxoxt+Whoht−1+Wcoct+bo)
ht=σt tan h(ct)
where σ is the logistic sigmoid function, and i, f, o, and c are respectively the input gate, forget gate, output gate and cell activation vectors, all of which are the same size as the hidden vector h. Whi is the hidden-input gate matrix, Wxo is the input-output gate matrix. The weight matrix from the cell to gate vectors (e.g. Wci) are diagonal, so element m in each gate vector only receives input from element m of the cell vector. The bias terms have been omitted in this example for clarity.
Bidirectional RNNs (BRNNs) utilize context by processing the data in both directions with two separate hidden layers, which are then fed forwards to the same output layer. A BRNN can compute the forward hidden sequence {right arrow over (h)}, the backward hidden sequence and the output sequence y by iterating the backward layer from t=T to 1, the forward layer from t=1 to T and then updating the output layer:
{right arrow over (h)}=H(Wx{right arrow over (h)}xt+W{right arrow over (h)}{right arrow over (h)}{right arrow over (h)}t+1+b{right arrow over (h)})
=H(Wxxt+Wt+1+b)
yt=W{right arrow over (h)}y{right arrow over (h)}t+Wyt+by)
Combining BRNNs with LSTM gives bidirectional LSTM, which can access long-range context in both input directions. In automatic grading, where the whole responses are collected at once, future context and history context can be utilized together.
In one embodiment, two neural network architectures are used to generate a sound quality score: the multilayer perceptron (MLP) and the bidirectional long short term memory recurrent neural networks (BLSTM). A BLSTM is used to learn the high level abstraction of the time-sequence features and MLP/LR is used as the output layer to combine the hidden state outputs of a BLSTM with time-aggregated features. The BLSTM and the MLP/LR are optimized jointly.
With reference to the BLSTM, the input layer dimension of the BLSTM is the dimension of the time-sequence features. The input layer is fully connected to the hidden layer, and the hidden layer is fully connected to the output layer. LSTM blocks use the logistic sigmoid for the input and output squashing functions of the cell. The BLSTM can be augmented, in some embodiments, by concatenating the time aggregated features to the last hidden state output of the LSTM and reverse-LSTM. The example of
Neural network models, as described herein, can be implemented in a variety of configurations including: BLSTM with an MLP output layer; BLSTM with LR output layer, standalone MLP, BLSTM with an MLP output layer that utilizes prosodic and MFCC features as a time sequence feature set and a content feature set as a time aggregated feature set.
In
Each of the element managers, real-time data buffer, conveyors, file input processor, database index shared access memory loader, reference data buffer and data managers may include a software application stored in one or more of the disk drives connected to the disk controller 690, the ROM 658 and/or the RAM 659. The processor 654 may access one or more components as required.
A display interface 687 may permit information from the bus 652 to be displayed on a display 680 in audio, graphic, or alphanumeric format. Communication with external devices may optionally occur using various communication ports 682.
In addition to these computer-type components, the hardware may also include data input devices, such as a keyboard 679, or other input device 681, such as a microphone, remote control, pointer, mouse and/or joystick.
Additionally, the methods and systems described herein may be implemented on many different types of processing devices by program code comprising program instructions that are executable by the device processing subsystem. The software program instructions may include source code, object code, machine code, or any other stored data that is operable to cause a processing system to perform the methods and operations described herein and may be provided in any suitable language such as C, C++, JAVA, for example, or any other suitable programming language. Other implementations may also be used, however, such as firmware or even appropriately designed hardware (e.g., ASICs, FPGAs) configured to carry out the methods and systems described herein.
The systems' and methods' data (e.g., associations, mappings, data input, data output, intermediate data results, final data results, etc.) may be stored and implemented in one or more different types of computer-implemented data stores, such as different types of storage devices and programming constructs (e.g., RAM, ROM, Flash memory, flat files, databases, programming data structures, programming variables, IF-THEN (or similar type) statement constructs, etc.). It is noted that data structures describe formats for use in organizing and storing data in databases, programs, memory, or other computer-readable media for use by a computer program.
The computer components, software modules, functions, data stores and data structures described herein may be connected directly or indirectly to each other in order to allow the flow of data needed for their operations. It is also noted that a module or processor includes but is not limited to a unit of code that performs a software operation, and can be implemented for example as a subroutine unit of code, or as a software function unit of code, or as an object (as in an object-oriented paradigm), or as an applet, or in a computer script language, or as another type of computer code. The software components and/or functionality may be located on a single computer or distributed across multiple computers depending upon the situation at hand.
In the descriptions above and in the claims, phrases such as “at least one of” or “one or more of” may occur followed by a conjunctive list of elements or features. The term “and/or” may also occur in a list of two or more elements or features. Unless otherwise implicitly or explicitly contradicted by the context in which it is used, such a phrase is intended to mean any of the listed elements or features individually or any of the recited elements or features in combination with any of the other recited elements or features. For example, the phrases “at least one of A and B;” “one or more of A and B;” and “A and/or B” are each intended to mean “A alone, B alone, or A and B together.” A similar interpretation is also intended for lists including three or more items. For example, the phrases “at least one of A, B, and C;” “one or more of A, B, and C;” and “A, B, and/or C” are each intended to mean “A alone, B alone, C alone, A and B together, A and C together, B and C together, or A and B and C together.” In addition, use of the term “based on,” above and in the claims is intended to mean, “based at least in part on,” such that an unrecited feature or element is also permissible.
While the disclosure has been described in detail and with reference to specific embodiments thereof, it will be apparent to one skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the embodiments. Thus, it is intended that the present disclosure cover the modifications and variations of this disclosure provided they come within the scope of the appended claims and their equivalents.
This application claims priority to U.S. Provisional Application No. 62/195,359, filed Jul. 22, 2015, the entirety of which is herein incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
9489864 | Evanini | Nov 2016 | B2 |
9514109 | Yoon | Dec 2016 | B2 |
20100145698 | Chen | Jun 2010 | A1 |
20150248608 | Higgins | Sep 2015 | A1 |
Entry |
---|
Hönig, Florian, et al. “Automatic modelling of depressed speech: relevant features and relevance of gender.” Interspeech. 2014. |
Yu, Zhou, et al. “Using bidirectional lstm recurrent neural networks to learn high-level abstractions of sequential features for automated scoring of non-native spontaneous speech.” Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, Dec. 2015. |
Metallinou, Angeliki, and Jian Cheng. “Using deep neural networks to improve proficiency assessment for children English language learners.”INTERSPEECH. 2014. |
Gonzalez-Dominguez, Javier, et al. “Automatic language identification using long short-term memory recurrent neural networks.” Interspeech. 2014. |
Attali, Yigal, Burstein, Jill; Automated Essay Scoring with E-Rater, V.2; Journal of Technology, Learning, and Assessment, 4(3); 2006. |
Bernstein, Jared, Cheng, Jian, Suzuki, Masanori; Fluency and Structural Complexity as Predictors of L2 Oral Proficiency; Proceedings of InterSpeech; pp. 1241-1244; 2010. |
Bishop, Christopher; Pattern Recognition and Machine Learning; Singapore: Springer; 2006. |
Chen, Lei, Zechner, Klaus; Applying Rhythm Features to Automatically Assess Non-Native Speech; Proceedings of Interspeech; 2011. |
Chen, Lei, Zechner, Klaus, Xi, Xiaoming; Improved Pronunciation Features for Construct-Driven Assessment of Non-Native Spontaneous Speech; Proceedings of the North American Chapter of the ACL, Human Language Technologies; pp. 442-449; 2009. |
Chen, Lei, Tetreault, Joel, Xi, Xiaoming; Towards Using Structural Events to Assess Non-Native Speech; Proceedings of the Naacl HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications; pp. 74-79; 2010. |
Chen, Lei, Yoon, Su-Youn; Application of Structural Events Detected on ASR Outputs for Automated Speaking Assessment; Proceedings of INTERSPEECH; 2012. |
Cucchiarini, Calla, Strik, Helmer, Boves, Lou; Quantitative Assessment of Second Language Learners' Fluency: Comparisons Between Read and Spontaneous Speech; Journal of the Acoustical Society of America, 111(6); pp. 2862-2873; 2002. |
Eyben, Florian, Wollmer, Martin, Schuller, Bjorn; openSMILE—The Munich Versatile and Fast Open-Source Audio Feature Extractor; Proceedings of ACM Multimedia, 10; pp. 1459-1462; 2010. |
Fan, Yuchen, Qian, Yao, Xie, Fenglong, Soong, Frank; TTS Synthesis with Bidirectional LSTM Based Recurrent Neural Networks; Proceedings of INTERSPEECH; pp. 1964-1968; Sep. 2014. |
Franco, Horacio, Bratt, Harry, Rossier, Romain, Gade, Venkata Rao, Shriberg, Elizabeth, Abrash, Victor, Precoda, Kristin; EduSpeak: A Speech Recognition and Pronunciation Scoring Toolkit for Computer-Aided Language Learning Applications; Language Testing, 27(3); pp. 401-418; 2010. |
Graves, Alex, Jaitly, Navdeep, Mohamed, Abdel-rahman; Hybrid Speech Recognition with Deep Bidirectional LSTM; Proceedings of the Automatic Speech Recognition and Understanding (ASRU), IEEE; pp. 273-278; 2013. |
Graves, Alex; Supervised Sequence Labelling with Recurrent Neural Networks; Studies in Computational Intelligence, vol. 385; Springer-Verlag; 2012. |
Graves, Alex, Schmidhuber, Jurgen; Framewise Phoneme Classification with Bidirectional Lstm and Other leural Network Architectures; Neural Networks, 18(5); pp. 602-610; 2005. |
Higgins, Derrick; Xi, Xiaoming, Zechner, Klaus, Williamson, David; A Three-Stage Approach to the Automated Scoring of Spontaneous Spoken Responses; Computer Speech and Language, 25; pp. 282-306; 2011. |
Hochreiter, Sepp, Schmidhuber, Jurgen; Long Short-Term Memory; Neural Computation, 9(8); pp. 1735-1780; 1997. |
Krizhevsky, Alex, Sutskever, Ilya, Hinton, Geoffrey; ImageNet Classification with Deep Convolutional Neural Networks; Proceedings of the Advances in Neural Information Processing Systems; pp. 1097-1105; 2012. |
Landauer, Thomas, Laham, Darrell, Foltz, Peter; Automated Scoring and Annotation of Essays with the Intelligent Essay Assessor; Ch. 6, In Automated Essay Scoring: A Cross-Disciplinary Perspective, M. Shermis and J. Burstein (Eds.); pp. 87-112; 2001. |
Loukina, Anastassia, Zechner, Klaus, Chen, Lei, Heilman, Michael; Feature Selection for Automated Speech Scoring; Proceedings of the 10th Workshop on Innovative Use of NLP for Building Educational Applications; pp. 12-19; Jun. 2015. |
Ngiam, Jiquan, Khosla, Aditya, Kim, Mingyu, Nam, Juhan, Lee, Honglak, NG, Andrew; Multimodal Deep Learning; Proceedings of the 28th International Conference on Machine Learning; pp. 689-696; 2011. |
Pedregosa, Fabian, Varoquaux, Gael, Gramfort, Alexandre, Michel, Vincent, Thirion, Bertrand, Grisel, Olivier, Blondel, Mathieu, Prettenhofer, Peter, Weiss, Ron, Dubourg, Vincent, Vanderplas, Jake, Passos, Alexandre, Cournapeau, David, Brucher, Matthieu, Perrot, Matthieu, Duchesnay, Edouard; Scikit-learn: Machine Learning in Python; Journal of Machine Learning Research, 12; pp. 2825-2830; 2011. |
Rumelhart, David; Hinton, Geoffrey, Williams, Ronald; Learning Internal Representations by Error Propagation; Institute for Cognitive Science, DTIC; Sep. 1985. |
Schuster, Mike, Paliwal, Kuldip; Bidirectional Recurrent Neural Networks; IEEE Transactions on Signal Processing, 45(11); pp. 2673-2681; Nov. 1997. |
Smola, Alex, Scholkopf, Bernhard; A Tutorial on Support Vector Regression; Statistics and Computing, 14(3); pp. 199-222; 2004. |
Wang, Xinhao, Evanini, Keelan, Zechner, Klaus; Coherence Modeling for the Automated Assessment of Spontaneous Spoken Responses; Proceedings of the NAACL-HLT; pp. 814-819; Jun. 2013. |
Wang, Zhen, Von Davier, Alina; Monitoring of Scoring Using the E-Rater Automated Scoring System and Human Raters on a Writing Test; Educational Testing Service, Research Report RR-14-04; Jun. 2014. |
Yu, Zhou, Gerritsen, David, Ogan, Amy, Black, Alan, Cassell, Justine; Automatic Prediction of Friendship via Multi-Model Dyadic Features; Proceedings of the SIGDIAL 2013 Conference; pp. 51-60; Aug. 2013. |
Zechner, Klaus, Higgings, Derrick, Xi, Xiaoming, Williamson, David; Automatic Scoring of Non-Native Spontaneous Speech in Tests of Spoken English; Speech Communication, 51(10); pp. 883-895; 2009. |
Number | Date | Country | |
---|---|---|---|
62195359 | Jul 2015 | US |