The disclosed embodiments relate generally to data processing systems and more particularly, but not exclusively, to data processing systems and methods suitable for training and utilizing multilingual neural network systems that are designed to evaluate the quality of translations generated by machine translation systems, sometimes referenced herein as multilingual machine translation evaluation models.
Historically, metrics for evaluating the quality of machine translation (or MT) have relied on assessing the similarity between a MT-generated translation hypothesis and a human-generated reference translation in the target language. Traditional metrics have largely focused on basic, lexical-level features such as counting the number of matching words and sequences of words (or n-grams) between the MT hypothesis and the reference translation. Metrics such as Bilingual Evaluation Understudy (or BLEU), as described in “BLEU: a Method for Automatic Evaluation of Machine Translation,” by Kishore Papineni et al., 2002, and METEOR, as described in “The METEOR metric for automatic evaluation of machine translation,” by Alon Lavie et al., 2009, remain popular as a means of evaluating MT systems due to their light-weight and fast computation.
Modern neural approaches to MT result in much higher quality of translation than earlier technology, which often deviates from monotonic lexical transfer between languages and is much more expressive than can be captured and reflected in a single reference translation. For this reason, it has become increasingly evident that metrics such as BLEU are no longer able to provide an accurate estimate of the quality of current state-of-the-art MT systems.
While an increased research interest in neural methods for training MT models and systems has resulted in a recent, dramatic improvement in MT quality, MT evaluation has lagged behind. The MT research community still largely relies on outdated metrics and no new, widely-adopted standard has emerged. For example, in 2019, the WMT News Translation Shared Task, a recognized annual benchmark evaluation of MT technology, received a total of 153 MT system submissions as described in “Findings of the 2019 Conference on Machine Translation (WMT19),” by Loïc Barrault et al., 2019. The Metrics Shared Task of the same year, a track for benchmarking MT evaluation metrics, saw only twenty-four submissions, almost half of which were entrants to the Quality Estimation Shared Task, adapted to serve as metrics as described in “Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges,” by Qingsong Ma et al., 2019.
The findings of the above-mentioned task highlighted two major challenges that prior existing MT evaluation metrics have been largely unable to address. Namely, that current metrics struggle to accurately correlate with human quality scores at the segment level and fail to correctly rank the highest performing MT systems.
Classic MT evaluation metrics are commonly characterized as n-gram matching metrics because, using hand-crafted features, they estimate MT quality by counting the number and fraction of n-grams that appear simultaneously in a candidate translation hypothesis and one or more human-reference translations. Metrics such as BLEU, METEOR, and chrF as described in “CHRF: character n-gram F-Score for automatic MT evaluation,” by Maja Popović, 2015, have been widely studied and improved (“Moses: Open Source Toolkit for Statistical Machine Translation,” Philipp Koehn et al., 2007; “CHRF++: words helping character n-grams,” by Maja Popović, 2017; “Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems,” Michael Denkowski et al., 2011; “Meteor++ 2.0: Adopt Syntactic Level Paraphrase Knowledge into Machine Translation Evaluation,” by Yinuo Guo et al., 2019), but, due to their lexical nature, they usually fail to recognize and capture semantic similarity and translation nuances beyond the lexical level.
In recent years, word embeddings (“Distributed Representations of Words and Phrases and their Compositionality,” Tomas Mikolov et al., 2013; “GloVe: Global Vectors for Word Representation,” Jeffrey Pennington et al., 2014; “Deep contextualized word representations,” Matthew E. Peters et al., 2018; “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Jacob Devlin et al., 2019) have emerged as a commonly used alternative to n-gram matching for capturing word and segment-level semantic similarity. More recent embedding-based metrics like YiSi-1 (“YiSi—A Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources,” Chi-kiu Lo, 2019), MoverScore (“MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance,” Wei Zhao et al., 2019) and BERTScore (“BERTScore: Evaluating Text Generation with BERT,” Tianyi Zhang et al., 2020) create soft-alignments between reference and hypothesis in an embedding space and then compute a score that reflects the semantic similarity between those segments. However, human quality scores such as Direct Assessment (or DA) (“Continuous Measurement Scales in Human Evaluation of Machine Translation,” Yvette Graham et al., 2013) and Multidimensional Quality Metrics (or MQM) (“Multidimensional Quality Metrics (MQM): A Framework for Declaring and Describing Translation Quality Metrics,” Arle Lommel et al., 2014), capture much more than just semantic similarity, thus limiting the ability of the scores generated by such metrics to correlate well with these forms of human quality scores.
Learnable metrics (“RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation,” Hiroki Shimanaka et al., 2018; “Putting Evaluation in Context: Contextual Embeddings Improve Machine Translation Evaluation,” Mitika Mathur et al., 2019) attempt to learn parameters that directly optimize the correlation with human quality scores, and have recently shown promising results. BLEURT (“BLEURT: Learning Robust Metrics for Text Generation,” Thibault Sellam et al., 2020), a recent learnable metric based on BERT (“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Jacob Devlin et al., 2019), has exhibited state-of-the-art performance on data from the last three years of the WMT Metrics Shared task. Furthermore, all previously proposed learnable metrics have focused on optimizing their parameters to Direct Assessment (DA) data which, due to a scarcity of annotators, can be inherently noisy as described in “Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges,” by Qingsong Ma et al., 2019.
Reference-less MT evaluation, also known as Quality Estimation (or QE), has historically been trained and evaluated on predicting Human-mediated Translation Edit Rate (or HTER) (“A Study of Translation Edit Rate with Targeted Human Annotation,” Snover et al., 2006) in segment-level evaluation settings (“Findings of the 2013 Workshop on Statistical Machine Translation,” Ondřej Bojar et al., 2013; “Findings of the 2014 Workshop on Statistical Machine Translation,” Ondřej Bojar et al., 2014; “Findings of the 2015 Workshop on Statistical Machine Translation,” Ondřej Bojar et al., 2015; “Findings of the 2016 Workshop on Statistical Machine Translation,” Ondřej Bojar et al., 2016; “Findings of the 2017 Workshop on Statistical Machine Translation,” Ondřej Bojar et al., 2017). More recently, MQM has been used for document-level evaluation (“Findings of the WMT 2018 Shared Task on Quality Estimation,” Lucia Specia et al., 2018; “Findings of the WMT 2019 Shared Task on Quality Estimation,” Erick Fonseca et al., 2019). Recent new QE systems, such as “Unbabel's Participation in the WMT19 Translation Quality Estimation Shared Task,” Fabio Kepler et al., 2019, have exhibited dramatically improved correlations with human quality scores by leveraging highly multilingual pretrained encoders such as multilingual BERT (“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Jacob Devlin et al., 2019) and cross-lingual language models such as XLM (“Cross-lingual Language Model Pretraining,” Alexis Conneau et al., 2019). Concurrently, the OpenKiwi framework (“OpenKiwi: An Open Source Framework for Quality Estimation,” Fabio Kepler et al., 2019) has made it easier for researchers to push the field forward and build stronger QE models.
In view of the foregoing, a need exists for an improved system and method for training multilingual machine translation evaluation models that overcomes the aforementioned obstacles and deficiencies of currently-available methods for evaluating the quality of machine translation.
It should be noted that the figures are not drawn to scale and that elements of similar structures or functions may be generally represented by like reference numerals for illustrative purposes throughout the figures. It also should be noted that the figures are only intended to facilitate the description of the preferred embodiments. The figures do not illustrate every aspect of the described embodiments and do not limit the scope of the present disclosure.
Since currently-available methods for evaluating machine translation (MT) quality rely on outdated metrics, lack any widely-adopted standard, struggle to accurately correlate with human quality scores and fail to correctly rank highest performing MT systems, a system and method for training multilingual machine translation evaluation models that can use cross-lingual language modeling and a predictive neural network to generate prediction estimates of human quality scores can prove desirable and provide a basis for a wide range of system applications, such as generation of a statistically-informed prediction of machine translation quality based on one or more examples of prior human action and/or generation of a score for a new machine translation based upon scores assigned by humans to previous translations. This result can be achieved, according to selected embodiments disclosed herein, by an evaluation model training system 100 for training multilingual machine translation evaluation models as illustrated in
In selected embodiments, the evaluation model training system 100 can comprise a framework for training highly multilingual and adaptable machine translation (or MT) evaluation models (not shown) that can function as metrics. The framework, for example, can be implemented using the PyTorch neural software library (“PyTorch: An Imperative Style, High-Performance Deep Learning Library”, by Adam Paszke et al. 2019), primarily developed by Facebook's AI Research Lab. Turning to
In selected embodiments, the evaluation model training system 100 can be configured to receive any suitable number of machine translations 214 of the original source language input text segment 212. Exemplary numbers of machine translations 214 can include one or two machine translations 214, without limitation. The original source language input text segment 212, for example, can comprise a source-language input word, a source-language input sentence and/or a source-language input segment, comprising a plurality of source-language input words or sentences. At least one feature based on the original source language input text segment 212 can be incorporated into the machine translation evaluation models.
In selected embodiments, the encoding system 110 can comprise one or more transformer encoder layers (not shown). An exemplary building block of the MT evaluation models can be a pretrained, cross-lingual encoder model. Exemplary pretrained, cross-lingual encoder models can include multilingual BERT (“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Jacob Devlin et al., 2019), and cross-lingual language models such as XLM (“Cross-lingual Language Model Pretraining,” Alexis Conneau et al., 2019) and/or XLM-RoBERTa (“Unsupervised Cross-lingual Representation Learning at Scale”, Alexis Conneau et al., 2020), without limitation. The pretrained, cross-lingual model can include at least one of the transformer encoder layers. When trained with large amounts of data from multiple languages, these pretrained, cross-lingual models can be highly effective in serving as an encoder model for providing a basis to train other neural models that perform various cross-lingual tasks such as document classification and natural language inference and can generalize well to unseen languages and scripts.
The example analysis presented herein relies on XLM-RoBERTa (base), as described in “Unsupervised Cross-lingual Representation Learning at Scale”, Alexis Conneau et al., 2020, as the encoder model. Given an input sequence x=[x0, x1, . . . , xn], the encoder system 110 can produce an embedding ej(l) for each token xj and each layer l∈{0, 1, . . . , k}. The embedding process can be applied to the original source language input text segment 212, the machine translation 214 and/or the reference translation 216 to map the original source language input text segment 212, the machine translation 214 and/or the reference translation 216 into a shared embedding feature space. The embeddings generated by the last, or any other, layer of the pretrained encoders of the encoder system 110 can be used for fine-tuning the model parameters to support one or more new tasks, including the prediction of MT evaluation scores.
Advantageously, different transformer encoder layers of the encoder system 110 can capture linguistic information that can be relevant for one or more different downstream tasks. In the case of MT evaluation, the different transformer encoder layers can encode different aspects of meaning representation that can be useful as input features for predicting the quality of an MT hypothesis, generalizing and improving upon the utility of leveraging only the last transformer encoder layer. In selected embodiments, the pooling layer can pool information from the most important transformer encoder layers into a single embedding for each token, e1, by using a layer-wise attention mechanism. The resultant embedding can be computed as:
e
x
=μE
x
Tα (Equation 1)
where μ is a trainable weight coefficient, Ej=[ej(0), ej(1), . . . , ej(k)] corresponds to a vector of transformer encoder layer embeddings for token xj, and α=softmax([α(1), α(2), . . . , α(k)]) is a vector corresponding to layer-wise trainable weights. To avoid overfitting to the information contained in any single transformer encoder layer, the pooling system 120 can use layer dropout whereas with a probability p the weight α(i) is set to −∞.
In selected embodiments, the pooling system 120 can apply average pooling to the resulting word embeddings to derive a sentence and/or segment embedding for each of the inputs: the source-language input 212, the machine translation hypothesis 214, the reference translation 216 and/or other system inputs 210. The pooling system 120 thereby can leverage features extracted from these sentence and/or segment embedded inputs to evaluate the machine translation 214, and provide one or more system outputs 220, such as a machine translation quality score 222, for setting forth at least one evaluation result for the machine translation 214 of the original source language input text segment 212.
The evaluation model training system 100 can utilize a multilingual embedding space to leverage information from the system inputs 210, including the original source language input text segment 212, the machine translation 214 of the original source language input text segment 212 and the reference translation 216. Thereby, the evaluation model training system 100 can improve an accuracy of translation quality predictions by assigning the machine translation quality score 222 based on the system inputs 210 to be assigned to the machine translation 214. The machine translation quality score 222 advantageously can demonstrate value added by using the original source language input text segment 212 as an input to machine translation evaluation models.
In selected embodiments, the evaluation model training system 100 advantageously can utilize cross-lingual language modeling and/or a predictive neural network to generate prediction estimates of various human quality scores. Exemplary predictions estimates can include, but are not limited to, Direct Assessments (or DA), Multidimensional Quality Metric (or MQM) and/or Human-mediated Translation Edit Rate (or HTER). Direct Assessments optionally can be converted into pairs of relative rankings from the Direct Assessments (or DARR), for example, when a number of annotations per segment of the original source language input text segment 212 is limited. Stated somewhat differently, for two machine translations 214 of a selected source-language input segment of the original source language input text segment 212, the Direct Assessment score associated with the first machine translation 214 of the selected source-language input segment can be higher than the Direct Assessment score associated with the second machine translation 214 of the selected source-language input segment such that the first machine translation 214 can be regarded as being a better translation than the second machine translation 214. Additionally and/or alternatively, if a difference between the first and second Direct Assessments scores is not higher than twenty-five points, the selected source-language input segment can be excluded from the DARR data.
The evaluation model training system 100 advantageously can evaluate the original source language input text segment 212, the machine translation 214, the reference translation 216 and/or other system inputs 210 and generate the machine translation quality score 222 and/or other system outputs 220 in an effective and/or flexible manner. In selected embodiments, the evaluation model training system 100 can train two or more exemplary machine translation evaluation models for estimating different types of human quality scores. For example, the evaluation model training system 100 can support two or more distinct system architectures to train the exemplary machine translation evaluation models for estimating different types of human quality scores.
Exemplary embedding, combining and outputting operations of the evaluation model training system 100 are illustrated in
The evaluation model training system 100 of
The tokenizer system 105 can provide the tokens 232A-C to a pretrained language model encoder system 114 of the evaluation model training system 100. The pretrained language model encoder system 114 can receive the tokens 232A-C and, based at least in part upon the tokens 232A-C, generate at least one token embedding 234. As illustrated in
The evaluation model training system 100 can further include a vector pooling system 124 for receiving the token embeddings 234 and pooling the received token embeddings 234 into at least one source vector 236.
Turning to
The vector combination system 116 can provide the pooled vector 239 to a neural network regressor system 118 as shown in
A first exemplary system architecture, sometimes referenced herein as being an estimator model architecture, of the evaluation model training system 100 is shown in
In selected embodiments, the original source language input text segment 212, the machine translation 214 and the reference translation 216 can be independently encoded via the pretrained and/or cross-lingual encoder system 112. The resulting word embeddings can be passed through the layered pooling system 122 to an embeddings concatenation system 130. The embeddings concatenation system 130 can create a sentence embedding for each segment. Additionally and/or alternatively, the embeddings concatenation system 130 can combine and concatenate the resulting sentence embeddings into a single vector that is passed to a feed-forward neural network that can serve as a regressor system 140. The entire multilingual machine translation evaluation model thereby can be trained on the collection of available training examples for all language pairs by minimizing a Mean Squared Error (MSE) value 224 between the scores predicted by the model and the human-generated scores associated with the training examples.
For example, the pooling system 120 can provide a d-dimensional sentence embedding for the original source language input text segment 212, the machine translation (or hypothesis) 214 of the original source language input text segment 212 and the reference translation 216 to the embeddings concatenation system 130. The embeddings concatenation system 130 can calculate and/or extract multiple features from these embeddings, including but not limited to, an element-wise product between the embeddings of the machine translation (or hypothesis) 214 and the embedding for the original source language input text segment 212, an element-wise product between the embeddings of the machine translation (or hypothesis) 214 and the embedding for the reference translation 216, an absolute element-wise difference between the hypothesis 214 and the source 212, and/or an absolute element-wise difference between the hypothesis 214 and the reference 216, in accordance with Equations 2-5.
Element-wise source product: h⊙s (Equation 2)
Element-wise reference product: h⊙r (Equation 3)
Absolute element-wise source difference: |h−s| (Equation 4)
Absolute element-wise reference difference: |h−r| (Equation 5)
wherein h represents a hypothesis embedding of the machine translation (or hypothesis) 214, s represents a source embedding of the original source language input text segment 212 and r represents a reference embedding of the reference translation 216.
The embeddings concatenation system 130 can concatenate the element-wise source product h⊙s of Equation 2, the element-wise reference product h⊙r of Equation 3, the absolute element-wise source difference |h−s| of Equation 4 and/or the absolute element-wise reference difference |h−r| of Equation 5 to the reference embedding r and/or the hypothesis embedding h into a single vector x=[h; r; h⊙s; h⊙r; |h−s|; |h−r|], which can be provided as an input to the feed-forward regression system 140. By augmenting the d-dimensional embeddings of the MT hypothesis h and the reference r, the element-wise source product h⊙s, the element-wise reference product h⊙r, the absolute element-wise source difference |h−s| and/or the absolute element-wise reference difference |h−r| advantageously can help to highlight any differences between these embeddings in a semantic feature space.
While cross-lingual pretrained models are trained to cover multiple languages, the feature space between the languages is not well aligned. Accordingly, although the element-wise source product h⊙s and the absolute element-wise difference |h−s| can be useful features for the embeddings concatenation system 130, the raw source embedding s may be omitted as input to the embeddings concatenation system 130 in selected embodiments.
The multilingual machine translation evaluation model is then trained on a collection of MT evaluation training examples to minimize the mean squared error 224 between the predicted scores and human-generated quality scores, such as Direct Assessments, Multidimensional Quality Metric and/or Human-mediated Translation Edit Rate.
An exemplary evaluation model training method 300 for the evaluation model training system 100 is illustrated in
The evaluation model training method 300 can include extracting an element-wise source product between the embedding representation of the machine translation 214 and the embedding representation of the original source language segment 212, at 320. In selected embodiments, the element-wise source product, at 320, can be generated in the manner discussed in more detail above with reference to Equation 2. At 330, the evaluation model training method 300 can include extracting an element-wise reference product between the embedding representation of the machine translation 214 and the embedding representation of the reference translation 216. The element-wise reference product, at 330, can be generated in the manner discussed in more detail above with reference to Equation 3.
At 340, an absolute element-wise source difference between the embedding representation of the machine translation 214 and the embedding representation of the original source language segment 212 can be extracted. The absolute element-wise source difference, at 340, can be generated in the manner discussed in more detail above with reference to Equation 4. An absolute element-wise reference difference between the embedding representation of the machine translation 214 and the embedding representation of the reference translation 216 can be extracted, at 350. The absolute element-wise reference difference, at 350, can be generated in the manner discussed in more detail above with reference to Equation 5.
As shown in
Additionally and/or alternatively, the evaluation model training system 100 can be provided with a second exemplary system architecture, sometimes referenced herein as being a translation ranking model architecture, as illustrated in
Turning to
Alternative exemplary evaluation model training methods 400, 500 for the evaluation model training system 100 are illustrated in
At 420, the method 400 can include pooling and combining the token-level embedding representations into segment-level embedding representations. Multiple contrastive feature vector representations from the segment-level embedding representations can be extracted and the vector representations can be combined into a single vector representation, at 430. The method of
The evaluation model training method 500, alternatively, can involve training a translation ranking-based multilingual machine translation evaluation model by iteratively optimizing weights of the entire neural system, including the encoder system and/or the layer attention mechanism, via standard neural back-propagation triplet-margin-loss optimization on data collections of triplets of anchors (a source segment and a reference translation of the source segment) paired with two ranked MT-generated translations (a “better” MT hypothesis and a “worse” MT hypothesis). Turning to
The token-level embedding representations can be pooled and combined, at 520, into segment-level embedding representations. At 530, the method 500 can calculate a triplet margin loss for a training example consisting of the segment-level embedding representations of an original source language text segment 212, a first “better” machine translation 214A of the original source language text segment 212, a second “worse” machine translation 214B of the original source language text segment 212 and a reference translation 216 of the original source language text segment 212. The weights of the entire neural system, including the encoder system and/or the layer attention mechanism, can be iteratively optimized, at 540, via the standard neural weight back-propagation triplet-margin-loss optimization on data collections of triplets of anchors (a source segment and a reference translation of the source segment) paired with two ranked MT-generated translations (a “better” MT hypothesis and a “worse” MT hypothesis).
In operation, the system 100 can receive the original source language input text segment 212, the better machine translation 214A, the worse machine translation 214B and the reference translation 216. The original source language input text segment 212, better machine translation 214A, the worse machine translation 214B and the reference translation 216 can be independently encoded using the pretrained encoder system 112 and the layered pooling system 122. Using a triplet margin loss the resulting embedding space can be optimized to minimize or otherwise reduce a distance between the better machine translation 214A and the translation anchors 218.
For example, the translation ranking model training system 104 can receive a tuple χ=(s, h+, h−, r), wherein s represents a source embedding of the original source language input text segment 212, h+ represents the better machine translation 214A and that has been ranked higher than the worse machine translation 214B, h− represents the worse machine translation 214B and r represents a reference embedding of the reference translation 216. The tuple χ can be passed through the encoding system 110 and the pooling system 120 and provided to a sentence embeddings system 150. The sentence embeddings system 150 can generate a sentence embedding for each segment in the tuple x. For example, the sentence embeddings system 150 can utilize one or more embeddings {s, h+, h−, r} to calculate a triplet margin loss 226 in relation to the source embedding s and the reference embedding r can be computed in accordance with Equations 6-8:
L(χ)=L(s,h+,h−)+L(r,h+,h−) (Equation 6)
wherein:
L(s,h+,h−)=max{0,d(s,h+)−d(s,h−)+ε} (Equation 7)
L(r,h+,h−)=max{0,d(r,h+)−d(r,h−)+ε} (Equation 8)
wherein d(u, v) denotes a Euclidean distance function between u and v and ∈ is a margin. Thus, during training, the multilingual machine translation evaluation model can optimize the embedding space so that the distance between the translation anchors 218 and the worse machine translation 214B is greater by at least ∈ than the distance between the translation anchors 218 and the better machine translation 214A.
During inference, the described multilingual machine translation evaluation model can receive a triplet (s, ĥ, r) that includes a single MT hypothesis ĥ. The single MT hypothesis ĥ, in selected embodiments, can refer to a translation produced by an independent MT system (not shown) that is being evaluated. Stated somewhat differently, the hypothesis ĥ can be a hypothesis translation presented for evaluation to the MT evaluation model that was trained by the evaluation model training system.
The translation quality score 222 (shown in
The harmonic mean between the first distance d(s, ĥ) and the second distance d(r, ĥ) can be converted into a similarity score bounded between 0 and 1 in accordance with Equation 10:
During standard training of the multilingual machine translation evaluation models, the evaluation model training system 100 can receive the selected system inputs 210. The evaluation model training system 100 preferably receives the selected system inputs 210 in the following order: the original source language input text segment 212 followed by any machine translations 214 and then followed by one or more reference translations 216. The evaluation model training system 100 thereby can concatenate the embeddings.
Another alternative embodiment of the evaluation model training system 100 is shown in
The training method for the multi-reference architecture is modified in order to promote the learning of model parameters that perform well when presented at inference time with zero, one or more reference translations. In order to support the learning of such effective parameters, the positions of the original source language input text segment 212 and the reference translation 216 can be switched during training with probability of 0.5. Stated somewhat differently, the system 110 can receive any one of the reference translations 216 as the original source language input text segment 212 and the original source language input text segment 212 as the reference translation 216. The multi-reference model training system 106 thereby can receive the selected system inputs 210 in the following order: any of the one or more reference translations 216 followed by any machine translations 214 and then followed by the original source language input text segment 212. This order switching can be performed with a probability of 0.5 throughout the course of training the model.
By switching the positions of the original source language input text segment 212 and the reference translations 216, the source embeddings can be aligned with the target language embedding space during fine-tuning of the underlying multilingual machine translation evaluation model and can result in more useful source embeddings. Switching the positions of the original source language input text segment 212 and the reference translations 216 likewise can force the underlying multilingual machine translation evaluation model to treat the original source language input text segment 212 and the reference translations 216 as being interchangeable system inputs 210. The multi-reference model training system 106 thereby trains a model that can handle switching of inputs at inference time without excessively hindering a predictive ability of the multilingual machine translation evaluation model.
At inference time, the multi-reference machine translation evaluation model can embed the original source language input text segment 212, the machine translation (or hypothesis) 214, the reference translation 216 and an alternative reference translation (not shown) via, for example, the embeddings concatenation system 130 (shown in
The feed-forward regressor system 140 can receive each respective permutation of the embeddings and provide a prediction based upon the permutation of the embeddings. The resulting score predictions for the various permutations of the embeddings can be the same, or different. The feed-forward regressor system 140, for example, can generate aggregated scores by computing a mean of the predictions and multiplying the mean of the predictions by a scaling factor (l−σ) that is equal to one minus a standard deviation (σ). The scaling factor (l−σ) advantageously can provide a confidence score for the multilingual machine translation evaluation model at the segment-level. Additionally and/or alternatively, scaling the mean prediction by the scaling factor (l−σ) to penalize lower confidence can better align the multilingual machine translation evaluation model with human quality scores.
At inference time, the original source language input text segment 212 and the reference translation 216 can be introduced in varying configurations to the model resulting from selected embodiments of the multi-reference model training system 106. If no reference translation 216 is available, for example, the resulting MT evaluation model can receive the original source language input text segment 212 twice with the second instance of the original source language input text segment 212 being received as the reference translation 216.
If two reference translations 216 are available at inference time, the resulting MT evaluation model alternatively can receive both reference translations 216 with the second instance of the reference translation 216 being received as the original source language input text segment 212.
Corpora
To demonstrate the effectiveness of the evaluation model training system 100, three MT evaluation models were trained, where each model targeted a different type of human scoring of translation quality. To train these multilingual machine translation evaluation models, data from four different corpora was used: the QT21 corpus; the DARR from the WMT Metrics shared task (2017 to 2019); an extension of the latter corpus containing multiple references established by Freitag et al. (2020) (“BLEU might be Guilty but References are not Innocent,” Freitag et al., 2020), and a proprietary MQM annotated corpus.
The QT21 Corpus
The QT21 corpus is a dataset that is available at https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2390 and contains industry generated sentences from the information technology and life sciences domains (“Translation Quality and Productivity: A Study on Rich Morphology Languages,” Specia et al., 2017). The QT21 corpus contains a total of 173K tuples with source sentence, respective human-generated reference translation, MT hypothesis (either from a phrase-based statistical MT or from a neural MT system), and a human post-edited correction of the MT hypothesis (PE). The language pairs represented in this corpus are English to German (en-de), English to Latvian (en-lt), English to Czech (en-cs) and German to English (de-en).
For each tuple in the corpus, the HTER score is obtained by computing the translation edit rate (TER) ((“A Study of Translation Edit Rate with Targeted Human Annotation,” Snover et al., 2006) between the MT hypothesis and the corresponding PE. Finally, after computing the HTER for each MT, a training dataset D={si,hi,ri,yi}n=1N was built, wherein si denotes the source text, hi denotes the MT hypothesis, ri the reference translation, and y, the HTER score for the hypothesis hi. In this manner a regression ƒ(s, h, r)→y is learned that predicts the human-effort required to correct the hypothesis by looking at the source, hypothesis, and reference (but not the post-edited hypothesis).
The WMT DARR Corpus
Since 2017, the organizers of the WMT News Translation Shared Task (“Findings of the 2019 Conference on Machine Translation (WMT19),” Loïc Barrault et al., 2019) have collected human quality scores in the form of adequacy DAs (“Continuous Measurement Scales in Human Evaluation of Machine Translation,” Yvette Graham et al., 2013, “Is Machine Translation Getting Better over Time?”, Yvette Graham et al., 2014, “Can Machine Translation Systems be Evaluated by the Crowd Alone?,” Yvette Graham et al., 2017). The DAs are then mapped into relative rankings (DARR) (“Results of the WMT19 Metrics Shared Task: Segment-level and Strong MT Systems Pose Big Challenges,” Ma et al., 2019a). The resulting data for each year (2017-19) form a dataset D={si,hi+,hi−,ri}n=1N where denotes a “better” hypothesis and denotes a “worse” one. Here, a function ƒ(s, h, r) is learned such that the score assigned to hi+ is, in an embodiment, higher than the score assigned to hi−(ƒ(si, hi+, ri)>ƒ(si, hi−, ri)). This data contains a total of twenty-four high and low-resource language pairs such as Chinese to English (zh-en) and English to Gujarati (en-gu).
The Multi-Reference Corpus
The Multi-Reference corpus was established by Freitag et al. (2020) (“BLEU might be Guilty but References are not Innocent,” Freitag et al. 2020) and extends the WMT DARR corpus for English to German and German to English with three additional reference translations: AR reference (an additional high-quality reference translation), ARp reference (a “paraphrased-as-much-as-possible” version of AR), and WMTp reference (a “paraphrased-as-much-as-possible” version of the original WMT reference). For the latter, the evaluation model training system 100 can use the alternative reference given in the WMT19 News shared task test set being part of the WMT DARR corpus defined herein. The corpus also provides human-generated adequacy assessments for each reference.
The MQM Corpus
The MQM corpus is an Unbabel Inc. proprietary internal database of MT-generated translations of customer support chat messages that were annotated according to the guidelines set out in “Practical Guidelines for the Use of MQM in Scientific Research on Translation quality,” by Burchardt and Lommel (2014). This data contains a total of 12K tuples, covering twelve language pairs from English to: German (en-de), Spanish (en-es), Latin-American Spanish (en-es-latam), French (en-fr), Italian (en-it), Japanese (en-ja), Dutch (en-nl), Portuguese (en-pt), Brazilian Portuguese (en-pt-br), Russian (en-ru), Swedish (en-sv), and Turkish (en-tr). Note that in this corpus English is always present as the source language, but never as the target language. Each tuple consists of a source sentence, a human-generated reference, a MT hypothesis, and its MQM score annotated by one (or more) professional editors. The MQM scores range from −∞ co to 100 and are defined as:
where IMinor denotes the number of minor errors, IMajor the number of major errors and ICrit. the number of critical errors.
MQM takes into account the severity of the errors identified in the MT hypothesis, leading to a more fine-grained metric than HTER or DA. When used experimentally, these values were divided by 100 and truncated at 0. A training dataset D={si,hi,ri,yi}n=1N was constructed in the manner set forth above with reference to the WMT DARR corpus, where si denotes the source text, hi denotes the MT hypothesis, ri denotes the reference translation, and yi denotes the MQM score for the hypothesis hi.
For purposes of experimentation, three MT evaluation models trained using alternative embodiments of the evaluation model training system 100 were examined. Two models were trained using the estimator model training system 102 as shown and described with reference to
Training Setup
The two models trained using the estimator model training system 102 of
Before initializing the multilingual machine translation evaluation models, a random seed was set to three in all libraries that perform “random” operations (torch, numpy, random and cuda).
For training, the pretrained and/or cross-lingual encoder system 112 (shown in
To set up the training of the Rank-DARR model using the translation ranking model training system 104 of
The Rank-DARR multilingual machine translation evaluation model trained using the translation ranking model training system 104 was trained on the WMT DARR corpus in the manner described in more detail above. With a probability of 0.5, the positions of the original source language input text segment 212 and the reference translation 216 at input are switched, allowing the model to better align the multilingual embedding space and to treat the original source language input text segment 212 and the reference translation 216 interchangeably. All model parameters are otherwise as described above with reference to other models.
Evaluation Setup
The test data and setup of the WMT 2019 Metrics Shared Task (“Results of the WMT19 Metrics Shared Task: Segment-level and Strong MT Systems Pose Big Challenges,” Ma et al., 2019) were used to compare the three example multilingual machine translation evaluation models (Est-HTER, Est-MQM and Rank-DARR) trained by the respective embodiments of the evaluation model training system 100, with the top performing submissions of the shared task and other recent state-of-the-art metrics such as BERTScore and BLEURT. The evaluation method used is the official Kendall's Tau-like formulation, τ, from the WMT 2019 Metrics Shared Task (Ma et al., 2019) defined as:
where Concordant is a number of times a metric assigns a higher score to the “better” hypothesis h+, such as the better machine translation 214A (shown in
As mentioned in the findings of “Results of the WMT19 Metrics Shared Task: Segment-level and Strong MT Systems Pose Big Challenges,” Ma et al., 2019, segment-level correlations of all originally submitted metrics were frustratingly low. Furthermore, all submitted metrics exhibited a dramatic lack of ability to correctly rank strong MT systems. To evaluate whether the three multilingual machine translation evaluation models trained by the evaluation model training system 100 better address these issues, the described evaluation setup used in the analysis presented in Ma et al., 2019, was followed, where correlation levels are computed for portions of the DARR data that include only the top 10, 8, 6 and 4 MT systems.
Results
Results for the above-referenced experiments are set forth below.
From English into X
Table 2 shows results for all eight language pairs with English as source. The three example models of embodiments of the invention were contrasted against baseline metrics such as BLEU and chrF, the 2019 task winning metric Yisi-1, as well as the more recent BERTScore.
For BERTScore and XLM-RoBERTa (base), the results were reported with the default encoder model for a complete comparison. The values reported for YiSi-1 are taken directly from the shared task paper (Ma et al., 2019).
It was observed that all three multilingual machine translation evaluation models trained by the evaluation model training system 100 outperform all of the other metrics across the board, often by significant margins. The Rank-DARR model trained using training system 104 with the WMT DARR corpus outperformed the two models trained using the estimator model training system 102 (Est-HTER and Est-MQM) in seven out of eight language pairs. Also, even though trained on only 12K annotated segments, the estimator model trained using training system 102 regressed on MQM (Est-MQM) performed roughly on par with the estimator model trained using training system 102 regressed on HTER (Est-HTER) for most language-pairs and outperforms all the other metrics in en-ru.
From X into English
Table 3 shows results for the seven to-English language pairs.
Results for BERTScore and for BLEURT are reported for two model versions: the base model, which is comparable in size with the XLM-RoBERTa (base) model that was used as the pretrained model for encoding system 110 (shown in
Again, the three models trained using the evaluation model training system 100 are contrasted against baseline metrics such as BLEU and chrF, the 2019 task winning metric Yisi-1, as well as the recently published metrics BERTScore and BLEURT. As in Table 2, translation ranking model training system 104 with the WMT DARR corpus showed strong correlations with human judgments outperforming the recently proposed English-specific BLEURT metric in five out of seven language pairs. Furthermore, again, the estimator model trained using training system 102 regressed on MQM (Est-MQM) showed surprisingly strong results despite the fact that this model was trained with data that did not include English as a target language. Although the encoding system 110 used in the trained models of the evaluation model training system 100 is highly multilingual, this powerful “zero-shot” result is likely due to the inclusion of the original source language input text segment 212 in the models of the evaluation model training system 100.
Language Pairs not Involving English
All three of the evaluation models trained with training system 100 were trained on data involving English (either as a source or as a target). Nevertheless, to demonstrate that the models trained using the evaluation model training system 100 generalize well to other languages, these models were also tested on data from the three WMT 2019 language pairs that do not include English as either the source or target language. Results of these tests are shown in Table 4.
As can be seen in Table 4, the results are consistent with observations in Tables 2 and 3.
Multi-Reference Experiments
Similar experiments also were performed with the multi-reference model training system 106 (shown in
Based upon the above results, a positive correlation can be seen between reference quality and its utility to the predictive model.
Utilizing a second reference improved prediction accuracy only when the adequacy of the second reference was as good or better as that of the first reference. These results show that, for approaches such as that employed in the multi-reference model training system 106, quality is more important than quantity, and that lower quality additional references can hurt rather than help improve the correlations obtained using only one single high-quality reference. These results highlight that a single high-quality reference translation is sufficient in order for the MT evaluation models trained with embodiments of training system 100 to learn accurate quality score predictions.
Robustness to High-Quality MT
The three trained models based on training system 100 were further analyzed with respect to their ability to correctly rank high-quality MT systems. The DARR corpus from the 2019 Shared Task was used for evaluating on the subset of the data from the top performing MT systems for each language pair. This example analysis included language pairs for which data for at least ten different MT systems (i.e. all but kk-en and gu-en) could be retrieved. The analysis of the performance of the models trained using evaluation model training system 100, presented herein, was contrasted against the strong, recently proposed, BERTScore and BLEURT, with BLEU as a baseline. Results are presented in
Importance of the Source
To shed some light on the actual value and contribution of the original source language input text segment 212 to the ability of the evaluation model training system 100 to learn accurate predictions, two versions of the Rank-DARR model using the ranking model training system 104 were trained using the WMT DARR corpus: one of the Rank-DARR models used only the reference translation 216; whereas, the other Rank-DARR model used both reference translation 216 and the original source language input text segment 212. Both models were trained using the WMT 2017 corpus that only includes language pairs from English (en-de, en-cs, en-fi, en-tr). In other words, while English was never observed as a target language during training for either version of the model, the training of the second version included English source embeddings. The two versions of the Rank-DARR model trained using the translation ranking model training system 104 were then tested on the WMT 2018 corpus for these language pairs and for the reversed directions. The test results are shown in Table 6.
The results in Table 6 clearly show that for the translation ranking architecture model training system 104, including the original source language input text segment 212 improves the overall correlation with human quality rankings. Furthermore, the inclusion of the original source language input text segment 212 exposed the second version of the model to English embeddings, which is reflected in a higher Δτ for the language pairs with English as the target language.
External Validation
A recent research paper from Google (“Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation,” Freitag et al. 2021) measured the correlation of scores produced by a variety of automated MT evaluation metrics, including the models formed via the evaluation model training system 100, with a significant corpus of human quality scores in the form of MQM scores. Two experiments, with data from English to German and Chinese to English, concluded that the models herein described showed significantly higher correlation with human quality scores than all other evaluated metrics.
Independently, a recent research paper from Microsoft (“To Ship or Not to Ship: Extensive Evaluation of Automatic Metrics for Machine Translation”, Kocmi et al. 2021) conducted an in-depth investigation of the correlation between the scores generated by multiple MT evaluation metrics, including the models trained using embodiments of model training system 100 described herein, and a significant corpus of human-generated MT system rankings, for a large collection of MT systems developed by Microsoft that cover multiple language-pairs and domains. Results indicated that the MT evaluation models trained using embodiments of training system 100 exhibited significantly higher levels of correlation with the human rankings than all other MT evaluation metrics. The authors further recommended that the models described herein be broadly adopted by the MT community at large as a primary MT evaluation metric.
Data Statistics
Tables 7-12 show key data statistics for the corpora used to train and test the models trained using embodiments of the evaluation model training system 100.
In selected embodiments, one or more of the features disclosed herein can be provided as a computer program product being encoded on one or more non-transitory machine-readable storage media. As used herein, a phrase in the form of at least one of A, B, C and D herein is to be construed as meaning one or more of A, one or more of B, one or more of C and/or one or more of D.
The described embodiments are susceptible to various modifications and alternative forms, and specific examples thereof have been shown by way of example in the drawings and are herein described in detail. It should be understood, however, that the described embodiments are not to be limited to the particular forms or methods disclosed, but to the contrary, the present disclosure is to cover all modifications, equivalents, and alternatives.
This application claims the benefit of, and priority to, U.S. Provisional Application Ser. No. 63/055,272, filed Jul. 22, 2020, the disclosure of which is hereby incorporated herein by reference in its entirety and for all purposes.
Number | Date | Country | |
---|---|---|---|
63055272 | Jul 2020 | US |