Machine-based translations can be used to translate text from one language to another. Machine translation quality estimation or prediction involves evaluating the output of a machine translation system without access to a “gold” label sequence. Machine translation models can be trained with large parallel datasets, such as millions (or more) of sentence pairs. However, real-world datasets may contain a substantial amount of noisy data. Use of such data by a machine translation model can produce poor training results, which, in turn, can result in low quality translations.
Aspects of the technology employ a machine translation quality prediction (MTQP) model to refine datasets that are used in training machine translation systems. The MTQP model is configured to provide indications on the quality of a sentence pair. Given a large dataset containing sentence pairs (e.g., hundreds of thousands, millions or billions of sentence pairs) from real-world datasets, the MTQP model assigns a score to each sentence pair. The model flags low-scoring pairs that fall below a selected threshold. The resultant high quality dataset pairs can then be used to train various types of machine translation models, such as neural machine translation (NMT) models. Example implementations are thus directed to a specific technical implementation of a machine translation training system which filters training data using an MTQP model and then uses the filtered training data to train a machine translation model.
According to one aspect of the technology, a computer-implemented method comprises receiving, by a machine translation quality prediction model, a sentence pair of a source sentence and a translated output; performing feature extraction on the sentence pair using a set of two or more feature extractors, each feature extractor generating a corresponding feature vector; concatenating the corresponding feature vectors from the set of feature extractors together; and applying the concatenated feature vectors to a feedforward neural network, the feedforward neural network generating a machine translation quality prediction score for the translated output.
In one example, the method further comprises storing the machine translation quality prediction score in a database in association with the translated output. In another example, the method further comprises transmitting the machine translation quality prediction score to a user. In either case, the set of two or more feature extractors may comprise at least two of a Quasi-MT feature extractor, a neural machine translation feature extractor, a language model extractor, and a LogPr feature extractor. The Quasi-MT feature extractor may use internal scores of a Quasi-MT model that is trained by trying to predict each token in a gold-label sentence by using information in both the source sentence and the gold-label sentence. The neural machine translation feature extractor may use internal scores from at least a decoder of a neural machine translation model. The language model extractor may use internal scores from two kinds of language models. Here, a first one of the language models is trained on a selected corpus of a source language, and a second one of the language models is a contrastive language model that is first trained on the selected corpus and then incrementally trained on a corpus formed by source sentences in a set of training sentence pairs.
In a further example, the method also includes determining whether the machine translation quality prediction score exceeds a quality threshold, and when the machine translation quality prediction score does not exceed the quality threshold, filtering the translated output. Filtering the translated output may comprise storing a flag with the translated output to indicate that the machine translation quality prediction score does not exceed the quality threshold. Filtering the translated output may comprise removing the translated output from a corpus of translated output sentences.
In another example, the method further comprises determining whether the machine translation quality prediction score exceeds a quality threshold, and when the machine translation quality prediction score exceeds the quality threshold, adding the translated output to a corpus of translated output sentences. In yet another example, the method further includes training a machine translation model using the translated output when the machine translation quality prediction score exceeds a quality threshold.
In a further example, the method also includes creating a curated data set of source sentences and corresponding translated outputs, where each translated output exceeds a quality threshold, and then training a machine translation model using the curated data set. The trained machine translation model may be a neural machine translation model.
According to another aspect of the technology, a system is provided that comprises memory configured to store machine translation quality prediction information and one or more processors operatively coupled to the memory. The one or more processors are configured to implement a machine translation quality prediction model by: reception of a sentence pair of a source sentence and a translated output; performance of feature extraction on the sentence pair using a set of two or more feature extractors, each feature extractor generating a corresponding feature vector; performing a concatenation of the corresponding feature vectors from the set of feature extractors together; and application of the concatenated feature vectors to a feedforward neural network, in which the feedforward neural network is configured to generate a machine translation quality prediction score for the translated output.
In one example, the set of two or more feature extractors comprises at least two of a Quasi-MT feature extractor, a neural machine translation feature extractor, a language model extractor, and a LogPr feature extractor.
In another example, the one or more processors are further configured to: determine whether the machine translation quality prediction score exceeds a quality threshold; and when the machine translation quality prediction score does not exceed the quality threshold, filter the translated output. The one or more processors may be configured to filter the translated output by storing a flag with the translated output to indicate that the machine translation quality prediction score does not exceed the quality threshold. The one or more processors may be further configured to: determine whether the machine translation quality prediction score exceeds a quality threshold; and when the machine translation quality prediction score exceeds the quality threshold, add the translated output to a corpus of translated output sentences. The one or more processors may be further configured to train a machine translation model using the translated output when the machine translation quality prediction score exceeds a quality threshold. And the one or more processors may be further configured to: create a curated data set of source sentences and corresponding translated outputs, where each translated output exceeds a quality threshold; store the curated data set in the memory; and train a machine translation model using the curated data set.
Overview
Machine translation quality prediction (MTQP), also referred to as machine translation quality estimation (MTQE), aims to evaluate the output of a machine translation system without access to reference translations. For example, given a source sentence (a sentence in a source language) and a translation output (a sentence generated by a machine translation system), it is beneficial to be able to predict the quality score of this translation, even without knowing the machine translation system or the gold-label sentence (e.g., a human generated reference translation sentence). In particular, MTQP predicts if the translation output matches the meaning of the source sentence, and whether the target sentence is fluent.
Different metrics may be used to evaluate the quality of a machine translation. For instance, a BLEU score that is based on n-gram precision may be employed. Here, a BLEU score may be calculated between translation output and gold-label sentences. A parallel corpus containing source sentences and corresponding gold-label sentences may be used to evaluate the translation quality. For instance, the BLEU metric may be averaged on a corpus to provide an indication of how well the machine translation system is trained. MTQP may be used to evaluate the quality of a specific translation output given the source sentence. In contrast to BLEU, it may not be meaningful to calculate the average MTQP for a whole corpus, because MTQP is beneficial to flag low-quality translation outputs that would then be subject to next-step post-editing.
By way of example, there can be a significant benefit to an application service provider or other customer to receive a confidence score (quality estimation) along with translation. This score can be used to judge whether machine translation can be directly used without post-editing, or how much post-editing is needed. Since human post-editing can be the most significant cost in the localization workflow, the confidence score feature is of high importance to reduce cost, and provide additional information on translation quality.
For situations where post-editing may be required, MTQP allows experts to concentrate on translations that are estimated to be of low-quality, further reducing post-editing cost. For example, a service provider may use a given machine translation system to translate 10 million sentences, while also wanting to ensure that all translations are good, e.g., of at least some threshold quality. By way of example, the quality threshold may only be met by the top 30-40% of the translations (or more or less). There may be a large cost factor in terms of human and/or computing resources to check all 10 million sentences and do post-editing. However, if quality estimation (QE) scores are provided along with translations, a threshold may be set for translations to review. Here, for instance, the service provider may only pick the lowest 10,000 sentences (or set a threshold QE score) and send those sentences falling below the threshold to experts for post-editing. In such a scenario, the costs associated with post-editing could be reduced 99.9% percent when compared to performing post-edit evaluations on the entire set of translations.
There are also scenarios where post-editing may not be required, but a fast turnaround time is required. In such cases, it may be particularly beneficial for the service provider to only publish good quality translations. Here, the lower quality translations that fall below the threshold may be maintained as source language. In this type of situation, an MTQP score can provide a reliable metric for only picking up (or otherwise selecting) high-quality translation sentences.
As shown in the level below block 102, the system may use either general or custom QE, and general or custom machine translation. For general QE, [source sentence and translation output] pairs are provided, and the general QE model (e.g., trained by some general data labeled by the system) is used to predict a quality score for each sentence pair. For custom QE, a dataset is provided that comprises [source sentence, translation output, quality label] pairs. Here, the system fine-tunes or retrains the QE model based on [source sentence, translation output] pairs. In this case, the customized QE model can be used to predict quality score for each sentence pair. For general machine translation, received source sentences are used by a neural machine translation (NMT) model to generate translation sentences. And for custom machine translation, the system employs a parallel corpus. Here, the system fine-tunes the NMT model to derive a custom machine translation model. Source sentences are applied to the custom machine translation model to generate translation sentences.
Block 104 shows a configuration using general QE for general machine translation. Block 106 shows a configuration using general QE for custom machine translation. Block 108 shows a configuration using custom QE for general machine translation. And block 110 shows a configuration using custom QE for custom machine translation. Each of these may be suited to different customers' needs, for instance depending on whether the customer has the ability to provide its own data quality labeling and/or its own data set.
Blocks 112 show options where data is labeled by users (e.g., customers), and blocks 114 show options where data is labeled by the system pipeline. For user-labeled data of blocks 112, the users are responsible for choosing which sentences in the dataset to label, as well as labeling the QE score. This approach allows users to design their own labeling rules and/or follow guideline for the MTQP system. For labeling according to the system pipeline as in block 114, the pipeline can be used for labeling not only the general QE data to train the general QE model, but also the translation output for the user's data. In addition, for applications where a general QE approach is satisfactory, there may not be a need for custom QE. In contrast, in applications where the data may be field-specific, such as for movies or other videos, a custom QE+custom machine translation approach may be most appropriate.
By way of example, the Quasi-MT model 200 is trained to predict each token based on the source sentence from the translation output. Given a source sentence [a, b, c] and a translation sentence [A, B, C, D], the Quasi-MT model 200 seeks to predict each token in translation sentence based on “source sentence+bi-directional information in translation sentence”. More specifically in this example, the model attempts to predict “A” based on [a, b, c] and [B, C, D]; predict “B” based on [a, b, c] and [A, C, D]; predict “C” based on [a, b, c] and [A, B, D]; and predict “D” based on [a, b, c] and [A, B, C]. These predictions are done in parallel (independent from each other).
The data for training the Quasi-MT model can be a large parallel corpus, which is [source sentence, gold-label sentence] pairs. For training QE model, the data can be MTQP data, which is [source sentence, translation output, quality label] pairs.
For any machine learning problem with sentence pairs as input, such as textual entailment, semantic similarity, etc., the MTQP approaches discussed herein can be used as a feature score for the input. It is important to note that the effectiveness of a machine translation system can be limited by the quality of the data used to train the machine translation model(s). In particular, for models trained on sentence pairs, the performance of a given model may be highly dependent on the quality of the dataset. A large number of sentence pairs may be collected (e.g., hundreds of thousands, millions or more), and then the MTQP service is used to score all sentence pairs and only keep high-quality sentence pairs in the dataset. However, in some situations it can be beneficial to avoid using the same MTQP service on the trained machine translation models to avoid bias.
As discussed herein, the MTQP model takes a sentence pair as the input, and returns a (predicted) score as the output.
According to one aspect of the technology, there may be different tiers for the predicted MTQP score. For instance, in a tier 1 scenario, a neural machine translation model may be fixed (static), which means that no training is needed for this model (or it has been previously trained offline). Here, the language pairs that are supported depend on the neural machine translation model. The tier 1 scenario can be used to obtain a forced decoding score, which is calculated by summing up the cross-entropy (e.g., the log probability) of each token, and then normalizing (divide) by the sentence length.
For example, the input sentence pair may be (“a b c d”, f g″). The following steps can be performed to calculate the forced decoding score (FDS). First, run the translation model on this sentence pair, and at each token in the second sentence, the model will produce a probability distribution. Assume that the distribution produced at the first token is {a:0.5, b:0.3, e:0.1, g:0.1}. Then the log probability for “e” at the first token is log(0.1). Similarly, the log probability for “f” can be obtained at the second token and “g” at the third token. Finally, the system can sum up all of the log probabilities and divide by the number of tokens (e.g., sentence length), which is 3 in this example. This score is the forced decoding score. A benefit to this approach is that it does not require additional training data. It is also applicable to the machine translation-supported language pairs.
Other tier scores, such as tier 2 and tier 3 scores, may be produced by the same model structure, but which are trained on different data sets either statically (tier 2) or dynamically (tier 3). Unlike the approach for tier 1, these other tiers need to be trained. By way of example, a tier 2 approach may be particularly beneficial for users (or apps) that require very high quality MTQP scores. In this case, a large training set can be collected for some popular language pairs, and the system may train a general model for each of those language pairs. Here, the model is static (trained offline). Each entry in the dataset contains a source sentence, a translation sentence produced by the machine translation system, and a label to indicate whether the translation is good enough. In this way, the MTQP model is able to learn to distinguish between good and bad translations. However, human labeling for a large data collection is expensive (considering the annotator needs to be a bilingual speaker who can determine if a translation is good or not), especially for low-resource languages. A tier 2 approach may employ around 100,000 samples (e.g., on the order of 80,000-120,000 samples) to train and validate the general MTQP model.
In one example of a tier 3 (dynamic training) approach, the user could provide custom training data to train a customized MTQP model. For instance, this may involve around 15,000 (e.g., on the order of 10,000-20,000) samples to train a customized MTQP model from scratch. The data size employed for fine-tuning from the general MTQP model could be much smaller, such as ⅓ the size (e.g., 5,000 samples). Here, because the custom training data may be tailored to a specific app (e.g., subtitles for a movie), it can produce very effective results.
As shown in view 400 of
For the classification setting, the predicted score is a n-dim vector, where n is the number of classes, and each score represents the probability of that class. And for regression setting, the predicted score is a single value. When in a training mode, the loss is calculated and the gradient can be back propagated to update parameters in the MTQP model. For instance, a gradient descent approach, which finds a local minimum, can be used when training the MTQP model. For the classification setting, the loss is cross-entropy loss. For the regression setting, the loss is Mean Squared Error (MSE). In one scenario, the predicted scores 510 may be normalized, for instance to make the distribution of predicted scores similar across different languages.
The quasiMT feature extractor 504 uses the internal scores of a Quasi-MT model, which is trained on a large parallel sentence corpus. The Quasi-MT model is trained by trying to predict each token in a gold-label sentence by using the information in both the source sentence and the gold-label sentence. For example, in view of the discussion above regarding Quasi-MT model 200, assume there is a source sentence [a, b, c] and a gold-label sentence [A, B, C, D]. In this case, predict A with [a, b, c] and [B, C, D]. Predict B with [a, b, c] and [A, C, D]. Predict C with [a, b, c] and [A, B, D]. And predict D with [a, b, c] and [A, B, C]. Note that for a conventional MT model, a beam search is needed when the model generates a translation during inference time, where it processes one token at a time (in a sequential manner). However, because Quasi-MT processes all tokens at the same time, a beam search is not needed.
The NMT feature extractor 504b uses internal scores from the encoder and decoder of the NMT translation model. Here, use of the encoder scores is optional. Besides these internal scores, this feature extractor may also use the mismatching features and Monte-Carlo dropout word-level confidence features. For using internal scores in the decoder, it uses the output of the decoder (204 in
The language model feature extractor 504c uses internal scores from two kinds of language models. The first one is a language model trained on a large corpus of the source language. The second one is a contrastive language model which is first trained on the large corpus and then incrementally trained on the corpus formed by source sentences in the training sentence pairs. In addition to the internal scores from the two kinds of language models, this feature extractor also has the mismatch and entropy features as in the NMT feature extractor. The entropy (Hk) may be obtained from a predictor P, according to the following: Hk=Σt
The source-side language model feature extractor 504c can be extended by using a contrastive language model, in which a second (adapted) language model is trained on the confidence estimation training data incrementally from the previous one. The aim of this second language model is to capture differences between the domain in which the machine translation and confidence estimation models are to be employed, and the domain in which the machine translation model was trained. Here, the same features as for the base language model are used for the adapted language model. In the case of using a contrastive language model feature extractor, the concatenation of the feature sequences from the two language models can be augmented with two difference features and sent to the LSTM layer that encodes into a fixed dimensional feature, where:
arg maxPbase(sk)==arg maxPadopted(sk)(binary)
log Pbase(sk*)−log Padopted(skk*)
The LogPr feature extractor 504d calculates log P(target|source)/len(target) from an NMT model as a single feature, based on the target (translated) sentence and the source sentence. Len(target) is the length of the target. The log P(t) produced by the NMT model at each position k in the target sentence T=[tl, . . . , tk, . . . , t{length(T3}] is summed over all k=1 . . . length(T):
log P(T)=Σk log Pt
Ideally, the calculated value is equivalent to the forced decoding score.
Various adjustments may be made to the model structure. For instance, to evaluate confidence in the model, it may be run multiple times with different dropout rates, or generate different top-n candidates during decoding. The more diverse results, the less confident is the model, and the lower MTQP score that should be produced. Dropout involves dropping out nodes at random during training. By way of example, the top-n candidates may have an n value of 5, or more or less.
A back translation forced decoding score can also be used to evaluate system performance. For instance, as the forced decoding score for each sentence pair could be calculated directly, the system can switch the sentence pair and calculate the forced decoding score again, which is the back translation forced decoding score. Then using those two forced decoding scores, the system can combine them (for example, just take average value) to see if the performance is better. This would involve adding the FDS and back translation FDS into the features. Just as mismatching features, the system can add any FDS features, which can make the MTQP model better because there would be more features overall.
In another scenario, the NMT decoder may generate posterior probability lattices. In this case, the posterior probability for each token on the target side could be used for confidence scoring. This functionality is also applicable in other areas, e.g., generating alternatives for a given token/phrase on the target side.
It can be beneficial to control the amount of noise introduced in the downstream machine translation pipeline. In order to help evaluate performance, different metrics may be employed. For instance, to evaluate noise (or whether the translation data is accurate enough to be used by the translation system), a primary performance metric R@P=t can be used. Here, R denotes recall (or sensitivity), which corresponds to a percentage of relevant instances that were retrieved by the system. P denotes precision (or the positive predictive value), which corresponds to a percentage of relevant instances from the total number of retrieved instances. In this evaluation, precision maximizes recall subject to the constraint of precision being above the threshold t.
Setting a high value for t controls the amount of noise introduced in the downstream pipeline. By way of example, a value oft on the order of 0.9 (e.g., +/−10%) provides sufficient precision for most machine translation situations. Having a t=0.9 means that when the user directly uses those translations with a high MTQP score, 90% of them are truly good translations (that would not require post-editing). In other examples, t may be higher or lower than 0.9. This parameter may be tunable, for instance based on the type of information being translated, the type of application (e.g., video subtitles, scientific paper translations, etc.) or other factors.
In the classification setting, where the data label is binary, evaluation metrics that take the area under the curve (AUC) of the precision-recall curve, or the AUC of the receiver operating characteristic curve, may also be employed. And in the regression setting, where the data label may have a value between 0 and 1, one or more of the following metrics may be employed: mean squared error (MSE), mean absolute error (MAE), the Pearson correlation coefficient (Pearson), Spearman's rank correlation coefficient (Spearman), or the Kendall rank correlation coefficient (Kendall). For situations where the data set is provided by a user, the metric information may be considered by the user for setting operating criteria. For instance, the precision-recall curve information may be used to decide the operation point (e.g., to set t).
By way of example only for a tier 3 score using a custom-trained MTQP model, the primary performance metric may be: 0.2 R@P=0.9, which means to achieve at least 0.2 recall when the precision is 0.9. Both the recall value and the precision value may vary, e.g., by 5-15%, or more or less. For tier 1 and tier 2 scores, the target value t could be loosened, e.g., to between 0.75-0.85.
The following is an example evaluation for parallel sentence mining comparing the forced decoding score of tier 1 with the MTQP score of tier 3. In this example the data sources may include legacy data, such as sampled sentence pairs from a mixture of translations and web data, with labeling on a sentence pair that has consistent meaning. The data sources may also include mined data, where the sentence pairs are mined from a large corpus, e.g., from natural data the web. Here, the mined data may not be translation data. According to one example, the system may split all sentences that appeared on the web into deduped monolingual sentences, and filter high quality portions using a sentence quality score. From this, sentence pairs may be mined directly from the monolingual sentences using language-agnostic embeddings. In one scenario, the legacy data has about 30,000 sentence pairs and the mined data has about 10,000 sentence pairs. The language pairs evaluated include: English (En)-Chinese (Zh), English (En)-Russian (Ru), English (En)-Hindi (Hi), English (En)-French (Fr), English (En)-Spanish (Es), and English (En) Portuguese (Pt).
Table 1 illustrates scores where the primary performance evaluation metric is R@P=0.9.
Table 2 illustrates scores where the primary performance evaluation metric is R@P=0.8.
As can be seen, the MTQP score outperforms (i.e., is higher) than the forced decoding score for each language translation except En:Fr, and for some languages is 50% or higher than the forced decoding score.
Another metric may be used to show how well the forced decoding score model and the MTQP model perform when the top translation samples are considered. This metric may be calculated by first ranking the samples according to the predicted scores (forced decoding score or MTQP score). Then, the top X percentile are selected (e.g., 10%, 15%, 20%, 25% and 30%). For each top X percentile, count how many samples of the group provide satisfactory translations (e.g., that would not require any post-editing), and how many samples provide unsatisfactory translations (e.g., that may require significant post-editing). Note that there may be translations that fall in between satisfactory and unsatisfactory because they may require a minimal amount of post-editing. Based on such criteria, this metric is shown below in Table 3 for the En:Zh machine translations, where X is evaluated between 10% and 30%.
Tables 4 and 5 illustrate examples of other metrics as applied to training with classification or regression for a tier 3 MTQP approach, as compared to a tier 1 forced decoding score approach. Table 4 shows the results for English to French translation, and table 5 shows the results for English to Russian translation. In these examples, R@P=0.9. For MSE or MAE, a lower value indicates a higher quality machine translation, while for Pearson, Spearman, Kendall and R@P, a larger value indicates a higher quality machine translation.
It can be seen that both the classification and regression training strategies perform comparably across various metrics. The actual performance can depend on the particular language pair. However, in some situations the classification approach may be more suitable, for instance when the translation memory could be easily converted to training data. In addition, the classification setting may be compatible with regression data, but not vice versa, because the regression label could be converted to binary label by setting a threshold.
As shown by arrow 610, the user 602 may send a request to the translation API 604. Here, the request includes one or more source sentences. As shown by arrow 612, the translation API 604 sends a request to the MTQP service 606, which includes the received source sentence(s) and one or more translated sentences. As shown by arrow 614, the MTQP service 606 requests that the dependent service 608 perform model inference, and per arrow 616 the dependent service 608 returns a predicted score. For instance, when sentence pairs arrive at the MTQP service 606, preprocessing can be performed to convert those sentence pairs into tensors that could be consumed by the MTQP model. The tensors are then passed to the dependent service, where the MTQP model is served. After getting the output tensor back (arrow 616), the MTQP service 606 performs post-processing to convert the tensors to predicted MTQP scores. Based on the predicted score, the MTQP service 606 returns an MTQP score to the translation API 604, as shown by arrow 618. The translation API 604 returns the translated sentence(s) with the MTQP score(s) as shown by arrow 620. Alternatively, as shown by arrow 622, the user 604 may send a request with a source sentence and a translated sentence directly to the MTQP service 606. Here, in response the MTQP service 606 (after performing model inference and receiving the predicted score) provides the MTQP score to the user 602 directly. Based on the MTQP scores for the translated sentences, the system may flag translated sentences that fall below a quality threshold, and modify a translation database based on this. Alternatively, the translated sentences that satisfy the quality threshold may be flagged and the database updated accordingly. The user(s) may access the high-quality translations and use them in various applications. Conversely, the translations flagged as not satisfying the quality threshold may be post-edited with adjustments so that they would meet the quality threshold. Thus, according to one aspect of the technology, nothing will be discarded even if a translated sentence falls below the quality threshold. By providing the MTQP scores to the user(s), this leaves the choice to the user(s) about what to do with those scores.
TPU, CPU or other computing architectures can be employed to implement MTQP model approach in accordance with the features disclosed herein. One example computing architecture is shown in
As shown in
The processors may be any conventional processors, such as commercially available CPUs, TPUs, etc. Alternatively, each processor may be a dedicated device such as an ASIC or other hardware-based processor. Although
The input data, such as source sentences or translated outputs, may be operated on by the MTQP module to generate one or more predicted scores and associated information. The predicted scores may be used to filter the translation results so that only results exceeding a threshold (e.g., the top 10-40%) are provided to or otherwise utilized by the user. The user devices may utilize such information in various apps or other programs to provide accurate, high quality translations in accordance with a variety of applications as discussed herein. For instance, this can include using the scoring as a filter to identify high-quality sentence pairs to use as training data for a better translation model. According to one aspect of the technology, MTQP-analyzed data (sentence pairs) is used to perform NMT training data filtering. For instance, quality predictions from the MTQP model are used to “nominate” sentence pairs, e.g., as labeling data. This enables the system to curate a more suitable data set (with quality predictions satisfying some quality metric) that is then used to train machine translation models (e.g., an NMT model).
The computing devices may include all of the components normally used in connection with a computing device such as the processor and memory described above as well as a user interface subsystem for receiving input from a user and presenting information to the user (e.g., text, imagery and/or other graphical elements). The user interface subsystem may include one or more user inputs (e.g., at least one front (user) facing camera, a mouse, keyboard, touch screen and/or microphone) and one or more display devices (e.g., a monitor having a screen or any other electrical device that is operable to display information (e.g., text, imagery and/or other graphical elements). Other output devices, such as speaker(s) may also provide information to users.
The user-related computing devices (e.g., 712-714) may communicate with a back-end computing system (e.g., server 702) via one or more networks, such as network 710. The network 710, and intervening nodes, may include various configurations and protocols including short range communication protocols such as Bluetooth™, Bluetooth LE™, the Internet, World Wide Web, intranets, virtual private networks, wide area networks, local networks, private networks using communication protocols proprietary to one or more companies, Ethernet, WiFi and HTTP, and various combinations of the foregoing. Such communication may be facilitated by any device capable of transmitting data to and from other computing devices, such as modems and wireless interfaces.
In one example, computing device 702 may include one or more server computing devices having a plurality of computing devices, e.g., a load balanced server farm or cloud computing system, that exchange information with different nodes of a network for the purpose of receiving, processing and transmitting the data to and from other computing devices. For instance, computing device 702 may include one or more server computing devices that are capable of communicating with any of the computing devices 712-714 via the network 710.
According to aspects of the technology, given a pair of sentences [a source sentence and a translation output], the MTQP service returns a score to indicate the quality of the translation. The MTQP score can be used in various ways by an application, service or other user. For instance, the score can be used to estimate the post-editing efforts for each sentence. Alternatively, post-editing may be omitted for high-quality translation output that exceeds a selected threshold. Here, the user can directly use the high-MTQP score translations and send other translations for post-editing (either automated or human post-editing), which saves the post-editing cost. In this situation, an upper bound threshold can be configured for translations that do not need post-editing. This approach can also improve the post-editing efficiency by bucketing scores with different queues for translators to concentrate on similar types of works.
In another scenario, the system may make sure that low-quality translations falling below the selected threshold are post-edited. Here, the user may not have strict requirements about the quality of the translations, which would mean that most machine translations would be acceptable, and there may only be a need to pick out very low-quality translations (for example, the bottom 10%, or more or less) and have those translations post-edited. In this situation, a lower bound threshold can be configured for translations that need post-editing. And yet another scenario, may involve direct human translation on poor machine translations. Here, the user can directly send source sentences for human translation and discard the corresponding machine translations with very low MTQP scores, because the poor translation could be misleading and the post-editor may need to take some time to read that translation. This approach removes the burden of poor machine translations and instead would utility the translator to do human translation directly.
Although the technology herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present technology. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present technology as defined by the appended claims.
The present application is a continuation of International Application No. PCT/US2021/40492, filed Jul. 6, 2021, the entire disclosure of which is hereby incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US2021/040492 | Jul 2021 | US |
Child | 17852863 | US |