This specification relates to machine learning.
Manual translation of text by a human operator can be time consuming and costly. One goal of machine translation is to automatically translate text in a source language to corresponding text in a target language. There are several different approaches to machine translation including example-based machine translation and statistical machine translation. Statistical machine translation attempts to identify a most probable translation in a target language given a particular input in a source language. For example, when translating a sentence from French to English, statistical machine translation identifies the most probable English sentence given the French sentence.
A commonly used training technique in statistical machine translation is the Minimum Error Rate Training (MERT) technique. The MERT technique is described, for example, in Franz Josef Och, “Minimum Error Rate Training in Statistical Machine Translation,” Proceedings of the 41 st Annual Meeting on the Association for Computational Linguistics, pages 160-167, July 2003.
Many conventional statistical machine translation systems use the MERT technique. The MERT technique trains parameters for a linear statistical machine translation model directly with respect to automatic evaluation metrics, i.e., metrics that do not require human evaluation, which can be time-consuming. Some examples of automatic evaluation metrics include word error rate, position independent error rate, National Institute of Standards and Technology (NIST) score, and Bilingual Evaluation Understudy (BLEU) score.
The MERT technique directly optimizes the objective function of interest and thereby avoids making approximations of other objective functions for example, likelihood or margin. However, the MERT technique is generally efficient for training model parameters (i.e., weights) for only a relatively small number of feature functions (e.g., less than 20 or 30 feature functions). The MERT technique is slow if a large number of feature functions are considered, because only one feature function is updated at a time and the computation involves iterating over the complete training corpus. Additionally, in the case of highly correlated features, the MERT technique tends to assign most of the weight to one of the correlated features, causing instability. Instability in the MERT technique occurs when different values of the initial weights result in very different final weights.
Systems, methods, and apparatuses including computer program products for machine learning are provided. In general, in one aspect, a method is provided. The method includes determining model parameters for a plurality of feature functions for a linear machine learning model, ranking the plurality of feature functions according to a quality criterion, and selecting, using the ranking, a group of feature functions from the plurality of feature functions to update with the determined model parameters.
Other embodiments of the aspect include systems and computer program products.
Implementations can include one or more of the following features. Determining model parameters for the plurality of feature functions can further include, for each feature function in the plurality of feature functions: calculating a source sentence error surface for each source sentence of a plurality of source sentences as a function of feature function model parameter, merging the source sentence error surfaces into an aggregate error surface for the feature function, and identifying an optimal model parameter for the feature function that minimizes the aggregate error surface for the feature function. The quality criterion can be BLEU score gain.
Selecting, using the ranking, the group of feature functions to update with the determined model parameters can further include, for each source sentence in a plurality of source sentences, calculating a source sentence error surface as a function of number of updates for ranked feature functions, merging all source sentence error surfaces into an aggregate error surface, and identifying an optimal number of updates for ranked feature functions that minimizes the aggregate error surface. Selecting, using the ranking, the group of feature functions to update with the determined model parameters can further include selecting the group of feature functions to include a particular feature function if updating the particular feature function with the respective optimal model parameter does not increase an error count. The linear machine learning model can be a linear statistical machine translation model.
In general, in one aspect, a method is provided. The method includes determining a group of candidate translations for each source sentence in a plurality of source sentences, and, for one or more iterations: calculating a first aggregate error surface and an optimal model parameter for each feature function in a plurality of feature functions for a linear statistical machine translation model, ranking the plurality of feature functions according to a quality criterion, calculating a second aggregate error surface and an optimal number of updates for ranked feature functions, determining a group of feature functions from the plurality of feature functions using the optimal number of updates for ranked feature functions, where the group of feature functions includes a particular feature function if updating the particular feature function with the respective optimal model parameter does not increase an error count, and updating each feature function of the group of feature functions with the corresponding optimal model parameter.
Other embodiments of the aspect include systems and computer program products.
Implementations can include one or more of the following features. Calculating the first aggregate error surface and the optimal model parameter for each feature function can further include, for each feature function in the plurality of feature functions: for each source sentence in the plurality of source sentences: calculating a minimum cost surface as a function of feature function model parameter, and calculating a source sentence error surface using the minimum cost surface, merging the source sentence error surfaces for each source sentence into the first aggregate error surface for the feature function, and identifying the optimal model parameter for the feature function that minimizes the first aggregate error surface for the feature function. The quality criterion can be BLEU score gain.
Calculating the second aggregate error surface and the optimal number of updates for ranked feature functions can further include, for each source sentence in the plurality of source sentences: calculating a minimum cost surface as a function of number of updates for ranked feature functions, and calculating a source sentence error surface using the minimum cost surface, merging all source sentence error surfaces into the second aggregate error surface, and identifying the optimal number of updates for ranked feature functions that minimizes the second aggregate error surface.
The aspect can further include recalculating the second aggregate error surface and the optimal number of updates for ranked feature functions using the determined group of feature functions. Updating with the optimal model parameters a group of feature functions can further include updating with the optimal model parameters reduced in step size. The first aggregate error surface and the optimal model parameter for each feature function can be calculated using a first training corpus, and the second aggregate error surface and the optimal number of updates for ranked feature functions can be calculated using a second training corpus.
Calculating the first aggregate error surface and the optimal model parameter for each feature function can further include calculating the first aggregate error surface and the optimal model parameter for each feature function in parallel across a plurality of machines. Calculating the second aggregate error surface and the optimal number of updates for ranked feature functions can further include calculating the second aggregate error surface and the optimal number of updates for ranked feature functions in parallel across a plurality of machines.
Particular embodiments of the subject matter described in this specification can be implemented to realize one or more of the following advantages. The MERT technique is extended to scale to an arbitrary number (e.g., millions) of features and an arbitrary number (e.g., millions) of training examples. A translation system can efficiently calculate the effect of updating increasing groups of model parameters essentially simultaneously. Modifying step size in updates to model parameters can reduce overfitting to training data. This technique is easy to parallelize efficiently over many machines and provides solid improvements in BLEU score over previous techniques.
The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.
Like reference numbers and designations in the various drawings indicate like elements.
A commonly used training technique in statistical machine translation is the MERT technique. However, the MERT technique can also be applied to other machine learning applications and problems where parameters of a log-linear model need to be trained. The MERT technique and an extension to the MERT technique, as described below, can be used in speech recognition, optical character recognition, search ranking, and advertisement targeting, for example. The application determines the type of objective function for which parameter training is needed. For example, word error rate (e.g., how many words have been recognized correctly) can be used for speech recognition, and dialog success rate (e.g., how many dialogs have been handled successfully) can be used for dialog systems. Without loss of generality, the MERT technique and an extension to the MERT technique will be described below as applied to statistical machine translation.
An extension to the MERT technique allows a large number of features of a linear statistical machine translation model to be trained on a large number of training examples by optimizing the linear model with respect to an arbitrary error function. For example, a phrase-based statistical machine translation system can use the technique to train millions of lexicalized language model features (e.g., lexical n-gram features) to improve the BLEU score. BLEU is a method for evaluating the quality of text which has been translated from one natural language to another using machine translation. The BLEU score provides a measure of the statistical closeness of machine translations to reference translations.
An n-gram is a sequence of n consecutive words. An n-gram has an order, which is the number of words in the n-gram. For example, a 1-gram (or unigram) includes one word; a 2-gram (or bigram) includes two words. In some implementations, a translation system uses the technique to train other forms of language model features, e.g., long-distance language model features, phrase table features, or syntactic features.
For a given source sentence f in a first language (e.g., French), statistical machine translation attempts to identify the most probable target sentence e in a second language (e.g., English) given the source sentence. A model parameter λm corresponds with each of group of M feature functions hm (e,f), where m=1, . . . , M. In some implementations, the model parameter λm has a default value of zero. The cost of a target sentence is defined as Σm=1Mλmhm (e,f), which the statistical machine translation system will seek to minimize according to the following decision rule:
The translation system identifies as ê(f; λ1M) the target sentence e (e.g., where e is in a group C of multiple target sentences) for which the cost, as defined by Eqn. 1, has the smallest value. The modeling problem includes developing suitable feature functions h1M that capture the relevant properties of the translation task. The training problem focuses on identifying suitable model parameter values λ1M.
One assumption made by the translation system is that the number of errors in target sentence e is calculated by comparing the target sentence e with a reference translation r using a function E(r,e). Another assumption is that the number of errors for a group of target sentences e1s and the corresponding group of reference translations rlS are obtained by summing the errors for the individual target sentences: E(r1s, e1s)=Σs=1sE(rs, es).
A single error count is typically insufficient to calculate corpus-wide scores (i.e., scores calculated across a representative corpus of source sentences) for common metrics including, for example, a BLEU score or F-Measure. However, it is typically straightforward to accumulate the sufficient statistics to calculate such corpus-level scores.
The BLEU score is described, for example, in Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, “BLEU: a Method for Automatic Evaluation of Machine Translation,” Proceedings of the 40th Annual Meeting on the Association for Computational Linguistics, pages 311-318, July 2002. The BLEU score provides a geometric mean of the ratio of matching n-grams of length one to four between a candidate translation and a group of reference translations, along with a length term penalizing short sentences. The sufficient statistics of the BLEU score are the number of matching n-grams (i.e., n-gram precisions for the group of reference translations), the candidate translation length, and the effective length of the reference translations of the group.
As part of parameter training, the translation system obtains a minimal error count on a representative corpus of source sentences f1S, given reference translations r1s and a group Cs={es,1, . . . , es,N} of N different candidate translations (i.e., target sentences) for each source sentence fs. The error count for a specific sentence s, which for notational simplicity will be referred to as Es(λ1M), is given by:
E
s(λ1M)=Σk=1N(rs,es,k)δ(ê(fs; λ1M),es,k) (Eqn. 2)
The δ(ê(fs; λ1M), es,kfunction of Eqn. 2 is the Kronecker delta function, which is equal to 1 when ê(fs; λ1M) is equal to es,k and 0 otherwise. The translation system obtains the optimal parameter values by minimizing the sum of the errors over all source sentences in the representative corpus:
This optimization criterion is computationally difficult as the objective function has a large number of local optima, is piecewise constant, and does not allow the computation of a gradient.
The MERT technique can be the basis for the extension technique described in further detail below. The MERT technique trains parameters for a linear statistical machine translation model directly with respect to an automatic evaluation criterion (e.g., the BLEU score) that measures translation quality. A globally optimal value for each model parameter m is identified while holding all other model parameters fixed. Each corresponding feature function hm(e,f) is updated greedily in turn (i.e., by applying the optimal value as a model parameter for the particular feature function hm(e,f) without regard to other feature functions).
The MERT technique includes several steps: calculating a minimum cost surface function for each source sentence fs of a representative corpus; calculating an error surface Es(λm) for each source sentence fs of the representative corpus; calculating an aggregate error surface E(λm) across all source sentences f1S of the representative corpus; and identifying a globally optimal model parameter {circumflex over (λ)}m, which minimizes the aggregate error surface E(λm).
To find the lowest-cost (e.g., as defined by Eqn. 1) of a group C={e1, . . . , eN} of candidate translations (i.e., target sentences) as a function of λm, the translation system solves an optimization problem of the following functional form:
K(e,f)=Σm′≠mλm, hm, (e,f), which corresponds to the weighted feature function sum excluding the feature m that is being optimized. Therefore, K(e,f) is a constant with respect to λm. If cost is plotted as a function of λm, every candidate translation eεC corresponds to a line with slope hm(e,f), as illustrated in the example 100 of
The minimum cost surface f (f; λm) 120, illustrated as the bold line in
The example 100 of
As described above, each candidate translation eεC has an associated error count function defined by Es(r,e). Using this error function, the translation system calculates the error surface (i.e., an error count) for each candidate translation e in the minimum cost surface f (f; λm) 120. This error surface is the source sentence error surface Es(λm) as a function of λm, which defines the error count of the minimum cost surface f(f; λm) 120 at every possible value of λm. The source sentence error surface Es(λm) for a specific sentence s, which is illustrated in the example 200 of
E
s(λm)=Σk=1NE(rs,es,k)δ({circumflex over (e)}(fs; λm),es,k) (Eqn. 6)
The example 200 of
Once the translation system has calculated source sentence error surfaces Es(λm) for all source sentences f1s of the representative corpus, the translation system aggregates error counts while traversing in parallel the source sentence error surfaces Es(λm) The example 300 of
The model parameter λm for the feature function hm(e,f) can then be updated to the identified optimal parameter value {circumflex over (λ)}m.
The overall MERT optimization technique therefore includes the following steps:
1. ERROR SURFACE CALCULATION: For each source sentence fs, calculate the piecewise linear minimum cost surface f (f; λm) and its associated source sentence error surface Es(λm) as functions of λm.
2. MERGING AND MINIMIZATION: Merge all source sentence error surfaces Es(λm) into the aggregate error surface E(λm) and identify the optimal parameter value {circumflex over (λ)}m, which minimizes the aggregate error surface E(λm).
The MERT technique is generally efficient for only a relatively small number of features and does not scale well to a large number of features, because only one feature is updated at a time and the computation involves iterating over the complete training corpus. However, an extension of the MERT technique allows the effect of updating increasing groups of features (i.e., batch updates) to be efficiently calculated at once. Further efficiencies are gained by parallelizing the MERT technique and the extension technique, including the efficient batch updates, over many machines.
A translation system determines model parameters for multiple feature functions for a linear machine learning model (step 1010). In some implementations, the linear machine learning model is a linear statistical machine translation model. The model parameters can be determined using the MERT technique described above. The translation system ranks the multiple feature functions according to a quality criterion (step 1020). The translation system selects, using the ranking, a group of feature functions from the multiple feature functions to update with the determined model parameters (step 1030). Typically, the group of feature functions does not include all of the multiple feature functions. Ranking of the feature functions and selection of the group of feature functions will be described in more detail below.
The translation system determines a group of candidate translations for each source sentence of multiple source sentences (step 1110). The translation system can use a decoder to apply a language model (e.g., a syntactic language model) and a translation model (e.g., word alignment or phrase-based translation) to the respective source sentence in order to determine each candidate translation in the group of candidate translations. In particular, for a source sentence f, the decoder can determine the candidate sentence e, that maximizes the product of P(e) (i.e., the probability of e) determined by the language model and P(f|e) (i.e., the conditional probability of f given e) determined by the translation model.
The translation system calculates a first aggregate error surface and an optimal model parameter for each feature function of multiple feature functions for a linear statistical machine translation model (step 1120). For example, the first aggregate error surfaces and the optimal model parameters can be calculated using the MERT technique. The translation system ranks the plurality of feature functions according to a quality criterion (step 1130). As described above, the MERT technique identifies an optimal parameter value {circumflex over (λ)}m for each feature function hm. The aggregate error surface E({circumflex over (λ)}m) at the optimal parameter value {circumflex over (λ)}m is a measure of the quality of the corresponding feature function hm. The translation system can rank the feature functions h1M by quality, for example, by the gain in the evaluation metric (e.g., a gain in BLEU score). For the following analysis, it is assumed that the feature functions h1M are sorted according to quality, such that E({circumflex over (λ)}m)≦E({circumflex over (λ)}m−1).
With the ordered list of feature functions h1M, the translation system determines which subgroup of feature function updates results in a minimal error count. The problem can be simplified by using the quality ranking of the feature functions and restricting the considered subgroups to the M subgroups ordered by quality:
{{h1}, {h1, h2}, . . . , {h1, . . . , hM}
Using only the first m ordered feature functions (i.e., hi(e,f), where i=1, . . . , m) to rank the candidate translations, the translation system obtains the following decision rule for finding the lowest-cost candidate translation out of the group of candidate translations C={e1, . . . , eN}:
In the corresponding Eqn. 4, each candidate translation eεC corresponds to a line when cost is plotted as a function of λm. In contrast, in Eqn. 7, each candidate translation eεC corresponds to a piecewise constant surface, as illustrated in the example 400 of
The minimum cost surface f(f; m) 420 for the number of updates m for a source sentence fs is defined by the function
forming the lower boundary, illustrated as the bold line, of
Each candidate translation eεC has an associated error count function defined by Es(r,e). Using the error count function Es(r,e), the translation system can obtain the error surface (i.e., an error count) for each candidate translation eεC in the minimum cost surface f (f; m) 420, as illustrated in the example 500 of
E
s(m)=Σk=1NE(rs,es,k)δ({circumflex over (e)})δ(ê(fs; m),es,k) (Eqn. 9)
The example 500 of
As shown in
The translation system determines a group of feature functions from the multiple feature functions using the optimal number of ranked feature function updates (step 1150). The translation system updates each feature function of the group of feature functions with the corresponding optimal model parameter (step 1160). The translation system applies the optimal parameter values {circumflex over (λ)}1{circumflex over (m)} to update the corresponding feature functions in the determined subgroup {h1, . . . , h{circumflex over (m)}} while retaining the present values λm for all feature functions not in the subgroup {h1, . . . , h{circumflex over (m)}}.
The translation system repeats step 1120 through step 1160 of example process 1100 if multiple iterations are to be performed (decision 1170). For example, the number of iterations can be determined using a threshold, e.g., a convergence criterion or a minimum gain in the evaluation metric.
The efficient batch update technique therefore includes the following steps:
1. ERROR SURFACE CALCULATION: For each source sentence fs, calculate the piecewise constant minimum cost surface f(f; m) 420 and its associated source sentence error surface Esm) as functions of m.
2. MERGING AND MINIMIZATION: Merge all source sentence error surfaces Es(m) into the aggregate error surface E(m) and identify the optimal number of updates m, which minimizes the aggregate error surface E(m).
The steps of the efficient batch update technique mirror the steps of the MERT technique. However, the resulting aggregate error surfaces are different. Instead of being a function of λm, the aggregate error surface in step 2 of the efficient batch update is a function of m. Overall, the batch update technique is generally efficient. In step 1, the translation system only processes the complete group of N·S candidate translations once. Additionally, for each candidate translation eεC, the translation system only iterates through all non-zero optimal parameter values {circumflex over (λ)}m. In step 2, the translation system iterates through the non-trivial decision boundaries of the S sentence-specific error surfaces Es(m).
Although the translation system can efficiently calculate the impact of updating millions of feature functions, problems can exist if there are correlated features. Correlated features are common in machine learning problems. For example, strong correlations can occur in translation systems using individual n-grams as features. Strong correlations can be expected between n-grams that subsume or are subsumed by each other. For example, the effects of updating the features “of” and “of the” by applying the identified optimal parameters values λm are expected to be highly correlated, which suggests that it might be better not to update the features together.
In some implementations, the translation system avoids applying the optimal parameter value λm (i.e., the feature weight) to update a feature if the update leads to an increase in the error count. Instead, the optimal parameter value {circumflex over (λ)}m for the detrimental feature is not applied (i.e., the feature model parameter remains at its present value λm). The translation system can then repeat the steps (i.e., run another iteration) of the batch update technique to produce a new aggregate error surface E(m) without including the updates to the detrimental features. Each iteration of this filtering step typically reduces the resulting error count.
Table 1 and
Table 1 illustrates an example of the top seven n-gram features ranked by gain in the BLEU score, where the gain in the BLEU score is calculated under the assumption that the translation system updates each feature individually. As mentioned above, the effects of updating the second feature (“of”) and the third feature (“of the”) are expected to be highly correlated.
Combining the MERT technique with batch updating and feature decorrelation filtering results in a technique that includes the below steps:
In the DECODE step, the translation system translates the training corpus, which potentially includes millions of source sentences, and produces an N-best list of candidate translations Cs for each source sentence fs. An N-best list of candidate translations Cs is a list of the top N candidate translations for the respective source sentence fs as determined by, for example, translation scores or confidence estimations. The remaining steps can be implemented as described above. In some implementations, the numbers of iterations I and J are fixed. In other implementations, the numbers of iterations I and J are determined using a threshold, e.g., a convergence criterion or a minimum gain in the evaluation metric.
The batch updating and feature decorrelation filtering can be used with other machine learning techniques for linear models and not just the MERT technique. For example, a translation system can use conditional-random fields or the Perceptron algorithm to learn feature function model parameters for a linear model, rank the features according to different quality criteria, and use the batch updating and feature decorrelation filtering to select an optimal group of features according to the BLEU score, the NIST score, or another automatic evaluation metric.
As illustrated in
In some implementations, the translation system reduces the step size of a feature model parameter update to reduce overfitting. That is, instead of immediately updating each feature model parameter from its current value λm to its optimal value {circumflex over (λ)}m, the translation system sets the feature model parameter to an intermediate value, λm+γ({circumflex over (λ)}m−{circumflex over (λ)}m), using a step size parameter γ (e.g., γ=0.1). Smaller step sizes can also reduce the problem of instability in the case of correlated features. In some cases, reducing the step size might increase the number of iterations needed to reach a determined threshold for the evaluation metric.
Other techniques to reduce overfitting are possible. For example, the translation system can calculate the aggregate error surface E(m) and hence, the optimal number of updates {circumflex over (m)}, using training data that is different from the training data used to determine the optimal feature model parameters {circumflex over (λ)}m. A much smaller training corpus is needed to determine the optimal number of updates m than is needed to determine the optimal feature model parameters {circumflex over (λ)}m, because only one parameter, {circumflex over (m)}, is being optimized. In another example, the translation system can limit the number of considered features by retaining, after ranking features by quality, a determined number of features for the batch updating. Alternatively, the translation system only uses those features that occur in a determined number of sentences.
Training a large number of features for machine translation over very large training corpora is computationally challenging. However, the combined technique of MERT, batch updating, and correlated feature filtering, can be parallelized over many machines using an efficient distributed implementation, for example, using the MapReduce programming model. The MapReduce programming model is described, for example, in Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Proceedings of the 6th Symposium on Operating Systems Design and Implementation, pages 137-150, December 2004.
MapReduce is a powerful programming model for processing large data sets. The basic paradigm is borrowed from functional programming and requires the user to provide two functions: map and reduce. The map function processes a single key and value pair from the input and produces a set of intermediate key and value pairs. The reduce function merges all values associated with a single intermediate key, producing a set of output values. MapReduce's run-time system handles the details of coordination among thousands of machines. The strength of MapReduce lies in parallel execution. Typically, the map function is replicated on different machines operating on different parts of the input data. The intermediate data is partitioned by the intermediate key so that the reduce function can also be executed in parallel. Because there are no dependencies among map workers or among reduce workers, the parallelization of each step is highly efficient.
Each of the three components (i.e., DECODE, MERT, and BATCH) of the combined technique described above corresponds to one MapReduce.
For the DECODE step, MapReduce is used solely to handle parallelization. The input to this step includes the training corpus sentences f1s. The map function implements the actual translation process and outputs an N-best list of candidate translations Cs for each input sentence fs. The reduce function outputs these N-best lists into a distributed file representation.
The input to the MERT step includes the group of all N-best lists output from the DECODE step. The map function calculates the error surface Es(λm) for a single sentence. For example, each machine can independently calculate the error surface for a respective sentence Es(λm). The reduce function merges all error surfaces Es(λm) for a single feature hm, producing the aggregate error surface E(λm) and identifying the optimal value {circumflex over (λ)}m for the feature over the entire corpus.
The input to the BATCH step includes the output from the DECODE and MERT steps. The map function reads the output of the MERT step and calculates the error surface for a single sentence Es(m). The reduce function merges all error surfaces Es(m) to produce an aggregate error surface E(m). This reduce function can be effectively identical to the reduce function for the MERT step.
Combining the MERT technique with batch updating and feature decorrelation filtering results in a technique that can efficiently train a large number of features and improve the BLEU score over conventional statistical machine translation systems. This technique can improve translation quality for many applications of statistical machine translation, including automatic translation of text content on the Internet and military applications. The technique can also be applied to other problems where parameters of a log-linear model need to be trained.
The memory 1216 is a computer readable medium such as volatile or non-volatile that stores information within the system 1200. The memory 1216 can store processes related to the functionality of a machine translation engine, for example. The storage device 1252 is capable of providing persistent storage for the system 1200. The storage device 1252 can include a floppy disk device, a hard disk device, an optical disk device, or a tape device, or other suitable persistent storage mediums. The storage device 1252 can store the various databases described above. The input/output device 1254 provides input/output operations for the system 1200. The input/output device 1254 can include a keyboard, a pointing device, and a display unit for displaying graphical user interfaces.
The computer system shown in
The term “data processing apparatus” encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, to name just a few.
Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described is this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Particular embodiments of the subject matter described in this specification have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain implementations, multitasking and parallel processing may be advantageous.
This application claims the benefit under 35 U.S.C. § 119(e) of U.S. Provisional Patent Application No. 60/920,242, titled “Minimum Error Rate Training with Millions of Features for Statistical Machine Translation,” filed Mar. 26, 2007, which is incorporated here by reference.
Number | Date | Country | |
---|---|---|---|
60920242 | Mar 2007 | US |