Machine translation is a process by which a textual input in a first language is automatically translated, using a computerized machine translation system, into a textual output in a second language. Some such systems operate using word based translation. In those systems, each word in the input text, in the first language, is translated into some number of corresponding words in the output text, in the second language. Better performing systems, however, are referred to as phrase-based translation systems. In order to train either of these two types of systems (and many other machine translation systems), current training systems often access a parallel bilingual corpus; that is, a text in one language and its translation into another language. The training systems first align text fragments in the bilingual corpus such that a text fragment (e.g., a sentence) in the first language is aligned with a text fragment (e.g., a sentence) in the second language that is the translation of the text fragment in the first language. When the text fragments are aligned sentences, this is referred to as a bilingual sentence-aligned data corpus.
In order to train the machine translation system, the training system must also know the individual word alignments within the aligned sentences. In other words, even though sentences have been identified as translations of one another in the bilingual, sentence-aligned corpus, the machine translation training system must also know which words in each sentence of the first language translate to which words in the aligned sentence in the second language.
One current approach to word alignment makes use of five translation models. This approach to word alignment is sometimes augmented by a Hidden Markov Model (HMM) based model.
These word alignment models are less than ideal, however, in a number of different ways. For instance, although the standard models can theoretically be trained without supervision, in practice various parameters are introduced that should be optimized using annotated data. In the models that include an HMM model supervised optimization of a number of parameters is suggested, including the probability of jumping to the empty word in the Hidden Markov Model (HMM), as well as smoothing parameters for the distortion probabilities and fertility probabilities of the more complex models. Since the values of these parameters affect the values of the translation, alignment, and fertility probabilities trained by estimation maximization (EM) algorithm, there is no effective way to optimize them other than to run the training procedure with a particular combination of values and to evaluate the accuracy of the resulting alignments. Since evaluating each combination of parameter values in this way can take hours to days on a large training corpus, it is likely that these parameters are rarely, if ever, truly jointly optimized for a particular alignment task.
Another problem associated with these models is the difficulty of adding features to them, because they are standard generative models. Generative models require a generative “story” as to how the observed data is generated by an inter-related set of stochastic processes. For example, the generative story for models 1 and 2 mentioned above and the HMM alignment model is that a target language translation of a given source language sentence is generated by first choosing a length for the target language sentence, then for each target sentence position, choosing a source sentence word, and then choosing the corresponding target language word.
One prior system attempted to add a fertility component to create models 3, 4 and 5 mentioned above. However, this generative story did not fit any longer, because it did not include the number of target language words needed to align to each source language word as a separate decision. Therefore, to model this explicitly, a different generative “story” was required. Thus, a relatively large amount of additional work is required in order to add features.
In addition, the higher accuracy models are mathematically complex, and also difficult to train, because they do not permit a dynamic programming solution. It can thus take many hours of processing time on current standard computers to train the models and produce an alignment of a large parallel corpus.
The present invention addresses one, some, or all of these problems. However, these problems are not to be used to limit the scope of the invention in any way, and the invention can be used to address different problems, other than those mentioned, in machine translation.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
A weighted linear word alignment model linearly combines weighted features to score a word alignment for a bilingual, aligned pair of text fragments. The features are each weighted by a feature weight. One of the features is a word association metric generated from surface statistics.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The present invention deals with bilingual word alignment. However, before describing the present invention in greater detail, one illustrative environment in which the present invention can be used will be discussed.
Embodiments are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with various embodiments include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
Embodiments may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Some embodiments are designed to be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules are located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 195.
The computer 110 is operated in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
Bilingual corpus 206 illustratively includes bilingual data in which text in the first language is found, along with a translation of that text into a second language. For instance, using the English and French languages as an example, bilingual corpus 206 will illustratively include a relatively large amount of English language text along with a French translation of that text. A relatively small amount of bilingual corpus 206 is word-aligned by a person fluent in both languages. Illustratively, bilingual corpus 206 might consist of 500,000 pairs, each pair having an English sentence and its French translation, of which 200 to 300 pairs have been word-aligned by hand.
In order to word-align all the sentences in corpus 206, text fragment alignment component 208 first accesses bilingual corpus 206 to generate pairs of aligned text fragments from bilingual corpus 206. In one illustrative embodiment, the text fragments are sentences, although the text fragments could be other fragments such as clauses, etc.
Text fragment alignment component 208 thus outputs a first text fragment 214 in a first language E (such as English) and a second text fragment 216 in a second language F (such as French) which is the translation of the first text fragment 214. The bilingual, aligned text fragments 214 and 216 (such as bilingual, aligned sentences) are then input to word alignment component 202.
Either text fragment alignment component 208, or a different component, illustratively calculates values of a statistical measure of the strength of word associations in the text-fragment-aligned data. These values are referred to as word association scores and are indicative of a strength of association between a bilingual pair of words, or a bilingual cluster of words. Each pair or cluster of words is referred to as a word association type and is shown in
In one embodiment, index generator 210 accesses all of the various word association types identified in the training data (and stored in word association type data store 212) and indexes those word association types. This is described in greater detail below with respect to
Word alignment component 202 then sorts the list of possible association types based on their association scores. This is indicated by block 306 in
Finally, word alignment component 202 identifies the best alignment according to word alignment model 204, by accessing word alignment model 204, and employing the various features 218 in model 204. This is indicated by block 308 in
In one embodiment, model 204 is generated based on discriminative training of a weighted linear combination of a relatively small number of features. For a given parallel sentence pair, for each possible word alignment considered, model 214 simply multiplies the values of each of the features by a corresponding weight to give a score for that feature, and sums the feature scores to give an overall score for the alignment. The possible alignment having the best overall score is selected as the word alignment for that sentence pair. Thus, for a sentence pair e,f (where e is the sentence in English and f is the sentence in French) model 204 identifies an alignment â such that:
where fi are the features, and λi are the corresponding features weights.
The weights can be optimized using a modified version of averaged perceptron learning as described below with respect to
The specific implementation of word alignment model 204 can be any of a variety of different implementations incorporating a variety of different features. In one embodiment described herein, word alignment model 204 incorporates a feature computed from the different word association scores, mentioned above, intended to indicate how likely various pairs of words or groups of words are to be mutual translations, plus additional features measuring how much word reordering is required by a given alignment, and how many words are left unlinked in that alignment. As discussed below, embodiments of model 204 can also include a feature measuring how often one word is linked to several words in the alignment under analysis.
In the following discussion, and as used above, the term “alignment” is used to mean an overall word alignment of a sentence pair. The term “link” on the other hand is used to mean the alignment of a particular pair of words or small group of words.
In any case, one embodiment of model 204 uses a statistical measure of word association in order to perform bilingual word alignment. The term “word” here and subsequently should be taken very broadly to include any relatively fixed sequence of characters (including a single character) for which a translation relationship can be meaningfully considered. For example, a single punctuation character such as a period or comma may be treated as a word. In the Chinese language, words are conventionally considered to include usually no more than one or two characters. For the purposes of bilingual word alignment, however, it has sometimes proved useful to treat each individual Chinese character as a single word.
On the other hand, many languages, including English, include fixed phrases, such as “in spite of”, “according to”, or “more than”, which function as a single unit and might desirably be treated as single words for purposes of bilingual word alignment or translation. One might also consider breaking what are conventionally regarded as single words into a stem and an inflectional marker (or series of markers) and using each of those as a basic unit for word alignment. For example, the English word “went” might be decomposed into “go” followed by an inflectional marker that might be represented as “+PAST”. In what follows, it is simply assumed that the system is dealing with bilingual text segments that have been “tokenized”, i.e., broken up, and perhaps transformed, in some way into discrete tokens that we may treat as words for alignment purposes.
While any statistical measure indicative of the strength of association between words can be used, one illustrative statistical measure is referred to as the log likelihood ratio (LLR) statistic. Assume, for instance, that the two languages being discussed are English and French. The log likelihood ratio statistic is a measure of the strength of association between a particular English word and a particular French word. Basically, the log likelihood ratio is computed from bilingual, aligned sentences. The LLR statistic takes into account how often an English word occurs in the English sentences, and how often a French word occurs in the French sentences, and how often they occur together in an aligned sentence pair. One way of calculating LLR scores for words in the training corpus is as follows:
In Equation 2, f and e refer to the words (in French and in English, respectively) whose degree of association is being measured. When the terms f and e are used, it means that those words occur in the respective target and source sentences of an aligned sentence pair, and f and e mean that the corresponding words do not occur in the respective sentences, whereas f? and e? are variables ranging over these values, and C(f?,e?) is the observed joint count for the values of f? and e?. The probabilities in Equation 2, p(f?|?) and p(f?), illustratively refer to maximum likelihood estimates.
The LLR scores computed using Equation 2 for a pair of words is high if the words have either a strong positive association or a strong negative association. Therefore, in accordance with one embodiment, any negatively associated word pairs are discarded by requiring that p(f,e)<p(f)·p(e). Also, any word pairs with an LLR score of less than 1 can be discarded as well.
In this particular embodiment of model 204, the word association scores are used to compute word association features 230 used in model 204, and the value of the principal word association feature for an alignment is simply the sum of all the individual log-likelihood ratio scores for the word pairs linked by the alignment. The log-likelihood ratio-based model also includes a plurality of other features.
For instance, one set of features is referred to as non-monotonicity features 232. It may be observed that in closely related languages, word alignments of sentences that are mutual translations tend to be approximately monotonic (i.e., corresponding words tend to be in nearly corresponding sentence positions). Even for distantly related languages, the number of crossing links is far less than chance, since phrases tend to be translated as contiguous chunks. To model these tendencies, non-monotonicity features 232 provide a measure of the monotonicity (or more accurately the non-monotonicity) of the alignment under consideration.
To find the points of non-monotonicity of a word alignment, one of the languages in the alignment is arbitrarily designated as the source, and the other as the target. The word pairs in the alignment are sorted, first by source word position, then by target word position. (That is, the ordering is determined primarily by source word position, and target word position is considered only if the source word positions are the same.) The alignment is traversed, looking only at the target word positions. The points of non-monotonicity in the alignment are places where there are backward jumps in this sequence of target word positions.
For example, suppose a sorted alignment contains the following pairs of linked word positions ((1,1) (2,4) (2,5) (3,2) (5,6)). The first term in this sequence (1,1) means that the first word in the source sentence is aligned with the first word in the target sentence. The second term (2,4) means that the second word in the source sentence is aligned with the fourth word in the target sentence. The third term (2,5) means that the second word in the source sentence is also aligned with the fifth word in the target sentence. The fourth term (3,2) means that the third word in the target sentence is aligned with the second word in the source sentence, and the last term (5,6) means that the fifth word in the source sentence is aligned with the sixth word in the target sentence. The sequence of target word positions in this sorted alignment is (1,4,5,2,6). Therefore, there is one point of non-monotonicity where target word position 2 follows target word position 5.
The particular way in which the degree of non-monotonicity of an alignment is measured can vary. For instance, in one embodiment, the magnitudes of the backward jumps in the target word sequence are summed, and this sum is the measure of non-monotonicity. In another embodiment, the number of backward jumps are counted, and the number of jumps is indicative of the non-monotonicity. Finally, rather than choosing between those various embodiments, both of them can be used. Thus, the non-monotonicity features 232 in word alignment model 204 are illustratively comprised of one or both of these measures of non-monotonicity, or a different set of measures of non-monotonicity.
Another set of features is referred to as a set of multiple link features 234 in word alignment model 204. It has often been observed that word alignment links tend to be 1-to-1. Indeed, word alignment results can often be improved by restricting more general models to permit only 1-to-1 links between words.
In order to model the tendency for links to be 1-to-1, one embodiment of the invention defines a 1-to-many feature as the number of links connecting two words such that exactly one of them participates in at least one other link. The system can also define a many-to-many feature as the number of links that connect two words that both participate in other links. Multiple link features 234 in word alignment model 204 can be either or both of these features. However, in one embodiment, the 1-to-many feature is the only one used in multiple link features 234, while the many-to-many feature is not used directly in the model, but is simply used to reduce the number of alignments that must be considered, as any alignments having a non-zero value of the many-to-many feature are discarded.
Another exemplary feature used in word alignment model 204 is referred to as a set of unlinked word features 236. The unlinked word features 236 simply count the total number of unlinked words in both sentences in an aligned sentence pair. This is used to control the number of words that get linked to something in the aligned sentence pair.
The rank of an association with respect to a word in a sentence pair can be defined to be the number of association types (word-type to word-type) for that word that have higher association scores, such that words of both types occur in the sentence pair. In one embodiment, there are two association score rank features 231 that are based on association score rank. One feature totals the sum of the association ranks with respect to both words involved in each link. The second feature sums the minimum of association ranks with respect to both words involved in each link.
So far, as discussed above, the only feature relating to word order are those measuring non-monotonicity. The likelihoods of various forward jump distances is not modeled. If alignments are dense enough, measuring non-monotonicity models this indirectly. That is, if every word is aligned, it is impossible to have large forward jumps without correspondingly large backwards jumps, because something has to link to the words that are jumped over. If word alignments are sparse, however, due to free translation, it is possible to have alignments with very different forward jumps, but the same backwards jumps. To differentiate such alignments, in one embodiment, a jump distance difference feature 233 is used that sums the differences between the distance between consecutive aligned source words and the distance between the closest target words they are aligned to. In another embodiment, jump distance difference feature 233 sums the differences between the distance between consecutive aligned target words and the distance between the closest source words they are aligned to.
It may be that the likelihoods of a large forward jump on either the source or target side of an alignment is much less if the jump is between the words that are both linked to the same word of the other language. In one embodiment, this is modeled by including two many to-one jump distance features 235. One feature sums, for each word w, the number of words not linked to w that fall between the first and last words linked to w. The other feature counts only such words that are linked to some word other than w. The point of the second of these features is that it is likely not as detrimental to have a function word not linked to anything, between two words linked to the same word.
In other embodiments, an exact match feature 237 sums the number of words linked to identical words. This can be included because proper names or specialized terms are often the same in both languages, and it can be advantageous to take advantage of this to link such words even when they are too rare to have a high association score.
In one embodiment, benefits may be gained by including lexical features 239 that count the links between particular high frequency words. Such features can cover all pairs of the five most frequent non-punctuation words in each language, for instance. In one embodiment, features are included for all bilingual word pairs that have at least two occurrences in the labeled training data. In addition, features can be included for counting the number of unlinked occurrences of each word having at least two occurrences in the labeled training data.
In training the present model, it was believed that using so many lexical features 239 might result in over-fitting to the training data. To try to prevent this, the model can be trained by first optimizing the weights for all other features, then optimizing the weights for the lexical features 239, with the other weights held fixed to the optimum values without lexical features 239.
In accordance with another embodiment of word alignment model 204, word association features 230 are not simply the sum of log-likelihood ratio-based word association statistics. Instead, those statistics are replaced with the logarithm of the estimated conditional probability of two words (or combinations of words) being linked, given that they co-occur in a pair of aligned sentences. These estimates are derived from the best alignments according to another, simpler model. For example, if “former” occurs 100 times in English sentences whose French translation contain “ancien”, and the simpler alignment model links them in 60 of those sentence pairs, the conditional link probability (CLP) can be estimated for this word pair as 60/100, or 0.6. However, it may be more desirable to adjust the probabilities by subtracting a small fixed discount from the link count as follows:
where LPd(f,e) represents the estimated conditional link probability for the words f and e, links1(f,e) is the number of times they are linked by the simpler alignment model, d is the discount, and cooc(f,e) is the number of times they co-occur. This adjustment prevents assigning high probabilities to links between pairs of words that rarely co-occur. Illustratively, this discount may have a value between 0 and 1.
One difference between the LLR-based model and the CLP-based model is that the LLR-based model considers each word-to-word link separately, but allows multiple links per word, as long as they lead to alignments consisting only of 1-to-1 and 1-to-many links (in either direction). In the CLP-based model, however, conditional probabilities are allowed for both 1-to-1 and 1-to-many clusters, but all clusters are required to be disjoint.
For instance, the conditional probability of linking “not” (in English) to “ne . . . pas” (in French) can be estimated by considering the number of sentence pairs in which “not” occurs in the English sentence and both “ne” and “pas” occur in the French sentence, compared to the number of times “not” is linked to both “ne” and “pas” in pairs of corresponding sentences. However, when this estimate is made in the CLP-based model, a link between “not” and “ne . . . pas” is not counted if the same instance of “not”, “ne” or “pas” is linked to any other words.
The CLP-based model incorporates the same additional features as the LLR-based model, except that it omits the 1-to-many feature since it is assumed that the 1-to-1 versus the 1-to-many tradeoff is already modeled in the conditional link probabilities for particular 1-to-1 and 1-to-many clusters. In other embodiments, the 1-to-many feature may be retained in the CLP-based model, in case it turns out that the conditional link probability estimates are more reliable for 1-to-1 clusters than for 1-to-many clusters, or vice versa.
There are a variety of different bases for estimating the conditional link probabilities. For instance, one estimate of the conditional link probabilities can be derived from the LLR-based model described above, optimized on an annotated development set. Another estimate can be derived from a heuristic alignment model. It should also be noted that, in addition to the LLR-based model and the CLP-based model, other weighted linear models using word association scores based on surface statistics can be used as well. By “surface statistics” it is meant any different association metrics that can be defined on a contingency table. In other words, a contingency table for two words is a two-by-two matrix in which the four cells of the matrix indicate a count where neither of the words is present, where one of the words is present but the other is not and vice versa, and where both words are present. There are many different association metrics which can be calculated from such a matrix, including the χ2 statistic, the Dice co-efficient, or any of wide variety of other metrics.
In another embodiment, the estimated conditional probability of a cluster of words linked is replaced with the estimated conditional odds of a cluster of words being linked, as follows:
where LO(w1, . . . ,wk) represents the estimated conditional link odds for the cluster of words w1, . . . ,wk. In this exemplary embodiment, “add-one” smoothing is used in place of a discount.
Some embodiments include additional features. One such feature is a symmetrized non-monotonicity feature 241 in which the previous non-monotonicity feature that sums the magnitude of backwards jumps, is symmetrized by averaging the sum of backwards jumps in the target sentence order relative to the source sentence order, with the sum of the backwards jumps in the source sentence order relative to the target sentence order. In this exemplary embodiment, the feature that counts the number of backwards jumps can be omitted.
A multi-link feature 243 counts the number of link clusters that are not one-to-one. This enables modeling whether the link scores for these clusters are more or less reliable than the link scores for one-to-one clusters.
Another feature is an empirically parameterized jump distance feature 245 which incorporates a feature measuring the jump distances between alignment links that are more sophisticated than simply measuring the difference in source and target distances. The (signed) source and target distances between all pairs of links are measured in the simpler alignment of the full training data that is used to estimate the conditional link probability and conditional link odds features. From this, the odds of each possible target distance given the corresponding source distance are estimated as:
Similarly, the odds of each possible source distance given the corresponding target distance are estimated. The feature values include the sum of the scaled log odds of the jumps between consecutive links in a hypothesized alignment, computed in both source sentence and target sentence order. This feature is applied only when both the source and target jump distances are non-zero, so that it applies only to jumps between clusters, not to jumps on the “many” side of many-to-one cluster. In one embodiment these feature values are linearly scaled in order to get good results (in terms of training set alignment error rate (AER)) when using perceptron training. It has been found empirically that good results can be obtained in terms of training set AER by dividing each log odds estimate by the largest absolute value of any such estimate computed.
Additional embodiments of the empirically parameterized jump distance feature are based on the probability, rather than the odds, of each possible target distance given the corresponding source distance and/or each possible source distance given the corresponding target distance, or other quantities computed using the frequency that a given jump distance between two words in the source language occurs with a given jump distance between words in the target language linked to the two words in the target language.
While the discriminative models discussed above are relatively straightforward to describe, finding the optimal alignment according to these models is non-trivial. Adding a link for a new pair of words can affect the non-monotonicity scores, the 1-to-many score, and the unlinked word score differently, depending on what other links are present in the alignment.
However, a beam search procedure can be used which is highly effective in finding good alignments, when used with these models. This was discussed in brief with respect to
Index generator 210 then indexes the given association type by the selected words. This is indicated by block 354, and results in the index of word association types 220 shown in
It should be noted that index generator 210 may illustratively generate index 220 prior to runtime. It may illustratively be done at set up time or at any other time as desired.
Word alignment component 202 then selects one of the word pairs from the list. This is indicated by block 358. Word alignment component 202 then determines whether there is an index entry for the selected word pair. In doing so, word alignment component 202 accesses index 220 to determine whether it contains an entry for the selected word pair from the list of word pairs generated from the aligned text fragments 214 and 216. Checking for the index entry is indicated by block 360 in
If there is no index entry, then word alignment component 202 determines whether there are any more possible word pairs in the list to be considered. If so, processing reverts to block 358 where another word pair is selected. Determination of whether there are more word pairs to be considered is indicated by block 362 in
If, at block 360, word alignment component 202 determines that there is an index entry in index 220 for the selected word pair, then word alignment component 202 determines whether the index entry is for a 1-to-1 association type. In other words, component 202 determines whether the index entry is only a link between a single word in text fragment 214 and a single word in text fragment 216, where neither of the words have additional links specified by the association type. This is indicated by block 364. If the index entry is for a 1-to-1 association type, then the association type represented by the index entry is simply added to the list of possible association types generated for the aligned text fragments 214 and 216. This is indicated by block 366 in
If, at block 364, it is determined that the index entry is not a for a 1-to-1 association type, then word alignment component 202 determines whether the other words in the association type represented by the index entry (other than those which are listed in the index entry) occur in the pair of aligned text fragments 214 and 216. This is indicated by block 368 in
It will be noted that, in accordance with one embodiment, many-to-many association types are not considered. In that case, those association types can be omitted from index 220, in which case the many-to-many association type will never be selected. Other ways of omitting many-to-many association types can be used as well, and it may in some cases be desirable to use such association types, in which case they are left in and treated as a 1-to-many association type at this point.
Once all of the word pairs have been considered as determined at block 362, then the list of possible association types for the aligned text fragments 214 and 216 is sorted based on association scores, from strongest association score to weakest association score. This is indicated by block 370 in
In another embodiment of the present invention, instead of first generating all possible word pairs in the sentence pair as in block 356, and then determining which ones index a possible association type for the sentence pair, the possible association types can be determined incrementally as the possible word pairs are generated. That is, as each word pair is generated, the operations indicated in blocks 360, 364, 368, and 366 are performed for that word pair, before the next possible word pair is generated.
Once this list of possible association types for the pair of aligned sentences 214 and 216 under consideration has been generated, word alignment component 202 then identifies the best alignment according to word alignment model 204 using the list of possible association types.
Word alignment component 202 first initializes a list of existing alignments to contain only an empty alignment along with its overall score. Since an empty alignment has no links, the overall score for an empty alignment will simply be the total number of words in both sentences, multiplied by the unlinked word feature weight. This is indicated by block 400 in
Component 202 then incrementally adds all possible instances of the selected association type to copies of each of the alignments in a list of current alignments, keeping the previous alignments as well (before each instance of the association type is added). This is indicated by block 404 in
If there is more than one instance, in the aligned text fragments 214 and 216, of the selected association type being processed, then component 202 picks one instance and tries adding that instance to each of the alignments, and repeats that process for each of the instances. As each instance is considered, the alignments created by adding earlier instances are included in the existing potential alignments that component 202 adds the new instance to.
Without pruning, the number of possible alignments generated by component 202 would combinatorially increase dramatically. Therefore, the set of alignments is pruned by component 202, as new alignments are generated as indicated by block 404 in
Component 202 iterates through the sorted list of association types, from best to worst, creating new alignments that add links for all instances of the association type currently being considered to existing alignments, potentially keeping both the old and new alignments in the set of possible alignments being generated. This continues until there are no more association types in the list to consider. This is indicated by block 408 in
Once the final set of potential alignments has been generated, component 202 simply outputs the best scoring word alignment 222 (shown in
First, a possible link “I” that is an instance of the selected association type is selected in the sentence pair. This is indicated by block 504 in
The set of recent alignments is initialized to be empty. This is indicated by block 506 in
Once the set of new alignments is created, an alignment (A′) is selected from the set of new alignments. This is indicated by block 512 in
Component 202 then determines whether A′ already exists in the set of recent alignments, or whether it has any many-to-many links in it, or whether it has any one-to-many links with more than a predetermined value “M” branches. This is indicated by block 514 in
If, at block 514, word alignment component 202 determines that the selected alignment A′ either already exists in the set of recent alignments or has many-to-many links in it, or has any one-to-many links with more than M branches, then processing moves to block 516 where component 202 determines whether there are any more alignments A′ to consider. However, if, at block 514, component 202 determines that A′ does not already exist in the set of recent alignments, and it does not have any many-to-many links in it, and it does not have any one-to-many links with more than “M” branches, then word alignment component 202 computes the score for the alignment A′ according to the model 204. Computing the score is indicated by block 518 in
Word alignment component 202 then determines whether the score for the alignment A′ is worse than the best score computed so far by more than a pruning threshold amount. This is indicated by block 520 in
If, at block 520, word alignment component 202 determines that the score for the alignment A′ is not worse than the best score so far by more than the pruning threshold, then component 202 adds the alignment A′ to the list of recent alignments. This is indicated by block 524 in
Component 202 then determines whether there are more existing alignments “A” to be processed. If so, processing reverts back to block 508. If not, however, component 202 adds the recent alignments to the set of existing alignments. This is indicated by block 534 in
Component 202 then determines whether there are more possible links “I” that are instances of the selected association type in the sentence pair currently being processed. If so, processing reverts back to block 504. Determining whether there are more existing alignments “A” is indicated by block 528, and determining whether there are more possible links “I” is determined by block 530.
If, at block 530, component 202 determines that there are no more instances of the association type to be processed, then component 202 has completed the processing indicated by block 404.
Component 202 first initializes the set of new alignments to be empty. This is indicated by block 540 in
An extra pruning technique can also be used with the LLR-based model. In generating the list of possible association types to be used in aligning a given sentence pair, we use only association types which have the best association score for this sentence pair for one of the word types involved in the association. The idea is to discard associations not likely to be used. For example, in data from the Canadian Parliament, “Prime Minister” and “premier minister” frequently occur in parallel sentence pairs. In one illustrative training corpus, the association scores for each pair of one of these English words and one of these French words are as follows:
4125.02019332218 Minister ministre
2315.88778082931 Prime premier
1556.9205658087 Prime ministre
1436.06392959541 Minister premier
All four pairs have quite high association scores, but in aligning a sentence pair that contains both “Prime Minister” and “premier ministre”, we would not consider the associations between “Prime” and “ministre” and between “Minister” and “premier”, because in those two pairings, neither word is the most strongly associated with the other for this sentence pair. This pruning step can be applied as the list of possible association types for a selected sentence pair in being generated in block 304, just before block 366.
Component 202 simply lets the set of new alignments contain only an alignment having a link for the instance “I” plus all links in the alignment “A” that are not conflicting with the instance “I”. This is indicated by block 546 in
λi→λi+(fi(aref, e,f)−fi(ahyp,e,f)) Eq. 6
The updated feature weights are used to compute ahyp for the next sentence pair.
Iterating through the data continues until the weights stop changing, because aref=ahyp for each sentence pair, or until some other stopping condition is met.
In the averaged perceptron learning technique, the feature weights for the final model are the average of the weight values over all the data, rather than simply the values after the final sentence pair of the final iteration.
In accordance with one embodiment of the optimization technique, the present system averages the weight values over each pass through the data, rather than over all passes. It is believed that this leads to faster convergence. After each pass of perceptron learning through the data, another pass is made through the data with feature weights fixed to their average value for the previous learning pass, in order to evaluate current performance of the model. The system iterates over this procedure until a local optimum is found.
Also, in accordance with one embodiment of the present system, a fixed weight is provided for the word association feature 230. It is believed that this feature is of significant importance in the model, and fixing the weight can be fixed to any desired or empirically determined value. In one embodiment, the weight is fixed to 1.0. Allowing all weights to vary allows many equivalent sets of weights that differ only by a constant scale factor. Fixing one weight thus eliminates a spurious apparent degree of freedom.
By eliminating this degree of freedom, and fixing one of the weights, the present system thus employs a version of perceptron learning that uses a learning rate parameter. As is known, the perceptron update rules involve incrementing each weight by the difference in the feature values being compared. If the feature values are discrete, however, the minimum difference may be too large compared to the unweighted association score. Therefore, the present system multiplies the feature value difference by a learning rate parameter η to allow smaller increments when needed as follows:
λi→λi+η(fi(aref,e,f)−fi(ahyp,e,f)) Eq. 7
For the CLP-based model, based on the typical feature values expected, the learning rate can be set to any empirically determined value. In one embodiment, the learning rate is set to 0.01, although different rates can be used and optimizations on the rate can be performed as desired.
For the LLR-based model, the LLR scores can become very large (such as 100,000 for a 500,000 pair corpus) but small differences can be significant. Thus, small differences in the weighting values are also likely to be significant. This means that a learning rate small enough to allow convergence on a desired weight value may require a very large number of iterations through the data in order to reach those values. Thus, in accordance with one embodiment, the present system uses a progression of learning rates, starting at a relatively large value (which can be empirically determined, and one example of which is approximately 1000) and reducing each successive weight until a final desired learning weight is reached. Of course, the level of reduction can be empirically determined or set as desired. In one embodiment, the learning rate is reduced, successively by an order of magnitude until a learning rate of 1 is reached. Of course, other values can be used as well. At each transition between learning rates, the feature weights are reinitialized to the optimum values found with the previous learning rate. This can be done based on error rate or any other desired measure.
With this in mind,
A training sample sentence pair, annotated with its correct word alignment, is then processed as described above with respect to the previous figures, in order to obtain a best guess at a word alignment for the sentence pair. This is indicated by block 562.
The best guess is then compared to the known correct alignment for the sentence pair. This is indicated by block 564.
The weights (λ1) are then adjusted based on the difference in feature values between the correct alignment and the best guess. This is indicated by block 566 in
It is then determined whether enough data has been processed in order to check the error rate. This is indicated by block 568. In other words, it may be desirable not to check the error rate after processing each training sentence pair. Instead, it may be desirable to process a plurality of different training sentence pairs before checking the error rate. Therefore, determining whether enough data has been processed to check the error rate is indicated by block 568. Illustratively, it may be desirable to process all the annotated training sentence pairs once between occurrences of checking the error rate.
If so, then the error rate is checked to determine whether it is still decreasing since the last time it was checked. This check is performed using the average values for the feature weights since the last time the error rate was checked, applied to a specified set of annotated sentence pairs. This set may be the entire set of training sentence pairs used in adjusting the feature weights, a subset of that set, or an independent set of annotated sentence pairs. This is indicated by block 569 in
However, if, at block 570, the error rate has started to increase (or is at least no longer decreasing) then it is determined that training has flattened out with respect to the current learning rate. It is thus determined whether there are any additional learning rates to try during the training process. This is indicated by block 572. If not, training is complete and the weights that yielded the lowest error rate are used.
However, if, at block 572 it is determined that there are more learning rates to try, then the learning rate is set to its next lowest value, and the feature weights are reset to the values that have yielded the lowest error rate so far. This is indicated by block 574 and 576. Processing then continues at block 562 in which training samples are again processed in order to continue training the model feature weights λi.
Practitioners skilled in the art will recognize that many other variations of perceptron learning may be used to optimize the model feature weights, and that other learning methods such as maximum entropy modeling or maximum margin methods, including support vector machines, may be used to optimize the feature weights. If the number of feature weights is small, direct optimization methods such as Powell's method or the downhill simplex method may also be used.
In one alternative embodiment, a support vector machine (SVM) method for structured output spaces can be used. The method can be based on known large margin methods for structured and interdependent output variables. Like standard SVM learning, this method tries to find the hyperplane that separates the training examples with the largest margin. Despite a very large number of possible output labels (e.g., all possible alignments of a given pair of sentences), the optimal hyperplane can be efficiently approximated given the desired error rate, using a cutting plane algorithm. In each iteration of the algorithm, it adds the “best” incorrect predictions given the current model as constraints, and optimizes the weight vector subject only to them.
One advantage of this algorithm is that it does not pose special restrictions on the output structure, as long as “decoding” can be done efficiently. This can be beneficial because several features mentioned above are believed to be very effective in this task, but are difficult to incorporate into structured learning methods that require decomposable features. This method also allows a variety of loss functions, but can also use only simple 0-1 loss, which in this context means whether or not the alignment of a sentence pair is completely correct.
In the embodiment in which an SVM method is used, the SVM method has a number of free parameters, which can be tuned in a number of different ways. One way is by minimizing training set AER. Another is five-fold cross validation. In this method, training is performed five times on 80% of the training data and testing on the other 20%, with five disjoint subsets used for testing. The parameter values yielding the best averages AER on the five test subsets of the training set are used to train the final model on the entire training set.
It will also be appreciated that log-conditional-odds-based features, as mentioned above, are not only useful in bilingual word alignment as discussed above. In addition, log conditional odds can be used to define features in other applications as well.
For word segmentation, one might want to use as a local feature: the log-probability that a segment is a word, given the character sequence it spans. A curious property of this feature is that it induces a counterintuitive asymmetry between the is-word and is-not-word cases: the component generative model can efficiently dictate that a certain chunk is not a word, by assigning it a very low probability (driving the feature value to negative infinity), but it cannot dictate that a chunk is a word, because the log-probability is bounded above. If instead the log conditional odds
is used, the asymmetry disappears. Such a log-odds feature provides much greater benefit than the log-probability, and it is useful to include such a feature even when the model also includes indicator function features for every word in the training corpus.
Therefore, a feature that can be used in word segmentation is the smoothed log conditional odds that a given sub-sequence xab=(xa, . . . , xb-1) forms a word, estimated as:
where wordcount(xab) is the number of times (xab) forms a word in the training set, and nonwordcount(xab) is the number of times (xab) occurs, not segmented into a single word. As in our word alignment features, we use “add-one” smoothing so that neither the numerator or denominator of the ratio is ever 0.
The word alignment problem and sequence segmentation problems described above are both instances of structured classification problems, because a word alignment or a segmentation of a sequence can both be viewed as structured labels of the inputs, with the individual alignment links or word segment boundaries being partial labels. Log-conditional-odds features were also found very useful in a multi-class classification model for another natural language prediction problem, which can be another application for classifier 600, in which a fixed set of more than two unstructured labels is used. The exemplary problem is to predict Japanese case markers given the rest of the words in a Japanese sentence. Case markers are words which indicate grammatical relations (such as subject, object, and location) of the complement noun phrase to the predicate.
The following Japanese sentence (in Table 1) shows an example of the use of case markers in Japanese. The case markers that need to be predicted are kara (from) for the second phrase and ni for the third phrase (the case markers are underlined below).
This task can be viewed as a multi-class classification problem where the goal is to predict one case marker for each phrase in the Japanese sentence, given a set of features from the context of the phrase, and possibly features from a corresponding English sentence. In one example, the number of classes to be predicted is 19, which includes 18 case markers and a class NONE, meaning that a phrase has no case marker.
One exemplary model for this task uses nominal-valued features, such as HeadPOS (the part of speech tag of the head word of the phrase), HeadWord, PrevHeadWord, NextHeadWord (head words of surrounding phrases), as well as features from a syntactic dependency tree.
To describe the model in more detail, some notation is now introduced. Denote the context features of a phrase by a vector of nominal features X=[x1,x2,x3, . . . ,xm], where there is one dimension for each of the nominal features included in the model. Denote also by y1,y2, . . . ,yk the k possible classes (case marker assignments). The form of the model is as follows:
In EQ. 9, the trainable parameters of the model are the λjyi parameters—one parameter for every class yi,i=1 . . . k and for every feature type j,j=0 . . . m. Feature type 0 is added to model the prior likelihood of every class. Each of the log-odds features in this equation represents the logarithm of the probability of class yi given some feature divided by the probability of the complement of yi (denoted by yi) given that feature. The complement of class yi is the set of all classes other than yi.
The odds values were estimated by smoothed relative frequency estimates from the training data. The unsmoothed relative frequency estimates are:
In one embodiment, add-α alpha smoothing is used to improve this estimate. The add-α estimate is as follows:
In EQ. 11, k denotes the number of classes as before. The parameters λjyi of the model can be trained using a standard technique. For instance, the sum of the conditional log-likelihoods of the training data instances can be maximized, and a Gaussian prior on the parameters can be included.
Comparing this model using log-odds features to one using log-probability features it was found that the model using log-odds outperformed the latter model. In particular, the model using log-probability features had the following form:
Equation 12 closely corresponds to the method of discriminatively training parameters (the λjyi), for weighting log-probability features from generative models.
For the model in EQ. 12, the probability features P(xj|yi) were estimated by smoothed relative frequency estimates from the training data, using add-α smoothing. The smoothed relative frequency estimate for P(xj|yi) is:
where V denotes the number of possible values for the j-th nominal feature xj. A range of values for the α smoothing parameter can be tried in any known manner for both Equations 9 and 12, finding approximately optimal values for them. This can be done empirically, using a systematic approach, or otherwise.
Therefore, conditional log odds (either single log of conditional odds or a sum of a plurality of logs of estimated conditional odds) can be used in classification. This can be done in binary classification, multi-class classification (with a fixed set of more than 2 classes) and structured classification (such as bilingual word alignment or word segmentation). In some cases of classification, the logarithm of the ratio of the probability of some label given a feature and the probability of not the label, given the feature, produces the same results as the logarithm of the ratio of the probability of the feature given the label and the probability of the feature given not the label. That is,
can be replaced by
Even in cases where the results are not mathematically equivalent, such substitutions may be effective. The present embodiment is intended to cover, in both multi-class classification and structured classification, such substitutions.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
The present application is a continuation of and claims priority of U.S. patent application Ser. No. 11/242,290, filed Oct. 3, 2005, the content of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | 11242290 | Oct 2005 | US |
Child | 11485015 | Jul 2006 | US |