Text to text applications include machine translation and other machine intelligence systems such as speech recognition and automated summarization. These systems often rely on training that is carried out based on information from specified databases known as corpora.
A training pipeline may include many millions of words. It is not uncommon for the training to take weeks. There is often a tradeoff between the speed of the processing and the accuracy of the obtained information.
It is desirable to speed up the training of such a system.
The present application describes parallelization of certain aspects of training. Specifically, an embodiment describes how to parallelize a training task which requires knowledge about previous training portions.
The general structure and techniques, and more specific embodiments which can be used to effect different ways of carrying out the more general goals are described herein.
A current training system may require as long as two weeks to train 100 million words. Of course, faster processors may reduce that time. Parallelization of these operations by partitioning the input corpus is not straightforward, however, since certain operations may require accumulation of accumulated results from other operations. Multiple operations that are operating in multiple processors would not have access to the results of the other processors.
In evaluating the entire training pipeline for machine translation, it was noticed that word alignment takes by far the most time of the entire process. For example, word alignment may take an order of magnitude longer than any of the other 11 processes that are used during training. Parallelization of word alignment can hence speed up training.
The embodiment shown in
In operation, the expectation maximization algorithm collects counts which are formed from arbitrary choices of probabilities between words in the full corpus. The words in the corpus are analyzed, to find all word to word pairings. A determination of probabilities of hookups is used to form a table of counts. That table of counts is then used, along with the corpus, to determine further probabilities. This process is then iterated.
The task of determining the word alignments requires analysis of both the table of probabilities from the final iteration of the expectation maximization algorithm, as well as the corpus information.
Since the accumulation and normalization of count information is necessary, dividing this task into multiple processors is not a straightforward issue of simply dividing the work among processors and performing multiple isolated iterations of expectation maximization.
The master computer 99 runs a T table manager 105 which updates the interim T table and other model parameters 110 with counts and probabilities. The T table manager accumulates all of the data from all of the different evaluation passes through the corpus. These evaluations may create parameters and information other than the T table. The embodiment emphasizes the T table, because it is usually very large and hence its manipulation and storage requires significant resources, such as computer RAM. Many, if not all word alignment models, also share this set of parameters. The embodiment contemplates operation with other models such as HMM, model 2 and others. These models may use additional parameters, which may not be specifically discussed herein.
At 200, the master determines pieces of the corpus, shown as 120. Each of those pieces forms a sub corpus 121, 122, 123. These form one component of a “work unit”. The master also creates sub T tables at 210 that include only the word-to-word hookups that occur in the corresponding sub-corpus, shown as 125, 126, 127. The smaller tables minimize the memory requirements of the work unit.
If the model has additional parameters, these are included in the work unit as well.
Computing which word-to-word hookups appear in a given sub-corpus is expensive in terms of computer resources. The system used herein uses multiple computing iterations. One aspect enables reusing the returned sub-T-table output from previous iterations, rather than recomputing those hookups for each iteration.
The first iteration must build the sub-T-tables from scratch. However, rather than creating all of those sub-T-tables on the master machine, the first iteration is made “special”. In the first iteration, only the sub-corpus is sent as a work unit. Each worker computes the hookups and create their sub-T-table. Each worker machine uses the sub-T-table and sub-corpus to compute parameter counts as per the normal expectation maximization operation. When all desired iterations are complete, the worker machines compute the final alignment of the sub-corpus, using the same sub-T-table and other parameters of the model.
These counts in the form of sub T tables 131, 132, 133, and possibly other parameter tables shown generically as 136 are then returned to the T table manager 105 at 215. The T table manager 105 collects the count information, and normalizes using the new information, to form new probabilities at 220. The T table manager sends the new probabilities back to the work units for their use in evaluating their next units of work. After all iterations are complete, the work units return a final alignment of the sub-corpora. This allows the master machine to simply concatenate these in the proper order, completing the full word alignment process.
The probabilities include word to word translation parameters and other model parameters. In operation, for example, the corpus may be passed through both the model 1 algorithm and the HMM algorithm five times. Each pass through the algorithm updates the probabilities in the T table and other tables. The tables are then used for further iterations and eventually alignment.
The T table manager is shown in
The work units should each obtain roughly similar amounts of work. The amount of work to be done by a work unit may be proportional to the sentence lengths. Accordingly, it is desirable for the different work units to have roughly similar amounts of work to do in each sub work corpus.
A first way of breaking up the data relies on the corpora being probabilistically similar. Probabilistically, lengths of random sentences within the corpora should be approximately average. Therefore, a first way of effecting 200 in
Another embodiment of 200 sorts the corpus by sentence lengths, and assigns sentences in order from the sentence length sorted corpus. In this way, all work units receive roughly similar length sentences.
The T table manager 105 normalizes between each iteration to produce new T table information from the sub T tables.
According to another embodiment, the T table manager may divide the information in N units, where N is different than the number of machines doing the actual computations. The units are queued up in the T table manager, and are used by the machines during their operation. A work unit queuing system, such as “Condor”, may be used to allocate and provide work to the different units, as each machine becomes available.
The master processor may also carry out other operations in between accumulating the T table results. For example, the master processor may allocate the work units, may itself become a work unit, for a complete unit, or for some unit smaller than the usual work unit.
The calculations by the work units may also be time-monitored by either the master processor or some other processor. Some units may become stragglers, either because they are processing a particularly difficult work unit, or because the computer itself has some error therein of either hardware or software. According to another aspect, the work allocation unit maintains a time out unit shown as 225. If the time becomes longer than a specified time, then the unit may be allocated to another work machine. The first machine to return a result is accepted.
The pseudocode for the word alignment follows:
I
It may be useful to return some of the intermediate parameter tables themselves as well, which is commonly done in machine translation, for example.
To summarize the above psuedocode, the operations of the computer are as follows: first the corpus is split into pieces, to form small T tables with uniform probabilities, as an initialization. The counts are added and normalized over multiple iterations of different models. After that iteration, alignment is carried out using the most-recently trained model and the alignments are concatenated to obtain an alignment of the full corpus.
Although only a few embodiments have been disclosed in detail above, other embodiments are possible and are intended to be encompassed within this specification. The specification describes specific examples to accomplish a more general goal that may be accomplished in other way. This disclosure is intended to be exemplary, and the claims are intended to cover any modification or alternative which might be predictable to a person having ordinary skill in the art. For example, while the above describes parallelizing a word alignment, it should be understood that any machine based text application that requires accumulation of probabilities can be parallelized in this way. While the above has described the work being broken up in a specified way, it should be understood that the work can be broken up in different ways. For example, the T-table manager can receive data other than counts and/or probabilities from the sub units and may compute information from raw data obtained from the T-table manager.
Also, only those claims which use the words “means for” are intended to be interpreted under 35 USC 112, sixth paragraph. Moreover, no limitations from the specification are intended to be read into any claims, unless those limitations are expressly included in the claims.
The present application is a continuation of U.S. patent application Ser. No. 11/196,785, filed Aug. 2, 2005 now abandoned, and entitled “Task Parallelization in a Text-to-Text System,” which is herein incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
4502128 | Okajima et al. | Feb 1985 | A |
4599691 | Sakaki et al. | Jul 1986 | A |
4787038 | Doi et al. | Nov 1988 | A |
4814987 | Miyao et al. | Mar 1989 | A |
4942526 | Okajima et al. | Jul 1990 | A |
5146405 | Church | Sep 1992 | A |
5181163 | Nakajima et al. | Jan 1993 | A |
5212730 | Wheatley et al. | May 1993 | A |
5267156 | Nomiyama | Nov 1993 | A |
5311429 | Tominaga | May 1994 | A |
5432948 | Davis et al. | Jul 1995 | A |
5477451 | Brown et al. | Dec 1995 | A |
5510981 | Berger et al. | Apr 1996 | A |
5644774 | Fukumochi et al. | Jul 1997 | A |
5696980 | Brew | Dec 1997 | A |
5724593 | Hargrave, III et al. | Mar 1998 | A |
5761631 | Nasukawa | Jun 1998 | A |
5781884 | Pereira et al. | Jul 1998 | A |
5805832 | Brown et al. | Sep 1998 | A |
5848385 | Poznanski et al. | Dec 1998 | A |
5867811 | O'Donoghue | Feb 1999 | A |
5870706 | Alshawi | Feb 1999 | A |
5903858 | Saraki | May 1999 | A |
5909681 | Passera et al. | Jun 1999 | A |
5987404 | Della Pietra et al. | Nov 1999 | A |
5991710 | Papineni et al. | Nov 1999 | A |
6031984 | Walser | Feb 2000 | A |
6032111 | Mohri | Feb 2000 | A |
6092034 | McCarley et al. | Jul 2000 | A |
6119077 | Shinozaki | Sep 2000 | A |
6131082 | Hargrave, III et al. | Oct 2000 | A |
6182014 | Kenyon et al. | Jan 2001 | B1 |
6205456 | Nakao | Mar 2001 | B1 |
6223150 | Duan et al. | Apr 2001 | B1 |
6236958 | Lange et al. | May 2001 | B1 |
6278967 | Akers et al. | Aug 2001 | B1 |
6285978 | Bernth et al. | Sep 2001 | B1 |
6289302 | Kuo | Sep 2001 | B1 |
6304841 | Berger et al. | Oct 2001 | B1 |
6311152 | Bai et al. | Oct 2001 | B1 |
6327568 | Joost | Dec 2001 | B1 |
6360196 | Poznanski et al. | Mar 2002 | B1 |
6389387 | Poznanski et al. | May 2002 | B1 |
6393388 | Franz et al. | May 2002 | B1 |
6393389 | Chanod et al. | May 2002 | B1 |
6415250 | van den Akker | Jul 2002 | B1 |
6460015 | Hetherington et al. | Oct 2002 | B1 |
6502064 | Miyahira et al. | Dec 2002 | B1 |
6782356 | Lopke | Aug 2004 | B1 |
6810374 | Kang | Oct 2004 | B2 |
6904402 | Wang et al. | Jun 2005 | B1 |
7107215 | Ghali | Sep 2006 | B2 |
7113903 | Riccardi et al. | Sep 2006 | B1 |
7219051 | Moore | May 2007 | B2 |
20020188438 | Knight et al. | Dec 2002 | A1 |
20020198701 | Moore | Dec 2002 | A1 |
20040030551 | Marcu et al. | Feb 2004 | A1 |
20060190241 | Goutte et al. | Aug 2006 | A1 |
Number | Date | Country |
---|---|---|
0469884 | Feb 1992 | EP |
0715265 | Jun 1996 | EP |
0933712 | Aug 1999 | EP |
07244666 | Jan 1995 | JP |
10011447 | Jan 1998 | JP |
11272672 | Oct 1999 | JP |
Number | Date | Country | |
---|---|---|---|
Parent | 11196785 | Aug 2005 | US |
Child | 11412307 | US |