Task parallelization in a text-to-text system

Description

BACKGROUND

Text to text applications include machine translation and other machine intelligence systems such as speech recognition and automated summarization. These systems often rely on training that is carried out based on information from specified databases known as corpora.

A training pipeline may include many millions of words. It is not uncommon for the training to take weeks. There is often a tradeoff between the speed of the processing and the accuracy of the obtained information.

It is desirable to speed up the training of such a system.

SUMMARY

The present application describes parallelization of certain aspects of training. Specifically, an embodiment describes how to parallelize a training task which requires knowledge about previous training portions.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a block diagram of a machine text training system; and

FIG. 2 illustrates a flowchart of operation.

DETAILED DESCRIPTION

The general structure and techniques, and more specific embodiments which can be used to effect different ways of carrying out the more general goals are described herein.

A current training system may require as long as two weeks to train 100 million words. Of course, faster processors may reduce that time. Parallelization of these operations by partitioning the input corpus is not straightforward, however, since certain operations may require accumulation of accumulated results from other operations. Multiple operations that are operating in multiple processors would not have access to the results of the other processors.

In evaluating the entire training pipeline for machine translation, it was noticed that word alignment takes by far the most time of the entire process. For example, word alignment may take an order of magnitude longer than any of the other 11 processes that are used during training. Parallelization of word alignment can hence speed up training.

The embodiment shown in FIG. 1 uses multiple different known techniques to determine word alignment. For example, FIG. 1 shows using a Model 1, and an HMM model to determine word alignment. The overall algorithm is the well-known expectation maximization algorithm to determine the most likely hookups between the words in parallel corpora of data.

In operation, the expectation maximization algorithm collects counts which are formed from arbitrary choices of probabilities between words in the full corpus. The words in the corpus are analyzed, to find all word to word pairings. A determination of probabilities of hookups is used to form a table of counts. That table of counts is then used, along with the corpus, to determine further probabilities. This process is then iterated.

The task of determining the word alignments requires analysis of both the table of probabilities from the final iteration of the expectation maximization algorithm, as well as the corpus information.

Since the accumulation and normalization of count information is necessary, dividing this task into multiple processors is not a straightforward issue of simply dividing the work among processors and performing multiple isolated iterations of expectation maximization.

FIG. 1 illustrates an embodiment of the word alignment operations that can be carried out in multiple parallelized processors. The corpus 100 includes the training material, that includes, for example, parallel information in two different languages. This training material is used to create the final probability tables and alignment. A master computer 99 executes the operations flowcharted according to FIG. 2. The master computer maintains a “T table”, which is a table of probabilities of word to word translation and other model parameters.

The master computer 99 runs a T table manager 105 which updates the interim T table and other model parameters 110 with counts and probabilities. The T table manager accumulates all of the data from all of the different evaluation passes through the corpus. These evaluations may create parameters and information other than the T table. The embodiment emphasizes the T table, because it is usually very large and hence its manipulation and storage requires significant resources, such as computer RAM. Many, if not all word alignment models, also share this set of parameters. The embodiment contemplates operation with other models such as HMM, model 2 and others. These models may use additional parameters, which may not be specifically discussed herein.

At 200, the master determines pieces of the corpus, shown as 120. Each of those pieces forms a sub corpus 121, 122, 123. These form one component of a “work unit”. The master also creates sub T tables at 210 that include only the word-to-word hookups that occur in the corresponding sub-corpus, shown as 125, 126, 127. The smaller tables minimize the memory requirements of the work unit.

If the model has additional parameters, these are included in the work unit as well.

Computing which word-to-word hookups appear in a given sub-corpus is expensive in terms of computer resources. The system used herein uses multiple computing iterations. One aspect enables reusing the returned sub-T-table output from previous iterations, rather than recomputing those hookups for each iteration.

The first iteration must build the sub-T-tables from scratch. However, rather than creating all of those sub-T-tables on the master machine, the first iteration is made “special”. In the first iteration, only the sub-corpus is sent as a work unit. Each worker computes the hookups and create their sub-T-table. Each worker machine uses the sub-T-table and sub-corpus to compute parameter counts as per the normal expectation maximization operation. When all desired iterations are complete, the worker machines compute the final alignment of the sub-corpus, using the same sub-T-table and other parameters of the model.

These counts in the form of sub T tables 131, 132, 133, and possibly other parameter tables shown generically as 136 are then returned to the T table manager 105 at 215. The T table manager 105 collects the count information, and normalizes using the new information, to form new probabilities at 220. The T table manager sends the new probabilities back to the work units for their use in evaluating their next units of work. After all iterations are complete, the work units return a final alignment of the sub-corpora. This allows the master machine to simply concatenate these in the proper order, completing the full word alignment process.

The probabilities include word to word translation parameters and other model parameters. In operation, for example, the corpus may be passed through both the model 1 algorithm and the HMM algorithm five times. Each pass through the algorithm updates the probabilities in the T table and other tables. The tables are then used for further iterations and eventually alignment.

The T table manager is shown in FIG. 1 and in 200 as breaking the corpus into the sub corpora 121, 122, 123. Of course, this can be done by a separate process running within the master computer 99. The corpora can be broken up in a number of different ways.

The work units should each obtain roughly similar amounts of work. The amount of work to be done by a work unit may be proportional to the sentence lengths. Accordingly, it is desirable for the different work units to have roughly similar amounts of work to do in each sub work corpus.

A first way of breaking up the data relies on the corpora being probabilistically similar. Probabilistically, lengths of random sentences within the corpora should be approximately average. Therefore, a first way of effecting 200 in FIG. 2 is via a round robin between sentences. Each machine is assigned a different randomly selected sentence. The effectively random selection of the sentence is likely to produce sentences with roughly equal word lengths in each subunit.

Another embodiment of 200 sorts the corpus by sentence lengths, and assigns sentences in order from the sentence length sorted corpus. In this way, all work units receive roughly similar length sentences.

The T table manager 105 normalizes between each iteration to produce new T table information from the sub T tables.

According to another embodiment, the T table manager may divide the information in N units, where N is different than the number of machines doing the actual computations. The units are queued up in the T table manager, and are used by the machines during their operation. A work unit queuing system, such as “Condor”, may be used to allocate and provide work to the different units, as each machine becomes available.

The master processor may also carry out other operations in between accumulating the T table results. For example, the master processor may allocate the work units, may itself become a work unit, for a complete unit, or for some unit smaller than the usual work unit.

The calculations by the work units may also be time-monitored by either the master processor or some other processor. Some units may become stragglers, either because they are processing a particularly difficult work unit, or because the computer itself has some error therein of either hardware or software. According to another aspect, the work allocation unit maintains a time out unit shown as 225. If the time becomes longer than a specified time, then the unit may be allocated to another work machine. The first machine to return a result is accepted.

The pseudocode for the word alignment follows:

INPUT: CORPUS C, OUTPUT ALIGNMENT ABIG

- 1. INPUT: NUMBERIZED CORPUS C
- 2. INIT LARGE T TABLE TBIG AND OTHER MODEL PARAMETERS PBIG USING C (ZERO PROBABILITIES)
- 3. DIVIDE CORPUS INTO N PIECES {CI}, I=1, . . . , N
  - a. C → {CI}
- 4. DO N WORK UNITS OF INITIALIZATION (CREATE SMALL T TABLES AND ASSIGN UNIFORM COUNTS)
  - a. CI →OI (COUNTS), I=1, . . . , N
- 5. ADD ALL COUNTS AND NORMALIZE, AND WRITE NEW SUB T TABLES
  - a. TBIG, {OI} →TBIG, {TI}
- 6. DO N WORK UNITS OF ONE ITERATION OF A MODEL
  - a. CI, TI → OI (COUNTS), I=1, . . . , N
- 7. REPEAT STEPS 5 AND 6 FOR EACH MODEL 1 ITERATION, THEN EACH HMM ITERATION, ETC. UNTIL ALL ITERATIONS ARE COMPLETE. END AFTER FINAL RUN OF STEP 5.
- 8. DO N WORK UNITS OF ALIGNMENT USING THE LAST-TRAINED MODEL
  - a. CI, TI → AI (ALIGNMENTS), I=1, . . . , N
- 9. SIMPLY CONCATENATE THE ALIGNMENTS TO OBTAIN AN ALIGNMENT OF THE FULL CORPUS.
  - a. {AI} →ABIG
- 10. RETURN ABIG AS THE RESULT.

It may be useful to return some of the intermediate parameter tables themselves as well, which is commonly done in machine translation, for example.

To summarize the above psuedocode, the operations of the computer are as follows: first the corpus is split into pieces, to form small T tables with uniform probabilities, as an initialization. The counts are added and normalized over multiple iterations of different models. After that iteration, alignment is carried out using the most-recently trained model and the alignments are concatenated to obtain an alignment of the full corpus.

Although only a few embodiments have been disclosed in detail above, other embodiments are possible and are intended to be encompassed within this specification. The specification describes specific examples to accomplish a more general goal that may be accomplished in other way. This disclosure is intended to be exemplary, and the claims are intended to cover any modification or alternative which might be predictable to a person having ordinary skill in the art. For example, while the above describes parallelizing a word alignment, it should be understood that any machine based text application that requires accumulation of probabilities can be parallelized in this way. While the above has described the work being broken up in a specified way, it should be understood that the work can be broken up in different ways. For example, the T-table manager can receive data other than counts and/or probabilities from the sub units and may compute information from raw data obtained from the T-table manager.

Also, only those claims which use the words “means for” are intended to be interpreted under 35 USC 112, sixth paragraph. Moreover, no limitations from the specification are intended to be read into any claims, unless those limitations are expressly included in the claims.

Claims

1. A method comprising: dividing a corpus of information among multiple work units and carrying out a text-to text operation in each of said work units; andmaintaining a single parameter table for all the work carried out in all the work units, wherein said parameter table is a probability table with probabilities of word to word translation.
2. A method as in claim 1, wherein said text-to-text determination operation is a word alignment.
3. A method as in claim 1, wherein said text to text operation that is carried out in each of said work units forms a table of counts based on probabilities of hookups for word to word pairing.
4. A method as in claim 1, wherein said text to text operation is carried out in multiple computing iterations.
5. A method as in claim 4, wherein at least one subsequent iteration uses a parameter table from a previous iteration.
6. A method as in claim 4, wherein said multiple computing iterations include a first iteration which computes word-to-word hookup information, and a subsequent iteration which uses said hookup information from the first iteration.
7. A method as in claim 1, wherein said text-to-text operation uses a model 1 algorithm.
8. A method as in claim 1, wherein said dividing comprises a random division of information.
9. A method as in claim 1, wherein said dividing comprises sorting information, and selecting units of information based on said sorting.
10. A method as in claim 1, further comprising monitoring said each of said work units to detect work units that are requiring longer calculation times than other work units.
11. A method as in claim 1, wherein said carrying out a determination comprises doing an initialization, and subsequently doing multiple iterations beyond said initialization.
12. A method as in claim 11, further comprising carrying out alignment after said iterations.
13. A computer system, comprising: a master computer, connected to a corpus of training information about text-to-text operations, having a plurality of work unit computers, having separate processors from said master computer, and said master computer running a routine that maintains a table of information related to training based on said corpus, a routine that provides separated portions of said corpus and said work unit computers, and accumulates information indicative of training each of said work unit computers and maintains said table of information, wherein said table of information includes a probability of word to word translation.
14. A computer system as in claim 13, wherein said master computer also processes at least one of said separated portions of said corpus.
15. A computer system as in claim 13, wherein said training in said working unit computers comprises a word alignment operation.
16. A computer system as in claim 13, wherein said training in said work unit computers comprises multiple computing iterations based on the same data.
17. A computer system as in claim 13, further comprising using a parameter table from a previous iteration in a subsequent iteration.
18. A method, comprising: dividing a training corpus into at least a plurality of groups;carrying out a training operation for a text to text application substantially simultaneously on each of said plurality of groups, using separate processors for each of said groups and using a single table of information indicative of word probabilities, for each of said groups, and using said training operation to update said single probability table based on training information obtained from each of said groups, wherein said single probability table comprises probabilities of word to word translations.
19. A method as in claim 18, wherein said training operation is a word alignment.
20. A method as in claim 18, wherein said training operation is a computation of counts.

CROSS REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of U.S. patent application Ser. No. 11/196,785, filed Aug. 2, 2005 now abandoned, and entitled “Task Parallelization in a Text-to-Text System,” which is herein incorporated by reference.

US Referenced Citations (58)

Number	Name	Date	Kind
4502128	Okajima et al.	Feb 1985	A
4599691	Sakaki et al.	Jul 1986	A
4787038	Doi et al.	Nov 1988	A
4814987	Miyao et al.	Mar 1989	A
4942526	Okajima et al.	Jul 1990	A
5146405	Church	Sep 1992	A
5181163	Nakajima et al.	Jan 1993	A
5212730	Wheatley et al.	May 1993	A
5267156	Nomiyama	Nov 1993	A
5311429	Tominaga	May 1994	A
5432948	Davis et al.	Jul 1995	A
5477451	Brown et al.	Dec 1995	A
5510981	Berger et al.	Apr 1996	A
5644774	Fukumochi et al.	Jul 1997	A
5696980	Brew	Dec 1997	A
5724593	Hargrave, III et al.	Mar 1998	A
5761631	Nasukawa	Jun 1998	A
5781884	Pereira et al.	Jul 1998	A
5805832	Brown et al.	Sep 1998	A
5848385	Poznanski et al.	Dec 1998	A
5867811	O'Donoghue	Feb 1999	A
5870706	Alshawi	Feb 1999	A
5903858	Saraki	May 1999	A
5909681	Passera et al.	Jun 1999	A
5987404	Della Pietra et al.	Nov 1999	A
5991710	Papineni et al.	Nov 1999	A
6031984	Walser	Feb 2000	A
6032111	Mohri	Feb 2000	A
6092034	McCarley et al.	Jul 2000	A
6119077	Shinozaki	Sep 2000	A
6131082	Hargrave, III et al.	Oct 2000	A
6182014	Kenyon et al.	Jan 2001	B1
6205456	Nakao	Mar 2001	B1
6223150	Duan et al.	Apr 2001	B1
6236958	Lange et al.	May 2001	B1
6278967	Akers et al.	Aug 2001	B1
6285978	Bernth et al.	Sep 2001	B1
6289302	Kuo	Sep 2001	B1
6304841	Berger et al.	Oct 2001	B1
6311152	Bai et al.	Oct 2001	B1
6327568	Joost	Dec 2001	B1
6360196	Poznanski et al.	Mar 2002	B1
6389387	Poznanski et al.	May 2002	B1
6393388	Franz et al.	May 2002	B1
6393389	Chanod et al.	May 2002	B1
6415250	van den Akker	Jul 2002	B1
6460015	Hetherington et al.	Oct 2002	B1
6502064	Miyahira et al.	Dec 2002	B1
6782356	Lopke	Aug 2004	B1
6810374	Kang	Oct 2004	B2
6904402	Wang et al.	Jun 2005	B1
7107215	Ghali	Sep 2006	B2
7113903	Riccardi et al.	Sep 2006	B1
7219051	Moore	May 2007	B2
20020188438	Knight et al.	Dec 2002	A1
20020198701	Moore	Dec 2002	A1
20040030551	Marcu et al.	Feb 2004	A1
20060190241	Goutte et al.	Aug 2006	A1

Foreign Referenced Citations (6)

Number	Date	Country
0469884	Feb 1992	EP
0715265	Jun 1996	EP
0933712	Aug 1999	EP
07244666	Jan 1995	JP
10011447	Jan 1998	JP
11272672	Oct 1999	JP

Continuations (1)

	Number	Date	Country
Parent	11196785	Aug 2005	US
Child	11412307		US

Task parallelization in a text-to-text system

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

US Classifications

Field of Search

US

International Classifications