Identifying documents which form translated pairs, within a document collection

Information

  • Patent Grant
  • 7813918
  • Patent Number
    7,813,918
  • Date Filed
    Wednesday, August 3, 2005
    19 years ago
  • Date Issued
    Tuesday, October 12, 2010
    13 years ago
Abstract
A training system for text to text application. The training system finds groups of documents, and identifies automatically similar documents in the groups which are similar. The automatically identified documents can then be used for training of the text to text application. The comparison uses reduced size versions of the documents in order to minimize the amount of processing.
Description
BACKGROUND

Text to text applications include machine translation, automated summarization, question answering, and other similar applications where a machine carries out the function of understanding some kind of input information, and generating text. The input information is often “text”, but more generally, can be any kind of information that is received and understandable by the machine.


Conventional text to text applications use heterogeneous methods for implementing the generation phase. Machine translation often produces sentences using application-specific decoders that are based on work that was conducted on speech recognition. Automated summarization produces abstracts using task specific strategies.


Machine translation systems rely on training that is carried out based on corresponding, or “parallel” information that exists in both of two languages. The information in the two languages can be from many sources. Sometimes, it is known that the contents of two documents represent the same information.


The internet is a source of information. Documents on the Internet are often available in multiple different languages. However, it may be difficult to identify mutual translations within the many different web pages on the Internet. Comparing all documents within the document pool using conventional systems would require a number of computations that scales with the square of the number of document pairs.


For example, each English language page can be compared with every known French language page, to determine the best match. This naive system would take extreme computation times to identify the training pairs.


Philip Resnik has suggested a method which identifies parallel documents by producing pairs of similar URLs which are presumed to be in different languages. For example, if one URL says “En”, and another URL is similar but differs only by stating “FR”, then these are presumed to be parallel URLs.


Not all Web documents are in this form, and Resnik's system is quite specific to web pages which have that specific kinds of URLs.


SUMMARY

The present application teaches a system that forms a similarity measure that returns a score given a document pair. Techniques are disclosed which scale n*log n with the number of documents.


One aspect forms a reduced-size version of the document that is associated with the document contents, and compares that reduced size version, with comparably reduced sized versions in other languages. The reduced size document can be a document fingerprint.


Another aspect compares the documents using a probabilistic shuffling technique, where the documents and contents are mixed, and then compared to some, but not all, information about other documents. The shuffling may be carried out numerous times, in order to obtain a best match.





BRIEF DESCRIPTION OF THE DRAWINGS

The drawings show:



FIG. 1 shows a block diagram of a system;



FIG. 2 shows a flowchart of operation to find parallel information; and



FIG. 3 shows a flowchart of an embodiment of determining the signatures of the documents.



FIG. 4 shows a flowchart of another embodiment.





DETAILED DESCRIPTION

The general structure and techniques, and more specific embodiments which can be used to effect different ways of carrying out the more general goals are described herein.



FIG. 1 illustrates an exemplary hardware device and its flow, which may execute the operations that are described with reference to the flowcharts. This system can be used for any text to text application. However, the embodiment discloses the specific application of machine translation.


A processor is assumed to have access to various sources 105. The sources may be parallel corpora of multiple language information. Specifically, the sources may include translation memories, probabilistic and non-probabilistic word- and phrase-based dictionaries, glossaries, Internet information, parallel corpora in multiple languages, non-parallel corpora in multiple languages having similar subject matter, and human-created translations. The processor creates training data 110.


Speech engine 120 carries out a text-to-text application based on the training data.


The present application teaches a system of identifying mutual translations within a collection of documents such as 105. The documents are assumed to be in first and second languages.


A first embodiment describes the first and second languages as being English and French. It should be understood, however, that any first and second languages could be used. The language information is used to train a machine based text to text system. That system can be machine translation, automated summarization, speech recognition, or any other machine application.


Data from the Web can be gathered by focused crawling. Alternatively, other data can be obtained. The data includes a collection of information in first and second languages that does not necessarily have any subject matter connection. This data is used as the input to the system.


The processing computer operates to find mutual translations according to the flowchart of FIG. 2. At 200, each of the French language documents are translated into English using a rough machine translator. This rough translation is done quickly, and makes an adequate, but not perfect, translation. The translation technique which is used at 200 is optimized for speed, not for accuracy. This translation produces two sets of documents in the same language: here English. One of those sets of documents is the original English document, called herein the native documents. The other set of documents is the translated documents.


At 210, reduced size versions of the documents are created for both the native and translated documents. The reduced size version has parts that are associated with the document contents. The reduced size document can be a document fingerprint. The fingerprint has “keys” that relate to words and their placement in the dictionaries. In effect, this summarizes concisely information about the words contained in the document.


Different techniques of forming fingerprints may be used, and one specific technique is described herein with reference to FIG. 3. At 300, n dictionaries are obtained. A dictionary can be any grouping of words, which includes the words in the language of the documents. The dictionaries can be conventional dictionaries, or any other collection of words. Each of the dictionaries will have different words in different orders. At 305, the system identifies which word in the document appears first, or at some specified location, within each dictionary. The number of that word in the document is assigned to a key that corresponds to the dictionary, at 310. Each dictionary will be different, and therefore, each dictionary will form a different key. Each of the keys will be associated with the document contents.


The keys collectively form a fingerprint. A typical system of this type may use 128 different dictionaries, and hence the fingerprint, shown in 315 is formed of 128 different keys. Each document will form a unique set of keys, and conversely, the keys effectively form a signature that allows identification of the document. Any other type signature which identifies the document can alternatively be used. At 220, each of the native and translated documents is compared to its neighboring document, that is not to all documents in the database, but only to specified neighboring documents. The comparison may use, for example, a fast Hamming matching. The comparison may only be to the left and right neighbors, or may alternatively be to 2-5 left and right nearest neighbors, or to some other number of neighbors. The Hamming distance is found at 225 and represents how many pieces of the pair of fingerprints do not match.


Even a document and its identical translation would not match exactly because of imperfections in the translator, for example. The Hamming distance indicates the amount by which the fingerprints do not match.


At 230, a shuffle is carried out, in which the order of the keys within the native and translated fingerprints are shuffled randomly. After shuffling, the documents are sorted at 235, according to fingerprints. The documents are again compared to their nearest neighbor(s) at 225. Flow continues until a desired match is obtained. The output is the closest neighbor at 240.


The shuffle operation uses statistical properties to find the nearest neighbor. For a database with 1,000 documents, for example, the shuffle can find the nearest neighbor after approximately 50 shuffles.


The SHUFFLE process is done so that the keys can be sorted in a way that brings similar items “nearby”, so that they get compared.


For example, consider the following two 5-key signatures:

    • doc-1: 1 0 3 4 5
    • doc-2: 1 9 3 4 5


These two docs are quite similar because they differ only in one key (ie, 0 vs 9).


However, the ordering of the documents may be very different, depending on the key order. A worst-case shuffle, for example, may lead to the following key re-ordering:

    • doc-1: 0 1 3 4 5
    • doc-2: 9 1 3 4 5


When documents are sorted according to their keys and according to this worst case scenario, doc-1 & doc-2 are likely to be very far apart. An example sorting might be:

    • doc-1: 0 1 3 4 5
    • . . .
    • doc-11: 2 0 3 4 5
    • doc-12: 2 9 3 4 5
    • doc-13: 3 0 3 4 5
    • doc-22: 4 9 3 4 5
    • doc-17: 4 0 3 4 5
    • doc-29: 5 9 3 4 9
    • . . .
    • doc-2: 9 1 3 4 5


In contrast, a best-case shuffle will put the like keys in agreement, for example, a best case shuffle might be:

    • doc-1: 1 3 4 5 0
    • doc-2: 1 3 4 5 9


In this case, after sorting, the documents will be very close.


Another embodiment is described with reference to the flowchart of FIG. 4. This embodiment does not require a rough translation, but instead compares aspects of the documents that are in the document collection.


At 400, each document in the collection is analyzed according to an assessment vector technique. The analysis may look for any category or feature within each document. For example, the assess operation at 400 may maintain a detection of the number of times that a specified word is used, and keep counts of those multiple uses. The analyzed information is used to form vectors indicative of the documents. In this embodiment, the vectors form the reduced versions.


The vectors can be native, or may use a translation technique. For example, a word frequency vector can be used for English documents, while a modified word frequency vector can be used place the words from the French document into the English space.


At 420, the vectors are compared and shuffled at 430 using similar techniques to those in a previous embodiment.


According to exemplary embodiments, a processor may determine reduced size versions of documents such as those that may be included in a database. A processor may also compare the reduced size versions to determine documents that represent similar information. Additionally, a text-to-text application module may use documents that represent similar information to train a text-to-text application. The text-to-text application may be a machine translation system in some embodiments. Furthermore, the text-to-text application may carry out a rough translation of documents to form a group of translated documents.


Although only a few embodiments have been disclosed in detail above, other embodiments are possible and are intended to be encompassed within this specification. The specification describes specific examples to accomplish a more general goal that may be accomplished in other way. This disclosure is intended to be exemplary, and the claims are intended to cover any modification or alternative which might be predictable to a person having ordinary skill in the art. For example, the above techniques can be used with other sources of information, other languages, and other signature techniques.


Also, only those claims which use the words “means for” are intended to be interpreted under 35 USC 112, sixth paragraph. Moreover, no limitations from the specification are intended to be read into any claims, unless those limitations are expressly included in the claims.

Claims
  • 1. A method for identifying documents that represent similar information to train a text-to-text application, the method comprising: obtaining a group of documents;determining reduced size versions of the documents, wherein the reduced size versions summarize information about words contained in the documents and the determining is performed by a processor;changing an order of information within the reduced size versions;sorting the reduced size versions;comparing the reduced size versions to determine documents that represent similar information, wherein the comparing is performed by a processor; andusing the documents that represent similar information for training for the text-to-text application.
  • 2. The method of claim 1, wherein the text-to-text application is a machine translation system.
  • 3. The method of claim 1, further comprising: carrying out a rough translation to a second language of documents in the group to form a group of translated documents; andcomparing the group of translated documents to other documents prior to the determining.
  • 4. The method of claim 1, wherein determining the reduced size versions comprises: forming vectors indicative of the documents; andcomparing the vectors.
  • 5. A method for identifying documents that represent similar information to train a text-to-text application, the method comprising: obtaining a group of documents;determining reduced size versions of the documents, wherein the reduced size versions summarize information about words contained in the documents and the determining is performed by a processor;comparing the reduced size versions to determine documents that represent similar information, wherein the comparing is performed by a processor; andusing the documents that represent similar information for training for the text-to-text application,wherein determining the reduced size versions includes comparing words in the documents to specified dictionaries of words and defining the documents in terms of information about the words in the dictionaries.
  • 6. The method of claim 5, wherein the reduced size versions include keys representing positions of words in the dictionaries.
  • 7. The method of claim 6, further comprising changing an order of the keys prior to comparing the reduced size versions.
  • 8. A system for identifying documents that represent similar information to train a text-to-text application, the system comprising: a database including a group of documents;a processor that determines reduced size versions of the documents and compares the reduced size versions to determine documents within the group that represent similar information, wherein the reduced size versions summarize information about words contained in the documents; anda text-to-text application module stored in memory and executable to use the documents that represent similar information for training a text-to-text application,wherein the text-to-text application is executable to carry out a rough translation to a second language of documents in the group to form a group of translated documents, and to compare the group of translated documents to other documents prior to determining the documents that represent similar information.
  • 9. The system of claim 8, wherein the text-to-text application is a machine translation system.
  • 10. The system of claim 8, wherein the text-to-text application module is executable to change an order of information within the reduced size versions prior to the comparing.
  • 11. The system of claim 10, wherein the text-to-text application module is executable to sort the reduced size versions.
  • 12. The system of claim 8 wherein the text-to-text application is executable to form vectors indicative of the documents, and compares the vectors.
  • 13. A system for identifying documents that represent similar information to train a text-to-text application, the system comprising: a database including a group of documents;a processor that determines reduced size versions of the documents and compares the reduced size versions to determine documents within the group that represent similar information, wherein the reduced size versions summarize information about words contained in the documents;a text-to-text application module stored in memory and executable to use the documents that represent similar information for training a text-to-text application; anda plurality of word dictionaries each having a plurality of words therein, and wherein the reduced size versions are determined at least in part by comparing words in the documents to words in the dictionaries.
  • 14. The system of claim 13, wherein the reduced size versions include keys representing positions of words in the dictionaries.
  • 15. The system of claim 14, wherein the text-to-text application module is executable to change an order of said keys prior to said comparing.
  • 16. A method for identifying documents that represent similar information, the method comprising: obtaining a first group of documents in a first language, and a second group of documents in a second language;carrying out a rough translation to the first language of the second group of documents to form a third group of translated documents, the carrying out of the rough translation performed by a machine translation system;determining reduced size versions of the first and third groups of documents, wherein the reduced size versions summarize information about words contained in the first and third groups of documents, and the determining is performed by a processor; andcomparing the reduced size versions to determine documents that represent similar information, the comparing performed by a processor.
  • 17. The method of claim 16, further comprising using the documents that represent similar information to train a text-to-text application system.
  • 18. The method of claim 16, further comprising changing an order of information within the reduced size versions prior to determining the documents that represent similar information.
  • 19. The method of claim 18, further comprising sorting the reduced size versions.
GOVERNMENT INTERESTS

This invention was made with government support under Contract No. N66001-00-1-8914 awarded by the Space and Naval Warfare Systems Command. The U.S. government has certain rights in the claimed inventions.

US Referenced Citations (64)
Number Name Date Kind
4502128 Okajima et al. Feb 1985 A
4599691 Sakaki et al. Jul 1986 A
4787038 Doi et al. Nov 1988 A
4814987 Miyao et al. Mar 1989 A
4942526 Okajima et al. Jul 1990 A
5146405 Church Sep 1992 A
5181163 Nakajima et al. Jan 1993 A
5212730 Wheatley et al. May 1993 A
5267156 Nomiyama Nov 1993 A
5311429 Tominaga May 1994 A
5432948 Davis et al. Jul 1995 A
5477451 Brown et al. Dec 1995 A
5510981 Berger et al. Apr 1996 A
5644774 Fukumochi et al. Jul 1997 A
5696980 Brew Dec 1997 A
5724593 Hargrave III et al. Mar 1998 A
5761631 Nasukawa Jun 1998 A
5781884 Pereira et al. Jul 1998 A
5794178 Caid et al. Aug 1998 A
5805832 Brown et al. Sep 1998 A
5848385 Poznanski et al. Dec 1998 A
5867811 O'Donoghue Feb 1999 A
5870706 Alshawi Feb 1999 A
5903858 Saraki May 1999 A
5987404 Della Pietra et al. Nov 1999 A
5991710 Papineni et al. Nov 1999 A
6031984 Walser Feb 2000 A
6032111 Mohri Feb 2000 A
6092034 McCarley et al. Jul 2000 A
6119077 Shinozaki Sep 2000 A
6131082 Hargrave III et al. Oct 2000 A
6182014 Kenyon et al. Jan 2001 B1
6205456 Nakao Mar 2001 B1
6223150 Duan et al. Apr 2001 B1
6236958 Lange et al. May 2001 B1
6278967 Akers et al. Aug 2001 B1
6285978 Bernth et al. Sep 2001 B1
6289302 Kuo Sep 2001 B1
6304841 Berger et al. Oct 2001 B1
6311152 Bai et al. Oct 2001 B1
6317708 Witbrock et al. Nov 2001 B1
6360196 Poznanski et al. Mar 2002 B1
6389387 Poznanski et al. May 2002 B1
6393388 Franz et al. May 2002 B1
6393389 Chanod et al. May 2002 B1
6415250 van den Akker Jul 2002 B1
6460015 Hetherington et al. Oct 2002 B1
6502064 Miyahira et al. Dec 2002 B1
6782356 Lopke Aug 2004 B1
6810374 Kang Oct 2004 B2
6904402 Wang et al. Jun 2005 B1
7107215 Ghali Sep 2006 B2
7113903 Riccardi et al. Sep 2006 B1
7197451 Carter et al. Mar 2007 B1
7356457 Pinkham et al. Apr 2008 B2
20020078091 Vu et al. Jun 2002 A1
20020188438 Knight et al. Dec 2002 A1
20020198701 Moore Dec 2002 A1
20040030551 Marcu et al. Feb 2004 A1
20050228643 Munteanu et al. Oct 2005 A1
20060015320 Och Jan 2006 A1
20060142995 Knight et al. Jun 2006 A1
20060150069 Chang Jul 2006 A1
20090083023 Foster et al. Mar 2009 A1
Foreign Referenced Citations (6)
Number Date Country
0469884 Feb 1992 EP
0715265 Jun 1996 EP
0933712 Aug 1999 EP
07244666 Jan 1995 JP
10011447 Jan 1998 JP
11272672 Oct 1999 JP
Related Publications (1)
Number Date Country
20070033001 A1 Feb 2007 US