Building a useful machine translation system requires a lot of data. In particular, the data cannot simply be the translation of the words in one language to those in another language, but rather needs to include phrases and sentences so that the context of multiple words is considered. While there are some sources of translated data available, such as the same web page content translated into different languages, and governmental documents (e.g., the European Union translates documents into multiple languages), there are drawbacks with using these sources.
While a great deal of parallel text exists in digital form (web data, scanned books, and so forth), the nature of this data is skewed in various ways. For instance, certain domains (e.g. government, science) tend to be very well-represented, while others (e.g., entertainment, sports) are much less well-covered. Even more significant is the skewing for particular language pairs; e.g., while there is a substantial amount of English-Spanish data available in digital form, there is very little Hungarian-Spanish or Vietnamese-Spanish. When parallel speech data is considered, the problem is even greater. Relatively little spoken parallel speech data exists, and collecting it can be extremely expensive because of the laborious nature of speech transcription.
Attempts have been made to use bilingual speakers to translate sentences and phrases from one language to another. However, employing such bilingual speakers is generally costly, and thus only a limited amount of data can be practically collected in this way. Gathering translation data from bilingual speakers within the general public (“crowd-sourcing”) could in principle help collect large amounts of parallel data, but this approach is also problematic. For one, translation quality can vary greatly from speaker to speaker, and motivating highly skilled bilingual contributors can be difficult. If translators are offered significant financial rewards for contributing data, cheating can become a problem, e.g., unscrupulous programmers can write automated “bots” that simply call an existing machine translation engine to supply a translation.
Paraphrase data refers to different sentences and phrases that mean approximately the same thing in a given language. This is generally similar to translation data except that only monolingual annotators are needed to produce paraphrase data. However, collecting paraphrase data has its own problems, including that the annotator paraphrasing a source sentence or phrase into target data is biased by the source sentence/phrase. For example, many people tend to simply substitute each source noun with a different target noun and/or each source verb for a different target verb, similar to using a thesaurus. Other people find it difficult to construct paraphrases in general, e.g., they are confused as to whether they are supposed to reorder the words, substitute words and/or do something else with the source text to provide the target text. As with translation data, the paucity of paraphrase data is even more extreme with respect to spoken language. There is virtually no spoken paraphrase data that might be used to train models aimed at understanding spoken monolingual utterances.
In sum, existing techniques for collecting translation or paraphrase data have a number of drawbacks that adversely impact how much data can be collected, as well as the quality of the data. Notwithstanding, it is desirable to have large amounts of good quality translation and/or paraphrase data for building machine-based systems.
This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards a technology by which translation and paraphrase data are collected by showing a stimulus to contributors, such as video clips to viewers (e.g., of a crowd-sourcing service), who then respond with linguistic (text and/or speech) descriptions of that stimulus in a language of their choice. Data contributors may be entirely monolingual, and each piece of collected data is a description of the same stimulus and is thus associated with each other piece. The collected data includes translation data that relates the descriptions in various languages to one another, and paraphrase data that relates the descriptions in the same language to one another in that language. Although these descriptions are not exactly “parallel” in the linguistic sense, they are parallel in a more abstract semantic sense, because they describe the same scene and action in one or more languages.
Paired descriptions corresponding to different languages in the translation data may be used as a basis for translation training data provided to train a machine translation system. Descriptions in the paraphrase data may be used as a basis for paraphrase training data provided to a machine paraphrasing system.
In one aspect, there is provided a mechanism for evaluating the quality of a machine paraphrasing system. This includes a metric for measuring the distinctiveness of a machine-generated paraphrased sentence or phrase with respect to the original sentence or phrase. Another metric may measure how well the machine-generated paraphrased sentence or phrase retains the original sentence's or phrase's meaning, and these metrics may be combined to determine the quality of the machine output.
Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
Various aspects of the technology described herein are generally directed towards collecting translation data without bilingual speakers, as well as collecting natural paraphrase data without presenting the annotator with a source sentence or phrase to paraphrase. To this end, a large number of contributors are shown a selected stimulus (e.g., a video clip, a still image or another stimulus) that is generally intended to elicit universal responses from among the contributors. The contributors are asked to describe the stimulus, e.g., the main action or event that occurred in the video, in the language of their choosing, with the descriptions (text and/or speech) saved for each stimulus. This set of contributors may span a broad range, such as contributors from all over the world. As such, translation data that describes the same event/stimulus in various languages is obtained, as well as paraphrase data that describes the same event/stimulus in the same language.
It should be understood that any of the examples herein are non-limiting. For one, many of the examples herein describe a stimulus in the form of a brief video clip shown to contributors who are viewers of that clip. However, any suitable stimulus that results in returned translation and/or paraphrase data may be employed, such as one or more still photographs, audio (e.g., a “woman humming,” a “dog barking” and so forth), scent, temperature and/or texture. Another type of stimulus comprises an action carried out by a program, such as to have contributors narrate some programmatic behavior, e.g., making someone's eyes bigger in an application for editing photographs, and then using that data to produce a command-and-control interface; program developers may narrate code snippets to learn code/intent mappings. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used various ways that provide benefits and advantages in computing in general.
Each describer 1041-104n outputs a description 1061-106n comprising text and/or speech as to what the video clip 102 conveyed to that describer. Each describer 1041-104n provides the description 1061-106n in a language of his or her choosing, which the describer may designate, or may be automatically detected.
As exemplified in
Note that for simplicity,
By way of example of a very small sample, consider showing a group of describers a brief video clip of a man eating spaghetti. The following English-language descriptions may be collected for the same video-clip, each of which are paraphrases of one another (allowing for identical “paraphrases” to exist):
The following foreign language data may be collected from the same video clip:
Thus, in this simplified example there are five languages for which translation data is available, and two languages (English and Romanian) for which paraphrase data is available. Any two (or possibly more than two) different language sentences may be paired for use in training a machine translation engine, and any two (or more) same language sentences may be paired for use in training a machine paraphrasing system. Note that identical “paraphrase” sentences can be valuable in training, reinforcing the probability of specific word/phrase mappings associated with the “centroid” utterance for a cluster.
As can be readily appreciated, scaling a video or other stimulus to many thousands of users will result in a significant amount of translation data 110 and paraphrase data 112. As represented in
Other data may be used to vet the collected data; for example, another set of computer users may be paid to review subsets of the available descriptions and indicate which one or ones they think are the outliers, e.g., pick the worst three descriptions out of ten shown. Such vetting data may be used to home in on the outliers.
Note that while only one video clip has been described thus far as an example stimulus, it is understood that a describer may view and describe many hundreds or thousands of different clips. For each clip, the description data is collected and associated with the descriptions of other viewers of that same clip. The descriptions may be text or speech, or both, improving the quality of text-to-text, text-to-speech, speech-to-speech and speech-to-text translation across languages, as well as paraphrasing in the same language.
Moreover, the clips or other stimulus instances that are presented may be tailored to a specific class of activity or the like for which translation/paraphrase models are of value. By way of example, consider a video game running on a gaming console that allows players to communicate with one another as they play. A combat-style game may have only so many operations that a user can command, but many users may verbally express the command in different ways, e.g., “attack the building” or “charge the enemy compound” may represent the same command. Collecting speech data from game players and associating that data with the actions that occurred allow training a game (or future versions of that game) to provide spoken command-and-control operation, including with paraphrases, instead of only allowing a limited set of commands that need to be explicitly spoken to be recognized.
Note that in addition to command-and-control, machine paraphrasing systems may be used in other applications. These include question answering, search help, providing writing assistance, and so forth, and thus a well-trained machine paraphrasing system is desirable.
Another use of paraphrase data is in translation, such as when there are relatively few descriptions in one language, yet many descriptions (and thus abundant paraphrase data) in another language with which translation is desired. Consider for example that there are thousands of English language descriptions for a video, and only a few in a language such as Tamil. Each source sentence in Tamil may be paired with one target sentence in English, or some number of them, (e.g., five, ten), or all of them. The large number of variations in English may help expand the Tamil dataset; e.g., “man” and “guy” in the English descriptions may map well to the same one word in Tamil. Indeed, tests proved that in general, the more targets to which such a source is mapped, the better the improvement in translation.
In this manner, video data or other stimulus is used to create translation and/or paraphrase data. Note that unlike prior solutions, there is no need to start with a specific linguistic utterance, which inevitably biases the lexical and syntactic choices a contributor might make. With videos or other Internet-provided stimulus, such as from an online video streaming site as the source, and with an online crowd-sourcing service, such as one that pays participants for input, useful data in large amounts may be collected.
Moreover, the selection of stimulus used for data collection is a task that can be delegated to the crowd. For example, the online crowd-sourcing service may gather suitable video segments that clearly show an unambiguous action or event. Such videos are usually five to ten seconds in length, and no more than a minute. After (e.g., manually) filtering them for inappropriate content or otherwise unsuitable videos, the videos are presented to a group of users, such as others in the online crowd-sourcing service.
Still further, users' actions with respect to a stimulus may help determine which ones others receive. For example, if users tend to skip over certain videos or still images, such as if they are confusing or too long, those may be removed and replaced with others, and so on.
Turning to another aspect, while machine translation has a well-known automated metric, BLEU, for evaluating the quality of a machine translation system, heretofore there has been no well-accepted metric for evaluating the quality of a machine paraphrasing system. Note that while the below examples generally refer to the paraphrase of a sentence, it is understood that this includes any part of a sentence that may be paraphrased, including a phrase, or even a longer set of words, such as a paragraph.
In generating a good paraphrase, the paraphrase needs to retain the meaning of the original input sentence. A measure based upon the BLEU-like n-gram overlap metric may be used as a score to measure the success of retaining the meaning of the original sentence, that is, how well a candidate paraphrase remains focused on the correct topic as well as remains fluent.
Further, a general observation is that a paraphrase is most likely more valuable if the paraphrase is as different as possible from the original sentence, (while retaining the meaning of the original sentence). For example, “a man is laughing” may be paraphrased as “a guy is laughing” or as “a man finds something to be very funny.” The simple substitution of “guy” for “man” in the former phrase is not particularly valuable in most scenarios, however the latter phrase is helpful. For example, the latter phrase may be used to suggest an alternative way that a writer may better convey his or her idea when writing something.
Described herein is a scoring metric that measures n-gram dissimilarity for evaluating differences between paraphrases of sentences. In one implementation, the number of n-grams (up to n=4 by default) that do not appear in the original sentence is counted in the paraphrase candidate. This count is divided by the total number of n-grams in the paraphrase candidate. The scores for n=1, 2, 3, 4 may be averaged to compute an overall distinctiveness score. As can be readily appreciated, other dissimilarity scores based upon n-grams or the like are feasible and may be alternatively or additionally employed.
The use of descriptions from the same stimulus enables collecting arbitrarily many naturally-occurring descriptions of the same event/stimulus, whereby a good range of possible linguistic descriptions of that event/stimulus is obtained. This gives the metric enough to work with in deciding whether a machine-generated paraphrase of some input string retains the original meaning, yet deviates enough so as to provide value.
In one implementation, the BLEU-like n-gram overlap score and the dissimilarity metric may be mathematically combined to find the paraphrase that retains the meaning of the original sentence, yet is most dissimilar. For example, the BLEU-like n-gram overlap score may act as a constraint, such that the selected paraphrase is the one that is maximally distinct within a meaning/fluency range appropriately permitted by the constraint.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer 410 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 410 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 410. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above may also be included within the scope of computer-readable media.
The system memory 430 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 431 and random access memory (RAM) 432. A basic input/output system 433 (BIOS), containing the basic routines that help to transfer information between elements within computer 410, such as during start-up, is typically stored in ROM 431. RAM 432 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 420. By way of example, and not limitation,
The computer 410 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, described above and illustrated in
The computer 410 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 480. The remote computer 480 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 410, although only a memory storage device 481 has been illustrated in
When used in a LAN networking environment, the computer 410 is connected to the LAN 471 through a network interface or adapter 470. When used in a WAN networking environment, the computer 410 typically includes a modem 472 or other means for establishing communications over the WAN 473, such as the Internet. The modem 472, which may be internal or external, may be connected to the system bus 421 via the user input interface 460 or other appropriate mechanism. A wireless networking component such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN. In a networked environment, program modules depicted relative to the computer 410, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.