Hereinafter, an embodiment in which a dictionary creation support system, a method and a program of the present invention are applied to creation of a bilingual dictionary used in mechanical translation will be explained with reference to the drawings.
In the embodiment, the past history is stored, and when dictionary creation process is performed on candidate words for registering in the dictionary that have been extracted from input text (text data), this information is referred to in order to inhibit output of un-required candidate words to the dictionary. In addition, in this embodiment, candidate words that do not satisfy set conditions for registration for just one file can be output to the dictionary if it is determined that the candidate word satisfies the set conditions based on the result of cumulative total processing.
Referring to
The input output device 1 includes an input portion 11 and an output portion 12. The input portion 11 is used to fetch various types of input information, such as a plurality of input texts (text data sequences), and instructions related to registering of registration candidate words, that is used as a basis for creating the content that is registered in a dictionary 31. The output portion 12 is used to output (usually, submit to the user) candidate words for registration in the dictionary 31.
The input portion 11 is able to fetch the various types of input information by use of a pointing device such as a keyboard or a mouse, a scanner and character recognition processing, a microphone and voice recognition processing, or by reading a file. The output portion 12 is able to display the data on a display device, print it using a printer, convert the data to sound and generate a sound output, or output the data to a file.
Note that, the input portion 11 and the output portion 12 may be able to input and output data from/to other devices via a network or a determined circuit. For example, as the input text (the text data sequence), a file that is already stored on the computer or the network may be designated, or the output of an internet search engine may be used without amendment.
The storage device 3 is configured by hardware such as, for example, a hard disk, an optical disk, or a memory, that has a large storage capacity. The storage device 3 includes a saved history data base 31 and a dictionary (dictionary file) 32 as functional units. The saved history data base 31 saves the history of dictionary registration candidate words that have been extracted from the input texts. The dictionary 32 stores information that can be used in mechanical translation, for example, terms and information related to terms.
The saved history data base 31 includes a field 31a, a field 31b and a field 31c. The field 31astores information that is used to determine whether or not registration candidate words should be registered or not, namely, their usage frequency or their importance. The field 31b stores the heading of the dictionary candidate word, and the field 31c stores information related to the history, for example, whether or not the user has completed giving instructions related to each candidate word, or whether each word has been fully registered in the dictionary.
The dictionary 32 includes, at the least, a field 32a that stores words or word sequences (headings) of a first language, and a field 32b that stores words or word sequences (translations) of a second language corresponding therewith. In addition, the dictionary 32 may also include a field that stores information required for translation such as information related to parts of speech, and information related to meanings.
The processing device 2 is configured by hardware such as, for example, a CPU, a ROM, a RAM, an EEPROM, or a hard disk, and is a structural member that can run a dictionary creation support program (excluding the portions of the above-described input output device 1 and the storage device 3).
The processing device 2 includes a term extraction portion 21, an information update portion 22 and a dictionary creation portion 23 as functional units. The term extraction portion 21 extracts dictionary registration candidate words from the input text data sequences (input texts). The information update portion 22 rewrites the contents of the saved history data base 31 based on information related to the extracted terms and information related to the dictionary creation operation. The dictionary creation portion 23 creates the dictionary 32 by determining and outputting dictionary registration candidate words that need to be registered in the dictionary 32 while referring to the contents of the updated saved history data base 31.
Next, the functions of the term extraction portion 21, the information update portion 22 and the dictionary creation portion 23 will be explained in more detail.
The term extraction portion 21 performs morphological analysis processing, usage frequency calculation processing, and the like, on the text data sequences input from the input portion 11, and extracts dictionary registration candidate words that it is determined need to be registered in the dictionary as well as information relate to the usage frequency or the level of importance of the dictionary registration candidate words within the text data (hereinafter referred to as the “evaluation value”).
The information update portion 22 saves the extracted information related to the dictionary registration candidate words in the saved history data base 31. When storage is performed, if the dictionary registration candidate word is already stored in the saved history data base 31, the extracted information related to the candidate word (the evaluation value) and the information stored in the saved history data base 31 are used as a basis for re-calculating the evaluation value. Accordingly, the content of the saved history data base 31 is updated. In addition, as will be described later, the information update portion 22 also updates the information in the saved history data base 31 when information, which indicates whether the user has instructed that a given dictionary registration candidate word is to be registered in the dictionary, is received from the dictionary creation portion 23.
The dictionary creation portion 23 uses the output portion 12 to output (submit) dictionary registration candidate words that meet with pre-set conditions, while referring to the contents of the updated saved history data base 31. In addition, the dictionary creation portion 23 transfers to the information update portion 22 the information about whether the user has instructed that a given dictionary registration candidate word is to be registered in the dictionary.
Next, the operation of the dictionary creation support system 100 (the dictionary creation support method of the embodiment) having the above-described functional structure will be explained with reference to the drawings.
When a text data sequence is input from the input portion 11 (step S1), the term extraction portion 21 performs morphological analysis processing and usage frequency calculation processing and the like on the input text data sequence, and extracts the dictionary registration candidate words that it is determined need to be registered, and their evaluation values (step S2).
As an example of the most simple method of performing the term extraction operation, a method is known, for example, in which the usage frequency of word N-grams are computed from an input text on which morphological analysis has been performed, and then terms that exceed a threshold value are extracted. Furthermore, a method including set limits related to parts of speech, grammar structures or the like, such as extracting just noun sequences, may be applied to the above-described method. In addition, a method may be applied in which computation is used to derive evaluation values of word strings, such as that described in “Extraction of Specialist Terminology based on Usage Frequency and Sequence Frequency” (Authors: Nakagawa, Yumoto and Mori, 2003, Journal of Natural Language Processing, Vol. 10, No. 1, pp. 27-45).
The evaluation value attributed to each term is a value that is calculated using a given calculation formula and the usage frequency of each term in the input text, etc. (for example, dividing the usage frequency by the total term number of the input text).
The information related to the extracted dictionary registration candidate word is stored in the saved history data base 31 by the information update portion 22 (step S3). When storage is performed, if the same dictionary registration candidate word is already stored in the saved history data base 31, the information related to the extracted candidate word and the information stored in the saved history data base 31 are used as a basis for re-calculating the evaluation value, without creating a new record. Accordingly, just the evaluation value is updated.
Next, the dictionary creation portion 23 controls the output portion 12 such that the output portion 12 outputs (for example, on a display) one of the dictionary registration candidate words that meets with the pre-set conditions (for example, having an evaluation value equal to or above a given threshold value, or not being a word that the user has rejected for dictionary registration in the past) while referring to the contents of the updated saved history data base 31 (step S4). The output information related to the dictionary registration candidate word may include not just a word sequence, but also evaluation values, parts of speech etc.
The user determines whether the dictionary registration candidate word is to be registered in the dictionary 32 based on the output contents, and the input portion 11 gives instructions about whether to register the candidate word. When registration is performed, the user inputs necessary information such as a translation, and instructs that registration to the dictionary 32 is to be performed.
In the case that one dictionary registration candidate word has been output, the dictionary creation portion 23 waits for an instruction from the input portion 11 related to whether registration is to be performed or not. When the instruction is received, the dictionary creation portion 23 determines whether the instruction is requesting registration to be performed or not (step S5). Note that, the contents of the instruction related to whether registration is to be performed or not are sent from the dictionary creation portion 23 to the information update portion 22.
If the instruction requests registration to be performed, the dictionary creation portion 23 registers the information related to the dictionary registration candidate word that is presently subject to processing in the dictionary 32 (step S6). In addition, the information update portion 22 writes information that indicates that registration to the dictionary 32 has been performed, information that registration to the dictionary 32 has not yet been performed, or the like, in the saved history data base 31 (step S7).
Once the processing of steps S4 to S7 has been completed for the dictionary registration candidate word that is subject to processing, it is determined whether there are any remaining dictionary registration candidate words that the user has not determined whether or not to register in the dictionary (step S8). In step S8, if it is determined that no more remaining dictionary registration candidate words, the series of processing steps shown in
When the term extraction operation is ended by the term extraction portion 21, the information update portion 22 starts the processing shown in
If the given dictionary registration candidate word is already stored in the saved history data base 31, the information update portion 22 re-calculates the evaluation value (step S14), and then updates the information related to the given dictionary registration candidate word contained in the saved history data base 31 (step S15).
On the other hand, if the dictionary registration candidate word read in step S11 is not stored in the saved history data base 31, the information update portion 22 adds an evaluation value and a heading for the given dictionary registration candidate word in the saved history data base 31 (step S16).
The processing like that described above that is performed in steps S11 to S16 is repeatedly performed for all of the extracted dictionary registration candidate words (step S17).
Next, the flow of steps S3 to S6 (the update operation of the saved history data base 31 and the registration operation to the dictionary) will be explained with reference to a specific example.
In addition, it is assumed that at the phase at which the dictionary registration candidate words shown in
In the update operation (
Processing like that described above is repeatedly performed with respect to the data for second and following dictionary registration candidate words, namely, “host cell”, “zooblast”, and “vegetable cell”.
The first datum, “cell” of
Next, the second datum, “host cell”, shown in
The usage frequency of the data for the third and following dictionary registration candidate words of
Next, a new input text is input, and the term extraction processing is performed to extract the dictionary registration candidate words shown in
In the update operation (
The processing described above is repeatedly performed on the data for the second and following dictionary registration candidate words shown in
Next, dictionary registration candidate words are appropriately output (displayed) based on the contents of the saved history data base 31 shown in
The usage frequency of the first word “cell” in
The frequency of the second word “host cell” is also 500 or more. However, since the word is already registered in the dictionary 32, the word is not output (displayed), and the processing moves to the next datum (a negative result in step S4).
The new frequency of the third word “zooblast” is 500 or more, and thus the word is output (displayed) as a dictionary registration candidate word. Assuming that the user instructs that “zooblast” is to be registered in the dictionary, “zooblast” is registered in the dictionary 32, and the information “registered in dictionary” is written in the saved history field of the saved history data base 31 (steps S6, S7).
The usage frequencies of the fourth and following dictionary registration candidate words are below 500, and thus the words are not output (displayed) for the user to determine whether or not they are to be registered in the dictionary.
In the above-described embodiment, when the dictionary registration operation is repeatedly performed on a plurality of input texts (text data sequences), the results of past registration operations are referred to using the history. Accordingly, in the above-described embodiment, terms that have already been determined as not requiring registration and terms that have already been registered etc. in previous dictionary creation processing are no longer submitted as they would be in known technology. Accordingly, repeated operations are eliminated, and operation efficiency can be improved.
In addition, in the above-described embodiment, even if a term is excluded from the dictionary registration candidate words because it does not meet the conditions such as the threshold value in a single performance of the dictionary creation processing, the word may become a candidate word as a result of totaling the results of a plurality of repetitions of the processing. In other words, in the above-described embodiment, it is possible to process a plurality of small texts to obtain similar extraction results as when processing a large text.
The above-described embodiment explains a configuration in which dictionary registration candidate words that have “registered in dictionary” or “displayed” entered in the history information of the saved history data base are not submitted to the user. However, the submission conditions are not limited to those described above. For example, as other possible submission conditions, the dictionary registration candidate words may be displayed along with the history information such as “registered in dictionary” or “displayed”. Alternatively, in the case of “registered in dictionary”, the contents already registered in the dictionary may be displaced.
Furthermore, the above-described embodiment explains a configuration in which the user inputs information related to the translation. However, registration to the dictionary may be performed with the translation column left blank, and a known translation determination method may be used to determine the translation of the blank column. As the translation determination method, for example, the method disclosed in Japanese Patent Laid-open Publication No. 2006-146610, or the method described in “Machine Translation System Capable of Autonomous Vocabulary Expansion, Authors Kamiyama and Ito, presented at the 65th Annual Meeting of the Information Processing Society of Japan, 1B-4, 2003” may be used.
In addition, the above-described embodiment explains a configuration in which dictionary registration candidate words are submitted one at a time to the user who inputs information about whether or not registration is to be performed. However, a batch of words or a given number of words that meet submission conditions may be submitted, while instructions about whether registration is to be performed or not may be made individually. As an example of another embodiment, a given number of dictionary registration candidate words may be displayed on a screen along with check boxes that can be checked to indicate whether registration is to be performed or not. In addition, an execute icon may also be displayed on the screen, and when the execute icon is operated, this may be taken as an instruction to register the words that have a check in their check boxes. Accordingly, the given words are fetched.
Moreover, the above-described embodiment explains a configuration in which support is provided for creating a parallel translation dictionary used in machine translation. However, the present invention may be applied to supporting creation of other dictionaries. For example, the present invention can be applied to creation of a dictionary that includes a keyword and a descriptive text explaining the keyword.
Number | Date | Country | Kind |
---|---|---|---|
JP2006-262699 | Sep 2006 | JP | national |