The invention relates generally to the field of computer software products. More particularly, the invention relates to methods and systems for producing language specific versions of text in a software product.
Users of word processing and text intensive visual aid presentation software such as Microsoft® Word and Microsoft® PowerPoint programs, in Bosnian and Serbian languages, for example, are required to provide copies of documents in both Cyrillic and Latin script. As a result, typically the user must retype an entire document twice, once in Cyrillic script and once in Latin script. This is extremely time intensive and redundant.
There is thus a need for a method and system for transliteration capability back and forth between these two language scripts that is convenient for the user and robust enough to handle the semantic differences between the language scripts. It is with respect to these needs that the present invention has been developed.
Embodiments of the present invention are a system and a method for transliterating either language script easily and at the user's command. The method involves loading a text of characters and words in one of a Cyrillic or Latin script into a character transliteration module and converting each character in the one of a Cyrillic or Latin script into a corresponding opposite Cyrillic or Latin character. Each word is then sequentially also loaded into a word capitalization exception module where the word is examined for occurrences of any capitalization exceptions. If there are exceptions, one or more predetermined rules may be applied, and if the word matches an applicable predetermined rule, the character capitalization in the word is modified in accordance with the applicable predetermined resource rule.
In accordance with other aspects, the present invention relates to a system for transliterating Cyrillic to Latin script and vice versa that involves loading a text of characters and words in one of a Cyrillic or Latin script into a character transliteration module and converting each character in the one of a Cyrillic or Latin script into a corresponding opposite Cyrillic or Latin character. Each word is also sequentially loaded into a word capitalization exception module where the word is examined for occurrences of any capitalization exceptions. If there are exceptions, one or more predetermined rules may be applied, and if the word matches an applicable predetermined rule, the character capitalization in the word is modified in accordance with the applicable predetermined resource rule. This results in a system for script transliteration between Cyrillic and Latin scripts, and vice versa, that is fast, simple to use, and permits substantial productivity gains to the user.
The invention may be implemented as a computer process, a computing system or as an article of manufacture such as a computer program product or computer readable media. The computer program product may be a computer storage media readable by a computer system and encoding a computer program of instructions for executing a computer process. The computer program product may also be a propagated signal on a carrier readable by a computing system and encoding a computer program of instructions for executing a computer process.
These and various other features as well as advantages, which characterize the present invention, will be apparent from a reading of the following detailed description and a review of the associated drawings.
The system 100 includes a character transliteration module 102 and a word capitalization module 104 that both draw character data from a transliteration character database 106. Text that is to be transliterated 108 is highlighted or otherwise identified by a user as needing transliteration. This text or script 108 is then fed first to the character transliteration module where all the script 108 is transliterated, and then to a word transliteration module 104. Both modules draw from the transliteration-mapping table 106 in order to generate transliterated text data 110.
The Cyrillic characters with their corresponding Latin characters are shown in the table 400 of
In three cases a single Cyrillic character maps to two Latin characters. These are: Jb into Lj, Hb into Nj, and LI into D{hacek over (z)}. This is fine if they are lowercase characters as the lowercase Cyrillic character simple maps to two lowercase Latin characters, and vice versa. However, when the Cyrillic character is capitalized, a question arises: Should the second Latin character in the mapping be lowercase or uppercase (the first Latin character will definitely be uppercase)? This can only be answered by considering the word in which the characters reside. There are a number of rules that govern this. These rules basically look at the next character's case to determine the case of the second Latin character. The following rules are exemplary and regard usage of capital and small letters involving combination characters in Cyrillic script with 2 characters in Serbian (Latin).
1. At the beginning of any sentence, Latin double character letters should be written with the first letter always a capital letter and second letter a small letter. Thus for Latin to Cyrillic script:
2. In titles, letters LJ, NJ and D{hacek over (Z)} should be always written with capital letters. Thus:
3. When using these three combinations of letters in the middle of sentences, the letters are always small. Thus:
System 200 may also contain communications connection(s) 212 that allow the system to communicate with other devices. Communications connection(s) 212 is an example of communication media. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. The term computer readable media as used herein includes both storage media and communication media.
System 200 may also have input device(s) 214 such as keyboard, mouse, pen, voice input device, touch input device, etc. Output device(s) 216 such as a display, speakers, printer, etc. may also be included. All these devices are well know in the art and need not be discussed at length here.
A computing device, such as system 200, typically includes at least some form of computer-readable media. Computer readable media can be any available media that can be accessed by the system 200. By way of example, and not limitation, computer-readable media might comprise computer storage media and communication media.
The logical operations of the various embodiments of the present invention are implemented (1) as a sequence of computer implemented acts or program modules running on a computing system and/or (2) as interconnected machine logic circuits or circuit modules within the computing system. The implementation is a matter of choice dependent on the performance requirements of the computing system implementing the invention. Accordingly, the logical operations making up the embodiments of the present invention described herein are referred to variously as operations, structural devices, acts or modules. It will be recognized by one skilled in the art that these operations, structural devices, acts and modules may be implemented in software, in firmware, in special purpose digital logic, and any combination thereof without deviating from the spirit and scope of the present invention as recited within the claims attached hereto.
In query operation 306, the question is asked whether the first/next character in the word being examined is transliteratable. If there is a corresponding character in the opposite language, then control transfers to operation 308. However, if the character is not transliteratable, the character remains unchanged and control returns to operation 304 for examination of the next character in sequence.
In operation 308, the transliteration mapping table 106 is accessed to provide the appropriate replacement character, an example of which is found in
Query operation 310 asks whether the character under examination is the last character in the last word in the script to be transliterated. If the character being examined is the last character in the last word in the script sequence, control transfers to query operation 312. If it is not the last character, control transfers back to operation 304 and the next character is examined as described above.
In query operation 312, the question is asked whether the first/next word in the script that was transliterated is capitalized. If the answer is yes, control transfers to operation 318. If the first/next word is not capitalized, transliteration of the current word is complete, and control transfers to query operation 322. If the first/next word is capitalized control then transfers to query operation 318.
Query operation 318 examines the word to determine whether the word contains a capitalization exception. This occurs in certain situations in which a letter within the mid portion of the current word is capitalized. However, this only occurs in certain situations that can be characterized by a set of grammar rules also contained in the transliteration mapping table 106. If the word contains an exception, control transfers to operation 320. If not, control transfers to query operation 322.
In operation 320 the word is checked against rules from the mapping table 106 in order to determine whether a character within the transliterated current word should be capitalized. If the check finds that a rule is matched, the requisite character in the word is capitalized, and control transfers to operation 322. The following rules are exemplary and regard usage of capital and small letters involving combination characters in Cyrillic script with 2 characters in Serbian (Latin).
1. At the beginning of any sentence, Latin double character letters should be written with the first letter always a capital letter and second letter a small letter. Thus for Latin to Cyrillic script:
2. In titles, letters LJ, NJ and D{hacek over (Z)} should be always written with capital letters. Thus:
3. When using these three combinations of letters in the middle of sentences, the letters are always small. Thus:
In query operation 322, the current transliterated word is complete, and thus transferred to the transliterated text data store 324, and the query is made whether there is another word in the transliterated script sequence. If the answer is no, control transfers to operation 324, which returns control to the calling program, or to the user. If the answer is yes, there is another transliterated word, control transfers back to operation 312 where the next word is examined for capitalization. The process from 312 through 322 is repeated as many times as necessary until all the words in the transliterated script are examined for capitalization exceptions, thus completing transliteration of the desired text contained in operation 324.
Although the invention has been described in language specific to computer structural features, methodological acts and by computer readable media, it is to be understood that the invention defined in the appended claims is not necessarily limited to the specific structures, acts or media described. As an example, other types of data may be included in the language map in place of the string data discussed herein. Additionally, different manners of referencing the language specific data of the language map from the system calls in base product may be used. Therefore, the specific structural features, acts and mediums are disclosed as exemplary embodiments implementing the claimed invention.
The various embodiments described above are provided by way of illustration only and should not be construed to limit the invention. Those skilled in the art will readily recognize various modifications and changes that may be made to the present invention without following the example embodiments and applications illustrated and described herein, and without departing from the true spirit and scope of the present invention, which is set forth in the following claims.