The subject matter of the present disclosure generally relates to electronic searching, and more particularly relates to improvements in electronic cross-lingual searching.
All languages possess words, terms, and/or gestures that do not always translate neatly into other vernaculars. Often, even when a direct translation exists it may still contain errors due to sematic use, idioms, or the context of the expression when crossing languages. This reality creates difficulties when attempting to translate a single word across languages as multiple forms of the word within a single language can be relevant based on its use or purpose. Translating from character-based to pictographic (e.g. Chinese, Japanese, Korean) languages exacerbates these problems because there is no true character-for-character or word-for-word association available.
Current computer cross-lingual search systems utilize a single-source translation that converts the query, be it word, phrase, or gesture, into the appropriate language representation used in the text communication. Using this single source, the electronic search is thus limited just to the direct translation of the query, without taking into account semantics or lexicon. Thus, the translation of the query may not accurately account for the context of the original use. Existing computer processes limit the full scope of available sources of information since documents that do not contain the correct translated form of the word, phrase, or gesture of interest would not appear as a match, leaving the user unaware of the existence of search results of interest when search results are returned. This can severely limit the utility of current computer search systems.
Disclosed is a method and system for conducting cross lingual searching of text based communications using a multi-language ontology. In an embodiment, a word of interest is received and propagated through an ontology for multiple languages identifying all associations within a database to create a search set. For the purposes of the present disclosure, the term “WORD” will represent, without limitation, “words, phrases, gestures, slang terms, expressions, and pictographic representations.” The search set is composed of all representations for the parent entity for each language as a set of sub-sets (i.e., an individual sub-set for each language). The search is then performed using the search set to identify text-based communications containing some equivalent representation of the parent entity within the document's respective language. The resulting documents, containing WORDs within the search sets, are then indexed to correlate with the parent entity. The product is a set of documents containing one or more of the ontology search set entities for the parent word indexed back to the initial search entry for direct retrieval and future searching.
Discovery of key terms, phrases, or gestures within text based communication across multiple languages using an ontology based approach increases the effectiveness of searching compared to the use of direct single source translations. A multi-language ontology effectively represents each individual language's lexicon for the word of interest which allows for the creation of a search set. The use of the search set provides a larger breadth of searching capability compared to the use of a single direct translation. Once complete, the results are stored in an electronic database with an index to the parent entity to permit efficient retrieval and future searching. The method therefore accounts for subtle differences in semantics, vernacular, and dialect that may not transform accurately from a single source translation. Thus, the search identifies potential matches that may have otherwise been lost with the use of a preprocessed single word direct translation.
Using a multi-language ontology to represent multiple forms and related terms associated with a word of interest increases the effectiveness of cross lingual searching by expanding the body of available information that would otherwise be inaccessible for a direct single source translation. This ontology accommodates the use of a wide array of terms covering dialect, jargon, slang, contextual relationships, or gestures (including pictograph representations) in creating a search set. This will improve search capabilities by ensuring the semantic influences and context of the words are accurately represented in the search results for all languages of interest.
Disclosed is a method for conducting cross lingual searches of electronic text based media for WORDs that accounts for the semantics and contextual differences across vernaculars. Embodiments utilize a multi-language ontology to establish a search set that will contain multiple forms and word relationships to the parent entity in the respective languages prior to conducting a search process. The end result is a set of documents that have one or more entries within the search set indexed to the parent entity.
In an embodiment and with reference to
This ontology becomes the search set, which is composed of all the associated WORDs collected from the individual language ontologies. The search set is thus a list of searchable terms used to process texted-based media.
The process uses the search set to filter for ontology matches in steps 104 and 105 and then store the matching documents and index them to the parent entity in step 106. This indexing of results is depicted in
After indexing, the documents are directly correlated to the parent entity. This process is represented in
Now with reference to
To improve the comprehension of the process described above, the following example provides an exemplary use case of an embodiment.
At the time of the present disclosure, the Islamic State of Iraq and Syria (ISIS) is a mainstream concern for the United States and other nations. Searching for the term ISIS across languages presents challenges due to its representations in different cultures and the inability of tradition translation methods to capture these variants. Additionally, the term is an acronym but also is recognized as a proper noun. If a user were to enter the term “ISIS” into an engine performing searches across languages the term is still represented as “ISIS.” Even when converting to the primary alphabet of other languages (ex. Cyrillic or Arabic) the response is still a single word.
For example, GOOGLE TRANSLATE and SYSTRAN form the backbone for the majority of translation tools easily available to consumers. The translation of the entity “ISIS” into Russian and Croatian yields in both cases simply “ISIS.”
Using these translated forms of the entity will produce results but only when “ISIS” appears in a document. The drawback for this is that the term can be represented quite differently and without proper correlation a large amount of data will go unobserved. Overcoming this problem is one advantage of the disclosed method.
Embodiments use an ontology to capture the representations that a WORD may have within other languages. This ensures that an exhaustive search of available sources will contain the greatest number of relevant documents.
Croatians typically use the phonetic spelling of ISIS in their own dialect but also the spelling in Cyrillic. In previous systems the translation tools would have overlooked documents containing this subtle difference. The disclosed method would identify these items as possessing the same usage as the searched entity because a comprehensive ontology mapping of equivalents is developed for use in searching. Specifically, on at least one computer readable storage medium, a plurality of language sets are stored. In each language set, a WORD from another language will be associated with (indexed) its equivalents in that language. When a processor receives a query containing a parent entity, it retrieves from each language set the indexed equivalents, and combines those equivalents into an ontology mapping. Afterwards, the processor searches another database searching for results based on the ontology mapping.
The Russian ontology representations contain many representations for ISIS in its primary alphabet, Cyrillic. Therefore, in this instance while the translation tools would search for a single translation of the entity, the proposed method would search for five different versions of the term, 1 Latin alphabet spelling (the same as the other tools) plus the four Cyrillic versions.
Using direct translation tools the translation into Arabic abjad of ISIS does not account for many manifestations of “ISIS” found in Arabic communications. The disclosed would, however, identify those representations and use them in searching for relevant documents.
Although the disclosed subject matter has been described and illustrated with respect to embodiments thereof, it should be understood by those skilled in the art that features of the disclosed embodiments can be combined, rearranged, etc., to produce additional embodiments within the scope of the invention, and that various other changes, omissions, and additions may be made therein and thereto, without parting from the spirit and scope of the present invention.
The present application claims priority to U.S. Provisional Patent Application No. 62/349,709, entitled “Cross Lingual Search using Multi-Language Ontology for Text Based Communication” and filed Jun. 14, 2017. The contents of U.S. 62/349,709 are hereby incorporated by reference herein in their entirety.
Number | Date | Country | |
---|---|---|---|
62349709 | Jun 2016 | US |