The present invention relates to a system and method of a language continuum and more particularly, to context based systems and methods for dialect language harmonization.
An important aspect of automatic speech recognition (ASR) systems is the ability to distinguish between dialects in order to properly identify and recognize speech in acoustic data. However, current solutions train ASR systems using all available acoustic data, regardless of the type of accent or dialect employed by the speaker. It is accepted that a dialect is a particular form of a language group that is peculiar to a specific region or social group. A useful metric in defining a dialect is a set settlement languages that share 90% of vocabulary and 75% of the exact meanings of that vocabulary. With regard to Arabic speech recognition in particular, most recent work has focused on recognizing Modem Standard Arabic (MSA). The problem of recognizing dialectal Arabic has not been adequately addressed. Arabic dialects differ from MSA and each other morphologically, lexically, syntactically, phono-logically and, indeed, in many dimensions of the linguistic spectrum. Heretofore there has been an effort in providing tools to enable words from one language to be translated into another language. Still other efforts have focused on the effect that dialects on accurate translations. Many of these efforts focus on data mining, computer generated statistical analyses and machine learning of published languages such as those set forth in US20170011739 and US20150287405, both of which are incorporated herein in their entirety. Certain languages, such as Arabic, Cantonese and others, are comprise of dialects from various countries, regions, cities, and villages that contain homophones (words that have the same spelling or structure but have different meanings). Many of these dialects lack sufficient written records to allow for the machine based translation methods of the prior art to provide accurate translations. This problem is well addressed in the publication titled “A Machine Translation of Arabic Dialects Arabic” by Rabih Zbib et al, also incorporated herein in its entirety.
Still using Arabic as an example, the lack of general knowledge about the content of the Arabic dialects has been limited by to the use of the formal language MSA which is virtually absent from everyday speech. The lack of specific knowledge about available vocabulary has been limited by a lack of written definition of Arabic dialect and overlapping vocabulary having differences (from large to subtle) in semantics that limits the recording of dialect vocabulary. If words could be both recorded and categorized by dialect, new markets could emerge such as the Colloquial Arabic language learning industry, sources, dictionaries, and applications, including Colloquial Arabic online content. It is important to note that the lack of online dialect specific content that prevents above mentioned and statistical machine translation processes.
Translating across Arabic dialects, as well as other dialect-rich languages can often be inaccurate and confusion using prior art methods, such as a conventional dictionary, or electronic methods such as an electronic translator. Arabic dialects, and two distinct dialects of other languages, can have problems not only in homophones in general, but in some specific homophones. For instance, in English there are the words “bear” as in “the big fuzzy animal” and there is also “bear” as in “to yield a weapon” (“to bear arms”). These words should not be a problem to use a dictionary or an electronic translator when translated into or out of English. The spelling and pronunciation doesn't stop a clear, concise definition of each to make the difference between the two uses of the word obvious. But unlike in English, Arabic and other dialect-rich languages are filled not just with homophones, but homophones with minute differences therebetween and overlapping meanings. Minute differences between two distinct dialects' identically spelled word necessitates any definitions of the words to be explicit enough not only to define the word, but so that the reader or user knows what the meaning of the word is not. Imagine if a first dialect word of English used “bear” in both of the ways that are used directly herein above, and another second English dialect used “bear” to mean “bear arms/weapons . . . but really only as in for hunting bears”, and a third English dialect's “bear” meant “to bear weapons but only in the sense that the user means to use the weapon non-lethally”, and yet a fourth English dialect's use of the word “bear” meant “non-lethal weapon”. Continuing with this example, then imagine that someone from the first dialect says “The protestors will bear arms” (as in, bear any kind of weapon) to mean “bear arms” to the third dialect speaker who thinks it only means “non-lethal”. The third dialect speaker wouldn't realize he had misunderstood the story (thinking an upcoming confrontation will be only with non-lethal weapons, yet guns are actually to be used in a lethal manner), while the first dialect speaker would not realize he had been misunderstood. If or when the misunderstanding is realized, the explanation can be obtuse and confusing, especially since neither speak the dialect of the other perfectly. If one were to use an electronic translator, the definitions would have to be detailed enough to not only make the “yield a weapon clear” but so that the person in need knows that it is not simply “bear any weapon”.
However, the problem of contextual recognition of dialectal languages has not been adequately addressed. What is needed is a is system and methods for producing contextually accurate translations between different dialects of the same language.
A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions. One general aspect includes a computer-implemented method a continuum translation for a plurality of dialects, including at a computer having one or more processors and non-volatile memory for storing programs and to be executed by the one or more processors, entering an input word from a first region having an input word spelling, at least one input word definition and a first dialect, selecting a second region where the second region includes a second dialect, assigning a High SpeVal to the input word, matching the High SpeVal to at least one Low SpeVal, identifying at least one second dialect word having a second dialect word definition and a second dialect word spelling in dependence of the at least one Low SpeVal, comparing the input word definition for equality to the second dialect word definition the at least one second dialect word, comparing the second dialect word spelling of the at least one second dialect word for equality to the input word spelling; and outputting any of at least one identical word in the second dialect when the input word definition and the input word definition are equal to the second dialect word definition and the second dialect word spelling is equal to the second dialect word spelling of the at least one second dialect word, at least one similar word in the second dialect when the input word definition and the input word definition are equal to the second dialect word definition and the second dialect word spelling is not equal to the second dialect word spelling of the at least one second dialect word, and at least one conflicting word when the input word definition and the input word definition are not equal to the second dialect word definition and the second dialect word spelling is not equal to the second dialect word spelling of the at least one second dialect word. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
Implementations may include one or more of the following features. The computer-implemented method further includes identifying at least one context sentence associated with the High SpeVal in the first dialect, outputting the at least one context sentence, matching a specific context sentence from the at least one context sentence in dependence of a predetermined meaning, matching of the High SpeVal to the at least one Low SpeVal is in dependence of the specific context sentence and the High SpeVal, identifying at least one similar context sentence associated with the at least one similar word in the second dialect, comparing the specific context sentence with the at least one similar context sentence; and outputting any of at least one alternative word from the at least one similar word when the specific context sentence is substantially similar to the at least one similar context sentence and at least one conflicting word from the at least one similar word when the specific context sentence is not substantially similar to the at least one similar context sentence. The computer-implemented method further includes creating a first database including a plurality of first dialect words from the first dialect, creating a second database including a plurality of second dialect words from the second dialect, creating a third database including at least one High SpeVal for each of the plurality of first dialect words in the first database and each of the plurality of second dialect words from the second dialect, creating a fourth database including at least one Low SpeVal for each of the plurality of first dialect words in the first database and each of the plurality of second dialect words from the second dialect, creating a fifth database including at least one context sentence for each of the plurality of first dialect words in the first database, creating a sixth database including at least one context sentence for each of the plurality of second dialect words in the second database, creating a seventh database including at least one definition for each of the plurality of first dialect words in the first database, creating an eighth database including at least one definition for each of the plurality of second dialect words in the second database. The computer-implemented method further including populating at least a portion of the second database, the third database, the fourth database, the sixth database and the eighth database using a plurality of human speakers of the second dialect. The computer-implemented method may also include populating at least a portion of the first database, the third database, the fourth database, the fifth database and the seventh database using a plurality of human speakers of the first dialect. The computer-implemented method where at least a portion of the populating is any of crowd sourcing, translation software and machine learning. The computer-implemented method where the at least one High SpeVal includes a unique High SpeVal identifying computer code for each of the plurality of first dialect words in the first database and each of the plurality of second dialect words from the second dialect. The computer-implemented method where the at least one Low SpeVal includes a unique unique Low SpeVal identifying computer code for each of the at least one High SpeVal. The computer-implemented method where the at least one context sentence is further associated with the unique High SpeVal identifying computer code. The computer-implemented method where the at least one context sentence is further associated with the unique Low SpeVal identifying computer code. The method where the first dialect and the second dialect are two distinct dialects from a common language group. The method where the first dialect is from a first language group and the second dialect is from a second language group. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
One general aspect includes a computer system, including one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform operations, the operations include inputting an input word from a first region having an input word spelling, at least one input word definition and a first dialect, selecting a second region where the second region includes a second dialect, assigning a High SpeVal to the input word, matching the High SpeVal to at least one Low SpeVal identifying at least one second dialect word having a second dialect word definition and a second dialect word spelling in dependence of the at least one Low SpeVal, comparing the input word definition for equality to the second dialect word definition the at least one second dialect word, comparing the second dialect word spelling of the at least one second dialect word for equality to the input word spelling; and outputting any of at least one identical word in the second dialect when the input word definition and the input word definition are equal to the second dialect word definition and the second dialect word spelling is equal to the second dialect word spelling of the at least one second dialect word, at least one similar word in the second dialect when the input word definition and the input word definition are equal to the second dialect word definition and the second dialect word spelling is not equal to the second dialect word spelling of the at least one second dialect word, at least one conflicting word when the input word definition and the input word definition are not equal to the second dialect word definition and the second dialect word spelling is not equal to the second dialect word spelling of the at least one second dialect word. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
Implementations may include one or more of the following features. The computer system further includes identifying at least one context sentence associated with the High SpeVal in the first dialect, outputting the at least one context sentence, matching a specific context sentence from the at least one context sentence in dependence of a predetermined meaning, where matching of the High SpeVal to the at least one Low SpeVal is in dependence of the specific context sentence and the High SpeVal. The computer system further includes identifying at least one similar context sentence associated with the at least one similar word in the second dialect, comparing the specific context sentence with the at least one similar context sentence; and outputting any of, at least one alternative word from the at least one similar word when the specific context sentence is substantially similar to the at least one similar context sentence, at least one conflicting word from the at least one similar word when the specific context sentence is not substantially similar to the at least one similar context sentence. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
One general aspect includes a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform operations, the operations include inputting an input word from a first region having an input word spelling, at least one input word definition and a first dialect, selecting a second region where the second region includes a second dialect, assigning a High SpeVal to the input word, matching the High SpeVal to at least one Low SpeVal, identifying at least one second dialect word having a second dialect word definition and a second dialect word spelling in dependence of the at least one Low SpeVal, comparing the input word definition for equality to the second dialect word definition the at least one second dialect word, comparing the second dialect word spelling of the at least one second dialect word for equality to the input word spelling; and outputting any of, at least one identical word in the second dialect when the input word definition and the input word definition are equal to the second dialect word definition and the second dialect word spelling is equal to the second dialect word spelling of the at least one second dialect word, at least one similar word in the second dialect when the input word definition and the input word definition are equal to the second dialect word definition and the second dialect word spelling is not equal to the second dialect word spelling of the at least one second dialect word, at least one conflicting word when the input word definition and the input word definition are not equal to the second dialect word definition and the second dialect word spelling is not equal to the second dialect word spelling of the at least one second dialect word. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
Implementations may include one or more of the following features. The non-transitory computer-readable medium further includes identifying at least one context sentence associated with the High SpeVal in the first dialect. The non-transitory computer-readable medium may also include outputting the at least one context sentence. The non-transitory computer-readable medium may also include matching a specific context sentence from the at least one context sentence in dependence of a predetermined meaning. The non-transitory computer-readable medium may also include where matching of the High SpeVal to the at least one Low SpeVal is in dependence of the specific context sentence and the High SpeVal. The non-transitory computer-readable medium further includes identifying at least one similar context sentence associated with the at least one similar word in the second dialect. The non-transitory computer-readable medium may also include comparing the specific context sentence with the at least one similar context sentence; and outputting any of. The non-transitory computer-readable medium may also include at least one alternative word from the at least one similar word when the specific context sentence is substantially similar to the at least one similar context sentence. The non-transitory computer-readable medium may also include at least one conflicting word from the at least one similar word when the specific context sentence is not substantially similar to the at least one similar context sentence. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.
So that the manner in which the above-recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
In the following detailed description of the embodiments, reference is made to the accompanying drawings, which form a part hereof, and within which are shown by way of illustration specific embodiments by which the examples described herein may be practiced. It is to be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the disclosure.
The examples disclosed herein relate to a continuum dictionary generator (CDG) to provide contextual continuity between dialects of a common language group. In many of the examples the Arabic language is used, however, the systems and methods of the present disclosure are equally useful in other languages having similar dialect issues as those described herein above and are considered part of the present disclosure. In accordance with the present disclosure words are recorded contextually based on their regional meaning, be that cities or settlements. The words are identified by these regions and entered into the CDG wherein the relationships between vocabulary and the meaning of each word of each settlement (or city) can be known. The method pf the present disclosure can be performed at a computer having one or more processors and memory for storing programs to be executed by the one or more processors.
The CDG of the present disclosure is based on the recognition that some words in any particular language are homophones, the same structural word used for multiple meanings. The CDG resolves these similarities and differences by being based on “Specific Values” or “SpeVals”, that is, a sort of micro word, wherein values are assigned to specific words and their single regional meanings. A SpeVal, for example, is a specific meaning of a word; treating the word “country” as referring to a nation, and its homophone “country” as referring to a rural area, as two different values. The system and methods of the present disclosure utilizes SpeVals to identify overlaps in vocabulary that diverge slightly in semantics. In certain embodiments this is done by maintaining a single dictionary of SpeVals used for all languages, while every language and dialect is assigned its own dictionaries, or databases, of words, definitions, and context sentences. The CDG uses both standard definitions and standardized context sentences to determine the slight dialect differences in meaning amongst words. The CDG of the present disclosure assigns SpeVals to “contrasts” or “relationships”. The SpeVals are placed in a database that is equally separate from all dialects and languages, and act as the means of relating these languages by using specific meanings, not words, as the base unit of the CDG. This is critical for dialects of languages that have evolved using the same words for different uses, but might still share one or two meanings due to their shared origin. For instance, in a traditional translation dictionary from English, “Country” in Algeria (“Balad”) is set to equal “Balad” in Cairo, regardless of differences in specific uses, the CDG would NOT say “Country=Balad”—instead, “Country”=different definitions, indicated by the SpeVals, indicating that this is a conflicting word. In this example, if “SpeVal A1=rural area outside of a town” and “SpeVal A2=nation or state”, and “SpeVal A3=hometown” then “Balad (of the Algiers database)=A1+A2+A3”. In to database for SpeVals for Cairo, “Balad=A1+A2”. It should be noted that, for avoidance of mistake and analyzing the data, the present disclosure includes High SpeVals and Low SpeVals. By way of example, a word native to a specific city is itself assigned a “High SpeVal” (for Algiers “balad”=“5R”, and for Cairo “balad”=“7P”) in the form of with the unique High SpeVal identifying computer code, which itself is set equal to specific SpeVals (like A1, A2, etc.—which we now refer to as “Low SpeVals”) in the form of a unique Low SpeVal identifying computer code. Continuing this example, if “5R=A1+A2+A3”, and “7P=A1+A2”, then “5R DOESN”T=7P”—that is to say, the Algerian “balad” doesn”t equal the Cairene “balad”. But, “5R=7P-A3”—that is to say, ““Balad” in Algiers is different from “balad” in Cairo in that the former can also be used to refer to “hometown” and the latter cannot”. This method is useful not only for immediate translation, which the specific and general search functions offer as described herein below, and mapping dialects. Further the present disclosure provides for the production of foreign language dialectical dictionaries and language learning material that pinpoint the specific ways one word, though sharing spelling and some meanings, is different from another dialect.
Referring to
While still referring to
Still referring to CDG 300 of
The CDG 300 of the present disclosure can be more readily understood by way of the examples presented herein after wherein Arabic words are presented and further used in English sentences. In this first example the word “Yom”, wherein the plural form is “ayam” is explored. Yom is an Arabic word and a homophone in both a dialect native to Cairo and a dialect native to Algiers. In the Cairo dialect a first definition of Yom is different than that of the same word in Algiers and in Cairo can be defined as “a period of twenty-four hours as a unit of time, reckoned from one midnight to the next, corresponding to a rotation of the earth on its axis”. An appropriate contextual sentence in Cairo can be “The past few “ayam” I slept a lot”. In a second definition of the word Yom, the word is defined the same in Cairo as it is in Algiers and can be defined as “the current day” (similar to “today” in the English language). In this context the word is used with the definite article “al” or in Arabic “” which is sensibly the equivalent of “this” or “the” in the English language. An appropriate contextual sentence can be “A1-yom was the best day of my life”. In the Algiers dialect the definition is the same as the second definition in the Cairo dialect and the contextual sentence would be the same. A second word in the dialect of Algiers is “Nhar”, wherein the plural form is “nharat”. In the Algiers dialect, the only a definition can be “a period of twenty-four hours as a unit of time, reckoned from one midnight to the next, corresponding to a rotation of the earth on its axis”. An appropriate contextual sentence (x1) for “nhar” (plural=“nharat”), since this is an identical word in meaning (and thus shares the same SpeVal) to the Cairene yom/ayam, the example sentence will also be identical: “The past few days (nharat) I slept a lot”.
In using the CDG of the present disclosure with the example given above and the user is a native Cairo speaker, or a user of any language wanted to see how to say the “yom” (“day” in Cairene), if it is used at all, in the dialect of Arabic spoken daily in the Algerian city of Algiers. Referring back to
Referring back to
Now referring back to
The following examples are meant to further illustrate the general search function 200 and the specific search function 100 of CGD 300. The various steps of the method of the present disclosure refer to this found in the various figures as outlined herein above. In this example the word to be translated by the CGD 300 is “Next to them” and “side”. This particular word is “Ganbu” (used in both Cairo and Algiers for “side” and Cairo only for “next to”). In Cairo the definition for “Next (to)” could be “in or into a position immediately to one side of; beside.” and “Side” could be defined as “an upright or sloping surface of a structure or object that is not the top or bottom and generally not the front or back”. Similarly, in Algiers “Side” could be defined as “an upright or sloping surface of a structure or object that is not the top or bottom and generally not the front or back”. While the word “Hda” is used in Algiers for the meaning of “next to” having a definition of “in or into a position immediately to one side of; beside.”
In this particular example a user may have knowledge of Cairene Arabic (or are from Cairo) and desires to know how to say “Next” (the preposition—as in, “next to . . . ”) to be best understood by a native from Algiers. Referring to
In Step 2 the high SpeVal of “Ganbu” is then located by processor 303 in the High SpeVal database 308 from its link to the word “Ganbu” in W1 305. In Step 3 the high SpeVal of “Ganbu” is linked by processor 303 to the context sentences for each SpeVal (each specific meaning) that “Ganbu” is used for in the Cairo-Ganbu, as we see, is used for at least two meanings in Cairene:
In Step 4 these two context sentences are returned by processor 303 to the user for the proper selection, as part of a graphical interface 302 or via a website through which a user can use the CDG 300. In Step 5 using user input device 301 the user then selects the context sentence which displays his or her intended use of the word, which in this particular example, desiring the meaning for “next to”, the user would select the first option. In Step 6 the low SpeVal of the selected context sentence is located by processor 303 in the low SpeVal database 309. In Step 7 the word and its definition used by the dialect 2 for that SpeVal are located by processor 303 in the dialect 2 word database W2 310 and definition database D2 312, respectively. In Step 8 the as described herein before, for Algiers, the word is “Hda” wherein Hda would be included with the returned results so the user can be sure they picked the right word, the definition would be the same appropriate definition of “side” in English (with any differences noted).
Referring back to
In Step 5, the Algiers words for each of the two low SpeVals are tested for equality with the word used by the Cairene dialect:
In Step 6 the word(s) that do equal in form for the same SpeVal are returned to the user via graphical interface 302 and can be labeled “Same form and meaning” and in this particular example, “Ganbu” is returned, due to testing by processor 303 as an identical word to the Algiers form. Though not shown, the input word definition and context sentence of this word, according to the dialect 1 (Cairene) could also be returned from D1 305 and S1 306, respectively. In
In this next example the word to be translated by the CGD 300 is “thing” in English; in Arabic it is “Haga”. It is used in both Cairo and Algiers for “an object that one need not, cannot, or does not wish to give a specific name to” but in the Algiers dialect “Haga” is used to refer to some unspecified noun, similar to how “anything” is used in English wherein these could be referred to as similar words. The Cairo dialect also uses the word for that meaning, in addition to the sense of specific objects someone has in mind, like “belongings, baggage, or stuff”. In this example a user may have knowledge of Cairene Arabic (or are from Cairo) and desires to know how to say “thing” (as in “belongings, baggage”) so that he could be understood while traveling in the city of Algiers. Referring to
In Step 4 these two context sentences are returned to the user for the proper selection, as part of graphical interface 302 by processor 303 or a website through which a user can use the CDG 300. In addition, similar context sentences can be returned to the user as well. In Step 5 using user input device 302 the user then selects the context sentence which displays the meaning for “baggage”. In Step 6 that context sentences low SpeVal is located by processor 303 in the low SpeVal database S1 306. In Step 7 the word and its definition used by the dialect 2 (Algiers dialect) for that SpeVal are located by processor 303 in the dialect 2 word database W2 310 and definition database D2 312, respectively. In Step 8 the as described herein before, for Algiers, the word is “Durzan” wherein Durzan would be included with the returned results so the user can be sure they picked the right word, the definition would be the same appropriate definition of “side” in English (with any differences noted).
Referring to
In Step 5, the Algiers words for each of the two low SpeVals are tested by processor 303 for equality with the word used by the Cairene dialect:
Since “Haga” (for the meaning of general ‘thing’) tested identical by processor 303 to the Algiers form of the word, it is returned (along with its definition, from D1 305, and context sentence, from S1 306) to the user via graphical interface 302 under the label “same form and meaning” or something to that effect, as occurs with all such words in Step 6. In
Furthermore, while the invention has been shown and described with respect to certain preferred embodiments, it is obvious that equivalent alterations and modifications will occur to others skilled in the art upon the reading and understanding of the specification. The present invention includes all such equivalent alterations and modifications, and is limited only by the scope of the appended claims.
While foregoing is directed to the preferred embodiment of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.
This application claims the benefit of U.S. Provisional Patent Application Ser. No. 62/711,211 filed 27 Jul. 2018. The disclosure of the application above is incorporated herein by reference in its entirety.