The present disclosure relates to language detection and, in particular, to systems and methods for detecting languages in short text messages.
In general, language detection or identification is a process in which a language present in a body of text is detected automatically based on the content of the text. Language detection is useful in the context of automatic language translation, where the language of a text message must generally be known before the message can be translated accurately into a different language.
While traditional language detection is usually performed on a collection of many words and sentences (i.e., on the document level), a particularly challenging domain is the chat text domain, where messages often include only a few words (e.g., four or less), some or all of which may be informal and/or misspelled. In the chat text domain, existing language detection approaches have proven to be inaccurate and/or slow, given the lack of information and the informalities present in such messages.
Embodiments of the systems and methods described herein are used to detect the language in a text message based on, for example, content of the message, information about the keyboard used to generate the message, and/or information about the language preferences of the user who generated the message. Compared to previous language detection techniques, the systems and methods described herein are generally faster and more accurate, particularly for short text messages (e.g., of four words or less).
In various examples, the systems and methods use a plurality of language detection tests and classifiers to determine probabilities associated with possible languages in a text message. Each language detection test may output a set or vector of probabilities associated with the possible languages. The classifiers may combine the output from the language detection tests to determine a most likely language for the message. The particular language detection test(s) and classifier(s) chosen for the message may depend on a predicted accuracy, a confidence score, and/or a linguistic domain for the message.
In one aspect, the invention relates to a computer-implemented method of identifying a language in a message. The method includes: performing a plurality of different language detection tests on a message associated with a user, each language detection test determining a respective set of scores, each score in the set of scores representing a likelihood that the message is in one of a plurality of different languages; providing one or more combinations of the score sets as input to one or more distinct classifiers; obtaining as output from each of the one or more classifiers a respective indication that the message is in one of the plurality of different languages, the indication including a confidence score; and identifying the language in the message as being the indicated language from one of the one or more classifiers, based on at least one of the confidence score and an identified linguistic domain.
In certain examples, a particular classifier is a supervised learning model, a partially supervised learning model, an unsupervised learning model, or an interpolation. Identifying the language in the message may include selecting the indicated language based on the confidence score.
Identifying the language in the message may include selecting the classifier based on the identified linguistic domain. In some instances, the linguistic domain is or includes video games, sports, news, parliamentary proceedings, politics, health, and/or travel.
In some examples, the message includes two or more of the following: a letter, a number, a symbol, and an emoticon. The plurality of different language detection tests may include at least two methods selected from the group consisting of a byte n-gram method, a dictionary-based method, an alphabet-based method, a script-based method, and a user language profile method. The plurality of different language detection tests may be performed simultaneously (e.g., with parallel processing). The one or more combinations may include score sets from a byte n-gram method and a dictionary-based method. The one or more combinations may further include score sets from the user language profile method and/or the alphabet-based method.
In another aspect, the invention relates to a system for identifying a language in a message. The system includes a computer storage device having instructions stored thereon. The system also includes a data processing apparatus configured to execute the instructions to perform operations that include: performing a plurality of different language detection tests on a message associated with a user, each language detection test determining a respective set of scores, each score in the set of scores representing a likelihood that the message is in one of a plurality of different languages; providing one or more combinations of the score sets as input to one or more distinct classifiers; obtaining as output from each of the one or more classifiers a respective indication that the message is in one of the plurality of different languages, the indication including a confidence score; and identifying the language in the message as being the indicated language from one of the one or more classifiers, based on at least one of the confidence score and an identified linguistic domain.
In certain examples, a particular classifier is a supervised learning model, a partially supervised learning model, an unsupervised learning model, or an interpolation. Identifying the language in the message may include selecting the indicated language based on the confidence score. Identifying the language in the message may include selecting the classifier based on the identified linguistic domain. In some instances, the linguistic domain is or includes video games, sports, news, parliamentary proceedings, politics, health, and/or travel.
In some examples, the message includes two or more of the following: a letter, a number, a symbol, and an emoticon. The plurality of different language detection tests may include at least two methods selected from the group consisting of a byte n-gram method, a dictionary-based method, an alphabet-based method, a script-based method, and a user language profile method. The plurality of different language detection tests may be performed simultaneously (e.g., with parallel processing). The one or more combinations may include score sets from a byte n-gram method and a dictionary-based method. The one or more combinations may further include score sets from the user language profile method and/or the alphabet-based method.
In another aspect, the invention relates to a computer program product stored in one or more storage devices for controlling a processing mode of a data processing apparatus. The computer program product is executable by the data processing apparatus to cause the data processing apparatus to perform operations that include: performing a plurality of different language detection tests on a message associated with a user, each language detection test determining a respective set of scores, each score in the set of scores representing a likelihood that the message is in one of a plurality of different languages; providing one or more combinations of the score sets as input to one or more distinct classifiers; obtaining as output from each of the one or more classifiers a respective indication that the message is in one of the plurality of different languages, the indication including a confidence score; and identifying the language in the message as being the indicated language from one of the one or more classifiers, based on at least one of the confidence score and an identified linguistic domain.
In certain examples, a particular classifier is a supervised learning model, a partially supervised learning model, an unsupervised learning model, or an interpolation. Identifying the language in the message may include selecting the indicated language based on the confidence score. Identifying the language in the message may include selecting the classifier based on the identified linguistic domain. In some instances, the linguistic domain is or includes video games, sports, news, parliamentary proceedings, politics, health, and/or travel.
In some examples, the message includes two or more of the following: a letter, a number, a symbol, and an emoticon. The plurality of different language detection tests may include at least two methods selected from the group consisting of a byte n-gram method, a dictionary-based method, an alphabet-based method, a script-based method, and a user language profile method. The plurality of different language detection tests may be performed simultaneously (e.g., with parallel processing). The one or more combinations may include score sets from a byte n-gram method and a dictionary-based method. The one or more combinations may further include score sets from the user language profile method and/or the alphabet-based method.
Elements of embodiments described with respect to a given aspect of the invention may be used in various embodiments of another aspect of the invention. For example, it is contemplated that features of dependent claims depending from one independent claim can be used in apparatus and/or methods of any of the other independent claims
In general, the language detection systems and methods described herein can be used to identify the language in a text message when language information for the message (e.g., keyboard information from a client device) is absent, malformed or unreliable. The systems and methods improve the accuracy of language translation methods used to translate text messages from one language to another. Language translation generally requires the source language to be identified accurately, otherwise the resulting translation may be inaccurate.
An application such as a web-based application can be provided as an end-user application to allow users to provide messages to the server system 12. The end-user applications can be accessed through a network 32 by users of client devices, such as a personal computer 34, a smart phone 36, a tablet computer 38, and a laptop computer 40. Other client devices are possible. The user messages may be accompanied by information about the devices used to create the messages, such as information about the keyboard, client device, and/or operating system used to create the messages.
Although
In some implementations, the language indication from the one or more classifiers is selected by the manager module 20 according to a computed confidence score and/or a linguistic domain. For example, the classifiers may compute a confidence score indicating a degree of confidence associated with the language prediction. Additionally or alternatively, certain classifier output may be selected according to the linguistic domain associated with the user or the message. For example, if the message originated in a computer gaming environment, a particular classifier output may be selected as providing the most accurate language prediction. Likewise, if the message originated in the context of sports (e.g., regarding a sporting event), a different classifier output may be selected as being more appropriate for the sports linguistic domain. Other possible linguistic domains include, for example, news, parliamentary proceedings, politics, health, travel, web pages, newspaper articles, and microblog messages. In general, certain language detection methods or combinations of language detection methods (e.g., from a classifier) may be more accurate for certain linguistic domains, when compared to other linguistic domains. In some implementations, the domain can be determined based on the presence of words from a domain vocabulary in a message. For example, a domain vocabulary for computer gaming could include common slang words used by gamers.
The language detection methods used by the detection method module 16 may include, for example, an n-gram method (e.g., a byte n-gram method), a dictionary-based method, an alphabet-based method, a script-based method, and a user language profile method. Other language detection methods are possible. Each of these language detection methods may be used to detect a language present in a message. The output from each method may be, for example, a set or vector of probabilities associated with each possible language in the message. In some instances, two or more of the language detection methods may be performed simultaneously, using parallel computing, which can reduce computation times considerably.
In one implementation, a byte n-gram method uses byte n-grams instead of word or character n-grams to detect languages. The byte n-gram method is preferably trained over a mixture of byte n-grams (e.g., with 1≦n≦4), using a naive Bayes classifier having a multinomial event model. The model preferably generalizes to data from different linguistic domains, such that the model's default configuration is accurate over a diverse set of domains, including newspaper articles, online gaming, web pages, and microblog messages. Information about the language identification task may be integrated from a variety of domains.
The task of attaining high accuracy may be relatively easy for language identification in a traditional text categorization setting, for which in-domain training data is available. This task may be more difficult when attempting to use learned model parameters for one linguistic domain to classify or categorize data from a separate linguistic domain. This problem may be addressed by focusing on important features that are relevant to the task of language identification. This may be based on, for example, a concept called information gain, which was originally introduced for decision trees as a splitting criteria, and later found to be useful for selecting features in text categorization. In certain implementations, a detection score is calculated that represents the difference in information gain relative to domain and language. Features having a high detection score may provide information about language without providing information about domain. For simplicity, the candidate feature set may be pruned before information gain is calculated, by means of a feature selection based on term-frequency.
Referring to
In general, the dictionary-based language detection method counts the number of tokens or words belonging to each language by looking up words in a dictionary or other word listing associated with the language. The language having the most words in the message is chosen as the best language. In the case of multiple best languages, the more frequent or commonly used of the best languages may be chosen. The language dictionaries may be stored in the dictionaries database 24.
To ensure accuracy of the dictionary-based language detection method, particularly for short sentences, it is preferable to use dictionaries that include informal words or chat words (e.g., abbreviations, acronyms, slang words, and profanity), in additional to formal words. Informal words are commonly used in short text communications and in chat rooms. The dictionaries are preferably augmented to include informal words on an ongoing basis, as new informal words are developed and used.
The alphabet-based method is generally based on character counts for each language's alphabet and relies on the observation that many languages have unique alphabets or different sets of characters. For example, Russian, English, Korean, and Japanese each use a different alphabet. Although the alphabet-based method may be unable to distinguish some languages precisely (e.g., languages that use similar alphabets, such as Latin languages), the alphabet-based method can generally detect certain languages quickly. In some instances it is preferable to use the alphabet-based method in combination with one or more other language detection methods (e.g., using a classifier), as discussed herein. The language alphabets may be stored in the alphabets database 26.
In general, the script-based language detection method determines the character counts for each possible script (e.g. Latin script, CJK script, etc.) that is present in the message. The script-based method relies on the observation that different languages may use different scripts, e.g., Chinese and English. The method preferably uses a mapping that maps a script to a list of languages that use the script. For example, the mapping may consider the UNICODE values for the characters or symbols present in the message, and these UNICODE values may be mapped to a corresponding language or set of possible languages for the message. The language scripts and UNICODE values or ranges may be stored in the scripts database 28.
Referring to
The user language profile based method uses the user profiles database 30, which stores historical messages sent by various users. The languages of these stored messages are detected using, for example, one or more other language detection methods described herein (e.g., the byte n-gram method), to identify the language(s) used by each user. For example, if all of a user's prior messages are in Spanish, the language profile for that user may indicate the user's preferred language is Spanish. Likewise, if a user's prior messages are in a mixture of different languages, the language profile for the user may indicate probabilities associated with the different languages (e.g., 80% English, 15% French, and 5% Spanish). In general, the user language profile based method addresses language detection issues associated with very short messages, which often do not have enough information in them to make an accurate language determination. In such an instance, the language preference of a user can be used to predict the language(s) in the user's messages, by assuming the user will continue to use the language(s) he or she has used previously.
Referring to
Referring to
The output from the various language detection methods in the detection method module 16 may be combined using the classifier module 18. Referring to
The interpolation module 802 is used to perform a linear interpolation of the results from two or more language detection methods. For example, the language of a text message may be determined by interpolating between results from the byte n-gram method and the dictionary-based method. For the chat message “lol gtg,” the byte n-gram method may determine the likelihood of English is 0.3, the likelihood of French is 0.4, and the likelihood of Polish is 0.3 (e.g., the output from the byte n-gram method may be {en:0.3, fr:0.4, pl:0.3}). The dictionary-based method may determine the likelihood of English is 0.1, the likelihood of French is 0.2, and the likelihood of Polish is 0.7 (e.g., the output may be {en:0.1, fr:0.2, pl:0.7}). To interpolate between the results of these two methods, the output from the byte n-gram is multiplied by a first weight and the output from the dictionary-based method is multiplied by a second weight, such that the first and second weights add to one. The weighted outputs from the two methods are then added together. For example, if the byte n-gram results are given a weight of 0.6, then the dictionary-based results are given a weight of 0.4, and the interpolation between the two methods is: {en:0.3, fr:0.4, pl:0.3}*0.6+{en:0.1, fr:0.2, pl:0.7}*0.4={en:0.22, fr:0.32, pl:0.46}.
In general, the optimal weights for interpolating between two or more values may be determined numerically through trial and error. Different weights can be tried to identify the best set of weights for a given set of messages. In some instances, the weights may be a function of the number of words or characters in the message. Alternatively or additionally, the weights may depend on the linguistic domain of the message. For example, the optimal weights for a gaming environment may be different than the optimal weights for a sports environment. For a combination of the byte n-gram method and the dictionary-based method, good results may be obtained using a weight of 0.1 on the byte n-gram method and a weight of 0.9 on the dictionary-based method.
The SVM module 804 may be or include a supervised learning model that analyzes language data and recognizes language patterns. The SVM module 804 may be a multi-class SVM classifier, for example. For an English SVM classifier, the feature vector may be the concatenation of the two distributions above (i.e., {en:0.3, fr:0.4, pl:0.3, en:0.1, fr:0.2, pl:0.7}). The SVM classifier is preferably trained on labeled training data. The trained model acts as a predictor for an input. The features selected in the case of language detection may be, for example, sequences of bytes, words, or phrases. Input training vectors may be mapped into a multi-dimensional space. The SVM algorithm may then use kernels to identify the optimal separating hyplerplane between these dimensions, which will give the algorithm a distinguishing ability to predict languages (in this case). The kernel may be, for example, a linear kernel, a polynomial kernel, or a radial basis function (RBF) kernel. A preferred kernel for the SVM classifier is the RBF kernel. After training the SVM classifier using training data, the classifier may be used to output a best language among all the possible languages.
The training data may be or include, for example, the output vectors from different language detection methods and an indication of the correct language, for a large number of messages having, for example, different message lengths, linguistic domains, and/or languages. The training data may include a large number of messages for which the language in each message is known.
The linear SVM module 806 may be or include a large-scale linear classifier. An SVM classifier with a linear kernel may perform better than other linear classifiers, such as linear regression. The linear SVM module 806 differs from the SVM module 804 at the kernel level. There are some cases when a polynomial model works better than a linear model, and vice versa. The optimal kernel may depend on the linguistic domain of the message data and/or the nature of the data.
Other possible classifiers used by the systems and methods described herein include, for example, decision tree learning, association rule learning, artificial neural networks, inductive logic programming, random forests, clustering, Bayesian networks, reinforcement learning, representation learning, similarity and metric learning, and sparse dictionary learning. One or more of these classifiers, or other classifiers, may be incorporated into and/or form part of the classifier module 18.
Referring to
In the detection method module 16, one or more language detection methods are used (step 904) to detect a language in the message. Each method used by the detection method module 16 may output a prediction regarding the language present in the message. The prediction may be in the form of a vector that includes a probability for each possible language that may be in the message.
The output from the detection method module 16 is then delivered to the classifier module 18 where the results from two or more language detection methods may be combined (step 906). Various combinations of the results from the language detection methods may be obtained. In one example, the results from the byte n-gram method and the dictionary based method are combined in the classifier module 18 by interpolation. In another example, a SVM combination or classification is performed on the results from the byte n-gram method, the dictionary-based method, the alphabet method, and the user profile method. Alternatively or additionally, the combination may include or consider results from the script-based method. A further example includes a large linear combination of the byte n-gram method, the language profile method, and the dictionary method. In general, however, the results from any two or more of the language detection methods may be combined in the classifier module 18.
The method 900 uses the manager module 20 to select output (step 908) from a particular classifier. The output may be selected based on, for example, a confidence score computed by a classifier, an expected language detection accuracy, and/or a linguistic domain for the message. A best language is then chosen (step 910) from the selected classifier output.
In some instances, the systems and methods described herein choose the language detection method(s) according to the length of the message. For example, referring to
Otherwise, if the candidate language is not a language with a unique alphabet and/or script, then the length of the text message is evaluated. If the message length is less than a threshold length (e.g., four bytes or four characters) and the text message includes or is accompanied by a keyboard language used by the client device (step 1110), then the language of the message is chosen (step 1112) to be the keyboard language.
Alternatively, if the message length is greater than the threshold length or the keyboard language is not available, then the message is processed with an n-gram method (e.g., the byte n-gram method) to identify (step 1114) a first set of possible languages for the text message. The message is also then processed with the dictionary-based method to identify (step 1116) a second set of possible languages for the text message. If a user language profile exists for the user (step 1118), then the first set of possible languages, the second set of possible languages, and the user language profile 1120 are combined (e.g., using an SVM classifier or a large linear classifier) to obtain a first combination of possible languages (step 1122). The language of the text message is then chosen (step 1124), based on the first combination of possible languages. Otherwise, if the user language profile is not available, then the first set of possible languages and the second set of possible languages are combined (e.g., using a linear interpolator or other classifier) to obtain a second combination of possible languages (step 1126). Finally, the language of the text message is chosen (step 1128), based on the second combination of possible languages.
In some instances, language detection is performed by combining the output from multiple language detection methods in two or more steps. For example, a first step may use the alphabet-script based method to detect special languages that use their own unique alphabets or scripts, such as Chinese (cn), Japanese (ja), Korean (ko), Russian (ru), Hebrew (he), Greek (el), and Arabic (ar). If necessary, the second step may use a combination (e.g., from a classifier) of multiple detection methods (e.g., the byte n-gram method, the user language profile based method, and the dictionary-based method) to detect other languages (e.g., Latin languages) in the message.
In certain examples, the message provided or received for language detection includes certain digits, characters, or images (e.g., emoticons or emojis) that are not specific to any particular language and/or are recognizable to any user, regardless of language preference. The systems and methods described herein may ignore such characters or images when doing language detection and may ignore messages that include only such characters or images.
In the depicted example method 1200, the detection method module 16 includes ten different language detection methods. Three of the language detection methods in the detection method module 16 are Byte n-gram A 1206, Byte n-gram B 1208, and Byte n-gram C 1210, which are all byte n-gram methods and may be configured to detect a different set or number of languages. For example, Byte n-gram A 1206 may be configured to detect 97 languages, Byte n-gram B 1208 may be configured to detect 27 languages, and Byte n-gram C 1210 may be configured to detect 20 languages. Two of the language detection methods in the detection method module 16 are Dictionary A 1212 and Dictionary B 1214, which are both dictionary-based methods and may be configured to detect a different set or number of languages. For example, Dictionary A 1212 may be configured to detect 9 languages, and Dictionary B 1214 may be configured to detect 10 languages. Two of the language detection methods in the detection method module 16 are Language Profile A 1216 and Language Profile B 1218, which are user language profile methods and may be configured to detect a different set or number of languages. For example, Language Profile A 1216 may be configured to detect 20 languages, and Language Profile B 1218 may be configured to detect 27 languages. Two of the language detection methods in the detection method module 16 are Alphabet A 1220 and Alphabet B 1222, which are alphabet-based methods and may be configured to detect a different set or number of languages. For example, Alphabet A 1220 may be configured to detect 20 languages, and Alphabet B 1222 may be configured to detect 27 languages. The detection method module 16 also includes a script-based language detection method 1224.
Output from the different language detection methods in the detection method module 16 is combined and processed by the classifier module 18. For example, an interpolation classifier 1226 combines output from Byte n-gram B 1208 and Dictionary B 1214. Weights for the interpolation may be, for example, 0.1 for Byte n-gram B 1208 and 0.9 for Dictionary B 1214. The classifier module 18 may also use an SVM classifier 1228 that combines output from Byte n-gram C 1210, Dictionary B 1214, Language Profile B 1218, and Alphabet B 1222. The classifier module 18 may also use a first combination 1230 of the script-based method 1224 and an SVM classifier combination of Byte n-gram C 1210, Dictionary A 1212, Language Profile A 1216, and Alphabet A 1220. Additionally, the classifier module 18 may use a second combination 1232 of the script based method 1224 and a Linear SVM classifier combination of Byte n-gram C 1210, Dictionary A 1212, and Language Profile A 1216. While
For both the first combination 1230 and the second combination 1232, the script-based method 1224 and the classifier may be used in a tiered approach. For example, the script-based method 1224 may be used to quickly identify languages having unique scripts. When such a language is identified in the message 1204, use of the SVM classifier in the first combination 1230 or the Linear SVM classifier in the second combination may not be required.
In general, the manager module 20 may select specific language detection methods, classifiers, and/or combinations of detection method output to identify the language in the message 1204. The manager module 20 may make the selection according to the linguistic domain or according to an anticipated language for the message. The manager module 20 may select specific classifiers according to a confidence score determined by the classifiers. For example, the manager module 20 may select the output from the classifier that is the most confident in its prediction.
In certain implementations, the systems and methods described herein are suitable for making language detection available as a service to a plurality of users. Such a service is made possible and/or enhanced by the speed at which the systems and methods identify languages, and by the ability of the systems and methods to handle multiple identification techniques at runtime, based on service requests from diverse clients.
Embodiments of the subject matter and the operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on computer storage medium for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. A computer storage medium can be, or be included in, a computer-readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or a combination of one or more of them. Moreover, while a computer storage medium is not a propagated signal, a computer storage medium can be a source or destination of computer program instructions encoded in an artificially-generated propagated signal. The computer storage medium can also be, or be included in, one or more separate physical components or media (e.g., multiple CDs, disks, or other storage devices).
The operations described in this specification can be implemented as operations performed by a data processing apparatus on data stored on one or more computer-readable storage devices or received from other sources.
The term “data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, a system on a chip, or multiple ones, or combinations, of the foregoing. The apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or a combination of one or more of them. The apparatus and execution environment can realize various different computing model infrastructures, such as web services, distributed computing and grid computing infrastructures.
A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, object, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform actions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for performing actions in accordance with instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device (e.g., a universal serial bus (USB) flash drive), to name just a few. Devices suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's client device in response to requests received from the web browser.
Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), an inter-network (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks).
The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data (e.g., an HTML page) to a client device (e.g., for purposes of displaying data to and receiving user input from a user interacting with the client device). Data generated at the client device (e.g., a result of the user interaction) can be received from the client device at the server.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any inventions or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. For example, parallel processing may be used to perform multiple language detection methods simultaneously. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain implementations, multitasking and parallel processing may be advantageous.
This application is a continuation of and claims priority from U.S. patent application Ser. No. 14/517,183, filed Oct. 17, 2014, entitled “Systems and Methods for Language Detection.” The preceding patent application is hereby incorporated by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
4460973 | Tanimoto et al. | Jul 1984 | A |
4502128 | Okajima et al. | Feb 1985 | A |
4706212 | Toma | Nov 1987 | A |
5313534 | Burel | May 1994 | A |
5526259 | Kaji | Jun 1996 | A |
5884246 | Boucher et al. | Mar 1999 | A |
5991710 | Papineni et al. | Nov 1999 | A |
6125362 | Elworthy | Sep 2000 | A |
6167369 | Schulze | Dec 2000 | A |
6182029 | Friedman | Jan 2001 | B1 |
6269189 | Chanod | Jul 2001 | B1 |
6285978 | Bernth et al. | Sep 2001 | B1 |
6304841 | Berger et al. | Oct 2001 | B1 |
6415250 | van den Akker | Jul 2002 | B1 |
6425119 | Jones et al. | Jul 2002 | B1 |
6722989 | Hayashi | Apr 2004 | B1 |
6799303 | Blumberg | Sep 2004 | B2 |
6801190 | Robinson et al. | Oct 2004 | B1 |
6848080 | Lee et al. | Jan 2005 | B1 |
6993473 | Cartus | Jan 2006 | B2 |
6996520 | Levin | Feb 2006 | B2 |
7165019 | Lee et al. | Jan 2007 | B1 |
7174289 | Sukehiro | Feb 2007 | B2 |
7451188 | Cheung et al. | Nov 2008 | B2 |
7475343 | Mielenhausen | Jan 2009 | B1 |
7478033 | Wu et al. | Jan 2009 | B2 |
7533013 | Marcu | May 2009 | B2 |
7539619 | Seligman et al. | May 2009 | B1 |
7895576 | Chang et al. | Feb 2011 | B2 |
8010338 | Thorn | Aug 2011 | B2 |
8027438 | Daigle et al. | Sep 2011 | B2 |
8112497 | Gougousis et al. | Feb 2012 | B1 |
8145472 | Shore et al. | Mar 2012 | B2 |
8170868 | Gamon | May 2012 | B2 |
8244567 | Estill | Aug 2012 | B2 |
8270606 | Caskey et al. | Sep 2012 | B2 |
8326601 | Ribeiro et al. | Dec 2012 | B2 |
8380488 | Liu | Feb 2013 | B1 |
8392173 | Davis et al. | Mar 2013 | B2 |
8468149 | Lung et al. | Jun 2013 | B1 |
8473555 | Lai et al. | Jun 2013 | B2 |
8510328 | Hatton | Aug 2013 | B1 |
8543374 | Dymetman | Sep 2013 | B2 |
8566306 | Jones | Oct 2013 | B2 |
8606800 | Lagad et al. | Dec 2013 | B2 |
8626486 | Och et al. | Jan 2014 | B2 |
8655644 | Kanevsky et al. | Feb 2014 | B2 |
8671019 | Barclay et al. | Mar 2014 | B1 |
8688433 | Davis et al. | Apr 2014 | B2 |
8688451 | Grost et al. | Apr 2014 | B2 |
8738355 | Gupta et al. | May 2014 | B2 |
8762128 | Brants et al. | Jun 2014 | B1 |
8788259 | Buryak | Jul 2014 | B1 |
8818791 | Xiao et al. | Aug 2014 | B2 |
8825467 | Chen et al. | Sep 2014 | B1 |
8825469 | Duddu et al. | Sep 2014 | B1 |
8832204 | Gailloux et al. | Sep 2014 | B1 |
8838437 | Buryak | Sep 2014 | B1 |
8914395 | Jiang | Dec 2014 | B2 |
8918308 | Caskey et al. | Dec 2014 | B2 |
8928591 | Swartz | Jan 2015 | B2 |
8935147 | Stern et al. | Jan 2015 | B2 |
8990064 | Marcu et al. | Mar 2015 | B2 |
20010020225 | Zerber | Sep 2001 | A1 |
20010029455 | Chin et al. | Oct 2001 | A1 |
20020029146 | Nir | Mar 2002 | A1 |
20020037767 | Ebin | Mar 2002 | A1 |
20020152063 | Tokieda et al. | Oct 2002 | A1 |
20020169592 | Aityan | Nov 2002 | A1 |
20020198699 | Greene et al. | Dec 2002 | A1 |
20030009320 | Furuta | Jan 2003 | A1 |
20030033152 | Cameron | Feb 2003 | A1 |
20030046350 | Chintalapati et al. | Mar 2003 | A1 |
20030101044 | Krasnov | May 2003 | A1 |
20030125927 | Seme | Jul 2003 | A1 |
20030176995 | Sukehiro | Sep 2003 | A1 |
20040030750 | Moore et al. | Feb 2004 | A1 |
20040030781 | Etesse et al. | Feb 2004 | A1 |
20040044517 | Palmquist | Mar 2004 | A1 |
20040093567 | Schabes et al. | May 2004 | A1 |
20040102201 | Levin | May 2004 | A1 |
20040102956 | Levin | May 2004 | A1 |
20040102957 | Levin | May 2004 | A1 |
20040158471 | Davis et al. | Aug 2004 | A1 |
20040205671 | Sukehiro et al. | Oct 2004 | A1 |
20040210443 | Kuhn et al. | Oct 2004 | A1 |
20040243409 | Nakagawa | Dec 2004 | A1 |
20040267527 | Creamer et al. | Dec 2004 | A1 |
20050076240 | Appleman | Apr 2005 | A1 |
20050102130 | Quirk et al. | May 2005 | A1 |
20050160075 | Nagahara | Jul 2005 | A1 |
20050165642 | Brouze et al. | Jul 2005 | A1 |
20050171758 | Palmquist | Aug 2005 | A1 |
20050197829 | Okumura | Sep 2005 | A1 |
20050209844 | Wu et al. | Sep 2005 | A1 |
20050234702 | Komiya | Oct 2005 | A1 |
20050251384 | Yang | Nov 2005 | A1 |
20050283540 | Fux | Dec 2005 | A1 |
20050288920 | Green et al. | Dec 2005 | A1 |
20060015812 | Cunningham et al. | Jan 2006 | A1 |
20060053203 | Mijatovic | Mar 2006 | A1 |
20060101021 | Davis et al. | May 2006 | A1 |
20060133585 | Daigle et al. | Jun 2006 | A1 |
20060136223 | Brun et al. | Jun 2006 | A1 |
20060167992 | Cheung et al. | Jul 2006 | A1 |
20060173839 | Knepper et al. | Aug 2006 | A1 |
20060206309 | Curry et al. | Sep 2006 | A1 |
20060271352 | Nikitin et al. | Nov 2006 | A1 |
20070016399 | Gao et al. | Jan 2007 | A1 |
20070050182 | Sneddon et al. | Mar 2007 | A1 |
20070077975 | Warda | Apr 2007 | A1 |
20070088793 | Landsman | Apr 2007 | A1 |
20070124133 | Wang et al. | May 2007 | A1 |
20070124202 | Simons | May 2007 | A1 |
20070129935 | Uchimoto et al. | Jun 2007 | A1 |
20070143410 | Kraft et al. | Jun 2007 | A1 |
20070168450 | Prajapat et al. | Jul 2007 | A1 |
20070218997 | Cho | Sep 2007 | A1 |
20070219774 | Quirk et al. | Sep 2007 | A1 |
20070219776 | Gamon | Sep 2007 | A1 |
20070276814 | Williams | Nov 2007 | A1 |
20070294076 | Shore et al. | Dec 2007 | A1 |
20080005319 | Anderholm et al. | Jan 2008 | A1 |
20080052289 | Kolo et al. | Feb 2008 | A1 |
20080097745 | Bagnato et al. | Apr 2008 | A1 |
20080097746 | Tagata | Apr 2008 | A1 |
20080126077 | Thorn | May 2008 | A1 |
20080147408 | Da Palma et al. | Jun 2008 | A1 |
20080176655 | James et al. | Jul 2008 | A1 |
20080177528 | Drewes | Jul 2008 | A1 |
20080183459 | Simonsen et al. | Jul 2008 | A1 |
20080243834 | Rieman et al. | Oct 2008 | A1 |
20080249760 | Marcu et al. | Oct 2008 | A1 |
20080274694 | Castell et al. | Nov 2008 | A1 |
20080281577 | Suzuki | Nov 2008 | A1 |
20080313534 | Cheung et al. | Dec 2008 | A1 |
20080320086 | Callanan et al. | Dec 2008 | A1 |
20090011829 | Yang | Jan 2009 | A1 |
20090068984 | Burnett | Mar 2009 | A1 |
20090100141 | Kirkland et al. | Apr 2009 | A1 |
20090106695 | Perry et al. | Apr 2009 | A1 |
20090125477 | Lu et al. | May 2009 | A1 |
20090204400 | Shields et al. | Aug 2009 | A1 |
20090221372 | Casey et al. | Sep 2009 | A1 |
20100015581 | DeLaurentis | Jan 2010 | A1 |
20100099444 | Coulter | Apr 2010 | A1 |
20100138210 | Seo et al. | Jun 2010 | A1 |
20100145900 | Zheng et al. | Jun 2010 | A1 |
20100180199 | Wu et al. | Jul 2010 | A1 |
20100204981 | Ribeiro et al. | Aug 2010 | A1 |
20100241482 | Knyphausen et al. | Sep 2010 | A1 |
20100268730 | Kazeoka | Oct 2010 | A1 |
20100293230 | Lai et al. | Nov 2010 | A1 |
20100324894 | Potkonjak | Dec 2010 | A1 |
20110022381 | Gao et al. | Jan 2011 | A1 |
20110066421 | Lee et al. | Mar 2011 | A1 |
20110071817 | Siivola | Mar 2011 | A1 |
20110077933 | Miyamoto et al. | Mar 2011 | A1 |
20110077934 | Kanevsky et al. | Mar 2011 | A1 |
20110082683 | Soricut et al. | Apr 2011 | A1 |
20110082684 | Soricut et al. | Apr 2011 | A1 |
20110098117 | Tanaka | Apr 2011 | A1 |
20110191096 | Sarikaya et al. | Aug 2011 | A1 |
20110213607 | Onishi | Sep 2011 | A1 |
20110219084 | Borra et al. | Sep 2011 | A1 |
20110238406 | Chen et al. | Sep 2011 | A1 |
20110238411 | Suzuki | Sep 2011 | A1 |
20110246881 | Kushman et al. | Oct 2011 | A1 |
20110307356 | Wiesinger | Dec 2011 | A1 |
20110307495 | Shoshan | Dec 2011 | A1 |
20110320019 | Lanciani et al. | Dec 2011 | A1 |
20120109631 | Gopal et al. | May 2012 | A1 |
20120156668 | Zelin | Jun 2012 | A1 |
20120173502 | Kumar et al. | Jul 2012 | A1 |
20120179451 | Miyamoto et al. | Jul 2012 | A1 |
20120209852 | Dasgupta et al. | Aug 2012 | A1 |
20120226491 | Yamazaki | Sep 2012 | A1 |
20120233191 | Ramanujam | Sep 2012 | A1 |
20120240039 | Walker et al. | Sep 2012 | A1 |
20120246564 | Kolo | Sep 2012 | A1 |
20120262296 | Bezar | Oct 2012 | A1 |
20120265518 | Lauder | Oct 2012 | A1 |
20120303355 | Liu et al. | Nov 2012 | A1 |
20130084976 | Kumaran et al. | Apr 2013 | A1 |
20130085747 | Li et al. | Apr 2013 | A1 |
20130091429 | Weng et al. | Apr 2013 | A1 |
20130096911 | Beaufort et al. | Apr 2013 | A1 |
20130130792 | Crocker et al. | May 2013 | A1 |
20130138428 | Chandramouli et al. | May 2013 | A1 |
20130144599 | Davis et al. | Jun 2013 | A1 |
20130173247 | Hodson | Jul 2013 | A1 |
20130197896 | Chalabi et al. | Aug 2013 | A1 |
20130211821 | Tseng et al. | Aug 2013 | A1 |
20130253834 | Slusar | Sep 2013 | A1 |
20140006003 | Soricut et al. | Jan 2014 | A1 |
20140058807 | Altberg et al. | Feb 2014 | A1 |
20140142917 | D'Penha | May 2014 | A1 |
20140188453 | Marcu et al. | Jul 2014 | A1 |
20140199975 | Lou et al. | Jul 2014 | A1 |
20140200878 | Mylonakis et al. | Jul 2014 | A1 |
20140208367 | DeWeese et al. | Jul 2014 | A1 |
20140330760 | Meier et al. | Nov 2014 | A1 |
20150006148 | Goldszmit | Jan 2015 | A1 |
20150161104 | Buryak | Jun 2015 | A1 |
20150161114 | Buryak | Jun 2015 | A1 |
20150161227 | Buryak | Jun 2015 | A1 |
Entry |
---|
Ahmed, B., et al., “Language Identification from Text Using n-gram Based Cumulative Frequency Addition,” In Proceedings of Student/Faculty Research Day, CSIS, Pace University; pp. 12.1-12.8; May 2004. |
Aikawa et al., “The Impact of Crowdsourcing Post-editing with the Collaborative Translation Framework,” JapTAL Oct. 22-24, 2012; LNAI; 7614:1-10. |
Ambati et al., “Collaborative Workflow for Crowdsourcing Translation,” Proc. of the ACM 2012 conf. on Computer Supported Cooperative Work, ACM; 1191-1194; Feb. 11-15, 2012. |
Baldwin, T. and Lui, M., “Language identification: The Long and the Short of the Matter,” In Proceedings of NAACL-HLT; pp. 229-237; Jun. 2010. |
Bergsma, et al., “Language Identification for Creating Language-specific Twitter Collections,” In Proceedings of the Second Workshop on Language in Social Media; pp. 65-74; Jun. 2012. |
Callison-Burch et al., “Creating Speech and Language Data with Amazon's Mechanical Turk”, Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk; 1-12, Jun. 6, 2010. |
Callison-Burch, C., “Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon's Mechanical Turk,” Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 286-295, Singapore, Aug. 6-7, 2009. |
Carter, et al., “Microblog Language Identification: Overcoming the Limitations of Short, Unedited and Idiomatic Text,” Language Resources and Evaluation; 47(1): 195-215; Mar. 2013. |
Cavnar, W. and Trenkle, J., “N-gram-based Text Categorization,” In Proceedings of the Third Symposium on Document Analysis and Information Retrieval; Apr. 1994, 14 pgs. |
Ceylan, H. and Kim, Y., “Language Identification of Search Engine Queries,” In Proceedings of ACL-IJCNLP; 2:1066-1074; Aug. 2009. |
Chang, C. and Lin, C., “LIBSVM: A Library for Support Vector Machines,” ACM Transactions on Intelligent Systems and Technology, 2(27):1-39; Apr. 2011. |
Dunning, “Statistical Identification of Language,” Computing Research Laboratory, New Mexico State University; Mar. 1994, 31 pgs. |
Fan, et al., “LIBLINEAR: A Library for Large Linear Classification,” Journal of Machine Learning Research; 9:1871-1874; Aug. 2008. |
Foster, et al., “#hardtoparse: POS Tagging and Pursing the Twitterverse,” In Proceedings of the AAAI Workshop on Analyzing Microtext; Aug. 2011, 7 pgs. |
Gottron, T. and Lipka, N., “A Comparison of Language Identification Approaches on Short, Query-style Texts,” In Advances in Information Retrieval; pp. 611-614; Mar. 2010. |
Grothe, et al., “A Comparative Study on Language Identification Methods,” In Proceedings of LREC; pp. 980-985; May 2008. |
Hughes, et al., “Reconsidering Language Identification for Written Language Resources,” In Proceedings of LREC; pp. 485-488; May 2006. |
Hulin et al., “Applications of Item Response Theory to Analysis of Attitude Scale Translations,” American Psychological Association; vol. 67(6); Dec. 1982; 51 pgs. |
Int'l Search Report and Written Opinion of the ISA/US in PCT/US2014/015632; Jul. 8, 2014; 8 pgs. |
Partial Int'l Search Report of the ISA/EP in PCT/US2014/040676; Feb. 17, 2015; 5 pgs. |
Little, G., “Turkit: Tools for Iterative Tasks on Mechanical Turk;” IEEE Symposium on Visual Languages and Human-Centric Computing; pp. 252-253; Sep. 20, 2009. |
Liu, et al., “A Broad-coverage Normalization System for Social Media Language,” In Proceedings of ACL; pp. 1035-1044; Jul. 2012. |
Liu, et al., “Recognizing Named Entities in Tweets,” In Proceedings of ACL-HLT; 1:359-367; Jun. 2011. |
Lui, M. and Baldwin, T., “Accurate Language Identification of Twitter Messages,” Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM)@ EACL 2014; pp. 17-25; Apr. 26-30, 2014. |
Lui, et al., “Automatic Detection and Language Identification of Multilingual Documents,” Transactions of the Association for Computational Linguistics, 2:27-40; Feb. 2014. |
Lui, M. and Baldwin, T., “Cross-domain Feature Selection for Language Identification,” Proceedings of the 5th International Joint Conference on Natural Language Processing; pp. 553-561; Nov. 8-13, 2011. |
Lui, M. and Baldwin, T., “langid.py: An Off-the-shelf Language Identification Tool,” Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics; pp. 25-30; Jul. 8-14, 2012. |
Monteith, et al., “Turning Bayesian Model Averaging Into Bayesian Model Combination,” Proceedings of the International Joint Conference on Neural Networks IJCNN'11; Jul. 31-Aug. 5, 2011; 7pgs. |
Papineni, K., et al. “BLEU: A Method for Automatic Evaluation of Machine Translation,” Proc. 40th annual Meeting on Assoc. for Computational Linguistics (ACL); Jul. 2002; pp. 311-318. |
Popovic, et al., “Syntax-oriented Evaluation Measures for Machine Translation Output,” Proc. of the Fourth Workshop on Statistical Machine Translation, pp. 29-32, Mar. 30-31, 2009. |
Qureshi et al., Collusion Detection and Prevention with Fire+ Trust and Reputation Model, 2010, IEEE, Computer and Information Technology (CIT), 2010 IEEE 10th International Conference, pp. 2548-2555; Jun. 2010. |
Ritter, et al., “Named Entity Recognition in Tweets: An Experimental Study,” In Proceedings of EMNLP;pp. 1524-1534; Jul. 2011. |
Rouse, M., “Parallel Processing,” Search Data Center.com; Mar. 27, 2007; 2pgs. |
Shieber, S.M., and Nelken R., “Abbreviated Text Input Using Language Modeling.” Natural Language Eng; 13(2):165-183; Jun. 2007. |
Tromp, E. and Pechenizkiy, M., “Graph-based n-gram Language Identification on Short Texts,” In Proceedings of the 20th Machine Learning Conference of Belgium and The Netherlands; May 2011; 8 pgs. |
Vatanen, et al., “Language Identification of Short Text Segments with n-gram Models,” In Proceedings of LREC; pp. 3423-3430; May 2010. |
Vogel, J. and Tresner-Kirsch, D., “Robust Language Identification in Short, Noisy Texts: Improvements to LIGA,” in Proceedings of the 3rd International Workshop on Mining Ubiquitous and Social Environments; pp. 1-9; Jul. 2012. |
Xia, F. and Lewis, W.D., “Applying NLP Technologies to the Collection and Enrichment of Language Data on the Web to Aid Linguistic Research,” Proc. of the EACL 2009 Workshop on Language Tech. and Resources for Cultural Heritage, Social Sciences, Humanities, and Education-LaTech—SHELT&R 2009; pp. 51-59; Mar. 2009. |
Zaidan et al., “Crowdsourcing Translation: Professional Quality from Non-Professionals,” Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pp. 1220-1229, Portland, Oregon, Jun. 19-24, 2011. |
“Arabic script in Unicode,” accessed on the internet at <http://en.wikipedia.org/wiki/Arabic—script—in—Unicode>; downloaded Dec. 22, 2014; 18pgs. |
“Chromium-compact-language-detector,” accessed on the internet at <https://code.googie.com/p/chromium-compact-language-detector/>; downloaded Dec. 22, 2014; 1 pg. |
“CJK Unified Ideographs (Unicode block),” accessed on the internet at <http://en.wikipedia.org/wiki/CJK—Unified—Ideographs—(Unicode block)>; downloaded Dec. 22, 2014; 1pg. |
“CJK Unified Ideographs,” accessed on the internet at <http://en.wikipedia.org/wiki/CJK—Unified—Ideographs>; downloaded Dec. 22, 2014; 11pgs. |
“cld2,” accessed on the internet at <https://code.google.com/p/cld2/>; downloaded Dec. 22, 2014; 2pgs. |
“Cyrillic script in Unicode,” accessed on the internet at <http://en.wikipedia.org/wiki/Cyrillic—script—in—Unicode>; downloaded Dec. 22, 2014; 22pgs. |
“Detect Method,” accessed on the internet at <http://msdn.microsoft.com/enus/library/ff512411.aspx>; downloaded Dec. 22, 2014; 5pgs. |
“GitHub,” accessed on the internet at <https://github.com/feedbackmine/language—detector>; downloaded Dec. 22, 2014; 1pg. |
“Google Translate API,” accessed on the internet at <https://cloud.qooqle.com/translate/v2/using—rest>; downloaded Dec. 22, 2014; 12pgs. |
“Idig (Language Detection with Infinity Gram),” accessed on the internet at <https://github.com/shuyo/Idig>; downloaded Dec. 22, 2014; 3pgs. |
“Language identification,” accessed on the internet at <http://en.wikipedia.org/wiki/Language—identification>; downloaded Dec. 22, 2014; 5pgs. |
“Languages and Scripts, CLDR Charts,” accessed on the internet at <http://www.unicode.org/cldr/charts/latest/ supplemental/lang uages—and—scripts.html>; downloaded Dec. 22, 2014; 23pgs. |
“Latin Script in Unicode,” accessed on the internet at <http://en.wikipedia.org/wiki/Latin—script—in—Unicode>; downloaded Dec. 22, 2014; 5pgs. |
“Mimer SQL Unicode Collation Charts,” accessed on the internet at <http://developer.mimer.com/charts/index. html>; downloaded Dec. 22, 2014; 2pgs. |
“Multi Core and Parallel Processing,” accessed on the internet at stackoverflow.com/questions/1922465/multi-core-and-parallel-processing, published Dec. 17, 2009; downloaded on Jun, 30, 2015; 2pgs. |
“Scripts and Languages,” accessed on the internet at <http://www.unicode.org/cldr/charts/latest/supplemental/scripts—and—languages.html>; downloaded Dec. 22, 2014; 23pgs. |
“Supported Script,” accessed on the internet at <http://www.unicode.org/standard/supported.html>; downloaded Dec. 22, 2014; 3pgs. |
“Unicode Character Ranges,” accessed on the internet at <http://jrgraphix.net/research/unicode—blocks.php>; downloaded Dec. 22, 2014; 1 pg. |
“Uscript.h File Reference,” accessed on the internet at <http://icuproject.org/apiref/icu4c/uscript—8h.html>; downloaded Dec. 22, 2014; 34pgs. |
Int'l Search Report and Written Opinion of the ISA/EP in PCT/US2014/040676; May 6, 2015; 16 pgs. |
Int'l Search Report and Written Opinion of the ISA/EP in PCT/US2014/061141; Jun. 16, 2015; 13pgs. |
Number | Date | Country | |
---|---|---|---|
20160267070 A1 | Sep 2016 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14517183 | Oct 2014 | US |
Child | 15161913 | US |