The invention generally relates to the field of machine learning and more particularly to two-dimensional symbols for facilitating machine learning of written Chinese language using “pinyin” letters.
Written Chinese language have been traced back around 1000 BC in forms of ancient Chinese characters, which evolve over time and become the modern Chinese characters (i.e., Hanzi in Chinese “pinyin” system). Chinese characters are logosyllabic; that is, a character generally represent one syllable of spoken Chinese and may be word of its own or a part of polysyllabic word. The characters themselves are often composed of parts that may represent physical objects, abstract notions, or pronunciation. Literacy requires the memorization of a great many characters (e.g., about three- to four-thousands characters). The large number of Chinese characters has in part led to the adoption of Latin alphabets as an auxiliary means of representing Chinese (i.e., Chinese “pinyin” system). Standardization of Chinese character set has also been evolving over the past decades. The latest standard is referred to as GB18030, which is a Chinese government standard as “Information technology—Chinese coded character set” for defining entire Chinese character set. GB18030 defines the required language and character support for software.
Traditionally, written Chinese have been learned and mastered with rote learning techniques such as memorization with repetition. Students generally learn the written Chinese language from individual characters, to compound phrases, idioms, proverbs, sentences, poems, paragraphs, articles (i.e., written works), etc.
Machine learning is an application of artificial intelligence. In machine learning, a computer or computing device is programmed to think like human beings so that the computer may be taught to learn on its own. The development of neural networks has been key to teaching computers to think and understand the world in the way human beings do. One particular implementation is referred to as Cellular Neural Networks or Cellular Nonlinear Networks (CNN) based computing system. CNN based computing system has been used in many different fields and problems including, but not limited to, image processing.
This section is for the purpose of summarizing some aspects of the invention and to briefly introduce some preferred embodiments. Simplifications or omissions in this section as well as in the abstract and the title herein may be made to avoid obscuring the purpose of the section. Such simplifications or omissions are not intended to limit the scope of the invention.
Two-dimensional symbols for facilitating machine learning of written Chinese language are disclosed. According to one aspect of the invention, a two-dimensional symbol comprises a matrix of N×N pixels of data containing a “super-character” that represents specific form and meaning of written Chinese language. Each pixel contains a K-bit binary number for representing a Chinese “pinyin” letter. The matrix is partitioned into a number of sections with each section being so sized for storing an identical training set of at least Y Chinese characters in a specific order maintained by a Cellular Neural Networks or Cellular Nonlinear Networks (CNN) based computing system. As a result, a first section contains first P rows of the matrix while remaining sections contain respective subsequent next P rows of the matrix. N, K, P and Y are positive integers. Each pixel is either “on” or “off”. A particular Chinese character is recognized out of the training set in each section, when corresponding consecutive pixels are “on”.
The “super-character” represents specific form and meaning of written Chinese language including, for example, compounded phrases, idioms, proverbs, poems, passages, sentences, articles (i.e., written works), etc.
One of the objectives, features and advantages of the invention is to use a two-dimensional symbol for representing more than individual ideogram, logosyllabic script or character (e.g., Chinese character). Such a two-dimensional symbol facilitates a CNN based computing system to learn the meaning of a specific combination of a plurality of Chinese characters contained in a “super-character” using image processing techniques, e.g., convolutional neural networks, recurrent neural networks, etc.
Other objects, features, and advantages of the invention will become apparent upon examining the following detailed description of an embodiment thereof, taken in conjunction with the attached drawings.
These and other features, aspects, and advantages of the invention will be better understood with regard to the following description, appended claims, and accompanying drawings as follows:
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will become obvious to those skilled in the art that the invention may be practiced without these specific details. The descriptions and representations herein are the common means used by those experienced or skilled in the art to most effectively convey the substance of their work to others skilled in the art. In other instances, well-known methods, procedures, and components have not been described in detail to avoid unnecessarily obscuring aspects of the invention.
Reference herein to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Further, the order of blocks in process flowcharts or diagrams or circuits representing one or more embodiments of the invention do not inherently indicate any particular order nor imply any limitations in the invention. Used herein, the terms “vertical”, “horizontal”, “left”, “right”, “upper”, “lower”, “column”, “row” are intended to provide relative positions for the purposes of description, and are not intended to designate an absolute frame of reference.
Embodiments of the invention are discussed herein with reference to
Referring first to
The Chinese “pinyin” system uses Latin letters to represent pronunciation sounds of Chinese characters.
Each pixel of data 202 can be shown or displayed with a specific color or grayscale. Also each pixel of data 202 can be turned to either “on” or “off”.
For facilitating machine learning,
Before the contents are recognized, the two-dimensional symbol 400 is a simply a matrix of N×N pixels with certain pixels “on” and others “off”. The “super-character” contains at least two Chinese characters that represent specific form and meaning of written Chinese language including, but not necessarily limited to, compounded phrases, idioms, proverbs, passages, sentences, poems. In another embodiment, when there is only one character in a two-dimensional symbol, the “super-character” contains one Chinese character.
In
All of the recognized Chinese characters in the two-dimensional symbol 400 represent specific meaning (i.e., “xue”, “xi”, “zhong” and “wen”, which means learning Chinese language) instead of a group of unrelated Chinese characters. In one embodiment, the specific meaning includes, but is not limited to, compound phrase, idioms, proverbs, etc. These recognized Chinese characters may not necessarily be in any particular order. In other words, the order of the recognized Chinese characters in each two-dimensional symbol 400 is arbitrary.
The “super-character” may contain more than one meanings in certain instances. “Super-character” can tolerate certain errors that can be corrected with error-correction techniques. In other words, the pixels represent Chinese “pinyin” letters do not have to be exact. The errors may have different causes, for example, data corruptions, during data retrieval, etc.
The training set can be initially established in many techniques, for example, inputted manually or generated with a default setting. An example set 510 is shown in
“Super-character” is extracted out of the matrix (e.g., the example two-dimensional symbol 400 of
Referring now to
The CNN based computing system 600 may be implemented on integrated circuits as a digital semi-conductor chip (e.g., a silicon substrate) and contains a controller 610, and a plurality of CNN processing units 602a-602b operatively coupled to at least one input/output (I/O) data bus 620. Controller 610 is configured to control various operations of the CNN processing units 602a-602b, which are connected in a loop with a clock-skew circuit.
In one embodiment, each of the CNN processing units 602a-602b is configured for processing imagery data (e.g., the example two-dimensional symbol 400 of
In another embodiment, the CNN based computing system is a digital integrated circuit that can be extendable and scalable. For example, multiple copies of the digital integrated circuit may be implemented on a single semi-conductor chip.
Although the invention has been described with reference to specific embodiments thereof, these embodiments are merely illustrative, and not restrictive of, the invention. Various modifications or changes to the specifically disclosed example embodiments will be suggested to persons skilled in the art. For example, whereas the two-dimensional symbol has been described and shown with a specific example of a matrix of 224×224 pixels, other sizes may be used for achieving substantially similar objections of the invention. Additionally, whereas at 1000 Chinese characters in a training set has been shown and described, other number of Chinese characters may be used for achieving the same. Furthermore, the Chinese “pinyin” letters shown in the examples are arbitrarily selected, other “pinyin” letters may be used for achieving objectives of the invention. In summary, the scope of the invention should not be restricted to the specific example embodiments disclosed herein, and all modifications that are readily suggested to those of ordinary skill in the art should be included within the spirit and purview of this application and scope of the appended claims.
This application claims priority from a co-pending U.S. Provisional Patent Application Ser. No. 62/541,081, entitled “Two-dimensional Symbol For Facilitating Machine Learning Of Natural Languages Having Logosyllabic Characters” filed on Aug. 3, 2017. The contents of which are incorporated by reference in its entirety for all purposes.
Number | Date | Country | |
---|---|---|---|
62541081 | Aug 2017 | US |