1. Technical Field
The embodiments of the present disclosure relate to classification systems and methods, and particularly to an electronic device and a document classification method of the electronic device.
2. Description of Related Art
Documents can be classified into different categories according to a certain attribute of subject matters of the documents. For example, LCD patent documents can be classified into a wide view category and a transflective/reflective category according to technical field of the subject matters of the documents. However, it would be desirable to classify the documents into different categories according to different attributes of the subject matters in some cases.
The disclosure is illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an” or “one” embodiment in this disclosure are not necessarily to the same embodiment, and such references mean “at least one.”
In general, the word “module”, as used herein, refers to logic embodied in computing or firmware, or to a collection of software instructions, written in a programming language, such as, Java, C, or assembly. One or more software instructions in the modules may be embedded in firmware, such as in an erasable programmable read only memory (EPROM). The modules described herein may be implemented as either software and/or computing modules and may be stored in any type of non-transitory computer-readable medium or other storage device. Some non-limiting examples of non-transitory computer-readable media include CDs, DVDs, BLU-RAY, flash memory, and hard disk drives.
The electronic device 2 may further include a storage system 23 and at least one processor 25. The storage system 23 can be a dedicated memory, such as EPROM, a hard disk drive (HDD), or a flash memory. In some embodiments, the storage system 23 can also be an external storage device, such as an external hard disk, a storage card, or other data storage medium. The at least one processor 25 can be a central processing unit (CPU), a microprocessor, or other suitable data processor chip that performs various functions of the electronic device 2.
At block 31, the receipt module 240 receives a plurality of documents to be classified and receives categorical descriptions of the documents. The documents may be obtained from the storage system 23 according to keywords input by the user. For example, the user inputs keywords “liquid crystal display (LCD)” and “patent” and obtains a plurality of LCD patent documents. The categorical descriptions specify one or more attributes of the subject matters of the documents, according to which the documents may be classified. The categorical descriptions further specify various categories corresponding to each attribute. Each category may include several sub-categories.
At block 32, the extraction module 241 extracts core terms of the documents and core terms of the categorical descriptions. The extraction module 241 may divide each document into different blocks and extract the core terms of the documents from the blocks. For example, for a patent document, each of the parts (for example, title, abstract, detailed description, and claims) of the patent document is regarded as a single block. The core terms may be extracted using a natural language processing method, such as a term frequency-inverse document frequency method. In one embodiment, the extraction module 241 may set a weight for each core term of a document. The weight may be adjusted according to a position of the core term in the document. For example, for a patent document, if a core term is extracted from abstract, a weight for the core term is adjusted to a larger value.
At block 33, the processing module 242 constructs a term-document matrix of the documents according to the core terms of the documents, and performs a dimension reduction operation on the term-document matrix to obtain a concept matrix of the documents in a concept space.
At block 34, the processing module 242 determines a vector of each category specified in the categorical descriptions in the concept space according to the core terms of the categorical descriptions, and determines a vector of each document in the concept space from the concept matrix. The vector of each category and the vector of each document may be concept vectors. In one embodiment, the processing module 242 determines an overall vector of all categories specified in the categorical descriptions in the concept space according to all the core terms of the categorical descriptions. The processing module 242 parses the overall vector to obtain a vector corresponding to each attribute specified in the categorical descriptions, and parses the vector corresponding to each attribute to obtain the vector of each category.
At block 35, for each document, the classification module 243 classifies the document into one or more categories according to a similarity between the vector of each category specified in the categorical descriptions and the vector of the document. Further details of block 35 are described below in reference to
At block 36, the output module 244 outputs the one or more categories of each document on the display device 20 when all the documents have been classified.
At block 41, the classification module 243 selects an attribute specified in the categorical descriptions. In one example with respect to
At block 42, the classification module 243 selects a category corresponding to the selected attribute specified in the categorical descriptions. In one example, the classification module 243 selects the wide view category corresponding to the selected attribute of technical field. In another example, the classification module 243 selects the FFS sub category corresponding to the selected attribute of technical field.
At block 43, the classification module 243 calculates a similarity between the vector of the selected category and the vector of the document. In one embodiment, the similarity is a cosine value of an angle between the vector of the selected category and the vector of the document. The less the divergence, or the smaller the angle between the two vectors, then the larger will be the cosine value of the angle, and the greater will be the similarity between the two vectors.
At block 44, the classification module 243 determines whether the similarity between the vector of the selected category and the vector of the document is greater than a preset value α, for example, α=0.8.
If the similarity is greater than the preset value, At block 45, the classification module 243 classifies the document into the selected category.
If the similarity is less than or equal to the preset value, At block 46, the classification module 243 does not classify the document into the selected category.
At block 47, the classification module 243 determines whether there are any other categories corresponding to the selected attribute which have not been selected. If there are corresponding but unselected other categories, the flow returns to block 42.
If there are no other corresponding but unselected categories, At block 48, the classification module 243 determines whether there are any other attributes that have not been selected. If there are other unselected attributes, the flow returns to block 41. If all attributes have been selected, the flow ends.
Although certain disclosed embodiments of the present disclosure have been specifically described, the present disclosure is not to be construed as being limited thereto. Various changes or modifications may be made to the present disclosure without departing from the scope and spirit of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
201310169201X | May 2013 | CN | national |