This application claims the priority benefit of Taiwan application Ser. No. 93131521, filed on Oct. 18, 2004.
1. Field of Invention
The present invention relates to a method for analyzing documents. More particularly, the present invention relates to a method for analyzing and classifying electronic documents.
2. Description of Related Art
In the highly competitive industrial environment, in order to increase and to maintain the research potential, every business party not only physically invest money on researching projects but also improve the value of the invisible property such as knowledge documents, patents, trademarks and copyrights. Therefore, the business parties start to take the information management about the knowledge related to the business management seriously. Moreover, because of the highly development of the information technology and the network transmission technology, the barrier of time and space for accessing the knowledge and the information can be broken down through using electronic technology. Hence, any kind of information can be obtained rapidly. Therefore, these electronic documents easily to be managed, transmitted or stored gradually replace the conventional document storage media such as books or paper.
The primary object for the knowledge document is to transmit information. Hence, the knowledge document should possesse a structure property for the reader to easily understand the document. The primary object for the management of the electronic document is to understand the basic data definition for later analyzing process. The fist step of managing electronic documents is to differentiate the type of the documents. Tyrvaninen et al. provide a electronic document management system to analyze and to classify the business inner documents (Tyrvainen and Paivarinta, 1999).
Altogether, in the conventional analyzing method, it is necessary to define the document categories previously and it cannot be sure whether the definition completely meets the classification requirements. Further, it also cannot be sure how detail the categories should be or even it is not necessary to define some specific categories. Moreover, for some categories, the technology contents of some documents are quite different from each other after the classification so that the document classification fails to obtain the features of referring to and fully understanding the technologies basing on the least documents easily. Additionally, in the document classification, sometimes the personal subjective factors will influence the result of the classification and there are no identical and serious standards so that the great classification divergence will happen during the comparison step.
Accordingly, at least one objective of the present invention is to provide a method for analyzing and classifying electronic documents capable of defining document groups basing on the technology group obtained by analyzing the key words in the documents. Therefore, the usage frequency of each document group is increased.
At least a second objective of the present invention is to provide a method for analyzing and classifying electronic documents capable of grouping mass of documents under no pre-classification situation. Hence, when the user searches documents about certain technology, the documents highly related to the technology can be found out and the searching efficient is increased.
The present invention provides a method for analyzing and classifying electronic documents. The method comprises steps of fetching an electronic document from an electronic document folder, wherein the electronic document comprises a plurality of key words. Then, the key words are retrieved. Further, according to an appearance frequency of each key word, a correlation between each two key words is calculated. Finally, according to the correlations between the key words, the key words are classified into at least one technology group.
In the present invention, the step of retrieving the key words include at least one step selected form a group composed of word section analyzing, rhetoric analyzing, vocabulary comparison, word frequency maintaining, retrieving the key word of the candidate word library and retrieving the key word of the word library waiting for confirmation.
Moreover, in the present invention, the step of calculating the correlation between each two key words according to the appearance frequency of each key word comprises steps of de-duplicating the identical key words with merging the appearance frequencies thereof. And then, the correlation of each two key words is calculated.
Furthermore, the former mentioned step of the step of de-duplicating the identical key words with merging the appearance frequencies thereof comprises steps of retrieving the key words from the electronic document. Then, the duplicated key words are merged and the appearance frequencies of the key words are re-calculated.
Additionally, the step of the step of re-calculating the correlation of each two key words comprises steps of obtaining the appearance frequency of each key word and calculating a correlation coefficient between each two key words, wherein the correlation coefficient between each two key word denotes the correlation between the appearance frequencies of the key words.
Also, the step of classifying the key words comprises steps of forming a vocabulary data by using the correlations and a Cartesian dimension system with a dimension corresponding to the number of the key words, wherein each key word is represented by a data point with a coordinate composed by the correlation coefficients. The data points in the vocabulary data are grouped into at least one technology group by using K-Means algorithm.
In the present invention, it further comprises a step of obtaining a maturity of a technology group by using the number of the key words, the number of the electronic documents in the technology group and the number of the key words in the technology group.
The present invention also provide a method for analyzing and classifying electronic documents. The method comprises steps of fetching a plurality of documents from a document folder, wherein at least one of the electronic documents includes at leas a technology group. Then, the technology groups in the electronic documents are obtained and an appearance frequency of each technology group in the electronic documents are statistically calculated. Finally, according to the appearance frequency of each technology group in the electronic documents, the electronic documents are classified into at least one document group.
In the present invention, the step of obtaining the technology groups in the electronic documents comprises steps of retrieving a plurality of key words in the electronic documents and calculating a correlation between each two key words according to an appearance frequency of each key word. Then, the key words are classified into at least one technology group according to the correlations between the key words.
Moreover, the former mentioned step of retrieving the key words include at least one step selected form a group composed of word section analyzing, rhetoric analyzing, vocabulary comparison, word frequency maintaining, retrieving the key word of the candidate word library and retrieving the key word of the word library waiting for confirmation.
Moreover, in the present invention, the step of calculating the correlation between each two key words according to the appearance frequency of each key word comprises steps of de-duplicating the identical key words with merging the appearance frequencies thereof. And then, the correlation of each two key words is calculated.
Furthermore, the former mentioned step of the step of de-duplicating the identical key words with merging the appearance frequencies thereof comprises steps of retrieving the key words from the electronic document. Then, the duplicated key words are merged and the appearance frequencies of the key words are re-calculated.
Additionally, the step of the step of re-calculating the correlation of each two key words comprises steps of obtaining the appearance frequency of each key word and calculating a correlation coefficient between each two key words, wherein the correlation coefficient between each two key word denotes the correlation between the appearance frequencies of the key words.
Also, the step of classifying the key words comprises steps of forming a vocabulary data by using the correlations and a Cartesian dimension system with a dimension corresponding to the number of the key words, wherein each key word is represented by a data point with a coordinate composed by the correlation coefficients. The data points in the vocabulary data are grouped into at least one technology group by using K-Means algorithm.
In the present invention, the step of classifying the electronic documents comprises steps of forming a technology data by using the appearance frequency of each technology group and a Cartesian dimension system with a dimension corresponding to the number of the technology groups, wherein each technology group is represented by a data point with a coordinate composed by the appearance number of each technology group. The data points in the technology data are grouped into at least one document group by using K-Means algorithm.
Altogether, the method for analyzing and classifying electronic documents of the present invention comprises the steps of retrieving the key words in the documents and then statistically calculating and merging the appearance frequencies of the key words. Further, the correlations between key words are established and then the key words are grouped into several technology group mentioned in the electronic documents. Each technology group is the key word included in the technology so that each technology group can be the classification basis for means for performing the classification of the documents and the usage frequency and the detail level of the classification are increased. Moreover, under no pre-classification situation or in the circumstance of further analyzing highly similar documents in the same class, the user can easily use the technology groups and key words to search certain documents and then can also retrieve other documents highly analogue technology content. Accordingly, the accuracy of the automatically analyzing and classifying technology is improved and the searching efficiency is increased.
It is to be understood that both the foregoing general description and the following detailed description are exemplary, and are intended to provide further explanation of the invention as claimed.
The accompanying drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention.
In the present invention, the method for analyzing and classifying documents capable of analyzing the technology groups of the documents according to the key words in the documents. Therefore, the means for classifying documents can base on the technology groups to define the categories of the documents so as to increase usage frequency and the detail level of each document category. Moreover, under the premise that no prior classification is made, the mass of documents can be classified by using the method of analyzing and classifying documents. Therefore, when assisting the user to search a specific technology, the method can provide a more efficient searching way to find out the documents related to the specific technology. Hence, the invisible knowledge property in the enterprise can be well and efficiently managed and the user can analyze the known technology by using this method to determine the future research direction.
A preferred embodiment is provided to details the present invention.
In the above equation, the Xi,l, denotes that the appearance number of a key word Vi which has been de-duplicated in a first document Dl and the ND denotes the total amount of the documents in the document folder. Therefore,
After the correlation coefficient table is established in step S307, the key words are classified into several technology groups by using the correlation coefficient table (step S205). Basing on the correlation coefficient table obtained by the analytic result and the correlation analysis of the historic technology vocabularies, if there are N numbers of key words, each key word can be represented by an N dimension coordinate with N elements in an N dimensional Cartesian coordinate system, wherein each element is the correlation coefficient between the key word and the other key words or itself. More specifically, taking the correlation coefficient table shown in
The following is a description of the process of K-Mean algorithm.
In order to particularly describe the spirit of the present invention, the symbols used later are defined as following:
KPi: the ith group of the key words;
nc: seed number, the numbers of the groups;
v: dimension of the key words;
nj: the number of the data in the jth dimension;
nij: the number of the data in the jth dimension in the ith group;
SSw: the number of the data after the summation of the square values of the data points in the technology group;
SSb: the number of the data after the summation of the square values of the data points between the technology group;
SSt: the number of the data after the summation of the square values of the total data points;
n: the number of the key words in certain technology classification; and
N: total number of the key words.
RMSSTD and RS can be expressed by the following equations respectively:
Since the objective of the result of the classification is to obtain the technology groups with highly similarity between each other, the lesser the variation represented by RMSSTD between the groups is, the better the result is. But, the greater the variation represented by RS between the groups is, the better the result is. After comparing these two values with each other, the results of grouping the N numbers of key words into one group to into N numbers of groups respectively can be examined to obtain the best grouping result. This grouping result can be also used to analyze the technology maturity (step S211 in
As shown in
wherein n denotes total number of the electronic documents, Nij denotes the number of the electronic documents belonging to the ith technology group and N denotes the number of the technology groups.
In the step S207, according to the classified technology groups obtained from the step S205, a statistic calculation is operated to statistically calculate the technologies and the key words appearance in the documents so as to establish a technology group statistic table shown in
Altogether, the method for analyzing and classifying electronic documents of the present invention comprises the steps of retrieving the key words in the documents and then statistically calculating and merging the appearance frequencies of the key words. Further, the correlations between key words are established and then the key words are grouped into several technology group mentioned in the electronic documents. Each technology group is the key word included in the technology so that each technology group can be the classification basis for means for performing the classification of the documents and the usage frequency and the detail level of the classification are increased. Moreover, under no pre-classification situation or in the circumstance of further analyzing highly similar documents in the same class, the user can easily use the technology groups and key words to search certain documents and then can also retrieve other documents highly analogue technology content. Accordingly, the accuracy of the automatically analyzing and classifying technology is improved and the searching efficiency is increased.
It will be apparent to those skilled in the art that various modifications and variations can be made to the structure of the present invention without departing from the scope or spirit of the invention. In view of the foregoing descriptions, it is intended that the present invention covers modifications and variations of this invention if they fall within the scope of the following claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
93131521 | Oct 2004 | TW | national |