The present invention generally relates to document classification, and particularly to schemes of assigning, according to a specific database, at least one document-category title to an object document.
Analysis, induction, merging or integration, sharing, and communication, as well as access authorization of information (including knowledge, messages, and data) have been playing very significant roles for years as many people have long been besieged with an astronomical amount of information. This is particularly obvious now that diversity or variety is dominating almost every thing and activity in the world, and information flow among people, organizations, and nations turns out so huge. Information management, no matter in terms of analysis, or induction, or merging/integration, or sharing, or communication, or access authorization, relies on classification of various documents (including knowledge, message, data, and another type of information). Although a variety of methods/systems for managing electronic files have been developed to raise the efficiency and reliability of transmitting and sharing messages/data/information/documents, one providing ideal schemes for classifying documents is still expected.
Although document classification may be done by charging a group of administration staff with responsibility of classifying all documents, too much reliance on human being's knowledge, experience, caution, stable mood, and constant or consistent criteria for making judgments constitutes critical problem. This is particularly true considering the difficulty of having the same staff to work all the time. Even if it is possible to keep the same group of people work all the time, difference of judgment among different ones of the group can still be a problem, not to mention that the same person may have different judgments at different times. Furthermore, the huge amount of information faced by people or organizations now, if classified solely on the basis of human being's judgment, is certainly to consume huge manpower, resulting in high cost in addition to mistakes originating from human being's subjective views. The problem will be more serious in the future as the amount of information not only is increasing, but also is being diversified.
Improper classification of documents inevitably results in heavy time-consuming, poor efficiency, or uncontrollable/inconsistent/disorderly procedures in managing information. Specifically improper classification itself makes any related database and communication in a state of chaos, and unreliable access authorization originating therefrom further brings about redundant communication which, not only occupy the capacity of communication channel, but also add extra work load to people or organizations who are supposed to strain off irrelevant messages/data/information/documents from the bulky material received all the time.
Although U.S. Pat. Nos. 6,243,723 and 5,832,470 might be deemed in relation to the fields similar to the present invention, they are substantially different from the present invention in terms of either algorithm or achievements. No any prior art has ever been known to substantially address the aforementioned issues of classifying documents. This is why a method/apparatus providing ideal classification of documents (or messages/data/information/knowledge) on the basis of automation or computer processing is broadly expected now and will even be more in the future.
Definition
The expression “document” or “documents” in the disclosure means “message” or “messages” or “data” or “knowledge” or any information which can be stored and is readable.
The expression “word” or “words” or “word code” or “word codes” in the disclosure means “one or more than one symbol which can be stored in a machine and is readable by a machine and/or human being”. For example, English expression “a” or “people” or “security” or punctuation mark “;”, etc is a word or word code according to the disclosure. Obviously any word in another language is also a word or word code according to the disclosure.
Objects
An object of the present invention is to provide a method/apparatus in managing documents, for an organization or agency or people to promote its capability of adapting to knowledge based economy.
Another object of the present invention is to overcome the bottle-neck of achieving what is expected of processing documents electronically or systematically.
A further object of the present invention is to provide a method/apparatus in managing documents, by which network communication can be better exploited by various organizations and enterprises to process their internal documents.
Another further object of the present invention is to provide a method/apparatus in managing documents, by which the information communication between different people, organizations, and enterprises can be more smooth and efficient.
Still another further object of the present invention is to provide a method/apparatus in managing documents, by which various people, organizations, and enterprises can manage documents in a way with less time consumption, lower cost, and minimum complication.
Operating Algorithm
The present invention features a process for assigning, according to a database, at least one of a plurality of document-category titles to an object document, wherein the object document includes one or more than one key word, and the database includes a plurality of keyword-to-document-category-relevance-referring numbers respectively correspond to the key words, and to the document-category titles. One of the keyword-to-document-category-relevance-referring numbers which corresponds to an arbitrarily selected one of the key words, and to an arbitrarily selected one of the document-category titles, represents or relates to the probability the arbitrarily selected key word appears in a document with the arbitrarily selected document-category title, i.e., represents or relates to the probability the arbitrarily selected key word appears in a document classified into the arbitrarily selected document-category.
The present invention also features a process for obtaining, according to a record file, the plurality of keyword-to-document-category-relevance-referring numbers, the record file including a plurality of record documents each corresponding to at least one of the document-category titles.
The present invention further features an apparatus for storing the plurality of keyword-to-document-category-relevance-referring numbers and/or another information/data.
Furthermore the present invention features an apparatus for performing the aforementioned processes.
The present invention may best be understood through the following description with reference to the accompanying drawings, in which:
A method provided by the present invention for classifying documents, comprises a document-category-assigning process for assigning, according to a plurality of reference-number groups, at least one of a plurality of document-category titles to an object document, wherein the object document includes a plurality (at least two, for example) of key words (denoted by KW(1), . . . , KW(j), . . . , KW(m) in this disclosure), the reference-number groups [denoted by R(1), . . . , R(q), . . . , R(u) in this disclosure] correspond to the document-category titles g(1), . . . , g(q), . . . , g(u) in a way of one-to-one, i.e., each of the reference-number groups corresponds to a different one of the plurality of document-type titles, each of the reference-number groups R(1), . . . , R(q), . . . , R(u) includes a plurality of keyword-to-document-category-relevance-referring numbers corresponding to the key words KW(1), . . . , KW(j), . . . , KW(m) in a way of one-to-one, i.e., each of the keyword-to-document-category-relevance-referring numbers included in each [R(q), for example] of the reference-number groups R(1), . . . , R(q), . . . , R(u) corresponds to a different one of the key words KW(1), . . . , KW(j), . . . , KW(m). One of the keyword-to-document-category-relevance-referring numbers which corresponds to an arbitrarily selected key word [KW(j), for example], and is included in one reference-number group [R(q), for example] that corresponds to an arbitrarily selected document-category title [g(q), for example], represents or relates to the probability the arbitrarily selected key word KW(j) appears in a document with the arbitrarily selected document-category title g(q), i.e., represents or relates to the probability the arbitrarily selected key word KW(j) appears in a document which has the arbitrarily selected document-category title g(q) assigned thereto. For easier understanding of the method provided by the present invention for classifying documents, an example of obtaining these keyword-to-document-category-relevance-referring numbers [or the reference-number groups R(1), . . . , R(q), . . . , R(u)], i.e., a reference-number-calculation process is described as follows. The reference-number-calculation process obtains the reference-number groups (or the keyword-to-document-category-relevance-referring numbers), according to a record file including a plurality of record documents each corresponding to at least one of the document-category titles, i.e., each of the record documents has been or had been assigned one or more than one of the document-category titles g(1), . . . , g(q), . . . , g(u). Alternatively speaking, each of the record documents has been or had been classified in one or more than one document-category. The plurality of record documents are denoted by D1, . . . , Dn, . . . , Dy hereinafter. A scheme for embodying the reference-number-calculation process comprises the steps of:
Repeating the steps of (c) and (d) above for different key words, i.e., key words KW(1), . . . , KWj−1), KW(j+1), . . . , KW(m) in addition to KW(j), a group of keyword-to-document-category-relevance-referring numbers [denoted by KTRB(1,k), KTRB(2,k), . . . , KTRB(m,k) in this disclosure] which respectively correspond to the key words KW(1), KW(2), . . . , KW(m) and all correspond to the document-category title g(q) are obtained.
Repeating the steps of (a), (b), (c), and (d) above for different document-category titles g(1), . . . , g(q−1), g(q+1), . . . , g(u) in addition to g(q), and for all key words KW(1), . . . , KW(m), a plurality of reference-number groups are obtained, wherein the reference-number groups correspond to the document-category titles g(1), . . . , g(u) in a way of one-to-one, and each of the reference-number groups includes a plurality of keyword-to-document-category-relevance-referring numbers corresponding to the key words KW(1), . . . , KW(m) in a way of one-to-one, thereby all the keyword-to-document-category-relevance-referring numbers included in one of the reference-number groups which corresponds to a document-category title [g(u), for example], shall correspond to the document-category title g(u).
An arbitrarily selected one [KTRB(i,j), for example] of the keyword-to-document-category-relevance-referring numbers represents or relates to the probability a key word KW(i) appears in a document with document-category title g(j), i.e., represents or relates to the probability a key word KW(i) appears in a document classified into a document category entitled g(j).
The aforementioned scheme for embodying the reference-number-calculation process according to the present invention may be such that a frequency value (Fjn, for example) representing the frequency the arbitrarily selected key word KW(j) appears in record document Dn, is the result of dividing the times (denoted by JNT in this disclosure) the key word KW(j) appears in record document Dn by the number of total words (denoted by NWDn in this disclosure) in record document Dn, i.e., Fjn=JNT÷NWDn, or the frequency value is obtained in another way, as can be seen from another scheme for embodying the reference-number-calculation process, which comprises:
Repeating the steps of (e), (f), (g), (i), and (j) above for different document-category titles g(1), . . . , g(q−1), g(q+1), . . . , g(u) in addition to g(q), and for all key words KW(1), . . . , KW(m), a plurality of reference-number groups are obtained, wherein the reference-number groups correspond to the document-category titles g(1), . . . , g(q) . . . , g(u) in a way of one-to-one, and each of the reference-number groups includes a plurality of keyword-to-document-category-relevance-referring numbers corresponding to the key words KW(1), . . . , KW(m) in a way of one-to-one, thereby all the keyword-to-document-category-relevance-referring numbers included in one of the reference-number groups which corresponds to a document-category title [g(u), for example], shall correspond to the document-category title g(u). For example, for one of the reference-number groups which corresponds to a document-category title g(u), the keyword-to-document-category-relevance-referring numbers KTRB(1,n), KTRB(2,n), . . . , KTRB(m,n) therein all correspond to document-category title g(u), and respectively correspond to the key words KW(1), . . . , KW(m) in a way of one-to-one. An arbitrarily selected one [KTRB(i,j), for example] of the keyword-to-document-category-relevance-referring numbers represents or relates to the probability a key word KW(i) appears in a document with document-category title g(j).
The reference-number-calculation process according to the present invention, may also be configured to comprise the steps of:
Repeating the steps of (k), (l), and (m) above for different document-category titles g(1), . . . , g(q−1), g(q+1), . . . , g(u) in addition to g(q), and for all key words KW(1), . . . , KW(m), a plurality of reference-number groups are obtained, wherein the reference-number groups correspond to the document-category titles g(1), . . . , g(u) in a way of one-to-one, and each of the reference-number groups includes a plurality of keyword-to-document-category-relevance-referring numbers corresponding to the key words KW(1), . . . , KW(m) in a way of one-to-one, thereby all the keyword-to-document-category-relevance-referring numbers included in one of the reference-number groups which corresponds to a document-category title [g(u), for example], shall correspond to the document-category title g(u).
All the keyword-to-document-category-relevance-referring numbers and/or the reference-number groups usually constitute or are included in a database residing on a data storage portion of a device (particularly an information management system, specifically a computer). Obviously a plurality of key words corresponded by these keyword-to-document-category-relevance-referring numbers, and the document-category titles g(1), . . . , g(u) corresponded by the reference-number groups, may also constitute or be included in a database residing on a data storage portion of the device.
The aforementioned method provided by the present invention may further comprise a reference-number-adjusting process for adjusting the keyword-to-document-category-relevance-referring numbers, to adapt the method to the condition that any record document unusually contains too many or too few of a key word, i.e., one (or more than one) key word appears in a record document too many or too few times compared to the average of the times the key word appears in all the record documents with the same document-category title (i.e., in the same document-category). A scheme for embodying the reference-number-adjusting process with reference to the step (d) above, comprises:
A scheme for embodying the reference-number-adjusting process with reference to the step (j) above is on the analogy of the one above, and needs no description.
Another scheme for embodying the reference-number-adjusting process with reference to the step (d) above, comprises:
Obviously the frequency values such as Fj1, Fj2, . . . , Fjn or the like, the adjust-criteria value ACV, the first adjust-criteria value FACV, the first-adjusting amount FAA, the second adjust-criteria value SACV, and the second-adjusting amount SAA, may also constitute or be included in a database residing on a data storage portion of a device (particularly an information management system, specifically a computer).
Based on the keyword-to-document-category-relevance-referring numbers each [KTRB(i,j), for example] representing or relating to the probability a key word KW(i) appears in a document with document-category title g(j), the the present invention provides a document-category-assigning process for assigning, according to a plurality of reference-number groups, at least one of a plurality of document-category titles g(1), . . . , g(u) to an object document (denoted by Dt in this disclosure), wherein the object document Dt includes at least two key words KW(1), . . . , KW(m), the reference-number groups correspond to document-category titles g(1), . . . , g(u) in a way of one-to-one, each of the reference-number groups includes a plurality of keyword-to-document-category-relevance-referring numbers corresponding to the key words KW( ), . . . , KW(m) in a way of one-to-one. One scheme for embodying the document-category-assigning process comprises:
In the document-category-assigning process above, if more than one [DREN(p) in addition to DREN(q), for example] of the category-to-object-document-relevance-evaluation numbers DREN(1), . . . , DREN(q), . . . , DREN(u) is identified meeting the reference condition, the object document is classified into more than one document-category, i.e., in document-categories entitled g(p) and g(q).
To be easier understood, DREN(1)=FON(1,1)⊕FON(2,1)⊕ . . . ⊕FON(m,1)=F1t{circle over (×)}KTRB(1,1)⊕(F2t{circle over (×)}KTRB(2,1)⊕ . . . ⊕Fmt{circle over (×)}KTRB(m,1); DREN(q)=FON(1,q)⊕FON(2,q)⊕ . . . ⊕FON(m,q)=F1t{circle over (×)}KTRB(1,q)⊕F2t{circle over (×)}KTRB(2,q)⊕ . . . ⊕Fmt{circle over (×)}KTRB(m,q); DREN(u)=FON(1,u)⊕(FON(2,u)⊕ . . . ⊕FON(m,u)=F1t{circle over (×)}KTRB(1,u)⊕F2t{circle over (×)}KTRB(2,u)⊕ . . . ⊕Fmt{circle over (×)}KTRB(m,u). In other words, for each q where q=1, 2, . . . , u, a first-operation-result group FR(q) includes FON(1,q)=F1t{circle over (×)}KTRB(1,q), FON(2,q)=F2t{circle over (×)}KTRB(2,q), . . . , FON(m,q)=Fmt{circle over (×)}KTRB(m,q), performing the second mathematical operation ⊕ among the first-operation numbers FON(1,q), FON(2,q), . . . , FON(m,q) in the first-operation-result group FR(q), a category-to-object-document-relevance-evaluation number DREN(q)=FON(1,q) ⊕FON(2,q)⊕ . . . ⊕FON(m,q)=F1t{circle over (×)}KTRB(1,q)⊕F2t{circle over (×)}KTRB(2,q)⊕ . . . ⊕Fmt{circle over (×)}KTRB(m,q) is obtained corresponding to a document-category entitled g(q) where u≧q≧1.
In the document-category-assigning process above, the first mathematical operation {circle over (×)}may be multiplication usually denoted by ×, and the second mathematical operation ⊕ may be addition usually denoted by +.
In the document-category-assigning process above, the reference condition may be “larger than a category-judge-criteria-value”, i.e., the reference condition is such that one [DREN(q), for example] of the category-to-object-document-relevance-evaluation numbers is identified if the magnitude thereof [the magnitude of DREN(q)] is larger than the category-judge-criteria-value. Alternatively the reference condition may be such that one [DREN(p), for example] of the category-to-object-document-relevance-evaluation numbers is identified if the magnitude of DREN(p), in an order among the category-to-object-document-relevance-evaluation numbers DREN(1), . . . , DREN(p), . . . , DREN(u), is within an order-criteria range. For example, if the order-criteria range is “the biggest”, and DREN(P) is the biggest among the category-to-object-document-relevance-evaluation numbers DREN(1), . . . , DREN(p), . . . , DREN(u), then DREN(P) is the identified one of the category-to-object-document-relevance-evaluation numbers. For another example, if the order-criteria range is “no smaller than the second biggest”, and DREN(p) and DREN(q) are respectively the biggest and the second biggest among the category-to-object-document-relevance-evaluation numbers DREN(1), . . . , DREN(p), . . . , DREN(u), then both DREN(p) and DREN(q) are the identified ones of the category-to-object-document-relevance-evaluation numbers, and the object document can be classified into two document-categories.
Another scheme for embodying the document-category-assigning process comprises:
In case the object document Dt includes only one key word KW, the document-category-assigning process can be simplified to comprise: computing the frequency the key word KW appears in the object document Dt, to obtain a frequency value Ft representing the frequency the key word KW appears in the object document Dt;
In the document-category-assigning process above, if the reference condition is “larger than a category-judge-criteria-value” instead of being based on the order among the category-to-object-document-relevance-evaluation numbers, then the present invention provides an evaluation-number-normalizing process to make sure the reference condition can always be relied upon. The evaluation-number-normalizing process includes:
The descriptions above may be better understood by referring to the following Tables 1-8, Matrix M1, and Matrix M2, as well as the notes associated therewith.
In Table 1 below, record documents D1, D2, . . . , Dn are in a same-category group corresponding to a document-category title g(q), Fij represents the frequency the key word KW(i) appears in document Dj.
In Table 2 below, D3, D4, . . . , Dp are in a same-category group corresponding to a document-category title g(s), Fij represents the frequency the key word KW(i) appears in document Dj.
In Table 3 below, D1, D2, . . . , Dn are in a same-category group corresponding to a document-category title g(q), TN(j,k) represents the times the key word KW(j) appears in document Dk, where j=1, 2, 3, 4, 5, and k=1, . . . , n.
all listed on Table 4 below.
Repeating the above steps for each q where q=1, . . . , u, a plurality of keyword-to-document-category-relevance-referring numbers listed on Table 5 below are obtained.
All the keyword-to-document-category-relevance-referring numbers on each row of table 5 correspond to the same one document-category title. For example, KTRB(1,1), . . . KTRB(m,1) all correspond to document-category title g(1); KTRB(1,q), . . . KTRB(m,q) all correspond to document-category title g(q).
Another scheme for obtaining the plurality of keyword-to-document-category-relevance-referring numbers is represented by Table 6 bolow, where NWK is number of the words in the same-category group g(q) of record documents, i.e., NWK is number of the total words in all of the record documents D1, . . . , Dn classified into same-category group g(q).
A plurality of frequency values F1t, F2t, . . . , Fmt representing the frequencies the key words KW(1), . . . , KW(m) appear in object document Dt, are listed on Table 7 below.
Table 8 below lists a plurality of category-to-object-document-relevance-evaluation numbers DREN(1), . . . , DREN(q), . . . , DREN(u) obtained by performing mathematical operations {circle over (×)} and ⊕ between the keyword-to-document-category-relevance-referring numbers listed on Table 5 and the frequency values listed on Table 7
The category-to-object-document-relevance-evaluation numbers DREN(1), . . . , DREN(u) may also be obtained by performing matrix operation (multiplication) between a matrix M1 and a matrix M2 as shown below.
Tables 9, 10, and 11 below, as a whole, represent a specific example characterizing Tables 5, 7, and 8 above, and are to illustrate main features of the document-category-assigning process provided by the present invention.
Frequency values 8, 2, and 6 above respectively represent the frequencies the key words KW(1), KW(2), KW(3) appear in object document Dt.
If the reference condition is such that one category-to-object-document-relevance-evaluation number is identified if the magnitude thereof, in an order among the category-to-object-document-relevance-evaluation numbers DREN(1), DREN(2), DREN(3), DREN(4) is the biggest, then DREN(1) is identified, and object document Dt is classified into a document-category entitled g(1) which corresponds to DREN(1). If the reference condition is “larger than a category-judge-criteria-value” instead of being based on the order among the category-to-object-document-relevance-evaluation numbers, then the magnitudes 3.9, 3.4, 3.0, and 1.8 of DREN(1), DREN(2), DREN(3), and DREN(4) had better be normalized, for example, by an evaluation-number-normalizing process, to make sure the reference condition can always be relied upon. The normalized magnitudes are 3.9÷(3.9+3.4+3.0+1.8), 3.4÷(3.9+3.4+3.0+1.8), 3.0÷(3.9+3.4+3.0+1.8), and 1.8÷(3.9+3.4+3.0+1.8). Assume the category-judge-criteria-value is set to be 0.32, then only 3.9÷(3.9+3.4+3.0+1.8) is larger than 0.32, and DREN(1) is identified, thereby object document Dt is classified into a document-category entitled g(1) which corresponds to DREN(1).
The method provided by the present invention may further comprise a key-word-identification process for identifying the key words in an arbitrary document (including the object document). The key-word-identification process may comprise:
According to the category-classification process provided by the present invention and described above, the key-word-reference database is configured to contain a plurality of reference codes. The reference code corresponding to a candidate key word includes the candidate key word. The reference code also includes an attribute represented by a first symbol or a second symbol. The candidate key word is determined to be a key word if the attribute of the reference code is represented by the first symbol, while determined to be not a key word if the attribute of the reference code is represented by the second symbol. For example, if the candidate key word is the words “investment risk” and the reference code is “investment risk +” with its attribute represented by a first symbol “+”, the candidate key word is determined to be a key word, while determined to be not a key word if the reference code is “investment risk −” with its attribute represented by a second symbol “−”. The reference code may include one or more than word in addition to an attribute.
The present invention may also be embodied as an apparatus 11 (in
Alternatively the database according to the present invention may comprise:
In the apparatus 11 provided by the present invention, the database may further comprise a plurality of frequency values respectively representing the frequencies the key words KW(1), . . . , KW(m) appear in the plurality of record documents D, . . . , Dy to which at least one of the document-category titles g(1), . . . , g(u) has been assigned, i.e., the database further comprises frequency values F11, F21, F31, . . . , Fm1 respectively representing the frequencies the key words KW(1), . . . , KW(m) appear in record documents D1, and frequency values F12, F22, F32, . . . , Fm2 respectively representing the frequencies the key words KW(1), . . . , KW(m) appear in record documents D2, or in other words, comprises frequency values F1v, F2v, F3v, . . . , Fmv respectively representing the frequencies the key words KW(1), . . . , KW(m) appear in record documents Dv where v=1, 2, . . . , y.
Alternatively, in the apparatus 11 provided by the present invention, the database may further comprise a plurality of times-numbers respectively representing the times the key words KW(1), . . . , KW(m) appear in the plurality of record documents D1, . . . , Dy to which at least one of the document-category titles g(1), . . . , g(u) has been assigned.
The apparatus 11 provided by the present invention may further comprise an operational portion 15 (shown in
The operational portion 15 according to the present invention may have a program residing therein, and the database according to the present invention further comprises the plurality of record documents D1, . . . , Dy. The program is for performing any of the reference-number-calculation processes described hereinbefore.
The operational portion 15 according to the present invention may also be for performing any of the document-category-assigning processes described hereinbefore.
The database according to the present invention may further comprise the aforementioned category-judge-criteria-value, and the operational portion 15 according to the present invention is such that a category-to-object-document-relevance-evaluation number DREN(j) is identified if the magnitude of the DREN(j), in an order among the category-to-object-document-relevance-evaluation numbers DREN(1), . . . , DREN(j), . . . , DREN(u), is larger than the category-judge-criteria-value.
Apparatus 11 (as shown in
While the invention has been described in terms of what are presently considered to be the most practical and preferred schemes or embodiments, it shall be understood that the invention is not limited to the disclosure. On the contrary, it is to cover various modifications or similar arrangements suggested by the disclosure or included within the spirit and scope of the appended claims.