Embodiments described herein relate generally to a document classification apparatus and a document classification method for classifying an enormous number of digitized documents in accordance with their contents.
Along with the growth in computer performance and the capacity of storage media or proliferation of computer networks in recent years, it has become possible to collect, store, and use an enormous number of digitized documents using a computer system. Automatic classification, clustering, and the like of documents are expected as technologies for organizing such an enormous number of documents into a form easy to use.
In particular, activities of corporations and the like have undergone rapid globalization of late. Under the circumstances, it is required to efficiently classify documents described in not only one language but also a plurality of natural languages such as Japanese, English, and Chinese.
There is a need to, for example, classify patent documents applied in a plurality of countries based on not the difference in language but the similarity of contents and analyze trends in applications. There is also a need to, for example, accept, at contact centers in a plurality of countries, information such as questions and complaints from customers concerning a product on sale in the countries and classify/analyze the information. There also exists a need to, for example, collect and analyze information such as news articles and ratings/opinions about a product/service, or the like, which are described in various languages and made open to the public via the Internet.
One method of cross-lingually classifying document sets of different languages based on the similarity of contents uses machine translation technology. In this method, each document described in a language (for example, English or Chinese when Japanese is the native language) other than the native language is translated such that all documents are processable as documents of one language (that is, native language), and after that, automatic classification, clustering, or the like is performed.
However, this method has a problem of accuracy; for example, the accuracy of automatic classification depends on the accuracy of machine translation, and documents cannot appropriately be classified due to a translation error and the like. In addition, since the calculation cost for processing of machine translation is generally high, a problem of performance arises when processing an enormous number of documents.
Furthermore, when a plurality of users classify and use documents, the native languages of the documents are also considered to vary. It is therefore difficult to translate an enormous number of documents into a plurality of languages in advance.
Another method of cross-lingually classifying document sets described in a plurality of languages uses a bilingual dictionary (translation dictionary). Here, the bilingual dictionary is a dictionary or thesaurus that associates an expression such as a word or a phrase described in a given language with a synonymous expression in a different language. For the sake of simplicity, the expression, including a compound word and a phrase, will simply be referred to as a word hereinafter.
As an example of the method of implementing cross-lingual classification using a bilingual dictionary, first, out of a document set described in a plurality of languages, subsets of documents described in language 1 are classified, and categories are created. A word in a language a representing the feature of each category is obtained in a form of, for example, a word vector. On the other hand, for a document in another language b, a word vector in the language b representing the feature of the document is obtained.
Here, when each dimension (that is, word in the language a) of the word vector of each category in the language a and each dimension (that is, word in the language b) of the word vector of a document in the language b can be associated using the bilingual dictionary, the similarity between the word vector in the language a and the word vector in the language b can be calculated. The document in the language b can thus be classified into an appropriate one of the categories in the language a based on the similarity.
In the method using a bilingual dictionary, the quality and quantity of the bilingual dictionary are important. However, labor is necessary to manually create the whole bilingual dictionary. As a method of semiautomatically creating a bilingual dictionary, there is a method of obtaining, in correspondence with a word described in a certain language, a word described in another appropriate language as an equivalent based on a general-purpose bilingual dictionary and the cooccurrence frequency of the word in the corpus (database of model sentences) of each language.
In this method, for example, a technical term or the like whose expression in one language is known but whose expression in the other language corresponding to the above expression is unknown needs to be designated as a word for which a bilingual dictionary is to be created. However, when classifying documents of unknown contents, a word for which a bilingual dictionary should be created cannot be assumed in advance.
Hence, the method using the cooccurrence frequency and the bilingual dictionary is not suitable for the purpose of classifying documents of unknown contents by a heuristic method such as clustering. Additionally, the above-described method needs a general-purpose bilingual dictionary as well as the semiautomatically created bilingual dictionary. However, it may be impossible to sufficiently prepare the general-purpose bilingual dictionary in advance depending on the target language.
Furthermore, Japanese words corresponding to, for example, an English word “character” are “”, “”, “”, “”, and the like. For this reason, especially when using the general-purpose bilingual dictionary, an appropriate equivalent needs to be selected in accordance with the document set to be classified.
There is also a method of automatically classifying a document using a thesaurus of equivalents created by the above-described method. In this method, if the document is not classified into an appropriate category, the user corrects the meaning of a word in the thesaurus corresponding to a category, thereby coping with a classification error or the like. However, this operation is particularly laborious for a user who is unfamiliar with the target language.
In general, according to one embodiment, there is provided a document classification apparatus including a document storage unit configured to store a plurality of documents in different languages, an inter-document corresponding relationship storage unit configured to store a corresponding relationship between the documents in the different languages which are stored in the document storage unit, and a category storage unit configured to store a category to classify the plurality of documents stored in the document storage unit.
The document classification apparatus includes a word extraction unit configured to extract words from the documents stored in the document storage unit.
The document classification apparatus includes an inter-word corresponding relationship extraction unit configured to extract the corresponding relationship between the words extracted by the word extraction unit, using the corresponding relationship between the documents described in the different languages and stored in the inter-document corresponding relationship storage unit and based on a frequency with which the words extracted by the word extraction unit co-occurrently appear between the documents having the corresponding relationship.
The document classification apparatus includes a category generation unit configured to generate the category for each language by clustering, based on a similarity of the frequency with which the words extracted by the word extraction unit appear between the documents in the same language, which are stored in the document storage unit, the plurality of documents described in the language.
The document classification apparatus includes an inter-category corresponding relationship extraction unit configured to extract the corresponding relationship between the categories into which the documents described in the different languages are classified by regarding assuming that the more inter-word corresponding relationships exist there are between a word that frequently appears in a document classified into a certain category and a word that frequently appears in a document classified into another category, the higher the similarity between the categories is, based on the frequency of the word that appears in the document classified into each category generated for each language by the category generation unit and the corresponding relationship between the words described in different languages, which is extracted by the inter-word corresponding relationship extraction unit.
An embodiment will now be described with reference to the accompanying drawings.
Referring to
The word extraction unit 2 extracts a word from the data of a document. More specifically, the word extraction unit 2 extracts a word that is data necessary for processing of, for example, classifying a document by morphological analysis or the like, and obtains, for example, the appearance frequency of each word in each document.
To cope with documents in different languages, the word extraction unit 2 is formed from units for the languages, that is, a first word extraction unit, a second word extraction unit, . . . , an nth word extraction unit, as shown in
The category storage unit 3 stores and manages data of categories to classify documents. The category storage unit 3 is implemented by a storage device, for example, a nonvolatile memory. Generally, in the category storage unit 3, the documents are classified by a plurality of categories having a hierarchical structure in accordance with the contents. The category storage unit 3 stores data of documents classified into each category and data of the parent-child relationship between the categories in the hierarchical structure of the categories.
The category operation unit 4 accepts an operation such as browsing or editing by the user for the data of categories stored in the category storage unit 3.
The category operation unit 4 is generally implemented using a graphical user interface (GUI). By the category operation unit 4, the user can perform an operation for a document.
More specifically, the operation is an operation for a category or an operation of classifying a document into a category or moving a document classified in a category to another category. The operation for a category is category create, delete, move (changing the parent-child relationship in the hierarchical structure), copy, integrate (integrating a plurality of categories into one), or the like.
The inter-document corresponding relationship storage unit 5 stores the corresponding relationship between the documents stored in the document storage unit 1. The inter-document corresponding relationship storage unit 5 is implemented by a storage device, for example, a nonvolatile memory. Generally, the inter-document corresponding relationship storage unit 5 stores and manages data representing the corresponding relationship between documents described in different languages. When classifying patent documents, an example of the specific corresponding relationship between documents is the corresponding relationship between a Japanese patent and a U.S. patent in right of priority or international patent application.
The inter-word corresponding relationship extraction unit 6 automatically extracts the corresponding relationship between words described in different languages based on a word extracted by the word extraction unit 2 from a document described in each language and the corresponding relationship between the documents stored in the inter-document corresponding relationship storage unit 5.
An example of the specific corresponding relationship between the words described in different languages, which is extracted by the inter-word corresponding relationship extraction unit 6, is a corresponding relationship close to equivalents such as the corresponding relationship between a Japanese word “”, an English word “character”, and a Chinese word “”.
A category generation unit 7 and an inter-category corresponding relationship extraction unit 8 shown in
The category generation unit 7 automatically generates categories by clustering a plurality of documents described in the same language based on the similarity of appearance frequencies of a word extracted from each document by the word extraction unit 2.
The inter-category corresponding relationship extraction unit 8 generally automatically extracts the corresponding relationship between a plurality of categories that are the categories generated by the category generation unit 7 and used to classify document groups of different languages. The categories and the corresponding relationship between the categories generated by these units are stored in the category storage unit 3.
According to the embodiment shown in
In an arrangement according to an embodiment shown in
The case-based document classification unit 9 performs automatic classification processing. More specifically, for one or a plurality of categories stored in the category storage unit 3, the case-based document classification unit 9 automatically determines, based on one or a plurality of classified documents which are already classified into the categories, whether to classify, into the category, an unclassified document yet to be classified into a category.
Based on words extracted from each document by the word extraction unit 2 and the corresponding relationship between words extracted by the inter-word corresponding relationship extraction unit 6, the case-based document classification unit 9 can determine whether to classify not only an unclassified document described in the same language as the classified documents of a category but also an unclassified document described in another language to the category.
According to the embodiment shown in
In an arrangement according to an embodiment shown in
For one or a plurality of categories stored in the category storage unit 3, the category feature word extraction unit 10 extracts characteristic words representing the contents of documents classified into each category. The characteristic word will be referred to as a feature word hereinafter as needed.
The feature word is a word extracted by selecting an appropriate word representing the feature of a category well from the words extracted by the word extraction unit 2 from the documents classified into the category, as will be described later.
The category feature word conversion unit 11 converts a feature word described in a certain language and extracted from a category into a feature word described in another language based on the corresponding relationship between words described in different languages, which is extracted by the inter-word corresponding relationship extraction unit 6.
According to the embodiment shown in
In an arrangement according to an embodiment shown in
By a classification rule set for each category stored in the category storage unit 3, the rule-based document classification unit 12 determines a document to be classified into the category. In general, the classification rule of each category is defined to classify, into the category, a document in which one or a plurality of words out of words extracted from documents by the word extraction unit 2 appear.
The classification rule conversion unit 13 converts a classification rule used to classify a document described in a certain language into a classification rule used to classify a document described in another language based on the corresponding relationship between words described in different languages, which is extracted by the inter-word corresponding relationship extraction unit 6.
According to the embodiment shown in
In an arrangement according to an embodiment shown in
That is, the dictionary storage unit 14 stores a dictionary that defines a word use method in the processing of the category generation unit 7 shown in
According to the embodiment shown in
As will be described later, one or a plurality of types of important words that are words on which importance is placed, unnecessary words that are words to be neglected, and synonyms that are combinations of words regarded as identical in processing such as document classification and category feature word extraction can be set as dictionary words in each dictionary stored in the dictionary storage unit 14. The dictionary setting unit 15 sets the dictionary words in the dictionary.
The dictionary conversion unit 16 converts a dictionary word described in a certain language and set in a dictionary into a dictionary word described in another language based on the corresponding relationship between words described in different languages, which is extracted by the inter-word corresponding relationship extraction unit 6.
As the language that describes the document, a row 602 shown in
As shown in
For example, the parent category of the category shown in
The parent category of the category shown in
A title such as “” (Digital camera) in a row 703 of
The data of each category sets documents classified into the category in the form of a classification rule or a document set. For example, in the category shown in
In the category shown in
In the category shown in
Processing of classifying a document by a classification rule is executed by the rule-based document classification unit 12 shown in
Each row such as a row 801 or a row 802 shown in
Similarly, the row 802 shown in
According to rows 804 and 805 shown in
An important word is a word on which importance is placed in processing such as document classification (to be described later). For example, when performing processing such as document classification by a method using word vectors, as in this embodiment, processing of, for example, doubling the weight of an important word in a word vector is performed. An unnecessary word is a word to be neglected in processing such as document classification. In this embodiment, processing of, for example, removing unnecessary words from word vectors and prohibiting them from being used as the dimensions of the word vectors is performed.
When classifying, for example, a patent document, a word such as “invention” or “apparatus” rarely represents the contents of the patent. For this reason, in this embodiment, such words are defined as unnecessary words, as shown in
First, the word extraction unit 2 acquires a text from a document as the target of word extraction (step S1001). In the example shown in
Next, the word extraction unit 2 screens the morphemes to which predetermined parts of speech are added, thereby leaving only necessary morphemes and removing unnecessary morphemes (step S1003). In general, the word extraction unit 2 performs processing of leaving an independent word or a content word as a morpheme used for processing such as classification and removing a dependent word or a function word. This processing depends on the language.
If a morpheme is, for example, an English or Chinese verb, the word extraction unit 2 can leave this morpheme as a necessary morpheme. If a morpheme is a Japanese verb, the word extraction unit 2 can remove this morpheme as an unnecessary morpheme. The word extraction unit 2 may remove an English verb such as “have” or “make” as a so-called stop word.
Next, the word extraction unit 2 normalizes the expressions of the morphemes (step S1004). This processing also depends on the language. For example, if the extracted text is Japanese, the word extraction unit 2 may absorb an expression fluctuation between “” (combination) and “” (combination) or the like and handle them as the same morpheme. If the extracted text is English, the word extraction unit 2 may perform processing called stemming and handle morphemes including the same stem as the same morpheme.
The word extraction unit 2 obtains the appearance frequency (here, TF (Term Frequency)) in the document for each morpheme that is normalized in step S1004 (step S1005). Finally, the word extraction unit 2 outputs the combination of each morpheme normalized in step S1004 and its appearance frequency (step S1006).
First, the inter-word corresponding relationship extraction unit 6 acquires data stored in the inter-document corresponding relationship storage unit 5. Using the data, the inter-word corresponding relationship extraction unit 6 defines the set of corresponding relationships between documents dk belonging to a document set Dk in a language k and documents dl belonging to a document set Dl in a language 1 as Dkl={(dk,dl):dkεDk, dlεDl, dkdl} (step S1101).
Next, the inter-word corresponding relationship extraction unit 6 obtains the union of words extracted by the word extraction unit 2 from each of the documents dk in the language k in Dkl for all documents dk in Dkl, thereby obtaining a word set Tk in the language k (step S1102). As a result, words in the language k included in the documents in Dkl and their appearance frequencies (here, DF (Document Frequencies)) are obtained.
For the language l as well, the inter-word corresponding relationship extraction unit 6 obtains the union of words extracted by the word extraction unit 2 from each of the documents dl in the language l in Dkl for all documents dl in Dkl, thereby obtaining a word set Tl in the language l (step S1103). Then, the inter-word corresponding relationship extraction unit 6 repetitively (step S1104) performs the following processes of steps S1105 to S1112 for each word tk in the word set Tk.
The inter-word corresponding relationship extraction unit 6 obtains a document frequency df(tk, Dkl) of the word tk in Dkl (step S1105). If the document frequency is equal to or higher than a predetermined threshold (YES in step S1106), the inter-word corresponding relationship extraction unit 6 repetitively (step S1107) performs the following processes of steps S1108 to S1112 for each word tl in the word set Tl.
The inter-word corresponding relationship extraction unit 6 obtains a document frequency df(tl, Dkl) of the word tl (step S1108). If the document frequency is equal to or higher than the predetermined threshold (YES in step S1109), the inter-word corresponding relationship extraction unit 6 performs the following process from step S1110.
If the document frequency df(tk, Dkl) of the word tk, that is, the number of documents in which the word appears is smaller than the predetermined threshold (for example, smaller than 5) (NO in step S1106), the inter-word corresponding relationship extraction unit 6 returns to step S1104, based on the fact that data necessary to accurately obtain the corresponding relationship between the word and that described in another language is insufficient in Dkl.
If the document frequency df(tl, Dkl) of the word tl, that is, the number of documents in which the word appears is smaller than the predetermined threshold (for example, smaller than 5) (NO in step S1109), the inter-word corresponding relationship extraction unit 6 returns to step S1107, based on the fact that data necessary to accurately obtain the corresponding relationship between the word and that described in another language is insufficient in Dkl.
If the document frequency df(tl, Dkl) is equal to or higher than the predetermined threshold (YES in step S1109), the inter-word corresponding relationship extraction unit 6 obtains a cooccurrence frequency df(tk, tl, Dkl) of the words tk and tl in Dkl. The cooccurrence frequency is the number of corresponding relationships between documents including the word tk and documents including the word tl. Using the cooccurrence frequency, the inter-word corresponding relationship extraction unit 6 also obtains a Dice coefficient representing the magnitude of cooccurrence of the words tk and tl in Dkl by
dice(tk,tl,Dkl)=df(tk,tl,Dkl)/(df(tk,Dkl)+df(t,Dkl)) (1).
In addition, the inter-word corresponding relationship extraction unit 6 obtains a Simpson coefficient also representing the magnitude of cooccurrence in Dkl by
simp(tk,tl,Dkl)=df(tk,tl,Dkl)/min(df(tk,Dkl),df(tl,Dkl)) (2)(step S1110).
If each of the cooccurrence frequency df(tk, tl, Dkl), the Dice coefficient dice(tk, tl, Dkl), and the Simpson coefficient simp(tk, tl, Dkl) is equal to or more than a predetermined threshold (YES in step S1111), the inter-word corresponding relationship extraction unit 6 sets the relationship between the words tk and tl as a candidate of the corresponding relationship between the words. The inter-word corresponding relationship extraction unit 6 sets a score corresponding to the candidate of the corresponding relationship between the words to α*dice(tk,tl,Dkl)+β*simp(tk,tl,Dkl) (α and β are constants) (step S1112). Finally, the inter-word corresponding relationship extraction unit 6 outputs a plurality of thus obtained candidates of the corresponding relationship between the words in the descending order of score (step S1113).
In this embodiment, it is determined using the Dice coefficient and the Simpson coefficient based on the DF whether the relationship between the words tk and tl described in different languages is appropriate as equivalents or associated words. According to this method, the multilingual document classification apparatus can accurately extract the corresponding relationship between words using only a corresponding relationship on a document basis, that is, a rough corresponding relationship that is not a translation relationship on a sentence basis. However, this embodiment is not limited to the above-described method and equations, and another equation of, for example, a mutual information amount may be used, or a method considering the TF may be used.
As shown in
The score added to the corresponding relationship between the words quantitatively indicates the degree of appropriateness of the corresponding relationship. Hence, the multilingual document classification apparatus can also selectively use, for example, only corresponding relationships of high scores, that is, corresponding relationships representing correct equivalents with a high possibility depending on the application purpose.
In this processing, clustering is performed for a document set described in a certain language, thereby automatically generating categories (clusters) each including documents of similar contents.
First, the category generation unit 7 defines a document set in the language l that is the target of category generation as Dl, and sets the initial value of a category set Cl that is the result of category generation as an empty set (step S1301). The category generation unit 7 repetitively (step S1302) executes the following processes of steps S1303 to S1314 for each document dl of the document set Dl.
The category generation unit 7 obtains a word vector vdl of the document dl by words extracted from the document dl by the word extraction unit 2 (step S1303). A word vector is a vector that uses each word appearing in a document as a dimension of the vector and has the weight of each word as the value of the dimension of the vector. This word vector can be obtained using a conventional technique. The weight of each word of the word vector can be calculated by a method generally called TFIDF, as indicated by, for example,
tfidf(tl,dl,Dl)=tf(tl,dl)*log(|Dl|/df(tl,Dl)) (3)
where tf(tl, dl) is the TF for the word tl in the document dl, and df(tl, Dl) is the DF for the word tl in the document set Dl. Note that tf(tl, dl) may simply be the appearance count of the word tl in the document dl. Alternatively, tf(tl, dl) may be, for example, a value obtained by dividing the appearance count of each word by the sum of the appearance counts of all words appearing in the document dl and normalizing the quotient.
When obtaining a word vector for a subset Dcl (Dcl⊂Dl) of certain documents, the category generation unit 7 can calculate the weight of the word tl of the word vector as the sum of the weights of the words tl of the word vectors of the documents dl in Dcl, as indicated by
tfidf(tl,Dcl,Dl)=(ΣdlεDcl(tf(tl,dl)))*log(|Dl|/df(tl,Dl)) (4).
Note that in the embodiment configured to use a dictionary, as described with reference to
Calculation in the category generation unit 7 is not limited to equation (3) or (4). More specifically, calculation for obtaining the weight of each word in the word vector suffices. If the same processing is performed, the calculation need not always be performed by the category generation unit 7.
Next, the category generation unit 7 sets the initial value of a classification destination category cmax of the document dl to “absent” and the initial value of a maximum value smax of the similarity between dl and cmax to 0 (step S1304). The category generation unit 7 repetitively (step S1305) executes the following processes of steps S1306 to S1308 for each category cl in the category set Cl.
The category generation unit 7 obtains a similarity s between the category cl and the document dl based on a cosine value cos(vcl, vdl) between a word vector vcl of the category cl and the word vector vdl of the document dl (step S1306).
If the similarity s is equal to or more than a predetermined threshold and more than smax (YES in step S1307), the category generation unit 7 sets cmax=cl and smax=s (step S1308).
If the category cmax exists (YES in step S1309) as the result of the repetitive process (step S1305), the category generation unit 7 classifies the document dl into the category cmax (step S1310). Then, the category generation unit 7 adds the word vector vdl of the document dl to a word vector vcmax of the category cmax (step S1311). As a result, a weight by the TF of the document dl is added to the weight of each word of the word vector vcmax, as indicated by equation (4).
On the other hand, if the category cmax does not exist (NO in step S1309), the category generation unit 7 newly creates a category cnew and adds it to the category set Cl (step S1312). The category generation unit 7 classifies the document dl into the category cnew (step S1313) and sets a word vector vcnew of the category cnew as the word vector vdl of the document dl (step S1314).
As the result of the repetitive process (step S1302), categories as the result of clustering the document set are generated in the category set Cl. The category generation unit 7 deletes, out of the generated categories, categories in which the number of documents is smaller than a predetermined threshold (step S1315). That is, for example, a category including only one document is meaningless. The category generation unit 7 removes such categories from the category generation result.
In addition, for each generated category cl, the category generation unit 7 sets the title of the category using the word vector vcl (step S1316). The category generation unit 7 sets the title by, for example, selecting one or a plurality of words of largest weights out of the word vectors of the category. For example, in the example shown in
This processing is executed as the processes of step S1504 (inter-category corresponding relationship extraction unit 8) of
To determine the similarity of contents between such various categories, processing shown in
Note that in the first embodiment corresponding to
In the word vector generation processing, first, the multilingual document classification apparatus repetitively (step S1401) executes the following processes of steps S1402 to S1406 for each language l out of a plurality of languages. In the word vector generation processing, the multilingual document classification apparatus defines a document set in the language l classified into a category c as Dcl (step S1402). In the word vector generation processing, the document set Dcl may be an empty set depending on the category c and the type of the language l. Next, in the word vector generation processing, the multilingual document classification apparatus sets the initial value vcl of a word vector in the language l in the category c to an empty vector (all dimensions have a weight 0) (step S1403).
Next, in the word vector generation processing, the multilingual document classification apparatus repetitively (step S1404) obtains the word vector vdl of the document dl for each document dl in the document set Dcl (step S1405). In the word vector generation processing, the multilingual document classification apparatus adds the word vector vdl of the document dl to the word vector vcl in the language l in the category c (see equation (4)) (step S1406). In the above-described way, the word vectors in each language l are generated first based on the document set Dcl itself in the language l, which is actually classified into the category c. However, if the document set Dcl is an empty set, as described above, the word vectors vcl are empty vectors as well.
Next, in the word vector generation processing, the multilingual document classification apparatus repetitively (step S1407) executes the following processes of steps S1408 to S1413 again for each language l out of the plurality of languages. In the word vector generation processing, the multilingual document classification apparatus sets a word vector vcl′ in the language l in the category c to an empty vector (step S1408). The word vector vcl′ is different from the word vector vcl obtained in step S1405. In the word vector generation processing, first, the word vector vcl is added to the word vector vcl′ (step S1409).
Next, in the word vector generation processing, the multilingual document classification apparatus repetitively (step S1410) executes the following processes of steps S1411 to S1413 for each language k other than the language l. In the word vector generation processing, the multilingual document classification apparatus acquires the corresponding relationship between words in the languages k and l by the processing shown in
Then, in the word vector generation processing, the multilingual document classification apparatus converts a word vector vck in the language k in the category c into a word vector vckl in the language l (step S1412). In the corresponding relationship between words acquired in step S1411, the word tk in the language k, the word tl in the language l, and the score of the corresponding relationship between them are obtained, as described with reference to
weight(vckl,tl)=Σtk(weight(vck,tk)*score(tk,tl)) (5).
Using the acquired result, the multilingual document classification apparatus obtains the weight of the word tl of the word vector vckl in the language l.
Here, the weight weight(vck, tk) of the word tk of the word vector vck may be TFIDF described concerning equation (4). The score score(tk, tl) of the corresponding relationship between the words tk and tl may be α*dice(tk,tl,Dkl)+β*simp(tk,tl,Dkl) described with reference to
In the word vector generation processing, the multilingual document classification apparatus thus adds the word vector vckl obtained by converting the word vector in the language k into the language l to the word vector vcl′ (step S1413).
The word vectors vcl′ in the language l in the category c are generated by the repetitive process of step S1410. Additionally, the word vectors in all languages in the category c are generated by the repetitive process of step S1407.
As is apparent from the above explanation, even for a category into which, for example, only Japanese documents are classified, the multilingual document classification apparatus can generate a word vector in English or a word vector in Chinese using the corresponding relationship between a Japanese word and an English word or the corresponding relationship between a Japanese word and a Chinese word.
The processing from step S1408 to step S1413 of
This processing extracts the corresponding relationship between each category cl of a certain category set Cl and each category ck of another category set Ck. In particular, this processing aims at extracting a corresponding relationship based on the similarity of contents between categories into which documents described in different languages are classified. The languages of documents classified into the categories of the category sets Ck and Cl are not particularly limited in the processing of
The inter-category corresponding relationship extraction unit 8 sets the corresponding category set whose corresponding relationship with the category set Ck is to be obtained as Cl (step S1501). The inter-category corresponding relationship extraction unit 8 repetitively (step S1502) executes the following processes of steps S1503 to S1509 for each category ck of the category set Ck.
First, the inter-category corresponding relationship extraction unit 8 sets the initial value of the category cmax corresponding to the category ck to “absent”, and sets the maximum value smax of the similarity between the categories ck and cmax to 0 (step S1503).
Next, the inter-category corresponding relationship extraction unit 8 obtains a word vector vckk′ in the language k in the category ck and a word vector vckl′ in the language l (step S1504). The process of step S1504 is performed by the processing described with reference to
The inter-category corresponding relationship extraction unit 8 first obtains the word vector vclk′ in the language k in the category cl and a word vector vcll′ in the language l (step S1506). The process of step S1506 is performed by the processing described with reference to
The inter-category corresponding relationship extraction unit 8 then obtains the similarity between the categories ck and cl as s=cos(vckk′, vclk′)+cos(vckl′, vcll′) using the word vectors obtained in steps S1504 and S1506 (S1507). That is, the inter-category corresponding relationship extraction unit 8 obtains the similarity between the categories by the sum of the cosine value between the word vectors in the language k and the cosine value between the word vectors in the language l.
If the similarity s is equal to or more than a predetermined threshold and more than smax (YES in step S1508), the inter-category corresponding relationship extraction unit 8 sets category cmax=cl and smax=s (step S1509). If the category cmax exists after the repetitive process of step S1505, the inter-category corresponding relationship extraction unit 8 determines the category cmax as the category corresponding to the category ck (step S1510). That is, the inter-category corresponding relationship extraction unit 8 obtains cmax as the category assumed to have contents most similar to those of the category ck out of the category set Cl. In this case, the similarity (score) of the corresponding relationship is smax.
Note that although the score of the corresponding relationship between the categories ck and cl is obtained as the sum of the word vectors in the languages k and l in step S1507, the method of obtaining the score is not limited to this. For example, the inter-category corresponding relationship extraction unit 8 may calculate the score as the maximum value of the cosine value between the word vectors in the language k and the cosine value between the word vectors in the language l, that is, s=max(cos(vckk′, vclk′), cos(vckl′, vcll′)).
Each row such as a row 1601 or a row 1602 in
As described concerning step S1316 of
The categories for which an appropriate corresponding relationship has been obtained may be integrated using the category operation unit 4 shown in
In this example, the category titles are connected in the form of “-face-detect”, as indicated by a row 1603 in
According to this arrangement, for example, when classifying a document set in which Japanese documents, English documents, and Chinese documents coexist, a classification structure used to cross-lingually classify these documents based on the similarity between the contents can efficiently be created. That is, the multilingual document classification apparatus first performs clustering of the document set of Japanese, English, and Chinese documents separately on a language basis and automatically generates categories to classify the documents of similar contents in each language.
Next, the multilingual document classification apparatus extracts the corresponding relationship between words described in different languages based on the corresponding relationship between documents described in different languages. Here, the corresponding relationship between documents described in different languages is an equivalent relationship or a relationship close to it. As a detailed example, when classifying patent documents, for example, the corresponding relationship between a Japanese patent and a U.S. patent in right of priority or international patent application is extracted.
As the extracted corresponding relationship between words, for example, a corresponding relationship close to an equivalent relationship like the corresponding relationship between a Japanese word “”, an English word “character”, and a Chinese word “” is automatically obtained. The multilingual document classification apparatus automatically extracts the corresponding relationship between categories described in different languages based on the corresponding relationship between words.
The multilingual document classification apparatus cross-lingually integrates the categories whose corresponding relationship has been obtained, thereby creating categories to classify documents of similar contents independently of the languages such as Japanese, English, and Chinese.
Processing according to the embodiment shown in
As a conventional technique, a case-based classification (automatic supervised classification) technique has been implemented. In this technique, using a document already classified into a category as a classification case (supervisor document), it is determined based on the document whether to classify an unclassified document into the category. However, according to the processing shown in
In the procedure of the processing shown in
Next, the case-based document classification unit 9 repetitively (step S1705) executes the following processes of steps S1706 to S1711 for each document dl (document described in the language l) of the document set D.
First, the case-based document classification unit 9 obtains the word vector vdl of the document dl in the language l (step S1706). This processing is performed by obtaining the weight of each word in the language l using equation (3).
Then, the case-based document classification unit 9 repetitively (step S1707) executes the following processes of steps S1708 to S1711 for each category c of the category C.
First, if the document dl is not classified into the category c yet (NO in step S1708), the case-based document classification unit 9 obtains the similarity s between the category c and the document dl as s=cos(vcl′,vdl) based on the cosine value of the word vectors (step S1709). The word vector vdl of the document dl is the word vector in the language l. For this reason, as the word vector of the category whose similarity to the document is to be obtained, the word vector vcl′ in the same language l is used. This is the word vector obtained for the language l by the case-based document classification unit 9 out of the word vectors obtained for the respective languages in step S1704.
If the similarity s is equal to or more than a predetermined threshold (YES in step S1710), the case-based document classification unit 9 classifies the document dl into the category c (step S1711). The processes of steps S1710 and S1711 can be modified. For example, a modification can be made such that the case-based document classification unit 9 classifies the document to one selected category having the maximum similarity or classifies the document to three categories at maximum selected in descending order of similarity.
In the processing of
According to this arrangement, after several documents in the native language that the user can easily understand, for example, only Japanese documents are manually classified into a category, the multilingual document classification apparatus can automatically classify English or Chinese documents having similar contents into the category based on the classification case of the Japanese documents, that is, supervisor documents.
Processing according to the embodiment shown in
A feature word of a category is a characteristic word representing the contents of documents classified into the category. The feature word is automatically extracted from each category for the purpose of, for example, allowing the user to easily understand what kind of documents are classified into each category.
In the processing shown in
Next, for each word tcl of the word set Tcl, the category feature word extraction unit 10 repetitively (step S1802) obtains the score of tcl by
mi(t,Dcl,D)=df(t,Dcl)/|Dl|*log(df(t,Dcl)*|Dl|/df(t,Dl)/|Dcl|)+(df(t,Dl)−df(t,Dcl))/|Dl|*log((df(t,Dl)−df(t,Dcl))*|Dl|/df(t,Dl)/(|Dl|−|Dcl|))+(|Dcl|−df(t,Dcl))/|Dl|*log((|Dcl|−df(t,Dcl))*|Dl|/(|Dl|−df(t,Dl))/|Dl|)+(|Dl|−df(t,Dl)−|Dcl|+df(t,Dcl))/|Dl|*log((|Dl|−df(t,Dl)−|Dcl|+df(t,Dcl))*|Dl|/(|Dl|−df(t,Dl))/(|Dl|−|Dcl|)) (6)
(step S1803).
If df(t,Dcl)/df(t,Dl)≦|Dcl|/|Dl|, then mi(t,Dcl,Dl)=0
Here, using a mutual information amount, the category feature word extraction unit 10 obtains the score of the feature word based on the strength of correlation between an event representing whether a document has been classified into a category and an event representing whether the word tcl appears in the document. The event representing whether a document has been classified into a category equals an event representing whether a document is included in the document set Dcl.
Dl in equation (6) is the universal set (Dl⊃Dcl in general or Dl⊃Dcl in many cases) of documents described in the language l. A word and a category may have a negative correlation. To exclude this correlation, when df(tcl,Dcl)/df(tcl,Dl)≦|Dc|/|Dl|, the category feature word extraction unit 10 sets the score to 0, as indicated by the proviso of equation (6).
Finally, the category feature word extraction unit 10 selects a predetermined number of (for example, 10) words tcl in descending order of score, and sets the result as the feature words in the language l in the category c (step S1804).
According to the processing described with reference to
In the processing shown in
As in step S1901, the category feature word conversion unit 11 obtains a feature word set Tcl in the language l in the category c using the result of processing shown in
Next, the corresponding relationship between words in the language k and those in the language l is obtained by the category feature word conversion unit 11 and the inter-word corresponding relationship extraction unit 6 (processing of
The category feature word conversion unit 11 repetitively (step S1905) executes the following processes of steps S1906 to S1910 for each feature word tck of the feature word set Tck.
First, the category feature word conversion unit 11 obtains the word tcl in the language l corresponding to the feature word tck using the corresponding relationship between words acquired in step S1903. In general, 0 or more words tcl can exist. Hence, the category feature word conversion unit 11 defines a combination of the feature words tck and tcl as pckl, including a case where there exists no word tcl, that is, the word tcl does not exist (step S1906).
The category feature word conversion unit 11 obtains the score of pckl. The score of tck as a feature word is obtained by the process of step S1901.
The score of tck as a feature word is obtained when the feature word tcl is included in the feature word set Tcl obtained in step S1902. However, the score of the feature word tcl that is not included in the feature word set Tcl is 0. Considering the above case, the category feature word conversion unit 11 sets the score of pckl as the maximum value of the score of the feature word tck and the score of the feature word tcl (step S1907).
Next, the category feature word conversion unit 11 checks whether words in the language k or l overlap between an already created combination qckl and the combination pckl created this time in a set Pckl of feature word combinations (step S1908).
If qckl in which the words overlap exists (YES in step S1908), the category feature word conversion unit 11 integrates pckl into qckl. For example, when pckl=({tck1},{tcl1, tcl2}), and qck1=((tck2),(tcl2,tcl3)), feature words tcl2 in the language l overlap between pckl and qckl. Hence, the category feature word conversion unit 11 integrates them to obtain qckl=({tck1,tck2}, {tcl1,tcl2,tcl3}). The score of qckl after the integration is the maximum value (that is, the maximum value of the scores of feature words tck1, tck2, tcl1, tcl2, and tcl3) of qckl and pckl before the integration (step S1909).
On the other hand, if qckl in which the words overlap those of pckl does not exist (NO in step S1908), the category feature word conversion unit 11 adds pckl to Pckl (step S1910). After the repetitive process of step S1905, the category feature word conversion unit 11 outputs the combinations of feature words in Pckl in descending order of score (step S1911).
As shown in
According to this arrangement, from, for example, a category into which many Chinese documents are classified, a Chinese feature word is automatically extracted as the feature word of the category. Next, the feature word is automatically converted into a Japanese or English feature word. The user can use the feature word described in the language easy for him/her to understand and can therefore easily grasp the contents of the category.
Processing according to the embodiment shown in
As described with reference to
First, the classification rule conversion unit 13 acquires the corresponding relationship between words in the languages k and l from the inter-word corresponding relationship extraction unit 6 (corresponding to the processing of
Next, the classification rule conversion unit 13 repetitively (step S2102) executes the following processes of steps S2103 to S2106 for an element (in the example of
The classification rule conversion unit 13 first determines, using the corresponding relationship between words acquired in step S2101, whether the word tl in the language l corresponding to the word tk in an element rk of the classification rule exists (step S2103).
If the word tl exists (YES in step S2103), the classification rule conversion unit 13 creates an element rl by replacing the word tk of rk with the word tl (step S2104). In the example of
In the process from step S2105 of
If the word tk′ exists (YES in step S2105), the classification rule conversion unit 13 creates an element rk′ by replacing the word tl of the element rl created in step S2104 with the word tk′ (step S2106). In the example indicated by the row 712 of
The classification rule conversion unit 13 replaces the portion of rl of the classification rule with (rl OR rk′). In this case, the element rk of the original classification rule is eventually replaced with (rk OR rl OR rk′).
A classification rule indicated by a row 2202 of
According to this arrangement, the multilingual document classification apparatus creates a classification rule to classify a document including, for example, a Japanese word “” into a certain category and then converts the classification rule into English or Chinese. This makes it possible to classify a document including an equivalent or related term of the Japanese word “”, for example, an English word “encrypt” or a Chinese word “” into the category.
Processing according to the embodiment shown in
As described with reference to
In the processing shown in
The dictionary conversion unit 16 first determines, using the corresponding relationship between words acquired in step S2301, whether the word tl in the language l corresponding to the dictionary word tk exists (step S2303). If the word tl exists (YES in step S2303), the dictionary conversion unit 16 employs the word tl as a dictionary word. The dictionary conversion unit 16 sets the type (important word, unnecessary word, synonym, or the like) of the dictionary word to the same type as the dictionary word tk. If a plurality of words tl corresponding to the one dictionary word tk exist, the dictionary conversion unit 16 sets these words to synonyms (step S2304).
A row 2401 of
A row 2402 of
A row 2403 of
As indicated by a row 2404 of
Note that if only one word or less than one word is obtained as the result of conversion of synonyms (if no corresponding word exists in the conversion destination language or if the words are converted into a single word), the meaning as a synonym is lost. Hence, the dictionary conversion unit 16 may delete the synonym from the converted dictionary.
Next, the dictionary conversion unit 16 performs processing of extending the synonyms of the dictionary in the language k as the conversion source. This processing is not essential. The dictionary conversion unit 16 determines, using the corresponding relationship between words acquired in step S2301, whether the word tk′ (word different from tk) in the language k corresponding to the word tl in the language 1 exists (step S2305). If the word tk′ exists (YES in step S2305), the dictionary conversion unit 16 sets the original word tk and the word tk′ in the language k to synonyms (step S2306).
For example, the English important word “exposure” indicated by the row 2402 of
According to this arrangement, the multilingual document classification apparatus can efficiently create, for example, a dictionary suitable for classifying English or Chinese documents from a dictionary created for the purpose of appropriately classifying Japanese documents.
In the embodiments, the above-described functions can be implemented using only the corresponding relationship between documents described in different languages, which are documents included in the document set to be classified itself. It is therefore unnecessary to prepare a bilingual dictionary or the like in advance. In addition, when an existing general-purpose bilingual dictionary is used, appropriate equivalents need to be selected in accordance with the document to be classified. In this embodiment, however, a word corresponding relationship extracted from the document to be classified itself is used. Hence, the multilingual document classification apparatus need not select equivalents. Furthermore, the multilingual document classification apparatus can avoid using inappropriate equivalents.
As a consequence, the multilingual document classification apparatus can accurately implement processing of automatically extracting the cross-lingual corresponding relationship between categories or processing of automatically cross-lingually classifying a document. If the above-described classification rule or dictionary word is converted by a conventional method using a general-purpose bilingual dictionary, an inappropriate classification rule or dictionary word is often created. In this embodiment, such a problem does not arise, and the multilingual document classification apparatus can obtain a classification rule or dictionary word to appropriately classify the document to be classified.
While a certain embodiment has been described, this embodiment has been presented by way of example only, and is not intended to limit the scope of the inventions. Indeed, the novel embodiment described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions, and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2012-183534 | Aug 2012 | JP | national |
This application is a Continuation application of PCT Application No. PCT/JP2013/072481, filed Aug. 22, 2013 and based upon and claiming the benefit of priority from Japanese Patent Application No. 2012-183534, filed Aug. 22, 2012, the entire contents of all of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2013/072481 | Aug 2013 | US |
Child | 14627734 | US |