The present invention relates to a classification method that determines the class of a new data instance (e.g. a text document) using a naive Bayes classifier.
The naive Bayes classifier is still a popular method for classification, especially in text classification where it often performs at par with the Support Vector Machines (SVM)-classifier (see Non-Patent Document 1). One advantage of the naive Bayes classifier is that it has the interpretation of a generative model that can be easily extended to model more complex relations (e.g. see Non-Patent Document 3).
In order to learn a naive Bayes classifier, for each class z, we determine the distribution of words that occur in documents belonging to class z. Let us denote the word distribution for class z as θz, and the probability for a specific word w in class z, as θw|z. Often this distribution is modeled by a Multinomial distribution. In order to classifying a new text, the probability of class z given the new text is calculated by multiplying the probabilities θw|z for each word w in the new document.
Note that naive Bayes classifier estimates the probabilities θw|z using only the training data instances (instances with known class). However, words like “I”, or “the”, that occur often in many documents, independent of the class, often introduced noise, and this way the estimates of θw|z get unreliable. One approach is to use a stop-word list to filter out such words. However, such a stop-word list is static and depends on the domain of the documents. Another approach is to weight the words by their inverse-document frequency, as suggested, for example, in Non-Patent Document 1. However, when assigning these weights the interpretation of the naive Bayes classifier as a generative model is lost. As a consequence, the weights and the interaction with parameters of the naive Bayes classifier cannot be learned jointly. Therefore, the weights are either fixed, or must be tuned using part of the training data (for example by using cross-validation).
Another line of research tries to improve classification accuracy, by additionally using instances (e.g. text documents) for which the class is not known. In contrast to training data instances (instances with known class), such additional instances are often available at large quantities. For example, in contrast to a few newspaper articles that are manually annotated with a class (e.g. whether the article is about “Animals” or about “Computer”), there is a vast amount newspaper articles for which not such class information is available (unlabeled instances). Such an approach to learn a classifier is often referred to as “semi-supervised”. The method in Non-Patent Document 2 describes such a semi-supervised approach that can improve the estimation of the probabilities θw|z by using unlabeled instances. Using the Expectation Maximization (EM)-algorithm to assign class probabilities to unlabeled instances, they are able to estimate θw/z for words w that occur in the unlabeled corpus, but do not occur in the training data. However, their approach does not provide a solution to the problem of high-frequent words.
The naive Bayes model is not able to down-weight high frequent words like “I” or “the” that are often irrelevant for determining the class of document. However, due to the small sample of training data instances, theses irrelevant words might by chance occur more often in one class, than the other. As a consequence, for high-frequent words the probabilities θw|z are not spread evenly over all classes z, and thus some documents are wrongly classified due to the presence of high-frequent words.
To overcome the above problem, we propose an extended generative model of the naive Bayes classifier. The extended model introduces a background distribution γ which is set to the frequency distribution of the words in the whole corpus. The whole corpus includes the training data, and can additionally include all other instances for which no class information is available. The proposed model allows that any word in the document is either sampled from the distribution θz defined by its class z, or from the background distribution γ. As a result, the proposed model allows that, especially high-frequent words, are explained by the background distribution γ rather than by any distribution θz. In order to decide whether a word is sampled from the distribution θz or from the distribution γ, we introduce a binary indicator variable d, one for each word in the document. The prior probability for the parameter d controls how likely it is that a word is sampled from γ and this way controls the impact of high-frequent words on the classification result. The formulation as a generative model allows us to learn this prior probability efficiently using all instances (labeled and unlabeled), and thus, it is not needed to manually tune this prior probability.
The present invention has the effect of reducing the impact of high frequent words on the classification result of a naive Bayes classifier. High frequent words often tend to be less informative than middle or low frequent words. The proposed method takes this into account, by explaining the high frequent words by a background distribution (word frequency distribution of whole corpus), rather than the word distribution of any individual class. The proposed method extends the generative model of the naive Bayes classifier, and the additional parameters can be learned from unlabled data (i.e., no need for cross-validation or additional training data).
We demonstrate the proposed idea by extending the naive Bayes classifier for text classification. Given the class z of a document, we assume that each word in the document is independently generated from a distribution θz. A popular choice for this distribution is the categorical distribution (=multinomial distribution for one word occurrence). Using the plate notation, we depict the model in
Where θz is the parameter vector of the categorical distribution, with Σwθw|z=1; and p(z) is the prior probability of class z. Accordingly, we have
Let us denote by θ the parameter vectors θz, for all classes z. Given a collection of texts D={(t1,z1), . . . , (tn,zn)} with known classes, stored in a non-transitory computer storage medium such as a hard disk drive and a semiconductor memory 1 in
using the usual iid-assumption, and that zi is independent from θ. Furthermore, using Equation (1), we get, in a block 10, in
For simplicity, let us assume that p(θ) is constant, then the above expression is maximized by
where freqz(w) is the number of time word w occurs in the collection of documents that have class z. The prior probability p(z) can be estimated in a similar way, and is constant if the amount of training documents per class is the same for all classes.
For classifying a new document in a block 20′ in
Let consider a concrete example. Assume that we have two classes, with 5 instances each. For example, 5 documents that are about animals (short, z=A), and 5 documents that are about computer (short, z=C). For simplicity, we assume that each document has two words. Let us assume that the word “I” occurs 3 times, and 2 times, in documents that belong to class “Animal”, and class “Computer”, respectively. Therefore, the probability that word “I” occurs in a document belonging to class “Animal” is 3/10, and the probability that it belongs to class “Computer” is 2/10. Analogously, assume that the word “mouse” occurs 2 times, and 3 times in documents in documents that belong to class “Animal”, and class “Computer”, respectively. To summarize we have the following maximum-likelihood (ML) estimates:
Let us now consider a new document that contains the two words “I” and “mouse”. The class for the new document is decided by considering the ratio
If this ratio is larger than 1, then the document is classified as “Animal”, if it is smaller than 1 it is classified as “Computer”. Using the naive Bayes classifier, this can be we written as follows
Therefore, we see that naive Bayes classifier is not able to distinguish between the two classes. However, in general, by inspecting a large collection of documents, we know that the word “I” is a high frequent word that is not very informative, that is the word is of little help for document classification. In contrast the word “mouse” is a more specific word, and can in general better help to distinguish between two classes. As a consequence, the word “mouse” should have a (slightly) more weight for deciding the class, and therefore, the document with the words “I” and “mouse” should be classified as “Computer”, rather than “Animal”.
We now describe our extension of the naive Bayes model, displayed in
More specifically, under the proposed model, the joint probability of the text document with words w1, . . . , wk, hidden variables d1, . . . , dk and class z is
where the word probability p(w|z, d) is defined as follows:
The variables dj are binary random variables that indicate whether the word wj is drawn from the class's word distribution θz or from the background distribution γ. The variables dj are hidden variables which cannot be observed from the training documents. To acquire the probability of a training document (w1, . . . , wk, z), we integrate over all d1, . . . , dk, leading to
We assume, that the prior probability p(dj) is independent from the class of the document, and independent from the word position j. Therefore, we define δ:=p(dj=1), which is constant for all words. This way the joint probability of the document with class z can be expressed as follows
For a class z, the word distribution θw|z can be estimated as before using Equation (1). For estimating the background distribution γ and the prior probability δ in a block 15 in
where freqD*(w) is the frequency of word w in D*. (For example, if D* contains two documents, where in the first document word w occurs 3 times, and in the second document it occurs 2 times, then freqD* (w) equals 5.)
The parameter delta can also be learned using D*, as we show later, or can be set manually to a value between ]0, 1]. Note that, if delta is 1, the model reduces to the original naive Bayes classifier.
Finally, in order to classify a new document w1, . . . , wk in a block 20 in
To see that the proposed method can diminish the impact of high frequent words, consider the same example as before. Let us assume that we have additionally 90 documents without class information (unlabeled corpus) in the non-transitory computer storage medium 2 in
and therefore we have
The class probabilities θw|z for the words “I” and “mouse” are set to the probabilities p(w|z) of the original naive Bayes model, i.e.:
Furthermore, for simplicity we assume that δ is set to 0.5, that means p(dj=1)=p(dj=0)=0.5, for all j. Let us now consider the document containing the two words “I” and “mouse”, which class is decided by the following ratio
Therefore the document is classified as a “Computer” article, which is in contrast to before using the (original) naive Bayes classifier. We can that here the weight of word “mouse” dominates the weight of word “I”, which is a high frequent word. In general, high frequent word get a lower weight for deciding the class, and therefore their (negative) impact is diminished.
We note that the above example holds in more general. Consider a document that has two words a and b, and θa|z1=θb|z2, and θa|z2=θb|z1. Without loss of generalization, we assume that θa|z1>θa|z2. In words, this means word a suggests class z1 exactly as strong as word b suggests class z2. Furthermore, let δ be in ]0,1[. Assuming that both prior probabilities of class z1 and z2 are the same, we can see whether the document a, b is classified as class z1 or z2 by inspecting the ratio:
If the ratio is larger than 1, the document is classified as class z1, if the ratio is smaller than 1 the document is classified as class z2. We can show that this ratio is smaller than 1, if, and only if, γa>γb. Therefore, if the word b is less frequent than a, the weight of word b becomes higher than the weight of word a.
As a consequence, the proposed method can have a similar effect as idf-weighting, in the sense, that it mitigates high-frequency words. Note that, a naive Bayes classifier cannot classify the document, because in that case we would be directly on the decision boundary.
Proof of the Above Statement:
To simplify notation let
γ′a:=(1−δ)·γa
γ′b:=(1−δ)·γb
θ′a|z
θ′b|z
θ′a|z
θ′b|z
since θa|z1=θb|z2, and θa|z2=θb|z1, and θa|z1>θa|z2, we then have
It is not difficult to see that the parameter δ controls how much the impact of high frequent words is reduced. We will now show that the parameter can be learned from the corpus D*. We suggest to set δ such that if there are many high-frequent words in D* that cannot be explained by any θz, the parameter δ is closer to 0. We can achieve this by choosing the parameter δ* that maximizes p(D*) under our proposed model for fixed parameters θz and γ.
This means
To find an approximate solution to this problem we can, for example, use the EM-algorithm, considering all class labels zi and all indicator variables dj as unobserved.
We note that, in the same way as δ, it also possible to estimate γ instead of setting it to the word frequency distribution (as in Equation (2)). In doing so, for high-frequent words w that can be well explained by a class z, i.e. θw|z is high, the probability γw is reduced. This has the advantage, that such high-frequent words w remain to have a high weight that favors class z.
For simplicity, in this example, we set the probability θw|z and the probability γw to the categorical distribution (or multinomial distribution without the combinatorial factor for the word frequency). However, in practice for modelling text it is advantageous to use instead a mixture distribution model, most notably a mixture of multinomial distributions like in Non-Patent Document 2. The number of components can be determined using cross-validation, and the word probabilities for each component can be learned, for example, using the EM algorithm from labeled and unlabeled data. It is also possible to assume an infinite mixture model, by placing a Dirichlet process prior over the number of components. In that case, the probabilities θw|z and the probability γw can be estimated using Markov-Chain-Monte Carlo (MCMC) methods.
As an alternative to Equation (3), we can set the interpolation parameter δ such that the expected document classification accuracy is optimized. This can be achieved by using cross-validation on the training data instances with class information (i.e. labeled data).
The method for classifying a new data instance, such as a text document of the above exemplary embodiments may be realized by dedicated hardware, or may be configured by means of memory and a DSP (digital signal processor) or other computation and processing device. On the other hand, the functions may be realized by execution of a program used to realize the steps of the method for classifying a new data instance, such as a text document.
Moreover, a program to realize the steps of the method for classifying a new data instance, such as a text document, may be recorded on computer-readable storage media, and the program recorded on this storage media may be read and executed by a computer system to perform the method for classifying a new data instance, such as a text document, processing. Here, a “computer system” may include an OS, peripheral equipment, or other hardware.
Further, “computer-readable storage media” means a flexible disk, magneto-optical disc, ROM, flash memory or other writable nonvolatile memory, CD-ROM or other removable media, or a hard disk or other storage system incorporated within a computer system.
Further, “computer readable storage media” also includes members which hold the program for a fixed length of time, such as volatile memory (for example, DRAM (dynamic random access memory)) within a computer system serving as a server or client, when the program is transmitted via the Internet, other networks, telephone circuits, or other communication circuits.
For convenience, we use the term “word” to describe a feature in the present specification and claims below. However, we note that the method can also be applied for other features that are not lexical.
The present invention allows to classify an input text with a naive Bayes classifier without previous feature selection that removes high frequent words (like stop-words) that are uninformative. Feature selection is known to improve the performance of a classifier, since it removes noise. However, feature selection needs to be done partly manually, involving additional costs. The present invention allows to automatically determine how to diminish the impact of high frequent words' noise by learning word distributions from unlabeled text. That means, no parameters need to be manually tuned, and no additional manually labeled training data is necessary. The present invention is formulated as extension of the generative process of the naive Bayes classifier, which allows it to be easily extended to model more complex interaction of words, or to model words and additional other types of attributes (e.g. for spam detection, the actual email text+additional attributes like number of times email from same sender was removed.). As a consequence the present invention allows high text classification accuracy, without additional costs.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2014/067090 | 6/20/2014 | WO | 00 |