The present invention relates to an information processing system, an information processing method and a computer readable storage medium thereof.
A bag-of-words (BOW) representation for text documents is known as one of the most effective vector representations for text classification and clustering. In this representation, each word corresponds to one dimension in an n-dimensional weight vector, where n is the number of words in vocabulary. The i-th dimension of the weight vector represents a weight of the i-th (i=1, . . . , n) word in the vocabulary. As word weights, the number of times the word occurs in a text document (frequency count) or other weighting schemes like tf-idf (Term Frequency, Inverse Document Frequency) can be used.
However, one major drawback of the BOW representation is that it does not model semantic relationships between words. For example, a text which contains a word “park” but not a word “meadow”, might actually be considered similar to a text which contains the word “meadow” but not the word “park”. However, the BOW representation is not able to consider that “meadow” and “park” can be synonymous, and thus fails to detect that two texts like “I love playing soccer in the park.” and “I like playing in the meadows.” are similar.
One solution to this problem is to map the text representation to lower dimensional semantic space using, for example, Latent-Semantic-Indexing (LSI) or Latent Dirichlet Allocation (LDA), as described in NPL 1. However, these methods cannot make use of word similarity information that is known a-priori.
As a different solution that is able to use such word similarity information, a bag-of-clusters (BOC) representation is disclosed in NPL 2.
The BOC representation uses given word similarities to create word clusters, and then represents a text using these clusters.
Note that, as a related technology, PTL 1 discloses a method for reducing an appearance frequency of a word for a document data based on a style to which the document data belong. PTL 2 disclosed a technology for complementing evaluation expressions missing in a sentence by using feature data of sentences including and excluding the evaluation expressions. PTL 3 discloses a method of speech recognition using continuous Makaranobis DP (Dynamic Programming).
A problem of the BOC representation disclosed in NPL 2 is that, depending on the text document, the word clusters result in coarse-grained representations. Consequently, in the BOC representation, information needed for text classification and clustering may be lost.
An object of the present invention is to solve the problem mentioned above. Specifically, the object is to provide an information processing system, an information processing method and a computer readable storage medium thereof which increases weights of words that are related to a text, but that do not explicitly occur in the text.
An information processing system according to an exemplary aspect of the invention includes: a distance storing means for storing distances between any two terms of a plurality of terms, the distance between two terms becoming smaller as the two terms are semantically more similar; and an adjusting means for adjusting a weight of each term of the plurality of terms in a weight vector including weights of the plurality of terms and representing a text, on the basis of a distance between each term and other term in the weight vector and a weight of the other term.
An information processing method according to an exemplary aspect of the invention includes: reading out a distance between each term of a plurality of terms in a weight vector and other term in the weight vector from a distance storing means which stores distances between any two terms of the plurality of terms, the distance between two terms becoming smaller as the two terms are semantically more similar, the weight vector including weights of the plurality of terms and representing a text; and adjusting a weight of each term of the plurality of terms in the weight vector on the basis of the distance between each term and other term in the weight vector and a weight of the other term.
A computer readable storage medium according to an exemplary aspect of the invention records thereon a program, causing a computer to perform a method including: reading out a distance between each term of a plurality of terms in a weight vector and other term in the weight vector from a distance storing means which stores distances between any two terms of the plurality of terms, the distance between two terms becoming smaller as the two terms are semantically more similar, the weight vector including weights of the plurality of terms and representing a text; and adjusting a weight of each term of the plurality of terms in the weight vector on the basis of the distance between each term and other term in the weight vector and a weight of the other term.
An advantageous effect of the present invention is increasing weights of words that are related to a text, but that do not explicitly occur in the text, in a weight vector representing the text.
A first exemplary embodiment of the present invention will be described below.
First, a configuration of the first exemplary embodiment of the present invention will be described.
The distance storing unit 110 stores distances between any two words (also referred to as terms hereafter) in a plurality of words (vocabulary). Here, the distance between two words becomes smaller as the two words are semantically more similar. The distances are previously set in the distance storing unit 110 by a user, as a word similarity matrix, for example.
As a distance between two words, a Mahalanobis or any other distance between feature vectors of the two words, which represent the words, may be used. As the word feature vector, a word embedding may be used. The word embedding is calculated, for example, with word2vec described in NPL 4. In addition, as a word feature vector, a topic distribution p(topic|w) for each word may be used. The topic distribution is calculated, for example, using the Latent Dirichlet Allocation (LDA).
Alternatively, the word feature vector of a word may include information about a local-word window context and/or information about a global-context (topical context). In the case that the word feature vector includes information about the local-word window context, the distance between two words becomes smaller if the two words tend to be surrounded by the same words. In the case that the word feature vector includes information about the global-context, the distance between two words becomes smaller if the two words tend to occur in the same texts.
The adjusting unit 120 acquires a weight vector for each text to be classified. The weight vector corresponds to the BOW mentioned above and represents the text. The weight vector includes weights of the plurality of words (vocabulary).
The adjusting unit 120 adjusts a weight of each word of the plurality of words (vocabulary) in the weight vector, on the basis of a distance between each word and other word in the weight vector and a weight of the other word. Here, the adjusting unit 120 adjusts a weight of each word in such a way that the weight is increased when the distance between each word and other word is smaller and the weight of the other word is larger, for example.
The classify unit 130 performs text classification using the adjusted weight vector (adjusted BOW representation).
Note that the adjusting system 100 may be a computer which includes a central processing unit (CPU) and a storage medium storing a program and which operates according to the program-based control.
With reference to
Alternatively, each of the units in the adjusting system 100 may be implemented on an individual logical circuit. The units in the adjusting system 100 may be allocated respectively to a plurality of devices interconnected with wired or wireless channels.
Next, operations of the adjusting system 100 according to the first exemplary embodiment of the present invention will be described.
In this way, the weight in
By using the adjusted weight vectors (adjusted BOW representation), the classify unit 130 is possible to obtain accurate classification results for these texts.
Next, a characteristic configuration of the first exemplary embodiment of the present invention will be described.
With reference to
According to the first exemplary embodiment of the present invention, it is possible to increase weights of words that are related to a text, but that do not explicitly occur in the text, in a weight vector representing the text. This is because the adjusting unit 120 adjusts a weight of each word in a weight vector for a text, on the basis of a distance between each word and other word and a weight of the other word. It achieves a richer BOW representation of a text that contains more semantic information with less zero entries. As a consequence, the rich BOW representation helps text classification and clustering, among others.
A second exemplary embodiment of the present invention will be described below. In the second exemplary embodiment of the present invention, it is assumed that real but unobserved weights of words are latent variables which are distributed by noise, and the weight vector is adjusted (estimated) based on a noise model.
The distance storing unit 110 stores distance ri,j between each pair of the i-th (i=1, . . . , n) word and j-th (j=1, . . . , n) word.
The adjusting unit 120 calculates a covariance matrix for all words from the distances between words stored in the distance storing unit 110. The covariance matrix contains all information about the correlation between any two words.
The adjusting unit 120 uses the covariance matrix to adjust the word weights. It is assumed that the true, but unobserved (latent) true word weights have been generated from a multivariate Gaussian with the covariance matrix. However, the true weights are disrupted by some noise. The disrupted word weights are actually observed as word weights of an input weight vector.
As the noise, two types of noise, additive and multiplicative, are mainly considered.
First, the adjustment of word weights assuming the additive noise is explained. In the following, a function f maps a word feature vector of the i-th word ei (i=1, . . . , n) to a weight f(ei). The weight f(ei) is also referred to as fi (i=1, . . . , n). Here, it is assumed that the function f is distributed according to a Gaussian Process (GP) with a fixed mean mi (i=1, . . . , n) and a covariance function k(ei, ej).
As the covariance function, a squared exponential covariance function expressed by Math. 1 can be used, for example.
where 1 is a characteristic length scale described in NPL 3, and the distance ri,j is defined by Math. 2.
r
i,j
:=∥e
i
−e
j∥ [Math. 2]
The above choice of the covariance function achieves that two similar words have similar word weights. To see this, consider that k(ei, ej) equals the covariance E[(f(ei)−m)(f(ej)−m)], and assume that, for similar words, the corresponding word feature vectors are similar and thus the distance ri,j is small. As a consequence, k(ei, ej) is large.
Furthermore, it is assumed that the true, but unobserved word weights fi are disturbed by isotropic Gaussian noise, leading to the observed word weights yi (i=1, . . . , n). The generative model can be expressed by Math. 3.
1. Sample f from a GP
f˜GP(m,k(ei,ej))
2. Add isotropic noise
y
i
=f(ei)+ϵ [Math. 3]
where the noise ϵ is sampled from the isotropic Gaussian noise with zero mean and variance σ2, N(0, σ2).
Given a text with word weights yi, the true word weights fi can be estimated with a MAP (maximum-a-posteriori) estimate expressed by Math. 4.
Referring to NPL 3, it is easy to show that p(fi|yi, . . . , yn) is a Gaussian distribution with a mode expressed by Math. 5.
where m (m is an n dimensional vector) is the mean of the GP to which an empirical mean weight of all word weights in the text expressed by math. 6 is set for all i (i=1, . . . , n). Note that mi contains the same value for each i.
The vector y (y is an n dimensional vector) contains the observed word weights yi, . . . , yn. The co-variance matrix K (K is an n×n dimensional matrix with an element Ki,j (i=1, . . . , n, j=1, . . . , n)) is defined by math. 7.
K
i,j
=k(ei,ej) [Math. 7]
The vector Ki (Ki is an n dimensional vector, i=1, . . . , n) denotes the i-th column vector of the matrix K.
The adjusting unit 120 estimates the latent word weights with Math. 5 to obtain the adjusted word weights.
The smoothing vector Ki(K+σ2I)−1 in Math. 5 is constant for all texts and therefore can be calculated in advance. The variance σ2 can be set to a predetermined constant value, like 1.0 or 2.0. Alternatively, the variance σ2 can also be determined using cross-validation or maximizing a marginal likelihood of training data.
Note that, for the observed word weight yi, there are several choices. For example, for text classification, as a word weight, tf-idf or a supervised weight like ConfWeight as defined in NPL 5 can be used.
Here, concrete examples using the texts shown in
In the example of
In the example of
Next, the adjustment of word weights assuming the multiplicative noise is explained. In this case, the observed word weights yi can be expressed by Math. 8.
y
i=1θ(fi)·fi [Math. 8]
where the noise 1θ(fi) is 1 with probability proportional to fi, and otherwise 0. This model explicitly explains the typically sparse BOW representation of texts. It is assumed that the function f is distributed according to multivariate normal distribution with mean weight vector m, and covariance matrix K, as expressed by Math. 9.
f˜N(m,K) [Math. 9]
Although there is no analytic solution to find the MAP estimate of fi, Gibbs sampling can be used to find an approximation of fi, for example.
According to the second exemplary embodiment of the present invention, it is also possible to increase weights of words that are related to a text in a weight vector. This is because the adjusting unit 120 adjusts a weight of each word in the weight vector by calculating a covariance matrix from the distances between words and estimating a latent weight of each word based on the calculated covariance matrix and a noise model.
A third exemplary embodiment of the present invention will be described below. In the third exemplary embodiment of the present invention, distances between words are calculated from word feature vectors.
In the third exemplary embodiment of the present invention, the distance rid is defined by Math. 10.
r
i,j:=αi,j√{square root over ((ei−ej)TB(ei−ej))}[Math. 10]
where B is an n×n dimensional feature weighting matrix, and αi,j is a factor to correct the distance between the i-th word and the j-th word for class-specificity.
First, the feature weighting matrix B is explained.
In its easiest setting, the feature weighting matrix B is simply set to an identity matrix.
As described above, as a word feature vector, a word embedding or topic distribution may be used, for example. Alternatively, both of the word embedding and the topic distribution may also be combined to a (d1+d2) dimensional feature vector e, where d1 and d2 are the dimension of the word embedding and the number of topics, respectively. If, for example, the word embedding information is more important than the topic distribution, then it is beneficial to additionally specify an appropriate matrix B. In this case, the matrix may be learned using cross-validation or by maximizing the likelihood of the training data, as described in NPL 3.
Next, the correction factor αi,j is explained.
Even if local or topical context of two words is quite similar, the two words might tend to occur in different classes. In that case their semantic similarity should be small. In the third exemplary embodiment of the present invention, it is assumed that some texts have been assigned to classes. This is in particular the case if the final representation (weight vector) for the text is used for classification. For example, if the final task is to classify texts as either being about “sport” or being about “politics”, it is possible to assume having a training data set of texts with the assigned classes.
The distance between two words assigned to classes is corrected by using correlation between the two words and the assigned classes. For simplicity, assume there are only two classes A and B. In this case, the correlation coefficient ρA,i, that is a correlation coefficient between class A and the i-th word, ranging from −1 to 1, is calculated. The correction factor αi,j is expressed by Math. 11, using the correlation coefficient ρA,i and ρA,j.
For example, for the correlation coefficient ρA,i, a Pearson's correlation coefficient can be used. Let denote a random variable indicating whether a text belongs to class A by A, and a random variable indicating whether the i-th word occurs (one or more times) in the text by i. Let denote the number of labeled training data by N1. Furthermore, let denote a variable which indicates whether a text t belongs to class A or not by At (At is 0 or 1). Analogously, let denote a variable which indicates whether the i-th word occurs (one or more times) in the text t or not by it (it is 0 or 1). In this case, the correlation coefficient ρA,i can be expressed by Math. 12.
The distance calculation unit 140 calculates the distances between words using Math. 10, Math. 11, and Math. 12.
According to the third exemplary embodiment of the present invention, it is possible to obtain more accurate classification results using the weight vectors adjusted based on distances between words. This is because the distance calculation unit 140 corrects distances in such a way that the distance between two words becomes smaller if the two words tend to occur in texts that belong to the same class.
A fourth exemplary embodiment of the present invention will be described below. In the fourth exemplary embodiment of the present invention, an approximation of the Gaussian process (GP) model described in the second exemplary is used.
Calculating the inverse of the matrix (K+σ2I) shown in the second exemplary becomes infeasible for n>104. However, in practice, the size of vocabulary is often much larger, at around 105 or more words. In the fourth exemplary embodiment, words w1, . . . , ws (s is smaller enough compared with n) are chosen from the vocabulary such that the covariance k(ei, ew) is maximal, for fixed i and words w in the vocabulary, and the approximation expressed by Math. 13 is used.
p(fi|y1, . . . ,yn)≈p(fi|yw
The words w in the vocabulary are sorted by a distance (|ei−ew|) between the i-th word and each word w, and top s words are taken. Let denote the taken words w1, . . . , ws (a subset of the vocabulary for the i-th word) by sub(i), and the vectors and matrices restricted to the words sub(i) by ysub(i), msub(i), Ki,sub(i), and Ksub(i). The approximation can be expressed by Math. 14.
By using biT defined in Math. 15, the approximation can be expressed as Math. 16.
Note that computational expensive calculation of biT can be done in advance during training phase. Furthermore, instead of calculating the inverse of the matrix (Ksub(i)+σ2I), Cholesky decomposition may be used. In this case, the matrix (Ksub(i)+σ2I) is decomposed as Ksub(i)+σ2I=LLT, where L is a Cholesky factor, and a linear equation LLTbiT=Ki,sub(i) can be solved by back substitution. Using the Cholesky decomposition is in general faster and numerically more stable than calculating the inverse as shown in NPL 3.
The adjusting unit 120 adjusts a weight of each word in the weight vector according to Math. 16.
Note that, for training, by using the approximation, the computational complexity drops from O(n3) to O(ns3) and the space complexity drops from O(n2) to O(s2). Also for testing, i.e. for calculating f1, . . . , fn for a new text given yi, . . . , yn, the computational complexity drops from O(n2) to O(s2).
According to the fourth exemplary embodiment of the present invention, it is possible to reduce computational complexity for adjusting weights of words. This is because the adjusting unit 120 adjusts a weight of each word in the weight vector by using the covariance matrix restricted to a subset of words closed to the word for which the weight is adjusted.
While the invention has been particularly shown and described with reference to exemplary embodiments thereof, the invention is not limited to these embodiments. It will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present invention as defined by the claims.
The present invention is applicable to a text classification or clustering system. In particular, the present invention allows to detect that two texts are similar even if they are semantically similar but have only few words in common. Therefore, if the text to be classified is small like Tweets, the present invention allows high text classification accuracy.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2015/002242 | 4/24/2015 | WO | 00 |