The present disclosure relates to methods for identifying and extracting subjective information from natural language text. More particularly, the present disclosure relates to a method and system for sentiment classifying text using n-gram analysis.
Sentiment analysis (SA) or polarity mining involves the tasks of identifying and extracting subjective information from natural language text. Automatic sentiment analysis has received significant attention in recent years, largely due to the explosion of social oriented content online (e.g., user reviews, blogs, etc). As one of the basic SA tasks, sentiment classification targets to classify the polarity of a given text accurately towards a label or a score, which indicates whether the expressed opinion in the text is positive, negative, or neutral.
Prior art sentiment classification methods classify the polarity of a given text at either the word, sentence (or paragraph), or document level. Some methods relate the polarity of an article to sentiment orientation of the words in the article. Latent semantic analysis has been used to calculate the semantic orientation of the extracted words according to their co-occurrences with the seed words, such as excellent and poor. The polarity of an article is then determined by averaging the sentimental orientation of words in the article.
Instead of limiting the analysis on the word level, other prior art methods perform sentiment classification on the article level. Various methods have been proposed and they mainly differ in the features used where most methods focus on using unigrams and/or filtered bigrams only. Also, inverse document frequency (IDF) weighting schemes have been used as features and found to improve the sentiment classification accuracy effectively.
Still other methods capture substructures existing in the article in order to help polarity prediction. For example, some methods use an hidden Markov-based model to describe the dependency between local content substructures in text in order to improve sentiment polarity prediction. Similarly, other methods learn a different content model (aspect-sentiment model) using large-scale data sets in an unsupervised fashion.
Accordingly, an improved method for sentiment classifying text is needed.
Disclosed herein is a method for determining the sentiment of a text document. The method may comprise embedding each word of the document into feature space in a computer process to form word embedding vectors; linking the word embedding vectors into an n-gram in a computer process to generate a vector; mapping the vector into latent space in a computer process to generate a plurality of n-gram vectors; generating a document embedding vector in a computer process using the n-gram vectors; and classifying the document embedding vector in a computer process to determine the sentiment of the document.
The method of the present disclosure classifies the sentiment orientation of text at the article level using high order n-grams (i.e., short phrases of 3 or more words), because intuitively longer phrases tend to be less ambiguous in terms of their polarity. An n-gram is a sequence of neighboring n items from a string of text or speech, such as syllables, letters, words and the like.
The method of the present disclosure uses high order n-grams for capturing sentiments in text. For example, the term “good” commonly appears in positive reviews, but “not good” or “not very good” are less likely to appear in positive comments. If a bag-of-unigrams (bag of all possible words) model is used, and the term “not” is separated from the term “good”, the term “not” does not have the ability to describe the “not good” combination. Similarly, if a bag-of-bigrams model is used, the model can not represent the short pattern “not very good.” In another example, if a product review uses the phrase “Terrible, Terrible, Terrible,” the review contains a more negative opinion than three separate occurrences of “Terrible” in the review.
Building n-gram features (words) can remedy the above-identified issue, however, it is computationally very difficult to model n-gram (for n>=3) raw features directly. This is due to the extremely large parameter space associated with n-grams. For instance, assuming the English word dictionary size as D, then bigram representation of text relates to D2 free parameters, while trigram relates to D3 free parameters. When the number of training samples is limited, it can easily lead to over fitting. If the unigram dictionary has a size D=10,000, we have D=108 free parameters or D3=1012 that need to be estimated, which is far too many for a small corpora (bodies of writing). As more and more web-scale sentiment data sets become available, large corpora with sentiment labels could be accessible easily for researchers.
To solve the excessively high-dimensional problem, the method of the present disclosure represents each n-gram as a embedding vector, hereinafter referred to as a “latent n-gram.” A multi-level embedding strategy may be used to project n-grams into a low-dimensional latent semantic space where the projection parameters are trained in a supervised fashion together with the sentiment classification task. Using, for example, a deep convolutional neural network, the semantic embedding of the n-grams, the bag-of-occurrence representation of text from n-grams, and the classification function from each review to the sentiment class, are learned jointly in one unified discriminative framework. The method of the present disclosure advantageously utilizes an embedding space to greatly reduce the dimensionality of the n-gram, therefore, making it much easier to model than n-gram raw features. Further, the n-gram embeddings are learned using supervised signals with the main sentiment classification task, therefore, the n-gram embeddings are optimized for the task and require little human input in feature engineering.
In some embodiments, the element of the embedding vector may be initialized by an unsupervised method such as, but not limited to, latent semantic indexing. Each element ejiε,jε[1 . . . m], in the context of latent semantic indexing, represents the component of concept j in the word i-th. Given a sentence of n words, this sentence may be represented by a sequence of n word (w) embedding vectors
s=(ew1, ew2, . . . ewn).
In block 20, the embedding vectors generated in block 10 are used in a phrase or n-gram embedding process, to generate phrase or n-gram vectors. The term “phrase” in the present disclosure refers to a sliding window of length k in a sentence of the text. For example but not limitation, if k==3, phrase 1 can be (w1, w2, w3) and phrase 2 can be (w2, w3, w4), etc. The maximum index of the phrases would be n−k+1. If the sentence is not long enough to make (n<k), artificial words can be appended as “padding” to make up the shortage. Phrase embedding vector pi of the i-th phrase may be, pi=h(F·ci). Concatenation vector ciεkm is the concatenation of word embeddings of words in i-th phrase: Ci=[ewi1, ewi2, . . . ewim, ewi+11, ewi+k−1m,]T, and Fεb×km is an embedding matrix. Each row in F can be viewed as a “loading vector” on which a concatenation vector can be projected to generate the component. This behavior is similar to other dimension reduction methods like PCA and LSI. The difference is that the loading vectors of the present disclosure are generated by semi-supervised training. The nonlinear function h(x)i=tan h(xi) is an element-wise operator on phrase embedding vector pi. This nonlinear function converts an unlimited output range to [−1, 1].
Referring again to
Here, dεb is a b-dimension embedding. P is the matrix with all the phrase embedding in it's columns, P=[p1|p2| . . . |pN−n+1], N is the length of the document.
In other words, the i-th element in document embedding d is the mean value of i-th dimension of all phrase embeddings. The rational behind this is that the sentiment of a document is related to the average polarity of all phrases. The more positive phrases in the document, the more likely the document is of a positive opinion. Mean value is a good summarization for the sentiment of the document.
Another fundamental reason for this formulation is that the number of phrases in the sentence is variable depending on the sentence length n. Thus, we need a function to compress the information from these phrases into a fixed length document embedding vector. There are of course many options for this operation. For example, in some embodiments, a max function, which selects the maximum value in each phrase embedding dimension outputs a fixed dimension vector may be used for this operation instead of the mean function described earlier.
Referring still to
ĉ=arg mincεcΣdεD(c(d)−yd).
Then, a linear classifier c(x)=sgn(Wx+b) can be selected to optimize classification performance.
In other embodiments, the classifier may comprise an ordinal classifier, which performs an ordinal regression scheme that ranks the document, for example but not limitation, on a likert-scale such that the class labels are in rank order. Utilizing ordinal information in the classification may achieve better performance than treating each class separately. There are different methods for ordinal classification/regression. In some embodiments, the ordinal classification scheme may comprise a simple marginal ordinal loss:
In this embodiment, a t-likert-scale system is disclosed where a set of boundaries B1 is provided for each class lε[1, t]. These boundaries may be in ascending orders, i.e. Bi<Bj, ∀i<j. The function ƒ(d) outputs a score for a document embedding vector d. The objective is to find the parameters (function θ(•) and class boundaries Bi, iε[1, t]) that minimizes L(D). The classifier c(d) may be defined as:
The method of the present disclosure can be implemented using a layered network structure, such as but not limited to a convolutional neural network. In one exemplary embodiment, the neural network may comprise a 5-layer architecture including a lookup table layer (first level) for word embedding, a temporal convolution layer (second level) for phrase embedding, a transfer tan h layer (third level) for phrase embedding, a mean function layer for document embedding, and a classifier layer (e.g., binary, ordinal, etc.) for classifying the sentiment of the document. The use of a neural network allows for easy training using back propagation. The stacked layers in the neural network can be written in a form of embedded functions:
y=ƒ
n(ƒn−1( . . . (ƒ1(x)) . . . )).
For a layer ƒi, iε[1, n], the derivative for updating its parameter set θi is:
and the first factor on the right can be recursively calculated:
Further more, a stochastic gradient descent (SGD) method may be used to accelerate training of the network. For a set of training samples, instead of calculating true gradient of the objective on all training samples, SGD calculates gradient and updates accordingly on each training sample. SGD has proven to be more scalable and more efficient than the batch-mode gradient descent method. In one embodiment, the training algorithm may be defined as:
While exemplary drawings and specific embodiments of the present disclosure have been described and illustrated, it is to be understood that that the scope of the invention as set forth in the claims is not to be limited to the particular embodiments discussed. Thus, the embodiments shall be regarded as illustrative rather than restrictive, and it should be understood that variations may be made in those embodiments by persons skilled in the art without departing from the scope of the invention as set forth in the claims that follow and their structural and functional equivalents.
This application claims the benefit of U.S. Provisional Application No. 61/469,297, filed Mar. 30, 2011, the entire disclosure of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61469297 | Mar 2011 | US |