1. Technical Field
The present invention relates to document classification and, more particularly, to document classification using supervised weighted n-gram embedding.
2. Description of the Related Art
The task of document classification is defined as automatic assignment of one or more categorical labels to a given document. Examples of document classification include topic categorization, sentiment analysis, and formality studies. A document may include a sentence, paragraph, or any snippet of text—the term is defined herein to encompass all such objects.
Previous techniques applied to this task are either generative or discriminative supervised methods. Discriminative document classification techniques commonly rely on the so-called “bag-of-words” (BoW) representation that maps text articles of variable lengths into a fixed-dimensional vector space, parameterized by a finite vocabulary. The BOW model treats a document as an unordered collection of word-features and utilizes the distribution of the words as the primary evidence for its classification. The “bag-of-unigrams” is the most common form of BoW representation that utilizes a word dictionary as its vocabulary.
Some classification attempts have employed short phrases as being more effective than single words (unigrams) for the task. Extending the “bag-of-unigrams” model by incorporating n-grams (a contiguous sequence of n words in the vector space representation of the text). However, the complexity of modeling n-grams grows exponentially with the dictionary size. Considering the n-gram cases for an English word dictionary of size |, bi-gram and trigram representations of text relate to |2 and |3 free parameters.
Despite the simplicity and relative success of document classification using n-gram features, previous models disregard all the spatial and ordering information of the n-grams—such information is important for text data. In the example of sentiment analysis, phrases with strong polarity might be more important in deciding the polarity of the whole document. For example, a document with the phrase “generally good” at its beginning is very likely to be a positive sentiment. When the same phrase appears in the middle of another text, the document is less likely to be a positive comment. Similarly, the start and the end of sentences or paragraphs in an on-line news article might contain more critical and subjective information than its other parts. To completely capturing such relationships would require full semantic understanding, which is beyond the current state of technology.
A method for document classification includes embedding n-grams from an input text in a latent space; embedding the input text in the latent space based on the embedded n-grams and weighting said n-grams according to spatial evidence of the respective n-grams in the input text; classifying the document along one or more axes using a processor; and adjusting weights used to weight the n-grams based on the output of the classifying step.
A system for document classification includes an n-gram embedding module configured to embed n-grams from an input text in a latent space; a document embedding module configured to embed the input the input text in the latent space based on the embedded n-grams, weighted according to spatial evidence of the respective n-grams in the input text; a classifier configured to classify the document along one or more axes using a processor; and a weight learning module configured to adjust the weights for the n-grams based on the output of the classifying step.
These and other features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
The disclosure will provide details in the following description of preferred embodiments with reference to the following figures wherein:
The present principles provide a unified deep learning framework for using high-order n-grams and for exploring spatial information. This goes beyond the bag-of-words (BOW) strategy for document classification. First, a supervised embedding mechanism is provided to directly model n-grams in a low-dimensional latent semantic space. Then, to achieve a fine-grained model of target documents, a controller module is trained by spatially re-weighting the sub-parts of each document. The document representation is then provided to a classifier that is trained for target classification tasks. A deep neural network learns the parameters of the latent space, the article modeling layers, and the classifier jointly in one end-to-end discriminative framework.
Compared to BON strategy using feature selection or the n-gram embedding based method, an advantage of the present principles is that, the whole-document modeling using the controller module provides a spatial weighting scheme for different subparts using the supervised classification signals. All layers are optimized for the target task and need little human intervention, thereby making the system totally adaptable for other tasks.
Before addressing the particulars of the present principles, the following notations are used throughout. denotes an underlying word (unigram) dictionary and S denotes the set of all finite length sequences of words from . The operator |.| denotes the cardinality of a set. An input text sequence of length N will be denoted as x=(w1, . . . wN), where wjε, xεS and j indicates the j-th position in x. Formally, the basic BOW representation applies a mapping φ(•) to a text string x and converts it into a feature space of fixed dimension (e.g., |). It is in this space that a standard classifier such as linear perceptron or support vector machine can be applied. The mapping φ: S→RM takes words sequences in S and maps them to a finite dimension feature space. The document labels form a set ={1,K,C}. For example, C=2 denotes sentiment classes such as “positive” or “negative.” A labeled training-set with training labels from is denoted as X={{(xi,yi)}i=I,K,L|xiεS, yiε|}.
Referring now to
Block 106 uses factors to weight the n-gram vectors according to where each n-gram is located within the document. For example, n-grams that are located near the beginning or end of the document may have greater weight than those n-grams located in the middle. Block 108 uses the weighted vectors to obtain a fixed-dimension representation of the document, a process called “embedding.” The resulting vector representation of the text document is used in block 110 to classify the document according to a predetermined classification system. For example, a classifier may discriminate between “positive” and “negative” statements, or may instead use any well-defined classification system. Block 112 uses the results of the classification to “backpropagate” and train the weights and document embeddings used in blocks 106 and 108.
Embodiments described herein may be entirely hardware, entirely software or including both hardware and software elements. In a preferred embodiment, the present invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Embodiments may include a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. A computer-usable or computer readable medium may include any apparatus that stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be magnetic, optical, electronic, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. The medium may include a computer-readable storage medium such as a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk, etc.
A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution. Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
The present principles are exemplified using common text classification tasks: sentimental text binary classification and news text categorization. Sentimental text binary classification predicts a binary sentiment, positive or negative, expressed by a given document, whereas news text categorization predicts the semantic class a news text piece belongs to.
To overcome the dimensionality burdens that high-order models impose, the present principles project n-grams to a low-dimensional latent semantic space at block 104 using only unigram word dictionaries. Associated parameters are estimated with a bias for target classification. This projection of n-grams can be performed with a word embedding or with direct phrase embedding.
Γ denotes the vocabulary of n-grams in an input corpus, with each n-gram γj=(wj+1, . . . , wj+n−1), where j indicates the j-th position in the input document x. In a bag-of-unigrams representation, the function φ(x) would map the input x in a natural way as a (sparse) vector of dimensionality M=|. Similarly, in a bag-of-ngrams, the function φ(x) maps x to a vector M=|Γ|-dimensional representation, with |Γ|=O(|n). Using a sparse vector representation, a unigram word wj can be described as a vector
with an n-gram vector eγ
Referring now to
where Eε is a matrix with word embedding parameters to be learned. Here, Ew
This formation of n-grams is carried through a sliding window of length n. Given an n-gram γj of n adjacent words at the position j, in block 204 the word lookup table layer applies the same operation for each word inside the n-gram, producing zγ
p
γ
=h(F×zγ
where projection matrix FεRM×nm maps the vector z into an M-dimensional latent space and h(•)=tan h(•). The h( ) function is not limited to the hyperbolic tangent, but may instead be any appropriate function that converts an unbounded range into a range from −1 to 1. In this manner, word-based embedding constructs a low-dimensional latent embedding for all phrases γjεx by first projecting teach word into a latent space in block 202, followed by a second projection to obtain the latent embedding of each n-gram in block 204.
With the goal of classifying the whole document into certain classes, the whole-document may be represented based on its included n-grams. The structured spatial patterns in natural language text could not be captured by the unstructured “bag” assumption of BOW.
The n-gram embedding computes a low-dimensional representation of all n-grams in a given text. Some function φ(•) is used to compress the information in a text document of variable length to a finite dimension feature space (a document embedding vector). The class prediction for an input text is then computed using a function g(•), defined in the resulting low-dimensional space. While there are many possibilities to combine latent n-grams into a document embedding vector, an averaging strategy is described herein. Formally, the document representation is defined as:
where dxεRM and x=(w1, . . . , wN). In other words, dx is the centroid of the vector associated with n-grams of the document x. Using sentiment classification as a test case, the sentiment polarity of a document is intuitively related to the aggregated semantic or polarity of all its n-grams. In other words, the more positive n-grams present in the document, the more likely it is for the document to express a positive opinion. While there are many possibilities for the aggregation function, a mean value function provides a good summarization of the document's sentiment in this latent space. One can also use a maximizing function that selects the maximum value along each latent dimension.
Referring to
where dxεRM, x=(w1, . . . , wN), and h(•)=tan h(•). The convex combination parameter for the phrase γj is initially defined as qj=1/N, and is subsequently learned based on the location of the phrase γj in block 304. The weight of every γj is modeled as a scalar qj using the following mixture model. Let γjεx, |x|=N and jε{1 . . . N} indicate the position of an n-gram in x, and define the weight associated with γj as:
where ak,bk are parameters to be learned, Q=Σj=1Nqj, K specifies the number of mixture quantities, and sigmoid(•) is a non-linear transfer function. The spatial re-weighting attempts to capture longer “trends” within each document and preserves spatial information for phrases within a document. The parameters in equation 5 include two vectors, a and b. They are learned/identified by a “backpropagation” training process. The n-gram vectors pγ
An alternative embodiment for modeling the weight of every γj is to use both the position evidence and the content of this n-gram. Let γjεx, |x|=N and jε{1 . . . N} indicate the position of an n-gram in x. Utilizing another linear project layer as,
where the projection matrix BεR(M+1)×1 maps the vector
(concatenating relative position and the n-gram representation) into a scalar and h(•)=tan h(•). The resulting weight value qj for phrase γj considers not only the spatial evidence, but also the n-gram itself. The parameters in equation 6 include the projection matrix B which is identified (learned) by the “backpropagation” training process.
Referring now to
x=(w1, . . . ,wN)=[x1,x2, . . . ,xk, . . . xK],
where the notation [•] denotes the concatenation of subsequences into an N-dimensional vector. For kε{1, . . . K}, it can be shown that
As with
d
x
=h([p1T,p2T, . . . pKt]T) (7)
which is a vector having size KM. The document vector is built from the concantation of the embedding vectors from its K subsequences in block 512. This captures even “long range” spatial patterns, where each subsequence is normally much longer than a typical n-gram. Also the concantation operation keeps the original linear ordering between subsequences that are useful for the classification of a document.
Document classification in block 110 may be performed using a linear projection function. Given the document embedding dx, and C candidate classes, W represents a linear projection matrix which projects the embedding representation into a vector with size C. The output vector from this classifier layer is,
g(x)=(W×dx) (8)
The predicted class label can be calculated as
This predicted class belongs to one of the candidate classes {1 . . . C}.
The last layer is measurement of how different the predicted class of a document compares to its true class label:
The whole network is trained by minimizing the loss function summing over a set of training examples which includes a number of documents and their true class labels X={{(xi,yi)}i=1,K,L|xiεS, yiε}. The training procedure tries to search for the parameters (Table 1 below) to optimizes the following total loss (commonly named the “negative log likelihood” loss),
Stochastic gradient descent (SGD) may be used to optimize the above loss. In SGD, for a training set X, instead of calculating a true gradient of the objective with all the training samples, the gradient is computed with a randomly chosen training sample (x,yTRUE) εX with all parameters being updated to optimize equation 9. SGD optimization method is scalable and proven to rival the performance of batch-mode gradient descent methods when dealing with large-scale datasets.
Backpropagation is used to optimize the loss function and learn the parameters (called “training”) of each layered module of the network. Each step of the network (word embedding, n-gram embedding, spatial re-weighting, and classification) can be written more generally as a set of functions, lx=fT(fT−1( . . . (f1(x)) . . . )), where lx denotes a loss on a single example x, and the first layer fT is the loss function defined above in equation 9, evaluated using a single training example x. Each function has a set of parameters θi as described in Table 1. For example, θ1={E}, θ2={F}, and θ4={W} or θ4={a,b}. The overall system can be represented in a 6-layer network architecture (T=6):
Table 1
Each parameter listed in Table 1 corresponds to a parameter from one of the above equations. In particular, the parameter E is used in equation 1, the parameter F is used in equation 2, the parameters a and b are used in equation 5, the parameter B is used in equation 6, and the parameter W is used in equation 8.
For a layer fi, iε[1,T], the derivative
is used for updating its parameter set θi uses the delta rule
where λ is a small constant called learning rate which influences the speed of learning the parameters. The delta rule is derived through gradient descent which tries to optimize the parameters by minimizing the error (loss) in the output of each single-layer module. It can be seen that:
and the first factor on the right can be recursively calculated:
are Jacobian matrices. Backpropagation is a generalization of the delta rule which provides an efficient strategy to perform parameter learning and optimizing multi-layered network modules together.
Referring now to
Block 410 begins a loop by initializing an iterator index i to zero and an accumulator variable to one. Block 412 multiplies the accumulator by
and stores that value as
Block 414 weights
by a factor λ and subtracts the weighted value from the parameter θi, storing the updated parameter. Block 416 multiplies the accumulator by
and stores the value as the new accumulator. Block 418 increments i, and decision block 420 determines whether to continue the loop based on whether all of the T layers have been updated. If not, processing returns to block 412. If so, processing returns to block 404 to determine whether the updated parameters have converged. The overall loop may be performed any number of times, whether until convergence or until a maximum number of iterations has been reached.
Referring now to
Having described preferred embodiments of a system and method (which are intended to be illustrative and not limiting), it is noted that modifications and variations can be made by persons skilled in the art in light of the above teachings. It is therefore to be understood that changes may be made in the particular embodiments disclosed which are within the scope of the invention as outlined by the appended claims. Having thus described aspects of the invention, with the details and particularity required by the patent laws, what is claimed and desired protected by Letters Patent is set forth in the appended claims.
This application claims priority to provisional application Ser. No. 61/492,228 filed on Jun. 1, 2011, incorporated herein by reference. The application further claims priority to provisional application Ser. No. 61/647,012, filed on May 15, 2012, incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61492228 | Jun 2011 | US | |
61647012 | May 2012 | US |