The World Wide Web (“web”) provides access to an enormous collection of information that is available via the Internet. The Internet is a worldwide collection of thousands of networks that span over a hundred countries and connect millions of computers. As the number of users of the web continues to grow, the web has become an important means of communication, collaboration, commerce, entertainment, and so on. The web pages accessible via the web cover a wide range of topics including politics, sports, hobbies, sciences, technology, current events, and so on. The web provides many different mechanisms through which users can post, access, and exchange information on various topics. These mechanisms include newsgroups, bulletin boards, web forums, web logs (“blogs”), new service postings, discussion threads, product review postings, and so on.
Because the web provides access to enormous amounts of information, it is being used extensively by users to locate information of interest. Because of this enormous quantity, almost any type of information is electronically accessible; however, this also means that locating information of interest can be very difficult. Many search engine services, such as Google and Yahoo, provide for searching for information that is accessible via the Internet. These search engine services allow a user to search for web pages that may be of interest. After a user submits a search request (also referred to as a “query”) that includes search terms, the search engine service identifies web pages that may be related to those search terms. The search engine service then displays to the user links to those web pages that may be ordered based on their relevance to the search request and/or their importance.
Various types of experts, such as political advisors, social psychologists, marketing directors, pollsters, and so on, may be interested in analyzing information available via the Internet to identify views, opinions, moods, attitudes, and so on that are being expressed. For example, a company may want to mine web logs and discussion threads to determine the views of consumers of the company's products. If a company can accurately determine consumer views, the company may be able to respond more effectively to consumer demand. As another example, a political adviser may want to analyze public response to a proposal of a politician so that the adviser may advise his clients how to respond to the proposal based in part on this public response.
Such experts may want to concentrate their analyses on subjective content (e.g., opinions or views), rather than objective content (e.g., facts). Typical search engine services, however, do not classify search results as being subjective or objective. As a result, it can be difficult for an expert to identify subjective content from the search results.
Some attempts have been made to categorize documents as subjective or objective, referred to subjectivity categorization. These attempts, however, have not effectively addressed the “unseen word” problem. An unseen word is a word within a document being categorized that was not in training data used to train the categorizer. If the categorizer encounters an unseen word, the categorizer will not know whether the word relates to subjective content, objective content, or neutral content. Unseen words are especially problematic in web logs. Because web logs are generally far less focused and less topically organized than other sources of content, they include words drawn from a wide variety of topics that may be used infrequently in the web logs. As a result, categorizers trained based on a small fraction of the web logs will likely have many unseen words. As a result, the categorizers often cannot effectively categorize documents (e.g., entries, paragraphs, or sentences) of web logs with unseen words.
A method and system is provided for classifying documents based on the subjectivity of the content of the documents using a part-of-speech analysis to help account for unseen words. A classification system trains a classifier using the parts of speech of training documents so that the classifier can classify an unseen word based on the part of speech of the unseen word. The classification system identifies n-grams of the parts of speech of the words of each training document. The classification system also identifies n-grams of the terms of the training documents. The classification system then trains a part-of-speech model using the parts of speech of the n-grams and labels of the training documents, and trains a term model using the term unigrams and labels. The models are trained by calculating probabilities of the n-grams being subjective. To classify a target document, the classification system applies the part-of-speech model to the part-of-speech n-grams of the target document and the term model to term n-grams of the target document. A model combines the probabilities of the n-grams to give a probability for that model. The classification system combines the probabilities of the models and designates the target document as being subjective or not based on the combined probabilities.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
A method and system is provided for classifying documents based on the subjectivity of the content of the documents using a part-of-speech analysis to help account for unseen words. In some embodiments, a classification system trains a classifier using the parts of speech of training documents so that the classifier can classify unseen words based on the part of speech of the unseen word. The classification initially collects the training documents and labels the training documents based on the subjectivity of their content. For example, the classification system may crawl various web logs and treat each sentence or paragraph of a web log as a training document. The classification system may have a person manually label each training document as being subjective or objective. The classification system then identifies the parts of speech of the words or terms of the training documents. For example, the classification system may have a training document with the content “the script is a tired one.” The classification system, disregarding noise words, may identify the parts of speech as noun for “script,” verb for “is,” adjective for “tired,” and noun for “one.” The classification system then identifies n-grams of the parts of speech of each training document. For example, when the n-grams are bigrams, the classification system may identify the n-grams of “noun-verb,” “verb-adjective,” and “adjective-noun.” The classification system also identifies n-grams of the terms of the training documents. For example, when the n-grams are unigrams, the classification system may identify the n-grams of “script,” “is,” “tired,” and “one.” The classification system then trains a part-of-speech model using the parts of speech of the n-grams and labels, and trains a term model using the term unigrams and labels. The models may be for Bayesian classifiers. The models are trained by calculating probabilities of the n-grams being subjective. To classify a target document, the classification system applies the part-of-speech model to the part-of-speech n-grams of the target document and the term model to term n-grams of the target document. A model combines the probabilities of the n-grams to give a probability for that model. The classification system combines the probabilities of the models and designates the target document as being subjective or not based on the combined probabilities. Because the classification system uses the part-of-speech model, a document with an unseen word will be classified based at least in part on the part of speech of an unseen word. In this way, the classification system will be able to provide more effective classifications than classifiers that do not factor in unseen words.
In some embodiments, the classification system may use several different models for term n-grams and part-of-speech n-grams for n-grams of varying lengths (e.g., unigrams, bigrams, and trigrams). To generate a combined score for the models, the classification system learns weights for the various models. To learn the weights, the classification system may collect additional training documents and label those training documents. The classification system then uses each model to classify the additional training documents. The classification system may use a linear regression technique to calculate weights for each of the models to minimize the error between a classification generated by the weighted models and the labels. The classification system may iteratively calculate new weights and classify the training document until the error reaches an acceptable level or changes by less than a threshold amount from one iteration to the next.
The classification system uses a naïve Bayes classification technique. The goal of naïve Bayes classification is to classify a document d by the conditional probability P(c|d). Bayes' rule is represented by the following:
where c denotes a classification (e.g., subjective or objective) and d denotes a document. The probability P(c) is the prior probability of category c. A naïve Bayes classifier can be constructed by seeking the optimal category which maximizes the posterior conditional probability P(c|d) as represented by the following:
c*=arg max{P(c|d)} (2)
Basic naive Bayes (“BNB”) introduces an additional assumption that all the features (e.g., n-grams) are independent given the classification label. Since the probability of a document P(d) is a constant for every classification c, the maximum of the posterior conditional probability can be represented by the following:
where document d is represented by a vector of N features that are treated as terms appearing in the document, d=(w1, w2, . . . , wn).
In some embodiments, the classification system uses a naïve Bayes classifier based on term n-grams and part-of-speech n-grams. The classification system uses n-grams and Markov n-grams. An n-gram takes a sequence of n consecutive terms (which may be alphabetically ordered) as a single unit. A Markov n-gram considers the local Markov chain dependence in the observed terms. The classification system may use 10 different types of models and combine the models into an overall model. Each model uses a variant of basic naïve Bayes using term and part-of-speech models to calculate P(wi|c).
The classification system may use a BNB model based on term unigrams where PBNB (wi|c) represents the probability for the BNB model.
The classification system may also use a naïve Bayes model based on part-of-speech n-grams (a “PNB” model). The PNB model uses part-of-speech information in subjectivity categorization. The probability of a part of speech is used for smoothing of the unseen word probabilities. The probability for the PNB model is represented by the following:
P
PNB(wi|c)=P(posi|c) (4)
where PPNB represents the probability for the PNB model and posi represents the part of speech of wi.
The classification system may also use a naïve Bayes model based on term n-grams, where n is greater than 1 (“an NG model”). The probability of a term trigram (“TG”) model is represented by the following:
P
TG(wi|c)=P(wi-2wi-1wi|c) (i>3) (5)
where PTG represents the probability of the TG model.
The classification system may also use a naïve Bayes model based on a part-of-speech n-gram, where n is greater than 1 (“a PNG model”). The PNG model helps solve the sparseness of n-grams and makes n-gram classification more effective. N-gram sparseness means that the n-gram with n greater than 1 has a very low probability of occurrence compared to a unigram. The probability of a part-of-speech trigram (“PTG”) model is represented by the following:
P
PTG(wi|c)=P(posi-2posi-1posi|c) (i>3) (6)
where PPTG represents the probability of the PTG model.
The classification system may also use a naïve Bayes model using a Markov term n-gram (“an MNG model”). The model relaxes some of the independence assumptions of naïve Bayes and allows a local Markov chain dependence in the observed variables. The probability of a Markov term trigram (“MTG”) model is represented by the following:
P
MTG(wi|c)=P(wi|wi-2wi-1c) (i>3) (7)
where PMTG represents the probability of the MTG model.
The classification system may also use a naïve Bayes model based on a Markov part-of-speech n-gram (“an MPNG model”). The MPNG model combines the concept of a Markov n-gram with parts of speech. The probability of a Markov part-of-speech trigram (“MPTG”) model is represented by the following:
P
MPTG(wi|c)=P(posi|posi-2posi-1c) (i>3) (8)
where PMPTG represents the probability of the MPTG model.
The classification system may also use models based on bigrams that are analogous to those described above for the trigrams. Thus, the classification system may use a term bigram (“BG”) model, a Markov term bigram (“MBG”) model, a part of speech bigram (“PBG”) model, and a Markov part-of-speech bigram (“MPBG”) model. One skilled in the art will appreciate that the classification system may use n-grams of any length and may not use n-grams of one length, but may use n-grams of a longer length. Also, the models based on terms and parts of speech need not use n-grams of the same length.
The classification system may use smoothing techniques to overcome the problem of underestimated probability of any word unseen in a document. In general, smoothing techniques try to discount the probabilities of the words seen in the text and then assign an extra probability mass to the unseen words. A standard naïve Bayes model uses a Laplace smoothing technique. Laplace smoothing is represented by the following:
where Njc represents the frequency of word j appearing in category c, Nc represents the sum of the frequencies of the words appearing in category c, and |V| is the vocabulary size of the training data.
The classification system may also employ smoothing for unseen words in subjectivity classification using parts of speech. The classification system uses a linear interpolation of a term model and a part-of-speech model. The classification smooths based on the PNB model as represented by the following:
The classification system also smooths based on the PNG model as represented by the following:
The classification system also smooths based on the MPNG model as represented by the following:
where linear interpretation coefficients or weights α and β represent the contribution of each model.
The classification system may represent the overall combination of the models into a combined model by the following:
The classification system uses a linear regression model to learn the coefficients automatically. Regression is used to determine the relationships between two random variables x=(x1, x2, . . . , xp) and y. Linear regression attempts to explain the relationship of x and y with a straight line fit to the data. The linear regression model is represented by the following:
where the “residual” e represents a random variable with mean zero and the coefficients bj(0≦j≦p) are determined by the condition that the sum of the square residuals is as small as possible. The independent variable x is the probability that a single term belongs to a classification under the 10 models, x=(PBNB, PBG, PTG, PMBG, PMTG, PPNB, PPBG, PPTG, PMPBG, PMPTG), and the dependent variable y is the probability between 0 and 1, which indicates whether the word belongs to a classification or not.
The classification system also includes a generate classifier component 121, a train models component 122, a generate n-grams component 123, a learn model weights component 124, and a classify documents based on model component 125. The generate classifier component collects and labels the training documents, trains the models, and then learns the weights for the models. The generate classifier component invokes the train models component to train the models, which invokes the generate n-grams component to generate n-grams. The generate classifier component invokes the learn model weights component to learn the model weights, and the learn model weights component invokes the classify documents based on model component to determine the classification of training documents.
The classification system also includes a classify document component 126 and a get classification probability component 127. The classify document component generates the n-grams for the models and then invokes the get classification probability component for each classifier to determine the probability that a target document is within that classification. The component then selects the classification with the highest probability.
The computing device on which the classification system is implemented may include a central processing unit, memory, input devices (e.g., keyboard and pointing devices), output devices (e.g., display devices), and storage devices (e.g., disk drives). The memory and storage devices are computer-readable media that may be encoded with computer-executable instructions that implement the system, which means a computer-readable medium that contains the instructions. In addition, the instructions, data structures, and message structures may be stored or transmitted via a data transmission medium, such as a signal on a communications link. Various communications links may be used, such as the Internet, a local area network, a wide area network, a point-to-point dial-up connection, a cell phone network, and so on.
Embodiments of the classification system may be implemented in or used in conjunction with various operating environments that include personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, programmable consumer electronics, digital cameras, network PCs, minicomputers, mainframe computers, cell phones, personal digital assistants, smart phones, personal computers, programmable consumer electronics, distributed computing environments that include any of the above systems or devices, and so on.
The classification system may be described in the general context of computer-executable instructions, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, and so on that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments. For example, a separate computing system may crawl the web to collect the training data.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. The classification system may be used to classify documents based on any type of classification such as interrogative sentences or imperative sentences, questions and answers in a discussion thread, and so on. The classification system may be trained with documents from one domain and used to classify documents in a different domain. The classification system may be used in conjunction with other supervised machine learning techniques such as support vector machines, neural networks, and so on. Accordingly, the invention is not limited except as by the appended claims.