Social media is becoming increasingly prevalent with the advent of new technologies, advances in communication channels, and other factors. Social media platforms and websites, such as Facebook®, Twitter®, and others, attract users to post and share messages for a variety of purposes, including daily conversation, sharing uniform resource locators (URLs) and other data, among other uses. Companies, individuals, and other entities may desire to detect topics described in social media data, for the purpose of discovering “trending topics,” tracking online users' interests, understanding users' complaints or mentions about a product or service, or other purposes. For example, when a company launches a marketing campaign for a newly-released product, the company may desire to investigate whether the product or service is relevant to the trending topics recently discussed on social media, or whether the product or service might be desired by selected users.
Existing topic detection methodologies are generally based on probabilistic language models, such as Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). However, these analyses assume that a single document contains rich information, which is not applicable to some forms of social media. For example, a “tweet” from the social media platform Twitter(r) (“Twitter”) is limited to 140 characters. In addition, the detected topics by probabilistic methods are difficult to interpret in a human-understandable way, in part because the methods can only identify a set of “key” terms associated with a set of numerical values to indicate how important the terms are for each detected topic. An additional concern of topic detection in social media is the issue of scalability. More particularly, a large volume of tweets and other data postings are posted on various social networking websites and platforms every day in an order of hundreds of millions.
Therefore, it may be desirable to have systems and methods for detecting topics in social media. In particular, it may be desirable to have systems and methods for leveraging available data to construct topic models and interpret generated results in a human-understandable way.
An embodiment generally relates to method of processing data. The method comprises processing a data set to extract a hierarchy of topics comprising a plurality of layers and linking a portion of the data set to an appropriate topic of the hierarchy of topics. Further, for each layer of the plurality of layers, the method comprises employing a classification technique to train a topic model based on a subset of the portion of the data set that is residing in the layer. Still further, the method comprises evaluating an accuracy of each topic model of the topic models and examining the accuracy of each topic model of the topic models to identify a layer of the plurality of layers corresponding to a topic model that is most appropriate for social media data.
Another embodiment pertains generally to a system for processing data. The system comprises an interface to a storage device configured to store a data set and a processor that communicates with the storage device via the interface. The processor is configured to process the data set to extract a hierarchy of topics comprising a plurality of layers, and link a portion of the data set to an appropriate topic of the hierarchy of topics. Further, for each layer of the plurality of layers, the processor is configured to employ a classification technique to train a topic model based on a subset of the portion of the data set that is residing in the layer. Still further, the processor is configured to evaluate an accuracy of each topic model of the topic models and examine the accuracy of each topic model of the topic models to identify a layer of the plurality of layers corresponding to a topic model that is most appropriate for social media data.
Various features of the embodiments can be more fully appreciated, as the same become better understood with reference to the following detailed description of the embodiments when considered in connection with the accompanying figures, in which:
For simplicity and illustrative purposes, the principles of the present teachings are described by referring mainly to exemplary embodiments thereof. However, one of ordinary skill in the art would readily recognize that the same principles are equally applicable to, and can be implemented in, all types of analysis systems, and that any such variations do not depart from the true spirit and scope of the present invention. Moreover, in the following detailed description, references are made to the accompanying figures, which illustrate specific embodiments. Electrical, mechanical, logical and structural Changes can be made to the embodiments without departing from the spirit and scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense and the scope of the present invention is defined by the appended claims and their equivalents.
Embodiments generally relate to systems and methods for detecting topics in social media data. In particular, the systems and methods are configured to extract a concept hierarchy from a set of data, the concept hierarchy comprising a plurality of layers. Further, the systems and methods can train topic models based on the content in each of the layers. Still further, the systems and methods can select the most appropriate topic model for social media topic detection by balancing the complexity of the model and the accuracy of the topic detection result. Moreover, the systems and methods can use the most appropriate topic model to detect topics in social media data.
Topic detection in social media can be beneficial in various fields and environments. For example, in business environments, when a company releases a new product or service, customers or potential customers might post or comment on the product or service in a social media environment. Therefore, it would be beneficial for the company to obtain instant access to what is being said about the product, service, or the brand to get a stream of ideas, content, links, resources, and/or tips focused on the new product or service, and to monitor what is being said about the customers or target audience.
To achieve this, the present embodiments are configured to detect topics in social media, by, for example, discovering trending topics or tracking interests of users. Despite the advent of new technologies, topic detection in social media is not trivial for at least three reasons. First, the special characteristics of social media data such as, for example, brevity, vulnerability to noise, and others, render detecting potential topics in social media data much more challenging. Second, the volume of social media data requires more research for efficient implementation of topic detection algorithms and real-time processing. Third, the generated results or topics from most existing topic detection algorithms cannot be easily interpreted.
According to the present embodiments, processing components of the systems and methods can perform topic detection on social media data related to a specific trend or a specific user by concatenating the content of the social media data and preprocessing the social media data using well-known natural language processing techniques. Further, the processing components can employ scalable machine learning techniques, associated with a map-reduce programming framework for topic modeling, and then adopt a locality-sensitive hashing technique to expedite the topic detection procedure. Still further, the processing components can utilize available data, such as data from Wikipedia, for topic modeling and detection. The detected topics can be interpreted by human ontology.
It should be appreciated that, although the systems and methods as described herein incorporate the use of Wikipedia, using other available encyclopedia or reference data is envisioned, either singularly or in combination with Wikipedia. Further, although the systems and methods as described herein incorporate the use of Twitter and data thereof, using other available social media data and platforms is envisioned, either singularly or in combination with Twitter. As used herein, a “tweet” can be any data (e.g., comments, links, etc.) posted by a Twitter user via the Twitter platform or associated applications or plug-ins.
The use of Wikipedia is beneficial because Wikipedia offers a comparatively complete, freely accessible, and structured collection of world knowledge and content. Further, Wikipedia contains abundant conceptual and semantic information that can be utilized to model the topics in a variety of sets of documents. In the embodiments as described herein, Wikipedia articles in combination with Twitter data can be used as a substitute corpus to train topic models. In particular, articles of the Wikipedia encyclopedia and tweets on Twitter bear a resemblance, which can be quantitatively measured. Further, whenever a tweet is statistically similar to a particular Wikipedia article, there can also be a similarity in the corresponding topics. Moreover, the hierarchical structure of Wikipedia in combination with the abundant conceptual information can allow a thorough interpretation of the detected topics.
Referring to
The systems and methods as described herein can rely on the Wikipedia conceptual hierarchy for topic modeling and, therefore, constructing the Wikipedia hierarchy is helpful. According to embodiments, the Wikipedia hierarchy can be stored according to various conventions. Referring to Table 1, an exemplary storing convention for the Wikipedia hierarchy is illustrated:
Referring to Table 1, the “Concept ID” column and the “Concept Title” column can identify the associated concept in the Wikipedia hierarchy. Further, the “Parent IDs” and the “Children IDs” can illustrate any correlations between the associated concept and other related concepts, and the “Level” specifies the layer in which the associated concept is lying. It should be appreciated that a concept may belong to different layers, since the concept may be relevant to concepts within different levels. As a result, the table can record many possible levels for concepts.
Because of the large volume of Wikipedia concept and article information, the present embodiments can adopt the Apache™ Hadoop™ project, or similar techniques, for distributed processing of dump files across clusters of computing resources using, for example, the MapReduce programming framework available from Google®. More particularly, in MapReduce processing, a user can select a number of concepts as a top level of an exemplary hierarchy. For example, a user can select eight (8) general concepts (e.g., Arts, Science, Technology, History, etc.) as the top level of the hierarchy. The MapReduce framework can attempt to find the “children” of each of the concepts in a Wikipedia hierarchy, and then put the children into a “mapper collector.” In addition, the mapper collector can identify and record the parents and the level of each of the concepts by, for example, labeling from “0” for the top level. The MapReduce framework can then combine each concept's parents and children, and associate the articles in the Wikipedia article dump with the concept. As a result, the Wikipedia hierarchy can be efficiently extracted from dump files. Further, the extracted Wikipedia hierarchy contains not only the concepts in each level, but also the articles related to each concept. For example,
Referring back to the training step 110 of
To train the topic models, the systems and methods can employ the “Naive Bayes classifier,” which is a probabilistic learning technique. Specifically, the probability of an article (d) being in a topic class (c) can be computed as equation (1):
P(c|d)∝P(c)·π1≦k≦nP(tk|c) (1)
In equation (1), P(tk|c) is the conditional probability of term tk occurring in an article of class c. More particularly, P(tk|c) can be interpreted as a measure of how much evidence tk contributes that c is the correct class. P(c) is the prior knowledge of a document occurring in class c. Further, if the terms in a document do not provide clear evidence for one class versus another, the term that has a higher prior probability can be selected. <t1, t2, . . . , tn> are the terms in an article d that are part of the vocabulary that is used for classification, and nd is the number of such terms in d.
Further, the “best” class in the Naive Bayes classification is the most
likely or maximum a posteriori (MAP) class, CMAP, which can be defined according to equation (2):
C
MAP=argmaxc∈C {circumflex over (P)}(c|d)=argmax·{circumflex over (P)}(c)·π1≦k≦n{circumflex over (P)}(tk|c) (2)
Because the true values of P(c) and π1≦k≦nP(tk|c) cannot be directly obtained, equation (2) instead incorporates {circumflex over (P)}(c) and π1≦k≦n{circumflex over (P)}(tk|c), both of which are estimated from the training set (e.g., the Wikipedia articles). Specifically, to estimate these two parameters, the systems and methods can use maximum likelihood estimation (MLE), which is the relative frequency and corresponds to the most likely value of each parameter given the training data. The prior knowledge(s) {circumflex over (P)}(c) are estimated using
where Nc is the number of articles in category c and N is the total number of articles in a specific Wikipedia level. The conditional probability {circumflex over (P)}(t|c) can be estimated as a relative frequency of term t in articles belonging to class c, as detailed in equation (3):
In equation (3), Tct is the number of occurrences of t in training articles from class c, including multiple occurrences of a term in an article. In some cases, the training data may not be large enough to represent the frequency of rare events due to the sparseness of terms. To accommodate such sparseness, the systems and methods can use “Laplacian smoothing” to estimate {circumflex over (P)}(t|c), which adds a “Dirichlet Prior” related to the term “t” to each count, as detailed in equation (4):
Further, a simple version of Laplacian smoothing is to add a uniform prior to each count, as detailed in equation (5):
More particularly, |V| is the number of terms in the vocabulary V. By using Laplace Smoothing, the systems and methods can obtain more accurate estimation of the parameters.
Referring back to the model selection stage 115 of
The systems and methods can extract ten (10) levels, or other amounts, of Wikipedia hierarchy. In the case of any computation resource limitations, the systems and methods can train topic models on a subset of Wikipedia articles. More particularly, to select the best layer for topic modeling, the systems and methods can use a small set of tweets, with about 50 trends, or other amounts of trends, to validate the model in each level of the Wikipedia hierarchy. Referring to Table 2, shown are statistics related to an exemplary validation dataset.
As shown in Table 2, for each “trend,” there exists a certain amount of tweets that were posted by users using the Twitter platform. According to embodiments, for each layer of the hierarchy, human labels of tweets are assigned using the concepts of the layer, and the systems and methods can detect topics using the techniques as described herein. In embodiments, the systems and methods can report the accuracy of the result for each layer. Referring to
Once the model level is selected, a topic detection technique can be applied according to the selected model level to detect topics in social media data. Further, to handle a large-scale topic detection, the systems and methods can use locality-sensitive hashing (LSH) to reduce any comparisons that reside in the procedure of model fitting. More particularly, the systems and methods can hash Wikipedia articles in a median hierarchy level, such as level 6 in
To find similar Wikipedia articles, the systems and methods can use a standard Jaccard index-based hashing procedure. More particularly, the procedure can first decompose Wikipedia articles into “shingles” that can represent articles, to overcome disadvantages of the traditional bag-of-words model. Specifically, a k-shingle for an article is a sequence of k continuous words that appear in the article. Before decomposing the articles into the shingles, the systems and methods can execute a sequence of preprocessing steps, such as removing stop words, tokenizing, and stemming, on the original articles. Because typical Wikipedia articles are in moderate length, the systems and methods can choose k=10 to reduce the probability of any given shingle appearing in any article. By shingling, the systems and methods can quantify the original Wikipedia corpus as a shingle-article matrix M, where rows represent shingles and columns represent articles.
In general, the shingle-article matrix M may not fit into some memory, as the number of articles tends to be substantial. To alleviate this issue, the systems and methods can use a “Minhashing” technique to generate a succinct signature for each column in M, such that the probability of two articles having the same signature is equal to the Jaccard index similarity between the two articles. More particularly, the systems and methods can construct a Minhash signature of length-100, or other amounts, using one or more known techniques. In some cases, the randomized nature of the Minhash generation method can require further checks to increase the probability of uncovering all pairs of related articles in terms of the signature. Thus, the systems and methods utilize LSH to increase such probability.
More particularly, the systems and methods can employ LSH to reduce the comparisons, where the generated Minhash signatures can be initially decomposed into multiple bands. Further, for each band of the multiple bands, the systems and methods can adopt a standard hash function to hash the bands into a larger hash table. The columns whose bands are hashed into the same bucket at least once can be treated as the similar ones. Further, the original Wikipedia corpus can be separated into multiple small articles groups.
In embodiments, when a group of new tweets is identified or received, the systems and methods can concatenate the group of new tweets into a single document and perform the same preprocessing on the single document. The single document can then be hashed using LSH to check whether there are similar Wikipedia articles. If there are similar Wikipedia articles, the systems and methods can choose, as the label of the tweets, the topic label of the article which is the most similar to the tweets. Otherwise, the systems and methods can run the topic model against the tweets to obtain the topic label.
As shown in
In 420, the processing module can employ a classification technique to train a topic model based on a subset of the portion of the data set that is residing in the layer. More particularly, a Naive Bayes classification method can be used to train the topic model based on the articles residing in the particular layer and associated with its corresponding categories. In 425, the processing module can determine if a topic model has. been determined for each layer of the plurality of layers. If a topic model has not been determined for each layer (NO), the processing module can repeat 420. In contrast, if a topic model has been determined for each layer (YES, 430), then the processing module can evaluate an accuracy of each topic model of the topic model.
In 435, the processing module can examine the accuracy of each topic model of the topic models to identify a layer of the plurality of layers corresponding to a topic model that is most appropriate for a topic detection technique. For example, the most appropriate topic model identification can be based on the accuracy and complexity of the topic models. In 440, the processing module can detect one or more topics in social media data according to the topic model that is most appropriate for the topic detection technique. In embodiments, the social media data can be data collected from Twitter over a period of time such as, for example, seven (7) to fifteen (15) days. In 445, the processing can end, repeat, or return to any previous step.
The processor 508 can further communicate with a network interface 504, such as an Ethernet or wireless data connection, which in turn communicates with the network 509, such as the Internet or other public or private networks. The processor 508 can also communicate with the database 515 or any applications 505, such as a topic detection application or other logic, to execute control logic and perform the data processing and/or topic detection functionality as described herein.
While
Certain embodiments can be performed as a computer program. The computer program can exist in a variety of forms both active and inactive. For example, the computer program can exist as software program(s) comprised of program instructions in source code, object code, executable code or other formats; firmware program(s); or hardware description language (HDL) files. Any of the above can be embodied on a computer readable medium, which include storage devices and signals, in compressed or uncompressed form. Exemplary computer readable storage devices include conventional computer system RAM (random access memory), RQM (read-only memory), EPROM (erasable, programmable ROM), EEPROM (electrically erasable, programmable ROM), and magnetic or optical disks or tapes. Exemplary computer readable signals, whether modulated using a carrier or not, are signals that a computer system hosting or running the present invention can be configured to access, including signals downloaded through the Internet or other networks. Concrete examples of the foregoing include distribution of executable software program(s) of the computer program on a CD-ROM or via Internet download. In a sense, the Internet itself, as an abstract entity, is a computer readable medium. The same is true of computer networks in general.
While the invention has been described with reference to the exemplary embodiments thereof, those skilled in the art will be able to make various modifications to the described embodiments without departing from the true spirit and scope. The terms and descriptions used herein are set forth by way of illustration only and are not meant as limitations. In particular, although the method has been described by examples, the steps of the method can be performed in a different order than illustrated or simultaneously. Those skilled in the art will recognize that these and other variations are possible within the spirit and scope as defined in the following claims and their equivalents.