This Application claims priority of Taiwan Patent Application No. 101149250, filed on Dec. 22, 2012 and Taiwan Patent Application No. 102124478 field on Jul. 9, 2013, the entireties of which are incorporated by reference herein.
1. Technical Field
The disclosure is related to system and method for analyzing text stream messages, and related to the analysis of network real time messages thereof.
2. Description of the Related Art
A blog is a network platform for users to publish their comment and communicate with friends. Micro-blogs, such as Twitter, and Plurk, are popular network community platforms. Users can publish their daily trifles, share their daily lives, and get updates on friends, via the micro-blog.
Because the micro-blog gathers real time information of specific topics, it generates big influence on news, economy, politics, and society. The micro-blog promotes everyone's concern over popular topics (events) of the world. For example, when natural disasters or mass movement occurs, local residents may provide real time information through micro-blogs, thus, it's helpful to analyze the evolution of the real time information.
The words of text stream messages of micro-blogs are usually less than 140 characters, such as Twitter. Therefore, there are few features in a micro-blog message and concept-drift phenomenon would occur on a topic in these features in different time duration. Concept-drift occurs when the meaning of the topic changes in different time duration. Popular keywords of a topic will vary over the topic evolves with time. For example, a tsunami occurs; therefore the word “tsunami” is a popular word. With the topic evolves, the tsunami leads a nuclear disaster. Then the word “tsunami” is not so popular in this topic, and other words such as “nuclear”, become more popular in this topic. That is the popularity of the word “tsunami” decreases, and popularity of the word “nuclear” increases. A concept-drift occurs when the popularity of the word “tsunami” and the word “nuclear” are changed. Therefore, the real time topic would be clustered and observed to determine whether the real time topic is a popular topic. Data mining is applied to process the messages of the real time topic. For general micro-blogs, data mining technology can be divided into two types: graph mining; and text mining. Graph mining is applied for analyzing the graphic relationship between messages, and text mining is applied for analyzing text content of messages for detecting and tracking topics. Therefore, text stream mining technology is applied to analyze real time topics, wherein the text stream mining technology comprises Micro-blogging Topic Detection and Tracking and Text Stream Mining studying groups.
In Term Frequency-Inverse Document Frequency (TF-IDF) technology, Term Frequency (TF) is affected by the length of topic data, therefore, it may not be objective when dealing with different length of text message. Although the Inverse Document Frequency (IDF) would weight the words over the text messages, it may be not suitable for detecting popular topics.
Therefore, how to provide a stream message analyzing method for users to get real time information from the large numbers of topics in micro-blogs rapidly and accurately will become important.
An embodiment of the disclosure provides a system for analyzing text stream messages, comprising: a sliding window module, storing a plurality of text stream messages and updating the plurality of text stream messages by a sliding window once every preset duration; a dynamic text weight module, receiving the plurality of text stream messages and calculating the plurality of text stream messages for generating a burst weight according to a dynamic text stream weight algorithm; and a clustering module, clustering the plurality of text stream messages by a clustering algorithm according to the plurality of text stream messages and the burst weight for generating a plurality of clusters and selecting one or more than one keyword with higher burst weight in each of the clusters as concept words, wherein as the concept words of the clusters vary with time, the time varying sequence of concept words are identified as the concept words sequence denoting the concept drift of the clusters; and a memory device, storing the clusters which are clustered by the clustering module.
An embodiment of the disclosure provides a method for analyzing text stream messages, comprising: storing a plurality of text stream messages and updating the plurality of text stream messages by a sliding window once every preset duration; receiving the plurality of text stream messages and calculating the plurality of text stream messages for generating a burst weight according to a dynamic text stream weight algorithm; and clustering the plurality of text stream messages by a clustering algorithm according to the plurality of text stream messages and the burst weight for generating at least one cluster.
An embodiment of the disclosure provides a system for analyzing text stream messages, comprising: an analyzing device, comprising: a sliding window module, storing a plurality of text stream messages and updating the plurality of text stream messages by a sliding window once every preset duration; a dynamic text weight module, receiving the plurality of text stream messages and calculating the plurality of text stream messages for generating a burst weight according to a dynamic text stream weight algorithm; and a clustering module, clustering the plurality of text stream messages by a clustering algorithm according to the plurality of text stream messages and the burst weight for generating at least one cluster; a memory device, storing the clusters which are clustered by the clustering module; and an electrical device, displaying information of the clusters stored in the memory device.
The disclosure will become more fully understood by referring to the following detailed description with reference to the accompanying drawings, wherein:
In an embodiment of the disclosure, the sliding window module 110 comprises a sliding window for storing the text stream micro-blog messages, such as text stream messages from Twitter. Then, the stored text stream messages are updated by the sliding window once every preset duration. In addition, the sliding window module 110 is configured to delete the stored text stream messages of which the time points are out-of-date of the sliding window. The detailed description of the sliding window module 110 will introduced below.
In an embodiment of the disclosure, a dynamic text weight module 130 is configured to receive the text stream messages, wherein the plurality of text stream messages received by the dynamic text weight module 130 are pre-processed by the pre-processing module 120 in advance. When being pre-processing, every text stream message is processed through a word segmentation or tokenization process and a sentence segmentation process, and after pre-processing, non-important words are filtered for generating at least one keyword. For example, the pre-processing module 120 may extract the keywords “global warming”, “Arctic”, “iceberg” and “sea level”, from the sentence, “global warming will make the icebergs in the Arctic melt as a result the sea levels rising”.
Because the importance of every keyword may be changed as time goes on, the dynamic text weight module 130 has to provide different weighted values for every keyword at different time points according to concept-drift. The dynamic text weight module 130 calculates the plurality of text stream messages which have been pre-processed by the pre-processing module 120, according to a dynamic text stream weight algorithm for generating burst weight, wherein in the dynamic text stream weight algorithm, the burst scores (BS) of the keywords and a Term Occurrence Probability (TOP) are calculated for generating burst weight. The weightw,t is calculated according to the frequency of the keyword for reflecting the frequency of the keyword is increased or decreased, and it means the burst weighted value of a keyword w at time point t. In an embodiment, weightw,t is generated according to two factors, BSw,t and TOPw,t. BSw,t is the burst score of a keyword w at time point t and TOPw,t is the probability of a keyword w occurring at time point t.
In an embodiment, the detailed mathematical formulas of weightw,t, BSw,t and TOPw,t are expressed as follow:
, wherein arw,t is the arrival rate of a keyword w at time point t, E(arw,t) is the expected value of arw,t, P(wt/ct) is the conditional probability of a keyword w at time point t in the message set c, |{m:wt ∈ ct}| is the number of the keyword w in the message m at time point t in the message set c, and |ct| is the amount of the messages at time point t in the message set c. In an embodiment of the disclosure, the words of the plurality of text stream messages may be classified into three types, uninformative words, common words, and topic words, and the dynamic text weight module 130 provides different weighted values according to the importance of the three types of words.
For example, in the Table 1, some text stream messages have been received from Twitter:
In the Table 2, keywords such as “debate”, “Obama”, “presidential”, and “Romney” are extracted by the pre-processing module 120 from every text stream message.
And then, in the Table 3, the dynamic text weight module 130 calculates the plurality of text stream messages which have been pre-processed by the pre-processing module 120, according to a dynamic text stream weight algorithm for generating burst weight.
In an embodiment of the disclosure, the clustering module 140 is configured to cluster the plurality of text stream messages which have been pre-processed by the pre-processing module 120 by a cluster algorithm for generating at least one cluster, wherein the clustering module 140 clusters the plurality of text stream messages by processing a similarity estimation according to the different keywords and the burst weight of keywords. Each of the clusters which is clustered by the clustering module 140 us a detected topic and one or more than one keyword with higher burst weight in each of the clusters are selected as concept words, wherein as the concept words of the clusters vary with time, the time varying sequence of concept words are identified as the concept words sequence denoting the concept drift of the clusters.
According to the above example, in the Table 4, the two messages have four keywords, “debate”, “Obama”, “presidential”, “Romney” and the time difference of the two message are (Thu Oct 04 08:08:04 CST 2012−Thu Oct 04 07:59:53 CST 2012=1349309284−1349308793=491). In addition, the window length is 7200. Therefore, the similarity estimation is as follow:
In an embodiment of the disclosure, if the similarity estimated by the clustering module 140 is more than a threshold, the two messages will be added in the same cluster, and if the similarity estimated by the clustering module 140 less than a threshold, the two messages will be deleted. For example, if the threshold is set to 0.6 and the similarity of the two messages is 0.68, the two messages will be added in the same cluster. Namely, in the embodiment of the disclosure, the cluster algorithm has two stages: a deleting stage and adding stage. The deleted stage is divided to three methods for handling messages. The three methods are: Removal, Reduction and Potential. The added stage is divided to four cases: Noise, Creation, Absorption and Merge, wherein the Creation means that a new cluster was created, Absorption means that elements in some clusters have been absorbed, and Merge means that it is determined whether the clusters may be merged according to the sum score of the burst weight of the same keywords whose similarity may be more than a threshold in the clusters.
In an embodiment of the disclosure, the memory device 150 is configured to collect and store the clusters corresponding to different topics after the above clustering process. In an embodiment of the disclosure, the memory device 150 comprises a cloud data base established by a cloud method. In an embodiment of the disclosure, the memory device 150 may gather the collected and stored data to a topic abstract and transmit the topic abstract to the client electrical device, such as desktop computer, smart phone, or tablet, for providing users for watching and searching. In an embodiment of the disclosure, the sliding window module 110, the pre-processing module 120, the dynamic text weight module 130 and the clustering module 140 may be integrated in an analyzing device (not expressed in
In an embodiment of the disclosure, the plurality of text stream messages analyzing system 100 further comprises a displaying device (not expressed in
One or more than one keyword with the most occurring times can be selected as the concept word(s) for each topic. Or one or more than one keywords with higher burst weight can be selected as the concept word(s) for each topic. Other algorithm such as term frequency-inverse document frequency (TF-IDF) algorithm can also be adopted as the concept word selection criterion. In addition, the concept words for each topic can be selected by selecting one or more than one keyword according to above method respectively, and then assembling the keywords from different methods.
Every cluster ct clustered from the clustering module 140 at time point t can be identified as a detected topic. The topic energy tec
wherein nm,c
#distWords ∈ ct denotes the number of distict keywords in the topic ct;
nw,c
wc
BSw
In an embodiment of the disclosure, the plurality of text stream messages analyzing method further comprises the plurality of text stream messages being deleted by the sliding window module once every preset duration, when the time points of the stored text stream messages are out-of-date of the sliding window.
In an embodiment of the disclosure, the plurality of text stream messages received by the dynamic text weight module has to be pre-processed by the pre-processing module 120. When being pre-processing, every text stream message is processed through a word segmentation or tokenization process and a sentence segmentation process, and after pre-processing, non-important words are filtered out to generate a plurality of keywords. In an embodiment of the disclosure, the plurality of text stream messages analyzing method further comprises burst scores (BS) and a Term Occurrence Probability (TOP) of the keywords are calculated via the dynamic text stream weight algorithm for generating burst weight.
In an embodiment of the disclosure, the plurality of text stream messages are clustered through the cluster algorithm according to the plurality of text stream messages and the burst weight to process a similarity estimation for generating the clusters. In an embodiment of the disclosure, the memory device comprises a cloud data base established by a cloud method for storing the clusters which are clustered by the clustering module.
In the traditional method, the parameters are fixed as a result the method is not applied properly for detecting unknown amount of topics and the method need more calculating time as a result the method is not applied properly for real time topic detection. In addition, the traditional weighting method cannot present the variety of dynamic weighted values of the text stream messages, thus, it can not overcome the concept-drift problem of the text stream messages. The text stream messages of the disclosure may be added and deleted by a sliding window module to maintain the system dynamically. The importance of the messages, changing as time goes by, is detected through the dynamic text weight technology. Continuous messages are clustered by the clustering module immediately. When real time topics are detected and the clusters of the topics are generated, the clusters of the topics will be stored in a cloud data base. Therefore, the method is helpful to analyze the evolution of the real time topics for the variety and impact of market and achieve the goals of the market development of products or the disaster warning function.
The above paragraphs describe many aspects of the disclosure. Obviously, the teaching of the disclosure can be accomplished by many methods, and any specific configurations or functions in the disclosed embodiments only present a representative condition. Those who are skilled in this technology can understand that all of the disclosed aspects in the disclosure can be applied independently or be incorporated.
While the disclosure has been described by way of example and in terms of embodiment, it is to be understood that the disclosure is not limited thereto. Those who are skilled in this technology can still make various alterations and modifications without departing from the scope and spirit of this disclosure. Therefore, the scope of the present disclosure shall be defined and protected by the following claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
101149250 | Dec 2012 | TW | national |
102124478 | Jul 2013 | TW | national |