The present invention relates to a system and method for topic-based analysis of information derived from microblogs. More particularly, the present invention relates to a system and method for automatic generation of information-rich content from multiple microblogs, each microblog containing only sparse information
A topic may be or concern an event such as a political election; a geographical location such as a tourist attraction; or an entity such as an individual or a corporate body. More wide rangingly: a topic can be identified from a user query representing a particular user's need for information—the topic could be a particular event: such as an election, a sports event or a natural disaster; or an entity: such as a person, a location, an organisation, a concept (such as a religion, a philosophy or a language); or a product.
Microblogs are a popular tool for users to post news, information or queries online for public (or private group) dissemination, review and reply. Twitter is a popular microblog site with over 300,000,000 microblogs each day being exchanged. A Twitter microblog or Tweet comprises a message of 140 characters.
The problem addressed by the present invention and implementations thereof is to provide for the conduct of a meaningful analysis and exploration of microblogs based on a user input.
Searching functionality in social networking sites in general and through microblogs specifically is basic and limited. This limitation is particularly pronounced when searching for a particular event or entity which has multiple facets or aspects, as most topics do. The search functionality that is currently implemented through microblogging sites and tools is a simple word-matching search that retrieves the most recent posts on a given query. Further, a user may obtain hundreds or perhaps thousands of hits comprising individual microblogs/posts in response to a given query. This leads to instant information overload and unusable search results.
Many microblog and social websites provide a search capability to users to allow them to find relevant posts using a word-matching search in response to a user query. The current state of the art in microblog searching returns as search results any recent posts that contain the search word/s. In this way, a user can be updated by recent posts mentioning the search term—a given event or certain entity—most recently. Microblog searching is discussed in:
The search scenarios in the microblog environment are limited and the information provided in any one microblog (140 characters perhaps) is sparse. Equally sparse on information are comments or posts on social networking sites. Nevertheless, attempts have been made to provide useful information derived from microblogs—a rich seam of social data. These attempts include:
Interest in microblog retrieval has significantly increased in recent years. Several studies investigated the nature of microblog search compared to other search tasks [N. Naveed, T. Gottron, J. Kunegis, A. Alhadi. (2011). Searching microblogs: coping with sparsity and document quality. CIKM-2011. ] and [J. Teevan, D. Ramage, M. Morris. (2011). #Twittersearch: A comparison of microblog search and web search. WSDM 2011]. [N. Naveed, T. Gottron, J. Kunegis, A. Alhadi. (2011). Searching microblogs: coping with sparsity and document quality. CIKM-2011] illustrated the challenges of microblog retrieval, where documents are very short and typically focus on a single topic. [J. Teevan, D. Ramage, M. Morris. (2011). #Twittersearch: A comparison of microblog search and web search. WSDM 2011] highlighted the differences between web queries and microblog queries, where microblog queries usually represent users' interest to find updates about a given event or person as opposed to finding relevant pages on a given topic in a web search.
Due to this increased interest in microblog search, TREC introduced a new track focused on microblog retrieval in 2011 [I. Ounis, C. Macdonald, J. Lin, I. Soboroff. (2011). Overview of the TREC-2011 Microblog Track. TREC-2011]. The aim was to find the best methods for achieving high precision retrieval for microblog search. A collection of 14 million tweets from Twitter and a test set of 50 topics were provided for investigation [I. Ounis, C. Macdonald, J. Lin, I. Soboroff. (2011). Overview of the TREC-2011 Microblog Track. TREC-2011]. Although the track led to a variety of effective retrieval approaches, the issue of modelling the search scenario remains important as the TREC track setup models search like a standard ad-hoc retrieval task, which may be suboptimal [J. Teevan, D. Ramage, M. Morris. (2011). #Twittersearch: A comparison of microblog search and web search. WSDM 2011].
The absence of a sensible definition for a microblog search scenario led some researchers to create different useful tasks other than direct search. For example, [I. Subasic, B. Berendt. (2011). Peddling or Creating? Investigating the Role of Twitter in News Reporting. ECIR-2011] used tweets as a news source and compared them to other online news media to detect features for automatic news detection from Twitter. In [7], tweets were used to recommend news to users based on their preferences. In [J. Bollen, H. Mao, X-J. Zeng. (2010). Twitter mood predicts the stock market. Journal of Computational Science. 2(1)], users' mood on Twitter was utilized to predict stock market changes. Many other tasks have been suggested for achieving information gain to users based on social data from Twitter.
Other references of note are F. W. Lancaster, E. G. Fayen. (1973). Information Retrieval On-Line. Melville Publishing Co., Los Angeles, California; O. Phelan, K. McCarthy, M. Bennett, and B. Smyth. (2011). Terms of a feather: content-based news recommendation and discovery using twitter. ECIR 2011; I. Subasic, B. Berendt. (2011). Peddling or Creating? Investigating the Role of Twitter in News Reporting. ECIR-2011; B. Han, T. Baldwin. (2011). Lexical Normalisation of Short Text Messages: Makn Sens a #twitter. ACL-HLT 2011; and W. X. Zhao, J. Jiang, Ji. Weng, J. He, E-P. Lim, Ho. Yan, X. Li. (2011). Comparing twitter and traditional media using topic models. ECIR 2011.
The Twitter microblog uses “hashtags”—“The # symbol, called a hashtag, is used to mark keywords or topics in a Tweet. It was created organically by Twitter users as a way to categorize messages.”—source: www.twittercom. In other words, a user creates a hashtag by prefixing a term with a # symbol to identify the prefixed term as the intended topic of that microblog. The hashtag can be seen as a “Subject:” line or topic identifier so that other users can search for that particular hashtag to identify further microblogs referencing the same hashtag. More than one hashtag can be present in a single microblog.
Many microblog and social websites, such as Twitter, provide search capabilities to allow users to find relevant posts that match their information need. The currently implemented microblog search on Twitter provides recent tweets that match search words. A user may elect to search (or follow) for specific entities, persons, or events, via the use of hashtags “#tag” or name mention “@user”, to get continuous updates [J. Teevan, D. Ramage, M. Morris. (2011). #Twittersearch: A comparison of microblog search and web search. WSDM 2011]. One disadvantage of this kind of search is that a query may yield a large number of tweets, overwhelming the user. In this scenario, a user is presented with a flat list of matching tweets (tweets and microblogs are used interchangeably in the paper), leaving much to be desired, such as time span, the tweet sentiment and topic modelling.
Some sites allow searching by hashtags, in which case, the hashtags are used as keywords: http://truthy.indiana.edu/. This website provides a tool for analysing a population of microblogs for hashtags and plots a link graph between a hashtag forming a user query and other hashtags that co-occur within each microblog. The website also allows a user to search for a hashtag and then displays recent tweets that contain the given hashtag, as well as an indication of the distribution of how many times the searched hashtag is mentioned over time.
The main shortcoming of the current technology is that a search through microblogs provides only the most recent hits (relevant posts) based on a given user search query. Searching social content and social networking site content in general and microblogs (aka tweets) in particular has been basic and limited, especially for time-sensitive topics. The currently implemented microblog search on sites such as Twitter is based on simple word matching and retrieves the most recent microblogs that match a given query.
Furthermore, a user may obtain hundreds or perhaps thousands of microblogs in response to a given query, leading to information overload. The problem with this scenario is that there will typically be a large number of relevant posts for any one search term and a user will be swamped by the volume of returns—so-called “information overload”. It is a technical problem to present relevant search results without providing an overwhelming volume of relevant search results.
A typical reaction of a user faced with a large number of hits is to narrow the search down by using more specific search terminology, i.e. longer or plural hashtags. This means that a user receives updates on a very specifically defined topic presented as a hashtag. The present systems provide too many relevant hits to be useful and relatively little information content. The situation is analogous to not being able to see the wood for the trees, where the trees are the relevant hits and the wood is the information being sought.
There is, therefore, a desire to overcome one or more of the problems associated with the prior art and create, for example, a system and method for topic analysis based on information derived from microblogs.
The technical solution is to present a system and method of searching microblogs embodying the present invention. The solution provides for a search to be conducted which leads to more useful information being gained by the user compared to current systems which return an overwhelming number of relevant hits but little in the way of useful information for a user.
Embodiments of the present invention seek to ameliorate one or more problems associated with the prior art.
An aspect of the present invention provides a method for automatic generation of information-rich content from multiple microblogs, each microblog containing only sparse information, the method comprising: collecting a population of microblogs comprising microblog data, each microblog containing a limited number of characters; providing a user interface allowing entry of a search query; matching a search query entered on the user interface to data in the microblog data; providing the results of the matching process as a sub-set of microblog data; applying processing techniques to the sub-set of microblog data; and generating a summary report of the processed sub-set of microblog data.
In embodiments of the invention, the method further comprises: dividing the sub-set of microblog data into different categories of microblog; and incorporating results for each of the different categories in the summary report.
Preferably, natural language processing is used, such as: text normalization; named entity recognition; keyword/key-phrase extraction; or sentiment analysis.
A further aspect of the invention provides a system for automatic generation of information-rich content from multiple microblogs, each microblog containing only sparse information, the system including: a computing device having a processor and a memory: and a storage device, the computing device being configured to perform the method for automatic generation of information-rich content from multiple microblogs, each microblog containing only sparse information, the method comprising: collecting a population of microblogs comprising microblog data, each microblog containing a limited number of characters; providing a user interface allowing entry of a search query; matching a search query entered on the user interface to data in the microblog data; providing the results of the matching process as a sub-set of microblog data; applying processing techniques to the sub-set of microblog data; and generating a summary report of the processed sub-set of microblog data.
In embodiments, the system further includes a visual display for displaying an interface to a user, and to receive a search query from a user, such that the input of a search query by the user causes the computing device to output to the interface a summary report of the processed sub-set of microblog data corresponding to the search query.
Another aspect of the present invention provides a computer-readable medium storing instructions which when executed to run on a processor cause the processor to perform the steps according to the method for automatic generation of information-rich content from multiple microblogs, each microblog containing only sparse information, the method comprising: collecting a population of microblogs comprising microblog data, each microblog containing a limited number of characters; providing a user interface allowing entry of a search query; matching a search query entered on the user interface to data in the microblog data; providing the results of the matching process as a sub-set of microblog data; applying processing techniques to the sub-set of microblog data; and generating a summary report of the processed sub-set of microblog data.
Another aspect of the present invention provides a search tool operable to generate automatically information-rich content from multiple microblogs, each microblog containing only sparse information, the tool comprising: a collection of microblogs comprising microblog data, each microblog containing a limited number of characters; a user interface allowing entry of a search query; a matching processor to match a search query entered on the user interface to data in the microblog data; a result set comprising a matched sub-set of the microblog data; and a report generator to applying processing techniques to the result set to generate a summary report of the processed sub-set of microblog data.
Another aspect of the present invention may provide a topic-based microblog analysis tool.
Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings, in which:
Embodiments of the present invention provide a new multidimensional microblog search tool that generates a comprehensive report from microblogs instead of a flat list of recent/relevant microblogs for a given query. Reports may include tag-clouds, topic time series, and most popular and funny microblogs and an analysis of those displaying sentiment. The tool can be configured for monitoring time-sensitive topics using a set of predefined queries. Embodiments of the present invention provide a user experience that is different from the tweet search available from Twitter.
A search scenario embodying the present invention leads to significant information gain to the user compared to the current scenario, for example a word search through most recent microblogs. This disclosure deals with scenarios involving more general queries and information needs (requiring more than a simple hashtag or user mention search); and the outcome is a more comprehensive summary of hits for the search query in the microblog domain or social media in general which is more rich in information than just a simple list of results.
Referring to
In further detail embodiments of the invention provide:
Embodiments of the invention can be implemented as:
However, embodiments of the invention can also be implemented by other means such as applets, apps and bespoke desktop solutions.
Embodiments of the invention can provide:
Referring to
Embodiments of the present invention provide a microblog search tool, system and method of indexing that generates a comprehensive report in response to a given search query based on indexed microblog data.
In the embodiment shown in
The microblog feed 110 may in some embodiments be a live feed of microblogs or in other embodiments may be a saved database of microblogs mirrored or compiled from a live feed of microblogs.
The microblog feed 110 makes a population of microblogs together with their associated metadata available to the system 100. For example, the microblog feed 110 is of microblogs from the Twitter website (“tweets”) which are collected for a given language and saved in a database 110a in the feed 110. Tweets are collected by issuing generic queries, such as “lang:xx” (ex. “lang:ar” for Arabic) against Twitter, which retrieves tweets in a given language. Collected tweets contain the author ID, tweet ID, timestamps, etc.
A normaliser processor 102a is located downstream of the microblog feed 110. The normaliser processor 102a normalises the text of the microblogs (the “Tweet text”) from the database 110a using advanced text normalization techniques operating on the informal or slang language that is used in tweets and social media in general. For example, English language normalisation can be used as described in [B. Han, T. Baldwin. (2011). Lexical Normalisation of Short Text Messages: Makn Sens a #twitter. ACL-HLT 2011] and for Arabic as described in [K. Darwish, W. Magdy, A. Mourad. (2012).—Language Processing for Arabic Microblog Retrieval. CIKM 2012]
The normalization process can also operate on emoticons. Emoticons are commonly used in microblogs reflecting the sentiment of a blogger. Microblogs can be normalised and sentiment displayed in a microblog is detected from use of emoticons and language.
An indexer 102b then indexes the normalized tweets together with their metadata, such as author ID, time stamp, and tweet ID. In embodiments, normalized tweets are indexed along with their metadata. A retrieval system is configured to use a simple Boolean retrieval model [F. W. Lancaster, E. G. Fayen. (1973). Information Retrieval On-Line. Melville Publishing Co., Los Angeles, Calif.] instead of a ranking model because the system should operate on and analyse “all” tweets that match a query in a given time window as opposed to a top ranking selection.
A data storage facility 103 is fed with and holds the normalised and indexed microblogs.
As will be appreciated, the user interface 111 provides for search query entry 112, allowing a user to input a search query, or choose a search query from a drop-down list of pre-selected or pre-generated search queries. A user provides a search query, which would preferably be an entity or an event, or could be a hashtag (#tag), a name mention (@some_user), or a free-form query.
Queries used for the system can be rich Boolean. Boolean queries, although they require time to construct manually, do not require training and can help disambiguate entities or events than may be referred to in multiple topics. For example, searching for the French president “Hollande” can retrieve many tweets referring to different persons carrying the same name. The Boolean query can be formulated as: “Hollande AND (François OR France OR president)” to disambiguate the entity.
In addition to search query entry 112, the user interface 111 incorporates an optional time window filter 113 which is operable by the user to limit search results down to a specific time window. Note that the metadata for microblogs incorporates a time stamp allowing microblogs to be categorised by date of creation or posting. If there is no user input to the time window filter 113, then a default time window is preferably set by the filter 113. In embodiments, the default time window is set as from the present time on the current day back to midnight of the previous day. Other default time windows can be preset or defined as simply “the last 2, 4, 6, 12 hours”.
In combination, the search query entry 112 and the time window filter 113 generate a composite search query 114 which is transmitted from the user interface 111 for interrogation of the index database 103.
All resulting microblogs satisfying the composite search query 114 in any specified time window are retrieved from the index database 103 and present a fresh population 115 of indexed microblogs. An extractor module 116 serves to analyse the retrieved population 115 and extract at least some of the following information from the retrieved population:
For 1 and 2, all retrieved tweets for a given search query are grouped to aggregate all similar tweets into the same group. For a fast and robust matching between tweets, an additional normalization step is applied which involves case-folding and removal of all hashtags, name mentions, URLs, punctuations, symbols, emoticons, and retweet symbols. Tweets that match exactly after normalization are grouped together. Groups are presented in ranked order (in descending order) by their size with the most common tweet form as the representative of the group along with the number of tweets in the cluster. Top funny tweets (sentiment tweets) are extracted in the same manner and clustering is applied to those tweets that have smiley emoticons only.
The URLs in the tweets of the top 100 clusters are extracted. Since URLs in tweets are typically shortened and some URLs may have multiple shortened forms, all URLS are expanded to reveal the original URLs. URLs pointing to a video hosting site such as YouTube are used to obtain a ranked list of the most popular videos, for example, which can then be embedded in the resultant report. Other URLs are extracted and their titles are presented and ordered by the number of appearances in tweets along with their links and number of appearance. Different categories of links pointing to non-video material are also possible so links to news stories, audio clips, geographic locations can also be ranked and incorporated in a report. For example, the most frequently occurring place names or geographic co-ordinates could be shown on a map in the report.
For Arabic, a base-phrase chunker is used which is akin to AMIRA [M. Diab. (2009). Second generation tools (AMIRA 2.0): Fast and robust tokenization, POS tagging, and base phrase chunking. MEDAR 2009.] to extract noun-phrase. For English, Open Calais is used to extract keywords/keyphrase. Extracted noun-phrases and/or keywords/keyphrases are sorted by their frequency and are displayed in a tag-cloud, see
The number of tweets across time is plotted and presented to the user in an interactive graph, see
A report generator 117 is provided which takes the extracted information 201-205 from the retrieved tweets and creates a summarised report 120 preferably presented in a user-friendly standardised or customised format: where top tweets, top funny tweets, and most circulated videos and links are sorted by frequency of appearance. Most frequent terms and phrases are presented in the form of tag-cloud. A time-series graph shows the popularity of the topic on Twitter over time as in
The generated report derived from the microblog information provides a higher level of information content compared to a standard list of word-matched search results. Providing a summarised report derived from what would otherwise be an overload of sparse microblog data provides a user with useful information tailored to the search query term selected by the user.
In embodiments, report generation and microblog retrieval can be pre-configured for special events. The system and method can be used for tasks beyond searching for a given topic on Twitter. Embodiments can be configured to monitor the popularity of specific entities or events over time and report on the same. In such embodiments, the system is fed with a set of fixed queries, and summarized reports are updated continuously at fixed time intervals to provide users with updated reports. Multiple entities can be monitored within a given event, and the relation among these entities can be extracted and plotted in graphs to show the connection among different entities.
In summary, referring to
For the resulting tweets in the specified time-span, embodiments of the present invention generate reports showing the top tweets (“top” means most (re)tweeted), top funny tweets, most circulated videos and links, most popular terms and phrases, and statistics about the entity/event over time. A user can also navigate through the resulting report over time to see how the popularity of a given entity/event has changed. In addition, the system can be configured for automatically collecting tweets related to a given topic to monitor special events for a period of time.
Data collection
Configuring the system—Creating rich queries
Calculating Candidates Popularity and Relations
In embodiments, a translation module is provided which is configured such that microblogs in a plurality of languages can be used to generate a collection of microblog data in a single language. Translation can be regarded as a normalisation step.
When used in this specification and claims, the terms “comprises” and “comprising” and variations thereof mean that the specified features, steps or integers are included. The terms are not to be interpreted to exclude the presence of other features, steps or components.
The features disclosed in the foregoing description, or the following claims, or the accompanying drawings, expressed in their specific forms or in terms of a means for performing the disclosed function, or a method or process for attaining the disclosed result, as appropriate, may, separately, or in any combination of such features, be utilised for realising the invention in diverse forms thereof.
Number | Date | Country | Kind |
---|---|---|---|
1211853.5 | Jul 2012 | GB | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2012/065367 | 8/6/2012 | WO | 00 | 4/21/2015 |