This disclosure relates generally to information gathering and more particularly to a technique to combine a plurality of short communications into a larger document to more readily understand the context of the overall communication.
The growth of internet use in recent years has provided unparalleled access to informational resources. Over the past decade, social networking and microblogging services such as Facebook and Twitter have become popular communication tools among internet users, being employed for a wide range of purposes including marketing, expressing opinions, broadcasting events or simply conversing with friends. Thus, there has been a growth in development of rapid automatic processing technologies that not only provide insights but also keep up with the rate at which information is produced. Recent work has included sentiment analysis, mining coherent discussions, identifying trending topics, detecting events, etc. There is a need for technologies that can process content from these services, extract entities, sentiment, topics, location, etc., and enable linking the attributes, such as sentiment to topic, topic to location and such.
In accordance with the present disclosure, a document building system is provided including: a user interface device having access to a communication system having a plurality of short media message units available to collect the short media message units; memory to cache the short media message units in the system; a collator to collect a plurality of related short media message units among users over a predetermined period of time; and a user interface to output to a single file the plurality of related short media message units when the file reaches a predetermined size to construct a cohesive document or to output to a single file a plurality of related short media message units after a maximum predetermined period of time to construct a cohesive document.
In accordance also with the present disclosure, a method for constructing a cohesive document includes: accessing a communication system having a plurality of social media message units accessible; collecting a plurality of related social media message units among users over a predetermined period of time; outputting to a single file the plurality of related social media message units when the file reaches a predetermined size to construct a cohesive document or outputting to a single file a plurality of related social media message units after a maximum predetermined period of time to construct a different cohesive document.
The details of one or more embodiments of the disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the disclosure will be apparent from the description and drawings, and from the claims.
Like reference symbols in the various drawings indicate like elements.
The present disclosure describes techniques to create cohesive documents from multiple social media message units (SMMUs) produced in services such as, Twitter, Facebook, Whatsapp and others. Documents are created based on content and themes from users, temporal information in content or metadata, geospatial information in content or metadata, and other attributes. The following description primarily discusses Twitter, but it should be appreciated that the description is also applicable to other services such as Facebook, Whatsapp and others that communicate with short bits and pieces or snippets of information.
The growth of Internet use in recent years has provided unparalleled access to informational resources. Micro-blogging services such as Twitter have become a very popular communication tool among Internet users, being employed for a wide range of purposes including marketing, expressing opinions, broadcasting events or simply conversing with friends.
Each day, more than 200 million active users publish more than 400 million tweets per day in the social network, sharing significant events in their daily lives. With such a large geographically diverse user base, Twitter has essentially published many terabytes of real-time sensor data in the form of status updates. Additionally, Twitter allows researchers and, government unprecedented access to digital trails of data as users share information and communicate online. This is helpful to parties seeking to understand trends and patterns ranging from customer feedback to the mapping of health pandemics. Hence, every Twitter user can be described as a sensor that can provide spatiotemporal information capable of detecting major events such as earthquakes or hurricanes or other man made or natural events.
Location and language are crucial attributes to understanding the ways in which online flow of information might reveal underlying economic, social, political, and environmental trends and patterns. Localization facilitates temporal analyses of trending news topics and events from a geospatial perspective, which is often useful in selecting localized events for further analysis. Studies have addressed the capability to track emergency events and how they evolve, as people usually first post news on Twitter, and are later broadcast by traditional media corporations. Alerts can be sent as soon as an emergency event is detected (known as First Story Detection—FSD), providing relevant information gathered from the conversations around it to the correspondent emergency response teams. One of the challenges to this process is identifying the location where the emergency is taking place.
Geospatial tagging features are certainly not new to Twitter, which has a check-in feature as most social networking sites do. This feature allows users to geographically tag their tweets by listing their location in their Twitter User Profile. Unfortunately, Twitter users have been slow to adopt such geospatial features. In our sampling of over approximately 3 million Twitter users; only 30% have listed user location, which include locations as granular as a city name (e.g. Riyadh, Saudi Arabia) to something overly general (e.g. Asia) and unhelpful (e.g. The World). In addition to location via user profile, Twitter supports per-tweet geo-tagging feature which provides extremely fine-tuned Twitter user tracking by associating each tweet with a latitude and longitude.
In a sampling of 17 million tweets over 1st quarter of 2013, less than 0.70% of all tweets actually use this functionality. When this feature is enabled, it generally functions automatically when a tweet is published with the coordinate data coming either from user's device itself via GPS, or from detecting the location of the user's Internet (IP) address. Additionally, neither of these Twitter-provided features for geo-location provides location estimates based on the textual content in the user-posted tweet messages. On the whole, the lack of adoption and availability of per-user and per-tweet geo-tagging features indicates that the capability of Twitter as a location-based sensing and information tracking tool may have only limited reach and impact. Additionally, these features do not provide location estimates based on the content of the user-posted tweet messages.
Although Twitter provides vast amounts of data, it introduces several natural language processing (NLP) challenges: Multilingual posts and code-switching between languages makes it harder to develop language models and may require Machine Translation (MT); With the limitation of 140 characters per-tweet, Twitter users often use shorthand and non-standard vocabulary which makes named-entity detection and geo-location via gazetteer more challenging, Tweets are inherently noisy and may contain limited information for geo-location detection on per-tweet basis; Twitter content tends to be very volatile, and pieces of content become popular and fade away within a matter of hours.
It should be appreciated, using the Twitter Spritzer streaming API, with a filter to differentiate selected users of interest, one can access multiple social media message units among users. The Twitter Spritzer feed streams approximately 1% of the entire world's tweets in real-time. One can then additionally filter further down-samples the 1% feed into tweets within the users' network, which includes tweets based on user mentions and re-tweets, in addition to the tweets from the selected users. Further information on accessing streaming can be found at https//dev.twitter.com/docs/streaming-apis and http://blog.gnip.com/tag/spritzer/.
A technique of collating a group of tweets into a document structure based on parameters such as user's tweeting frequency, and, minimum and maximum time window over which the topic of interest (such as a news topic) is expected to evolve, trend and fade in the Twittersphere will now be described. Once the document is defined, further processing such as analysis by NLP and Information Extraction (IE) algorithms can be performed to further gleam information from the content of the document.
The motivation for defining a document is two-fold: (1) as a single tweet is limited to 140 characters, it may not have sufficient textual content to understand the significance that corresponds to a specific topic (or a news story), and, (2) most Twitter users post tweets on specific trending topics and move on to other topics within a certain temporal window. Content from social media sites, such as Twitter, Facebook, WhatsApp, is produced in small snippets or posts and often a complete story is expressed over multiple posts. Running natural language processing (NLP) and information extraction (IE) algorithms on small snippets of content becomes challenging, since the algorithms may not have sufficient context to produce useful output. A new method has been developed to create cohesive documents from multiple social media message units (SMMUs) based on content and themes from users, temporal and geospatial information in content and metadata, and other attributes or combination of attributes. The NLP algorithms run on these cohesive documents instead of SMMUs to produce improved named entity recognition, sentiment analysis, geolocation, and machine translation.
There are several advantages of this approach over traditional techniques that work on SMMUs or a group of SMMUs. The cohesive document produced by the present method contains the contextual information that may be present in a SMMU. Since a typical conversation spans over several SMMUs among multiple users, combining the SMMUs produces documents analogous to text documents that present a cohesive narrative. Moreover the document size can be tuned based either on the SMMU attributes, such as frequency at which SMMUs are produced, time windows, users, hashtags or the requirements of NLP and IE algorithms.
Shallow processing technologies designed for “big data” can deal with volume, velocity, variety of the data, but lack the richer and in-depth analysis provided by natural language processing (NLP) and information extraction (IE) algorithms. The present disclosure defines a process for creating cohesive documents from the content produced on social networking and microblogging services. NLP and IE techniques can then be employed on documents instead of the message units.
Referring now to
As described above, individual SMMUs may not provide enough information to understand the content of a conversation so by converting SMMUs into a cohesive document unit that can be used as a subject of analysis by NLP and Information Extraction (IE) algorithms, a better analysis of the SMMUs can be accomplished. The motivation for defining the document is two-fold; (1) as a single message is often limited, for example 140 characters in case of Twitter, and it may not have sufficient textual content for understanding the information that corresponds to a specific topic (or a news story), and, (2) most users post messages on specific trending topics and move on to other topics within a certain temporal window.
Referring now to
If the set of SMMUs meets the document creation criteria, then a document gets created and added to the document list (DocumentList) for NLP processing 32 (
Referring again now to
Referring now to
A tweets-to-document generation process 400 is formulated in Algorithm 1 and is shown in text form in
Input:
tweets: List of n tweets from m Twitter users in time window t
minWindowSize: The minimum size of the time window in hours
maxWindowSize: The maximum size of the time window in hours
minTweetsInWindow: The minimum number of tweets per-user in a time window
maxTweetsInDocument: The maximum number of tweets allowed in a document
Output:
documentList: List of documents in time window t
Notation: { }—List, [ ]—Array
Once all the tweets in a time-delineated window are converted into documents, such that each document contains multiple tweet posts from a specific user, each document can be further processed using NLP and Information Extraction as necessary.
It should be appreciated in addition to the technique described to text produced on social networking, microblogging and chat services, the technique can be extended to other domains and modes. The document creation technique can be extended to audio and speech processing where we can create an audio document from many short segments of audio or conversation. The technique can be further applied on videos generated on video-sharing and video-blogging sites. In general this can be applied to content that has well-defined attributes and is produced over a period of time.
Having described a document building system using a service such as Twitter, one may implement such a system for gathering information. In one environment, the system can be used to capture information from first responders when responding to an incident. Each first responder can be assigned a Twitter account and each account can be configured with a certain set of attributes. As can be appreciated, when first responders respond to an incident and report to the chain of command providing situational awareness, it can be difficult to collect and verify the accuracy of the information during the initial period of response. By using a service such as Twitter or the like instead of hand held voice communication radios, first responders can tweet information (send SMMUs) to the team and the team's leadership and using the document building system as taught herein, documents can be created from the SMMUs that can then be analyzed by intelligence personnel to collect information and provide cohesive information to the decision makers so that the decision makers can provide guidance and instructions. In another environment, the SMMUs generated in the geographical area of a significant event can be captured and cached and a set of attributes can be set and those SMMUs meeting the set of attributes can then be captured and documents created accordingly. The created documents can then be analyzed using natural language processing techniques or information extraction techniques to gleam information of interest.
According to the disclosure an article includes: a non-transitory computer-readable 20 medium that stores computer-executable instructions, the instructions causing a machine to: access a communication system having a plurality of social media message units available; collect a plurality of related social media message units among users over a predetermined period of time; output to a single file the plurality of social media message units when the file reaches a predetermined size to construct a cohesive document; and output to a single file the plurality of related social media message units after a maximum predetermined period of time to construct a cohesive document. Furthermore, a method for constructing a cohesive document includes: accessing a communication system having a plurality of social media message units accessible; collecting a plurality of related social media message units among users over a predetermined period of time; outputting to a single file the plurality of related social media message units when the file reaches a predetermined size to construct a cohesive document; and outputting to a single file a plurality of related social media message units after a maximum predetermined period of time to construct a different cohesive document.
Referring to
The processes and techniques described herein are not limited to use with the hardware and software of
The processes described herein may be implemented in hardware, software, or a combination of the two. The processes described herein may be implemented in computer programs executed on programmable computers/machines that each includes a processor, a non-transitory machine-readable medium or other article of manufacture that is readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and one or more output devices. Program code may be applied to data entered using an input device to perform any of the processes described herein and to generate output information.
The system may be implemented, at least in part, via a computer program product, (e.g., in a non-transitory machine-readable storage medium such as, for example, a non-transitory computer-readable medium), for execution by, or to control the operation of, data processing apparatus (e.g., a programmable processor, a computer, or multiple computers)). Each such program may be implemented in a high level procedural or object-oriented programming language to communicate with a computer system. However, the programs may be implemented in assembly or machine language. The language may be a compiled or an interpreted language and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network. A computer program may be stored on a non-transitory machine-readable medium that is readable by a general or special purpose programmable computer for configuring and operating the computer when the non-transitory machine-readable medium is read by the computer to perform the processes described herein. For example, the processes described herein may also be implemented as a non-transitory machine-readable storage medium, configured with a computer program, where upon execution, instructions in the computer program cause the computer to operate in accordance with the processes. A non-transitory machine-readable medium may include but is not limited to a hard drive, compact disc, flash memory, non-volatile memory, volatile memory, magnetic diskette and so forth but does not include a transitory signal per se.
The processes described herein are not limited to the specific examples described. Rather, any of the processing blocks as described above may be re-ordered, combined or removed, performed in parallel or in serial, as necessary, to achieve the results set forth above.
The processing blocks associated with implementing the system may be performed by one or more programmable processors executing one or more computer programs to perform the functions of the system. All or part of the system may be implemented as, special purpose logic circuitry (e.g., an FPGA (field-programmable gate array) and/or an ASIC (application-specific integrated circuit)). All or part of the system may be implemented using electronic hardware circuitry that include electronic devices such as, for example, at least one of a processor, a memory, a programmable logic device or a logic gate.
Having described a document building system to gather information, we will now discuss a process to identify social media users across the Middle East who are influential contributors on the Twitter social media platform. The goal was to identify a total of 300-350 users selected from countries across the region, with the distribution roughly matching the population of each country. Through this process, a list of Twitter users was created, culled from mainstream journalism feeds, diplomatic circles, and political circles having wide Arabic regional appeal.
Tweets were collected over a period of 3 months using the Twitter Spritzer streaming API with a filter for selected users of interest. The Twitter Spritzer feed streams approximately 1% of the entire world's tweets in real-time. The users filter further down-samples the 1% feed into tweets within the users' network, which includes tweets based on user mentions and re-tweets, in addition to the tweets from the selected users. Using this setup, approximately 17 million multilingual tweets were collected distributed into 85% Arabic, and 15% English from 2.6 million Twitter users as shown in
To measure the performance of the tweet geo-location detection algorithm, evaluation across two dimensions were performed; (1) compare the estimated tweet geo-location with the device-based geospatial data, and, (2) compare the estimated tweet geo-location versus the geo-location of user that posted the tweet. The first metric we consider is the error distance, which quantifies the distance in miles between the actual geo-location of the tweet lact(t) and the estimated geo-location lest(t). The Error Distance for tweet t is defined as:
ErrDist(t)=d(lact(t),lest(t)) Eq. 1
The overall performance of the content-based tweet geo-location detector can further be measured using the Average Error Distance across all the geo-located tweets T using Equation (2):
A low Average Error Distance indicates that the geo-location detector can geo-locate tweets close to their geo-location on average as provided by the user profile or user device. This metric does not provide more insight into the distribution of the geo-location detection errors. We apply maximum allowed distance in miles thresholding at three points; 100 miles, 500 miles and 1000 miles and calculate the next metric, Accuracy100, Accuracy500 and Accuracy1000 using Equation (3):
where K is distance in miles.
Referring now to
As described above, the motivation for defining the Document was two-fold; (1) as a single tweet is limited to 140 characters, it may not have sufficient textual content for estimating location that corresponds to a specific topic (or a news story), and, (2) most Twitter users post tweets on specific trending topics and move on to other topics within a certain temporal window. Hence it is desirable to provide this tweets-to-document generation as formulated in the algorithm as show in
Referring again to
Our geo-location detection algorithm has three distinct phases as shown in
In phase two, individual locations are identified. In this phase, the list of named entities which were discovered in phase one is now employed to select location records from several gazetteers. This selection is sometimes enhanced with an alias file that provides supplementary information. Each match is then given a preliminary score based on features both internal to the location record and features from external sources. Points are then duplicated proportionally to their scores to create a weighting scheme for k-means clustering. The randomly assigned points are then rescored based on how close they are to their cluster's center or centroid location. Prior to each k-means iteration, the points are reassigned to whichever cluster has the nearest centroid to that point. When clusters are stable, they are scored. Finally, location identities are assigned to location names according to their membership in the cluster with the highest score containing that name.
The third and last phase of the system is concerned with selecting the best overall location associated with the document. This phase begins by iterating through the locations identified in the previous step. During this initial pass, common features such as political administrative unit membership are identified, as well as other features such as order of occurrence. In a second pass, each location is scored by comparing it to the results of the first pass; certain features are biased and others receive an anti-bias. After each point is scored, the highest scoring city belonging to the highest scoring country is returned. If no matching cities are found, the highest scoring country is returned as the estimated location.
A goal is to measure the accuracy of content-based geo-location of tweets against both the device-based tweet geo-location as well as the user profile-based geo-location. A key point to be noted is that we are measuring the performance of a content-based geo-location detector against geospatial data that is based solely on either the location of the users where they were tweeting from or their location when they created their Twitter profile. While these results help us assess the performance of geo-location detector, we believe that creating a manually annotated set would allow use to demonstrate greater accuracy. This is due to the discrepancy between a user's physical location and the subjects a user may be tweeting about. For example, a user with profile provided location of Boston, Mass., USA might be traveling in Egypt, while tweeting about trending news in Syria.
As mentioned previously, Twitter offers a per-tweet geo-tagging feature which provides extremely fine-tuned user tracking by associating each tweet with a latitude and longitude. In our sampling of 17 million tweets over 1st quarter of 2013, less than 0.70% of all tweets actually use this functionality. To minimize outliers, we filtered tweets that are from potential spammers based on 2 criteria; (1) filter tweets that are not from our core selected users, and, (2) filter tweets that are auto-generated by advert spreading tools. After filtering, we had approximately 50K tweets with Twitter-provided device-based geospatial data in terms of latitude and longitude points.
Table 1 shows the results of our content-based geo-location detection algorithm using the average distance error and accuracy metrics defined above.
We found that only 12% of the 50K tweets in the test set could be geo-located within 100 miles of their device-provided geospatial points and that the AvgErrDist across all 50K was 1,881 miles. The accuracy does improve close to 50% for tweets that could be geo-located within 1000 miles of their device-provided location.
Twitter geo-tagging feature allows users to geographically tag their tweets by listing their location in their Twitter User Profile. Unfortunately, Twitter users have been slow to adopt such geospatial features. In our sampling of over approximately 3 million Twitter users; only 30% have listed user location, which include locations as granular as a city name (e.g. Riyadh, Saudi Arabia) to something overly general (e.g. Asia) and unhelpful (e.g. The World). We further filtered this set of users to consider only our core selected Middle East users who provided valid location (city/country) names in their user profiles. Further, we resolved the location names to geospatial points using the Google Maps API4. Based on this, we had 325 users with valid geospatial information which we then transferred to the 50K tweets that we had selected as our test set above. Table 2 shows the results of our content-based geo-location detector using user profile based geo-location as reference.
We found that only 9% of the 50K tweets in the test set could be geo-located within 100 miles of their user profile provided geospatial points and the AvgErrDist was 2,053 miles. In comparison to the device-based evaluation, the Accuracy100 degraded relatively by 75%. This result indicates that our core users who are contribute to mainstream journalism feeds, diplomatic circles, and political circles having wide Arabic regional appeal, and, their tweeting profile varies from their user profile which was created when they opened an account with Twitter. For our baseline evaluation, we set the parameters min WindowSize and max Win-dowSize of our Tweet-to-Document generation (
In Table 3, we present some results with variation of these parameters and analyze the impact on the content-based geo-location detection performance. Our main motivation for varying these parameters was that the user tweeting frequency varies depending on the time of the day, trending news stories on that day as well as other factors pertaining to users' work schedule.
In Variant 1, we changed min WindowSize parameter from 4 hours to 2 hours which reduced the contextual time window, leading to smaller length documents localized to tweeting profile in the 2-hour window. The max WindowSize parameter was not changed in this experiment. We noticed that the Accuracy100 increased by 156% relative to our baseline parameters and the AvgErrDist also reduced to 773 miles from 1,881 miles. This improvement indicates that, even though shorter time window leads to smaller length documents, the content is more localized to a specific city/country as compared to the larger 4-hour window which might have content from topics pertaining to more than one location.
In Variant 2, we changed both, the min WindowSize and max WindowSize parameters to 2 hours and 4 hours respectively. This lead to a further improvement in Accuracy100; 209% relative to baseline and 20% relative to Variant 1. This improvement indicates that a time window of 4 hours leads to a more optimal context for all tweets that pertain to topic or news story. Content-based geo-location detection has many applications in the sector of advertising and user modeling. Our application of content-based geo-location detection is to segregate tweets pertaining to specific hashtags or trending news story and localize than on the global map. Such geo-location leads to detection of news or events that are trending in a specific city, country or region.
It should now be appreciated a cohesive document building system according to the disclosure includes: a user interface device having access to a communication system having a plurality of short media message units available to collect the short media message units; memory to cache the short media message units in the system; a collator to collect a plurality of related short media message units among users over a predetermined period of time; and a user interface to output to a single file the plurality of related short media message units when the file reaches a predetermined size to construct a cohesive document or to output to a single file a plurality of related short media message units after a maximum predetermined period of time to construct a cohesive document.
The document building system may include one or more of the following features independently or in combination with another feature to include caching mechanism that supports harvesting content from online social networking and microblogging services; generating documents from SMMUs based on specific attributes, such as users, location, specific words; creating documents by collating SMMUs from multiple languages; incorporating the temporal aspects (i.e. relating to the tense or the linguistic expression) of the message in document creation; multi-phased windowing approach to handle processing based on attribute-SM MU distribution; online algorithm that runs on streaming data; and temporal windows and size of documents which can be tuned to control the quality of NLP and IE.
Elements of different embodiments described herein may be combined to form other embodiments not specifically set forth above. Other embodiments not specifically described herein are also within the scope of the following claims.
This application claims the benefit under 35 U.S.C. §0119(e) of U.S. Provisional Application No. 62/032,189 filed Aug. 1, 2014, which application is incorporated herein by reference in its entirety.
This invention was made with Government support under Contract No. N41756-11-C-3878 awarded by the Department of the Navy. The Government has certain rights in this invention.
Number | Date | Country | |
---|---|---|---|
62032189 | Aug 2014 | US |