Social media networks such as Facebook®, Twitter®, and Google Plus® have experienced exponential growth in recently years as web-based communication platforms. Hundreds of millions of people are using various forms of social media networks every day to communicate and stay connected with each other.
the resulting activities/content items from the users on the social media networks, such as tweets posted on Twitter®, become phenomenal and can be collected for various kinds of measurements, presentation and analysis. Specifically, these user activity data can be retrieved from the social data sources of the social networks through their respective publicly available Application Programming Interfaces (APIs), indexed, processed, and stored locally for further analysis.
These stream data from the social networks collected in real time along with those collected and stored overtime provide the basis for a variety of measurements, presentation and analysis. Some of the metrics for measurements, and analysis include but are not limited to:
Unlike traditional web traffic sources, social media content items such as citations/Tweets/posts are typically opinions expressed by sources/subjects/authors about certain objects on the social network. Due to the subjective nature of the social media content items, it is important to have a customize the search results or analytics over the social network by taking into account the sentiments expressed by the content items and/or the influence of the subjects who authored them during filtering and computing of the search results or analytics.
The foregoing examples of the related art and limitations related therewith are intended to be illustrative and not exclusive. Other limitations of the related art will become apparent upon a reading of the specification and a study of the drawings.
The approach is illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an” or “one” or “some” embodiment(s) in this disclosure are not necessarily to the same embodiment, and such references mean at least one.
A new approach is proposed that contemplates systems and methods to filter and/or rank a plurality of content items retrieved from a social network based on the sentiments expressed by the authors of the content items and/or the influence level of the authors over the social network. First, content items matching a set of keywords submitted by a user are retrieved from the social network. The sentiments and/or the influence levels of the authors of these content items are then identified in real time. Once identified, the sentiments and/or influence levels of the authors are used to filter and/or rank the retrieved content items to generate a search result that matches with the sentiment and/or influence level specified by the user. Finally, the customized search result based on the sentiments and/or the influence levels of the authors is presented to the user.
As referred to hereinafter, a social media network or social network, can be any publicly accessible web-based platform or community that enables its users/members to post, share, communicate, and interact with each other. For non-limiting examples, such social media network can be but is not limited to, Facebook®, Google+®, Twitter®, LinkedIn®, blogs, forums, or any other web-based communities.
As referred to hereinafter, a user's activities/content items on a social media network include but are not limited to, citations, Tweets, replies and/or re-tweets to the tweets, posts, comments to other users' posts, opinions (e.g., Likes), feeds, connections (e.g., add other user as friend), references, links to other websites or applications, or any other activities on the social network. Such social content items are alternatively referred to hereinafter as citations, Tweets, or posts. In contrast to a typical web content, whose creation time may not always be clearly associated with the content, one unique characteristic of a content item on the social network is that there is an explicit time stamp associated with the content, making it possible to establish a pattern of the user's activities over time on the social network.
In some embodiments, subjects 102 representing any entities or sources that make citations may correspond to one or more of the following:
In some embodiments, some subjects/authors 102 who create the citations 104 can be related to each other, for a non-limiting example, via an influence network or community and influence scores can be assigned to the subjects 102 based on their authorities in the influence network.
In some embodiments, objects 106 cited by the citations 104 may correspond to one or more of the following: Internet web sites, blogs, videos, books, films, music, image, video, documents, data files, objects for sale, objects that are reviewed or recommended or cited, subjects/authors, natural or legal persons, citations, or any entities that are or may be associated with a Uniform Resource Identifier (URI), or any form of product or service or information of any means or form for which a representation has been made.
In some embodiments, the links or edges 104 of the citation graph/diagram 100 represent different forms of association between the subject nodes 102 and the object nodes 106, such as citations 104 of objects 106 by subjects 102. For non-limiting examples, citations 104 can be created by authors citing targets at some point of time and can be one of link, description, keyword or phrase by a source/subject 102 pointing to a target (subject 102 or object 106). Here, citations may include one or more of the expression of opinions on objects, expressions of authors in the form of Tweets, blog posts, reviews of objects on Internet web sites Wikipedia entries, postings to social media such as Twitter® or Jaiku®, postings to websites, postings in the form of reviews, recommendations, or any other form of citation made to mailing lists, newsgroups, discussion forums, comments to websites or any other form of Internet publication.
In some embodiments, citations 104 can be made by one subject 102 regarding an object 106, such as a recommendation of a website, or a restaurant review, and can be treated as representation an expression of opinion or description. In some embodiments, citations 104 can be made by one subject 102 regarding another subject 102, such as a recommendation of one author by another, and can be treated as representing an expression of trustworthiness. In some embodiments, citations 104 can be made by certain object 106 regarding other objects, wherein the object 106 is also a subject.
In some embodiments, citation 104 can be described in the format of (subject, citation description, object, timestamp, type). Citations 104 can be categorized into various types based on the characteristics of subjects/authors 102, objects/targets 106 and citations 104 themselves. Citations 104 can also reference other citations. The reference relationship among citations is one of the data sources for discovering influence network.
In the example of
In the example of
In the example of
In the example of
In some embodiments, social media content collection engine 102 utilizes explicit first order literal matching of keywords over the social networks. Specifically, social media content collection engine 102 may search for keywords in a citation/Tweet's ‘text’ field. If a Tweet is a native re-tweet, then “social media content collection engine 102 searches in the citation/Tweet's ‘retweeted_status->text’ field. Here, keyword matches of the social content are case-insensitive. For a non-limiting example, ‘gadaffi’ will match ‘gadaffi’ or ‘Gadaffi’ or ‘GADAFFI’ but will not match on ‘kadaffi’ or ‘qadhafi’ or ‘#gadaffi’, and ‘#gadafficrimes’ will match ‘gadafficrimes’ or ‘#Gadafficrimes’ but will not match on ‘gadafficrimes.’
In some embodiments, social media content collection engine 102 may remove punctuations determined as extraneous when matching the keywords. Here, the punctuations to be ignored when matching keywords include but are not limited to, the, to, and, on, in, of, for, i, you, at, with, it, by, this, your, from, that, my an, what, as, For a non-limiting example, if ‘airplane’ or ‘airplane!’ appeared in the Tweet's text as a standalone word or at the end of a tweet, then it would return as a match for ‘airplane.’
In some embodiments, social media content collection engine 102 enables matching based on commonly used citation conventions on social networks. For a non-limiting example, social media content collection engine 102 would enable the user to match on citations/tweets about a stock by using the common Twitter® convention for referencing a stock by inserting a dollar sign in front of the ticker symbol, e.g., Tweets about Apple can be matched using the keyword ‘$aapl’ which will match all tweets that contain the text ‘$aapl’ or ‘$AAPL.’
In some embodiments, the user interface of the social media content collection engine 102 further provides a plurality of search options via a search menu (shown as the gear image to the left of keywords in the example of
In some embodiments, the social media content collection engine 102 provides at least two options for the displaying keywords in the search result:
In some embodiments, the social media content collection engine 102 enables the user to save user-defined sets of keywords and report parameters that define a search as a saved topic/search. Saved topics can be used as logical groupings of terms/keywords commonly associated with a particular country or event (e.g., #egypt, #mubarak, #muslimbrotherhood, #jan25, @egyptocracy). Such saved topic or search allows users to save keywords and parameters so they can be used again as shown in the example depicted in
In some embodiments, social media content collection engine 102 provides a saved search dropdown menu, which allows the user to easily find and retrieve previously saved topics. If there are a lot of saved searches, the user can enter parts of the saved search name in a search box to find the specified search topic as shown in the example depicted in
In some embodiments, social media content collection engine 102 enables a user to download a saved topic/search and the corresponding search results or analytics from the topic to a specific file/date format (e.g., CSV format) by clicking the Export button on the user interface. In addition, social media content collection engine 102 may also provide an Application Programming Interface (API) URL for users who want to access the Secure Reporting API to programmatically retrieve data. All citations/Tweets from the search query can be downloaded in batch mode, including those “significant posts”, which are tweets that have links or tweets that have been retweeted.
In some embodiments, social media content collection engine 102 enables a user to copy a topic by clicking the “Save As . . . ” button and choosing “Create a new Topic” to save a copy of the existing topic under a new name. Social media content collection engine 102 further enables a user to share a topic with another user by clicking the gear icon next to the list of keywords (as shown in
In the example of
In some embodiments, social media content collection engine 102 enables the user to restrict the search results or analytics based on dates/timestamps of the content items/citations. For a non-limiting example, the default selection of time range can be last 24 hours, which can be changed to any of the following: last hour, last 24 hours, last 7 days, last 30 days, last 90 days, last 180 days, or a specific date range as specified by the user as shown by the example depicted in
In some embodiments, social media content collection engine 102 filters the search results or analytics based on the originating locations of the content items/citations/posts/tweets. Here, the filtering location can be specified at the country, state, county, or city level. Additionally, the filtering location can be specified by latitude and longitude coordinates as shown by the example depicted in
In some embodiments, social media content collection engine 102 adopts various language detection and processing techniques to filter and rank the search results or analytics by language, wherein the language detection techniques include but are not limited to, tokenization, domain-specific handling, stemming and lemmatization. Here, the tokenization of the search results or analytics is language dependent. Specifically, whitespace and punctuation are delimited for European languages, Japanese is tokenized using grammatical hints to guess word boundaries, and other Asian languages are tokenized using overlapping n-grams. As referred to hereinafter, an n-gram is a contiguous sequence of n items/words from a given sequence of text or speech, which can be used by a probabilistic model for predicting the next item in such a sequence.
In some embodiments, social media content collection engine 102 searches and returns search results or analytics for social media content in any language regardless of character set. Since social media content collection engine 102 matches the content items based on literal keywords, the user can enter any word from a foreign language and social media content collection engine 102 will return exact matches for the words entered. In addition, social media content collection engine 102 uses various methods of language morphology (e.g., tokenization) to isolate search results or analytics to just the language specified for a specific set of languages, which include but are not limited to English, Japanese, Korean, Chinese, Arabic, Farsi and Russian as shown by the example depicted in
In some embodiments, social media content collection engine 102 uses character set processing as a first pass through character sets (e.g., Chinese, Japanese, Korean), while statistical models can be used to refine other languages (English, French, German, Turkish, Spanish, Portuguese, Russian), and n-grams be used for Arabic and Farsi. In some embodiments, domain-specific handling is utilized to identify and handle short strings and domain-specific features such as #hashtags, RT @replys for search results or analytics from social networks such as Twitter®. Stemming and lemmatization features are available for English and Russian languages. As referred to herein, A hashtag is a word or a phrase prefixed with the symbol # as a form of metadata tag for short messages or micro blogs on a social network.
In some embodiments, social media content collection engine 102 utilizes a user's historical comments/posts/citations to improve accuracy for language detection for search results or analytics. If the user is consistently identified as a user of one specific language upon examining his/her historical comments, future comments from that user will be tagged with that specific language, which largely eliminates false negatives for such user.
In some embodiments, social media content collection engine 102 detects and identifies the sentiments expressed by the authors of the content items with respect to/toward a specific event or topic via a number of sentiment text scoring schemes. Here, the sentiment of each user can be characterized as very positive, positive, flat, negative, very negative. Specifically, social media content search engine 10 identifies the sentiment expressed by the author of a content item by analyzing the posted English text of the content item. In some embodiments, social media content collection engine 102 uses a curated sentiment dictionary of sentiment-weighted words and phrases to fine tune its sentiment detection for the content items retrieved from the specific social network, such as Twitter's® unique 140 character limits and “twitterisms”. By combining some English grammar rules to this, social media content collection engine 102 is able to accurately fine tune results in relatively high accuracy rates, with results typically garnering a 70% agreement rate with manually reviewed content. Such measurement of the sentiments of the users provides real-time gauges of their views/opinions expressed over the social network.
In some embodiments, social media content collection engine 102 is further able to identify and ignore entities in the content items with misleading names (e.g. Angry Birds) for sentiment detection by applying stemming and lemmatization to expand the scope of the sentiment dictionary. Here, the curated dictionary of sentiment weighted words and phrases can grow organically based on real world data as more and more search results or analytics are generated and grammar rules found to be significant in helping to determine sentiment are included. For a non-limiting example, the use of the word “not” before a word is used as a negativity rule. In addition, since stemming can introduce errors in categorization of sentiment (example, the root by itself could have negative sentiment but root+suffix could have positive sentiment), such stemming errors are handled on a case by case basis by adding the improper sentiment categorization due to stemming as exceptions to the dictionary.
In some embodiments, social media content collection engine 102 takes into consideration the ways and the nuances of how people express themselves over social media network in general, and specifically within Twitter®. In the non-limiting example of Twitter®, there are significant differences in how people express themselves within 140 character constraint of a tweet that traditional sentiment measurement technique do not handle well. Based on the analysis and testing of the mass amount of data that has been collected in real time and stored over time, social media content collection engine 102 is able to identify a number of “twitterisms” in the tweets, i.e., specific characteristics of sentiment expressions in the collected content items that are not only indicative of how people feel about certain event or things, but are also unique to how people express themselves on a social network such as Twitter® using tweets. These identified characteristics of sentiment expressions are utilized by the number of sentiment text scoring schemes for detecting the sentiments expressed by the users on the social network.
In some embodiments, social media content collection engine 102 generates the search result by filtering the content items retrieved based on the sentiments expressed by their authors. Specifically, social media content collection engine 102 enables the user to determine a specific sentiment expressed the authors as shown by the example depicted in
In some embodiments, social media content collection engine 102 filters search results or analytics to those authored by users determined to be influential only as specified by the user and shown by the example depicted in in
In some embodiments, social media content collection engine 102 calculates the influence level of an author transitively, i.e., the author's influence level is higher if he/she receives attention from other people with influence than if the author receives attention from people without influence. For a non-limiting example, the politicians as identified by their social media source IDs (e.g., “barackobama”) will frequently have high influence because they are mentioned by many influential users, including news organization. Likewise many celebrities (e.g., “justinbieber”) have high influence since they are frequently mentioned by other influential users. In some embodiments, social media content collection engine 102 utilizes a decay factor, so that an account of a user which is inactive—and which therefore no other user is mentioning—will fall to the bottom of the influence ranking, as will an account from spammers or celebrities who do not post things that other influential users find interesting.
In some embodiments, social media content collection engine 102 adopts iterative influence calculation to handle the apparent circularity of the influence level (i.e., that an individual gains influence by receiving attention from other influential individuals) by measuring centrality in an attention diagram/graph. As shown in the example depicted in
In the example of
In the example of
In some embodiments, social media content analysis engine 104 provides activity history view that displays the volume of mentions for a set of keywords over a period of time. Social media content analysis engine 104 provides the user with the ability to select the start and end dates for displaying mention metrics within the view/report. It also enables the user to specify the time windows to display, including by month, week, hour, and minute. Such a view/report is useful for examining historical events and identifying patterns. For non-limiting examples, such report can be used to:
In some embodiments, social media content analysis engine 104 makes the activity history data available for presentation in real time on a rolling basis. Specifically, minute metrics are available for the last 6-8 hours on a rolling basis, hour metrics are available for last 30 days on a rolling basis, and daily metrics are available at least 6 months back.
In some embodiments, social media content analysis engine 104 allows the user to selectively enable and display of a set of keywords and the associated lines representing the content items containing the keywords on the figure by clicking on the keywords below the figure as shown by the example depicted in
In some embodiments, social media content analysis engine 104 supports Share of Voice (SOV) analysis, which measures the relative change in mentions of a set of keywords in the content items collected from the social network over the period of time as shown by the example depicted in
In some embodiments, social media content analysis engine 104 enables the user to select a time slice window during the period of time for presentation and analysis of the social media content items collected during the time slice window, wherein the time slice window can be by minutes, hours, day, week, or month. Social media content analysis engine 104 enables the user to zoom in and out on the specific region of the activity diagram for the time slice window by clicking a region and then holding down the click until identified the region to zoom into has been selected (click & drag to select). This allows the user to quickly and easily change the range to see the time frame that is relevant to his/her analysis as shown by the example depicted in
In some embodiments, social media content analysis engine 104 enables the user to select and view the Top Posts with a specific time range selected. If a specific point on the activity diagram is selected, then the Top Posts are from just that date and keyword selected. For a non-limiting example, if the top peak of the dark green line was selected, the top posts for #NBA at 6 PM will be shown by the example depicted in
In some embodiments, social media content analysis engine 104 presents the top trending results for posts, links, photos, and videos sorted by one or more of: relevance, date, momentum, velocity, and peak of the keywords/terms during the time frame selected. As referred to hereinafter:
In some embodiments, the social media content analysis engine 104 identifies the most significant posts which were mentioned within the time range selected, with variations in the metrics presented that are important to note. In addition, for all the time ranges from x-date to present (e.g., past 24 hours, past 7 days), the mention and influential mentions are calculated based on the number of all-time mentions. If a specific time slice is selected (e.g., Jan. 1, 2012 to Jan. 31, 2012) then the mention and influence metrics are also scoped to all time and not to just the timeframe specified.
In some embodiments, social media content analysis engine 104 presents a list of the most recent trending metrics for the specified saved search group or for the keywords/terms entered. Each term will include the following metrics: mentions, percent influence, momentum, velocity, peak period as shown by the example depicted in
In some embodiments, social media content analysis engine 104 presents the trending top posts for the keywords and parameters specified, where the view displays the actual post, along with the author of the post, a timestamp of when the post was originally communicated, and the corresponding mention, influential (number of influential mentions), momentum, velocity, and peak metrics. In addition, the profile information of the user on the social network (e.g., Twitter®) is displayed (name, link, bio, latest post, number of posts, number they are following, and number of followers) by highlighting the picture associated with the user's login name on the social network. The user is also enabled to click on the arrows on the right side of the spark line diagram for each post from the view depicted in
In some embodiments, social media content analysis engine 104 presents the trending links, where the view displays the most popular links matching any set of keywords, including domains. By specifying only domains as keywords (e.g., “nytimes.com”), the trending links view returns the most popular links on a specific domain/website (e.g., washingtonpost.com, espn.com) or across the multiple domains entered. For each domain specified, social media content analysis engine 104 will display one or more of the following metrics: mentions, percent influence, momentum, velocity, peak period.
In some embodiments, social media content analysis engine 104 enables the user to input multiple domains for domain analysis in order to quickly identify what links to these domains have the highest mention volume, momentum, velocity or are peaking most recently via peak period metrics as shown in the example depicted in
In some embodiments, social media content analysis engine 104 presents the top trending media (photos) related to the keywords and parameters entered. The results presented can be sorted by one or more of relevance, date, momentum, velocity, and peak as shown by the example depicted in
In some embodiments, social media content analysis engine 104 presents the top trending videos related to the keywords and parameters entered. The results presented can be sorted by one or more of relevance, date, momentum, velocity, and peak. Displayed along with the top video, which is shared on the social network (e.g., Twitter®) from a variety of video sharing sites are the number of mentions containing the video link, number of influential people that posted it, and the momentum, velocity, and peak score. In some embodiments, a spark line is displayed to quickly determine what video is taking off (i.e., trending) or stale. The view of trending videos is very useful for identifying videos associated with events as they unfold. Such view can be used to find videos from individuals on the ground before media outlets pick them up. Users can also isolate what videos are trending the most within a country by only selecting country and not specifying anything else.
In some embodiments, social media content analysis engine 104 presents a cumulative exposure view of the search results or analytics, which returns the gross cumulative exposure for the posts/content items containing the set of keywords over time. This analysis is useful to measure the gross exposure over time from posts matching a target set of keywords. For non-limiting examples, such cumulative exposure view can be used to:
In some embodiments, social media content analysis engine 104 calculates the cumulative exposure by summing the follower counts of all the authors of the posts that match the keywords being queried. This calculation returns overall gross exposure (vs. unduplicated net exposure) so multiple posts from the same author or authors with common followers may result in audience duplication as shown by the example depicted in
In some embodiments, social media content analysis engine 104 displays top significant posts in the cumulative exposure view for the time range selected in the search parameters. If a specific point on the exposure view is selected then the top posts are from just that date and keyword selected. For a non-limiting example, in the example depicted in
In the example of
For non-limiting examples, the related terms discovered by social media content analysis engine 104 enables the user to:
In some embodiments, social media content analysis engine 104 pre-computes and discovers the related terms by examining a historical archive of recent tweets/posts retrieved from the social network for top trending terms co-occurring with the submitted keywords before searching over the social network. The discovered related terms can then be used together with the keyword(s) submitted by the user to search for the relevant content items in the social media content stream retrieved continuously in real time from the social media network via a social media source fire hose. Alternatively, social media content analysis engine 104 may dynamically discover the related terms by examining the social media content stream in real time as they are being retrieved and apply the related terms discovered to search for relevant social media content items together with the user-submitted keyword(s).
In some embodiments, social media content analysis engine 104 discovers the related terms via a significant post index, which includes citations/posts that contain a link or a re-post to another content item. Social media content analysis engine 104 then applies a weighted frequency analysis to the significant posts containing the submitted keywords and the related terms to discover the related terms within the date range selected.
In some embodiments, social media content analysis engine 104 discovers and/or sorts the list of related terms based on a combination of one or more of:
In some embodiments, social media content analysis engine 104 also discovers and/or sorts the related terms based on one or more of: momentum, velocity, peak and influential metrics in addition to correlation scores and mentions (e.g., total number of mentions/retweets for this post, link, image or video over its lifetime) for each of the related terms. The following metrics are based on the timeframe set by the user in the search parameters and are calculated off of a census-based post index for all posts: momentum, velocity, peak, and influence, as described above.
In some embodiments, once the terms related to the set of keywords have been discovered, social media content analysis engine 104 utilizes both to search the social network for the content items (citations, tweets, comments, posts, etc.) containing all or most of the keywords plus the related terms. For a non-limiting example, the top posts found by search via the target/submitted search term and the discovered related term as shown by the example depicted in
In some embodiments, social media content analysis engine 104 supports cross network identification to identify an author and to view the content produced by the same user across different social networks, such as between Twitter® and Blogs, or a review site and a chat site analysts. Specifically, social media content analysis engine 104 compares the user profile photos and/or content of the posts from different sources of social media content and analyzes if the author is the same on those sources. If the same author is identified, social media content analysis engine 104 may assign a common cross network identification to the user. Social media content analysis engine 104 may further present the user's posts over the different social media sources/social networks side-by-side on the same display in such way to enable a viewer to easily toggle between the different social networks to compare the posts by the same user.
In some embodiments, social media content analysis engine 104 supports media identification to classify individual authors of social media content items from commercial and news sources. By filtering out commercial and news sources, social media content analysis engine 104 is able to generates reports focused on individuals “on the ground”.
In some embodiments, social media content analysis engine 104 uses a combination of a whitelist and a trained classifier to assign users as a media or non-media type. For a non-limiting example, the whitelist can initially be derived from the public list of social media sources lists and their respective verified accounts and grown organically on an ongoing basis.
In some embodiments, social media content analysis engine 104 may review the user's profile and historical post information to intelligently identify media/news sources the user belongs to. Some of the attributes and features of the user's information being reviewed by social media content analysis engine 104 include but are not limited to:
In some embodiments, social media content analysis engine 104 supports geographic analysis, which returns/presents a view/report on at least some of the social media content items (social mentions) with a set of known geographic locations over a period of time as shown by the example depicted in
In some embodiments, social media content analysis engine 104 shades the world map based on a polynomial function that colors the map by default based on the raw volume of mentions per geographic location. If the Activity table is re-sorted by “% Activity”, then the world map is refreshed and shaded based on the relative percentage activity for each country. When the shaded location (the ones selected as part of the report parameters) is rolled over, the volume metrics and percent activity are displayed. The table below the map allows the user to see mention and percent metrics for each geographic area. Here, the “% Activity” metrics are defined as the mentions matching the entered keywords divided by total overall mentions for the geographic area. In some embodiments, social media content analysis engine 104 may calculate the “% Activity” metric by taking the total posts for the keywords entered divided by the total number of all posts for that country, basically calculating a share of voice percentage. For a non-limiting example, a 3.1% activity means that 3.1% of tweets found for that country contain the keywords entered during the timeframe specified. In some embodiments, social media content analysis engine 104 enables the user to display metrics by specifying either latitude/longitude or not, in which case metrics will be calculated based upon the system's inferred geo location.
In the example of
In some embodiments, social media geo tagging engine 106 may identify geo-location of a content item from the profile information of the author/user of the content item, wherein the user's profile contains the user's self-described geographic location. The data point in the user's profile identifies where the user may be (not where they are communicating from) with low confidence (because the information is self-described by the user him/herself) but with relatively high coverage (50-70%). Social media geo tagging engine 106 determines that the location identified in the user's profile is “valid” if the user with that location is generally telling the truth (e.g. people who claim to live in Antarctica are generally not telling the truth).
In some embodiments, social media geo tagging engine 106 may utilize one or more of the followings for geo-location identification in addition to use of lat/long coordinates and user profile:
In some embodiments, social media geo tagging engine 106 uses the high-confidence geo location information in posts having such information as anchors to identify geographic locations of other content items whose geographic locations (e.g., geo-coordinates) are not available with relative high level of confidence to increase geographic location coverage of the social media content items significantly. Specifically, an archive of historical content items/posts with high-confidence geographic coordinate data can be used as a training set to train a customized probabilistic location classifier. Once trained, the location classifier can then be used to predict the actual geographic locations of the content items without geo-coordinates with high accuracy.
During the training process, social media geo tagging engine 106 reversely geocodes the latitude/longitude coordinates of each post in the training set using an internal lookup table. For geo-tagged posts in the United States, social media geo tagging engine 106 assigns the location based on the lat/long point being found within a defined polygon, associating each content item in the training set with the 4-tuple <country, state, county, city> (or <country, admin1, admin2, city> outside of the US). In some embodiments, social media geo tagging engine 106 uses the U.S. Census Bureau TIGER (Topologically Integrated Geographic Encoding and Referencing) shape files as the source of U.S. polygons. For non-U.S. cities, social media geo tagging engine 106 assigns city names if the coordinates fall within a 10 mile radius around the city center, or uses non-U.S. mapping data to improve foreign city assignment. When coordinates are found across multiple cities due to overlapping radii, social media geo tagging engine 106 may geo-tag the post to one of the cities.
In some embodiments, the location classifier of the social media geo tagging engine 106 recognizes and extracts a set of features related to geographical location from each of the posts in the training set and calculates an observation set of the extracted features as the cross-product of the location vector and feature set, yielding <feature, location> pairs. For a non-limiting example, the term “Giants” can be associated with city of “San Francisco” at the city level of <SF Giants, SF> if 75% of the posts containing “Giants” are determined to be originated from San Francisco (<US, CA, SF, SF>) vs. 25% of the posts are determined to be originated from Oakland (<US, CA, Oakland>) across the San Francisco Bay.
In some embodiments, the information recognized by the location classifier includes but is not limited to:
In some embodiments, social media geo tagging engine 106 aggregates a count of identical <feature, location> pairs and groups them by <feature, location level>, which shows the full distribution of P(location|feature) for that level. Features with few observations or low correlation to any geographical location are discarded.
Once the location classifier has been trained, social media geo tagging engine 106 continuously applies the location classifier to identify the geographic locations of all social media content items (citations, tweets, posts, etc.) retrieved from a social media network via a social media source fire hose in real time. When a new post lacking geographic (e.g., lat/long) information is found, the trained location classifier of social media geo tagging engine 106 uses the P(location|feature) model generated from the training set to predict the geographic location of the new post based on the features of the new post. Social media geo tagging engine 106 normalizes the output from the location classifier into standard location identifiers around country, state, and city to determine the geographic location of the post.
In some embodiments, once geographic location of a post has been identified, the social media geo tagging engine 106 may further compare the identified location of the post with the determined geographic locations of prior posts by the same subject/author. The newly identified location is confirmed if it matches with the location of the majority of the previous posts by the same author. Otherwise, the location of the majority of the previous posts by the author may be chosen as the geographic location of the new post instead. As a result, 98% of the posts can be geo-tagged at the country level or city/state level in US.
One embodiment may be implemented using a conventional general purpose or a specialized digital computer or microprocessor(s) programmed according to the teachings of the present disclosure, as will be apparent to those skilled in the computer art. Appropriate software coding can readily be prepared by skilled programmers based on the teachings of the present disclosure, as will be apparent to those skilled in the software art. The invention may also be implemented by the preparation of integrated circuits or by interconnecting an appropriate network of conventional component circuits, as will be readily apparent to those skilled in the art.
One embodiment includes a computer program product which is a machine readable medium (media) having instructions stored thereon/in which can be used to program one or more hosts to perform any of the features presented herein. The machine readable medium can include, but is not limited to, one or more types of disks including floppy disks, optical discs, DVD, CD-ROMs, micro drive, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data. Stored on any one of the computer readable medium (media), the present invention includes software for controlling both the hardware of the general purpose/specialized computer or microprocessor, and for enabling the computer or microprocessor to interact with a human viewer or other mechanism utilizing the results of the present invention. Such software may include, but is not limited to, device drivers, operating systems, execution environments/containers, and applications.
This application is a continuation-in-part of current copending U.S. application Ser. No. 13/158,992 filed Jun. 13, 2011, which claims the benefit of U.S. Provisional Patent Application No. 61/354,551, 61/354,584, 61/354,556, and 61/354,559, all filed Jun. 14, 2010. U.S. application Ser. No. 13/158,992 is also a continuation in part of U.S. Pat. No. 7,991,725 issued Aug. 2, 2011, a continuation in part of U.S. Pat. No. 8,244,664 issued Aug. 14, 2012, and a continuation in part of current copending U.S. application Ser. No. 12/628,791 filed Dec. 1, 2009. This application is a continuation-in-part of current copending U.S. application Ser. No. 13/660,533 filed Oct. 25, 2012, which claims the benefit of U.S. Provisional Patent Application No. 61/551,833, filed Oct. 26, 2011. This application claims benefit of U.S. Provisional Patent Application No. 61/617,524, filed Mar. 29, 2012, and entitled “Social Analysis System,” and is hereby incorporated herein by reference. This application claims the benefit of U.S. Provisional Patent Application No. 61/618,474, filed Mar. 30, 2012, and entitled “GEO-Tagging Enhancements,” and is hereby incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61354551 | Jun 2010 | US | |
61354584 | Jun 2010 | US | |
61354556 | Jun 2010 | US | |
61354559 | Jun 2010 | US | |
61617524 | Mar 2012 | US | |
61618474 | Mar 2012 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13158992 | Jun 2011 | US |
Child | 13853741 | US | |
Parent | 12895593 | Sep 2010 | US |
Child | 13158992 | US | |
Parent | 12628801 | Dec 2009 | US |
Child | 12895593 | US | |
Parent | 12628791 | Dec 2009 | US |
Child | 12628801 | US |