The present invention generally relates to an apparatus, system, and method for understanding communications between users of Internet based social media. More particularly, this invention relates to an apparatus, system, and method for collecting communications exchanged by users of Internet based social media, determining the entities (e.g., people, places, organizations, media, and fictional characters) that are referenced in those communications, determining the author's sentiment about those entities (e.g., love, hate, and indifference), and extracting the author's interests into an inferred user profile, which may be stored in a research database for use in targeted marketing of goods and services.
Automated machine understanding of social media has value because social media statements and actions may reveal the interests, opinions, and personality of the author. Significant technical challenges, however, may exist for understanding social data posts. For example, social data posts may incorporate shorthand notations for entities (e.g., MJ, instead of Michael Jordan) that are discussed in the communication. Social media posts, further, may include poor grammar, slang, and clever or lazy turns of phrase. Accordingly, a need exists for systems and methods for automated machine understanding of social media communications, which incorporate semantic inferences and syntactic analyses to identify and analyze social media statements and actions.
Hence, the present invention is directed to a computer-implemented method performed by a processor for understanding a snapshot of social network information. The method may include accessing social network information associated with a user of social media, collecting a snapshot of social network information associated with the user which comprises a plurality of social media statements, accessing a plurality of subculture models, and analyzing the snapshot of social network information and the plurality of subculture models to identify a weighted set of subcultures that reflects interests of the user. The method may further include analyzing the snapshot of social network information to identify one or more contacts associated with the user, assigning a weight to each contact that reflects the strength of each contact's connection to the user, and generating a personalized language model for the user that is based on the weighted set of subcultures and the set of contacts associated with the user. The personalized language model may include an entity list.
Additionally, the method may include extracting at least one mention of entities that are identified on the entity list from the plurality of social media statements, compiling a list of possible references for the at least one mention of entities extracted from the plurality of social media statements, inferring a weighted posterior distribution over the list of possible references for the at least one mention of entities that are identified on the entity list; and analyzing the weighted posterior distribution to identify a list of disambiguated references for the at least one mention of entities in the snapshot of social network information.
In one aspect, the method may include rating the user's sentiment for the list of disambiguated references and recording the user's sentiment for the list of disambiguated references in a database of inferred user profile opinions. Rating the user's sentiment for the list of disambiguated references may include word-based targeted sentiment analysis and pattern-based targeted sentiment analysis. Pattern-based targeted sentiment analysis may include comparing at least one of the user's plurality of social media statements with a pattern of expressions. The pattern of expressions may include a regular expression, a rating, and a confidence value.
In another aspect, the method may include inferring an updated weighted set of subcultures that reflect interests of the user based on an analysis of the snapshot of social media and the list of disambiguated references. The method may include recording the updated weighted set of subcultures that reflect interests of the user in a database of inferred user profile interests.
In another aspect, the method may include recording the updated weighted set of subcultures that reflect interests of the user in a database of inferred user profile interests.
In another aspect, the plurality of subculture models each may include a database of subculture specific entities and a database of subculture specific entity nicknames. Each of the plurality of subculture models further may include a database of subculture specific sentiment patterns. Also, each of the plurality of subculture models further may include a database of subculture specific semantic graph connections. Further still, each of the plurality of subculture models may include a database of subculture specific weighted N-grams. Each of the plurality of subculture models further may include a database of subculture specific co-occurrence frequencies.
The accompanying drawings, which are incorporated herein and constitute part of this specification, illustrate an embodiment of the present invention, and together with the general description given above and the detailed description given below, serve to explain aspects and features of the present invention.
a is a concept map of an entity disambiguation method of the present invention.
These inputs, along with subculture models (112), which may be generated offline from the same inputs, pass through the Social Media Understanding Engine (SMUE) 107, which extracts evidence of the social media user's personality 108, interests 109, opinions 110 and product relationships 111, and records this information in a repository of inferred user profiles.
In the system of
The system of
The usefulness of subculture identification and analysis in understanding social media statements may be demonstrated by evaluating the following illustrative social media statement, which may be found in a social media post: “I love watching anthony and bryant fight it out.” The entities in this statement, mentioned as “anthony” and “bryant” are ambiguous. The author knows which entities are referenced and presumes that the communications audience does too. For instance, the author may presume his audience knows which entities are referenced because (1) he knows the knowledge bases of his intended audience (at least to some extent); (2) he presumes that there are no other pair of entities that match the two mentions besides his intended references; or (3) some other element of the shared context (e.g., recent events), heavily favors his intended references.
For instance, if the author is a fan of NBA basketball (i.e., in the NBA subculture) and posts often about the NBA, the entities are most likely Carmelo Anthony and Kobe Bryant, two of the top players in that league, and therefore two commonly referenced entities by those in that subculture. If the two players played against each other in the past 24 hrs, the likelihood of this conclusion is raised. By contrast, if the author is a mother of a son named Anthony and is not a fan of basketball, then “anthony” likely refers to her son. Similarly, given the “fight it out” clause, an author who is a fan of boxing would likely be referring to two boxers in a recent match. Finally, the “I love” clause indicates that the author is either a fan of the entities or a fan of the activity engaged by the entity.
Accordingly, social media understanding may be aided by subculture analysis because a subculture may generally reflect the language, customs and practices of a group of social media users that are connected by a common trait or interest.
In the context of
Although the subculture models of
Many of the methods described above involve comparing subculture-specific data with generic data, then comparing frequencies. Variants of existing techniques such as Pointwise Mutual Information (PMI) and Term Frequency-Inverse document frequency (TF-IDF) may be used for this purpose.
In view of the above, an exemplary subculture may be modeled by locating available data sources used predominately or exclusively by its members or representatives and then extracting and analyzing data associated with each element model. The element models may be improved by comparing the subculture-specific data sources with large data sources known to have only trace amounts of data for that subculture. For instance, models for an NBA basketball subculture can be extracted from NBA.com, win ipedia articles containing “NBA” within category names, twitter accounts devoted to the NBA, and other websites. To determine which elements of the data source are NBA specific, we cross reference the data with a similar, but distinct source, such as subculture data specific to another sport, and with general data, such as a sampling of wikipedia pages that do not contain NBA as a category. Thus, subculture modeling may attempt to leverage information considered pertinent to a particular topic (or fields of study) and which may be strongly associated with the knowledge base of individuals that are active in this area of interest.
Subculture Identification.
This sub-process involves associating a weighted set of subcultures to a user of social media based on an analysis of a snapshot of the user's social media data. The process generates a score for each subculture based on the social media assertions, social media statements and conversations, and user profile. The score may be aggregation of subscores, each of which corresponds to the degree of match between the social data and a single element of the subculture model (see paragraph [0014]). For example, social data text may be matched against the n-gram models of the subculture to determine the degree to which the text expressions fit the model. In a second example, unambiguous entities mentioned in the social data may be cross-referenced to the entity lists of the subculture, resulting in a subscore. The total score, possibly normalized, indicates the degree to which the social media user “identifies” with a subculture.
Personal Entity Extraction.
Personal entity extraction 203 involves creating a set of social media contacts (e.g., Friends, Followers, etc.) for the social media user. The set of personal entities may be gathered through the friend lists and follow lists on social networks. A weighting factor for each personal entity may be determined by combining the following information:
The weighting factor indicates the relative likelihood that an ambiguous reference to a nickname of the personal entity is actually the entity itself. For example, if an author has 4 contacts for which “Anthony” is a valid nickname, then the prior probability that a mention of “Anthony” in a post refers to each will be proportional to the weight induced for each. Many methods may be used to produce an appropriate weighting factor. For example, a +1 score can be applied to an entity or nickname for each interaction found in social media, whereas listing as a family member can earn a +10 score; listing a spouse can earn a +30 score; and a +1 score can be given for simply being a “friend”. The score for each entity or nickname in a group may then be normalized, along with a slot for “other”, to produce a distribution over possibilities for that entity or nickname. Generally, however, a suitable method will produce a weighting that expresses the likelihood of the social media user referring to each entity, given a particular nickname mentioned. For example, a user may have three “Michael” in their social data. Michael 1 is a spouse, and has 10 interactions with the user, for a total score of 40. Michael 2 is a friend with 8 interactions, for a total score of 9. Michael 3 is a friend with no interactions, for a total score of 1. Normalizing the scores of all three Michaels, yields the following: Michael 1=0.8, Michael 2=0.18, Michael 3=0.02.
The personal entity list is treated like a special subculture to which the user belongs with maximum weight.
Personal Language Model Generation.
A user's likelihood to emit phrases (N-grams), entities, and entity groups may be modeled using a weighted combination of that person's subculture models, plus their set of personal entities. Continuing the example from paragraph [0022], if a social media user matches only 1 subculture with weight 0.5, and that subculture had the following distribution over Michael's: Michael 4=0.5, Michael 5=0.5, the mixed distribution over Michael's, given that the personal subculture has weight 1, is achieved by multiplying all priors by the subculture weight, then normalizing. Pre-normalized: Michael 1=0.8, Michael 2=0.18, Michael 3=0.02, Michael 4=0.25, Michael 5=0.25.
Although a full personal language model may be developed for each user based on this approach, in practice, however, it is not necessary to compute and store the full model for each person. The Entity Disambiguation algorithm of
Entity Disambiguation.
Entity Disambiguation May Involve the Following Sub Processes: (1) generating candidate references+priors for each mention; (2) inferring semantic tags for each candidate reference; (3) inducing a conditional random field model; and (4) inferring a most likely assignment.
Referring to
For example, referring to the illustrative social media statement discussed above, the mention “Anthony” could have node values for the NBA player Carmelo Anthony, the user's cousin Anthony Thomas, two other sports players named Anthony, and ‘Other’. The mention ‘Bryant’ could have values for NBA player ‘Kobe Bryant, sportscaster Bryant Gumbel, clothing designer Lane Bryant, and “other.” The joint probability of Carmelo Anthony and Kobe Bryant would be high, whereas the joint for Carmelo Anthony and Lane Bryant would be low. Other factors (induced through processing social media) include the home city of the user and their interests.
Accordingly, the entity disambiguation process of
Preferably, the method for entity disambiguation within a social media conversation may include the following high level steps:
Given an infinitely large corpus, multiple conversations containing every possible combination of entities would be present. It would be possible to compute the co-occurrence frequency of all combinations. Defining the joint probability over any set of mentions and corresponding referenced entities would be tedious, but straightforward. In the absence of this theoretical (i.e., infinitely large) corpus, however, the joint probability over any set of mentions and corresponding referenced entities may be approximated using the semantic network that connects any pair of entities.
Referring to
Additionally, the edges of the semantic network may connect semantic objects to more specific semantic objects. For example, sports may have a link pointing to basketball, basketball may have a link that points to NBA Basketball, etc 501. The network, therefore, may be a directed acyclic graph rooted at the most general node (e.g., ‘object’).
More particularly,
As shown in
CS
a,b=1−product{i=1,2, . . . ,n}(1−csa,bi)
As an ad hoc method for combining this semantic data with real corpus data, pairs of entities with actual co-occurrence frequencies will be given a value between 1.0 and 2.0. One method is to normalize all frequency data to a 0 to 1 range; the total value is then 1 added to the normalized value.
By contrast, referring to
Additionally, the semantic network may be amended at any time by adding paths. For example, if we learn that Bryant Gumbel and Carmelo Anthony are both alumni of the same university, an additional path can be added to
Sentiment Analysis.
For many purposes, including suggesting items relevant to the author, it may be useful to know how the author feels about the subjects the author is discussing. Generally, Targeted Sentiment analysis (TS analysis) takes as input
In addition to the rating, a confidence measure may be output for each mention, which indicates the certainty of the system for its rating. The confidence measure may range from [0,1]. For example, “I'd rather not watch the movie Titanic again” indicates a slightly negative sentiment, −0.2 with medium confidence 0.4. “I LOVE the movie Titanic” is strongly positive, 0.99, with strong confidence, 0.7. If the user is known to rarely use sarcasm, the confidence may be higher.
In a preferred embodiment, sentiment analysis may include targeted word-based analysis methods as follows:
Additionally, the following targeted word-based analysis method may be added:
Additionally, pattern based targeted sentiment analysis may be used to define zero or more subculture-specific linguistic patterns that indicate sentiment. For example, “Go Raiders” is a highly positive statement about a professional football team. The pattern [“Go”] ENTITY is a sports-specific pattern that works across multiple teams and sports, and can be interpreted as positive with very high confidence. Generally, patterns may be implemented as regular expressions over the following items:
An exemplary overall targeted sentiment analysis algorithm is as follows.
Evidence aggregation 210: Multiple conversations by a given social media user may reference a given entity. In these cases, the disambiguation algorithm above will produce qualitatively similar assertions, but with different sentiment values and confidence levels. A method may be supplied to unify these sentiment values and confidence levels into a single sentiment value and confidence level for that entity.
One method is to simply average sentiment values and confidence levels. Another method may assume that the existence of other mentions for an entity inherently raises the confidence for that entity. Intuitively, if a person mentions an entity once, they are more likely to mention that same entity again. For example, if one conversation leads to the inference fan.CarmeloAnthony=0.7(0.4 confidence) and another conversation leads to fan.CarmeloAnthony=0.8(0.5 confidence), the sentiment level can average to 0.75 and the confidence can combined as follows: confidence=1−(1−0.4)*(1−0.5)=0.7. A third method may include the degree of disagreement in sentiment levels. The confidence may be reduced by function of the difference in sentiment levels. For example, for inferences fan.CarmeloAnthony=0.7(0.8 confidence) and fan.CarmeloAnthony=−0.2(0.8 confidence). The original computed confidence can be multiplied by (2−abs(0.7−0.2))/2=1.1/2=0.65. With no difference in confidence, the original computed confidence remains the same. With maximum difference, confidence becomes 0.
A second iteration of subculture identification may be performed. After inferring entities mentioned, overall accuracy may be improved if the weighted set of subcultures is recalculated based on the inferred entities. For example, if the basketball subculture is detected with a small weight (e.g., 0.3) upon initial analysis, but the social media user mentions 10 NBA players in conversations, the weight of the basketball culture should be revised upward. This revision, however, may trigger a re-analysis of the conversations, and would impact results. A discount may be applied on subsequent iterations to prevent continuous processing and to promote a convergence of subculture weights.
Referring to
The social media understanding system 100 may stand alone or may be part of another system. For example, the social media understanding system 100 may be part of a social media marketing system which collects communications exchanged by users of an Internet based social media community, generates a collection of purchase decision profiles for each of those users, researches market conditions for a set of goods and services, and transforms these data into individually customized offers to buy or sell goods and services to those users and their social network contacts. A social marketing system is disclosed in commonly owned, co-pending patent application Ser. No. 13/761,121, entitled, “Apparatus, System, and Methods for Marketing Targeted Products to Users of Social Media,” filed on Feb. 6, 2013, (the '121 patent application). The '121 patent application is incorporated herein by reference in its entirety.
In a second example, the social media understanding system 100 may be part of a system that predicts or analyzes world events based on social media. For example, if many users of the system abruptly begin discussing common entities within a subculture, it may indicate that an important event has happened or will happen related to that entity. This may have great value where social media is the only media source accurately covering the subculture.
While it has been illustrated and described what at present are considered to be preferred embodiments of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made, and equivalents may be substituted for elements thereof without departing from the true scope of the invention. Additionally, features and/or elements from any embodiment may be used singly or in combination with other embodiments. Therefore, it is intended that this invention not be limited to the particular embodiments disclosed herein, but that the invention include all embodiments within the scope and the spirit of the present invention.