The following relates generally to a system and method for identifying experts on social media and more specifically to systems and methods for identifying experts, topics and followers in social media networks that may be used to engage or track a wide and relevant audience for message targeting.
Social media has transformed the way we interact online as individuals and consumers. At the same time, it is transforming the way businesses aim to interact with their customers and fans online. Before social media became mainstream, online marketers and advertisers resorted to the collection of behavioral online information regarding individuals to target their messages. Individuals were primarily targeted based on the topical focus of the sites they visited. For example, sports news sites might display advertising related to the perceived interests of sports fans. The general interests of sports fans would be derived based on third party market research (e.g., males aged 25-35 with interest in sports are also interested in certain types of movies or specific male grooming products).
In the early stages of the social web, bloggers on particular topics with wide followings were identified to endorse or sponsor specific products. At the same time, bloggers started serving advertisements on their blog real estate.
Social media is transforming the way marketers and advertisers spend their budgets. Novel ways to market online are gaining traction both from an academic as well as a practical point of view. In particular, influencer-based targeting in social media has emerged as a very popular way to market on social platforms (such as Twitter and Facebook). Individuals are identified as online experts in particular topics; they are either incentivized to participate in sponsored advertising by spreading the messages to their followers or the platforms automatically insert sponsored messages in their activity streams (as in the case of Twitter/Facebook advertising). Further, they may be targeted with relevant content such that they organically share it with their followers. The goal is to increase brand awareness, by increasing the number of impressions (e.g., how many followers see a particular message) and click-throughs to particular campaign (how many click on the link embedded in the message) with the ultimate goal to track conversions (how many end up purchasing a product).
As an example, with more than 250 million users, Twitter has emerged as a prominent marketing and advertising vehicle in addition to being a prominent social communications platform.
In one aspect a system for identifying one or more experts of a topic on a social network is provided, the system comprising a server in communication over a network with a social network, the server comprising: (a) a user interface unit configured to obtain a topical query representing the topic; (b) an obtaining unit configured to obtain social network data from the social network, the social network data comprising one or more topical lists and a social graph representing user relationships in the social network, each topical list identifying one or more users; (c) a tokenizing unit configured to: (i) tokenize titles of the topical lists and lexically group the tokens into token groupings; and (ii) tokenize the topical query to determine at least one token grouping to which the topical query corresponds; and (d) a processing unit configured to: (i) generate, for each user, a topic signature vector comprising topic signature vector eLements corresponding to the token groupings for which the user is identified in the corresponding topical lists; (ii) generate for each topic signature vector element an occurrence count representing the number of times each of the token groupings is identified for the user; (iii) rank the users by their occurrence counts for the at least one token groupings corresponding to the topical query; and (iv) return a selected set of the ranked users as experts in the topic.
In another aspect, a computer network implemented method for identifying one or more experts of a topic on a social network is provided, the method comprising: (a) obtaining a topical query representing the topic; (b) obtaining soda network data from the social network, the social network data comprising one or more topical lists and a social graph representing user relationships in the social network, each topical list identifying one or more users; (c) tokenizing titles of the topical lists and lexically grouping the tokens into token groupings; (d) tokenizing the topical query to determine at least one token grouping to which the topical query corresponds; (e) generating, by a processing unit comprising one or more processors, for each user, a topic signature vector comprising topic signature vector elements corresponding to the token groupings for which the user is identified in the corresponding topical lists; (f) generating, by the processing unit, for each topic signature vector element an occurrence count representing the number of times each of the token groupings is identified for the user; (g) ranking the users by theft occurrence counts for the at least one token groupings corresponding to the topical query; and (h) returning a selected set of the ranked users as experts in the topic.
The features of the invention will become more apparent in the following detailed description in which reference is made to the appended drawings wherein:
It will be appreciated that for simplicity and clarity of illustration, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. In addition, numerous specific details are set forth in order to provide a thorough understanding of the embodiments described herein. However, it will be understood by those of ordinary skill in the art that the embodiments described herein may be practiced without these specific details. In other instances, well-known methods, procedures and components have not been described in detail so as not to obscure the embodiments described herein. Also, the description is not to be considered as limiting the scope of the embodiments described herein.
It will also be appreciated that any module, unit, component, server, computer, terminal or device exemplified herein that executes instructions may include or otherwise have access to computer readable media such as storage media, computer storage media, or data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Computer storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. Examples of computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by an application, module, or both. Any such computer storage media may be part of the device or accessible or connectable thereto. Any application or module herein described may be implemented using computer readable/executable instructions that may be stored or otherwise held by such computer readable media and executed by the one or more processors.
Advertising and marketing on Twitter involves two crucial steps. First, being able to identify who are the “experts” on any topic on the platform and second, being able to identify sets of users with active “interest” on a particular topic. In the context of Twitter, an expert in a particular topic is represented as an account (user) that primarily produces and shares content related to that topic and has a wide following that actively engages with the produced content (sharing, re-tweeting, etc.). A user may demonstrate interest in a particular topic if, for example the user follows a number of experts in the topic and engages with the content they produce.
A need exists to be able to identify experts on any given topic and analytical functions on the set of experts' accounts on a specific topic, such as what other topics they are experts in, what conversations they participate in, and what types of content they share online. A further need exists to identify other users (e.g., followers) that are likely to be interested in a given topic.
The following relates generally to systems and methods for identifying experts on social media. The system is configured to collect data on user interaction, communication and profile information to identify experts, topics and followers in social networks that may be used to engage or track a wide and relevant audience for marketing purposes. In another aspect, such information may be provided via a user interface to enable message targeting decisions.
Social networks like Google+, Facebook, Twitter, and Pinterest, have emerged as vehicles for marketing and branding. Marketers seeking to engage with, and advertise to, consumers may wish to identify a network of experts in and followers of given topics to whom to market specific content, as interests in a topic may correlate to sales of a given product or service. Without a loss of generality, Twitter will be used herein as an example of a social platform from which content may be collected to provide data regarding experts, topics and followers. The techniques described may be applied equally well to any other similar social platform. The terms “follower”, “marketer” and “social networks” are used herein illustratively and in a non-limiting manner. These terms could be substituted for appropriate parties as applicable to alternative implementations.
In another aspect, a system and method is provided for characterizing the expertise of particular social network users among a set of topics, including the generation of a topic signature for each user of a social network. A topic signature comprises a list of all topics of expertise of the user. Additionally, a system and method is provided to produce an aggregate signature. An aggregate signature comprises a list of topics in which a set of users has expertise. Both topic and aggregate signatures can be interpreted as a ranking of most relevant, for the purposes of reaching the largest audience, to least relevant topics.
Many social networks provide advertising platforms with tools for marketers. For example, in use, a marketer would utilize the Twitter advertising platform in one of the following three ways: Firstly, the advertiser provides a set of Twitter user handles, and Twitter targets advertisements to the followers of these accounts. Being able to identify sets of experts in any topic readily aids advertisers to identify the most relevant accounts to provide advertisements to while instigating a Twitter advertising campaign.
Secondly, the advertiser bids on a list of topics on Twitter. Twitter, using their own proprietary algorithms, identifies which users are interested in the topic and subsequently targets those users with messages “promoted” by the advertisers (inserting them in their tweet stream). By analyzing related topics for a topic of interest, advertisers can identify possibly cheaper topics to bid on. For example, if the price for ‘social marketing’ is too high, ‘seo’ as a related topic, which may have a relatively lower bid price may be used instead. The effectiveness of the campaign may be the same, due to the substantial overlap between the two.
Thirdly, the advertisers bid on search keywords (to target searches input to the Twitter search feature). Information on Twitter is temporal by nature and events evolve with time, thus the keywords used in searches evolve over time. When a keyword is used during a search query on Twitter for which an advertisement exists, the platform will display promoted messages (as advertising) along with the search results. Advertisers may wish to identify keywords related to a queried keyword at a given time.
However, the applicants have now determined that advertisers may also be interested in specific users that would be highly relevant as followers of a given user. These new followers should be highly interested in the topics for which the given user has expertise, since the followers desire to follow the given user's messages. Furthermore, the applicants have determined that advertisers may be well served by not just understanding who the experts are in relation to a given topic, but what other topics those users are interested in; for example, by identifying all experts in ‘cloud computing’ with interests in ‘photography’ or experts in ‘food and dining’ with interest in ‘movies’. Such sets of experts can be targets of novel engagement campaigns that attract attention by combining their area of expertise and their interests.
Various interactions between users and social networks such as Twitter result in data generation. Many social media networks record their interactions with their users. The system is configured to obtain interaction data via various sources and/or connectors. The social networks collect and record such data in logs stored on social network nodes or a network-accessable server. A social network may provide access to data collected on user interactions. For example, Twitter provides a Gardenhose streaming API which may be used to access messages and user profile information. Thus, Twitter activity may be stored in files that may be automatically created and maintained by a given server or set of servers. In another aspect, Twitter may be accessed directly via network connection, such as the internet, and data may be crawled, scraped and indexed. Crawling and scraping may be performed using various techniques by employing varying levels of automation. The obtained data may be stored in a database for ready use by the system.
Referring now to
Referring now to
As shown in
To accomplish the foregoing, as mentioned above, the server comprises an obtaining unit 109 for obtaining social network data, a tokenizing unit 108 for tokenizing social media messages, a processing unit 102 for processing the social media data to generate each user's topic signature and an aggregate signature for each topics, respectively, and an indexing unit 107 for interaction with the database to locally store the social network data. In the case of the exemplary system that utilizes public Twitter data, obtaining unit 109 fetches 3 pieces of information from public Twitter data feed: the actual textual tweet contents from the public Gardenhose streaming API, the Twitter follower graph (who follows who), and the Twitter lists, as more fully set forth below.
The tokenizing unit 108 tokenizes the Twitter lists to produce topics and associates the topics to the users of the lists. The processing unit 102 uses the output association (of user to topics) from tokenizing unit 108 to instruct indexing unit 107 to stores it as a fast accessible index “IXD” in the database. The user interface unit 103 can use index “IXD” to process the topic query “q” to return the list of experts “E” as associated to the topics by tokenizing unit 108.
“IXD” supports this by storing (1) the inverted index from topics to a collection of users who belong to a Twitter list associated with the topic, and (2) total number of topic lists a given user belongs to. Using “IXD”, one can compute, given a specific topic and a specified user, the number of times the user is listed in a Twitter list associated with the specified topic (represented as “frequency count” or “occurrence count”). The inverted index is used to compute the set of all experts “E” associated with “q” by finding set intersection of index associated with each topic “t” present in the query “q”. The experts in “E” can be ranked for display using user interface unit 103 by using the frequency count as described above.
To return a list of related topics to “q”, user interface unit 103 will consult “IXD” to lookup all other topics for all users from “E”, call this set of all topics “A”. From “A”, a list of top ranked topics is presented to the user using some scoring function (frequency count, or tf.idf)
Referring now to
Referring now to
Referring again to
Relaxed transactional semantics may be run to increase throughput across multiple threads reading and writing the table. The tables for a selected time period may be stored on solid state drives (SSDs) for increased performance. The collection of tables keeping the association between account identifier and message identifiers may be stored in the database. The indexing unit may retrieve for any day, the identifiers of all messages produced that day for any set of accounts. The indexing unit may then provide the collection of all message identifiers to the database to retrieve the actual messages.
The obtaining unit is configured to collect account relationships, such as which users follow others directly. Certain social networks are configured to permit users to create lists containing a descriptive name (supplied by the creator) and a set of accounts associated to the list (supplied by the creator). For example, a list on “machine learning” may contain all accounts that are experts or very related to the topic of machine learning. The obtaining unit is configured to store the above mentioned data in the database 101.
The obtaining unit is further configured to receive information about which accounts follow others along with a set of metadata appended by the social network, associated with the accounts, for storage in the database. In an embodiment, this data may be represented by a graph that may be stored in a MYSQL instance. It will be appreciated that another relational database may be used as an alternative to MYSQL. The indexing unit is further configured to index this data. In embodiments, an Apache Lucene index may be used. It will be appreciated that another text search engine library may be used as an alternative to Apache Lucene. This data provides an expertise vector, or a set of all lists a given account is associated with. That information is then directed to Lucene to populate the index of topics as associated with the account. The index supports full Lucene query syntax including phrase queries and Boolean logic. At the same time, the social graph provides related information about user interests. For example, if a user follows someone with expertise in cooking, one may infer that the user has interest in cooking. Given all accounts followed by a given account, the union of both expertise vectors may produce an interest vector.
The server is configured to un-shorten multiple URLs. Since typically URLs in messages are shortened (using popular URL shortening sites like bit.ly or t.co) conducting analysis on the shared domains to provide insight into the source of the content is challenging as each URL has to be un-shortened (possibly multiple times). Thus the server efficiently un-shortens multiple URLs. Utilizing asynchronous 10 this process may be conducted for tens of thousands of URLs in parallel on a single thread, typically in a short time frame.
Given the receipt of some or a combination of the data mentioned above, the processing unit is able to generate useful information and analysis such as a topic signature for a given user, an aggregate signature for a group of users and techniques to automatically identify changes in the aggregate signatures over time for a given query.
Referring now to
The vector may be referred to as the topic signature, assuming a total ordering on all topics (tokens) and assigning a value of zero to the occurrence count for a topic if the account is not associated with that topic at all. Assuming each token to be a unique dimension in a multi-dimensional space, the occurrence counts are normalized to produce the unit topic signature vector in space. Thus, this vector represents the weight of the account being associated with a topic. The
space is used and hence, the length of the expertise vector is normalized to 1 using the manhattan norm. The union of all these vectors will result in a multi-dimensional space with each unique token corresponding to a dimension.
In an exemplary scenario, consider a user @john that is member of three Lists {toronto-dentist, dentists, music-toronto}. The set of tokens with occurrence counts for this user is {dentist(2), toronto(2), music(1)}. After normalization, the unit topic signature vector becomes
The vector above is of unit length in space, with non-zero values across three dimensions and zero across all others.
Considering two more users in the same scenario: @henry belonging to lists {dentists, squash-london, music}, and @susan who is a member of lists {squash, music-london, squash-london}. After considering all the 3 users, a 5-dimensional topic space is produced:
The above matrix is a compact form notation of individually writing the vectors as
and so on.
The process of computing topic signatures for each user is linear in the number of users and the length of their topic signatures. In this example, the technique of extracting topic signatures is applied on messages from 240 million users and 15 million lists. Apache Lucene may be used for implementing the tokenizing unit and indexing unit to tokenize and index the lists.
The generated topic signature provide for fast response to a request for an expert list in response to a topic query. The index allows query with full boolean syntax, and is used to quickly return all users having certain topic associations. For example, in response to a query, the query is tokenized and processed lexically in similar manner to lists are processed as described in reference to
To describe how the processing unit generates the aggregate signature, the following setting and notation will be used: Let the set of all users be denoted by UM, which has cardinality M. Let uεUM denote a unique user, with unit normalized topic signature vector topics(u). The vectors are derived from a multi-dimensional space SN with N dimensions. The matrix representing signature vectors for all users will then have M×N entries. We denote this matrix as MSE.
If si is a specific dimension in SN, then the signature vector may be represented as follows where wi represents length of the vector across dimension si, topics=w1·ŝ1+w2·ŝ2+ . . . +wN·ŝN. The graphing unit generates a social graph, where the social graph spanning all users denoting follower relationships is represented by (UM, E), where UM is the set of nodes and E is the set of edges. Each user represents a node. Edges are follower relationships, i.e., if a user u follows another user v, then the directed edge from u to v will be part of the set of edges E. Formally, u, vεUM, follows(u, v)=true(u, v)εE. Let q represent a keyword query such as, “hurricane sandy” or “pepsi”; the query may permit more complex queries that include boolean operators such as, “elections AND (“barack obama” OR “mitt romney”)” as well. Let R represent the set of message results after evaluating the query against the content of all raw messages obtained by the obtaining unit from the social network and stored in the database. If the search has a time restriction t denoting that only results within the time interval t are of interest, the set of results is Rt. Each entry rεRt is a message such that matches(q, r)=true meaning that the query evaluates to true on message r and posttime(r)εt, namely that the message was posted within the time interval of interest t. Let At denote the set of all unique authors (users) in Rt, i.e., set of authors of all messages rεRt. uεAt
∃rεRt:author(r)=u.
As MSE can potentially consists of a large number of entries, it is desirable to produce a concise summary of MSE as aggregate signature. This can be done by first computing the relevant rank of each user in the set UM using the social graph and network ranking algorithms such as PageRank. Then, conditional probability can be utilized to aggregate the topics. This process is mathematically the following: Pr(si)=Σj Pr(si|uj)*Pr(uj) for expertise si and user uj where Pr(si) denotes the probability of si over the appropriate sample space.
The aggregate signature may be used for two main purposes, namely, a) to obtain a concise view of the topics associated with the messages of interest (denoted by the query q above). This is done by aggregating the expertise vector of all authors who have authored one of the messages of interest. and b) to rank those topics based on their potential for dissemination within the social network. This is done by summing over the users and their topics/expertise using conditional probability. The utility of such an aggregate signature may be to gain insight on what topics a marketer may associate with q in order to increase the reach and dissemination of messages related to q in the Twitter network, via sharing through re-posting messages from those accounts sending messages about q.
The processing unit may alternatively generate the aggregate signature of all followers of a particular account (instead of author group for a query). For example, a marketer may be interested in understanding who is talking about the brand Pepsi on Twitter and the aggregate of topics associated with that group. This information may be used to create better advertising content for this group, e.g., if many users are associated with travel, then a good strategy could be to create marketing messages incorporating travel as a theme. That way one aims to capture the attention of the group in multiple ways and increases engagement and content sharing.
Thus, in order to maximize the spread of marketing content to a relevant group it is desirable to direct resources toward users who can spread the content (e.g., by re-tweeting or resharing the message). Hence, when constructing the aggregate signature not every member of At is considered with an equal weight. The aggregate signature aggrsig(q, t) (or correspondingly aggrsig(At)) is computed by taking in account the ability of a user to spread a message.
Given a query q, time interval t, and its associated set of authors At from result set Rt the processing unit generates the aggregate signature aggrsig(q, t) taking into account the potential reach of each author in At. Having retrieved all messages Rt with respect to the query q of time interval t, the processing unit scans through all items in Rt to resolve the set of unique users from Rt as At. The processing unit is operable to generate the topic signatures for each user in At.
One strategy to produce aggregate signatures is to sum up the topic signatures retrieved and normalize them to unit length. However, this method fails to capture the relative importance of each user in disseminating a message to their followers with respect to the query q. Under this scheme all users are assumed equally important as far as the dissemination potential is concerned, which may not actually reflect reality. For example, the set At may contain several users with association in the topic of music but each with very few followers, and few users with association in the topic of travel but each with many followers.
Thus, the processing unit generates an aggregate score which may be referred to as “AGGR” herein. AGGR represents the relative ranking of uεAt. By looking at the subgraph induced by At on the original follower graph (UM, E) only, a user u1 may have a substantial number of followers in (UM, E) but have very few followers who also belong to At. The number of followers in At may be more important than the total number of followers across the entire social graph, as the aim is to find users who can disseminate the message to potentially largest group of relevant users.
To capture these intuitions the processing unit models this scenario as a Hidden Markov Model (HMM), with each user in uεAt represented as a node in the hidden layer, and each topic in their aggregate signatures' represented as a node in the output layer. For users u, vεAt, if user u follows v, a directed edge is added in the Markov chain from u to v. Transition from one node to another takes place with equal probability. That is, if there are eu edges out of node u, one of the edges is selected for transition with equal probability
Since the Markov chain may have disconnected components, with a small pre-specified probability α a random jump takes place, and with probability 1−α one of the outgoing edges is selected.
Traversing the Markov chain, while at node u, having eu outgoing edges, the probability of transition is computed as follows. If eu is zero, the next node after transition is randomly picked from set At. Let |At| be the cardinality of the set At. If eu is non zero, then the next node will be:
This completes the construction of the Markov chain and emission probability for the topics is assigned at each node. The symbols being emitted from the HMM are dimensions of the topic signature. For example, if the topic signature of a user u is
then one of music or squash is emitted with equal probability when at the node in the HMM associated with this particular user u. Since the topic signatures are of unit length in space, further normalizations are not needed to compute symbol emission probabilities. For a topic signature, topic(u)=w1·ŝ1+w2·ŝ2+ . . . +wN·ŝN. The symbol si will be emitted with probability wi. Since, w0+w1+ . . . +wN=1, the sum of all probabilities will be 1.
Continuing from the example used for the creation of the topic signature above and assuming each of the three users follow each other, the HMM as displayed in
As a final check, performed by the processing unit, Pr(dentist)+Pr(music)+Pr(london)+Pr(squash)+Pr(toronto)=1. The resulting aggregate signature, therefore is
The HMM has now been defined by the processing unit with a set of nodes, transition probabilities, and emission probabilities for symbols. The steady state probabilities for this HMM will allow the processing unit to compute the aggregate signature across the set of all users At. At steady state, assuming that the probability that a symbol si is seen is prob(si), then the aggregate signature will be, aggrsig(At)=prob(si)ŝi+prob(s2)ŝ2+ . . . +prob(sN)ŝN, which is of unit length in .
At steady state, assuming the probability that a symbol si is seen is prob(si), then the aggregate signature will be, aggrsig(At)=prob(si)ŝi+prob(s2)ŝ2+ . . . +prob(sN)ŝN, which is of unit length in . To compute the aggregate signature given the steady state distribution from the Twitter follower graph, the processing unit uses the definition of conditional probability. Observe that for topic s:
Pr(s)=ΣuPr(s,u)=ΣuPr(s|u)Pr(u) (1)
and since Pr(s|u) (the topic probability of a user u) and Pr(u) (the steady state probability of user u) are independent and known from the Markov chain and preprocessing, the processing unit proceeds to solve the HMM first by the hidden user layers, then the emission (topic) layer.
Marketers invest significant effort to change brand perception and association. A change in the audience of a brand could be organic over time, or it may be influenced by an event. For example, numerous marketing efforts attempt to reinvent or reposition brands in new target segments and change the way brands are perceived online or offline. An effort to make a brand more fashionable or trendy may be successful if the people taking about the brand online associate themselves with fashion and/or fashion trends. Thus, such changes, if one is able to identify them, may point to the success or failure of marketing efforts online. Identifying such changes in the conversation around events may further identify parties relevant to a political or academic subject and how insights evolve over time. In an exemplary scenario, the query “hurricane sandy” is considered. The processing unit conducts the search for one day time intervals for a 92 day long period from 1 Oct. 2012 to 31 Dec. 2012.
The processing unit may proceed to generate results as to how aggrsig(q, t) evolves over time for a long time range T consisting of D smaller time intervals, T={t1, t2, . . . tD}. The processing unit generates the aggregate signature for a given query q for each of the time intervals as: ASM(q, T)={aggrsig(q, t1), aggrsig(q, t2), . . . aggrsig(q, tD)}. The resulting matrix ASM(q, T) has N rows and D columns. The rows will each correspond to a topic dimension from SN, and columns will each correspond to a time interval from T. This matrix is referred to as the aggregate signature matrix (ASM(q, T)) over time T for the query q.
Given an aggregate signature matrix ASM(q, T)={aggrsig(q, t1), aggrsig(q, tD)}, a pre specified k<D, and a function score that measures similarity of aggregate signatures, namely, score(aggrsig(q, ti) . . . aggrsig(q, tj))ε+ define a disjoint, continuous k partitioning of [1, 2, . . . , D] as Pk:={[b1, e1], [b2, e2], . . . , [bk, ek]} with b1=1, ek=D, ∀ibi, eiε
, ei≧bi and ∀i<j, ei=bj−1 by solving for argminP
In embodiments, the processing unit may iterate over a few values of k and trace the value of the overall function score for each value of k. Points at which large discontinuities arise are typically good candidates for k.
The processing unit selects k groups of continuous days across the 92 day period for a pre-specified k. Each of these k date ranges will represent a distinct aspect of the event. For example, if k was 3, resulting date ranges could be expected to represent the pre-hurricane period, the period during the hurricane specifically as it passed over New York city, and the period post-hurricane.
Using the notations defined above, given the aggregate signature matrix ASM(q, T) and specified k<D, the processing unit partitions T into k continuous and disjoint intervals. The aim is to group similar time periods together, and this is formalized by defining a scoring function capturing similarity that is minimized. Once the scoring function has been chosen, the problem reduces to that of identifying the optimal partitioning.
The processing unit generates two scoring functions. The first function minimizes the total error represented as the sum of root mean square distance between the average aggregate signature of a collection of signatures and the aggregate signatures in the collection. The first measure, given a collection of aggregate signatures ASM={aggrsig(q, t1), aggrsig(q, t2), . . . aggrsig(q, tD)}, access the distance using the root mean square error:
The RMSE score increases as the distance between aggregate signatures increases, i.e., when the topics across {t1, t2, . . . , tD} are different, and decreases when the topics are the same. Therefore, with this score function intervals of time are singled out where the aggregate signatures are very similar to each other.
The second discretizes ASM(q, T) into an indicator matrix of 0 and 1, and measures similarity as the hamming distance across neighbouring aggregate signatures. A second error measure is proposed that involves the discretization of aggregate signatures. The value in each dimension of aggrsig(q, t1), is between 0 and 1. The aggregate signature can be discretized by assigning each dimension the value of 0 or 1. There are many ways to discretize the signature; a statistically sound way is to assess the mean of all the values and assign a value as 1 if it is above some standard deviation of the mean and zero otherwise. Denote the discretized aggrsig(q, ti) as agg(q, ti) and similarly, the discretized ASM(q, T) matrix as AS
T). With t1, t2, . . . , tD−1=T1 and t2, . . . , tD=T2, rewritten in the compact form: score=∥AS
T2)−AS
T1)∥F using the Frobenius norm. ∥A∥F=√{square root over (ΣiΣj|Aij|2)} where A is a matrix.
In further embodiments, given a function score that computes a distance between aggregate signatures of ASM(q, T), the following recurrence may be generated by the processing unit to measure the similarity of aggregate signatures. Bj,k is defined to be the best k partition score of the first j columns of ASM(q, T) using the given score function:
The best K partition of ASM(q, T) for all K<D may be computed using Equation 2. Notice that it would take (DD)score steps to solve the best K partition of ASM(q, T) for all K in a brute force way. Since there are D−1K-1 ways to produce k disjoint continuous intervals for [1, 2, . . . , D], the processing unit generates Σi=1D
(Di)=
(DD).
Looking at the recurrence Equation 2 the processing unit may pre compute score(ASM(q, Ti) . . . ASM(q, Tj)) as it is independent from the recurrence. i←arg mini≦j Bi-1,k-1+scoi,j will take (D) steps. When solving for the best K partition of ASM(q, T) for all K<D using dynamic programming, the runtime is dramatically reduced to
(D3). The space requirement can also be optimized by noting that in Equation 2, Bj,k depends only on the values from the previous iteration. Therefore, after an iteration is complete, the processing unit may discard the optimal interval partitioning and the optimal scoring from the last iteration, bringing the space requirement excluding the precomputed scores, down to
(D).
Continuing with the example above, for each day, the aggregate signature vector is computed based on everyone who is talking about the hurricane on that day i.e. who have posted a message containing the words hurricane and sandy). As time progresses, a natural evolution in the matrix ASM(q, T) for this query is expected. Hurricane Sandy first affected Caribbean and Bermuda on October 22nd, and Twitter users actively participating in the discussion topically associated with these regions. As days progressed, more American and subsequently global audience started discussing the hurricane. As the hurricane traveled from the southeast of the US (Florida, Virginia, Carolinas), to the mid-Atlantic region (Washington D.C., Maryland, New Jersey), and finally reached New York City, the group of users talking about the hurricane changed. In November, post-hurricane, the discussion shifted further to rebuilding efforts and those discussing were associated with politics. Intuitively it is evident that this 92 day time period can be partitioned into a discrete time periods which capture the evolution of this story namely tracing the geographical path of the storm (by observing the topics associated with those taking about it) and then capturing the political discussion centered in re-building efforts.
In embodiments, the processing unit may be configured to perform random sampling of data to speed up this computation without sacrificing quality. This effectively offers a good tradeoff between accuracy and speed.
The system may be configured to process a subset of search results to reduce processing time for a query. Run time may be improved by using random sampling on the set At. Instead of constructing the HMM with all users in At, only a fraction, such as, for example, f≦1.0 may be randomly selected. Referring now to
Referring now to
As the fraction f is reduced, the number of topics with non-zero weights may also decrease, as depicted in
Referring now to
Other applications may become apparent.
Although the invention has been described with reference to certain specific embodiments, various modifications thereof will be apparent to those skilled in the art without departing from the spirit and scope of the invention as outlined in the claims appended hereto. The entire disclosures of all references recited above are incorporated herein by reference.