The following relates to systems and methods for estimating influence spread in social networks.
In recent years social media has become a popular way for individuals and consumers to interact online (e.g. on the Internet). Social media also affects the way businesses aim to interact with their customers, fans, and potential customers online.
Some users on particular topics with a wide following are identified and are used to endorse or sponsor specific products. For example, advertisement space on a popular blogger's website is used to advertise related products and services.
Social network platforms are known to be used to communicate with a targeted group of people, or advertise to a targeted group of people. Examples of social network platforms include (but are not limited to) those known by the trade names Facebook, Twitter, LinkedIn, Tumblr, Instagram, and Pinterest.
Such social network platforms are also used to influence groups of people, since online social networks enable large scale word-of-mouth marketing. For instance, massive social networks, such as Facebook, Twitter, and Instagram, include billions of users (e.g. data nodes) and trillions of edges (e.g. data links) representing interactions, dictating opinions, and causing viral explosions. Quickly identifying relevant target groups and/or popular or influential individuals, and accurately identifying influential individuals that should be targeted initially, such that an expected number of follow-ups is maximized for a particular topic, can be difficult and computationally expensive, particularly as number of users within a social network grows.
Below are example embodiments and example aspects of the data infrastructure system and methods for estimating influence spread in a social network. These example embodiments and aspects are non-limiting. Alternative embodiments or additional details, or both, are provided in the accompanying figures and the below detailed description.
In a general example embodiment, a method is provided for determining influence spread in social networks, the method comprising: generating a plurality of samples using a computing device, each sample corresponding to a collection of all edge weights for a social network graph topology; allocating, by the computing device, the plurality of samples into at least one batch, a size of which is being determined according to a number of threads and global memory space available in a multi-processor platform; for each batch: parallel processing the samples in that batch using the multi-processor platform to generate results corresponding to a spread of each graph node per sample in that batch; storing results of that batch in the global memory accessible to the multi-processor platform; and sending the results to the computing device; computing, using the computing device, an average spread of each node across all samples in all batches; and determining, from the average spreads, one or more nodes having a largest spread.
In other example embodiments, computing systems and computer readable media are provided that are configured to perform the above method.
Embodiments will now be described by way of example only with reference to the appended drawings wherein:
It will be appreciated that for simplicity and clarity of illustration, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. In addition, numerous specific details are set forth in order to provide a thorough understanding of the example embodiments described herein. However, it will be understood by those of ordinary skill in the art that the example embodiments described herein may be practiced without these specific details. In other instances, well-known methods, procedures and components have not been described in detail so as not to obscure the example embodiments described herein. Also, the description is not to be considered as limiting the scope of the example embodiments described herein.
Influence spread can be efficiently estimated by utilizing a multi-processing platform such as a GPU, multi-core processor, etc. It is recognized that the processing of samples of edge weights in a network graph are independent of each other and lend themselves to processing in a parallelized manner, e.g., using a known sampling method such as Naïve Sampling or Cohen's Estimator algorithms, particularly using a multi-processing environment that includes many threads, e.g., a GPU-based environment.
Turning now to the figures,
Social networking platforms 12 include users who generate and post content for others to see, hear, etc. (e.g. via a network of computing devices communicating through websites associated with the social networking platform). Non-limiting examples of social networking platforms 12 are Facebook, Twitter, LinkedIn, Pinterest, Tumblr, Instagram, blogospheres, websites, collaborative wikis, online newsgroups, online forums, emails, and instant messaging services. Currently known and future known social networking platforms 12 may be used with principles described herein. Social networking platforms 12 can be used to market to, and advertise to, users of the platforms 12. Although the principles described herein may apply to different social networking platforms 12, many of the examples are described with respect to Twitter to aid in the explanation of the principles.
More generally, social networks allow users to easily pass on information to all of their followers (e.g., re-tweet or @reply using Twitter) or friends (e.g., share using Facebook).
The terms “friend” and “follower” are defined below.
The term “follower”, as used herein, refers to a first user account (e.g. the first user account associated with one or more social networking platforms 12 accessed via a computing device) that follows a second user account (e.g. the second user account associated with at least one of the social networking platforms 12 of the first user account and accessed via a computing device), such that content posted by the second user account is published for the first user account to read, consume, etc. For example, when a first user follows a second user, the first user (i.e. the follower) will receive content posted by the second user. In some cases, a follower engages with the content posted by the other user (e.g., by sharing or reposting the content). The second user account is the “followee” and the follower follows the followee.
It will be appreciated that a user account is a known term in the art of computing. In some cases, although not necessarily, a user account is associated with an email address. A user has a user account and is identified to the computing system by a username (or user name). Other terms for username include login name, screen name (or screenname), nickname (or nick) and handle.
A “friend”, as used herein, is used interchangeably with a “followee”. In other words, a friend refers to a user account, for which another user account can follow. Put another way, a follower follows a friend.
A “social data network” or “social network”, as used herein includes one or more social data networks based on different social networking platforms 12. For example, a social network based on a first social networking platform 12 and a social network based on a second social networking platform 12 may be combined to generate a combined social data network. A target audience of users may be identified using the combined social data network, or also simply herein referred to as a “social data network” or “social network”.
Examples of social media intelligence applications 20 that can use or otherwise benefit from the results generated by the system 10 include, without limitation, Sysomos Influence (for determining top influencers and influencer communities), Sysomos MAP (for viral marketing), etc.
Traditionally, time has not been taken into account when determining the influence spread illustrated in
One simplified assumption on such a density function is that the time taken by node i to infect node j does not depend on the actual time node i infects node j, i.e.:
f
ji(tj|ti)=fji(ti−tj).
Sampling is used to generate the edge weights. As shown in
In reality it is appreciated that a campaign has either a strict or effective “deadline” of time T.
For a given sample, the spread of node i is equal to the number of nodes infected by node i within time T. As such, the system 10 is interested in σ(i)=expected spread of node i=average # of nodes infected by i across all samples. This can be generalized for a set of nodes, A, namely σ(A). Given the directed graph G(V,E), with vertices V and edges E, the edge weight distributions, and a budget k, an objective is to find a set S of at most k nodes (i.e. a seed set) that maximizes σ(A), or S←argmaxA:|A|≦kσ(A).
Accordingly, the problem to be addressed in determining influence spread in the social network is to find a set S of k nodes (i.e. the seed set) that maximizes the expected spread σ(S). To do so, an approximation algorithm can be applied as follows (Kempe et al., 2003):
It has been observed that processing samples of a social network graph is at least in part inherently parallelizable since each sample that is processed, is processed independently and thus can be processed in parallel processes or threads. As such, multi-processing platforms 18 such as GPUs are particularly well suited to perform an influence spread calculation. As illustrated in
A weight generator 48 generates the weights for the edges, which are used in the parallel sample processing. The results of all samples are then averaged by summing the results and dividing that by the number of samples:
Instead of Naïve Sampling, the system can also be configured to have the multi-processor platform 18 utilize other algorithms, such as Cohen's Neighborhood Size Estimation Algorithm shown in
factor, and also requires fewer samples. The tradeoff when compared to, for example, Naïve Sampling, is speed versus accuracy. With Cohen's Neighborhood Estimation Algorithm, the subroutine 3.1-3.4 (see above) can be modified as follows:
It can be seen that there are an order of magnitude fewer samples, and by looking at neighborhoods instead of number of paths, further performance gains can be achieved.
Naïve Sampling can be considered “embarrassingly parallel” (i.e. where little or no effort is required to separate the problem into a number of parallel tasks) since it has virtually complete independence across samples that are being processed. Typically, the number of samples required is between 100,000 and 1,000,000 to achieve convergence, which motivates acceleration. Cohen's Neighborhood Size Estimation Algorithm requires an inner loop (e.g., with approx. 5-10 inner samples) and an outer loop (e.g., with approx. 10,000 to 50,000 outer samples), and the core randomized algorithm exhibits complete independence across both inner and outer samples. Since the number of samples is also less, it is recognized that it makes more sense to parallelize the outer loop.
There are also space versus speed trade-offs to consider. For example, there is a need to pre-generate the weights (on the host (CPU) versus the device (GPU)), a need to balance data loads/unloads between the host (CPU) and the device (GPU), and thus, as described in more detail below, batch sampling is utilized to process large numbers of samples.
Referring to
By implementing batch sampling/allocation, these issues can be addressed. To do so, one can fix the batch size to a constant size B, such that B samples are passed to the multi-processor platform 18, which implies, in the context of a GPU, that N/B threads are available and utilized. As shown in
An example of a batch processing implementation using a multi-processor platform 18 such as a GPU, is illustrated pictorially in
Each batch 64 of B samples 50 is processed at each iteration of the processing algorithm to generate a spread 65. Each sample 50 is processed in a thread 66 (or stream) of the multi-processor platform 18 such as a GPU, with a given network topology 14 and a time T, in this example=0.5 to obtain the spread values 65. The results of the computations of each sample are stored in the global memory 24 of the multi-processor platform 18.
The CPU 16 collects all spreads computed by the multi-processor platform 18 and passed thereto and computes the average spread for all the nodes and across all samples. From this, the CPU 16 can find the seed with the maximum spread. This process can be repeated a plurality of times until the number of required seeds is found.
It has been recognized that the inherent randomness of the influence spread computations can cause poor memory coalescence, causing potential latency problems in a GPU. For example, adjacent threads 66 may need to access edge weights 62 far apart in memory. In one enhancement, as shown in
In another enhancement, a 1D texture memory structure can be used for read-only data (weights, topology, etc.). By using texture memory, a block of the GPU global memory 24 is fetched at once, each time any thread tries to fetch something from the GPU global memory 24 (rather than only fetching that something). This can help nearby threads 66 if they are also trying to access nearby GPU global memory 24, thereby reducing the number of calls to the GPU memory 24, which can improve latency.
In yet another enhancement, the L1 cache can be disabled resulting in fewer wasteful fetches. The L1 cache is a small pool of memory attached to each streaming processor in a GPU. The L1 cache stores data that are likely to be used often by the processor. In this way, each time a new request for data occurs, then those can be found in the L1 cache instead of looking up in the global memory 24, which can be considered slower to access. The process works well when the access patterns are somewhat predictable. However, in the present example the memory access patterns are semi-random because of sampling, and thus generally unpredictable. This means that the L1 cache often contains data that are not necessary, along with the data that are. In some scenarios, it is possible that the majority of cached data is unnecessary for most of the operation time. The L1 cacheline (i.e. the number of bytes the L1 cache fetches) varies from device to device. In one example, the L1 cache fetches 128 bytes of data from device memory each time there is a request that is not found in L1 (i.e. a cache miss). Only a small portion of this data is used (e.g., 8 bytes). As such, in this example, there is a large % of wasteful fetching (120 bytes). If the L1 cache is disabled, then the L2 cache is used, which cannot be disabled. With the L2 cache, the fetching is 32 bytes each time we have a cache miss. Hence, a smaller % of wasteful data is fetched (24 bytes wasted in that case).
For each batch 64, the multi-processor platform 18 such as a GPU is used at step 106 to parallel process the samples 50 in that batch 64. The results of that batch 64 are stored in the global memory 24 at step 108. The results correspond to the spread of each graph node per sample 50 in that batch 64. The results are then sent back to the CPU 16 at step 110, and the CPU 16 then moves to the next batch 64 and sends that data to the multi-platform 18 such that steps 106-110 are repeated for all batches 64. The CPU 16 then computes that average spread 65 of each node across all samples 50 in all batches 64 at step 112 in order to determine the node(s) with the largest spread 65.
As indicated above, where the goal is to find a set of seeds, the process described herein can be repeated. In the example shown in
The above-described process was demonstrated using the following setup:
System:
Social Graphs:
with a sampling range: 100-10,000 samples.
The results shown in
Turning to
It can be appreciated that social network data includes data about the users of the social network platform, as well as the content generated or organized, or both, by the users. Non-limiting examples of social network data includes the user account ID or user name, a description of the user or user account, the messages or other data posted by the user, connections between the user and other users, location information, etc. An example of connections is a “user list”, also herein called “list”, which includes a name of the list, a description of the list, and one or more other users which the given user follows. The user list is, for example, created by the given user.
The server 350 includes a processor 352 (e.g., the CPU 16), and a memory device 354. In an example embodiment, the server 350 includes one or more processors (e.g. a central processor system) and a large amount of memory capacity. In another example embodiment, the memory device 354 or memory devices are solid state drives for increased read/write performance. In another example embodiment, multiple servers are used to implement the methods described herein. In other words, in an example embodiment, the server 350 refers to a server system. In another example embodiment, other currently known computing hardware or future known computing hardware is used, or both.
The server 350 also includes a communication device 356 to communicate via the network 346. The network 346 may be a wired or wireless network, or both. In an example embodiment, the server 350 also includes a GUI module 356 for displaying and receiving data via the computing device 348. The server 350 also includes: a social networking data module 360, an indexer module 362, and a user account relationship module 364. Other components or modules may also be utilized by or included in the server 350 even if not shown in this illustrative example. Similarly, other functionality can be implemented by the modules shown in
The server 350 also includes a number of databases, including a data store 368, an index store 370, a profile store 372, and a database for storing community graph information 366.
The social networking data module 360 is used to receive a stream of social networking data. In an example embodiment, millions of new messages are delivered to social networking data module 360 each day, and in real-time. The social networking data received by the social networking data module 360 is stored in the data store 368.
In an example embodiment, only certain types of data are received based on the follower and friend API, such as node and edge connection data. In other words, the message content may or may not be received and stored by the server 350.
The indexer module 362 performs an indexer process on the data in the data store 68 and stores the indexed data in the index store 370. In an example embodiment, the indexed data in the index store 370 can be more easily searched, and the identifiers in the index store can be used to retrieve the actual data (e.g. full messages).
A social network graph is also obtained from the social networking platform server, not shown, and is stored in the social network graph database. The social network graph 14, when given a user as an input to a query, can be used to return all users “following” the queried user.
The profile store 372 stores meta data related to user profiles. Examples of profile related meta data include the aggregate number of followers of a given user, self-disclosed personal information of the given user, location information of the given user, etc. The data in the profile store 372 can be queried.
In an example embodiment, the user account relationship module 364 can use the social network graph 14 and the profile store 372 to determine which users are following a particular user. In other words, a user can be identified as “friend” or “follower”, or both, with respect to one or more other users. The module 64 may also configured to determine relationships between user accounts, including reply relationships, mention relationships, and re-post relationships.
The server 350 may also include a community identification module or capability (not shown) that is configured to identify communities (e.g. a cluster of information within a queried topic such as Topic A) within a topic network. The output from a community identification module comprises a visual identification of clusters (e.g. visually coded) defined as communities of the topic network that contain common characteristics and/or are affected (e.g. influenced such as follower-followee relationships), to a higher degree by other entities (e.g. influencers, experts, high-authority users) in the same community than those in another community.
The server 350 in this example also includes a data retrieval module 334 (e.g., REST module), a graph update module 336, and an influence spread module 338.
The server 350 is in communication with a cluster of titan graph server machines 349, which has memory devices 353 that store the social graph 14 and an HDFS 332. Each server machine in the titan graph cluster 349 includes a processor 351 and a communication device 355 for indexing and storing the data. Using the communication devices, the server 350 and the cluster of titan graph server machines 349 communicate with each other over the data network 346. While a cluster of server nodes can be used, it will be appreciated that different numbers of server nodes may be used to form the cluster.
The computing device 348 includes a communication device 374 to communicate with the server 350 via the network 346, a processor 376, a memory device 378, a display screen 380, and an Internet browser 382. In an example embodiment, the GUI provided by the server 350 is displayed by the computing device 348 through the Internet browser 382. In another example embodiment, where an analytics application 384 is available on the computing device 348, the GUI is displayed by the computing device through the analytics application 384. It can be appreciated that the display screen 380 may be part of the computing device 348 (e.g. as with a mobile device, a tablet, a laptop, a wearable computing device, etc.) or may be separate from the computing device (e.g. as with a desktop computer, or the like).
Although not shown, various user input devices (e.g. touch screen, roller ball, optical mouse, buttons, keyboard, microphone, etc.) can be used to facilitate interaction between the user and the computing device 348.
It will be appreciated that, in another example embodiment, the system includes multiple server machines. In another example embodiment, there are multiple computing devices that communicate with the one or more servers.
It will also be appreciated that one or more computer readable mediums may collectively store the computer executable instructions that, when executed, perform the computations described herein.
It will also be appreciated that the examples and corresponding diagrams used herein are for illustrative purposes only. Different configurations and terminology can be used without departing from the principles expressed herein. For instance, components and modules can be added, deleted, modified, or arranged with differing connections without departing from these principles.
It will also be appreciated that any module or component exemplified herein that executes instructions may include or otherwise have access to computer readable media such as storage media, computer storage media, or data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Computer storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. Examples of computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by an application, module, or both. Any such computer storage media may be part of the system 10, any component of or related to the system 10, etc., or accessible or connectable thereto. Any application or module herein described may be implemented using computer readable/executable instructions that may be stored or otherwise held by such computer readable media.
The steps or operations in the flow charts and diagrams described herein are just for example. There may be many variations to these steps or operations without departing from the principles discussed above. For instance, the steps may be performed in a differing order, or steps may be added, deleted, or modified.
Although the above principles have been described with reference to certain specific examples, various modifications thereof will be apparent to those skilled in the art as outlined in the appended claims.
This application claims priority to U.S. Provisional Patent Application No. 62/316,902 filed on Apr. 1, 2016, entitled “Data Infrastructure and Method for Estimating Influence Spread in Social Networks” and the entire contents of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62316902 | Apr 2016 | US |