The present invention relates to the field of social media networks, and more particularly to a system and method of determining the source of a rumor or piece of information from within a social media network.
Social media networks have experienced a meteoric rise in popularity. Information spread much faster in social networks than any other communication method as of this writing. This ease of information dissemination can be a double-edged sword, however, as social networks can also be used to spread rumors or computer malware. In such circumstances, detecting and determining the source of rumors or misinformation in a social network becomes valuable as a part of an affected party's damage control.
One potential source of information/misinformation may be a result of a node with a high degree of centrality (e.g., a node with a large number of friends on Facebook). This, however, is unlikely, because, in general, every node in a social network has the potential to spread information/misinformation.
It may be possible to use information from a snapshot of infected nodes to identify the source of information/misinformation. This requires the assumption that all nodes in the network monitor and report their status, which is not practical in large-scale social networks. Furthermore, this assumes that the underlying social graph is a regular tree. In general, however, an underlying social graph can be any type of graph.
It may also be possible to use a subset of nodes (called sensors) in the social network to find the source of information/misinformation. The foregoing methods require a large number of nodes in the network to act as sensors which is generally impractical. Furthermore, these methods do not consider the varying inter-node relationship strengths.
In view of the foregoing background, a system and method of detecting a source of a rumor in a social media network is disclosed. The system and method involves identifying a plurality of node clusters in the network, each of the plurality of node clusters including a plurality of nodes, each of the plurality of nodes from each of the plurality of node clusters having at least one edge connection defined by a connection to a different node from a same one of the node clusters; identifying a plurality of gateway nodes from each of the plurality of node clusters, each gateway node from each of the plurality of node clusters as having at least one weak tie connection with a corresponding gateway node from a different one of the plurality of node clusters; selecting a subset of the plurality of gateway nodes as sensor nodes; measuring arrival times of the information from a source node at each of the sensor nodes; selecting a candidate node cluster from the plurality of node clusters based on high betweenness centrality, the candidate node cluster having a high probability of including the source node from among its corresponding plurality of nodes; selecting a subset of the plurality of nodes in the candidate cluster as candidate sensor nodes; measuring arrival times of the information from a source node at each of the candidate sensor nodes; and selecting a candidate node from the candidate cluster based on high betweenness centrality, the candidate node having a high probability of being the source node.
For a more complete understanding of the present invention, reference is made to the following detailed description of an embodiment considered in conjunction with the accompanying drawings, in which:
The following disclosure is presented to provide an illustration of the general principles of the present invention and is not meant to limit, in any way, the inventive concepts contained herein. Moreover, the particular features described in this section can be used in combination with the other described features in each of the multitude of possible permutations and combinations contained herein.
All terms defined herein should be afforded their broadest possible interpretation, including any implied meanings as dictated by a reading of the specification as well as any words that a person having skill in the art and/or a dictionary, treatise, or similar authority would assign particular meaning.
Further, it should be noted that, as recited in the specification and in the claims appended herein, the singular forms ‘a,’ “an,” and “the” include the plural referents unless otherwise stated. Additionally, the terms “comprises” and “comprising” when used herein specify that certain features are present in that embodiment. However, this phrase should not be interpreted to preclude the presence or inclusion of additional steps, operations, features, components, and/or groups thereof.
The present disclosure generally relates to a system and process for finding the source of a rumor or other form of information/misinformation in a social network. More particularly, the present system involves finding a candidate cluster from the plurality of clusters in the network, the candidate cluster having a high probability of containing the source of diffusion of a rumor, and then searching the candidate cluster to locate the specific source node. This process is especially suited to social networks with uncertain inter-node relationship strengths, as the randomness of inter-node relationship strengths is quantified through a probabilistic weighted graph in which the uncertainty in the network is modeled by a probability mass function (pmf).
It should be understood that the elements shown in the figures may be implemented in various forms of hardware, software or combinations thereof. Preferably, these elements are implemented in a combination of hardware and software on one or more appropriately programmed general-purpose devices, which may include a processor, memory and input/output interfaces.
Turning to
As seen in
The connections/relationships involved in data sharing may be considered “edges” between the nodes and can be based upon a variety of things, such as personal relationships, geographic proximity, commonly held interests, etc. The connections which offer the least resistance to information being shared from one node to another are considered strong ties (see connections 14aa-ah, 14ba-bh, and 14ca-ch), while the connections which have offer the strongest such resistance are considered weak ties (see connection 16a-c). Social networks can thus be viewed as comprising several clusters of nodes (see clusters 18a, 18b, and 18c), each of which having a plurality of strong ties connecting the nodes therein (see connections 14aa-ah, 14ba-bh, and 14ca-ch), while the clusters 18a-c themselves are interconnected via weak ties (see connections 16a, 16b, and 16c). The strong ties between any given cluster of nodes indicates that the nodes therein frequently interact with each other and are responsible for dissemination of information within a cluster. By comparison, the weak ties between the different clusters enable information to go “viral” and spread throughout the various clusters of a social network.
The strength of these connections can be quantified by assigning numerical weights to these connections (i.e., between 0 and 1) to represent the strengths of the relationship between these nodes, where a weight of 0 represents the least resistance to propagation of information (i.e., a strong tie) and a weight of 1 represents the greatest resistance to such propagation along that connection. Since relationships between nodes can rise and fall due to changing circumstances (e.g., losing old friendships and gaining new ones; changing levels of interest in certain subject matter), the strengths of these connections can vary over time. In such circumstances, the system and method disclosed below samples the network and weights of these connections at specific time intervals, yielding data sets that are simpler to analyze.
For example, let wij be the weight of the ith connection at some instance j, with the number of distinct values for the weight of the connection i be Mi. We can now construct an |E|×M matrix of weights, W=[w1, w2, . . . , wM], where M=Πi=1|E|Mi. The ith column of this matrix is a vector, wi, with elements representing one possible combination of weights for each connection. We can then construct one graph, Gi=(V, E, wi) for every vector wi, where V and E denote the set of nodes and connections, respectively. Assuming weight independence among connections, the probability of occurrence of graph Gi is given by the following formula:
In one example, let the unknown source of rumor, v*ϵV, initiate the rumor at an unknown time t*. Since there is no prior knowledge about v*, all nodes are equally likely to be the source node. Moreover, each node can be either suspected or infected, and any susceptible node can become infected independently of other nodes. We assume that the rumor diffuses along the shortest path between the source v* and each node vϵV. The time taken for a node m to repost information from n to its own neighbors on the network Gi depends on the strength of the social tie between m and n. It takes less time for any piece of information to diffuse inside a dense cluster of strong ties than across weak ties. Moreover, nodes repost what their neighbors posted with different time delay values (e.g. depending on the time of day that they are online). Assuming a Gaussian distribution for the information propagation delay along each edge eiϵE, the time, di, it takes for information from the corresponding node to reach its susceptible neighbor when the weight of the tie is wij is statistically distributed as the following:
di|wij˜N(wij·μmax,σij2)
where the average information propagation delay for the weakest social relationship (wij=1) is μmax.
Referring back to
The first stage 104 begins with identifying and extracting the clusters in a social network (step 108). This process will result in a model of a social network similar to that seen in
Once the gateway nodes 20a-f have been identified, a subset k1 of these gateway nodes 20a-f is selected from Vgate to act as a set of sensors (step 112). This set of sensors, S={s1, s2, . . . , sk1}, measure the arrival times of information (i.e., when and from what connection a particular rumor arrives and a particular gateway node) to estimate which cluster is the most likely candidate cluster. For instance, nodes 20a and 20e may be chosen as sensors to measure the time at which the rumor arrived at them for the first time.
With the measurements obtained from the sensor gateway nodes, the system then uses this information to estimate which cluster is most likely to include the source, making that cluster the candidate cluster (step 114). Since the exact time that a source begins spreading information (e.g. a rumor) is typically unknown, measurements regarding the differences in arrival times of a rumor at sensor pairs,
Δti1(ti+t*)−(t1+t*)=ti−t1
can be used to estimate in which cluster the source is located, where ti and t1 are the times at which the rumor is received at the ith sensor and the first sensor, respectively.
Let the arrival time difference vector be Δt=(Δt21, Δt31, . . . , Δtk
Due to the lack of prior knowledge as to which node is the source of the rumor, one embodiment of the present invention implements a maximum likelihood estimator (“MLE”), which becomes
where P(Δt|v) is the probability density function of the observation vector, given v belongs to the cluster contains the source of rumor and the SI model is used. Considering the statistical distribution of Δt, the optimal MLE for identifying the candidate cluster {circumflex over (v)}(1) is calculated using the following:
where μv,i(r) is the mean value of difference in arrival times between the first and the (r+1)th sensors, and Λv,i(a,b) is the cross-correlation of difference in arrival times between the ath and the bth sensors. Pr(Gigate) is the probability of the ith possible gateway graph Gigate. Assuming independence among edges, the probability of the ith possible gateway graph is calculated
as where wij(1≤i≤|Egate|) are the elements of the jth column of the matrix Wgate.
Given a typical social network, the number of possible graphs can become extremely large. In order to reduce the complexity of searching for the source of the rumor, one embodiment of the present invention involves searching amongst the m most likely gateway gale graphs corresponding to the m most likely weight vectors wigate, where m<<M. In such circumstances the MLE calculation for locating the candidate cluster {circumflex over (v)}(1) changes to the following:
Once the candidate cluster {circumflex over (v)}(1) has been identified, the second stage 106 of the system 100 begins. The second stage 106 begins by graphing the nodes of the candidate cluster, Gicluster=(Vcluster, Ecluster, wicluster), and selecting a subset k2 of the nodes of the candidate cluster as a second set of sensors (step 116). Thereafter, similar to step 114, the system 1 searches cluster amongst the m most likely graphs corresponding to the m most likely weight vectors wicluster to locate the source of diffusion within the candidate cluster. Thus, the corresponding optimal MLE is given by
where Pr(Gicluster) is the probability of the ith possible gateway graph and Δt is the observation vector at the sensors. Note that the optimization problems in the MLEs for {circumflex over (v)}(1) and {circumflex over (v)}(2) have no closed-form solution, thus a brute-force search is run through all the suspected nodes. The number of suspected nodes is equal to the size of the most likely candidate cluster, which provides the following advantages: (1) the percentage of sensors significantly reduces compared to alternative algorithms discussed in Pinto, P. C., Thiran, P., Vetterli, M.: Locating the source of diffusion in large-scale networks. Phys. Rev. Lett. 109, 068-702 (2012) and Luo, W., Tay, W. P., Leng, M.: How to identify an infection source with limited observations. Selected Topics in Signal Processing, IEEE Journal of 8(4), 586-597 (2014) for the same level of accuracy, which decreases the dimension of the matrix Λv,i in the MLE for {circumflex over (v)}(2), thereby reducing the computational complexity thereof; and (2) the likelihood function in the MLE for {circumflex over (v)}(2) should be calculated for much smaller number of nodes than all the nodes in the network.
Source Localization Algorithm
In one embodiment, an algorithm is used to identify the source of diffusion in a social network with varying relationship strength. The first stage of the proposed algorithm FindCluster is depicted in Algorithm 1, shown below. As shown in Algorithm 1, the clusters/communities existing in the network are first discovered using the Louvain method. The time complexity of this method O(|V|log|V|) is significantly lower than other methods to compute clusters. The gateway graph is constructed using the gateway nodes of these clusters. The algorithm SampleGraph, as seen in Algorithm 4 shown below, is used to generate m of the most likely gateway graphs corresponding to the m most likely weight vectors. Since it is reasonable to expect that any piece of information flows along the shortest paths into the network, the most appropriate sensor nodes will be the nodes with high betweenness centrality, where betweenness centrality of a node v is defined as
Ns,tSP(v) is the number of shortest paths from s to t passing through node v, and Ns,tSP is the total number of shortest paths from node s to node t. The number of shortest paths varies with graph size and connectivity, making it difficult to directly compare betweenness centrality (“BC”) values across the possible graphs. Thus, typically analysis focuses on betweenness centrality order, where the nodes are ranked in descending order of BC values, and the node with the highest BC value is given a betweenness centrality score (“BCS”) of 1. In this embodiment, although the size of the graph is the same, the varying weights between the nodes imply varying connectivity. Hence, we approximate the expected BCS for each node vϵVgate using the m most likely graphs as
where BCSjgate(v) is the BCS for the vth node in the ith possible graph Gigate. The computational complexity is O(m·|Egate|·|Vgate|) where |Egate| and |Vgate| denote the number of edges and nodes in the graph Gigate, respectively. The algorithm that finds the BCS values, FindBCS, is shown in Algorithm 3 below. As shown in Algorithm 1, FindCluster selects the top k1 nodes with high betweenness centrality (line 4) and then finds the most likely cluster using the MLE for finding the candidate cluster v(1) (lines 7-9). The algorithm FindSource, shown in Algorithm 2 below, implements the second stage of the rumor localization. As with finding the candidate cluster, FindBCS selects k2 nodes from within the candidate cluster as sensors to measure the arrival times of the rumor (line 4 of Algorithm 3). Finally, the node that maximizes the likelihood value (line 9 of Algorithm 2) is chosen as the source of the information (i.e., rumor).
V
, E
G
<-SampleGraph(V
G
<-SampleGraph(V
We performed simulations on a large dataset extracted from the Twitter network. The network is obtained from Twitter users who mentioned “Python” or “data” on their posts and then tracing followers' links up to three hops. The dataset included 23,370 nodes with 33,101 interconnecting edges and had a diameter of 15 nodes.
To quantify the average information propagation delay, we extract the time difference between the time that one node u tweets and the time its neighbor v retweets the u's tweet. The mean shift method is used to cluster the propagation delay values. For each edge ei we have
where μil and nl, (1≤l≤Mi) are the average information propagation delay and the number of points in the lth cluster, respectively. Mi is the total number of clusters. Based on the equation di|wij˜N(wij·μmax,σij2),
where μi1=max(μi1, μi2, . . . , μiM
159 clusters are found using the Louvain method. The average number of nodes in each cluster is 146 nodes. Therefore, on an average, only 146 nodes need to be searched in the second stage which results in low computational complexity.
To find a sufficient number of sample graphs m in the equations above, the k-nearest neighbors in uncertain graphs approach may be used. We ran an experiment in which the average shortest path distance is computed for (i) 500 sample graphs (m=500); and (ii) different sizes of sample graphs ranging from 2 to 100 (2≤m≤100). The average shortest path distance in the first case is given by
where Gs is the set of 500 most likely graphs and
is the average shortest path distance in the graph Gi and SPDij is the shortest path distance between nodes i and j. Similarly, the average shortest path distance in the second case is given by
where m varies between 2 and 100. We then calculate the mean square error (MSE) between the average shortest path distances in the first case (as ground-truth value) and in the second case. Since the MSE converges to 0.05 after 30 sample graphs in
Since there is no prior knowledge of the source of diffusion, we generate a uniformly distributed source in [1,|V|]. We simulate the information spread using the SI model. The following results are obtained by averaging over 100 independent runs. The percentage of sensors is fixed at 0.4%.
We investigate the accuracy of the multivariate Gaussian distribution assumption for the observation vector Δt in
We vary the percentage of sensors from 0.1% to 1.0% to illustrate how this parameter affects the average distance error (the distance between the actual and estimated sources). As seen from
The various embodiments disclosed herein can be implemented as hardware, firmware, software, or any combination thereof. Moreover, the software is preferably implemented as an application program tangibly embodied on a program storage unit or computer readable medium. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units (“CPUs”), a memory, and input/output interfaces. The computer platform may also include an operating system and microinstruction code. The various processes and functions described herein may be either part of the microinstruction code or part of the application program, or any combination thereof, which may be executed by a CPU, whether or not such computer or processor is explicitly shown. In addition, various other peripheral units may be connected to the computer platform such as an additional data storage unit and a printing unit.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the principles of the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the invention, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
It will be understood that the embodiments described herein are merely exemplary and that a person skilled in the art may make many variations and modifications without departing from the spirit and scope of the invention. All such variations and modifications are intended to be included within the scope of the invention as defined in the appended claims.
This application claims priority to Provisional Patent Application Ser. No. 62/104,211, filed Jan. 16, 2015, the disclosure of which is incorporated by reference herein in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
9292493 | Chandramouli et al. | Mar 2016 | B2 |
20040018839 | Andric | Jan 2004 | A1 |
20110173264 | Kelly | Jul 2011 | A1 |
Entry |
---|
Doerr et al.; Why rumors spread so quickly in social networks; Commun. ACM 55(6), 70-75 (2012). |
Hill et al.; Network-based marketing: Identifying likely adopters via consumer networks; Statistical Science 21(2), 256-276 (2006). |
Comin et al. Identifying the starting point of a spreading process in complex networks. Phys. Rev. E 84,056-105 (2011). |
Sabidussi; The centrality index of a graph; Psychometrika 31(4), 581-603 (1966). |
Newman; The structure and function of complex networks. SIAM Rev. 45,167-256 (2003). |
Kermack, et al.; A contribution to the mathematical theory of epidemics; Proceedings of the Royal Society of London; Series A 115(772), 700-721 (1927). |
Shah; Detecting sources of computer viruses in networks: theory and experiment; In: Proceedings of the ACM Sigmetrics International Conference on Measurement and Modeling of Computer Systems. Sigmetrics '10, pp. 203-214 (2010). |
Shah et al.; Rumors in a network: Who's the culprit?; Information Theory, IEEE Transactions on 57(8), 5163-5181 (2011). |
Shah; et al.; Rumor centrality: a universal source detector; In: Proceedings of the 12th ACM SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems. SIGMETRICS '12, pp. 199-210 (2012). |
Luo et al.; Identifying multiple infection sources in a network; In: Signals, Systems and Computers (ASILOMAR), 2012 Conference Record of the Forty Sixth Asilomar Conference on, pp. 1483-1489 (2012). |
Prakash et al.; Spoiling culprits in epidemics: How many and which ones?; In: Data Mining (ICDM), 2012 IEEE 12th International Conference On, pp. 11-20 (2012). |
Pinto et al.; Locating the source of diffusion in large-scale networks; Phys. Rev. Lett. 109, 068-702 (2012). |
Luo et al.; How to identify an infection source with limited observations. Selected Topics in Signal Processing, IEEE Journal of 8(4), 586-597 (2014). |
Louni et al.: A two-stage algorithm to estimate the source of information diffusion in social media networks; 2014 IEEE Conference on Computer Communications Workshops (IFOCOM WKSHPS); 329-333 (2014). |
Xiang et al.; Modeling relationship strength in online social networks; In: Proceedings of the 19th International Conference on World Wide Web. WWW '10, pp. 981-990 (2010). |
Weng et al.; Competition among memes in a world with limited attention; Sci. Rep. 2(335) (2012). |
Pfeiffer III et al.; Methods to determine node centrality and clustering in graphs with uncertain structure; In Proceedings of the 5th International AAAI Conference on Weblogs and Social Media (2011). |
Bakshy et al.; The role of social networks in information diffusion; In: Proceedings of the 21st International Conference on World Wide Web. WWW '12, pp. 519-528 (2012). |
Granovetter; The strength of weak ties; The American Journal of Sociology 78(6), 1360-1380 (1973). |
Blondel et al.; Fast unfolding of communities in large networks; J. Stat. Mech, 10008 (2008). |
Bonacich; Power and centrality: A family of measures; American Journal of Sociology 92(5), 1170-1182 (1987). |
Comaniciu et al.; Mean shift: a robust approach toward feature space analysis; Pattern Analysis and Machine Intelligence, IEEE Transactions on 24(5), 603-619 (2002). |
Potamias et al.; K-nearest neighbors in uncertain graphs; Proc. VLDB Endow, 3(1-2), 997-1008 (2010). |
Barry, K.: “Ford Bets the Fiesta on Social Networking,” Wired (Apr. 2009). |
Jackson, M.O., Social and Economic Networks, Princeton University Press, Princeton, NJ, USA (Draft Date: Mar. 2008). |
Louni, A., et al., “Who Spread that Rumor: Finding the Source of Information in Large Social Networks with Varying Inter-Node Relationship Strengths” (Jan. 2015). |
Morozov, E., “Swine Flu: Twitter's Power to Misinform,” Foreign Policy (Apr. 28, 2009). |
Newman, M., “Fast Algorithm for Detecting Community Structure in networks,” Phys. Rev. E, vol. 69, p. 066133 (Jun. 18, 2004). |
Strauss, G. et al., “SEC, FBI Probe Fake Tweet That Rocked Stocks,” USA Today (Apr. 24, 2013). |
Number | Date | Country | |
---|---|---|---|
20160212163 A1 | Jul 2016 | US |
Number | Date | Country | |
---|---|---|---|
62104211 | Jan 2015 | US |