The present disclosure generally relates to identifying pathogenic social media accounts, and in particular to systems and methods for detecting pathogenic social media accounts without leveraging network structure, cascade path information, content, or user's information.
The spread of harmful misinformation in social media is a pressing problem. Social media accounts that have the capability of spreading such information to viral proportions are referred to as “Pathogenic Social Media” (PSM) accounts. These accounts can include terrorist supporter accounts, water armies, and fake news writers. In addition, these organized groups/accounts spread messages regarding certain topics. Such accounts might be operated by multiple people that tweet/retweet through multiple accounts to promote/degrade an idea, which can influence public opinion. As such, identifying PSM accounts has important applications to countering extremism, the detection of water armies, and fake news campaigns. In Twitter, many of these accounts have been found to be social bots. The PSM accounts that propagate information are key to a malicious information campaign and detecting such accounts is critical to understanding and stopping such campaigns. However, this is difficult in practice. Existing methods rely on social media message content (e.g. text, image, video, audio, hashtag, URL, or user mentions, etc.), social media user profile information (e.g. user name, photo, URL, number of tweets/retweets, location, etc.), social or network structure (e.g. number of followers, number of followees, actual list of followers, actual list of followees, number of friends, actual friends list, betweenness centrality, pagerank centrality, closeness centrality, and in-degree and out-degree centralities, star and clique networks associated with users, average clustering coefficients of each user, etc.), or combinations thereof. However, reliance on information of this type leads to two challenges. First, collection, storage, and processing of message content, user profile, and/or network structure requires significantly more computing CPU, memory, and network bandwidth. Second, the use of content often necessitates the training of a new model for previously unobserved topics. For example, PSM accounts taking part in elections in the U.S. and Europe will likely leverage different types of content. Third, network structure information is not always available. For example, the FACEBOOK® API does not make this information available without permission of the users (which is likely a non-starter for identifying PSM accounts).
Early detection of PSM accounts is critical as these accounts are likely to be key users related to the formation of harmful malicious information campaigns. This is a challenging task for social media authorities for three reasons. First, it is not always easy to manually shut down these accounts. Despite efforts to suspend these accounts, many of them simply return to social media with different accounts. Second, for the most part, the available data is imbalanced and social network structure, which is at the core of many techniques are not readily available. Third, PSM accounts often seek to utilize and cultivate large number of online communities of passive supporters to spread as much harmful information as they can. Consequently, extra efforts need to be dedicated to proposing capabilities that could be deployed by social network firms to counter PSM accounts, regardless of the underlying social network platform.
It is with these observations in mind, among others, that various aspects of the present disclosure were conceived and developed.
Corresponding reference characters indicate corresponding elements among the view of the drawings. The headings used in the figures do not limit the scope of the claims.
As described herein, addressing the viral spread of misinformation poses various technical problems. For example, some technologies may rely upon social media message content, social media user profile information, social structure, or combinations thereof and therefore require excess computing power, memory, and network bandwidth. The present novel concept provides a technical solution to these technical problems by defining a sophisticated computational framework that determines how causal a user is with respect to a cascade by comparison to others using historical cascade event data. The concept further differs from previous approaches by focusing on malicious social media accounts that make messages “go viral”. The framework may supplement but does not require the analysis of social media message content, social media user profile information, or social to identify PSM accounts. Rather, the concept relies primarily on the evaluation of how likely a given social media account is to cause a message to spread virally. Thus, the present disclosure involves significantly less CPU, memory, and network bandwidth, and at the same time, provides a more accurate and efficient system for detecting malicious accounts. Further, because the concept does not rely upon social media content, it can potentially detect pathogenic social media accounts used to spread new types of malicious information where the content has differed greatly from training data. This approach is potentially complementary to content based methods.
Throughout this disclosure cascades shall be represented as an “action log” (Actions) of tuples where each tuple (u, m, t)∈Actions corresponds with a user u E U posting message m∈M at time t∈T. It is assumed that set M includes posts/repost of a certain original tweet or message. For a given message, the first occurrence of each user is only considered. The present disclosure defines Actionsm as a subset of Actions for a specific message m. Formally, it is defined as Actionsm={(u′, m′, t′)∈Actions s.t. m′=m}.
Definition 1. (m-participant). For a given m∈M, user i is an m-participant if there exists t such that (i, m, t)∈Actions.
Note that the users posting tweet/retweet in the early stage of cascades are the most important ones since they play a significant role in advertising the message and making it viral. For a given m∈M, an m-participant i “precedes” m-participant j if there exists t<t′ where (i, m, t), (j, m, t′)∈Actions. Thus, key users are defined as a set of users adopting a message in the early stage of its life span. These users appear in the “beginning” of a cascade, before specified users join the cascade. The present disclosure defines key user as follows:
Definition 2. (Key User). For a given message m, m-participant i, and Actionsm, wherein user i is a key user iff user i precedes at least ϕ fraction of m-participants (formally: |Actionsm|×ϕ≤|{j|∃t′:(j, m, t′)∈Actionsm∧t′>t}|, (i, m, t)∈Actionsm), where ϕ∈(0, 1).
The notation |⋅| denotes the cardinality of a set. All messages are not equally important. That is, only a small portion of them become popular. These viral messages are defined as follows:
Definition 3. (Viral Messages). For a given threshold θ, a message m∈M is viral iff|Actionsm|≥θ. Mvir is used to denote the set of viral messages.
The Definition 3 allows the computation of the prior probability of a message (cascade) going viral as follows:
The probability of a cascade m going viral given some user i was involved is defined as:
The present disclosure is also concerned with two other measures. First, the probability that two users i and j tweet or retweet viral post m chronologically, and both are key users. In other words, these two users are making post m viral.
Second, the probability that key user j tweets/retweets viral post m and user i does not tweet/retweet earlier than j. In other words, only user j is making post m viral.
Knowing the action log, the aim is to find a set of pathogenic social media (PSM) accounts. These users are associated with the early stages of large information cascades and, once detected, are often deactivated by a social media firm. In the causal framework disclosed herein, a series of causality-based metrics is introduced for identifying PSM users.
The present disclosure adopts a causal inference framework. In particular, the present disclosure expands upon the causal inference framework in two ways: (1) the casual framework addresses the problem of identifying PSM accounts; and (2) the present disclosure extends this single causal metric to a set of metrics. Multiple causality measurements provide a stronger determination of significant causality relationships. For a given viral cascade, the casual framework seeks to identify potential users who likely caused the cascade to go viral. An initial set of criteria is first required for such a causal user. This is done by instantiating the notion of Prima Facie causes to our particular use case below:
Definition 4. (Prima Facie Causal User). A user u is a prima facie causal user of cascade m iff: User u is a key user of m, m∈Mvir, and pm|u>ρ.
For a given cascade m, the language prima facie causal user will often be used to describe user i as a prima facie cause for m to be viral. In determining if a given prima facie causal user is causal, other “related” users must be considered. In this disclosure, i and j are m-related if (1.) i and j are both prima facie causal users for m, (2.) i and j are both key users for m, and (3.) i precedes j. Hence, the set of “related users” for user i (denoted R(i)) are defined as follows:
R(i)={js.t.j≠i,∃m∈Ms.t.i,j are m-related} (5)
Therefore, pi,j in (3) is the probability that cascade m goes viral given both users i and j, and p¬i,j in (4) is the probability that cascade m goes viral given key user j tweets/retweets it while key user i does not tweet/retweet m or precedes j. The idea is that if pi,j−p¬i,j>0, then user i is more likely a cause than j for m to become viral. The Kleinberg-Mishra causality (∈K&M) is measured as the average of this quantity to determine how causal a given user i is as follows:
Intuitively, ∈K&M measures the degree of causality exhibited by user i. In addition, it was found useful to include a few other measures. The relative likelihood causality (∈rel) is introduced as follows:
where α is infinitesimal. Relative likelihood causality metric assesses the relative difference between pi,j and p¬i,j. This assists in finding new users that may not be prioritized by ∈K&M. It was also found that if a user is mostly appearing after those with the high value of %, then it is likely to be a PSM account. One can consider all possible combinations of events to capture this situation. However, this approach is computationally expensive. Therefore, Q(j) is defined as follows:
Q(j)−(is.t.j⊂R(i)] (9)
Consider the following example:
Consider two cascades (actions) τ1={A, B, C, D, E, F, G, H} and τ2={N, M, C, A, H, V, S, T} where the capital letters signify users. The aim is to relate key users while ϕ=0.5 (Definition 2). Table I shows the related users R(.) for each cascade. Note that the final set R(.) for each user, is the union of all sets from the cascades. Set Q(.) for the users of Table I are presented in Table II.
Accordingly, the neighborhood-based causality (€nb) is defined as the average €K&M (i) for all i∈Q(j) as follows:
The intuition behind this metric is that accounts who are retweeting a message that was tweeted/retweeted by several causal users are potential for PSM accounts. Weighted neighborhood-based causality (€wnb) is defined as follows:
The intuition behind the metric €wnb is that the users in Q may not have the same impact on user j and thus different weights w, are assigned to each user i with €K&M (i).
The goal is to find the potential PSM accounts from the cascades. Assigning a score to each user and applying threshold-based algorithm is one way of selecting users. As discussed above, causality metrics were defined where each of the metrics or combination of metrics can be a strategy for assigning scores. Users with high values for causality metrics are more likely to be PSM accounts—this relationship is demonstrated between these measurements and the real world by identifying accounts deactivated eventually.
Problem 1. (Threshold-Based Problem). Given a causality metric ∈k where k∈{K&M, rel, nb, wnb}, parameter κ, set of users U, we wish to identify set {u s.t. ∀u∈U, ∈k(u)≥θ}.
It was found that considering a set of cascades as a hypergraph where users of each cascade are connected to each other can better model the PSM accounts. The intuition is that densely connected users with high values for causality are the most potential PSM accounts. In other words, the interest is in selecting a user if (1.) it has a score higher than a specific threshold or (2.) it has a lower score but occurs in the cascades where high score users occur. Therefore, the label propagation problem is defined as follows:
Problem 2. (Label Propagation Problem). Given a causality metric ∈k where k∈{K&M, rel, nb, wnb}, parameters θ, λ, set of cascades T={τ1, τ1, . . . , τn}, and set of users U, we wish to identify set S:S1, S2, . . . , SI, . . . , S|U| where SI={u≡∀ξ∈T,∀u∈(τ|SI−1), ∈k (u)≥(HτI−λ)} and HIτ={min ∈k(u)) s.t. ∀u∈τ∧u∈USI′}·I′∈[1,I)
In another problem statement, the framework seeks to determine if causal or decay-based metrics could be leveraged for identifying PSM accounts at early stage of their life span. Note each user can now be represented by a causality vector x∈Rd, which is computed by any of the causality metrics (summarized in Table 3 below) over a period of time. The problem of early identification of PSM accounts is formally defined as follows.
indicates data missing or illegible when filed
Given an action log A, and user u where ∃t s.t. (u, m, t)∈A, and
With respect to Problem 1 discussed above, the framework uses a map-reduce programming model to calculate causality metrics. In this approach, users are selected with causality value greater than or equal to a specific threshold. This approach is referred to as the Threshold-Based Selection Approach.
With respect to Problem 2, label propagation algorithms iteratively propagate labels of a seed set to their neighbors. All nodes or a subset of nodes in the graph are usually used as a seed set. A Label Propagation Algorithm (Algorithm 1) is proposed to solve problem 2. First, users with a causality value greater than or equal to a specific threshold (i.e. 0.9) are taken as the seed set. Then, in each iteration, every selected user u can activate user u′ if the following two conditions are satisfied: (1.) u and u′ have at least one cascade (action) in common and (2.) ∈k(u′))≥∈k(u)−λ,λ∈(0, 1). Note that a minimum threshold, such as 0.7, is set so that all users are supposed to satisfy it. In this algorithm, inputs are a set of cascades (actions) T, causality metric ∈k and two parameters θ, λ in (0, 1). This algorithm is illustrated by a toy example:
Assuming the minimum acceptable value is set as 0.7; in this case, users C and E would not be activated in this algorithm. Assuming two parameters θ=0.9, λ=0.1, both users A and G get activated (
Proposition 1. Given a set of cascades T, a threshold θ, parameter λ, and causality values ∈k where k∈{K&M, rel, nb, wnb}, ProSel returns a set of users R={u|∈k (u)≥θ or ∃u′ s.t. u′, u∈T, ∈k (u)≥∈k (u′)−λ and u′ is picked}. Set R is equivalent to the set S in Problem 2.
Proposition 2. The time complexity of Algorithm ProSel is O(|T|×avg(log(|T|))×|U|)
Previous causal metrics do not take into account time-decay effect while computing causality scores. In other words, they assume a steady trend for computing causality of users. This is considered an unrealistic assumption since causal effect of users may change over time. Here, a generic decay-based metric is introduced, which essentially addresses the time-decay effect problem by assigning different time points of a given time interval, different weights, inversely proportional to their distance from t. In more details, this metric does the followings: (1) breaks down the given time interval into shorter time periods, using a sliding time window, (2) deploys an exponential decay function of the form f(x)=e−αx to account for the time-decay effect, and (3) takes average between the causality values computed using the sliding time window. We formally define ξkτ as follows, which by varying k∈{K&M, rel, nb, wnb} could be used to compute each of the previous causal metrics.
Here σ is a scaling parameter of the exponential decay function, T={umlaut over (=)}′ι0+j×δ,j∈{dot over (∧)}t′≤t−δ}, δ is a small fixed amount of time, which is used as the length of the sliding time window Δ=|t′−δ,t′|. In
Similar to the previous observations,
0.57
0.59
0.6
0.62
0.63
indicates data missing or illegible when filed
0.59
0.61
0.62
0.65
0.67
indicates data missing or illegible when filed
Having defined the problem of early identification of PSM accounts, the question of whether leveraging community structure of PSM accounts could boost the classification performance by checking if users in a given community establish stronger causal-based relationships with each other compared to the rest of the communities is examined.
To answer this question, since network structure and its underlying graph are not available, a hypergraph G=(V, E) from A is built by connecting any pairs of users who have posted same message chronologically. In this hypergraph, V is a set of vertices (i.e. users) and E is a set of directed edges between users. For sake of simplicity and without loss of generality, we make the edges of this hypergraph, undirected. Next, the L
H
0
:v
a
≥v
b
, H
1
:v
a
<v
b (14)
where the null hypothesis is: users in a given community establish weak causal relations with each other as opposed to the other users in other communities. To work this out, two vectors va and vb are constructed as follows. va is generated by computing Euclidean distances between each pair of users (ui, uj) who are from same community C1∈C. Therefore, va contains exactly ½Σt=1|C||Ci|.(|Ct|−1) elements in total. Likewise, vb is constructed of size Σt=1|C||Ci| by computing Euclidean distance between each user ui in community C1∈C, and a random user uk chosen from the rest of the communities, i.e., C\C1. The null hypothesis is rejected at significance level a=0.01 with the p-value of 4.945e-17. Thus, it is concluded that users in same communities are more likely to establish stronger causal relationships with each other than other users in rest of the communities. The answer to the posed question is thus positive. Note for brevity, t-test results for 10% of the training set were only reported, while making similar arguments for other percentages is straight-forward.
← DOMINANT-LABEL(K)
indicates data missing or illegible when filed
Next, a causal community detection-based classification algorithm, namely C2DC, is utilized that takes into account causality scores of users and their underlying community structure to perform early detection of PSM accounts. First step of the algorithm involves finding the communities of the hypergraph. The L
A dataset was collected from February 2016 to May 2016, consisting of ISIS related tweets/retweets in Arabic. The dataset contains different fields including tweets and the associated information such as user ID, re-tweet ID, hashtags, content, date and time. The dataset also contains user profile information, including name, number of followers/followees, description, location, etc. About 53M tweets are collected based on the 290 hashtags such as Terrorism, State of the Islamic-Caliphate, Rebels, Burqa State, and Bashar-Assad, Ahrar Al-Sham, and Syrian Army. In this dataset, about 600K tweets have at least one URL (i.e., event) referencing one of the social media platforms or news outlets. There are about 1.4M of paired URLs which are denoted by μ1→μ2 and indicates a retweet (with the URL μ2) of the original tweet (with the URL μ1). In this paper, we only use tweets (more than 9M) associated with viral cascades. The statistics of the dataset are presented in Table III discussed in details below.
Cascades. The present disclosure aims to identify PSM accounts—which in this dataset are mainly social bots or terrorism-supporting accounts that participate in viral cascades. The tweets that have been retweeted from 102 to 18,892 times. This leads to more than 35 k cascades which are tweeted or retweeted by more than 1M users. The distribution of the number of cascades vs cascade size is illustrated in
Users. There are more than 1M users that have participated in the viral cascades.
User's Current Status. Key users are selected that have tweeted or retweeted a post in its early life span—among first half of the users (according to Definition 2, ϕ=0.5), and check whether they are active or not. Accounts are not active if they are suspended or deleted. More than 88% of the users are active as shown in Table IV. The statistics of the generator users are also reported. Generator users are those that have initiated a viral cascade. As shown, 90% of the generator users are active as well. Moreover, there are a significant number of cascades with hundreds of inactive users. The number of inactive users in every cascade is illustrated in
Generator Users. In this part, users that have generated the viral tweets are only considered. According to Table IV, there are more than 7 k active and 800 inactive generator users. That is, more than 10% of the generator users are suspended or deleted, which means they are potentially automated accounts. The distribution of the number of tweets generated by generator users shows that most of the users (no matter active and inactive) have generated a few posts (less than or equal to 3) while only a limited number of users are with a large number of tweets.
Impact of μ1 on μ2. In this part, the impact of the URL μ1 on μ2 is investigated. Accordingly, the dataset contains 35K cascades (i.e., sequences of events) of different sizes and durations, some of which contain paired URLs in the aforementioned form. After pre-processing and removing duplicate users from cascades (those who retweet themselves multiple times), cascades sizes (i.e. number of associated postings) vary between 20 to 9,571 and take from 10 seconds to 95 days to finish. The log-log distribution of cascades vs. cascade size and the cumulative distribution of cascades are depicted in
The statistics of the dataset are presented in Table V. For labeling, the Twitter API is checked to examine whether the users have been suspended (labeled as PSM) or they are still active (labeled as normal). According to Table V, 11% of the users in the dataset are PSMs and others are normal. The total number of PSM accounts that have been suspended by Twitter is depicted in each cascade in
Twitter deploys a URL shortener technique to leave more space for content and protect users from malicious sites. To obtain the original URLs, a URL unshortening tool is used. This tool obtains the original links contained in the tweets of the dataset.
A number of major and well known social media platforms are considered, including Facebook, Instagram, Google, and YouTube. About the dichotomy of mainstream and alternative media, it is notable to mention that most criteria for determining whether a news source counts as either of them are based on a number of factors including but not limited to the content and whether or not it is corporate owned. However, a key difference between these two sources of media comes from the fact that all of mainstream media is profit oriented, in contrast to alternative media. It is further noted that, for the most part, mainstream media is considered a more credible source than alternative media, although the reputation has been recently tainted by the fake news.
Popular news outlets such as The New York Times and The Wall Street Journal are considered as mainstream and less popular ones as alternatives. In Table VI, the total number of paired URLs (i.e. μ1→μ2) in which the original URL (i.e. μ1) corresponds to each social media platform with at least one event in the dataset is summarized. In Table VII, the total number of paired URLs whose original URL belongs to the mainstream and alternative news sources is summarized. Table VIII shows the breakdown of the number of paired URLs for the PSM and normal users. Table IX further demonstrates some examples of the mainstream and alternative news URLs occurrence used in the work.
598/3,843
655/3,108
Here, the differences between PSM accounts with their counterparts, normal users, through temporal analysis of their posted URL's are presented.
In
In the previous section, a data analysis to demonstrate differences between PSM accounts and normal users in terms of URLs they usually post on Twitter was presented. The next step is to assess their impact via a well known mathematical process called “Hawkes Process”. A Hawkes process is used since the platforms introduced are dependent and thus not disjoint, meaning that they are affected by each other, while they have their own background events.
In many scenarios, one has to deal with timestamped events such as the activity of users on a social network recorded in continuous time. An important task then is to estimate the influence of the nodes based on their timestamp patterns. Point process is a principled framework for modeling such event data, where the dynamic of the point process can be captured by its conditional intensity function as follows:
Where (dN(t)|) is the expectation of events which happened in the interval (t,t+dt] given the historical observations , and N(t) records the number of events before time t. Point process can be equivalently represented as a counting process N={N(t)|t∈[0,T]} over the time interval [0,T].
The Hawkes process framework has been used in many problems that require modeling complicated event sequences where historical events have impact on future ones. Examples include but are not limited to financial analysis, seismic analysis, and social network modeling. One-dimensional Hawkes process is a point process Nt with the following particular form of intensity function:
Here, the exponential kernal of the form g(t)=we−wt is used, but adapting to the other positive terms is straightforward. The second part of the above formulation captures the self exciting nature of the point processes—the occurrence of events in the past has a positive impact on future ones. Given a sequence of events {ti}i=1n observed in [0, T] and generated from the above intensity function, the log-likelihood function can be obtained as follows:
Here, the focus is on multi-dimensional Hawkes processes which is defined by U-dimensional point process Ntu,u=1, . . . , U. In other words, U Hawkes processes are coupled with each other—each Hawkes process corresponds to one of the platforms and the influence between them is modeled using the mutually exciting property of the multi-dimensional Hawkes processes. The following formulation to model the influence of different events on each other is formally defined:
In
Consider an infectivity matrix A=[auu
The goal is to assess the influence of the PSM accounts in the dataset via their posted URLs. The URLs posted by two groups of users are considered: (1) PSM accounts and (2) normal users. For both groups, a Hawkes model with K=7 point processes each is fitted for each category of social media and news outlets.
In each of the Hawkes models, every process is able to influence all the others including itself, which allow for the estimation of the strength of connections between each of the seven categories for both groups of users, in terms of how likely an event (i.e. the posted URL) can cause subsequent events in each of the groups.
The A
Two different sets of URLs posted by the PSM account and normal users are considered by selected URLs that have at least one event in Twitter (i.e. posted by a user). For each group, a matrix W∈T×D with D=7, whose entries are sequences of events (i.e. posted URLs) observed during time period T are constructed. Each sequence of events is of the form S={(ti,ui)}i=1n
The A
The number of nodes is set to D=7 to reflect the 7 platforms used in this analysis. Level of penalization is set to C=1000, and the ratio of Lasso-Nuclear regularization mixing parameter is set to 0.5. Finally, the maximum number of iterations for solving the optimization is set to 50 and the tolerance of solving algorithm is set to 1e-5.
The behavior of the causality metrics is examined. Users are analyzed based their current account status in Twitter. A user is labeled as active (inactive) if the account is still active (suspended or deleted).
Kleinberg-Mishra Causality. Users were studied that had a causality value of ∈K&M greater than or equal to 0.5. As expected, inactive users exhibit different distribution from active users (
Neighborhood-Based Causality. The users having a causality value of ∈nb greater than or equal to 0.5 were studied. As expected, inactive users exhibit different distribution from active users as shown in
Weighted Neighborhood-Based Causality. This metric is the weighted version of the previous metric (∈nb). A weight is assigned to each user in proportion to her participation rate in the viral cascades.
The Scala Spark and Python 2.7x were run on an Intel Xeon CPU (1.6 GHz) with 128 GB of RAM running Windows 7. The parameter ϕ was set to label key users 0.5 (Definition 2). Thus, for the users that participate in the action before the number of participants gets twice.
In the following sections, two proposed approaches are considered: (1) Threshold-Based Selection Approach—selecting users based on a specific threshold, (2) Label Propagation Selection Approach—selecting by applying Algorithm 1. The intuition behind this approach is to select a user if it has a score higher than a threshold or has a lower score but occurs in the cascades that high score users exist. The methods are evaluated based on true positive (True Pos), false positive (False Pos), precision, the average (Avg CS) and median (Med CS) of cascade size of the detected PSM accounts. Note that in our problem, precision is the most important metric. The main reason is labeling an account as PSM means it should be deleted. However, removing a real user is costly. Therefore, it is important to have a high precision to prevent removing real user.
In the proposed method, all features that could be extracted from the dataset include tweet syntax (average number of hashtags, average number of user mentions, average number of links, average number of special characters), tweet semantics (LDA topics), and user behavior (tweet spread, tweet frequency, tweet repeats). Three existing methods were to detect PSM accounts: 1) Random selection: This method achieves the precision of 0.11. This also presents that the data is imbalanced and less than 12% of the users are PSM accounts. 2) Sentimetrix (Sentimet.): The data was clustered by DBSCAN algorithm. The labels were then propagated from 40 initial users to the users in each cluster based on the similarity metric. Support Vector Machines (SVM) were used to classify the remaining PSM accounts [10]. 3) Classification methods: In this experiment, the same labeled accounts were used as the previous experiment and different machine learning algorithms were applied to predict the label of other samples. Features based on the limitations of access to data into three categories were then grouped together. First, only content information (Content) is used to detect the PSM accounts. Second, content independent features (NoCont.) is used to classify users. Third, all features (Allfeat.) are applied to discriminate PSM accounts. The best result for each setting is when we apply Random Forest using all features. According to the results, this method achieves the highest precision of 0.16. Note that, most of the features used in the previous work and our baseline take advantage of both content and network structure. However, there are situations that the network information and content do not exist. In this situation, the best baseline has the precision of 0.15. The average (Avg CS) and median (Med CS) of the size of the cascades were studied in which the selected PSM accounts have participated. Table X also illustrates the false positive, true positive and precision of different methods.
In this experiment, all the users were selected that satisfied the thresholds and checked whether they were active or not. A user is inactive if the account is suspended or closed. Since the dataset is not labeled, inactive users were labeled as PSM accounts. The threshold for all metrics was set to 0.7 except for relative likelihood causality (∈rel), which is set to 7. Two types of experiments were conducted: first, user selection for a given causality metric was studied. This approach was further studied using the combinations of metrics.
Single Metric Selection. In this experiment, users were selected based on each individual metric. As expected, these metrics can assist in filtering a significant amount of active users and overcome the data imbalance issue. Metric ∈rel achieves the largest recall in comparison with other metrics. However, it has a maximum number of false positives. Table XI shows the performance of each metric. The precision value varies from 0.43 to 0.66 and metric ∈wnb achieves the best value. Metric ∈rel finds the more important PSM accounts with average cascade size of 567.78 and median of 211. In general, the detected PSM accounts have participated in the larger cascades in comparison with baseline methods. It was also observed that these metrics cover different regions of the search area. In other words, they select different user sets with little overlap between each other. The common users between any two pairs of the features are illustrated in Table XII. Considering the union of all metrics, 36,983 and 30,353 active and inactive users are selected, respectively.
Combination of Metrics Selection. According to Table XII, most of the metric pairs have more inactive users in common than active users. The experiment attempted to select users that satisfy the threshold for at least three metrics. It was determined that 1,636 inactive users were obtained out of 2,887 selected ones, which works better than ∈K&M and ∈rel while worse than ∈nb and ∈wnb. In brief, this approach achieves precision of 0.57. Moreover, the number of false positives (1,251) was lower than most of the other metrics.
In label propagation selection, a set of users were selected that have a high causality score as seeds, then ProSel selects users that occur with those seeds and have a score higher than a threshold iteratively. Also, the seed set in each iteration is the selected users of the previous iteration. The intuition behind this approach is to select a user if it has a score higher than a threshold or has a lower score but occurs in the cascades that high score users occur. The parameters of ProSel Algorithm are set as follows: λ=0.1, θ=0.9, except for relative likelihood causality, where we set λ=1, θ=9. Table XIII shows the performance of each metric. Precision of these metrics varies from 0.47 to 0.75 and ∈wnb achieves the highest precision. Metrics ∈rel with average cascade size of 612.04 and ∈nb with median of 230 find the more important PSM accounts. Moreover, detected PSM accounts have participated in the larger cascades compared with threshold-based selection. This approach also produces much lower number of false positives compared to threshold-based selection. The comparison between this approach and threshold-based selection is illustrated in
The number of common users selected by any pair of two metrics is also illustrated in Table XIV. It shows that our metrics are covering different regions of the search area and identifying different sets of users. In total, 10,254 distinct active users and 16,096 inactive ones are selected.
The infectivity matrix for both PSM and normal users is estimated by fitting the Hawkes model described earlier. In this study, this matrix characterizes the strength of the connections between platforms and news sources. More specifically, each weight value represents the connection strength from one platform to another. In other words, each entry in this matrix can be interpreted as the expected number of subsequent events that will occur on the second group after each event on the first.
In
Overall, the following observations can be made:
The above mentioned observations demonstrate the effectiveness of leveraging Hawkes process to quantify the impact of URLs posted by PSMs and regular users on the dissemination of content on Twitter. The observations we make here show that PSM accounts and regular users behave differently in terms of the URLs they post on Twitter, in that they have different tastes while disseminating URL links. Accordingly their impact on the subsequent events significantly differ from each other.
Additional experiments were conducted to gauge the effectiveness and efficiency (timeliness) of the above approaches that use the causality metrics as features in (1) supervised setting, and (2) for computing proximity scores between users in community detection-based framework.
Experimental Settings
To assess the effectiveness and efficiency of the metrics and the proposed framework in early detection of PSM accounts, we use different subsets of size x % of the entire time-line (from Feb. 22, 2016 to May 27, 2016) of the action log A, by varying x as in {10, 20, 30, 40, 50}. Next, for each subset and each user i in the subset, we compute feature vector x, of the corresponding 4 causality scores. The feature vectors are then fed into supervised classifiers and the proposed community detection-based algorithm to perform classification. For the sake of fair comparison, both standard and decay-based metrics were performed. For both metrics, we empirically found that p=0.1 and a=0.001 work well. For the decay-based causality metric a sliding window of size of 5 days (i.e. δ=5) and set σ=0.001 was assumed which were found to work well in the experiments. Note that only results for PSM (suspended) accounts were presented. Among many other supervised classifiers such as A
The results for the proposed community detection-based framework and both causal and decay-based metrics are presented herein. As stated earlier, for community detection the LOUVAIN algorithm was implemented since it was fast, scaled well and it did not require any parameter tuning. For the second part of the proposed algorithm, i.e., computing k nearest neighbors of each instance and the final labeling procedure, we set k=10 as it was found to work well for our problem. It must be stressed that using KNN alone does not yield a good performance by reporting the results of KNN trained on the decay-based causality features.
For the sake of fair comparison, all approaches were implemented and run in Python 2.7x, using scikit-leam package. For the Louvain algorithm, we used the Python community detection package was used. For any approaches that required special tuning of parameters, a grid search was conducted to choose the best set of parameters.
To validate the performance of the proposed metrics and framework, the metrics and framework were compared against the following state-of-the-art baselines:
CAUSAL (Anonymous 2017). The proposed decay-based metrics was benchmarked against causal metrics (Anonymous 2017) via both supervised and community detection-based settings.
SentiMetrix-DBscAN (Subrahmanian et al. 2016). This method was the winner of the DARPA Twitter bot challenge contest (Subrahmanian et al. 2016). This baseline deploys several features including tweet syntax (average number of hashtags, average number of user mentions, average number of links, average number of special characters), tweet semantics (LDA topics), and user behavior (tweet spread, tweet frequency, tweet repeats). For this baseline, a 10-fold cross validation was performed and use a held-out test set for evaluating the baseline. This baseline then uses a seed set of 100 active and 100 inactive accounts, and then use D
SentiMetrix-RF. This is a variant of the previous baseline where we excluded the DBSCAN part and instead trained RF classifier using only the above features to evaluate the feature set.
For evaluating how accurate different approaches are in identifying PSM accounts, we use different metrics including precision, recall and F-1 measure and area under curve (AUC) for the PSM accounts. Precision, recall, FI-score and AUC results for each method are shown in
The following experiment was conducted to gauge the ability of different approaches in identifying PSM accounts around their activity. For each approach, it was determined how many of the PSM accounts were active in the first 10 days of the dataset, were correctly classified (i.e., true positives) over time. Also, the framework needed to keep track of false positives to ensure a given approach does not merely label each instance as positive—otherwise a trivial approach that always would label each instance as PSM would achieve the highest performance. In addition, to figure how many days were needed to find these PSM accounts each classifier was trained using 50% of the first portion of the dataset. A held-out set was used for the rest for evaluation. Next, the misclassified PSM accounts were passed on to the next portions to see how many of them can be captured over time. The process was repeated until reaching 50% of the action log—each time the training set was increased by adding new instances from each portion.
There are 14,841 users in the first subset from which 3,358 users are PSM. Table XV below shows the number of users from the first portion that (1) are correctly classified as PSM (out of 3,358), (2) are incorrectly classified as PSM (out of 29,617), over time. According to this table, community detection-based approaches were able to detect all PSM accounts who were active in the first 10 days of our dataset, in about a month since their first activity. In particular, D
05/0
indicates data missing or illegible when filed
The observations that were made aligned well with previous observations; suggesting that the community detection-based framework yields higher classification performance than other frameworks. Finally, it should be noted that to make the computations and discussion less complicated, a period of 10 days was used, therefore, the exact number of days that should pass were not known, although this will still give a good approximation of the efficiency of the given approach in early identification of PSM accounts.
Blind validation tests were conducted by a third party using real social media data that the third party used for other products and services. In the test, the third party provided data without ground truth, for which results were provided in return. The third party provided performance metrics compared with an existing approach for identifying social bots. It should be noted that the evaluated version of the PSM Account Detection software was the initial version. The version disclosed has shown improved performance.
Further comparisons with existing methods were made to to evaluate for true positives and false positives over time.
Finally, as discussed, existing approaches for detecting socialbots and other malicious actors on social media are highly dependent upon content—which leads to a dramatic increase in disk space and RAM requirements.
Turning to
The user devices 102 may be generally any form of computing device capable of interacting with the network 106 to access a social media website, such as a mobile device, a personal computer, a laptop, a tablet, a workstation, a smartphone, or other Internet-communicable device. The social media computing devices 104A-104B may include computing devices configured for providing social media websites or aspects thereof and/or any applications or websites where a user may adopt a social media message, and may be implemented in the form of servers, workstations, terminals, mainframes, storage devices, or the like.
As further shown, a computing device 108 may be communicably coupled to the network 106. The computing device 108 may be implemented to provide the functionality described herein, and may include a laptop, desktop computer, server, or other such computing device. Any one of the social media computing devices 104A-B or the computing device 108 may include at least one server hosting a website or providing an application. Such a server (not shown) may be a single server, a plurality of servers with each such server being a physical server or a virtual machine, or a collection of both physical servers and virtual machines. In another implementation, a cloud hosts one or more components of the network environment 100. In this implementation, the user devices 102, any servers, and other resources connected to the network 106 may access one or more servers to access websites, applications, web services, interfaces, storage devices, computing devices, or the like.
Referring to
The process flow 300 of
Referring to block 304, one or more key users may be identified for analysis. Key users may be defined as users of social media that interact with a cascade or adopt a social media message at an early stage; e.g., within a predetermined timeframe. Key users may be associated with social media user profiles that include at least a name, a possible photo, a URL, and a number of posts/tweets/retweets or other information indicating adoption behavior associated with the user profile, and location.
Referring to block 306, with continuing reference to diagram 200, a subset of the social media data accessed in block 302, associated with the key users, may be applied to one or more causality metrics. Specifically, in some embodiments, the computing device 202 may be configured with at least one application 208, which may define a plurality of modules as defined herein suitable for executing the functionality described herein, i.e. applying causality metrics 210 (as described in Table 3 above) to the data. Using the social media data, the application 208 generates a set of causality values for one or more key users. As shown in block 308, a PSM account may be detected where the causality value of a given key user exceeds a predefined threshold. As further shown, and referenced in block 310, the application 208 may include or otherwise be configured to implement time-decay extensions 212, hyper-graph based machine learning (ML) 214, and label propagation 216 to further define the data.
Main memory 704 can be Random Access Memory (RAM) or any other dynamic storage device(s) commonly known in the art. Read-only memory 706 can be any static storage device(s) such as Programmable Read-Only Memory (PROM) chips for storing static information such as instructions for processor 702. Mass storage device 707 can be used to store information and instructions. For example, hard disks such as the Adaptec® family of Small Computer Serial Interface (SCSI) drives, an optical disc, an array of disks such as Redundant Array of Independent Disks (RAID), such as the Adaptec® family of RAID drives, or any other mass storage devices, may be used.
Bus 701 communicatively couples processor(s) 702 with the other memory, storage, and communications blocks. Bus 701 can be a PCI/PCI-X, SCSI, or Universal Serial Bus (USB) based system bus (or other) depending on the storage devices used. Removable storage media 705 can be any kind of external hard drives, thumb drives, Compact Disc-Read Only Memory (CD-ROM), Compact Disc-Re-Writable (CD-RW), Digital Video Disk-Read Only Memory (DVD-ROM), etc.
Embodiments herein may be provided as a computer program product, which may include a machine-readable medium having stored thereon instructions which may be used to program a computer (or other electronic devices) to perform a process. The machine-readable medium may include, but is not limited to optical discs, CD-ROMs, magneto-optical disks, ROMs, RAMs, erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing electronic instructions. Moreover, embodiments herein may also be downloaded as a computer program product, wherein the program may be transferred from a remote computer to a requesting computer by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., modem or network connection).
As shown, main memory 704 may be encoded with the application 208 that supports functionality discussed herein. In other words, aspects of the application 208 (and/or other resources as described herein) can be embodied as software code such as data and/or logic instructions (e.g., code stored in the memory or on another computer readable medium such as a disk) that supports processing functionality according to different embodiments described herein. During operation of one embodiment, processor(s) 702 accesses main memory 704 via the use of bus 701 in order to launch, run, execute, interpret or otherwise perform processes, such as through logic instructions, executing on the processor 702 and based on the application 208 stored in main memory or otherwise tangibly stored.
The description above includes example systems, methods, techniques, instruction sequences, and/or computer program products that embody techniques of the present disclosure. However, it is understood that the described disclosure may be practiced without these specific details. In the present disclosure, the methods disclosed may be implemented as sets of instructions or software readable by a device. Further, it is understood that the specific order or hierarchy of steps in the methods disclosed are instances of example approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the method can be rearranged while remaining within the disclosed subject matter. The accompanying method claims present elements of the various steps in a sample order, and are not necessarily meant to be limited to the specific order or hierarchy presented.
The described disclosure may be provided as a computer program product, or software, that may include a machine-readable medium having stored thereon instructions, which may be used to program a computer system (or other electronic devices) to perform a process according to the present disclosure. A machine-readable medium includes any mechanism for storing information in a form (e.g., software, processing application) readable by a machine (e.g., a computer). The machine-readable medium may include, but is not limited to optical storage medium (e.g., CD-ROM); magneto-optical storage medium, read only memory (ROM); random access memory (RAM); erasable programmable memory (e.g., EPROM and EEPROM); flash memory; or other types of medium suitable for storing electronic instructions.
Certain embodiments are described herein as being implemented via one or more applications that include metrics which may collectively define modules. Such modules may be hardware-implemented, and may thus include at least one tangible unit capable of performing certain operations and may be configured or arranged in a certain manner. For example, a hardware-implemented module may comprise dedicated circuitry that is permanently configured (e.g., as a special-purpose processor, such as a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC)) to perform certain operations. A hardware-implemented module may also comprise programmable circuitry (e.g., as encompassed within a general-purpose processor or other programmable processor) that is temporarily configured by software or firmware to perform certain operations. In some example embodiments, one or more computer systems (e.g., a standalone system, a client and/or server computer system, or a peer-to-peer computer system) or one or more processors may be configured by software (e.g., an application or application portion) as a hardware-implemented module that operates to perform certain operations as described herein.
Accordingly, the term “hardware-implemented module” or “module” encompasses a tangible entity, be that an entity that is physically constructed, permanently configured (e.g., hardwired), or temporarily configured (e.g., programmed) to operate in a certain manner and/or to perform certain operations described herein. Considering embodiments in which hardware-implemented modules are temporarily configured (e.g., programmed), each of the hardware-implemented modules need not be configured or instantiated at any one instance in time. For example, where the hardware-implemented modules comprise a general-purpose processor configured using software, the general-purpose processor may be configured as respective different hardware-implemented modules at different times. Software may accordingly configure a processor, for example, to constitute a particular hardware-implemented module at one instance of time and to constitute a different hardware-implemented module at a different instance of time.
Hardware-implemented modules may provide information to, and/or receive information from, other hardware-implemented modules. Accordingly, the described hardware-implemented modules may be regarded as being communicatively coupled. Where multiple of such hardware-implemented modules exist contemporaneously, communications may be achieved through signal transmission (e.g., over appropriate circuits and buses) that connect the hardware-implemented modules. In embodiments in which multiple hardware-implemented modules are configured or instantiated at different times, communications between such hardware-implemented modules may be achieved, for example, through the storage and retrieval of information in memory structures to which the multiple hardware-implemented modules have access. For example, one hardware-implemented module may perform an operation, and may store the output of that operation in a memory device to which it is communicatively coupled. A further hardware-implemented module may then, at a later time, access the memory device to retrieve and process the stored output. Hardware-implemented modules may also initiate communications with input or output devices.
It should be understood from the foregoing that, while particular embodiments have been illustrated and described, various modifications can be made thereto without departing from the spirit and scope of the invention as will be apparent to those skilled in the art. Such changes and modifications are within the scope and teachings of this invention as defined in the claims appended hereto.
This is a PCT application that claims benefit to U.S. provisional application Ser. No. 62/628,196 filed on Feb. 8, 2018 which is incorporated by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2019/017287 | 2/8/2019 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62628196 | Feb 2018 | US |