A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
With the increased use of online advertising models that reward the owners of web hosts whenever an advertisement is viewed (pay-per-view advertising) or clicked on (pay-per-click advertising), it has become common for people to create hosts solely for the purpose of getting advertisements viewed by users. Such hosts are referred to as “spam” hosts (because of their similarity to “spam” e-mail, i.e., they are unsolicited displays of advertising) to distinguish them from hosts that primarily provide content of value (e.g., news articles, blogs, some service, etc.), other than advertisements.
Because most hosts on the Internet are found through search engines such as those provided by Yahoo! and Google, operators of spam hosts have incentive to make their hosts appear as close to the top in the results of any given search query entered into a search engine. Spam host operators typically do this by placing random or targeted content and links on the pages of their hosts in order to trick the analysis algorithms used by search engines into classifying the spam host higher in a given query's search results.
Conversely, the operators of the search engines recognize that those users performing searches are not interested viewing spam hosts. Thus, search engines have an incentive to make spam hosts appear near or at the bottom of the search results generated by any given search query or omit the spam hosts altogether from the search results. However, achieving this is complicated because it can be difficult to identify spam hosts without manually reviewing the content of each host and classifying it as a spam or non-spam host.
Systems and methods for identifying spam hosts are disclosed in which hosts are known to the system and initially classified as spam or non-spam by a baseline classifier. The accuracy of the initial host classifications are then improved by propagating them using a random walk algorithm. The random walk used may be modified in order to obtain a weighted or skewed characterization of the host. The hosts may then be reclassified based on the characterization obtained from the random walk to obtain a final spam/non-spam classification. The final classification may then be used in many different ways including to filter search results based on host classifications so that spam hosts are not displayed or displayed last in a results set.
In one aspect, the disclosure describes a method for identifying spam hosts within a set of hosts. The method includes indexing content and links on each host within the set of hosts on a network and classifying each host as spam or non-spam based on an analysis of the indexed information. The method then performs a random walk analysis to generate a characterization of each host and reclassifies one or more hosts as spam hosts based on the characterizations generated by the random walk analysis.
In the method, the random walk analysis may be a modified analysis in which, starting from a first host, a next host is identified by either: randomly selecting a link on the current host to identify the next host; or selecting a host classified as spam by the classifying operation from the set of hosts on the network. Which one of the two possible ways of identifying the next host may be determined at each step in the random walk based on a probability factor so that the random walk is skewed to select spam hosts and hosts linked to spam hosts more often.
In another aspect, the disclosure describes a computer-readable medium storing computer executable instructions for a method of presenting a list of hosts as search results in response to a search query. The method includes receiving, from a requester, a search query requesting a list of hosts matching a search term. In response, the method identifies hosts matching the search term. Then the method assigns a host spamicity value to each host matching the search term based on content and links on that host and also based on a characterization of the host generated by a random walk analysis. The host spamicity value of each host is used to identify the host as either a spam host or a non-spam host. A list of search results identifying the hosts matching the search term is then presented to the requestor, wherein the list is sorted or filtered at least in part based on the host spamicity value of each host in the list.
In yet another aspect, the disclosure describes a system for generating a list of search results that includes a spam host identification module that identifies each of a plurality of hosts as either a spam host or a non-spam host based on content and links on that host and based on a random walk analysis of the hosts. The spam host identification module may include a prediction module that initially classifies each host in the plurality of hosts as either a spam host or a non-spam host based on at least the content on that host. The spam host identification module may further include a random walk module that generates a characterization value for each host based on a random walk analysis of the hosts and a reclassification module that changes the initial classifications for one or more hosts based on a comparison of each host's characterization value and one or more predetermined threshold values. The random walk module may identify a path (e.g., series of hosts) by, starting from a first host, identifying a next host in the path and then repeating the process from the identified host. The identification is performed by either randomly selecting a link on the current host to identify the next host or selecting a host classified as spam by the classifying operation from the set of hosts on the network. The system may also include an index containing information describing the content and links of a set of hosts on a network and search engine for handling search queries.
These and various other features as well as advantages will be apparent from a reading of the following detailed description and a review of the associated drawings. Additional features are set forth in the description that follows and, in part, will be apparent from the description, or may be learned by practice of the described embodiments. The benefits and features will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed.
The following drawing figures, which form a part of this application, are illustrative of embodiments systems and methods described below and are not meant to limit the scope of the disclosure in any manner, which scope shall be based on the claims appended hereto.
The systems and methods described herein will classify each host 102, 104, 106, 108, 110 as either spam or non-spam. Because each host 102, 104, 106, 108, 110 may provide access to one or more web pages or other types of content, classifying a host as spam will result in the classification of web pages on the host as spam also. For example, in an embodiment in which the TLD “www.cnn.com” is on a host, all the web pages and other content accessible via sub domains under the TLD (e.g., cnn.com/politics/todaysstory.htm and cnn.com/technology/fuelcell.htm) are classified the same as that host 102, 104, 106, 108, 110.
The architecture 100 illustrated is a networked client/server architecture in which some of the computing devices, such the hosts 102, 104, 106, 108, 110, are referred to as servers in that they serve requests for content (e.g., transmit web pages and perform services in response to requests) and other computing devices, that issue requests for content to servers, may be referred to as clients. For example, in the embodiment shown the spam identification module 114 is incorporated into a server 112 that can serve web page search requests from clients, such as the client computing device 130 shown. The systems and methods described herein are suitable for use with other architectures as discussed in greater detail below.
Computing devices are well known in the art. For the purposes of this disclosure, a computing device such as the client 130, server 112 or host 102, 104, 106, 108, 110 includes a processor coupled with memory for storing and executing data and software. Computing devices may be provided with operating systems that allow the execution of software applications in order to manipulate data. Examples of operating systems include an operating system suitable for controlling the operation of a networked server computer, such as the WINDOWS XP or WINDOWS 2003 operating systems from MICROSOFT CORPORATION.
In order to store the software and data files, a computing device may include a mass storage device in addition to the memory of the computing device. Local data structures, including discrete web pages such as .HTML files, may be stored on a mass storage device (not shown) that is connected to, or part of, any of the computing devices described herein. A mass storage device includes some form of computer-readable media and provides non-volatile storage of data for later use by one or more computing devices. Although the description of computer-readable media contained herein often refers to a mass storage device, such as a hard disk or CD-ROM drive, it should be appreciated by those skilled in the art that computer-readable media can be any available media that can be accessed by a computing device.
By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassette, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer.
As discussed above,
As shown in
The server 112 includes a number of modules. The term module is used to remind the reader that the functions and systems described herein may be embodied in a software application executing on a process, in a piece of hardware purpose-built to perform a function or in a system embodied by a combination of software, hardware and firmware.
The server 112 includes a spam host identification module 114 as shown. As discussed in greater detail below, the spam host identification module 114 takes information previously indexed about the hosts, such as by a search engine 122 or web crawler (not shown), initially classifies all of the hosts as either a spam host or non-spam host, and then, by performing a random walk analysis characterizes all of the hosts. Based on this characterization, one or more of the hosts may then be reclassified and a final classification as either spam or non-spam is obtained for each host.
In the embodiment shown, the spam host identification module 114 includes a prediction module 116. The prediction module 116 analyzes the information in the index 124 and initially classifies each host with a value indicative of the results of a preliminary analysis of whether that host is a spam host or a non-spam host. This value shall be referred to as the host “spamicity” value. In an embodiment, the spamicity value may be a simple binary 0 or 1, such that, for example, 1 indicates a spam host and 0 indicates a non-spam host (or vice versa). Alternatively, a more complicated algorithm may be used that provides a host spamicity value ranging between an upper limit and a lower limit that reflects the confidence of the algorithm in identifying a host as spam or non-spam.
In the embodiment shown, the spamicity values for hosts evaluated by the spam host identification module 114 are stored in a classification datastore 150. In an alternative embodiment, the spamicity values or equivalent information classifying hosts as spam/non-spam may be stored in the index 124 of host information so that all host information is stored in a single datastore.
In addition to the prediction module, a random walk module 118 is provided as shown. The random walk module 118 uses the spamicity values produced by the prediction module, and the topology of the host graph, to reassign spamicity scores to hosts and reclassify the hosts. Generally, a random walk analysis is performed on a graph structure, which is a set of nodes with links among them. A random walk starts from an arbitrary node in the graph and proceeds to visit other nodes by following links in the graph. Under certain mathematical assumptions, one can define the stationary distribution of the random walk, which is a probability distribution and assigns a value to each node in the graph. That value expresses the probability that the random walk will visit each node, and can be considered a measure of centrality or importance of each node.
One component of the random walk is that of random restart. According to random restart, at each step of the random walk, with probability a the random walk follows a link of the graph, and with probability 1-α it jumps to a random node of the graph. Such a random node can be selected with equal probability among all nodes in the graph, or it can be selected among a subset of the nodes of the graph, for example, from the nodes with domain .edu. The purpose of random restart is twofold. The first is to ensure the mathematical assumptions under which the random walk converge to a well-defined stationary distribution. The second is to bias the importance of certain nodes. For instance, in the previous example in which the random restart is defined to nodes in the .edu domain, it is clear the educational hosts or hosts that are closely linked by educational hosts will have higher visiting probability, and, thus, more importance.
In the system described herein, the random walk is used to propagate spamicity scores among the hosts, using the initial predictions from the prediction module. To be more specific, a random walk is performed starting from the hosts that the prediction module has marked as spam. The random walk follows a link in the graph (i.e., following a link from the host to identify a next host/node in the graph) with probability α, and it restarts by returning to a spam node with probability 1-α. When returning to the spam node, the target node is selected with a probability proportional to its spamicity prediction value. At the end of this process, the stationary distribution of the random walk defines a new value for each host of the graph. These values are interpreted as the new spamicity scores and they are used to reclassify the hosts as spam and non-spam.
The spam host identification module 114 further includes a reclassification module 120 that determines a final classification of each host. The reclassification module 120 uses the host characterization information determined by the random walk module 116 to reclassify one or more of the hosts. The reclassification may be based on the comparison of each host's characterization with one or more predetermined thresholds (e.g., a spam threshold and a non-spam threshold). If a host characterization exceeds a particular threshold. (i.e., falls within an identified range of values), the host is reclassified as either spam or non-spam regardless of its initial classification. The thresholds may be empirically determined based on a training set of host information in which hosts are known to be spam or non-spam as described in detail in the Example below.
In the embodiment shown in
In the embodiment shown, the spam host identification (ID) module 114 is used to classify each of the hosts known to the system as either spam or non-spam. Using the classifications, the search engine then can sort the search results 126 so that spam hosts, e.g., those hosts identified as spam hosts by the spam host ID module 114, are placed at the end of the search results 126. Alternatively, spam hosts may be filtered so that they do not appear in the search result 126 at all.
In yet another embodiment, the spam host ID module 114 may be implemented as a standalone module or service on its own computing device (not shown) that analyzes the data maintained in the index 124 and assigns each host a host spam value, which may also be stored in the index 124. Independent search engines on other servers then need only inspect the data maintained in the index 124, which now includes the host spamicity value for each host so that search results 126 may be easily ordered or filtered based on whether a host is spam or non-spam.
The web pages 204, 210 further include one or more links 206. The links may be hyperlinks or other references to other web pages accessible via the network 101. It is now common for many websites to have many links on each web page. A link may be a user-selectable text, control or icon that causes the user's browser to retrieve the linked page. Alternatively, the link may be a simple address in text that the user's device identifies as link to the page identified by the address.
Based on the information contained in the index, the method 300 further classifies each host in the index as either spam or non-spam in a initial classification operation 304. The initial classification of hosts as spam or non-spam may be done using any method known in the art, including methods in which the content of the pages of the hosts are analyzed, the number and type of links on each page or a representative set of pages in the host is analyzed, and/or a combination of both. Methods and algorithms for initially classifying a set of host web pages are known in the art. Any such method may be used in order to obtain an initial classification of each of the hosts.
In an embodiment, each host, regardless of, the number of web pages associated with the host, is assigned its own spamicity value indicative of whether the host is spam or non-spam. In an alternative embodiment, each web page on a host may be assigned a web page spamicity value, thereby, in effect, modifying the systems described herein to be a spam web page identification system, as opposed to a spam host identification system.
After the host has been classified in the classification operation 306, a random walk operation 306 is performed. In the random walk operation 306, a “path” is identified starting from a first host through a series of subsequent hosts following links between the hosts as described above with reference to the random walk module of
An embodiment of performing a random walk operation 306 skewed toward spam hosts using a predetermined probability factor is described above reference to
After the hosts have been characterized by the random walk operation 306, the characterization value for each host is then compared in a comparison operation 308 to one or more predetermined thresholds.
In the embodiment shown, based on the results of the comparison one of three actions are performed, which actions are referred to collectively as a reclassification operation 310. Depending on which of three conditions is determined from the comparison, the reclassification operation 310 generates a final classification of spam or non-spam for each host. In the first condition, if a characterization value exceeds a predetermined spam threshold, then that host is reclassified as spam regardless of its initial classification assigned in the classification operation 304.
Likewise, a non-spam threshold may also be designated as illustrated. If the characterization for one of the hosts exceeds the predetermined non-spam threshold, then that host may be reclassified with a host classification value indicating it is non-spam regardless of the initial classification determined in the classification operation 304.
The third possible outcome of the reclassification operation 310 is if the characterization value exceeds neither threshold, then the initial classification determined in the classification operation 304 is retained as the final classification for those hosts. These three actions, i.e., reclassifying a host as spam, as non-spam or retaining the initial classification, may be considered a single reclassification operation 310 in which all hosts in each cluster are reclassified if necessary based on the characterization generated by the random walk analysis, which itself was influenced by the initial classifications of the hosts. In an alternative embodiment, the reclassification operation 310 may only consider one threshold when determining a final classification, for example reclassifying only hosts exceeding a spam threshold and for hosts not exceeding the spam threshold retaining their initial classifications as final classifications.
Regardless of the outcome of the reclassification operation 310, all hosts known to the system or being analyzed under this method are now identified as either a spam host or a non-spam host by the assignment of a final host spamicity value.
In the embodiment shown, a search engine receives a search query including search terms from a user in a receive search operation 402. In response, the search engine identifies hosts matching the search terms in a host identification operation 404. The host identification operation 404 may include identifying specific web pages within hosts that match the search terms from the search query or may only identify host sites that contain pages that match the search query. In order to match a host to a search query, the same content in the index used to classify hosts as spam or non-spam may be used to match the hosts to the query. Such matching algorithms are known in the art and any suitable algorithm may be used to identify hosts matching a search term.
Following the host identification operation 404, an retrieve classification operation 406 is performed. In the embodiment, the spam/non-spam classification derived by the random walk method described with reference to
After the spam/non-spam classification of each host has been retrieved, search results are then filtered and/or sorted, based on whether each host identified in the search results is spam or non-spam and then presented to the requester of the search in a search result presentation operation 408. The search results may be sorted so that spam hosts appear at the bottom of the search results. Alternatively, spam hosts may be identified in some way to the user/requestor in the search result. In yet another embodiment, hosts identified as spam may be filtered out of the search result entirely so that they are not presented to the user in response to the user's search query. Such filtering may be in response to a user default in which the user requests that the search engine not transmit any search results likely to be spam hosts.
The method 400 for generating search results identified in
In yet another embodiment, the systems and methods described above may be used to detect spam registrations in directories of hosts. Hosts identified as spam by the random walk method may then be removed from the registry or flagged as spam.
In yet another embodiment, the systems and methods described above may be used to detect spam or abusive replies in a forum context, e.g., Yahoo! answers. Again, answers and hosts associated with answers identified as spam by the random walk method may then be removed from the forum or flagged/sorted/displayed as spam.
The following is a description of an embodiment of a spam host identification system and method that was created and tested against a known dataset of spam and non-spam hosts to determine its efficacy. The WEBSPAM-UK2006 dataset, a publicly available Web spam collection, was used as the initial dataset upon which to test the spam host identification system. It is based on a crawl of the .uk domain done in May 2006, including 77.9 million pages and over 3 billion links in about 11,400 hosts.
This reference collection has been tagged at the host level by a group of volunteers. The assessors labeled hosts as “normal”, “borderline” or “spam”, and were paired so that each sampled host was labeled by two persons independently. For the ground truth, only hosts for which the assessors agreed were used, plus the hosts in the collection marked as non-spam because they belong to special domains such as police .uk or .gov.uk.
The benefit of labeling hosts instead of individual pages is that a large coverage can be obtained, meaning that the sample includes several types of Web spam, and the useful link information among them. Since about 2,725 hosts were evaluated by at least two assessors, a tagging with the same resources at the level of pages would have been either completely disconnected (if pages were sampled uniformly at random), or would have had a much smaller coverage (if a sub set of sites were picked at the beginning for sampling pages). On the other hand, the drawback of the host-level tagging is that a few hosts contain a mix of spam/non-spam content, which increases the classification errors. Domain-level tagging is another embodiment that could be used.
Evaluation. The evaluation of the overall process is based on a set of measures commonly used in Machine Learning and Information Retrieval. Given a classification algorithm C, we consider its confusion matrix:
where a represents the number of non-spam examples that were correctly classified, b represents the number of nonspam examples that were falsely classified as spam, c represents the spam examples that were falsely classified as nonspam, and d represents the number of spam examples that were correctly classified. We consider the following measures: true positive rate (or recall), false positive rate and F-measure. The recall R is d/(c+d). The false positive rate is defined as b/(b+a). The F-Measure is defined as F=2PR/(P+R), where P is the precision P=d/(b+d).
For evaluating the classification algorithms, we focus on the F-measure F as it is a standard way of summarizing both precision and recall. We also report the true positive rate and false positive rate as they have a direct interpretation in practice. The true positive rate R is the amount of spam that is detected (and thus deleted or demoted) by the search engine. The false positive rate is the fraction of non-spam objects that are mistakenly considered to be spam by the automatic classifier.
Cross-validation. All the predictions reported here were computed using tenfold cross validation. For each classifier, the true positives, false positives and F-measure were reported. A classifier whose prediction it is desired to estimate, is trained 10 times, each time using the 9 out of the 10 partitions as training data and computing the confusion matrix using the tenth partition as test data. The resulting ten confusion matrices are then averaged and the evaluation metrics on the average confusion matrix are estimated.
For the content analysis, a summary of the content of each host was obtained by taking the first 400 pages reachable by breadth-first search. The summarized sample contains 3.3 million pages. All of the content data used in the rest of this paper were extracted from a summarized version of the crawl. Note that the assessors spent on average 5 minutes per host, so the vast majority of the pages they inspected were contained in the summarized sample.
The spam detection system then used a cost-sensitive decision tree as part of the classification process. Experimentally, this classification algorithm was found to work better than other methods tried. The features used to learn the tree were derived from a combined approach based on link and content analysis to detect different types of Web spam pages.
The features analyzed to build the classifiers were extracted link-based features from the Web graph and hostgraph, and content-based features from individual pages. For the link-based features the method described by L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates in Link-based characterization and detection of Web Spam. In AIR Web, 2006 was used and for the content-based features the method described by A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly in Detecting spam web pages through content analysis. In WWW, pages 83-92, Edinburgh, Scotland, May 2006 was used.
Most of the link-based features were computed for the home page and the page in each host with the maximum Page-Rank. The remainder of link features were computed directly over the graph between hosts (obtained by collapsing together pages of the same host).
Degree-related measures. A number of measures were computed related to the in-degree and out-degree of the hosts and their neighbors. In addition, various other measures were considered, such as the edge-reciprocity (the number of links that are reciprocal) and the assortativity (the ratio between the degree of a particular page and the average degree of its neighbors). A total of 16 degree-related features was obtained.
PageRank. PageRank is a well known link-based ranking algorithm that computes a score for each page. Various measures related to the PageRank of a page and the PageRank of its in-link neighbors were calculated to obtain a total of 11 PageRank-based features.
TrustRank. Gy {umlaut over ( )}ongyi et al. (Z. Gy {umlaut over ( )}ongyi, H. Garcia-Molina, and J. Pedersen. Combating Web spam with TrustRank. In VLDB, 2004) introduced the idea that if a page has high PageRank, but it does not have any relationship with a set of known trusted pages then it is likely to be a spam page. TrustRank is an algorithm that, starting from a subset of hand-picked trusted nodes and propagating their labels through the Web graph, estimates a TrustRank score for each page. Using TrustRank the spam mass of a page, i.e., the amount of PageRank received from a spammer, may be estimated. The performance of TrustRank depends on the seed set. In the experiment 3,800 nodes were chosen at random from the Open Directory Project, excluding those that were labeled as spam. It was observed that the relative non-spam mass for the home page of each host (the ratio between the TrustRank score and the PageRank score) is a very effective measure for separating spam from non-spam hosts. However, using this measure alone is not sufficient for building an automatic classifier because it would yield a high number of false positives (around the 25%).
Truncated PageRank. Becchetti et al. described Truncated PageRank, a variant of PageRank that diminishes the influence of a page to the PageRank score of its neighbors.
Estimation of supporters. Given two nodes x and y, it is said that x is a d-supporter of y, if the shortest path from x to y has length d. Let Nd(x) be the set of the d-supporters of page x. An algorithm for estimating the set Nd(x) for all pages x based on probabilistic counting is described in Becchetti et al. For each page x, the cardinality of the set Nd(x) is an increasing function with respect to d. A measure of interest is the bottleneck number bd(x) of page x, which we define to be bd(x)=minj≦d{|Nj(x)|/|Nj-1(x)|}. This measure indicates the minimum rate of growth of the neighbors of x up to a certain distance. We expect that spam pages form clusters that are somehow isolated from the rest of the Web graph and they have smaller bottleneck numbers than nonspam pages.
For each web page in the data set a number of features were extracted based on the content of the pages. Most of the features reported by Ntoulas et al. were used, with the addition of new ones such as the entropy (see below), which is meant to capture the compressibility of the page. Ntoulas et al. use a set of features that measures the precision and recall of the words in a page with respect to the set of the most popular terms in the whole web collection. A new set of features was also used that measures the precision and recall of the words in a page with respect to the q most frequent terms from a query log, where q=100, 200, 500, 1000.
Number of words in the page, number of words in the title, average word length. For these features the method counted only the words in the visible text of a page, and we consider words consisting only of alphanumeric characters.
Fraction of anchor text. This feature was defined as the fraction of the number of words in the anchor text to the total number of words in the page.
Fraction of visible text. The fraction of the number of words in the visible text to the total number of words in the page, include html tags and other invisible text, was also used as a feature.
Compression rate. The visible text of the page was compressed using bzip. Compression rate is the ratio of the size of the compressed text to the size of the uncompressed text.
Corpus precision and corpus recall. The k most frequent words in the dataset, excluding stopwords, was used as a classification feature. Corpus precision refers to the fraction of words in a page that appear in the set of popular terms. Corpus recall was defined to be the fraction of popular terms that appear in the page. For both corpus precision and recall 4 features were extracted, for k=100, 200, 500 and 1000.
Query precision and query recall. The set of q most popular terms in a query log was used, and query precision and recall analogous to corpus precision and recall. As with corpus precision and recall, eight features were extracted, for q=100, 200, 500 and 1000.
Independent trigram likelihood. A trigram is three consecutive words. Let {pw} be the probability distribution of trigrams in a page. Let T={w} be the set of all trigrams in a page and k=|T(p)| be the number of distinct trigrams. Then the independent trigram likelihood is a measure of the independence of the distribution of trigrams. It is defined as P in Ntoulas et al. to be
Entropy of trigrams. The entropy is another measure of the compressibility of a page, in this case more macroscopic than the compressibility ratio feature because it is computed on the distribution of trigrams. The entropy of the distribution of trigrams, {pw}, is defined as H=−ΣwεT pw log pw
The above list gives a total of 24 features for each page. In general we found that, for this dataset, the content-based features do not provide as good separation between spam and non-spam pages as for the data set used in Ntoulas et al. For example, we found that in the dataset we are using, the distribution of average word length in spam and non-spam pages were almost identical. In contrast, for the data set of Ntoulas et al. that particular feature provides very good separation. The same is true for many of the other content features. Some of the best features (judging only from the histograms) are the corpus precision and query precision, which is shown in
In total, 140 features were extracted for each host and 24 features for each page. The total number of link-based features, as described above, is 140 features for each host. The content-based features of pages were aggregated in order to obtain content-based features for hosts.
Let h be a host containing m web pages, denoted by the set P=p1, . . . , {pm}. Let {circumflex over (p)} denote the home page of host h and p* denote the page with the largest PageRank among all pages in P. Let c(p) be the 24-dimensional content feature vector of page p. For each host h we form the content-based feature vector c(h) of h as follows
c(h)=<c({circumflex over (p)}),c(p*), E[c(p)], Var[c(p)]>.
Here E[c(p)] is the average of all vectors c(p), p ε P, and Var[c(p)] is the variance of c(p), p ε P. Therefore, for each host there were 4×24=96 content features. In total, there were 140+96=236 link and content features.
In the process of aggregating page features, hosts h were ignored for which the home page {circumflex over (p)} or the maxPR page p* is not present in our summary sample. This left a total of 8,944 hosts, out of which 5,622 are labeled; from them, 12% are spam hosts.
The base classifier used in the experiment was the implementation of C4.5 (decision trees) given in Weka (see e.g., I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999.). Using both link and content features, the resulting tree used 45 unique features, of which 18 were content features.
In the data used, the non-spam examples outnumber the spam examples to such an extent that the classifier accuracy improves by misclassifying a disproportionate number of spam examples. To minimize the misclassification error, and compensate for the imbalance in class representation in the data, a cost-sensitive decision tree was used. A cost of zero was imposed for correctly classifying the instances, and the cost of misclassifying a spam host as normal was set to be R times more costly than misclassifying a normal host as spam. Table 1 shows the results for different values of R. The value of R is a parameter that can be tuned to balance the true positive rate and the false positive rate. In this case, it was desired to maximize the F-measure. Incidentally note that R=1 is equivalent to having no cost matrix, and is the baseline classifier.
It was then attempted to improve the results of the baseline classifier using bagging. Bagging is a technique that creates an ensemble of classifiers by sampling with replacement from the training set to create N classifiers whose training sets contain the same number of examples as the original training set, but may contain duplicates. The labels of the test set are determined by a majority vote of the classifier ensemble. In general, any classifier can be used as a base classifier, and in this experiment the cost-sensitive decision trees described above were used. Bagging improved our results by reducing the false-positive rate, as shown in Table 2. The decision tree created by bagging was roughly the same size as the tree created without bagging, and used 49 unique features, of which 21 were content features.
The results of classification reported in Tables 1 and 2 use both link and content features. Table 3 shows the contribution of each type of feature to the classification. The content features serve to reduce the false-positive rate, with-out diminishing the true positive result, and thus improve the overall performance of the classifier.
Given the above, an embodiment of the methods described above then was analyzed in which during the extraction of link-based features, all nodes in the network were anonymous, while in this regularization phase, the identity (the predicted label or initial host spamicity value) of each node is known.
In the experiment, a prediction-propagation algorithm was used to improve the prediction accuracy obtained from the baseline classifiers. In the method, the graph topology was used to smooth the predictions by propagating them using random walks, following Zhou et al. (D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf. Learning with local and global consistency. Advances in Neural Information Processing Systems, 16:321-328, 2004).
In the experiment, p(h) ε [0 . . . 1] was the prediction of a particular classification algorithm C as described above, so that for each host h a value of p(h) equal to 0 indicates non-spam, while a value of 1 indicates spam. Let v(0) be a vector such that
v
h
(0)
=p(h)/ΣhεHp(h)
is a normalized version of the predicted spamicity.
Next, v is updated in the following way:
In which out-degree(g) is the out-degree of g. In the limit this process converges to the stationary probabilities of a random walk with restart probability 1−α. When the random walk is restarted it returns to a node with high predicted spamicity.
After this process was run, the training part of the data was used to learn a threshold parameter, and this threshold subsequently used to classify the testing part as non-spam or spam.
Three forms of random walk were tested: on the host graph, on the transposed host graph (meaning that the activation goes backwards) and on the undirected host graph. Different values of the α parameter were tried which resulted in improvements over the baseline with α ε [0.1,0.3], implying short chains in the random walk. In Table 4 we report on the results when α=0.3, after 10 iterations (this was enough for convergence in this graph).
As shown in Table 4, the classifier without bagging was improved (and the improvement is statistically significant at the 0.05 confidence level), but the increase of accuracy for the classifier with bagging is small.
Those skilled in the art will recognize that the methods and systems of the present disclosure may be implemented in many manners and as such are not to be limited by the foregoing exemplary embodiments and examples. In other words, functional elements being performed by single or multiple components, in various combinations of hardware and software or firmware, and individual functions, may be distributed among software applications at either the client level or server level or both. In this regard, any number of the features of the different embodiments described herein may be combined into single or multiple embodiments, and alternate embodiments having fewer than, or more than, all of the features described herein are possible. Functionality may also be, in whole or in part, distributed among multiple components, in manners now known or to become known. Thus, myriad software/hardware/firmware combinations are possible in achieving the functions, features, interfaces and preferences described herein. Moreover, the scope of the present disclosure covers conventionally known manners for carrying out the described features and functions and interfaces, as well as those variations and modifications that may be made to the hardware or software or firmware components described herein as would be understood by those skilled in the art now and hereafter.
Furthermore, the embodiments of methods presented and described as flowcharts in this disclosure are provided by way of example in order to provide a more complete understanding of the technology. The disclosed methods are not limited to the operations, order of operations and logical flow presented herein. Alternative embodiments are contemplated in which the order of the various operations is altered and in which sub-operations described as being part of a larger operation are performed independently.
While various embodiments have been described for purposes of this disclosure, such embodiments should not be deemed to limit the teaching of this disclosure to those embodiments. Various changes and modifications may be made to the elements and operations described above to obtain a result that remains within the scope of the systems and processes described in this disclosure. For example, a spam host identification system could be incorporated into an automated news aggregation system so that pages from spam hosts are not accidentally aggregated as non-spam news items. Numerous other changes may be made that will readily suggest themselves to those skilled in the art and which are encompassed in the spirit of the invention disclosed and as defined in the appended claims.